suod.models package

Submodules

suod.models.base module

Base class and functions of SUOD (Scalable Unsupervised Outlier Detection)

class suod.models.base.SUOD(base_estimators, contamination=0.1, n_jobs=None, rp_clf_list=None, rp_ng_clf_list=None, rp_flag_global=True, target_dim_frac=0.5, jl_method='basic', bps_flag=True, approx_clf_list=None, approx_ng_clf_list=None, approx_flag_global=True, approx_clf=None, cost_forecast_loc_fit=None, cost_forecast_loc_pred=None, verbose=False)[source]

Bases: object

SUOD (Scalable Unsupervised Outlier Detection) is an acceleration framework for large scale unsupervised outlier detector training and prediction. The corresponding paper is under review in KDD 2020.

Parameters
  • base_estimators (list, length must be greater than 1) – A list of base estimators. Certain methods must be present, e.g., fit and predict.

  • contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

  • n_jobs (optional (default=1)) – The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the the number of jobs that can actually run in parallel.

  • rp_clf_list (list, optional (default=None)) – The list of outlier detection models to use random projection. The detector name should be consistent with PyOD.

  • rp_ng_clf_list (list, optional (default=None)) – The list of outlier detection models NOT to use random projection. The detector name should be consistent with PyOD.

  • rp_flag_global (bool, optional (default=True)) – If set to False, random projection is turned off for all base models.

  • target_dim_frac (float in (0., 1), optional (default=0.5)) – The target compression ratio.

  • jl_method (string, optional (default = 'basic')) –

    The JL projection method:

    • ”basic”: each component of the transformation matrix is taken at random in N(0,1).

    • ”discrete”, each component of the transformation matrix is taken at random in {-1,1}.

    • ”circulant”: the first row of the transformation matrix is taken at random in N(0,1), and each row is obtained from the previous one by a one-left shift.

    • ”toeplitz”: the first row and column of the transformation matrix is taken at random in N(0,1), and each diagonal has a constant value taken from these first vector.

  • bps_flag (bool, optional (default=True)) – If set to False, balanced parallel scheduling is turned off.

  • approx_clf_list (list, optional (default=None)) – The list of outlier detection models to use pseudo-supervised approximation. The detector name should be consistent with PyOD.

  • approx_ng_clf_list (list, optional (default=None)) – The list of outlier detection models NOT to use pseudo-supervised approximation. The detector name should be consistent with PyOD.

  • approx_flag_global (bool, optional (default=True)) – If set to False, pseudo-supervised approximation is turned off.

  • approx_clf (object, optional (default: sklearn RandomForestRegressor)) – The supervised model used to approximate unsupervised models.

  • cost_forecast_loc_fit (str, optional) – The location of the pretrained cost prediction forecast for training.

  • cost_forecast_loc_pred (str, optional) – The location of the pretrained cost prediction forecast for prediction.

  • verbose (bool, optional (default=False)) – Controls the verbosity of the building process.

approximate(X)[source]

Use the supervised regressor (random forest by default) to approximate unsupervised fitted outlier detectors.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples. The same feature space of the unsupervised outlier detector will be used.

Returns

self – The estimator after with approximation.

Return type

object

decision_function(X)[source]

Predict raw anomaly scores of X using the fitted detectors.

The anomaly score of an input sample is computed based on the fitted detector. For consistency, outliers are assigned with higher anomaly scores.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns

anomaly_scores – The anomaly score of the input samples.

Return type

numpy array of shape (n_samples,)

fit(X)[source]

Fit all base estimators.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.

Returns

self – Fitted estimator.

Return type

object

get_params(deep=True)[source]

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters

deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

predict(X)[source]

Predict the class labels for the provided data.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.

Returns

outlier_labels – For each observation, tells whether or not it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Return type

numpy array of shape (n_samples, n_estimators)

predict_proba(X)[source]

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.

Returns

outlier_probability – For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].

Return type

numpy array of shape (n_samples,)

set_params(**params)[source]

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns

self

Return type

object

suod.models.cost_predictor module

Cost predictor function for forecasting base model training and prediction cost.

suod.models.cost_predictor.build_cost_predictor(file_name, output_file, save_to_local=True)[source]

Build cost predictor from the scratch. In general, this does not need to be used.

Parameters
  • file_name (string) – The training table of algorithm performance.

  • output_file

  • save_to_local

suod.models.cost_predictor.indices_to_one_hot(data, nb_classes)[source]

Convert an iterable of indices to one-hot encoded labels.

Parameters
  • data (list) – The raw data.

  • nb_classes (int) – The number of targeted classes.

suod.models.jl_projection module

Johnson–Lindenstrauss process. Part of the code is adapted from https://github.com/PTAug/jlt-python

suod.models.jl_projection.jl_fit_transform(X, objective_dim, method='basic')[source]

Fit and transform the input data by Johnson–Lindenstrauss process. See [BJL84] for details.

Parameters
  • X (numpy array of shape (n_samples, n_features)) – The input samples.

  • objective_dim (int) – The expected output dimension.

  • method (string, optional (default = 'basic')) –

    The JL projection method:

    • ”basic”: each component of the transformation matrix is taken at random in N(0,1).

    • ”discrete”, each component of the transformation matrix is taken at random in {-1,1}.

    • ”circulant”: the first row of the transformation matrix is taken at random in N(0,1), and each row is obtained from the previous one by a one-left shift.

    • ”toeplitz”: the first row and column of the transformation matrix is taken at random in N(0,1), and each diagonal has a constant value taken from these first vector.

Returns

  • X_transformed (numpy array of shape (n_samples, objective_dim)) – The dataset after the JL projection.

  • jl_transformer (object) – Transformer instance.

suod.models.jl_projection.jl_transform(X, jl_transformer)[source]

Use the fitted transformer to conduct JL projection.

Parameters
  • X (numpy array of shape (n_samples, n_features)) – The input samples.

  • jl_transformer (object) – Fitted transformer instance.

Returns

X_transformed – Transformed matrix.

Return type

numpy array of shape (n_samples, reduced_dimensions)

suod.models.parallel_processes module

suod.models.parallel_processes.balanced_scheduling(time_cost_pred, n_estimators, n_jobs, verbose=False)[source]

Conduct balanced scheduling based on the sum of rank, for both train and prediction. The algorithm will enforce the equal sum of ranks among workers.

Parameters
  • time_cost_pred (list) – The list of time cost by the cost predictor. The length is equal to the number of base detectors.

  • n_estimators (int) – The number of base estimators.

  • n_jobs (optional (default=1)) – The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

  • verbose (bool, optional (default=False)) – Controls the verbosity of the building process.

Returns

  • n_estimators_list (list) – The number of estimators for each worker

  • starts (list) – The actual index of base detectors to be scheduled. For instance, starts[k, k+1] base detectors will be assigned to worker k.

  • n_jobs – The actual usable number of jobs to run in parallel.

suod.models.parallel_processes.cost_forecast_meta(clf, X, base_estimator_names)[source]

Forecast model cost by pretrained cost estimator.

Parameters
  • clf (object, sklearn regressor) – Random forest regressor trained to forecast model cost

  • X (numpy array of shape (n_samples, n_features)) – The input samples.

  • base_estimator_names (list of str) – The list of outlier detection model names in the string format

Returns

time_cost_pred

Return type

numpy array of outlier detection model cost in seconds.

suod.models.parallel_processes.indices_to_one_hot(data, nb_classes)[source]

Convert an iterable of indices to one-hot encoded labels.

Module contents

References

BJL84

William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.

BKKSZ11

Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. Interpreting and unifying outlier scores. In Proceedings of the 2011 SIAM International Conference on Data Mining, 13–24. SIAM, 2011.

BZNL19

Yue Zhao, Zain Nasrullah, and Zheng Li. PyOD: a python toolbox for scalable outlier detection. Journal of Machine Learning Research, 20:1–7, 2019.