suod.models package¶
Subpackages¶
Submodules¶
suod.models.base module¶
Base class and functions of SUOD (Scalable Unsupervised Outlier Detection)
- class suod.models.base.SUOD(base_estimators, contamination=0.1, n_jobs=None, rp_clf_list=None, rp_ng_clf_list=None, rp_flag_global=True, target_dim_frac=0.5, jl_method='basic', bps_flag=True, approx_clf_list=None, approx_ng_clf_list=None, approx_flag_global=True, approx_clf=None, cost_forecast_loc_fit=None, cost_forecast_loc_pred=None, verbose=False)[source]¶
Bases:
object
SUOD (Scalable Unsupervised Outlier Detection) is an acceleration framework for large scale unsupervised outlier detector training and prediction. The corresponding paper is under review in KDD 2020.
- Parameters
base_estimators (list, length must be greater than 1) – A list of base estimators. Certain methods must be present, e.g., fit and predict.
contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
n_jobs (optional (default=1)) – The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the the number of jobs that can actually run in parallel.
rp_clf_list (list, optional (default=None)) – The list of outlier detection models to use random projection. The detector name should be consistent with PyOD.
rp_ng_clf_list (list, optional (default=None)) – The list of outlier detection models NOT to use random projection. The detector name should be consistent with PyOD.
rp_flag_global (bool, optional (default=True)) – If set to False, random projection is turned off for all base models.
target_dim_frac (float in (0., 1), optional (default=0.5)) – The target compression ratio.
jl_method (string, optional (default = 'basic')) –
The JL projection method:
”basic”: each component of the transformation matrix is taken at random in N(0,1).
”discrete”, each component of the transformation matrix is taken at random in {-1,1}.
”circulant”: the first row of the transformation matrix is taken at random in N(0,1), and each row is obtained from the previous one by a one-left shift.
”toeplitz”: the first row and column of the transformation matrix is taken at random in N(0,1), and each diagonal has a constant value taken from these first vector.
bps_flag (bool, optional (default=True)) – If set to False, balanced parallel scheduling is turned off.
approx_clf_list (list, optional (default=None)) – The list of outlier detection models to use pseudo-supervised approximation. The detector name should be consistent with PyOD.
approx_ng_clf_list (list, optional (default=None)) – The list of outlier detection models NOT to use pseudo-supervised approximation. The detector name should be consistent with PyOD.
approx_flag_global (bool, optional (default=True)) – If set to False, pseudo-supervised approximation is turned off.
approx_clf (object, optional (default: sklearn RandomForestRegressor)) – The supervised model used to approximate unsupervised models.
cost_forecast_loc_fit (str, optional) – The location of the pretrained cost prediction forecast for training.
cost_forecast_loc_pred (str, optional) – The location of the pretrained cost prediction forecast for prediction.
verbose (bool, optional (default=False)) – Controls the verbosity of the building process.
- approximate(X)[source]¶
Use the supervised regressor (random forest by default) to approximate unsupervised fitted outlier detectors.
- Parameters
X (numpy array of shape (n_samples, n_features)) – The input samples. The same feature space of the unsupervised outlier detector will be used.
- Returns
self – The estimator after with approximation.
- Return type
- decision_function(X)[source]¶
Predict raw anomaly scores of X using the fitted detectors.
The anomaly score of an input sample is computed based on the fitted detector. For consistency, outliers are assigned with higher anomaly scores.
- Parameters
X (numpy array of shape (n_samples, n_features)) – The input samples. Sparse matrices are accepted only if they are supported by the base estimator.
- Returns
anomaly_scores – The anomaly score of the input samples.
- Return type
numpy array of shape (n_samples,)
- fit(X)[source]¶
Fit all base estimators.
- Parameters
X (numpy array of shape (n_samples, n_features)) – The input samples.
- Returns
self – Fitted estimator.
- Return type
- get_params(deep=True)[source]¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
- Parameters
deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
- predict(X)[source]¶
Predict the class labels for the provided data.
- Parameters
X (numpy array of shape (n_samples, n_features)) – The input samples.
- Returns
outlier_labels – For each observation, tells whether or not it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- Return type
numpy array of shape (n_samples, n_estimators)
- predict_proba(X)[source]¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
- Parameters
X (numpy array of shape (n_samples, n_features)) – The input samples.
- Returns
outlier_probability – For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].
- Return type
numpy array of shape (n_samples,)
- set_params(**params)[source]¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
- Returns
self
- Return type
suod.models.cost_predictor module¶
Cost predictor function for forecasting base model training and prediction cost.
suod.models.jl_projection module¶
Johnson–Lindenstrauss process. Part of the code is adapted from https://github.com/PTAug/jlt-python
- suod.models.jl_projection.jl_fit_transform(X, objective_dim, method='basic')[source]¶
Fit and transform the input data by Johnson–Lindenstrauss process. See [BJL84] for details.
- Parameters
X (numpy array of shape (n_samples, n_features)) – The input samples.
objective_dim (int) – The expected output dimension.
method (string, optional (default = 'basic')) –
The JL projection method:
”basic”: each component of the transformation matrix is taken at random in N(0,1).
”discrete”, each component of the transformation matrix is taken at random in {-1,1}.
”circulant”: the first row of the transformation matrix is taken at random in N(0,1), and each row is obtained from the previous one by a one-left shift.
”toeplitz”: the first row and column of the transformation matrix is taken at random in N(0,1), and each diagonal has a constant value taken from these first vector.
- Returns
X_transformed (numpy array of shape (n_samples, objective_dim)) – The dataset after the JL projection.
jl_transformer (object) – Transformer instance.
- suod.models.jl_projection.jl_transform(X, jl_transformer)[source]¶
Use the fitted transformer to conduct JL projection.
- Parameters
X (numpy array of shape (n_samples, n_features)) – The input samples.
jl_transformer (object) – Fitted transformer instance.
- Returns
X_transformed – Transformed matrix.
- Return type
numpy array of shape (n_samples, reduced_dimensions)
suod.models.parallel_processes module¶
- suod.models.parallel_processes.balanced_scheduling(time_cost_pred, n_estimators, n_jobs, verbose=False)[source]¶
Conduct balanced scheduling based on the sum of rank, for both train and prediction. The algorithm will enforce the equal sum of ranks among workers.
- Parameters
time_cost_pred (list) – The list of time cost by the cost predictor. The length is equal to the number of base detectors.
n_estimators (int) – The number of base estimators.
n_jobs (optional (default=1)) – The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.
verbose (bool, optional (default=False)) – Controls the verbosity of the building process.
- Returns
n_estimators_list (list) – The number of estimators for each worker
starts (list) – The actual index of base detectors to be scheduled. For instance, starts[k, k+1] base detectors will be assigned to worker k.
n_jobs – The actual usable number of jobs to run in parallel.
- suod.models.parallel_processes.cost_forecast_meta(clf, X, base_estimator_names)[source]¶
Forecast model cost by pretrained cost estimator.
- Parameters
clf (object, sklearn regressor) – Random forest regressor trained to forecast model cost
X (numpy array of shape (n_samples, n_features)) – The input samples.
base_estimator_names (list of str) – The list of outlier detection model names in the string format
- Returns
time_cost_pred
- Return type
numpy array of outlier detection model cost in seconds.
Module contents¶
References
- BJL84
William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
- BKKSZ11
Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. Interpreting and unifying outlier scores. In Proceedings of the 2011 SIAM International Conference on Data Mining, 13–24. SIAM, 2011.
- BZNL19
Yue Zhao, Zain Nasrullah, and Zheng Li. PyOD: a python toolbox for scalable outlier detection. Journal of Machine Learning Research, 20:1–7, 2019.