suod.models package#
Subpackages#
Submodules#
suod.models.base module#
suod.models.cost_predictor module#
suod.models.jl_projection module#
Johnson–Lindenstrauss process. Part of the code is adapted from https://github.com/PTAug/jlt-python
- suod.models.jl_projection.jl_fit_transform(X, objective_dim, method='basic')[source]#
Fit and transform the input data by Johnson–Lindenstrauss process. See [BJL84] for details.
Parameters#
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- objective_dimint
The expected output dimension.
- methodstring, optional (default = ‘basic’)
The JL projection method:
“basic”: each component of the transformation matrix is taken at random in N(0,1).
“discrete”, each component of the transformation matrix is taken at random in {-1,1}.
“circulant”: the first row of the transformation matrix is taken at random in N(0,1), and each row is obtained from the previous one by a one-left shift.
“toeplitz”: the first row and column of the transformation matrix is taken at random in N(0,1), and each diagonal has a constant value taken from these first vector.
Returns#
- X_transformednumpy array of shape (n_samples, objective_dim)
The dataset after the JL projection.
- jl_transformerobject
Transformer instance.
- suod.models.jl_projection.jl_transform(X, jl_transformer)[source]#
Use the fitted transformer to conduct JL projection.
Parameters#
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- jl_transformerobject
Fitted transformer instance.
Returns#
- X_transformednumpy array of shape (n_samples, reduced_dimensions)
Transformed matrix.
suod.models.parallel_processes module#
- suod.models.parallel_processes.balanced_scheduling(time_cost_pred, n_estimators, n_jobs, verbose=False)[source]#
Conduct balanced scheduling based on the sum of rank, for both train and prediction. The algorithm will enforce the equal sum of ranks among workers.
Parameters#
- time_cost_predlist
The list of time cost by the cost predictor. The length is equal to the number of base detectors.
- n_estimatorsint
The number of base estimators.
- n_jobsoptional (default=1)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.
- verbosebool, optional (default=False)
Controls the verbosity of the building process.
Returns#
- n_estimators_listlist
The number of estimators for each worker
- startslist
The actual index of base detectors to be scheduled. For instance, starts[k, k+1] base detectors will be assigned to worker k.
- n_jobs :
The actual usable number of jobs to run in parallel.
- suod.models.parallel_processes.cost_forecast_meta(clf, X, base_estimator_names)[source]#
Forecast model cost by pretrained cost estimator.
Parameters#
- clfobject, sklearn regressor
Random forest regressor trained to forecast model cost
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- base_estimator_nameslist of str
The list of outlier detection model names in the string format
Returns#
time_cost_pred : numpy array of outlier detection model cost in seconds.
Module contents#
References
William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. Interpreting and unifying outlier scores. In Proceedings of the 2011 SIAM International Conference on Data Mining, 13–24. SIAM, 2011.
Yue Zhao, Zain Nasrullah, and Zheng Li. PyOD: a python toolbox for scalable outlier detection. Journal of Machine Learning Research, 20:1–7, 2019.