suod.models package#

Subpackages#

Submodules#

suod.models.base module#

suod.models.cost_predictor module#

suod.models.jl_projection module#

Johnson–Lindenstrauss process. Part of the code is adapted from https://github.com/PTAug/jlt-python

suod.models.jl_projection.jl_fit_transform(X, objective_dim, method='basic')[source]#

Fit and transform the input data by Johnson–Lindenstrauss process. See [BJL84] for details.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

objective_dimint

The expected output dimension.

methodstring, optional (default = ‘basic’)

The JL projection method:

  • “basic”: each component of the transformation matrix is taken at random in N(0,1).

  • “discrete”, each component of the transformation matrix is taken at random in {-1,1}.

  • “circulant”: the first row of the transformation matrix is taken at random in N(0,1), and each row is obtained from the previous one by a one-left shift.

  • “toeplitz”: the first row and column of the transformation matrix is taken at random in N(0,1), and each diagonal has a constant value taken from these first vector.

Returns#

X_transformednumpy array of shape (n_samples, objective_dim)

The dataset after the JL projection.

jl_transformerobject

Transformer instance.

suod.models.jl_projection.jl_transform(X, jl_transformer)[source]#

Use the fitted transformer to conduct JL projection.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

jl_transformerobject

Fitted transformer instance.

Returns#

X_transformednumpy array of shape (n_samples, reduced_dimensions)

Transformed matrix.

suod.models.parallel_processes module#

suod.models.parallel_processes.balanced_scheduling(time_cost_pred, n_estimators, n_jobs, verbose=False)[source]#

Conduct balanced scheduling based on the sum of rank, for both train and prediction. The algorithm will enforce the equal sum of ranks among workers.

Parameters#

time_cost_predlist

The list of time cost by the cost predictor. The length is equal to the number of base detectors.

n_estimatorsint

The number of base estimators.

n_jobsoptional (default=1)

The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

verbosebool, optional (default=False)

Controls the verbosity of the building process.

Returns#

n_estimators_listlist

The number of estimators for each worker

startslist

The actual index of base detectors to be scheduled. For instance, starts[k, k+1] base detectors will be assigned to worker k.

n_jobs :

The actual usable number of jobs to run in parallel.

suod.models.parallel_processes.cost_forecast_meta(clf, X, base_estimator_names)[source]#

Forecast model cost by pretrained cost estimator.

Parameters#

clfobject, sklearn regressor

Random forest regressor trained to forecast model cost

Xnumpy array of shape (n_samples, n_features)

The input samples.

base_estimator_nameslist of str

The list of outlier detection model names in the string format

Returns#

time_cost_pred : numpy array of outlier detection model cost in seconds.

suod.models.parallel_processes.indices_to_one_hot(data, nb_classes)[source]#

Convert an iterable of indices to one-hot encoded labels.

Module contents#

References

[BJL84]

William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.

[BKKSZ11]

Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. Interpreting and unifying outlier scores. In Proceedings of the 2011 SIAM International Conference on Data Mining, 13–24. SIAM, 2011.

[BZNL19]

Yue Zhao, Zain Nasrullah, and Zheng Li. PyOD: a python toolbox for scalable outlier detection. Journal of Machine Learning Research, 20:1–7, 2019.