pytranskit.optrans.decomposition package

CCA

class pytranskit.optrans.decomposition.cca.CCA(n_components=1, scale=True, max_iter=500, tol=1e-06, copy=True)[source]

Bases: object

Canonical Correlation Analysis.

This is a wrapper for scikit-learn’s CCA class, which allows it to be used in a similar manner to PLDA and PCA.

Parameters:
  • n_components (int (default=1)) – Number of components to keep.

  • scale (bool (default=True)) – Whether to scale the data?

  • max_iter (int (default=500)) – The maximum number of iterations of the NIPALS inner loop.

  • tol (float (default=1e-6)) – The tolerance used in the iterative algorithm.

  • copy (bool (default=True)) – Whether the deflation be done on a copy. Let the default value to True unless you don’t care about side effects.

components_

X block weights vectors.

Type:

array, shape (n_components, n_features)

components_y_

Y block weights vectors.

Type:

array, shape (n_components, n_targets)

explained_variance_

The amount of variance explained by each of the selected weights for the X data.

Type:

array, shape (n_components,)

explained_variance_y_

The amount of variance explained by each of the selected weights for the Y data.

Type:

array, shape (n_components,)

mean_

Per-feature empirical mean of X, estimated from the training set.

Type:

array, shape (n_features,)

mean_y_

Per-feature empirical mean of Y, estimated from the training set.

Type:

array, shape (n_targets,)

n_components_

The number of components.

Type:

int

References

[scikit-learn’s documentation on CCA] (http://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.CCA.html) Jacob A. Wegelin. A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case. Technical Report 371, Department of Statistics, University of Washington, Seattle, 2000.

fit(X, Y)[source]

Fit model to data.

Parameters:
  • X (array, shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.

  • Y (array, shape (n_samples, n_targets)) – Target vectors, where n_samples is the number of samples and n_targets is the number of response variables.

fit_transform(X, Y)[source]

Learn and apply the dimension reduction on the train data.

Parameters:
  • X (array, shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.

  • Y (array, shape (n_samples, n_targets)) – Target vectors, where n_samples is the number of samples and n_targets is the number of response variables.

Returns:

  • X_new (array, shape (n_samples, n_components)) – Transformed X data.

  • Y_new (array, shape (n_samples, n_components)) – Transformed Y data.

inverse_transform(X, Y=None)[source]

Transform data back to its original space.

Note: This is not exact!

Parameters:
  • X (array, shape (n_samples, n_components)) – Transformed X data.

  • Y (array, shape (n_samples, n_components) or None (default=None)) – Transformed Y data. If Y=None, only the X data are transformed back to the original space.

Returns:

  • X_original (array, shape (n_samples, n_features)) – X data transformed back into original space.

  • Y_original (array, shape (n_samples, n_targets)) – Y data transformed back into original space. If Y=None, only X_original is returned.

score(X, Y)[source]

Return Pearson product-moment correlation coefficients for each component.

The values of R are between -1 and 1, inclusive.

Note: This is different from sklearn.cross_decomposition.CCA.score(), which returns the coefficient of determination of the prediction.

Parameters:
  • X (array, shape (n_samples, n_features)) – Input X data.

  • Y (array, shape (n_samples, n_targets) or None (default=None)) – Input Y data.

Returns:

score – Pearson product-moment correlation coefficients. If n_components=1, a single value is returned, else an array of correlation coefficients is returned.

Return type:

float or array, shape (n_components,)

transform(X, Y=None)[source]

Apply the dimension reduction learned on the train data.

Parameters:
  • X (array, shape (n_samples, n_features)) – Input X data.

  • Y (array, shape (n_samples, n_targets) or None (default=None)) – Input Y data. If Y=None, then only the transformed X data are returned.

Returns:

  • X_new (array, shape (n_samples, n_components)) – Transformed X data.

  • Y_new (array, shape (n_samples, n_components)) – Transformed Y data. If Y=None, only X_new is returned.

pytranskit.optrans.decomposition.cca.CanonCorr

alias of CCA

pytranskit.optrans.decomposition.cca.check_array(array, ndim=None, dtype='numeric', force_all_finite=True, force_strictly_positive=False)[source]

Input validation on an array, list, or similar.

Parameters:
  • array (object) – Input object to check/convert

  • ndim (int or None (default=None)) – Number of dimensions that array should have. If None, the dimensions are not checked

  • dtype (string, type, list of types or None (default='numeric')) – Data type of result. If None, the dtype of the input is preserved. If ‘numeric’, dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.

  • force_all_finite (boolean (default=True)) – Whether to raise an error on np.inf and np.nan in array

  • force_strictly_positive (boolean (default=False)) – Whether to raise an error if any array elements are <= 0

Returns:

array_converted – The converted and validated array.

Return type:

object

PLDA

class pytranskit.optrans.decomposition.plda.BaseEstimator[source]

Bases: object

Base class for all estimators in scikit-learn.

Notes

All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments (no *args or **kwargs).

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

class pytranskit.optrans.decomposition.plda.PLDA(alpha=1.0, n_components=None)[source]

Bases: BaseEstimator

Penalized Linear Discriminant Analysis.

This is both a dimensionality reduction method and a linear classifier.

Parameters:
  • alpha (scalar (default=1.)) – Parameter that controls the proportion of LDA vs PCA. If alpha=0, PLDA functions like LDA. If alpha is large, PLDA functions more like PCA.

  • n_components (int or None (default=None)) – Number of components to keep. If n_components is not set, all components are kept: n_components == min(n_samples, n_features).

components_

Axes in the feature space. The components are sorted by the explained variance.

Type:

array, shape (n_components, n_features)

explained_variance_

The amount of variance explained by each of the selected components.

Type:

array, shape (n_components,)

explained_variance_ratio_

Proportion of variance explained by each of the selected components. If n_components is not set then all components are stored and the sum of explained variance ratios is equal to 1.0.

Type:

array, shape(n_components,)

mean_

Per-feature empirical mean, estimated from the training set.

Type:

array, shape (n_features,)

n_components_

The number of components.

Type:

int

coef_

Weight vector(s).

Type:

array, shape (n_features,) or (n_classes, n_features)

intercept_

Intercept term.

Type:

array, shape (n_features,)

class_means_

Class means, estimated from the training set.

Type:

array, shape (n_classes, n_features)

classes_

Unique class labels.

Type:

array, shape (n_classes,)

References

W. Wang et al. Penalized Fisher Discriminant Analysis and its Application to Image-Based Morphometry. Pattern Recognit. Lett., 32(15):2128-35, 2011

decision_function(X)[source]

Predict confidence scores for samples.

The confidence score for a sample is the signed distance of that sample to the hyperplane.

Parameters:

X (array, shape (n_samples, n_features)) – Input data.

Returns:

scores – else (n_samples, n_classes) Confidence scores per (sample, class) combination. In the binary case, confidence score for self.classes_[1] where >0 means this class would be predicted.

Return type:

array, shape=(n_samples,) if n_classes == 2

fit(X, y)[source]

Fit PLDA model according to the given training data and parameters.

Parameters:
  • X (array, shape (n_samples, n_features)) – Training data.

  • y (array, shape (n_samples,)) – Target values.

fit_transform(X, y)[source]

Fit the model with X and transform X.

Parameters:
  • X (array, shape (n_samples, n_features)) – Training data.

  • y (array, shape (n_samples,)) – Target values.

Returns:

X_new – Transformed data.

Return type:

array, shape (n_samples, n_components)

inverse_transform(X)[source]

Transform data back to its original space.

Note: If n_components is less than the maximum, information will be lost, so reconstructed data will not exactly match the original data.

Parameters:

X (array shape (n_samples, n_components)) – New data.

Returns:

X_original – Data transformed back into original space.

Return type:

array, shape (n_samples, n_features)

predict(X)[source]

Predict class labels for samples in X.

Parameters:

X (array, shape (n_samples, n_features)) – Input data.

Returns:

C – Predicted class label per sample.

Return type:

array, shape (n_samples,)

predict_log_proba(X)[source]

Estimate log probability.

Parameters:

X (array, shape (n_samples, n_features)) – Input data.

Returns:

C – Estimated log probabilities.

Return type:

array, shape (n_samples, n_classes)

predict_proba(X)[source]

Estimate probability.

Parameters:

X (array, shape (n_samples, n_features)) – Input data.

Returns:

C – Estimated probabilities.

Return type:

array, shape (n_samples, n_classes)

predict_transformed(X_trans)[source]

Predict class labels for data that have already been transformed by self.transform(X).

This is useful for plotting classification boundaries. Note: Due to arithemtic discrepancies, this may return slightly different class labels to self.predict(X).

Parameters:

X_trans (array, shape (n_samples, n_components)) – Test samples that have already been transformed into PLDA space.

Returns:

y – Predicted class labels for X_trans.

Return type:

array, shape (n_samples,)

score(X, y, sample_weight=None)[source]

Returns the mean accuracy on the given test data and labels.

Parameters:
  • X (array, shape (n_samples, n_features)) – Test samples.

  • y (array, shape (n_samples,)) – True labels for X.

  • sample_weight (array, shape (n_samples,), optional) – Sample weights.

Returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

Return type:

float

transform(X)[source]

Transform data.

Parameters:

X (array, shape (n_samples, n_features)) – Input data.

Returns:

X_new – Transformed data.

Return type:

array, shape (n_samples, n_components)

pytranskit.optrans.decomposition.plda.accuracy_score(y_true, y_pred, *, normalize=True, sample_weight=None)[source]

Accuracy classification score.

In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

Read more in the User Guide.

Parameters:
  • y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) labels.

  • y_pred (1d array-like, or label indicator array / sparse matrix) – Predicted labels, as returned by a classifier.

  • normalize (bool, default=True) – If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – If normalize == True, return the fraction of correctly classified samples (float), else returns the number of correctly classified samples (int).

The best performance is 1 with normalize == True and the number of samples with normalize == False.

Return type:

float

See also

balanced_accuracy_score

Compute the balanced accuracy to deal with imbalanced datasets.

jaccard_score

Compute the Jaccard similarity coefficient score.

hamming_loss

Compute the average Hamming loss or Hamming distance between two sets of samples.

zero_one_loss

Compute the Zero-one classification loss. By default, the function will return the percentage of imperfectly predicted subsets.

Notes

In binary classification, this function is equal to the jaccard_score function.

Examples

>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
>>> accuracy_score(y_true, y_pred, normalize=False)
2

In the multilabel case with binary label indicators:

>>> import numpy as np
>>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
0.5
pytranskit.optrans.decomposition.plda.check_array(array, ndim=None, dtype='numeric', force_all_finite=True, force_strictly_positive=False)[source]

Input validation on an array, list, or similar.

Parameters:
  • array (object) – Input object to check/convert

  • ndim (int or None (default=None)) – Number of dimensions that array should have. If None, the dimensions are not checked

  • dtype (string, type, list of types or None (default='numeric')) – Data type of result. If None, the dtype of the input is preserved. If ‘numeric’, dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.

  • force_all_finite (boolean (default=True)) – Whether to raise an error on np.inf and np.nan in array

  • force_strictly_positive (boolean (default=False)) – Whether to raise an error if any array elements are <= 0

Returns:

array_converted – The converted and validated array.

Return type:

object

pytranskit.optrans.decomposition.plda.eigh(a, b=None, lower=True, eigvals_only=False, overwrite_a=False, overwrite_b=False, turbo=True, eigvals=None, type=1, check_finite=True, subset_by_index=None, subset_by_value=None, driver=None)[source]

Solve a standard or generalized eigenvalue problem for a complex Hermitian or real symmetric matrix.

Find eigenvalues array w and optionally eigenvectors array v of array a, where b is positive definite such that for every eigenvalue λ (i-th entry of w) and its eigenvector vi (i-th column of v) satisfies:

              a @ vi = λ * b @ vi
vi.conj().T @ a @ vi = λ
vi.conj().T @ b @ vi = 1

In the standard problem, b is assumed to be the identity matrix.

Parameters:
  • a ((M, M) array_like) – A complex Hermitian or real symmetric matrix whose eigenvalues and eigenvectors will be computed.

  • b ((M, M) array_like, optional) – A complex Hermitian or real symmetric definite positive matrix in. If omitted, identity matrix is assumed.

  • lower (bool, optional) – Whether the pertinent array data is taken from the lower or upper triangle of a and, if applicable, b. (Default: lower)

  • eigvals_only (bool, optional) – Whether to calculate only eigenvalues and no eigenvectors. (Default: both are calculated)

  • subset_by_index (iterable, optional) – If provided, this two-element iterable defines the start and the end indices of the desired eigenvalues (ascending order and 0-indexed). To return only the second smallest to fifth smallest eigenvalues, [1, 4] is used. [n-3, n-1] returns the largest three. Only available with “evr”, “evx”, and “gvx” drivers. The entries are directly converted to integers via int().

  • subset_by_value (iterable, optional) – If provided, this two-element iterable defines the half-open interval (a, b] that, if any, only the eigenvalues between these values are returned. Only available with “evr”, “evx”, and “gvx” drivers. Use np.inf for the unconstrained ends.

  • driver (str, optional) – Defines which LAPACK driver should be used. Valid options are “ev”, “evd”, “evr”, “evx” for standard problems and “gv”, “gvd”, “gvx” for generalized (where b is not None) problems. See the Notes section.

  • type (int, optional) –

    For the generalized problems, this keyword specifies the problem type to be solved for w and v (only takes 1, 2, 3 as possible inputs):

    1 =>     a @ v = w @ b @ v
    2 => a @ b @ v = w @ v
    3 => b @ a @ v = w @ v
    

    This keyword is ignored for standard problems.

  • overwrite_a (bool, optional) – Whether to overwrite data in a (may improve performance). Default is False.

  • overwrite_b (bool, optional) – Whether to overwrite data in b (may improve performance). Default is False.

  • check_finite (bool, optional) – Whether to check that the input matrices contain only finite numbers. Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs.

  • turbo (bool, optional) – Deprecated since v1.5.0, use ``driver=gvd`` keyword instead. Use divide and conquer algorithm (faster but expensive in memory, only for generalized eigenvalue problem and if full set of eigenvalues are requested.). Has no significant effect if eigenvectors are not requested.

  • eigvals (tuple (lo, hi), optional) – Deprecated since v1.5.0, use ``subset_by_index`` keyword instead. Indexes of the smallest and largest (in ascending order) eigenvalues and corresponding eigenvectors to be returned: 0 <= lo <= hi <= M-1. If omitted, all eigenvalues and eigenvectors are returned.

Returns:

  • w ((N,) ndarray) – The N (1<=N<=M) selected eigenvalues, in ascending order, each repeated according to its multiplicity.

  • v ((M, N) ndarray) – (if eigvals_only == False)

Raises:

LinAlgError – If eigenvalue computation does not converge, an error occurred, or b matrix is not definite positive. Note that if input matrices are not symmetric or Hermitian, no error will be reported but results will be wrong.

See also

eigvalsh

eigenvalues of symmetric or Hermitian arrays

eig

eigenvalues and right eigenvectors for non-symmetric arrays

eigh_tridiagonal

eigenvalues and right eiegenvectors for symmetric/Hermitian tridiagonal matrices

Notes

This function does not check the input array for being Hermitian/symmetric in order to allow for representing arrays with only their upper/lower triangular parts. Also, note that even though not taken into account, finiteness check applies to the whole array and unaffected by “lower” keyword.

This function uses LAPACK drivers for computations in all possible keyword combinations, prefixed with sy if arrays are real and he if complex, e.g., a float array with “evr” driver is solved via “syevr”, complex arrays with “gvx” driver problem is solved via “hegvx” etc.

As a brief summary, the slowest and the most robust driver is the classical <sy/he>ev which uses symmetric QR. <sy/he>evr is seen as the optimal choice for the most general cases. However, there are certain occasions that <sy/he>evd computes faster at the expense of more memory usage. <sy/he>evx, while still being faster than <sy/he>ev, often performs worse than the rest except when very few eigenvalues are requested for large arrays though there is still no performance guarantee.

For the generalized problem, normalization with respect to the given type argument:

type 1 and 3 :      v.conj().T @ a @ v = w
type 2       : inv(v).conj().T @ a @ inv(v) = w

type 1 or 2  :      v.conj().T @ b @ v  = I
type 3       : v.conj().T @ inv(b) @ v  = I

Examples

>>> from scipy.linalg import eigh
>>> A = np.array([[6, 3, 1, 5], [3, 0, 5, 1], [1, 5, 6, 2], [5, 1, 2, 2]])
>>> w, v = eigh(A)
>>> np.allclose(A @ v - v @ np.diag(w), np.zeros((4, 4)))
True

Request only the eigenvalues

>>> w = eigh(A, eigvals_only=True)

Request eigenvalues that are less than 10.

>>> A = np.array([[34, -4, -10, -7, 2],
...               [-4, 7, 2, 12, 0],
...               [-10, 2, 44, 2, -19],
...               [-7, 12, 2, 79, -34],
...               [2, 0, -19, -34, 29]])
>>> eigh(A, eigvals_only=True, subset_by_value=[-np.inf, 10])
array([6.69199443e-07, 9.11938152e+00])

Request the largest second eigenvalue and its eigenvector

>>> w, v = eigh(A, subset_by_index=[1, 1])
>>> w
array([9.11938152])
>>> v.shape  # only a single column is returned
(5, 1)