feature_grouper
API reference¶
A set of functions and an sklearn transformer class for finding clusters of correlated features and grouping them together into feature groups.
-
class
FeatureGrouper
(threshold=0.5, copy=True)[source]¶ Hierarchical clustering-based dimensionality reduction.
Calculates correlation matrix of all features in X, applies hierarchical clustering to create flat clusters of highly correlated features, then generates and applies a loading matrix that evenly weights the input features within each cluster.
Input features should be normalized (i.e. z-scores).
Parameters: - threshold – float The minimum correlation similarity threshold to group descendants of a cluster node into the same flat cluster.
- copy – bool If False, data passed to transform are overwritten.
Variables: - components_ – array, shape (n_components, n_features) The loading matrix obtained from clustering and weighting correlated features.
- n_components_ – int The number of components that were estimated from the data.
-
fit
(X, y=None)[source]¶ Fit the model with X.
Parameters: X – array-like, shape (n_samples, n_features) New data, where n_samples is the number of samples and n_features is the number of features.
-
cluster
(X, threshold=0.5)[source]¶ Find clusters of correlated features from a correlation matrix using hierarchical clustering.
Parameters: - X – array-like, shape (n_samples, n_features) New data, where n_samples is the number of samples and n_features is the number of features.
- threshold – float The minimum correlation similarity threshold to group descendants of a cluster node into the same flat cluster.
-
make_loadings
(labels, threshold=0.5)[source]¶ Generate a loading matrix from the feature cluster labels, given a minimum correlation similarity threshold.
Apply the loading matrix to the original data with
np.matmul
or the@
operator.Example: >>> import numpy as np >>> import feature_grouper >>> threshold = 0.5 >>> clusters = feature_grouper.cluster(X, threshold) >>> loading_matrix = feature_grouper.make_loading_matrix(clusters, threshold) >>> X_transformed = X @ loading_matrix
Parameters: - labels – array-like, shape (n,) A numpy 1d array containing the cluster number label for each column in the original dataset.
- threshold – float The minimum correlation similarity threshold that was used to cluster the features.