feature_grouper API reference

A set of functions and an sklearn transformer class for finding clusters of correlated features and grouping them together into feature groups.

class FeatureGrouper(threshold=0.5, copy=True)[source]

Hierarchical clustering-based dimensionality reduction.

Calculates correlation matrix of all features in X, applies hierarchical clustering to create flat clusters of highly correlated features, then generates and applies a loading matrix that evenly weights the input features within each cluster.

Input features should be normalized (i.e. z-scores).

Parameters:
  • threshold – float The minimum correlation similarity threshold to group descendants of a cluster node into the same flat cluster.
  • copy – bool If False, data passed to transform are overwritten.
Variables:
  • components_ – array, shape (n_components, n_features) The loading matrix obtained from clustering and weighting correlated features.
  • n_components_ – int The number of components that were estimated from the data.
fit(X, y=None)[source]

Fit the model with X.

Parameters:X – array-like, shape (n_samples, n_features) New data, where n_samples is the number of samples and n_features is the number of features.
inverse_transform(X)[source]

Transform data back to its original space. In other words, return an input X_original whose transform would be X.

Parameters:X – array-like, shape (n_samples, n_components) New data, where n_samples is the number of samples and n_components is the number of components.
transform(X)[source]

Apply dimensionality reduction on X.

Parameters:X – array-like, shape (n_samples, n_features) New data, where n_samples is the number of samples and n_features is the number of features.
cluster(X, threshold=0.5)[source]

Find clusters of correlated features from a correlation matrix using hierarchical clustering.

Parameters:
  • X – array-like, shape (n_samples, n_features) New data, where n_samples is the number of samples and n_features is the number of features.
  • threshold – float The minimum correlation similarity threshold to group descendants of a cluster node into the same flat cluster.
make_loadings(labels, threshold=0.5)[source]

Generate a loading matrix from the feature cluster labels, given a minimum correlation similarity threshold.

Apply the loading matrix to the original data with np.matmul or the @ operator.

Example:
>>> import numpy as np
>>> import feature_grouper
>>> threshold = 0.5
>>> clusters = feature_grouper.cluster(X, threshold)
>>> loading_matrix = feature_grouper.make_loading_matrix(clusters, threshold)
>>> X_transformed = X @ loading_matrix
Parameters:
  • labels – array-like, shape (n,) A numpy 1d array containing the cluster number label for each column in the original dataset.
  • threshold – float The minimum correlation similarity threshold that was used to cluster the features.