ml_grid.pipeline.embeddings

Module for applying dimensionality reduction techniques (embeddings).

Designed for automated data pipelines that prepare features for binary classification. Focuses on methods suitable for sparse, high-dimensional data with reproducible transforms.

Attributes

EmbeddingMethod

Functions

`create_embedding_pipeline`(→ sklearn.pipeline.Pipeline)	Creates a scikit-learn pipeline for dimensionality reduction.
`apply_embedding`(→ numpy.ndarray)	Applies a pre-configured embedding pipeline to the data.
`transform_new_data`(→ numpy.ndarray)	Transforms new data using an already-fitted pipeline.
`get_method_recommendation`(→ Dict[str, Any])	Recommends the best embedding method based on data characteristics.
`get_explained_variance`(→ Optional[numpy.ndarray])	Extracts explained variance information from fitted pipeline if available.
`recommend_n_components`(→ Optional[int])	Recommends number of components to retain a target variance level.

Module Contents

ml_grid.pipeline.embeddings.EmbeddingMethod[source]

ml_grid.pipeline.embeddings.create_embedding_pipeline(method: EmbeddingMethod = 'svd', n_components: int = 64, scale: bool = True, **kwargs: Any) → sklearn.pipeline.Pipeline[source]

Creates a scikit-learn pipeline for dimensionality reduction.

This function constructs a pipeline optimized for automated preprocessing in classification pipelines. All methods support fit/transform pattern for proper train/test separation.

Parameters:

method (EmbeddingMethod) – The embedding method to use: - “svd”: TruncatedSVD - best for sparse matrices (TF-IDF, count vectors) - “pca”: PCA - standard choice for dense data - “nmf”: Non-negative Matrix Factorization - for non-negative sparse data - “lda”: Linear Discriminant Analysis - supervised, maximizes class separation - “random_gaussian”: Gaussian Random Projection - fast, preserves distances - “random_sparse”: Sparse Random Projection - very fast for sparse data - “select_kbest_f”: F-statistic feature selection - supervised, linear relationships - “select_kbest_mi”: Mutual information feature selection - supervised, non-linear Defaults to “svd”.
n_components (int, optional) – Target number of dimensions. Note: LDA is limited to n_classes - 1 (max 1 for binary). Defaults to 64.
scale (bool, optional) – Whether to apply StandardScaler before embedding. Note: Scaling converts sparse to dense - set False for sparse data. Defaults to True.
**kwargs – Additional keyword arguments for the embedding method: - SVD: n_iter (int), random_state (int) - PCA: random_state (int), svd_solver (str) - NMF: init (str), max_iter (int), random_state (int) - Random Projection: eps (float), random_state (int) - SelectKBest: (no additional params typically needed)

Returns:

A scikit-learn pipeline configured with the specified steps.

Return type:

Pipeline

Raises:

ValueError – If an unsupported embedding method is provided.

Examples

>>> # Sparse TF-IDF data (no scaling)
>>> pipe = create_embedding_pipeline("svd", n_components=128, scale=False)

>>> # Dense numerical features
>>> pipe = create_embedding_pipeline("pca", n_components=50, scale=True)

>>> # Supervised feature selection
>>> pipe = create_embedding_pipeline("select_kbest_f", n_components=100)

>>> # Fast random projection for very high dims
>>> pipe = create_embedding_pipeline("random_sparse", n_components=200,
...                                   scale=False, random_state=42)

ml_grid.pipeline.embeddings.apply_embedding(X: pandas.DataFrame | numpy.ndarray, pipeline: sklearn.pipeline.Pipeline, y: pandas.Series | numpy.ndarray | None = None) → numpy.ndarray[source]

Applies a pre-configured embedding pipeline to the data.

Parameters:

X (Union[pd.DataFrame, np.ndarray]) – The input feature data.
pipeline (Pipeline) – The scikit-learn pipeline to apply.
y (Optional[Union[pd.Series, np.ndarray]], optional) – Target labels, required for supervised methods (lda, select_kbest_*). Defaults to None.

Returns:

The transformed data with reduced dimensionality.

Return type:

np.ndarray

Raises:

ValueError – If supervised method is used without providing labels.

Examples

>>> # Unsupervised
>>> X = np.random.rand(100, 500)
>>> pipe = create_embedding_pipeline("svd", n_components=64)
>>> X_reduced = apply_embedding(X, pipe)

>>> # Supervised
>>> y = np.random.randint(0, 2, 100)
>>> pipe = create_embedding_pipeline("lda", n_components=1)
>>> X_reduced = apply_embedding(X, pipe, y=y)

ml_grid.pipeline.embeddings.transform_new_data(X: pandas.DataFrame | numpy.ndarray, fitted_pipeline: sklearn.pipeline.Pipeline) → numpy.ndarray[source]

Transforms new data using an already-fitted pipeline.

Critical for proper train/test separation in production pipelines.

Parameters:

X (Union[pd.DataFrame, np.ndarray]) – New data to transform.
fitted_pipeline (Pipeline) – A pipeline that has already been fitted.

Returns:

The transformed data.

Return type:

np.ndarray

Examples

>>> # Fit on training data
>>> pipe = create_embedding_pipeline("svd", n_components=64)
>>> X_train_reduced = apply_embedding(X_train, pipe)

>>> # Transform test data with same fitted pipeline
>>> X_test_reduced = transform_new_data(X_test, pipe)

ml_grid.pipeline.embeddings.get_method_recommendation(is_sparse: bool, has_labels: bool, n_features: int, n_samples: int, is_nonnegative: bool = False) → Dict[str, Any][source]

Recommends the best embedding method based on data characteristics.

Parameters:

is_sparse (bool) – Whether the data is sparse (e.g., TF-IDF, one-hot encoded).
has_labels (bool) – Whether labels are available for supervised methods.
n_features (int) – Number of input features.
n_samples (int) – Number of samples.
is_nonnegative (bool) – Whether all data values are non-negative.

Returns:

Dictionary containing:

method: Recommended method name
scale: Whether to apply scaling
rationale: Explanation of recommendation
alternatives: List of other suitable methods

Return type:

Dict[str, Any]

Examples

>>> # Sparse TF-IDF data for classification
>>> rec = get_method_recommendation(
...     is_sparse=True, has_labels=True,
...     n_features=10000, n_samples=5000
... )
>>> print(rec['method'])
'svd'

ml_grid.pipeline.embeddings.get_explained_variance(pipeline: sklearn.pipeline.Pipeline, X: pandas.DataFrame | numpy.ndarray | None = None) → numpy.ndarray | None[source]

Extracts explained variance information from fitted pipeline if available.

Only works with methods that have explained_variance_ratio_ attribute (PCA, SVD).

Parameters:

pipeline (Pipeline) – A fitted pipeline.
X (Optional[Union[pd.DataFrame, np.ndarray]]) – Data to compute variance on if not already fitted.

Returns:

Array of explained variance ratios, or None if not applicable.

Return type:

Optional[np.ndarray]

Examples

>>> pipe = create_embedding_pipeline("pca", n_components=10)
>>> X_reduced = apply_embedding(X, pipe)
>>> variance = get_explained_variance(pipe)
>>> print(f"Total variance explained: {variance.sum():.2%}")

ml_grid.pipeline.embeddings.recommend_n_components(pipeline: sklearn.pipeline.Pipeline, X: pandas.DataFrame | numpy.ndarray, variance_threshold: float = 0.95, y: pandas.Series | numpy.ndarray | None = None) → int | None[source]

Recommends number of components to retain a target variance level.

Only works with methods that provide explained variance (PCA, SVD).

Parameters:

pipeline (Pipeline) – A pipeline with a variance-based method.
X (Union[pd.DataFrame, np.ndarray]) – The data to analyze.
variance_threshold (float) – Target cumulative variance to retain (0-1).
y (Optional[Union[pd.Series, np.ndarray]]) – Labels if needed.

Returns:

Recommended number of components, or None if not applicable.

Return type:

Optional[int]

Examples

>>> # Fit with high n_components first
>>> pipe = create_embedding_pipeline("pca", n_components=100)
>>> apply_embedding(X, pipe)
>>> n_opt = recommend_n_components(pipe, X, variance_threshold=0.95)
>>> print(f"Use {n_opt} components for 95% variance")