ml_grid.pipeline.embeddings

Module for applying dimensionality reduction techniques (embeddings).

Designed for automated data pipelines that prepare features for binary classification. Focuses on methods suitable for sparse, high-dimensional data with reproducible transforms.

Attributes

EmbeddingMethod

Functions

create_embedding_pipeline(→ sklearn.pipeline.Pipeline)

Creates a scikit-learn pipeline for dimensionality reduction.

apply_embedding(→ numpy.ndarray)

Applies a pre-configured embedding pipeline to the data.

transform_new_data(→ numpy.ndarray)

Transforms new data using an already-fitted pipeline.

get_method_recommendation(→ Dict[str, Any])

Recommends the best embedding method based on data characteristics.

get_explained_variance(→ Optional[numpy.ndarray])

Extracts explained variance information from fitted pipeline if available.

recommend_n_components(→ Optional[int])

Recommends number of components to retain a target variance level.

Module Contents

ml_grid.pipeline.embeddings.EmbeddingMethod[source]
ml_grid.pipeline.embeddings.create_embedding_pipeline(method: EmbeddingMethod = 'svd', n_components: int = 64, scale: bool = True, **kwargs: Any) sklearn.pipeline.Pipeline[source]

Creates a scikit-learn pipeline for dimensionality reduction.

This function constructs a pipeline optimized for automated preprocessing in classification pipelines. All methods support fit/transform pattern for proper train/test separation.

Parameters:
  • method (EmbeddingMethod) – The embedding method to use: - “svd”: TruncatedSVD - best for sparse matrices (TF-IDF, count vectors) - “pca”: PCA - standard choice for dense data - “nmf”: Non-negative Matrix Factorization - for non-negative sparse data - “lda”: Linear Discriminant Analysis - supervised, maximizes class separation - “random_gaussian”: Gaussian Random Projection - fast, preserves distances - “random_sparse”: Sparse Random Projection - very fast for sparse data - “select_kbest_f”: F-statistic feature selection - supervised, linear relationships - “select_kbest_mi”: Mutual information feature selection - supervised, non-linear Defaults to “svd”.

  • n_components (int, optional) – Target number of dimensions. Note: LDA is limited to n_classes - 1 (max 1 for binary). Defaults to 64.

  • scale (bool, optional) – Whether to apply StandardScaler before embedding. Note: Scaling converts sparse to dense - set False for sparse data. Defaults to True.

  • **kwargs – Additional keyword arguments for the embedding method: - SVD: n_iter (int), random_state (int) - PCA: random_state (int), svd_solver (str) - NMF: init (str), max_iter (int), random_state (int) - Random Projection: eps (float), random_state (int) - SelectKBest: (no additional params typically needed)

Returns:

A scikit-learn pipeline configured with the specified steps.

Return type:

Pipeline

Raises:

ValueError – If an unsupported embedding method is provided.

Examples

>>> # Sparse TF-IDF data (no scaling)
>>> pipe = create_embedding_pipeline("svd", n_components=128, scale=False)
>>> # Dense numerical features
>>> pipe = create_embedding_pipeline("pca", n_components=50, scale=True)
>>> # Supervised feature selection
>>> pipe = create_embedding_pipeline("select_kbest_f", n_components=100)
>>> # Fast random projection for very high dims
>>> pipe = create_embedding_pipeline("random_sparse", n_components=200,
...                                   scale=False, random_state=42)
ml_grid.pipeline.embeddings.apply_embedding(X: pandas.DataFrame | numpy.ndarray, pipeline: sklearn.pipeline.Pipeline, y: pandas.Series | numpy.ndarray | None = None) numpy.ndarray[source]

Applies a pre-configured embedding pipeline to the data.

Parameters:
  • X (Union[pd.DataFrame, np.ndarray]) – The input feature data.

  • pipeline (Pipeline) – The scikit-learn pipeline to apply.

  • y (Optional[Union[pd.Series, np.ndarray]], optional) – Target labels, required for supervised methods (lda, select_kbest_*). Defaults to None.

Returns:

The transformed data with reduced dimensionality.

Return type:

np.ndarray

Raises:

ValueError – If supervised method is used without providing labels.

Examples

>>> # Unsupervised
>>> X = np.random.rand(100, 500)
>>> pipe = create_embedding_pipeline("svd", n_components=64)
>>> X_reduced = apply_embedding(X, pipe)
>>> # Supervised
>>> y = np.random.randint(0, 2, 100)
>>> pipe = create_embedding_pipeline("lda", n_components=1)
>>> X_reduced = apply_embedding(X, pipe, y=y)
ml_grid.pipeline.embeddings.transform_new_data(X: pandas.DataFrame | numpy.ndarray, fitted_pipeline: sklearn.pipeline.Pipeline) numpy.ndarray[source]

Transforms new data using an already-fitted pipeline.

Critical for proper train/test separation in production pipelines.

Parameters:
  • X (Union[pd.DataFrame, np.ndarray]) – New data to transform.

  • fitted_pipeline (Pipeline) – A pipeline that has already been fitted.

Returns:

The transformed data.

Return type:

np.ndarray

Examples

>>> # Fit on training data
>>> pipe = create_embedding_pipeline("svd", n_components=64)
>>> X_train_reduced = apply_embedding(X_train, pipe)
>>> # Transform test data with same fitted pipeline
>>> X_test_reduced = transform_new_data(X_test, pipe)
ml_grid.pipeline.embeddings.get_method_recommendation(is_sparse: bool, has_labels: bool, n_features: int, n_samples: int, is_nonnegative: bool = False) Dict[str, Any][source]

Recommends the best embedding method based on data characteristics.

Parameters:
  • is_sparse (bool) – Whether the data is sparse (e.g., TF-IDF, one-hot encoded).

  • has_labels (bool) – Whether labels are available for supervised methods.

  • n_features (int) – Number of input features.

  • n_samples (int) – Number of samples.

  • is_nonnegative (bool) – Whether all data values are non-negative.

Returns:

Dictionary containing:
  • method: Recommended method name

  • scale: Whether to apply scaling

  • rationale: Explanation of recommendation

  • alternatives: List of other suitable methods

Return type:

Dict[str, Any]

Examples

>>> # Sparse TF-IDF data for classification
>>> rec = get_method_recommendation(
...     is_sparse=True, has_labels=True,
...     n_features=10000, n_samples=5000
... )
>>> print(rec['method'])
'svd'
ml_grid.pipeline.embeddings.get_explained_variance(pipeline: sklearn.pipeline.Pipeline, X: pandas.DataFrame | numpy.ndarray | None = None) numpy.ndarray | None[source]

Extracts explained variance information from fitted pipeline if available.

Only works with methods that have explained_variance_ratio_ attribute (PCA, SVD).

Parameters:
  • pipeline (Pipeline) – A fitted pipeline.

  • X (Optional[Union[pd.DataFrame, np.ndarray]]) – Data to compute variance on if not already fitted.

Returns:

Array of explained variance ratios, or None if not applicable.

Return type:

Optional[np.ndarray]

Examples

>>> pipe = create_embedding_pipeline("pca", n_components=10)
>>> X_reduced = apply_embedding(X, pipe)
>>> variance = get_explained_variance(pipe)
>>> print(f"Total variance explained: {variance.sum():.2%}")
ml_grid.pipeline.embeddings.recommend_n_components(pipeline: sklearn.pipeline.Pipeline, X: pandas.DataFrame | numpy.ndarray, variance_threshold: float = 0.95, y: pandas.Series | numpy.ndarray | None = None) int | None[source]

Recommends number of components to retain a target variance level.

Only works with methods that provide explained variance (PCA, SVD).

Parameters:
  • pipeline (Pipeline) – A pipeline with a variance-based method.

  • X (Union[pd.DataFrame, np.ndarray]) – The data to analyze.

  • variance_threshold (float) – Target cumulative variance to retain (0-1).

  • y (Optional[Union[pd.Series, np.ndarray]]) – Labels if needed.

Returns:

Recommended number of components, or None if not applicable.

Return type:

Optional[int]

Examples

>>> # Fit with high n_components first
>>> pipe = create_embedding_pipeline("pca", n_components=100)
>>> apply_embedding(X, pipe)
>>> n_opt = recommend_n_components(pipe, X, variance_threshold=0.95)
>>> print(f"Use {n_opt} components for 95% variance")