ml_grid.pipeline.embeddings
Module for applying dimensionality reduction techniques (embeddings).
Designed for automated data pipelines that prepare features for binary classification. Focuses on methods suitable for sparse, high-dimensional data with reproducible transforms.
Attributes
Functions
|
Creates a scikit-learn pipeline for dimensionality reduction. |
|
Applies a pre-configured embedding pipeline to the data. |
|
Transforms new data using an already-fitted pipeline. |
|
Recommends the best embedding method based on data characteristics. |
|
Extracts explained variance information from fitted pipeline if available. |
|
Recommends number of components to retain a target variance level. |
Module Contents
- ml_grid.pipeline.embeddings.create_embedding_pipeline(method: EmbeddingMethod = 'svd', n_components: int = 64, scale: bool = True, **kwargs: Any) sklearn.pipeline.Pipeline[source]
Creates a scikit-learn pipeline for dimensionality reduction.
This function constructs a pipeline optimized for automated preprocessing in classification pipelines. All methods support fit/transform pattern for proper train/test separation.
- Parameters:
method (EmbeddingMethod) – The embedding method to use: - “svd”: TruncatedSVD - best for sparse matrices (TF-IDF, count vectors) - “pca”: PCA - standard choice for dense data - “nmf”: Non-negative Matrix Factorization - for non-negative sparse data - “lda”: Linear Discriminant Analysis - supervised, maximizes class separation - “random_gaussian”: Gaussian Random Projection - fast, preserves distances - “random_sparse”: Sparse Random Projection - very fast for sparse data - “select_kbest_f”: F-statistic feature selection - supervised, linear relationships - “select_kbest_mi”: Mutual information feature selection - supervised, non-linear Defaults to “svd”.
n_components (int, optional) – Target number of dimensions. Note: LDA is limited to n_classes - 1 (max 1 for binary). Defaults to 64.
scale (bool, optional) – Whether to apply StandardScaler before embedding. Note: Scaling converts sparse to dense - set False for sparse data. Defaults to True.
**kwargs – Additional keyword arguments for the embedding method: - SVD: n_iter (int), random_state (int) - PCA: random_state (int), svd_solver (str) - NMF: init (str), max_iter (int), random_state (int) - Random Projection: eps (float), random_state (int) - SelectKBest: (no additional params typically needed)
- Returns:
A scikit-learn pipeline configured with the specified steps.
- Return type:
Pipeline
- Raises:
ValueError – If an unsupported embedding method is provided.
Examples
>>> # Sparse TF-IDF data (no scaling) >>> pipe = create_embedding_pipeline("svd", n_components=128, scale=False)
>>> # Dense numerical features >>> pipe = create_embedding_pipeline("pca", n_components=50, scale=True)
>>> # Supervised feature selection >>> pipe = create_embedding_pipeline("select_kbest_f", n_components=100)
>>> # Fast random projection for very high dims >>> pipe = create_embedding_pipeline("random_sparse", n_components=200, ... scale=False, random_state=42)
- ml_grid.pipeline.embeddings.apply_embedding(X: pandas.DataFrame | numpy.ndarray, pipeline: sklearn.pipeline.Pipeline, y: pandas.Series | numpy.ndarray | None = None) numpy.ndarray[source]
Applies a pre-configured embedding pipeline to the data.
- Parameters:
X (Union[pd.DataFrame, np.ndarray]) – The input feature data.
pipeline (Pipeline) – The scikit-learn pipeline to apply.
y (Optional[Union[pd.Series, np.ndarray]], optional) – Target labels, required for supervised methods (lda, select_kbest_*). Defaults to None.
- Returns:
The transformed data with reduced dimensionality.
- Return type:
np.ndarray
- Raises:
ValueError – If supervised method is used without providing labels.
Examples
>>> # Unsupervised >>> X = np.random.rand(100, 500) >>> pipe = create_embedding_pipeline("svd", n_components=64) >>> X_reduced = apply_embedding(X, pipe)
>>> # Supervised >>> y = np.random.randint(0, 2, 100) >>> pipe = create_embedding_pipeline("lda", n_components=1) >>> X_reduced = apply_embedding(X, pipe, y=y)
- ml_grid.pipeline.embeddings.transform_new_data(X: pandas.DataFrame | numpy.ndarray, fitted_pipeline: sklearn.pipeline.Pipeline) numpy.ndarray[source]
Transforms new data using an already-fitted pipeline.
Critical for proper train/test separation in production pipelines.
- Parameters:
X (Union[pd.DataFrame, np.ndarray]) – New data to transform.
fitted_pipeline (Pipeline) – A pipeline that has already been fitted.
- Returns:
The transformed data.
- Return type:
np.ndarray
Examples
>>> # Fit on training data >>> pipe = create_embedding_pipeline("svd", n_components=64) >>> X_train_reduced = apply_embedding(X_train, pipe)
>>> # Transform test data with same fitted pipeline >>> X_test_reduced = transform_new_data(X_test, pipe)
- ml_grid.pipeline.embeddings.get_method_recommendation(is_sparse: bool, has_labels: bool, n_features: int, n_samples: int, is_nonnegative: bool = False) Dict[str, Any][source]
Recommends the best embedding method based on data characteristics.
- Parameters:
- Returns:
- Dictionary containing:
method: Recommended method name
scale: Whether to apply scaling
rationale: Explanation of recommendation
alternatives: List of other suitable methods
- Return type:
Dict[str, Any]
Examples
>>> # Sparse TF-IDF data for classification >>> rec = get_method_recommendation( ... is_sparse=True, has_labels=True, ... n_features=10000, n_samples=5000 ... ) >>> print(rec['method']) 'svd'
- ml_grid.pipeline.embeddings.get_explained_variance(pipeline: sklearn.pipeline.Pipeline, X: pandas.DataFrame | numpy.ndarray | None = None) numpy.ndarray | None[source]
Extracts explained variance information from fitted pipeline if available.
Only works with methods that have explained_variance_ratio_ attribute (PCA, SVD).
- Parameters:
pipeline (Pipeline) – A fitted pipeline.
X (Optional[Union[pd.DataFrame, np.ndarray]]) – Data to compute variance on if not already fitted.
- Returns:
Array of explained variance ratios, or None if not applicable.
- Return type:
Optional[np.ndarray]
Examples
>>> pipe = create_embedding_pipeline("pca", n_components=10) >>> X_reduced = apply_embedding(X, pipe) >>> variance = get_explained_variance(pipe) >>> print(f"Total variance explained: {variance.sum():.2%}")
- ml_grid.pipeline.embeddings.recommend_n_components(pipeline: sklearn.pipeline.Pipeline, X: pandas.DataFrame | numpy.ndarray, variance_threshold: float = 0.95, y: pandas.Series | numpy.ndarray | None = None) int | None[source]
Recommends number of components to retain a target variance level.
Only works with methods that provide explained variance (PCA, SVD).
- Parameters:
pipeline (Pipeline) – A pipeline with a variance-based method.
X (Union[pd.DataFrame, np.ndarray]) – The data to analyze.
variance_threshold (float) – Target cumulative variance to retain (0-1).
y (Optional[Union[pd.Series, np.ndarray]]) – Labels if needed.
- Returns:
Recommended number of components, or None if not applicable.
- Return type:
Optional[int]
Examples
>>> # Fit with high n_components first >>> pipe = create_embedding_pipeline("pca", n_components=100) >>> apply_embedding(X, pipe) >>> n_opt = recommend_n_components(pipe, X, variance_threshold=0.95) >>> print(f"Use {n_opt} components for 95% variance")