ml_grid.pipeline.embeddings =========================== .. py:module:: ml_grid.pipeline.embeddings .. autoapi-nested-parse:: Module for applying dimensionality reduction techniques (embeddings). Designed for automated data pipelines that prepare features for binary classification. Focuses on methods suitable for sparse, high-dimensional data with reproducible transforms. Attributes ---------- .. autoapisummary:: ml_grid.pipeline.embeddings.EmbeddingMethod Functions --------- .. autoapisummary:: ml_grid.pipeline.embeddings.create_embedding_pipeline ml_grid.pipeline.embeddings.apply_embedding ml_grid.pipeline.embeddings.transform_new_data ml_grid.pipeline.embeddings.get_method_recommendation ml_grid.pipeline.embeddings.get_explained_variance ml_grid.pipeline.embeddings.recommend_n_components Module Contents --------------- .. py:data:: EmbeddingMethod .. py:function:: create_embedding_pipeline(method: EmbeddingMethod = 'svd', n_components: int = 64, scale: bool = True, **kwargs: Any) -> sklearn.pipeline.Pipeline Creates a scikit-learn pipeline for dimensionality reduction. This function constructs a pipeline optimized for automated preprocessing in classification pipelines. All methods support fit/transform pattern for proper train/test separation. :param method: The embedding method to use: - "svd": TruncatedSVD - best for sparse matrices (TF-IDF, count vectors) - "pca": PCA - standard choice for dense data - "nmf": Non-negative Matrix Factorization - for non-negative sparse data - "lda": Linear Discriminant Analysis - supervised, maximizes class separation - "random_gaussian": Gaussian Random Projection - fast, preserves distances - "random_sparse": Sparse Random Projection - very fast for sparse data - "select_kbest_f": F-statistic feature selection - supervised, linear relationships - "select_kbest_mi": Mutual information feature selection - supervised, non-linear Defaults to "svd". :type method: EmbeddingMethod :param n_components: Target number of dimensions. Note: LDA is limited to n_classes - 1 (max 1 for binary). Defaults to 64. :type n_components: int, optional :param scale: Whether to apply StandardScaler before embedding. Note: Scaling converts sparse to dense - set False for sparse data. Defaults to True. :type scale: bool, optional :param \*\*kwargs: Additional keyword arguments for the embedding method: - SVD: n_iter (int), random_state (int) - PCA: random_state (int), svd_solver (str) - NMF: init (str), max_iter (int), random_state (int) - Random Projection: eps (float), random_state (int) - SelectKBest: (no additional params typically needed) :returns: A scikit-learn pipeline configured with the specified steps. :rtype: Pipeline :raises ValueError: If an unsupported embedding method is provided. .. rubric:: Examples >>> # Sparse TF-IDF data (no scaling) >>> pipe = create_embedding_pipeline("svd", n_components=128, scale=False) >>> # Dense numerical features >>> pipe = create_embedding_pipeline("pca", n_components=50, scale=True) >>> # Supervised feature selection >>> pipe = create_embedding_pipeline("select_kbest_f", n_components=100) >>> # Fast random projection for very high dims >>> pipe = create_embedding_pipeline("random_sparse", n_components=200, ... scale=False, random_state=42) .. py:function:: apply_embedding(X: Union[pandas.DataFrame, numpy.ndarray], pipeline: sklearn.pipeline.Pipeline, y: Optional[Union[pandas.Series, numpy.ndarray]] = None) -> numpy.ndarray Applies a pre-configured embedding pipeline to the data. :param X: The input feature data. :type X: Union[pd.DataFrame, np.ndarray] :param pipeline: The scikit-learn pipeline to apply. :type pipeline: Pipeline :param y: Target labels, required for supervised methods (lda, select_kbest_*). Defaults to None. :type y: Optional[Union[pd.Series, np.ndarray]], optional :returns: The transformed data with reduced dimensionality. :rtype: np.ndarray :raises ValueError: If supervised method is used without providing labels. .. rubric:: Examples >>> # Unsupervised >>> X = np.random.rand(100, 500) >>> pipe = create_embedding_pipeline("svd", n_components=64) >>> X_reduced = apply_embedding(X, pipe) >>> # Supervised >>> y = np.random.randint(0, 2, 100) >>> pipe = create_embedding_pipeline("lda", n_components=1) >>> X_reduced = apply_embedding(X, pipe, y=y) .. py:function:: transform_new_data(X: Union[pandas.DataFrame, numpy.ndarray], fitted_pipeline: sklearn.pipeline.Pipeline) -> numpy.ndarray Transforms new data using an already-fitted pipeline. Critical for proper train/test separation in production pipelines. :param X: New data to transform. :type X: Union[pd.DataFrame, np.ndarray] :param fitted_pipeline: A pipeline that has already been fitted. :type fitted_pipeline: Pipeline :returns: The transformed data. :rtype: np.ndarray .. rubric:: Examples >>> # Fit on training data >>> pipe = create_embedding_pipeline("svd", n_components=64) >>> X_train_reduced = apply_embedding(X_train, pipe) >>> # Transform test data with same fitted pipeline >>> X_test_reduced = transform_new_data(X_test, pipe) .. py:function:: get_method_recommendation(is_sparse: bool, has_labels: bool, n_features: int, n_samples: int, is_nonnegative: bool = False) -> Dict[str, Any] Recommends the best embedding method based on data characteristics. :param is_sparse: Whether the data is sparse (e.g., TF-IDF, one-hot encoded). :type is_sparse: bool :param has_labels: Whether labels are available for supervised methods. :type has_labels: bool :param n_features: Number of input features. :type n_features: int :param n_samples: Number of samples. :type n_samples: int :param is_nonnegative: Whether all data values are non-negative. :type is_nonnegative: bool :returns: Dictionary containing: - method: Recommended method name - scale: Whether to apply scaling - rationale: Explanation of recommendation - alternatives: List of other suitable methods :rtype: Dict[str, Any] .. rubric:: Examples >>> # Sparse TF-IDF data for classification >>> rec = get_method_recommendation( ... is_sparse=True, has_labels=True, ... n_features=10000, n_samples=5000 ... ) >>> print(rec['method']) 'svd' .. py:function:: get_explained_variance(pipeline: sklearn.pipeline.Pipeline, X: Optional[Union[pandas.DataFrame, numpy.ndarray]] = None) -> Optional[numpy.ndarray] Extracts explained variance information from fitted pipeline if available. Only works with methods that have explained_variance_ratio_ attribute (PCA, SVD). :param pipeline: A fitted pipeline. :type pipeline: Pipeline :param X: Data to compute variance on if not already fitted. :type X: Optional[Union[pd.DataFrame, np.ndarray]] :returns: Array of explained variance ratios, or None if not applicable. :rtype: Optional[np.ndarray] .. rubric:: Examples >>> pipe = create_embedding_pipeline("pca", n_components=10) >>> X_reduced = apply_embedding(X, pipe) >>> variance = get_explained_variance(pipe) >>> print(f"Total variance explained: {variance.sum():.2%}") .. py:function:: recommend_n_components(pipeline: sklearn.pipeline.Pipeline, X: Union[pandas.DataFrame, numpy.ndarray], variance_threshold: float = 0.95, y: Optional[Union[pandas.Series, numpy.ndarray]] = None) -> Optional[int] Recommends number of components to retain a target variance level. Only works with methods that provide explained variance (PCA, SVD). :param pipeline: A pipeline with a variance-based method. :type pipeline: Pipeline :param X: The data to analyze. :type X: Union[pd.DataFrame, np.ndarray] :param variance_threshold: Target cumulative variance to retain (0-1). :type variance_threshold: float :param y: Labels if needed. :type y: Optional[Union[pd.Series, np.ndarray]] :returns: Recommended number of components, or None if not applicable. :rtype: Optional[int] .. rubric:: Examples >>> # Fit with high n_components first >>> pipe = create_embedding_pipeline("pca", n_components=100) >>> apply_embedding(X, pipe) >>> n_opt = recommend_n_components(pipe, X, variance_threshold=0.95) >>> print(f"Use {n_opt} components for 95% variance")