ml_grid.pipeline.embeddings
===========================

.. py:module:: ml_grid.pipeline.embeddings

.. autoapi-nested-parse::

   Module for applying dimensionality reduction techniques (embeddings).

   Designed for automated data pipelines that prepare features for binary classification.
   Focuses on methods suitable for sparse, high-dimensional data with reproducible transforms.


Attributes
----------

.. autoapisummary::

   ml_grid.pipeline.embeddings.EmbeddingMethod


Functions
---------

.. autoapisummary::

   ml_grid.pipeline.embeddings.create_embedding_pipeline
   ml_grid.pipeline.embeddings.apply_embedding
   ml_grid.pipeline.embeddings.transform_new_data
   ml_grid.pipeline.embeddings.get_method_recommendation
   ml_grid.pipeline.embeddings.get_explained_variance
   ml_grid.pipeline.embeddings.recommend_n_components


Module Contents
---------------

.. py:data:: EmbeddingMethod

.. py:function:: create_embedding_pipeline(method: EmbeddingMethod = 'svd', n_components: int = 64, scale: bool = True, **kwargs: Any) -> sklearn.pipeline.Pipeline

   Creates a scikit-learn pipeline for dimensionality reduction.

   This function constructs a pipeline optimized for automated preprocessing
   in classification pipelines. All methods support fit/transform pattern for
   proper train/test separation.

   :param method: The embedding method to use:
                  - "svd": TruncatedSVD - best for sparse matrices (TF-IDF, count vectors)
                  - "pca": PCA - standard choice for dense data
                  - "nmf": Non-negative Matrix Factorization - for non-negative sparse data
                  - "lda": Linear Discriminant Analysis - supervised, maximizes class separation
                  - "random_gaussian": Gaussian Random Projection - fast, preserves distances
                  - "random_sparse": Sparse Random Projection - very fast for sparse data
                  - "select_kbest_f": F-statistic feature selection - supervised, linear relationships
                  - "select_kbest_mi": Mutual information feature selection - supervised, non-linear
                  Defaults to "svd".
   :type method: EmbeddingMethod
   :param n_components: Target number of dimensions.
                        Note: LDA is limited to n_classes - 1 (max 1 for binary). Defaults to 64.
   :type n_components: int, optional
   :param scale: Whether to apply StandardScaler before embedding.
                 Note: Scaling converts sparse to dense - set False for sparse data. Defaults to True.
   :type scale: bool, optional
   :param \*\*kwargs: Additional keyword arguments for the embedding method:
                      - SVD: n_iter (int), random_state (int)
                      - PCA: random_state (int), svd_solver (str)
                      - NMF: init (str), max_iter (int), random_state (int)
                      - Random Projection: eps (float), random_state (int)
                      - SelectKBest: (no additional params typically needed)

   :returns: A scikit-learn pipeline configured with the specified steps.
   :rtype: Pipeline

   :raises ValueError: If an unsupported embedding method is provided.

   .. rubric:: Examples

   >>> # Sparse TF-IDF data (no scaling)
   >>> pipe = create_embedding_pipeline("svd", n_components=128, scale=False)

   >>> # Dense numerical features
   >>> pipe = create_embedding_pipeline("pca", n_components=50, scale=True)

   >>> # Supervised feature selection
   >>> pipe = create_embedding_pipeline("select_kbest_f", n_components=100)

   >>> # Fast random projection for very high dims
   >>> pipe = create_embedding_pipeline("random_sparse", n_components=200,
   ...                                   scale=False, random_state=42)


.. py:function:: apply_embedding(X: Union[pandas.DataFrame, numpy.ndarray], pipeline: sklearn.pipeline.Pipeline, y: Optional[Union[pandas.Series, numpy.ndarray]] = None) -> numpy.ndarray

   Applies a pre-configured embedding pipeline to the data.

   :param X: The input feature data.
   :type X: Union[pd.DataFrame, np.ndarray]
   :param pipeline: The scikit-learn pipeline to apply.
   :type pipeline: Pipeline
   :param y: Target labels,
             required for supervised methods (lda, select_kbest_*). Defaults to None.
   :type y: Optional[Union[pd.Series, np.ndarray]], optional

   :returns: The transformed data with reduced dimensionality.
   :rtype: np.ndarray

   :raises ValueError: If supervised method is used without providing labels.

   .. rubric:: Examples

   >>> # Unsupervised
   >>> X = np.random.rand(100, 500)
   >>> pipe = create_embedding_pipeline("svd", n_components=64)
   >>> X_reduced = apply_embedding(X, pipe)

   >>> # Supervised
   >>> y = np.random.randint(0, 2, 100)
   >>> pipe = create_embedding_pipeline("lda", n_components=1)
   >>> X_reduced = apply_embedding(X, pipe, y=y)


.. py:function:: transform_new_data(X: Union[pandas.DataFrame, numpy.ndarray], fitted_pipeline: sklearn.pipeline.Pipeline) -> numpy.ndarray

   Transforms new data using an already-fitted pipeline.

   Critical for proper train/test separation in production pipelines.

   :param X: New data to transform.
   :type X: Union[pd.DataFrame, np.ndarray]
   :param fitted_pipeline: A pipeline that has already been fitted.
   :type fitted_pipeline: Pipeline

   :returns: The transformed data.
   :rtype: np.ndarray

   .. rubric:: Examples

   >>> # Fit on training data
   >>> pipe = create_embedding_pipeline("svd", n_components=64)
   >>> X_train_reduced = apply_embedding(X_train, pipe)

   >>> # Transform test data with same fitted pipeline
   >>> X_test_reduced = transform_new_data(X_test, pipe)


.. py:function:: get_method_recommendation(is_sparse: bool, has_labels: bool, n_features: int, n_samples: int, is_nonnegative: bool = False) -> Dict[str, Any]

   Recommends the best embedding method based on data characteristics.

   :param is_sparse: Whether the data is sparse (e.g., TF-IDF, one-hot encoded).
   :type is_sparse: bool
   :param has_labels: Whether labels are available for supervised methods.
   :type has_labels: bool
   :param n_features: Number of input features.
   :type n_features: int
   :param n_samples: Number of samples.
   :type n_samples: int
   :param is_nonnegative: Whether all data values are non-negative.
   :type is_nonnegative: bool

   :returns:

             Dictionary containing:
                 - method: Recommended method name
                 - scale: Whether to apply scaling
                 - rationale: Explanation of recommendation
                 - alternatives: List of other suitable methods
   :rtype: Dict[str, Any]

   .. rubric:: Examples

   >>> # Sparse TF-IDF data for classification
   >>> rec = get_method_recommendation(
   ...     is_sparse=True, has_labels=True,
   ...     n_features=10000, n_samples=5000
   ... )
   >>> print(rec['method'])
   'svd'


.. py:function:: get_explained_variance(pipeline: sklearn.pipeline.Pipeline, X: Optional[Union[pandas.DataFrame, numpy.ndarray]] = None) -> Optional[numpy.ndarray]

   Extracts explained variance information from fitted pipeline if available.

   Only works with methods that have explained_variance_ratio_ attribute (PCA, SVD).

   :param pipeline: A fitted pipeline.
   :type pipeline: Pipeline
   :param X: Data to compute variance on
             if not already fitted.
   :type X: Optional[Union[pd.DataFrame, np.ndarray]]

   :returns: Array of explained variance ratios, or None if not applicable.
   :rtype: Optional[np.ndarray]

   .. rubric:: Examples

   >>> pipe = create_embedding_pipeline("pca", n_components=10)
   >>> X_reduced = apply_embedding(X, pipe)
   >>> variance = get_explained_variance(pipe)
   >>> print(f"Total variance explained: {variance.sum():.2%}")


.. py:function:: recommend_n_components(pipeline: sklearn.pipeline.Pipeline, X: Union[pandas.DataFrame, numpy.ndarray], variance_threshold: float = 0.95, y: Optional[Union[pandas.Series, numpy.ndarray]] = None) -> Optional[int]

   Recommends number of components to retain a target variance level.

   Only works with methods that provide explained variance (PCA, SVD).

   :param pipeline: A pipeline with a variance-based method.
   :type pipeline: Pipeline
   :param X: The data to analyze.
   :type X: Union[pd.DataFrame, np.ndarray]
   :param variance_threshold: Target cumulative variance to retain (0-1).
   :type variance_threshold: float
   :param y: Labels if needed.
   :type y: Optional[Union[pd.Series, np.ndarray]]

   :returns: Recommended number of components, or None if not applicable.
   :rtype: Optional[int]

   .. rubric:: Examples

   >>> # Fit with high n_components first
   >>> pipe = create_embedding_pipeline("pca", n_components=100)
   >>> apply_embedding(X, pipe)
   >>> n_opt = recommend_n_components(pipe, X, variance_threshold=0.95)
   >>> print(f"Use {n_opt} components for 95% variance")