ml_grid.pipeline.data
=====================

.. py:module:: ml_grid.pipeline.data


Classes
-------

.. autoapisummary::

   ml_grid.pipeline.data.pipe


Module Contents
---------------

.. py:class:: pipe(file_name: str, drop_term_list: List[str], local_param_dict: Dict[str, Any], base_project_dir: str, param_space_index: int, additional_naming: Optional[str] = None, test_sample_n: int = 0, column_sample_n: int = 0, time_series_mode: bool = False, model_class_dict: Optional[Dict[str, bool]] = None, outcome_var_override: Optional[str] = None)

   Initializes the data pipeline object.

   This method reads data, applies various cleaning and feature engineering
   steps based on the provided parameters, and splits the data into
   training and testing sets.

   :param file_name: The path to the input CSV file.
   :type file_name: str
   :param drop_term_list: A list of substrings to identify columns
                          to drop.
   :type drop_term_list: List[str]
   :param local_param_dict: A dictionary of parameters for this
                            specific pipeline run.
   :type local_param_dict: Dict[str, Any]
   :param base_project_dir: The root directory for the project.
   :type base_project_dir: str
   :param param_space_index: The index of the current parameter space
                             permutation.
   :type param_space_index: int
   :param additional_naming: Additional string to
                             append to log folder names. Defaults to None.
   :type additional_naming: Optional[str], optional
   :param test_sample_n: The number of rows to sample from the
                         dataset for testing. Defaults to 0 (no sampling).
   :type test_sample_n: int, optional
   :param column_sample_n: The number of columns to sample.
                           Defaults to 0 (no sampling).
   :type column_sample_n: int, optional
   :param time_series_mode: Flag to enable time-series specific
                            data processing. Defaults to False.
   :type time_series_mode: bool, optional
   :param model_class_dict: A dictionary
                            specifying which model classes to include. Defaults to None.
   :type model_class_dict: Optional[Dict[str, bool]], optional
   :param outcome_var_override: A specific outcome
                                variable name to use, overriding the one from `local_param_dict`.
                                Defaults to None.
   :type outcome_var_override: Optional[str], optional


   .. py:attribute:: base_project_dir
      :type:  str

      The root directory for the project, used for saving logs and models.


   .. py:attribute:: additional_naming
      :type:  Optional[str]

      An optional string to append to log folder names for better identification.


   .. py:attribute:: local_param_dict
      :type:  Dict[str, Any]

      A dictionary of parameters for this specific pipeline run.


   .. py:attribute:: global_params
      :type:  ml_grid.util.global_params.global_parameters

      A reference to the global parameters singleton instance.


   .. py:attribute:: verbose
      :type:  int

      The verbosity level for logging, inherited from global parameters.


   .. py:attribute:: param_space_index
      :type:  int

      The index of the current parameter space permutation being run.


   .. py:attribute:: time_series_mode
      :type:  bool

      A flag indicating if the pipeline is running in time-series mode.


   .. py:attribute:: model_class_dict
      :type:  Optional[Dict[str, bool]]

      A dictionary specifying which model classes to include in the run.


   .. py:attribute:: df
      :type:  pandas.DataFrame

      The raw input DataFrame after being read from the source file.


   .. py:attribute:: all_df_columns
      :type:  List[str]

      A list of all column names from the original raw DataFrame.


   .. py:attribute:: orignal_feature_names
      :type:  List[str]

      A copy of the original feature names before any processing.


   .. py:attribute:: pertubation_columns
      :type:  List[str]

      A list of columns selected for inclusion based on `local_param_dict`.


   .. py:attribute:: drop_list
      :type:  List[str]

      A list of columns identified to be dropped due to various cleaning steps.


   .. py:attribute:: outcome_variable
      :type:  str

      The name of the target variable for the current pipeline run.


   .. py:attribute:: final_column_list
      :type:  List[str]

      The final list of feature columns to be used after all filtering.


   .. py:attribute:: X
      :type:  pandas.DataFrame

      The feature matrix (DataFrame) after all cleaning and selection steps.


   .. py:attribute:: y
      :type:  pandas.Series

      The target variable (Series) corresponding to the feature matrix `X`.


   .. py:attribute:: X_train
      :type:  pandas.DataFrame

      The training feature set.


   .. py:attribute:: X_test
      :type:  pandas.DataFrame

      The validation/testing feature set.


   .. py:attribute:: y_train
      :type:  pandas.Series

      The training target set.


   .. py:attribute:: y_test
      :type:  pandas.Series

      The validation/testing target set.


   .. py:attribute:: X_test_orig
      :type:  pandas.DataFrame

      The original, held-out test set for final validation.


   .. py:attribute:: y_test_orig
      :type:  pandas.Series

      The target variable for the original, held-out test set.


   .. py:attribute:: model_class_list
      :type:  List[Any]

      A list of instantiated model class objects to be evaluated in this run.


   .. py:attribute:: logging_paths_obj