ml_grid.pipeline.data ===================== .. py:module:: ml_grid.pipeline.data Classes ------- .. autoapisummary:: ml_grid.pipeline.data.pipe Module Contents --------------- .. py:class:: pipe(file_name: str, drop_term_list: List[str], local_param_dict: Dict[str, Any], base_project_dir: str, param_space_index: int, additional_naming: Optional[str] = None, test_sample_n: int = 0, column_sample_n: int = 0, time_series_mode: bool = False, model_class_dict: Optional[Dict[str, bool]] = None, outcome_var_override: Optional[str] = None) Initializes the data pipeline object. This method reads data, applies various cleaning and feature engineering steps based on the provided parameters, and splits the data into training and testing sets. :param file_name: The path to the input CSV file. :type file_name: str :param drop_term_list: A list of substrings to identify columns to drop. :type drop_term_list: List[str] :param local_param_dict: A dictionary of parameters for this specific pipeline run. :type local_param_dict: Dict[str, Any] :param base_project_dir: The root directory for the project. :type base_project_dir: str :param param_space_index: The index of the current parameter space permutation. :type param_space_index: int :param additional_naming: Additional string to append to log folder names. Defaults to None. :type additional_naming: Optional[str], optional :param test_sample_n: The number of rows to sample from the dataset for testing. Defaults to 0 (no sampling). :type test_sample_n: int, optional :param column_sample_n: The number of columns to sample. Defaults to 0 (no sampling). :type column_sample_n: int, optional :param time_series_mode: Flag to enable time-series specific data processing. Defaults to False. :type time_series_mode: bool, optional :param model_class_dict: A dictionary specifying which model classes to include. Defaults to None. :type model_class_dict: Optional[Dict[str, bool]], optional :param outcome_var_override: A specific outcome variable name to use, overriding the one from `local_param_dict`. Defaults to None. :type outcome_var_override: Optional[str], optional .. py:attribute:: base_project_dir :type: str The root directory for the project, used for saving logs and models. .. py:attribute:: additional_naming :type: Optional[str] An optional string to append to log folder names for better identification. .. py:attribute:: local_param_dict :type: Dict[str, Any] A dictionary of parameters for this specific pipeline run. .. py:attribute:: global_params :type: ml_grid.util.global_params.global_parameters A reference to the global parameters singleton instance. .. py:attribute:: verbose :type: int The verbosity level for logging, inherited from global parameters. .. py:attribute:: param_space_index :type: int The index of the current parameter space permutation being run. .. py:attribute:: time_series_mode :type: bool A flag indicating if the pipeline is running in time-series mode. .. py:attribute:: model_class_dict :type: Optional[Dict[str, bool]] A dictionary specifying which model classes to include in the run. .. py:attribute:: df :type: pandas.DataFrame The raw input DataFrame after being read from the source file. .. py:attribute:: all_df_columns :type: List[str] A list of all column names from the original raw DataFrame. .. py:attribute:: orignal_feature_names :type: List[str] A copy of the original feature names before any processing. .. py:attribute:: pertubation_columns :type: List[str] A list of columns selected for inclusion based on `local_param_dict`. .. py:attribute:: drop_list :type: List[str] A list of columns identified to be dropped due to various cleaning steps. .. py:attribute:: outcome_variable :type: str The name of the target variable for the current pipeline run. .. py:attribute:: final_column_list :type: List[str] The final list of feature columns to be used after all filtering. .. py:attribute:: X :type: pandas.DataFrame The feature matrix (DataFrame) after all cleaning and selection steps. .. py:attribute:: y :type: pandas.Series The target variable (Series) corresponding to the feature matrix `X`. .. py:attribute:: X_train :type: pandas.DataFrame The training feature set. .. py:attribute:: X_test :type: pandas.DataFrame The validation/testing feature set. .. py:attribute:: y_train :type: pandas.Series The training target set. .. py:attribute:: y_test :type: pandas.Series The validation/testing target set. .. py:attribute:: X_test_orig :type: pandas.DataFrame The original, held-out test set for final validation. .. py:attribute:: y_test_orig :type: pandas.Series The target variable for the original, held-out test set. .. py:attribute:: model_class_list :type: List[Any] A list of instantiated model class objects to be evaluated in this run. .. py:attribute:: logging_paths_obj