ml_grid.pipeline.data

Classes

pipe

Initializes the data pipeline object.

Module Contents

class ml_grid.pipeline.data.pipe(file_name: str, drop_term_list: List[str], local_param_dict: Dict[str, Any], base_project_dir: str, param_space_index: int, additional_naming: str | None = None, test_sample_n: int = 0, column_sample_n: int = 0, time_series_mode: bool = False, model_class_dict: Dict[str, bool] | None = None, outcome_var_override: str | None = None)[source]

Initializes the data pipeline object.

This method reads data, applies various cleaning and feature engineering steps based on the provided parameters, and splits the data into training and testing sets.

Parameters:
  • file_name (str) – The path to the input CSV file.

  • drop_term_list (List[str]) – A list of substrings to identify columns to drop.

  • local_param_dict (Dict[str, Any]) – A dictionary of parameters for this specific pipeline run.

  • base_project_dir (str) – The root directory for the project.

  • param_space_index (int) – The index of the current parameter space permutation.

  • additional_naming (Optional[str], optional) – Additional string to append to log folder names. Defaults to None.

  • test_sample_n (int, optional) – The number of rows to sample from the dataset for testing. Defaults to 0 (no sampling).

  • column_sample_n (int, optional) – The number of columns to sample. Defaults to 0 (no sampling).

  • time_series_mode (bool, optional) – Flag to enable time-series specific data processing. Defaults to False.

  • model_class_dict (Optional[Dict[str, bool]], optional) – A dictionary specifying which model classes to include. Defaults to None.

  • outcome_var_override (Optional[str], optional) – A specific outcome variable name to use, overriding the one from local_param_dict. Defaults to None.

base_project_dir: str[source]

The root directory for the project, used for saving logs and models.

additional_naming: str | None[source]

An optional string to append to log folder names for better identification.

local_param_dict: Dict[str, Any][source]

A dictionary of parameters for this specific pipeline run.

global_params: ml_grid.util.global_params.global_parameters[source]

A reference to the global parameters singleton instance.

verbose: int[source]

The verbosity level for logging, inherited from global parameters.

param_space_index: int[source]

The index of the current parameter space permutation being run.

time_series_mode: bool[source]

A flag indicating if the pipeline is running in time-series mode.

model_class_dict: Dict[str, bool] | None[source]

A dictionary specifying which model classes to include in the run.

df: pandas.DataFrame[source]

The raw input DataFrame after being read from the source file.

all_df_columns: List[str][source]

A list of all column names from the original raw DataFrame.

orignal_feature_names: List[str][source]

A copy of the original feature names before any processing.

pertubation_columns: List[str][source]

A list of columns selected for inclusion based on local_param_dict.

drop_list: List[str][source]

A list of columns identified to be dropped due to various cleaning steps.

outcome_variable: str[source]

The name of the target variable for the current pipeline run.

final_column_list: List[str][source]

The final list of feature columns to be used after all filtering.

X: pandas.DataFrame[source]

The feature matrix (DataFrame) after all cleaning and selection steps.

y: pandas.Series[source]

The target variable (Series) corresponding to the feature matrix X.

X_train: pandas.DataFrame[source]

The training feature set.

X_test: pandas.DataFrame[source]

The validation/testing feature set.

y_train: pandas.Series[source]

The training target set.

y_test: pandas.Series[source]

The validation/testing target set.

X_test_orig: pandas.DataFrame[source]

The original, held-out test set for final validation.

y_test_orig: pandas.Series[source]

The target variable for the original, held-out test set.

model_class_list: List[Any][source]

A list of instantiated model class objects to be evaluated in this run.

logging_paths_obj[source]