ml_grid.pipeline.data
Classes
Initializes the data pipeline object. |
Module Contents
- class ml_grid.pipeline.data.pipe(file_name: str, drop_term_list: List[str], local_param_dict: Dict[str, Any], base_project_dir: str, param_space_index: int, additional_naming: str | None = None, test_sample_n: int = 0, column_sample_n: int = 0, time_series_mode: bool = False, model_class_dict: Dict[str, bool] | None = None, outcome_var_override: str | None = None)[source]
Initializes the data pipeline object.
This method reads data, applies various cleaning and feature engineering steps based on the provided parameters, and splits the data into training and testing sets.
- Parameters:
file_name (str) – The path to the input CSV file.
drop_term_list (List[str]) – A list of substrings to identify columns to drop.
local_param_dict (Dict[str, Any]) – A dictionary of parameters for this specific pipeline run.
base_project_dir (str) – The root directory for the project.
param_space_index (int) – The index of the current parameter space permutation.
additional_naming (Optional[str], optional) – Additional string to append to log folder names. Defaults to None.
test_sample_n (int, optional) – The number of rows to sample from the dataset for testing. Defaults to 0 (no sampling).
column_sample_n (int, optional) – The number of columns to sample. Defaults to 0 (no sampling).
time_series_mode (bool, optional) – Flag to enable time-series specific data processing. Defaults to False.
model_class_dict (Optional[Dict[str, bool]], optional) – A dictionary specifying which model classes to include. Defaults to None.
outcome_var_override (Optional[str], optional) – A specific outcome variable name to use, overriding the one from local_param_dict. Defaults to None.
- additional_naming: str | None[source]
An optional string to append to log folder names for better identification.
- local_param_dict: Dict[str, Any][source]
A dictionary of parameters for this specific pipeline run.
- global_params: ml_grid.util.global_params.global_parameters[source]
A reference to the global parameters singleton instance.
- model_class_dict: Dict[str, bool] | None[source]
A dictionary specifying which model classes to include in the run.
- df: pandas.DataFrame[source]
The raw input DataFrame after being read from the source file.
- orignal_feature_names: List[str][source]
A copy of the original feature names before any processing.
- pertubation_columns: List[str][source]
A list of columns selected for inclusion based on local_param_dict.
- drop_list: List[str][source]
A list of columns identified to be dropped due to various cleaning steps.
- final_column_list: List[str][source]
The final list of feature columns to be used after all filtering.
- X: pandas.DataFrame[source]
The feature matrix (DataFrame) after all cleaning and selection steps.
- y: pandas.Series[source]
The target variable (Series) corresponding to the feature matrix X.
- X_train: pandas.DataFrame[source]
The training feature set.
- X_test: pandas.DataFrame[source]
The validation/testing feature set.
- y_train: pandas.Series[source]
The training target set.
- y_test: pandas.Series[source]
The validation/testing target set.
- X_test_orig: pandas.DataFrame[source]
The original, held-out test set for final validation.
- y_test_orig: pandas.Series[source]
The target variable for the original, held-out test set.