ml_grid.pipeline.data_constant_columns
Functions
|
Identifies columns in a DataFrame where all values are the same. |
Removes constant columns from training and testing datasets. |
Module Contents
- ml_grid.pipeline.data_constant_columns.remove_constant_columns(X: pandas.DataFrame, drop_list: List[str] | None = None, verbose: int = 1) List[str][source]
Identifies columns in a DataFrame where all values are the same.
- Parameters:
- Returns:
Updated list of columns to drop, including constant columns.
- Return type:
List[str]
- Raises:
AssertionError – If X is None.
- ml_grid.pipeline.data_constant_columns.remove_constant_columns_with_debug(X_train: pandas.DataFrame | numpy.ndarray, X_test: pandas.DataFrame | numpy.ndarray, X_test_orig: pandas.DataFrame | numpy.ndarray, verbosity: int = 2) Tuple[pandas.DataFrame | numpy.ndarray, pandas.DataFrame | numpy.ndarray, pandas.DataFrame | numpy.ndarray][source]
Removes constant columns from training and testing datasets.
This function identifies columns that have zero variance in the training set and removes them from all provided datasets (X_train, X_test, X_test_orig). It supports both pandas DataFrames and NumPy arrays, including 3D arrays for time series data.
IMPORTANT: Only checks X_train for constant columns to prevent data leakage. A column is constant if it has <= 1 unique value in X_train.
- Parameters:
X_train (Union[pd.DataFrame, np.ndarray]) – Training feature data.
X_test (Union[pd.DataFrame, np.ndarray]) – Testing feature data.
X_test_orig (Union[pd.DataFrame, np.ndarray]) – Original (unsplit) testing feature data.
verbosity (int, optional) – Controls the verbosity of debug messages. Defaults to 2.
- Returns:
A tuple containing the modified X_train, X_test, and X_test_orig datasets with constant columns removed.
- Return type:
Tuple[Union[pd.DataFrame, np.ndarray], …]