ml_grid.pipeline.data_train_test_split
Functions
|
Splits data into train and test sets, with optional resampling. |
|
Checks if the input data is a 2-dimensional array or DataFrame. |
Module Contents
- ml_grid.pipeline.data_train_test_split.get_data_split(X: pandas.DataFrame, y: pandas.Series, local_param_dict: Dict[str, Any]) Tuple[pandas.DataFrame, pandas.DataFrame, pandas.Series, pandas.Series, pandas.DataFrame, pandas.Series] [source]
Splits data into train and test sets, with optional resampling.
This function splits the input data (X, y) into training and testing sets. It can perform no resampling, undersampling, or oversampling based on the ‘resample’ key in local_param_dict. The data is first split into a preliminary train/test set, and then the preliminary training set is further split to create the final train/test sets for model evaluation, while the original test set is preserved for final validation.
- Parameters:
X (pd.DataFrame) – The feature data.
y (pd.Series) – The target variable.
local_param_dict (Dict[str, Any]) – A dictionary of parameters, including the ‘resample’ strategy (‘undersample’, ‘oversample’, or None).
- Returns:
- A tuple containing:
X_train: Features for training.
X_test: Features for testing.
y_train: Target variable for training.
y_test: Target variable for testing.
X_test_orig: Original features for validation.
y_test_orig: Original target variable for validation.
- Return type:
Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series, pd.DataFrame, pd.Series]
- ml_grid.pipeline.data_train_test_split.is_valid_shape(input_data: numpy.ndarray | pandas.DataFrame) bool [source]
Checks if the input data is a 2-dimensional array or DataFrame.
This is used to validate data before resampling, as some resampling techniques may not work with other data shapes.
- Parameters:
input_data (Union[np.ndarray, pd.DataFrame]) – The data to check.
- Returns:
True if the data is 2-dimensional, False otherwise.
- Return type: