Data Workflow

This guide explains how data flows through the ensemble genetic algorithm pipeline, from initial input to final model evaluation. It covers supported formats, preprocessing steps, train/test splitting strategies, and cross-validation workflows.

See also: Configuration Guide for hyperparameter configuration examples that impact data handling.


Data Input Formats Supported

Primary Format: CSV (Comma-Separated Values)

The framework is designed around a single primary data format:

Requirements

Requirement

Details

File extension

.csv only

Row limit

None (optimized for large datasets)

Encoding

UTF-8 with BOM support

Header row

Required (first row must contain column names)

Sample CSV Structure

patient_ID,age,male,factor_A,factor_B,disease_outcome_var_1
P001,45,1,2.3,0.8,0
P002,32,0,1.9,0.5,1
P003,67,1,3.1,1.2,0

Alternative Input Methods

1. Programmatic Data Loading

For in-memory data frames (Notebooks):

import pandas as pd
from ml_grid.pipeline.read_in import read

# Load from DataFrame
df = pd.read_csv("data/my_data.csv")

# Or create synthetic data for testing
from ml_grid.util.synthetic_data_generator import SyntheticDataGenerator

synth_gen = SyntheticDataGenerator(n_features=10, n_samples=100)
df = synth_gen.generate_df()

2. Sampling for Testing

Use the test_sample_n and column_sample_n parameters:

# Sample 100 rows for debugging
ml_grid_object = data_pipe(
    file_name="data/large_dataset.csv",
    test_sample_n=100,
    column_sample_n=5,
    ...
)

# Sample specific columns (ensures outcome_var_1 is always included)
ml_grid_object = data_pipe(
    file_name="data/large_dataset.csv",
    column_sample_n=5,  # Includes 'age', 'male' + 3 random columns
    ...
)

Data Validation

The input pipeline performs automatic validation:

Check

Description

File existence

Ensures CSV path is valid

Numerical types

Verifies all columns can be converted to float

Outcome variable

Checks for outcome_var_1 suffix and binary values

Missing headers

Validates column names are present


Preprocessing Steps

The data pipeline automatically executes a sequence of preprocessing steps. These can be configured via the grid_params section in config.yml.

Step 1: Initial Feature Selection

Filters columns based on configuration:

# Configuration in config.yml
grid_params:
  # Include/exclude features based on column names containing these words
  feature_toggles: {
    "include": ["factor", "biochemical"],
    "exclude": ["ID", "date"]
  }

Step 2: Correlation Filtering

Removes highly correlated features to reduce multicollinearity.

How It Works

  1. Computes Pearson correlation matrix for all features

  2. Iteratively removes one feature from each high-correlation pair (> threshold)

  3. Keeps the feature with lower variance (more stable)

Configuration

grid_params:
  corr: [0.95, 0.98]              # Correlation thresholds to test

Threshold

Use Case

0.90

Very strict - keeps most features but risks multicollinearity

0.95

Standard balance for most datasets

0.98

Aggressive - removes redundant features, fewer columns

Visualization Method

After preprocessing, use:

from ml_grid.pipeline.data_correlation_matrix import handle_correlation_matrix

corr_drops = handle_correlation_matrix(
    local_param_dict={"corr": 0.95},
    drop_list=[],
    df=original_df
)

print(f"Dropped due to correlation: {corr_drops}")

Step 3: Missing Value Handling

Columns with excessive missing data are removed.

Configuration

grid_params:
  percent_missing: [100, 99, 95]  # Thresholds for column removal

Threshold

Behavior

100

Remove columns with >100% missing (all columns) - effectively disables filtering

99.8

Remove columns with >99.8% missing values (allows rare NaNs)

95

Aggressive - removes even moderately missing columns

Imputation Strategy

Remaining missing values are handled internally:

  • Numerical features: Mean imputation via sklearn.SimpleImputer(strategy="mean")

  • Base learners: Many algorithms (XGBoost, etc.) handle NaN natively

Step 4: Constant Column Removal

Features with zero variance across all samples are removed.

Why This Matters

Constant columns provide no predictive signal and waste computational resources.

# Example of constant column detection
import pandas as pd

df_with_constant = pd.DataFrame({
    "feature1": [1, 2, 3, 4],     # Variable
    "feature2": [5, 5, 5, 5],     # Constant - will be dropped
    "outcome_var_1": [0, 1, 0, 1]
})

# After constant removal:
df_clean = df_with_constant.drop(columns=["feature2"])

Step 5: Feature Scaling

Standardizes features to zero mean and unit variance (Optional).

Configuration

grid_params:
  scale: [true, false]            # Whether to apply StandardScaler

When to Use Scaling

Scenario

Recommended

Neural Networks

scale: true

SVMs

scale: true

Distance-based methods (KNN)

scale: true

Tree-based models (RF, XGBoost)

scale: false (not required)

Step 6: Feature Importance Selection

Selects top features based on importance scores.

Configuration

grid_params:
  n_features: [50, 100, "all"]    # Number of features to retain

Available Methods

The framework supports multiple importance scoring methods:

Method

Base Model

When to Use

RandomForest

Random Forest

General-purpose, robust

XGBoost

Gradient Boosting

High-performance scenarios

LogisticRegression

Logistic Regression

Linear relationships

Feature Selection Process

  1. Train temporary model on all features

  2. Extract importance scores

  3. Rank features by importance

  4. Select top n_features


Train/Test Split Options

The framework implements a stratified hold-out validation strategy with three splits.

Standard Stratified Split

graph LR
    A[100% Data] --> B[75% Training Set]
    A --> C[25% Validation _orig]
    B --> D[75% of 75% = ~56.3% Training]
    B --> E[25% of 75% = ~18.8% Test]
    
    style D fill:#c9f7c9
    style E fill:#ffd7b5
    style C fill:#ffcccc
    
    subgraph "Final Splits"
        D[X_train]
        E[X_test]
        C[X_test_orig]
    end

Data Split Naming Convention

Split

Purpose

Usage

X_train, y_train

Training new models

During GA evolution

X_test, y_test

Evaluation during GA

Fitness calculation

X_test_orig, y_test_orig

Final validation

Unseen data evaluation

Split Statistics

For a 1,000-sample dataset:

  • Training: ~563 samples (used for GA training)

  • Testing: ~188 samples (for GA fitness evaluation)

  • Validation (_orig): ~250 samples (final hold-out)

Resampling Strategies

Controlled via resample parameter: null, "undersample", "oversample"

1. No Resampling (null)

Standard stratified split with class distribution preserved:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=1, stratify=y
)

Use Cases: Balanced datasets, no class imbalance issues

2. Undersampling ("undersample")

Reduces majority class before splitting:

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)

# Then perform stratified split

Use Cases: Severe class imbalance (e.g., 1% positive class)

3. Oversampling ("oversample")

Adds synthetic minority samples (SMOTE) after train/test split:

from imblearn.over_sampling import RandomOverSampler

# Split first to prevent data leakage
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(
    X, y, test_size=0.25, random_state=1
)

# Oversample ONLY the training set
ros = RandomOverSampler(random_state=1)
X_train_orig, y_train_orig = ros.fit_resample(X_train_orig, y_train_orig)

Use Cases: Moderate imbalance, want to preserve all original data


Cross-Validation Workflows

While the main GA uses a single hold-out validation set for efficiency, cross-validation can be integrated for final model evaluation.

Standard CV Strategy (For Final Evaluation)

After finding the best ensemble via GA, evaluate using 10-fold CV:

from sklearn.model_selection import RepeatedKFold, cross_validate

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

scores = cross_validate(
    best_ensemble_model,
    X_train,
    y_train,
    cv=cv,
    scoring=['roc_auc', 'accuracy', 'precision', 'recall'],
    n_jobs=-1
)

Genetic Algorithm-Specific CV Integration

The GA fitness function can incorporate CVscores for more robust evaluation:

Approach

Description

Trade-offs

Single split (default)

Use hold-out validation for speed

Faster but noisier estimate

K-fold CV in fitness

Average K splits per individual

More accurate but 10x slower

3-折 CV only for best N

Use CV only on top candidates

Balance of accuracy/speed

Parameter Grid Cross-Validation

After GA finds optimal base learners, fine-tune hyperparameters via GridSearchCV:

from sklearn.model_selection import GridSearchCV
from ml_grid.pipeline.grid_search_cross_validate import grid_search_crossvalidate

# Extract best ensemble configuration
best_params = {
    "n_estimators": [50, 100, 200],
    "max_depth": [5, 10, 15]
}

grid_search = GridSearchCV(
    estimator=XGBoost(),
    param_grid=best_params,
    cv=5,
    scoring='roc_auc'
)

grid_search.fit(X_train, y_train)

Cross-Validation Visualization

graph TD
    A[All Data] --> B[K-Fold Split]
    
    subgraph Fold 1
        B --> C1[Train set]
        B --> D1[Test set - Fold 1]
        C1 --> E1[Model Training]
        E1 --> F1[Predictions]
        F1 --> G1[AUC Score]
    end
    
    subgraph Fold 2
        B --> C2[Train set]
        B --> D2[Test set - Fold 2]
        C2 --> E2[Model Training]
        E2 --> F2[Predictions]
        F2 --> G2[AUC Score]
    end
    
    subgraph ... more folds
    end
    
    G1 --> H[Average AUC across K folds]
    G2 --> H
    
    style H fill:#c9f7c9,stroke:#333

Data Pipeline Diagrams

Complete Data Flow

flowchart TB
    subgraph "Input Layer"
        A[CSV File] --> B[data_pipe]
        B --> C{Sampling enabled?}
        C -->|Yes| D[test_sample_n / column_sample_n]
        C -->|No| E[Full dataset load]
    end
    
    subgraph "Preprocessing Pipeline"
        D --> F[Initial features]
        E --> F
        
        F --> G[Drop correlated features<br/>corr threshold]
        G --> H[Drop missing values<br/>percent_missing]
        H --> I[Remove other outcomes]
        I --> J[Kill constant columns]
    end
    
    subgraph "Safety Net"
        J --> K{Any features left?}
        K -->|No| L[Retain minimal set]
        K -->|Yes| M[Use filtered set]
    end
    
    subgraph "Train/Test Split"
        L --> N[75/25 split]
        M --> N
        
        N --> O[X_train / y_train]
        N --> P[X_test / y_test]
        N --> Q[X_test_orig / y_test_orig]
        
        O --> R{Resample?}
        R -->|undersample| S[Under-sample all]
        R -->|oversample| T[Over-sample train only]
        R -->|null| U[No resampling]
    end
    
    subgraph "Final Processing"
        S --> V[Post-split cleaning]
        T --> V
        U --> V
        
        V --> W{Scale?}
        W -->|Yes| X[StandardScaler]
        W -->|No| Y[Skip scaling]
        
        X --> Z[Feature selection if requested]
        Y --> Z
        
        Z --> AA[X_train_final]
        Z --> AB[X_test_final]
        Z --> AC[X_test_orig_final]
    end
    
    style A fill:#e1f5fe,stroke:#333
    style F fill:#fff9c4,stroke:#333
    style O fill:#b9f6ca,stroke:#333
    style P fill:#ffd7b5,stroke:#333
    style Q fill:#ffcccc,stroke:#333

Feature Transformation Log

The pipeline maintains a detailed log of feature changes:

Step

Features Before

Features After

Removed

Reason

Initial Load

100

100

0

Base count

Drop Correlated

100

92

8

corr > 0.95

Drop Missing

92

90

2

>99% missing

Drop Other Outcomes

90

90

0

Not present

Drop Constants

90

85

5

Zero variance

Split

85

85

0

No feature removal

Post-Split Clean

85

83

2

Became constant after split

Final Features

-

83

17

Total


Summary

This data workflow guide covered:

  1. Input formats: CSV requirements and alternative loading methods

  2. Preprocessing steps: Correlation filtering, missing value handling, scaling

  3. Train/test splits: Stratified hold-out with 3-way split options

  4. Resampling strategies: Undersample/oversample configuration

  5. Cross-validation workflows: Integration for final evaluation

The data pipeline is designed to be fully automated and configurable via config.yml, making it easy to adapt to different dataset characteristics while maintaining reproducibility through random state management.

See Implementation Guide for complete implementation walkthrough with setup, execution examples, and result interpretation.