Data Workflow
This guide explains how data flows through the ensemble genetic algorithm pipeline, from initial input to final model evaluation. It covers supported formats, preprocessing steps, train/test splitting strategies, and cross-validation workflows.
See also: Configuration Guide for hyperparameter configuration examples that impact data handling.
Data Input Formats Supported
Primary Format: CSV (Comma-Separated Values)
The framework is designed around a single primary data format:
Requirements
Requirement |
Details |
|---|---|
File extension |
|
Row limit |
None (optimized for large datasets) |
Encoding |
UTF-8 with BOM support |
Header row |
Required (first row must contain column names) |
Sample CSV Structure
patient_ID,age,male,factor_A,factor_B,disease_outcome_var_1
P001,45,1,2.3,0.8,0
P002,32,0,1.9,0.5,1
P003,67,1,3.1,1.2,0
Alternative Input Methods
1. Programmatic Data Loading
For in-memory data frames (Notebooks):
import pandas as pd
from ml_grid.pipeline.read_in import read
# Load from DataFrame
df = pd.read_csv("data/my_data.csv")
# Or create synthetic data for testing
from ml_grid.util.synthetic_data_generator import SyntheticDataGenerator
synth_gen = SyntheticDataGenerator(n_features=10, n_samples=100)
df = synth_gen.generate_df()
2. Sampling for Testing
Use the test_sample_n and column_sample_n parameters:
# Sample 100 rows for debugging
ml_grid_object = data_pipe(
file_name="data/large_dataset.csv",
test_sample_n=100,
column_sample_n=5,
...
)
# Sample specific columns (ensures outcome_var_1 is always included)
ml_grid_object = data_pipe(
file_name="data/large_dataset.csv",
column_sample_n=5, # Includes 'age', 'male' + 3 random columns
...
)
Data Validation
The input pipeline performs automatic validation:
Check |
Description |
|---|---|
File existence |
Ensures CSV path is valid |
Numerical types |
Verifies all columns can be converted to float |
Outcome variable |
Checks for |
Missing headers |
Validates column names are present |
Preprocessing Steps
The data pipeline automatically executes a sequence of preprocessing steps. These can be configured via the grid_params section in config.yml.
Step 1: Initial Feature Selection
Filters columns based on configuration:
# Configuration in config.yml
grid_params:
# Include/exclude features based on column names containing these words
feature_toggles: {
"include": ["factor", "biochemical"],
"exclude": ["ID", "date"]
}
Step 2: Correlation Filtering
Removes highly correlated features to reduce multicollinearity.
How It Works
Computes Pearson correlation matrix for all features
Iteratively removes one feature from each high-correlation pair (> threshold)
Keeps the feature with lower variance (more stable)
Configuration
grid_params:
corr: [0.95, 0.98] # Correlation thresholds to test
Threshold |
Use Case |
|---|---|
|
Very strict - keeps most features but risks multicollinearity |
|
Standard balance for most datasets |
|
Aggressive - removes redundant features, fewer columns |
Visualization Method
After preprocessing, use:
from ml_grid.pipeline.data_correlation_matrix import handle_correlation_matrix
corr_drops = handle_correlation_matrix(
local_param_dict={"corr": 0.95},
drop_list=[],
df=original_df
)
print(f"Dropped due to correlation: {corr_drops}")
Step 3: Missing Value Handling
Columns with excessive missing data are removed.
Configuration
grid_params:
percent_missing: [100, 99, 95] # Thresholds for column removal
Threshold |
Behavior |
|---|---|
|
Remove columns with >100% missing (all columns) - effectively disables filtering |
|
Remove columns with >99.8% missing values (allows rare NaNs) |
|
Aggressive - removes even moderately missing columns |
Imputation Strategy
Remaining missing values are handled internally:
Numerical features: Mean imputation via
sklearn.SimpleImputer(strategy="mean")Base learners: Many algorithms (XGBoost, etc.) handle NaN natively
Step 4: Constant Column Removal
Features with zero variance across all samples are removed.
Why This Matters
Constant columns provide no predictive signal and waste computational resources.
# Example of constant column detection
import pandas as pd
df_with_constant = pd.DataFrame({
"feature1": [1, 2, 3, 4], # Variable
"feature2": [5, 5, 5, 5], # Constant - will be dropped
"outcome_var_1": [0, 1, 0, 1]
})
# After constant removal:
df_clean = df_with_constant.drop(columns=["feature2"])
Step 5: Feature Scaling
Standardizes features to zero mean and unit variance (Optional).
Configuration
grid_params:
scale: [true, false] # Whether to apply StandardScaler
When to Use Scaling
Scenario |
Recommended |
|---|---|
Neural Networks |
|
SVMs |
|
Distance-based methods (KNN) |
|
Tree-based models (RF, XGBoost) |
|
Step 6: Feature Importance Selection
Selects top features based on importance scores.
Configuration
grid_params:
n_features: [50, 100, "all"] # Number of features to retain
Available Methods
The framework supports multiple importance scoring methods:
Method |
Base Model |
When to Use |
|---|---|---|
|
Random Forest |
General-purpose, robust |
|
Gradient Boosting |
High-performance scenarios |
|
Logistic Regression |
Linear relationships |
Feature Selection Process
Train temporary model on all features
Extract importance scores
Rank features by importance
Select top
n_features
Train/Test Split Options
The framework implements a stratified hold-out validation strategy with three splits.
Standard Stratified Split
graph LR
A[100% Data] --> B[75% Training Set]
A --> C[25% Validation _orig]
B --> D[75% of 75% = ~56.3% Training]
B --> E[25% of 75% = ~18.8% Test]
style D fill:#c9f7c9
style E fill:#ffd7b5
style C fill:#ffcccc
subgraph "Final Splits"
D[X_train]
E[X_test]
C[X_test_orig]
end
Data Split Naming Convention
Split |
Purpose |
Usage |
|---|---|---|
|
Training new models |
During GA evolution |
|
Evaluation during GA |
Fitness calculation |
|
Final validation |
Unseen data evaluation |
Split Statistics
For a 1,000-sample dataset:
Training: ~563 samples (used for GA training)
Testing: ~188 samples (for GA fitness evaluation)
Validation (
_orig): ~250 samples (final hold-out)
Resampling Strategies
Controlled via resample parameter: null, "undersample", "oversample"
1. No Resampling (null)
Standard stratified split with class distribution preserved:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=1, stratify=y
)
Use Cases: Balanced datasets, no class imbalance issues
2. Undersampling ("undersample")
Reduces majority class before splitting:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)
# Then perform stratified split
Use Cases: Severe class imbalance (e.g., 1% positive class)
3. Oversampling ("oversample")
Adds synthetic minority samples (SMOTE) after train/test split:
from imblearn.over_sampling import RandomOverSampler
# Split first to prevent data leakage
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(
X, y, test_size=0.25, random_state=1
)
# Oversample ONLY the training set
ros = RandomOverSampler(random_state=1)
X_train_orig, y_train_orig = ros.fit_resample(X_train_orig, y_train_orig)
Use Cases: Moderate imbalance, want to preserve all original data
Cross-Validation Workflows
While the main GA uses a single hold-out validation set for efficiency, cross-validation can be integrated for final model evaluation.
Standard CV Strategy (For Final Evaluation)
After finding the best ensemble via GA, evaluate using 10-fold CV:
from sklearn.model_selection import RepeatedKFold, cross_validate
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_validate(
best_ensemble_model,
X_train,
y_train,
cv=cv,
scoring=['roc_auc', 'accuracy', 'precision', 'recall'],
n_jobs=-1
)
Genetic Algorithm-Specific CV Integration
The GA fitness function can incorporate CVscores for more robust evaluation:
Approach |
Description |
Trade-offs |
|---|---|---|
Single split (default) |
Use hold-out validation for speed |
Faster but noisier estimate |
K-fold CV in fitness |
Average K splits per individual |
More accurate but 10x slower |
3-折 CV only for best N |
Use CV only on top candidates |
Balance of accuracy/speed |
Parameter Grid Cross-Validation
After GA finds optimal base learners, fine-tune hyperparameters via GridSearchCV:
from sklearn.model_selection import GridSearchCV
from ml_grid.pipeline.grid_search_cross_validate import grid_search_crossvalidate
# Extract best ensemble configuration
best_params = {
"n_estimators": [50, 100, 200],
"max_depth": [5, 10, 15]
}
grid_search = GridSearchCV(
estimator=XGBoost(),
param_grid=best_params,
cv=5,
scoring='roc_auc'
)
grid_search.fit(X_train, y_train)
Cross-Validation Visualization
graph TD
A[All Data] --> B[K-Fold Split]
subgraph Fold 1
B --> C1[Train set]
B --> D1[Test set - Fold 1]
C1 --> E1[Model Training]
E1 --> F1[Predictions]
F1 --> G1[AUC Score]
end
subgraph Fold 2
B --> C2[Train set]
B --> D2[Test set - Fold 2]
C2 --> E2[Model Training]
E2 --> F2[Predictions]
F2 --> G2[AUC Score]
end
subgraph ... more folds
end
G1 --> H[Average AUC across K folds]
G2 --> H
style H fill:#c9f7c9,stroke:#333
Data Pipeline Diagrams
Complete Data Flow
flowchart TB
subgraph "Input Layer"
A[CSV File] --> B[data_pipe]
B --> C{Sampling enabled?}
C -->|Yes| D[test_sample_n / column_sample_n]
C -->|No| E[Full dataset load]
end
subgraph "Preprocessing Pipeline"
D --> F[Initial features]
E --> F
F --> G[Drop correlated features<br/>corr threshold]
G --> H[Drop missing values<br/>percent_missing]
H --> I[Remove other outcomes]
I --> J[Kill constant columns]
end
subgraph "Safety Net"
J --> K{Any features left?}
K -->|No| L[Retain minimal set]
K -->|Yes| M[Use filtered set]
end
subgraph "Train/Test Split"
L --> N[75/25 split]
M --> N
N --> O[X_train / y_train]
N --> P[X_test / y_test]
N --> Q[X_test_orig / y_test_orig]
O --> R{Resample?}
R -->|undersample| S[Under-sample all]
R -->|oversample| T[Over-sample train only]
R -->|null| U[No resampling]
end
subgraph "Final Processing"
S --> V[Post-split cleaning]
T --> V
U --> V
V --> W{Scale?}
W -->|Yes| X[StandardScaler]
W -->|No| Y[Skip scaling]
X --> Z[Feature selection if requested]
Y --> Z
Z --> AA[X_train_final]
Z --> AB[X_test_final]
Z --> AC[X_test_orig_final]
end
style A fill:#e1f5fe,stroke:#333
style F fill:#fff9c4,stroke:#333
style O fill:#b9f6ca,stroke:#333
style P fill:#ffd7b5,stroke:#333
style Q fill:#ffcccc,stroke:#333
Feature Transformation Log
The pipeline maintains a detailed log of feature changes:
Step |
Features Before |
Features After |
Removed |
Reason |
|---|---|---|---|---|
Initial Load |
100 |
100 |
0 |
Base count |
Drop Correlated |
100 |
92 |
8 |
corr > 0.95 |
Drop Missing |
92 |
90 |
2 |
>99% missing |
Drop Other Outcomes |
90 |
90 |
0 |
Not present |
Drop Constants |
90 |
85 |
5 |
Zero variance |
Split |
85 |
85 |
0 |
No feature removal |
Post-Split Clean |
85 |
83 |
2 |
Became constant after split |
Final Features |
- |
83 |
17 |
Total |
Summary
This data workflow guide covered:
Input formats: CSV requirements and alternative loading methods
Preprocessing steps: Correlation filtering, missing value handling, scaling
Train/test splits: Stratified hold-out with 3-way split options
Resampling strategies: Undersample/oversample configuration
Cross-validation workflows: Integration for final evaluation
The data pipeline is designed to be fully automated and configurable via config.yml, making it easy to adapt to different dataset characteristics while maintaining reproducibility through random state management.
See Implementation Guide for complete implementation walkthrough with setup, execution examples, and result interpretation.