# Data Workflow
This guide explains how data flows through the ensemble genetic algorithm pipeline, from initial input to final model evaluation. It covers supported formats, preprocessing steps, train/test splitting strategies, and cross-validation workflows.
See also: {doc}`./Configuration_Guide` for hyperparameter configuration examples that impact data handling.
---
## Data Input Formats Supported
### Primary Format: CSV (Comma-Separated Values)
The framework is designed around a single primary data format:
#### Requirements
| Requirement | Details |
|------------|---------|
| **File extension** | `.csv` only |
| **Row limit** | None (optimized for large datasets) |
| **Encoding** | UTF-8 with BOM support |
| **Header row** | Required (first row must contain column names) |
#### Sample CSV Structure
```csv
patient_ID,age,male,factor_A,factor_B,disease_outcome_var_1
P001,45,1,2.3,0.8,0
P002,32,0,1.9,0.5,1
P003,67,1,3.1,1.2,0
```
### Alternative Input Methods
#### 1. Programmatic Data Loading
For in-memory data frames (Notebooks):
```python
import pandas as pd
from ml_grid.pipeline.read_in import read
# Load from DataFrame
df = pd.read_csv("data/my_data.csv")
# Or create synthetic data for testing
from ml_grid.util.synthetic_data_generator import SyntheticDataGenerator
synth_gen = SyntheticDataGenerator(n_features=10, n_samples=100)
df = synth_gen.generate_df()
```
#### 2. Sampling for Testing
Use the `test_sample_n` and `column_sample_n` parameters:
```python
# Sample 100 rows for debugging
ml_grid_object = data_pipe(
file_name="data/large_dataset.csv",
test_sample_n=100,
column_sample_n=5,
...
)
# Sample specific columns (ensures outcome_var_1 is always included)
ml_grid_object = data_pipe(
file_name="data/large_dataset.csv",
column_sample_n=5, # Includes 'age', 'male' + 3 random columns
...
)
```
### Data Validation
The input pipeline performs automatic validation:
| Check | Description |
|-------|-------------|
| **File existence** | Ensures CSV path is valid |
| **Numerical types** | Verifies all columns can be converted to float |
| **Outcome variable** | Checks for `outcome_var_1` suffix and binary values |
| **Missing headers** | Validates column names are present |
---
## Preprocessing Steps
The data pipeline automatically executes a sequence of preprocessing steps. These can be configured via the `grid_params` section in `config.yml`.
### Step 1: Initial Feature Selection
Filters columns based on configuration:
```python
# Configuration in config.yml
grid_params:
# Include/exclude features based on column names containing these words
feature_toggles: {
"include": ["factor", "biochemical"],
"exclude": ["ID", "date"]
}
```
### Step 2: Correlation Filtering
Removes highly correlated features to reduce multicollinearity.
#### How It Works
1. Computes Pearson correlation matrix for all features
2. Iteratively removes one feature from each high-correlation pair (> threshold)
3. Keeps the feature with lower variance (more stable)
#### Configuration
```yaml
grid_params:
corr: [0.95, 0.98] # Correlation thresholds to test
```
| Threshold | Use Case |
|-----------|----------|
| `0.90` | Very strict - keeps most features but risks multicollinearity |
| `0.95` | Standard balance for most datasets |
| `0.98` | Aggressive - removes redundant features, fewer columns |
#### Visualization Method
After preprocessing, use:
```python
from ml_grid.pipeline.data_correlation_matrix import handle_correlation_matrix
corr_drops = handle_correlation_matrix(
local_param_dict={"corr": 0.95},
drop_list=[],
df=original_df
)
print(f"Dropped due to correlation: {corr_drops}")
```
### Step 3: Missing Value Handling
Columns with excessive missing data are removed.
#### Configuration
```yaml
grid_params:
percent_missing: [100, 99, 95] # Thresholds for column removal
```
| Threshold | Behavior |
|-----------|----------|
| `100` | Remove columns with >100% missing (all columns) - effectively disables filtering |
| `99.8` | Remove columns with >99.8% missing values (allows rare NaNs) |
| `95` | Aggressive - removes even moderately missing columns |
#### Imputation Strategy
Remaining missing values are handled internally:
- **Numerical features**: Mean imputation via `sklearn.SimpleImputer(strategy="mean")`
- **Base learners**: Many algorithms (XGBoost, etc.) handle NaN natively
### Step 4: Constant Column Removal
Features with zero variance across all samples are removed.
#### Why This Matters
Constant columns provide no predictive signal and waste computational resources.
```python
# Example of constant column detection
import pandas as pd
df_with_constant = pd.DataFrame({
"feature1": [1, 2, 3, 4], # Variable
"feature2": [5, 5, 5, 5], # Constant - will be dropped
"outcome_var_1": [0, 1, 0, 1]
})
# After constant removal:
df_clean = df_with_constant.drop(columns=["feature2"])
```
### Step 5: Feature Scaling
Standardizes features to zero mean and unit variance (Optional).
#### Configuration
```yaml
grid_params:
scale: [true, false] # Whether to apply StandardScaler
```
#### When to Use Scaling
| Scenario | Recommended |
|----------|-------------|
| Neural Networks | `scale: true` |
| SVMs | `scale: true` |
| Distance-based methods (KNN) | `scale: true` |
| Tree-based models (RF, XGBoost) | `scale: false` (not required) |
### Step 6: Feature Importance Selection
Selects top features based on importance scores.
#### Configuration
```yaml
grid_params:
n_features: [50, 100, "all"] # Number of features to retain
```
#### Available Methods
The framework supports multiple importance scoring methods:
| Method | Base Model | When to Use |
|--------|-----------|-------------|
| `RandomForest` | Random Forest | General-purpose, robust |
| `XGBoost` | Gradient Boosting | High-performance scenarios |
| `LogisticRegression` | Logistic Regression | Linear relationships |
#### Feature Selection Process
1. Train temporary model on all features
2. Extract importance scores
3. Rank features by importance
4. Select top `n_features`
---
## Train/Test Split Options
The framework implements a **stratified hold-out validation** strategy with three splits.
### Standard Stratified Split
```mermaid
graph LR
A[100% Data] --> B[75% Training Set]
A --> C[25% Validation _orig]
B --> D[75% of 75% = ~56.3% Training]
B --> E[25% of 75% = ~18.8% Test]
style D fill:#c9f7c9
style E fill:#ffd7b5
style C fill:#ffcccc
subgraph "Final Splits"
D[X_train]
E[X_test]
C[X_test_orig]
end
```
#### Data Split Naming Convention
| Split | Purpose | Usage |
|-------|---------|-------|
| `X_train`, `y_train` | Training new models | During GA evolution |
| `X_test`, `y_test` | Evaluation during GA | Fitness calculation |
| `X_test_orig`, `y_test_orig` | Final validation | Unseen data evaluation |
#### Split Statistics
For a 1,000-sample dataset:
- **Training**: ~563 samples (used for GA training)
- **Testing**: ~188 samples (for GA fitness evaluation)
- **Validation** (`_orig`): ~250 samples (final hold-out)
### Resampling Strategies
Controlled via `resample` parameter: `null`, `"undersample"`, `"oversample"`
#### 1. No Resampling (`null`)
Standard stratified split with class distribution preserved:
```python
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=1, stratify=y
)
```
**Use Cases**: Balanced datasets, no class imbalance issues
#### 2. Undersampling (`"undersample"`)
Reduces majority class before splitting:
```python
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)
# Then perform stratified split
```
**Use Cases**: Severe class imbalance (e.g., 1% positive class)
#### 3. Oversampling (`"oversample"`)
Adds synthetic minority samples (SMOTE) after train/test split:
```python
from imblearn.over_sampling import RandomOverSampler
# Split first to prevent data leakage
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(
X, y, test_size=0.25, random_state=1
)
# Oversample ONLY the training set
ros = RandomOverSampler(random_state=1)
X_train_orig, y_train_orig = ros.fit_resample(X_train_orig, y_train_orig)
```
**Use Cases**: Moderate imbalance, want to preserve all original data
---
## Cross-Validation Workflows
While the main GA uses a single hold-out validation set for efficiency, cross-validation can be integrated for final model evaluation.
### Standard CV Strategy (For Final Evaluation)
After finding the best ensemble via GA, evaluate using 10-fold CV:
```python
from sklearn.model_selection import RepeatedKFold, cross_validate
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_validate(
best_ensemble_model,
X_train,
y_train,
cv=cv,
scoring=['roc_auc', 'accuracy', 'precision', 'recall'],
n_jobs=-1
)
```
### Genetic Algorithm-Specific CV Integration
The GA fitness function can incorporate CVscores for more robust evaluation:
| Approach | Description | Trade-offs |
|----------|-------------|------------|
| **Single split** (default) | Use hold-out validation for speed | Faster but noisier estimate |
| **K-fold CV in fitness** | Average K splits per individual | More accurate but 10x slower |
| **3-折 CV only for best N** | Use CV only on top candidates | Balance of accuracy/speed |
### Parameter Grid Cross-Validation
After GA finds optimal base learners, fine-tune hyperparameters via GridSearchCV:
```python
from sklearn.model_selection import GridSearchCV
from ml_grid.pipeline.grid_search_cross_validate import grid_search_crossvalidate
# Extract best ensemble configuration
best_params = {
"n_estimators": [50, 100, 200],
"max_depth": [5, 10, 15]
}
grid_search = GridSearchCV(
estimator=XGBoost(),
param_grid=best_params,
cv=5,
scoring='roc_auc'
)
grid_search.fit(X_train, y_train)
```
### Cross-Validation Visualization
```mermaid
graph TD
A[All Data] --> B[K-Fold Split]
subgraph Fold 1
B --> C1[Train set]
B --> D1[Test set - Fold 1]
C1 --> E1[Model Training]
E1 --> F1[Predictions]
F1 --> G1[AUC Score]
end
subgraph Fold 2
B --> C2[Train set]
B --> D2[Test set - Fold 2]
C2 --> E2[Model Training]
E2 --> F2[Predictions]
F2 --> G2[AUC Score]
end
subgraph ... more folds
end
G1 --> H[Average AUC across K folds]
G2 --> H
style H fill:#c9f7c9,stroke:#333
```
---
## Data Pipeline Diagrams
### Complete Data Flow
```mermaid
flowchart TB
subgraph "Input Layer"
A[CSV File] --> B[data_pipe]
B --> C{Sampling enabled?}
C -->|Yes| D[test_sample_n / column_sample_n]
C -->|No| E[Full dataset load]
end
subgraph "Preprocessing Pipeline"
D --> F[Initial features]
E --> F
F --> G[Drop correlated features
corr threshold]
G --> H[Drop missing values
percent_missing]
H --> I[Remove other outcomes]
I --> J[Kill constant columns]
end
subgraph "Safety Net"
J --> K{Any features left?}
K -->|No| L[Retain minimal set]
K -->|Yes| M[Use filtered set]
end
subgraph "Train/Test Split"
L --> N[75/25 split]
M --> N
N --> O[X_train / y_train]
N --> P[X_test / y_test]
N --> Q[X_test_orig / y_test_orig]
O --> R{Resample?}
R -->|undersample| S[Under-sample all]
R -->|oversample| T[Over-sample train only]
R -->|null| U[No resampling]
end
subgraph "Final Processing"
S --> V[Post-split cleaning]
T --> V
U --> V
V --> W{Scale?}
W -->|Yes| X[StandardScaler]
W -->|No| Y[Skip scaling]
X --> Z[Feature selection if requested]
Y --> Z
Z --> AA[X_train_final]
Z --> AB[X_test_final]
Z --> AC[X_test_orig_final]
end
style A fill:#e1f5fe,stroke:#333
style F fill:#fff9c4,stroke:#333
style O fill:#b9f6ca,stroke:#333
style P fill:#ffd7b5,stroke:#333
style Q fill:#ffcccc,stroke:#333
```
### Feature Transformation Log
The pipeline maintains a detailed log of feature changes:
| Step | Features Before | Features After | Removed | Reason |
|------|----------------|----------------|---------|--------|
| Initial Load | 100 | 100 | 0 | Base count |
| Drop Correlated | 100 | 92 | 8 | corr > 0.95 |
| Drop Missing | 92 | 90 | 2 | >99% missing |
| Drop Other Outcomes | 90 | 90 | 0 | Not present |
| Drop Constants | 90 | 85 | 5 | Zero variance |
| Split | 85 | 85 | 0 | No feature removal |
| Post-Split Clean | 85 | 83 | 2 | Became constant after split |
| Final Features | - | **83** | **17** | Total |
---
## Summary
This data workflow guide covered:
1. **Input formats**: CSV requirements and alternative loading methods
2. **Preprocessing steps**: Correlation filtering, missing value handling, scaling
3. **Train/test splits**: Stratified hold-out with 3-way split options
4. **Resampling strategies**: Undersample/oversample configuration
5. **Cross-validation workflows**: Integration for final evaluation
The data pipeline is designed to be fully automated and configurable via `config.yml`, making it easy to adapt to different dataset characteristics while maintaining reproducibility through random state management.
See {doc}`./Implementation_Guide` for complete implementation walkthrough with setup, execution examples, and result interpretation.