Implementation Guide
This guide provides a step-by-step walkthrough for running an ensemble genetic algorithm experiment on your dataset. It covers installation, setup, data preparation, execution with different configurations, and result interpretation.
See also: Data Workflow for detailed data preprocessing workflows.
Prerequisites
Python >=3.12 (required)
Virtual environment management (e.g.,
venv,conda)Access to a machine with sufficient compute resources (CPU/GPU)
For complete architecture overview, see Architectural Overview.
Installation
The project can be installed in two ways:
Method 1: Manual Installation (Recommended for Customization)
# Clone the repository
git clone <repository-url>
cd ensemble_genetic_algorithm
# Create and activate virtual environment
python -m venv ga_env
source ga_env/bin/activate # Linux/Mac
# or
ga_env\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
Method 2: Automated Setup Script
chmod +x setup.sh && ./setup.sh
This script automatically:
Creates a
ga_envvirtual environmentInstalls all required packages
Configures the environment
Dependencies
The core dependencies include:
Package |
Version |
Purpose |
|---|---|---|
|
>=1.3 |
Data preprocessing, model validation |
|
>=1.24 |
Numerical computations |
|
>=2.0 |
Data manipulation |
|
Latest |
Genetic algorithm framework |
|
Latest |
Resampling techniques (SMOTE, etc.) |
|
Latest |
Machine learning models |
Step-by-Step Setup and Configuration
1. Prepare Your Dataset
Ensure your dataset meets the following requirements:
File Format
CSV format with headers
All values must be numeric (integers or floats)
No strings, categories, or object types
Outcome Variable Requirements
Must be binary (e.g.,
0and1)Column name must end with
_outcome_var_1✅ Valid:
disease_outcome_var_1,readmission_outcome_var_1❌ Invalid:
outcome,target
Example CSV Structure
feature_A,feature_B,age,male,disease_outcome_var_1
0.5,2.3,45,1,0
0.8,1.9,32,0,1
0.3,3.1,67,1,0
2. Configure Your Experiment
Create a config.yml file in the project root directory:
global_params:
input_csv_path: "data/my_dataset.csv" # Path to your CSV
n_iter: 20 # Number of grid iterations
model_list: [
"logisticRegression",
"randomForest",
"XGBoost",
"gradientBoosting"
] # Base learners to use
verbose: 2 # Logging level (1-5)
base_project_dir: "experiments/" # Output directory
testing: false # Set true for quick tests
ga_params:
nb_params: [8, 16] # Number of base learners per ensemble
pop_params: [50] # Population size
g_params: [100] # Number of generations
grid_params:
weighted: ["unweighted"] # Weighting method
resample: ["undersample", "oversample", null] # Sampling strategy
corr: [0.95] # Correlation threshold for feature removal
percent_missing: [100] # Max % missing values allowed
scale: [true, false] # Whether to standardize features
Key Configuration Parameters
Parameter |
Type |
Description |
|---|---|---|
|
str |
Path to your dataset |
|
int |
Number of grid search iterations (higher = more thorough but slower) |
|
list |
List of model class names to include in base learner pool |
|
int |
Logging level: 1=info, 2=debug, 3=pathological debug |
|
str |
Directory for saving experiment results |
|
list[int] |
List of possible numbers of base learners per ensemble |
|
list[int] |
Population sizes for the GA |
|
list[int] |
Number of generations to evolve |
3. Run the Experiment
Create a Python script (e.g., run_experiment.py) or use Jupyter Notebook:
from tqdm import tqdm
from ml_grid.util.global_params import global_parameters
from ml_grid.util.grid_param_space_ga import Grid
from ml_grid.pipeline.data import pipe as data_pipe
from ml_grid.pipeline.main_ga import run as ga_run
# Load configuration and initialize global parameters
global_params = global_parameters(config_path='config.yml')
# Create grid of hyperparameter combinations
grid = Grid(global_params=global_params)
# Iterate through parameter space
for i in tqdm(range(global_params.n_iter)):
local_param_dict = next(grid.settings_list_iterator)
# Execute data pipeline for this configuration
ml_grid_object = data_pipe(
global_params=global_params,
file_name=global_params.input_csv_path,
drop_term_list=[], # Terms to drop from feature selection (optional)
local_param_dict=local_param_dict,
base_project_dir=global_params.base_project_dir,
param_space_index=i,
)
# Run the genetic algorithm
ga_run(
ml_grid_object=ml_grid_object,
local_param_dict=local_param_dict,
global_params=global_params
).execute()
Data Preparation Workflow
The data pipeline performs automatic preprocessing with configurable steps:
Pipeline Steps
Data Loading
Reads CSV file (supports sampling for debugging)
Validates dataset structure and outcome variable format
Initial Feature Selection
Filters columns based on configured feature toggles
Identifies target outcome variable (
outcome_var_1)
Feature Filtering
Removes highly correlated features (configurable via
corr)Removes columns with excessive missing data (configurable via
percent_missing)Removes other outcome variables that aren’t the target
Safety Net Activation
If all features are pruned, retains a minimum set for model training
Prevents pipeline failure due to overly aggressive filtering
Data Splitting
75% → Train (further split into train/validation)
25% → Hold-out validation (
_origsets)
Post-Split Cleaning
Removes constant columns that arise from splitting
Handles data leakage prevention
Feature Scaling (optional)
Standardizes features if
scale=trueApplies same scaler to train/test/validation sets
Feature Importance Selection (optional)
Selects top
n_featuresbased on importance scoringUses multiple methods: Random Forest, XGBoost, etc.
Data Split Strategies
The pipeline supports three resampling strategies via the resample parameter:
Strategy |
Description |
When to Use |
|---|---|---|
|
Standard stratified split |
Balanced datasets |
|
Randomly removes majority class samples |
Severe class imbalance |
|
Adds synthetic minority class samples (SMOTE) |
Moderate class imbalance |
Visualizing the Data Pipeline
graph TD
A[Raw CSV] --> B[data_pipe]
B --> C{Initial loading}
C --> D[Feature selection<br/>drop_term_list]
D --> E[Filter by correlation]
E --> F[Filter missing values]
F --> G[Remove other outcomes]
G --> H[Drop constant columns]
H --> I[Safety net check]
I --> J[Create X/y]
J --> K[Train/Test/Validation split]
K --> L{Resample strategy}
L -->|None| M[Standard split]
L -->|Undersample| N[Under-sample all]
L -->|Oversample| O[Over-sample train only]
M --> P[Post-split cleaning]
N --> P
O --> P
P --> Q{Scale?}
Q -->|Yes| R[StandardScaler]
Q -->|No| S[Skip scaling]
R --> T{Feature selection?}
S --> T
T -->|Yes| U[Select top n_features]
T -->|No| V[Use all features]
U --> W[Align indices]
V --> W
W --> X[Final splits stored]
Running GA with Different Configurations
Configuration 1: Quick Test Run (Debug Mode)
global_params:
testing: true # Activate quick test mode
verbose: 3 # High verbosity for debugging
grid_params:
corr: [0.95]
resample: [null]
This configuration:
Uses smaller grid sizes
Reduces population and generation counts
Increases logging for troubleshooting
Configuration 2: Performance Optimization Run
global_params:
n_iter: 5
verbose: 1
grid_params:
corr: [0.95, 0.98] # More aggressive correlation removal
resample: [null] # No resampling for speed
scale: [true] # Standardize features
Configuration 3: Thorough Hyperparameter Search
global_params:
n_iter: 50 # Increase iterations
verbose: 2
ga_params:
nb_params: [8, 16, 32] # Larger ensemble sizes
pop_params: [100, 200] # Larger population
g_params: [200, 300] # More generations
grid_params:
resample: [null, "undersample", "oversample"]
corr: [0.90, 0.95] # Multiple thresholds
Configuration 4: Medical Dataset (High Missingness)
global_params:
verbose: 2
grid_params:
percent_missing: [99] # Allow high missingness
resample: ["undersample"] # Handle class imbalance
corr: [0.95]
Configuration 5: High-Dimensional Genomic Data
global_params:
verbose: 2
grid_params:
scale: [true] # Essential for genomic data
n_features: [100, 500, "all"] # Feature importance selection
corr: [0.98] # Less aggressive filtering
ga_params:
pop_params: [100] # Larger population for feature diversity
g_params: [200]
Interpreting Results and Evaluating Models
Output Files Structure
After running, your base_project_dir will contain:
experiments/
├── final_grid_score_log.csv # Main results file
├── progress_logs/ # Per-iteration logs
│ └── *_progress.png # Fitness evolution plots
└── best_pop=*_g=*_nb=*.pkl # Saved best ensembles
Interpreting final_grid_score_log.csv
This CSV contains all experiment results. Key columns:
Column |
Description |
|---|---|
|
The base learner algorithm name |
|
Parameter grid size evaluated |
|
Area Under the ROC Curve (validation set) |
|
AUC on training set (to detect overfitting) |
|
Best hyperparameters found |
|
Execution time in minutes |
|
Random Forest-based feature ranking |
Example Result Row
method_name,AUC,ACC,Best params,n_features
randomForest,0.87,0.82,"{'max_depth': 15}",50
XGBoost,0.89,0.84,"{'learning_rate': 0.1}",30
Evaluating Model Performance
1. Check for Overfitting
Compare training vs validation AUC:
Good: Training AUC ≈ Validation AUC
Overfitting: Training AUC >> Validation AUC
Underfitting: Both scores are low
2. Compare Across Configurations
Use the param_space_index column to identify which grid search iteration each row represents.
import pandas as pd
# Load results
results = pd.read_csv("experiments/final_grid_score_log.csv")
# Find best performing configuration
best_config = results.loc[results['AUC'].idxmax()]
print(f"Best AUC: {best_config['AUC']}")
print(f"Algorithm: {best_config['method_name']}")
print(f"Parameters: {best_config['Best params']}")
3. Analyze Final Ensemble (from Best Run)
Use the GA_results_explorer class to deep-dive into top-performing ensembles:
from ml_grid.util.GA_results_explorer import GA_results_explorer
explorer = GA_results_explorer(
base_project_dir="experiments/",
config_path='config.yml'
)
# Get all results sorted by AUC
results_df = explorer.get_sorted_results()
# Create visualizations
explorer.plot_result_distributions()
explorer.plot_auc_distribution()
explorer.plot_best_models_per_param_space()
Visual Interpretation
Fitness Evolution Plot
Each experiment generates a progress_logs/*_progress.png file showing:
X-axis: Generation number
Y-axis: Population best fitness (AUC score)
Interpretation:
Steep initial rise + plateau = Good convergence
Plateau early = May need more generations or larger population
Oscillating = Possible overfitting or noisy evaluation
AUC Distribution Plot
Shows performance distribution across all grid search iterations.
Advanced: Manual Ensemble Evaluation
To evaluate the best ensemble on a hold-out test set:
import pickle
from ml_grid.util.evaluate_ensemble_methods import evaluate_ensemble_methods
# Load best ensemble from previous run
with open("experiments/best_pop=50_g=100_nb=8.pkl", "rb") as f:
best_ensemble = pickle.load(f)
# Evaluate on hold-out data
evaluator = evaluate_ensemble_methods(best_ensemble)
auc_train = evaluator.evaluate_auc(X_train, y_train)
auc_test = evaluator.evaluate_auc(X_test_orig, y_test_orig)
print(f"Training AUC: {auc_train}")
print(f"Hold-out Test AUC: {auc_test}")
Best Practices
1. Start Small
Begin with:
n_iter=5pop_params=[20],g_params=[50]Single model in
model_listOne
resamplestrategy
Once confident in the setup, scale up.
2. Monitor Resource Usage
Large populations and generations consume significant RAM:
Population Size |
Generations |
Estimated RAM |
|---|---|---|
50 |
100 |
~2 GB |
100 |
200 |
~8 GB |
200 |
300 |
~24 GB |
Use n_iter to control total iterations.
3. Validate Data First
Run a quick sanity check on your data:
import pandas as pd
df = pd.read_csv("data/my_dataset.csv")
print(f"Shape: {df.shape}")
print(f"Missing values:\n{df.isnull().sum()}")
print(f"Data types:\n{df.dtypes}")
4. Use testing=True for Development
This reduces all grid sizes and speeds up iterative development.
Troubleshooting
Common Errors
“No features available to select for safety net”
Cause: Too many features have been dropped during filtering
Solution: Relax correlation threshold (
corr) or missing percentage (percent_missing)
“AUC undefined (only one class in y_true)”
Cause: Test set contains only one class
Solution: Increase train/test split size (currently 75%/25%) or use
oversample
MemoryError during execution
Cause: Too-large population or generation count
Solution: Reduce
pop_paramsandg_params
Summary
This implementation guide covered:
Installation methods for Python >=3.12+
Data preparation requirements (CSV format, binary outcome)
Step-by-step configuration via
config.ymlRunning experiments with multiple preset configurations
Interpreting results from
final_grid_score_log.csvVisualizing fitness evolution and ensemble performance
All experiment outputs are logged to the base_project_dir directory for post-analysis using the GA_results_explorer.