Implementation Guide

This guide provides a step-by-step walkthrough for running an ensemble genetic algorithm experiment on your dataset. It covers installation, setup, data preparation, execution with different configurations, and result interpretation.

See also: Data Workflow for detailed data preprocessing workflows.


Prerequisites

  • Python >=3.12 (required)

  • Virtual environment management (e.g., venv, conda)

  • Access to a machine with sufficient compute resources (CPU/GPU)

For complete architecture overview, see Architectural Overview.

Installation

The project can be installed in two ways:

Method 2: Automated Setup Script

chmod +x setup.sh && ./setup.sh

This script automatically:

  • Creates a ga_env virtual environment

  • Installs all required packages

  • Configures the environment

Dependencies

The core dependencies include:

Package

Version

Purpose

scikit-learn

>=1.3

Data preprocessing, model validation

numpy

>=1.24

Numerical computations

pandas

>=2.0

Data manipulation

DEAP

Latest

Genetic algorithm framework

imblearn

Latest

Resampling techniques (SMOTE, etc.)

sklearn

Latest

Machine learning models


Step-by-Step Setup and Configuration

1. Prepare Your Dataset

Ensure your dataset meets the following requirements:

File Format

  • CSV format with headers

  • All values must be numeric (integers or floats)

  • No strings, categories, or object types

Outcome Variable Requirements

  • Must be binary (e.g., 0 and 1)

  • Column name must end with _outcome_var_1

    • ✅ Valid: disease_outcome_var_1, readmission_outcome_var_1

    • ❌ Invalid: outcome, target

Example CSV Structure

feature_A,feature_B,age,male,disease_outcome_var_1
0.5,2.3,45,1,0
0.8,1.9,32,0,1
0.3,3.1,67,1,0

2. Configure Your Experiment

Create a config.yml file in the project root directory:

global_params:
  input_csv_path: "data/my_dataset.csv"      # Path to your CSV
  n_iter: 20                                  # Number of grid iterations
  model_list: [
    "logisticRegression", 
    "randomForest", 
    "XGBoost",
    "gradientBoosting"
  ]                                            # Base learners to use
  verbose: 2                                  # Logging level (1-5)
  base_project_dir: "experiments/"            # Output directory
  testing: false                              # Set true for quick tests

ga_params:
  nb_params: [8, 16]                          # Number of base learners per ensemble
  pop_params: [50]                            # Population size
  g_params: [100]                             # Number of generations

grid_params:
  weighted: ["unweighted"]                    # Weighting method
  resample: ["undersample", "oversample", null]  # Sampling strategy
  corr: [0.95]                                # Correlation threshold for feature removal
  percent_missing: [100]                      # Max % missing values allowed
  scale: [true, false]                        # Whether to standardize features

Key Configuration Parameters

Parameter

Type

Description

input_csv_path

str

Path to your dataset

n_iter

int

Number of grid search iterations (higher = more thorough but slower)

model_list

list

List of model class names to include in base learner pool

verbose

int

Logging level: 1=info, 2=debug, 3=pathological debug

base_project_dir

str

Directory for saving experiment results

nb_params

list[int]

List of possible numbers of base learners per ensemble

pop_params

list[int]

Population sizes for the GA

g_params

list[int]

Number of generations to evolve

3. Run the Experiment

Create a Python script (e.g., run_experiment.py) or use Jupyter Notebook:

from tqdm import tqdm
from ml_grid.util.global_params import global_parameters
from ml_grid.util.grid_param_space_ga import Grid
from ml_grid.pipeline.data import pipe as data_pipe
from ml_grid.pipeline.main_ga import run as ga_run

# Load configuration and initialize global parameters
global_params = global_parameters(config_path='config.yml')

# Create grid of hyperparameter combinations
grid = Grid(global_params=global_params)

# Iterate through parameter space
for i in tqdm(range(global_params.n_iter)):
    local_param_dict = next(grid.settings_list_iterator)
    
    # Execute data pipeline for this configuration
    ml_grid_object = data_pipe(
        global_params=global_params,
        file_name=global_params.input_csv_path,
        drop_term_list=[],  # Terms to drop from feature selection (optional)
        local_param_dict=local_param_dict,
        base_project_dir=global_params.base_project_dir,
        param_space_index=i,
    )
    
    # Run the genetic algorithm
    ga_run(
        ml_grid_object=ml_grid_object,
        local_param_dict=local_param_dict,
        global_params=global_params
    ).execute()

Data Preparation Workflow

The data pipeline performs automatic preprocessing with configurable steps:

Pipeline Steps

  1. Data Loading

    • Reads CSV file (supports sampling for debugging)

    • Validates dataset structure and outcome variable format

  2. Initial Feature Selection

    • Filters columns based on configured feature toggles

    • Identifies target outcome variable (outcome_var_1)

  3. Feature Filtering

    • Removes highly correlated features (configurable via corr)

    • Removes columns with excessive missing data (configurable via percent_missing)

    • Removes other outcome variables that aren’t the target

  4. Safety Net Activation

    • If all features are pruned, retains a minimum set for model training

    • Prevents pipeline failure due to overly aggressive filtering

  5. Data Splitting

    • 75% → Train (further split into train/validation)

    • 25% → Hold-out validation (_orig sets)

  6. Post-Split Cleaning

    • Removes constant columns that arise from splitting

    • Handles data leakage prevention

  7. Feature Scaling (optional)

    • Standardizes features if scale=true

    • Applies same scaler to train/test/validation sets

  8. Feature Importance Selection (optional)

    • Selects top n_features based on importance scoring

    • Uses multiple methods: Random Forest, XGBoost, etc.

Data Split Strategies

The pipeline supports three resampling strategies via the resample parameter:

Strategy

Description

When to Use

null / None

Standard stratified split

Balanced datasets

"undersample"

Randomly removes majority class samples

Severe class imbalance

"oversample"

Adds synthetic minority class samples (SMOTE)

Moderate class imbalance

Visualizing the Data Pipeline

graph TD
    A[Raw CSV] --> B[data_pipe]
    B --> C{Initial loading}
    C --> D[Feature selection<br/>drop_term_list]
    D --> E[Filter by correlation]
    E --> F[Filter missing values]
    F --> G[Remove other outcomes]
    G --> H[Drop constant columns]
    H --> I[Safety net check]
    I --> J[Create X/y]
    J --> K[Train/Test/Validation split]
    K --> L{Resample strategy}
    L -->|None| M[Standard split]
    L -->|Undersample| N[Under-sample all]
    L -->|Oversample| O[Over-sample train only]
    M --> P[Post-split cleaning]
    N --> P
    O --> P
    P --> Q{Scale?}
    Q -->|Yes| R[StandardScaler]
    Q -->|No| S[Skip scaling]
    R --> T{Feature selection?}
    S --> T
    T -->|Yes| U[Select top n_features]
    T -->|No| V[Use all features]
    U --> W[Align indices]
    V --> W
    W --> X[Final splits stored]

Running GA with Different Configurations

Configuration 1: Quick Test Run (Debug Mode)

global_params:
  testing: true                              # Activate quick test mode
  verbose: 3                                 # High verbosity for debugging

grid_params:
  corr: [0.95]
  resample: [null]

This configuration:

  • Uses smaller grid sizes

  • Reduces population and generation counts

  • Increases logging for troubleshooting

Configuration 2: Performance Optimization Run

global_params:
  n_iter: 5
  verbose: 1

grid_params:
  corr: [0.95, 0.98]                         # More aggressive correlation removal
  resample: [null]                           # No resampling for speed
  scale: [true]                              # Standardize features

Configuration 4: Medical Dataset (High Missingness)

global_params:
  verbose: 2

grid_params:
  percent_missing: [99]                      # Allow high missingness
  resample: ["undersample"]                  # Handle class imbalance
  corr: [0.95]

Configuration 5: High-Dimensional Genomic Data

global_params:
  verbose: 2

grid_params:
  scale: [true]                              # Essential for genomic data
  n_features: [100, 500, "all"]             # Feature importance selection
  corr: [0.98]                               # Less aggressive filtering
  
ga_params:
  pop_params: [100]                          # Larger population for feature diversity
  g_params: [200]

Interpreting Results and Evaluating Models

Output Files Structure

After running, your base_project_dir will contain:

experiments/
├── final_grid_score_log.csv                 # Main results file
├── progress_logs/                           # Per-iteration logs
│   └── *_progress.png                       # Fitness evolution plots
└── best_pop=*_g=*_nb=*.pkl                 # Saved best ensembles

Interpreting final_grid_score_log.csv

This CSV contains all experiment results. Key columns:

Column

Description

method_name

The base learner algorithm name

PG

Parameter grid size evaluated

AUC

Area Under the ROC Curve (validation set)

AUC_train

AUC on training set (to detect overfitting)

Best params

Best hyperparameters found

Run time

Execution time in minutes

Feature importance score

Random Forest-based feature ranking

Example Result Row

method_name,AUC,ACC,Best params,n_features
randomForest,0.87,0.82,"{'max_depth': 15}",50
XGBoost,0.89,0.84,"{'learning_rate': 0.1}",30

Evaluating Model Performance

1. Check for Overfitting

Compare training vs validation AUC:

  • Good: Training AUC ≈ Validation AUC

  • Overfitting: Training AUC >> Validation AUC

  • Underfitting: Both scores are low

2. Compare Across Configurations

Use the param_space_index column to identify which grid search iteration each row represents.

import pandas as pd

# Load results
results = pd.read_csv("experiments/final_grid_score_log.csv")

# Find best performing configuration
best_config = results.loc[results['AUC'].idxmax()]

print(f"Best AUC: {best_config['AUC']}")
print(f"Algorithm: {best_config['method_name']}")
print(f"Parameters: {best_config['Best params']}")

3. Analyze Final Ensemble (from Best Run)

Use the GA_results_explorer class to deep-dive into top-performing ensembles:

from ml_grid.util.GA_results_explorer import GA_results_explorer

explorer = GA_results_explorer(
    base_project_dir="experiments/",
    config_path='config.yml'
)

# Get all results sorted by AUC
results_df = explorer.get_sorted_results()

# Create visualizations
explorer.plot_result_distributions()
explorer.plot_auc_distribution()
explorer.plot_best_models_per_param_space()

Visual Interpretation

Fitness Evolution Plot

Each experiment generates a progress_logs/*_progress.png file showing:

X-axis: Generation number
Y-axis: Population best fitness (AUC score)

Interpretation:

  • Steep initial rise + plateau = Good convergence

  • Plateau early = May need more generations or larger population

  • Oscillating = Possible overfitting or noisy evaluation

AUC Distribution Plot

Shows performance distribution across all grid search iterations.

Advanced: Manual Ensemble Evaluation

To evaluate the best ensemble on a hold-out test set:

import pickle
from ml_grid.util.evaluate_ensemble_methods import evaluate_ensemble_methods

# Load best ensemble from previous run
with open("experiments/best_pop=50_g=100_nb=8.pkl", "rb") as f:
    best_ensemble = pickle.load(f)

# Evaluate on hold-out data
evaluator = evaluate_ensemble_methods(best_ensemble)

auc_train = evaluator.evaluate_auc(X_train, y_train)
auc_test = evaluator.evaluate_auc(X_test_orig, y_test_orig)

print(f"Training AUC: {auc_train}")
print(f"Hold-out Test AUC: {auc_test}")

Best Practices

1. Start Small

Begin with:

  • n_iter=5

  • pop_params=[20], g_params=[50]

  • Single model in model_list

  • One resample strategy

Once confident in the setup, scale up.

2. Monitor Resource Usage

Large populations and generations consume significant RAM:

Population Size

Generations

Estimated RAM

50

100

~2 GB

100

200

~8 GB

200

300

~24 GB

Use n_iter to control total iterations.

3. Validate Data First

Run a quick sanity check on your data:

import pandas as pd

df = pd.read_csv("data/my_dataset.csv")
print(f"Shape: {df.shape}")
print(f"Missing values:\n{df.isnull().sum()}")
print(f"Data types:\n{df.dtypes}")

4. Use testing=True for Development

This reduces all grid sizes and speeds up iterative development.


Troubleshooting

Common Errors

“No features available to select for safety net”

  • Cause: Too many features have been dropped during filtering

  • Solution: Relax correlation threshold (corr) or missing percentage (percent_missing)

“AUC undefined (only one class in y_true)”

  • Cause: Test set contains only one class

  • Solution: Increase train/test split size (currently 75%/25%) or use oversample

MemoryError during execution

  • Cause: Too-large population or generation count

  • Solution: Reduce pop_params and g_params


Summary

This implementation guide covered:

  1. Installation methods for Python >=3.12+

  2. Data preparation requirements (CSV format, binary outcome)

  3. Step-by-step configuration via config.yml

  4. Running experiments with multiple preset configurations

  5. Interpreting results from final_grid_score_log.csv

  6. Visualizing fitness evolution and ensemble performance

All experiment outputs are logged to the base_project_dir directory for post-analysis using the GA_results_explorer.