Technical Deep Dive

This section provides an in-depth description of the theoretical background, methodology, and implementation details of the Ensemble Genetic Algorithm project.

Table of Contents

Overview
Theoretical Background
Implementation Details
Performance Optimization
Architecture Diagrams

Overview

This genetic algorithm is designed to evolve an optimal ensemble of machine learning classifiers for binary classification tasks. It applies a grid search over the feature space and genetic algorithm hyperparameters to find the best-performing model ensemble.

Key Design Principles

Modularity: The framework separates data preprocessing, model generation, and evolutionary operators.
Extensibility: New base learners and weighting methods can be easily integrated.
Efficiency: Model caching and early stopping mechanisms reduce computational overhead.
Robustness: Comprehensive error handling and logging ensure reliable execution.

Theoretical Background

Machine Learning Optimization Framework

A classification problem can be expressed as learning a function f: X onto y. A binary classification problem is where y is binary, 0 or 1. Several learning algorithms have been developed to this end by researchers in mathematical statistics and more recently machine learning (Bishop and Nasrabadi, 2006) (Haykin, 1998).

Given a set of learning algorithms A and a limited dataset D = {(x1, y1), …, (xn, yn)}, the objective of model selection is to identify the algorithm A∗ ∈ A that achieves optimal generalization performance. Generalization performance is assessed by splitting D into disjoint training and validation sets D(i)_train and D(i)_valid.

The model selection problem can be formally expressed as:

𝐴∗ = arg min (𝐴∈𝒜𝑘) ∑ 𝑖 𝐿(𝐴, 𝐷(𝑖)train, 𝐷(𝑖)valid)

where L(A, D(i)_train, D(i)_valid) represents the loss (e.g., misclassification rate) attained by A when trained on D(i)_train and assessed on D(i)_valid.

Genetic Algorithm Fundamentals

A genetic algorithm is an optimization technique inspired by the process of natural selection. It’s used in machine learning to find optimal solutions to complex problems by mimicking the evolutionary process:

Population Initialization: Starts with a population of potential solutions represented as individuals
Selection: Individuals are selected based on their fitness (ability to solve the problem)
Crossover: Selected individuals recombine to create offspring
Mutation: Offspring undergo random mutations
Replacement: New generation replaces the old population

The process repeats over multiple generations, tending to improve overall fitness.

Implementation Details

Core Components Architecture

# Main pipeline flow
data.pipe() -> ml_grid_object -> main_ga.run().execute()

1. Data Pipeline (`ml_grid.pipeline.data.pipe`)

Factory function that creates experiment objects for each grid search iteration:

Input: Dataset path, hyperparameter set, global configuration
Process:
- Loads and validates dataset
- Splits into train/validation/test sets
- Applies feature selection and preprocessing
- Creates transformed datasets for this iteration’s configuration
Output: ml_grid_object encapsulating all experiment settings

2. Genetic Algorithm Runner (`main_ga.run`)

The primary engine for running the ensemble evolution. Key methods:

Method	Description
`__init__()`	Initializes DEAP toolbox, GA parameters, logging paths
`execute()`	Runs evolutionary loop for all GA hyperparameter combinations
`evaluate()`	Evaluates fitness of individual ensembles

Evolutionary Loop:

Initialize population -> Evaluate fitness ->
Select parents -> Crossover -> Mutation ->
Evaluate offspring -> Replace population

3. Ensemble Generation (`ensemble_generator_ga`)

Creates candidate ensembles by:

Randomly selecting base learners from the pool
Creating instances with randomized hyperparameters
Training on training data with selected feature subsets
Evaluating on validation data to get fitness score

4. Weighting Methods

Three ensemble weighting strategies are available:

Method	Description	Complexity	Performance
`unweighted`	Simple average of predictions	O(n)	Fast, good baseline
`de`	Differential Evolution finds optimal weights	O(n×iter)	Better accuracy
`ann`	Neural network learns non-linear combination	High	Potentially best

Key Data Structures

ml_grid_object

Container for a single grid search iteration:

X_train, y_train: Training data
X_val, y_val: Validation data (for GA fitness)
X_test, y_test: Test data (for final evaluation)
model_class_list: List of base learner generators
local_param_dict: Current iteration’s hyperparameters

Individual Representation

Each ensemble is represented as a chromosome:

individual = [
    [algorithm, feature_subset, trained_model],
    [algorithm, feature_subset, trained_model],
    ...
]

See the API Reference for comprehensive API documentation of all classes and methods.

Performance Optimization

Model Caching Strategy

Benefits:

Avoids retraining when data preprocessing changes but models remain the same
Can reduce runtime by >90% for subsequent runs with similar configurations

Usage:

First run: store_base_learners=True in grid_params
Subsequent runs: use_stored_base_learners=True

Early Stopping

Implementation stops evolution when MCC score doesn’t improve over 5 consecutive generation evaluations.

Configuration (in config.yml):

global_params:
    gen_eval_score_threshold_early_stopping: 5  # Default

Fitness Evaluation

Primary metric: Area Under ROC Curve (AUC) Secondary metric: Matthews Correlation Coefficient (MCC) for model selection

GPU Acceleration

PyTorch models automatically use available CUDA devices:

# Check GPU availability
import torch
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

Architecture Diagrams

System Flow Diagram

┌─────────────────┐     ┌─────────────────────┐     ┌──────────────────┐
│   config.yml    │────>│   global_params     │────>│  Grid Search Loop│
└─────────────────┘     └─────────────────────┘     └────────┬─────────┘
                                                             │
                                             ┌──────────────▼──────────────┐
                                             │   data.pipe()                 │
                                             │   - Load dataset              │
                                             │   - Split data                │
                                             │   - Apply preprocessing       │
                                             └──────────────┬──────────────┘
                                                            │
                                            ┌──────────────▼──────────────┐
                                            │  ml_grid_object               │
                                            │  (per iteration)              │
                                            └──────────────┬──────────────┘
                                                           │
                                          ┌────────────────▼────────────────┐
                                          │   main_ga.run().execute()         │
                                          │   - DEAP initialization           │
                                          │   - Population generation         │
                                          │   - Evolutionary loop             │
                                          │   - Fitness evaluation            │
                                          └────────────────┬────────────────┘
                                                         │
                                      ┌────────────────────▼────────────────────┐
                                      │   Results Saving                          │
                                      │   - final_grid_score_log.csv              │
                                      │   - best_models.pkl                       │
                                      │   - plots/                                │
                                      └───────────────────────────────────────────┘

Genetic Algorithm Pipeline

Initialization (Pop size = 96)
    ↓
Evaluation (AUC on validation set)
    ↓
Selection (Tournament, size=3)
    ↓
Crossover (2-point, prob=0.8)
    ↓
Mutation (gene swap, prob=0.2)
    ↓
Replacement (generational)
    ↓
Early Stopping Check? ──No──> Continue? ──Yes──> Next Generation
     ↓ Yes                          ↓ No
Terminate with Best Solution

Weighting Method Comparison

Approach	Equation	Pros	Cons
Unweighted	ŷ = mean([ŷ₁, …, ŷₙ])	Simple, fast	Ignores model quality differences
Differential Evolution	ŷ = Σ(wᵢ × ŷᵢ), w=DE-optimal	Accuracy improvement	Slower (~10×)
ANN Weighting	ŷ = ANN([ŷ₁, …, ŷₙ])	Captures non-linear interactions	Most complex, may overfit

Performance Benchmarks

Baseline Performance (Typical Dataset)

Population Size	Generations	Runtime (min)	Best AUC	Models Evaluated
32	50	~15	0.78	1,600
64	100	~60	0.82	6,400
128	128	~240	0.85	16,384

Optimization Effects

Model Caching: Reduces runtime by ~90% for similar config runs
Early Stopping: Saves ~20-40 generations on average
GPU Acceleration: 3-5× speedup for PyTorch models

Troubleshooting Implementation Issues

Common issues with `main_ga.run`:

Memory Errors: Reduce population size or disable model caching
Slow Performance: Use simpler weighting method, reduce generations
Poor Convergence: Increase mutation rate or population diversity

See Troubleshooting Guide for detailed solutions.

References

Agius, R., Brieghel, C., Andersen, M.A., Pearson, A.T., Ledergerber, B., Cozzi-Lepri, A., Louzoun, Y., Andersen, C.L., Bergstedt, J., von Stemann, J.H., Jorgensen, M., Tang, M.E., Fontes, M., Bahlo, J., Herling, C.D., Hallek, M., Lundgren, J., MacPherson, C.R., Larsen, J. and Niemann, C.U. (2020) ‘Machine learning can identify newly diagnosed patients with CLL at high risk of infection’, Nature communications, 11(1), pp. 363-8. doi: 10.1038/s41467-019-14225-8
Bishop, C.M. and Nasrabadi, N.M. (2006) Pattern recognition and machine learning, Springer.
Haykin, S. (1998) Neural networks: a comprehensive foundation, Prentice Hall PTR.
Komer, B., Bergstra, J., and Eliasmith, C. (2014) Hyperopt-sklearn: automatic hyperparameter configuration for scikit-learn, Citeseer Austin, TX, pp. 50.
Thornton, C., Hutter, F., Hoos, H.H., and Leyton-Brown, K. (2013) Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms, pp. 847.
Nickolls, J., Buck, I., Garland, M., and Skadron, K. (2008) Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for?, Queue, 6(2), pp. 40-53.
Fortin, F., De Rainville, F., Gardner, M.G., Parizeau, M., and Gagné, C. (2012) DEAP: Evolutionary algorithms made easy, The Journal of Machine Learning Research, 13(1), pp. 2171-2175.
Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., and Bright, J. (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python, Nature methods, 17(3), pp. 261-272.