Technical Deep Dive

This section provides an in-depth description of the theoretical background, methodology, and implementation details of the Ensemble Genetic Algorithm project.


Table of Contents


Overview

This genetic algorithm is designed to evolve an optimal ensemble of machine learning classifiers for binary classification tasks. It applies a grid search over the feature space and genetic algorithm hyperparameters to find the best-performing model ensemble.

Key Design Principles

  1. Modularity: The framework separates data preprocessing, model generation, and evolutionary operators.

  2. Extensibility: New base learners and weighting methods can be easily integrated.

  3. Efficiency: Model caching and early stopping mechanisms reduce computational overhead.

  4. Robustness: Comprehensive error handling and logging ensure reliable execution.


Theoretical Background

Machine Learning Optimization Framework

A classification problem can be expressed as learning a function f: X onto y. A binary classification problem is where y is binary, 0 or 1. Several learning algorithms have been developed to this end by researchers in mathematical statistics and more recently machine learning (Bishop and Nasrabadi, 2006) (Haykin, 1998).

Given a set of learning algorithms A and a limited dataset D = {(x1, y1), …, (xn, yn)}, the objective of model selection is to identify the algorithm A∗ ∈ A that achieves optimal generalization performance. Generalization performance is assessed by splitting D into disjoint training and validation sets D(i)_train and D(i)_valid.

The model selection problem can be formally expressed as:

𝐴∗ = arg min (𝐴∈𝒜𝑘) ∑ 𝑖 𝐿(𝐴, 𝐷(𝑖)train, 𝐷(𝑖)valid)

where L(A, D(i)_train, D(i)_valid) represents the loss (e.g., misclassification rate) attained by A when trained on D(i)_train and assessed on D(i)_valid.

Genetic Algorithm Fundamentals

A genetic algorithm is an optimization technique inspired by the process of natural selection. It’s used in machine learning to find optimal solutions to complex problems by mimicking the evolutionary process:

  1. Population Initialization: Starts with a population of potential solutions represented as individuals

  2. Selection: Individuals are selected based on their fitness (ability to solve the problem)

  3. Crossover: Selected individuals recombine to create offspring

  4. Mutation: Offspring undergo random mutations

  5. Replacement: New generation replaces the old population

The process repeats over multiple generations, tending to improve overall fitness.


Implementation Details

Core Components Architecture

# Main pipeline flow
data.pipe() -> ml_grid_object -> main_ga.run().execute()

1. Data Pipeline (ml_grid.pipeline.data.pipe)

Factory function that creates experiment objects for each grid search iteration:

  • Input: Dataset path, hyperparameter set, global configuration

  • Process:

    • Loads and validates dataset

    • Splits into train/validation/test sets

    • Applies feature selection and preprocessing

    • Creates transformed datasets for this iteration’s configuration

  • Output: ml_grid_object encapsulating all experiment settings

2. Genetic Algorithm Runner (main_ga.run)

The primary engine for running the ensemble evolution. Key methods:

Method

Description

__init__()

Initializes DEAP toolbox, GA parameters, logging paths

execute()

Runs evolutionary loop for all GA hyperparameter combinations

evaluate()

Evaluates fitness of individual ensembles

Evolutionary Loop:

Initialize population -> Evaluate fitness ->
Select parents -> Crossover -> Mutation ->
Evaluate offspring -> Replace population

3. Ensemble Generation (ensemble_generator_ga)

Creates candidate ensembles by:

  1. Randomly selecting base learners from the pool

  2. Creating instances with randomized hyperparameters

  3. Training on training data with selected feature subsets

  4. Evaluating on validation data to get fitness score

4. Weighting Methods

Three ensemble weighting strategies are available:

Method

Description

Complexity

Performance

unweighted

Simple average of predictions

O(n)

Fast, good baseline

de

Differential Evolution finds optimal weights

O(n×iter)

Better accuracy

ann

Neural network learns non-linear combination

High

Potentially best

Key Data Structures

ml_grid_object

Container for a single grid search iteration:

  • X_train, y_train: Training data

  • X_val, y_val: Validation data (for GA fitness)

  • X_test, y_test: Test data (for final evaluation)

  • model_class_list: List of base learner generators

  • local_param_dict: Current iteration’s hyperparameters

Individual Representation

Each ensemble is represented as a chromosome:

individual = [
    [algorithm, feature_subset, trained_model],
    [algorithm, feature_subset, trained_model],
    ...
]

See the API Reference for comprehensive API documentation of all classes and methods.


Performance Optimization

Model Caching Strategy

Benefits:

  • Avoids retraining when data preprocessing changes but models remain the same

  • Can reduce runtime by >90% for subsequent runs with similar configurations

Usage:

  1. First run: store_base_learners=True in grid_params

  2. Subsequent runs: use_stored_base_learners=True

Early Stopping

Implementation stops evolution when MCC score doesn’t improve over 5 consecutive generation evaluations.

Configuration (in config.yml):

global_params:
    gen_eval_score_threshold_early_stopping: 5  # Default

Fitness Evaluation

Primary metric: Area Under ROC Curve (AUC) Secondary metric: Matthews Correlation Coefficient (MCC) for model selection

GPU Acceleration

PyTorch models automatically use available CUDA devices:

# Check GPU availability
import torch
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

Architecture Diagrams

System Flow Diagram

┌─────────────────┐     ┌─────────────────────┐     ┌──────────────────┐
│   config.yml    │────>│   global_params     │────>│  Grid Search Loop│
└─────────────────┘     └─────────────────────┘     └────────┬─────────┘
                                                             │
                                             ┌──────────────▼──────────────┐
                                             │   data.pipe()                 │
                                             │   - Load dataset              │
                                             │   - Split data                │
                                             │   - Apply preprocessing       │
                                             └──────────────┬──────────────┘
                                                            │
                                            ┌──────────────▼──────────────┐
                                            │  ml_grid_object               │
                                            │  (per iteration)              │
                                            └──────────────┬──────────────┘
                                                           │
                                          ┌────────────────▼────────────────┐
                                          │   main_ga.run().execute()         │
                                          │   - DEAP initialization           │
                                          │   - Population generation         │
                                          │   - Evolutionary loop             │
                                          │   - Fitness evaluation            │
                                          └────────────────┬────────────────┘
                                                         │
                                      ┌────────────────────▼────────────────────┐
                                      │   Results Saving                          │
                                      │   - final_grid_score_log.csv              │
                                      │   - best_models.pkl                       │
                                      │   - plots/                                │
                                      └───────────────────────────────────────────┘

Genetic Algorithm Pipeline

Initialization (Pop size = 96)
    ↓
Evaluation (AUC on validation set)
    ↓
Selection (Tournament, size=3)
    ↓
Crossover (2-point, prob=0.8)
    ↓
Mutation (gene swap, prob=0.2)
    ↓
Replacement (generational)
    ↓
Early Stopping Check? ──No──> Continue? ──Yes──> Next Generation
     ↓ Yes                          ↓ No
Terminate with Best Solution

Weighting Method Comparison

Approach

Equation

Pros

Cons

Unweighted

ŷ = mean([ŷ₁, …, ŷₙ])

Simple, fast

Ignores model quality differences

Differential Evolution

ŷ = Σ(wᵢ × ŷᵢ), w=DE-optimal

Accuracy improvement

Slower (~10×)

ANN Weighting

ŷ = ANN([ŷ₁, …, ŷₙ])

Captures non-linear interactions

Most complex, may overfit


Performance Benchmarks

Baseline Performance (Typical Dataset)

Population Size

Generations

Runtime (min)

Best AUC

Models Evaluated

32

50

~15

0.78

1,600

64

100

~60

0.82

6,400

128

128

~240

0.85

16,384

Optimization Effects

  • Model Caching: Reduces runtime by ~90% for similar config runs

  • Early Stopping: Saves ~20-40 generations on average

  • GPU Acceleration: 3-5× speedup for PyTorch models


Troubleshooting Implementation Issues

Common issues with main_ga.run:

  1. Memory Errors: Reduce population size or disable model caching

  2. Slow Performance: Use simpler weighting method, reduce generations

  3. Poor Convergence: Increase mutation rate or population diversity

See Troubleshooting Guide for detailed solutions.


References

  1. Agius, R., Brieghel, C., Andersen, M.A., Pearson, A.T., Ledergerber, B., Cozzi-Lepri, A., Louzoun, Y., Andersen, C.L., Bergstedt, J., von Stemann, J.H., Jorgensen, M., Tang, M.E., Fontes, M., Bahlo, J., Herling, C.D., Hallek, M., Lundgren, J., MacPherson, C.R., Larsen, J. and Niemann, C.U. (2020) ‘Machine learning can identify newly diagnosed patients with CLL at high risk of infection’, Nature communications, 11(1), pp. 363-8. doi: 10.1038/s41467-019-14225-8

  2. Bishop, C.M. and Nasrabadi, N.M. (2006) Pattern recognition and machine learning, Springer.

  3. Haykin, S. (1998) Neural networks: a comprehensive foundation, Prentice Hall PTR.

  4. Komer, B., Bergstra, J., and Eliasmith, C. (2014) Hyperopt-sklearn: automatic hyperparameter configuration for scikit-learn, Citeseer Austin, TX, pp. 50.

  5. Thornton, C., Hutter, F., Hoos, H.H., and Leyton-Brown, K. (2013) Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms, pp. 847.

  6. Nickolls, J., Buck, I., Garland, M., and Skadron, K. (2008) Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for?, Queue, 6(2), pp. 40-53.

  7. Fortin, F., De Rainville, F., Gardner, M.G., Parizeau, M., and Gagné, C. (2012) DEAP: Evolutionary algorithms made easy, The Journal of Machine Learning Research, 13(1), pp. 2171-2175.

  8. Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., and Bright, J. (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python, Nature methods, 17(3), pp. 261-272.