Architectural Overview
This guide provides a high-level overview of the project’s architecture, explaining the roles of its key components and how they interact. Understanding this structure will help you navigate, use, and extend the framework effectively.
Pipeline Visualization
The complete experiment pipeline can be visualized in three stages:
1. Data Flow Pipeline (Data Preparation)
flowchart TB
subgraph "Input"
A[CSV File] --> B[data.pipe]
end
subgraph "Feature Engineering"
B --> C{Initial Load}
C --> D[Drop Correlated Features<br/>corr threshold]
D --> E[Drop Missing Values<br/>percent_missing]
E --> F[Remove Constants]
F --> G[Safety Net Check]
end
subgraph "Train/Test Split"
G --> H[75/25 Stratified Split]
H --> I[X_train / y_train]
H --> J[X_test / y_test]
H --> K[X_test_orig / y_test_orig]
I --> L{Resample?}
L -->|undersample| M[Under-sample all]
L -->|oversample| N[Over-sample train only]
L -->|null| O[No resampling]
end
subgraph "Final Processing"
M --> P[Post-split cleaning]
N --> P
O --> P
P --> Q{Scale?}
Q -->|Yes| R[StandardScaler]
Q -->|No| S[Skip scaling]
R --> T[Feature Selection]
S --> T
T --> U[X_train_final]
T --> V[X_test_final]
end
style B fill:#e1f5fe,stroke:#333
style I fill:#b9f6ca,stroke:#333
style J fill:#ffd7b5,stroke:#333
See Data Workflow for detailed data preprocessing steps.
2. Genetic Algorithm Evolution
flowchart TB
subgraph "Initialization"
A[Population Size N] --> B[Create Initial Individuals]
B --> C[Evaluate Fitness (AUC)]
end
subgraph "Evolution Loop"
C --> D[Select Top Performers<br/> Tournament Selection]
D --> E[Crossover (cxTwoPoint)<br/> Combine Parent Features]
E --> F[Mutation (mutFlipBit)<br/> Random Bit Flip]
F --> G[Evaluate Offspring Fitness]
end
subgraph "Generation Progress"
C --> H[Track Best Individual]
G --> I[Record AUC per Generation]
I --> J[Early Stopping<br/> if no improvement]
end
H --> K{Max Generations?}
J --> L[Continue OR Stop]
style B fill:#fff9c4,stroke:#333
style C fill:#e1f5fe,stroke:#333
See Genetic Algorithm Python API Reference for detailed genetic algorithm APIs.
3. Complete Experiment Flow
flowchart TB
subgraph "Phase 1: Configuration"
A[config.yml] --> B[global_parameters]
B --> C[Grid search space generation<br/>grid_param_space_ga.Grid]
C --> D[n_iter configurations]
end
subgraph "Phase 2: Data Pipeline"
D --> E[data_pipe for config_1]
D --> F[data_pipe for config_2]
E --> G[X_train, X_test, X_test_orig]
F --> H[X_train, X_test, X_test_orig]
end
subgraph "Phase 3: GA Evolution"
G --> I[main_ga.run.execute]
H --> J[main_ga.run.execute]
I --> K[Evolve ensemble for config_1]
J --> L[Evolve ensemble for config_2]
end
subgraph "Phase 4: Output"
K --> M[final_grid_score_log.csv]
L --> M
M --> N[Analysis with GA_results_explorer]
end
style A fill:#e1f5fe,stroke:#333
style G fill:#b9f6ca,stroke:#333
style K fill:#ffd7b5,stroke:#333
Core Components
The project is designed with a modular architecture, separating concerns into distinct components. The main workflow revolves around the following parts:
1. The Orchestrator (main.py or example_usage.ipynb)
This is the main entry point for running an experiment. Its primary responsibilities are:
Configuration: Loading all experiment settings from a
config.ymlfile into a centralglobal_paramsobject. This includes paths, iteration counts, and model lists.Experiment Loop: Iterating
n_itertimes to run the genetic algorithm with different hyperparameter settings.Post-Processing: Calling the analysis and evaluation components after the main loop is complete.
2. The Configuration Layer (config.yml and grid_param_space_ga.py)
This layer defines the entire hyperparameter search space. The system uses a layered approach:
config.yml: The primary way to set parameters. You define the search space for the GA (ga_params) and the grid search (grid_params) here.grid_param_space_ga.py: This file contains the default, hardcoded search space. Any values set inconfig.ymlwill override these defaults.
The grid includes combinations of settings for:
Data preprocessing (e.g., resampling, scaling).
Feature selection (e.g., correlation thresholds).
Genetic algorithm parameters (e.g., population size, mutation rate).
Grid Parameter Combinations
For each n_iter, a unique combination is drawn from the Cartesian product:
Parameter |
Options |
Effect |
|---|---|---|
|
[0.9, 0.95] |
Higher = more aggressive feature removal |
|
[“undersample”, “oversample”, null] |
Handles class imbalance |
|
[true, false] |
Standardizes features for NN/SVM |
3. The Data Pipeline (ml_grid.pipeline.data.pipe)
This is a factory function that creates the central ml_grid_object for a single grid search iteration. It takes a specific set of hyperparameters from the configuration layer and performs initial data setup, including:
Loading the dataset.
Splitting data into training, validation, and hold-out test sets.
Applying initial data sampling or feature subset selection.
Pipeline Method Sequence
data_pipe.__init__()
├── _load_data() # Load CSV with optional sampling
├── _initial_feature_selection() # Filter columns by name/toggles
├── _apply_safety_net() # Retain minimal features if needed
├── _create_xy() # Create feature/target matrices
├── _split_data() # Train/test/validation split
├── _post_split_cleaning() # Remove post-split constants
├── _scale_features() # StandardScaler (optional)
└── _select_features_by_importance() # Final feature selection
4. The ml_grid_object
This object is the heart of a single grid search iteration. It acts as a container that encapsulates everything needed for one full execution of the genetic algorithm:
Data splits (train, validation, test).
The specific hyperparameters for the current run, drawn from the
global_paramsobject.Paths for saving logs, models, and results.
The list of base learner models to be used.
This object is passed to the core GA engine.
Key Attributes
Attribute |
Type |
Purpose |
|---|---|---|
|
pd.DataFrame/Series |
Training data for GA evolution |
|
pd.DataFrame/Series |
Evaluation data during evolution |
|
pd.DataFrame/Series |
Hold-out validation (final evaluation) |
|
list |
Base learner generators to use |
5. The Core GA Engine (main_ga.py)
This is where the evolutionary process happens. It receives the ml_grid_object and executes the genetic algorithm:
It uses the
modelFuncListto create a population of diverse base learners.It evolves ensembles (individuals) over multiple generations using selection, crossover, and mutation.
It evaluates the fitness of each ensemble using the validation data within the
ml_grid_object.It logs the results of the run to
final_grid_score_log.csv.
Evolution Process
Initialize: Create population of size
pop_params[i]Evaluate: Calculate AUC for each individual
Select: Tournament selection (size
t_size)Mate: Crossover with probability
cxpbMutate: Mutation with probability
mutpbReplace: New generation replaces old population
Repeat: Until max generations or early stopping
6. Model Generators (ml_grid/model_classes_ga/)
Each file in this directory defines a “generator” for a specific machine learning model (e.g., XGBoost, LogisticRegression). A generator is a class that knows how to:
Define the hyperparameter search space for its model (using
hyperopt).Instantiate its model with a given set of hyperparameters.
The GA engine uses these generators to create the base learners that form the building blocks of the ensembles. This design makes the framework highly extensible. See Adding a New Base Learner.
Base Learner Interface
All model generators implement:
class BaseGenerator:
def hyperparameter_search_space(self) -> Dict:
"""Returns hyperparameter search space for this model."""
def generate_model(self, params: Dict):
"""Instantiates model with given hyperparameters."""
7. The Analysis Layer (GA_results_explorer)
After all grid search iterations are complete, this class is used to analyze the results. It reads the final_grid_score_log.csv file and generates a suite of plots to help you understand:
Which hyperparameters were most impactful.
Which base learners performed best.
The convergence behavior of the GA.
See Interpreting Experiment Results.
Analysis Outputs
Output |
Description |
|---|---|
|
All experiment results with AUC, params |
|
Fitness evolution per iteration |
|
Saved best ensembles for deployment |
8. The Validation Layer (EnsembleEvaluator)
This is the final step. The EnsembleEvaluator takes the best models identified during the experiment and evaluates them on the hold-out test set—data that was never seen during the entire GA process. This provides a final, unbiased measure of the models’ generalization performance.
Data Flow Diagrams
Feature Transformation Log
The data pipeline tracks feature changes at each step:
Step | Before | After | Removed | Reason
------------------------|--------|-------|---------|--------------------
Initial Load | 100 | 100 | 0 | -
Drop Correlated | 100 | 92 | 8 | corr > 0.95
Drop Missing | 92 | 90 | 2 | >99% missing
Drop Constants | 90 | 85 | 5 | Zero variance
Post-Split Clean | 85 | 83 | 2 | Became constant after split
Final | | | |
------------------------|--------|-------|---------|--------------------
Total dropped: | | | 17 | from initial 100 features
Hyperparameter Grid Visualization
The grid defines a multi-dimensional search space:
graph TD
subgraph "Weighted"
A["ann"] --> D[Combination 1]
B["de"] --> E[Combination 2]
C["unweighted"] --> F[Combination 3]
end
subgraph "Resample"
G["undersample"] --> D
H["oversample"] --> E
I[null] --> F
end
subgraph "Correlation"
J[0.95] --> D
K[0.98] --> E
L[0.90] --> F
end
style D fill:#c9f7c9,stroke:#333
style E fill:#c9f7c9,stroke:#333
style F fill:#c9f7c9,stroke:#333
This modular structure allows each part of the system to be understood and modified independently, from adding a new model to changing the analysis plots.