API Reference
This document provides a comprehensive reference for the public API of the Ensemble Genetic Algorithm project.
Table of Contents
Core Entry Points
ml_grid.pipeline.main_ga.run
The primary orchestrator for running the genetic algorithm evolution process.
Class: run
Orchestrates the main Genetic Algorithm (GA) evolution process.
Module: ml_grid.pipeline.main_ga
Instantiation Signature:
main_ga.run(
ml_grid_object,
local_param_dict,
global_params
)
Usage:
from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_params import global_parameters
# Setup
global_params = global_parameters(config_path='config.yml')
ml_grid_object = data.pipe(
global_params=global_params,
file_name="data/dataset.csv",
local_param_dict={},
param_space_index=0,
)
# Execute GA
result = main_ga.run(
ml_grid_object,
local_param_dict={'cxpb': 0.8},
global_params=global_params
).execute()
See also: Configuration Guide for comprehensive configuration options and Pipeline_API.md for pipeline workflow details.
Attributes:
Attribute |
Type |
Description |
|---|---|---|
|
|
Configuration object with experiment settings |
|
|
Experiment object containing data and hyperparameters |
|
|
Logging verbosity level |
|
|
Flag for error handling behavior |
|
|
List of ensemble sizes to try |
|
|
List of population sizes to try |
|
|
List of generation counts to try |
|
|
Path for storing experiment logs and artifacts |
Methods:
execute() → List[List]
Executes the full genetic algorithm process for all GA parameter combinations.
Returns: List[List]
A list of errors encountered during execution
Each item contains:
[model_implementation, exception, traceback]
Behavior:
Iterates through grid of GA hyperparameters (nb, pop, g)
Registers genetic operators with DEAP toolbox
Creates initial population of candidate ensembles
Runs evolutionary loop (selection, crossover, mutation)
Tracks best-performing ensemble per configuration
Implements early stopping if performance stagnates
Evaluates final ensemble on hold-out validation set
Logs all results to disk
Configuration APIs
ml_grid.util.global_params
Central configuration object for experiments.
Class: global_parameters
Controls overall experiment behavior and settings.
Module: ml_grid.util.global_params
Instantiation Signature:
global_parameters(
config_path=None,
**kwargs
)
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Path to YAML configuration file |
|
any |
- |
Runtime parameter overrides |
Available Parameters (from config file):
Parameter |
Type |
Example |
Description |
|---|---|---|---|
|
|
|
Path to input dataset |
|
|
20 |
Number of grid search iterations |
|
|
|
Base learners to use |
|
|
2 |
Logging verbosity (0-15) |
|
|
8 |
Parallel jobs for grid search |
|
|
|
Output directory |
|
|
False |
Use smaller test grid |
ml_grid.util.grid_param_space_ga.Grid
Defines the hyperparameter search space for experiments.
Class: Grid
Creates parameter grids for systematic exploration of the configuration space.
Module: ml_grid.util.grid_param_space_ga
Instantiation Signature:
Grid(
global_params,
config_path=None,
test_grid=False
)
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
Required |
Experiment configuration |
|
|
|
Override config file path |
|
|
|
Use smaller test grid |
Attributes:
Attribute |
Type |
Description |
|---|---|---|
|
|
Yields parameter combinations |
|
|
Ensemble sizes: |
|
|
Population sizes: |
|
|
Generation counts: |
Usage:
global_params = global_parameters(config_path='config.yml')
grid = Grid(global_params=global_params)
# Iterate through parameter combinations
for i in range(20):
params = next(grid.settings_list_iterator)
# params contains: weighted, resample, corr, etc.
Pipeline Classes
Data Pipeline
Class: pipe
The main data processing pipeline for an ML grid experiment.
Module: ml_grid.pipeline.data
Instantiation Signature:
data.pipe(
global_params,
file_name,
drop_term_list,
local_param_dict,
base_project_dir,
param_space_index,
additional_naming=None,
test_sample_n=0,
column_sample_n=0,
config_dict=None,
testing=False,
multiprocessing_ensemble=False
)
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
Required |
Global configuration object |
|
|
Required |
Path to input CSV file |
|
|
Required |
List of substrings for column removal |
|
|
Required |
Current iteration’s parameters |
|
|
Required |
Output directory path |
|
|
Required |
Index in parameter grid |
|
|
|
Optional string for log folder identification |
|
|
0 |
Number of rows to sample (0 = all) |
|
|
0 |
Number of columns to sample (0 = all) |
|
|
|
GA configuration options |
|
|
False |
Enable testing/debug mode |
|
|
False |
Enable multiprocessing for ensemble |
Returns: ml_grid_object
Pipeline Steps:
The pipe class performs the following steps:
Load data from CSV file with optional sampling
Select features based on configuration
Apply safety net if all features have been pruned
Create X and y variables
Split data into train/test/validation sets
Apply post-split cleaning to prevent data leakage
Optionally scale features using StandardScaler
Select features by importance if configured
Attributes:
Attribute |
Type |
Description |
|---|---|---|
|
|
Main DataFrame holding data |
|
|
Feature DataFrames for different splits |
|
|
Target Series for different splits |
|
|
List of columns to remove |
|
|
Final feature column names |
|
|
List of model generator functions |
|
|
Log of feature counts at each pipeline step |
Edge Cases and Special Behavior:
Handling Empty Feature Sets
The data pipeline includes safety mechanisms to prevent failures when all features are pruned during data cleaning:
Safety Net: If the feature selection process removes all features (e.g., due to high correlation
corrthreshold, missing value thresholdpercent_missing, or constant column removal), the pipeline activates a safety net that retains at least 1-2 numeric non-outcome columns from the original dataset.NoFeaturesError Exception: If the safety net cannot retain any features (e.g., the dataset is empty or contains only the outcome variable), a
NoFeaturesErrorexception is raised with a clear message indicating the root cause.
Empty Selection via max_features Parameter
When n_features parameter in local_param_dict results in zero features selection:
Scenario: The feature importance method (ANOVA F-test or Markov Blanket) selects fewer features than requested, possibly resulting in an empty set if no features pass statistical significance tests.
Behavior:
If the selected feature count reaches zero during
_select_features_by_importance(), aNoFeaturesErroris raised with message:"Feature importance selection removed all features."This occurs after post-split cleaning, which may eliminate columns that became constant
Handling Edge Cases Programmatically
Example of how to handle empty feature scenarios:
from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_params import global_parameters
from ml_grid.pipeline.data import NoFeaturesError
try:
# Setup with aggressive feature selection
global_params = global_parameters(config_path='config.yml')
# Use a safe configuration with fallback behavior
local_param_dict = {
'corr': 0.95, # Moderate correlation threshold
'percent_missing': 100.0, # Allow some missing values (90-100%)
'n_features': 3, # Request 3 features, but may get fewer
'feature_selection_method': 'anova' # or 'markov_blanket'
}
ml_grid_object = data.pipe(
global_params=global_params,
file_name="data/dataset.csv",
drop_term_list=[],
local_param_dict=local_param_dict,
param_space_index=0,
)
except NoFeaturesError as e:
# Handle the edge case gracefully
logger.warning(f"Feature selection failed: {e}")
# Fallback strategies:
# 1. Relax feature selection parameters
relaxed_params = {
'n_features': 'all', # Use all remaining features
'corr': 0.85, # Lower correlation threshold
'percent_missing': 99.0 # More lenient missing value threshold
}
# 2. Or disable feature importance selection temporarily
safe_params = {
'n_features': 'all', # Explicitly request all
'feature_selection_method': None
}
except ValueError as e:
# Handle data type errors (non-numeric columns, string values)
logger.error(f"Data validation error: {e}")
Best Practices for Robust Experiments:
Use Safety Net: Always enable the default safety net by ensuring
n_features != 'all'only when you have sufficient features.Monitor Feature Logs: Check the
feature_transformation_logattribute after pipeline execution to track feature counts at each step.Gradual Aggression: Start with lenient thresholds (higher
percent_missing, lowercorr) and gradually increase aggressiveness.Validate Input Data: Ensure your dataset has sufficient numeric features beyond the outcome variable before running experiments.
Accessing Feature Transformation Log
After pipeline execution, examine which features were removed at each step:
# After creating ml_grid_object
print(ml_grid_object.feature_transformation_log)
# Example output:
# step features_before features_after features_changed description
# 0 Initial Load 50 50 0 Initial data loaded.
# 1 Feature Selection 50 48 -2 Selected columns based on feature toggles
# 2 Drop Correlated 48 45 -3 Dropped columns with correlation > 0.95
# 3 Drop Missing 45 45 0 Dropped columns with > 99% missing
# 4 Drop Other Outcomes 45 45 0 Removed other potential outcome variables
# 5 Drop Constants 45 38 -7 Removed constant columns
This log helps diagnose why features were removed and enables better parameter tuning for future runs.
GA Pipeline
Class: run
The primary orchestrator for running the genetic algorithm evolution process.
Module: ml_grid.pipeline.main_ga
Instantiation Signature:
main_ga.run(
ml_grid_object,
local_param_dict,
global_params
)
Usage:
from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_parameters import global_parameters
# Setup
global_params = global_parameters(config_path='config.yml')
ml_grid_object = data.pipe(
global_params=global_params,
file_name="data/dataset.csv",
drop_term_list=[],
local_param_dict={},
param_space_index=0,
)
# Execute GA
result = main_ga.run(
ml_grid_object,
local_param_dict={'cxpb': 0.8},
global_params=global_params
).execute()
Attributes:
Attribute |
Type |
Description |
|---|---|---|
|
|
Configuration object with experiment settings |
|
|
Experiment object containing data and hyperparameters |
|
|
Logging verbosity level (0-15) |
|
|
Flag for error handling behavior |
|
|
List of ensemble sizes to try |
|
|
List of population sizes to try |
|
|
List of generation counts to try |
|
|
Path for storing experiment logs and artifacts |
Methods:
execute() → List[List]
Executes the full genetic algorithm process for all GA parameter combinations.
Returns: List[List]
A list of errors encountered during execution
Each item contains:
[model_implementation, exception, traceback]
Behavior:
Iterates through grid of GA hyperparameters (nb, pop, g)
Registers genetic operators with DEAP toolbox
Creates initial population of candidate ensembles
Runs evolutionary loop (selection, crossover, mutation)
Tracks best-performing ensemble per configuration
Implements early stopping if performance stagnates
Evaluates final ensemble on hold-out validation set
Logs all results to disk
Model Classes
Base Learner Interface
All base learner generators follow a consistent interface pattern (see Model_API.md for comprehensive model generator reference).
Generation Function Signature:
def model_nameModelGenerator(
ml_grid_object: Any,
local_param_dict: Dict
) -> Tuple[float, ModelClass, List[str], int, float, np.ndarray]:
"""Generates, trains, and evaluates a model.
Args:
ml_grid_object: Contains X_train, y_train, X_test, y_test and config
local_param_dict: Parameters for this specific run
Returns:
Tuple of (mccscore, model, feature_names, train_time, auc_score, y_pred)
"""
Return Values:
Index |
Type |
Description |
|---|---|---|
0 |
|
Matthews Correlation Coefficient (MCC) |
1 |
|
Trained model object |
2 |
|
List of feature names used for training |
3 |
|
Model training time in seconds |
4 |
|
ROC AUC score |
5 |
|
Model predictions on test set |
Available Models:
AdaBoostClassifierModelGeneratorDecisionTreeClassifierModelGeneratorelasticNeuralNetworkModelGeneratorextraTreesModelGeneratorGaussianNB_ModelGeneratorGradientBoostingClassifier_ModelGeneratorkNearestNeighborsModelGeneratorlogisticRegressionModelGeneratorMLPClassifier_ModelGeneratorperceptronModelGeneratorPytorch_binary_class_ModelGeneratorQuadraticDiscriminantAnalysis_ModelGeneratorrandomForestModelGeneratorSVC_ModelGeneratorXGBoostModelGenerator
Classification Models
Function: logisticRegressionModelGenerator
Generates, trains, and evaluates a logistic regression classifier.
Module: ml_grid.model_classes_ga.logistic_regression_model
Generation Signature:
lr_generator = logisticRegressionModelGenerator(
ml_grid_object,
local_param_dict
)
Function: randomForestModelGenerator
Generates, trains, and evaluates a random forest classifier.
Module: ml_grid.model_classes_ga.randomForest_model
Generation Signature:
rf_generator = randomForestModelGenerator(
ml_grid_object,
local_param_dict
)
Function: XGBoostModelGenerator
Generates, trains, and evaluates an XGBoost classifier.
Module: ml_grid.model_classes_ga.XGBoost_model
Generation Signature:
xgb_generator = XGBoostModelGenerator(
ml_grid_object,
local_param_dict
)
Genetic Algorithm APIs
Evaluation Methods
Function: get_y_pred_resolver
Resolves and generates predictions for ensemble evaluation.
Module: ml_grid.pipeline.evaluate_methods_ga
Signature:
y_pred = get_y_pred_resolver(
individual,
ml_grid_object,
valid=False
)
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
Required |
Ensemble configuration (DEAP个体格式) |
|
|
Required |
Experiment object with data splits |
|
|
|
If True, predict on validation set |
Returns: Union[List, np.ndarray]
Final ensemble predictions
Function: evaluate_weighted_ensemble_auc
Main fitness evaluation function for genetic algorithm.
Module: ml_grid.pipeline.evaluate_methods_ga
Signature:
fitness = evaluate_weighted_ensemble_auc(
individual,
ml_grid_object
)
Parameters:
Parameter |
Type |
Description |
|---|---|---|
|
|
Ensemble to evaluate |
|
|
Experiment object with data and configuration |
Returns: Tuple[float]
Single-element tuple containing fitness score (AUC or diversity-penalized AUC)
Mutation Methods
Function: baseLearnerGenerator
Generates a random base learner.
Module: ml_grid.pipeline.mutate_methods
Signature:
new_learner = baseLearnerGenerator(ml_grid_object)
Function: mutateEnsemble
Mutates an ensemble by replacing one base learner.
Module: ml_grid.pipeline.mutate_methods
Signature:
mutated_individual = mutateEnsemble(
individual,
ml_grid_object
)
Weighting Methods
Function: get_unweighted_ensemble_predictions
Generates predictions by majority voting (mode).
Module: ml_grid.ga_functions.ga_unweighted
Signature:
predictions = get_unweighted_ensemble_predictions(
best,
ml_grid_object,
valid=False
)
Function: find_ensemble_weights_de
Finds optimal weights for ensemble using Differential Evolution.
Module: ml_grid.ga_functions.ga_ensemble_weight_finder_de
Signature:
weights = find_ensemble_weights_de(
ensemble,
ml_grid_object,
valid=False
)
Result Analysis APIs
ml_grid.util.GA_results_explorer
Analyzes and visualizes experiment results.
Class: GA_results_explorer
Parses and visualizes GA experiment outcomes.
Module: ml_grid.util.GA_results_explorer
Instantiation:
from ml_grid.util.GA_results_explorer import GA_results_explorer
explorer = GA_results_explorer(
base_project_dir="HFE_GA_experiments",
)
Methods:
Method |
Parameters |
Description |
|---|---|---|
|
- |
Generate fitness convergence plot |
|
- |
Performance vs. ensemble size |
|
- |
Feature/base learner importance |
ml_grid.util.evaluate_ensemble_methods
Ensemble evaluation utilities (see Model_API.md for base learner interface and Pipeline_API.md for GA pipeline context).
Class: EnsembleEvaluator
Evaluates ensembles on hold-out data.
Module: ml_grid.util.evaluate_ensemble_methods
Instantiation:
from ml_grid.util.evaluate_ensemble_methods import EnsembleEvaluator
evaluator = EnsembleEvaluator(
base_project_dir="HFE_GA_experiments",
X_train=None, y_train=None,
X_test=None, y_test=None,
store_base_learners=True,
)
Methods:
Method |
Description |
|---|---|
|
Evaluate best ensemble on test set |
Error Handling Reference
Common Exceptions
Exception |
Cause |
Resolution |
|---|---|---|
|
Package not installed |
Run |
|
Dataset not found |
Check |
|
Missing values |
Adjust |
Configuration YAML Schema
Complete list of configurable parameters:
global_params:
input_csv_path: str # Required, path to dataset
n_iter: int # Default: 20, grid search iterations
model_list: List[str] # Required, base learner names
verbose: int # Default: 2, logging level (0-15)
grid_n_jobs: int # Default: 8, parallel jobs
base_project_dir: str # Default: "HFE_GA_experiments"
testing: bool # Default: False, use smaller grid
test_sample_n: int # Default: 0, no sampling
ga_params:
nb_params: List[int] # Default: [8, 16, 24], ensemble sizes
pop_params: List[int] # Default: [64, 128], population sizes
g_params: List[int] # Default: [100], generation counts
grid_params:
weighted: List[str] # Default: ["unweighted"], methods
resample: List[str \| None] # Default: [None], imbalancing handling
corr: List[float] # Default: [0.95], feature correlation
Python Requirement: Python >=3.12
This API documentation corresponds to version v1.0+ of the ensemble_genetic_algorithm package.
For the latest API reference, please visit our online documentation.
footnotes>
[1] API Documentation Summary: API-Documentation-Summary.md