API Reference

This document provides a comprehensive reference for the public API of the Ensemble Genetic Algorithm project.

Table of Contents

Core Entry Points

`ml_grid.pipeline.main_ga.run`

The primary orchestrator for running the genetic algorithm evolution process.

Class: `run`

Orchestrates the main Genetic Algorithm (GA) evolution process.

Module: ml_grid.pipeline.main_ga

Instantiation Signature:

main_ga.run(
    ml_grid_object,
    local_param_dict,
    global_params
)

Usage:

from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_params import global_parameters

# Setup
global_params = global_parameters(config_path='config.yml')
ml_grid_object = data.pipe(
    global_params=global_params,
    file_name="data/dataset.csv",
    local_param_dict={},
    param_space_index=0,
)

# Execute GA
result = main_ga.run(
    ml_grid_object, 
    local_param_dict={'cxpb': 0.8}, 
    global_params=global_params
).execute()

See also: Configuration Guide for comprehensive configuration options and Pipeline_API.md for pipeline workflow details.

Attributes:

Attribute	Type	Description
`global_params`	`global_parameters`	Configuration object with experiment settings
`ml_grid_object`	`Any`	Experiment object containing data and hyperparameters
`verbose`	`int`	Logging verbosity level
`error_raise`	`bool`	Flag for error handling behavior
`nb_params`	`List[int]`	List of ensemble sizes to try
`pop_params`	`List[int]`	List of population sizes to try
`g_params`	`List[int]`	List of generation counts to try
`log_folder_path`	`str`	Path for storing experiment logs and artifacts

Methods:

`execute()` → `List[List]`

Executes the full genetic algorithm process for all GA parameter combinations.

Returns: List[List]

A list of errors encountered during execution
Each item contains: [model_implementation, exception, traceback]

Behavior:

Iterates through grid of GA hyperparameters (nb, pop, g)
Registers genetic operators with DEAP toolbox
Creates initial population of candidate ensembles
Runs evolutionary loop (selection, crossover, mutation)
Tracks best-performing ensemble per configuration
Implements early stopping if performance stagnates
Evaluates final ensemble on hold-out validation set
Logs all results to disk

Configuration APIs

`ml_grid.util.global_params`

Central configuration object for experiments.

Class: `global_parameters`

Controls overall experiment behavior and settings.

Module: ml_grid.util.global_params

Instantiation Signature:

global_parameters(
    config_path=None,
    **kwargs
)

Parameters:

Parameter	Type	Default	Description
`config_path`	`str \| None`	`None`	Path to YAML configuration file
`\\kwargs`	any	-	Runtime parameter overrides

Available Parameters (from config file):

Parameter	Type	Example	Description
`input_csv_path`	`str`	`"data/my.csv"`	Path to input dataset
`n_iter`	`int`	20	Number of grid search iterations
`model_list`	`List[str]`	`["logisticRegression"]`	Base learners to use
`verbose`	`int`	2	Logging verbosity (0-15)
`grid_n_jobs`	`int`	8	Parallel jobs for grid search
`base_project_dir`	`str`	`"HFE_GA_experiments"`	Output directory
`testing`	`bool`	False	Use smaller test grid

`ml_grid.util.grid_param_space_ga.Grid`

Defines the hyperparameter search space for experiments.

Class: `Grid`

Creates parameter grids for systematic exploration of the configuration space.

Module: ml_grid.util.grid_param_space_ga

Instantiation Signature:

Grid(
    global_params,
    config_path=None,
    test_grid=False
)

Parameters:

Parameter	Type	Default	Description
`global_params`	`global_parameters`	Required	Experiment configuration
`config_path`	`str \| None`	`None`	Override config file path
`test_grid`	`bool`	`False`	Use smaller test grid

Attributes:

Attribute	Type	Description
`settings_list_iterator`	`Iterator[Dict]`	Yields parameter combinations
`nb_params`	`List[int]`	Ensemble sizes: `[8, 16, 24]`
`pop_params`	`List[int]`	Population sizes: `[64, 128]`
`g_params`	`List[int]`	Generation counts: `[100]`

Usage:

global_params = global_parameters(config_path='config.yml')
grid = Grid(global_params=global_params)

# Iterate through parameter combinations
for i in range(20):
    params = next(grid.settings_list_iterator)
    # params contains: weighted, resample, corr, etc.

Pipeline Classes

Data Pipeline

Class: `pipe`

The main data processing pipeline for an ML grid experiment.

Module: ml_grid.pipeline.data

Instantiation Signature:

data.pipe(
    global_params,
    file_name,
    drop_term_list,
    local_param_dict,
    base_project_dir,
    param_space_index,
    additional_naming=None,
    test_sample_n=0,
    column_sample_n=0,
    config_dict=None,
    testing=False,
    multiprocessing_ensemble=False
)

Parameters:

Parameter	Type	Default	Description
`global_params`	`global_parameters`	Required	Global configuration object
`file_name`	`str`	Required	Path to input CSV file
`drop_term_list`	`List[str]`	Required	List of substrings for column removal
`local_param_dict`	`Dict`	Required	Current iteration’s parameters
`base_project_dir`	`str`	Required	Output directory path
`param_space_index`	`int`	Required	Index in parameter grid
`additional_naming`	`str \| None`	`None`	Optional string for log folder identification
`test_sample_n`	`int`	0	Number of rows to sample (0 = all)
`column_sample_n`	`int`	0	Number of columns to sample (0 = all)
`config_dict`	`Dict \| None`	`None`	GA configuration options
`testing`	`bool`	False	Enable testing/debug mode
`multiprocessing_ensemble`	`bool`	False	Enable multiprocessing for ensemble

Returns: ml_grid_object

Pipeline Steps:

The pipe class performs the following steps:

Load data from CSV file with optional sampling
Select features based on configuration
Apply safety net if all features have been pruned
Create X and y variables
Split data into train/test/validation sets
Apply post-split cleaning to prevent data leakage
Optionally scale features using StandardScaler
Select features by importance if configured

Attributes:

Attribute	Type	Description
`df`	`pd.DataFrame`	Main DataFrame holding data
`X_train`, `X_test`, `X_test_orig`	`pd.DataFrame`	Feature DataFrames for different splits
`y_train`, `y_test`, `y_test_orig`	`pd.Series`	Target Series for different splits
`drop_list`	`List[str]`	List of columns to remove
`final_column_list`	`List[str]`	Final feature column names
`model_class_list`	`List`	List of model generator functions
`feature_transformation_log`	`pd.DataFrame`	Log of feature counts at each pipeline step

Edge Cases and Special Behavior:

Handling Empty Feature Sets

The data pipeline includes safety mechanisms to prevent failures when all features are pruned during data cleaning:

Safety Net: If the feature selection process removes all features (e.g., due to high correlation corr threshold, missing value threshold percent_missing, or constant column removal), the pipeline activates a safety net that retains at least 1-2 numeric non-outcome columns from the original dataset.
NoFeaturesError Exception: If the safety net cannot retain any features (e.g., the dataset is empty or contains only the outcome variable), a NoFeaturesError exception is raised with a clear message indicating the root cause.

Empty Selection via max_features Parameter

When n_features parameter in local_param_dict results in zero features selection:

Scenario: The feature importance method (ANOVA F-test or Markov Blanket) selects fewer features than requested, possibly resulting in an empty set if no features pass statistical significance tests.
Behavior:
- If the selected feature count reaches zero during _select_features_by_importance(), a NoFeaturesError is raised with message: "Feature importance selection removed all features."
- This occurs after post-split cleaning, which may eliminate columns that became constant

Handling Edge Cases Programmatically

Example of how to handle empty feature scenarios:

from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_params import global_parameters
from ml_grid.pipeline.data import NoFeaturesError

try:
    # Setup with aggressive feature selection
    global_params = global_parameters(config_path='config.yml')
    
    # Use a safe configuration with fallback behavior
    local_param_dict = {
        'corr': 0.95,              # Moderate correlation threshold
        'percent_missing': 100.0,   # Allow some missing values (90-100%)
        'n_features': 3,            # Request 3 features, but may get fewer
        'feature_selection_method': 'anova'  # or 'markov_blanket'
    }
    
    ml_grid_object = data.pipe(
        global_params=global_params,
        file_name="data/dataset.csv",
        drop_term_list=[],
        local_param_dict=local_param_dict,
        param_space_index=0,
    )
    
except NoFeaturesError as e:
    # Handle the edge case gracefully
    logger.warning(f"Feature selection failed: {e}")
    
    # Fallback strategies:
    # 1. Relax feature selection parameters
    relaxed_params = {
        'n_features': 'all',         # Use all remaining features
        'corr': 0.85,                # Lower correlation threshold
        'percent_missing': 99.0      # More lenient missing value threshold
    }
    
    # 2. Or disable feature importance selection temporarily
    safe_params = {
        'n_features': 'all',         # Explicitly request all
        'feature_selection_method': None
    }
    
except ValueError as e:
    # Handle data type errors (non-numeric columns, string values)
    logger.error(f"Data validation error: {e}")

Best Practices for Robust Experiments:

Use Safety Net: Always enable the default safety net by ensuring n_features != 'all' only when you have sufficient features.
Monitor Feature Logs: Check the feature_transformation_log attribute after pipeline execution to track feature counts at each step.
Gradual Aggression: Start with lenient thresholds (higher percent_missing, lower corr) and gradually increase aggressiveness.
Validate Input Data: Ensure your dataset has sufficient numeric features beyond the outcome variable before running experiments.

Accessing Feature Transformation Log

After pipeline execution, examine which features were removed at each step:

# After creating ml_grid_object
print(ml_grid_object.feature_transformation_log)

# Example output:
#          step  features_before  features_after  features_changed        description
# 0   Initial Load             50              50                 0     Initial data loaded.
# 1 Feature Selection           50              48                -2  Selected columns based on feature toggles
# 2    Drop Correlated           48              45                -3  Dropped columns with correlation > 0.95
# 3      Drop Missing            45              45                 0  Dropped columns with > 99% missing
# 4   Drop Other Outcomes        45              45                 0  Removed other potential outcome variables
# 5     Drop Constants           45              38                -7         Removed constant columns

This log helps diagnose why features were removed and enables better parameter tuning for future runs.

GA Pipeline

Class: `run`

The primary orchestrator for running the genetic algorithm evolution process.

Module: ml_grid.pipeline.main_ga

Instantiation Signature:

main_ga.run(
    ml_grid_object,
    local_param_dict,
    global_params
)

Usage:

from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_parameters import global_parameters

# Setup
global_params = global_parameters(config_path='config.yml')
ml_grid_object = data.pipe(
    global_params=global_params,
    file_name="data/dataset.csv",
    drop_term_list=[],
    local_param_dict={},
    param_space_index=0,
)

# Execute GA
result = main_ga.run(
    ml_grid_object, 
    local_param_dict={'cxpb': 0.8}, 
    global_params=global_params
).execute()

Attributes:

Attribute	Type	Description
`global_params`	`global_parameters`	Configuration object with experiment settings
`ml_grid_object`	`Any`	Experiment object containing data and hyperparameters
`verbose`	`int`	Logging verbosity level (0-15)
`error_raise`	`bool`	Flag for error handling behavior
`nb_params`	`List[int]`	List of ensemble sizes to try
`pop_params`	`List[int]`	List of population sizes to try
`g_params`	`List[int]`	List of generation counts to try
`log_folder_path`	`str`	Path for storing experiment logs and artifacts

Methods:

`execute()` → `List[List]`

Executes the full genetic algorithm process for all GA parameter combinations.

Returns: List[List]

A list of errors encountered during execution
Each item contains: [model_implementation, exception, traceback]

Behavior:

Iterates through grid of GA hyperparameters (nb, pop, g)
Registers genetic operators with DEAP toolbox
Creates initial population of candidate ensembles
Runs evolutionary loop (selection, crossover, mutation)
Tracks best-performing ensemble per configuration
Implements early stopping if performance stagnates
Evaluates final ensemble on hold-out validation set
Logs all results to disk

Model Classes

Base Learner Interface

All base learner generators follow a consistent interface pattern (see Model_API.md for comprehensive model generator reference).

Generation Function Signature:

def model_nameModelGenerator(
    ml_grid_object: Any,
    local_param_dict: Dict
) -> Tuple[float, ModelClass, List[str], int, float, np.ndarray]:
    """Generates, trains, and evaluates a model.
    
    Args:
        ml_grid_object: Contains X_train, y_train, X_test, y_test and config
        local_param_dict: Parameters for this specific run
        
    Returns:
        Tuple of (mccscore, model, feature_names, train_time, auc_score, y_pred)
    """

Return Values:

Index	Type	Description
0	`float`	Matthews Correlation Coefficient (MCC)
1	`ModelClass`	Trained model object
2	`List[str]`	List of feature names used for training
3	`int`	Model training time in seconds
4	`float`	ROC AUC score
5	`np.ndarray`	Model predictions on test set

Available Models:

AdaBoostClassifierModelGenerator
DecisionTreeClassifierModelGenerator
elasticNeuralNetworkModelGenerator
extraTreesModelGenerator
GaussianNB_ModelGenerator
GradientBoostingClassifier_ModelGenerator
kNearestNeighborsModelGenerator
logisticRegressionModelGenerator
MLPClassifier_ModelGenerator
perceptronModelGenerator
Pytorch_binary_class_ModelGenerator
QuadraticDiscriminantAnalysis_ModelGenerator
randomForestModelGenerator
SVC_ModelGenerator
XGBoostModelGenerator

Classification Models

Function: `logisticRegressionModelGenerator`

Generates, trains, and evaluates a logistic regression classifier.

Module: ml_grid.model_classes_ga.logistic_regression_model

Generation Signature:

lr_generator = logisticRegressionModelGenerator(
    ml_grid_object,
    local_param_dict
)

Function: `randomForestModelGenerator`

Generates, trains, and evaluates a random forest classifier.

Module: ml_grid.model_classes_ga.randomForest_model

Generation Signature:

rf_generator = randomForestModelGenerator(
    ml_grid_object,
    local_param_dict
)

Function: `XGBoostModelGenerator`

Generates, trains, and evaluates an XGBoost classifier.

Module: ml_grid.model_classes_ga.XGBoost_model

Generation Signature:

xgb_generator = XGBoostModelGenerator(
    ml_grid_object,
    local_param_dict
)

Genetic Algorithm APIs

Evaluation Methods

Function: `get_y_pred_resolver`

Resolves and generates predictions for ensemble evaluation.

Module: ml_grid.pipeline.evaluate_methods_ga

Signature:

y_pred = get_y_pred_resolver(
    individual,
    ml_grid_object,
    valid=False
)

Parameters:

Parameter	Type	Default	Description
`individual`	`List`	Required	Ensemble configuration (DEAP个体格式)
`ml_grid_object`	`Any`	Required	Experiment object with data splits
`valid`	`bool`	`False`	If True, predict on validation set

Returns: Union[List, np.ndarray]

Final ensemble predictions

Function: `evaluate_weighted_ensemble_auc`

Main fitness evaluation function for genetic algorithm.

Module: ml_grid.pipeline.evaluate_methods_ga

Signature:

fitness = evaluate_weighted_ensemble_auc(
    individual,
    ml_grid_object
)

Parameters:

Parameter	Type	Description
`individual`	`List`	Ensemble to evaluate
`ml_grid_object`	`Any`	Experiment object with data and configuration

Returns: Tuple[float]

Single-element tuple containing fitness score (AUC or diversity-penalized AUC)

Mutation Methods

Function: `baseLearnerGenerator`

Generates a random base learner.

Module: ml_grid.pipeline.mutate_methods

Signature:

new_learner = baseLearnerGenerator(ml_grid_object)

Function: `mutateEnsemble`

Mutates an ensemble by replacing one base learner.

Module: ml_grid.pipeline.mutate_methods

Signature:

mutated_individual = mutateEnsemble(
    individual,
    ml_grid_object
)

Weighting Methods

Function: `get_unweighted_ensemble_predictions`

Generates predictions by majority voting (mode).

Module: ml_grid.ga_functions.ga_unweighted

Signature:

predictions = get_unweighted_ensemble_predictions(
    best,
    ml_grid_object,
    valid=False
)

Function: `find_ensemble_weights_de`

Finds optimal weights for ensemble using Differential Evolution.

Module: ml_grid.ga_functions.ga_ensemble_weight_finder_de

Signature:

weights = find_ensemble_weights_de(
    ensemble,
    ml_grid_object,
    valid=False
)

Result Analysis APIs

`ml_grid.util.GA_results_explorer`

Analyzes and visualizes experiment results.

Class: `GA_results_explorer`

Parses and visualizes GA experiment outcomes.

Module: ml_grid.util.GA_results_explorer

Instantiation:

from ml_grid.util.GA_results_explorer import GA_results_explorer

explorer = GA_results_explorer(
    base_project_dir="HFE_GA_experiments",
)

Methods:

Method	Parameters	Description
`plot_convergence()`	-	Generate fitness convergence plot
`plot_ensemble_size_performance()`	-	Performance vs. ensemble size
`plot_base_learner_importance()`	-	Feature/base learner importance

`ml_grid.util.evaluate_ensemble_methods`

Ensemble evaluation utilities (see Model_API.md for base learner interface and Pipeline_API.md for GA pipeline context).

Class: `EnsembleEvaluator`

Evaluates ensembles on hold-out data.

Module: ml_grid.util.evaluate_ensemble_methods

Instantiation:

from ml_grid.util.evaluate_ensemble_methods import EnsembleEvaluator

evaluator = EnsembleEvaluator(
    base_project_dir="HFE_GA_experiments",
    X_train=None, y_train=None,
    X_test=None, y_test=None,
    store_base_learners=True,
)

Methods:

Method	Description
`evaluate()`	Evaluate best ensemble on test set

Error Handling Reference

Common Exceptions

Exception	Cause	Resolution
`ModuleNotFoundError`	Package not installed	Run `pip install .`
`FileNotFoundError`	Dataset not found	Check `input_csv_path` in config
`ValueError: Input contains NaN`	Missing values	Adjust `percent_missing` threshold

Configuration YAML Schema

Complete list of configurable parameters:

global_params:
  input_csv_path: str              # Required, path to dataset
  n_iter: int                      # Default: 20, grid search iterations
  model_list: List[str]            # Required, base learner names
  verbose: int                     # Default: 2, logging level (0-15)
  grid_n_jobs: int                 # Default: 8, parallel jobs
  base_project_dir: str            # Default: "HFE_GA_experiments"
  testing: bool                    # Default: False, use smaller grid
  test_sample_n: int               # Default: 0, no sampling

ga_params:
  nb_params: List[int]             # Default: [8, 16, 24], ensemble sizes
  pop_params: List[int]            # Default: [64, 128], population sizes
  g_params: List[int]              # Default: [100], generation counts

grid_params:
  weighted: List[str]              # Default: ["unweighted"], methods
  resample: List[str \| None]      # Default: [None], imbalancing handling
  corr: List[float]                # Default: [0.95], feature correlation

Python Requirement: Python >=3.12

This API documentation corresponds to version v1.0+ of the ensemble_genetic_algorithm package.

For the latest API reference, please visit our online documentation.

footnotes> [1] API Documentation Summary: API-Documentation-Summary.md

API Reference

Table of Contents

Related Documentation

Core Entry Points

ml_grid.pipeline.main_ga.run

Class: run

execute() → List[List]

Configuration APIs

ml_grid.util.global_params

Class: global_parameters

ml_grid.util.grid_param_space_ga.Grid

Class: Grid

Pipeline Classes

Data Pipeline

Class: pipe

Handling Empty Feature Sets

Empty Selection via max_features Parameter

Handling Edge Cases Programmatically

Accessing Feature Transformation Log

GA Pipeline

Class: run

execute() → List[List]

Model Classes

Base Learner Interface

Classification Models

Function: logisticRegressionModelGenerator

Function: randomForestModelGenerator

Function: XGBoostModelGenerator

Genetic Algorithm APIs

Evaluation Methods

Function: get_y_pred_resolver

Function: evaluate_weighted_ensemble_auc

Mutation Methods

Function: baseLearnerGenerator

Function: mutateEnsemble

Weighting Methods

Function: get_unweighted_ensemble_predictions

Function: find_ensemble_weights_de

Result Analysis APIs

ml_grid.util.GA_results_explorer

Class: GA_results_explorer

ml_grid.util.evaluate_ensemble_methods

Class: EnsembleEvaluator

Error Handling Reference

Common Exceptions

Configuration YAML Schema

`ml_grid.pipeline.main_ga.run`

Class: `run`

`execute()` → `List[List]`

`ml_grid.util.global_params`

Class: `global_parameters`

`ml_grid.util.grid_param_space_ga.Grid`

Class: `Grid`

Class: `pipe`

Class: `run`

`execute()` → `List[List]`

Function: `logisticRegressionModelGenerator`

Function: `randomForestModelGenerator`

Function: `XGBoostModelGenerator`

Function: `get_y_pred_resolver`

Function: `evaluate_weighted_ensemble_auc`

Function: `baseLearnerGenerator`

Function: `mutateEnsemble`

Function: `get_unweighted_ensemble_predictions`

Function: `find_ensemble_weights_de`

`ml_grid.util.GA_results_explorer`

Class: `GA_results_explorer`

`ml_grid.util.evaluate_ensemble_methods`

Class: `EnsembleEvaluator`