API Reference

This document provides a comprehensive reference for the public API of the Ensemble Genetic Algorithm project.


Table of Contents


Core Entry Points

ml_grid.pipeline.main_ga.run

The primary orchestrator for running the genetic algorithm evolution process.

Class: run

Orchestrates the main Genetic Algorithm (GA) evolution process.

Module: ml_grid.pipeline.main_ga

Instantiation Signature:

main_ga.run(
    ml_grid_object,
    local_param_dict,
    global_params
)

Usage:

from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_params import global_parameters

# Setup
global_params = global_parameters(config_path='config.yml')
ml_grid_object = data.pipe(
    global_params=global_params,
    file_name="data/dataset.csv",
    local_param_dict={},
    param_space_index=0,
)

# Execute GA
result = main_ga.run(
    ml_grid_object, 
    local_param_dict={'cxpb': 0.8}, 
    global_params=global_params
).execute()

See also: Configuration Guide for comprehensive configuration options and Pipeline_API.md for pipeline workflow details.

Attributes:

Attribute

Type

Description

global_params

global_parameters

Configuration object with experiment settings

ml_grid_object

Any

Experiment object containing data and hyperparameters

verbose

int

Logging verbosity level

error_raise

bool

Flag for error handling behavior

nb_params

List[int]

List of ensemble sizes to try

pop_params

List[int]

List of population sizes to try

g_params

List[int]

List of generation counts to try

log_folder_path

str

Path for storing experiment logs and artifacts

Methods:

execute()List[List]

Executes the full genetic algorithm process for all GA parameter combinations.

Returns: List[List]

  • A list of errors encountered during execution

  • Each item contains: [model_implementation, exception, traceback]

Behavior:

  1. Iterates through grid of GA hyperparameters (nb, pop, g)

  2. Registers genetic operators with DEAP toolbox

  3. Creates initial population of candidate ensembles

  4. Runs evolutionary loop (selection, crossover, mutation)

  5. Tracks best-performing ensemble per configuration

  6. Implements early stopping if performance stagnates

  7. Evaluates final ensemble on hold-out validation set

  8. Logs all results to disk


Configuration APIs

ml_grid.util.global_params

Central configuration object for experiments.

Class: global_parameters

Controls overall experiment behavior and settings.

Module: ml_grid.util.global_params

Instantiation Signature:

global_parameters(
    config_path=None,
    **kwargs
)

Parameters:

Parameter

Type

Default

Description

config_path

str | None

None

Path to YAML configuration file

\*\*kwargs

any

-

Runtime parameter overrides

Available Parameters (from config file):

Parameter

Type

Example

Description

input_csv_path

str

"data/my.csv"

Path to input dataset

n_iter

int

20

Number of grid search iterations

model_list

List[str]

["logisticRegression"]

Base learners to use

verbose

int

2

Logging verbosity (0-15)

grid_n_jobs

int

8

Parallel jobs for grid search

base_project_dir

str

"HFE_GA_experiments"

Output directory

testing

bool

False

Use smaller test grid


ml_grid.util.grid_param_space_ga.Grid

Defines the hyperparameter search space for experiments.

Class: Grid

Creates parameter grids for systematic exploration of the configuration space.

Module: ml_grid.util.grid_param_space_ga

Instantiation Signature:

Grid(
    global_params,
    config_path=None,
    test_grid=False
)

Parameters:

Parameter

Type

Default

Description

global_params

global_parameters

Required

Experiment configuration

config_path

str | None

None

Override config file path

test_grid

bool

False

Use smaller test grid

Attributes:

Attribute

Type

Description

settings_list_iterator

Iterator[Dict]

Yields parameter combinations

nb_params

List[int]

Ensemble sizes: [8, 16, 24]

pop_params

List[int]

Population sizes: [64, 128]

g_params

List[int]

Generation counts: [100]

Usage:

global_params = global_parameters(config_path='config.yml')
grid = Grid(global_params=global_params)

# Iterate through parameter combinations
for i in range(20):
    params = next(grid.settings_list_iterator)
    # params contains: weighted, resample, corr, etc.

Pipeline Classes

Data Pipeline

Class: pipe

The main data processing pipeline for an ML grid experiment.

Module: ml_grid.pipeline.data

Instantiation Signature:

data.pipe(
    global_params,
    file_name,
    drop_term_list,
    local_param_dict,
    base_project_dir,
    param_space_index,
    additional_naming=None,
    test_sample_n=0,
    column_sample_n=0,
    config_dict=None,
    testing=False,
    multiprocessing_ensemble=False
)

Parameters:

Parameter

Type

Default

Description

global_params

global_parameters

Required

Global configuration object

file_name

str

Required

Path to input CSV file

drop_term_list

List[str]

Required

List of substrings for column removal

local_param_dict

Dict

Required

Current iteration’s parameters

base_project_dir

str

Required

Output directory path

param_space_index

int

Required

Index in parameter grid

additional_naming

str | None

None

Optional string for log folder identification

test_sample_n

int

0

Number of rows to sample (0 = all)

column_sample_n

int

0

Number of columns to sample (0 = all)

config_dict

Dict | None

None

GA configuration options

testing

bool

False

Enable testing/debug mode

multiprocessing_ensemble

bool

False

Enable multiprocessing for ensemble

Returns: ml_grid_object

Pipeline Steps:

The pipe class performs the following steps:

  1. Load data from CSV file with optional sampling

  2. Select features based on configuration

  3. Apply safety net if all features have been pruned

  4. Create X and y variables

  5. Split data into train/test/validation sets

  6. Apply post-split cleaning to prevent data leakage

  7. Optionally scale features using StandardScaler

  8. Select features by importance if configured

Attributes:

Attribute

Type

Description

df

pd.DataFrame

Main DataFrame holding data

X_train, X_test, X_test_orig

pd.DataFrame

Feature DataFrames for different splits

y_train, y_test, y_test_orig

pd.Series

Target Series for different splits

drop_list

List[str]

List of columns to remove

final_column_list

List[str]

Final feature column names

model_class_list

List

List of model generator functions

feature_transformation_log

pd.DataFrame

Log of feature counts at each pipeline step

Edge Cases and Special Behavior:

Handling Empty Feature Sets

The data pipeline includes safety mechanisms to prevent failures when all features are pruned during data cleaning:

  1. Safety Net: If the feature selection process removes all features (e.g., due to high correlation corr threshold, missing value threshold percent_missing, or constant column removal), the pipeline activates a safety net that retains at least 1-2 numeric non-outcome columns from the original dataset.

  2. NoFeaturesError Exception: If the safety net cannot retain any features (e.g., the dataset is empty or contains only the outcome variable), a NoFeaturesError exception is raised with a clear message indicating the root cause.

Empty Selection via max_features Parameter

When n_features parameter in local_param_dict results in zero features selection:

  • Scenario: The feature importance method (ANOVA F-test or Markov Blanket) selects fewer features than requested, possibly resulting in an empty set if no features pass statistical significance tests.

  • Behavior:

    • If the selected feature count reaches zero during _select_features_by_importance(), a NoFeaturesError is raised with message: "Feature importance selection removed all features."

    • This occurs after post-split cleaning, which may eliminate columns that became constant

Handling Edge Cases Programmatically

Example of how to handle empty feature scenarios:

from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_params import global_parameters
from ml_grid.pipeline.data import NoFeaturesError

try:
    # Setup with aggressive feature selection
    global_params = global_parameters(config_path='config.yml')
    
    # Use a safe configuration with fallback behavior
    local_param_dict = {
        'corr': 0.95,              # Moderate correlation threshold
        'percent_missing': 100.0,   # Allow some missing values (90-100%)
        'n_features': 3,            # Request 3 features, but may get fewer
        'feature_selection_method': 'anova'  # or 'markov_blanket'
    }
    
    ml_grid_object = data.pipe(
        global_params=global_params,
        file_name="data/dataset.csv",
        drop_term_list=[],
        local_param_dict=local_param_dict,
        param_space_index=0,
    )
    
except NoFeaturesError as e:
    # Handle the edge case gracefully
    logger.warning(f"Feature selection failed: {e}")
    
    # Fallback strategies:
    # 1. Relax feature selection parameters
    relaxed_params = {
        'n_features': 'all',         # Use all remaining features
        'corr': 0.85,                # Lower correlation threshold
        'percent_missing': 99.0      # More lenient missing value threshold
    }
    
    # 2. Or disable feature importance selection temporarily
    safe_params = {
        'n_features': 'all',         # Explicitly request all
        'feature_selection_method': None
    }
    
except ValueError as e:
    # Handle data type errors (non-numeric columns, string values)
    logger.error(f"Data validation error: {e}")

Best Practices for Robust Experiments:

  1. Use Safety Net: Always enable the default safety net by ensuring n_features != 'all' only when you have sufficient features.

  2. Monitor Feature Logs: Check the feature_transformation_log attribute after pipeline execution to track feature counts at each step.

  3. Gradual Aggression: Start with lenient thresholds (higher percent_missing, lower corr) and gradually increase aggressiveness.

  4. Validate Input Data: Ensure your dataset has sufficient numeric features beyond the outcome variable before running experiments.

Accessing Feature Transformation Log

After pipeline execution, examine which features were removed at each step:

# After creating ml_grid_object
print(ml_grid_object.feature_transformation_log)

# Example output:
#          step  features_before  features_after  features_changed        description
# 0   Initial Load             50              50                 0     Initial data loaded.
# 1 Feature Selection           50              48                -2  Selected columns based on feature toggles
# 2    Drop Correlated           48              45                -3  Dropped columns with correlation > 0.95
# 3      Drop Missing            45              45                 0  Dropped columns with > 99% missing
# 4   Drop Other Outcomes        45              45                 0  Removed other potential outcome variables
# 5     Drop Constants           45              38                -7         Removed constant columns

This log helps diagnose why features were removed and enables better parameter tuning for future runs.


GA Pipeline

Class: run

The primary orchestrator for running the genetic algorithm evolution process.

Module: ml_grid.pipeline.main_ga

Instantiation Signature:

main_ga.run(
    ml_grid_object,
    local_param_dict,
    global_params
)

Usage:

from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_parameters import global_parameters

# Setup
global_params = global_parameters(config_path='config.yml')
ml_grid_object = data.pipe(
    global_params=global_params,
    file_name="data/dataset.csv",
    drop_term_list=[],
    local_param_dict={},
    param_space_index=0,
)

# Execute GA
result = main_ga.run(
    ml_grid_object, 
    local_param_dict={'cxpb': 0.8}, 
    global_params=global_params
).execute()

Attributes:

Attribute

Type

Description

global_params

global_parameters

Configuration object with experiment settings

ml_grid_object

Any

Experiment object containing data and hyperparameters

verbose

int

Logging verbosity level (0-15)

error_raise

bool

Flag for error handling behavior

nb_params

List[int]

List of ensemble sizes to try

pop_params

List[int]

List of population sizes to try

g_params

List[int]

List of generation counts to try

log_folder_path

str

Path for storing experiment logs and artifacts

Methods:

execute()List[List]

Executes the full genetic algorithm process for all GA parameter combinations.

Returns: List[List]

  • A list of errors encountered during execution

  • Each item contains: [model_implementation, exception, traceback]

Behavior:

  1. Iterates through grid of GA hyperparameters (nb, pop, g)

  2. Registers genetic operators with DEAP toolbox

  3. Creates initial population of candidate ensembles

  4. Runs evolutionary loop (selection, crossover, mutation)

  5. Tracks best-performing ensemble per configuration

  6. Implements early stopping if performance stagnates

  7. Evaluates final ensemble on hold-out validation set

  8. Logs all results to disk


Model Classes

Base Learner Interface

All base learner generators follow a consistent interface pattern (see Model_API.md for comprehensive model generator reference).

Generation Function Signature:

def model_nameModelGenerator(
    ml_grid_object: Any,
    local_param_dict: Dict
) -> Tuple[float, ModelClass, List[str], int, float, np.ndarray]:
    """Generates, trains, and evaluates a model.
    
    Args:
        ml_grid_object: Contains X_train, y_train, X_test, y_test and config
        local_param_dict: Parameters for this specific run
        
    Returns:
        Tuple of (mccscore, model, feature_names, train_time, auc_score, y_pred)
    """

Return Values:

Index

Type

Description

0

float

Matthews Correlation Coefficient (MCC)

1

ModelClass

Trained model object

2

List[str]

List of feature names used for training

3

int

Model training time in seconds

4

float

ROC AUC score

5

np.ndarray

Model predictions on test set

Available Models:

  • AdaBoostClassifierModelGenerator

  • DecisionTreeClassifierModelGenerator

  • elasticNeuralNetworkModelGenerator

  • extraTreesModelGenerator

  • GaussianNB_ModelGenerator

  • GradientBoostingClassifier_ModelGenerator

  • kNearestNeighborsModelGenerator

  • logisticRegressionModelGenerator

  • MLPClassifier_ModelGenerator

  • perceptronModelGenerator

  • Pytorch_binary_class_ModelGenerator

  • QuadraticDiscriminantAnalysis_ModelGenerator

  • randomForestModelGenerator

  • SVC_ModelGenerator

  • XGBoostModelGenerator


Classification Models

Function: logisticRegressionModelGenerator

Generates, trains, and evaluates a logistic regression classifier.

Module: ml_grid.model_classes_ga.logistic_regression_model

Generation Signature:

lr_generator = logisticRegressionModelGenerator(
    ml_grid_object,
    local_param_dict
)

Function: randomForestModelGenerator

Generates, trains, and evaluates a random forest classifier.

Module: ml_grid.model_classes_ga.randomForest_model

Generation Signature:

rf_generator = randomForestModelGenerator(
    ml_grid_object,
    local_param_dict
)

Function: XGBoostModelGenerator

Generates, trains, and evaluates an XGBoost classifier.

Module: ml_grid.model_classes_ga.XGBoost_model

Generation Signature:

xgb_generator = XGBoostModelGenerator(
    ml_grid_object,
    local_param_dict
)

Genetic Algorithm APIs

Evaluation Methods

Function: get_y_pred_resolver

Resolves and generates predictions for ensemble evaluation.

Module: ml_grid.pipeline.evaluate_methods_ga

Signature:

y_pred = get_y_pred_resolver(
    individual,
    ml_grid_object,
    valid=False
)

Parameters:

Parameter

Type

Default

Description

individual

List

Required

Ensemble configuration (DEAP个体格式)

ml_grid_object

Any

Required

Experiment object with data splits

valid

bool

False

If True, predict on validation set

Returns: Union[List, np.ndarray]

  • Final ensemble predictions

Function: evaluate_weighted_ensemble_auc

Main fitness evaluation function for genetic algorithm.

Module: ml_grid.pipeline.evaluate_methods_ga

Signature:

fitness = evaluate_weighted_ensemble_auc(
    individual,
    ml_grid_object
)

Parameters:

Parameter

Type

Description

individual

List

Ensemble to evaluate

ml_grid_object

Any

Experiment object with data and configuration

Returns: Tuple[float]

  • Single-element tuple containing fitness score (AUC or diversity-penalized AUC)


Mutation Methods

Function: baseLearnerGenerator

Generates a random base learner.

Module: ml_grid.pipeline.mutate_methods

Signature:

new_learner = baseLearnerGenerator(ml_grid_object)

Function: mutateEnsemble

Mutates an ensemble by replacing one base learner.

Module: ml_grid.pipeline.mutate_methods

Signature:

mutated_individual = mutateEnsemble(
    individual,
    ml_grid_object
)

Weighting Methods

Function: get_unweighted_ensemble_predictions

Generates predictions by majority voting (mode).

Module: ml_grid.ga_functions.ga_unweighted

Signature:

predictions = get_unweighted_ensemble_predictions(
    best,
    ml_grid_object,
    valid=False
)

Function: find_ensemble_weights_de

Finds optimal weights for ensemble using Differential Evolution.

Module: ml_grid.ga_functions.ga_ensemble_weight_finder_de

Signature:

weights = find_ensemble_weights_de(
    ensemble,
    ml_grid_object,
    valid=False
)

Result Analysis APIs

ml_grid.util.GA_results_explorer

Analyzes and visualizes experiment results.

Class: GA_results_explorer

Parses and visualizes GA experiment outcomes.

Module: ml_grid.util.GA_results_explorer

Instantiation:

from ml_grid.util.GA_results_explorer import GA_results_explorer

explorer = GA_results_explorer(
    base_project_dir="HFE_GA_experiments",
)

Methods:

Method

Parameters

Description

plot_convergence()

-

Generate fitness convergence plot

plot_ensemble_size_performance()

-

Performance vs. ensemble size

plot_base_learner_importance()

-

Feature/base learner importance


ml_grid.util.evaluate_ensemble_methods

Ensemble evaluation utilities (see Model_API.md for base learner interface and Pipeline_API.md for GA pipeline context).

Class: EnsembleEvaluator

Evaluates ensembles on hold-out data.

Module: ml_grid.util.evaluate_ensemble_methods

Instantiation:

from ml_grid.util.evaluate_ensemble_methods import EnsembleEvaluator

evaluator = EnsembleEvaluator(
    base_project_dir="HFE_GA_experiments",
    X_train=None, y_train=None,
    X_test=None, y_test=None,
    store_base_learners=True,
)

Methods:

Method

Description

evaluate()

Evaluate best ensemble on test set

Error Handling Reference

Common Exceptions

Exception

Cause

Resolution

ModuleNotFoundError

Package not installed

Run pip install .

FileNotFoundError

Dataset not found

Check input_csv_path in config

ValueError: Input contains NaN

Missing values

Adjust percent_missing threshold


Configuration YAML Schema

Complete list of configurable parameters:

global_params:
  input_csv_path: str              # Required, path to dataset
  n_iter: int                      # Default: 20, grid search iterations
  model_list: List[str]            # Required, base learner names
  verbose: int                     # Default: 2, logging level (0-15)
  grid_n_jobs: int                 # Default: 8, parallel jobs
  base_project_dir: str            # Default: "HFE_GA_experiments"
  testing: bool                    # Default: False, use smaller grid
  test_sample_n: int               # Default: 0, no sampling

ga_params:
  nb_params: List[int]             # Default: [8, 16, 24], ensemble sizes
  pop_params: List[int]            # Default: [64, 128], population sizes
  g_params: List[int]              # Default: [100], generation counts

grid_params:
  weighted: List[str]              # Default: ["unweighted"], methods
  resample: List[str \| None]      # Default: [None], imbalancing handling
  corr: List[float]                # Default: [0.95], feature correlation

Python Requirement: Python >=3.12

This API documentation corresponds to version v1.0+ of the ensemble_genetic_algorithm package.

For the latest API reference, please visit our online documentation.

footnotes> [1] API Documentation Summary: API-Documentation-Summary.md