# API Reference

This document provides a comprehensive reference for the public API of the **Ensemble Genetic Algorithm** project.

---

## Table of Contents

### Related Documentation

For additional context and guidance, refer to the following documentation:

- [Installation](Installation.md) - Set up the ensemble_genetic_algorithm package
- [Configuration Guide](Configuration_Guide.md) - Configure experiments with YAML settings
- [Pipeline API](Pipeline_API.md) - Deep dive into data and GA pipeline workflows
- [Model API](Model_API.md) - Comprehensive reference for model generators
- [GA Python API](GA_Python_API.md) - Genetic algorithm implementation details
- [Technical Deep Dive](Technical_Deep_Dive.md) - Architecture and algorithm explanation
- [Troubleshooting](Troubleshooting.md) - Common issues and solutions

---

- [Core Entry Points](#core-entry-points)
- [Configuration APIs](#configuration-apis)
- [Pipeline Classes](#pipeline-classes)
  - [Data Pipeline](#data-pipeline)
  - [GA Pipeline](#ga-pipeline)
- [Model Classes](#model-classes)
  - [Base Learner Interface](#base-learner-interface)
  - [Classification Models](#classification-models)
- [Genetic Algorithm APIs](#genetic-algorithm-apis)
  - [Evaluation Methods](#evaluation-methods)
  - [Mutation Methods](#mutation-methods)
  - [Weighting Methods](#weighting-methods)
- [Utility APIs](#utility-apis)
  - [Configuration Management](#configuration-management)
  - [Feature Selection](#feature-selection)
  - [Logging](#logging)
- [Pipeline Workflow Diagrams](#pipeline-workflow-diagrams)
- [Result Analysis APIs](#result-analysis-apis)

---

## Core Entry Points

### `ml_grid.pipeline.main_ga.run`

The primary orchestrator for running the genetic algorithm evolution process.

#### Class: `run`

Orchestrates the main Genetic Algorithm (GA) evolution process.

**Module**: `ml_grid.pipeline.main_ga`

**Instantiation Signature**:
```python
main_ga.run(
    ml_grid_object,
    local_param_dict,
    global_params
)
```

**Usage**:
```python
from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_params import global_parameters

# Setup
global_params = global_parameters(config_path='config.yml')
ml_grid_object = data.pipe(
    global_params=global_params,
    file_name="data/dataset.csv",
    local_param_dict={},
    param_space_index=0,
)

# Execute GA
result = main_ga.run(
    ml_grid_object, 
    local_param_dict={'cxpb': 0.8}, 
    global_params=global_params
).execute()
```

See also: [Configuration Guide](Configuration_Guide.md) for comprehensive configuration options and [Pipeline_API.md](Pipeline_API.md) for pipeline workflow details.

**Attributes**:

| Attribute | Type | Description |
|-----------|------|-------------|
| `global_params` | `global_parameters` | Configuration object with experiment settings |
| `ml_grid_object` | `Any` | Experiment object containing data and hyperparameters |
| `verbose` | `int` | Logging verbosity level |
| `error_raise` | `bool` | Flag for error handling behavior |
| `nb_params` | `List[int]` | List of ensemble sizes to try |
| `pop_params` | `List[int]` | List of population sizes to try |
| `g_params` | `List[int]` | List of generation counts to try |
| `log_folder_path` | `str` | Path for storing experiment logs and artifacts |

**Methods**:

##### `execute()` → `List[List]`

Executes the full genetic algorithm process for all GA parameter combinations.

**Returns**: `List[List]`
- A list of errors encountered during execution
- Each item contains: `[model_implementation, exception, traceback]`

**Behavior**:
1. Iterates through grid of GA hyperparameters (nb, pop, g)
2. Registers genetic operators with DEAP toolbox
3. Creates initial population of candidate ensembles
4. Runs evolutionary loop (selection, crossover, mutation)
5. Tracks best-performing ensemble per configuration
6. Implements early stopping if performance stagnates
7. Evaluates final ensemble on hold-out validation set
8. Logs all results to disk

---

## Configuration APIs

### `ml_grid.util.global_params`

Central configuration object for experiments.

#### Class: `global_parameters`

Controls overall experiment behavior and settings.

**Module**: `ml_grid.util.global_params`

**Instantiation Signature**:
```python
global_parameters(
    config_path=None,
    **kwargs
)
```

**Parameters**:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `config_path` | `str \| None` | `None` | Path to YAML configuration file |
| `\*\*kwargs` | any | - | Runtime parameter overrides |

**Available Parameters** (from config file):

| Parameter | Type | Example | Description |
|-----------|------|---------|-------------|
| `input_csv_path` | `str` | `"data/my.csv"` | Path to input dataset |
| `n_iter` | `int` | 20 | Number of grid search iterations |
| `model_list` | `List[str]` | `["logisticRegression"]` | Base learners to use |
| `verbose` | `int` | 2 | Logging verbosity (0-15) |
| `grid_n_jobs` | `int` | 8 | Parallel jobs for grid search |
| `base_project_dir` | `str` | `"HFE_GA_experiments"` | Output directory |
| `testing` | `bool` | False | Use smaller test grid |

---

### `ml_grid.util.grid_param_space_ga.Grid`

Defines the hyperparameter search space for experiments.

#### Class: `Grid`

Creates parameter grids for systematic exploration of the configuration space.

**Module**: `ml_grid.util.grid_param_space_ga`

**Instantiation Signature**:
```python
Grid(
    global_params,
    config_path=None,
    test_grid=False
)
```

**Parameters**:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `global_params` | `global_parameters` | Required | Experiment configuration |
| `config_path` | `str \| None` | `None` | Override config file path |
| `test_grid` | `bool` | `False` | Use smaller test grid |

**Attributes**:

| Attribute | Type | Description |
|-----------|------|-------------|
| `settings_list_iterator` | `Iterator[Dict]` | Yields parameter combinations |
| `nb_params` | `List[int]` | Ensemble sizes: `[8, 16, 24]` |
| `pop_params` | `List[int]` | Population sizes: `[64, 128]` |
| `g_params` | `List[int]` | Generation counts: `[100]` |

**Usage**:
```python
global_params = global_parameters(config_path='config.yml')
grid = Grid(global_params=global_params)

# Iterate through parameter combinations
for i in range(20):
    params = next(grid.settings_list_iterator)
    # params contains: weighted, resample, corr, etc.
```

---

## Pipeline Classes

### Data Pipeline

#### Class: `pipe`

The main data processing pipeline for an ML grid experiment.

**Module**: `ml_grid.pipeline.data`

**Instantiation Signature**:
```python
data.pipe(
    global_params,
    file_name,
    drop_term_list,
    local_param_dict,
    base_project_dir,
    param_space_index,
    additional_naming=None,
    test_sample_n=0,
    column_sample_n=0,
    config_dict=None,
    testing=False,
    multiprocessing_ensemble=False
)
```

**Parameters**:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `global_params` | `global_parameters` | Required | Global configuration object |
| `file_name` | `str` | Required | Path to input CSV file |
| `drop_term_list` | `List[str]` | Required | List of substrings for column removal |
| `local_param_dict` | `Dict` | Required | Current iteration's parameters |
| `base_project_dir` | `str` | Required | Output directory path |
| `param_space_index` | `int` | Required | Index in parameter grid |
| `additional_naming` | `str \| None` | `None` | Optional string for log folder identification |
| `test_sample_n` | `int` | 0 | Number of rows to sample (0 = all) |
| `column_sample_n` | `int` | 0 | Number of columns to sample (0 = all) |
| `config_dict` | `Dict \| None` | `None` | GA configuration options |
| `testing` | `bool` | False | Enable testing/debug mode |
| `multiprocessing_ensemble` | `bool` | False | Enable multiprocessing for ensemble |

**Returns**: `ml_grid_object`

**Pipeline Steps**:

The pipe class performs the following steps:
1. Load data from CSV file with optional sampling
2. Select features based on configuration
3. Apply safety net if all features have been pruned
4. Create X and y variables
5. Split data into train/test/validation sets
6. Apply post-split cleaning to prevent data leakage
7. Optionally scale features using StandardScaler
8. Select features by importance if configured

**Attributes**:

| Attribute | Type | Description |
|-----------|------|-------------|
| `df` | `pd.DataFrame` | Main DataFrame holding data |
| `X_train`, `X_test`, `X_test_orig` | `pd.DataFrame` | Feature DataFrames for different splits |
| `y_train`, `y_test`, `y_test_orig` | `pd.Series` | Target Series for different splits |
| `drop_list` | `List[str]` | List of columns to remove |
| `final_column_list` | `List[str]` | Final feature column names |
| `model_class_list` | `List` | List of model generator functions |
| `feature_transformation_log` | `pd.DataFrame` | Log of feature counts at each pipeline step |

**Edge Cases and Special Behavior**:

##### Handling Empty Feature Sets

The data pipeline includes safety mechanisms to prevent failures when all features are pruned during data cleaning:

1. **Safety Net**: If the feature selection process removes all features (e.g., due to high correlation `corr` threshold, missing value threshold `percent_missing`, or constant column removal), the pipeline activates a safety net that retains at least 1-2 numeric non-outcome columns from the original dataset.

2. **NoFeaturesError Exception**: If the safety net cannot retain any features (e.g., the dataset is empty or contains only the outcome variable), a `NoFeaturesError` exception is raised with a clear message indicating the root cause.

##### Empty Selection via max_features Parameter

When `n_features` parameter in `local_param_dict` results in zero features selection:

- **Scenario**: The feature importance method (ANOVA F-test or Markov Blanket) selects fewer features than requested, possibly resulting in an empty set if no features pass statistical significance tests.

- **Behavior**: 
  - If the selected feature count reaches zero during `_select_features_by_importance()`, a `NoFeaturesError` is raised with message: `"Feature importance selection removed all features."`
  - This occurs after post-split cleaning, which may eliminate columns that became constant

##### Handling Edge Cases Programmatically

Example of how to handle empty feature scenarios:

```python
from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_params import global_parameters
from ml_grid.pipeline.data import NoFeaturesError

try:
    # Setup with aggressive feature selection
    global_params = global_parameters(config_path='config.yml')
    
    # Use a safe configuration with fallback behavior
    local_param_dict = {
        'corr': 0.95,              # Moderate correlation threshold
        'percent_missing': 100.0,   # Allow some missing values (90-100%)
        'n_features': 3,            # Request 3 features, but may get fewer
        'feature_selection_method': 'anova'  # or 'markov_blanket'
    }
    
    ml_grid_object = data.pipe(
        global_params=global_params,
        file_name="data/dataset.csv",
        drop_term_list=[],
        local_param_dict=local_param_dict,
        param_space_index=0,
    )
    
except NoFeaturesError as e:
    # Handle the edge case gracefully
    logger.warning(f"Feature selection failed: {e}")
    
    # Fallback strategies:
    # 1. Relax feature selection parameters
    relaxed_params = {
        'n_features': 'all',         # Use all remaining features
        'corr': 0.85,                # Lower correlation threshold
        'percent_missing': 99.0      # More lenient missing value threshold
    }
    
    # 2. Or disable feature importance selection temporarily
    safe_params = {
        'n_features': 'all',         # Explicitly request all
        'feature_selection_method': None
    }
    
except ValueError as e:
    # Handle data type errors (non-numeric columns, string values)
    logger.error(f"Data validation error: {e}")
```

**Best Practices for Robust Experiments**:

1. **Use Safety Net**: Always enable the default safety net by ensuring `n_features != 'all'` only when you have sufficient features.
2. **Monitor Feature Logs**: Check the `feature_transformation_log` attribute after pipeline execution to track feature counts at each step.
3. **Gradual Aggression**: Start with lenient thresholds (higher `percent_missing`, lower `corr`) and gradually increase aggressiveness.
4. **Validate Input Data**: Ensure your dataset has sufficient numeric features beyond the outcome variable before running experiments.

##### Accessing Feature Transformation Log

After pipeline execution, examine which features were removed at each step:

```python
# After creating ml_grid_object
print(ml_grid_object.feature_transformation_log)

# Example output:
#          step  features_before  features_after  features_changed        description
# 0   Initial Load             50              50                 0     Initial data loaded.
# 1 Feature Selection           50              48                -2  Selected columns based on feature toggles
# 2    Drop Correlated           48              45                -3  Dropped columns with correlation > 0.95
# 3      Drop Missing            45              45                 0  Dropped columns with > 99% missing
# 4   Drop Other Outcomes        45              45                 0  Removed other potential outcome variables
# 5     Drop Constants           45              38                -7         Removed constant columns
```

This log helps diagnose why features were removed and enables better parameter tuning for future runs.

---

### GA Pipeline

#### Class: `run`

The primary orchestrator for running the genetic algorithm evolution process.

**Module**: `ml_grid.pipeline.main_ga`

**Instantiation Signature**:
```python
main_ga.run(
    ml_grid_object,
    local_param_dict,
    global_params
)
```

**Usage**:
```python
from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_parameters import global_parameters

# Setup
global_params = global_parameters(config_path='config.yml')
ml_grid_object = data.pipe(
    global_params=global_params,
    file_name="data/dataset.csv",
    drop_term_list=[],
    local_param_dict={},
    param_space_index=0,
)

# Execute GA
result = main_ga.run(
    ml_grid_object, 
    local_param_dict={'cxpb': 0.8}, 
    global_params=global_params
).execute()
```

**Attributes**:

| Attribute | Type | Description |
|-----------|------|-------------|
| `global_params` | `global_parameters` | Configuration object with experiment settings |
| `ml_grid_object` | `Any` | Experiment object containing data and hyperparameters |
| `verbose` | `int` | Logging verbosity level (0-15) |
| `error_raise` | `bool` | Flag for error handling behavior |
| `nb_params` | `List[int]` | List of ensemble sizes to try |
| `pop_params` | `List[int]` | List of population sizes to try |
| `g_params` | `List[int]` | List of generation counts to try |
| `log_folder_path` | `str` | Path for storing experiment logs and artifacts |

**Methods**:

##### `execute()` → `List[List]`

Executes the full genetic algorithm process for all GA parameter combinations.

**Returns**: `List[List]`
- A list of errors encountered during execution
- Each item contains: `[model_implementation, exception, traceback]`

**Behavior**:
1. Iterates through grid of GA hyperparameters (nb, pop, g)
2. Registers genetic operators with DEAP toolbox
3. Creates initial population of candidate ensembles
4. Runs evolutionary loop (selection, crossover, mutation)
5. Tracks best-performing ensemble per configuration
6. Implements early stopping if performance stagnates
7. Evaluates final ensemble on hold-out validation set
8. Logs all results to disk

---

## Model Classes

### Base Learner Interface

All base learner generators follow a consistent interface pattern (see [Model_API.md](Model_API.md) for comprehensive model generator reference).

**Generation Function Signature**:
```python
def model_nameModelGenerator(
    ml_grid_object: Any,
    local_param_dict: Dict
) -> Tuple[float, ModelClass, List[str], int, float, np.ndarray]:
    """Generates, trains, and evaluates a model.
    
    Args:
        ml_grid_object: Contains X_train, y_train, X_test, y_test and config
        local_param_dict: Parameters for this specific run
        
    Returns:
        Tuple of (mccscore, model, feature_names, train_time, auc_score, y_pred)
    """
```

**Return Values**:

| Index | Type | Description |
|-------|------|-------------|
| 0 | `float` | Matthews Correlation Coefficient (MCC) |
| 1 | `ModelClass` | Trained model object |
| 2 | `List[str]` | List of feature names used for training |
| 3 | `int` | Model training time in seconds |
| 4 | `float` | ROC AUC score |
| 5 | `np.ndarray` | Model predictions on test set |

**Available Models**:
- `AdaBoostClassifierModelGenerator`
- `DecisionTreeClassifierModelGenerator`
- `elasticNeuralNetworkModelGenerator`
- `extraTreesModelGenerator`
- `GaussianNB_ModelGenerator`
- `GradientBoostingClassifier_ModelGenerator`
- `kNearestNeighborsModelGenerator`
- `logisticRegressionModelGenerator`
- `MLPClassifier_ModelGenerator`
- `perceptronModelGenerator`
- `Pytorch_binary_class_ModelGenerator`
- `QuadraticDiscriminantAnalysis_ModelGenerator`
- `randomForestModelGenerator`
- `SVC_ModelGenerator`
- `XGBoostModelGenerator`

---

### Classification Models

#### Function: `logisticRegressionModelGenerator`

Generates, trains, and evaluates a logistic regression classifier.

**Module**: `ml_grid.model_classes_ga.logistic_regression_model`

**Generation Signature**:
```python
lr_generator = logisticRegressionModelGenerator(
    ml_grid_object,
    local_param_dict
)
```

#### Function: `randomForestModelGenerator`

Generates, trains, and evaluates a random forest classifier.

**Module**: `ml_grid.model_classes_ga.randomForest_model`

**Generation Signature**:
```python
rf_generator = randomForestModelGenerator(
    ml_grid_object,
    local_param_dict
)
```

#### Function: `XGBoostModelGenerator`

Generates, trains, and evaluates an XGBoost classifier.

**Module**: `ml_grid.model_classes_ga.XGBoost_model`

**Generation Signature**:
```python
xgb_generator = XGBoostModelGenerator(
    ml_grid_object,
    local_param_dict
)
```

---

## Genetic Algorithm APIs

### Evaluation Methods

#### Function: `get_y_pred_resolver`

Resolves and generates predictions for ensemble evaluation.

**Module**: `ml_grid.pipeline.evaluate_methods_ga`

**Signature**:
```python
y_pred = get_y_pred_resolver(
    individual,
    ml_grid_object,
    valid=False
)
```

**Parameters**:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `individual` | `List` | Required | Ensemble configuration (DEAP个体格式) |
| `ml_grid_object` | `Any` | Required | Experiment object with data splits |
| `valid` | `bool` | `False` | If True, predict on validation set |

**Returns**: `Union[List, np.ndarray]`
- Final ensemble predictions

#### Function: `evaluate_weighted_ensemble_auc`

Main fitness evaluation function for genetic algorithm.

**Module**: `ml_grid.pipeline.evaluate_methods_ga`

**Signature**:
```python
fitness = evaluate_weighted_ensemble_auc(
    individual,
    ml_grid_object
)
```

**Parameters**:

| Parameter | Type | Description |
|-----------|------|-------------|
| `individual` | `List` | Ensemble to evaluate |
| `ml_grid_object` | `Any` | Experiment object with data and configuration |

**Returns**: `Tuple[float]`
- Single-element tuple containing fitness score (AUC or diversity-penalized AUC)

---

### Mutation Methods

#### Function: `baseLearnerGenerator`

Generates a random base learner.

**Module**: `ml_grid.pipeline.mutate_methods`

**Signature**:
```python
new_learner = baseLearnerGenerator(ml_grid_object)
```

#### Function: `mutateEnsemble`

Mutates an ensemble by replacing one base learner.

**Module**: `ml_grid.pipeline.mutate_methods`

**Signature**:
```python
mutated_individual = mutateEnsemble(
    individual,
    ml_grid_object
)
```

---

### Weighting Methods

#### Function: `get_unweighted_ensemble_predictions`

Generates predictions by majority voting (mode).

**Module**: `ml_grid.ga_functions.ga_unweighted`

**Signature**:
```python
predictions = get_unweighted_ensemble_predictions(
    best,
    ml_grid_object,
    valid=False
)
```

#### Function: `find_ensemble_weights_de`

Finds optimal weights for ensemble using Differential Evolution.

**Module**: `ml_grid.ga_functions.ga_ensemble_weight_finder_de`

**Signature**:
```python
weights = find_ensemble_weights_de(
    ensemble,
    ml_grid_object,
    valid=False
)
```


## Result Analysis APIs

### `ml_grid.util.GA_results_explorer`

Analyzes and visualizes experiment results.

#### Class: `GA_results_explorer`

Parses and visualizes GA experiment outcomes.

**Module**: `ml_grid.util.GA_results_explorer`

**Instantiation**:
```python
from ml_grid.util.GA_results_explorer import GA_results_explorer

explorer = GA_results_explorer(
    base_project_dir="HFE_GA_experiments",
)
```

**Methods**:

| Method | Parameters | Description |
|--------|------------|-------------|
| `plot_convergence()` | - | Generate fitness convergence plot |
| `plot_ensemble_size_performance()` | - | Performance vs. ensemble size |
| `plot_base_learner_importance()` | - | Feature/base learner importance |

---

### `ml_grid.util.evaluate_ensemble_methods`

Ensemble evaluation utilities (see [Model_API.md](Model_API.md) for base learner interface and [Pipeline_API.md](Pipeline_API.md) for GA pipeline context).

#### Class: `EnsembleEvaluator`

Evaluates ensembles on hold-out data.

**Module**: `ml_grid.util.evaluate_ensemble_methods`

**Instantiation**:
```python
from ml_grid.util.evaluate_ensemble_methods import EnsembleEvaluator

evaluator = EnsembleEvaluator(
    base_project_dir="HFE_GA_experiments",
    X_train=None, y_train=None,
    X_test=None, y_test=None,
    store_base_learners=True,
)
```

**Methods**:

| Method | Description |
|--------|-------------|
| `evaluate()` | Evaluate best ensemble on test set |


## Error Handling Reference

### Common Exceptions

| Exception | Cause | Resolution |
|-----------|-------|------------|
| `ModuleNotFoundError` | Package not installed | Run `pip install .` |
| `FileNotFoundError` | Dataset not found | Check `input_csv_path` in config |
| `ValueError: Input contains NaN` | Missing values | Adjust `percent_missing` threshold |

---

## Configuration YAML Schema

Complete list of configurable parameters:

```yaml
global_params:
  input_csv_path: str              # Required, path to dataset
  n_iter: int                      # Default: 20, grid search iterations
  model_list: List[str]            # Required, base learner names
  verbose: int                     # Default: 2, logging level (0-15)
  grid_n_jobs: int                 # Default: 8, parallel jobs
  base_project_dir: str            # Default: "HFE_GA_experiments"
  testing: bool                    # Default: False, use smaller grid
  test_sample_n: int               # Default: 0, no sampling

ga_params:
  nb_params: List[int]             # Default: [8, 16, 24], ensemble sizes
  pop_params: List[int]            # Default: [64, 128], population sizes
  g_params: List[int]              # Default: [100], generation counts

grid_params:
  weighted: List[str]              # Default: ["unweighted"], methods
  resample: List[str \| None]      # Default: [None], imbalancing handling
  corr: List[float]                # Default: [0.95], feature correlation
```

---

**Python Requirement**: Python >=3.12

This API documentation corresponds to version **v1.0+** of the ensemble_genetic_algorithm package.

For the latest API reference, please visit our [online documentation](https://ensemble-genetic-algorithm.readthedocs.io/).

 footnotes>
[1] API Documentation Summary: `API-Documentation-Summary.md`