Pipeline API Reference

This document provides comprehensive API reference for the data pipeline classes in the Ensemble Genetic Algorithm project.


Table of Contents


Pipeline Classes

ml_grid.pipeline.main_ga.run

The primary orchestrator for running the genetic algorithm evolution process.

Class: run

Orchestrates the main Genetic Algorithm (GA) evolution process.

Module: ml_grid.pipeline.main_ga

Instantiation Signature:

main_ga.run(
    ml_grid_object,
    local_param_dict,
    global_params
)

Usage:

from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_parameters import global_parameters

# Setup
global_params = global_parameters(config_path='config.yml')
ml_grid_object = data.pipe(
    global_params=global_params,
    file_name="data/dataset.csv",
    drop_term_list=[],
    local_param_dict={},
    param_space_index=0,
)

# Execute GA
result = main_ga.run(
    ml_grid_object, 
    local_param_dict={'cxpb': 0.8}, 
    global_params=global_params
).execute()

See also: [Model_API.md](Model_API.md) for model generation patterns and base learner interface details.
See also: [GA_Python_API.md](GA_Python_API.md) for genetic algorithm evaluation methods.

**Attributes**:

| Attribute | Type | Description |
|-----------|------|-------------|
| `global_params` | `global_parameters` | Configuration object with experiment settings |
| `ml_grid_object` | `Any` | Experiment object containing data and hyperparameters |
| `verbose` | `int` | Logging verbosity level (0-15) |
| `error_raise` | `bool` | Flag for error handling behavior |
| `nb_params` | `List[int]` | List of ensemble sizes to try |
| `pop_params` | `List[int]` | List of population sizes to try |
| `g_params` | `List[int]` | List of generation counts to try |
| `log_folder_path` | `str` | Path for storing experiment logs and artifacts |

**Methods**:

##### `execute()` → `List[List]`

Executes the full genetic algorithm process for all GA parameter combinations.

**Returns**: `List[List]`
- A list of errors encountered during execution
- Each item contains: `[model_implementation, exception, traceback]`

**Behavior**:
1. Iterates through grid of GA hyperparameters (nb, pop, g)
2. Registers genetic operators with DEAP toolbox
3. Creates initial population of candidate ensembles
4. Runs evolutionary loop (selection, crossover, mutation)
5. Tracks best-performing ensemble per configuration
6. Implements early stopping if performance stagnates
7. Evaluates final ensemble on hold-out validation set
8. Logs all results to disk

---

### `ml_grid.pipeline.main.run`

Legacy grid search orchestrator for traditional machine learning models.

#### Class: `run`

Orchestrates grid search cross-validation for predefined models (legacy module).

**Module**: `ml_grid.pipeline.main`

**Instantiation Signature**:
```python
main.run(
    ml_grid_object,
    local_param_dict
)

Usage:

from ml_grid.pipeline import data, main
from ml_grid.util.global_parameters import global_parameters

# Setup
global_params = global_parameters(config_path='config.yml')
ml_grid_object = data.pipe(
    global_params=global_params,
    file_name="data/dataset.csv",
    drop_term_list=[],
    local_param_dict={},
    param_space_index=0,
)

# Execute grid search
result = main.run(
    ml_grid_object, 
    local_param_dict={'param_space_size': 'medium'}
).execute()

See also: [Model_API.md](Model_API.md) for model generation patterns.

# See Model_API documentation for model generation patterns

**Attributes**:

| Attribute | Type | Description |
|-----------|------|-------------|
| `global_params` | `global_parameters` | Configuration object with experiment settings |
| `ml_grid_object` | `Any` | Experiment object containing data and hyperparameters |
| `model_class_list` | `List[Any]` | List of instantiated model classes to evaluate |
| `verbose` | `int` | Logging verbosity level |

**Methods**:

##### `execute()` → `List[List]`

Executes grid search for all configured models.

**Returns**: `List[List]`
- A list of errors encountered during execution
- Each item contains: `[model_implementation, exception, traceback]`

---

## Data Processing

### `ml_grid.pipeline.data.pipe`

Factory function for creating experiment data objects.

See also: {doc}`./Model_API` for base learner generator interface and model generation patterns.

#### Class: `pipe`

The main data processing pipeline for an ML grid experiment.

**Module**: `ml_grid.pipeline.data`

**Instantiation Signature**:
```python
data.pipe(
    global_params,
    file_name,
    drop_term_list,
    local_param_dict,
    base_project_dir,
    param_space_index,
    additional_naming=None,
    test_sample_n=0,
    column_sample_n=0,
    config_dict=None,
    testing=False,
    multiprocessing_ensemble=False
)

Parameters:

Parameter

Type

Default

Description

global_params

global_parameters

Required

Global configuration object

file_name

str

Required

Path to input CSV file

drop_term_list

List[str]

Required

List of substrings for column removal

local_param_dict

Dict

Required

Current iteration’s parameters

base_project_dir

str

Required

Output directory path

param_space_index

int

Required

Index in parameter grid

additional_naming

str | None

None

Optional string for log folder identification

test_sample_n

int

0

Number of rows to sample (0 = all)

column_sample_n

int

0

Number of columns to sample (0 = all)

config_dict

Dict | None

None

GA configuration options

testing

bool

False

Enable testing/debug mode

multiprocessing_ensemble

bool

False

Enable multiprocessing for ensemble

See also: Configuration Guide for comprehensive configuration options including parameter grids and hyperparameter search strategies.

See also: Data_Workflow for comprehensive data preprocessing details including feature scaling, correlation filtering, and train/test split strategies.

Returns: ml_grid_object

Contents of ml_grid_object:

  • X_train, y_train: Training data splits

  • X_val, y_val: Validation data (for GA fitness)

  • X_test, y_test: Test data (for final evaluation)

  • model_class_list: List of base learner generators

  • local_param_dict: Configuration for this iteration

  • logging_paths_obj: Paths for saving results

Pipeline Steps:

The pipe class performs the following steps:

  1. Load data from CSV file with optional sampling

  2. Select features based on configuration

  3. Apply safety net if all features have been pruned

  4. Create X and y variables

  5. Split data into train/test/validation sets

  6. Apply post-split cleaning to prevent data leakage

  7. Optionally scale features using StandardScaler

  8. Select features by importance if configured

Attributes:

Attribute

Type

Description

df

pd.DataFrame

Main DataFrame holding data

X_train, X_test, X_test_orig

pd.DataFrame

Feature DataFrames for different splits

y_train, y_test, y_test_orig

pd.Series

Target Series for different splits

drop_list

List[str]

List of columns to remove

final_column_list

List[str]

Final feature column names

model_class_list

List

List of model generator functions

Error Handling Examples

Common pipeline failures and their solutions:

from ml_grid.pipeline import data
from ml_grid.util.global_parameters import global_parameters

# Scenario 1: File not found
try:
    ml_grid_object = data.pipe(
        global_params=global_parameters(config_path='config.yml'),
        file_name="data/nonexistent.csv",
        drop_term_list=[],
        local_param_dict={},
        param_space_index=0,
    )
except FileNotFoundError as e:
    print(f"Data file not found: {e}")

# Scenario 2: Invalid configuration values
try:
    ml_grid_object = data.pipe(
        global_params=global_parameters(config_path='config.yml'),
        file_name="data/dataset.csv",
        drop_term_list=[],
        local_param_dict={'invalid_param': True},
        param_space_index=0,
    )
except KeyError as e:
    print(f"Invalid parameter in local_param_dict: {e}")

# Scenario 3: Data validation errors (non-numeric columns)
try:
    ml_grid_object = data.pipe(
        global_params=global_parameters(config_path='config.yml'),
        file_name="data/dataset.csv",
        drop_term_list=[],
        local_param_dict={},
        param_space_index=0,
        testing=False,  # Validation runs in non-testing mode
    )
except ValueError as e:
    print(f"Data validation failed: {e}")

# Scenario 4: Memory errors with large datasets
import sys
try:
    ml_grid_object = data.pipe(
        global_params=global_parameters(config_path='config.yml'),
        file_name="data/large_dataset.csv",
        drop_term_list=[],
        local_param_dict={},
        param_space_index=0,
        test_sample_n=1000,  # Sample for debugging
    )
except MemoryError as e:
    print(f"Insufficient memory: {e}")
    sys.exit(1)

# Scenario 5: Multiprocessing issues on Windows
import platform
if platform.system() == 'Windows':
    importmultiprocessing.freeze_support()

Edge Cases and Limitations

multiprocessing_ensemble Parameter:

Aspect

Details

Default value

False (single-process execution)

Implementation

Uses Python’s multiprocessing.Pool for parallel model evaluation

Platform compatibility

Requires picklable objects; may fail on Windows without if __name__ == "__main__" guard

Memory impact

Each worker process holds a copy of data in memory

Known Limitations:

  1. Process overhead: For small ensembles (< 4 models), multiprocessing add overhead without benefit

  2. Windows compatibility: Requires proper main block protection to avoid infinite process spawning

  3. Resource contention: Multiple ensemble processes may compete for CPU/GPU resources

  4. Serialization limitations: Objects with unpicklable attributes (e.g., lambda functions, file handles) will fail

When to use multiprocessing_ensemble=True:

  • Large population sizes (> 128 individuals)

  • Hundreds of base learners evaluated in parallel

  • Long-running model training (typically > 30 seconds per model)

When to keep multiprocessing_ensemble=False:

  • Small-scale experiments (< 50 models)

  • Resource-constrained environments

  • Debugging/development (single-process is easier to debug)

# Example: Enabling multiprocessing with proper error handling
import platform

global_params = global_parameters(config_path='config.yml')

# Windows requires freeze_support for multiprocessing
if platform.system() == 'Windows':
    import multiprocessing
    multiprocessing.freeze_support()

ml_grid_object = data.pipe(
    global_params=global_params,
    file_name="data/dataset.csv",
    drop_term_list=[],
    local_param_dict={},
    param_space_index=0,
    multiprocessing_ensemble=True,  # Enable for large-scale runs
)

Feature Selection

ml_grid.pipeline.get_feature_selection_class_ga.feature_selection_methods_class

Manages feature selection methods for the pipeline.

Class: feature_selection_methods_class

Provides various feature selection techniques.

Module: ml_grid.pipeline.get_feature_selection_class_ga

Instantiation Signature:

feature_selection_methods_class(ml_grid_object)

Methods:

Error Handling Examples

Feature selection failures and edge cases:

from ml_grid.pipeline.get_feature_selection_class_ga import feature_selection_methods_class
import pandas as pd

# Scenario 1: Insufficient data points after preprocessing
try:
    fs_method = feature_selection_methods_class(ml_grid_object)
    X_train_selected, X_test_selected = fs_method.get_featured_selected_training_data(
        method="anova"
    )
except ValueError as e:
    print(f"Insufficient samples for feature selection: {e}")
    # Fallback to all features or reduce dimensionality
    X_train_selected = ml_grid_object.X_train.copy()
    X_test_selected = ml_grid_object.X_test.copy()

# Scenario 2: Invalid method parameter
try:
    fs_method = feature_selection_methods_class(ml_grid_object)
    X_train_selected, X_test_selected = fs_method.get_featured_selected_training_data(
        method="invalid_method"
    )
except KeyError as e:
    print(f"Invalid feature selection method: {e}")
    # Use default method (anova)
    X_train_selected, X_test_selected = fs_method.get_featured_selected_training_data()

# Scenario 3: Constant features after feature selection
from sklearn.feature_selection import SelectKBest

def safe_feature_selection(ml_grid_object, method="anova"):
    """Perform feature selection with safety checks."""
    try:
        fs_method = feature_selection_methods_class(ml_grid_object)
        X_train, X_test = fs_method.get_featured_selected_training_data(method=method)
        
        # Ensure no constant features remain
        for col in X_train.columns:
            if len(X_train[col].unique()) <= 1:
                print(f"Warning: Constant feature '{col}' retained")
        
        return X_train, X_test
    except Exception as e:
        print(f"Feature selection failed, using all features: {e}")
        return ml_grid_object.X_train.copy(), ml_grid_object.X_test.copy()

# Usage with fallback
X_train_selected, X_test_selected = safe_feature_selection(ml_grid_object)

# Scenario 4: Zero variance columns after preprocessing
def remove_constant_features(df):
    """Remove constant columns to prevent feature selection errors."""
    non_const_cols = [col for col in df.columns if len(df[col].unique()) > 1]
    const_cols = [col for col in df.columns if col not in non_const_cols]
    
    if const_cols:
        print(f"Removing constant features: {const_cols}")
    
    return df[non_const_cols]

# Preprocess before feature selection
X_train_clean = remove_constant_features(ml_grid_object.X_train)
ml_grid_object.X_train = X_train_clean

fs_method = feature_selection_methods_class(ml_grid_object)
X_train_selected, X_test_selected = fs_method.get_featured_selected_training_data()

Configuration APIs

ml_grid.util.config.load_config

Loads YAML configuration files.

See also: Configuration Guide for comprehensive configuration examples and best practices.

Function: load_config

Loads a YAML configuration file from the given path.

Module: ml_grid.util.config

Signature:

config = load_config(config_path="config.yml")

Parameters:

Parameter

Type

Default

Description

config_path

str

"config.yml"

Path to YAML configuration file

Returns: Dict[str, Any]

  • Configuration dictionary containing global_params and grid_params sections

Example config.yml:

global_params:
  input_csv_path: "data/dataset.csv"
  n_iter: 20
  model_list: ["logisticRegression", "randomForest"]
  verbose: 3
  base_project_dir: "HFE_GA_experiments"

grid_params:
  weighted: ["unweighted"]
  resample: [null]
  corr: [0.95]

ga_params:
  nb_params: [8, 16, 24]
  pop_params: [64, 128]
  g_params: [100]

Error Handling Examples

Configuration loading failures and edge cases:

from ml_grid.util.config import load_config
import os

# Scenario 1: Config file not found
try:
    config = load_config(config_path="nonexistent.yml")
except FileNotFoundError as e:
    print(f"Configuration file not found: {e}")
    # Provide default values or set explicit path
    os.environ['CONFIG_PATH'] = 'config.yml'

# Scenario 2: YAML syntax error
import yaml
try:
    config = load_config(config_path="config.yml")
except yaml.YAMLError as e:
    print(f"YAML parsing error in config file: {e}")
    # Use defaults or fix YAML syntax

# Scenario 3: Invalid parameter types
def validate_config(config):
    """Validate configuration values before use."""
    required_keys = {'global_params', 'ga_params'}
    for key in required_keys:
        if key not in config:
            raise ValueError(f"Missing required section: {key}")
    
    # Validate numeric parameters
    if 'n_iter' in config['global_params']:
        if not isinstance(config['global_params']['n_iter'], int):
            raise TypeError("'n_iter' must be an integer")
        if config['global_params']['n_iter'] < 1:
            raise ValueError("'n_iter' must be positive")

# Scenario 4: Missing required parameters
def load_config_with_defaults(config_path="config.yml"):
    """Load config with fallback to defaults."""
    default_config = {
        'global_params': {
            'n_iter': 20,
            'verbose': 2,
            'base_project_dir': 'experiments/',
        },
        'ga_params': {
            'nb_params': [4, 8],
            'pop_params': [50, 100],
            'g_params': [100]
        }
    }
    
    try:
        user_config = load_config(config_path)
        from ml_grid.util.config import merge_configs
        return merge_configs(default_config, user_config)
    except FileNotFoundError:
        return default_config

# Usage with error handling
try:
    config = load_config_with_defaults()
except Exception as e:
    print(f"Configuration error: {e}")

# Scenario 5: Merging nested configs with circular references
def safe_merge(default, user, depth=0, max_depth=10):
    """Safely merge configurations preventing infinite recursion."""
    if depth > max_depth:
        raise ValueError("Maximum merge depth exceeded")
    
    result = default.copy()
    for key in user:
        if isinstance(user[key], dict) and key in result:
            result[key] = safe_merge(result[key], user[key], depth + 1, max_depth)
        else:
            result[key] = user[key]
    return result

# Scenario 6: Invalid merge output (circular references detected)
try:
    config_a = {'a': 1}
    config_b = {'b': 2}
    merged = safe_merge(config_a, config_b)
except ValueError as e:
    print(f"Merge conflict: {e}")

ml_grid.util.config.merge_configs

Merges user configuration into default settings.

Function: merge_configs

Recursively merges a user-defined configuration into the default one.

Module: ml_grid.util.config

Signature:

merged_config = merge_configs(default, user)

Parameters:

Parameter

Type

Description

default

Dict

Default configuration dictionary

user

Dict

User-defined configuration dictionary

Returns: Dict

  • Merged configuration dictionary


Utility APIs

Logging Configuration

Function: ml_grid.util.logger_setup.setup_logger

Standardized logging setup.

Signature:

from ml_grid.util.logger_setup import setup_logger

logger = setup_logger()

Version Information

This API documentation corresponds to version v1.0+ of the ensemble_genetic_algorithm package.

Python Requirement: Python >=3.12

To check your installed version:

pip show ensemble_genetic_algorithm