Troubleshooting Guide

This guide provides solutions to common errors and issues you might encounter while using the Ensemble Genetic Algorithm project. Please also refer to the Best Practices and Tips for proactive tips on running experiments efficiently.

Environment and Setup Issues

Problem: `ModuleNotFoundError: No module named '...'`

This is the most common issue and usually means the project’s virtual environment is not active or was not set up correctly.

Solutions:

Activate the Environment: Make sure you have activated the correct virtual environment before running any scripts.

source ga_env/bin/activate  # If you used setup.sh
# OR
source .venv/bin/activate   # If you installed manually

Verify Python Version: Ensure you are using Python >=3.12 as required by pyproject.toml.
```
python --version  # Should show Python 3.12.x or higher
```
Re-install Dependencies: If the error persists, your installation might be incomplete. Re-install the dependencies.
```
pip install .
```
Check for Missing Optional Dependencies: If you’re using GPU support, ensure PyTorch with CUDA is installed:
```
pip list | grep torch
```

Problem: `ImportError: cannot import name '...' from 'ml_grid'`

This indicates that the package is not properly installed or your Python path is incorrect.

Solutions:

Reinstall in Editable Mode: Run from the project root:
```
pip install -e .
```

Verify Installation: Check if the module can be imported:

python -c "from ml_grid.pipeline import main_ga; print('Import successful')"

Problem: GPU is not being used by PyTorch models

Solutions:

Check Installation: Ensure you installed the GPU-enabled version of PyTorch. The easiest way is to re-run the setup script with the --gpu flag: ./setup.sh --gpu.
Verify CUDA: From your terminal, run nvidia-smi. This command should list your NVIDIA GPU. If it doesn’t, you may have an issue with your NVIDIA drivers.
Check Environment Variables: Make sure the CUDA_VISIBLE_DEVICES environment variable is not set to -1, as this explicitly disables GPU access.

Verify PyTorch CUDA:

python -c "import torch; print(torch.cuda.is_available())"

Should output True.

Runtime and Performance Issues

Problem: The experiment runs out of memory (`MemoryError` or `CUDA out of memory`)

Solutions:

Reduce Population Size: In your config.yml, use smaller values for pop_params under the ga_params section (e.g., [32] instead of [64, 128]).
Reduce Data Size: For testing, either use a smaller input CSV file or set testing: True in your config.yml under global_params. You can also set test_sample_n to a small number (e.g., 1000) to sample your data.
Disable Model Caching: In your config.yml, set store_base_learners: False to avoid storing trained models in memory.
GPU-Specific: If using GPU, reduce batch sizes in PyTorch model hyperparameters or use gradient accumulation.

Problem: The experiment is running very slowly

Solutions:

Start Small: For initial runs, set n_iter to a low number (e.g., 1-3) and testing: True in your config.yml under global_params.
Use Model Caching: For subsequent runs, set use_stored_base_learners: True in grid_params to avoid retraining models.
Simplify Weighting: In your config.yml, limit the weighted list under grid_params to ["unweighted"] for fast runs. 'de' and 'ann' are much slower.
Reduce Generations: Lower the values in g_params (e.g., [50]) for quicker experiments.

Problem: The genetic algorithm’s fitness score is not improving (the convergence plot is flat)

Solutions:

Increase Mutation/Crossover: The search might be stuck. In your config.yml, try increasing the mutpb (mutation rate) or cxpb (crossover rate) under grid_params to encourage more exploration.
Increase Population Size: In config.yml, use larger values for pop_params under ga_params to introduce more diversity.
Check Model Suitability: The base learners in your model_list (in config.yml) may not be a good fit for your data. Try adding or swapping in different types of models.
Feature Selection: Adjust the feature_selection_method or correlation threshold (corr) to allow different feature subsets.

GA Convergence Failures

When the genetic algorithm fails to converge, it means the evolutionary process does not find an optimal or satisfactory solution within the allocated generations. This can occur due to various reasons related to population diversity, fitness landscape, or parameter settings.

Understanding Convergence

A genetic algorithm is considered to have converged when:

The best-performing individual’s fitness score plateaus over multiple generations
The population stabilizes around a small set of high-fitness solutions
No significant improvement occurs after a defined number of generations

Common Signs of Convergence Failure

Symptom	Description
Flat Fitness Plot	Best fitness score remains constant or fluctuates randomly across generations without trend
Rapid premature convergence	Population converges too quickly to suboptimal solutions (typically within first 10-20 generations)
No improvement after burn-in	No performance gains despite extended generation counts (>50)
High diversity with no direction	Constant population reorganization without upward trend in fitness

Diagnosing Convergence Failures

1. Check Initial Population Diversity

from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_params import global_parameters

# Run a short experiment and inspect population diversity
global_params = global_parameters(config_path='config.yml')
ml_grid_object = data.pipe(
    global_params=global_params,
    file_name="data/dataset.csv",
    drop_term_list=[],
    local_param_dict={'pop': 64},
    param_space_index=0,
)

result = main_ga.run(
    ml_grid_object,
    local_param_dict={'cxpb': 0.8, 'mutpb': 0.2},
    global_params=global_params
).execute()

# Check logs for population statistics
# Look for: "Population diversity" and "Best fitness" logs

2. Analyze Feature Transformation Log

Low feature counts can limit solution quality:

print(ml_grid_object.feature_transformation_log)

# If features_after is very low (<3), consider relaxing thresholds
# or removing the n_features parameter to use all available features

3. Inspect Genetic Operator Parameters

Verify your configuration includes balanced exploration/exploitation:

grid_params:
  cxpb: 0.6-0.9   # Crossover rate - too low reduces exploration
  mutpb: 0.1-0.5  # Mutation rate - too high prevents convergence

Diagnostic Commands and Checks

Enable Verbose Logging:

global_params:
  verbose: 3  # or 4 for maximum GA debugging output

Monitor Key Metrics in Logs:
- Population diversity (should remain >30% throughout)
- Best fitness score trend (should show gradual improvement)
- Number of unique individuals per generation
Check Early Generation Performance: If fitness plateaus after <10 generations, the population likely lacks sufficient genetic diversity.

Solutions for Convergence Failures

Solution 1: Adjust Genetic Operators

ga_params:
  pop_params: [64, 128]  # increase to 256 if memory allows
  g_params: [100, 200]   # extend generations with no improvement
grid_params:
  cxpb: 0.7              # moderate crossover for balance
  mutpb: 0.3             # sufficient mutation to maintain diversity

Solution 2: Enable Weighted Ensemble Search

grid_params:
  weighted: ["unweighted", "de"]  # allow ensemble weighting exploration

Unweighted mode may explore slower but provides more stable convergence.

Solution 3: Modify Feature Selection Strategy

grid_params:
  feature_selection_method: null  # use all features
  corr: 0.75                     # reduce from default 0.95

Solution 4: Use Multiple Starting Points

Run several independent experiments with different random seeds:

import os
os.environ['PYTHONHASHSEED'] = '42'  # Set seed for reproducibility

# Try multiple starting configurations
for seed in [42, 123, 456, 789]:
    local_param_dict = {
        'seed': seed,
        'cxpb': 0.8,
        'mutpb': 0.3
    }
    # Run experiment with each seed

When GA Truly Fails

If convergence fails despite these adjustments:

Verify Data Quality: Check for label noise, class imbalance, or insufficient samples
Check Base Learner Diversity: Ensure model_list contains disparate model types (e.g., tree-based + linear models)
Consider Problem Suitability: Some problems may not benefit from ensemble optimization via GA
Try Alternative Optimization: Consider random search or Bayesian optimization for simpler configuration spaces

Configuration Errors

Problem: Unknown parameter in config.yml

Solutions:

Check Section Names: Ensure parameters are under the correct section (global_params, ga_params, grid_params).
Refer to Hyperparameter Reference: Check Hyperparameter Reference Guide for valid parameter names and acceptable values.
Use config.yml.example: Copy parameters from the example file rather than typing them manually.

Problem: Grid search takes too long or explores wrong space

Solutions:

Reduce Parameter Combinations: Remove high-cardinality options from your lists in config.yml.
Check n_iter Value: This controls how many random combinations are sampled, not the total grid size.
Use Testing Mode: Set testing: True for a smaller default grid.

Common Workflow Issues

Problem: Results directory is empty or missing expected files

Solutions:

Check Log Output: Look for error messages before the script exits.
Verify Write Permissions: Ensure your user has write access to base_project_dir.
Check Experiment Completion: The experiment must complete all generations and final evaluation to save key artifacts.

Problem: Plots are missing or cannot be generated

Solutions:

Ensure Evaluation Step: Run with --evaluate flag if using command line.
Check Backend: For headless servers, set matplotlib backend: matplotlib.use('Agg').

Getting Help

If the above solutions do not resolve your issue:

Consult the Home for project overview links
Review code comments and docstrings in ml_grid/pipeline/main_ga.py
Check the repository’s issue tracker for similar problems
Provide detailed error messages including Python version, OS, and installed packages when seeking help

Troubleshooting Guide

Environment and Setup Issues

Problem: ModuleNotFoundError: No module named '...'

Problem: ImportError: cannot import name '...' from 'ml_grid'

Problem: GPU is not being used by PyTorch models

Data-Related Errors

Problem: The experiment fails immediately with a KeyError or ValueError related to a column name.

Problem: A specific base learner fails with ValueError: Input contains NaN.

Problem: ValueError: Number of classes does not match number of labels

Runtime and Performance Issues

Problem: The experiment runs out of memory (MemoryError or CUDA out of memory)

Problem: The experiment is running very slowly

Problem: The genetic algorithm’s fitness score is not improving (the convergence plot is flat)

GA Convergence Failures

Understanding Convergence

Common Signs of Convergence Failure

Diagnosing Convergence Failures

1. Check Initial Population Diversity

2. Analyze Feature Transformation Log

3. Inspect Genetic Operator Parameters

Diagnostic Commands and Checks

Solutions for Convergence Failures

Solution 1: Adjust Genetic Operators

Solution 2: Enable Weighted Ensemble Search

Solution 3: Modify Feature Selection Strategy

Solution 4: Use Multiple Starting Points

When GA Truly Fails

Configuration Errors

Problem: Unknown parameter in config.yml

Problem: Grid search takes too long or explores wrong space

Common Workflow Issues

Problem: Results directory is empty or missing expected files

Problem: Plots are missing or cannot be generated

Getting Help

Problem: `ModuleNotFoundError: No module named '...'`

Problem: `ImportError: cannot import name '...' from 'ml_grid'`

Problem: The experiment fails immediately with a `KeyError` or `ValueError` related to a column name.

Problem: A specific base learner fails with `ValueError: Input contains NaN`.

Problem: `ValueError: Number of classes does not match number of labels`

Problem: The experiment runs out of memory (`MemoryError` or `CUDA out of memory`)