Configuration Guide
This guide explains how to customize your experiments. The project uses a layered configuration system, which gives you flexibility in how you define settings. The order of precedence is:
Runtime Arguments (Highest precedence): Parameters passed directly when initializing
global_parametersin a script.config.ymlFile: A central YAML file in your project root for most customizations.Hardcoded Defaults (Lowest precedence): The default values set within the package source code.
Configuration Reference
The configuration system supports three types of settings:
Category |
Parameters |
Purpose |
|---|---|---|
|
~20 parameters |
Experiment-wide settings |
|
3 parameters |
Genetic algorithm hyperparameters |
|
30+ parameters |
Data preprocessing & grid search space |
Complete Default Configuration
# All default values for reference
global_params:
input_csv_path: "synthetic_sample_100_features_4.csv"
n_iter: 20
model_list: [
"logisticRegression", "randomForest", "XGBoost",
"gradientBoosting", "elasticNeuralNetwork",
"adaboostClassifier", "decisionTreeClassifier",
"extraTrees", "gaussianNB", "kNearestNeighbors",
"mlpClassifier", "quadraticDiscriminantAnalysis"
]
verbose: 2
grid_n_jobs: -1
base_project_dir: "experiments/"
testing: false
test_sample_n: 0
error_raise: true
ga_params:
nb_params: [4, 8, 16, 32, 64]
pop_params: [32, 64, 128]
g_params: [128]
grid_params:
weighted: ["ann", "de", "unweighted"]
use_stored_base_learners: [false]
store_base_learners: [false]
resample: ["undersample", "oversample", null]
scale: [true]
n_features: ["all"]
param_space_size: ["medium"]
n_unique_out: [10]
outcome_var_n: ["1"]
div_p: [0]
percent_missing: [99.9, 99.8, 99.7]
corr: [0.9, 0.99]
feature_selection_method: ["anova", "markov_blanket"]
cxpb: [0.25, 0.5, 0.75]
mutpb: [0.2, 0.4, 0.8]
indpb: [0.025, 0.05, 0.075]
t_size: [3, 6, 9]
See Data Workflow for data preprocessing options that interact with configuration settings.
The config.yml File
This is the recommended method for most configuration. It is safe from being overwritten by package updates and keeps all your settings in one place.
Create the File: Copy the
config.yml.examplefrom the repository root to a new file namedconfig.yml.Edit: Uncomment and change the parameters you wish to modify. Any parameter you don’t specify will use its default value.
The config.yml is split into three main sections:
1. global_params
These settings control the overall behavior of the experiment, such as file paths, number of iterations, and logging verbosity.
global_params:
# Path to your dataset (required)
input_csv_path: "data/my_dataset.csv"
# Number of grid search iterations (higher = more thorough but slower)
n_iter: 20
# List of models for base learner pool (names from ml_grid/model_classes_ga/)
model_list:
- logisticRegression
- randomForest
- XGBoost
# Verbosity level: 1=info, 2=debug, 3=pathological debug
verbose: 2
# Parallel jobs for grid search (-1 = all CPUs)
grid_n_jobs: 8
# Output directory for experiment results
base_project_dir: "HFE_GA_experiments/"
# Quick test mode (reduces all grid sizes)
testing: false
# Sample dataset rows for debugging (0=all)
test_sample_n: 1000
# Raise exceptions or continue on errors
error_raise: true
Detailed Parameter Descriptions
Parameter |
Default |
Description |
|---|---|---|
|
required |
Path to dataset CSV file. Must end with |
|
20 |
Number of grid search iterations (total experiment runs). |
|
Full list |
Names of model classes to use in base learner pool. |
|
2 |
Logging level: 1=minimal, 2=standard, 3=detailed. |
|
-1 |
Parallel jobs for CV (-1 = all cores). |
|
“experiments/” |
Directory where results are saved. |
|
false |
Quick test mode (reduces populations and generations). |
|
0 |
Number of rows to sample from dataset for quick testing. |
2. ga_params
These control the core genetic algorithm process.
ga_params:
# Number of base learners per ensemble (list = grid search over values)
nb_params: [8, 16]
# Population size - larger = more diversity but slower
pop_params: [50, 100]
# Number of generations to evolve (more = better convergence)
g_params: [100, 200]
GA Performance Guidelines
Population Size |
Generations |
Use Case |
|---|---|---|
32-50 |
50-100 |
Quick prototyping |
100-200 |
100-200 |
Standard production runs |
200+ |
200+ |
Thorough search (high compute) |
3. grid_params
This defines the hyperparameter search space for each grid search iteration.
grid_params:
# Weighting methods for ensemble predictions
weighted: ["unweighted"] # Options: "ann", "de", "unweighted"
# Whether to resample data (handles class imbalance)
resample: ["undersample", null] # "oversample", "undersample", or null
# Correlation threshold - remove highly correlated features
corr: [0.95, 0.98] # Higher = more aggressive filtering
# Maximum % missing values allowed (above = drop column)
percent_missing: [99] # 99 = keep columns with ≤1% missing
# Scale features to zero mean, unit variance
scale: [true] # true for NN/SVM, false for trees
# Number of features to retain (or "all" for all)
n_features: ["all", 50, 100]
# Feature selection method for importance scoring
feature_selection_method: ["anova", "markov_blanket"]
Advanced Configuration Examples
Example 1: Medical Dataset with High Missingness
global_params:
input_csv_path: "data/medical_records.csv"
n_iter: 30 # More iterations for complex medical data
verbose: 2
model_list:
- logisticRegression # Interpretable models preferred
- randomForest # Handles missing values well
- extraTrees # Robust to outliers
grid_params:
percent_missing: [95] # Allow some missing data
corr: [0.90] # Less aggressive filtering
resample: ["undersample"] # Often imbalanced in medical data
ga_params:
pop_params: [100] # Larger population for reliability
g_params: [200]
Example 2: Genomic Data (High Dimensionality)
global_params:
input_csv_path: "data/genomic_data.csv"
grid_params:
scale: [true] # Essential for genomic features
corr: [0.98] # Keep most features
n_features: [100, 500, "all"] # Feature selection crucial here
ga_params:
pop_params: [200] # Larger population for diversity
g_params: [300]
Example 3: Balanced Cybersecurity Dataset
grid_params:
resample: [null] # No need to sample if balanced
scale: [false] # Often already normalized
ga_params:
nb_params: [4, 8] # Smaller ensembles
pop_params: [50]
g_params: [100] # Faster converge possible
Programmatic Configuration (In Scripts/Notebooks)
For quick tests or dynamic settings, you can override any parameter at runtime by passing it as a keyword argument to global_parameters. These arguments will take precedence over both the config.yml file and the hardcoded defaults.
from tqdm import tqdm
from ml_grid.util.global_params import global_parameters
from ml_grid.util.grid_param_space_ga import Grid
from ml_grid.pipeline.data import pipe as data_pipe
from ml_grid.pipeline.main_ga import run as ga_run
# Load config.yml first, then override specific parameters
global_params = global_parameters(
config_path='config.yml',
input_csv_path="data/another_dataset.csv", # Override from config file
n_iter=5, # Quick test instead of 20
model_list=["logisticRegression", "XGBoost"] # Simplified for speed
)
# Create parameter grid (loads default + merged user config)
grid = Grid(global_params=global_params, sample_n=100)
# Run experiments
for i in tqdm(range(global_params.n_iter)):
local_param_dict = next(grid.settings_list_iterator)
ml_grid_object = data_pipe(
global_params=global_params,
file_name=global_params.input_csv_path,
drop_term_list=[], # Optional: terms to drop from features
local_param_dict=local_param_dict,
base_project_dir=global_params.base_project_dir,
param_space_index=i,
)
ga_run(
ml_grid_object=ml_grid_object,
local_param_dict=local_param_dict,
global_params=global_params
).execute()
Runtime Overrides Table
Override |
Use Case |
|---|---|
|
Debugging (detailed logs) |
|
Quick prototype |
|
Test specific models |
|
Debug with subset |
This level of configuration gives you full control over the scope and depth of your hyperparameter search.
Dynamic Configuration from Data
You can also infer parameters based on dataset characteristics:
import pandas as pd
# Load a sample to infer parameters
df = pd.read_csv("data/my_data.csv", nrows=1000)
# Detect class imbalance
imbalance_ratio = df['outcome_var_1'].value_counts(normalize=True).min()
# Adjust configuration based on imbalance
if imbalance_ratio < 0.20:
grid_params = {
"resample": ["undersample"],
"n_iter": 30
}
elif imbalance_ratio < 0.40:
grid_params = {
"resample": ["oversample"],
"n_iter": 25
}
else:
grid_params = {
"resample": [None],
"n_iter": 20
}
# Use inferred parameters
global_params = global_parameters(
config_path='config.yml',
n_iter=grid_params["n_iter"]
)
See Implementation Guide for how to execute experiments with these configurations.