Configuration Guide

This guide explains how to customize your experiments. The project uses a layered configuration system, which gives you flexibility in how you define settings. The order of precedence is:

Runtime Arguments (Highest precedence): Parameters passed directly when initializing global_parameters in a script.
config.yml File: A central YAML file in your project root for most customizations.
Hardcoded Defaults (Lowest precedence): The default values set within the package source code.

The `config.yml` File

This is the recommended method for most configuration. It is safe from being overwritten by package updates and keeps all your settings in one place.

Create the File: Copy the config.yml.example from the repository root to a new file named config.yml.
Edit: Uncomment and change the parameters you wish to modify. Any parameter you don’t specify will use its default value.

The config.yml is split into three main sections:

1. `global_params` (in `config.yml`)

These settings control the overall behavior of the experiment, such as file paths, number of iterations, and logging verbosity.

global_params:
  # Path to your dataset
  input_csv_path: "data/my_dataset.csv"
  # Number of grid search iterations to run
  n_iter: 20
  # List of models to include in the base learner pool
  model_list: ["logisticRegression", "randomForest", "XGBoost"]
  # Verbosity level for console output
  verbose: 2
  # Number of parallel jobs for grid search
  grid_n_jobs: 8
  # The root directory for saving project outputs
  base_project_dir: "HFE_GA_experiments/"
  # Use a smaller, faster grid for testing and debugging
  testing: False
  # Number of rows to sample from the dataset for quick tests (0 = use all)
  test_sample_n: 0

2. `ga_params`

These control the core genetic algorithm process.

ga_params:
  nb_params: [8, 16]       # Num base learners per ensemble
  pop_params: [50]         # Population size
  g_params: [100]          # Num generations

3. `grid_params`

This defines the hyperparameter search space for each grid search iteration. You can override entire lists or specific values.

grid_params:
  weighted: ["unweighted"] # Only use unweighted for a faster run
  resample: ["undersample", None]
  corr: [0.95]

Programmatic Configuration (In Scripts/Notebooks)

For quick tests or dynamic settings, you can override any parameter at runtime by passing it as a keyword argument to global_parameters. These arguments will take precedence over both the config.yml file and the hardcoded defaults.

from tqdm import tqdm
from ml_grid.util.global_params import global_parameters
from ml_grid.util.grid_param_space_ga import Grid
from ml_grid.pipeline import data, main_ga

# This will load from config.yml first, then apply the overrides below
global_params = global_parameters(
    config_path='config.yml',
    input_csv_path="data/another_dataset.csv", # Override path from config
    n_iter=5,                                  # Override n_iter for a quick run
    verbose=3                                  # Override verbosity
)

# The main loop is then executed as shown in the Quickstart section
grid = Grid(
    global_params=global_params,
    config_path='config.yml'
)

for i in tqdm(range(global_params.n_iter)):
    local_param_dict = next(grid.settings_list_iterator)
    ml_grid_object = data.pipe(
        global_params=global_params,
        file_name=global_params.input_csv_path,
        local_param_dict=local_param_dict,
        base_project_dir=global_params.base_project_dir,
        param_space_index=i,
    )
    main_ga.run(ml_grid_object, local_param_dict=local_param_dict, global_params=global_params).execute()

This level of configuration gives you full control over the scope and depth of your hyperparameter search.

Configuration Guide

The config.yml File

1. global_params (in config.yml)

2. ga_params

3. grid_params

Programmatic Configuration (In Scripts/Notebooks)

The `config.yml` File

1. `global_params` (in `config.yml`)

2. `ga_params`

3. `grid_params`