Configuration Guide

This guide explains how to customize your experiments. The project uses a layered configuration system, which gives you flexibility in how you define settings. The order of precedence is:

  1. Runtime Arguments (Highest precedence): Parameters passed directly when initializing global_parameters in a script.

  2. config.yml File: A central YAML file in your project root for most customizations.

  3. Hardcoded Defaults (Lowest precedence): The default values set within the package source code.


The config.yml File

This is the recommended method for most configuration. It is safe from being overwritten by package updates and keeps all your settings in one place.

  1. Create the File: Copy the config.yml.example from the repository root to a new file named config.yml.

  2. Edit: Uncomment and change the parameters you wish to modify. Any parameter you don’t specify will use its default value.

The config.yml is split into three main sections:

1. global_params (in config.yml)

These settings control the overall behavior of the experiment, such as file paths, number of iterations, and logging verbosity.

global_params:
  # Path to your dataset
  input_csv_path: "data/my_dataset.csv"
  # Number of grid search iterations to run
  n_iter: 20
  # List of models to include in the base learner pool
  model_list: ["logisticRegression", "randomForest", "XGBoost"]
  # Verbosity level for console output
  verbose: 2
  # Number of parallel jobs for grid search
  grid_n_jobs: 8
  # The root directory for saving project outputs
  base_project_dir: "HFE_GA_experiments/"
  # Use a smaller, faster grid for testing and debugging
  testing: False
  # Number of rows to sample from the dataset for quick tests (0 = use all)
  test_sample_n: 0

2. ga_params

These control the core genetic algorithm process.

ga_params:
  nb_params: [8, 16]       # Num base learners per ensemble
  pop_params: [50]         # Population size
  g_params: [100]          # Num generations

3. grid_params

This defines the hyperparameter search space for each grid search iteration. You can override entire lists or specific values.

grid_params:
  weighted: ["unweighted"] # Only use unweighted for a faster run
  resample: ["undersample", None]
  corr: [0.95]

Programmatic Configuration (In Scripts/Notebooks)

For quick tests or dynamic settings, you can override any parameter at runtime by passing it as a keyword argument to global_parameters. These arguments will take precedence over both the config.yml file and the hardcoded defaults.

from tqdm import tqdm
from ml_grid.util.global_params import global_parameters
from ml_grid.util.grid_param_space_ga import Grid
from ml_grid.pipeline import data, main_ga

# This will load from config.yml first, then apply the overrides below
global_params = global_parameters(
    config_path='config.yml',
    input_csv_path="data/another_dataset.csv", # Override path from config
    n_iter=5,                                  # Override n_iter for a quick run
    verbose=3                                  # Override verbosity
)

# The main loop is then executed as shown in the Quickstart section
grid = Grid(
    global_params=global_params,
    config_path='config.yml'
)

for i in tqdm(range(global_params.n_iter)):
    local_param_dict = next(grid.settings_list_iterator)
    ml_grid_object = data.pipe(
        global_params=global_params,
        file_name=global_params.input_csv_path,
        local_param_dict=local_param_dict,
        base_project_dir=global_params.base_project_dir,
        param_space_index=i,
    )
    main_ga.run(ml_grid_object, local_param_dict=local_param_dict, global_params=global_params).execute()

This level of configuration gives you full control over the scope and depth of your hyperparameter search.