core

Core module for ML results aggregation and management.

Handles loading, aggregating, and basic processing of results data.

Classes

ResultsAggregator

Initializes the ResultsAggregator.

DataValidator

A utility class for validating and checking the quality of results data.

Functions

get_clean_data(→ pandas.DataFrame)

A utility function to get clean data for analysis.

stratify_by_outcome(→ Dict[str, Any])

Applies a function to data stratified by outcome variable.

Module Contents

class core.ResultsAggregator(root_folder: str, feature_names_csv: str | None = None)[source]

Initializes the ResultsAggregator.

Parameters:
  • root_folder (str) – The path to the master root folder containing experiment run subfolders.

  • feature_names_csv (Optional[str], optional) – The path to a CSV file whose headers are the original feature names. This is required for decoding feature lists. Defaults to None.

root_folder[source]
feature_names: List[str] | None = None[source]
aggregated_data: pandas.DataFrame | None = None[source]
load_feature_names(feature_names_csv: str) None[source]

Loads feature names from the column headers of a CSV file.

Parameters:

feature_names_csv (str) – The path to the CSV file.

get_available_runs() List[str][source]

Gets a list of available timestamped run folders.

Returns:

A sorted list of valid run folder names.

Return type:

List[str]

Raises:

ValueError – If the root folder does not exist.

load_single_run(timestamp_folder: str) pandas.DataFrame[source]

Loads results from a specific timestamped run folder.

Parameters:

timestamp_folder (str) – The name of the run folder.

Returns:

A DataFrame containing the results for that run.

Return type:

pd.DataFrame

Raises:

FileNotFoundError – If the log file does not exist in the folder.

aggregate_all_runs() pandas.DataFrame[source]

Aggregates results from all available runs in the root folder.

Returns:

A single DataFrame containing all aggregated results.

Return type:

pd.DataFrame

Raises:

ValueError – If no valid runs are found.

aggregate_specific_runs(run_names: List[str]) pandas.DataFrame[source]

Aggregates results from a specified list of run folders.

Parameters:

run_names (List[str]) – A list of run folder names to aggregate.

Returns:

A single DataFrame containing the aggregated results.

Return type:

pd.DataFrame

Raises:

ValueError – If no data could be loaded from the specified runs.

get_summary_stats(data: pandas.DataFrame | None = None) pandas.DataFrame[source]

Gets summary statistics for the aggregated results.

Parameters:

data (Optional[pd.DataFrame], optional) – The DataFrame to summarize. If None, uses the internally stored aggregated data. Defaults to None.

Returns:

A DataFrame containing descriptive statistics.

Return type:

pd.DataFrame

Raises:

ValueError – If no data is available.

get_outcome_variables(data: pandas.DataFrame | None = None) List[str][source]

Gets a list of unique outcome variables from the data.

Parameters:

data (Optional[pd.DataFrame], optional) – The DataFrame to inspect. If None, uses the internally stored aggregated data. Defaults to None.

Returns:

A sorted list of unique outcome variable names.

Return type:

List[str]

Raises:

ValueError – If no data is available or the ‘outcome_variable’ column is missing.

get_data_by_outcome(outcome_variable: str, data: pandas.DataFrame | None = None) pandas.DataFrame[source]

Filters the data for a specific outcome variable.

Parameters:
  • outcome_variable (str) – The outcome variable to filter by.

  • data (Optional[pd.DataFrame], optional) – The DataFrame to filter. If None, uses the internally stored aggregated data. Defaults to None.

Returns:

A new DataFrame containing only the data for the specified outcome.

Return type:

pd.DataFrame

Raises:

ValueError – If no data is available, the ‘outcome_variable’ column is missing, or no data is found for the specified outcome.

get_outcome_summary(data: pandas.DataFrame | None = None) pandas.DataFrame[source]

Gets summary statistics stratified by outcome variable.

Parameters:

data (Optional[pd.DataFrame], optional) – The DataFrame to summarize. If None, uses the internally stored aggregated data. Defaults to None.

Returns:

A multi-index DataFrame with summary statistics for each outcome variable.

Return type:

pd.DataFrame

Raises:

ValueError – If no data is available or the ‘outcome_variable’ column is missing.

class core.DataValidator[source]

A utility class for validating and checking the quality of results data.

static validate_data_structure(df: pandas.DataFrame) Dict[str, Any][source]

Validates the structure and quality of a results DataFrame.

Parameters:

df (pd.DataFrame) – The DataFrame to validate.

Returns:

A dictionary containing the validation report.

Return type:

Dict[str, Any]

static print_validation_report(validation_report: Dict[str, Any]) None[source]

Prints a formatted validation report to the console.

Parameters:

validation_report (Dict[str, Any]) – The validation report dictionary generated by validate_data_structure.

core.get_clean_data(df: pandas.DataFrame, remove_failed: bool = True) pandas.DataFrame[source]

A utility function to get clean data for analysis.

Parameters:
  • df (pd.DataFrame) – The input DataFrame.

  • remove_failed (bool, optional) – If True, removes rows where the ‘failed’ column is 1. Defaults to True.

Returns:

The cleaned DataFrame.

Return type:

pd.DataFrame

core.stratify_by_outcome(df: pandas.DataFrame, func: callable, *args: Any, **kwargs: Any) Dict[str, Any][source]

Applies a function to data stratified by outcome variable.

Parameters:
  • df (pd.DataFrame) – DataFrame with an ‘outcome_variable’ column.

  • func (callable) – The function to apply to each outcome’s data subset.

  • *args (Any) – Positional arguments to pass to the function.

  • **kwargs (Any) – Keyword arguments to pass to the function.

Returns:

A dictionary with outcome variables as keys and the results of the function as values.

Return type:

Dict[str, Any]

Raises:

ValueError – If the ‘outcome_variable’ column is not found.