core

Core module for ML results aggregation and management.

Handles loading, aggregating, and basic processing of results data.

Classes

`ResultsAggregator`	Initializes the ResultsAggregator.
`DataValidator`	A utility class for validating and checking the quality of results data.

Functions

`get_clean_data`(→ pandas.DataFrame)	A utility function to get clean data for analysis.
`stratify_by_outcome`(→ Dict[str, Any])	Applies a function to data stratified by outcome variable.

Module Contents

class core.ResultsAggregator(root_folder: str, feature_names_csv: str | None = None)[source]

Initializes the ResultsAggregator.

Parameters:

root_folder (str) – The path to the master root folder containing experiment run subfolders.
feature_names_csv (Optional[str], optional) – The path to a CSV file whose headers are the original feature names. This is required for decoding feature lists. Defaults to None.

root_folder[source]

feature_names: List[str] | None = None[source]

aggregated_data: pandas.DataFrame | None = None[source]

load_feature_names(feature_names_csv: str) → None[source]

Loads feature names from the column headers of a CSV file.

Parameters:: feature_names_csv (str) – The path to the CSV file.

get_available_runs() → List[str][source]

Gets a list of available timestamped run folders.

Returns:: A sorted list of valid run folder names.
Return type:: List[str]
Raises:: ValueError – If the root folder does not exist.

load_single_run(timestamp_folder: str) → pandas.DataFrame[source]

Loads results from a specific timestamped run folder.

Parameters:: timestamp_folder (str) – The name of the run folder.
Returns:: A DataFrame containing the results for that run.
Return type:: pd.DataFrame
Raises:: FileNotFoundError – If the log file does not exist in the folder.

aggregate_all_runs() → pandas.DataFrame[source]

Aggregates results from all available runs in the root folder.

Returns:: A single DataFrame containing all aggregated results.
Return type:: pd.DataFrame
Raises:: ValueError – If no valid runs are found.

aggregate_specific_runs(run_names: List[str]) → pandas.DataFrame[source]

Aggregates results from a specified list of run folders.

Parameters:: run_names (List[str]) – A list of run folder names to aggregate.
Returns:: A single DataFrame containing the aggregated results.
Return type:: pd.DataFrame
Raises:: ValueError – If no data could be loaded from the specified runs.

get_summary_stats(data: pandas.DataFrame | None = None) → pandas.DataFrame[source]

Gets summary statistics for the aggregated results.

Parameters:: data (Optional[pd.DataFrame], optional) – The DataFrame to summarize. If None, uses the internally stored aggregated data. Defaults to None.
Returns:: A DataFrame containing descriptive statistics.
Return type:: pd.DataFrame
Raises:: ValueError – If no data is available.

get_outcome_variables(data: pandas.DataFrame | None = None) → List[str][source]

Gets a list of unique outcome variables from the data.

Parameters:: data (Optional[pd.DataFrame], optional) – The DataFrame to inspect. If None, uses the internally stored aggregated data. Defaults to None.
Returns:: A sorted list of unique outcome variable names.
Return type:: List[str]
Raises:: ValueError – If no data is available or the ‘outcome_variable’ column is missing.

get_data_by_outcome(outcome_variable: str, data: pandas.DataFrame | None = None) → pandas.DataFrame[source]

Filters the data for a specific outcome variable.

Parameters:

outcome_variable (str) – The outcome variable to filter by.
data (Optional[pd.DataFrame], optional) – The DataFrame to filter. If None, uses the internally stored aggregated data. Defaults to None.

Returns:

A new DataFrame containing only the data for the specified outcome.

Return type:

pd.DataFrame

Raises:

ValueError – If no data is available, the ‘outcome_variable’ column is missing, or no data is found for the specified outcome.

get_outcome_summary(data: pandas.DataFrame | None = None) → pandas.DataFrame[source]

Gets summary statistics stratified by outcome variable.

Parameters:: data (Optional[pd.DataFrame], optional) – The DataFrame to summarize. If None, uses the internally stored aggregated data. Defaults to None.
Returns:: A multi-index DataFrame with summary statistics for each outcome variable.
Return type:: pd.DataFrame
Raises:: ValueError – If no data is available or the ‘outcome_variable’ column is missing.

class core.DataValidator[source]

A utility class for validating and checking the quality of results data.

static validate_data_structure(df: pandas.DataFrame) → Dict[str, Any][source]

Validates the structure and quality of a results DataFrame.

Parameters:: df (pd.DataFrame) – The DataFrame to validate.
Returns:: A dictionary containing the validation report.
Return type:: Dict[str, Any]

static print_validation_report(validation_report: Dict[str, Any]) → None[source]

Prints a formatted validation report to the console.

Parameters:: validation_report (Dict[str, Any]) – The validation report dictionary generated by validate_data_structure.

core.get_clean_data(df: pandas.DataFrame, remove_failed: bool = True) → pandas.DataFrame[source]

A utility function to get clean data for analysis.

Parameters:

df (pd.DataFrame) – The input DataFrame.
remove_failed (bool, optional) – If True, removes rows where the ‘failed’ column is 1. Defaults to True.

Returns:

The cleaned DataFrame.

Return type:

pd.DataFrame

core.stratify_by_outcome(df: pandas.DataFrame, func: callable, *args: Any, **kwargs: Any) → Dict[str, Any][source]

Applies a function to data stratified by outcome variable.

Parameters:

df (pd.DataFrame) – DataFrame with an ‘outcome_variable’ column.
func (callable) – The function to apply to each outcome’s data subset.
*args (Any) – Positional arguments to pass to the function.
**kwargs (Any) – Keyword arguments to pass to the function.

Returns:

A dictionary with outcome variables as keys and the results of the function as values.

Return type:

Dict[str, Any]

Raises:

ValueError – If the ‘outcome_variable’ column is not found.