core
Core module for ML results aggregation and management.
Handles loading, aggregating, and basic processing of results data.
Classes
Initializes the ResultsAggregator. |
|
A utility class for validating and checking the quality of results data. |
Functions
|
A utility function to get clean data for analysis. |
|
Applies a function to data stratified by outcome variable. |
Module Contents
- class core.ResultsAggregator(root_folder: str, feature_names_csv: str | None = None)[source]
Initializes the ResultsAggregator.
- Parameters:
- aggregated_data: pandas.DataFrame | None = None[source]
- load_feature_names(feature_names_csv: str) None [source]
Loads feature names from the column headers of a CSV file.
- Parameters:
feature_names_csv (str) – The path to the CSV file.
- get_available_runs() List[str] [source]
Gets a list of available timestamped run folders.
- Returns:
A sorted list of valid run folder names.
- Return type:
List[str]
- Raises:
ValueError – If the root folder does not exist.
- load_single_run(timestamp_folder: str) pandas.DataFrame [source]
Loads results from a specific timestamped run folder.
- Parameters:
timestamp_folder (str) – The name of the run folder.
- Returns:
A DataFrame containing the results for that run.
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If the log file does not exist in the folder.
- aggregate_all_runs() pandas.DataFrame [source]
Aggregates results from all available runs in the root folder.
- Returns:
A single DataFrame containing all aggregated results.
- Return type:
pd.DataFrame
- Raises:
ValueError – If no valid runs are found.
- aggregate_specific_runs(run_names: List[str]) pandas.DataFrame [source]
Aggregates results from a specified list of run folders.
- Parameters:
run_names (List[str]) – A list of run folder names to aggregate.
- Returns:
A single DataFrame containing the aggregated results.
- Return type:
pd.DataFrame
- Raises:
ValueError – If no data could be loaded from the specified runs.
- get_summary_stats(data: pandas.DataFrame | None = None) pandas.DataFrame [source]
Gets summary statistics for the aggregated results.
- Parameters:
data (Optional[pd.DataFrame], optional) – The DataFrame to summarize. If None, uses the internally stored aggregated data. Defaults to None.
- Returns:
A DataFrame containing descriptive statistics.
- Return type:
pd.DataFrame
- Raises:
ValueError – If no data is available.
- get_outcome_variables(data: pandas.DataFrame | None = None) List[str] [source]
Gets a list of unique outcome variables from the data.
- Parameters:
data (Optional[pd.DataFrame], optional) – The DataFrame to inspect. If None, uses the internally stored aggregated data. Defaults to None.
- Returns:
A sorted list of unique outcome variable names.
- Return type:
List[str]
- Raises:
ValueError – If no data is available or the ‘outcome_variable’ column is missing.
- get_data_by_outcome(outcome_variable: str, data: pandas.DataFrame | None = None) pandas.DataFrame [source]
Filters the data for a specific outcome variable.
- Parameters:
outcome_variable (str) – The outcome variable to filter by.
data (Optional[pd.DataFrame], optional) – The DataFrame to filter. If None, uses the internally stored aggregated data. Defaults to None.
- Returns:
A new DataFrame containing only the data for the specified outcome.
- Return type:
pd.DataFrame
- Raises:
ValueError – If no data is available, the ‘outcome_variable’ column is missing, or no data is found for the specified outcome.
- get_outcome_summary(data: pandas.DataFrame | None = None) pandas.DataFrame [source]
Gets summary statistics stratified by outcome variable.
- Parameters:
data (Optional[pd.DataFrame], optional) – The DataFrame to summarize. If None, uses the internally stored aggregated data. Defaults to None.
- Returns:
A multi-index DataFrame with summary statistics for each outcome variable.
- Return type:
pd.DataFrame
- Raises:
ValueError – If no data is available or the ‘outcome_variable’ column is missing.
- class core.DataValidator[source]
A utility class for validating and checking the quality of results data.
- static validate_data_structure(df: pandas.DataFrame) Dict[str, Any] [source]
Validates the structure and quality of a results DataFrame.
- Parameters:
df (pd.DataFrame) – The DataFrame to validate.
- Returns:
A dictionary containing the validation report.
- Return type:
Dict[str, Any]
- core.get_clean_data(df: pandas.DataFrame, remove_failed: bool = True) pandas.DataFrame [source]
A utility function to get clean data for analysis.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
remove_failed (bool, optional) – If True, removes rows where the ‘failed’ column is 1. Defaults to True.
- Returns:
The cleaned DataFrame.
- Return type:
pd.DataFrame
- core.stratify_by_outcome(df: pandas.DataFrame, func: callable, *args: Any, **kwargs: Any) Dict[str, Any] [source]
Applies a function to data stratified by outcome variable.
- Parameters:
df (pd.DataFrame) – DataFrame with an ‘outcome_variable’ column.
func (callable) – The function to apply to each outcome’s data subset.
*args (Any) – Positional arguments to pass to the function.
**kwargs (Any) – Keyword arguments to pass to the function.
- Returns:
A dictionary with outcome variables as keys and the results of the function as values.
- Return type:
Dict[str, Any]
- Raises:
ValueError – If the ‘outcome_variable’ column is not found.