filters
Data filtering and querying module for ML results analysis. Provides flexible filtering capabilities with outcome variable stratification.
Classes
Initializes the ResultsFilter with a results DataFrame. |
|
Initializes the OutcomeComparator with results data. |
Module Contents
- class filters.ResultsFilter(data: pandas.DataFrame)[source]
Initializes the ResultsFilter with a results DataFrame.
- Parameters:
data (pd.DataFrame) – The DataFrame containing experiment results.
- filter_by_algorithm(algorithms: str | List[str]) pandas.DataFrame [source]
Filters the data by one or more algorithm names.
- filter_by_outcome(outcomes: str | List[str]) pandas.DataFrame [source]
Filters the data by one or more outcome variables.
- Parameters:
outcomes (Union[str, List[str]]) – A single outcome variable name or a list of names to filter by.
- Returns:
A DataFrame containing only the rows for the specified outcomes.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the ‘outcome_variable’ column is not in the data.
- filter_by_metric_threshold(metric: str, threshold: float, above: bool = True) pandas.DataFrame [source]
Filters data based on a metric’s threshold.
- Parameters:
- Returns:
The filtered DataFrame.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the specified metric column is not in the data.
- filter_by_run_timestamp(timestamps: str | List[str]) pandas.DataFrame [source]
Filters data by one or more run timestamps.
- filter_successful_runs() pandas.DataFrame [source]
Filters out failed runs from the data.
- Returns:
A DataFrame containing only successful runs.
- Return type:
pd.DataFrame
- filter_by_feature_count(min_features: int | None = None, max_features: int | None = None) pandas.DataFrame [source]
Filters data by the number of features used.
- Parameters:
- Returns:
The filtered DataFrame.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the ‘n_features’ column is not in the data.
- filter_by_sample_size(min_train_size: int | None = None, max_train_size: int | None = None) pandas.DataFrame [source]
Filters data by the training sample size.
- Parameters:
- Returns:
The filtered DataFrame.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the ‘X_train_size’ column is not in the data.
- get_top_performers(metric: str = 'auc', n: int = 10, stratify_by_outcome: bool = False) pandas.DataFrame | Dict[str, pandas.DataFrame] [source]
Gets the top N performing configurations.
- Parameters:
- Returns:
A DataFrame of top performers, or a dictionary of DataFrames if stratified.
- Return type:
Union[pd.DataFrame, Dict[str, pd.DataFrame]]
- Raises:
ValueError – If stratifying by outcome and ‘outcome_variable’ column is missing.
- get_algorithm_performance_summary(metric: str = 'auc', stratify_by_outcome: bool = False) pandas.DataFrame | Dict[str, pandas.DataFrame] [source]
Gets a performance summary grouped by algorithm.
- Parameters:
- Returns:
A summary DataFrame, or a dictionary of summary DataFrames if stratified.
- Return type:
Union[pd.DataFrame, Dict[str, pd.DataFrame]]
- Raises:
ValueError – If stratifying by outcome and ‘outcome_variable’ column is missing.
- find_feature_usage_patterns(min_frequency: int = 5, stratify_by_outcome: bool = False) Dict[str, int] | Dict[str, Dict[str, int]] [source]
Finds common feature usage patterns.
This method requires the ‘decoded_features’ column to be present.
- Parameters:
- Returns:
A dictionary of feature counts, or a nested dictionary if stratified.
- Return type:
- Raises:
ValueError – If ‘decoded_features’ or (if stratifying) ‘outcome_variable’ column is missing.
- compare_algorithms_across_outcomes(algorithms: List[str] | None = None, metric: str = 'auc') pandas.DataFrame [source]
Compares algorithm performance across different outcome variables.
- Parameters:
- Returns:
A pivot table with algorithms as rows and outcomes as columns, showing the mean performance.
- Return type:
pd.DataFrame
- Raises:
ValueError – If ‘outcome_variable’ column is missing.
- get_outcome_difficulty_ranking(metric: str = 'auc') pandas.DataFrame [source]
Ranks outcome variables by difficulty based on average performance.
- Parameters:
metric (str, optional) – The performance metric to use for ranking. Defaults to ‘auc’.
- Returns:
A DataFrame with outcomes ranked by difficulty.
- Return type:
pd.DataFrame
- Raises:
ValueError – If ‘outcome_variable’ column is missing.
- filter_by_cross_outcome_performance(metric: str = 'auc', min_outcomes: int = 2, percentile_threshold: float = 75) pandas.DataFrame [source]
Finds algorithms/configurations that perform well across multiple outcomes.
- Parameters:
- Returns:
A DataFrame of configurations that perform well across multiple outcomes.
- Return type:
pd.DataFrame
- Raises:
ValueError – If ‘outcome_variable’ column is missing.
- class filters.OutcomeComparator(data: pandas.DataFrame)[source]
Initializes the OutcomeComparator with results data.
- Parameters:
data (pd.DataFrame) – The DataFrame containing experiment results.
- Raises:
ValueError – If ‘outcome_variable’ column is missing.
- get_outcome_characteristics() pandas.DataFrame [source]
Gets characteristics of each outcome variable based on metadata.
- Returns:
A DataFrame summarizing the characteristics of each outcome.
- Return type:
pd.DataFrame
- find_similar_outcomes(reference_outcome: str, similarity_metrics: List[str] | None = None) pandas.DataFrame [source]
Finds outcomes with similar performance patterns to a reference outcome.
- Parameters:
- Returns:
A DataFrame with similarity scores for other outcomes.
- Return type:
pd.DataFrame
- Raises:
ValueError – If no data is found for the reference outcome.