ml_grid.util.synthetic_data_generator

Module for generating synthetic datasets that mimic the structure of real-world data used in the ml-grid pipeline.

Attributes

missing_pickle_filename

Classes

SyntheticDataGenerator

Initializes the SyntheticDataGenerator with specified parameters.

Functions

generate_synthetic_data(→ tuple[pandas.DataFrame, ...)

A convenience function to generate a synthetic dataset.

Module Contents

class ml_grid.util.synthetic_data_generator.SyntheticDataGenerator(n_rows: int = 1000, n_features: int = 150, n_outcome_vars: int = 3, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_binary_features: float = 0.15, percent_int_features: float = 0.2, verbose: bool = True)[source]

Initializes the SyntheticDataGenerator with specified parameters.

Parameters:

n_rows (int) – Number of rows for the synthetic dataset.
n_features (int) – Number of feature columns to generate.
n_outcome_vars (int) – Number of outcome variables to generate.
feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.
percent_important_features (float) – Percentage of features that should be predictive of the outcome.
percent_binary_features (float) – Percentage of features to be binary.
percent_int_features (float) – Percentage of features to be integer-based.
verbose (bool) – If True, prints generation status messages.

n_rows = 1000[source]

n_features = 150[source]

n_outcome_vars = 3[source]

feature_strength = 0.8[source]

percent_important_features = 0.1[source]

percent_binary_features = 0.15[source]

percent_int_features = 0.2[source]

logger[source]

generate() → tuple[pandas.DataFrame, dict[str, list[str]]][source]

Generates and returns the synthetic DataFrame and a map of important features.

Returns:

The fully generated synthetic dataset.
A dictionary mapping each outcome variable to its list of important features.

Return type:

tuple[pd.DataFrame, dict[str, list[str]]]

ml_grid.util.synthetic_data_generator.generate_synthetic_data(n_rows: int = 1000, n_features: int = 150, n_outcome_vars: int = 3, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_binary_features: float = 0.15, percent_int_features: float = 0.2, verbose: bool = True) → tuple[pandas.DataFrame, dict[str, list[str]]][source]

A convenience function to generate a synthetic dataset.

This function instantiates the SyntheticDataGenerator, calls its generate method, and returns the resulting DataFrame.

Parameters:

n_rows (int) – Number of rows for the synthetic dataset.
n_features (int) – Number of feature columns to generate.
n_outcome_vars (int) – Number of outcome variables to generate.
feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.
percent_important_features (float) – Percentage of features that should be predictive of the outcome.
percent_binary_features (float) – Percentage of features to be binary.
percent_int_features (float) – Percentage of features to be integer-based.
verbose (bool) – If True, enables logging of generation status.

Returns:

The generated synthetic dataset.
A dictionary mapping each outcome variable to its list of important features.

Return type:

tuple[pd.DataFrame, dict[str, list[str]]]

ml_grid.util.synthetic_data_generator.missing_pickle_filename = 'percent_missing_synthetic_data_generated.pkl'[source]