ml_grid.util.synthetic_data_generator
Module for generating synthetic datasets that mimic the structure of real-world data used in the ml-grid pipeline.
Attributes
Classes
Initializes the SyntheticDataGenerator with specified parameters. |
Functions
|
A convenience function to generate a synthetic dataset. |
Module Contents
- class ml_grid.util.synthetic_data_generator.SyntheticDataGenerator(n_rows: int = 1000, n_features: int = 150, n_outcome_vars: int = 3, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_binary_features: float = 0.15, percent_int_features: float = 0.2, verbose: bool = True)[source]
Initializes the SyntheticDataGenerator with specified parameters.
- Parameters:
n_rows (int) – Number of rows for the synthetic dataset.
n_features (int) – Number of feature columns to generate.
n_outcome_vars (int) – Number of outcome variables to generate.
feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.
percent_important_features (float) – Percentage of features that should be predictive of the outcome.
percent_binary_features (float) – Percentage of features to be binary.
percent_int_features (float) – Percentage of features to be integer-based.
verbose (bool) – If True, prints generation status messages.
- ml_grid.util.synthetic_data_generator.generate_synthetic_data(n_rows: int = 1000, n_features: int = 150, n_outcome_vars: int = 3, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_binary_features: float = 0.15, percent_int_features: float = 0.2, verbose: bool = True) tuple[pandas.DataFrame, dict[str, list[str]]][source]
A convenience function to generate a synthetic dataset.
This function instantiates the SyntheticDataGenerator, calls its generate method, and returns the resulting DataFrame.
- Parameters:
n_rows (int) – Number of rows for the synthetic dataset.
n_features (int) – Number of feature columns to generate.
n_outcome_vars (int) – Number of outcome variables to generate.
feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.
percent_important_features (float) – Percentage of features that should be predictive of the outcome.
percent_binary_features (float) – Percentage of features to be binary.
percent_int_features (float) – Percentage of features to be integer-based.
verbose (bool) – If True, enables logging of generation status.
- Returns:
The generated synthetic dataset.
A dictionary mapping each outcome variable to its list of important features.
- Return type: