ml_grid.util.synthetic_data_generator
Module for generating synthetic datasets that mimic the structure of real-world data used in the ml-grid pipeline.
Attributes
Classes
Initializes the SyntheticDataGenerator with specified parameters. |
|
Initializes the SyntheticTSDataGenerator. |
Functions
|
A convenience function to generate a synthetic longitudinal dataset. |
|
A convenience function to generate a synthetic dataset. |
Module Contents
- class ml_grid.util.synthetic_data_generator.SyntheticDataGenerator(n_rows: int = 1000, n_features: int = 150, n_outcome_vars: int = 3, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_binary_features: float = 0.15, percent_int_features: float = 0.2, verbose: bool = True)[source]
Initializes the SyntheticDataGenerator with specified parameters.
- Parameters:
n_rows (int) – Number of rows for the synthetic dataset.
n_features (int) – Number of feature columns to generate.
n_outcome_vars (int) – Number of outcome variables to generate.
feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.
percent_important_features (float) – Percentage of features that should be predictive of the outcome.
percent_binary_features (float) – Percentage of features to be binary.
percent_int_features (float) – Percentage of features to be integer-based.
verbose (bool) – If True, prints generation status messages.
- class ml_grid.util.synthetic_data_generator.SyntheticTSDataGenerator(n_instances: int = 200, n_timepoints: int = 50, n_features: int = 100, n_outcome_vars: int = 1, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_missing: float = 0.1, start_date: str = '2022-01-01', verbose: bool = True)[source]
Initializes the SyntheticTSDataGenerator.
- Parameters:
n_instances (int) – Number of unique patients.
n_timepoints (int) – Number of daily timestamped rows per patient.
n_features (int) – Number of feature columns to generate.
n_outcome_vars (int) – Number of binary outcome columns to generate.
feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.
percent_important_features (float) – Fraction of features that should be predictive of each outcome.
percent_missing (float) – Approximate percentage of feature values to set to NaN.
start_date (str) – ISO date string for the first timestamp (e.g.
"2022-01-01").verbose (bool) – If True, enables logging of generation status.
- generate() tuple[pandas.DataFrame, dict[str, list[str]]][source]
Generates and returns the synthetic longitudinal DataFrame.
The output is a long-format 2D DataFrame with one row per
(client_idcode, timestamp)pair — matching the structure of the real ml-grid time-series data exactly. Each patient has exactlyn_timepointsconsecutive daily rows. Outcome labels are generated per-row using the same signal/noise + median-threshold approach asSyntheticDataGenerator.Column order:
client_idcode | timestamp | <features> | <outcome_vars>
- ml_grid.util.synthetic_data_generator.generate_synthetic_ts_data(n_instances: int = 200, n_timepoints: int = 50, n_features: int = 100, n_outcome_vars: int = 1, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_missing: float = 0.1, start_date: str = '2022-01-01', verbose: bool = True) tuple[pandas.DataFrame, dict[str, list[str]]][source]
A convenience function to generate a synthetic longitudinal dataset.
The returned DataFrame has one row per
(client_idcode, timestamp)pair, matching the structure of real ml-grid time-series data exactly.- Parameters:
n_instances (int) – Number of unique patients.
n_timepoints (int) – Number of daily timestamped rows per patient.
n_features (int) – Number of feature columns to generate.
n_outcome_vars (int) – Number of binary outcome columns to generate.
feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.
percent_important_features (float) – Fraction of features that should be predictive of each outcome.
percent_missing (float) – Approximate percentage of feature values to set to NaN.
start_date (str) – ISO date string for the first timestamp.
verbose (bool) – If True, enables logging of generation status.
- Returns:
The generated longitudinal dataset.
A dictionary mapping each outcome variable to its important features.
- Return type:
- ml_grid.util.synthetic_data_generator.generate_synthetic_data(n_rows: int = 1000, n_features: int = 150, n_outcome_vars: int = 3, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_binary_features: float = 0.15, percent_int_features: float = 0.2, verbose: bool = True) tuple[pandas.DataFrame, dict[str, list[str]]][source]
A convenience function to generate a synthetic dataset.
This function instantiates the SyntheticDataGenerator, calls its generate method, and returns the resulting DataFrame.
- Parameters:
n_rows (int) – Number of rows for the synthetic dataset.
n_features (int) – Number of feature columns to generate.
n_outcome_vars (int) – Number of outcome variables to generate.
feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.
percent_important_features (float) – Percentage of features that should be predictive of the outcome.
percent_binary_features (float) – Percentage of features to be binary.
percent_int_features (float) – Percentage of features to be integer-based.
verbose (bool) – If True, enables logging of generation status.
- Returns:
The generated synthetic dataset.
A dictionary mapping each outcome variable to its list of important features.
- Return type: