ml_grid.util.synthetic_data_generator

Module for generating synthetic datasets that mimic the structure of real-world data used in the ml-grid pipeline.

Attributes

missing_pickle_filename

Classes

SyntheticDataGenerator

Initializes the SyntheticDataGenerator with specified parameters.

SyntheticTSDataGenerator

Initializes the SyntheticTSDataGenerator.

Functions

generate_synthetic_ts_data(→ tuple[pandas.DataFrame, ...)

A convenience function to generate a synthetic longitudinal dataset.

generate_synthetic_data(→ tuple[pandas.DataFrame, ...)

A convenience function to generate a synthetic dataset.

Module Contents

class ml_grid.util.synthetic_data_generator.SyntheticDataGenerator(n_rows: int = 1000, n_features: int = 150, n_outcome_vars: int = 3, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_binary_features: float = 0.15, percent_int_features: float = 0.2, verbose: bool = True)[source]

Initializes the SyntheticDataGenerator with specified parameters.

Parameters:
  • n_rows (int) – Number of rows for the synthetic dataset.

  • n_features (int) – Number of feature columns to generate.

  • n_outcome_vars (int) – Number of outcome variables to generate.

  • feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.

  • percent_important_features (float) – Percentage of features that should be predictive of the outcome.

  • percent_binary_features (float) – Percentage of features to be binary.

  • percent_int_features (float) – Percentage of features to be integer-based.

  • verbose (bool) – If True, prints generation status messages.

n_rows = 1000[source]
n_features = 150[source]
n_outcome_vars = 3[source]
feature_strength = 0.8[source]
percent_important_features = 0.1[source]
percent_binary_features = 0.15[source]
percent_int_features = 0.2[source]
logger[source]
generate() tuple[pandas.DataFrame, dict[str, list[str]]][source]

Generates and returns the synthetic DataFrame and a map of important features.

Returns:

  • The fully generated synthetic dataset.

  • A dictionary mapping each outcome variable to its list of important features.

Return type:

tuple[pd.DataFrame, dict[str, list[str]]]

class ml_grid.util.synthetic_data_generator.SyntheticTSDataGenerator(n_instances: int = 200, n_timepoints: int = 50, n_features: int = 100, n_outcome_vars: int = 1, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_missing: float = 0.1, start_date: str = '2022-01-01', verbose: bool = True)[source]

Initializes the SyntheticTSDataGenerator.

Parameters:
  • n_instances (int) – Number of unique patients.

  • n_timepoints (int) – Number of daily timestamped rows per patient.

  • n_features (int) – Number of feature columns to generate.

  • n_outcome_vars (int) – Number of binary outcome columns to generate.

  • feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.

  • percent_important_features (float) – Fraction of features that should be predictive of each outcome.

  • percent_missing (float) – Approximate percentage of feature values to set to NaN.

  • start_date (str) – ISO date string for the first timestamp (e.g. "2022-01-01").

  • verbose (bool) – If True, enables logging of generation status.

n_instances = 200[source]
n_timepoints = 50[source]
n_features = 100[source]
n_outcome_vars = 1[source]
feature_strength = 0.8[source]
percent_important_features = 0.1[source]
percent_missing = 0.1[source]
start_date = '2022-01-01'[source]
logger[source]
generate() tuple[pandas.DataFrame, dict[str, list[str]]][source]

Generates and returns the synthetic longitudinal DataFrame.

The output is a long-format 2D DataFrame with one row per (client_idcode, timestamp) pair — matching the structure of the real ml-grid time-series data exactly. Each patient has exactly n_timepoints consecutive daily rows. Outcome labels are generated per-row using the same signal/noise + median-threshold approach as SyntheticDataGenerator.

Column order: client_idcode | timestamp | <features> | <outcome_vars>

Returns:

  • The fully generated longitudinal dataset.

  • A dictionary mapping each outcome variable name to its list of important feature names used to construct it.

Return type:

tuple[pd.DataFrame, dict[str, list[str]]]

ml_grid.util.synthetic_data_generator.generate_synthetic_ts_data(n_instances: int = 200, n_timepoints: int = 50, n_features: int = 100, n_outcome_vars: int = 1, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_missing: float = 0.1, start_date: str = '2022-01-01', verbose: bool = True) tuple[pandas.DataFrame, dict[str, list[str]]][source]

A convenience function to generate a synthetic longitudinal dataset.

The returned DataFrame has one row per (client_idcode, timestamp) pair, matching the structure of real ml-grid time-series data exactly.

Parameters:
  • n_instances (int) – Number of unique patients.

  • n_timepoints (int) – Number of daily timestamped rows per patient.

  • n_features (int) – Number of feature columns to generate.

  • n_outcome_vars (int) – Number of binary outcome columns to generate.

  • feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.

  • percent_important_features (float) – Fraction of features that should be predictive of each outcome.

  • percent_missing (float) – Approximate percentage of feature values to set to NaN.

  • start_date (str) – ISO date string for the first timestamp.

  • verbose (bool) – If True, enables logging of generation status.

Returns:

  • The generated longitudinal dataset.

  • A dictionary mapping each outcome variable to its important features.

Return type:

tuple[pd.DataFrame, dict[str, list[str]]]

ml_grid.util.synthetic_data_generator.generate_synthetic_data(n_rows: int = 1000, n_features: int = 150, n_outcome_vars: int = 3, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_binary_features: float = 0.15, percent_int_features: float = 0.2, verbose: bool = True) tuple[pandas.DataFrame, dict[str, list[str]]][source]

A convenience function to generate a synthetic dataset.

This function instantiates the SyntheticDataGenerator, calls its generate method, and returns the resulting DataFrame.

Parameters:
  • n_rows (int) – Number of rows for the synthetic dataset.

  • n_features (int) – Number of feature columns to generate.

  • n_outcome_vars (int) – Number of outcome variables to generate.

  • feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.

  • percent_important_features (float) – Percentage of features that should be predictive of the outcome.

  • percent_binary_features (float) – Percentage of features to be binary.

  • percent_int_features (float) – Percentage of features to be integer-based.

  • verbose (bool) – If True, enables logging of generation status.

Returns:

  • The generated synthetic dataset.

  • A dictionary mapping each outcome variable to its list of important features.

Return type:

tuple[pd.DataFrame, dict[str, list[str]]]

ml_grid.util.synthetic_data_generator.missing_pickle_filename = 'percent_missing_synthetic_data_generated.pkl'[source]