ml_grid.util.synthetic_data_generator

Module for generating synthetic datasets that mimic the structure of real-world data used in the ml-grid pipeline.

Attributes

missing_pickle_filename

Classes

SyntheticDataGenerator

Initializes the SyntheticDataGenerator with specified parameters.

Functions

generate_synthetic_data(→ tuple[pandas.DataFrame, ...)

A convenience function to generate a synthetic dataset.

Module Contents

class ml_grid.util.synthetic_data_generator.SyntheticDataGenerator(n_rows: int = 1000, n_features: int = 150, n_outcome_vars: int = 3, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_binary_features: float = 0.15, percent_int_features: float = 0.2, verbose: bool = True)[source]

Initializes the SyntheticDataGenerator with specified parameters.

Parameters:
  • n_rows (int) – Number of rows for the synthetic dataset.

  • n_features (int) – Number of feature columns to generate.

  • n_outcome_vars (int) – Number of outcome variables to generate.

  • feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.

  • percent_important_features (float) – Percentage of features that should be predictive of the outcome.

  • percent_binary_features (float) – Percentage of features to be binary.

  • percent_int_features (float) – Percentage of features to be integer-based.

  • verbose (bool) – If True, prints generation status messages.

n_rows = 1000[source]
n_features = 150[source]
n_outcome_vars = 3[source]
feature_strength = 0.8[source]
percent_important_features = 0.1[source]
percent_binary_features = 0.15[source]
percent_int_features = 0.2[source]
logger[source]
generate() tuple[pandas.DataFrame, dict[str, list[str]]][source]

Generates and returns the synthetic DataFrame and a map of important features.

Returns:

  • The fully generated synthetic dataset.

  • A dictionary mapping each outcome variable to its list of important features.

Return type:

tuple[pd.DataFrame, dict[str, list[str]]]

ml_grid.util.synthetic_data_generator.generate_synthetic_data(n_rows: int = 1000, n_features: int = 150, n_outcome_vars: int = 3, feature_strength: float = 0.8, percent_important_features: float = 0.1, percent_binary_features: float = 0.15, percent_int_features: float = 0.2, verbose: bool = True) tuple[pandas.DataFrame, dict[str, list[str]]][source]

A convenience function to generate a synthetic dataset.

This function instantiates the SyntheticDataGenerator, calls its generate method, and returns the resulting DataFrame.

Parameters:
  • n_rows (int) – Number of rows for the synthetic dataset.

  • n_features (int) – Number of feature columns to generate.

  • n_outcome_vars (int) – Number of outcome variables to generate.

  • feature_strength (float) – Strength of the signal from important features. Must be between 0 and 1.

  • percent_important_features (float) – Percentage of features that should be predictive of the outcome.

  • percent_binary_features (float) – Percentage of features to be binary.

  • percent_int_features (float) – Percentage of features to be integer-based.

  • verbose (bool) – If True, enables logging of generation status.

Returns:

  • The generated synthetic dataset.

  • A dictionary mapping each outcome variable to its list of important features.

Return type:

tuple[pd.DataFrame, dict[str, list[str]]]

ml_grid.util.synthetic_data_generator.missing_pickle_filename = 'percent_missing_synthetic_data_generated.pkl'[source]