pat2vec.util.impute_data_for_pipe

Functions

`mean_impute_dataframe`(data, y_vars[, ...])	Splits data, imputes missing numeric values with the mean, and recombines.
`save_missing_percentage`(df[, output_file])	Calculates and saves the percentage of missing values for each column.

pat2vec.util.impute_data_for_pipe.mean_impute_dataframe(data, y_vars, test_size=0.25, val_size=0.25, random_state=1, seed=1234)[source]

Splits data, imputes missing numeric values with the mean, and recombines.

This function performs a standard machine learning preprocessing step. It first splits the input DataFrame into training, validation, and test sets. It then calculates the mean for each numeric feature based on the training set and uses this mean to impute missing values in all three sets. Finally, it recombines the imputed sets into a single DataFrame.

Parameters:

data (DataFrame) – The input DataFrame containing features and target variables.
y_vars (Union[str, List[str]]) – The name of the target variable column(s).
test_size (float) – The proportion of the dataset to allocate to the test split.
val_size (float) – The proportion of the training dataset to allocate to the validation split.
random_state (int) – Seed for the train-test split for reproducibility.
seed (int) – Seed for Python’s random module.

Return type:

DataFrame

Returns:

The full DataFrame with missing numeric values imputed.

pat2vec.util.impute_data_for_pipe.save_missing_percentage(df, output_file='percent_missing.pkl')[source]

Calculates and saves the percentage of missing values for each column.

Parameters:

df (DataFrame) – The input DataFrame to analyze.
output_file (str) – The path to save the resulting dictionary as a pickle file.

Return type:

Dict[str, float]

Returns:

A dictionary where keys are column names and values are the percentage of missing values.