pat2vec.util.post_processing

Functions

`aggregate_dataframe_mean`(df[, group_by_column])	Aggregates a DataFrame by a grouping column.
`check_list_presence`(df, column, lst[, ...])	Checks if any string in a list is present in a specified DataFrame column, optionally after applying annotation filters.
`collapse_df_to_mean`(df[, output_filename, ...])	Collapses a DataFrame to calculate mean values for numeric columns and retains the first non-numeric value for non-numeric columns for each unique client_idcode.
`convert_true_to_float`(df[, columns])	Converts 'True' strings to 1.0 and ensures columns are float type.
`copy_files_and_dirs`(source_root, ...[, ...])	Copies specified directories and files from a source project location to a new destination.
`count_files`(path)	Recursively counts the number of files in a directory.
`drop_columns_with_all_nan`(df)	Drops columns from a DataFrame where all values are NaN or None.
`extract_datetime_from_binary_columns`(df)	Extracts datetime values from binary columns representing dates in a DataFrame.
`extract_datetime_from_binary_columns_chunk_reader`(...)	Extracts datetime values from binary columns representing dates from a CSV file.
`extract_datetime_to_column`(df[, drop])	Extracts datetime information from specified columns and creates a new column.
`extract_types_from_csv`(directory)	Extracts all unique 'types' from CSV files within a given directory and its subdirectories.
`filter_and_select_rows`(dataframe, filter_list)	Filter a dataframe based on a filter_column and filter_list, and return either the earliest or latest rows.
`filter_and_update_csv`(target_directory, ...)	Filters and updates CSV files in a target directory based on patient IPW records.
`filter_annot_dataframe2`(dataframe, filter_args)	Filter a DataFrame based on specified filter arguments.
`filter_dataframe_by_cui`(dataframe, filter_list)	Filter an annotation DataFrame based on a list of CUI codes and a specified mode.
`filter_dataframe_n_lists`(df, column_name, ...)	Filters a DataFrame to include rows where the value in a specified column is present in all of the provided lists.
`get_all_target_annots`(all_pat_list, n_lists)	Retrieves and filters target annotations for a list of patients.
`impute_dataframe`(df[, verbose, ...])	Imputes missing numeric values in a DataFrame based on patient ID and temporal order.
`impute_datetime`(df[, datetime_column, ...])	Imputes missing datetime values in a DataFrame based on patient ID and temporal order.
`join_icd10_OPC4S_codes_to_annot`(df[, inner])	Joins ICD-10 and OPCS-4 codes to an annotation DataFrame.
`join_icd10_codes_to_annot`(df[, inner])	Joins ICD-10 codes to an annotation DataFrame.
`missing_percentage_df`(dataframe)	Calculate the percentage of missing values in each column of a DataFrame.
`plot_missing_pattern_bloods`(dfb)	Plots the number of missing client_idcodes for the top 50 most frequent 'basicobs_itemname_analysed' values in the given dataframe dfb.
`process_chunk`(args)	Processes a chunk of CSV files, concatenating their data into a dictionary.
`produce_filtered_annotation_dataframe`([...])	Filter annotation dataframe based on specified criteria.
`remove_file_from_paths`(current_pat_idcode[, ...])	Removes patient-specific CSV files from various predefined project paths.
`retrieve_pat_annots_mct_epr`(client_idcode, ...)	Retrieves and merges annotation data for a single patient from multiple sources.
`save_missing_values_pickle`(df, out_file_path)	Calculates the percentage of missing values for each column in a DataFrame and saves the result as a pickle file.

pat2vec.util.post_processing.count_files(path)[source]

Recursively counts the number of files in a directory.

Parameters:: path (str) – The path to the directory.
Returns:: The total number of files in the directory and its subdirectories.
Return type:: int

pat2vec.util.post_processing.extract_datetime_to_column(df, drop=True)[source]

Extracts datetime information from specified columns and creates a new column.

This function scans for columns with names matching the pattern ‘(YYYY, MM, DD)_date_time_stamp’ and a value of 1. For each such occurrence, it parses the date from the column name and populates a new ‘extracted_datetime_stamp’ column.

Parameters:

df (DataFrame) – The DataFrame containing date-as-column features.
drop (bool) – If True, the original date_time_stamp columns are dropped.

Return type:

DataFrame

Returns:

The DataFrame with a new ‘extracted_datetime_stamp’ column.

pat2vec.util.post_processing.filter_annot_dataframe2(dataframe, filter_args)[source]

Filter a DataFrame based on specified filter arguments.

Parameters:

dataframe (DataFrame) – The DataFrame to filter.
filter_args (Dict[str, Any]) – A dictionary containing filter arguments. Keys are column names, and values are the filter criteria. Special handling for ‘types’, ‘Time_Value’, ‘Presence_Value’, ‘Subject_Value’, ‘Time_Confidence’, ‘Presence_Confidence’, ‘Subject_Confidence’, and ‘acc’.

Return type:

DataFrame

Returns:

The filtered DataFrame.

pat2vec.util.post_processing.produce_filtered_annotation_dataframe(cui_filter=False, meta_annot_filter=False, pat_list=None, config_obj=None, filter_custom_args=None, cui_code_list=None, mct=False)[source]

Filter annotation dataframe based on specified criteria.

Parameters:

cui_filter (bool) – Whether to filter by CUI codes.
meta_annot_filter (bool) – Whether to apply meta annotation filtering.
pat_list (Optional[List[str]]) – List of patient identifiers. If None, uses config_obj.all_patient_list.
config_obj (Optional[Any]) – Configuration object containing necessary parameters.
filter_custom_args (Optional[Dict[str, Any]]) – Custom filter arguments. If None, uses config_obj.filter_arguments.
cui_code_list (Optional[List[int]]) – List of CUI codes for filtering.
mct (bool) – If True, processes MCT annotation batches; otherwise, processes EPR.

Returns:

Filtered annotation dataframe.

Return type:

pd.DataFrame

pat2vec.util.post_processing.extract_types_from_csv(directory)[source]

Extracts all unique ‘types’ from CSV files within a given directory and its subdirectories.

Parameters:: directory (str) – The path to the directory to search for CSV files.
Return type:: List[str]
Returns:: A list of all unique ‘types’ found in the ‘types’ column of the CSV files.

pat2vec.util.post_processing.remove_file_from_paths(current_pat_idcode, project_name='new_project', verbosity=0, config_obj=None)[source]

Removes patient-specific CSV files from various predefined project paths.

Parameters:

current_pat_idcode (str) – The unique identifier of the patient whose files are to be removed.
project_name (str) – The name of the project. Used if config_obj is None.
verbosity (int) – Verbosity level for printing messages.
config_obj (Optional[Any]) – A configuration object containing project paths. If provided, project_name is overridden by config_obj.proj_name. Defaults to None.

Return type:

None

pat2vec.util.post_processing.process_chunk(args)[source]

Processes a chunk of CSV files, concatenating their data into a dictionary.

This helper function is designed for multiprocessing. It reads a specified range of files, extracts data for a given set of unique columns, and returns a dictionary where keys are column names and values are lists of data from those columns.

Parameters:: args (tuple) – A tuple containing (part_chunk, all_files, part_size, unique_columns).
Return type:: Dict[str, List[str]]
Returns:: A dictionary with concatenated data for the specified unique columns.

pat2vec.util.post_processing.join_icd10_codes_to_annot(df, inner=False)[source]

Joins ICD-10 codes to an annotation DataFrame.

This function merges the input DataFrame df with a predefined ICD-10 mapping DataFrame based on the ‘cui’ column in df and ‘referencedComponentId’ in the mapping.

Parameters:

df (DataFrame) – The annotation DataFrame.
inner (bool) – If True, performs an inner merge; otherwise, performs a left merge.

Return type:

DataFrame

Returns:

The DataFrame with ICD-10 codes joined.

pat2vec.util.post_processing.join_icd10_OPC4S_codes_to_annot(df, inner=False)[source]

Joins ICD-10 and OPCS-4 codes to an annotation DataFrame.

This function merges the input DataFrame df with a predefined ICD-10/OPCS-4 mapping DataFrame based on the ‘cui’ column in df and ‘conceptId’ in the mapping.

Parameters:

df (DataFrame) – The annotation DataFrame.
inner (bool) – If True, performs an inner merge; otherwise, performs a left merge.

Return type:

DataFrame

Returns:

The DataFrame with ICD-10 and OPCS-4 codes joined.

pat2vec.util.post_processing.filter_and_select_rows(dataframe, filter_list, verbosity=0, time_column='updatetime', filter_column='cui', mode='earliest', n_rows=1)[source]

Filter a dataframe based on a filter_column and filter_list, and return either the earliest or latest rows.

Parameters:

dataframe (DataFrame) – Input dataframe.
filter_list (List[Any]) – List of values to filter the dataframe.
verbosity (int) – If > 0, print additional information during execution.
time_column (str) – Column representing time, used for sorting if specified.
filter_column (str) – Column used for filtering based on filter_list.
mode (str) – Either ‘earliest’ or ‘latest’ to specify the rows to return.
n_rows (int) – Number of rows to return if they exist.

Returns:

Filtered and selected rows from the input dataframe.

Return type:

pd.DataFrame

pat2vec.util.post_processing.filter_dataframe_by_cui(dataframe, filter_list, filter_column='cui', mode='earliest', temporal='before', verbosity=0, time_column='updatetime')[source]

Filter an annotation DataFrame based on a list of CUI codes and a specified mode.

Parameters:

dataframe (DataFrame) – The input DataFrame.
filter_list (List[int]) – List of CUI codes to filter the DataFrame.
filter_column (str) – The column containing filter.
mode (str) – Specifies whether to consider the earliest or latest entry for each filter.
temporal (str) – Specifies whether to retain entries before or after the selected mode entry.
verbosity (int) – Verbosity level. 0 for no debug statements, higher values for more verbosity.
time_column (str) – The column containing time information.

Returns:

Filtered DataFrame based on the specified criteria.

Return type:

pd.DataFrame

pat2vec.util.post_processing.copy_files_and_dirs(source_root, source_name, destination, items_to_copy=None, loose_files=None)[source]

Copies specified directories and files from a source project location to a new destination.

This function is useful for porting project files to a new location while preserving the directory structure. It can copy specific subdirectories and individual files.

Parameters:

source_root (str) – The root directory of the source project.
source_name (str) – The name of the source project directory (e.g., “new_project”).
destination (str) – The destination directory where the project will be copied.
items_to_copy (Optional[List[str]]) – A list of directory or file names (relative to source_name) to copy. If None, a default set of common project directories is copied. Defaults to None.
loose_files (List[str], optional) – A list of file names (relative to source_root) to copy directly to the destination root. If None, a default set of common loose files is copied. Defaults to None.

Return type:

None

Usage:

project_root_source = “/home/cogstack/%USERNAME%/_data/HFE_5” project_name_source = “new_project” project_destination = “.”

copy_files_and_dirs(project_root_source, project_name_source, project_destination)

pat2vec.util.post_processing.filter_and_update_csv(target_directory, ipw_dataframe, filter_type='after', verbosity=False)[source]

Filters and updates CSV files in a target directory based on patient IPW records.

This function iterates through each patient record in the ipw_dataframe, finds corresponding CSV files in the target_directory (and its subdirectories), and filters the rows in those CSV files based on a timestamp column and a filter date.

Parameters:

target_directory (str) – The root directory containing the CSV files to be filtered.
ipw_dataframe (pd.DataFrame) – A DataFrame containing patient IPW records, including ‘client_idcode’ and a timestamp column (e.g., ‘updatetime’).
filter_type (str, optional) – The type of filtering to apply: “after” (keep records after filter_date) or “before” (keep records before filter_date). Defaults to “after”.
verbosity (bool, optional) – If True, print verbose messages during processing. Defaults to False.

Return type:

None

pat2vec.util.post_processing.retrieve_pat_annots_mct_epr(client_idcode, config_obj, columns_epr=None, columns_mct=None, columns_to=None, columns_report=None, merge_columns=True)[source]

Retrieves and merges annotation data for a single patient from multiple sources.

This function reads annotation data for a specified patient from four potential sources: EPR annotations, MCT annotations, textual observations annotations, and reports annotations. It loads the corresponding CSV files, optionally selecting specific columns, and concatenates them into a single DataFrame. It can also merge related columns (e.g., timestamps, content) to create a more unified dataset.

Parameters:

client_idcode (str) – The unique identifier for the patient.
config_obj (Any) – A configuration object containing paths to the various annotation batch files.
columns_epr (Optional[List[str]]) – A list of columns to load from the EPR annotations CSV.
columns_mct (Optional[List[str]]) – A list of columns to load from the MCT annotations CSV.
columns_to (Optional[List[str]]) – A list of columns to load from the textual observations annotations CSV.
columns_report (Optional[List[str]]) – A list of columns to load from the reports annotations CSV.
merge_columns (bool, optional) – If True, attempts to merge corresponding columns (e.g., timestamps, content) from the different sources into a unified set of columns. Defaults to True.

Returns:

A DataFrame containing the concatenated and optionally: merged annotation data for the patient. Returns an empty DataFrame if no data is found for the patient in any of the sources.

Return type:

pd.DataFrame

pat2vec.util.post_processing.check_list_presence(df, column, lst, annot_filter_arguments=None)[source]

Checks if any string in a list is present in a specified DataFrame column, optionally after applying annotation filters.

Parameters:

df (pd.DataFrame) – The input DataFrame.
column (str) – The name of the column to check for string presence.
lst (list) – A list of strings to search for.
annot_filter_arguments (dict, optional) – Arguments to filter the DataFrame before checking for list presence. Defaults to None.

Returns:

True if any string from lst is found in column (case-insensitive), False otherwise.

Return type:

bool

pat2vec.util.post_processing.filter_dataframe_n_lists(df, column_name, n_lists)[source]

Filters a DataFrame to include rows where the value in a specified column is present in all of the provided lists.

Parameters:

df (DataFrame) – The input DataFrame.
column_name (str) – The name of the column to filter.
n_lists (List[List[Any]]) – A list of lists. A row is kept only if the value in column_name is present in every sublist within n_lists.

Return type:

DataFrame

Returns:

The filtered DataFrame.

pat2vec.util.post_processing.get_all_target_annots(all_pat_list, n_lists, config_obj=None, annot_filter_arguments=None)[source]

Retrieves and filters target annotations for a list of patients.

This function iterates through a list of patient IDs, retrieves their annotations, applies optional annotation filters, and then filters the annotations to include only those where the ‘cui’ (Concept Unique Identifier) is present in all of the provided n_lists. The results are concatenated into a single DataFrame and saved.

Parameters:

all_pat_list (List[str]) – A list of patient IDs to process.
n_lists (List[List[int]]) – A list of lists of CUI codes. Annotations are kept if their CUI is in all sublists.
config_obj (Optional[Any]) – A configuration object.
annot_filter_arguments (Optional[Dict[str, Any]]) – Arguments to filter annotations.

Returns:

A DataFrame containing all target annotations.

Return type:

pd.DataFrame

pat2vec.util.post_processing.extract_datetime_from_binary_columns(df)[source]

Extracts datetime values from binary columns representing dates in a DataFrame.

Binary columns are expected to have names like (YYYY, MM, DD)_date_time_stamp, where a value of 1 indicates the presence of that date.

Parameters:: df (DataFrame) – The DataFrame containing the binary columns with ‘_date_time_stamp’ in column names.
Return type:: DataFrame
Returns:: The DataFrame with a new ‘datetime’ column.

pat2vec.util.post_processing.extract_datetime_from_binary_columns_chunk_reader(filepath)[source]

Extracts datetime values from binary columns representing dates from a CSV file.

This function reads a CSV file in chunks, extracts datetime values from binary columns (e.g., (YYYY, MM, DD)_date_time_stamp), and appends a ‘datetime’ column to the DataFrame.

Parameters:: filepath (str) – The file path to the CSV file.
Return type:: DataFrame
Returns:: The last chunk of the DataFrame with the ‘datetime’ column appended.

pat2vec.util.post_processing.drop_columns_with_all_nan(df)[source]

Drops columns from a DataFrame where all values are NaN or None.

Parameters:

df (DataFrame) – The input DataFrame.

Returns:

The DataFrame with columns containing all NaNs dropped.
pd.Index: An Index of the column names that were dropped.

Return type:

A tuple containing

pat2vec.util.post_processing.save_missing_values_pickle(df, out_file_path, overwrite=False)[source]

Calculates the percentage of missing values for each column in a DataFrame and saves the result as a pickle file.

Parameters:

df (DataFrame) – The input DataFrame.
out_file_path (str) – The full path to the output file (e.g., “path/to/data.csv”). The pickle file will be saved in the same directory with a _missing_dict.pickle suffix.
overwrite (bool) – If True, overwrites the pickle file if it already exists.

Return type:

None

pat2vec.util.post_processing.convert_true_to_float(df, columns=['census_black_african_caribbean_or_black_british', 'census_mixed_or_multiple_ethnic_groups', 'census_white', 'census_asian_or_asian_british', 'census_other_ethnic_group'])[source]

Converts ‘True’ strings to 1.0 and ensures columns are float type.

Parameters:

df (pandas.DataFrame) – The DataFrame to operate on.
columns (list, optional) – List of column names to convert. Defaults to a predefined list of census columns.

Returns:

DataFrame with specified columns converted.

Return type:

pandas.DataFrame

pat2vec.util.post_processing.impute_datetime(df, datetime_column='datetime', patient_column='client_idcode', forward=True, backward=True, mean_impute=True, verbose=False)[source]

Imputes missing datetime values in a DataFrame based on patient ID and temporal order.

This function sorts the DataFrame by patient ID and datetime, then applies forward-fill, backward-fill, and mean imputation (for remaining NaNs) to the datetime column.

Parameters:

df (DataFrame) – The input DataFrame.
datetime_column (str) – The name of the datetime column to impute.
patient_column (str) – The name of the patient ID column for grouping.
forward (bool) – If True, performs forward-fill imputation.
backward (bool) – If True, performs backward-fill imputation.
mean_impute (bool) – If True, performs mean imputation for any remaining NaNs.
verbose (bool) – If True, prints verbose messages.

Return type:

DataFrame

pat2vec.util.post_processing.impute_dataframe(df, verbose=True, datetime_column='datetime', patient_column='client_idcode', forward=True, backward=True, mean_impute=True)[source]

Imputes missing numeric values in a DataFrame based on patient ID and temporal order.

This function sorts the DataFrame by patient ID and datetime, then applies forward-fill, backward-fill, and mean imputation (for remaining NaNs) to all numeric columns.

Parameters:

df (DataFrame) – The input DataFrame.
verbose (bool) – If True, prints verbose messages.
datetime_column (str) – The name of the datetime column for sorting.
patient_column (str) – The name of the patient ID column for grouping.
forward (bool) – If True, performs forward-fill imputation.
backward (bool) – If True, performs backward-fill imputation.
mean_impute (bool) – If True, performs mean imputation for any remaining NaNs.

Return type:

DataFrame

pat2vec.util.post_processing.missing_percentage_df(dataframe)[source]

Calculate the percentage of missing values in each column of a DataFrame.

Parameters:: dataframe (DataFrame) – The input DataFrame.
Return type:: DataFrame
Returns:: A DataFrame containing ‘Column’ (column names) and ‘MissingPercentage’.

pat2vec.util.post_processing.aggregate_dataframe_mean(df, group_by_column='client_idcode')[source]

Aggregates a DataFrame by a grouping column.

For each group, it calculates the mean for numeric columns and takes the first value for non-numeric columns.

Parameters:

df (DataFrame) – The input DataFrame to aggregate.
group_by_column (str) – The column to group by.

Return type:

DataFrame

Returns:

The aggregated DataFrame.

pat2vec.util.post_processing.collapse_df_to_mean(df, output_filename='output.csv', client_idcode_string='client_idcode')[source]

Collapses a DataFrame to calculate mean values for numeric columns and retains the first non-numeric value for non-numeric columns for each unique client_idcode.

Parameters:

df (DataFrame) – Input DataFrame containing client_idcode and other columns.
output_filename (str) – Name of the output file to save the processed DataFrame.
client_idcode_string (str) – Name of the client_idcode column in the DataFrame.

Return type:

None

pat2vec.util.post_processing.plot_missing_pattern_bloods(dfb)[source]

Plots the number of missing client_idcodes for the top 50 most frequent ‘basicobs_itemname_analysed’ values in the given dataframe dfb.

Parameters:: dfb (DataFrame) – DataFrame containing ‘client_idcode’ and ‘basicobs_itemname_analysed’.
Return type:: None