pat2vec.util.helper_functions

Functions

clear_patient_features(patient_id, config_obj)

Deletes all features for a specific patient from the database.

ensure_index(connection, table_name, ...)

Ensures a database index exists for a given column.

extract_nhs_numbers(input_string)

Extracts all occurrences of "NHS" followed by a 10-digit number.

get_all_features(config_obj)

Retrieves all patient features from the configured backend.

get_df_from_db(config_obj, schema, table[, ...])

Generic helper to retrieve a DataFrame from the database backend.

get_search_client_idcode_list_from_nhs_number_list(...)

Retrieves a unique list of hospital IDs from a list of NHS numbers.

sanitize_for_path(text)

Sanitizes a string to be safe for use in a file/directory path.

save_annotations_to_db(df, patient_id, ...)

Saves an annotation DataFrame to the database.

save_patient_features(features_df, ...[, ...])

Saves the feature vector(s) for a single patient to the configured backend.

save_raw_patient_batch(df, patient_id, ...)

Saves a raw data batch for a patient to the database.

pat2vec.util.helper_functions.sanitize_for_path(text)[source]

Sanitizes a string to be safe for use in a file/directory path.

Return type:

str

Parameters:

text (str)

pat2vec.util.helper_functions.extract_nhs_numbers(input_string)[source]

Extracts all occurrences of “NHS” followed by a 10-digit number.

The function searches for the pattern “NHS” followed by a 10-digit number, which may contain spaces. It then cleans the extracted numbers by removing any spaces.

Parameters:

input_string (str) – The string to search for NHS numbers.

Return type:

List[str]

Returns:

A list of all extracted 10-digit NHS numbers as strings.

Examples

>>> extract_nhs_numbers("NHS 123 456 7890")
['1234567890']
>>> extract_nhs_numbers("NHS 123 456 7890 and NHS 098 765 4321")
['1234567890', '0987654321']
pat2vec.util.helper_functions.get_search_client_idcode_list_from_nhs_number_list(nhs_numbers, pat2vec_obj)[source]

Retrieves a unique list of hospital IDs from a list of NHS numbers.

This function uses a pat2vec_obj to perform a cohort search against an index (e.g., ‘pims_apps*’) to find the corresponding ‘HospitalID’ for each ‘PatNHSNo’ in the provided list.

Parameters:
  • nhs_numbers (List[str]) – A list of NHS numbers to search for.

  • pat2vec_obj (Any) – An object with a cohort_searcher_with_terms_and_search method for querying the data source.

Return type:

List[str]

Returns:

A unique list of hospital IDs found for the given NHS numbers.

pat2vec.util.helper_functions.ensure_index(connection, table_name, schema_name, column_name, engine_name)[source]

Ensures a database index exists for a given column.

Return type:

None

Parameters:
  • connection (Any)

  • table_name (str)

  • schema_name (str | None)

  • column_name (str)

  • engine_name (str)

pat2vec.util.helper_functions.clear_patient_features(patient_id, config_obj)[source]

Deletes all features for a specific patient from the database.

Return type:

None

Parameters:
  • patient_id (str)

  • config_obj (Any)

pat2vec.util.helper_functions.save_patient_features(features_df, patient_id, config_obj, overwrite=True)[source]

Saves the feature vector(s) for a single patient to the configured backend.

If storage_backend is ‘database’, it appends/overwrites the features in a ‘features’ table within a ‘features’ schema.

If storage_backend is ‘file’, it saves the features to a CSV file in the current_pat_lines_path directory, preserving the original behavior.

Parameters:
  • features_df (DataFrame) – The DataFrame containing one or more feature vectors for the patient.

  • patient_id (str) – The unique identifier for the patient.

  • config_obj (Any) – The configuration object containing backend settings and paths.

  • overwrite (bool) – If True, delete existing features for the patient before saving. Defaults to True.

Raises:
  • ValueError – If an unknown storage_backend is specified.

  • Exception – Propagates exceptions from database operations.

Return type:

None

pat2vec.util.helper_functions.save_raw_patient_batch(df, patient_id, table_name, config_obj, id_column='client_idcode')[source]

Saves a raw data batch for a patient to the database.

Parameters:
  • df (DataFrame) – The DataFrame containing the raw data.

  • patient_id (str) – The patient identifier.

  • table_name (str) – The target table name (without schema prefix).

  • config_obj (Any) – The configuration object.

  • id_column (str) – The column name for the patient ID in this table.

Return type:

None

pat2vec.util.helper_functions.get_all_features(config_obj)[source]

Retrieves all patient features from the configured backend.

If storage_backend is ‘database’, it reads the entire ‘features’ table.

If storage_backend is ‘file’, it reads and concatenates all individual patient CSV files from the current_pat_lines_path directory.

Return type:

DataFrame

Parameters:

config_obj (Any)

pat2vec.util.helper_functions.get_df_from_db(config_obj, schema, table, patient_ids=None, patient_id_column='client_idcode', columns=None)[source]

Generic helper to retrieve a DataFrame from the database backend.

This function handles database connections, dialect-specific table naming (e.g., for SQLite), and filtering by a list of patient IDs.

Parameters:
  • config_obj (Any) – The configuration object.

  • schema (str) – The database schema name (e.g., ‘raw_data’).

  • table (str) – The database table name (e.g., ‘raw_drugs’).

  • patient_ids (Optional[List[str]]) – An optional list of patient IDs to filter the DataFrame.

  • patient_id_column (str) – The name of the patient ID column.

  • columns (Optional[List[str]]) – An optional list of columns to select.

Return type:

DataFrame

Returns:

A pandas DataFrame with the requested data, or an empty DataFrame on error.

pat2vec.util.helper_functions.save_annotations_to_db(df, patient_id, table_name, config_obj, id_column='client_idcode', schema_name='annotations')[source]

Saves an annotation DataFrame to the database.

Return type:

None

Parameters:
  • df (DataFrame)

  • patient_id (str)

  • table_name (str)

  • config_obj (Any)

  • id_column (str)

  • schema_name (str)