pat2vec.util.pre_processing

Functions

calculate_age_append(df)

Calculates the age of clients and appends it as a new column.

demo_to_latest(demo_df)

Filters a demographics DataFrame to keep only the latest record per patient.

draw_document_samples(df, n)

Draws n random samples for each unique 'search_term' in a DataFrame.

get_treatment_docs_by_iterative_multi_term_cohort_searcher_no_terms_fuzzy(...)

Searches for documents using a list of terms and returns the results.

search_cohort(patlist, pat2vec_obj, ...[, ...])

Searches for a cohort of patients' demographic data within a date range.

pat2vec.util.pre_processing.get_treatment_docs_by_iterative_multi_term_cohort_searcher_no_terms_fuzzy(pat2vec_obj, term_list, overwrite=False, overwrite_search_term=None, append=False, verbose=0, mct=True, textual_obs=True, additional_filters=None, all_fields=False, method='fuzzy', fuzzy=2, slop=1)[source]

Searches for documents using a list of terms and returns the results.

This function takes a list of terms, runs an iterative fuzzy search across multiple data sources (EPR, MCT, Textual Observations), and returns the combined search results as a pandas DataFrame. It also handles saving the results to a CSV file.

Parameters:
  • pat2vec_obj (Any) – A pat2vec object with necessary attributes set.

  • term_list (List[str]) – A list of terms to search for.

  • overwrite (bool) – Whether to overwrite the output file if it already exists.

  • overwrite_search_term (Optional[str]) – A term to override the search terms in term_list. Used for testing.

  • append (bool) – Whether to append to the output file if it exists.

  • verbose (int) – Verbosity level.

  • mct (bool) – If True, includes results from the MCT source.

  • textual_obs (bool) – If True, includes results from the textual observations source.

  • additional_filters (Optional[List[str]]) – A list of additional filters to apply to the search.

  • all_fields (bool) – Whether to include and return all fields in the search.

  • method (str) – The search method to use (‘fuzzy’, ‘phrase’, ‘exact’). Defaults to “fuzzy”.

  • fuzzy (int) – The fuzzy matching tolerance. Defaults to 2.

  • slop (int) – The slop for phrase matching. Defaults to 1.

Return type:

DataFrame

Returns:

A DataFrame containing the search results.

pat2vec.util.pre_processing.draw_document_samples(df, n)[source]

Draws n random samples for each unique ‘search_term’ in a DataFrame.

Parameters:
  • df (DataFrame) – DataFrame containing a ‘search_term’ column.

  • n (int) – The number of samples to draw for each unique search term. If a term has fewer than n rows, all its rows are returned.

Return type:

DataFrame

Returns:

A new DataFrame containing the sampled entries.

pat2vec.util.pre_processing.demo_to_latest(demo_df)[source]

Filters a demographics DataFrame to keep only the latest record per patient.

Based on the ‘updatetime’ column, this function finds and returns the most recent entry for each unique ‘client_idcode’.

Parameters:

demo_df (DataFrame) – A DataFrame with patient demographic data, including ‘client_idcode’ and ‘updatetime’ columns.

Return type:

DataFrame

Returns:

A DataFrame containing only the latest record for each patient.

pat2vec.util.pre_processing.calculate_age_append(df)[source]

Calculates the age of clients and appends it as a new column.

This function takes a DataFrame with a ‘client_dob’ (date of birth) column, calculates the current age for each client, and adds it as a new ‘age’ column. Rows with invalid or missing ‘client_dob’ are dropped.

Parameters:

df (DataFrame) – DataFrame containing client data with a ‘client_dob’ column.

Return type:

DataFrame

Returns:

The input DataFrame with an additional ‘age’ column.

pat2vec.util.pre_processing.search_cohort(patlist, pat2vec_obj, start_year, start_month, start_day, end_year, end_month, end_day, additional_filters=None)[source]

Searches for a cohort of patients’ demographic data within a date range.

Parameters:
  • patlist (List[str]) – List of patient IDs to search for.

  • pat2vec_obj (Any) – The main pat2vec object with a configured cohort searcher.

  • start_year (str) – Start year for the search.

  • start_month (str) – Start month for the search.

  • start_day (str) – Start day for the search.

  • end_year (str) – End year for the search.

  • end_month (str) – End month for the search.

  • end_day (str) – End day for the search.

  • additional_filters (Optional[List[str]]) – List of additional filter strings to append to the search query.

Return type:

DataFrame

Returns:

A DataFrame containing the demographic data for the specified cohort.

Raises:

ValueError – If pat2vec_obj is not provided.