pat2vec.util.get_dummy_data_cohort_searcher

Functions

cohort_searcher_with_terms_and_search_dummy(...)

Generates dummy data based on simulated Elasticsearch query parameters.

create_random_date_from_globals(start_year, ...)

Generates a random datetime within a given month-level range.

extract_date_range(date_string)

Extracts a date range from a string.

extract_search_term_obscatalogmasteritem_displayname(...)

Extracts a search term from an 'obscatalogmasteritem_displayname' query.

generate_appointments_data(num_rows, ...[, ...])

Generates dummy data for the 'pims_apps' index.

generate_basic_observations_data(num_rows, ...)

Generates dummy data for the 'basic_observations' index.

generate_basic_observations_textual_obs_data(...)

Generates dummy textual data for the 'basic_observations' index.

generate_bmi_data(num_rows, entered_list, ...)

Generates dummy data for BMI, Weight, and Height observations.

generate_core_o2_data(num_rows, ...[, ...])

Generates dummy data for CORE_SpO2 (oxygen saturation) observations.

generate_core_resus_data(num_rows, ...[, ...])

Generates dummy data for CORE_RESUS_STATUS observations.

generate_diagnostic_orders_data(num_rows, ...)

Generates dummy data for the 'diagnostic_orders' index.

generate_drug_orders_data(num_rows, ...[, ...])

Generates dummy data for the 'drug_orders' index.

generate_epr_documents_data(num_rows, ...[, ...])

Generates dummy data for the 'epr_documents' index.

generate_epr_documents_personal_data(...[, ...])

Generates dummy personal data for the 'epr_documents' index.

generate_hospital_site_data(num_rows, ...[, ...])

Generates dummy data for hospital site observations.

generate_observations_MRC_text_data(...[, ...])

Generates dummy MRC text data for the 'observations' index.

generate_observations_Reports_text_data(...)

Generates dummy report text data for the 'basic_observations' index.

generate_observations_data(num_rows, ...[, ...])

Generates dummy data for the 'observations' index.

generate_patient_timeline(client_idcode)

Generates a random patient timeline using a GPT-2 model.

generate_patient_timeline_faker(client_idcode)

Generates a fake patient timeline using the Faker library.

generate_uuid(prefix[, length])

Generates a UUID-like string with a given prefix.

generate_uuid_list(n, prefix[, length])

Generates a list of n UUID-like strings.

get_patient_timeline_dummy(client_idcode[, ...])

Retrieves a random patient timeline from a pre-generated CSV file.

maybe_nan(value[, probability])

Returns a value or NaN based on a probability.

run_generate_patient_timeline_and_append([...])

Generates and appends dummy patient timelines to a CSV file.

pat2vec.util.get_dummy_data_cohort_searcher.maybe_nan(value, probability=0.2)[source]

Returns a value or NaN based on a probability.

Parameters:
  • value (Any) – The value to potentially return.

  • probability (float) – The probability of returning np.nan instead of the value. Defaults to 0.2.

Return type:

Union[Any, float]

Returns:

The original value or np.nan.

pat2vec.util.get_dummy_data_cohort_searcher.create_random_date_from_globals(start_year, start_month, end_year, end_month)[source]

Generates a random datetime within a given month-level range.

Parameters:
  • start_year (int) – The starting year.

  • start_month (int) – The starting month.

  • end_year (int) – The ending year.

  • end_month (int) – The ending month.

Return type:

datetime

Returns:

A random datetime object within the specified range.

pat2vec.util.get_dummy_data_cohort_searcher.generate_epr_documents_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, use_GPT=True, fields_list=['client_idcode', 'document_guid', 'document_description', 'body_analysed', 'updatetime', 'clientvisit_visitidcode'])[source]

Generates dummy data for the ‘epr_documents’ index.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • use_GPT (bool) – If True, uses a text generation model for the document body.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy EPR document data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_epr_documents_personal_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['client_idcode', 'client_firstname', 'client_lastname', 'client_dob', 'client_gendercode', 'client_racecode', 'client_deceaseddtm', 'updatetime'])[source]

Generates dummy personal data for the ‘epr_documents’ index.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy personal data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_diagnostic_orders_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['order_guid', 'client_idcode', 'order_name', 'order_summaryline', 'order_holdreasontext', 'order_entered', 'order_createdwhen', 'clientvisit_visitidcode', '_id', '_index', '_score', 'order_performeddtm'])[source]

Generates dummy data for the ‘diagnostic_orders’ index.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy diagnostic order data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_drug_orders_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['order_guid', 'client_idcode', 'order_name', 'order_summaryline', 'order_holdreasontext', 'order_entered', 'order_createdwhen', 'clientvisit_visitidcode', '_id', '_index', '_score', 'order_performeddtm'])[source]

Generates dummy data for the ‘drug_orders’ index.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy drug order data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_observations_MRC_text_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, use_GPT=False, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode', '_id', '_index', '_score', 'textualObs'])[source]

Generates dummy MRC text data for the ‘observations’ index.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • use_GPT (bool) – If True, uses a text generation model for the document body.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy observation data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_observations_Reports_text_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, use_GPT=False, fields_list=['basicobs_guid', 'client_idcode', 'basicobs_itemname_analysed', 'basicobs_value_analysed', 'textualObs', 'updatetime', 'clientvisit_visitidcode', '_id', '_index', '_score'])[source]

Generates dummy report text data for the ‘basic_observations’ index.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • use_GPT (bool) – If True, uses a text generation model for the document body.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy report data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_appointments_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['client_idcode', 'Popular', 'AppointmentType', 'AttendanceReference', 'ClinicCode', 'ClinicDesc', 'Consultant', 'DateModified', 'DNA', 'HospitalID', 'PatNHSNo', 'Specialty', '_id', '_index', '_score', 'AppointmentDateTime', 'Attended', 'CancDesc', 'CancRefNo', 'ConsultantCode', 'DateCreated', 'Ethnicity', 'Gender', 'NHSNoStatusCode', 'NotSpec', 'PatDateOfBirth', 'PatForename', 'PatPostCode', 'PatSurname', 'PiMsPatRefNo', 'Primarykeyfieldname', 'Primarykeyfieldvalue', 'SessionCode', 'SpecialtyCode'])[source]

Generates dummy data for the ‘pims_apps’ index.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy appointment data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_observations_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, search_term, use_GPT=False, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode', '_id', '_index', '_score'])[source]

Generates dummy data for the ‘observations’ index.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • search_term (str) – The search term to use for the display name.

  • use_GPT (bool) – If True, uses a text generation model for the document body.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy observation data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_basic_observations_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['client_idcode', 'basicobs_itemname_analysed', 'basicobs_value_numeric', 'basicobs_entered', 'clientvisit_serviceguid', '_id', '_index', '_score', 'order_guid', 'order_name', 'order_summaryline', 'order_holdreasontext', 'order_entered', 'clientvisit_visitidcode', 'updatetime'])[source]

Generates dummy data for the ‘basic_observations’ index.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy basic observation data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_basic_observations_textual_obs_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['client_idcode', 'basicobs_itemname_analysed', 'basicobs_value_numeric', 'basicobs_entered', 'clientvisit_serviceguid', '_id', '_index', '_score', 'basicobs_guid', 'clientvisit_serviceguid', 'updatetime', 'textualObs'])[source]

Generates dummy textual data for the ‘basic_observations’ index.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy textual observation data.

pat2vec.util.get_dummy_data_cohort_searcher.extract_date_range(date_string)[source]

Extracts a date range from a string.

The expected format is “YYYY-MM-DD TO YYYY-MM-DD”.

Parameters:

date_string (str) – The string containing the date range.

Return type:

Optional[Tuple[int, int, int, int, int, int]]

Returns:

A tuple of six integers (start_year, start_month, start_day, end_year, end_month, end_day), or None if the pattern is not found.

pat2vec.util.get_dummy_data_cohort_searcher.cohort_searcher_with_terms_and_search_dummy(index_name, fields_list, term_name, entered_list, search_string)[source]

Generates dummy data based on simulated Elasticsearch query parameters.

This function acts as a stand-in for a real CogStack/Elasticsearch query, routing requests to different dummy data generator functions based on the index_name and search_string.

Parameters:
  • index_name (str) – The name of the target index (e.g., ‘epr_documents’).

  • fields_list (List[str]) – A list of fields to be returned in the DataFrame.

  • term_name (str) – The field name for the term-level query (e.g., ‘client_idcode’).

  • entered_list (List[str]) – The list of values for the term-level query.

  • search_string (str) – A string simulating a query string search, used for routing to the correct data generator.

Return type:

DataFrame

Returns:

A pandas DataFrame containing the generated dummy data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_patient_timeline(client_idcode)[source]

Generates a random patient timeline using a GPT-2 model.

Creates a short, semi-realistic clinical note timeline for a patient, including demographic information and a series of timestamped entries.

Parameters:

client_idcode (str) – The client ID for the patient.

Return type:

str

Returns:

A string containing the patient’s dummy timeline.

pat2vec.util.get_dummy_data_cohort_searcher.generate_patient_timeline_faker(client_idcode)[source]

Generates a fake patient timeline using the Faker library.

Creates a short, semi-realistic clinical note timeline for a patient, including demographic information and a series of timestamped entries with fake sentences.

Parameters:

client_idcode (str) – The client ID for the patient.

Return type:

str

Returns:

A string containing the patient’s dummy timeline.

pat2vec.util.get_dummy_data_cohort_searcher.extract_search_term_obscatalogmasteritem_displayname(search_string)[source]

Extracts a search term from an ‘obscatalogmasteritem_displayname’ query.

This function uses a regular expression to find a term enclosed in parentheses following ‘obscatalogmasteritem_displayname:’. It cleans the term by removing quotes and stripping any trailing ‘AND’ or ‘OR’ clauses.

Parameters:

search_string (str) – The input query string.

Return type:

str

Returns:

The extracted search term, or the original string if no match is found.

pat2vec.util.get_dummy_data_cohort_searcher.run_generate_patient_timeline_and_append(n=10, output_path='test_files/dummy_timeline.csv')[source]

Generates and appends dummy patient timelines to a CSV file.

This function creates n dummy patient timelines and appends them to a specified CSV file. If the file doesn’t exist, it will be created.

Parameters:
  • n (int) – The number of patient timelines to generate. Defaults to 10.

  • output_path (str) – The path to the output CSV file. Defaults to “test_files/dummy_timeline.csv”.

Raises:
  • FileNotFoundError – If the output_path does not exist and cannot be created.

  • Exception – For any other unexpected errors during timeline generation or file operations.

Return type:

None

pat2vec.util.get_dummy_data_cohort_searcher.get_patient_timeline_dummy(client_idcode, output_path='test_files/dummy_timeline.csv')[source]

Retrieves a random patient timeline from a pre-generated CSV file.

Parameters:
  • client_idcode (str) – The client ID to search for (currently unused, as a random row is always selected).

  • output_path (str) – The path to the CSV file containing dummy timelines.

Return type:

Optional[str]

Returns:

The text of a random patient timeline, or None if the file is not found or is invalid.

pat2vec.util.get_dummy_data_cohort_searcher.generate_uuid(prefix, length=7)[source]

Generates a UUID-like string with a given prefix.

Parameters:
  • prefix (str) – The prefix for the UUID, must be ‘P’ or ‘V’.

  • length (int) – The length of the random part of the string. Defaults to 7.

Return type:

str

Returns:

The generated UUID-like string.

pat2vec.util.get_dummy_data_cohort_searcher.generate_uuid_list(n, prefix, length=7)[source]

Generates a list of n UUID-like strings.

Parameters:
  • n (int) – The number of UUIDs to generate.

  • prefix (str) – The prefix for each UUID.

  • length (int) – The length of the random part of each UUID.

Return type:

List[str]

Returns:

A list of generated UUID-like strings.

pat2vec.util.get_dummy_data_cohort_searcher.generate_hospital_site_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode'])[source]

Generates dummy data for hospital site observations.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy hospital site data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_bmi_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode'])[source]

Generates dummy data for BMI, Weight, and Height observations.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy BMI-related data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_core_o2_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode'])[source]

Generates dummy data for CORE_SpO2 (oxygen saturation) observations.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy SpO2 data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_core_resus_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode'])[source]

Generates dummy data for CORE_RESUS_STATUS observations.

Parameters:
  • num_rows (int) – Number of rows to generate for each client.

  • entered_list (List[str]) – List of client IDs to generate data for.

  • global_start_year (int) – Start year for the random date range.

  • global_start_month (int) – Start month for the random date range.

  • global_end_year (int) – End year for the random date range.

  • global_end_month (int) – End month for the random date range.

  • fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy resuscitation status data.