pat2vec.util.get_dummy_data_cohort_searcher

Functions

`cohort_searcher_with_terms_and_search_dummy`(...)	Generates dummy data based on simulated Elasticsearch query parameters.
`create_random_date_from_globals`(start_year, ...)	Generates a random datetime within a given month-level range.
`extract_date_range`(date_string)	Extracts a date range from a string.
`extract_search_term_obscatalogmasteritem_displayname`(...)	Extracts a search term from an 'obscatalogmasteritem_displayname' query.
`generate_appointments_data`(num_rows, ...[, ...])	Generates dummy data for the 'pims_apps' index.
`generate_basic_observations_data`(num_rows, ...)	Generates dummy data for the 'basic_observations' index.
`generate_basic_observations_textual_obs_data`(...)	Generates dummy textual data for the 'basic_observations' index.
`generate_bmi_data`(num_rows, entered_list, ...)	Generates dummy data for BMI, Weight, and Height observations.
`generate_core_o2_data`(num_rows, ...[, ...])	Generates dummy data for CORE_SpO2 (oxygen saturation) observations.
`generate_core_resus_data`(num_rows, ...[, ...])	Generates dummy data for CORE_RESUS_STATUS observations.
`generate_diagnostic_orders_data`(num_rows, ...)	Generates dummy data for the 'diagnostic_orders' index.
`generate_drug_orders_data`(num_rows, ...[, ...])	Generates dummy data for the 'drug_orders' index.
`generate_epr_documents_data`(num_rows, ...[, ...])	Generates dummy data for the 'epr_documents' index.
`generate_epr_documents_personal_data`(...[, ...])	Generates dummy personal data for the 'epr_documents' index.
`generate_hospital_site_data`(num_rows, ...[, ...])	Generates dummy data for hospital site observations.
`generate_observations_MRC_text_data`(...[, ...])	Generates dummy MRC text data for the 'observations' index.
`generate_observations_Reports_text_data`(...)	Generates dummy report text data for the 'basic_observations' index.
`generate_observations_data`(num_rows, ...[, ...])	Generates dummy data for the 'observations' index.
`generate_patient_timeline`(client_idcode)	Generates a random patient timeline using a GPT-2 model.
`generate_patient_timeline_faker`(client_idcode)	Generates a fake patient timeline using the Faker library.
`generate_uuid`(prefix[, length])	Generates a UUID-like string with a given prefix.
`generate_uuid_list`(n, prefix[, length])	Generates a list of n UUID-like strings.
`get_patient_timeline_dummy`(client_idcode[, ...])	Retrieves a random patient timeline from a pre-generated CSV file.
`maybe_nan`(value[, probability])	Returns a value or NaN based on a probability.
`run_generate_patient_timeline_and_append`([...])	Generates and appends dummy patient timelines to a CSV file.

pat2vec.util.get_dummy_data_cohort_searcher.maybe_nan(value, probability=0.2)[source]

Returns a value or NaN based on a probability.

Parameters:

value (Any) – The value to potentially return.
probability (float) – The probability of returning np.nan instead of the value. Defaults to 0.2.

Return type:

Union[Any, float]

Returns:

The original value or np.nan.

pat2vec.util.get_dummy_data_cohort_searcher.create_random_date_from_globals(start_year, start_month, end_year, end_month)[source]

Generates a random datetime within a given month-level range.

Parameters:

start_year (int) – The starting year.
start_month (int) – The starting month.
end_year (int) – The ending year.
end_month (int) – The ending month.

Return type:

datetime

Returns:

A random datetime object within the specified range.

pat2vec.util.get_dummy_data_cohort_searcher.generate_epr_documents_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, use_GPT=True, fields_list=['client_idcode', 'document_guid', 'document_description', 'body_analysed', 'updatetime', 'clientvisit_visitidcode'])[source]

Generates dummy data for the ‘epr_documents’ index.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
use_GPT (bool) – If True, uses a text generation model for the document body.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy EPR document data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_epr_documents_personal_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['client_idcode', 'client_firstname', 'client_lastname', 'client_dob', 'client_gendercode', 'client_racecode', 'client_deceaseddtm', 'updatetime'])[source]

Generates dummy personal data for the ‘epr_documents’ index.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy personal data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_diagnostic_orders_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['order_guid', 'client_idcode', 'order_name', 'order_summaryline', 'order_holdreasontext', 'order_entered', 'order_createdwhen', 'clientvisit_visitidcode', '_id', '_index', '_score', 'order_performeddtm'])[source]

Generates dummy data for the ‘diagnostic_orders’ index.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy diagnostic order data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_drug_orders_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['order_guid', 'client_idcode', 'order_name', 'order_summaryline', 'order_holdreasontext', 'order_entered', 'order_createdwhen', 'clientvisit_visitidcode', '_id', '_index', '_score', 'order_performeddtm'])[source]

Generates dummy data for the ‘drug_orders’ index.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy drug order data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_observations_MRC_text_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, use_GPT=False, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode', '_id', '_index', '_score', 'textualObs'])[source]

Generates dummy MRC text data for the ‘observations’ index.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
use_GPT (bool) – If True, uses a text generation model for the document body.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy observation data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_observations_Reports_text_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, use_GPT=False, fields_list=['basicobs_guid', 'client_idcode', 'basicobs_itemname_analysed', 'basicobs_value_analysed', 'textualObs', 'updatetime', 'clientvisit_visitidcode', '_id', '_index', '_score'])[source]

Generates dummy report text data for the ‘basic_observations’ index.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
use_GPT (bool) – If True, uses a text generation model for the document body.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy report data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_appointments_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['client_idcode', 'Popular', 'AppointmentType', 'AttendanceReference', 'ClinicCode', 'ClinicDesc', 'Consultant', 'DateModified', 'DNA', 'HospitalID', 'PatNHSNo', 'Specialty', '_id', '_index', '_score', 'AppointmentDateTime', 'Attended', 'CancDesc', 'CancRefNo', 'ConsultantCode', 'DateCreated', 'Ethnicity', 'Gender', 'NHSNoStatusCode', 'NotSpec', 'PatDateOfBirth', 'PatForename', 'PatPostCode', 'PatSurname', 'PiMsPatRefNo', 'Primarykeyfieldname', 'Primarykeyfieldvalue', 'SessionCode', 'SpecialtyCode'])[source]

Generates dummy data for the ‘pims_apps’ index.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy appointment data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_observations_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, search_term, use_GPT=False, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode', '_id', '_index', '_score'])[source]

Generates dummy data for the ‘observations’ index.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
search_term (str) – The search term to use for the display name.
use_GPT (bool) – If True, uses a text generation model for the document body.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy observation data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_basic_observations_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['client_idcode', 'basicobs_itemname_analysed', 'basicobs_value_numeric', 'basicobs_entered', 'clientvisit_serviceguid', '_id', '_index', '_score', 'order_guid', 'order_name', 'order_summaryline', 'order_holdreasontext', 'order_entered', 'clientvisit_visitidcode', 'updatetime'])[source]

Generates dummy data for the ‘basic_observations’ index.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy basic observation data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_basic_observations_textual_obs_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['client_idcode', 'basicobs_itemname_analysed', 'basicobs_value_numeric', 'basicobs_entered', 'clientvisit_serviceguid', '_id', '_index', '_score', 'basicobs_guid', 'clientvisit_serviceguid', 'updatetime', 'textualObs'])[source]

Generates dummy textual data for the ‘basic_observations’ index.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy textual observation data.

pat2vec.util.get_dummy_data_cohort_searcher.extract_date_range(date_string)[source]

Extracts a date range from a string.

The expected format is “YYYY-MM-DD TO YYYY-MM-DD”.

Parameters:: date_string (str) – The string containing the date range.
Return type:: Optional[Tuple[int, int, int, int, int, int]]
Returns:: A tuple of six integers (start_year, start_month, start_day, end_year, end_month, end_day), or None if the pattern is not found.

pat2vec.util.get_dummy_data_cohort_searcher.cohort_searcher_with_terms_and_search_dummy(index_name, fields_list, term_name, entered_list, search_string)[source]

Generates dummy data based on simulated Elasticsearch query parameters.

This function acts as a stand-in for a real CogStack/Elasticsearch query, routing requests to different dummy data generator functions based on the index_name and search_string.

Parameters:

index_name (str) – The name of the target index (e.g., ‘epr_documents’).
fields_list (List[str]) – A list of fields to be returned in the DataFrame.
term_name (str) – The field name for the term-level query (e.g., ‘client_idcode’).
entered_list (List[str]) – The list of values for the term-level query.
search_string (str) – A string simulating a query string search, used for routing to the correct data generator.

Return type:

DataFrame

Returns:

A pandas DataFrame containing the generated dummy data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_patient_timeline(client_idcode)[source]

Generates a random patient timeline using a GPT-2 model.

Creates a short, semi-realistic clinical note timeline for a patient, including demographic information and a series of timestamped entries.

Parameters:: client_idcode (str) – The client ID for the patient.
Return type:: str
Returns:: A string containing the patient’s dummy timeline.

pat2vec.util.get_dummy_data_cohort_searcher.generate_patient_timeline_faker(client_idcode)[source]

Generates a fake patient timeline using the Faker library.

Creates a short, semi-realistic clinical note timeline for a patient, including demographic information and a series of timestamped entries with fake sentences.

Parameters:: client_idcode (str) – The client ID for the patient.
Return type:: str
Returns:: A string containing the patient’s dummy timeline.

pat2vec.util.get_dummy_data_cohort_searcher.extract_search_term_obscatalogmasteritem_displayname(search_string)[source]

Extracts a search term from an ‘obscatalogmasteritem_displayname’ query.

This function uses a regular expression to find a term enclosed in parentheses following ‘obscatalogmasteritem_displayname:’. It cleans the term by removing quotes and stripping any trailing ‘AND’ or ‘OR’ clauses.

Parameters:: search_string (str) – The input query string.
Return type:: str
Returns:: The extracted search term, or the original string if no match is found.

pat2vec.util.get_dummy_data_cohort_searcher.run_generate_patient_timeline_and_append(n=10, output_path='test_files/dummy_timeline.csv')[source]

Generates and appends dummy patient timelines to a CSV file.

This function creates n dummy patient timelines and appends them to a specified CSV file. If the file doesn’t exist, it will be created.

Parameters:

n (int) – The number of patient timelines to generate. Defaults to 10.
output_path (str) – The path to the output CSV file. Defaults to “test_files/dummy_timeline.csv”.

Raises:

FileNotFoundError – If the output_path does not exist and cannot be created.
Exception – For any other unexpected errors during timeline generation or file operations.

Return type:

None

pat2vec.util.get_dummy_data_cohort_searcher.get_patient_timeline_dummy(client_idcode, output_path='test_files/dummy_timeline.csv')[source]

Retrieves a random patient timeline from a pre-generated CSV file.

Parameters:

client_idcode (str) – The client ID to search for (currently unused, as a random row is always selected).
output_path (str) – The path to the CSV file containing dummy timelines.

Return type:

Optional[str]

Returns:

The text of a random patient timeline, or None if the file is not found or is invalid.

pat2vec.util.get_dummy_data_cohort_searcher.generate_uuid(prefix, length=7)[source]

Generates a UUID-like string with a given prefix.

Parameters:

prefix (str) – The prefix for the UUID, must be ‘P’ or ‘V’.
length (int) – The length of the random part of the string. Defaults to 7.

Return type:

str

Returns:

The generated UUID-like string.

pat2vec.util.get_dummy_data_cohort_searcher.generate_uuid_list(n, prefix, length=7)[source]

Generates a list of n UUID-like strings.

Parameters:

n (int) – The number of UUIDs to generate.
prefix (str) – The prefix for each UUID.
length (int) – The length of the random part of each UUID.

Return type:

List[str]

Returns:

A list of generated UUID-like strings.

pat2vec.util.get_dummy_data_cohort_searcher.generate_hospital_site_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode'])[source]

Generates dummy data for hospital site observations.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy hospital site data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_bmi_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode'])[source]

Generates dummy data for BMI, Weight, and Height observations.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy BMI-related data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_core_o2_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode'])[source]

Generates dummy data for CORE_SpO2 (oxygen saturation) observations.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy SpO2 data.

pat2vec.util.get_dummy_data_cohort_searcher.generate_core_resus_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode'])[source]

Generates dummy data for CORE_RESUS_STATUS observations.

Parameters:

num_rows (int) – Number of rows to generate for each client.
entered_list (List[str]) – List of client IDs to generate data for.
global_start_year (int) – Start year for the random date range.
global_start_month (int) – Start month for the random date range.
global_end_year (int) – End year for the random date range.
global_end_month (int) – End month for the random date range.
fields_list (List[str]) – List of columns to include in the DataFrame.

Return type:

DataFrame

Returns:

A pandas DataFrame with generated dummy resuscitation status data.