pat2vec.util.get_dummy_data_cohort_searcher
Functions
Generates dummy data based on simulated Elasticsearch query parameters. |
|
|
Generates a random datetime within a given month-level range. |
|
Extracts a date range from a string. |
Extracts a search term from an 'obscatalogmasteritem_displayname' query. |
|
|
Generates dummy data for the 'pims_apps' index. |
|
Generates dummy data for the 'basic_observations' index. |
Generates dummy textual data for the 'basic_observations' index. |
|
|
Generates dummy data for BMI, Weight, and Height observations. |
|
Generates dummy data for CORE_SpO2 (oxygen saturation) observations. |
|
Generates dummy data for CORE_RESUS_STATUS observations. |
|
Generates dummy data for the 'diagnostic_orders' index. |
|
Generates dummy data for the 'drug_orders' index. |
|
Generates dummy data for the 'epr_documents' index. |
|
Generates dummy personal data for the 'epr_documents' index. |
|
Generates dummy data for hospital site observations. |
|
Generates dummy MRC text data for the 'observations' index. |
Generates dummy report text data for the 'basic_observations' index. |
|
|
Generates dummy data for the 'observations' index. |
|
Generates a random patient timeline using a GPT-2 model. |
|
Generates a fake patient timeline using the Faker library. |
|
Generates a UUID-like string with a given prefix. |
|
Generates a list of n UUID-like strings. |
|
Retrieves a random patient timeline from a pre-generated CSV file. |
|
Returns a value or NaN based on a probability. |
Generates and appends dummy patient timelines to a CSV file. |
- pat2vec.util.get_dummy_data_cohort_searcher.maybe_nan(value, probability=0.2)[source]
Returns a value or NaN based on a probability.
- Parameters:
value (
Any
) – The value to potentially return.probability (
float
) – The probability of returning np.nan instead of the value. Defaults to 0.2.
- Return type:
Union
[Any
,float
]- Returns:
The original value or np.nan.
- pat2vec.util.get_dummy_data_cohort_searcher.create_random_date_from_globals(start_year, start_month, end_year, end_month)[source]
Generates a random datetime within a given month-level range.
- Parameters:
start_year (
int
) – The starting year.start_month (
int
) – The starting month.end_year (
int
) – The ending year.end_month (
int
) – The ending month.
- Return type:
datetime
- Returns:
A random datetime object within the specified range.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_epr_documents_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, use_GPT=True, fields_list=['client_idcode', 'document_guid', 'document_description', 'body_analysed', 'updatetime', 'clientvisit_visitidcode'])[source]
Generates dummy data for the ‘epr_documents’ index.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.use_GPT (
bool
) – If True, uses a text generation model for the document body.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy EPR document data.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_epr_documents_personal_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['client_idcode', 'client_firstname', 'client_lastname', 'client_dob', 'client_gendercode', 'client_racecode', 'client_deceaseddtm', 'updatetime'])[source]
Generates dummy personal data for the ‘epr_documents’ index.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy personal data.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_diagnostic_orders_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['order_guid', 'client_idcode', 'order_name', 'order_summaryline', 'order_holdreasontext', 'order_entered', 'order_createdwhen', 'clientvisit_visitidcode', '_id', '_index', '_score', 'order_performeddtm'])[source]
Generates dummy data for the ‘diagnostic_orders’ index.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy diagnostic order data.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_drug_orders_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['order_guid', 'client_idcode', 'order_name', 'order_summaryline', 'order_holdreasontext', 'order_entered', 'order_createdwhen', 'clientvisit_visitidcode', '_id', '_index', '_score', 'order_performeddtm'])[source]
Generates dummy data for the ‘drug_orders’ index.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy drug order data.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_observations_MRC_text_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, use_GPT=False, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode', '_id', '_index', '_score', 'textualObs'])[source]
Generates dummy MRC text data for the ‘observations’ index.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.use_GPT (
bool
) – If True, uses a text generation model for the document body.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy observation data.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_observations_Reports_text_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, use_GPT=False, fields_list=['basicobs_guid', 'client_idcode', 'basicobs_itemname_analysed', 'basicobs_value_analysed', 'textualObs', 'updatetime', 'clientvisit_visitidcode', '_id', '_index', '_score'])[source]
Generates dummy report text data for the ‘basic_observations’ index.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.use_GPT (
bool
) – If True, uses a text generation model for the document body.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy report data.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_appointments_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['client_idcode', 'Popular', 'AppointmentType', 'AttendanceReference', 'ClinicCode', 'ClinicDesc', 'Consultant', 'DateModified', 'DNA', 'HospitalID', 'PatNHSNo', 'Specialty', '_id', '_index', '_score', 'AppointmentDateTime', 'Attended', 'CancDesc', 'CancRefNo', 'ConsultantCode', 'DateCreated', 'Ethnicity', 'Gender', 'NHSNoStatusCode', 'NotSpec', 'PatDateOfBirth', 'PatForename', 'PatPostCode', 'PatSurname', 'PiMsPatRefNo', 'Primarykeyfieldname', 'Primarykeyfieldvalue', 'SessionCode', 'SpecialtyCode'])[source]
Generates dummy data for the ‘pims_apps’ index.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy appointment data.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_observations_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, search_term, use_GPT=False, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode', '_id', '_index', '_score'])[source]
Generates dummy data for the ‘observations’ index.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.search_term (
str
) – The search term to use for the display name.use_GPT (
bool
) – If True, uses a text generation model for the document body.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy observation data.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_basic_observations_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['client_idcode', 'basicobs_itemname_analysed', 'basicobs_value_numeric', 'basicobs_entered', 'clientvisit_serviceguid', '_id', '_index', '_score', 'order_guid', 'order_name', 'order_summaryline', 'order_holdreasontext', 'order_entered', 'clientvisit_visitidcode', 'updatetime'])[source]
Generates dummy data for the ‘basic_observations’ index.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy basic observation data.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_basic_observations_textual_obs_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['client_idcode', 'basicobs_itemname_analysed', 'basicobs_value_numeric', 'basicobs_entered', 'clientvisit_serviceguid', '_id', '_index', '_score', 'basicobs_guid', 'clientvisit_serviceguid', 'updatetime', 'textualObs'])[source]
Generates dummy textual data for the ‘basic_observations’ index.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy textual observation data.
- pat2vec.util.get_dummy_data_cohort_searcher.extract_date_range(date_string)[source]
Extracts a date range from a string.
The expected format is “YYYY-MM-DD TO YYYY-MM-DD”.
- Parameters:
date_string (
str
) – The string containing the date range.- Return type:
Optional
[Tuple
[int
,int
,int
,int
,int
,int
]]- Returns:
A tuple of six integers (start_year, start_month, start_day, end_year, end_month, end_day), or None if the pattern is not found.
- pat2vec.util.get_dummy_data_cohort_searcher.cohort_searcher_with_terms_and_search_dummy(index_name, fields_list, term_name, entered_list, search_string)[source]
Generates dummy data based on simulated Elasticsearch query parameters.
This function acts as a stand-in for a real CogStack/Elasticsearch query, routing requests to different dummy data generator functions based on the index_name and search_string.
- Parameters:
index_name (
str
) – The name of the target index (e.g., ‘epr_documents’).fields_list (
List
[str
]) – A list of fields to be returned in the DataFrame.term_name (
str
) – The field name for the term-level query (e.g., ‘client_idcode’).entered_list (
List
[str
]) – The list of values for the term-level query.search_string (
str
) – A string simulating a query string search, used for routing to the correct data generator.
- Return type:
DataFrame
- Returns:
A pandas DataFrame containing the generated dummy data.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_patient_timeline(client_idcode)[source]
Generates a random patient timeline using a GPT-2 model.
Creates a short, semi-realistic clinical note timeline for a patient, including demographic information and a series of timestamped entries.
- Parameters:
client_idcode (
str
) – The client ID for the patient.- Return type:
str
- Returns:
A string containing the patient’s dummy timeline.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_patient_timeline_faker(client_idcode)[source]
Generates a fake patient timeline using the Faker library.
Creates a short, semi-realistic clinical note timeline for a patient, including demographic information and a series of timestamped entries with fake sentences.
- Parameters:
client_idcode (
str
) – The client ID for the patient.- Return type:
str
- Returns:
A string containing the patient’s dummy timeline.
- pat2vec.util.get_dummy_data_cohort_searcher.extract_search_term_obscatalogmasteritem_displayname(search_string)[source]
Extracts a search term from an ‘obscatalogmasteritem_displayname’ query.
This function uses a regular expression to find a term enclosed in parentheses following ‘obscatalogmasteritem_displayname:’. It cleans the term by removing quotes and stripping any trailing ‘AND’ or ‘OR’ clauses.
- Parameters:
search_string (
str
) – The input query string.- Return type:
str
- Returns:
The extracted search term, or the original string if no match is found.
- pat2vec.util.get_dummy_data_cohort_searcher.run_generate_patient_timeline_and_append(n=10, output_path='test_files/dummy_timeline.csv')[source]
Generates and appends dummy patient timelines to a CSV file.
This function creates n dummy patient timelines and appends them to a specified CSV file. If the file doesn’t exist, it will be created.
- Parameters:
n (
int
) – The number of patient timelines to generate. Defaults to 10.output_path (
str
) – The path to the output CSV file. Defaults to “test_files/dummy_timeline.csv”.
- Raises:
FileNotFoundError – If the output_path does not exist and cannot be created.
Exception – For any other unexpected errors during timeline generation or file operations.
- Return type:
None
- pat2vec.util.get_dummy_data_cohort_searcher.get_patient_timeline_dummy(client_idcode, output_path='test_files/dummy_timeline.csv')[source]
Retrieves a random patient timeline from a pre-generated CSV file.
- Parameters:
client_idcode (
str
) – The client ID to search for (currently unused, as a random row is always selected).output_path (
str
) – The path to the CSV file containing dummy timelines.
- Return type:
Optional
[str
]- Returns:
The text of a random patient timeline, or None if the file is not found or is invalid.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_uuid(prefix, length=7)[source]
Generates a UUID-like string with a given prefix.
- Parameters:
prefix (
str
) – The prefix for the UUID, must be ‘P’ or ‘V’.length (
int
) – The length of the random part of the string. Defaults to 7.
- Return type:
str
- Returns:
The generated UUID-like string.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_uuid_list(n, prefix, length=7)[source]
Generates a list of n UUID-like strings.
- Parameters:
n (
int
) – The number of UUIDs to generate.prefix (
str
) – The prefix for each UUID.length (
int
) – The length of the random part of each UUID.
- Return type:
List
[str
]- Returns:
A list of generated UUID-like strings.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_hospital_site_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode'])[source]
Generates dummy data for hospital site observations.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy hospital site data.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_bmi_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode'])[source]
Generates dummy data for BMI, Weight, and Height observations.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy BMI-related data.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_core_o2_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode'])[source]
Generates dummy data for CORE_SpO2 (oxygen saturation) observations.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy SpO2 data.
- pat2vec.util.get_dummy_data_cohort_searcher.generate_core_resus_data(num_rows, entered_list, global_start_year, global_start_month, global_end_year, global_end_month, fields_list=['observation_guid', 'client_idcode', 'obscatalogmasteritem_displayname', 'observation_valuetext_analysed', 'observationdocument_recordeddtm', 'clientvisit_visitidcode'])[source]
Generates dummy data for CORE_RESUS_STATUS observations.
- Parameters:
num_rows (
int
) – Number of rows to generate for each client.entered_list (
List
[str
]) – List of client IDs to generate data for.global_start_year (
int
) – Start year for the random date range.global_start_month (
int
) – Start month for the random date range.global_end_year (
int
) – End year for the random date range.global_end_month (
int
) – End month for the random date range.fields_list (
List
[str
]) – List of columns to include in the DataFrame.
- Return type:
DataFrame
- Returns:
A pandas DataFrame with generated dummy resuscitation status data.