pat2vec.util.config_pat2vec

Functions

`update_global_start_date`(self, start_date)	Updates the global start date if the provided start_date is later.
`validate_and_fix_global_dates`(config)	Ensures global start date is before the global end date.

Classes

config_class([remote_dump, suffix, ...])

Initializes the configuration object for the pat2vec pipeline.

class pat2vec.util.config_pat2vec.config_class(remote_dump=False, suffix='', treatment_doc_filename='treatment_docs.csv', treatment_control_ratio_n=1, proj_name='new_project', current_path_dir='.', main_options=None, start_date=datetime.datetime(1995, 1, 1, 0, 0), years=0, months=0, days=1, batch_mode=True, store_annot=False, share_sftp=True, multi_process=False, strip_list=True, verbosity=3, random_seed_val=42, hostname=None, username=None, password=None, gpu_mem_threshold=4000, testing=False, dummy_medcat_model=False, use_controls=False, medcat=False, global_start_year=None, global_start_month=None, global_end_year=None, global_end_month=None, global_start_day=None, global_end_day=None, skip_additional_listdir=False, start_time=None, root_path=None, negate_biochem=False, patient_id_column_name='client_idcode', overwrite_stored_pat_docs=False, overwrite_stored_pat_observations=False, store_pat_batch_docs=True, store_pat_batch_observations=True, annot_filter_options=None, shuffle_pat_list=False, individual_patient_window=False, individual_patient_window_df=None, individual_patient_window_start_column_name=None, individual_patient_id_column_name=None, individual_patient_window_controls_method='full', dropna_doc_timestamps=True, time_window_interval_delta=relativedelta(days=+1), feature_engineering_arg_dict=None, split_clinical_notes=True, lookback=True, add_icd10=False, add_opc4s=False, all_epr_patient_list_path='/home/samorah/_data/gloabl_files/all_client_idcodes_epr_unique.csv', override_medcat_model_path=None, data_type_filter_dict=None, filter_split_notes=True, client_idcode_term_name='client_idcode.keyword', sanitize_pat_list=True, calculate_vectors=True, prefetch_pat_batches=False, sample_treatment_docs=0, test_data_path=None, credentials_path='../../credentials.py')[source]

Bases: object

Initializes the configuration object for the pat2vec pipeline.

Parameters:

remote_dump (bool)
suffix (str)
treatment_doc_filename (str)
treatment_control_ratio_n (int)
proj_name (str)
current_path_dir (str)
main_options (Dict[str, bool] | None)
start_date (datetime)
years (int)
months (int)
days (int)
batch_mode (bool)
store_annot (bool)
share_sftp (bool)
multi_process (bool)
strip_list (bool)
verbosity (int)
random_seed_val (int)
hostname (str | None)
username (str | None)
password (str | None)
gpu_mem_threshold (int)
testing (bool)
dummy_medcat_model (bool)
use_controls (bool)
medcat (bool)
global_start_year (int | str | None)
global_start_month (int | str | None)
global_end_year (int | str | None)
global_end_month (int | str | None)
global_start_day (int | str | None)
global_end_day (int | str | None)
skip_additional_listdir (bool)
start_time (datetime | None)
root_path (str | None)
negate_biochem (bool)
patient_id_column_name (str)
overwrite_stored_pat_docs (bool)
overwrite_stored_pat_observations (bool)
store_pat_batch_docs (bool)
store_pat_batch_observations (bool)
annot_filter_options (Dict[str, Any] | None)
shuffle_pat_list (bool)
individual_patient_window (bool)
individual_patient_window_df (DataFrame | None)
individual_patient_window_start_column_name (str | None)
individual_patient_id_column_name (str | None)
individual_patient_window_controls_method (str)
dropna_doc_timestamps (bool)
time_window_interval_delta (relativedelta)
feature_engineering_arg_dict (Dict[str, Any] | None)
split_clinical_notes (bool)
lookback (bool)
add_icd10 (bool)
add_opc4s (bool)
all_epr_patient_list_path (str)
override_medcat_model_path (str | None)
data_type_filter_dict (Dict[str, Any] | None)
filter_split_notes (bool)
client_idcode_term_name (str)
sanitize_pat_list (bool)
calculate_vectors (bool)
prefetch_pat_batches (bool)
sample_treatment_docs (int)
test_data_path (str | None)
credentials_path (str)

__init__(remote_dump=False, suffix='', treatment_doc_filename='treatment_docs.csv', treatment_control_ratio_n=1, proj_name='new_project', current_path_dir='.', main_options=None, start_date=datetime.datetime(1995, 1, 1, 0, 0), years=0, months=0, days=1, batch_mode=True, store_annot=False, share_sftp=True, multi_process=False, strip_list=True, verbosity=3, random_seed_val=42, hostname=None, username=None, password=None, gpu_mem_threshold=4000, testing=False, dummy_medcat_model=False, use_controls=False, medcat=False, global_start_year=None, global_start_month=None, global_end_year=None, global_end_month=None, global_start_day=None, global_end_day=None, skip_additional_listdir=False, start_time=None, root_path=None, negate_biochem=False, patient_id_column_name='client_idcode', overwrite_stored_pat_docs=False, overwrite_stored_pat_observations=False, store_pat_batch_docs=True, store_pat_batch_observations=True, annot_filter_options=None, shuffle_pat_list=False, individual_patient_window=False, individual_patient_window_df=None, individual_patient_window_start_column_name=None, individual_patient_id_column_name=None, individual_patient_window_controls_method='full', dropna_doc_timestamps=True, time_window_interval_delta=relativedelta(days=+1), feature_engineering_arg_dict=None, split_clinical_notes=True, lookback=True, add_icd10=False, add_opc4s=False, all_epr_patient_list_path='/home/samorah/_data/gloabl_files/all_client_idcodes_epr_unique.csv', override_medcat_model_path=None, data_type_filter_dict=None, filter_split_notes=True, client_idcode_term_name='client_idcode.keyword', sanitize_pat_list=True, calculate_vectors=True, prefetch_pat_batches=False, sample_treatment_docs=0, test_data_path=None, credentials_path='../../credentials.py')[source]

Initializes the configuration object for the pat2vec pipeline.

This class holds all configuration parameters for a pat2vec run, including file paths, time window settings, feature selection, and operational flags.

Parameters:

remote_dump (bool) – If True, data will be dumped to a remote server via SFTP.
suffix (str) – A suffix to append to output folder names.
treatment_doc_filename (str) – The filename for the input document containing the primary cohort list.
treatment_control_ratio_n (int) – The ratio of treatment to control patients.
proj_name (str) – The name of the current project, used for creating project-specific folders where patient data batches and vectors are stored.
current_path_dir (str) – The current working directory.
main_options (Optional[Dict[str, bool]]) – A dictionary of boolean flags to enable or disable specific feature extractions (e.g., ‘demo’, ‘bloods’, ‘annotations’). If None, a default dictionary is used.
start_date (datetime) – The anchor date for generating time windows. For global windows, this is the start. For individual windows, this is overridden per patient.
years (int) – The number of years in the time window duration.
months (int) – Number of months to add to the start_date.
days (int) – The number of days in the time window duration.
batch_mode (bool) – Flag for batch processing mode. This is currently the only functioning mode.
store_annot (bool) – Flag to store annotations. Partially deprecated.
share_sftp (bool) – Flag for sharing SFTP connection. Partially deprecated.
multi_process (bool) – Flag for multi-process execution. Deprecated.
strip_list (bool) – If True, this will check for completed patients before starting to avoid redundancy.
verbosity (int) – Verbosity level for logging (0-9).
random_seed_val (int) – Random seed for reproducibility.
hostname (Optional[str]) – Hostname for SFTP connection.
username (Optional[str]) – Username for SFTP connection.
password (Optional[str]) – Password for SFTP connection.
gpu_mem_threshold (int) – GPU memory threshold in MB for MedCAT.
testing (bool) – If True, enables testing mode, which may use dummy data generators.
dummy_medcat_model (bool) – If True and in testing mode, simulates a MedCAT model.
use_controls (bool) – If True, this will add the desired ratio of controls at random from the global pool, requiring configuration with a master list of patients.
medcat (bool) – Flag for MedCAT processing. If True, MedCAT will load into memory and be used for annotating.
global_start_year (Union[int, str, None]) – Global start year for the overall data extraction window.
global_start_month (Union[int, str, None]) – Global start month.
global_start_day (Union[int, str, None]) – Global start day.
global_end_year (Union[int, str, None]) – Global end year.
global_end_month (Union[int, str, None]) – Global end month.
global_end_day (Union[int, str, None]) – Global end day.
skip_additional_listdir (bool) – If True, skips some listdir calls for performance.
start_time (Optional[datetime]) – Start time for logging. Defaults to datetime.now().
root_path (Optional[str]) – The root directory for the project. If None, defaults to the current working directory.
negate_biochem (bool) – Flag for negating biochemistry features.
patient_id_column_name (str) – Column name for patient IDs in input files.
overwrite_stored_pat_docs (bool) – If True, overwrites existing stored patient documents.
overwrite_stored_pat_observations (bool) – If True, overwrites existing stored patient observations.
store_pat_batch_docs (bool) – If True, stores patient document batches.
store_pat_batch_observations (bool) – If True, stores patient observation batches.
annot_filter_options (Optional[Dict[str, Any]]) – Dictionary for filtering MedCAT annotations.
shuffle_pat_list (bool) – Flag for shuffling the patient list.
individual_patient_window (bool) – If True, uses patient-specific time windows defined in individual_patient_window_df.
individual_patient_window_df (Optional[DataFrame]) – DataFrame with patient IDs and their individual start dates. Required if individual_patient_window is True.
individual_patient_window_start_column_name (Optional[str]) – The column name for start dates in individual_patient_window_df.
individual_patient_id_column_name (Optional[str]) – The column name for patient IDs in individual_patient_window_df.
individual_patient_window_controls_method (str) – Method for handling control patients in IPW mode (‘full’ or ‘random’).
dropna_doc_timestamps (bool) – If True, drops documents with missing timestamps.
time_window_interval_delta (relativedelta) – The step/interval for each time slice vector.
feature_engineering_arg_dict (Optional[Dict[str, Any]]) – Dictionary of arguments for feature engineering.
split_clinical_notes (bool) – If True, clinical notes will be split by date and treated as individual documents with extracted dates. Requires a note splitter module.
lookback (bool) – If True, the time window is calculated backward from the start date. If False, it’s calculated forward.
add_icd10 (bool) – If True, appends ICD-10 codes to annotation batches.
add_opc4s (bool) – Requires add_icd10 to be True. If True, appends OPC4S codes to annotation batches.
all_epr_patient_list_path (str) – Path to a file containing all patient IDs, used for sampling controls.
override_medcat_model_path (Optional[str]) – Path to a MedCAT model pack to override the default.
data_type_filter_dict (Optional[Dict[str, Any]]) – Dictionary for data type filtering.
filter_split_notes (bool) – If enabled (True), the global time window filter will be reapplied after clinical note splitting.
client_idcode_term_name (str) – The Elasticsearch field name for patient ID searches.
sanitize_pat_list (bool) – If True, sanitizes the patient list (e.g., to uppercase).
calculate_vectors (bool) – If True, calculates feature vectors. If False, only extracts batches.
prefetch_pat_batches (bool) – If True, fetches all raw data for all patients before processing. May use significant memory.
sample_treatment_docs (int) – Number of patients to sample from the initial cohort list. 0 means no sampling.
test_data_path (Optional[str]) – The path to the test data file, used when testing is True.
credentials_path (str) – Path to the credentials file.

Return type:

None

prefetch_pat_batches: If True, fetches all raw data for all patients before processing. May use significant memory.

calculate_vectors: If True, calculates feature vectors. If False, only extracts batches.

skip_additional_listdir: If True, skips some listdir calls for performance.

filter_split_notes: If enabled (True), the global time window filter will be reapplied after clinical note splitting.

credentials_path: Path to the credentials file.

suffix: A suffix to append to output folder names.

treatment_doc_filename: The filename for the input document containing the primary cohort list.

treatment_control_ratio_n: The ratio of treatment to control patients.

pre_report_annotation_batch_path_report: Path to the report annotation batches directory.

pre_report_batch_path: Path to the report batches directory.

store_pat_batch_docs: If True, stores patient document batches.

store_pat_batch_observations: If True, stores patient observation batches.

proj_name: The name of the current project, used for creating project-specific folders.

negate_biochem: Flag for negating biochemistry features.

patient_id_column_name: Column name for patient IDs in input files.

add_icd10: If True, appends ICD-10 codes to annotation batches.

add_opc4s: If True, appends OPC4S codes to annotation batches. Requires add_icd10 to be True.

data_type_filter_dict: Dictionary for data type filtering.

batch_mode: Flag for batch processing mode.

remote_dump: If True, data will be dumped to a remote server via SFTP.

store_annot: Flag to store annotations. Partially deprecated.

share_sftp: Flag for sharing SFTP connection. Partially deprecated.

multi_process: Flag for multi-process execution. Deprecated.

strip_list: If True, checks for completed patients before starting to avoid redundancy.

verbosity: Verbosity level for logging (0-9).

random_seed_val: Random seed for reproducibility.

hostname: Hostname for SFTP connection.

username: Username for SFTP connection.

password: Password for SFTP connection.

gpu_mem_threshold: GPU memory threshold in MB for MedCAT.

testing: If True, enables testing mode, which may use dummy data generators.

use_controls: If True, adds the desired ratio of controls at random from the global pool.

skipped_counter: Counter for skipped items.

medcat: Flag for MedCAT processing. If True, MedCAT will be used for annotating.

overwrite_stored_pat_docs: If True, overwrites existing stored patient documents.

overwrite_stored_pat_observations: If True, overwrites existing stored patient observations.

annot_filter_options: Dictionary for filtering MedCAT annotations.

shuffle_pat_list: Flag for shuffling the patient list.

individual_patient_window: If True, uses patient-specific time windows.

individual_patient_window_start_column_name: The column name for start dates in individual_patient_window_df.

individual_patient_id_column_name: The column name for patient IDs in individual_patient_window_df.

individual_patient_window_controls_method: Method for handling control patients in IPW mode (‘full’ or ‘random’).

control_list_path: Path to the control list pickle file.

dropna_doc_timestamps: If True, drops documents with missing timestamps.

time_window_interval_delta: The step/interval for each time slice vector.

split_clinical_notes: If True, clinical notes will be split by date.

all_epr_patient_list_path: Path to a file containing all patient IDs, used for sampling controls.

lookback: If True, the time window is calculated backward from the start date.

override_medcat_model_path: Path to a MedCAT model pack to override the default.

start_time: Start time for logging.

drug_time_field: The time field to use for drug orders.

diagnostic_time_field: The time field to use for diagnostic orders.

appointments_time_field: The time field to use for appointments.

bloods_time_field: The time field to use for bloods.

client_idcode_term_name: The Elasticsearch field name for patient ID searches.

main_options: A dictionary of boolean flags to enable or disable specific feature extractions.

feature_engineering_arg_dict: Dictionary of arguments for feature engineering.

negated_presence_annotations: Flag for handling negated presence annotations.

pre_document_annotation_batch_path: Path to the document annotation batches directory.

pre_document_annotation_batch_path_mct: Path to the MCT document annotation batches directory.

pre_textual_obs_annotation_batch_path: Path to the textual observation annotation batches directory.

pre_textual_obs_document_batch_path: Path to the textual observation document batches directory.

pre_document_batch_path: Path to the document batches directory.

pre_document_batch_path_mct: Path to the MCT document batches directory.

pre_document_batch_path_reports: Path to the report document batches directory.

pre_bloods_batch_path: Path to the bloods batches directory.

pre_drugs_batch_path: Path to the drugs batches directory.

pre_diagnostics_batch_path: Path to the diagnostics batches directory.

pre_news_batch_path: Path to the NEWS batches directory.

pre_obs_batch_path: Path to the observations batches directory.

pre_bmi_batch_path: Path to the BMI batches directory.

pre_demo_batch_path: Path to the demographics batches directory.

pre_misc_batch_path: Path to the miscellaneous batches directory.

pre_appointments_batch_path: Path to the appointments batches directory.

pre_merged_input_batches_path: Path to the merged input batches directory.

output_folder: The name of the output folder.

PathsClass_instance: An instance of the PathsClass for managing directory paths.

start_date: The anchor date for generating time windows.

years: The number of years in the time window duration.

months: The number of months in the time window duration.

days: The number of days in the time window duration.

time_delta: The total time delta for the window.

slow_execution_threshold_low: Threshold for low slow execution warning.

slow_execution_threshold_high: Threshold for high slow execution warning.

slow_execution_threshold_extreme: Threshold for extreme slow execution warning.

sample_treatment_docs: Number of patients to sample from the initial cohort list. 0 means no sampling.

test_data_path: The path to the test data file, used when testing is True.

root_path: The root directory for the project.

pre_annotation_path: Path to the pre-annotation parts directory on the remote server.

pre_annotation_path_mrc: Path to the MRC pre-annotation parts directory on the remote server.

current_pat_line_path: Path to the patient line directory on the remote server.

current_pat_lines_path: Path to the patient lines parts directory.

sftp_client: SFTP client for remote file operations.

individual_patient_window_df: DataFrame with patient IDs and their individual start dates.

n_pat_lines: Number of patient lines, dynamic for IPW.

date_list: List[datetime]: List of datetime objects for time window generation.

pat2vec.util.config_pat2vec.update_global_start_date(self, start_date)[source]

Updates the global start date if the provided start_date is later.

This logic only applies when looking forward (lookback=False).

Parameters:

self (TypeVar(T_config, bound= config_class)) – The configuration object instance.
start_date (datetime) – The new start date to compare against the global start date.

Return type:

TypeVar(T_config, bound= config_class)

Returns:

The configuration object instance.

pat2vec.util.config_pat2vec.validate_and_fix_global_dates(config)[source]

Ensures global start date is before the global end date.

If the start date is after the end date, it swaps them to ensure compatibility with Elasticsearch range queries and warns the user.

Parameters:: config (TypeVar(T_config, bound= config_class)) – The configuration object instance.
Return type:: TypeVar(T_config, bound= config_class)
Returns:: The modified configuration object.