pat2vec.util.config_pat2vecο
Functions
|
Updates the global start date if the provided start_date is later. |
|
Ensures global start date is before the global end date. |
Classes
|
Initializes the configuration object for the pat2vec pipeline. |
- class pat2vec.util.config_pat2vec.config_class(remote_dump=False, suffix='', treatment_doc_filename='treatment_docs.csv', treatment_control_ratio_n=1, proj_name='new_project', current_path_dir='.', main_options=None, start_date=datetime.datetime(1995, 1, 1, 0, 0), years=0, months=0, days=1, batch_mode=True, store_annot=False, share_sftp=True, multi_process=False, strip_list=True, verbosity=3, random_seed_val=42, hostname=None, username=None, password=None, gpu_mem_threshold=4000, testing=False, dummy_medcat_model=False, use_controls=False, medcat=False, global_start_year=None, global_start_month=None, global_end_year=None, global_end_month=None, global_start_day=None, global_end_day=None, skip_additional_listdir=False, start_time=None, root_path=None, negate_biochem=False, patient_id_column_name='client_idcode', overwrite_stored_pat_docs=False, overwrite_stored_pat_observations=False, store_pat_batch_docs=True, store_pat_batch_observations=True, annot_filter_options=None, shuffle_pat_list=False, individual_patient_window=False, individual_patient_window_df=None, individual_patient_window_start_column_name=None, individual_patient_id_column_name=None, individual_patient_window_controls_method='full', dropna_doc_timestamps=True, time_window_interval_delta=relativedelta(days=+1), feature_engineering_arg_dict=None, split_clinical_notes=True, lookback=True, add_icd10=False, add_opc4s=False, all_epr_patient_list_path='/home/samorah/_data/gloabl_files/all_client_idcodes_epr_unique.csv', override_medcat_model_path=None, data_type_filter_dict=None, filter_split_notes=True, client_idcode_term_name='client_idcode.keyword', sanitize_pat_list=True, calculate_vectors=True, prefetch_pat_batches=False, sample_treatment_docs=0, test_data_path=None, credentials_path='../../credentials.py')[source]ο
Bases:
object
Initializes the configuration object for the pat2vec pipeline.
- Parameters:
remote_dump (bool)
suffix (str)
treatment_doc_filename (str)
treatment_control_ratio_n (int)
proj_name (str)
current_path_dir (str)
main_options (Dict[str, bool] | None)
start_date (datetime)
years (int)
months (int)
days (int)
batch_mode (bool)
store_annot (bool)
share_sftp (bool)
multi_process (bool)
strip_list (bool)
verbosity (int)
random_seed_val (int)
hostname (str | None)
username (str | None)
password (str | None)
gpu_mem_threshold (int)
testing (bool)
dummy_medcat_model (bool)
use_controls (bool)
medcat (bool)
global_start_year (int | str | None)
global_start_month (int | str | None)
global_end_year (int | str | None)
global_end_month (int | str | None)
global_start_day (int | str | None)
global_end_day (int | str | None)
skip_additional_listdir (bool)
start_time (datetime | None)
root_path (str | None)
negate_biochem (bool)
patient_id_column_name (str)
overwrite_stored_pat_docs (bool)
overwrite_stored_pat_observations (bool)
store_pat_batch_docs (bool)
store_pat_batch_observations (bool)
annot_filter_options (Dict[str, Any] | None)
shuffle_pat_list (bool)
individual_patient_window (bool)
individual_patient_window_df (DataFrame | None)
individual_patient_window_start_column_name (str | None)
individual_patient_id_column_name (str | None)
individual_patient_window_controls_method (str)
dropna_doc_timestamps (bool)
time_window_interval_delta (relativedelta)
feature_engineering_arg_dict (Dict[str, Any] | None)
split_clinical_notes (bool)
lookback (bool)
add_icd10 (bool)
add_opc4s (bool)
all_epr_patient_list_path (str)
override_medcat_model_path (str | None)
data_type_filter_dict (Dict[str, Any] | None)
filter_split_notes (bool)
client_idcode_term_name (str)
sanitize_pat_list (bool)
calculate_vectors (bool)
prefetch_pat_batches (bool)
sample_treatment_docs (int)
test_data_path (str | None)
credentials_path (str)
- __init__(remote_dump=False, suffix='', treatment_doc_filename='treatment_docs.csv', treatment_control_ratio_n=1, proj_name='new_project', current_path_dir='.', main_options=None, start_date=datetime.datetime(1995, 1, 1, 0, 0), years=0, months=0, days=1, batch_mode=True, store_annot=False, share_sftp=True, multi_process=False, strip_list=True, verbosity=3, random_seed_val=42, hostname=None, username=None, password=None, gpu_mem_threshold=4000, testing=False, dummy_medcat_model=False, use_controls=False, medcat=False, global_start_year=None, global_start_month=None, global_end_year=None, global_end_month=None, global_start_day=None, global_end_day=None, skip_additional_listdir=False, start_time=None, root_path=None, negate_biochem=False, patient_id_column_name='client_idcode', overwrite_stored_pat_docs=False, overwrite_stored_pat_observations=False, store_pat_batch_docs=True, store_pat_batch_observations=True, annot_filter_options=None, shuffle_pat_list=False, individual_patient_window=False, individual_patient_window_df=None, individual_patient_window_start_column_name=None, individual_patient_id_column_name=None, individual_patient_window_controls_method='full', dropna_doc_timestamps=True, time_window_interval_delta=relativedelta(days=+1), feature_engineering_arg_dict=None, split_clinical_notes=True, lookback=True, add_icd10=False, add_opc4s=False, all_epr_patient_list_path='/home/samorah/_data/gloabl_files/all_client_idcodes_epr_unique.csv', override_medcat_model_path=None, data_type_filter_dict=None, filter_split_notes=True, client_idcode_term_name='client_idcode.keyword', sanitize_pat_list=True, calculate_vectors=True, prefetch_pat_batches=False, sample_treatment_docs=0, test_data_path=None, credentials_path='../../credentials.py')[source]ο
Initializes the configuration object for the pat2vec pipeline.
This class holds all configuration parameters for a pat2vec run, including file paths, time window settings, feature selection, and operational flags.
- Parameters:
remote_dump (
bool
) β If True, data will be dumped to a remote server via SFTP.suffix (
str
) β A suffix to append to output folder names.treatment_doc_filename (
str
) β The filename for the input document containing the primary cohort list.treatment_control_ratio_n (
int
) β The ratio of treatment to control patients.proj_name (
str
) β The name of the current project, used for creating project-specific folders where patient data batches and vectors are stored.current_path_dir (
str
) β The current working directory.main_options (
Optional
[Dict
[str
,bool
]]) β A dictionary of boolean flags to enable or disable specific feature extractions (e.g., βdemoβ, βbloodsβ, βannotationsβ). If None, a default dictionary is used.start_date (
datetime
) β The anchor date for generating time windows. For global windows, this is the start. For individual windows, this is overridden per patient.years (
int
) β The number of years in the time window duration.months (
int
) β Number of months to add to the start_date.days (
int
) β The number of days in the time window duration.batch_mode (
bool
) β Flag for batch processing mode. This is currently the only functioning mode.store_annot (
bool
) β Flag to store annotations. Partially deprecated.share_sftp (
bool
) β Flag for sharing SFTP connection. Partially deprecated.multi_process (
bool
) β Flag for multi-process execution. Deprecated.strip_list (
bool
) β If True, this will check for completed patients before starting to avoid redundancy.verbosity (
int
) β Verbosity level for logging (0-9).random_seed_val (
int
) β Random seed for reproducibility.hostname (
Optional
[str
]) β Hostname for SFTP connection.username (
Optional
[str
]) β Username for SFTP connection.password (
Optional
[str
]) β Password for SFTP connection.gpu_mem_threshold (
int
) β GPU memory threshold in MB for MedCAT.testing (
bool
) β If True, enables testing mode, which may use dummy data generators.dummy_medcat_model (
bool
) β If True and in testing mode, simulates a MedCAT model.use_controls (
bool
) β If True, this will add the desired ratio of controls at random from the global pool, requiring configuration with a master list of patients.medcat (
bool
) β Flag for MedCAT processing. If True, MedCAT will load into memory and be used for annotating.global_start_year (
Union
[int
,str
,None
]) β Global start year for the overall data extraction window.global_start_month (
Union
[int
,str
,None
]) β Global start month.global_start_day (
Union
[int
,str
,None
]) β Global start day.global_end_year (
Union
[int
,str
,None
]) β Global end year.global_end_month (
Union
[int
,str
,None
]) β Global end month.global_end_day (
Union
[int
,str
,None
]) β Global end day.skip_additional_listdir (
bool
) β If True, skips some listdir calls for performance.start_time (
Optional
[datetime
]) β Start time for logging. Defaults to datetime.now().root_path (
Optional
[str
]) β The root directory for the project. If None, defaults to the current working directory.negate_biochem (
bool
) β Flag for negating biochemistry features.patient_id_column_name (
str
) β Column name for patient IDs in input files.overwrite_stored_pat_docs (
bool
) β If True, overwrites existing stored patient documents.overwrite_stored_pat_observations (
bool
) β If True, overwrites existing stored patient observations.store_pat_batch_docs (
bool
) β If True, stores patient document batches.store_pat_batch_observations (
bool
) β If True, stores patient observation batches.annot_filter_options (
Optional
[Dict
[str
,Any
]]) β Dictionary for filtering MedCAT annotations.shuffle_pat_list (
bool
) β Flag for shuffling the patient list.individual_patient_window (
bool
) β If True, uses patient-specific time windows defined in individual_patient_window_df.individual_patient_window_df (
Optional
[DataFrame
]) β DataFrame with patient IDs and their individual start dates. Required if individual_patient_window is True.individual_patient_window_start_column_name (
Optional
[str
]) β The column name for start dates in individual_patient_window_df.individual_patient_id_column_name (
Optional
[str
]) β The column name for patient IDs in individual_patient_window_df.individual_patient_window_controls_method (
str
) β Method for handling control patients in IPW mode (βfullβ or βrandomβ).dropna_doc_timestamps (
bool
) β If True, drops documents with missing timestamps.time_window_interval_delta (
relativedelta
) β The step/interval for each time slice vector.feature_engineering_arg_dict (
Optional
[Dict
[str
,Any
]]) β Dictionary of arguments for feature engineering.split_clinical_notes (
bool
) β If True, clinical notes will be split by date and treated as individual documents with extracted dates. Requires a note splitter module.lookback (
bool
) β If True, the time window is calculated backward from the start date. If False, itβs calculated forward.add_icd10 (
bool
) β If True, appends ICD-10 codes to annotation batches.add_opc4s (
bool
) β Requires add_icd10 to be True. If True, appends OPC4S codes to annotation batches.all_epr_patient_list_path (
str
) β Path to a file containing all patient IDs, used for sampling controls.override_medcat_model_path (
Optional
[str
]) β Path to a MedCAT model pack to override the default.data_type_filter_dict (
Optional
[Dict
[str
,Any
]]) β Dictionary for data type filtering.filter_split_notes (
bool
) β If enabled (True), the global time window filter will be reapplied after clinical note splitting.client_idcode_term_name (
str
) β The Elasticsearch field name for patient ID searches.sanitize_pat_list (
bool
) β If True, sanitizes the patient list (e.g., to uppercase).calculate_vectors (
bool
) β If True, calculates feature vectors. If False, only extracts batches.prefetch_pat_batches (
bool
) β If True, fetches all raw data for all patients before processing. May use significant memory.sample_treatment_docs (
int
) β Number of patients to sample from the initial cohort list. 0 means no sampling.test_data_path (
Optional
[str
]) β The path to the test data file, used when testing is True.credentials_path (
str
) β Path to the credentials file.
- Return type:
None
- prefetch_pat_batchesο
If True, fetches all raw data for all patients before processing. May use significant memory.
- calculate_vectorsο
If True, calculates feature vectors. If False, only extracts batches.
- skip_additional_listdirο
If True, skips some listdir calls for performance.
- filter_split_notesο
If enabled (True), the global time window filter will be reapplied after clinical note splitting.
- credentials_pathο
Path to the credentials file.
- suffixο
A suffix to append to output folder names.
- treatment_doc_filenameο
The filename for the input document containing the primary cohort list.
- treatment_control_ratio_nο
The ratio of treatment to control patients.
- pre_report_annotation_batch_path_reportο
Path to the report annotation batches directory.
- pre_report_batch_pathο
Path to the report batches directory.
- store_pat_batch_docsο
If True, stores patient document batches.
- store_pat_batch_observationsο
If True, stores patient observation batches.
- proj_nameο
The name of the current project, used for creating project-specific folders.
- negate_biochemο
Flag for negating biochemistry features.
- patient_id_column_nameο
Column name for patient IDs in input files.
- add_icd10ο
If True, appends ICD-10 codes to annotation batches.
- add_opc4sο
If True, appends OPC4S codes to annotation batches. Requires add_icd10 to be True.
- data_type_filter_dictο
Dictionary for data type filtering.
- batch_modeο
Flag for batch processing mode.
- remote_dumpο
If True, data will be dumped to a remote server via SFTP.
- store_annotο
Flag to store annotations. Partially deprecated.
Flag for sharing SFTP connection. Partially deprecated.
- multi_processο
Flag for multi-process execution. Deprecated.
- strip_listο
If True, checks for completed patients before starting to avoid redundancy.
- verbosityο
Verbosity level for logging (0-9).
- random_seed_valο
Random seed for reproducibility.
- hostnameο
Hostname for SFTP connection.
- usernameο
Username for SFTP connection.
- passwordο
Password for SFTP connection.
- gpu_mem_thresholdο
GPU memory threshold in MB for MedCAT.
- testingο
If True, enables testing mode, which may use dummy data generators.
- use_controlsο
If True, adds the desired ratio of controls at random from the global pool.
- skipped_counterο
Counter for skipped items.
- medcatο
Flag for MedCAT processing. If True, MedCAT will be used for annotating.
- overwrite_stored_pat_docsο
If True, overwrites existing stored patient documents.
- overwrite_stored_pat_observationsο
If True, overwrites existing stored patient observations.
- annot_filter_optionsο
Dictionary for filtering MedCAT annotations.
- shuffle_pat_listο
Flag for shuffling the patient list.
- individual_patient_windowο
If True, uses patient-specific time windows.
- individual_patient_window_start_column_nameο
The column name for start dates in individual_patient_window_df.
- individual_patient_id_column_nameο
The column name for patient IDs in individual_patient_window_df.
- individual_patient_window_controls_methodο
Method for handling control patients in IPW mode (βfullβ or βrandomβ).
- control_list_pathο
Path to the control list pickle file.
- dropna_doc_timestampsο
If True, drops documents with missing timestamps.
- time_window_interval_deltaο
The step/interval for each time slice vector.
- split_clinical_notesο
If True, clinical notes will be split by date.
- all_epr_patient_list_pathο
Path to a file containing all patient IDs, used for sampling controls.
- lookbackο
If True, the time window is calculated backward from the start date.
- override_medcat_model_pathο
Path to a MedCAT model pack to override the default.
- start_timeο
Start time for logging.
- drug_time_fieldο
The time field to use for drug orders.
- diagnostic_time_fieldο
The time field to use for diagnostic orders.
- appointments_time_fieldο
The time field to use for appointments.
- bloods_time_fieldο
The time field to use for bloods.
- client_idcode_term_nameο
The Elasticsearch field name for patient ID searches.
- main_optionsο
A dictionary of boolean flags to enable or disable specific feature extractions.
- feature_engineering_arg_dictο
Dictionary of arguments for feature engineering.
- negated_presence_annotationsο
Flag for handling negated presence annotations.
- pre_document_annotation_batch_pathο
Path to the document annotation batches directory.
- pre_document_annotation_batch_path_mctο
Path to the MCT document annotation batches directory.
- pre_textual_obs_annotation_batch_pathο
Path to the textual observation annotation batches directory.
- pre_textual_obs_document_batch_pathο
Path to the textual observation document batches directory.
- pre_document_batch_pathο
Path to the document batches directory.
- pre_document_batch_path_mctο
Path to the MCT document batches directory.
- pre_document_batch_path_reportsο
Path to the report document batches directory.
- pre_bloods_batch_pathο
Path to the bloods batches directory.
- pre_drugs_batch_pathο
Path to the drugs batches directory.
- pre_diagnostics_batch_pathο
Path to the diagnostics batches directory.
- pre_news_batch_pathο
Path to the NEWS batches directory.
- pre_obs_batch_pathο
Path to the observations batches directory.
- pre_bmi_batch_pathο
Path to the BMI batches directory.
- pre_demo_batch_pathο
Path to the demographics batches directory.
- pre_misc_batch_pathο
Path to the miscellaneous batches directory.
- pre_appointments_batch_pathο
Path to the appointments batches directory.
- pre_merged_input_batches_pathο
Path to the merged input batches directory.
- output_folderο
The name of the output folder.
- PathsClass_instanceο
An instance of the PathsClass for managing directory paths.
- start_dateο
The anchor date for generating time windows.
- yearsο
The number of years in the time window duration.
- monthsο
The number of months in the time window duration.
- daysο
The number of days in the time window duration.
- time_deltaο
The total time delta for the window.
- slow_execution_threshold_lowο
Threshold for low slow execution warning.
- slow_execution_threshold_highο
Threshold for high slow execution warning.
- slow_execution_threshold_extremeο
Threshold for extreme slow execution warning.
- sample_treatment_docsο
Number of patients to sample from the initial cohort list. 0 means no sampling.
- test_data_pathο
The path to the test data file, used when testing is True.
- root_pathο
The root directory for the project.
- pre_annotation_pathο
Path to the pre-annotation parts directory on the remote server.
- pre_annotation_path_mrcο
Path to the MRC pre-annotation parts directory on the remote server.
- current_pat_line_pathο
Path to the patient line directory on the remote server.
- current_pat_lines_pathο
Path to the patient lines parts directory.
- sftp_clientο
SFTP client for remote file operations.
- individual_patient_window_dfο
DataFrame with patient IDs and their individual start dates.
- n_pat_linesο
Number of patient lines, dynamic for IPW.
-
date_list:
List
[datetime
]ο List of datetime objects for time window generation.
- pat2vec.util.config_pat2vec.update_global_start_date(self, start_date)[source]ο
Updates the global start date if the provided start_date is later.
This logic only applies when looking forward (lookback=False).
- Parameters:
self (
TypeVar
(T_config
, bound= config_class)) β The configuration object instance.start_date (
datetime
) β The new start date to compare against the global start date.
- Return type:
TypeVar
(T_config
, bound= config_class)- Returns:
The configuration object instance.
- pat2vec.util.config_pat2vec.validate_and_fix_global_dates(config)[source]ο
Ensures global start date is before the global end date.
If the start date is after the end date, it swaps them to ensure compatibility with Elasticsearch range queries and warns the user.
- Parameters:
config (
TypeVar
(T_config
, bound= config_class)) β The configuration object instance.- Return type:
TypeVar
(T_config
, bound= config_class)- Returns:
The modified configuration object.