Comprehensive Configuration Guide for pat2vecο
Configuration for pat2vec is managed at multiple levels: during installation, through files in your project directory, and at runtime within your analysis notebooks. This guide provides a detailed overview of all configuration options.
1. Installation-Time Configurationο
The install_pat2vec.sh
script prepares your environment. You can customize its behavior with the following command-line flags:
Flag |
Alias |
Description |
---|---|---|
|
|
Configures pip to use a corporate proxy/mirror for installing Python packages. |
|
Installs development dependencies like pytest and nbmake, which are required for running tests and contributing to the project. |
|
|
|
Installs all optional dependencies required for every feature extractor. The default βliteβ installation only includes core dependencies. |
|
|
Performs a clean installation by removing any existing pat2vec_env virtual environment. |
|
Skips cloning the snomed_methods helper repository. Use this if you already have it. |
Example Usageο
To set up a full development environment behind a corporate proxy, you would run:
./install_pat2vec.sh --proxy --dev --all
2. Environment and File-Based Configurationο
The installation script sets up a specific directory structure and creates several configuration files that you must edit. The pipeline expects this layout to be in the parent directory of your pat2vec clone.
Project Directory Structureο
your_project_folder/
βββ credentials.py # <-- MUST EDIT: Your Elasticsearch credentials
βββ medcat_models/
β βββ your_model.zip # <-- MUST ADD: Your MedCAT model pack
βββ snomed_methods/ # <-- Cloned automatically
βββ pat2vec/ # <-- This repository
βββ ...
Note: The
install_pat2vec.sh
script also creates anotebooks/paths.py
file, but it is now recommended to set the MedCAT model path directly inconfig_class
usingoverride_medcat_model_path
for better clarity.
Key Configuration Filesο
credentials.pyο
Location:
your_project_folder/credentials.py
Purpose: Stores your sensitive Elasticsearch credentials (host, username, password).
Setup: The
install_pat2vec.sh
script copies a template to this location. You must edit this file to add your actual credentials. The path to this file can be specified inconfig_class
with thecredentials_path
argument.Security: This file is critical and should never be committed to version control.
medcat_models/ directoryο
Location:
your_project_folder/medcat_models/
Purpose: Stores your pre-trained MedCAT model packs (.zip files) used for clinical text annotation.
Setup: The installation script creates this directory. You must manually place your model pack(s) inside it.
3. Runtime Configuration (config_class)ο
The config_class
is the central Python object used within your Jupyter notebook (e.g., example_usage.ipynb
) to control the behavior of a pipeline run. It is highly detailed, allowing for fine-grained control over the entire process.
Project and Path Configurationο
These parameters define the projectβs file structure and I/O.
proj_name
(str): The name of your project. This is used to create a root directory for all outputs (e.g.,new_project/
).suffix
(str): An optional suffix appended to output sub-folders, allowing you to distinguish between different runs within the same project (e.g.,_run1
).treatment_doc_filename
(str): The path to your input CSV file containing the initial patient cohort. This file must contain a column with patient identifiers.patient_id_column_name
(str): The name of the column in your cohort CSV that contains the unique patient identifiers (default:'client_idcode'
).root_path
(str): The absolute path to the projectβs root output directory. If not set, it defaults toos.getcwd()/proj_name/
.override_medcat_model_path
(str): The direct path to the MedCAT model pack (.zip) you want to use. This is the recommended way to specify the model.
Execution and Operational Controlο
These flags control how the pipeline runs, its verbosity, and its behavior for testing and performance.
testing
(bool): Set toTrue
to run in testing mode, which uses dummy data generators for rapid debugging. Set toFalse
for production runs.strip_list
(bool): IfTrue
(default), the pipeline checks for already processed patients and skips them to avoid redundant work.verbosity
(int): Sets the logging level (0-9). A higher number provides more detailed output. 3 is a good default.random_seed_val
(int): A seed for random operations to ensure reproducibility.calculate_vectors
(bool): IfTrue
(default), the pipeline generates the final feature vector CSVs. IfFalse
, it only pre-fetches and saves the raw data batches, which can be useful for debugging the data extraction step.prefetch_pat_batches
(bool): IfTrue
, all raw data for the entire cohort is fetched and stored in memory before processing begins. This can speed up processing but requires significant RAM. It is not compatible withindividual_patient_window
.
Temporal Window Configurationο
This is one of the most critical parts of the configuration, defining how patient data is sliced over time.
Global Time Windowsο
Used when all patients are analyzed over the same fixed time period.
start_date
(datetime): The anchor date for the time window calculation.years
,months
,days
(int): The total duration of the time window relative tostart_date
.lookback
(bool): Determines the direction of the window. IfTrue
(default), the window extends backward fromstart_date
. IfFalse
, it extends forward.time_window_interval_delta
(relativedelta): The step size for each time slice. For example,relativedelta(months=1)
creates one feature vector per patient per month.
Individual Patient Windows (IPW)ο
Used for patient-specific time windows, typically anchored to a clinical event (e.g., diagnosis date).
individual_patient_window
(bool): Set toTrue
to enable IPW mode.individual_patient_window_df
(pd.DataFrame): A DataFrame containing patient IDs and their corresponding event dates.individual_patient_id_column_name
(str): The name of the patient ID column in theindividual_patient_window_df
.individual_patient_window_start_column_name
(str): The name of the column containing the anchor dates in theindividual_patient_window_df
.
Global Data Boundariesο
These parameters set the absolute earliest and latest dates for any data retrieval from Elasticsearch, acting as a hard filter.
global_start_year
,global_start_month
,global_start_day
(int/str)global_end_year
,global_end_month
,global_end_day
(int/str)
Feature Selection (main_options)ο
You can precisely control which features are extracted by creating a dictionary and passing it to the config_class
. Set a feature to True
to enable it or False
to disable it.
# 1. Define your feature set
main_options_dict = {
'demo': True, # Demographic information
'bmi': True, # BMI information
'bloods': True, # Blood-related information
'drugs': True, # Drug-related information
'diagnostics': True, # Diagnostic information
'core_02': True, # core_02 information
'bed': True, # Bed information
'vte_status': True, # VTE status information
'hosp_site': True, # Hospital site information
'core_resus': True, # Core resuscitation information
'news': True, # NEWS (National Early Warning Score)
'smoking': True, # Smoking-related information
'annotations': True, # EPR document annotations via MedCAT
'annotations_mrc': True,# MRC annotations via MedCAT
'negated_presence_annotations': False, # Negated presence annotations
'appointments': False, # Appointments information
'annotations_reports': False, # Reports information
'textual_obs': False, # Textual observations
}
# 2. Pass the dictionary to your config_class instance
config_obj = config_class(
main_options=main_options_dict,
# ... other configuration parameters
)
Cohort and Samplingο
use_controls
(bool): IfTrue
, generates a control group for a case-control study.treatment_control_ratio_n
(int): The ratio of control patients to treatment patients (e.g., 2 for a 2:1 ratio).all_epr_patient_list_path
(str): Path to a CSV file containing a master list of all possible patient IDs, used for sampling controls.sample_treatment_docs
(int): If set to a number greater than 0, a random sample of that size will be taken from the initial cohort. Useful for quick tests.shuffle_pat_list
(bool): IfTrue
, shuffles the final patient list before processing.
Advanced and Technical Parametersο
split_clinical_notes
(bool): IfTrue
, the pipeline attempts to parse dates within long clinical notes and split them into smaller, date-stamped documents for more accurate temporal analysis.add_icd10
/add_opc4s
(bool): IfTrue
, appends linked ICD-10 or OPCS-4 codes to the MedCAT annotation outputs.annot_filter_options
(dict): A dictionary to fine-tune MedCAT annotation filtering, allowing you to set thresholds for confidence, accuracy, and filter by concept types or meta-annotations (e.g.,Presence: True
).data_type_filter_dict
(dict): A dictionary to apply term-based filtering on raw data before feature extraction (e.g., only include specific blood tests).gpu_mem_threshold
(int): The minimum free GPU memory (in MB) required for MedCAT to be loaded onto a specific GPU.remote_dump
(bool): IfTrue
, saves output files to a remote server via SFTP. Requires hostname, username, and password to be set.