Comprehensive Configuration Guide for pat2vec

Configuration for pat2vec is managed at multiple levels: during installation, through files in your project directory, and at runtime within your analysis notebooks. This guide provides a detailed overview of all configuration options.

1. Installation-Time Configuration

The install_pat2vec.sh script prepares your environment. You can customize its behavior with the following command-line flags:

Flag

Alias

Description

--proxy

-p

Configures pip to use a corporate proxy/mirror for installing Python packages.

--dev

Installs development dependencies like pytest and nbmake, which are required for running tests and contributing to the project.

--all

-a

Installs all optional dependencies required for every feature extractor. The default β€œlite” installation only includes core dependencies.

--force

-f

Performs a clean installation by removing any existing pat2vec_env virtual environment.

--no-clone

Skips cloning the snomed_methods helper repository. Use this if you already have it.

Example Usage

To set up a full development environment behind a corporate proxy, you would run:

./install_pat2vec.sh --proxy --dev --all

2. Environment and File-Based Configuration

The installation script sets up a specific directory structure and creates several configuration files that you must edit. The pipeline expects this layout to be in the parent directory of your pat2vec clone.

Project Directory Structure

your_project_folder/
β”œβ”€β”€ credentials.py              # <-- MUST EDIT: Your Elasticsearch credentials
β”œβ”€β”€ medcat_models/
β”‚   └── your_model.zip          # <-- MUST ADD: Your MedCAT model pack
β”œβ”€β”€ snomed_methods/             # <-- Cloned automatically
└── pat2vec/                    # <-- This repository
    └── ...

Note: The install_pat2vec.sh script also creates a notebooks/paths.py file, but it is now recommended to set the MedCAT model path directly in config_class using override_medcat_model_path for better clarity.

Key Configuration Files

credentials.py

  • Location: your_project_folder/credentials.py

  • Purpose: Stores your sensitive Elasticsearch credentials (host, username, password).

  • Setup: The install_pat2vec.sh script copies a template to this location. You must edit this file to add your actual credentials. The path to this file can be specified in config_class with the credentials_path argument.

  • Security: This file is critical and should never be committed to version control.

medcat_models/ directory

  • Location: your_project_folder/medcat_models/

  • Purpose: Stores your pre-trained MedCAT model packs (.zip files) used for clinical text annotation.

  • Setup: The installation script creates this directory. You must manually place your model pack(s) inside it.

3. Runtime Configuration (config_class)

The config_class is the central Python object used within your Jupyter notebook (e.g., example_usage.ipynb) to control the behavior of a pipeline run. It is highly detailed, allowing for fine-grained control over the entire process.

Project and Path Configuration

These parameters define the project’s file structure and I/O.

  • proj_name (str): The name of your project. This is used to create a root directory for all outputs (e.g., new_project/).

  • suffix (str): An optional suffix appended to output sub-folders, allowing you to distinguish between different runs within the same project (e.g., _run1).

  • treatment_doc_filename (str): The path to your input CSV file containing the initial patient cohort. This file must contain a column with patient identifiers.

  • patient_id_column_name (str): The name of the column in your cohort CSV that contains the unique patient identifiers (default: 'client_idcode').

  • root_path (str): The absolute path to the project’s root output directory. If not set, it defaults to os.getcwd()/proj_name/.

  • override_medcat_model_path (str): The direct path to the MedCAT model pack (.zip) you want to use. This is the recommended way to specify the model.

Execution and Operational Control

These flags control how the pipeline runs, its verbosity, and its behavior for testing and performance.

  • testing (bool): Set to True to run in testing mode, which uses dummy data generators for rapid debugging. Set to False for production runs.

  • strip_list (bool): If True (default), the pipeline checks for already processed patients and skips them to avoid redundant work.

  • verbosity (int): Sets the logging level (0-9). A higher number provides more detailed output. 3 is a good default.

  • random_seed_val (int): A seed for random operations to ensure reproducibility.

  • calculate_vectors (bool): If True (default), the pipeline generates the final feature vector CSVs. If False, it only pre-fetches and saves the raw data batches, which can be useful for debugging the data extraction step.

  • prefetch_pat_batches (bool): If True, all raw data for the entire cohort is fetched and stored in memory before processing begins. This can speed up processing but requires significant RAM. It is not compatible with individual_patient_window.

Temporal Window Configuration

This is one of the most critical parts of the configuration, defining how patient data is sliced over time.

Global Time Windows

Used when all patients are analyzed over the same fixed time period.

  • start_date (datetime): The anchor date for the time window calculation.

  • years, months, days (int): The total duration of the time window relative to start_date.

  • lookback (bool): Determines the direction of the window. If True (default), the window extends backward from start_date. If False, it extends forward.

  • time_window_interval_delta (relativedelta): The step size for each time slice. For example, relativedelta(months=1) creates one feature vector per patient per month.

Individual Patient Windows (IPW)

Used for patient-specific time windows, typically anchored to a clinical event (e.g., diagnosis date).

  • individual_patient_window (bool): Set to True to enable IPW mode.

  • individual_patient_window_df (pd.DataFrame): A DataFrame containing patient IDs and their corresponding event dates.

  • individual_patient_id_column_name (str): The name of the patient ID column in the individual_patient_window_df.

  • individual_patient_window_start_column_name (str): The name of the column containing the anchor dates in the individual_patient_window_df.

Global Data Boundaries

These parameters set the absolute earliest and latest dates for any data retrieval from Elasticsearch, acting as a hard filter.

  • global_start_year, global_start_month, global_start_day (int/str)

  • global_end_year, global_end_month, global_end_day (int/str)

Feature Selection (main_options)

You can precisely control which features are extracted by creating a dictionary and passing it to the config_class. Set a feature to True to enable it or False to disable it.

# 1. Define your feature set
main_options_dict = {
    'demo': True,           # Demographic information
    'bmi': True,            # BMI information
    'bloods': True,         # Blood-related information
    'drugs': True,          # Drug-related information
    'diagnostics': True,    # Diagnostic information
    'core_02': True,        # core_02 information
    'bed': True,            # Bed information
    'vte_status': True,     # VTE status information
    'hosp_site': True,      # Hospital site information
    'core_resus': True,     # Core resuscitation information
    'news': True,           # NEWS (National Early Warning Score)
    'smoking': True,        # Smoking-related information
    'annotations': True,    # EPR document annotations via MedCAT
    'annotations_mrc': True,# MRC annotations via MedCAT
    'negated_presence_annotations': False,  # Negated presence annotations
    'appointments': False,  # Appointments information
    'annotations_reports': False,  # Reports information
    'textual_obs': False,   # Textual observations
}

# 2. Pass the dictionary to your config_class instance
config_obj = config_class(
    main_options=main_options_dict,
    # ... other configuration parameters
)

Cohort and Sampling

  • use_controls (bool): If True, generates a control group for a case-control study.

  • treatment_control_ratio_n (int): The ratio of control patients to treatment patients (e.g., 2 for a 2:1 ratio).

  • all_epr_patient_list_path (str): Path to a CSV file containing a master list of all possible patient IDs, used for sampling controls.

  • sample_treatment_docs (int): If set to a number greater than 0, a random sample of that size will be taken from the initial cohort. Useful for quick tests.

  • shuffle_pat_list (bool): If True, shuffles the final patient list before processing.

Advanced and Technical Parameters

  • split_clinical_notes (bool): If True, the pipeline attempts to parse dates within long clinical notes and split them into smaller, date-stamped documents for more accurate temporal analysis.

  • add_icd10 / add_opc4s (bool): If True, appends linked ICD-10 or OPCS-4 codes to the MedCAT annotation outputs.

  • annot_filter_options (dict): A dictionary to fine-tune MedCAT annotation filtering, allowing you to set thresholds for confidence, accuracy, and filter by concept types or meta-annotations (e.g., Presence: True).

  • data_type_filter_dict (dict): A dictionary to apply term-based filtering on raw data before feature extraction (e.g., only include specific blood tests).

  • gpu_mem_threshold (int): The minimum free GPU memory (in MB) required for MedCAT to be loaded onto a specific GPU.

  • remote_dump (bool): If True, saves output files to a remote server via SFTP. Requires hostname, username, and password to be set.