# Comprehensive Configuration Guide for pat2vec Configuration for pat2vec is managed at multiple levels: during installation, through files in your project directory, and at runtime within your analysis notebooks. This guide provides a detailed overview of all configuration options. ## 1. Installation-Time Configuration The `install_pat2vec.sh` script prepares your environment. You can customize its behavior with the following command-line flags: | Flag | Alias | Description | |------|-------|-------------| | `--proxy` | `-p` | Configures pip to use a corporate proxy/mirror for installing Python packages. | | `--dev` | | Installs development dependencies like pytest and nbmake, which are required for running tests and contributing to the project. | | `--all` | `-a` | Installs all optional dependencies required for every feature extractor. The default "lite" installation only includes core dependencies. | | `--force` | `-f` | Performs a clean installation by removing any existing pat2vec_env virtual environment. | | `--no-clone` | | Skips cloning the snomed_methods helper repository. Use this if you already have it. | ### Example Usage To set up a full development environment behind a corporate proxy, you would run: ```bash ./install_pat2vec.sh --proxy --dev --all ``` ## 2. Environment and File-Based Configuration The installation script sets up a specific directory structure and creates several configuration files that you must edit. The pipeline expects this layout to be in the parent directory of your pat2vec clone. ### Project Directory Structure ``` your_project_folder/ ├── credentials.py # <-- MUST EDIT: Your Elasticsearch credentials ├── medcat_models/ │ └── your_model.zip # <-- MUST ADD: Your MedCAT model pack ├── snomed_methods/ # <-- Cloned automatically └── pat2vec/ # <-- This repository └── ... ``` > **Note:** The `install_pat2vec.sh` script also creates a `notebooks/paths.py` file, but it is now recommended to set the MedCAT model path directly in `config_class` using `override_medcat_model_path` for better clarity. ### Key Configuration Files #### credentials.py - **Location:** `your_project_folder/credentials.py` - **Purpose:** Stores your sensitive Elasticsearch credentials (host, username, password). - **Setup:** The `install_pat2vec.sh` script copies a template to this location. You must edit this file to add your actual credentials. The path to this file can be specified in `config_class` with the `credentials_path` argument. - **Security:** This file is critical and should never be committed to version control. #### medcat_models/ directory - **Location:** `your_project_folder/medcat_models/` - **Purpose:** Stores your pre-trained MedCAT model packs (.zip files) used for clinical text annotation. - **Setup:** The installation script creates this directory. You must manually place your model pack(s) inside it. ## 3. Runtime Configuration (config_class) The `config_class` is the central Python object used within your Jupyter notebook (e.g., `example_usage.ipynb`) to control the behavior of a pipeline run. It is highly detailed, allowing for fine-grained control over the entire process. ### Project and Path Configuration These parameters define the project's file structure and I/O. - **`proj_name` (str):** The name of your project. This is used to create a root directory for all outputs (e.g., `new_project/`). - **`suffix` (str):** An optional suffix appended to output sub-folders, allowing you to distinguish between different runs within the same project (e.g., `_run1`). - **`treatment_doc_filename` (str):** The path to your input CSV file containing the initial patient cohort. This file must contain a column with patient identifiers. - **`patient_id_column_name` (str):** The name of the column in your cohort CSV that contains the unique patient identifiers (default: `'client_idcode'`). - **`root_path` (str):** The absolute path to the project's root output directory. If not set, it defaults to `os.getcwd()/proj_name/`. - **`override_medcat_model_path` (str):** The direct path to the MedCAT model pack (.zip) you want to use. This is the recommended way to specify the model. ### Execution and Operational Control These flags control how the pipeline runs, its verbosity, and its behavior for testing and performance. - **`testing` (bool):** Set to `True` to run in testing mode, which uses dummy data generators for rapid debugging. Set to `False` for production runs. - **`strip_list` (bool):** If `True` (default), the pipeline checks for already processed patients and skips them to avoid redundant work. - **`verbosity` (int):** Sets the logging level (0-9). A higher number provides more detailed output. 3 is a good default. - **`random_seed_val` (int):** A seed for random operations to ensure reproducibility. - **`calculate_vectors` (bool):** If `True` (default), the pipeline generates the final feature vector CSVs. If `False`, it only pre-fetches and saves the raw data batches, which can be useful for debugging the data extraction step. - **`prefetch_pat_batches` (bool):** If `True`, all raw data for the entire cohort is fetched and stored in memory before processing begins. This can speed up processing but requires significant RAM. It is not compatible with `individual_patient_window`. ### Temporal Window Configuration This is one of the most critical parts of the configuration, defining how patient data is sliced over time. #### Global Time Windows Used when all patients are analyzed over the same fixed time period. - **`start_date` (datetime):** The anchor date for the time window calculation. - **`years`, `months`, `days` (int):** The total duration of the time window relative to `start_date`. - **`lookback` (bool):** Determines the direction of the window. If `True` (default), the window extends backward from `start_date`. If `False`, it extends forward. - **`time_window_interval_delta` (relativedelta):** The step size for each time slice. For example, `relativedelta(months=1)` creates one feature vector per patient per month. #### Individual Patient Windows (IPW) Used for patient-specific time windows, typically anchored to a clinical event (e.g., diagnosis date). - **`individual_patient_window` (bool):** Set to `True` to enable IPW mode. - **`individual_patient_window_df` (pd.DataFrame):** A DataFrame containing patient IDs and their corresponding event dates. - **`individual_patient_id_column_name` (str):** The name of the patient ID column in the `individual_patient_window_df`. - **`individual_patient_window_start_column_name` (str):** The name of the column containing the anchor dates in the `individual_patient_window_df`. #### Global Data Boundaries These parameters set the absolute earliest and latest dates for any data retrieval from Elasticsearch, acting as a hard filter. - **`global_start_year`, `global_start_month`, `global_start_day` (int/str)** - **`global_end_year`, `global_end_month`, `global_end_day` (int/str)** ### Feature Selection (main_options) You can precisely control which features are extracted by creating a dictionary and passing it to the `config_class`. Set a feature to `True` to enable it or `False` to disable it. ```python # 1. Define your feature set main_options_dict = { 'demo': True, # Demographic information 'bmi': True, # BMI information 'bloods': True, # Blood-related information 'drugs': True, # Drug-related information 'diagnostics': True, # Diagnostic information 'core_02': True, # core_02 information 'bed': True, # Bed information 'vte_status': True, # VTE status information 'hosp_site': True, # Hospital site information 'core_resus': True, # Core resuscitation information 'news': True, # NEWS (National Early Warning Score) 'smoking': True, # Smoking-related information 'annotations': True, # EPR document annotations via MedCAT 'annotations_mrc': True,# MRC annotations via MedCAT 'negated_presence_annotations': False, # Negated presence annotations 'appointments': False, # Appointments information 'annotations_reports': False, # Reports information 'textual_obs': False, # Textual observations } # 2. Pass the dictionary to your config_class instance config_obj = config_class( main_options=main_options_dict, # ... other configuration parameters ) ``` ### Cohort and Sampling - **`use_controls` (bool):** If `True`, generates a control group for a case-control study. - **`treatment_control_ratio_n` (int):** The ratio of control patients to treatment patients (e.g., 2 for a 2:1 ratio). - **`all_epr_patient_list_path` (str):** Path to a CSV file containing a master list of all possible patient IDs, used for sampling controls. - **`sample_treatment_docs` (int):** If set to a number greater than 0, a random sample of that size will be taken from the initial cohort. Useful for quick tests. - **`shuffle_pat_list` (bool):** If `True`, shuffles the final patient list before processing. ### Advanced and Technical Parameters - **`split_clinical_notes` (bool):** If `True`, the pipeline attempts to parse dates within long clinical notes and split them into smaller, date-stamped documents for more accurate temporal analysis. - **`add_icd10` / `add_opc4s` (bool):** If `True`, appends linked ICD-10 or OPCS-4 codes to the MedCAT annotation outputs. - **`annot_filter_options` (dict):** A dictionary to fine-tune MedCAT annotation filtering, allowing you to set thresholds for confidence, accuracy, and filter by concept types or meta-annotations (e.g., `Presence: True`). - **`data_type_filter_dict` (dict):** A dictionary to apply term-based filtering on raw data before feature extraction (e.g., only include specific blood tests). - **`gpu_mem_threshold` (int):** The minimum free GPU memory (in MB) required for MedCAT to be loaded onto a specific GPU. - **`remote_dump` (bool):** If `True`, saves output files to a remote server via SFTP. Requires hostname, username, and password to be set.