Frequently Asked Questions (FAQ)

This page answers common questions about setting up and using pat2vec.


General

What is pat2vec?

pat2vec is a Python-based tool designed to transform raw electronic health records (EHR) into structured, time-series feature vectors. This process makes the data suitable for machine learning tasks, particularly binary classification. It can aggregate data at the patient level or construct detailed longitudinal timelines.


Installation & Setup

I’m behind a corporate proxy. How do I install?

The install_pat2vec.sh script for Unix/Linux includes a --proxy flag specifically for this purpose. This flag tells pip to use your organization’s internal package mirror.

./install_pat2vec.sh --proxy

If you are using Windows or the basic install.sh script, you will need to configure pip to use your proxy manually. This is typically done by setting the http_proxy and https-proxy environment variables or by creating and configuring a pip.conf/pip.ini file.

Where do I get a MedCAT model and where do I put it?

You need to have a pre-trained MedCAT model pack (.zip file). These are typically pretrained trained and then fine tuned with exports from MedCAT trainer for your specific use case and data.

Once you have the model pack, place it in the medcat_models/ directory, which should be in the same parent folder as your pat2vec repository clone. The installation script creates this directory for you. See https://github.com/CogStack/MedCAT.

your_project_folder/
β”œβ”€β”€ medcat_models/
β”‚   └── your_model.zip  <-- Place it here
└── pat2vec/

Where do my Elasticsearch credentials go?

Your credentials should be placed in a file named credentials.py in the parent directory of your pat2vec clone. The install_pat2vec.sh script automatically copies a template for you. If you installed manually, you can copy pat2vec/pat2vec/config/credentials_template.py to the parent directory and edit it.

IMPORTANT: This file contains sensitive information and should never be committed to version control. The root .gitignore file of this project should already be configured to ignore credentials.py.

The structure should look like this:

your_project_folder/
β”œβ”€β”€ credentials.py      <-- Edit this file
└── pat2vec/

What is the snomed_methods repository?

snomed_methods is a helper repository containing utility functions and methods related to SNOMED-CT and other clinical terminologies used in conjunction with this project. It is a dependency for certain feature extraction methods and is cloned automatically by the install_pat2vec.sh script.

The installation script failed. What should I do?

  1. Check Python Version: Ensure you are using Python 3.10 or higher.

  2. Check venv: Make sure the python3-venv package (or your OS equivalent) is installed.

  3. Run with --force: If you have a partially completed or corrupted installation, try running the script again with the --force flag. This will remove the existing pat2vec_env directory and perform a clean installation.

    ./install_pat2vec.sh --force
    
  4. Check Permissions: Ensure you have write permissions in the directory where you are running the script. The script needs to create directories and files one level above the pat2vec directory.

  5. Review Logs: Read the error messages in the terminal carefully. They often point to the exact package or command that failed.


Usage

What format does my input data need to be in?

Your primary input should be a CSV file. The only strict requirement is that this file must contain a column named client_idcode which holds the unique identifiers for each patient in your cohort.

If you are performing time-series analysis, you will also need a column containing the reference date for each patient (e.g., a diagnosis date) to align the data correctly.

How do I choose which features to extract?

Feature extraction is controlled via the main_options dictionary passed to the config_class. Each feature type can be enabled or disabled by setting its value to True or False. This modular design allows you to easily customize the feature set for your research needs.

Here is an example configuration snippet:

# 1. Define your feature set
main_options_dict = {
    'demo': True,           # Demographic information
    'bmi': True,            # BMI information
    'bloods': True,         # Blood-related information
    'drugs': True,          # Drug-related information
    'diagnostics': True,    # Diagnostic information
    'core_02': True,        # core_02 information
    'bed': True,            # Bed information
    'vte_status': True,     # VTE status information
    'hosp_site': True,      # Hospital site information
    'core_resus': True,     # Core resuscitation information
    'news': True,           # NEWS (National Early Warning Score)
    'smoking': True,        # Smoking-related information
    'annotations': True,    # EPR document annotations via MedCAT
    'annotations_mrc': True,# MRC annotations via MedCat
    'negated_presence_annotations': False,  # Negated presence annotations
    'appointments': False,  # Appointments information
    'annotations_reports': False,  # Reports information
    'textual_obs': False,   # Textual observations
}

# 2. Pass the dictionary to your config_class instance
config_obj = config_class(
    main_options=main_options_dict,
    # ... other configuration parameters
)