pat2vec.main_pat2vec

Classes

main([cogstack, use_filter, ...])

The main orchestrator for the pat2vec feature extraction pipeline.

class pat2vec.main_pat2vec.main(cogstack=True, use_filter=False, json_filter_path=None, random_seed_val=42, hostname=None, config_obj=None)[source]

Bases: object

The main orchestrator for the pat2vec feature extraction pipeline.

This class manages the entire workflow of processing patient data to generate time-sliced feature vectors. It initializes the pipeline based on a configuration object, connects to data sources like CogStack, prepares a list of patients, and orchestrates the feature extraction process for each patient.

The typical workflow is as follows:

  1. An instance of this class is created with a config_obj that defines all pipeline parameters (e.g., time windows, enabled features, paths).

  2. It establishes a connection to the data source (e.g., Elasticsearch via CogStack).

  3. It retrieves or generates a list of patients to be processed.

  4. It can pre-fetch all necessary raw data batches for the entire patient cohort if prefetch_pat_batches is enabled in the configuration.

  5. For each patient, it iterates through the defined time windows.

  6. For each time slice, it calls the main_batch function, which in turn calls the individual feature extraction modules (e.g., for demographics, bloods, NLP annotations) to generate a feature vector.

  7. The resulting feature vector is saved to a file.

This class relies heavily on the config_obj for its behavior.

Parameters:
  • cogstack (bool)

  • use_filter (bool)

  • json_filter_path (str | None)

  • random_seed_val (int)

  • hostname (str | None)

  • config_obj (Any | None)

config_obj

The configuration object that controls the pipeline.

Type:

config_class

cs

An instance of the CogStack client for data retrieval.

Type:

CogStack

all_patient_list

The list of patient IDs to be processed.

Type:

list

cat

A MedCAT instance for clinical text annotation if required.

Type:

MedCAT

t

A progress bar for monitoring the process.

Type:

tqdm.trange

__init__(cogstack=True, use_filter=False, json_filter_path=None, random_seed_val=42, hostname=None, config_obj=None)[source]

Initializes the main pat2vec pipeline orchestrator.

This constructor sets up the pipeline environment, including data source connections, patient lists, and NLP models, based on the provided configuration.

Parameters:
  • cogstack (bool) – If True, connects to a CogStack Elasticsearch instance. If False, a dummy searcher is used for testing.

  • use_filter (bool) – If True, applies a CUI filter to the MedCAT model.

  • json_filter_path (Optional[str]) – Path to a JSON file containing the CUI filter.

  • random_seed_val (int) – The random seed for reproducibility.

  • hostname (Optional[str]) – Deprecated. SFTP settings are now in the config object.

  • config_obj (Optional[Any]) – The main configuration object. If None, a default configuration is created.

pat_maker(i)[source]

Orchestrates the entire feature extraction process for a single patient.

This method is the primary worker function for processing one patient from the cohort. It manages the patient’s specific time window, pre-fetches all necessary raw data, and then iterates through each time slice to generate feature vectors.

The key steps for each patient are:

  1. Check for Completion: Skips the patient if their feature vectors have already been generated, based on the stripped_list_start.

  2. Set Time Window: If individual_patient_window is enabled, it calculates and sets the specific start and end dates for this patient, overriding the global time window. It handles both primary and control patients differently.

  3. Pre-fetch Data Batches: It calls various get_pat_batch_* functions to retrieve all required data for the patient across their entire time window. This includes demographics, bloods, medications, clinical notes (EPR, MRC), reports, and other observations.

  4. Pre-generate Annotations: If text-based features are enabled (e.g., annotations, annotations_mrc), it processes the fetched clinical notes with MedCAT to generate all annotations for the patient upfront.

  5. Data Cleaning: Performs initial cleaning on the fetched batches, such as dropping records with missing timestamps.

  6. Iterate and Process Slices: It loops through each time slice defined in the patient’s date_list. For each slice, it calls main_batch, passing all the pre-fetched data. main_batch is responsible for filtering the data for that specific slice and generating the final feature vector CSV file.

Parameters:

i (int) – The index of the patient within self.all_patient_list to be processed.

Return type:

None

Side Effects:
  • Creates output directories for the patient’s feature vectors if they do not exist.

  • Fetches potentially large amounts of data from the source (e.g., Elasticsearch) and holds it in memory for processing.

  • Calls main_batch which results in writing one CSV file per time slice for the patient.

  • Updates the tqdm progress bar to reflect the current status.

  • Can modify self.config_obj attributes (like date_list and global start/end dates) on-the-fly when individual_patient_window is enabled.

Returns:

This method orchestrates the processing pipeline and manages file

I/O, but it does not return any value.

Return type:

None

Parameters:

i (int)