pat2vec.pat2vec_pat_list.get_patient_treatment_list
Functions
|
Analyzes and clusters client ID codes based on their structure. |
|
Retrieves a list of unique client IDs from a treatment document. |
|
Generates and saves a list of control patients. |
|
Extracts and prepares the final list of all patient IDs for the pipeline. |
|
Sanitizes a list of hospital IDs by converting them to uppercase. |
- pat2vec.pat2vec_pat_list.get_patient_treatment_list.extract_treatment_id_list_from_docs(config_obj)[source]
Retrieves a list of unique client IDs from a treatment document.
This function reads a CSV or XLSX file specified in the configuration, identifies the column containing patient IDs (either explicitly or via auto-detection), and returns a list of unique IDs. It can also sample the document to a smaller size if configured.
- Parameters:
config_obj (
Any
) – A configuration object containing parameters like treatment_doc_filename, patient_id_column_name, and sample_treatment_docs.- Return type:
List
[str
]- Returns:
A list of unique client IDs from the treatment document.
- Raises:
ValueError – If the file format is not CSV or XLSX.
- pat2vec.pat2vec_pat_list.get_patient_treatment_list.generate_control_list(treatment_client_id_list, treatment_control_ratio_n, control_list_path='control_list.pkl', all_epr_patient_list_path='none', verbosity=0)[source]
Generates and saves a list of control patients.
This function creates a control group by taking a master list of all patient IDs, removing the IDs from the provided treatment list, and then randomly sampling from the remaining pool. The size of the control group is determined by the size of the treatment group and the specified ratio. The resulting list of control IDs is saved to a pickle file.
- Parameters:
treatment_client_id_list (
List
[str
]) – A list of client IDs for the treatment group.treatment_control_ratio_n (
int
) – The desired ratio of control patients to treatment patients (e.g., 2 for a 2:1 ratio).control_list_path (
str
) – The file path to save the generated control list pickle file.all_epr_patient_list_path (
str
) – The file path to the CSV containing all possible patient IDs.verbosity (
int
) – The level of verbosity for logging.
- Return type:
List
[str
]- Returns:
A list of client IDs for the generated control group.
- pat2vec.pat2vec_pat_list.get_patient_treatment_list.sanitize_hospital_ids(hospital_ids, config_obj)[source]
Sanitizes a list of hospital IDs by converting them to uppercase.
This function iterates through a list of hospital IDs, converts each to uppercase, and provides warnings if the IDs do not conform to the expected format (e.g., one letter followed by six digits).
- Parameters:
hospital_ids (
List
[str
]) – A list of hospital IDs to be sanitized.config_obj (
Any
) – A configuration object containing the verbosity and sanitize_pat_list flags.
- Return type:
List
[str
]- Returns:
The sanitized list of hospital IDs.
- pat2vec.pat2vec_pat_list.get_patient_treatment_list.get_all_patients_list(config_obj)[source]
Extracts and prepares the final list of all patient IDs for the pipeline.
This function serves as the main entry point for generating the patient cohort. It orchestrates several steps:
Extracts the initial list of patient IDs, either from a treatment document, an individual patient window (IPW) DataFrame, or a test data file.
If use_controls is enabled in the config, it generates a corresponding list of control patient IDs and appends them to the main list.
Sanitizes the final list of IDs (e.g., converts to uppercase).
Optionally samples the final list down to a smaller size if sample_treatment_docs is configured.
- Parameters:
config_obj (
Any
) – The main configuration object containing all necessary parameters.- Return type:
List
[str
]- Returns:
A list of all patient IDs to be processed by the pipeline.
- Raises:
ValueError – If required configuration parameters are missing (e.g., test_data_path in testing mode).
- pat2vec.pat2vec_pat_list.get_patient_treatment_list.analyze_client_codes(client_idcode_list, min_val=3)[source]
Analyzes and clusters client ID codes based on their structure.
This function separates a list of client IDs into valid and invalid groups based on a regex pattern (e.g., ‘A123456’). It then uses KMeans clustering on the valid codes to identify potential subgroups based on their prefix and the sum of their digits.
- Parameters:
client_idcode_list (
List
[str
]) – A list of client ID codes to analyze.min_val (
int
) – The minimum number of clusters to create. Defaults to 3.
- Return type:
Dict
[str
,Any
]- Returns:
A dictionary containing ‘valid_codes’, ‘invalid_codes’, and ‘clusters’.