pat2vec.util.medcat_misc_methods

Functions

`create_ner_results_dataframe`(fps, fns, tps, ...)	Creates a Pandas DataFrame from NER evaluation dictionaries.
`extract_labels_from_medcat_annotation_export`(df, ...)	Extracts and validates labels from a MedCAT annotation export.
`manually_label_annotation_df`(df[, ...])	Interactively labels an annotation DataFrame.
`medcat_trainer_export_to_df`(file_path)	Converts a MedCATTrainer export JSON file to a pandas DataFrame.
`parse_medcat_trainer_project_json`(json_path)	Parses a MedCAT trainer project JSON into a structured DataFrame.
`plot_ner_results`(results_df)	Generates plots to visualize NER (Named Entity Recognition) evaluation results.
`recreate_json`(df[, output_file])	Converts an exported MedCAT trainer DataFrame back to a training JSON.

pat2vec.util.medcat_misc_methods.medcat_trainer_export_to_df(file_path)[source]

Converts a MedCATTrainer export JSON file to a pandas DataFrame.

Parameters:: file_path (str) – Path to the JSON file containing MedCATTrainer export data.
Return type:: DataFrame
Returns:: A DataFrame containing the extracted data, with each row representing a single annotation.

pat2vec.util.medcat_misc_methods.extract_labels_from_medcat_annotation_export(df, human_labels, window=300, output_file=None)[source]

Extracts and validates labels from a MedCAT annotation export.

This function compares annotations from a MedCAT trainer export (df) with a set of human-labeled data (human_labels). It matches them based on the text content and source value, then validates the annotation based on its meta-annotations (Subject, Presence, Time). The result is stored in an ‘extracted_label’ column in the human_labels DataFrame.

Parameters:

df (DataFrame) – The trainer output in DataFrame form (from medcat_trainer_export_to_df).
human_labels (DataFrame) – The DataFrame containing human-labeled text samples.
window (int) – The window size for extracting text samples for comparison.
output_file (Optional[str]) – An optional file path to save the processed DataFrame.

Return type:

DataFrame

Returns:

The processed human_labels DataFrame with the ‘extracted_label’ column.

pat2vec.util.medcat_misc_methods.recreate_json(df, output_file=None)[source]

Converts an exported MedCAT trainer DataFrame back to a training JSON.

This function takes a DataFrame (as produced by medcat_trainer_export_to_df) and reconstructs the original JSON structure required for training a MedCAT model.

Parameters:

df (DataFrame) – DataFrame containing exported data from a MedCAT trainer project.
output_file (Optional[str]) – Optional file path to save the generated JSON.

Return type:

str

Returns:

A JSON string representing the MedCAT training data.

pat2vec.util.medcat_misc_methods.manually_label_annotation_df(df, file_path='human_labels.csv', confirmatory=False, verbose=False, filter_codes_list=[])[source]

Interactively labels an annotation DataFrame.

This function loops over an annotation DataFrame, displays annotations for unique client ID codes, and prompts the user for a label (1 for correct, 0 for incorrect). The process continues until all annotations for a client (matching filter_codes_list) are confirmed. The labels are saved to a CSV file.

Parameters:

df (DataFrame) – The DataFrame to annotate.
file_path (str) – The file path to store the human labels.
confirmatory (bool) – If True, skips clients who already have a confirmed correct annotation for the given filter codes.
verbose (bool) – If True, prints verbose output.
filter_codes_list (List[List[str]]) – A list of CUI code lists. A client is considered “done” when they have a correct annotation for each list of codes.

Return type:

None

pat2vec.util.medcat_misc_methods.parse_medcat_trainer_project_json(json_path)[source]

Parses a MedCAT trainer project JSON into a structured DataFrame.

This function reads a JSON file exported from a MedCAT trainer project. It handles various formats (nested JSON, list of JSON strings) and normalizes the data into a flat DataFrame where each row represents a single annotation with its associated document and project metadata.

Parameters:: json_path (str) – Path to the JSON file from a MedCAT trainer export.
Return type:: DataFrame
Returns:: A DataFrame containing parsed and structured data, including project and document details, annotations, and their meta-annotations.

Notes

Handles nested JSON structures and safely converts JSON strings.
Explodes ‘cuis’ and ‘documents’ columns to create detailed rows.
Extracts meta-annotation details into separate columns.

pat2vec.util.medcat_misc_methods.create_ner_results_dataframe(fps, fns, tps, cui_prec, cui_rec, cui_f1, cui_counts, cat=None)[source]

Creates a Pandas DataFrame from NER evaluation dictionaries.

Parameters:

fps (dict) – Dictionary of false positives with CUI as keys.
fns (dict) – Dictionary of false negatives with CUI as keys.
tps (dict) – Dictionary of true positives with CUI as keys.
cui_prec (dict) – Dictionary of CUI-based precision with CUI as keys.
cui_rec (dict) – Dictionary of CUI-based recall with CUI as keys.
cui_f1 (dict) – Dictionary of CUI-based F1-score with CUI as keys.
cui_counts (dict) – Dictionary of CUI counts with CUI as keys.
passed (if cat object)
name (will add preferred)

Returns:

DataFrame with CUI as index and columns for: fps, fns, tps, cui_prec, cui_rec, cui_f1, cui_counts and optionally a cat medcat object.

Return type:

pandas.DataFrame

pat2vec.util.medcat_misc_methods.plot_ner_results(results_df)[source]

Generates plots to visualize NER (Named Entity Recognition) evaluation results.

This function creates a series of plots to help analyze the performance of an NER model, including F1-scores, precision-recall, error analysis, and the relationship between concept frequency and performance.

Parameters:: results_df (DataFrame) – A DataFrame containing NER evaluation metrics, which must include ‘cui_name’, ‘cui_f1’, ‘cui_prec’, ‘cui_rec’, ‘fps’, ‘fns’, ‘tps’, and ‘cui_counts’.
Return type:: None