pat2vec.util.medcat_misc_methods
Functions
|
Creates a Pandas DataFrame from NER evaluation dictionaries. |
Extracts and validates labels from a MedCAT annotation export. |
|
|
Interactively labels an annotation DataFrame. |
|
Converts a MedCATTrainer export JSON file to a pandas DataFrame. |
|
Parses a MedCAT trainer project JSON into a structured DataFrame. |
|
Generates plots to visualize NER (Named Entity Recognition) evaluation results. |
|
Converts an exported MedCAT trainer DataFrame back to a training JSON. |
- pat2vec.util.medcat_misc_methods.medcat_trainer_export_to_df(file_path)[source]
Converts a MedCATTrainer export JSON file to a pandas DataFrame.
- Parameters:
file_path (
str
) – Path to the JSON file containing MedCATTrainer export data.- Return type:
DataFrame
- Returns:
A DataFrame containing the extracted data, with each row representing a single annotation.
- pat2vec.util.medcat_misc_methods.extract_labels_from_medcat_annotation_export(df, human_labels, window=300, output_file=None)[source]
Extracts and validates labels from a MedCAT annotation export.
This function compares annotations from a MedCAT trainer export (df) with a set of human-labeled data (human_labels). It matches them based on the text content and source value, then validates the annotation based on its meta-annotations (Subject, Presence, Time). The result is stored in an ‘extracted_label’ column in the human_labels DataFrame.
- Parameters:
df (
DataFrame
) – The trainer output in DataFrame form (from medcat_trainer_export_to_df).human_labels (
DataFrame
) – The DataFrame containing human-labeled text samples.window (
int
) – The window size for extracting text samples for comparison.output_file (
Optional
[str
]) – An optional file path to save the processed DataFrame.
- Return type:
DataFrame
- Returns:
The processed human_labels DataFrame with the ‘extracted_label’ column.
- pat2vec.util.medcat_misc_methods.recreate_json(df, output_file=None)[source]
Converts an exported MedCAT trainer DataFrame back to a training JSON.
This function takes a DataFrame (as produced by medcat_trainer_export_to_df) and reconstructs the original JSON structure required for training a MedCAT model.
- Parameters:
df (
DataFrame
) – DataFrame containing exported data from a MedCAT trainer project.output_file (
Optional
[str
]) – Optional file path to save the generated JSON.
- Return type:
str
- Returns:
A JSON string representing the MedCAT training data.
- pat2vec.util.medcat_misc_methods.manually_label_annotation_df(df, file_path='human_labels.csv', confirmatory=False, verbose=False, filter_codes_list=[])[source]
Interactively labels an annotation DataFrame.
This function loops over an annotation DataFrame, displays annotations for unique client ID codes, and prompts the user for a label (1 for correct, 0 for incorrect). The process continues until all annotations for a client (matching filter_codes_list) are confirmed. The labels are saved to a CSV file.
- Parameters:
df (
DataFrame
) – The DataFrame to annotate.file_path (
str
) – The file path to store the human labels.confirmatory (
bool
) – If True, skips clients who already have a confirmed correct annotation for the given filter codes.verbose (
bool
) – If True, prints verbose output.filter_codes_list (
List
[List
[str
]]) – A list of CUI code lists. A client is considered “done” when they have a correct annotation for each list of codes.
- Return type:
None
- pat2vec.util.medcat_misc_methods.parse_medcat_trainer_project_json(json_path)[source]
Parses a MedCAT trainer project JSON into a structured DataFrame.
This function reads a JSON file exported from a MedCAT trainer project. It handles various formats (nested JSON, list of JSON strings) and normalizes the data into a flat DataFrame where each row represents a single annotation with its associated document and project metadata.
- Parameters:
json_path (
str
) – Path to the JSON file from a MedCAT trainer export.- Return type:
DataFrame
- Returns:
A DataFrame containing parsed and structured data, including project and document details, annotations, and their meta-annotations.
Notes
Handles nested JSON structures and safely converts JSON strings.
Explodes ‘cuis’ and ‘documents’ columns to create detailed rows.
Extracts meta-annotation details into separate columns.
- pat2vec.util.medcat_misc_methods.create_ner_results_dataframe(fps, fns, tps, cui_prec, cui_rec, cui_f1, cui_counts, cat=None)[source]
Creates a Pandas DataFrame from NER evaluation dictionaries.
- Parameters:
fps (dict) – Dictionary of false positives with CUI as keys.
fns (dict) – Dictionary of false negatives with CUI as keys.
tps (dict) – Dictionary of true positives with CUI as keys.
cui_prec (dict) – Dictionary of CUI-based precision with CUI as keys.
cui_rec (dict) – Dictionary of CUI-based recall with CUI as keys.
cui_f1 (dict) – Dictionary of CUI-based F1-score with CUI as keys.
cui_counts (dict) – Dictionary of CUI counts with CUI as keys.
passed (if cat object)
name (will add preferred)
- Returns:
- DataFrame with CUI as index and columns for
fps, fns, tps, cui_prec, cui_rec, cui_f1, cui_counts and optionally a cat medcat object.
- Return type:
pandas.DataFrame
- pat2vec.util.medcat_misc_methods.plot_ner_results(results_df)[source]
Generates plots to visualize NER (Named Entity Recognition) evaluation results.
This function creates a series of plots to help analyze the performance of an NER model, including F1-scores, precision-recall, error analysis, and the relationship between concept frequency and performance.
- Parameters:
results_df (
DataFrame
) – A DataFrame containing NER evaluation metrics, which must include ‘cui_name’, ‘cui_f1’, ‘cui_prec’, ‘cui_rec’, ‘fps’, ‘fns’, ‘tps’, and ‘cui_counts’.- Return type:
None