pat2vec.util.medcat_misc_methods

Functions

create_ner_results_dataframe(fps, fns, tps, ...)

Creates a Pandas DataFrame from NER evaluation dictionaries.

extract_labels_from_medcat_annotation_export(df, ...)

Extracts and validates labels from a MedCAT annotation export.

manually_label_annotation_df(df[, ...])

Interactively labels an annotation DataFrame.

medcat_trainer_export_to_df(file_path)

Converts a MedCATTrainer export JSON file to a pandas DataFrame.

parse_medcat_trainer_project_json(json_path)

Parses a MedCAT trainer project JSON into a structured DataFrame.

plot_ner_results(results_df)

Generates plots to visualize NER (Named Entity Recognition) evaluation results.

recreate_json(df[, output_file])

Converts an exported MedCAT trainer DataFrame back to a training JSON.

pat2vec.util.medcat_misc_methods.medcat_trainer_export_to_df(file_path)[source]

Converts a MedCATTrainer export JSON file to a pandas DataFrame.

Parameters:

file_path (str) – Path to the JSON file containing MedCATTrainer export data.

Return type:

DataFrame

Returns:

A DataFrame containing the extracted data, with each row representing a single annotation.

pat2vec.util.medcat_misc_methods.extract_labels_from_medcat_annotation_export(df, human_labels, window=300, output_file=None)[source]

Extracts and validates labels from a MedCAT annotation export.

This function compares annotations from a MedCAT trainer export (df) with a set of human-labeled data (human_labels). It matches them based on the text content and source value, then validates the annotation based on its meta-annotations (Subject, Presence, Time). The result is stored in an ‘extracted_label’ column in the human_labels DataFrame.

Parameters:
  • df (DataFrame) – The trainer output in DataFrame form (from medcat_trainer_export_to_df).

  • human_labels (DataFrame) – The DataFrame containing human-labeled text samples.

  • window (int) – The window size for extracting text samples for comparison.

  • output_file (Optional[str]) – An optional file path to save the processed DataFrame.

Return type:

DataFrame

Returns:

The processed human_labels DataFrame with the ‘extracted_label’ column.

pat2vec.util.medcat_misc_methods.recreate_json(df, output_file=None)[source]

Converts an exported MedCAT trainer DataFrame back to a training JSON.

This function takes a DataFrame (as produced by medcat_trainer_export_to_df) and reconstructs the original JSON structure required for training a MedCAT model.

Parameters:
  • df (DataFrame) – DataFrame containing exported data from a MedCAT trainer project.

  • output_file (Optional[str]) – Optional file path to save the generated JSON.

Return type:

str

Returns:

A JSON string representing the MedCAT training data.

pat2vec.util.medcat_misc_methods.manually_label_annotation_df(df, file_path='human_labels.csv', confirmatory=False, verbose=False, filter_codes_list=[])[source]

Interactively labels an annotation DataFrame.

This function loops over an annotation DataFrame, displays annotations for unique client ID codes, and prompts the user for a label (1 for correct, 0 for incorrect). The process continues until all annotations for a client (matching filter_codes_list) are confirmed. The labels are saved to a CSV file.

Parameters:
  • df (DataFrame) – The DataFrame to annotate.

  • file_path (str) – The file path to store the human labels.

  • confirmatory (bool) – If True, skips clients who already have a confirmed correct annotation for the given filter codes.

  • verbose (bool) – If True, prints verbose output.

  • filter_codes_list (List[List[str]]) – A list of CUI code lists. A client is considered “done” when they have a correct annotation for each list of codes.

Return type:

None

pat2vec.util.medcat_misc_methods.parse_medcat_trainer_project_json(json_path)[source]

Parses a MedCAT trainer project JSON into a structured DataFrame.

This function reads a JSON file exported from a MedCAT trainer project. It handles various formats (nested JSON, list of JSON strings) and normalizes the data into a flat DataFrame where each row represents a single annotation with its associated document and project metadata.

Parameters:

json_path (str) – Path to the JSON file from a MedCAT trainer export.

Return type:

DataFrame

Returns:

A DataFrame containing parsed and structured data, including project and document details, annotations, and their meta-annotations.

Notes

  • Handles nested JSON structures and safely converts JSON strings.

  • Explodes ‘cuis’ and ‘documents’ columns to create detailed rows.

  • Extracts meta-annotation details into separate columns.

pat2vec.util.medcat_misc_methods.create_ner_results_dataframe(fps, fns, tps, cui_prec, cui_rec, cui_f1, cui_counts, cat=None)[source]

Creates a Pandas DataFrame from NER evaluation dictionaries.

Parameters:
  • fps (dict) – Dictionary of false positives with CUI as keys.

  • fns (dict) – Dictionary of false negatives with CUI as keys.

  • tps (dict) – Dictionary of true positives with CUI as keys.

  • cui_prec (dict) – Dictionary of CUI-based precision with CUI as keys.

  • cui_rec (dict) – Dictionary of CUI-based recall with CUI as keys.

  • cui_f1 (dict) – Dictionary of CUI-based F1-score with CUI as keys.

  • cui_counts (dict) – Dictionary of CUI counts with CUI as keys.

  • passed (if cat object)

  • name (will add preferred)

Returns:

DataFrame with CUI as index and columns for

fps, fns, tps, cui_prec, cui_rec, cui_f1, cui_counts and optionally a cat medcat object.

Return type:

pandas.DataFrame

pat2vec.util.medcat_misc_methods.plot_ner_results(results_df)[source]

Generates plots to visualize NER (Named Entity Recognition) evaluation results.

This function creates a series of plots to help analyze the performance of an NER model, including F1-scores, precision-recall, error analysis, and the relationship between concept frequency and performance.

Parameters:

results_df (DataFrame) – A DataFrame containing NER evaluation metrics, which must include ‘cui_name’, ‘cui_f1’, ‘cui_prec’, ‘cui_rec’, ‘fps’, ‘fns’, ‘tps’, and ‘cui_counts’.

Return type:

None