pat2vec.util.post_processing_medcat

Functions

coerce_document_df_to_medcat_trainer_input(df)

Prepares a DataFrame for MedCAT trainer input format.

sample_by_terms(df, column, term_groups, ...)

Randomly samples rows from a DataFrame based on fuzzy matching of terms.

pat2vec.util.post_processing_medcat.sample_by_terms(df, column, term_groups, min_samples_per_term, total_sample_size, threshold=75)[source]

Randomly samples rows from a DataFrame based on fuzzy matching of terms.

This function is useful for creating a balanced sample set for tasks like MedCAT model training, where the source data may have an uneven distribution of concepts. It performs stratified sampling based on term groups.

The sampling process is as follows:

  1. It ensures a min_samples_per_term for each group of terms.

  2. It then proportionally samples from the remaining available documents to reach the total_sample_size.

Parameters:
  • df (DataFrame) – The DataFrame to sample from.

  • column (str) – The column in df to search for term matches.

  • term_groups (List[List[str]]) – A list of term groups, where each inner list contains synonymous or related terms.

  • min_samples_per_term (int) – The minimum number of samples to retrieve for each term group.

  • total_sample_size (int) – The desired total number of samples in the final DataFrame.

  • threshold (int) – The fuzzy matching score (0-100) required to consider a term as a match.

Return type:

DataFrame

Returns:

A new DataFrame containing the sampled rows. An additional ‘matched_term’ column is added for debugging, showing which specific term from a group matched the row.

pat2vec.util.post_processing_medcat.coerce_document_df_to_medcat_trainer_input(df, text_column_value='body_analysed', name_value='_id')[source]

Prepares a DataFrame for MedCAT trainer input format.

This function transforms a given DataFrame into the specific two-column format required by the MedCAT trainer: a ‘name’ column for unique document identifiers and a ‘text’ column for the document content.

It performs the following steps:

  1. Renames the specified name_value and text_column_value columns to ‘name’ and ‘text’, respectively.

  2. Ensures that all values in the ‘name’ column are unique. If duplicates are found, they are made unique by appending a suffix (e.g., _1, _2).

  3. Returns a new DataFrame containing only the ‘name’ and ‘text’ columns.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • text_column_value (str) – The name of the column containing the document text.

  • name_value (str) – The name of the column to be used as the document identifier.

Return type:

DataFrame

Returns:

A new DataFrame with ‘name’ and ‘text’ columns, ready for MedCAT trainer.

Raises:

KeyError – If name_value or text_column_value are not found in the DataFrame’s columns.