pat2vec.util.anonymisation_deid_documents

Functions

anonymize_dataframe_quick(df,Β text_columns,Β ...)

A convenience function to quickly anonymize columns in a DataFrame.

anonymize_single_text(text,Β model_path[,Β redact])

A convenience function to quickly anonymize a single text string.

Classes

DeIdAnonymizer([model_path,Β log_level])

A class for anonymizing clinical text using MedCAT's DeIdModel.

class pat2vec.util.anonymisation_deid_documents.DeIdAnonymizer(model_path=None, log_level='INFO')[source]

Bases: object

A class for anonymizing clinical text using MedCAT’s DeIdModel.

This class encapsulates the functionality for loading a de-identification model, anonymizing text data in various formats (single string, list of strings, pandas DataFrame columns), and providing utilities for inspection and reporting.

Parameters:
  • model_path (str | Path | None)

  • log_level (str)

model

The loaded MedCAT DeIdModel instance.

model_path

The path to the loaded model pack.

is_loaded

A boolean indicating if a model is successfully loaded.

pii_labels

A list of PII labels the loaded model is configured to redact.

anonymization_log

A list of dictionaries logging each operation.

logger

A configured logger instance for the class.

__init__(model_path=None, log_level='INFO')[source]

Initializes the DeIdAnonymizer.

Parameters:
  • model_path (Union[str, Path, None]) – Optional path to the MedCAT DeIdModel pack. If provided, the model is loaded upon initialization.

  • log_level (str) – The logging level for the instance (e.g., β€œINFO”, β€œDEBUG”).

load_model(model_path)[source]

Loads a pre-trained DeIdModel from a specified path.

Parameters:

model_path (Union[str, Path]) – The path to the model pack (directory or .zip file).

Return type:

bool

Returns:

True if the model was loaded successfully, False otherwise.

anonymize_text(text, redact=True, verify=False)[source]

Anonymizes a single text string.

Parameters:
  • text (str) – The input text to anonymize.

  • redact (bool) – If True, replaces PII with asterisks (’***’). If False, replaces PII with type tags (e.g., β€˜<PERSON>’).

  • verify (bool) – If True, returns a tuple containing the anonymized text and a dictionary with verification information.

Return type:

Union[str, Tuple[str, Dict[str, Any]]]

Returns:

If verify is False, returns the anonymized text string. If verify is True, returns a tuple of (anonymized_text, verification_info).

anonymize_texts(texts, redact=True, n_process=1, batch_size=100, verify_sample=False, sample_size=10)[source]

Anonymizes a list of text strings, with parallel processing support.

Parameters:
  • texts (List[str]) – A list of input texts to anonymize.

  • redact (bool) – If True, replaces PII with asterisks. If False, uses type tags.

  • n_process (int) – The number of processes to use for parallel execution.

  • batch_size (int) – The number of texts to process in each batch.

  • verify_sample (bool) – If True, verifies a random sample of the results and returns a report.

  • sample_size (int) – The size of the random sample to verify if verify_sample is True.

Return type:

Union[List[str], Tuple[List[str], Dict]]

Returns:

If verify_sample is False, returns a list of anonymized texts. If verify_sample is True, returns a tuple of (anonymized_texts, verification_report).

anonymize_dataframe(df, text_columns, redact=True, inplace=False, suffix='_anonymized', n_process=1, batch_size=100)[source]

Anonymizes specified text columns in a pandas DataFrame.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • text_columns (List[str]) – A list of column names containing the text to be anonymized.

  • redact (bool) – If True, replaces PII with asterisks. If False, uses type tags.

  • inplace (bool) – If True, modifies the DataFrame in place by overwriting the original text columns. If False, returns a new DataFrame with anonymized columns added.

  • suffix (str) – The suffix to add to new anonymized column names. This is ignored if inplace is True.

  • n_process (int) – The number of processes for parallel execution.

  • batch_size (int) – The number of texts to process in each batch.

Return type:

DataFrame

Returns:

A DataFrame with the specified text columns anonymized.

inspect_text(text)[source]

Inspects text to find and log PII entities without anonymizing.

This method is useful for debugging and understanding what the loaded model is capable of detecting in a given piece of text.

Parameters:

text (str) – The text to inspect.

Return type:

List[Dict[str, Any]]

Returns:

A list of dictionaries, each representing a found PII entity.

get_structured_annotations(text)[source]

Gets structured annotations for PII entities in a text.

Parameters:

text (str) – The input text to analyze.

Return type:

List[Dict[str, Any]]

Returns:

A list of dictionaries, where each dictionary contains details (text, label, start, end, confidence) for an identified PII entity.

generate_report()[source]

Generates a summary report of all operations performed.

Return type:

Dict[str, Any]

Returns:

A dictionary containing statistics about the anonymization operations, model details, and total texts processed.

save_log(filepath)[source]

Saves the anonymization operation log to a JSON file.

Parameters:

filepath (Union[str, Path]) – The path where the log file will be saved.

Return type:

None

pat2vec.util.anonymisation_deid_documents.anonymize_single_text(text, model_path, redact=True)[source]

A convenience function to quickly anonymize a single text string.

Parameters:
  • text (str) – The input text to anonymize.

  • model_path (Union[str, Path]) – The path to the DeIdModel pack.

  • redact (bool) – If True, replaces PII with asterisks. If False, uses type tags.

Return type:

str

Returns:

The anonymized text.

pat2vec.util.anonymisation_deid_documents.anonymize_dataframe_quick(df, text_columns, model_path, redact=True)[source]

A convenience function to quickly anonymize columns in a DataFrame.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • text_columns (List[str]) – A list of column names to anonymize.

  • model_path (Union[str, Path]) – The path to the DeIdModel pack.

  • redact (bool) – If True, replaces PII with asterisks. If False, uses type tags.

Return type:

DataFrame

Returns:

A new DataFrame with anonymized text columns.