pat2vec.util.anonymisation_deid_documents

Functions

`anonymize_dataframe_quick`(df, text_columns, ...)	A convenience function to quickly anonymize columns in a DataFrame.
`anonymize_single_text`(text, model_path[, redact])	A convenience function to quickly anonymize a single text string.

Classes

DeIdAnonymizer([model_path, log_level])

A class for anonymizing clinical text using MedCAT's DeIdModel.

class pat2vec.util.anonymisation_deid_documents.DeIdAnonymizer(model_path=None, log_level='INFO')[source]

Bases: object

A class for anonymizing clinical text using MedCAT’s DeIdModel.

This class encapsulates the functionality for loading a de-identification model, anonymizing text data in various formats (single string, list of strings, pandas DataFrame columns), and providing utilities for inspection and reporting.

Parameters:

model_path (str | Path | None)
log_level (str)

model: The loaded MedCAT DeIdModel instance.

model_path: The path to the loaded model pack.

is_loaded: A boolean indicating if a model is successfully loaded.

pii_labels: A list of PII labels the loaded model is configured to redact.

anonymization_log: A list of dictionaries logging each operation.

logger: A configured logger instance for the class.

__init__(model_path=None, log_level='INFO')[source]

Initializes the DeIdAnonymizer.

Parameters:

model_path (Union[str, Path, None]) – Optional path to the MedCAT DeIdModel pack. If provided, the model is loaded upon initialization.
log_level (str) – The logging level for the instance (e.g., “INFO”, “DEBUG”).

load_model(model_path)[source]

Loads a pre-trained DeIdModel from a specified path.

Parameters:: model_path (Union[str, Path]) – The path to the model pack (directory or .zip file).
Return type:: bool
Returns:: True if the model was loaded successfully, False otherwise.

anonymize_text(text, redact=True, verify=False)[source]

Anonymizes a single text string.

Parameters:

text (str) – The input text to anonymize.
redact (bool) – If True, replaces PII with asterisks (’***’). If False, replaces PII with type tags (e.g., ‘<PERSON>’).
verify (bool) – If True, returns a tuple containing the anonymized text and a dictionary with verification information.

Return type:

Union[str, Tuple[str, Dict[str, Any]]]

Returns:

If verify is False, returns the anonymized text string. If verify is True, returns a tuple of (anonymized_text, verification_info).

anonymize_texts(texts, redact=True, n_process=1, batch_size=100, verify_sample=False, sample_size=10)[source]

Anonymizes a list of text strings, with parallel processing support.

Parameters:

texts (List[str]) – A list of input texts to anonymize.
redact (bool) – If True, replaces PII with asterisks. If False, uses type tags.
n_process (int) – The number of processes to use for parallel execution.
batch_size (int) – The number of texts to process in each batch.
verify_sample (bool) – If True, verifies a random sample of the results and returns a report.
sample_size (int) – The size of the random sample to verify if verify_sample is True.

Return type:

Union[List[str], Tuple[List[str], Dict]]

Returns:

If verify_sample is False, returns a list of anonymized texts. If verify_sample is True, returns a tuple of (anonymized_texts, verification_report).

anonymize_dataframe(df, text_columns, redact=True, inplace=False, suffix='_anonymized', n_process=1, batch_size=100)[source]

Anonymizes specified text columns in a pandas DataFrame.

Parameters:

df (DataFrame) – The input DataFrame.
text_columns (List[str]) – A list of column names containing the text to be anonymized.
redact (bool) – If True, replaces PII with asterisks. If False, uses type tags.
inplace (bool) – If True, modifies the DataFrame in place by overwriting the original text columns. If False, returns a new DataFrame with anonymized columns added.
suffix (str) – The suffix to add to new anonymized column names. This is ignored if inplace is True.
n_process (int) – The number of processes for parallel execution.
batch_size (int) – The number of texts to process in each batch.

Return type:

DataFrame

Returns:

A DataFrame with the specified text columns anonymized.

inspect_text(text)[source]

Inspects text to find and log PII entities without anonymizing.

This method is useful for debugging and understanding what the loaded model is capable of detecting in a given piece of text.

Parameters:: text (str) – The text to inspect.
Return type:: List[Dict[str, Any]]
Returns:: A list of dictionaries, each representing a found PII entity.

get_structured_annotations(text)[source]

Gets structured annotations for PII entities in a text.

Parameters:: text (str) – The input text to analyze.
Return type:: List[Dict[str, Any]]
Returns:: A list of dictionaries, where each dictionary contains details (text, label, start, end, confidence) for an identified PII entity.

generate_report()[source]

Generates a summary report of all operations performed.

Return type:: Dict[str, Any]
Returns:: A dictionary containing statistics about the anonymization operations, model details, and total texts processed.

save_log(filepath)[source]

Saves the anonymization operation log to a JSON file.

Parameters:: filepath (Union[str, Path]) – The path where the log file will be saved.
Return type:: None

pat2vec.util.anonymisation_deid_documents.anonymize_single_text(text, model_path, redact=True)[source]

A convenience function to quickly anonymize a single text string.

Parameters:

text (str) – The input text to anonymize.
model_path (Union[str, Path]) – The path to the DeIdModel pack.
redact (bool) – If True, replaces PII with asterisks. If False, uses type tags.

Return type:

str

Returns:

The anonymized text.

pat2vec.util.anonymisation_deid_documents.anonymize_dataframe_quick(df, text_columns, model_path, redact=True)[source]

A convenience function to quickly anonymize columns in a DataFrame.

Parameters:

df (DataFrame) – The input DataFrame.
text_columns (List[str]) – A list of column names to anonymize.
model_path (Union[str, Path]) – The path to the DeIdModel pack.
redact (bool) – If True, replaces PII with asterisks. If False, uses type tags.

Return type:

DataFrame

Returns:

A new DataFrame with anonymized text columns.