pat2vec.util.anonymisation_deid_documentsο
Functions
|
A convenience function to quickly anonymize columns in a DataFrame. |
|
A convenience function to quickly anonymize a single text string. |
Classes
|
A class for anonymizing clinical text using MedCAT's DeIdModel. |
- class pat2vec.util.anonymisation_deid_documents.DeIdAnonymizer(model_path=None, log_level='INFO')[source]ο
Bases:
object
A class for anonymizing clinical text using MedCATβs DeIdModel.
This class encapsulates the functionality for loading a de-identification model, anonymizing text data in various formats (single string, list of strings, pandas DataFrame columns), and providing utilities for inspection and reporting.
- Parameters:
model_path (str | Path | None)
log_level (str)
- modelο
The loaded MedCAT DeIdModel instance.
- model_pathο
The path to the loaded model pack.
- is_loadedο
A boolean indicating if a model is successfully loaded.
- pii_labelsο
A list of PII labels the loaded model is configured to redact.
- anonymization_logο
A list of dictionaries logging each operation.
- loggerο
A configured logger instance for the class.
- __init__(model_path=None, log_level='INFO')[source]ο
Initializes the DeIdAnonymizer.
- Parameters:
model_path (
Union
[str
,Path
,None
]) β Optional path to the MedCAT DeIdModel pack. If provided, the model is loaded upon initialization.log_level (
str
) β The logging level for the instance (e.g., βINFOβ, βDEBUGβ).
- load_model(model_path)[source]ο
Loads a pre-trained DeIdModel from a specified path.
- Parameters:
model_path (
Union
[str
,Path
]) β The path to the model pack (directory or .zip file).- Return type:
bool
- Returns:
True if the model was loaded successfully, False otherwise.
- anonymize_text(text, redact=True, verify=False)[source]ο
Anonymizes a single text string.
- Parameters:
text (
str
) β The input text to anonymize.redact (
bool
) β If True, replaces PII with asterisks (β***β). If False, replaces PII with type tags (e.g., β<PERSON>β).verify (
bool
) β If True, returns a tuple containing the anonymized text and a dictionary with verification information.
- Return type:
Union
[str
,Tuple
[str
,Dict
[str
,Any
]]]- Returns:
If verify is False, returns the anonymized text string. If verify is True, returns a tuple of (anonymized_text, verification_info).
- anonymize_texts(texts, redact=True, n_process=1, batch_size=100, verify_sample=False, sample_size=10)[source]ο
Anonymizes a list of text strings, with parallel processing support.
- Parameters:
texts (
List
[str
]) β A list of input texts to anonymize.redact (
bool
) β If True, replaces PII with asterisks. If False, uses type tags.n_process (
int
) β The number of processes to use for parallel execution.batch_size (
int
) β The number of texts to process in each batch.verify_sample (
bool
) β If True, verifies a random sample of the results and returns a report.sample_size (
int
) β The size of the random sample to verify if verify_sample is True.
- Return type:
Union
[List
[str
],Tuple
[List
[str
],Dict
]]- Returns:
If verify_sample is False, returns a list of anonymized texts. If verify_sample is True, returns a tuple of (anonymized_texts, verification_report).
- anonymize_dataframe(df, text_columns, redact=True, inplace=False, suffix='_anonymized', n_process=1, batch_size=100)[source]ο
Anonymizes specified text columns in a pandas DataFrame.
- Parameters:
df (
DataFrame
) β The input DataFrame.text_columns (
List
[str
]) β A list of column names containing the text to be anonymized.redact (
bool
) β If True, replaces PII with asterisks. If False, uses type tags.inplace (
bool
) β If True, modifies the DataFrame in place by overwriting the original text columns. If False, returns a new DataFrame with anonymized columns added.suffix (
str
) β The suffix to add to new anonymized column names. This is ignored if inplace is True.n_process (
int
) β The number of processes for parallel execution.batch_size (
int
) β The number of texts to process in each batch.
- Return type:
DataFrame
- Returns:
A DataFrame with the specified text columns anonymized.
- inspect_text(text)[source]ο
Inspects text to find and log PII entities without anonymizing.
This method is useful for debugging and understanding what the loaded model is capable of detecting in a given piece of text.
- Parameters:
text (
str
) β The text to inspect.- Return type:
List
[Dict
[str
,Any
]]- Returns:
A list of dictionaries, each representing a found PII entity.
- get_structured_annotations(text)[source]ο
Gets structured annotations for PII entities in a text.
- Parameters:
text (
str
) β The input text to analyze.- Return type:
List
[Dict
[str
,Any
]]- Returns:
A list of dictionaries, where each dictionary contains details (text, label, start, end, confidence) for an identified PII entity.
- pat2vec.util.anonymisation_deid_documents.anonymize_single_text(text, model_path, redact=True)[source]ο
A convenience function to quickly anonymize a single text string.
- Parameters:
text (
str
) β The input text to anonymize.model_path (
Union
[str
,Path
]) β The path to the DeIdModel pack.redact (
bool
) β If True, replaces PII with asterisks. If False, uses type tags.
- Return type:
str
- Returns:
The anonymized text.
- pat2vec.util.anonymisation_deid_documents.anonymize_dataframe_quick(df, text_columns, model_path, redact=True)[source]ο
A convenience function to quickly anonymize columns in a DataFrame.
- Parameters:
df (
DataFrame
) β The input DataFrame.text_columns (
List
[str
]) β A list of column names to anonymize.model_path (
Union
[str
,Path
]) β The path to the DeIdModel pack.redact (
bool
) β If True, replaces PII with asterisks. If False, uses type tags.
- Return type:
DataFrame
- Returns:
A new DataFrame with anonymized text columns.