pat2vec.util.clinical_note_splitter

Functions

`find_date`(txt[, original_update_time_value, ...])	Finds and extracts date-stamped text chunks from a larger text body.
`split_and_append_chunks`(docs[, epr, mct, ...])	Filters, splits, and re-appends clinical notes within a DataFrame.
`split_clinical_notes`(clin_note[, verbosity_val])	Splits clinical notes from an EPR schema DataFrame into date-stamped chunks.
`split_clinical_notes_mct`(clin_note[, ...])	Splits clinical notes from an MCT schema DataFrame into date-stamped chunks.

pat2vec.util.clinical_note_splitter.find_date(txt, original_update_time_value=None, reg='Entered on -', window=20, verbosity=0)[source]

Finds and extracts date-stamped text chunks from a larger text body.

This function scans through a given text for a specific regular expression pattern that indicates a date entry (e.g., “Entered on -“). For each match, it attempts to parse a timestamp from the subsequent text. It then splits the original text into chunks, with each chunk ending at a newly found timestamp.

Parameters:

txt (str) – The input text to search for date entries.
original_update_time_value (Optional[Timestamp]) – A fallback timestamp to use if a date cannot be parsed from a chunk.
reg (str) – The regular expression pattern to identify the start of a date entry.
window (int) – The character window size after the reg match to search for a timestamp.
verbosity (int) – The level of logging for messages.

Return type:

List[Dict[str, Any]]

Returns:

A list of dictionaries, where each dictionary represents a text chunk and contains the text, the parsed date, and metadata about the match.

pat2vec.util.clinical_note_splitter.split_clinical_notes(clin_note, verbosity_val=0)[source]

Splits clinical notes from an EPR schema DataFrame into date-stamped chunks.

This function iterates through a DataFrame of clinical notes (assuming an EPR-like schema with ‘body_analysed’ and ‘updatetime’ columns). It uses the find_date function to break down each note’s text into smaller documents based on embedded timestamps.

Parameters:

clin_note (DataFrame) – A DataFrame containing the clinical notes to be split.
verbosity_val (int) – The verbosity level passed to the find_date function.

Returns:

pd.DataFrame: The processed notes, split into smaller chunks.
pd.DataFrame: The original rows of notes that could not be split.

Return type:

A tuple containing two DataFrames

pat2vec.util.clinical_note_splitter.split_clinical_notes_mct(clin_note, verbosity_val=0)[source]

Splits clinical notes from an MCT schema DataFrame into date-stamped chunks.

This function is similar to split_clinical_notes but is tailored for an MCT/observations schema (with ‘observation_valuetext_analysed’ and ‘observationdocument_recordeddtm’ columns). It breaks down each note’s text into smaller documents based on embedded timestamps.

Parameters:

clin_note (DataFrame) – A DataFrame containing the clinical notes to be split.
verbosity_val (int) – The verbosity level passed to the find_date function.

Returns:

pd.DataFrame: The processed notes, split into smaller chunks.
pd.DataFrame: The original rows of notes that could not be split.

Return type:

A tuple containing two DataFrames

pat2vec.util.clinical_note_splitter.split_and_append_chunks(docs, epr=True, mct=False, verbosity=0)[source]

Filters, splits, and re-appends clinical notes within a DataFrame.

This function acts as a wrapper to orchestrate the clinical note splitting process. It identifies clinical notes within a larger document DataFrame, sends them to the appropriate splitting function (split_clinical_notes or split_clinical_notes_mct), and then concatenates the resulting smaller chunks back with the original non-clinical documents.

Parameters:

docs (DataFrame) – The input DataFrame containing various document types.
epr (bool) – If True, assumes an EPR schema for splitting.
mct (bool) – If True, assumes an MCT/observations schema for splitting.
verbosity (int) – The verbosity level for logging and splitting.

Return type:

DataFrame

Returns:

A new DataFrame containing the original non-clinical notes plus the newly created smaller chunks from the split clinical notes.