pat2vec.util.clinical_note_splitter
Functions
|
Finds and extracts date-stamped text chunks from a larger text body. |
|
Filters, splits, and re-appends clinical notes within a DataFrame. |
|
Splits clinical notes from an EPR schema DataFrame into date-stamped chunks. |
|
Splits clinical notes from an MCT schema DataFrame into date-stamped chunks. |
- pat2vec.util.clinical_note_splitter.find_date(txt, original_update_time_value=None, reg='Entered on -', window=20, verbosity=0)[source]
Finds and extracts date-stamped text chunks from a larger text body.
This function scans through a given text for a specific regular expression pattern that indicates a date entry (e.g., “Entered on -“). For each match, it attempts to parse a timestamp from the subsequent text. It then splits the original text into chunks, with each chunk ending at a newly found timestamp.
- Parameters:
txt (
str
) – The input text to search for date entries.original_update_time_value (
Optional
[Timestamp
]) – A fallback timestamp to use if a date cannot be parsed from a chunk.reg (
str
) – The regular expression pattern to identify the start of a date entry.window (
int
) – The character window size after the reg match to search for a timestamp.verbosity (
int
) – The level of logging for messages.
- Return type:
List
[Dict
[str
,Any
]]- Returns:
A list of dictionaries, where each dictionary represents a text chunk and contains the text, the parsed date, and metadata about the match.
- pat2vec.util.clinical_note_splitter.split_clinical_notes(clin_note, verbosity_val=0)[source]
Splits clinical notes from an EPR schema DataFrame into date-stamped chunks.
This function iterates through a DataFrame of clinical notes (assuming an EPR-like schema with ‘body_analysed’ and ‘updatetime’ columns). It uses the find_date function to break down each note’s text into smaller documents based on embedded timestamps.
- Parameters:
clin_note (
DataFrame
) – A DataFrame containing the clinical notes to be split.verbosity_val (
int
) – The verbosity level passed to the find_date function.
- Returns:
pd.DataFrame: The processed notes, split into smaller chunks.
pd.DataFrame: The original rows of notes that could not be split.
- Return type:
A tuple containing two DataFrames
- pat2vec.util.clinical_note_splitter.split_clinical_notes_mct(clin_note, verbosity_val=0)[source]
Splits clinical notes from an MCT schema DataFrame into date-stamped chunks.
This function is similar to split_clinical_notes but is tailored for an MCT/observations schema (with ‘observation_valuetext_analysed’ and ‘observationdocument_recordeddtm’ columns). It breaks down each note’s text into smaller documents based on embedded timestamps.
- Parameters:
clin_note (
DataFrame
) – A DataFrame containing the clinical notes to be split.verbosity_val (
int
) – The verbosity level passed to the find_date function.
- Returns:
pd.DataFrame: The processed notes, split into smaller chunks.
pd.DataFrame: The original rows of notes that could not be split.
- Return type:
A tuple containing two DataFrames
- pat2vec.util.clinical_note_splitter.split_and_append_chunks(docs, epr=True, mct=False, verbosity=0)[source]
Filters, splits, and re-appends clinical notes within a DataFrame.
This function acts as a wrapper to orchestrate the clinical note splitting process. It identifies clinical notes within a larger document DataFrame, sends them to the appropriate splitting function (split_clinical_notes or split_clinical_notes_mct), and then concatenates the resulting smaller chunks back with the original non-clinical documents.
- Parameters:
docs (
DataFrame
) – The input DataFrame containing various document types.epr (
bool
) – If True, assumes an EPR schema for splitting.mct (
bool
) – If True, assumes an MCT/observations schema for splitting.verbosity (
int
) – The verbosity level for logging and splitting.
- Return type:
DataFrame
- Returns:
A new DataFrame containing the original non-clinical notes plus the newly created smaller chunks from the split clinical notes.