pat2vec.util.methods_get

Functions

add_offset_column(dataframe, ...[, verbose])

Adds a new column with a time offset from a starting datetime column.

build_patient_dict(dataframe, ...)

Builds a dictionary mapping patient IDs to (start, end) datetime tuples.

convert_date(date_string)

Converts a date string in 'YYYY-MM-DD' format to a datetime object.

convert_timestamp_to_tuple(timestamp)

Converts a timestamp string to a (year, month) tuple.

create_folders(all_patient_list[, config_obj])

Creates folders for each patient in the specified paths.

create_folders_annot_csv_wrapper([config_obj])

Creates folders locally or remotely based on the configuration.

create_folders_for_pat(patient_id[, config_obj])

Creates folders for a single patient in the specified paths.

create_local_folders([config_obj])

Creates local project directories for storing intermediate files.

create_remote_folders([config_obj])

Creates remote project directories for storing intermediate files via SFTP.

dump_results(file_data, path[, config_obj])

Saves data to a file using pickle, either locally or remotely via SFTP.

enum_exact_target_date_vector(...)

Creates a one-hot encoded date vector for a specific target date.

enum_target_date_vector(target_date_range, ...)

Creates a one-hot encoded date vector for a target date.

exist_check(path[, config_obj])

Checks if a file or directory exists, either locally or remotely.

filter_stripped_list(stripped_list[, config_obj])

Filters a list of patients to exclude those already processed.

get_empty_date_vector(config_obj)

Creates an empty DataFrame with one-hot encoded date columns.

get_free_gpu()

Identifies and returns the GPU with the most available free memory.

list_dir_wrapper(path[, config_obj])

Lists the contents of a directory, either locally or remotely via SFTP.

read_csv_wrapper(path[, config_obj])

Reads CSV data from a file, handling both local and remote paths.

read_remote(path[, config_obj])

Reads a remote CSV file via SFTP and returns a pandas DataFrame.

sftp_exists(path, config_obj)

Checks if a file or directory exists on a remote SFTP server.

test_datetime_formats()

Test the function with various datetime formats

update_pbar(current_pat_client_id_code, ...)

Updates a tqdm progress bar with formatted information about the current processing state.

write_csv_wrapper(path[, csv_file_data, ...])

Writes CSV data to a file either locally or remotely.

write_remote(path, csv_file[, config_obj])

Writes a pandas DataFrame to a remote file via SFTP.

pat2vec.util.methods_get.list_dir_wrapper(path, config_obj=None)[source]

Lists the contents of a directory, either locally or remotely via SFTP.

This function acts as a wrapper around os.listdir and sftp.listdir to provide a consistent interface for listing directory contents based on the remote_dump setting in the configuration object.

Parameters:
  • path (str) – The path to the directory to list.

  • config_obj (Optional[Any]) – The configuration object containing SFTP credentials and settings if remote_dump is True.

Return type:

List[str]

Returns:

A list of filenames in the specified directory.

pat2vec.util.methods_get.convert_timestamp_to_tuple(timestamp)[source]

Converts a timestamp string to a (year, month) tuple.

Parameters:

timestamp (str) – The timestamp string to convert, expected in the format %Y-%m-%dT%H:%M:%S.%f%z.

Return type:

Tuple[int, int]

Returns:

A tuple containing the year and month as integers.

pat2vec.util.methods_get.enum_target_date_vector(target_date_range, current_pat_client_id_code, config_obj)[source]

Creates a one-hot encoded date vector for a target date.

Parameters:
  • target_date_range (Tuple[int, int, int]) – A tuple of (year, month, day) for the target date.

  • current_pat_client_id_code (str) – The patient’s ID.

  • config_obj (Any) – The configuration object.

Return type:

DataFrame

Returns:

A single-row DataFrame with a one-hot encoded column for the target date.

pat2vec.util.methods_get.enum_exact_target_date_vector(target_date_range, current_pat_client_id_code, config_obj)[source]

Creates a one-hot encoded date vector for a specific target date.

Parameters:
  • target_date_range (Tuple[int, int, int]) – A tuple of (year, month, day) for the target date.

  • current_pat_client_id_code (str) – The patient’s ID.

  • config_obj (Any) – The configuration object (currently unused).

Return type:

DataFrame

Returns:

A single-row DataFrame with a one-hot encoded column for the target date.

pat2vec.util.methods_get.dump_results(file_data, path, config_obj=None)[source]

Saves data to a file using pickle, either locally or remotely via SFTP.

Parameters:
  • file_data (Any) – The Python object to be pickled.

  • path (str) – The destination file path.

  • config_obj (Optional[Any]) – The configuration object containing SFTP credentials and settings if remote_dump is True.

Return type:

None

pat2vec.util.methods_get.update_pbar(current_pat_client_id_code, start_time, stage_int, stage_str, t, config_obj, skipped_counter=None, **n_docs_to_annotate)[source]

Updates a tqdm progress bar with formatted information about the current processing state.

This function dynamically sets the description and color of a tqdm progress bar to reflect the current patient, processing stage, and execution time. The color changes to indicate slow performance if the elapsed time exceeds predefined thresholds.

Parameters:
  • current_pat_client_id_code (str) – The identifier of the patient currently being processed.

  • start_time (datetime) – The start time of the current operation. Note: This parameter is currently overwritten by config_obj.start_time.

  • stage_int (int) – An integer representing the processing stage. Note: This parameter is currently unused.

  • stage_str (str) – A string describing the current processing stage (e.g., “demo”, “annotating”).

  • t (tqdm) – The tqdm progress bar instance to update.

  • config_obj (Any) – A configuration object containing settings like start_time, multi_process, and various slow_execution_threshold values.

  • skipped_counter (Union[int, Any, None]) – A counter for the number of skipped items. Can be a standard integer or a multiprocessing-safe value. Defaults to None.

  • **n_docs_to_annotate (Any) – Arbitrary keyword arguments that are displayed at the end of the progress bar description. Useful for showing counts like the number of documents to annotate.

Return type:

None

pat2vec.util.methods_get.get_free_gpu()[source]

Identifies and returns the GPU with the most available free memory.

This function executes the nvidia-smi command-line utility to query the GPU memory usage.

Return type:

Tuple[int, str]

pat2vec.util.methods_get.convert_date(date_string)[source]

Converts a date string in ‘YYYY-MM-DD’ format to a datetime object.

Parameters:

date_string (str) – The string to convert, which may include a time part (e.g., ‘YYYY-MM-DDTHH:MM:SS’).

Return type:

datetime

Returns:

A datetime object representing the date part of the string.

pat2vec.util.methods_get.write_csv_wrapper(path, csv_file_data=None, config_obj=None)[source]

Writes CSV data to a file either locally or remotely.

Parameters:
  • path (str) – The path to the destination CSV file.

  • csv_file_data (Optional[DataFrame]) – The DataFrame to write.

  • config_obj (Optional[Any]) – An object containing configuration settings, including ‘remote_dump’.

Return type:

None

pat2vec.util.methods_get.read_remote(path, config_obj=None)[source]

Reads a remote CSV file via SFTP and returns a pandas DataFrame.

Parameters:
  • path (str) – The remote path of the CSV file to read.

  • config_obj (Optional[Any]) – An object containing configuration details.

Return type:

DataFrame

Returns:

The DataFrame containing the data read from the remote CSV file.

pat2vec.util.methods_get.read_csv_wrapper(path, config_obj=None)[source]

Reads CSV data from a file, handling both local and remote paths.

This function is a wrapper that calls either pd.read_csv for local files or read_remote for SFTP paths, based on the remote_dump flag in the configuration.

Parameters:
  • path (str) – The path to the CSV file (local or remote).

  • config_obj (Optional[Any]) – An object containing configuration settings, including ‘remote_dump’.

Return type:

DataFrame

Returns:

The DataFrame containing the data read from the CSV file.

pat2vec.util.methods_get.create_local_folders(config_obj=None)[source]

Creates local project directories for storing intermediate files.

Parameters:

config_obj (Optional[Any]) – The configuration object containing root_path and proj_name.

Return type:

None

pat2vec.util.methods_get.create_remote_folders(config_obj=None)[source]

Creates remote project directories for storing intermediate files via SFTP.

Parameters:

config_obj (Optional[Any]) – An object containing configuration details like root_path, proj_name, and SFTP credentials.

Raises:

ValueError – If config_obj is not provided.

Return type:

None

pat2vec.util.methods_get.create_folders_annot_csv_wrapper(config_obj=None)[source]

Creates folders locally or remotely based on the configuration.

This function is a wrapper that calls either create_local_folders or create_remote_folders based on the remote_dump flag in the config.

Parameters:

config_obj (Optional[Any]) – The configuration object.

Return type:

None

pat2vec.util.methods_get.get_empty_date_vector(config_obj)[source]

Creates an empty DataFrame with one-hot encoded date columns.

The columns are generated based on the time window settings in the configuration object.

Parameters:

config_obj (Any) – The configuration object with time window settings.

Return type:

DataFrame

Returns:

A single-row DataFrame with columns for each date in the time window, initialized to 0.0.

pat2vec.util.methods_get.sftp_exists(path, config_obj)[source]

Checks if a file or directory exists on a remote SFTP server.

Parameters:
  • path (str) – The remote path to check.

  • config_obj (Any) – The configuration object containing SFTP credentials and settings.

Return type:

bool

Returns:

True if the path exists, False otherwise.

pat2vec.util.methods_get.exist_check(path, config_obj=None)[source]

Checks if a file or directory exists, either locally or remotely.

This is a wrapper around os.path.exists and sftp_exists that checks the remote_dump flag in the configuration object.

Parameters:
  • path (str) – The path to check.

  • config_obj (Optional[Any]) – The configuration object.

Return type:

bool

Returns:

True if the path exists, False otherwise.

pat2vec.util.methods_get.filter_stripped_list(stripped_list, config_obj=None)[source]

Filters a list of patients to exclude those already processed.

Checks if a patient’s output directory contains at least n_pat_lines files, indicating that processing for that patient is complete.

Parameters:
  • stripped_list (List[str]) – The initial list of patient IDs to process.

  • config_obj (Optional[Any]) – The configuration object containing paths and settings.

Returns:

the filtered list of patients to be processed, and the original filtered list (for reference).

Return type:

A tuple containing two lists

pat2vec.util.methods_get.create_folders(all_patient_list, config_obj=None)[source]

Creates folders for each patient in the specified paths.

Parameters:
  • all_patient_list (List[str]) – List of patient IDs.

  • config_obj (Optional[Any]) – Configuration object containing paths and verbosity level.

Return type:

None

pat2vec.util.methods_get.create_folders_for_pat(patient_id, config_obj=None)[source]

Creates folders for a single patient in the specified paths.

Parameters:
  • patient_id (str) – The patient’s ID.

  • config_obj (Optional[Any]) – Configuration object containing paths and verbosity level.

Return type:

None

pat2vec.util.methods_get.add_offset_column(dataframe, start_column_name, offset_column_name, time_offset, verbose=1)[source]

Adds a new column with a time offset from a starting datetime column.

Handles multiple datetime formats flexibly.

Parameters:
  • dataframe (DataFrame) – The input DataFrame.

  • start_column_name (str) – The name of the column with the starting datetime.

  • offset_column_name (str) – The name for the new column to be created.

  • time_offset (Union[timedelta, Any]) – The time period offset to add to the start time.

  • verbose (int) – Verbosity level (0=silent, 1=basic, 2=detailed).

Return type:

DataFrame

Returns:

The modified DataFrame with the new offset column.

pat2vec.util.methods_get.test_datetime_formats()[source]

Test the function with various datetime formats

pat2vec.util.methods_get.build_patient_dict(dataframe, patient_id_column, start_column, end_column)[source]

Builds a dictionary mapping patient IDs to (start, end) datetime tuples.

Parameters:
  • dataframe (DataFrame) – The input DataFrame.

  • patient_id_column (str) – The name of the column containing patient IDs.

  • start_column (str) – The name of the column containing start datetimes.

  • end_column (str) – The name of the column containing end datetimes.

Return type:

Dict[str, Tuple[datetime, datetime]]

Returns:

A dictionary where keys are patient IDs and values are (start, end) tuples.

pat2vec.util.methods_get.write_remote(path, csv_file, config_obj=None)[source]

Writes a pandas DataFrame to a remote file via SFTP.

Parameters:
  • path – The remote path where the file should be written.

  • csv_file – The DataFrame to be written.

  • config_obj – An object containing SFTP configuration details.

Raises:

ValueError – If config_obj is not provided.