pat2vec.util.methods_get
Functions
|
Adds a new column with a time offset from a starting datetime column. |
|
Builds a dictionary mapping patient IDs to (start, end) datetime tuples. |
|
Converts a date string in 'YYYY-MM-DD' format to a datetime object. |
|
Converts a timestamp string to a (year, month) tuple. |
|
Creates folders for each patient in the specified paths. |
|
Creates folders locally or remotely based on the configuration. |
|
Creates folders for a single patient in the specified paths. |
|
Creates local project directories for storing intermediate files. |
|
Creates remote project directories for storing intermediate files via SFTP. |
|
Saves data to a file using pickle, either locally or remotely via SFTP. |
Creates a one-hot encoded date vector for a specific target date. |
|
|
Creates a one-hot encoded date vector for a target date. |
|
Checks if a file or directory exists, either locally or remotely. |
|
Filters a list of patients to exclude those already processed. |
|
Creates an empty DataFrame with one-hot encoded date columns. |
Identifies and returns the GPU with the most available free memory. |
|
|
Lists the contents of a directory, either locally or remotely via SFTP. |
|
Reads CSV data from a file, handling both local and remote paths. |
|
Reads a remote CSV file via SFTP and returns a pandas DataFrame. |
|
Checks if a file or directory exists on a remote SFTP server. |
Test the function with various datetime formats |
|
|
Updates a tqdm progress bar with formatted information about the current processing state. |
|
Writes CSV data to a file either locally or remotely. |
|
Writes a pandas DataFrame to a remote file via SFTP. |
- pat2vec.util.methods_get.list_dir_wrapper(path, config_obj=None)[source]
Lists the contents of a directory, either locally or remotely via SFTP.
This function acts as a wrapper around os.listdir and sftp.listdir to provide a consistent interface for listing directory contents based on the remote_dump setting in the configuration object.
- Parameters:
path (
str
) – The path to the directory to list.config_obj (
Optional
[Any
]) – The configuration object containing SFTP credentials and settings if remote_dump is True.
- Return type:
List
[str
]- Returns:
A list of filenames in the specified directory.
- pat2vec.util.methods_get.convert_timestamp_to_tuple(timestamp)[source]
Converts a timestamp string to a (year, month) tuple.
- Parameters:
timestamp (
str
) – The timestamp string to convert, expected in the format %Y-%m-%dT%H:%M:%S.%f%z.- Return type:
Tuple
[int
,int
]- Returns:
A tuple containing the year and month as integers.
- pat2vec.util.methods_get.enum_target_date_vector(target_date_range, current_pat_client_id_code, config_obj)[source]
Creates a one-hot encoded date vector for a target date.
- Parameters:
target_date_range (
Tuple
[int
,int
,int
]) – A tuple of (year, month, day) for the target date.current_pat_client_id_code (
str
) – The patient’s ID.config_obj (
Any
) – The configuration object.
- Return type:
DataFrame
- Returns:
A single-row DataFrame with a one-hot encoded column for the target date.
- pat2vec.util.methods_get.enum_exact_target_date_vector(target_date_range, current_pat_client_id_code, config_obj)[source]
Creates a one-hot encoded date vector for a specific target date.
- Parameters:
target_date_range (
Tuple
[int
,int
,int
]) – A tuple of (year, month, day) for the target date.current_pat_client_id_code (
str
) – The patient’s ID.config_obj (
Any
) – The configuration object (currently unused).
- Return type:
DataFrame
- Returns:
A single-row DataFrame with a one-hot encoded column for the target date.
- pat2vec.util.methods_get.dump_results(file_data, path, config_obj=None)[source]
Saves data to a file using pickle, either locally or remotely via SFTP.
- Parameters:
file_data (
Any
) – The Python object to be pickled.path (
str
) – The destination file path.config_obj (
Optional
[Any
]) – The configuration object containing SFTP credentials and settings if remote_dump is True.
- Return type:
None
- pat2vec.util.methods_get.update_pbar(current_pat_client_id_code, start_time, stage_int, stage_str, t, config_obj, skipped_counter=None, **n_docs_to_annotate)[source]
Updates a tqdm progress bar with formatted information about the current processing state.
This function dynamically sets the description and color of a tqdm progress bar to reflect the current patient, processing stage, and execution time. The color changes to indicate slow performance if the elapsed time exceeds predefined thresholds.
- Parameters:
current_pat_client_id_code (
str
) – The identifier of the patient currently being processed.start_time (
datetime
) – The start time of the current operation. Note: This parameter is currently overwritten by config_obj.start_time.stage_int (
int
) – An integer representing the processing stage. Note: This parameter is currently unused.stage_str (
str
) – A string describing the current processing stage (e.g., “demo”, “annotating”).t (
tqdm
) – The tqdm progress bar instance to update.config_obj (
Any
) – A configuration object containing settings like start_time, multi_process, and various slow_execution_threshold values.skipped_counter (
Union
[int
,Any
,None
]) – A counter for the number of skipped items. Can be a standard integer or a multiprocessing-safe value. Defaults to None.**n_docs_to_annotate (
Any
) – Arbitrary keyword arguments that are displayed at the end of the progress bar description. Useful for showing counts like the number of documents to annotate.
- Return type:
None
- pat2vec.util.methods_get.get_free_gpu()[source]
Identifies and returns the GPU with the most available free memory.
This function executes the nvidia-smi command-line utility to query the current memory usage of all available NVIDIA GPUs. It parses the output to determine which GPU has the maximum amount of free memory and returns its index along with the amount of free memory.
This is particularly useful for automatically selecting a GPU for a compute-intensive task in a multi-GPU system.
- Return type:
Tuple
[int
,str
]- Returns:
A tuple where the first element is the integer index of the GPU with the most free memory, and the second element is a string representing the amount of free memory in MiB (e.g., “1024”).
- Raises:
FileNotFoundError – If the nvidia-smi command is not found in the system’s PATH.
subprocess.CalledProcessError – If the nvidia-smi command fails or returns a non-zero exit code.
- pat2vec.util.methods_get.convert_date(date_string)[source]
Converts a date string in ‘YYYY-MM-DD’ format to a datetime object.
- Parameters:
date_string (
str
) – The string to convert, which may include a time part (e.g., ‘YYYY-MM-DDTHH:MM:SS’).- Return type:
datetime
- Returns:
A datetime object representing the date part of the string.
- pat2vec.util.methods_get.write_csv_wrapper(path, csv_file_data=None, config_obj=None)[source]
Writes CSV data to a file either locally or remotely.
- Parameters:
path (
str
) – The path to the destination CSV file.csv_file_data (
Optional
[DataFrame
]) – The DataFrame to write.config_obj (
Optional
[Any
]) – An object containing configuration settings, including ‘remote_dump’.
- Return type:
None
- pat2vec.util.methods_get.read_remote(path, config_obj=None)[source]
Reads a remote CSV file via SFTP and returns a pandas DataFrame.
- Parameters:
path (
str
) – The remote path of the CSV file to read.config_obj (
Optional
[Any
]) – An object containing configuration details.
- Return type:
DataFrame
- Returns:
The DataFrame containing the data read from the remote CSV file.
- pat2vec.util.methods_get.read_csv_wrapper(path, config_obj=None)[source]
Reads CSV data from a file, handling both local and remote paths.
This function is a wrapper that calls either pd.read_csv for local files or read_remote for SFTP paths, based on the remote_dump flag in the configuration.
- Parameters:
path (
str
) – The path to the CSV file (local or remote).config_obj (
Optional
[Any
]) – An object containing configuration settings, including ‘remote_dump’.
- Return type:
DataFrame
- Returns:
The DataFrame containing the data read from the CSV file.
- pat2vec.util.methods_get.create_local_folders(config_obj=None)[source]
Creates local project directories for storing intermediate files.
- Parameters:
config_obj (
Optional
[Any
]) – The configuration object containing root_path and proj_name.- Return type:
None
- pat2vec.util.methods_get.create_remote_folders(config_obj=None)[source]
Creates remote project directories for storing intermediate files via SFTP.
- Parameters:
config_obj (
Optional
[Any
]) – An object containing configuration details like root_path, proj_name, and SFTP credentials.- Raises:
ValueError – If config_obj is not provided.
- Return type:
None
- pat2vec.util.methods_get.create_folders_annot_csv_wrapper(config_obj=None)[source]
Creates folders locally or remotely based on the configuration.
This function is a wrapper that calls either create_local_folders or create_remote_folders based on the remote_dump flag in the config.
- Parameters:
config_obj (
Optional
[Any
]) – The configuration object.- Return type:
None
- pat2vec.util.methods_get.get_empty_date_vector(config_obj)[source]
Creates an empty DataFrame with one-hot encoded date columns.
The columns are generated based on the time window settings in the configuration object.
- Parameters:
config_obj (
Any
) – The configuration object with time window settings.- Return type:
DataFrame
- Returns:
A single-row DataFrame with columns for each date in the time window, initialized to 0.0.
- pat2vec.util.methods_get.sftp_exists(path, config_obj)[source]
Checks if a file or directory exists on a remote SFTP server.
- Parameters:
path (
str
) – The remote path to check.config_obj (
Any
) – The configuration object containing SFTP credentials and settings.
- Return type:
bool
- Returns:
True if the path exists, False otherwise.
- pat2vec.util.methods_get.exist_check(path, config_obj=None)[source]
Checks if a file or directory exists, either locally or remotely.
This is a wrapper around os.path.exists and sftp_exists that checks the remote_dump flag in the configuration object.
- Parameters:
path (
str
) – The path to check.config_obj (
Optional
[Any
]) – The configuration object.
- Return type:
bool
- Returns:
True if the path exists, False otherwise.
- pat2vec.util.methods_get.filter_stripped_list(stripped_list, config_obj=None)[source]
Filters a list of patients to exclude those already processed.
Checks if a patient’s output directory contains at least n_pat_lines files, indicating that processing for that patient is complete.
- Parameters:
stripped_list (
List
[str
]) – The initial list of patient IDs to process.config_obj (
Optional
[Any
]) – The configuration object containing paths and settings.
- Returns:
the filtered list of patients to be processed, and the original filtered list (for reference).
- Return type:
A tuple containing two lists
- pat2vec.util.methods_get.create_folders(all_patient_list, config_obj=None)[source]
Creates folders for each patient in the specified paths.
- Parameters:
all_patient_list (
List
[str
]) – List of patient IDs.config_obj (
Optional
[Any
]) – Configuration object containing paths and verbosity level.
- Return type:
None
- pat2vec.util.methods_get.create_folders_for_pat(patient_id, config_obj=None)[source]
Creates folders for a single patient in the specified paths.
- Parameters:
patient_id (
str
) – The patient’s ID.config_obj (
Optional
[Any
]) – Configuration object containing paths and verbosity level.
- Return type:
None
- pat2vec.util.methods_get.add_offset_column(dataframe, start_column_name, offset_column_name, time_offset, verbose=1)[source]
Adds a new column with a time offset from a starting datetime column.
Handles multiple datetime formats flexibly.
- Parameters:
dataframe (
DataFrame
) – The input DataFrame.start_column_name (
str
) – The name of the column with the starting datetime.offset_column_name (
str
) – The name for the new column to be created.time_offset (
Union
[timedelta
,Any
]) – The time period offset to add to the start time.verbose (
int
) – Verbosity level (0=silent, 1=basic, 2=detailed).
- Return type:
DataFrame
- Returns:
The modified DataFrame with the new offset column.
- pat2vec.util.methods_get.test_datetime_formats()[source]
Test the function with various datetime formats
- pat2vec.util.methods_get.build_patient_dict(dataframe, patient_id_column, start_column, end_column)[source]
Builds a dictionary mapping patient IDs to (start, end) datetime tuples.
- Parameters:
dataframe (
DataFrame
) – The input DataFrame.patient_id_column (
str
) – The name of the column containing patient IDs.start_column (
str
) – The name of the column containing start datetimes.end_column (
str
) – The name of the column containing end datetimes.
- Return type:
Dict
[str
,Tuple
[datetime
,datetime
]]- Returns:
A dictionary where keys are patient IDs and values are (start, end) tuples.
- pat2vec.util.methods_get.write_remote(path, csv_file, config_obj=None)[source]
Writes a pandas DataFrame to a remote file via SFTP.
- Parameters:
path – The remote path where the file should be written.
csv_file – The DataFrame to be written.
config_obj – An object containing SFTP configuration details.
- Raises:
ValueError – If config_obj is not provided.