pat2vec.util.post_processing_process_csv_files

Functions

`process_csv_files`(input_path[, out_folder, ...])	Concatenates multiple CSV files from a directory into a single file.
`process_csv_files_multi`(input_path[, ...])	Concatenates multiple CSV files using multiprocessing.

pat2vec.util.post_processing_process_csv_files.process_csv_files(input_path, out_folder='outputs', output_filename_suffix='concatenated_output', part_size=336, sample_size=None, append_timestamp_column=False)[source]

Concatenates multiple CSV files from a directory into a single file.

This function scans a directory for CSV files, determines a union of all column headers, and then reads each file to append its content into a single, large CSV file. It handles cases where CSVs have different columns and can process files in chunks.

Parameters:

input_path (str) – The path to the directory containing the CSV files.
out_folder (str) – The folder name for the output CSV file.
output_filename_suffix (str) – The suffix for the output CSV file name.
part_size (int) – The number of files to process in each chunk.
sample_size (Union[int, str, None]) – The number of files to sample. If ‘all’ or None, all files are used.
append_timestamp_column (bool) – If True, processes the final concatenated file to extract a datetime column from binary date columns.

Return type:

str

Returns:

The path to the saved concatenated CSV file.

pat2vec.util.post_processing_process_csv_files.process_csv_files_multi(input_path, out_folder='outputs', output_filename_suffix='concatenated_output', part_size=336, sample_size=None, append_timestamp_column=False, n_proc=None)[source]

Concatenates multiple CSV files using multiprocessing.

This function is a multiprocessing version of process_csv_files. It distributes the file processing across multiple CPU cores to speed up the concatenation of a large number of CSV files.

Parameters:

input_path (str) – The path to the directory containing the CSV files.
out_folder (str) – The folder name for the output CSV file.
output_filename_suffix (str) – The suffix for the output CSV file name.
part_size (int) – The number of files to process in each chunk per process.
sample_size (Union[int, str, None]) – The number of files to sample. If ‘all’ or None, all files are used.
append_timestamp_column (bool) – If True, processes the final file to extract a datetime column.
n_proc (Union[int, str, None]) – The number of processes to use. Can be an integer, ‘all’, or ‘half’.

Return type:

str