pat2vec.util.post_processing_process_csv_filesο
Functions
|
Concatenates multiple CSV files from a directory into a single file. |
|
Concatenates multiple CSV files using multiprocessing. |
- pat2vec.util.post_processing_process_csv_files.process_csv_files(input_path, out_folder='outputs', output_filename_suffix='concatenated_output', part_size=336, sample_size=None, append_timestamp_column=False)[source]ο
Concatenates multiple CSV files from a directory into a single file.
This function scans a directory for CSV files, determines a union of all column headers, and then reads each file to append its content into a single, large CSV file. It handles cases where CSVs have different columns and can process files in chunks.
- Parameters:
input_path (
str
) β The path to the directory containing the CSV files.out_folder (
str
) β The folder name for the output CSV file.output_filename_suffix (
str
) β The suffix for the output CSV file name.part_size (
int
) β The number of files to process in each chunk.sample_size (
Union
[int
,str
,None
]) β The number of files to sample. If βallβ or None, all files are used.append_timestamp_column (
bool
) β If True, processes the final concatenated file to extract a datetime column from binary date columns.
- Return type:
str
- Returns:
The path to the saved concatenated CSV file.
- pat2vec.util.post_processing_process_csv_files.process_csv_files_multi(input_path, out_folder='outputs', output_filename_suffix='concatenated_output', part_size=336, sample_size=None, append_timestamp_column=False, n_proc=None)[source]ο
Concatenates multiple CSV files using multiprocessing.
This function is a multiprocessing version of process_csv_files. It distributes the file processing across multiple CPU cores to speed up the concatenation of a large number of CSV files.
- Parameters:
input_path (
str
) β The path to the directory containing the CSV files.out_folder (
str
) β The folder name for the output CSV file.output_filename_suffix (
str
) β The suffix for the output CSV file name.part_size (
int
) β The number of files to process in each chunk per process.sample_size (
Union
[int
,str
,None
]) β The number of files to sample. If βallβ or None, all files are used.append_timestamp_column (
bool
) β If True, processes the final file to extract a datetime column.n_proc (
Union
[int
,str
,None
]) β The number of processes to use. Can be an integer, βallβ, or βhalfβ.
- Return type:
str