pat2vec.util.post_processing_process_csv_files

Functions

process_csv_files(input_path[,Β out_folder,Β ...])

Concatenates multiple CSV files from a directory into a single file.

process_csv_files_multi(input_path[,Β ...])

Concatenates multiple CSV files using multiprocessing.

pat2vec.util.post_processing_process_csv_files.process_csv_files(input_path, out_folder='outputs', output_filename_suffix='concatenated_output', part_size=336, sample_size=None, append_timestamp_column=False)[source]

Concatenates multiple CSV files from a directory into a single file.

This function scans a directory for CSV files, determines a union of all column headers, and then reads each file to append its content into a single, large CSV file. It handles cases where CSVs have different columns and can process files in chunks.

Parameters:
  • input_path (str) – The path to the directory containing the CSV files.

  • out_folder (str) – The folder name for the output CSV file.

  • output_filename_suffix (str) – The suffix for the output CSV file name.

  • part_size (int) – The number of files to process in each chunk.

  • sample_size (Union[int, str, None]) – The number of files to sample. If β€˜all’ or None, all files are used.

  • append_timestamp_column (bool) – If True, processes the final concatenated file to extract a datetime column from binary date columns.

Return type:

str

Returns:

The path to the saved concatenated CSV file.

pat2vec.util.post_processing_process_csv_files.process_csv_files_multi(input_path, out_folder='outputs', output_filename_suffix='concatenated_output', part_size=336, sample_size=None, append_timestamp_column=False, n_proc=None)[source]

Concatenates multiple CSV files using multiprocessing.

This function is a multiprocessing version of process_csv_files. It distributes the file processing across multiple CPU cores to speed up the concatenation of a large number of CSV files.

Parameters:
  • input_path (str) – The path to the directory containing the CSV files.

  • out_folder (str) – The folder name for the output CSV file.

  • output_filename_suffix (str) – The suffix for the output CSV file name.

  • part_size (int) – The number of files to process in each chunk per process.

  • sample_size (Union[int, str, None]) – The number of files to sample. If β€˜all’ or None, all files are used.

  • append_timestamp_column (bool) – If True, processes the final file to extract a datetime column.

  • n_proc (Union[int, str, None]) – The number of processes to use. Can be an integer, β€˜all’, or β€˜half’.

Return type:

str