ml_grid.pipeline.read_in

Classes

`read`	Initializes the read class and loads the data.
`read_sample`	Initializes the read_sample class and loads a data sample.

Module Contents

class ml_grid.pipeline.read_in.read(input_filename: str, use_polars: bool = False)[source]

Initializes the read class and loads the data.

Parameters:

input_filename (str) – The path to the input CSV file.
use_polars (bool, optional) – If True, attempts to read the CSV using the Polars library and converts it to a pandas DataFrame. Falls back to pandas if Polars fails. Defaults to False.

class ml_grid.pipeline.read_in.read_sample(input_filename: str, test_sample_n: int, column_sample_n: int)[source]

Initializes the read_sample class and loads a data sample.

This class reads a random sample of rows and/or columns from a CSV file. It ensures that certain necessary_columns are always included if they exist in the source file.

Note

The column sampling logic (max_additional_columns) appears to be based on the number of rows to sample (test_sample_n) rather than the number of columns (column_sample_n), which may be unintended. The functionality has been preserved as is.

Parameters:

input_filename (str) – The path to the input CSV file.
test_sample_n (int) – The number of rows to randomly sample. If 0, all rows are read.
column_sample_n (int) – The number of columns to randomly sample, in addition to the necessary_columns.

Raises:

ValueError – If the ‘outcome_var_1’ column does not contain at least two unique classes after sampling.

filename[source]