ml_grid.pipeline.column_names
Functions
|
Filters a list of strings based on a list of substrings. |
|
Identifies and categorizes columns for perturbation and dropping. |
Module Contents
- ml_grid.pipeline.column_names.filter_substring_list(string: List[str], substr: List[str]) List[str] [source]
Filters a list of strings based on a list of substrings.
- ml_grid.pipeline.column_names.get_pertubation_columns(all_df_columns: List[str], local_param_dict: Dict[str, Any], drop_term_list: List[str]) Tuple[List[str], List[str]] [source]
Identifies and categorizes columns for perturbation and dropping.
This function processes a list of all DataFrame columns, categorizing them into groups like blood tests, diagnostic orders, etc. It also identifies columns to be dropped based on specific keywords. The selection of columns for ‘perturbation’ is determined by flags within local_param_dict.
- Parameters:
all_df_columns (List[str]) – A list of all column names in the DataFrame.
local_param_dict (Dict[str, Any]) – A dictionary containing local parameters, including ‘outcome_var_n’ and a ‘data’ sub-dictionary that specifies which column categories to include for perturbation (e.g., ‘age’, ‘sex’, ‘bmi’, ‘bloods’).
drop_term_list (List[str]) – A list of strings. Any column name containing these strings (case-insensitive) will be added to the drop_list.
- Returns:
- A tuple containing two lists:
pertubation_columns: A list of column names selected for perturbation based on the local_param_dict settings.
drop_list: A list of column names identified to be dropped from the DataFrame.
- Return type: