pat2vec.pat2vec_search.search_helper_functions

Functions

bulk_str_extract(target_colname_regex_pairs, ...)

Applies multiple regex extract operations to a source column.

bulk_str_extract_round_robin(target_dict, ...)

Extracts text between a series of regex patterns in a round-robin fashion.

bulk_str_findall(target_colname_regex_pairs, ...)

Applies multiple regex findall operations to a source column.

date_cleaner(df, cols, date_format)

Formats specified datetime columns in a DataFrame to a given string format.

pylist2searchlist(list_name, output_name)

Converts a Python list into an Elasticsearch OR-separated search string.

stringlist2pylist(string_list, var_name)

Converts a newline-separated string into a Python list and assigns it to a global variable.

stringlist2searchlist(string_list, output_name)

Converts a newline-separated string into an Elasticsearch OR-separated search string.

without_keys(d, keys)

Returns a new dictionary excluding the specified keys.

pat2vec.pat2vec_search.search_helper_functions.stringlist2searchlist(string_list, output_name)[source]

Converts a newline-separated string into an Elasticsearch OR-separated search string. :rtype: None

The resulting string is saved to a text file. For example, a string “term1

term2” becomes “”term1” OR “term2””.

Args:

string_list: A string where items are separated by newlines. output_name: The base name for the output text file (‘.txt’ will be appended).

Parameters:
  • string_list (str)

  • output_name (str)

Return type:

None

pat2vec.pat2vec_search.search_helper_functions.pylist2searchlist(list_name, output_name)[source]

Converts a Python list into an Elasticsearch OR-separated search string.

The resulting string is saved to a text file. For example, a list [‘term1’, ‘term2’] becomes “”term1” OR “term2””.

Parameters:
  • list_name (List[str]) – A list of strings to be joined.

  • output_name (str) – The base name for the output text file (‘.txt’ will be appended).

Return type:

None

pat2vec.pat2vec_search.search_helper_functions.stringlist2pylist(string_list, var_name)[source]

Converts a newline-separated string into a Python list and assigns it to a global variable.

Note

This function uses globals() to create a variable in the global scope, which is generally not recommended.

Parameters:
  • string_list (str) – A string where items are separated by newlines.

  • var_name (str) – The name of the global variable to which the resulting list will be assigned.

Return type:

None

pat2vec.pat2vec_search.search_helper_functions.date_cleaner(df, cols, date_format)[source]

Formats specified datetime columns in a DataFrame to a given string format.

This function modifies the DataFrame in-place.

Parameters:
  • df (DataFrame) – The DataFrame to modify.

  • cols (List[str]) – A list of column names to format.

  • date_format (str) – The target string format for the dates (e.g., ‘%Y-%m-%d’).

Return type:

None

pat2vec.pat2vec_search.search_helper_functions.bulk_str_findall(target_colname_regex_pairs, source_colname, df_name)[source]

Applies multiple regex findall operations to a source column.

For each key-value pair in target_colname_regex_pairs, this function finds all occurrences of the regex (value) in the source_colname and stores the joined results in a new column named after the key. This modifies the DataFrame in-place.

Parameters:
  • target_colname_regex_pairs (Dict[str, str]) – A dictionary mapping new column names to regex patterns.

  • source_colname (str) – The name of the column to search within.

  • df_name (DataFrame) – The DataFrame to modify.

Return type:

None

pat2vec.pat2vec_search.search_helper_functions.without_keys(d, keys)[source]

Returns a new dictionary excluding the specified keys.

Parameters:
  • d (Dict[Any, Any]) – The original dictionary.

  • keys (Iterable[Any]) – An iterable of keys to exclude.

Return type:

Dict[Any, Any]

Returns:

A new dictionary without the specified keys.

pat2vec.pat2vec_search.search_helper_functions.bulk_str_extract(target_colname_regex_pairs, source_colname, df_name, expand)[source]

Applies multiple regex extract operations to a source column.

For each key-value pair in target_colname_regex_pairs, this function extracts the first match of the regex (value) from the source_colname and stores it in a new column named after the key. This modifies the DataFrame in-place.

Parameters:
  • target_colname_regex_pairs (Dict[str, str]) – A dictionary mapping new column names to regex patterns.

  • source_colname (str) – The name of the column to search within.

  • df_name (DataFrame) – The DataFrame to modify.

  • expand (bool) – The expand parameter for pd.Series.str.extract.

Return type:

None

pat2vec.pat2vec_search.search_helper_functions.bulk_str_extract_round_robin(target_dict, df_name, source_colname, expand)[source]

Extracts text between a series of regex patterns in a round-robin fashion.

For each pattern in the target_dict, this function attempts to extract the text that appears after that pattern but before any of the other patterns in the dictionary. This modifies the DataFrame in-place.

Parameters:
  • target_dict (Dict[str, str]) – A dictionary mapping new column names to regex patterns.

  • df_name (DataFrame) – The DataFrame to modify.

  • source_colname (str) – The name of the column to search within.

  • expand (bool) – The expand parameter for pd.Series.str.extract.

Return type:

None