pat2vec.pat2vec_search.search_helper_functions

Functions

`bulk_str_extract`(target_colname_regex_pairs, ...)	Applies multiple regex extract operations to a source column.
`bulk_str_extract_round_robin`(target_dict, ...)	Extracts text between a series of regex patterns in a round-robin fashion.
`bulk_str_findall`(target_colname_regex_pairs, ...)	Applies multiple regex findall operations to a source column.
`date_cleaner`(df, cols, date_format)	Formats specified datetime columns in a DataFrame to a given string format.
`pylist2searchlist`(list_name, output_name)	Converts a Python list into an Elasticsearch OR-separated search string.
`stringlist2pylist`(string_list, var_name)	Converts a newline-separated string into a Python list and assigns it to a global variable.
`stringlist2searchlist`(string_list, output_name)	Converts a newline-separated string into an Elasticsearch OR-separated search string.
`without_keys`(d, keys)	Returns a new dictionary excluding the specified keys.

pat2vec.pat2vec_search.search_helper_functions.stringlist2searchlist(string_list, output_name)[source]

Converts a newline-separated string into an Elasticsearch OR-separated search string. :rtype: None

The resulting string is saved to a text file. For example, a string “term1

term2” becomes “”term1” OR “term2””.

Args:
string_list: A string where items are separated by newlines. output_name: The base name for the output text file (‘.txt’ will be appended).

Parameters:

string_list (str)
output_name (str)

Return type:

None

pat2vec.pat2vec_search.search_helper_functions.pylist2searchlist(list_name, output_name)[source]

Converts a Python list into an Elasticsearch OR-separated search string.

The resulting string is saved to a text file. For example, a list [‘term1’, ‘term2’] becomes “”term1” OR “term2””.

Parameters:

list_name (List[str]) – A list of strings to be joined.
output_name (str) – The base name for the output text file (‘.txt’ will be appended).

Return type:

None

pat2vec.pat2vec_search.search_helper_functions.stringlist2pylist(string_list, var_name)[source]

Converts a newline-separated string into a Python list and assigns it to a global variable.

Note

This function uses globals() to create a variable in the global scope, which is generally not recommended.

Parameters:

string_list (str) – A string where items are separated by newlines.
var_name (str) – The name of the global variable to which the resulting list will be assigned.

Return type:

None

pat2vec.pat2vec_search.search_helper_functions.date_cleaner(df, cols, date_format)[source]

Formats specified datetime columns in a DataFrame to a given string format.

This function modifies the DataFrame in-place.

Parameters:

df (DataFrame) – The DataFrame to modify.
cols (List[str]) – A list of column names to format.
date_format (str) – The target string format for the dates (e.g., ‘%Y-%m-%d’).

Return type:

None

pat2vec.pat2vec_search.search_helper_functions.bulk_str_findall(target_colname_regex_pairs, source_colname, df_name)[source]

Applies multiple regex findall operations to a source column.

For each key-value pair in target_colname_regex_pairs, this function finds all occurrences of the regex (value) in the source_colname and stores the joined results in a new column named after the key. This modifies the DataFrame in-place.

Parameters:

target_colname_regex_pairs (Dict[str, str]) – A dictionary mapping new column names to regex patterns.
source_colname (str) – The name of the column to search within.
df_name (DataFrame) – The DataFrame to modify.

Return type:

None

pat2vec.pat2vec_search.search_helper_functions.without_keys(d, keys)[source]

Returns a new dictionary excluding the specified keys.

Parameters:

d (Dict[Any, Any]) – The original dictionary.
keys (Iterable[Any]) – An iterable of keys to exclude.

Return type:

Dict[Any, Any]

Returns:

A new dictionary without the specified keys.

pat2vec.pat2vec_search.search_helper_functions.bulk_str_extract(target_colname_regex_pairs, source_colname, df_name, expand)[source]

Applies multiple regex extract operations to a source column.

For each key-value pair in target_colname_regex_pairs, this function extracts the first match of the regex (value) from the source_colname and stores it in a new column named after the key. This modifies the DataFrame in-place.

Parameters:

target_colname_regex_pairs (Dict[str, str]) – A dictionary mapping new column names to regex patterns.
source_colname (str) – The name of the column to search within.
df_name (DataFrame) – The DataFrame to modify.
expand (bool) – The expand parameter for pd.Series.str.extract.

Return type:

None

pat2vec.pat2vec_search.search_helper_functions.bulk_str_extract_round_robin(target_dict, df_name, source_colname, expand)[source]

Extracts text between a series of regex patterns in a round-robin fashion.

For each pattern in the target_dict, this function attempts to extract the text that appears after that pattern but before any of the other patterns in the dictionary. This modifies the DataFrame in-place.

Parameters:

target_dict (Dict[str, str]) – A dictionary mapping new column names to regex patterns.
df_name (DataFrame) – The DataFrame to modify.
source_colname (str) – The name of the column to search within.
expand (bool) – The expand parameter for pd.Series.str.extract.

Return type:

None