pat2vec.pat2vec_search.search_helper_functions
Functions
|
Applies multiple regex extract operations to a source column. |
|
Extracts text between a series of regex patterns in a round-robin fashion. |
|
Applies multiple regex findall operations to a source column. |
|
Formats specified datetime columns in a DataFrame to a given string format. |
|
Converts a Python list into an Elasticsearch OR-separated search string. |
|
Converts a newline-separated string into a Python list and assigns it to a global variable. |
|
Converts a newline-separated string into an Elasticsearch OR-separated search string. |
|
Returns a new dictionary excluding the specified keys. |
- pat2vec.pat2vec_search.search_helper_functions.stringlist2searchlist(string_list, output_name)[source]
Converts a newline-separated string into an Elasticsearch OR-separated search string. :rtype:
None
The resulting string is saved to a text file. For example, a string “term1
term2” becomes “”term1” OR “term2””.
- Args:
string_list: A string where items are separated by newlines. output_name: The base name for the output text file (‘.txt’ will be appended).
- Parameters:
string_list (str)
output_name (str)
- Return type:
None
- pat2vec.pat2vec_search.search_helper_functions.pylist2searchlist(list_name, output_name)[source]
Converts a Python list into an Elasticsearch OR-separated search string.
The resulting string is saved to a text file. For example, a list [‘term1’, ‘term2’] becomes “”term1” OR “term2””.
- Parameters:
list_name (
List
[str
]) – A list of strings to be joined.output_name (
str
) – The base name for the output text file (‘.txt’ will be appended).
- Return type:
None
- pat2vec.pat2vec_search.search_helper_functions.stringlist2pylist(string_list, var_name)[source]
Converts a newline-separated string into a Python list and assigns it to a global variable.
Note
This function uses globals() to create a variable in the global scope, which is generally not recommended.
- Parameters:
string_list (
str
) – A string where items are separated by newlines.var_name (
str
) – The name of the global variable to which the resulting list will be assigned.
- Return type:
None
- pat2vec.pat2vec_search.search_helper_functions.date_cleaner(df, cols, date_format)[source]
Formats specified datetime columns in a DataFrame to a given string format.
This function modifies the DataFrame in-place.
- Parameters:
df (
DataFrame
) – The DataFrame to modify.cols (
List
[str
]) – A list of column names to format.date_format (
str
) – The target string format for the dates (e.g., ‘%Y-%m-%d’).
- Return type:
None
- pat2vec.pat2vec_search.search_helper_functions.bulk_str_findall(target_colname_regex_pairs, source_colname, df_name)[source]
Applies multiple regex findall operations to a source column.
For each key-value pair in target_colname_regex_pairs, this function finds all occurrences of the regex (value) in the source_colname and stores the joined results in a new column named after the key. This modifies the DataFrame in-place.
- Parameters:
target_colname_regex_pairs (
Dict
[str
,str
]) – A dictionary mapping new column names to regex patterns.source_colname (
str
) – The name of the column to search within.df_name (
DataFrame
) – The DataFrame to modify.
- Return type:
None
- pat2vec.pat2vec_search.search_helper_functions.without_keys(d, keys)[source]
Returns a new dictionary excluding the specified keys.
- Parameters:
d (
Dict
[Any
,Any
]) – The original dictionary.keys (
Iterable
[Any
]) – An iterable of keys to exclude.
- Return type:
Dict
[Any
,Any
]- Returns:
A new dictionary without the specified keys.
- pat2vec.pat2vec_search.search_helper_functions.bulk_str_extract(target_colname_regex_pairs, source_colname, df_name, expand)[source]
Applies multiple regex extract operations to a source column.
For each key-value pair in target_colname_regex_pairs, this function extracts the first match of the regex (value) from the source_colname and stores it in a new column named after the key. This modifies the DataFrame in-place.
- Parameters:
target_colname_regex_pairs (
Dict
[str
,str
]) – A dictionary mapping new column names to regex patterns.source_colname (
str
) – The name of the column to search within.df_name (
DataFrame
) – The DataFrame to modify.expand (
bool
) – The expand parameter for pd.Series.str.extract.
- Return type:
None
- pat2vec.pat2vec_search.search_helper_functions.bulk_str_extract_round_robin(target_dict, df_name, source_colname, expand)[source]
Extracts text between a series of regex patterns in a round-robin fashion.
For each pattern in the target_dict, this function attempts to extract the text that appears after that pattern but before any of the other patterns in the dictionary. This modifies the DataFrame in-place.
- Parameters:
target_dict (
Dict
[str
,str
]) – A dictionary mapping new column names to regex patterns.df_name (
DataFrame
) – The DataFrame to modify.source_colname (
str
) – The name of the column to search within.expand (
bool
) – The expand parameter for pd.Series.str.extract.
- Return type:
None