pat2vec.pat2vec_search.cogstack_search_methods

Functions

`cohort_searcher_no_terms`(index_name, ...)	Searches an index using only a query string.
`cohort_searcher_no_terms_fuzzy`(index_name, ...)	Searches an index using different query string methods.
`cohort_searcher_with_terms_and_search`(...)	Searches a cohort using a term filter and a query string.
`cohort_searcher_with_terms_no_search`(...)	Searches a cohort using only a term-level filter.
`create_credentials_file`()	Creates a template credentials.py file.
`dataframe_generator`(list_of_dfs)	A generator that yields DataFrames from a list of DataFrames.
`initialize_cogstack_client`([config_obj])	Initializes the global CogStack client cs.
`iterative_multi_term_cohort_searcher_no_terms_fuzzy`(...)	Iteratively searches for EPR documents matching multiple search terms.
`iterative_multi_term_cohort_searcher_no_terms_fuzzy_mct`(...)	Iteratively searches for MCT documents matching multiple search terms.
`iterative_multi_term_cohort_searcher_no_terms_fuzzy_textual_obs`(...)	Iteratively searches for textual observations matching multiple terms.
`list_chunker`(entered_list)	Splits a list into smaller chunks of up to 10,000 elements.
`set_index_safe_wrapper`(df)	Safely sets the DataFrame index to 'id', ignoring errors.

Classes

CogStack(hosts[, username, password, api, ...])

pat2vec.pat2vec_search.cogstack_search_methods.create_credentials_file()[source]

Creates a template credentials.py file.

This function creates a credentials.py file three levels up from the current file’s directory. This file contains placeholder variables for Elasticsearch connection details (hosts, username, password, api_key). It is intended to be filled out by the user with their actual credentials.

Return type:: None

class pat2vec.pat2vec_search.cogstack_search_methods.CogStack(hosts, username=None, password=None, api=True, api_key=None)[source]

Bases: object

Parameters:

hosts (List[str])
username (str | None)
password (str | None)
api (bool)
api_key (str | None)

__init__(hosts, username=None, password=None, api=True, api_key=None)[source]

Initializes the CogStack client for Elasticsearch interaction.

Parameters:

hosts (List[str]) – A list of CogStack host URLs.
username (Optional[str]) – The username for basic authentication.
password (Optional[str]) – The password for basic authentication.
api (bool) – If True, use API key authentication. Defaults to True.
api_key (Optional[str]) – The API key for authentication.

get_docs_generator(index, query, es_gen_size=800, request_timeout=300)[source]

Returns a generator that yields documents from an Elasticsearch search.

This method uses elasticsearch.helpers.scan to efficiently scroll through all results of a query.

Parameters:

index (List[str]) – A list of Elasticsearch indices to search.
query (Dict[str, Any]) – The Elasticsearch query dictionary.
es_gen_size (int) – The number of documents to retrieve per shard in each scroll.
request_timeout (int) – The timeout in seconds for the request.

Return type:

Generator[Dict[str, Any], None, None]

Returns:

A generator object that yields search hits.

cogstack2df(query, index, column_headers=None, es_gen_size=800, request_timeout=300)[source]

Executes a search query and returns the results as a pandas DataFrame.

Parameters:

query (Dict[str, Any]) – The Elasticsearch query dictionary.
index (str) – The name of the index or a list of indices to search.
column_headers (Optional[List[str]]) – A specific list of columns for the DataFrame.
es_gen_size (int) – The number of documents per scroll request.
request_timeout (int) – The timeout in seconds for the request.

Return type:

DataFrame

Returns:

A pandas DataFrame containing the search results.

DataFrame(index)[source]

Returns an Eland DataFrame for the specified index.

Eland provides a pandas-like API for data in Elasticsearch.

Parameters:: index (str) – The name of the index or index pattern.
Return type:: DataFrame
Returns:: An Eland DataFrame object.

pat2vec.pat2vec_search.cogstack_search_methods.list_chunker(entered_list)[source]

Splits a list into smaller chunks of up to 10,000 elements.

Parameters:: entered_list (List[Any]) – The list to be split into chunks.
Return type:: List[List[Any]]
Returns:: A list of lists, where each sublist is a chunk of the original list.

pat2vec.pat2vec_search.cogstack_search_methods.dataframe_generator(list_of_dfs)[source]

A generator that yields DataFrames from a list of DataFrames.

Return type:: Generator[DataFrame, None, None]
Parameters:: list_of_dfs (List[DataFrame])

pat2vec.pat2vec_search.cogstack_search_methods.cohort_searcher_with_terms_and_search(index_name, fields_list, term_name, entered_list, search_string)[source]

Searches a cohort using a term filter and a query string.

Parameters:

index_name (str) – The name of the Elasticsearch index to search.
fields_list (List[str]) – The list of fields to return from each document.
term_name (str) – The name of the field to use for the term-level filter.
entered_list (List[str]) – The list of values to filter for in the term_name field.
search_string (str) – The query string to apply to the search.

Return type:

DataFrame

Returns:

A pandas DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.set_index_safe_wrapper(df)[source]

Safely sets the DataFrame index to ‘id’, ignoring errors.

Return type:: DataFrame
Parameters:: df (DataFrame)

pat2vec.pat2vec_search.cogstack_search_methods.cohort_searcher_with_terms_no_search(index_name, fields_list, term_name, entered_list)[source]

Searches a cohort using only a term-level filter.

Parameters:

index_name (str) – The name of the index to search.
fields_list (List[str]) – A list of fields to return.
term_name (str) – The field to filter on.
entered_list (List[str]) – The list of values to search for in the term_name field.

Return type:

DataFrame

Returns:

A pandas DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.cohort_searcher_no_terms(index_name, fields_list, search_string)[source]

Searches an index using only a query string.

Parameters:

index_name (str) – The name of the Elasticsearch index to search.
fields_list (List[str]) – A list of fields to return.
search_string (str) – The query string to use for the search.

Return type:

DataFrame

Returns:

A pandas DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.cohort_searcher_no_terms_fuzzy(index_name, fields_list, search_string, method='fuzzy', fuzzy=2, slop=1)[source]

Searches an index using different query string methods.

Parameters:

index_name (str) – The name of the Elasticsearch index.
fields_list (List[str]) – List of fields to retrieve.
search_string (str) – The search string to query.
method (str) – The search method (“fuzzy”, “exact”, or “phrase”).
fuzzy (int) – The fuzziness level for fuzzy matching.
slop (int) – The slop value for phrase searches (word proximity).

Return type:

DataFrame

Returns:

A DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.iterative_multi_term_cohort_searcher_no_terms_fuzzy(terms_list, treatment_doc_filename, start_year, start_month, start_day, end_year, end_month, end_day, overwrite=True, debug=False, uuid_column_name='client_idcode', additional_filters=None, all_fields=False, method='fuzzy', fuzzy=2, slop=1)[source]

Iteratively searches for EPR documents matching multiple search terms.

Parameters:

terms_list (List[str]) – The list of search terms to search for.
treatment_doc_filename (str) – The name of the file to store the results in.
start_year (str) – The start year of the date range.
start_month (str) – The start month of the date range.
start_day (str) – The start day of the date range.
end_year (str) – The end year of the date range.
end_month (str) – The end month of the date range.
end_day (str) – The end day of the date range.
overwrite (bool) – Whether to overwrite the existing file.
debug (bool) – Whether to print debug information.
uuid_column_name (str) – The name of the column containing the UUIDs.
additional_filters (Optional[List[str]]) – A list of additional filters to apply.
all_fields (bool) – Whether to retrieve all fields.
method (str) – The search method to use (‘fuzzy’, ‘exact’, or ‘phrase’).
fuzzy (int) – The fuzziness level for fuzzy matching.
slop (int) – The slop value for phrase searches.

Return type:

DataFrame

Returns:

A DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.iterative_multi_term_cohort_searcher_no_terms_fuzzy_mct(terms_list, treatment_doc_filename, start_year, start_month, start_day, end_year, end_month, end_day, append=True, debug=True, uuid_column_name='client_idcode', additional_filters=None, all_fields=False, method='fuzzy', fuzzy=2, slop=1, testing=False)[source]

Iteratively searches for MCT documents matching multiple search terms.

This function searches the ‘observations’ index for documents of type ‘AoMRC_ClinicalSummary_FT’ that contain the specified terms.

Parameters:

terms_list (List[str]) – A list of terms to search for.
treatment_doc_filename (str) – The filename to load or save the results.
start_year (str) – The start of the date range.
start_month (str) – The start of the date range.
start_day (str) – The start of the date range.
end_year (str) – The end of the date range.
end_month (str) – The end of the date range.
end_day (str) – The end of the date range.
append (bool) – Whether to append results to an existing file.
debug (bool) – Whether to print debug information.
uuid_column_name (str) – The name of the UUID column.
additional_filters (Optional[List[str]]) – Additional filters to apply to the search.
all_fields (bool) – Whether to retrieve all fields.
method (str) – The search method (‘fuzzy’, ‘exact’, ‘phrase’).
fuzzy (int) – The fuzziness level for fuzzy search.
slop (int) – The slop value for phrase search.
testing (bool) – Whether to use a dummy searcher for testing.

Return type:

DataFrame

Returns:

A DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.initialize_cogstack_client(config_obj=None)[source]

Initializes the global CogStack client cs.

This function sets up the connection to Elasticsearch. It can be configured to load credentials from a specific file path by passing a config object. If a client instance already exists, it will not re-initialize unless a config object with a new credentials path is provided.

The credential loading priority is: 1. credentials_path from the config_obj. 2. Default credentials.py in the project’s root. 3. If not found, it creates a template credentials.py and tries again. 4. Falls back to dummy credentials if all else fails.

Parameters:: config_obj – A configuration object that may have a credentials_path attribute.
Returns:: The initialized CogStack client instance.

pat2vec.pat2vec_search.cogstack_search_methods.iterative_multi_term_cohort_searcher_no_terms_fuzzy_textual_obs(terms_list, treatment_doc_filename, start_year, start_month, start_day, end_year, end_month, end_day, append=True, debug=True, uuid_column_name='client_idcode', bloods_time_field='basicobs_entered', additional_filters=None, all_fields=False, method='fuzzy', fuzzy=2, slop=1, testing=False)[source]

Iteratively searches for textual observations matching multiple terms.

This function searches the ‘basic_observations’ index for documents where the textualObs field contains the specified terms.

Parameters:

terms_list (List[str]) – A list of terms to search for.
treatment_doc_filename (str) – The filename to load or save the results.
start_year (str) – The start of the date range.
start_month (str) – The start of the date range.
start_day (str) – The start of the date range.
end_year (str) – The end of the date range.
end_month (str) – The end of the date range.
end_day (str) – The end of the date range.
append (bool) – Whether to append results to an existing file.
debug (bool) – Whether to print debug information.
uuid_column_name (str) – The name of the UUID column.
bloods_time_field (str) – The timestamp field to use for date filtering.
additional_filters (Optional[List[str]]) – Additional filters to apply to the search.
all_fields (bool) – Whether to retrieve all fields.
method (str) – The search method (‘fuzzy’, ‘exact’, ‘phrase’).
fuzzy (int) – The fuzziness level for fuzzy search.
slop (int) – The slop value for phrase search.
testing (bool) – Whether to use a dummy searcher for testing.

Return type:

DataFrame

Returns:

A DataFrame containing the search results.