pat2vec.pat2vec_search.cogstack_search_methods๏ƒ

Functions

cohort_searcher_no_terms(index_name,ย ...)

Searches an index using only a query string.

cohort_searcher_no_terms_fuzzy(index_name,ย ...)

Searches an index using different query string methods.

cohort_searcher_with_terms_and_search(...)

Searches a cohort using a term filter and a query string.

cohort_searcher_with_terms_no_search(...)

Searches a cohort using only a term-level filter.

create_credentials_file()

Creates a template credentials.py file.

dataframe_generator(list_of_dfs)

A generator that yields DataFrames from a list of DataFrames.

initialize_cogstack_client([config_obj])

Initializes the global CogStack client cs.

iterative_multi_term_cohort_searcher_no_terms_fuzzy(...)

Iteratively searches for EPR documents matching multiple search terms.

iterative_multi_term_cohort_searcher_no_terms_fuzzy_mct(...)

Iteratively searches for MCT documents matching multiple search terms.

iterative_multi_term_cohort_searcher_no_terms_fuzzy_textual_obs(...)

Iteratively searches for textual observations matching multiple terms.

list_chunker(entered_list)

Splits a list into smaller chunks of up to 10,000 elements.

set_index_safe_wrapper(df)

Safely sets the DataFrame index to 'id', ignoring errors.

Classes

CogStack(hosts[,ย username,ย password,ย api,ย ...])

pat2vec.pat2vec_search.cogstack_search_methods.create_credentials_file()[source]๏ƒ

Creates a template credentials.py file.

This function creates a credentials.py file three levels up from the current fileโ€™s directory. This file contains placeholder variables for Elasticsearch connection details (hosts, username, password, api_key). It is intended to be filled out by the user with their actual credentials.

Return type:

None

class pat2vec.pat2vec_search.cogstack_search_methods.CogStack(hosts, username=None, password=None, api=True, api_key=None)[source]๏ƒ

Bases: object

Parameters:
  • hosts (List[str])

  • username (str | None)

  • password (str | None)

  • api (bool)

  • api_key (str | None)

__init__(hosts, username=None, password=None, api=True, api_key=None)[source]๏ƒ

Initializes the CogStack client for Elasticsearch interaction.

Parameters:
  • hosts (List[str]) โ€“ A list of CogStack host URLs.

  • username (Optional[str]) โ€“ The username for basic authentication.

  • password (Optional[str]) โ€“ The password for basic authentication.

  • api (bool) โ€“ If True, use API key authentication. Defaults to True.

  • api_key (Optional[str]) โ€“ The API key for authentication.

get_docs_generator(index, query, es_gen_size=800, request_timeout=300)[source]๏ƒ

Returns a generator that yields documents from an Elasticsearch search.

This method uses elasticsearch.helpers.scan to efficiently scroll through all results of a query.

Parameters:
  • index (List[str]) โ€“ A list of Elasticsearch indices to search.

  • query (Dict[str, Any]) โ€“ The Elasticsearch query dictionary.

  • es_gen_size (int) โ€“ The number of documents to retrieve per shard in each scroll.

  • request_timeout (int) โ€“ The timeout in seconds for the request.

Return type:

Generator[Dict[str, Any], None, None]

Returns:

A generator object that yields search hits.

cogstack2df(query, index, column_headers=None, es_gen_size=800, request_timeout=300)[source]๏ƒ

Executes a search query and returns the results as a pandas DataFrame.

Parameters:
  • query (Dict[str, Any]) โ€“ The Elasticsearch query dictionary.

  • index (str) โ€“ The name of the index or a list of indices to search.

  • column_headers (Optional[List[str]]) โ€“ A specific list of columns for the DataFrame.

  • es_gen_size (int) โ€“ The number of documents per scroll request.

  • request_timeout (int) โ€“ The timeout in seconds for the request.

Return type:

DataFrame

Returns:

A pandas DataFrame containing the search results.

DataFrame(index)[source]๏ƒ

Returns an Eland DataFrame for the specified index.

Eland provides a pandas-like API for data in Elasticsearch.

Parameters:

index (str) โ€“ The name of the index or index pattern.

Return type:

DataFrame

Returns:

An Eland DataFrame object.

pat2vec.pat2vec_search.cogstack_search_methods.list_chunker(entered_list)[source]๏ƒ

Splits a list into smaller chunks of up to 10,000 elements.

Parameters:

entered_list (List[Any]) โ€“ The list to be split into chunks.

Return type:

List[List[Any]]

Returns:

A list of lists, where each sublist is a chunk of the original list.

pat2vec.pat2vec_search.cogstack_search_methods.dataframe_generator(list_of_dfs)[source]๏ƒ

A generator that yields DataFrames from a list of DataFrames.

Return type:

Generator[DataFrame, None, None]

Parameters:

list_of_dfs (List[DataFrame])

Searches a cohort using a term filter and a query string.

Parameters:
  • index_name (str) โ€“ The name of the Elasticsearch index to search.

  • fields_list (List[str]) โ€“ The list of fields to return from each document.

  • term_name (str) โ€“ The name of the field to use for the term-level filter.

  • entered_list (List[str]) โ€“ The list of values to filter for in the term_name field.

  • search_string (str) โ€“ The query string to apply to the search.

Return type:

DataFrame

Returns:

A pandas DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.set_index_safe_wrapper(df)[source]๏ƒ

Safely sets the DataFrame index to โ€˜idโ€™, ignoring errors.

Return type:

DataFrame

Parameters:

df (DataFrame)

Searches a cohort using only a term-level filter.

Parameters:
  • index_name (str) โ€“ The name of the index to search.

  • fields_list (List[str]) โ€“ A list of fields to return.

  • term_name (str) โ€“ The field to filter on.

  • entered_list (List[str]) โ€“ The list of values to search for in the term_name field.

Return type:

DataFrame

Returns:

A pandas DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.cohort_searcher_no_terms(index_name, fields_list, search_string)[source]๏ƒ

Searches an index using only a query string.

Parameters:
  • index_name (str) โ€“ The name of the Elasticsearch index to search.

  • fields_list (List[str]) โ€“ A list of fields to return.

  • search_string (str) โ€“ The query string to use for the search.

Return type:

DataFrame

Returns:

A pandas DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.cohort_searcher_no_terms_fuzzy(index_name, fields_list, search_string, method='fuzzy', fuzzy=2, slop=1)[source]๏ƒ

Searches an index using different query string methods.

Parameters:
  • index_name (str) โ€“ The name of the Elasticsearch index.

  • fields_list (List[str]) โ€“ List of fields to retrieve.

  • search_string (str) โ€“ The search string to query.

  • method (str) โ€“ The search method (โ€œfuzzyโ€, โ€œexactโ€, or โ€œphraseโ€).

  • fuzzy (int) โ€“ The fuzziness level for fuzzy matching.

  • slop (int) โ€“ The slop value for phrase searches (word proximity).

Return type:

DataFrame

Returns:

A DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.iterative_multi_term_cohort_searcher_no_terms_fuzzy(terms_list, treatment_doc_filename, start_year, start_month, start_day, end_year, end_month, end_day, overwrite=True, debug=False, uuid_column_name='client_idcode', additional_filters=None, all_fields=False, method='fuzzy', fuzzy=2, slop=1)[source]๏ƒ

Iteratively searches for EPR documents matching multiple search terms.

Parameters:
  • terms_list (List[str]) โ€“ The list of search terms to search for.

  • treatment_doc_filename (str) โ€“ The name of the file to store the results in.

  • start_year (str) โ€“ The start year of the date range.

  • start_month (str) โ€“ The start month of the date range.

  • start_day (str) โ€“ The start day of the date range.

  • end_year (str) โ€“ The end year of the date range.

  • end_month (str) โ€“ The end month of the date range.

  • end_day (str) โ€“ The end day of the date range.

  • overwrite (bool) โ€“ Whether to overwrite the existing file.

  • debug (bool) โ€“ Whether to print debug information.

  • uuid_column_name (str) โ€“ The name of the column containing the UUIDs.

  • additional_filters (Optional[List[str]]) โ€“ A list of additional filters to apply.

  • all_fields (bool) โ€“ Whether to retrieve all fields.

  • method (str) โ€“ The search method to use (โ€˜fuzzyโ€™, โ€˜exactโ€™, or โ€˜phraseโ€™).

  • fuzzy (int) โ€“ The fuzziness level for fuzzy matching.

  • slop (int) โ€“ The slop value for phrase searches.

Return type:

DataFrame

Returns:

A DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.iterative_multi_term_cohort_searcher_no_terms_fuzzy_mct(terms_list, treatment_doc_filename, start_year, start_month, start_day, end_year, end_month, end_day, append=True, debug=True, uuid_column_name='client_idcode', additional_filters=None, all_fields=False, method='fuzzy', fuzzy=2, slop=1, testing=False)[source]๏ƒ

Iteratively searches for MCT documents matching multiple search terms.

This function searches the โ€˜observationsโ€™ index for documents of type โ€˜AoMRC_ClinicalSummary_FTโ€™ that contain the specified terms.

Parameters:
  • terms_list (List[str]) โ€“ A list of terms to search for.

  • treatment_doc_filename (str) โ€“ The filename to load or save the results.

  • start_year (str) โ€“ The start of the date range.

  • start_month (str) โ€“ The start of the date range.

  • start_day (str) โ€“ The start of the date range.

  • end_year (str) โ€“ The end of the date range.

  • end_month (str) โ€“ The end of the date range.

  • end_day (str) โ€“ The end of the date range.

  • append (bool) โ€“ Whether to append results to an existing file.

  • debug (bool) โ€“ Whether to print debug information.

  • uuid_column_name (str) โ€“ The name of the UUID column.

  • additional_filters (Optional[List[str]]) โ€“ Additional filters to apply to the search.

  • all_fields (bool) โ€“ Whether to retrieve all fields.

  • method (str) โ€“ The search method (โ€˜fuzzyโ€™, โ€˜exactโ€™, โ€˜phraseโ€™).

  • fuzzy (int) โ€“ The fuzziness level for fuzzy search.

  • slop (int) โ€“ The slop value for phrase search.

  • testing (bool) โ€“ Whether to use a dummy searcher for testing.

Return type:

DataFrame

Returns:

A DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.initialize_cogstack_client(config_obj=None)[source]๏ƒ

Initializes the global CogStack client cs.

This function sets up the connection to Elasticsearch. It can be configured to load credentials from a specific file path by passing a config object. If a client instance already exists, it will not re-initialize unless a config object with a new credentials path is provided.

The credential loading priority is: 1. credentials_path from the config_obj. 2. Default credentials.py in the projectโ€™s root. 3. If not found, it creates a template credentials.py and tries again. 4. Falls back to dummy credentials if all else fails.

Parameters:

config_obj โ€“ A configuration object that may have a credentials_path attribute.

Returns:

The initialized CogStack client instance.

pat2vec.pat2vec_search.cogstack_search_methods.iterative_multi_term_cohort_searcher_no_terms_fuzzy_textual_obs(terms_list, treatment_doc_filename, start_year, start_month, start_day, end_year, end_month, end_day, append=True, debug=True, uuid_column_name='client_idcode', bloods_time_field='basicobs_entered', additional_filters=None, all_fields=False, method='fuzzy', fuzzy=2, slop=1, testing=False)[source]๏ƒ

Iteratively searches for textual observations matching multiple terms.

This function searches the โ€˜basic_observationsโ€™ index for documents where the textualObs field contains the specified terms.

Parameters:
  • terms_list (List[str]) โ€“ A list of terms to search for.

  • treatment_doc_filename (str) โ€“ The filename to load or save the results.

  • start_year (str) โ€“ The start of the date range.

  • start_month (str) โ€“ The start of the date range.

  • start_day (str) โ€“ The start of the date range.

  • end_year (str) โ€“ The end of the date range.

  • end_month (str) โ€“ The end of the date range.

  • end_day (str) โ€“ The end of the date range.

  • append (bool) โ€“ Whether to append results to an existing file.

  • debug (bool) โ€“ Whether to print debug information.

  • uuid_column_name (str) โ€“ The name of the UUID column.

  • bloods_time_field (str) โ€“ The timestamp field to use for date filtering.

  • additional_filters (Optional[List[str]]) โ€“ Additional filters to apply to the search.

  • all_fields (bool) โ€“ Whether to retrieve all fields.

  • method (str) โ€“ The search method (โ€˜fuzzyโ€™, โ€˜exactโ€™, โ€˜phraseโ€™).

  • fuzzy (int) โ€“ The fuzziness level for fuzzy search.

  • slop (int) โ€“ The slop value for phrase search.

  • testing (bool) โ€“ Whether to use a dummy searcher for testing.

Return type:

DataFrame

Returns:

A DataFrame containing the search results.