pat2vec.pat2vec_search.cogstack_search_methods

Functions

check_patients_existence(patient_ids[, ...])

Checks which patient IDs exist in Elasticsearch using terms aggregation.

cohort_searcher_no_terms(index_name, ...)

Searches an index using only a query string.

cohort_searcher_no_terms_fuzzy(index_name, ...)

Searches an index using different query string methods.

cohort_searcher_with_terms_and_search(...)

Searches a cohort using a term filter and a query string.

cohort_searcher_with_terms_no_search(...)

Searches a cohort using only a term-level filter.

create_credentials_file()

Creates a template credentials.py file.

dataframe_generator(list_of_dfs)

A generator that yields DataFrames from a list of DataFrames.

get_all_fields_for_method(method_name[, cs])

Retrieves all available fields from the Elasticsearch index associated with a given get method.

initialize_cogstack_client([config_obj])

Initializes the global CogStack client cs.

iterative_multi_term_cohort_searcher_no_terms_fuzzy(...)

Iteratively searches for EPR documents matching multiple search terms.

iterative_multi_term_cohort_searcher_no_terms_fuzzy_mct(...)

Iteratively searches for MCT documents matching multiple search terms.

iterative_multi_term_cohort_searcher_no_terms_fuzzy_textual_obs(...)

Iteratively searches for textual observations matching multiple terms.

list_chunker(entered_list)

Splits a list into smaller chunks of up to 10,000 elements.

set_index_safe_wrapper(df)

Safely sets the DataFrame index to 'id', ignoring errors.

Classes

CogStack(hosts[, username, password, api, ...])

pat2vec.pat2vec_search.cogstack_search_methods.create_credentials_file()[source]

Creates a template credentials.py file.

This function creates a credentials.py file three levels up from the current file’s directory. This file contains placeholder variables for Elasticsearch connection details (hosts, username, password, api_key). It is intended to be filled out by the user with their actual credentials.

Return type:

None

pat2vec.pat2vec_search.cogstack_search_methods.get_all_fields_for_method(method_name, cs=None)[source]

Retrieves all available fields from the Elasticsearch index associated with a given get method.

Parameters:
  • method_name (str) – The name of the get method.

  • cs (Optional[CogStack]) – An initialized CogStack client. If not provided, one will be initialized.

Return type:

List[str]

Returns:

A list of all fields in the index, or an empty list if not found.

pat2vec.pat2vec_search.cogstack_search_methods.list_chunker(entered_list)[source]

Splits a list into smaller chunks of up to 10,000 elements.

Parameters:

entered_list (List[Any]) – The list to be split into chunks.

Return type:

List[List[Any]]

Returns:

A list of lists, where each sublist is a chunk of the original list.

pat2vec.pat2vec_search.cogstack_search_methods.dataframe_generator(list_of_dfs)[source]

A generator that yields DataFrames from a list of DataFrames.

Return type:

Generator[DataFrame, None, None]

Parameters:

list_of_dfs (List[DataFrame])

Searches a cohort using a term filter and a query string.

Parameters:
  • index_name (str) – The name of the Elasticsearch index to search.

  • fields_list (List[str]) – The list of fields to return from each document.

  • term_name (str) – The name of the field to use for the term-level filter.

  • entered_list (List[str]) – The list of values to filter for in the term_name field.

  • search_string (str) – The query string to apply to the search.

Return type:

DataFrame

Returns:

A pandas DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.set_index_safe_wrapper(df)[source]

Safely sets the DataFrame index to ‘id’, ignoring errors.

Return type:

DataFrame

Parameters:

df (DataFrame)

Searches a cohort using only a term-level filter.

Parameters:
  • index_name (str) – The name of the index to search.

  • fields_list (List[str]) – A list of fields to return.

  • term_name (str) – The field to filter on.

  • entered_list (List[str]) – The list of values to search for in the term_name field.

Return type:

DataFrame

Returns:

A pandas DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.cohort_searcher_no_terms(index_name, fields_list, search_string)[source]

Searches an index using only a query string.

Parameters:
  • index_name (str) – The name of the Elasticsearch index to search.

  • fields_list (List[str]) – A list of fields to return.

  • search_string (str) – The query string to use for the search.

Return type:

DataFrame

Returns:

A pandas DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.cohort_searcher_no_terms_fuzzy(index_name, fields_list, search_string, method='fuzzy', fuzzy=2, slop=1)[source]

Searches an index using different query string methods.

Parameters:
  • index_name (str) – The name of the Elasticsearch index.

  • fields_list (List[str]) – List of fields to retrieve.

  • search_string (str) – The search string to query.

  • method (str) – The search method (“fuzzy”, “exact”, or “phrase”).

  • fuzzy (int) – The fuzziness level for fuzzy matching.

  • slop (int) – The slop value for phrase searches (word proximity).

Return type:

DataFrame

Returns:

A DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.iterative_multi_term_cohort_searcher_no_terms_fuzzy(terms_list, treatment_doc_filename, start_year, start_month, start_day, end_year, end_month, end_day, overwrite=True, debug=False, uuid_column_name='client_idcode', additional_filters=None, all_fields=False, method='fuzzy', fuzzy=2, slop=1)[source]

Iteratively searches for EPR documents matching multiple search terms.

Parameters:
  • terms_list (List[str]) – The list of search terms to search for.

  • treatment_doc_filename (str) – The name of the file to store the results in.

  • start_year (str) – The start year of the date range.

  • start_month (str) – The start month of the date range.

  • start_day (str) – The start day of the date range.

  • end_year (str) – The end year of the date range.

  • end_month (str) – The end month of the date range.

  • end_day (str) – The end day of the date range.

  • overwrite (bool) – Whether to overwrite the existing file.

  • debug (bool) – Whether to print debug information.

  • uuid_column_name (str) – The name of the column containing the UUIDs.

  • additional_filters (Optional[List[str]]) – A list of additional filters to apply.

  • all_fields (bool) – Whether to retrieve all fields.

  • method (str) – The search method to use (‘fuzzy’, ‘exact’, or ‘phrase’).

  • fuzzy (int) – The fuzziness level for fuzzy matching.

  • slop (int) – The slop value for phrase searches.

Return type:

DataFrame

Returns:

A DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.iterative_multi_term_cohort_searcher_no_terms_fuzzy_mct(terms_list, treatment_doc_filename, start_year, start_month, start_day, end_year, end_month, end_day, append=True, debug=True, uuid_column_name='client_idcode', additional_filters=None, all_fields=False, method='fuzzy', fuzzy=2, slop=1, testing=False, testing_elastic=False)[source]

Iteratively searches for MCT documents matching multiple search terms.

This function searches the ‘observations’ index for documents of type ‘AoMRC_ClinicalSummary_FT’ that contain the specified terms.

Parameters:
  • terms_list (List[str]) – A list of terms to search for.

  • treatment_doc_filename (str) – The filename to load or save the results.

  • start_year (str) – The start of the date range.

  • start_month (str) – The start of the date range.

  • start_day (str) – The start of the date range.

  • end_year (str) – The end of the date range.

  • end_month (str) – The end of the date range.

  • end_day (str) – The end of the date range.

  • append (bool) – Whether to append results to an existing file.

  • debug (bool) – Whether to print debug information.

  • uuid_column_name (str) – The name of the UUID column.

  • additional_filters (Optional[List[str]]) – Additional filters to apply to the search.

  • all_fields (bool) – Whether to retrieve all fields.

  • method (str) – The search method (‘fuzzy’, ‘exact’, ‘phrase’).

  • fuzzy (int) – The fuzziness level for fuzzy search.

  • slop (int) – The slop value for phrase search.

  • testing (bool) – Whether to use a dummy searcher for testing.

  • testing_elastic (bool) – If True, uses the real searcher against the configured ES instance even if testing is True.

Return type:

DataFrame

Returns:

A DataFrame containing the search results.

pat2vec.pat2vec_search.cogstack_search_methods.check_patients_existence(patient_ids, index_name='epr_documents', id_field='client_idcode.keyword', config_obj=None)[source]

Checks which patient IDs exist in Elasticsearch using terms aggregation. Supports checking multiple indices in a fallback manner.

Parameters:
  • patient_ids (List[str]) – List of patient IDs to check.

  • index_name (Union[str, List[Tuple[str, str]]]) – The Elasticsearch index to search against. Can be a string (single index) or a list of tuples [(index_name, id_field), …].

  • id_field (str) – The field in the index containing the patient ID.

  • config_obj (Optional[Any]) – Configuration object containing credentials path.

Return type:

List[str]

Returns:

A list of patient IDs that were found in the index.

pat2vec.pat2vec_search.cogstack_search_methods.initialize_cogstack_client(config_obj=None)[source]

Initializes the global CogStack client cs.

This function sets up the connection to Elasticsearch. It can be configured to load credentials from a specific file path by passing a config object. If a client instance already exists, it will not re-initialize unless a config object with a new credentials path is provided.

The credential loading priority is: 1. credentials_path from the config_obj. 2. Default credentials.py in the project’s root. 3. If not found, it creates a template credentials.py and tries again. 4. Falls back to dummy credentials if all else fails.

Parameters:

config_obj – A configuration object that may have a credentials_path attribute.

Returns:

The initialized CogStack client instance.

pat2vec.pat2vec_search.cogstack_search_methods.iterative_multi_term_cohort_searcher_no_terms_fuzzy_textual_obs(terms_list, treatment_doc_filename, start_year, start_month, start_day, end_year, end_month, end_day, append=True, debug=True, uuid_column_name='client_idcode', bloods_time_field='basicobs_entered', additional_filters=None, all_fields=False, method='fuzzy', fuzzy=2, slop=1, testing=False, testing_elastic=False)[source]

Iteratively searches for textual observations matching multiple terms.

This function searches the ‘basic_observations’ index for documents where the textualObs field contains the specified terms.

Parameters:
  • terms_list (List[str]) – A list of terms to search for.

  • treatment_doc_filename (str) – The filename to load or save the results.

  • start_year (str) – The start of the date range.

  • start_month (str) – The start of the date range.

  • start_day (str) – The start of the date range.

  • end_year (str) – The end of the date range.

  • end_month (str) – The end of the date range.

  • end_day (str) – The end of the date range.

  • append (bool) – Whether to append results to an existing file.

  • debug (bool) – Whether to print debug information.

  • uuid_column_name (str) – The name of the UUID column.

  • bloods_time_field (str) – The timestamp field to use for date filtering.

  • additional_filters (Optional[List[str]]) – Additional filters to apply to the search.

  • all_fields (bool) – Whether to retrieve all fields.

  • method (str) – The search method (‘fuzzy’, ‘exact’, ‘phrase’).

  • fuzzy (int) – The fuzziness level for fuzzy search.

  • slop (int) – The slop value for phrase search.

  • testing (bool) – Whether to use a dummy searcher for testing.

  • testing_elastic (bool) – If True, uses the real searcher against the configured ES instance even if testing is True.

Return type:

DataFrame

Returns:

A DataFrame containing the search results.