pat2vec.pat2vec_search.cogstack_search_methods
Functions
|
Checks which patient IDs exist in Elasticsearch using terms aggregation. |
|
Searches an index using only a query string. |
|
Searches an index using different query string methods. |
Searches a cohort using a term filter and a query string. |
|
Searches a cohort using only a term-level filter. |
|
Creates a template credentials.py file. |
|
|
A generator that yields DataFrames from a list of DataFrames. |
|
Retrieves all available fields from the Elasticsearch index associated with a given get method. |
|
Initializes the global CogStack client cs. |
Iteratively searches for EPR documents matching multiple search terms. |
|
|
Iteratively searches for MCT documents matching multiple search terms. |
|
Iteratively searches for textual observations matching multiple terms. |
|
Splits a list into smaller chunks of up to 10,000 elements. |
Safely sets the DataFrame index to 'id', ignoring errors. |
Classes
|
- pat2vec.pat2vec_search.cogstack_search_methods.create_credentials_file()[source]
Creates a template credentials.py file.
This function creates a credentials.py file three levels up from the current file’s directory. This file contains placeholder variables for Elasticsearch connection details (hosts, username, password, api_key). It is intended to be filled out by the user with their actual credentials.
- Return type:
None
- pat2vec.pat2vec_search.cogstack_search_methods.get_all_fields_for_method(method_name, cs=None)[source]
Retrieves all available fields from the Elasticsearch index associated with a given get method.
- Parameters:
method_name (
str) – The name of the get method.cs (
Optional[CogStack]) – An initialized CogStack client. If not provided, one will be initialized.
- Return type:
List[str]- Returns:
A list of all fields in the index, or an empty list if not found.
- pat2vec.pat2vec_search.cogstack_search_methods.list_chunker(entered_list)[source]
Splits a list into smaller chunks of up to 10,000 elements.
- Parameters:
entered_list (
List[Any]) – The list to be split into chunks.- Return type:
List[List[Any]]- Returns:
A list of lists, where each sublist is a chunk of the original list.
- pat2vec.pat2vec_search.cogstack_search_methods.dataframe_generator(list_of_dfs)[source]
A generator that yields DataFrames from a list of DataFrames.
- Return type:
Generator[DataFrame,None,None]- Parameters:
list_of_dfs (List[DataFrame])
- pat2vec.pat2vec_search.cogstack_search_methods.cohort_searcher_with_terms_and_search(index_name, fields_list, term_name, entered_list, search_string)[source]
Searches a cohort using a term filter and a query string.
- Parameters:
index_name (
str) – The name of the Elasticsearch index to search.fields_list (
List[str]) – The list of fields to return from each document.term_name (
str) – The name of the field to use for the term-level filter.entered_list (
List[str]) – The list of values to filter for in the term_name field.search_string (
str) – The query string to apply to the search.
- Return type:
DataFrame- Returns:
A pandas DataFrame containing the search results.
- pat2vec.pat2vec_search.cogstack_search_methods.set_index_safe_wrapper(df)[source]
Safely sets the DataFrame index to ‘id’, ignoring errors.
- Return type:
DataFrame- Parameters:
df (DataFrame)
- pat2vec.pat2vec_search.cogstack_search_methods.cohort_searcher_with_terms_no_search(index_name, fields_list, term_name, entered_list)[source]
Searches a cohort using only a term-level filter.
- Parameters:
index_name (
str) – The name of the index to search.fields_list (
List[str]) – A list of fields to return.term_name (
str) – The field to filter on.entered_list (
List[str]) – The list of values to search for in the term_name field.
- Return type:
DataFrame- Returns:
A pandas DataFrame containing the search results.
- pat2vec.pat2vec_search.cogstack_search_methods.cohort_searcher_no_terms(index_name, fields_list, search_string)[source]
Searches an index using only a query string.
- Parameters:
index_name (
str) – The name of the Elasticsearch index to search.fields_list (
List[str]) – A list of fields to return.search_string (
str) – The query string to use for the search.
- Return type:
DataFrame- Returns:
A pandas DataFrame containing the search results.
- pat2vec.pat2vec_search.cogstack_search_methods.cohort_searcher_no_terms_fuzzy(index_name, fields_list, search_string, method='fuzzy', fuzzy=2, slop=1)[source]
Searches an index using different query string methods.
- Parameters:
index_name (
str) – The name of the Elasticsearch index.fields_list (
List[str]) – List of fields to retrieve.search_string (
str) – The search string to query.method (
str) – The search method (“fuzzy”, “exact”, or “phrase”).fuzzy (
int) – The fuzziness level for fuzzy matching.slop (
int) – The slop value for phrase searches (word proximity).
- Return type:
DataFrame- Returns:
A DataFrame containing the search results.
- pat2vec.pat2vec_search.cogstack_search_methods.iterative_multi_term_cohort_searcher_no_terms_fuzzy(terms_list, treatment_doc_filename, start_year, start_month, start_day, end_year, end_month, end_day, overwrite=True, debug=False, uuid_column_name='client_idcode', additional_filters=None, all_fields=False, method='fuzzy', fuzzy=2, slop=1)[source]
Iteratively searches for EPR documents matching multiple search terms.
- Parameters:
terms_list (
List[str]) – The list of search terms to search for.treatment_doc_filename (
str) – The name of the file to store the results in.start_year (
str) – The start year of the date range.start_month (
str) – The start month of the date range.start_day (
str) – The start day of the date range.end_year (
str) – The end year of the date range.end_month (
str) – The end month of the date range.end_day (
str) – The end day of the date range.overwrite (
bool) – Whether to overwrite the existing file.debug (
bool) – Whether to print debug information.uuid_column_name (
str) – The name of the column containing the UUIDs.additional_filters (
Optional[List[str]]) – A list of additional filters to apply.all_fields (
bool) – Whether to retrieve all fields.method (
str) – The search method to use (‘fuzzy’, ‘exact’, or ‘phrase’).fuzzy (
int) – The fuzziness level for fuzzy matching.slop (
int) – The slop value for phrase searches.
- Return type:
DataFrame- Returns:
A DataFrame containing the search results.
- pat2vec.pat2vec_search.cogstack_search_methods.iterative_multi_term_cohort_searcher_no_terms_fuzzy_mct(terms_list, treatment_doc_filename, start_year, start_month, start_day, end_year, end_month, end_day, append=True, debug=True, uuid_column_name='client_idcode', additional_filters=None, all_fields=False, method='fuzzy', fuzzy=2, slop=1, testing=False, testing_elastic=False)[source]
Iteratively searches for MCT documents matching multiple search terms.
This function searches the ‘observations’ index for documents of type ‘AoMRC_ClinicalSummary_FT’ that contain the specified terms.
- Parameters:
terms_list (
List[str]) – A list of terms to search for.treatment_doc_filename (
str) – The filename to load or save the results.start_year (
str) – The start of the date range.start_month (
str) – The start of the date range.start_day (
str) – The start of the date range.end_year (
str) – The end of the date range.end_month (
str) – The end of the date range.end_day (
str) – The end of the date range.append (
bool) – Whether to append results to an existing file.debug (
bool) – Whether to print debug information.uuid_column_name (
str) – The name of the UUID column.additional_filters (
Optional[List[str]]) – Additional filters to apply to the search.all_fields (
bool) – Whether to retrieve all fields.method (
str) – The search method (‘fuzzy’, ‘exact’, ‘phrase’).fuzzy (
int) – The fuzziness level for fuzzy search.slop (
int) – The slop value for phrase search.testing (
bool) – Whether to use a dummy searcher for testing.testing_elastic (
bool) – If True, uses the real searcher against the configured ES instance even if testing is True.
- Return type:
DataFrame- Returns:
A DataFrame containing the search results.
- pat2vec.pat2vec_search.cogstack_search_methods.check_patients_existence(patient_ids, index_name='epr_documents', id_field='client_idcode.keyword', config_obj=None)[source]
Checks which patient IDs exist in Elasticsearch using terms aggregation. Supports checking multiple indices in a fallback manner.
- Parameters:
patient_ids (
List[str]) – List of patient IDs to check.index_name (
Union[str,List[Tuple[str,str]]]) – The Elasticsearch index to search against. Can be a string (single index) or a list of tuples [(index_name, id_field), …].id_field (
str) – The field in the index containing the patient ID.config_obj (
Optional[Any]) – Configuration object containing credentials path.
- Return type:
List[str]- Returns:
A list of patient IDs that were found in the index.
- pat2vec.pat2vec_search.cogstack_search_methods.initialize_cogstack_client(config_obj=None)[source]
Initializes the global CogStack client cs.
This function sets up the connection to Elasticsearch. It can be configured to load credentials from a specific file path by passing a config object. If a client instance already exists, it will not re-initialize unless a config object with a new credentials path is provided.
The credential loading priority is: 1. credentials_path from the config_obj. 2. Default credentials.py in the project’s root. 3. If not found, it creates a template credentials.py and tries again. 4. Falls back to dummy credentials if all else fails.
- Parameters:
config_obj – A configuration object that may have a credentials_path attribute.
- Returns:
The initialized CogStack client instance.
- pat2vec.pat2vec_search.cogstack_search_methods.iterative_multi_term_cohort_searcher_no_terms_fuzzy_textual_obs(terms_list, treatment_doc_filename, start_year, start_month, start_day, end_year, end_month, end_day, append=True, debug=True, uuid_column_name='client_idcode', bloods_time_field='basicobs_entered', additional_filters=None, all_fields=False, method='fuzzy', fuzzy=2, slop=1, testing=False, testing_elastic=False)[source]
Iteratively searches for textual observations matching multiple terms.
This function searches the ‘basic_observations’ index for documents where the textualObs field contains the specified terms.
- Parameters:
terms_list (
List[str]) – A list of terms to search for.treatment_doc_filename (
str) – The filename to load or save the results.start_year (
str) – The start of the date range.start_month (
str) – The start of the date range.start_day (
str) – The start of the date range.end_year (
str) – The end of the date range.end_month (
str) – The end of the date range.end_day (
str) – The end of the date range.append (
bool) – Whether to append results to an existing file.debug (
bool) – Whether to print debug information.uuid_column_name (
str) – The name of the UUID column.bloods_time_field (
str) – The timestamp field to use for date filtering.additional_filters (
Optional[List[str]]) – Additional filters to apply to the search.all_fields (
bool) – Whether to retrieve all fields.method (
str) – The search method (‘fuzzy’, ‘exact’, ‘phrase’).fuzzy (
int) – The fuzziness level for fuzzy search.slop (
int) – The slop value for phrase search.testing (
bool) – Whether to use a dummy searcher for testing.testing_elastic (
bool) – If True, uses the real searcher against the configured ES instance even if testing is True.
- Return type:
DataFrame- Returns:
A DataFrame containing the search results.