pat2vec.util.elasticsearch_methods

Functions

get_guess_datetime_column(df[, threshold])

Finds the single column most likely to be a datetime column.

guess_datetime_columns(df[, threshold])

Guesses which columns in a DataFrame are datetime columns.

handle_inconsistent_dtypes(df)

Handles inconsistent data types in a DataFrame's columns.

ingest_data_to_elasticsearch(temp_df, index_name)

Ingests data from a DataFrame into Elasticsearch with error handling.

pat2vec.util.elasticsearch_methods.ingest_data_to_elasticsearch(temp_df, index_name, index_mapping=None, replace_index=False)[source]

Ingests data from a DataFrame into Elasticsearch with error handling.

Parameters:
  • temp_df (DataFrame) – The DataFrame containing the data to be ingested.

  • index_name (str) – Name of the Elasticsearch index.

  • index_mapping (Optional[Dict[str, Any]]) – Optional mapping for the index.

  • replace_index (bool) – Whether to replace the index if it exists.

Return type:

Dict[str, int]

Returns:

A summary containing the number of successful and failed operations.

Raises:

ConnectionError – If the Elasticsearch server is not reachable.

pat2vec.util.elasticsearch_methods.handle_inconsistent_dtypes(df)[source]

Handles inconsistent data types in a DataFrame’s columns.

Iterates through each column, determines the majority data type (datetime, string, int, or float), and casts the entire column to that type. This is useful for cleaning data before ingestion into systems with strict schemas like Elasticsearch.

Parameters:

df (DataFrame) – The DataFrame to process.

Return type:

DataFrame

Returns:

The DataFrame with columns cast to their majority data type.

pat2vec.util.elasticsearch_methods.guess_datetime_columns(df, threshold=0.5)[source]

Guesses which columns in a DataFrame are datetime columns.

It iterates through each column and attempts to parse its values as datetimes. If the percentage of parsable values exceeds a given threshold, the column is considered a datetime column.

Parameters:
  • df (DataFrame) – The DataFrame to analyze.

  • threshold (float) – The minimum percentage of values in a column that must be parsable as datetime for it to be considered a datetime column. Defaults to 0.5.

Return type:

List[str]

Returns:

A list of column names that are likely to be datetime columns.

pat2vec.util.elasticsearch_methods.get_guess_datetime_column(df, threshold=0.2)[source]

Finds the single column most likely to be a datetime column.

This function iterates through all columns and calculates the ratio of values that can be parsed as a datetime. It returns the name of the column with the highest ratio, provided that ratio is above the specified threshold.

Parameters:
  • df (DataFrame) – The DataFrame to analyze.

  • threshold (float) – The minimum percentage of values that must be parsable as datetime for a column to be considered. Defaults to 0.2.

Return type:

Optional[str]

Returns:

The name of the column most likely to contain datetimes, or None if no column meets the threshold.