pat2vec.util.ethnicity_abstractor
Classes
- class pat2vec.util.ethnicity_abstractor.EthnicityAbstractor[source]
Bases:
object
- static abstractEthnicity(dataFrame, outputNameString, ethnicityColumnString)[source]
Abstracts ethnicity from free text to UK census categories.
This method processes a DataFrame column containing free-text ethnicity entries and maps them to standardized categories based on the UK census style guide. It uses keyword matching against predefined lists of ethnicities, nationalities, and countries.
The mapping logic relies on several assumptions and configurations:
It uses exact (case-insensitive) keyword matching, not fuzzy matching.
It can be configured to assume default ethnicities for certain nationalities (e.g., British -> White, Nigerian -> Black).
Explicit racial terms (e.g., “White”, “Black”) in an entry take precedence over national or country terms.
Note
The keyword lists and mapping logic may contain ambiguities. Manual review of the output is recommended. The outputNameString parameter is currently unused within the function’s logic.
- Parameters:
dataFrame (
DataFrame
) – The DataFrame containing the ethnicity data.outputNameString (
str
) – A string to prefix an output filename (currently unused).ethnicityColumnString (
str
) – The name of the column in dataFrame that contains the free-text ethnicity entries.
- Return type:
DataFrame
- Returns:
A new DataFrame with an added ‘census’ column containing the mapped ethnicity categories.