Imagine generated using Midjourney, representing the enigma machine

How to Choose the Best Categorical Encoding Method

aka do not one-hot encode everything.

In this article, we will look at how to encode categories and avoid dimensionality issues.

Introduction

During interviews for Data Scientist roles, I ask candidates to complete an assignment. The aim is to create a Machine Learning model, often a classifier. This will be an important part of their day-to-day job. As part of the steps to complete the exercise, they need to treat categorical features.

Most of the time candidates choose one-hot encoding to treat these features.

What is a one-hot encoding

There are many books and articles that suggest a simple technique. This technique transforms a categorical variable into binary variables. The binary variable indicates the presence of each category in the record.
Let’s give an example.

By using the following code

import pandas as pd

df = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5],
    'colour': ['black', 'yellow', 'red', 'white', 'purple']
})

we can create the table

IDcolour
1black
2yellow
3red
4white
5purple
The initial dataset containing the id and colour variables

Imagine you have a simple dataset like the one above, with id and colour variables. You have to change colour into a one-hot encoded feature. That means that you want to transform the dataset into something like the table below.

IDcolour_blackcolour_yellowcolour_redcolour_whitecolour_purple
110000
201000
300100
400010
500001
The initial dataset containing the id and colour variables

As you can see, we created 3 variables from the initial categorical one. Each variable represents a colour, a category of the original variable. Now, I understand why this method is so popular. If you think about it, it’s a line of code, you just need to run

df = pd.get_dummies(df)

and you take care of everything, don’t you? Yes, but that comes with a cost, as there is no such thing as a free lunch.

The cost of one-hot encoding everything

When you transform a categorical variable using one-hot encoding, the curse of dimensionality is the main cost. With this term, we refer to various phenomena that arise when dealing with data organised in high-dimensional spaces.

We can summarise the main adverse effects as:

  • Increased Data sparsity. This is pretty intuitive by looking at the previous example. We had to create 3 variables to represent the original variable with 3 categories. Imagine you have a dataset with millions of records. It has tens of categorical variables, each with 10 to 100 categories. By using this technique, you can create data matrices with many zeros. Each record will have only one variable representing the categorical value as 1. This creates a so-called sparse matrix. The sparsity of the matrix makes clustering and classification tasks more challenging
  • Computations become more resource-intensive. Due to increased computational complexity with more variables.
  • Adding more dimensions to a machine learning model increases the risk of overfitting. This happens because more dimensions give the model more degrees of freedom. Hence, the model may end up fitting the noise in the data instead of the actual signal.
  • When the dimensions increase, machine learning models generally perform worse. When there are more dimensions, the model needs more data to perform better. As the number of dimensions increases, the data space grows fast. The available data spreads out.

Encoding Alternatives

So, what alternatives do we have compared to one-hot encoding? Let’s explore some of them:

  • Label Encoding:
    • Description: We assgin an integer to each category
    • Pros: Saves space as it uses a single column
    • Cons: Can introduce ordinal relationships where none exist
    • Example:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Label_Encoded'] = le.fit_transform(df['colour'])
IDcolourLabel_Encoded
1black0
2yellow4
3red2
4white3
5purple1
Dataset example of the LabelEncoder
  • Ordinal Encoding:
    • Description: Ordinal variables have a logical order, so we can use ordinal encoding instead of label encoding.
    • Pros: Retains the ordinal nature of the variable
    • Cons: Not suitable for nominal variables
    • Example: Let’s say that the colours are in order alphabetically and have a ranking.
df['colour'].astype('category').cat.codes
IDcolourOrdinal_Encoded
1black0
2yellow4
3red2
4white3
5purple1
Dataset example of the Ordinal Encoding
  • Frequency (or Count) Encoding:
    • Description: Categories represent themselves through their occurrences or frequency count.
    • Pros: Keeps the shape of the dataset manageable.
    • Cons: Collisions where different categories have the same frequency.
    • Example:
df['Frequency_Encoded'] = df['colour'].map(df['colour'].value_counts())
IDcolourFrequency_Encoded
1black1
2yellow1
3red1
4white1
5purple1
Dataset example of the Frequency Encoding

Note: In this example, all colours appear only once, so the frequency is 1 for all.

  • Target (Mean) Encoding:
    • Description: The average value of the target value for the category replaces the category.
    • Pros: Captures information about the relationship between the category and the target variable.
    • Cons: Risk of data leakage; not suitable for datasets with a small number of observations.
    • Example:
# Python snippet for target encoding
df['Target_Encoded'] = df['colour'].map(df.groupby('colour')['target'].mean())
IDcolourtargetTarget_Encoded
1black1010.0
2yellow2015.0
3red3030.0
4white4040.0
5purple5050.0
6yellow1015.0
  • Hashing:
    • Description: Use a hash function to assign a number to different categories. We call this the hashing trick.
    • Pros: Helpful for managing many categories; fixed width, reducing complexity.
    • Cons: Collisions where different categories get mapped to the same hash.
    • Example:
df['Hashed_Value'] = df['colour'].apply(lambda x: hash(x) % 5)
IDcolourHashed_Value
1black1
2yellow3
3red0
4white0
5purple4
Dataset example of Hashing
  • Leave-One-Out Encoding:
    • I like using this method to encode a category, because it shows the connection to the target. To prevent data leakage, we find the average target value for a particular category, without including the target value of the current row.
    • Pros: Reduces risk compared to target encoding.
    • Cons: The computation is more intensive.
    • Example: In this example let’s add a target variable
import pandas as pd

df = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5, 6],
    'colour': ['black', 'yellow', 'red', 'white', 'purple', 'yellow'],
    'target': [10, 20, 30, 40, 50, 10]
})
IDcolourtarget
1black10
2yellow20
3red30
4white40
5purple50
6yellow10
The input dataset for the leave-one-out encoder
def loo_encode(row, df, column, target):
    temp_df = df[df[column] == row[column]]
    temp_df = temp_df[temp_df.index != row.name]
    if temp_df.empty:
        return df[target].mean()
    else:
        return temp_df[target].mean()

df['LOO_Encoded'] = df.apply(loo_encode, args=(df, 'colour', 'target'), axis=1)
IDcolourtargetLOO_Encoded
1black1026.666667
2yellow2010.000000
3red3026.666667
4white4026.666667
5purple5026.666667
6yellow1020.000000
The output table

Food For Thoughts

When choosing an encoding method, consider the type of data (nominal or ordinal), the model type (tree-based or linear), and any potential issues (data leakage, unwanted relationships). During interviews, showing knowledge about these techniques can make a candidate stand out. It shows they understand feature engineering well.


Posted

in

,

by