How to Choose the Best Categorical Encoding Method

aka do not one-hot encode everything.

In this article, we will look at how to encode categories and avoid dimensionality issues.

Table of Contents

Introduction

During interviews for Data Scientist roles, I ask candidates to complete an assignment. The aim is to create a Machine Learning model, often a classifier. This will be an important part of their day-to-day job. As part of the steps to complete the exercise, they need to treat categorical features.

Most of the time candidates choose one-hot encoding to treat these features.

What is a one-hot encoding

There are many books and articles that suggest a simple technique. This technique transforms a categorical variable into binary variables. The binary variable indicates the presence of each category in the record.
Let’s give an example.

By using the following code

import pandas as pd

df = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5],
    'colour': ['black', 'yellow', 'red', 'white', 'purple']
})

we can create the table

ID	colour
1	black
2	yellow
3	red
4	white
5	purple

The initial dataset containing the id and colour variables

Imagine you have a simple dataset like the one above, with id and colour variables. You have to change colour into a one-hot encoded feature. That means that you want to transform the dataset into something like the table below.

ID	colour_black	colour_yellow	colour_red	colour_white	colour_purple
1	1	0	0	0	0
2	0	1	0	0	0
3	0	0	1	0	0
4	0	0	0	1	0
5	0	0	0	0	1

The initial dataset containing the id and colour variables

As you can see, we created 3 variables from the initial categorical one. Each variable represents a colour, a category of the original variable. Now, I understand why this method is so popular. If you think about it, it’s a line of code, you just need to run

df = pd.get_dummies(df)

and you take care of everything, don’t you? Yes, but that comes with a cost, as there is no such thing as a free lunch.

The cost of one-hot encoding everything

When you transform a categorical variable using one-hot encoding, the curse of dimensionality is the main cost. With this term, we refer to various phenomena that arise when dealing with data organised in high-dimensional spaces.

We can summarise the main adverse effects as:

Increased Data sparsity. This is pretty intuitive by looking at the previous example. We had to create 3 variables to represent the original variable with 3 categories. Imagine you have a dataset with millions of records. It has tens of categorical variables, each with 10 to 100 categories. By using this technique, you can create data matrices with many zeros. Each record will have only one variable representing the categorical value as 1. This creates a so-called sparse matrix. The sparsity of the matrix makes clustering and classification tasks more challenging
Computations become more resource-intensive. Due to increased computational complexity with more variables.
Adding more dimensions to a machine learning model increases the risk of overfitting. This happens because more dimensions give the model more degrees of freedom. Hence, the model may end up fitting the noise in the data instead of the actual signal.
When the dimensions increase, machine learning models generally perform worse. When there are more dimensions, the model needs more data to perform better. As the number of dimensions increases, the data space grows fast. The available data spreads out.

Encoding Alternatives

So, what alternatives do we have compared to one-hot encoding? Let’s explore some of them:

Label Encoding:
- Description: We assgin an integer to each category
- Pros: Saves space as it uses a single column
- Cons: Can introduce ordinal relationships where none exist
- Example:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Label_Encoded'] = le.fit_transform(df['colour'])

ID	colour	Label_Encoded
1	black	0
2	yellow	4
3	red	2
4	white	3
5	purple	1

Dataset example of the LabelEncoder

Ordinal Encoding:
- Description: Ordinal variables have a logical order, so we can use ordinal encoding instead of label encoding.
- Pros: Retains the ordinal nature of the variable
- Cons: Not suitable for nominal variables
- Example: Let’s say that the colours are in order alphabetically and have a ranking.

df['colour'].astype('category').cat.codes

ID	colour	Ordinal_Encoded
1	black	0
2	yellow	4
3	red	2
4	white	3
5	purple	1

Dataset example of the Ordinal Encoding

Frequency (or Count) Encoding:
- Description: Categories represent themselves through their occurrences or frequency count.
- Pros: Keeps the shape of the dataset manageable.
- Cons: Collisions where different categories have the same frequency.
- Example:

df['Frequency_Encoded'] = df['colour'].map(df['colour'].value_counts())

ID	colour	Frequency_Encoded
1	black	1
2	yellow	1
3	red	1
4	white	1
5	purple	1

Dataset example of the Frequency Encoding

Note: In this example, all colours appear only once, so the frequency is 1 for all.

Target (Mean) Encoding:
- Description: The average value of the target value for the category replaces the category.
- Pros: Captures information about the relationship between the category and the target variable.
- Cons: Risk of data leakage; not suitable for datasets with a small number of observations.
- Example:

# Python snippet for target encoding
df['Target_Encoded'] = df['colour'].map(df.groupby('colour')['target'].mean())

ID	colour	target	Target_Encoded
1	black	10	10.0
2	yellow	20	15.0
3	red	30	30.0
4	white	40	40.0
5	purple	50	50.0
6	yellow	10	15.0

Hashing:
- Description: Use a hash function to assign a number to different categories. We call this the hashing trick.
- Pros: Helpful for managing many categories; fixed width, reducing complexity.
- Cons: Collisions where different categories get mapped to the same hash.
- Example:

df['Hashed_Value'] = df['colour'].apply(lambda x: hash(x) % 5)

ID	colour	Hashed_Value
1	black	1
2	yellow	3
3	red	0
4	white	0
5	purple	4

Dataset example of Hashing

Leave-One-Out Encoding:
- I like using this method to encode a category, because it shows the connection to the target. To prevent data leakage, we find the average target value for a particular category, without including the target value of the current row.
- Pros: Reduces risk compared to target encoding.
- Cons: The computation is more intensive.
- Example: In this example let’s add a target variable

import pandas as pd

df = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5, 6],
    'colour': ['black', 'yellow', 'red', 'white', 'purple', 'yellow'],
    'target': [10, 20, 30, 40, 50, 10]
})

ID	colour	target
1	black	10
2	yellow	20
3	red	30
4	white	40
5	purple	50
6	yellow	10

The input dataset for the leave-one-out encoder

def loo_encode(row, df, column, target):
    temp_df = df[df[column] == row[column]]
    temp_df = temp_df[temp_df.index != row.name]
    if temp_df.empty:
        return df[target].mean()
    else:
        return temp_df[target].mean()

df['LOO_Encoded'] = df.apply(loo_encode, args=(df, 'colour', 'target'), axis=1)

ID	colour	target	LOO_Encoded
1	black	10	26.666667
2	yellow	20	10.000000
3	red	30	26.666667
4	white	40	26.666667
5	purple	50	26.666667
6	yellow	10	20.000000

The output table

Food For Thoughts

When choosing an encoding method, consider the type of data (nominal or ordinal), the model type (tree-based or linear), and any potential issues (data leakage, unwanted relationships). During interviews, showing knowledge about these techniques can make a candidate stand out. It shows they understand feature engineering well.