aka do not one-hot encode everything.
In this article, we will look at how to encode categories and avoid dimensionality issues.
Introduction
During interviews for Data Scientist roles, I ask candidates to complete an assignment. The aim is to create a Machine Learning model, often a classifier. This will be an important part of their day-to-day job. As part of the steps to complete the exercise, they need to treat categorical features.
Most of the time candidates choose one-hot encoding to treat these features.
What is a one-hot encoding
There are many books and articles that suggest a simple technique. This technique transforms a categorical variable into binary variables. The binary variable indicates the presence of each category in the record.
Let’s give an example.
By using the following code
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3, 4, 5],
'colour': ['black', 'yellow', 'red', 'white', 'purple']
})
we can create the table
ID | colour |
---|---|
1 | black |
2 | yellow |
3 | red |
4 | white |
5 | purple |
Imagine you have a simple dataset like the one above, with id and colour variables. You have to change colour into a one-hot encoded feature. That means that you want to transform the dataset into something like the table below.
ID | colour_black | colour_yellow | colour_red | colour_white | colour_purple |
---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 |
3 | 0 | 0 | 1 | 0 | 0 |
4 | 0 | 0 | 0 | 1 | 0 |
5 | 0 | 0 | 0 | 0 | 1 |
As you can see, we created 3 variables from the initial categorical one. Each variable represents a colour, a category of the original variable. Now, I understand why this method is so popular. If you think about it, it’s a line of code, you just need to run
df = pd.get_dummies(df)
and you take care of everything, don’t you? Yes, but that comes with a cost, as there is no such thing as a free lunch.
The cost of one-hot encoding everything
When you transform a categorical variable using one-hot encoding, the curse of dimensionality is the main cost. With this term, we refer to various phenomena that arise when dealing with data organised in high-dimensional spaces.
We can summarise the main adverse effects as:
- Increased Data sparsity. This is pretty intuitive by looking at the previous example. We had to create 3 variables to represent the original variable with 3 categories. Imagine you have a dataset with millions of records. It has tens of categorical variables, each with 10 to 100 categories. By using this technique, you can create data matrices with many zeros. Each record will have only one variable representing the categorical value as 1. This creates a so-called sparse matrix. The sparsity of the matrix makes clustering and classification tasks more challenging
- Computations become more resource-intensive. Due to increased computational complexity with more variables.
- Adding more dimensions to a machine learning model increases the risk of overfitting. This happens because more dimensions give the model more degrees of freedom. Hence, the model may end up fitting the noise in the data instead of the actual signal.
- When the dimensions increase, machine learning models generally perform worse. When there are more dimensions, the model needs more data to perform better. As the number of dimensions increases, the data space grows fast. The available data spreads out.
Encoding Alternatives
So, what alternatives do we have compared to one-hot encoding? Let’s explore some of them:
- Label Encoding:
- Description: We assgin an integer to each category
- Pros: Saves space as it uses a single column
- Cons: Can introduce ordinal relationships where none exist
- Example:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Label_Encoded'] = le.fit_transform(df['colour'])
ID | colour | Label_Encoded |
---|---|---|
1 | black | 0 |
2 | yellow | 4 |
3 | red | 2 |
4 | white | 3 |
5 | purple | 1 |
- Ordinal Encoding:
- Description: Ordinal variables have a logical order, so we can use ordinal encoding instead of label encoding.
- Pros: Retains the ordinal nature of the variable
- Cons: Not suitable for nominal variables
- Example: Let’s say that the colours are in order alphabetically and have a ranking.
df['colour'].astype('category').cat.codes
ID | colour | Ordinal_Encoded |
---|---|---|
1 | black | 0 |
2 | yellow | 4 |
3 | red | 2 |
4 | white | 3 |
5 | purple | 1 |
- Frequency (or Count) Encoding:
- Description: Categories represent themselves through their occurrences or frequency count.
- Pros: Keeps the shape of the dataset manageable.
- Cons: Collisions where different categories have the same frequency.
- Example:
df['Frequency_Encoded'] = df['colour'].map(df['colour'].value_counts())
ID | colour | Frequency_Encoded |
---|---|---|
1 | black | 1 |
2 | yellow | 1 |
3 | red | 1 |
4 | white | 1 |
5 | purple | 1 |
Note: In this example, all colours appear only once, so the frequency is 1 for all.
- Target (Mean) Encoding:
- Description: The average value of the target value for the category replaces the category.
- Pros: Captures information about the relationship between the category and the target variable.
- Cons: Risk of data leakage; not suitable for datasets with a small number of observations.
- Example:
# Python snippet for target encoding
df['Target_Encoded'] = df['colour'].map(df.groupby('colour')['target'].mean())
ID | colour | target | Target_Encoded |
---|---|---|---|
1 | black | 10 | 10.0 |
2 | yellow | 20 | 15.0 |
3 | red | 30 | 30.0 |
4 | white | 40 | 40.0 |
5 | purple | 50 | 50.0 |
6 | yellow | 10 | 15.0 |
- Hashing:
- Description: Use a hash function to assign a number to different categories. We call this the hashing trick.
- Pros: Helpful for managing many categories; fixed width, reducing complexity.
- Cons: Collisions where different categories get mapped to the same hash.
- Example:
df['Hashed_Value'] = df['colour'].apply(lambda x: hash(x) % 5)
ID | colour | Hashed_Value |
---|---|---|
1 | black | 1 |
2 | yellow | 3 |
3 | red | 0 |
4 | white | 0 |
5 | purple | 4 |
- Leave-One-Out Encoding:
- I like using this method to encode a category, because it shows the connection to the target. To prevent data leakage, we find the average target value for a particular category, without including the target value of the current row.
- Pros: Reduces risk compared to target encoding.
- Cons: The computation is more intensive.
- Example: In this example let’s add a target variable
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3, 4, 5, 6],
'colour': ['black', 'yellow', 'red', 'white', 'purple', 'yellow'],
'target': [10, 20, 30, 40, 50, 10]
})
ID | colour | target |
---|---|---|
1 | black | 10 |
2 | yellow | 20 |
3 | red | 30 |
4 | white | 40 |
5 | purple | 50 |
6 | yellow | 10 |
def loo_encode(row, df, column, target):
temp_df = df[df[column] == row[column]]
temp_df = temp_df[temp_df.index != row.name]
if temp_df.empty:
return df[target].mean()
else:
return temp_df[target].mean()
df['LOO_Encoded'] = df.apply(loo_encode, args=(df, 'colour', 'target'), axis=1)
ID | colour | target | LOO_Encoded |
---|---|---|---|
1 | black | 10 | 26.666667 |
2 | yellow | 20 | 10.000000 |
3 | red | 30 | 26.666667 |
4 | white | 40 | 26.666667 |
5 | purple | 50 | 26.666667 |
6 | yellow | 10 | 20.000000 |
Food For Thoughts
When choosing an encoding method, consider the type of data (nominal or ordinal), the model type (tree-based or linear), and any potential issues (data leakage, unwanted relationships). During interviews, showing knowledge about these techniques can make a candidate stand out. It shows they understand feature engineering well.