There are several ways to deal with categorical data, also known as label data, in data science:
One-hot encoding
Label encoding
Dummy encoding
Binning
Count Encoding
Frequency Encoding
Target Encoding
The appropriate technique will depend on the specific data and the goals of the analysis. It's important to note that some algorithms like decision trees and random forest can handle categorical variables directly, so encoding may not be necessary.
We will now go through all the above ways with some sample data-set and also learn how o make our data trainable.
Let's Start
1. One-hot encoding
One-hot encoding is a technique used to convert categorical variables into numerical values by creating a binary column for each category. It is useful for handling categorical variables with multiple levels.
For example, let's say we have a dataset of hand bags with a column called "color" that contains the following values: "red", "green", and "blue".
color | price | units |
---|---|---|
red | 500 | 2 |
green | 800 | 3 |
blue | 300 | 1 |
red | 400 | 1 |
green | 600 | 1 |
One-hot encoding would create three new binary columns, one for each unique category, with a value of 1 indicating that the category is present and a value of 0 indicating that it is not. The resulting data might look like this:
color | price | units | color_red | color_green | color_blue |
---|---|---|---|---|---|
red | 500 | 2 | 1 | 0 | 0 |
green | 800 | 3 | 0 | 1 | 0 |
blue | 300 | 1 | 0 | 0 | 1 |
red | 400 | 1 | 1 | 0 | 0 |
green | 600 | 1 | 0 | 1 | 0 |
As you can see, the original "color" column has been replaced by three new binary columns, one for each unique category. Each row now has a value of 1 in exactly one of these new columns, indicating the presence of that category.
But wait, you should have one question ..... How to do it using python? So, let's do it using python.
In Python, You can use the get_dummies()
function from the pandas
library to apply one-hot encoding to the "color" column of your dataframe. Here is an example of how to do it:
import pandas as pd
# Create example dataframe
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green'],
'price': [500, 800, 300, 400, 600],
'units': [2, 3, 1, 1, 1]})
# Apply one-hot encoding to "color" column
df_encoded = pd.get_dummies(df, columns=['color'])
print(df_encoded)
Alternatively, you can use the OneHotEncoder
class from the sklearn.preprocessing
library to apply one-hot encoding.
from sklearn.preprocessing import OneHotEncoder
# Create example dataframe
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green'],
'price': [500, 800, 300, 400, 600],
'units': [2, 3, 1, 1, 1]})
# Create an instance of the encoder
encoder = OneHotEncoder(sparse=False)
# Fit and transform the "color" column
color_encoded = encoder.fit_transform(df[['color']])
# Create new dataframe with the encoded values
df_encoded = pd.concat([df.drop(columns=['color']), pd.DataFrame(color_encoded, columns=encoder.get_feature_names(['color']))], axis=1)
print(df_encoded)
The resulting dataframe will look the same as the previous one, but the columns will have a prefix 'color_x0_' rather 'color'.
2. Label encoding
Label encoding is a technique used to convert categorical variables into numerical values by assigning a unique integer value to each category. It is useful for handling ordinal variables, where the order of the categories matters.
For example, let's say we have a dataset with a column called "size" that contains the following values: "small", "medium", "large". Label encoding would replace each category with an integer, such as: "small" = 0, "medium" = 1, "large" = 2. The resulting data might look like this:
size | encoded_size |
---|---|
small | 0 |
medium | 1 |
large | 2 |
small | 0 |
medium | 1 |
As you can see, the original "size" column has been replaced by "encoded_size" column, each row now has a unique integer value representing the category.
You can use the LabelEncoder
class from the sklearn.preprocessing
library to apply label encoding to your data. Here is an example of how to do it:
from sklearn.preprocessing import LabelEncoder
# Create example dataframe
df = pd.DataFrame({'size': ['small', 'medium', 'large', 'small', 'medium'],
'price': [500, 800, 300, 400, 600],
'units': [2, 3, 1, 1, 1]})
# Create an instance of the encoder
encoder = LabelEncoder()
# Fit and transform the "size" column
df['encoded_size'] = encoder.fit_transform(df['size'])
print(df)
The resulting dataframe, df
, will have an new column "encoded_size" representing the encoded values of size column. The resulting dataframe will look like this:
size | price | units | encoded_size |
---|---|---|---|
small | 500 | 2 | 0 |
medium | 800 | 3 | 1 |
large | 300 | 1 | 2 |
small | 400 | 1 | 0 |
medium | 600 | 1 | 1 |
It's important to note that label encoding changes the relationship between the categories. It assigns a unique number to each category, but it doesn't take into account the ordinal relationship between the categories. In this case, the encoded values of "small", "medium" and "large" are 0, 1 and 2 respectively, but it doesn't mean that small is half the size of medium or large is twice the size of medium.
3. Dummy Encoding
Dummy encoding, also known as indicator encoding, is a technique used to convert categorical variables into numerical values by creating binary columns for each category, similar to one-hot encoding, but it doesn't remove any column. It is useful when working with categorical variables with many levels.
For example, let's say we have a dataset with a column called "color" that contains the following values: "red", "green", "blue". Dummy encoding would create three new binary columns, one for each unique category, with a value of 1 indicating that the category is present and a value of 0 indicating that it is not. The resulting data might look like this:
color | red | green | blue |
---|---|---|---|
red | 1 | 0 | 0 |
green | 0 | 1 | 0 |
blue | 0 | 0 | 1 |
red | 1 | 0 | 0 |
green | 0 | 1 | 0 |
As you can see, the original "color" column is still present in the table, but three new binary columns, one for each unique category, has been added. Each row now has a value of 1 in exactly one of these new columns, indicating the presence of that category.
You can use the pd.concat()
function from the pandas
library to apply dummy encoding to the "color" column of your dataframe, here is an example of how to do it:
# Create example dataframe
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green'],
'price': [500, 800, 300, 400, 600],
'units': [2, 3, 1, 1, 1]})
# Apply dummy encoding to "color" column
df_encoded = pd.concat([df, pd.get_dummies(df['color'])], axis=1)
print(df_encoded)
The resulting dataframe, df_encoded
, will have three new binary columns, one for each unique category in the "color" column, with a value of 1 indicating that the category is present and a value of 0 indicating that it is not. The original "color" column is still present in the table. The resulting dataframe will look like this:
color | price | units | red | green | blue |
---|---|---|---|---|---|
red | 500 | 2 | 1 | 0 | 0 |
green | 800 | 3 | 0 | 1 | 0 |
blue | 300 | 1 | 0 | 0 | 1 |
red | 400 | 1 | 1 | 0 | 0 |
green | 600 | 1 | 0 | 1 | 0 |
4. Binning
Binning is a technique used to group numerical values into bins or ranges, it is used to handle numerical variables with a large number of unique values. Binning can be useful for creating categorical variables from numerical ones and for handling outliers in the data.
For example, let's say we have a dataset with a column called "age" that contains the following values: 18, 20, 25, 30, 35, 40, 45. To apply binning, we can divide the range of values into a pre-defined number of intervals or bins. For example, we can divide the range of ages into four bins: (18, 25], (25, 35], (35, 45], (45, 50]. This would group the ages into four categories: "young", "middle-aged", "old", and "very old". The resulting data might look like this:
age | age_bin |
---|---|
18 | young |
20 | young |
25 | middle-aged |
30 | middle-aged |
35 | old |
40 | old |
45 | very old |
As you can see, the original "age" column is still present in the table, but a new column "age_bin" has been added, which contains the binned values for each age. The rows in the "age_bin" column now contain categorical values representing the age group.
You can use the cut()
function from the pandas
library to apply binning to the "age" column of your dataframe, here is an example of how to do it:
# Create example dataframe
df = pd.DataFrame({'age': [18, 20, 25, 30, 35, 40, 45],
'price': [500, 800, 300, 400, 600, 700, 800],
'units': [2, 3, 1, 1, 1, 2, 3]})
# Apply binning to "age" column
df['age_bin'] = pd.cut(df['age'], bins=[18, 25, 35, 45, 50], labels=['young', 'middle-aged', 'old', 'very old'])
print(df)
The resulting dataframe, df
, will have an new column "age_bin" representing the binned values of age column. The resulting dataframe will look like this:
age | price | units | age_bin |
---|---|---|---|
18 | 500 | 2 | young |
20 | 800 | 3 | young |
25 | 300 | 1 | middle-aged |
30 | 400 | 1 | middle-aged |
35 | 600 | 1 | old |
40 | 700 | 2 | old |
45 | 800 | 3 | very old |
As you can see, the original "age" column is still present in the table, but a new column "age_bin" has been added, which contains the binned values for each age. The rows in the "age_bin" column now contain categorical values representing the age group.
5. Count Encoding
Count encoding is a technique used to convert categorical variables into numerical values by counting the number of occurrences of each category in the dataset. It is used to handle categorical variables with many levels.
For example, let's say we have a dataset with a column called "product" that contains the following values: "apple", "orange", "banana", "apple", "orange", "apple", "banana". Count encoding would replace each category with the number of times it appears in the dataset. The resulting data might look like this:
product | count_encoded |
---|---|
apple | 3 |
orange | 2 |
banana | 2 |
apple | 3 |
orange | 2 |
apple | 3 |
banana | 2 |
As you can see, the original "product" column is still present in the table, but a new column "count_encoded" has been added, which contains the count encoded values for each product. The rows in the "count_encoded" column now contain unique integer values representing the number of times each product appears in the dataset.
You can use the value_counts()
function from the pandas
library to apply count encoding to the "product" column of your dataframe, here is an example of how to do it:
# Create example dataframe
df = pd.DataFrame({'product': ['apple', 'orange', 'banana', 'apple', 'orange', 'apple', 'banana'],
'price': [500, 800, 300, 400, 600, 700, 800],
'units': [2, 3, 1, 1, 1, 2, 3]})
# Apply count encoding to "product" column
df['count_encoded'] = df['product'].map(df['product'].value_counts())
print(df)
The resulting dataframe, df
, will have an new column "count_encoded" representing the count encoded values of product column. The resulting dataframe will look like this:
product | price | units | count_encoded |
---|---|---|---|
apple | 500 | 2 | 3 |
orange | 800 | 3 | 2 |
banana | 300 | 1 | 2 |
apple | 400 | 1 | 3 |
orange | 600 | 1 | 2 |
apple | 700 | 2 | 3 |
banana | 800 | 3 | 2 |
6. Frequency Encoding
Frequency encoding is a technique used to convert categorical variables into numerical values by representing each category as the proportion of occurrences of that category in the dataset. It is similar to count encoding, but it normalizes the count by dividing it by the total number of occurrences of all categories in the dataset. It is used to handle categorical variables with many levels.
For example, let's say we have a dataset with a column called "product" that contains the following values: "apple", "orange", "banana", "apple", "orange", "apple", "banana". Frequency encoding would replace each category with the proportion of times it appears in the dataset. The resulting data might look like this:
product | frequency_encoded |
---|---|
apple | 0.429 |
orange | 0.286 |
banana | 0.286 |
apple | 0.429 |
orange | 0.286 |
apple | 0.429 |
banana | 0.286 |
As you can see, the original "product" column is still present in the table, but a new column "frequency_encoded" has been added, which contains the frequency encoded values for each product. The rows in the "frequency_encoded" column now contain decimal values between 0 and 1 representing the proportion of times each product appears in the dataset.
You can use the value_counts()
function from the pandas
library to apply frequency encoding to the "product" column of your dataframe, here is an example of how to do it:
# Create example dataframe
df = pd.DataFrame({'product': ['apple', 'orange', 'banana', 'apple', 'orange', 'apple', 'banana'],
'price': [500, 800, 300, 400, 600, 700, 800],
'units': [2, 3, 1, 1, 1, 2, 3]})
# Apply frequency encoding to "product" column
df['frequency_encoded'] = df['product'].map(df['product'].value_counts(normalize=True))
print(df)
The resulting dataframe, df
, will have an new column "frequency_encoded" representing the frequency encoded values of product column. The resulting dataframe will look like this:
product | price | units | frequency_encoded |
---|---|---|---|
apple | 500 | 2 | 0.428571 |
orange | 800 | 3 | 0.285714 |
banana | 300 | 1 | 0.285714 |
apple | 400 | 1 | 0.428571 |
orange | 600 | 1 | 0.285714 |
apple | 700 | 2 | 0.428571 |
banana | 800 | 3 | 0.285714 |
7. Target Encoding
Target Encoding is a technique used to convert categorical variables into numerical values by representing each category as the mean of the target variable for that category. This technique is used when the categorical variable has a large number of levels and is also useful in situations where the data is highly imbalanced.
For example, let's say we have a dataset with a column called "product" and a target variable called "sales" that contains the following values:
product | sales |
---|---|
apple | 100 |
orange | 200 |
banana | 50 |
apple | 150 |
orange | 300 |
apple | 50 |
banana | 20 |
Target encoding would replace each category in the "product" column with the mean of the "sales" column for that category. The resulting data might look like this:
product | sales | target_encoded |
---|---|---|
apple | 100 | 83.333 |
orange | 200 | 250.0 |
banana | 50 | 35.0 |
apple | 150 | 83.333 |
orange | 300 | 250.0 |
apple | 50 | 83.333 |
banana | 20 | 35.0 |
As you can see, the original "product" column is still present in the table, but a new column "target_encoded" has been added, which contains the target encoded values for each product. The rows in the "target_encoded" column now contain decimal values representing the mean of the "sales" column for each product.
You can use the groupby()
function from the pandas
library to apply target encoding to the "product" column of your dataframe, here is an example of how to do it:
# Create example dataframe
df = pd.DataFrame({'product': ['apple', 'orange', 'banana', 'apple', 'orange', 'apple', 'banana'],
'sales': [100, 200, 50, 150, 300, 50, 20]})
# Apply target encoding to "product" column
df['target_encoded'] = df.groupby('product')['sales'].transform('mean')
print(df)
The resulting dataframe, df
, will have an new column "target_encoded" representing the mean of sales column for each product. The resulting dataframe will look like this:
product | sales | target_encoded |
---|---|---|
apple | 100 | 83.333 |
orange | 200 | 250.0 |
banana | 50 | 35.0 |
apple | 150 | 83.333 |
orange | 300 | 250.0 |
apple | 50 | 83.333 |
banana | 20 | 35.0 |
Top comments (0)