Hey there, data enthusiasts! 🎀
In the exciting world of data science and machine learning, one of the first and most crucial steps is turning raw data into a format that our models can understand and learn from. This process, called data preprocessing, involves several important steps:
- Data Cleaning: Removal of noise and inconsistent data. Let's say there was a feature with 80% null values. will you still keep it? What about 20% null values. Those can easily be filled with statistics like mean of all categorical data.
- Data Integration: Combine multiple dataset sources for better predictions. Eg. combining driver's medical record with race and season data to predict their position in an F1 race. While the health wouldn't be much helpful but using that as a weight for previous race position will drastically increase its importance!
- Data Selection: Selection important and useful data. Try doing feature engineering and get the best features for your model.
- Data Transformation: Data are transformed and consolidated for mining by performing encodings and feature engineering. I consider this as the most important topic before data mining since without encoding, data mining is useless and unhelpful.
- Data Mining: Intelligent methods are applied to extract data patterns OR Extraction of implicit, previously unknown and potentially useful information from data. Eg. using the race year and DOB of driver to find out the age of the driver to provide new insights while removing 2 columns from model.
- Pattern Evaluation: Identify the truly fascinating pattern using various evaluation metrics.
- Knowledge Presentation: Create graphs and stats like charts, heatmaps, and much more. Understand your data and improvise wherever needed using above steps.
Central to this preprocessing is the task of encoding. This blog delves into the various encoding methodologies, providing a comprehensive analysis of them.
Importance of Encoding
Encoding is a crucial step in the data preprocessing pipeline, especially when dealing with categorical data. Categorical variables, which represent data that can be divided into specific groups or categories, often need to be converted into a numerical format for machine learning algorithms to process them effectively. This conversion process is known as encoding. Machine learning models typically require numerical input because they are based on mathematical calculations that cannot interpret categorical data directly. By transforming categorical data into numerical values through various encoding techniques, we can ensure that our models can leverage all available information, leading to better performance and more accurate predictions. Encoding not only makes data suitable for analysis but also helps preserve the relationships and characteristics inherent in the original categorical variables.
Prerequisites
No sane person codes on paper, he who codes on paper has mastered the essence of coding or the truth behind the universe itself. - ME🎀
Install the following required Python libraries
pip install scikit-learn pandas category_encoders
Different datasets requires different encoding methods. Therefore, different examples might get used for each encoding methods.
Types of Encoding
While there are hundreds of encoding methods, we will focus on the most important and widely used ones.
- Multi-Hot Encoding
- Label Encoding
- Ordinal Encoding
- Binary Encoding
- Target Encoding
- Frequency Encoding
Multi-Hot Encoding
This method converts into binary-like data. Categorical values is mapped to a binary vector of length equal to the no. of categories. This method is usually used in classification models.
Example: Imagine you have a dataset of music tracks.
Name | Artist | Genre |
---|---|---|
Fly Me to the Moon | The Macarons Project | ["slow", "acoustic", "pop"] |
Mad at Disney | Salem ilese | ["dance", "pop"] |
Here, the genre
is a feature we need to encode since providing array of multiple genre-names would be ineffective to the model.
from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd
# Creating the dataframe with list of genres per song
df = pd.DataFrame({
"name": ["Fly Me to the Moon", "Mad at Disney"],
"artist": ["The Macarons Project", "Salem ilese"],
"genre": [["slow", "acoustic", "pop"], ["dance", "pop"]]
})
# Using MultiLabelBinarizer to handle the list of genres
mlb = MultiLabelBinarizer()
x_encoded = mlb.fit_transform(df["genre"])
# Creating the encoded dataframe
encoded_df = pd.DataFrame(x_encoded, columns=mlb.classes_)
# Concatenating the original columns with the encoded genres
df_final = pd.concat([df.drop(columns=["genre"]), encoded_df], axis=1)
print(df_final)
name | artist | acoustic | dance | pop | slow |
---|---|---|---|---|---|
Fly Me to the Moon | The Macarons Project | 1 | 0 | 1 | 1 |
Mad at Disney | Salem ilese | 0 | 1 | 1 | 0 |
The data is encoded with the genres where 1 means HOT (or present) and 0 means COLD (or absent). A similar approach can be taken with One-Hot Encoding but binary Encoding or Label Encoding is better in those cases most of the time.
Label Encoding
This method converts each categorical value into a numerical data.
Similar to multi-hot encoding in a way. The only key difference would be that Label Encoding might inadvertently introduce ordinal relationships where none exist, which can mislead some algorithms. multi-hot encoding avoids this by treating each category independently.
Example: A company sells shirt of different sizes and colours for X
amount of price.
Colour | Size | Company | Price |
---|---|---|---|
red | L | Max | 300 |
blue | S | ACM | 230 |
red | XL | Zara | 568 |
green | S | Gucci | 927 |
where we need to use encoding for all 3 columns Colour
, Size
, and Company
. We will use Label Encoding since that addition to bias can help model to predict with better accuracy.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Creating the dataframe
df = pd.DataFrame({
'Colour': ['red', 'blue', 'red', 'green'],
'Size': ['L', 'S', 'XL', 'S'],
'Company': ['Max', 'ACM', 'Zara', 'Gucci'],
'Price': [300, 230, 568, 927]
})
# Label Encoding for 'Colour', 'Size', and 'Company'
label_encoder = LabelEncoder()
df['Colour_encoded'] = label_encoder.fit_transform(df['Colour'])
df['Size_encoded'] = label_encoder.fit_transform(df['Size'])
df['Company_encoded'] = label_encoder.fit_transform(df['Company'])
# Drop the original categorical columns after encoding
df_final = df.drop(columns=['Colour', 'Size', 'Company'])
print(df_final)
Price | Colour_encoded | Size_encoded | Company_encoded |
---|---|---|---|
300 | 2 | 0 | 2 |
230 | 0 | 1 | 0 |
568 | 2 | 2 | 3 |
927 | 1 | 1 | 1 |
The numerical value here is assigned by sorting (alphabetically or numerically) the categories by default but if we want to intentionally give a preference to this encoding then we should look into Ordinal Encoding
Ordinal Encoding
Similar to Label Encoding with the only difference that we ourselves provide a specific order of importance to the categories (unlink how label encoder sorted all categories to provide numbering to it).
Example: In the Label Encoding example, the company should be in your preference order since we know companies like Gucci or Zara will sell T-shirts at expensive prices.
Colour | Size | Company | Price |
---|---|---|---|
red | L | Max | 300 |
blue | S | ACM | 230 |
red | XL | Zara | 568 |
green | S | Gucci | 927 |
Let's use ["ACM", "Max", "Zara", "Gucci"]
as our order of cheap to expensive T-shirts.
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
# Creating the dataframe
df = pd.DataFrame({
'Colour': ['red', 'blue', 'red', 'green'],
'Size': ['L', 'S', 'XL', 'S'],
'Company': ['Max', 'ACM', 'Zara', 'Gucci'],
'Price': [300, 230, 241, 927]
})
# Label Encoding for 'Colour' and 'Size'
label_encoder_colour = LabelEncoder()
label_encoder_size = LabelEncoder()
df['Colour_encoded'] = label_encoder_colour.fit_transform(df['Colour'])
df['Size_encoded'] = label_encoder_size.fit_transform(df['Size'])
# Ordinal Encoding for 'Company' with the specified reversed order
company_order = ["ACM", "Max", "Zara", "Gucci"]
ordinal_encoder = OrdinalEncoder(categories=[company_order])
df['Company_encoded'] = ordinal_encoder.fit_transform(df[['Company']])
# Drop the original categorical columns after encoding
df_final = df.drop(columns=['Colour', 'Size', 'Company'])
print(df_final)
Price | Colour_encoded | Size_encoded | Company_encoded |
---|---|---|---|
300 | 2 | 1 | 1 |
230 | 0 | 0 | 0 |
241 | 2 | 2 | 2 |
927 | 1 | 0 | 3 |
This adds bias to the model depending upon the company name.
Binary Encoding
This method converts each categorical value into binary digits (0s and 1s) then store them as separate columns. This is useful when you have many categories to encode and want to reduce dimensionality compared to multi-hot encoding.
Converts each category into binary code and then split the binary digits into separate columns. Results in log2(N) amount of columns while multi-hot encoding would provide (N) columns.
Example: Encoding just the Colours into something suitable.
Colour |
---|
Red |
Green |
Blue |
Red |
import pandas as pd
from category_encoders import BinaryEncoder
# Sample data
data = pd.DataFrame({'Colour': ['Red', 'Green', 'Blue', 'Red']})
# Create a BinaryEncoder object
encoder = BinaryEncoder(cols=['Colour'])
# Encode the categorical feature
encoded_data = encoder.fit_transform(data)
print(encoded_data)
Colour_0 | Colour_1 |
---|---|
0 | 1 |
1 | 0 |
1 | 1 |
0 | 1 |
Most of the time, if the categories are less. We should use multi-hot encoding or label encoding.
Target Encoding
Also known as Mean Encoding or Livelihood encoding. This method encodes the categorical values by replacing each category with statistics of the target variable in that category.
Highly recommended and very useful for handling high cardinality categorical variables. This captures relationship between the categorical variables and the target variable more effectively than one-hot encoding.
Formula:
here:
- n: No. of samples.
- m: smoothing parameter.
Example: In house prediction model, encoding neighborhood names wth mean of house price in those area would provide more insights than just normal label encoding.
House Number | Price | Neighborhood | Size (sq meter) |
---|---|---|---|
1 | 500000 | Downtown | 200 |
2 | 350000 | Suburb | 150 |
3 | 700000 | City Center | 300 |
4 | 450000 | Suburb | 180 |
5 | 600000 | Downtown | 250 |
import pandas as pd
# Original dataset
data = {
'House Number': [1, 2, 3, 4, 5],
'Price': [500000, 350000, 700000, 450000, 600000],
'Neighborhood': ['Downtown', 'Suburb', 'City Center', 'Suburb', 'Downtown'],
'Size (sq meter)': [200, 150, 300, 180, 250]
}
df = pd.DataFrame(data)
# Calculate mean price for each neighborhood
neighborhood_means = df.groupby('Neighborhood')['Price'].mean().to_dict()
# Map mean prices back to the original dataset
df['Neighborhood'] = df['Neighborhood'].map(neighborhood_means)
# Display the encoded dataset
print(df)
House Number | Price | Neighborhood | Size (sq meter) |
---|---|---|---|
1 | 500000 | 550000.0 | 200 |
2 | 350000 | 400000.0 | 150 |
3 | 700000 | 700000.0 | 300 |
4 | 450000 | 400000.0 | 180 |
5 | 600000 | 550000.0 | 250 |
Frequency Encoding
This method replaces each categorical value with its frequency or count within the training dataset.
Formula:
Example: Encoding cities based on the no. of times each city appears in the dataset.
Transaction ID | Amount | City | Product Category |
---|---|---|---|
1 | 100 | New York | Electronics |
2 | 200 | Los Angeles | Clothing |
3 | 150 | Chicago | Electronics |
4 | 300 | New York | Groceries |
5 | 250 | Chicago | Clothing |
import pandas as pd
# Example dataset with customer transactions
data = {
'Transaction ID': [1, 2, 3, 4, 5],
'Amount': [100, 200, 150, 300, 250],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago'],
'Product Category': ['Electronics', 'Clothing', 'Electronics', 'Groceries', 'Clothing']
}
df = pd.DataFrame(data)
# Using 'City' as a parameter (simple example)
selected_city = 'New York'
# Filter the dataset for the selected city
filtered_data = df[df['City'] == selected_city]
print(f"Data for transactions in {selected_city}:")
print(filtered_data)
# Applying frequency encoding to 'City'
city_frequency = df['City'].value_counts(normalize=True)
df['City'] = df['City'].map(city_frequency)
print(df)
Transaction ID | Amount | City | Product Category |
---|---|---|---|
1 | 100 | 0.4 | Electronics |
2 | 200 | 0.2 | Clothing |
3 | 150 | 0.4 | Electronics |
4 | 300 | 0.4 | Groceries |
5 | 250 | 0.4 | Clothing |
Conclusion
With this, all the important and necessary encoding methods are covered! Choosing the right encoding method can significantly impact the performance of your machine learning models.
Top comments (0)