Hey reader👋 Hope you are doing well😊
We know that machine learning is all about training our models on the given dataset and generating accurate output for any unseen similar data. There are algorithms (Regression algorithms) that works on numerical data only. And we know that dataset may contain numerical as well as categorical data. Then how can we use some algorithms that only work on numerical data on such dataset. To use Regression algorithms on categorical data we need to transform categorical data into numerical. But how can we do that?🤔
Don't worry in this blog I am going to tell you that how we can handle Categorical data.
So let's get started🔥
Handling Categorical Data
Categorical data refers to the categories in the data. Example -> male, female, red, green, yes or no.
(To understand the types of data that we can encounter please read this artice[https://dev.to/ngneha09/day-2-of-machine-learning-582g])
There are different techniques in Python's sklearn library to handle categorical data. Let's read about them-:
*1. Label Encoder *
The Label Encoder identifies unique categories within a categorical variable and then it assigns unique value to each category. There is no strict rule on how these numerical labels are assigned. One common method is to assign labels based on alphabetical order of categories. It is best suited to ordinal categorical variables.
Implementation-:
So here you can see that we have imported the LabelEncoder from sklearn's preprocessing
module then we have created it's instance and then transformed categories into numerical labels using fit_transform
.
Disadvantage-:
Due to arbitrary assignment this technique may not reflect meaningful relationships in the data.
2. One Hot Encoding
This technique creates binary features for each category in original variable.
So here you can see that in the first row we have red color so we have 1 assigned color_red and others are given 0.
Implementation-:
Here we have imported OneHotEncoder and then fit the data and transformed categories.
Disadvantages-:
With high cardinality categorical variables this can create sparse matrix, a matrix where most of the elements are 0. It can also result in increased dimensionality of data. Also it is not good for ordinal data as it doesn't preserve order.
3. Binary Encoding
This technique is combination of Hashing and Binary. In this technique the unique categories are assigned unique integers which are then converted into binary code (bit representation).
Implementation-:
Now you can see that extra columns are only the number of bits used in maximum integer assigned to categories.
This technique is best for nominal data where we have large number of categories.
Disadvantage-:
This technique is not good for ordinal data as it does not follow any order.
4. Ordinal Encoding
The critical aspect of Ordinal Encoding is to respect the inherent ordering of the categories. The integers should be assigned in such a way that the order of categories is preserved.
So here you can see that Poor is assigned 1 then Good is assigned 2 and so on. So here the ordering of the categories is preserved.
Implementation-:
Here the encoder takes a 2D array ,we can see that the encoded data is in alphabetical order. This is because we have not given any particular order to encoder so it encodes data on the basis of alphabetical order.
Here we have created an OrdinalEncoder instance with the specified order of categories.
Disadvantages-:
This encoding doesn't suit for nominal variables.
5. Frequency Encoding
This is used for nominal categorical variables with high cardinality. In this technique we calculate the frequency of each category and the encoded value is given by frequency of that category divided by total categories.
Disadvantage-:
The major disadvantage of this technique is that multiple categories can have same frequency and as a result they will have same encoding.
6. Mean Encoding
In this technique each category in the feature variable is replaced with the mean value of the target variable for that category.
Example-: Suppose we are predicting price of car (target variable) and we have a categorical variable 'Color'. If the average price of car is $20,000 then 'Red' would be replace by 20,000 in encoded feature.
It is useful when dealing with high cardinality Categorical features.
Implementation-:
Here we have calculated mean of the target variable for each category. Map the original categories to their corresponding means. Replace each category with the computed mean.
It has high chances of capturing any existing relationship between category and target variable.
Disadvantages-:
Mean encoding can lead to overfitting, especially when categories have few observations. Regularization techniques, such as smoothing, can help mitigate this risk.
So this is how we handle categorical values. I hope you have understood it well. For more don't forget to follow me.
Thankyou❤
Top comments (0)