DEV Community

Cover image for Laid-Back Approaches to Clean Data
Alit Darma Putra
Alit Darma Putra

Posted on • Updated on

Laid-Back Approaches to Clean Data

The importance of data in the output results of the model means that we need to carry out further handling of the input data. One way that can be done to improve data quality is to carry out data cleaning techniques.

What is Data Cleaning?

Data cleaning is a technique that aims to improve data quality by identifying and eliminating errors and inconsistencies in data.

Here I will share some data cleaning methods and providing simple implementation in how to do it.

There are 4 commonly methods in data cleaning:

  1. Scalling Feature Value
  2. Handling Extreme Outlier
  3. Binning
  4. Scrubbing

Scalling Feature Value

Feature scaling is the process of normalizing the range of features in a dataset. In real cases, the range of feature values ​​varies greatly. If one of the features has a wide value, then that feature will greatly influence the calculations of the algorithm used. Therefore, the range of all features must be normalized so that each feature can provide a comparable contribution. There are several techniques that can be used to perform feature scaling, including:

Absolute Maximum Scaling

Absolute Maximum Scaling is a scaling technique that is carried out based on the absolute maximum value of each feature. The stages of this technique are:

  1. Determine the maximum absolute value of the feature in the data set.
  2. Divide all values ​​in the column by the maximum value. Image description
import matplotlib.pyplot as plt

def max_absolute_scaling(data):
    # Determines the maximum absolute value
    max_abs_value = max(map(abs, data))

    # Divide each value by the maximum absolute value
    scaled_data = []
    for x in data:
        scaled_data.append(x/max_abs_value)

    return scaled_data

data = [3, -1, 6, 2, -4]
scaled_data = max_absolute_scaling(data)
print("Data:", data)
print("Scaled Data:", scaled_data)
plt.plot(data, "red", label="Original Data")
plt.plot(scaled_data, "blue", label="Scaled Data")
plt.legend()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Data: [3, -1, 6, 2, -4]
Scaled Data: [0.5, -0.16666666666666666, 1.0, 0.3333333333333333, -0.6666666666666666]
Enter fullscreen mode Exit fullscreen mode

Image description

Min-Max Scaling

Min-Max Scaling is a scaling technique that is carried out by reducing each value in the dataset by the minimum value and then dividing by the range of the dataset (maximum-minimum). By applying this technique, all feature values ​​will be between 0 and 1. The weakness of this technique is also the same, namely that it is susceptible to outliers.
Image description

import matplotlib.pyplot as plt

def min_max_scaling(data):
    # Determine the maximum and minimum values
    max_value = max(data)
    min_value = min(data)

    # Reduces each value by the minimum value
    # then divided by the range of dataset values
    scaled_data = []
    for x in data:
        scaled_data.append(
        (x-min_value)/(max_value-min_value)
        )

    return scaled_data

data = [3, -1, 6, 2, -4]
scaled_data = min_max_scaling(data)
print("Data:", data)
print("Scaled Data:", scaled_data)
plt.plot(data, "red", label="Original Data")
plt.plot(scaled_data, "blue", label="Scaled Data")
plt.legend()
plt.show()

Enter fullscreen mode Exit fullscreen mode

Output:

Data: [3, -1, 6, 2, -4]
Scaled Data: [0.7, 0.3, 1.0, 0.6, 0.0]
Enter fullscreen mode Exit fullscreen mode

Image description

Normalization

Normalization is a scaling technique that is similar to min-max scaling, but each feature value is reduced by the average value of the dataset. The results of the reduction are then divided by the range of dataset values.
Image description

import matplotlib.pyplot as plt
from statistics import mean

def normalization(data):
    # Determine maximum, minimum, average values
    max_value = max(data)
    min_value = min(data)
    mean_value = mean(data)

    # Subtract each value by the average value
    # then divided by the range of dataset values
    scaled_data = []
    for x in data:
        scaled_data.append(
        (x-mean_value)/(max_value-min_value)
        )

    return scaled_data

data = [3, -1, 6, 2, -4]
scaled_data = normalization(data)
print("Data:", data)
print("Scaled Data:", scaled_data)
plt.plot(data, "red", label="Original Data")
plt.plot(scaled_data, "blue", label="Scaled Data")
plt.legend()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Data: [3, -1, 6, 2, -4]
Scaled Data: [0.18, -0.22000000000000003, 0.48, 0.08, -0.52]
Enter fullscreen mode Exit fullscreen mode

Image description

Standardization (Z-score Normalization)

Standardization is a scaling technique that is carried out by reducing each feature value by the average and dividing by the standard deviation value or what is usually called the z-score. The result of this technique is data that is scaled so that it has features centered on the average and a standard deviation of 1. This technique is suitable if the features have a normal distribution such as salary or age

Image description

import matplotlib.pyplot as plt
from statistics import mean, stdev

def standardization(data):
    # Determine the average value, standard deviation
    mean_value = mean(data)
    stdev_value = stdev(data)

    # Subtract each value by the average value
    # then divided by the standard deviation
    scaled_data = []
    for x in data:
        scaled_data.append(
        (x-mean_value)/(stdev_value)
        )

    return scaled_data

data = [3, -1, 6, 2, -4]
scaled_data = standardization(data)
print("Data:", data)
print("Scaled Data:", scaled_data)
plt.plot(data, "red", label="Original Data")
plt.plot(scaled_data, "blue", label="Scaled Data")
plt.legend()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Data: [3, -1, 6, 2, -4]
Scaled Data: [0.469476477861571, -0.5738045840530313, 1.2519372742975226, 0.20865621238292043, -1.3562653804889828]
Enter fullscreen mode Exit fullscreen mode

Image description

Robust Scaling

In the Robust Scaling technique, each data is reduced by the median value and then divided by the Inter Quartile Range (IQR) value. IQR is the difference between the upper quartile (Q3) and the lower quartile (Q1).

Image description

import matplotlib.pyplot as plt
import pandas as pd

def robust_scaling(data):
    series = pd.Series(data)
    # Determine the median value, IQR
    q1, median, q3 = series.quantile([0.25, 0.5, 0.75])
    IQR = q3 - q1

    # Subtract each value by the median value
    # then divided by the IQR value
    scaled_data = []
    for x in data:
        scaled_data.append(
        (x-median)/IQR
        )

    return scaled_data

data = [3, -1, 6, 2, -4]
scaled_data = robust_scaling(data)
print("Data:", data)
print("Scaled Data:", scaled_data)
plt.plot(data, "red", label="Original Data")
plt.plot(scaled_data, "blue", label="Scaled Data")
plt.legend()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Data: [3, -1, 6, 2, -4]
Scaled Data: [0.25, -0.75, 1.0, 0.0, -1.5]
Enter fullscreen mode Exit fullscreen mode

Image description

Scaling to Vector Unit Length

Scaling to Vector Unit Length is a scaling technique that is carried out by transforming the components of a feature vector so that the transformed vector has a length of 1. In this technique, each feature value is divided by the vector length.

Image description

This technique can only be done if the value ||X||>0

import matplotlib.pyplot as plt
import numpy as np

def vector_normalization(data):
    vector = np.array(data)
    # Determines the length of the vector
    magnitued = np.linalg.norm(vector)

    # Normalize vectors to unit length
    scaled_data = vector/magnitued

    return scaled_data

data = [3, -1, 6, 2, -4]
scaled_data = vector_normalization(data)
print("Data:", data)
print("Scaled Data:", scaled_data.tolist())
plt.plot(data, "red", label="Original Data")
plt.plot(scaled_data, "blue", label="Scaled Data")
plt.legend()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Data: [3, -1, 6, 2, -4]
Scaled Data: [0.3692744729379982, -0.12309149097933272, 0.7385489458759964, 0.24618298195866545, -0.4923659639173309]   
Enter fullscreen mode Exit fullscreen mode

Image description

Handling Extreme Outlier

Outliers are values ​​that are much different from the majority of the data in the data set. Outliers may represent natural variation in the population. However, in most cases outliers are caused by an error in the data collection process such as entering incorrect data, equipment failure, or other measurement errors. If outliers are not handled, they can affect the results of statistical analysis and the accuracy of the model being developed.

Outlier Detection

Outlier Detection with Sorting Methods

The sorting method is the simplest method that can be used to detect outliers. Quantitative data can be sorted from low to high and manually detected for data with very low or very high values. In the Python language, sorting can be done using the sorted function.

data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
sorted_data = sorted(data)
print("Data:", data)
print("Sorted Data:", sorted_data)
Enter fullscreen mode Exit fullscreen mode

Output:

Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
Sorted Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27, 100] 
Enter fullscreen mode Exit fullscreen mode

A value of 100 is very high when compared to other data. So 100 can be detected as outliers. However, this method is less accurate because outliers are not determined using statistical calculations.

Outlier Detection using the Histogram Method

Histograms can be used to help visualize data and find out whether there are outlier values ​​in a set of data. Data that is outside the data curve is detected as an outlier. The disadvantages of this method are the same as the average method where outliers are determined only from visual observation of the data.

import matplotlib.pyplot as plt

data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
plt.hist(data)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Image description

In the histogram visualization above, it can be seen that the data values ​​are more centered between the values ​​0 to 40. However, there is data that is separated from other data, namely 100 and this data can be categorized as an outlier.

Outlier Detection with Box-Plot

Box-Plot is a summary of the sample distribution presented graphically which can describe the shape of the data distribution (skewness), a measure of central tendency and a measure of the spread (diversity) of observational data. There are 5 statistical measures that can be read in the box plot, namely minimum, maximum, Q1, median, and Q3. Values ​​outside the box and whisker can be categorized as outliers.

import matplotlib.pyplot as plt

data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] plt.boxplot(data, vert=False)
plt.title("Detecting outliers using Boxplot")
plt.xlabel('Data')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Image description
Value 100 that is outside the whisker can be categorized as an outlier.

Deteksi Outlier dengan Z-Score

The criteria for determining outliers with z-score is that every data point that has a z-score value that is outside the 3rd standard deviation is an outlier. The stages carried out using this technique include:

  1. For all data points, calculate the z-score value using the formula (Xi-mean)/std.
  2. Initialize the threshold value=3 and mark data points that have an absolute z-score value greater than the threshold as outliers.
import statistics as s

def detect_outliers_zscore(data):
    # Threshold initialization
    thres = 3

    # Determine the average value and standard deviation
    mean = s.mean(data)
    std = s.stdev(data)

    outliers = []
    # Determine the z-score value for each data
    for i in data:
        z_score = (i-mean)/std

        # Check whether the data is outliers
        if (abs(z_score) > thres):
            outliers.append(i)

    return outliers

data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
outliers = detect_outliers_zscore(data) 
print("Outliers:", outliers)
Enter fullscreen mode Exit fullscreen mode

Output:

Outliers: [100]
Enter fullscreen mode Exit fullscreen mode

Outlier Detection with Inter Quartile Range (IQR)

Based on the Inter Quartile Range (IQR) value, outliers can be detected if the data point is located 1.5 times the IQR above Q3 and below Q1. The stages in determining outliers with IQR are:

  1. Sort asset data in ascending order
  2. Calculate 1st and 3rd quartiles (Q1, Q3)
  3. Calculate the value of IQR=Q3-Q1
  4. Calculate the lower limit value = (Q1–1.5*IQR) and the upper limit = (Q3+1.5*IQR)
  5. For all data in the data set, check whether any data is below the lower limit and above the upper limit. Then mark the data as an outlier.
import pandas as pd

def detect_outliers_iqr(data):
    series = pd.Series(data)

    # Determine Q1, Q3, IQR values
    Q1, Q3 = series.quantile([0.25, 0.75])
    IQR = Q3-Q1

    # Determine lower bound and upper bound
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = []
    # Determine the z-score value for each data
    for i in data:
        # Check whether the data is outliers
        if (i < lower_bound or i > upper_bound):
            outliers.append(i)

    return outliers

data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
outliers = detect_outliers_iqr(data) 
print("Outliers:", outliers)
Enter fullscreen mode Exit fullscreen mode

Output:

Outliers: [100]
Enter fullscreen mode Exit fullscreen mode

Outlier Handling

After successfully detecting outlier data in the dataset, the next stage is to handle the outlier data. There are several ways that can be done to handle outlier data that has been detected.

Trimming

Outlier data detected using this technique will be removed from the dataset. However, this method is not the best practice to do.

def trimming(data, outlier):
    new_data = []

    # remove data that includes outliers
    for i in outlier:
       new_data = [x for x in data if x != i ]

    return new_data

data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
outlier = [100]
new_data = trimming(data, outlier)
print("Data:", data)
print("New Data:", new_data)
Enter fullscreen mode Exit fullscreen mode

Output:

Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27]
Enter fullscreen mode Exit fullscreen mode

Quantile Based Flooring and Capping

Handling of outliers in this technique is carried out by limiting outliers to certain values ​​above the 90th percentile value or placed in factors below the 10th percentile value.

import pandas as pd

def handle_quantile_outlier(data):
    series = pd.Series(data)

    # Determine the 10th and 90th percentiles
    P10 = series.quantile(0.1)
    P90 = series.quantile(0.9)
    new_data = []

    # Replace the data value with P10 for data < P10
    # and with P90 for data > P90
    for x in data:
        if x < P10:
            x = P10
        elif x > P90:
            x = P90
        new_data.append(x)

    return new_data

data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
new_data = handle_quantile_outlier(data)
print("Data:", data)
print("New Data:", new_data)
Enter fullscreen mode Exit fullscreen mode

Output:

Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [17.1, 17.1, 18, 19, 20, 21, 22, 23, 24, 26.9, 26, 26.9]
Enter fullscreen mode Exit fullscreen mode

Mean/Median Imputation

The average value is greatly influenced by the presence of outliers, so it is recommended to replace these outliers with median values.

import statistics as s

def handle_median_outlier(data, outlier):
    # Determine median value
    median = s.median(data)

    # Change outlier with median value
    new_data = []
    for i in outlier:
        for x in data:
            if x == i:
                new_data.append(median)
            else:
                new_data.append(x)

    return new_data

data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
outlier = [100]
new_data = handle_median_outlier(data, outlier)
print("Data:", data)
print("New Data:", new_data)
Enter fullscreen mode Exit fullscreen mode

Output:

Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 21.5, 26, 27]
Enter fullscreen mode Exit fullscreen mode

Log Transformation

Log transformation is a common technique used to reduce the skew in a distribution and make it more symmetric. In this way, the occurrence of extreme values ​​can be reduced and the data becomes more normally distributed.

import math
import matplotlib.pyplot as plt

def handle_log_outlier(data):
    new_data = []
    # Transformasi setiap nilai data dengan log
    for x in data:
        new_data.append(math.log(x, 10))

    return new_data

data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
new_data = handle_log_outlier(data)
print("Data:", data)
print("New Data:", new_data)
plt.plot(data, 'red', label='Data')
plt.plot(new_data, 'blue', label='New Data')
plt.legend()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [1.2041199826559246, 1.2304489213782739, 1.2552725051033058, 1.2787536009528289, 1.301029995663981, 1.322219294733919, 1.3424226808222062, 1.3617278360175928, 1.380211241711606, 2.0, 1.414973347970818, 1.4313637641589871]
Enter fullscreen mode Exit fullscreen mode

Image description

Binning

Data binning is a method of separating or grouping continuous numerical values ​​into discrete intervals called "bins" or "groups". Data grouping methods can be used to simplify data distribution and assist in statistical analysis and visualization. There are several techniques that are often used to group data, including:

Equal Width Binning

This technique groups data into intervals or bins with the same width and has been determined previously. Even though this method is simple, it cannot be applied to data with a skewed distribution.

import pandas as pd 

data = {'age': [16, 17, 18, 19, 20, 21, 22, 23, 24, 20, 26, 27]}

df = pd.DataFrame(data)

num_bins = 4

# Calculate the bin width
bin_width = (df['age'].max() - df['age'].min()) / num_bins

# Create bin limits
bin_edges = [df['age'].min() + i * bin_width for i in range(num_bins + 1)]

df['age_bins'] = pd.cut(df['usia'], bins=bin_edges, include_lowest=True, right=True)

print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

    age        age_bins
0     16  (15.999, 18.75]
1     17  (15.999, 18.75]
2     18  (15.999, 18.75]
3     19    (18.75, 21.5]
4     20    (18.75, 21.5]
5     21    (18.75, 21.5]
6     22    (21.5, 24.25]
7     23    (21.5, 24.25]
8     24    (21.5, 24.25]
9     20    (18.75, 21.5]
10    26    (24.25, 27.0]
11    27    (24.25, 27.0]
Enter fullscreen mode Exit fullscreen mode

Equal Frequency Binning

In this technique, data is grouped into bins with each bin having approximately the same number of data points. This technique is useful when maintaining the same frequency or distribution across bins if it is important. This binning method can also effectively deal with outlier data and skewed data.

import pandas as pd 

data = {'age': [16, 17, 18, 19, 20, 21, 22, 23, 24, 20, 26, 27]}

df = pd.DataFrame(data) 
df['age_bins'] = pd.qcut(df['age'], q=3)

print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

    age          age_bins
0     16  (15.999, 19.667]
1     17  (15.999, 19.667]
2     18  (15.999, 19.667]
3     19  (15.999, 19.667]
4     20  (19.667, 22.333]
5     21  (19.667, 22.333]
6     22  (19.667, 22.333]
7     23    (22.333, 27.0]
8     24    (22.333, 27.0]
9     20  (19.667, 22.333]
10    26    (22.333, 27.0]
11    27    (22.333, 27.0]
Enter fullscreen mode Exit fullscreen mode

Quantile Binning

In this technique, data is grouped based on percentile values. The limits of a bin are based on certain percentile values ​​(e.g. 25th, 50th, and 75th percentiles).

import pandas as pd 
import numpy as np

data = {'age': [16, 17, 18, 19, 20, 21, 22, 23, 24, 20, 26, 27]}
df = pd.DataFrame(data) 

# Defines percentiles for bin boundaries
percentiles = [0, 25, 50, 75, 100] # In this case, quartiles are used

# Defines percentiles for bin boundaries
bin_edges = np.percentile(df['age'], percentiles)

df['age_bins'] = pd.cut(df['age'], bins=bin_edges, include_lowest=True)

print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

    age        age_bins
0     16  (15.999, 18.75]
1     17  (15.999, 18.75]
2     18  (15.999, 18.75]
3     19    (18.75, 20.5]
4     20    (18.75, 20.5]
5     21    (20.5, 23.25]
6     22    (20.5, 23.25]
7     23    (20.5, 23.25]
8     24    (23.25, 27.0]
9     20    (18.75, 20.5]
10    26    (23.25, 27.0]
11    27    (23.25, 27.0]
Enter fullscreen mode Exit fullscreen mode

Scrubbing

Data scrubbing is a process for changing or deleting incomplete, incorrect, inaccurate, or repetitive data in a dataset. By carrying out this process, it can help improve data consistency, accuracy and reliability.

Deleting Repetitive Data

Deleting duplicate data events is one way to perform data scrubbing. Repeated data often appears if the dataset used comes from several different sources.

import pandas as pd

data = {'age': [15, 17, 23, 22, 17],
        'height': [155, 162, 165, 170, 162]}
df = pd.DataFrame(data)

print("Data:")
print(df)

# Check if data is duplicated
duplicate_data = df[df.duplicated()]

print("Duplicate Data:")
print(duplicate_data)

# Delete duplicate data
df = df.drop_duplicates()
print("New Data:")
print(df)
Output:
Data:
   age  height
0    15     155
1    17     162
2    23     165
3    22     170
4    17     162
Duplicate Data:
   age  height
4    17     162
New Data:
   age  height
0    15     155
1    17     162
2    23     165
3    22     170
Enter fullscreen mode Exit fullscreen mode

Handling Missing Data

In real cases, usually there is a lot of missing data in a data set. The causes of this data loss are very varied, ranging from data corruption to device failure when recording measurements.

Deleting Missing Data

Missing data can be resolved by deleting rows or columns of data that have NULL values.

import pandas as pd
import numpy as np

data = {'age': [15, 17, 23, np.nan, 17],
        'height': [155, 162, np.nan, 170, 162]}
df = pd.DataFrame(data)

print("Data:")
print(df)

# Deletes rows of missing data
df.dropna(axis=0, inplace=True)

print("New Data:")
print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

Data:
   age  height
0  15.0   155.0
1  17.0   162.0
2  23.0     NaN
3   NaN   170.0
4  17.0   162.0
New Data:
   age  height
0  15.0   155.0
1  17.0   162.0
4  17.0   162.0
Enter fullscreen mode Exit fullscreen mode

Pros:
• A model trained by removing all missing values ​​will produce a robust model.
Cons:
• Losing a lot of information.
• Works poorly if the percentage of missing values ​​is too large compared to the data set.

Fill in missing data with mean/median/mode values

Columns in a dataset that have numeric values ​​can be replaced with the mean, median, or mode of other data in that column. This technique will prevent data loss like the previous method.

import pandas as pd
import numpy as np

data = {'age': [15, 17, 23, np.nan, 17],
        'height': [155, 162, 165, 170, 162]}
df = pd.DataFrame(data)

print("Data:")
print(df)

df_filled_mean = df.copy()
df_filled_median = df.copy()
df_filled_mode = df.copy()

# Determine mean/median/mode value
mean = df['age'].mean() 
median = df['age'].median()
mode = df['age'].mode().values[0]

# Fill in the data with the mean/median/mode value
df_filled_mean['age'].fillna(mean, inplace=True)
df_filled_median['age'].fillna(median, inplace=True)
df_filled_mode['age'].fillna(mode, inplace=True)

print("New Data:")
print("Filled mean:")
print(df_filled_mean)
print("Filled median:")
print(df_filled_median)
print("Filled mode:")
print(df_filled_mode)
Output:
Data:
   age  height
0  15.0     155
1  17.0     162
2  23.0     165
3   NaN     170
4  17.0     162
New Data:
Filled mean:
   age  height
0  15.0     155
1  17.0     162
2  23.0     165
3  18.0     170
4  17.0     162
Filled median:
   age  height
0  15.0     155
1  17.0     162
2  23.0     165
3  17.0     170
4  17.0     162
Filled mode:
   age  height
0  15.0     155
1  17.0     162
2  23.0     165
3  17.0     170
4  17.0     162
Enter fullscreen mode Exit fullscreen mode

Pros:
• Prevent data loss resulting in deleted rows or columns
• Works well with small data sets and is easy to implement.
Cons:
• Only works with numeric continuous variables.
• May cause data leaks

Fill in Missing Data in Categorical Columns

When missing data is found in a categorical column of either character or number type, the missing data can be filled in with the highest frequency of the category. If there is a lot of missing data, the data is replaced with a new category.

import pandas as pd
import numpy as np

data = {'age': [15, 17, 23, 20, 17],
        'impression': ['good', 'fair', 'fair', 'very good', np.nan]}
df = pd.DataFrame(data)

print("Data:")
print(df)

most_category = df['impression'].mode().values[0]

# Fill in the data with the highest category frequency
df['impression'].fillna(most_category, inplace=True)

print("New Data:")
print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

Data:
   age        impression
0    15         good
1    17         fair
2    23         fair
3    20    very good
4    17          NaN
New Data:
   age        impression
0    15         good
1    17         fair
2    23         fair
3    20    very good
4    17         fair
Enter fullscreen mode Exit fullscreen mode

Pros:
• Prevent data loss resulting in deleted rows or columns
• Works well with small data sets and is easy to implement.
• Eliminate data loss by adding unique categories
Cons:
• Only works with categorical variables.
• Adding new features to the model while coding may result in poor performance

Data Type Conversion

Most machine learning models cannot be run on categorical data. Therefore, categorical data needs to be converted into numerical data. One technique that can be used is one-hot-encoding. One-hot-encoding is a representation of categorical variables in binary vector form.

import numpy as np

# Categorical data to be converted
colors = ["red", "green", "yellow", "red", "blue"]

# Color list
total_colors = ["red", "green", "blue", "black", "yellow"]

# map each color to numeric
mapping = {}
for x in range(len(total_colors)):
  mapping[total_colors[x]] = x

one_hot_encode = []

# Convert the numeric value of each data
for c in colors:
  arr = list(np.zeros(len(total_colors), dtype = int))
  arr[mapping[c]] = 1
  one_hot_encode.append(arr)

print(one_hot_encode)
Enter fullscreen mode Exit fullscreen mode

Output:

[[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [0, 0, 1, 0, 0]]
Enter fullscreen mode Exit fullscreen mode

Deleting Irrelevant Data

Data is said to be irrelevant when the data does not match the problem being researched.

import pandas as pd

data = {'age': [15, 17, 23],
        'email': ['ahmad@gmail.com', 'putra@yahoo.com', 'tegar@gmail.com']}
df = pd.DataFrame(data)

print("Data:")
print(df)

# Remove irrelevant attributes
df = df.drop('email', axis=1)

print("New Data:")
print(df)

Output:
Data:
   age              email
0    15   ahmad@gmail.com
1    17   putra@yahoo.com
2    23   tegar@gmail.com
New Data:
   age
0    15
1    17
2    23
Enter fullscreen mode Exit fullscreen mode

Avoiding Structural Errors

Structural errors include typos, incorrect naming conventions, incorrect use of capital letters, and so on. The following is an example of improvements to letter capitalization in categorical features:

import pandas as pd
import numpy as np

data = {'age': [15, 17, 23, 20, 17],
        'impression': ['good', 'Fair', 'fair', 'Very good', 'Good']}
df = pd.DataFrame(data)

print("Data:")
print(df)

# Fixed letter capitalization
df['impression'] = df['impression'].str.lower()

print("New Data:")
print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

Data:
   age    impression
0    15         good
1    17         Fair
2    23         fair
3    20    Very good
4    17         Good
New Data:
   age    impression
0    15         good
1    17         fair
2    23         fair
3    20    very good
4    17         good
Enter fullscreen mode Exit fullscreen mode

Closing

In conclusion, employing effective data cleaning methods not only enhances the reliability of your analyses but also paves the way for informed decision-making, ensuring that your data-driven journey is built on a solid foundation of accuracy and integrity

Top comments (0)