Ertugrul

Posted on Mar 10

Comparison of Machine Learning Algorithms...

#python #knn #svm #decisiontree

ÖnemliMakalenin Türkçe versiyonu için Linke tıkalyın

Türkçe:https://dev.to/ertugrulmutlu/makine-ogrenme-algoritmalarinin-karsilastirilmasi-4o0d

In this article we will compare SVM - DecisionTree - KNN algorithms.

The Features we will compare:

Accuracy: The ratio of total correct predictions to total data. That is, the ratio of correct predictions to the total number of predictions.

Macro avg precision Score: The average of the precision for each class. Precision is the ratio of correct positive predictions to total positive predictions. This shows how accurately a class is identified.

Macro avg Recall Score: The average of the precision for each class. Precision is the ratio of true positive predictions to the total number of true positives. This indicates how successfully a class was detected.

Macro avg F1 Score: The average of the F1 score for each class. The F1 score is the harmonic mean of precision and sensitivity. This combines the model's classification ability into a single metric.

Weighted avg precision Score: The average of the weighted precision based on the sampling rate of each class. This provides a measure of precision weighted by the importance of each class.

Weighted avg Recall Score: The average of the weighted precision based on the sampling rate of each class. This provides a measure of precision weighted by the importance of each class.

Weighted avg F1 Score: The average F1 score weighted by the sampling rate of each class. This provides a measure of the F1 score weighted by the importance of each class.

First the definitions of algorithms.

Instead of giving definitions, I found it more appropriate to give you a source that explains them more properly.

KNN (K-Nearest-Neighborn):

Source:

Video: https://www.youtube.com/watch?v=v5CcxPiYSlA
Article: https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

DT(Decision tree):

Source:

Video: https://www.youtube.com/watch?v=ZVR2Way4nwQ
Article: https://medium.com/@MrBam44/decision-trees-91f61a42c724

SVM (Support Vector Machine):

Source:

Video: https://www.youtube.com/watch?v=1NxnPkZM9bc
Article: https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

Now we can get started..

First let's take a look at the Database I will use

Database features:

Here, we will analyze our CSV using the Pandas library.

import pandas as pd
csv = pd.read_csv("glass.csv")
print(csv.head)

To explain the code here in order:

We import the Pandas library.
We read the CSV file with the Pandas library.
Finally, we write the "head" command to get an overview of the CSV file.

The output of this code:

As you can see, it gave us a general information about the content of the CSV file. It also gave us information about the number of rows and columns.

In this CSV file:

-214 Row
-10 Column
It is.

Now let's get the names of the columns:

import pandas as pd
csv = pd.read_csv("glass.csv")
print(csv.columns)

To explain the code here in order:

We import the Pandas library.
We read the CSV file with the Pandas library.
Finally, we write the "columns" command to get an overview of the CSV file.

The output of this code:

As you can see, we got the names of the COlumns of the CSV file and then we learned the Type of this data.

In this CSV file:

-RI (Refractive index)
-Na (Sodium)
-Mg (Magnesium)
-Al (Aluminum)
-Si (Silicone)
-K (Potassium)
-Ca (Calcium)
-Ba (Barium)
-Fe (Iron)
-Type (Glass type)
is located.

In the light of this data, different types of glass were identified based on the refractive index of the glass and the chemical substances it contains.

Note: For more detailed information, please visit the Source site.

Source

The site where I downloaded the CSV file:
https://www.kaggle.com/datasets/uciml/glass

Now let's move on to our plan:

What We Know

Data in CSV files needs to be shaped for use in Algorithms

-Algortimas need to be written using a Library.

-Results need to be extracted graphically

Let's do the data preparation part.

Preparation of Data

First, let's count the libraries I will use:

Sklearn
Pandas
Numpy

data = pd.read_csv(self.url, sep=",")
X = np.array(data.drop([columns[len(columns)-1]], axis=1))
y = np.array(data[columns[len(columns)-1]]) 
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X,y, test_size= 0.2)

To explain the code here in order:

We read our CSV file by separating it with ',' (We use the PANDAS library for this operation)
The 'X' data contains the properties of the data we want to predict (Type). With this code, we remove the 'Type' Column from the data and make all the data an array using the 'Numpy' library.
'y' data is the data we want to predict (i.e. 'Type'). We array it using the 'Numpy' library just like the 'X' data. 4.Finally, we divide this data into test and train. The reason for this is in the simplest terms to train algorithms with train data. With test data, determine the accuracy rate of the algorithm and take action. (We set this rate as 20% with the test_size command, but you can change it if you wish.)

Note: In larger databases or more complex Algorithms you may need validation data, but we don't need it here because we are doing a small and simple application.

Yes, our data is ready...

Integration of Algorithms:

Here we will integrate our algorithms with the Sklearn library.

-KNN**

from sklearn.neighbors import KNeighborsClassifier 
KNN = KNeighborsClassifier(n_neighbors=9)
KNN.fit(x_train,y_train)

To explain the code here in order:

We call the KNeighborsClassifier module from Sklear.neighbors.
KNN is integrated. With the n_neighbors parameter, it is decided how many nearest neighbors to look at. (This value may vary according to the project and database.)
Train the model with .fit command with x_train and y_train data.

-SVM

from sklearn import svm 
Svm = svm.SVC(kernel=linear)
Svm.fit(x_train,y_train)

To explain the code here in order:

We call the svm module from sklearn.
We call the Support Vector Classification function in Svm. (Briefly, this function allows you to perform classification using the Svm infrastructure). As hyperparameter (Kernel :'linear', 'poly', 'rbf', 'sigmoid') can be used.
With the .fit command the model is trained with x_train and y_train data.

-Decision Tree

from sklearn.tree import DecisionTreeClassifier
Dt = DecisionTreeClassifier(random_state=9)
Dt.fit(x_train,y_train)

To explain the code here in order:

We call the DecisionTreeClassifier module from sklearn.tree.
DecisionTree is integrated. With the random_state parameter, the stability of the algorithm is increased.
With the .fit command, the model is trained with x_train and y_train data.

Now that we have integrated our algorithms, we can move on to visualization and comparison.

Visualization and Comparison:

First, let's count the libraries I will use:

matplotlib In short, Matplotlib is a visualization library. It is simple to use and suitable for clean code writing.

All algorithms need to be trained to make comparisons. The code we will use after training:

dt_report =dt.predict_report(3, dt_x_train, dt_x_test, dt_y_train, dt_y_test)
svm_report =Svc.predict_report(3, svc_x_train, svc_x_test, svc_y_train, svc_y_test)
knn_report =Knear.predict_report(3, knn_x_train, knn_x_test, knn_y_train, knn_y_test)

In short, we can print the values we want on the screen with the very simple predict_report command.

Sample output (taken from the internet):