DEV Community

bennettandrewm
bennettandrewm

Posted on • Edited on

Learning how the Machines Learn: An Overview of Statistical Bases

Overview

To understand the basics of machine learning, it's important to grasp the foundational concepts. This post discusses inferential vs predictive statistics and regression vs classification. It also reviews 6 foundational algorithms to Machine Learning: Linear Regression, Logistic Regression, K Nearest Neighbors, Naive Bayes, Decision Trees, and Support Vector Machines. We'll also do a quick overview of popular loss functions for these algorithms with a brief explanation.

Why this is Important?

The real fun of machine learning comes from implementing neural networks and deep learning. Before we can walk there, we must crawl (sorry). These 6 algorithms represent the real basics of machine learning, from which more complex systems form. Once we get here, we can start using statistics to predict and generate content. Predict? Yes... that's correct. What, you thought statistics were just for inferences? Well, it can be, but let's discuss the difference.

Inferential vs Predictive

Inferential statistics focuses on the relationships between variables, establishing causal links between independent and dependent variables. Prediction, while not ignoring causality, focuses on the accuracy with which you can predict a certain outcome. To illustrate this difference, let's use climate.

Inferential Statistics - Example

There's consensus that the temperature of the earth is warming, but debate about exactly what's causing it. And, for the sake of the discussion, let's assume we're experts in the domain. If we wanted to understand causation, we would apply inferential principles, gathering data such as tree cover, greenhouse gas emissions, etc. as our independent variables, and gather some global air temperature data as our dependent variable. We then run some analysis, perhaps a linear regression, and determine which variables have the greatest weight (affect) on that temperature metric. As long as we were cognizant of correlation risks, our results would indicate which variable has the strongest link to global temperatures.

Inferential Statistical Metrics

With inferential, we might focus on p-values that could rule out a null hypothesis, perhaps considering R-squared (for accuracy) on certain models. We won't get into details here, but a small enough p-value could statistically rule out the opposite case of what we're trying to prove, which is ultimately the goal in establishing causation.

Predictive Statistics - Example

Now, returning to our climate dilemma, let's think about predictions. Can we try to predict what the weather will be tomorrow? Well, yes in fact, we can. And meteorologists do it every day, multiple times a day. Do we care how they got to their conclusion? Maybe. But we really care how accurate they are. Perhaps that's why you hear, "AccuWeather" forecast as a brand name for the technology.

Predictive Statistics Metrics

On predictive, we focus on things like a confusion metric, which consider false positives, false negatives, true positives, and true negatives. And from here we dive right into Accuracy, which is a measure of correct predictions (sum of true positives and true negatives) against the correct observations (sum of observed positives and negatives). This leads us to measure how "far off" our predicted values are from all of our observed values. Error, in other words.

Classifier vs Regression

Now that we have reviewed some of the statistical foundations of Machine Learning, we can focus on predictive analytics. Let's do a quick reminder of some differences between regression and classifier method, and then we'll dig into some algorithms.

Regression

With a regression, the goal is to reduce all of the complexities of your data set to a simpler, underlying relationship. We know it won't be perfect, but hopefully it's close. We can think of it as trying to UNIFY the data.

Classification

With classification, we SEPARATE the data by making clear distinctions. We look at a big mass of info and start divvying it up.

Loss Function

Circling back to error, it's a good time to delve into the idea of loss functions. This is critical in understanding how these programs perform optimization. Error, or in many cases, Mean Squared Error, is a popular term. When we use it as a loss function, we're constantly iterating our main algorithm to try and minimize the MSE. This is done by gradient descent, analyzing how rapidly our MSE is changing with each iteration and adjusting the parameters to minimize the descent. This is a mouthful, but ultimately loss functions represent an inaccuracy in our model that we're trying to reduce.

Algorithms

So, let's look at six algorithms below with help from a useful blog post (and subsequent diagram) called Daily Dose of Data Science.

Regression

The below chart shows ML Algorithms and Loss Functions

Regression

It's important to remember that some algorithms can be implemented as either regression or classification.

1. Linear Regression

Attempts to find a unifying expression for a continuous or non-discrete variable prediction. MSE or (RMSE) is the accuracy metric for the loss function that drives optimization.

2. Logistic Regression

Attempts to find a unifying expression for bilateral classification prediction. The Cross Entropy Loss determines how far your results are from either bilateral classification.

Classification

Again, we can see the following chart for classification.

Classification

3. Decision Tree

Decision Tree creates a series or path of splits (into 2 groups each time) between values in a single variable. Ideal for binomial classification, the algorithm creates a split, almost like a rule, that tries to group a certain range of variables with certain outcomes. Information Gain details how successful that split is.

4. Support Vector Machines

Tries to subdivide data using a linear demarcation. Hinge loss tells us how thick this "split" is in our model, and the thicker it is the better. If this sounds like a vague explanation of hinge loss, well, it is. This article goes into better detail.

5. K Nearest Neighbors

This algorithm locates a certain data point in the desired feature and analyzes what the variables around it. It reports a vote of the most likely classification based on some distance K from the data point you are trying to predict. Essentially, this algorithm is "lazy" and there's no loss function. You give it an input you're looking to predict, and it reports a vote. there's no optimization effort.

6. Naive Bayes

This algorithm follows from the original Bayesian theory that determines the probability that certain features are responsible for certain classification results. The order can vary, unlike Decision Tree, and it only matters that once you know one variables outcome, you can use that to determine another variable's effect on the prediction. There's not much to optimize per se, you just iterate through each variable to determine the effects on classification.

Summary

This blog post provides a summary of foundational elements of Machine Learning. It discusses inferential vs predictive statistics, classification vs regression, and then jumps into popular algorithms. We reviewed loss functions, and now, you could be ready to jump into neural networks and deep learning.

Sources

An Algorithm-wise Summary of Loss Functions in Machine Learning Loss functions of 16 ML algorithms in a single frame, Avi Chawla, Sept 30, 2023.
https://www.blog.dailydoseofds.com/p/an-algorithm-wise-summary-of-loss?utm_source=post-email-title&publication_id=1119889&post_id=137547091&utm_campaign=email-post-title&isFreemail=true&r=2ce3uv&triedRedirect=true

A definitive explanation to the Hinge Loss for Support Vector Machines Vagif Aliyev, Towards Data Science, Nov 23, 2020
https://towardsdatascience.com/a-definitive-explanation-to-hinge-loss-for-support-vector-machines-ab6d8d3178f1**

Top comments (0)