Machine Learning Log 3: a brief summary of supervised learning

#machinelearning #beginners #supervisedlearning

It's been a while since I wrote a Machine Learning Log entry, because I have been so busy...well, learning machine learning.

Supervised learning is a subset of Machine Learning. Supervised learning starts with data that is already labelled and tries to make predictions on new, un-labelled instances of similar data based on the existing dataset.

Now please bear with me as I make up a little story about a problem that could be solved with supervised learning, because I am tired of talking about housing market predictions. :)

Let's say your great grandfather - a prominent violinist - recently passed away and left you his large collection of violins in his will. You, not being musical yourself, want to sell the violins, but don't know how much each violin is worth. Your grandfather kept track of his collection in a little notebook in which he has listed the maker of each violin (when known), and other information like where each instrument was crafted, the date it was made, if it was made by a luthier (violin maker) or factory produced, the type of varnish that was used, and what type of wood the violin is made out of. He also included the price, but that was a long time ago, so you reasonably assume that you could get more for the instruments in today's market.

In this example the details about each violin are the features, or independent variables, and the price that you want to predict is the label, or dependent variable (so-called because its value relies on the features). In order to make an accurate prediction, up-to-date with today's violin market value, you need some up-to-date information about price.

You contact your friend who is a violin expert and dealer. He is very busy and can't personally help you appraise each instrument, but he sends you a spreadsheet of prices for some of the violins he has sold, along with sets of features similar to the ones that your great grandfather kept (maker, wood, varnish, place of origin, date, etc.). This spreadsheet is your dataset that you will use to make a prediction about the price of each of your violins.

In order to make these prediction, you will try to find the approximate function that, given a set of features for each violin, produces the real world price that a given violin sold for. Then you will feed your own set of features for each violin into this function and get a pretty good estimate of how much money each violin is worth.

Except, you personally are not very good at math and are also far too lazy to go through hundreds of examples of violins yourself, so you use your computer to find the function for you. And that computer program is a supervised learning model.

Terms

Dataset - a collection of data being used to make predictions.
Features - also called the independent or X variables. The information you are using to make a prediction.
Labels - also called the dependent or y variable. The thing you are trying to predict.
Supervised learning - a model that learns to make predictions on new data based on a set of already labeled data.

I hope that this example helped you to understand a little bit about the kind of problem supervised learning can solve. I should add that I talked about something called a regression problem - which is trying to predict a continuous number (the price) from some features. Supervised learning can also be used in classification problems, where you are trying to predict which category something belongs to, but that's a story for another day.

On that note, I want to begin posting these more frequently, and will be aiming to write about a topic that interests me each Sunday. In my next Machine Learning Log I will provide some links to useful resources that I have enjoyed in my studies. So if you enjoyed this article, stay tuned for that next Sunday!

As usual, if you have anything to add, or any corrections to make, please share in the comments below! I am still learning, so I welcome constructive criticism. ;)