Data Science For Cats : PART 1

#datascience #beginners #machinelearning #ai

Understanding The Problem

Imagine you’re a cat, who is obsessed with potato chips, and has no idea about what data science is. You have a hooman friend who has a lot of data but too lazy to do anything with it. You love potato chips so much that one day you decide to have your own tuna flavoured potato chips brand. You’re not sure if the hoomans would like your tuna flavoured potato chips, or how you should decide the price, or how the demand would be in future. So you’ve called your hooman friend to have some advice because he has a lot of data on it, and data can do magic.

Your hooman friend agrees to provide you the data and tell you how to use them. Now as you have the data, you are planning to identify your questions and find the answers from the data. Firstly, you wanted to know if the hoomans would like your tuna flavour. Your hooman friend explained that if you take a random person from the hooman race who ate chips at least once in his life and ask him if he likes it or not, there can be only two answers, yes or no. Similarly, if you ask them which flavour they like among sour cream, tomato and bbq, the answer will definitely not be jalapenos. Therefore, you can pick an answer from a definite set of options in these types of questions. Your hooman has now told you that you have successfully figured out CLASSIFICATION problems.

Now you’ve started thinking about your other questions. How can you have some basic idea about the price? You start checking your data where you see that a 16oz pack of Hay’s chips made with onions and sour cream flavour costs $3.66, and an 8oz pack of Tingles tomato salsa flavoured chips costs $2. You’ve noticed that you know various information about the chips in your data like packet size, flavours, ingredients and so on, and the prices of all the chips are not necessarily always $3.66 or $2. Depending on the features like size or ingredients, it is varying within a range. For example, if the first 5 samples of chips have prices as following: $2.19, $4.10, $3.50, $2.20 and $2.50, there is no such rule that the price of the 6th sample has to be within these exact prices only. It can be $1.99, or $4.50, depending on how complex the flavour profile is, and how big the pack size is. You mentally take note that your hooman friend is calling this a REGRESSION problem.

Hearing you meowing enthusiastically, your hooman friend decides to explain a special type of regression to you. He calls it a TIME SERIES regression. It is a special type of regression where you try to predict some future values of something using the values from the past, linked by time. You suddenly realize that your third problem is a time series problem where you’re trying to predict the demand of potato chips in the next month using the demand data of this month, the previous month and so on. In other words, the sales prediction of the next month can be predicted from the sales record of this month. You haven’t understood all the details of this regression yet, but hooman said he will explain this later.

Now the hooman thinks that you are prepared for starting some real work with all these data. He believes you have understood how to identify your questions and which approach you should take to explain your problems.