Understanding Logistic Regression

Previously, we discussed linear regression, a method used to predict continuous outcomes. Today let's explore logistic regression, which is essential for binary classification problems in data science.

What is Logistic Regression?

Logistic regression is used to predict a binary outcome (such as Yes/No or True/False) based on one or more input variables. Unlike linear regression, which deals with continuous data, logistic regression estimates the probability that a given input belongs to a certain class.

Applications of Logistic Regression

Spam Detection: Email services apply logistics regression to classify emails as spam or not by understanding the other input variables.
Medical Predictions: Logistic regression can be used to determine the probability of medical conditions, such as predicting heart attacks based on variables like weight and exercise habits.
educational Outcomes: Application aggregators use logistic regression to predict the probability of a student being accepted to a particular university or degree course by analyzing scores

Logistic Regression Equation and Assumptions

Logistic Regression Equation

Logistic regression uses the logistic function (sigmoid function) to map predictions and their probabilities. The sigmoid function is defined as:

Graph of Sigmoid function

if the output of the sigmoid function is greater than 0.5, the model predicts the instance as the positive class (1)
if the output is less than 0.5, the model predicts the instance as the negative class (0)

Interpretation of sigmoid function

The sigmoid function's output can be interpreted as the probability of the instance belonging to the positive class. For example

If the output is 0.7, there is a 70% chance that the instance belongs to the positive class
If the output is 0.2, there is a 20% chance that the instance belongs to the positive class

Assumptions:

Binary Logistics regression requires the dependent variable to be binary That means the outcome variable must have tow possible outcomes, such as 'yes' or 'no'.
Independence of observation The observation should be independent of each other, in other words, the outcome of one instance should not affect the outcome of another.
Linearity of independent variables and log odds Although logistic regression does not require dependent and independent variables to be linearly dependent. it requires that the independent variables are linearly related to log odds.
Absence of multicollinearity The independent variables should not be too highly correlated with each other,
Large sample size" logistic requires large sample size generally you require at least 10 cases with the least frequency outcome for each dependent variables.

Types of Logistic Regression

Binary Logistic Regression: when the dependent variable has two outcomes, such as predicting whether a loan will be approved (yes/no)
Multinomial Logistic Regression: when the dependent variable has more than two discrete outcomes, such as predicting the type of transport a person will choose (car, bike, bus)
Ordinal Logistic Regression: Used when the dependent variable is ordinal, such as survey responses (agree, disagree, unsure)

Conclusion

Logistic regression is a powerful and flexible tool for binary outcome modeling. Its simplicity, interpretability, and effectiveness with linearly separable datasets make it a preferred choice for many binary classification tasks in machine learning and predictive analytics. Understanding its assumptions and best practices ensures the development of robust and reliable models