I’m going to create a program that can predict customer churn
Customer churn occurs when subscribers or customers stop doing business with the company within a specific amount of time and retaining customer is more important for companies because it boost company’s revenue and Help Company to build meaningful relationship with the company.
In actual, customer retention is more valuable than customer acquisition.
Python Program to Predict Customer Churn
Import Libraries
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
Load the Dataset In Python
Now, I’m going to load my dataset so I’ve to use Google’s library to do this
Following are the commands to import data in colaboratory
from google.colab import files
uploaded = files.upload()
After running this cell click on choose file and upload the file that you wanted to upload from your customer churn data
Load the Data Into A Data Frame
Here, I’ve created a variable called DF which will be short for data frame. Here I want to look at the data specifically and the first five rows of data.
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head(7)
And now we are able to take a look at the data frame. As we can see we have all of our columns on the top and also up to the bottom we can see our target column which is called “churn”. And we can see the values in this column appears to be yes and no. so no would mean that customer did not churn and off course yes would mean that customer did churn and each row is a costumer.
Show the Number Of Rows And Columns
So just type,
df.shape
(7043, 21)
And we can see that there is seven thousand forty-three rows or customers in this data set and there are twenty one columns or data points on each customer.
Show All of The Columns
df.columns.values
array(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'], dtype=object)
So, here we can see all of the columns name and immediately I can see some interesting columns like customer Id, gender, phone service, internet service, contract, monthly charges, tenure and obviously churn
Check For Missing Data or NA Values
So in order to do this just type,
df. isna(). sum()
After running the cell I can see the columns name in the left and number for missing values for each column on the right and right every single value is zero so this tells me that this data has no missing values.
Show Some Statistics
So, to show some statistics just type here,
df.describe()
And here we get some statistics on the data set. So immediately we can see that the maximum tenure was 72 months which is about six years and the minimum tenure was zero months and the mean tenure was 32 points months. Now the maximum monthly charges was 118 dollars with some points for a customer and minimum monthly charge was 18 dollars 250000 cents for a customer and the mean was 64.761692 dollars for the customers
Get Customer Churn Count
My first question is, how many people are churning and how many people are not churning or being retained and stained. So, in order to get that count I just have to type,
df['Churn'] . value_counts()
Now we can see that five thousand one hundred and seventy four customers for this company did not churn and one thousand eight hundred sixty nine customers of this company will churn
Visualize he Count of Customer Churn
So in order to this just type,
sns.countplot(df['Churn'])
Now we can visually see those same counts as bar chart here so it make it little bit more obvious that there are customers that are staying with the company then there are the customer that will left the company.
What Is The Percentage Of The Customers That Are Leaving?
num_retained = df[df.Churn == 'No']. shape[0]
num_Churned = df[df.Churn == 'yes']. shape[0]
#print the percentages of customer that stayed
print( num_retained / (num_retained + num_Churned) * 100, '% of customers stayed with the company.')
#print the percentages of customer that left
print( num_Churned / (num_retained + num_Churned) * 100, '% of customers left with the company.')
73.0 % of customers stayed with the company.
26.0 % of customers left with the company.
As we can see that seventy three percent of the customers stayed with the company and 26 percent people left the company. So, this is important because this tells me that if randomly a customer from this data set and if I had to make a guess that if the customer churn or did not churn usually you would have 50/50 chance but based of this data is very unbalanced. So, I would have a better chance of making a guess that customer stayed with the company for every customer that i randomly choose and by the time I’m done going through the whole data set I would get 73 percent of the customers. So, the model that I built must be better than 73 percent.
Visualize the Churn Count For Both Male And Female
Now I want to visualize the churn count for both male and female. So just type,
sns.countplot(x='gender', hue= 'Churn', data = df)
As we can see there’s doesn’t seem to be a much difference here so it doesn’t look like more males or more females left this company or stay with the company, they seem about even based of this chart. So maybe we don’t want to take a look at the gender to figure out why these customers are the company. I don’t think so gender has anything to do with it.
Visualize the Churn Count For Internet Service
Now we are going to visualize the churn count for internet service. So just type,
sns.countplot( x='InternetService', hue='Churn', data = df)
Now we are getting back something useful so as immediately we can see that the highest number of customer that didn’t churn have dsl internet service and the highest count for the customer that did churn have fiber optic internet service.
Let’s take a look at some of the other columns. So, for this I’m gonna create,
numerical_features = ['tenure', 'MonthlyCharges']
fig, ax = plt.subplots(1, 2, figsize=(28,8))
df[df.Churn =='No'][numerical_features].hist(bins=20, color='blue', alpha=0.5, ax = ax)
df[df.Churn =='Yes'][numerical_features].hist(bins=20, color='orange', alpha=0.5, ax = ax)
So, here I’m gonna look at the two columns which are tenure and monthly charges because they are pretty interesting from the beginning and this time I’ve used histogram instead of bar chart. As we can see that most of the customers that are staying have monthly charges somewhere between 20 and about 30 dollars and that’s very obvious by this big amount on the top graph and what we can also see is that the churn count is a lot higher somewhere between 70 and about 100 dollars for monthly charges. So, that’s seems fairly obvious.
And then we’ll look at the tenure and we can see that most of the customer that churn somewhere between zero and about ten months and for the customers that didn’t churn they seem to basically have a higher tenure so it seems to be somewhere between 65 and 72 months. So, the ones that left have a lower tenure and the ones that stayed have a higher tenure.
Remove Unnecessary Columns
As I can see that customer ID column will be useless for the model so, let’s get rid of this column from our data set. So, I’ve created some variable as follows;
cleaned_df =df.drop('customerID', axis =1)
Look At the Numbers Of Rows And Columns In Data Set
We can look at the numbers of rows and columns in the data set by running following command;
cleaned_df.shape
So, now I can see that there are seven thousand and forty-three rows still but now there’s only 20 rows where before it was 21 columns.
Convert All Of The Non-Numeric Columns To Numeric
Following are the commands to convert non numeric to numeric values;
from sklearn.preprocessing import LabelEncoder
# select only columns with object data types (categorical variables)
cat_cols = cleaned_df.select_dtypes(include=['object']).columns
# convert object data types to categorical codes
for col in cat_cols:
cleaned_df[col] = cleaned_df[col].astype('category').cat.codes
# convert the remaining numeric columns to float
num_cols = cleaned_df.select_dtypes(include=['float', 'int']).columns
cleaned_df[num_cols] = cleaned_df[num_cols].astype('float')
Show the New Dataset Data Types
Now let’s take a look at the new data set and data types. So, just type;
cleaned_df.dtypes
So now we can see that all the data types for our dataset are numbers.
Show The First Five Rows Of The New Data Set
In this cell I want to show the first five rows our new dataset. So, just type;
cleaned_df.head()
Now we can see, a sample of the data are all numbers as expected.
Scaled the Data
So, I’m gonna create the feature dataset which will be called X and it will be contained all of the columns from DF data except the column churn which is our target.
X = cleaned_df.drop('Churn', axis= 1) #Feature data set
y = cleaned_df['Churn'] #target data set
X = StandardScaler().fit_transform(X)
Split the Data Into 80% Training And 20% Testing
Here, I’m going to split the data into eighty percent training and twenty percent testing, so let’s create variables as:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)
Create the Model
Now let’s create a variable called model and set it equal to logistic regression as follows:
# create the model
model = LogisticRegression()
#train the model
model.fit(x_train, y_train)
after running these commands, we’ve got following results;
Create the Prediction on the Test Data
In order to make some predictions, let’s create a model and set it equals to as follows:
predictions = model.predict(x_test)
#print the predictions
print(predictions)
After running the commands and getting above results, obviously we can’t see all of the data from printing it but if you want then you can go one by one and look at each prediction for the test data and compare it with the target test data
Check the Precision, Recall, F1-Score
Here I’m going to check the precision, recall and f1-score for our model;
print(classification_report(y_test, predictions))
Let’s run this and see how well the model did?
So, now we can see that our model has about 91% recall which is really good and it has 85% precision which is not bad and as an f1-score of 88% which is pretty good because maximum f1-score can be one hundred percent. And it has accuracy of about 82% which is better than the 73% from just guessing. So, this model will be very useful because it’s better than guessing.
Top comments (0)