Predicting Customer Churn with Naïve Bayes Classifier in Python
Machine Learning algorithms are everywhere in today's tech lingo. From XGBoost, to Bootstrapped Random Forests and Singular Value Decomposition, the data science world is plagued with powerful yet sometimes incomprehensible techniques. These names alone have repelled thousand of curious minds wanting to break into data science. Many times the terms may come across as intimidating and impossible to grasp. In this series of articles I will attempt to abolish these preconceived notions and dismantle the idea that data science is only for the mathematical or technically savvy. Behind the bombastic titles, lies a straightforward logic that if explained properly it can be comprehended by even the most neophyte mind. This time I will be exploring Naïve Bayes but I plan to write on different models eventually.
Naïve Bayes
In today's article we will be discussing about Naïve Bayes and its predictive prowess. As a way to demonstrate the analytics process, I will explain step by step what actions took place in order to make our data suitable for this type of analysis, as well as the implications of the results. Naïve Bayes is a probabilistic machine learning algorithm primarily used for classification problems. The theorem follows, that given a set of features X = (x1,x2,x3..xk), what is the probability of event (Y) occurring. This method is based on the Bayes principle which can be express in the form of:
As noted, we can calculate such probabilities of Y given X by first estimating the probability of the inverse, meaning what it the probability of X occurring Y ( known as likelihood)? and multiply it by a prior Y ( meaning the % of Y having occurred in the dataset). We divide this by the evidence, or percentage of event X occurring and this should give us P(Y/X). Finally, it is important to emphasize the naivety of the model and where does the name derive. Well, this means that the model assumes feature independence where no variable is dependent over the other. The final model should be a multiplication of all possible models given the number of features as expressed below:
Predicting Customer Churn using Bayes
Data Prep and Exploration
Now that we have explained the theoretical framework, let's deep dive into a practical use case. Customer Churn has been a headache for companies in the sense that there is no clear way to predict it. Usually, any attempt from management to predict a monthly turnover rate is based on hunch and feeble calculations. Using a robust Naïve Bayes we can create an accurate model that provides predictions with greater confidence. I have downloaded telecom customer data from Kaggle but before I jump into the analytics, I would like to import the data libraries from python that will be used along the project:
We then proceed to rapidly explore the data:
This table above gives us a quick glimpse which type of data we are dealing with. The majority of the data is continuous but there are some features that are categorical such as ContractRenewal and DataPlan.
Let's now check for nulls just to make sure there aren't any blank spaces on our dataset
We would like to proceed and check the distribution of our response variable to make sure that it is not skewed in what way or the other. We do that in python by plotting both possible responses (0,1):
The data is highly underrepresenting the people that have churned (Class 1). Mathematically, this imposes a major issue since our model will predict with greater accuracy people that have not churned since we have sufficient data to train the model, but we are lacking in capturing the idiosyncrasies of the 1s ( people that churn). There are two ways to solve this conundrum, we could either randomly pick values of 0s to match the number of responses in 1 (undersampling), or we could match the 1s to 0s by duplicating rows (oversampling). Both approaches have caveats thou. The former approach drastically undermines the data provided and results in a significant loss of information, the later will overfit the data for the 1s and will for sure underperform in the validation data set. I will prefer to lose of information since overfitting is very frowned upon the data community. To better illustrate both options, take a closer look at the picture:
Now that we have chosen to under-sample the data for the majority class ( people that have not churned), we will script it in python:
Let's get a visual of that by plotting both classes:
Recommended by LinkedIn
Success! We have now balanced the response variable and can now be confident that our algorithm will perform better.
Splitting the Data
In order assess the accuracy of our model we proceed to split the data into training and test set. This method is highly used in data science world to validate how robust your model is in predicting future observations. As the name suggests, we first train the model with "seen" data to later use the same model equation to predict the test or "unseen" data. Usually the split is 70/30, meaning that we train the model using 70% of the data and test the model in the remaining 30%. We can do the splitting really easily in python:
As shown above 675 observations are now part of our training set while 290 are part of our test set, both with equal distribution of churned and not churned customers.
Data Modeling
For the sake of model comparison we will first run a logistic regression- which is the standard model when dealing with categorical variables- and latter will compare against results produced by the Naïve Bayes algorithm. In python, fitting a logistic regression model looks something like this:
Model fitting and Confusion Matrix
Running the python script above would train and fit the model and it would render a confusion matrix based on the results. This confusion matrix compares "true" values against what our logistic model predicted and it is read in the following way: each Quadrant is representative of all possible scenarios that our model could have predicted and comparing them against the actual data. Quadrant II ( upper left) captures the instances where logistic model successfully predicted no churn and no churn indeed occur in real life ( which is 37.4% of total test population) - these are also called False Positives, because we have truly identified negative values. Quadrant IV ( lower right) - summarize the % of instances where a customer effectively churn and our model also predicted that it would churn -- these are known are True Positives and they happen to account for 38%. Conversely, Quadrant I ( upper right) and Quadrant III (lower left), display the level of inaccuracy of our model. In other words, 12.4% of the total population would fall in the category where our model projected people would churn but in reality they didn't. Diametrically, Quadrant III indicates that 11.5% of the total population the model falsely categorized people that churned as non-churned. Summing the percentages of Quadrant I and III and comparing them against II and IV, we can conclude that our model is correct by ~76% but wrongly predicted on ~24% of total instances. By now you may have realized why they call it confusion matrix given the various scenarios and interpretabilities that the matrix generate, but I hope that my explanation was easy enough to help you grasp the overall concept.
To further assess the precision of the model. we have printed an ROC curve where we plot on the Y axis the instances where we correctly identified people that churn and on the X Axis people that we correctly identified as non churn. The True Positive rate represents the proportion of observations that are predicted to be positive when indeed they are positive. Conversely, the False Positive rate represents the proportion of observations that are predicted to be positive when they’re actually negative. Then a curve is plotted at each point of intersection between Ys and Xs or every possible decision threshold. An ROC will also be accompanied by an AUC ( Area Under the Curve) and the closer this number is to 1 the better the model does at classifying. In this case our AUC is 0.71.
Now we are going to run the Naïve Bayes algorithm, to see how the performance of the model is enhanced.
As noted on the Confusion Matrix and ROC curve, the Naïve Bayes has rendered more promising results. The summation of percentages of the True Positive and True Negative quadrants account for 90% of the test data; in other words, our model correctly categorized/labeled the data in 90% of the instances. Furthermore, the area covered under the ROC curve generated a 0.89, or 0.12 points better than the logistic regression (0.71). Based on these results we can arrive to the conclusion that the Naïve Bayes algorithm outperformed the Logistic Regression model and was ~ 20% more effective at predicting customer churn.
Further work
I have attempted to explain in a very amicable way how the two algorithms compare to each other but in future articles I will also explain the implications of using each model and we will deepen on variable relevancy. For now we have just surfaced the end results of each model, but in order to fully understand what is driving churn/no churn, a thorough assessment of the variables needs further scrutiny. I intend to explore this level of analysis in future articles.
Problem Solver I Family Man I Recovering Ultramarathoner Turned Combat Sports Enthusiast
1yVery great detailed explanation! Thanks, Jose.
Analytics | Strategy | Program Management | Business Transformation | Engineering | 日本語
3yExcellent post!
Business Intelligence & Analysis, KPIs, AI/ML, Certified Scrum Master
3yGood work
Data Science & Analytics
3yVery lucid explanations! Thanks, José!
Senior Consultant at Prokura - a Kearney Company
3yMuy bueno!