Data Science Tutorial - Exploratory Data Analysis and Linear Regression

Data Science Tutorial - Exploratory Data Analysis and Linear Regression

Introduction

This is my first post thanks to my passion for solving business problems through data science. A little about me first 😊

I built my first machine learning model back in 2012 when the concept of data science was in its infancy to most organizations. The model that I built utilized an insurance dataset and predicted the outcome of life insurance cover applications that could not be decisioned by the rules engine. These were the complex cases that required the manual intervention of a senior underwriter to review and act upon.

I would therefore like to use this platform to share my learnings with all the newbies out there or those interested in the field of data science and analytics.

Alright, lets jump straight into it. I assume at this stage you have installed the latest Python version. My environment is set up with the Python, Jupyter notebook and Anaconda.

What are we solving for?

The problem that we are solving today is around an exploratory analysis of what affects the insurance premium of customers. The challenge was posted on Kaggel and I encourage you to check out this platform as there is a lot of learning and data sets to be found here.

Let’s invoke our libraries:- 

No alt text provided for this image



The next step is to load our file and take a closer look at what we are going to work with. As you can see this is a very basic data set but with lots of scenarios that can be explored.

No alt text provided for this image





One scenario that I would like to see is the visualization of Age compared to Insurance charges and categorized by Smoker status.

No alt text provided for this image









Immediately you can pick up that non-smokers pay less in insurance charges than smokers. That’s a given right? As per the insurance premiums charged by the provider it’s clear that smoking poses health risks to the individual hence the higher costs for cover.

The scenario below was interesting to observe. I wanted to show the relationship between BMI (body mass index), insurance charges and smoker status. The observation shows that being a non-smoker with either a low or higher BMI doesn’t impact your insurance premiums by far. However, this is a different story for smokers. The higher the BMI for smokers, the more that is incurred in insurance charges. The blue dots clearly show the population in this category.

No alt text provided for this image








How did we get to choose these features for our analysis? Well, let’s take a look at our correlation matrix and reveal some interesting facts. 

No alt text provided for this image







Any value compared against itself yields 1 and should therefore be ignored. We want to look at how different features perform against other features. In this data set, the highest value is 0.3 which relates to the correlation between age and charges. We saw earlier that as you get older, the more you pay in insurance. BMI and charges is also worth noting with 0.11. The graph is read as (x,y) or bottom categories vs adjacent categories.

We are missing something!!! Sex and Smoker status. Since these are string values, we would have to convert them to numeric in order to get the full correlation matrix but this will be covered in later content that I post. For now, we use the Sex and Smoker status as the features we have selected to test independently which is a term referred to as feature selection.

Building the model

Now that we have explored our data, let us build the model. Since we are going to work out the R squared value, we need to convert a couple of the text classifications to binary. This would imply that smoker status (yes or no) turns into (1 or 0) respectively. The unique() method lists the unique values in a column from the data frame. The unique values are then used to update the smoker column with the binary values.

No alt text provided for this image



We also need to do the same for the customer sex / gender.

No alt text provided for this image



The next line of code is used to split our data set into a test set and training set. We need to be able to test the performance of our model against a data set that it has not yet seen before. Remember, we are testing against the “charges” column as this is the problem we are trying to solve.

No alt text provided for this image

The fun begins. We start by fitting the model

No alt text provided for this image



I created two functions to help measure the performance of the train and test models. I will explain the reasoning behind this further on.

No alt text provided for this image



No alt text provided for this image
No alt text provided for this image




The output between the train and test is almost similar meaning that the model works and is reliable. The R squared is above 0.7 which means that we have a strong positive coupled with an RMSE (root mean squared error) that is similar between the train and test model performance.

No alt text provided for this image



This is beautiful work. The model is not overfitting or underfitting the data which is exactly what we are looking for.

I hope you enjoyed this and learnt something from this. Please keep checking on other up-coming posts as we progress deeper into the world of machine learning.

Thanks, and happy exploring.

Olusola Osinoiki

Author | Speaker | Founder | Growth Strategist | Relatable Coach-Mentor | Impact Angel Investor | Startup Ecosystem Builder | Redemptive Leadership Guide

5y

Cool!

Like
Reply

To view or add a comment, sign in

More articles by Timothy Besigye

Insights from the community

Others also viewed

Explore topics