Machine learning for all
Do you think data science is everyone’s cup of tea? I strongly say so. There is a data scientist inside each one of us who keeps popping out whenever it is time. The data scientist inside my dad used to pop out for predicting the next potential day on which I might skip school owing to stomachache. His predictions were based on team India’s cricket schedule for the upcoming calendar year (Most of the time got his predictions right!). Similarly, we all must have made such predictions. Now let’s see how machine learning models based on which such predictions can be made are usually built. Most of the time invested in coming up with a machine learning model is usually spent on readying the base data and identifying the suitable algorithm for building the model.
What does base data mean?
In Machine Learning terminology, base data is a combination of dependant and independent variables. Let me elaborate on this with an example. Suppose we want to come up with an ML model which can predict a sports team’s probability of winning a game.
So, what are the factors based on which we can deduce the probability? The team’s win/loss record at the venue, win/loss record against the opponent, and current form might play a key role in deducing the probability, right? These factors which will determine the team’s fate in the upcoming match are called the “independent variables” in machine learning terminology and the result of the match is termed as the “dependant variable”. Ideally, we will have an “n” number of independent variables and one dependant variable in a dataset/table. Since we now know what independent and dependant variables are, base data is a combination of these variables collected from past experiences.
Modeling:
Once the base data is ready, you will be establishing the relationship between the independent and dependant variables. To simply state, the relationship we have deduced here is the “model” in the Machine Learning world. Once this establishment/model is ready, based on it we should be able to able to deduce the probability which determines the result of the game.
In the base data, the dependant variable will be populated as either 1 or 0 (Where 1 indicates a win and vice versa). But the final probabilities we get from our model can be obtained on a scale of 0 to 1. We must set a threshold for probability beyond which the prediction can be seen as a potential win (Example: If the probability is over 0.5, we can say that model is predicting a victory and vice versa).
Modeling demystified:
If you felt a bit overwhelmed after going through the above paragraph, do not worry. We are going to simplify this further.
I want to take you back to your schooling days when you have successfully plotted x and y coordinates in a two-dimensional graph. After plotting all the coordinates, you must have ended up joining all the plotted points using a line. This line is nothing but our machine learning model.
When you draw a perpendicular line from new coordinates of x onto this model, the corresponding value of y that can be seen on the graph is nothing but the prediction we have been looking for. I hope this was helpful!
Recommended by LinkedIn
However, training the model is not as easy as it sounds. There are numerous algorithms available out there that can help us come up with a suitable model. In the above example, I have untangled a regression algorithm for your understanding. Fortunately, there are numerous packages out there that will help you leverage the power of these algorithms.
Training and testing the model:
Now, without delving deeper into those algorithms, let's discuss the other steps involved in building a model.
Suppose our base data has 20 different rows of data where the columns on the left correspond to independent variables and the column on the right is a dependent variable. Now you will split the data into training and testing subsets in unequal proportions where most of the chunk will be part of training (the usual train test split is 70:30).
Using the training dataset, you will build the model leveraging a suitable algorithm. Now, based on the model you will deduce the predictions for the testing dataset. Remember, for the testing dataset, we already know the outcomes (since training and test datasets are carved out of the base data).
Now, the total number of right predictions we got right out of the total number of predictions will give us the accuracy of the model. If the overall accuracy looks satisfactory, we can expect predictions at a similar accuracy from the model when deployed in the real world.
Other metrics can be used for measuring the sanctity of the model, but I want to restrain myself to accuracy for now. That’s all! We have covered all the important steps that are part of building a machine learning model.
Next steps:
Thank you so much for hanging on! If you find the above topic interesting, there are numerous resources out there on the internet from where you can advance your skills. However, I do not want to bombard you with a lot of links to those resources. Therefore, here is my favorite online tutor who can literally help anyone get started with python and machine learning is Jose Portilla. Here is a link to his course on Udemy:
Here is the link to my GitHub account (It should have some of the models I have built as part of my learning):
Senior Data Scientist | Machine Learning| LLMs| NLP| Predictive Models| Sentiment
3yVery well written Sai. In my experience understanding the business problem and data availability is the initial step. This is a journey while we try to understand the question and but also bring along the client in the journey!