Matrix Operations in Linear Regression
01 Overview
Introduction
This article aims to explain the Matrix operations involved in different stages of a Linear Regression Model, whether it’s making model predictions or training the model. Having this additional perspective, alongside the Mathematical details provides a deeper understanding of how Machine Learning Models work. In general, this knowledge equips you to implement the models from a First Principles approach providing a comprehensive view of nuances involved thus paving the way for more advanced learning in the field.
This article is structured in two main parts: First we will assume that we have a pre-built Linear Regression model. This will help us to focus on learning operations involved in making model predictions. In the later part of this article, we’ll shift our focus to the operations under “Model Training” which is where the Model weights are finalized
Topics covered in this Article:
Topics not covered in this article:
02 Input Data representation and Model Equation
To get a clear picture of how a Linear Regression model works behind the scenes, let’s begin by looking at how its underlying input data is organized. In Linear Regression, the input data is arranged into features (the variables used for predictions) and observations (individual data points). The target values, which the model tries to predict, are also included in the dataset. The table below gives a simple reference of how the input data is typically organized
Here:
Example : A dataset with two features and three observations will look like this:
Now, let’s look at the Linear Regression model equation for predicting the target (𝑌)
Key takeaway from this equation - Once we know the values of feature weights and the model intercept, we can plug in the feature values for any observation in the RHS (right hand side) of the above expression and compute the corresponding predicted target value.
But a more pertinent question here is, “how do we get the model predictions for multiple data points” – one by one or is there a more efficient way? The answer to this question also provides insights into the common steps performed while training a Linear Regression model.
Let’s address this question in the next section.
03 Breakdown of Model equation as Matrix Operations
The previously shown model equation can be written as an Algebraic expression that uses Three entities:
This is how these entities look.
Here the Feature and Weight Vectors are Two-Dimensional vectors or Matrices, whereas the Intercept is a scalar (single number with no dimensions). The dimensions of the Feature vector shown indicate that it holds information for a single data point. As a result, as model prediction for this single data point should also be a single number (a scalar) or a 1x1 Matrix
To get this 1x1output, we’ll perform the below Matrix operations.
For the Two vectors shown, if we simply take the product of the corresponding elements, add all of them along with the intercept, we get the RHS of the Linear Regression equation:
The above shown Matrix computations make sense for a single data point. Next, we’ll look at how predictions are made for Multiple data points. First let’s observe how the Feature and Weight matrices look for multiple data points (let’s say n data points):
The notation used in the above shown Feature Matrix is explained using this below visual :
Subscript of X – indicates which Feature we are talking about
Superscript of X – indicates the data point or the observation in the dataset
It’s evident that the representation of Feature Matrix has changed significantly but for both the Weight Vector and the Intercept it remains the same. This is because the Weight Vector is not dependent on the number of data observations, rather it is dependent on the number of features.
The predicted output for these multiple data points will be a nx1 Vector (one prediction for each of the n data points). Below are the Matrix operations to compute this prediction vector:
Here, first the Feature Matrix and the Weight Vector are multiplied to produce a nx1 vector (denoted in the graphic as the intermediate vector). Next, the intercept value is simply added to this vector. This addition is performed element-wise, where the scalar is added to each of the ‘n’ elements to produce the nx1 Predictions vector
Did you notice that regardless of the number of data points (single or multiple), the operations are same, i.e., Feature matrix (X) multiplied by the Transposed Weight vector (w) and the result added to the Intercept value
Let’s formalize this general expression so it captures these Matrix operations to get the Model predictions
This expression is ‘general’ since:
Up until now, we have explored the Matrix operations involved to make predictions in a Linear Regression Model. With the Model Weights in hand, we can get the Model predictions for both single and multiple data points.
But as you might already know, the model weights aren’t available to start with (unlike the Features and Target Values), rather they need to be learned through model training. This makes it essential to understand the steps involved in training the Model and the associated Matrix Operations. While it might sound like a humongous task, but don’t worry, we have already taken care of the foundational concepts of Model Training and going ahead we just need to build upon them.
Before moving to the specific steps of model training, let’s first briefly look at the overall process.
04 Model training and Loss Computation
When we are building a Linear Regression model from scratch, all we have is the input data, i.e. the Features and the Target. The Model weights are not known at this point. But why do we need Model weights - to make the model predictions, as seen in the previous sections.
Recommended by LinkedIn
To begin the Model training process, we assume some ‘initial values’ of the Feature Weights and the Intercept. These ‘initial weight’ values are randomly chosen; thus, they are expected to be off from the “final optimal values” we aim to achieve (unless we get lucky with the initial random choices). As a result, the model’s initial predictions will also be off or poor copies of the actual Target values.
This quality of prediction (i.e., how close or far the model predictions are from the actual target values) is quantified or measured through a Loss function. It’s called a ‘function’, because different values of the model weights will produce different loss values – thus it is a function between the model weights and the loss.
In summary the Model Training exercise is an activity of finding the model weights for which we observe the Minimum Loss; why minimum – since then, the model prediction will be closest to the actual target values
The most common loss function used in Linear Regression is the Mean Squared Error (MSE), calculated as:
This expression can be broken down into graduals steps for the purpose of understanding the underlying operations:
Now that we understand the computations under the Loss Function, let’s explore the corresponding Matrix operations:
Here’s a summary of the model training steps covered so far:
As discussed earlier, model predictions with the initial values of model weights will be off, which will also reflect in the Loss value we just learned to compute. Next, we will take steps to improve the model so that its predictions get closer to the Actual Target values thereby reducing the Loss value.
05 Matrix operations behind Gradient Descent and Weight update
Gradient Descent preview
Starting with the randomly chosen ‘initial values’, we update the Model weights by giving them a small nudge in the right direction (i.e., by either increasing or decreasing their values). If we rather update the weights randomly, then it can potentially cause the Loss to go up instead of lowering it. Thus, to update the weights so that it only lowers the loss, we take into account how these weights impact the Loss. This calls for computing the derivative of the Loss function with these weights.
We know from Calculus that the derivative of a function w.r.t a variable, gives a directional sense of where the function increases. However, since our goal is to find the direction in which the Loss function decreases, we simply take the negative of the derivative.
The next set of Model weights values are computed from the previous set by adding to them a small number, which is computed using the Loss Derivative and a term called as ‘learning rate’. The expression for this update looks as follows:
Here we first compute the RHS of both the expressions and then update the weight values against the variable on the LHS. Learning rate (the symbol eta) in the expression helps us control how big or small of an update we want to make to the previous weight value.
From a terminology point of view, we’ve just covered the core idea behind the Gradient Descent Algorithm for model training.
The most important (and often overlooked) aspect of these expressions is the shape of the derivatives. The derivative of Loss function with respect to the:
Next, we will aim to get to the final expressions of these weight updates by computing the derivative of the Loss function. We’ll also understand side-by-side the Matrix operations required to implement these weight updates.
Since we need to compute the derivative of the Loss function, let’s take a second look at its expression by plugging in the Model Weights
Next, let’s compute the derivative of the Loss function with the Weight Vector and the Intercept
Derivative of the Loss function w.r.t the Weight vector
The matrix operations for the last shown expression will be performed as follows.
Below is a breakdown of what’s happening here:
Derivative of the Loss function w.r.t the Intercept
The matrix operations required for computing the derivative w.r.t the Intercept are relatively simpler and requires just the computation of the Error vector. This is followed by taking the average of all Error values in the Vector to get a scalar number (or a matrix with dimension 1x1) – why average? since we need to apply the Summation from the expression and then divide by the number of observations.
Since both the derivatives have been computed, we can write the final weight update expressions:
The above expressions may seem intimidating at first and may not fully explain the process on their own. Fortunately, we’ve already explored the details of how these operations work through Matrices, making them much easier to understand.
Here is a view of how these weight updates look in the Matrix form
Weight vector
Intercept
These weight updates happen iteratively, and we keep a track of Loss values under each iteration. A good point to stop these iterations is when the drop in loss in consecutive iterations becomes negligible. This concludes the process of Model training providing us with the Final Model weights, that minimize the Loss and give us the most accurate predictions based on the data.
The expressions covered in this section not only take us closer to the Mathematical and Operational details of Gradient Descent but also help us see that each weight update requires computations on all ‘n’ observations of the Feature Matrix. Also, to get to the Final Model Weights, we need to perform multiple such iterations. Thus, for large input datasets the Model training process could be quite slow. To overcome this challenge, different variations of the Gradient Descent Algorithm such as Mini Batch Gradient Descent and Stochastic Gradient Descent are used. We’ll not go into these details in this article.
06 Conclusion
In this article we explored the kind of Matrix Operations that are performed in both Training a Linear Regression Model as well as in using it to make predictions. The approach to Model training covered here is a basic or ‘vanilla’ version to get the core idea behind Gradient Descent Algorithm. The more advanced approaches to Model training involve: tweaking the learning rate as training happens, incorporating the Regularization terms to avoid overfitting and also using different and more advanced variants of the Gradient Descent Algorithm itself.
The covered topics and insights go well beyond Linear Regression and serve as a foundation for understanding more complex Machine Learning algorithms. With this knowledge, you can explore advanced topics like Mini-batch Gradient Descent, Stochastic Gradient Descent, and optimization techniques used in Neural Networks. This step-by-step approach also highlights the importance of starting with the basics, helping you build a solid understanding of how models and algorithms truly work.
Tech Resource Optimization Specialist | Enhancing Efficiency for Startups
5moThis detailed breakdown of matrix operations in Linear Regression provides an excellent foundation for understanding model training and optimization—perfect for anyone diving into ML fundamentals!