ML | Stochastic Gradient Descent (SGD)
Last Updated :
03 Mar, 2025
Stochastic Gradient Descent (SGD) is an optimization algorithm in machine learning, particularly when dealing with large datasets. It is a variant of the traditional gradient descent algorithm but offers several advantages in terms of efficiency and scalability, making it the go-to method for many deep-learning tasks.
To understand SGD, it’s essential to first comprehend the concept of gradient descent.
What is Gradient Descent?
Gradient descent is an iterative optimization algorithm used to minimize a loss function, which represents how far the model’s predictions are from the actual values. The main goal is to adjust the parameters of a model (weights, biases, etc.) so that the error is minimized.
The update rule for the traditional gradient descent algorithm is:
[Tex]\theta = \theta – \eta \nabla_\theta J(\theta)[/Tex]
In traditional gradient descent, the gradients are computed based on the entire dataset, which can be computationally expensive for large datasets.
Need for Stochastic Gradient Descent
For large datasets, computing the gradient using all data points can be slow and memory-intensive. This is where SGD comes into play. Instead of using the full dataset to compute the gradient at each step, SGD uses only one random data point (or a small batch of data points) at each iteration. This makes the computation much faster.
Path followed by batch gradient descent vs. path followed by SGD:

Optimization path followed by Gradient Descent

Optimization path followed SGD Optimization
Working of Stochastic Gradient Descent
In Stochastic Gradient Descent, the gradient is calculated for each training example (or a small subset of training examples) rather than the entire dataset.
The update rule becomes:
[Tex]\theta = \theta – \eta \nabla_\theta J(\theta; x_i, y_i)[/Tex]
Where:
- [Tex]x_i[/Tex] and [Tex]y_i[/Tex] represent the features and target of the i-th training example.
- The gradient [Tex]\nabla_\theta J(\theta; x_i, y_i)[/Tex] is now calculated for a single data point or a small batch.
The key difference from traditional gradient descent is that, in SGD, the parameter updates are made based on a single data point, not the entire dataset. The random selection of data points introduces stochasticity, which can be both an advantage and a challenge.
Implementing Stochastic Gradient Descent from Scratch
Step 1: Data Generation
In this step, we generate synthetic data for the linear regression problem. The data consists of feature X and the target y, where the relationship is linear, i.e., y = 4 + 3 * X + noise.
- X is a random array of 100 samples between 0 and 2.
y
is the target, calculated using a linear equation with a little random noise to make it more realistic.
Python
import numpy as np
# Generate synthetic data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
For a linear regression with one feature, the model is described by the equation:
[Tex]y = \theta_0 + \theta_1 \cdot X[/Tex]
Where:
- [Tex]\theta_0[/Tex] is the intercept (the bias term),
- [Tex]\theta_1[/Tex] is the slope or coefficient associated with the input feature [Tex]X[/Tex].
Step 2: Define the SGD Function
Here we define the core function for Stochastic Gradient Descent (SGD). The function takes the input data X and y. It initializes the model parameters, performs stochastic updates for a specified number of epochs, and records the cost at each step.
theta
is the parameter vector (intercept and slope) initialized randomly.X_bias
is the augmented X
with a column of ones added for the bias term (intercept).
In each epoch, the data is shuffled, and for each mini-batch (or single sample), the gradient is calculated, and the parameters are updated. The cost is calculated as the mean squared error, and the history of the cost is recorded to monitor convergence.
Python
def sgd(X, y, learning_rate=0.1, epochs=1000, batch_size=1):
m = len(X)
theta = np.random.randn(2, 1)
# Add a bias term to X (X_0 = 1)
X_bias = np.c_[np.ones((m, 1)), X]
cost_history = []
for epoch in range(epochs):
# Shuffle the data at the beginning of each epoch
indices = np.random.permutation(m)
X_shuffled = X_bias[indices]
y_shuffled = y[indices]
for i in range(0, m, batch_size):
# Select a mini-batch or a single sample
X_batch = X_shuffled[i:i+batch_size]
y_batch = y_shuffled[i:i+batch_size]
# Compute the gradient
gradients = 2 / batch_size * X_batch.T.dot(X_batch.dot(theta) - y_batch)
# Update the parameters (theta)
theta -= learning_rate * gradients
# Calculate and record the cost (Mean Squared Error)
predictions = X_bias.dot(theta)
cost = np.mean((predictions - y) ** 2)
cost_history.append(cost)
# Print progress every 100 epochs
if epoch % 100 == 0:
print(f"Epoch {epoch}, Cost: {cost}")
return theta, cost_history
Step 3: Train the Model Using SGD
In this step, we call the sgd() function to train the model. We specify the learning rate, number of epochs, and batch size for SGD.
Python
# Train the model using SGD
theta_final, cost_history = sgd(X, y, learning_rate=0.1, epochs=1000, batch_size=1)
Output:

Step 4: Visualize the Cost Function
After training, we visualize how the cost function evolves over epochs. This helps us understand if the algorithm is converging properly.
Python
import matplotlib.pyplot as plt
# Plot the cost history
plt.plot(cost_history)
plt.xlabel('Epochs')
plt.ylabel('Cost (MSE)')
plt.title('Cost Function during Training')
plt.show()
Output:

Step 5: Plot the Data and Regression Line
In this step, we visualize the data points and the fitted regression line after training. We plot the data points as blue dots and the predicted line (from the final theta) as a red line.
Python
# Plot the data and the regression line
plt.scatter(X, y, color='blue', label='Data points')
plt.plot(X, np.c_[np.ones((X.shape[0], 1)), X].dot(theta_final), color='red', label='SGD fit line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression using Stochastic Gradient Descent')
plt.legend()
plt.show()
Output:

Step 6: Print the Final Model Parameters
After training, we print the final parameters of the model, which include the slope and intercept. These values are the result of optimizing the model using SGD.
Python
print(f"Final parameters: {theta_final}")
Output:
Final parameters: [[4.35097872] [3.45754277]]
The final parameters returned by the model are:
[Tex]\theta_0 = 4.3, \quad \theta_1 = 3.4[/Tex]
Then the fitted linear regression model will be:
[Tex]y = 4.3 + 3.4 \cdot X[/Tex]
This means:
- When X=0, y=4.3(the intercept or bias term).
- For each unit increase in [Tex]X, y[/Tex] will increase by 3.4 units (the slope or coefficient).
Advantages of Stochastic Gradient Descent
- Efficiency: Because it uses only one or a few data points to calculate the gradient, SGD can be much faster, especially for large datasets. Each step requires fewer computations, leading to quicker convergence.
- Memory Efficiency: Since it does not require storing the entire dataset in memory for each iteration, SGD can handle much larger datasets than traditional gradient descent.
- Escaping Local Minima: The noisy updates in SGD, caused by the stochastic nature of the algorithm, can help the model escape local minima or saddle points, potentially leading to better solutions in non-convex optimization problems (common in deep learning).
- Online Learning: SGD is well-suited for online learning, where the model is trained incrementally as new data comes in, rather than on a static dataset.
Challenges of Stochastic Gradient Descent
- Noisy Convergence: Since the gradient is estimated based on a single data point (or a small batch), the updates can be noisy, causing the cost function to fluctuate rather than steadily decrease. This makes convergence slower and more erratic than in batch gradient descent.
- Learning Rate Tuning: SGD is highly sensitive to the choice of learning rate. A learning rate that is too large may cause the algorithm to diverge, while one that is too small can slow down convergence. Adaptive methods like Adam and RMSprop address this by adjusting the learning rate dynamically during training.
- Long Training Times: While each individual update is fast, the convergence might take a longer time overall since the steps are more erratic compared to batch gradient descent.
Variants of Stochastic Gradient Descent
While traditional SGD is a powerful method, there are several improvements and variants designed to improve convergence and stability:
- Mini-batch SGD: Instead of using a single data point, mini-batch SGD uses a small batch of data points to calculate the gradient. This strikes a balance between the efficiency of SGD and the stability of batch gradient descent. It reduces the noise in the updates while maintaining the computational efficiency.
- Momentum: Momentum helps accelerate SGD by adding a fraction of the previous update to the current one. This allows the algorithm to keep moving in the same direction and can help overcome oscillations in the cost function.
- Adaptive Methods (e.g., Adam, RMSprop): These methods dynamically adjust the learning rate for each parameter. Adam, for example, uses both the average of the gradients (first moment) and the average of the squared gradients (second moment) to compute an adaptive learning rate, improving convergence and stability.
Applications of Stochastic Gradient Descent
SGD and its variants are widely used across various domains of machine learning:
- Deep Learning: In training deep neural networks, SGD is the default optimizer due to its efficiency with large datasets and its ability to work with large models. Deep learning frameworks like TensorFlow and PyTorch typically use variants like Adam or RMSprop, which are based on SGD.
- Natural Language Processing (NLP): Models like Word2Vec and transformers are trained using SGD variants to optimize large models on vast text corpora.
- Computer Vision: For tasks such as image classification, object detection, and segmentation, SGD has been fundamental in training convolutional neural networks (CNNs).
- Reinforcement Learning: SGD is also used to optimize the parameters of models used in reinforcement learning, such as deep Q-networks (DQNs) and policy gradient methods.
Similar Reads
Machine Learning Algorithms
Machine learning algorithms are essentially sets of instructions that allow computers to learn from data, make predictions, and improve their performance over time without being explicitly programmed. Machine learning algorithms are broadly categorized into three types: Supervised Learning: Algorith
8 min read
Top 15 Machine Learning Algorithms Every Data Scientist Should Know in 2025
Machine Learning (ML) Algorithms are the backbone of everything from Netflix recommendations to fraud detection in financial institutions. These algorithms form the core of intelligent systems, empowering organizations to analyze patterns, predict outcomes, and automate decision-making processes. Wi
15 min read
Linear Model Regression
Ordinary Least Squares (OLS) using statsmodels
Ordinary Least Squares (OLS) is a widely used statistical method for estimating the parameters of a linear regression model. It minimizes the sum of squared residuals between observed and predicted values. In this article we will learn how to implement Ordinary Least Squares (OLS) regression using P
3 min read
Linear Regression (Python Implementation)
Linear regression is a statistical method that is used to predict a continuous dependent variable i.e target variable based on one or more independent variables. This technique assumes a linear relationship between the dependent and independent variables which means the dependent variable changes pr
14 min read
ML | Multiple Linear Regression using Python
Linear regression is a fundamental statistical method widely used for predictive analysis. It models the relationship between a dependent variable and a single independent variable by fitting a linear equation to the data. Multiple Linear Regression is an extension of this concept that allows us to
4 min read
Polynomial Regression ( From Scratch using Python )
Prerequisites Linear RegressionGradient DescentIntroductionLinear Regression finds the correlation between the dependent variable ( or target variable ) and independent variables ( or features ). In short, it is a linear model to fit the data linearly. But it fails to fit and catch the pattern in no
5 min read
Bayesian Linear Regression
Linear regression is based on the assumption that the underlying data is normally distributed and that all relevant predictor variables have a linear relationship with the outcome. But In the real world, this is not always possible, it will follows these assumptions, Bayesian regression could be the
11 min read
How to Perform Quantile Regression in Python
In this article, we are going to see how to perform quantile regression in Python. Linear regression is defined as the statistical method that constructs a relationship between a dependent variable and an independent variable as per the given set of variables. While performing linear regression we a
4 min read
Isotonic Regression in Scikit Learn
Isotonic regression is a regression technique in which the predictor variable is monotonically related to the target variable. This means that as the value of the predictor variable increases, the value of the target variable either increases or decreases in a consistent, non-oscillating manner. Mat
6 min read
Stepwise Regression in Python
Stepwise regression is a method of fitting a regression model by iteratively adding or removing variables. It is used to build a model that is accurate and parsimonious, meaning that it has the smallest number of variables that can explain the data. There are two main types of stepwise regression: F
6 min read
Least Angle Regression (LARS)
Regression is a supervised machine learning task that can predict continuous values (real numbers), as compared to classification, that can predict categorical or discrete values. Before we begin, if you are a beginner, I highly recommend this article. Least Angle Regression (LARS) is an algorithm u
3 min read
Linear Model Classification
K-Nearest Neighbors (KNN)
ML | Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is an optimization algorithm in machine learning, particularly when dealing with large datasets. It is a variant of the traditional gradient descent algorithm but offers several advantages in terms of efficiency and scalability, making it the go-to method for many d
8 min read