Open In App

ML | Stochastic Gradient Descent (SGD)

Last Updated : 03 Mar, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Stochastic Gradient Descent (SGD) is an optimization algorithm in machine learning, particularly when dealing with large datasets. It is a variant of the traditional gradient descent algorithm but offers several advantages in terms of efficiency and scalability, making it the go-to method for many deep-learning tasks.

To understand SGD, it’s essential to first comprehend the concept of gradient descent.

What is Gradient Descent?

Gradient descent is an iterative optimization algorithm used to minimize a loss function, which represents how far the model’s predictions are from the actual values. The main goal is to adjust the parameters of a model (weights, biases, etc.) so that the error is minimized.

The update rule for the traditional gradient descent algorithm is:

[Tex]\theta = \theta – \eta \nabla_\theta J(\theta)[/Tex]

In traditional gradient descent, the gradients are computed based on the entire dataset, which can be computationally expensive for large datasets.

Need for Stochastic Gradient Descent

For large datasets, computing the gradient using all data points can be slow and memory-intensive. This is where SGD comes into play. Instead of using the full dataset to compute the gradient at each step, SGD uses only one random data point (or a small batch of data points) at each iteration. This makes the computation much faster.

Path followed by batch gradient descent vs. path followed by SGD:

gdp

Optimization path followed by Gradient Descent

sgd-1

Optimization path followed SGD Optimization

Working of Stochastic Gradient Descent

In Stochastic Gradient Descent, the gradient is calculated for each training example (or a small subset of training examples) rather than the entire dataset.

The update rule becomes:

[Tex]\theta = \theta – \eta \nabla_\theta J(\theta; x_i, y_i)[/Tex]

Where:

  • [Tex]x_i[/Tex]​ and [Tex]y_i[/Tex]​ represent the features and target of the i-th training example.
  • The gradient [Tex]\nabla_\theta J(\theta; x_i, y_i)[/Tex] is now calculated for a single data point or a small batch.

The key difference from traditional gradient descent is that, in SGD, the parameter updates are made based on a single data point, not the entire dataset. The random selection of data points introduces stochasticity, which can be both an advantage and a challenge.

Implementing Stochastic Gradient Descent from Scratch

Step 1: Data Generation

In this step, we generate synthetic data for the linear regression problem. The data consists of feature X and the target y, where the relationship is linear, i.e., y = 4 + 3 * X + noise.

  • X is a random array of 100 samples between 0 and 2.
  • y is the target, calculated using a linear equation with a little random noise to make it more realistic.
Python
import numpy as np

# Generate synthetic data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)  
y = 4 + 3 * X + np.random.randn(100, 1) 


For a linear regression with one feature, the model is described by the equation:

[Tex]y = \theta_0 + \theta_1 \cdot X[/Tex]

Where:

  • [Tex]\theta_0[/Tex]​ is the intercept (the bias term),
  • [Tex]\theta_1[/Tex] is the slope or coefficient associated with the input feature [Tex]X[/Tex].

Step 2: Define the SGD Function

Here we define the core function for Stochastic Gradient Descent (SGD). The function takes the input data X and y. It initializes the model parameters, performs stochastic updates for a specified number of epochs, and records the cost at each step.

  • theta is the parameter vector (intercept and slope) initialized randomly.
  • X_bias is the augmented X with a column of ones added for the bias term (intercept).

In each epoch, the data is shuffled, and for each mini-batch (or single sample), the gradient is calculated, and the parameters are updated. The cost is calculated as the mean squared error, and the history of the cost is recorded to monitor convergence.

Python
def sgd(X, y, learning_rate=0.1, epochs=1000, batch_size=1):
    m = len(X)  
    theta = np.random.randn(2, 1) 
    
    # Add a bias term to X (X_0 = 1)
    X_bias = np.c_[np.ones((m, 1)), X]

    cost_history = []  

    for epoch in range(epochs):
        # Shuffle the data at the beginning of each epoch
        indices = np.random.permutation(m)
        X_shuffled = X_bias[indices]
        y_shuffled = y[indices]

        for i in range(0, m, batch_size):
            # Select a mini-batch or a single sample
            X_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]

            # Compute the gradient
            gradients = 2 / batch_size * X_batch.T.dot(X_batch.dot(theta) - y_batch)

            # Update the parameters (theta)
            theta -= learning_rate * gradients

        # Calculate and record the cost (Mean Squared Error)
        predictions = X_bias.dot(theta)
        cost = np.mean((predictions - y) ** 2)
        cost_history.append(cost)

        # Print progress every 100 epochs
        if epoch % 100 == 0:
            print(f"Epoch {epoch}, Cost: {cost}")

    return theta, cost_history

Step 3: Train the Model Using SGD

In this step, we call the sgd() function to train the model. We specify the learning rate, number of epochs, and batch size for SGD.

Python
# Train the model using SGD
theta_final, cost_history = sgd(X, y, learning_rate=0.1, epochs=1000, batch_size=1)

Output:

training-output

Step 4: Visualize the Cost Function

After training, we visualize how the cost function evolves over epochs. This helps us understand if the algorithm is converging properly.

Python
import matplotlib.pyplot as plt

# Plot the cost history
plt.plot(cost_history)
plt.xlabel('Epochs')
plt.ylabel('Cost (MSE)')
plt.title('Cost Function during Training')
plt.show()

Output:

file

Step 5: Plot the Data and Regression Line

In this step, we visualize the data points and the fitted regression line after training. We plot the data points as blue dots and the predicted line (from the final theta) as a red line.

Python
# Plot the data and the regression line
plt.scatter(X, y, color='blue', label='Data points')
plt.plot(X, np.c_[np.ones((X.shape[0], 1)), X].dot(theta_final), color='red', label='SGD fit line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression using Stochastic Gradient Descent')
plt.legend()
plt.show()

Output:

Linear-regression-using-SGD-

Step 6: Print the Final Model Parameters

After training, we print the final parameters of the model, which include the slope and intercept. These values are the result of optimizing the model using SGD.

Python
print(f"Final parameters: {theta_final}")

Output:

Final parameters: [[4.35097872] [3.45754277]]

The final parameters returned by the model are:

[Tex]\theta_0 = 4.3, \quad \theta_1 = 3.4[/Tex]

Then the fitted linear regression model will be:

[Tex]y = 4.3 + 3.4 \cdot X[/Tex]

This means:

  • When X=0, y=4.3(the intercept or bias term).
  • For each unit increase in [Tex]X, y[/Tex] will increase by 3.4 units (the slope or coefficient).

Advantages of Stochastic Gradient Descent

  1. Efficiency: Because it uses only one or a few data points to calculate the gradient, SGD can be much faster, especially for large datasets. Each step requires fewer computations, leading to quicker convergence.
  2. Memory Efficiency: Since it does not require storing the entire dataset in memory for each iteration, SGD can handle much larger datasets than traditional gradient descent.
  3. Escaping Local Minima: The noisy updates in SGD, caused by the stochastic nature of the algorithm, can help the model escape local minima or saddle points, potentially leading to better solutions in non-convex optimization problems (common in deep learning).
  4. Online Learning: SGD is well-suited for online learning, where the model is trained incrementally as new data comes in, rather than on a static dataset.

Challenges of Stochastic Gradient Descent

  1. Noisy Convergence: Since the gradient is estimated based on a single data point (or a small batch), the updates can be noisy, causing the cost function to fluctuate rather than steadily decrease. This makes convergence slower and more erratic than in batch gradient descent.
  2. Learning Rate Tuning: SGD is highly sensitive to the choice of learning rate. A learning rate that is too large may cause the algorithm to diverge, while one that is too small can slow down convergence. Adaptive methods like Adam and RMSprop address this by adjusting the learning rate dynamically during training.
  3. Long Training Times: While each individual update is fast, the convergence might take a longer time overall since the steps are more erratic compared to batch gradient descent.

Variants of Stochastic Gradient Descent

While traditional SGD is a powerful method, there are several improvements and variants designed to improve convergence and stability:

  • Mini-batch SGD: Instead of using a single data point, mini-batch SGD uses a small batch of data points to calculate the gradient. This strikes a balance between the efficiency of SGD and the stability of batch gradient descent. It reduces the noise in the updates while maintaining the computational efficiency.
  • Momentum: Momentum helps accelerate SGD by adding a fraction of the previous update to the current one. This allows the algorithm to keep moving in the same direction and can help overcome oscillations in the cost function.
  • Adaptive Methods (e.g., Adam, RMSprop): These methods dynamically adjust the learning rate for each parameter. Adam, for example, uses both the average of the gradients (first moment) and the average of the squared gradients (second moment) to compute an adaptive learning rate, improving convergence and stability.

Applications of Stochastic Gradient Descent

SGD and its variants are widely used across various domains of machine learning:

  • Deep Learning: In training deep neural networks, SGD is the default optimizer due to its efficiency with large datasets and its ability to work with large models. Deep learning frameworks like TensorFlow and PyTorch typically use variants like Adam or RMSprop, which are based on SGD.
  • Natural Language Processing (NLP): Models like Word2Vec and transformers are trained using SGD variants to optimize large models on vast text corpora.
  • Computer Vision: For tasks such as image classification, object detection, and segmentation, SGD has been fundamental in training convolutional neural networks (CNNs).
  • Reinforcement Learning: SGD is also used to optimize the parameters of models used in reinforcement learning, such as deep Q-networks (DQNs) and policy gradient methods.




Next Article

Similar Reads

  翻译: