Mastering Kubernetes Scaling: HPA, VPA, KEDA, and Cluster Autoscaler

Mastering Kubernetes Scaling: HPA, VPA, KEDA, and Cluster Autoscaler

Scaling in Kubernetes refers to the process of adjusting the resources available to an application in response to changes in demand.

Scaling in Kubernetes can be done at the pod level or at the cluster level.

POD Level

Pod scaling involves adjusting the number of replicas of a pod or deployment based on specific metrics, such as CPU or memory utilization. This is typically achieved using

  1. Horizontal Pod Autoscaling (HPA)
  2. Vertical Pod Autoscaling (VPA)
  3. Kubernetes Event-driven Autoscaling (KEDA).

Cluster Level

Cluster scaling, on the other hand, involves adjusting the size of the Kubernetes cluster itself based on resource utilization or other metrics. This is typically achieved using cluster autoscaling.

While pod scaling is useful for ensuring that your applications are always running at the optimal number of replicas, cluster scaling is useful for ensuring that your entire Kubernetes environment is properly provisioned to handle the workload.

No alt text provided for this image

It's worth noting that both pod and cluster scaling can be used together to achieve optimal resource utilization and performance. For example, you can use HPA to scale individual pods or deployments based on specific metrics and use cluster autoscaling to adjust the size of the cluster by adding more nodes based on overall resource utilization.


Horizontal Pod Autoscaling (HPA)

Horizontal Pod Autoscaling (HPA) is a Kubernetes feature that automatically scales the number of replicas in a Deployment, ReplicaSet, or StatefulSet based on CPU utilization, memory utilization, or custom metrics. HPA works by periodically querying the metrics API of the Kubernetes API server to obtain the current utilization of the specified metric. If the current utilization exceeds or falls below the defined target utilization, HPA adjusts the number of replicas up or down, respectively.


No alt text provided for this image


Imagine you're running a web application that allows users to upload and share images. When many users are using your application, the server hosting your application might become overloaded and slow down, leading to a poor user experience.


To prevent this from happening, you can use Horizontal Pod Autoscaling (HPA) in Kubernetes to automatically adjust the number of pods running your application based on demand.


For example, you might configure HPA to scale up the number of pods running your application when CPU utilization exceeds a certain threshold. When the workload decreases, HPA can automatically scale down the number of pods running your application to save resources.

Here's an example of how you might configure HPA for your web application:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 70        

In this example, we're creating an HPA that will scale the number of replicas of our web-app-deployment deployment based on CPU utilization. Specifically, we've set the targetAverageUtilization to 70, which means that the HPA will attempt to maintain an average CPU utilization of 70% across all of the replicas.


We've also set minReplicas to 1 and maxReplicas to 10, which means that the HPA will always ensure that there is at least 1 replica of the deployment running, but will not scale beyond 10 replicas.


With this configuration, Kubernetes will automatically adjust the number of replicas of our web-app-deployment based on CPU utilization. If CPU utilization exceeds 70%, Kubernetes will scale up the number of replicas to handle the increased workload. Conversely, if CPU utilization drops below 70%, Kubernetes will scale down the number of replicas to save resources.


Overall, HPA is a powerful way to ensure that your applications are always running at optimal capacity, without requiring manual intervention. It's a great tool for ensuring that your users have a fast and responsive experience, even during periods of high demand.


Vertical Pod Autoscaling (VPA)

While HPA is a powerful tool for scaling the number of replicas of deployment, it is not designed to adjust the CPU and memory requests and limits of individual pods.


HPA is based on the assumption that each pod has a fixed CPU and memory request and limit, which allows Kubernetes to adjust the number of replicas to handle the workload. However, in certain scenarios, the actual CPU and memory utilization of a pod may be significantly different from its requested values.


This can happen, for example, when a pod experiences a sudden spike in traffic that exceeds its CPU and memory limits, causing performance issues or even crashes. In this scenario, simply adding more replicas of the pod may not be sufficient to address the issue, as each replica would still have the same CPU and memory limits and requests.


This is where VPA comes in. VPA is designed to adjust the CPU and memory requests and limits of individual pods based on their actual usage, allowing Kubernetes to allocate resources more efficiently and prevent performance issues or crashes.


For example, you might configure VPA to increase the CPU and memory requests and limits of your application pods when their CPU and memory utilization exceeds a certain threshold. When the workload decreases, VPA can automatically decrease the CPU and memory requests and limits to save resources.

Here's an example of how you might configure VPA for your web application:

apiVersion: autoscaling.k8s.io/v
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment
    name:       web-app-deployment
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: 50m
        memory: 128Mi
      maxAllowed:
        cpu: 500m
        memory: 512Mi        

In this example, we're creating a VPA that will automatically adjust the CPU and memory requests and limits of our web-app-deployment deployment based on the actual utilization of the pods.


We've set the minAllowed and maxAllowed values for both CPU and memory for all containers in the pod. This means that the VPA will ensure that the containers' CPU and memory limits and requests are set within these boundaries.


With this configuration, Kubernetes will automatically adjust the CPU and memory requests and limits of our web-app-deployment based on actual utilization, ensuring that the application always has enough resources to handle the workload.


Overall, VPA is a powerful tool for ensuring that your application pods have the right amount of CPU and memory resources to handle the workload. It's a great way to prevent performance issues and crashes during periods of high demand, without requiring manual intervention.


HPA + VPA (Is it possible ?)

Yes, you can use both HPA and VPA for deployment in Kubernetes.

For example, you might use HPA to scale the number of replicas of your deployment based on demand, while also using VPA to adjust the CPU and memory requests and limits of individual pods based on their actual usage.


This would allow Kubernetes to allocate resources more efficiently and ensure that your deployment has the right amount of CPU and memory resources to handle the workload, without requiring manual intervention.


Implementing both HPA and VPA in Kubernetes can present several challenges that need to be addressed to ensure that your deployment is running efficiently and effectively. Some of the main challenges include:


  1. Configuring the tools properly: To use both HPA and VPA effectively, you need to configure them properly and ensure that their configuration does not conflict with each other. This requires a good understanding of the tools and their configuration options.
  2. Fine-tuning the configuration: To achieve optimal results for your specific workload, you may need to fine-tune the configuration of each tool. For example, you may need to adjust the thresholds used by HPA to scale the number of replicas or the resource limits used by VPA to adjust the CPU and memory requests and limits of individual pods.
  3. Managing resource usage: Using both HPA and VPA can lead to complex resource management challenges, particularly if your deployment has a large number of pods or high resource requirements. You need to ensure that your nodes have enough resources to handle the workload, and that you're not wasting resources by over-provisioning.
  4. Monitoring and troubleshooting: When using both HPA and VPA, it's important to monitor your deployment carefully to ensure that it's running efficiently and effectively. You also need to be prepared to troubleshoot any issues that arise, such as performance problems or crashes.

Overall, using both HPA and VPA can provide a powerful and flexible solution for scaling and optimizing your Kubernetes workloads. However, it requires careful planning, configuration, and management to ensure that it's working effectively and efficiently.


KEDA - Is not HPA and VPA enough already ?

Kubernetes Event-driven Autoscaling (KEDA) is a tool that allows you to scale workloads in Kubernetes based on external events such as messages from message queues or events from event grids. While HPA and VPA are designed to scale based on resource utilization, KEDA is specifically designed to scale based on event-driven workloads.


There are several reasons why KEDA is necessary and why it cannot be handled by HPA or VPA:

  1. Different scaling triggers: KEDA allows you to scale workloads based on external events, such as the number of messages in a message queue or the number of events in an event grid. In contrast, HPA and VPA are designed to scale based on resource utilization, such as CPU and memory usage. Scaling based on external events requires a different approach than scaling based on resource utilization.
  2. Custom metrics support: KEDA allows you to scale based on custom metrics, such as the number of messages in a message queue or the number of events in an event grid. HPA and VPA are designed to scale based on standard metrics such as CPU and memory usage. Scaling based on custom metrics requires a more flexible and extensible approach than scaling based on standard metrics.
  3. Fine-grained control: KEDA allows you to control the scaling behavior of your workloads with fine-grained control, such as how many replicas to add or remove based on the number of messages in a message queue. HPA and VPA are designed to scale automatically based on pre-defined thresholds. Fine-grained control allows you to optimize the scaling behavior of your workloads based on your specific requirements.


Here is an example of a YAML file that creates a KEDA ScaledObject for a deployment that reads messages from an Azure Queue:

apiVersion: keda.sh/v1alpha
kind: ScaledObject
metadata:
  name: my-scaledobject
spec:
  scaleTargetRef:
    name: my-deployment
  pollingInterval: 10
  cooldownPeriod: 30
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
  - type: azure-queue
    metadata:
      accountName: my-storage-account
      queueName: my-queue
      connectionFromEnvironment: AZURE_STORAGE_CONNECTION_STRING
    authenticationRef:
      name: azure-storage-auth        

In this example, the ScaledObject includes the name of the Kubernetes deployment (my-deployment), the minimum and maximum number of replicas (1 and 10, respectively), and the scaling rules based on the number of messages in the Azure Queue (type: azure-queue). The metadata section includes the name of the Azure Storage account (my-storage-account) and the name of the queue (my-queue). The authenticationRef section specifies the name of the Kubernetes secret that contains the Azure Storage account connection string.


KEDA uses the Kubernetes HPA (Horizontal Pod Autoscaler) internally to scale the number of replicas of a deployment. In fact, when you create a ScaledObject in KEDA, it automatically creates a Kubernetes HPA for the deployment specified in the scaleTargetRef field of the ScaledObject.


However, KEDA extends the functionality of the Kubernetes HPA by adding support for other scaling triggers, such as Azure Queue messages, Kafka messages, and custom metrics. This allows you to scale your deployment based on a variety of different conditions, beyond just CPU and memory utilization.

By deploying this ScaledObject in your Kubernetes cluster, KEDA will automatically scale your deployment based on the number of messages in the Azure Queue.


Cluster Autoscaler

The Cluster Autoscaler is a component that runs as a pod in the Kubernetes cluster, which watches for pods that can't be scheduled on nodes because of resource constraints. The Autoscaler then automatically increases the number of nodes in the cluster to accommodate the new pods. This way, it ensures that there is always enough capacity available to run the required number of pods.


The Cluster Autoscaler can scale up and down the cluster based on the demand for resources. When the demand for resources increases, the Autoscaler automatically adds more nodes to the cluster. On the other hand, when the demand for resources decreases, the Autoscaler removes the excess nodes from the cluster to save resources and reduce costs.


In Azure, the cluster has node pools that are nothing but Virtual machine scale sets. By default when you create an AKS cluster, system node pools are created. In case you want to run a workload in a specific node pool, you can create user node pools with different VM sizes and configurations.

Here's an example of using cluster autoscaler with a node pool in Azure AKS:

  1. First, make sure that your AKS cluster is already created.
  2. Enable the cluster autoscaler for the AKS cluster:

az aks update \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --enable-cluster-autoscaler \
  --min-count 1 \
  --max-count 5        

3. Configure the cluster autoscaler by setting the minimum and maximum number of nodes for the node pool. For example, you can set the minimum to 1 and the maximum to 5:

kubectl autoscale nodepool mynodepool 
  --min 1 \
  --max 5 \        

4. Monitor the cluster autoscaler to see how it adjusts the number of nodes based on demand. You can use the kubectl get nodes command to see the current number of nodes in the node pool.


With these configurations, the cluster autoscaler will automatically adjust the number of nodes in the node pool based on the demand for resources from the workloads running on them. For example, if there are pods in a pending state and no nodes have the capacity to schedule them, the cluster autoscaler will add more nodes to the pool to handle the increased load. Similarly, If a node in the cluster has been idle for a certain period of time (which can be specified in the autoscaling configuration), and there is enough capacity in the cluster to run the workloads, the autoscaler will remove the node to save resources and reduce costs.

Sanjay Prakash

Senior Software Engineer | Master's in Electrical Engineering, Software Development

1y

simple and perfect explanation!

To view or add a comment, sign in

More articles by Huzefa Qubbawala

Insights from the community

Others also viewed

Explore topics