How Segment is optimizing cloud costs with Graviton instances
At Twilio Segment, we continuously look for opportunities to improve the efficiency of our cloud infrastructure (eg: 10M$ engineering problem). Recently, we experimented with AWS Graviton instances to reduce the cost of operating our Kubernetes infrastructure.
This effort began as a hackathon project during think week, an internal Twilio week-long event where engineers are encouraged to work on any problem without worrying about day-day obligations. With contributions from multiple teams, we successfully developed a proof of concept and subsequently rolled out our first production deployment that is projected to cut the cost of operating our Flink clusters by 35%, and we are only getting started.
In this blog, we will be sharing our journey so far, starting how our platforms were updated to be Graviton compatible, and some of the challenges faced along the way.
What are Graviton instances?
AWS Graviton instances are a family of EC2 instances powered by processors designed in-house by AWS based on the arm64 CPU architecture that as per AWS can deliver up to 40% better price performance over comparable current generation x86-based instances.
The arm64 CPU architecture is considered to be more power efficient because it is based on RISC architecture (Reduced Instruction Set Computing), in contrast to the CISC (Complex Instruction Set Computing), used by Intel today and AWS is able to leverage economies of scale and pass this improved efficiency with discounts to customers.
Getting started
To get started with Graviton instances, we need to ensure both infrastructure and applications are compatible with the arm64 CPU architecture while continuing to support (amd64) intel instances running today.
At a high level, this translated to the following requirements:
Support for building multi-arch docker images
multi-architecture docker images are key in enabling us to deploy the same image on different hardware, with the docker clients automatically fetching the compatible layers based on the node's CPU architecture.
There are three main strategies you could use to build multi-arch images:
1. Emulation: Building multi-arch images can be made easier with docker buildx, which utilizes CPU emulation (via QEMU) to support building docker images for multiple CPU architectures. Recent versions of docker come pre-installed with buildx and you are able to build images for arm64 without investing in arm64-based CI infrastructure. The main drawback is that emulation can be resource-intensive and lead to longer build times.
2. Cross-compilation: When using compiled languages like Go, there is a workaround to get around the slow build times with Emulation by leveraging the compiler’s ability to compile to a given target architecture, without the need for emulation. We can use it to force the time-consuming stages (like code compilation) to run natively and the stages close to runtime built using emulation. Docker has published an excellent article that delves into the topic
3. Native: The last option is to have a dedicated CI infrastructure to build docker images for respective CPU architectures. However, designing this workflow is not straightforward. This article goes into great detail on what it takes to implement this. Moreover, our CI infrastructure was in the middle of a major re-architecture so this was not viable.
Considering these limitations, we decided to use docker buildx with emulation to fulfill the initial use cases. This allowed us to proceed until any requirements emerged that could not be addressed using this approach.
Updating Infrastructure services to use multi-arch images
Making our Kubernetes platform Graviton ready involved the following steps:
1. Apply nodeSelector: We use ArgoCD and helm charts to deploy critical infra services like Prometheus, datadog agent, etc across all EKS clusters using GitOps. For services that do not yet use multi-arch images, we first deployed them with a nodeSelector (beta.kubernetes.io/arch: amd64) to ensure they don’t accidentally get scheduled on Graviton instances.
2. Publish multi-arch images: We maintain a central repository of Dockerfiles that we use to publish upstream docker images, golden base images, etc into our private ECR repositories. We reviewed the images and updated tooling to easily pull/publish multi-arch images by passing a flag that would trigger docker buildx. Fortunately, the majority of the infra services were open-source tools that have multi-arch images available, though some of them required a minor version upgrade. A couple of them did not and we had to fork them to build a custom image. One among them was Kuberhealthy, which we use to continuously verify the core functionalities of our Kubernetes clusters. We’ve contributed our changes upstream(PR) so that you don’t need to!
Recommended by LinkedIn
3. Deploy multi-arch images: As we built multi-images, we redeployed the services by updating the nodeSelector. We set up a test EKS cluster using Graviton instances to perform a one-time certification to build confidence while we relied on Kuberhealthy and infrastructure alerting to continuously monitor for issues.
Challenges
The process of building multi-arch images didn't go as smoothly as we thought:
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
Results
With the build issues sorted, we set out to identify the first set of internal customers to start using Graviton instances.
Running Flink on Graviton
It made sense for us to start with data systems like Flink because they are primarily Java based. They also do not require frequent docker so slow build times wouldn’t impact productivity.
We used to run fairly large Flink clusters on AWS EMR and this setup was burdened by an additional cost due to the EMR surcharge, and other challenges like bad user experience. Recently, we had a major re-platform effort to run Flink on EKS using the Apache Kubernetes Operator that allowed us to deploy Flink on Graviton instances in EKS (r6gd.16xlarge) and avoid EMR surcharge that ultimately resulted in a 35% reduction in cost.
We intend to run other data platform systems like Spark on Graviton in the future!
Running Segment applications on Graviton
We also ran a few experiments on high-throughput applications that had a performance testing pipeline in place. While the results are not representative of all applications at Segment, we felt it gave a sense of what we can expect if we migrate to Graviton:
Key observations:
This gave us confidence that there is an opportunity to reduce the cost (10-15%) of operating Segment’s internal applications but due to challenges discussed earlier, this effort is on hold until native arm64 CI infrastructure is generally available.
Conclusion
Our Kubernetes platform is now Graviton-ready and that enabled us to migrate Flink clusters in production and cut costs by 35%.
In the near team, we expect more data platform systems to run on Graviton while we explore ways to introduce native CI infrastructure to address issues with slow build times, crucial for wider adoption.
Credits to all the Twilions whose contributions made this possible: Abhinav Ittekot, Sudarshan C (Bengaluru Dev Center), Emmy Bryne (Dev platform), Prithvi Dammalapati (Data platform). Special thanks to Liza Zavetz (Content management) for proofreading this blog!
SRE | DevOps | Platform Engineering | Cloud | Infrastructure
1yWell articulated!! Kudos to everyone who contributed for this effort