How Segment is optimizing cloud costs with Graviton instances

How Segment is optimizing cloud costs with Graviton instances

At Twilio Segment, we continuously look for opportunities to improve the efficiency of our cloud infrastructure (eg: 10M$ engineering problem). Recently, we experimented with AWS Graviton instances to reduce the cost of operating our Kubernetes infrastructure.

This effort began as a hackathon project during think week, an internal Twilio week-long event where engineers are encouraged to work on any problem without worrying about day-day obligations. With contributions from multiple teams, we successfully developed a proof of concept and subsequently rolled out our first production deployment that is projected to cut the cost of operating our Flink clusters by 35%, and we are only getting started.

In this blog, we will be sharing our journey so far, starting how our platforms were updated to be Graviton compatible, and some of the challenges faced along the way.

What are Graviton instances?

AWS Graviton instances are a family of EC2 instances powered by processors designed in-house by AWS based on the arm64 CPU architecture that as per AWS can deliver up to 40% better price performance over comparable current generation x86-based instances.

The arm64 CPU architecture is considered to be more power efficient because it is based on RISC architecture (Reduced Instruction Set Computing), in contrast to the CISC (Complex Instruction Set Computing),  used by Intel today and AWS is able to leverage economies of scale and pass this improved efficiency with discounts to customers.

Getting started

To get started with Graviton instances, we need to ensure both infrastructure and applications are compatible with the arm64 CPU architecture while continuing to support (amd64) intel instances running today. 

At a high level, this translated to the following requirements:

  1. Developers should be able to build docker images for their applications and deploy them on Graviton instances.
  2. Developers should be able to provision EKS node pools using Graviton instance types.
  3. Infrastructure services, particularly daemonsets should work on Graviton instances. These services, primarily open-source tools, provide critical functionality like metrics, and logging that’s crucial to operate in production.

Support for building multi-arch docker images

multi-architecture docker images are key in enabling us to deploy the same image on different hardware, with the docker clients automatically fetching the compatible layers based on the node's CPU architecture.

There are three main strategies you could use to build multi-arch images:

1. Emulation: Building multi-arch images can be made easier with docker buildx, which utilizes CPU emulation (via QEMU) to support building docker images for multiple CPU architectures. Recent versions of docker come pre-installed with buildx and you are able to build images for arm64 without investing in arm64-based CI infrastructure. The main drawback is that emulation can be resource-intensive and lead to longer build times.

No alt text provided for this image
Publish images for both amd64 and arm64 (via emulation) architectures.

2. Cross-compilation: When using compiled languages like Go, there is a workaround to get around the slow build times with Emulation by leveraging the compiler’s ability to compile to a given target architecture, without the need for emulation. We can use it to force the time-consuming stages (like code compilation) to run natively and the stages close to runtime built using emulation.  Docker has published an excellent article that delves into the topic

No alt text provided for this image
Similar to 1) but notice the "build" stage runs natively and go build does "cross-compilation" to generate binaries for arm64.

3. Native: The last option is to have a dedicated CI infrastructure to build docker images for respective CPU architectures. However, designing this workflow is not straightforward. This article goes into great detail on what it takes to implement this. Moreover, our CI infrastructure was in the middle of a major re-architecture so this was not viable. 

Considering these limitations, we decided to use docker buildx with emulation to fulfill the initial use cases. This allowed us to proceed until any requirements emerged that could not be addressed using this approach.

Updating Infrastructure services to use multi-arch images

Making our Kubernetes platform Graviton ready involved the following steps:

1. Apply nodeSelector: We use ArgoCD and helm charts to deploy critical infra services like Prometheus, datadog agent, etc across all EKS clusters using GitOps. For services that do not yet use multi-arch images, we first deployed them with a nodeSelector (beta.kubernetes.io/arch: amd64) to ensure they don’t accidentally get scheduled on Graviton instances.

2. Publish multi-arch images: We maintain a central repository of Dockerfiles that we use to publish upstream docker images, golden base images, etc into our private ECR repositories. We reviewed the images and updated tooling to easily pull/publish multi-arch images by passing a flag that would trigger docker buildx. Fortunately, the majority of the infra services were open-source tools that have multi-arch images available, though some of them required a minor version upgrade. A couple of them did not and we had to fork them to build a custom image. One among them was Kuberhealthy, which we use to continuously verify the core functionalities of our Kubernetes clusters. We’ve contributed our changes upstream(PR) so that you don’t need to!

3. Deploy multi-arch images: As we built multi-images, we redeployed the services by updating the nodeSelector. We set up a test EKS cluster using Graviton instances to perform a one-time certification to build confidence while we relied on Kuberhealthy and infrastructure alerting to continuously monitor for issues.

Challenges

The process of building multi-arch images didn't go as smoothly as we thought:

  1. Slow build times: Slower build times have been the main concern during our experiments. We observed a general slowdown of approximately 20-25% using buildx, which is acceptable. However, we encountered a specific case where build times dramatically increased from 3 minutes to 20 minutes. This was because of a dependency on ctlstore, which needed the go-sqlite3 library to be installed and it has a dependency on an external C library that complies significantly slower inside emulation.
  2. Binaries not being arm64 compatible: This was a head-scratcher. We had an application’s pods crash due to an exec error even after making the application image multi-arch. It turns out that the pod’s command was overridden to execute a binary of chamber that we made available from the base image we recommend for Go projects that happened to be compiled for amd64 architecture. 
  3. Crashing init containers: We forgot to account for the init containers that were injected for application deployments and a few daemonsets for pre-start configurations. These were crashing even though the primary container was arm64 compatible. 
  4. Intermittent build failures: Debian-based builds were having intermittent build issues. Luckily we came across this issue, and using this workaround as a step before each build made the problems go away.

docker run --rm --privileged multiarch/qemu-user-static --reset -p yes        

Results

With the build issues sorted, we set out to identify the first set of internal customers to start using Graviton instances.

Running Flink on Graviton

It made sense for us to start with data systems like Flink because they are primarily Java based. They also do not require frequent docker so slow build times wouldn’t impact productivity.

We used to run fairly large Flink clusters on AWS EMR and this setup was burdened by an additional cost due to the EMR surcharge, and other challenges like bad user experience. Recently, we had a major re-platform effort to run Flink on EKS using the Apache Kubernetes Operator that allowed us to deploy Flink on Graviton instances in EKS (r6gd.16xlarge) and avoid EMR surcharge that ultimately resulted in a 35% reduction in cost.

We intend to run other data platform systems like Spark on Graviton in the future!

Running Segment applications on Graviton

We also ran a few experiments on high-throughput applications that had a performance testing pipeline in place. While the results are not representative of all applications at Segment, we felt it gave a sense of what we can expect if we migrate to Graviton:

No alt text provided for this image

Key observations:

  1. We had a service perform 15% faster on Graviton, which was totally unexpected. That effectively meant a 30% reduction in cost.
  2. We saw a significant performance degradation in an HTTP-based application. It turns out that it utilizes segmentio/encoding for custom implementations of standard Go libraries with SIMD instructions that were targeted for amd64 and have not been fully ported to arm64. 
  3. At the time of testing, Graviton3 was not available for all instance families, at least for general purposes. We tested with Graviton2 but it was slower than the latest generation of intel instances, though the effective cost savings is still positive.

This gave us confidence that there is an opportunity to reduce the cost  (10-15%) of operating Segment’s internal applications but due to challenges discussed earlier, this effort is on hold until native arm64 CI infrastructure is generally available.

Conclusion

Our Kubernetes platform is now Graviton-ready and that enabled us to migrate Flink clusters in production and cut costs by 35%.

In the near team, we expect more data platform systems to run on Graviton while we explore ways to introduce native CI infrastructure to address issues with slow build times, crucial for wider adoption.

Credits to all the Twilions whose contributions made this possible: Abhinav Ittekot, Sudarshan C (Bengaluru Dev Center), Emmy Bryne (Dev platform), Prithvi Dammalapati (Data platform). Special thanks to Liza Zavetz (Content management) for proofreading this blog!

 

Anees Q.

SRE | DevOps | Platform Engineering | Cloud | Infrastructure

1y

Well articulated!! Kudos to everyone who contributed for this effort

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics