Netflix's Cloud Edge Architecture
Netflix streaming turned 14 years old in January 2021. Rewinding back to 2007, streaming was called “Watch Now” and it only worked on Windows PCs (see this Hacking Netflix article about the initial launch). The architecture powering the streaming experience was very simple: hardware load balancer, a monolithic webapp, a few web services, and a relational database all running in one data center. That worked well for the thousands of members using streaming at that time. Fast forward to 2021 and Netflix has 200M+ global member households, 1000+ supported devices, and 1000s of original titles. Powering this requires a very different architecture and a broader set of stunning colleagues.
Overall Architecture
Netflix’s overall architecture is a dichotomy:
- Cloud-based microservices for running the services that power the experience
- Hardware-based Open Connect infrastructure to deliver the bits as fast as possible
The cloud provides flexibility and scalability for the parts of the ecosystem that need that dynamism and ability to evolve rapidly. Open Connect is purpose-built to do video streaming data delivery at planetary scale. Both are necessary for Netflix’s global scale.
Cloud Edge Architecture
The “Edge” of Netflix’s cloud architecture is the front door into the microservices ecosystem. It’s where every request starts and ends, so resiliency is paramount. If any part of the edge is down, Netflix doesn’t work. There are four core layers to the edge architecture:
- Cloud Gateway
- Backend-for-Frontend (BFFs)
- Customer Journey APIs
- Domain Services & Data Stores
Every Netflix-enabled device interacts with the cloud edge architecture to create the Netflix consumer experience.
Cloud Gateway
The first cloud layer is the Cloud Gateway, also known as Zuul. This is an open source solution that solves many top level gateway concerns. Code is contributed by multiple teams and configurability enables using the gateway for many types of incoming traffic. Zuul’s responsibilities include:
- Authentication - determine what device and user is sending the request
- Routing - analyze the URL and other incoming parameters along with configuration to route the request to the appropriate origin services
- Insights - measure request timings and reporting success/failure metrics, record tracing information for debugging purposes
- Rate Limiting - limit some requests to avoid overloading the overall architecture
- Enrichment - add ancillary data like geo-IP information
Backends-for-Frontends
The next layer is the BFF layer where device specific logic and data transforms live. This runs on a platform called NodeQuark: an opinionated set of Node.js libraries and frameworks offered as a managed service by the Node.js platform team. Device teams write code to be run in this managed layer, from calling the downstream APIs they need to build a given experience to transforming and packaging the resulting data into what’s needed to render that experience on the device. There are BFFs for the website, Android, iOS, TV, and a few other device platforms.
Customer Journey APIs & Domain Services
Beyond the BFFs are a full set of APIs for the customer journey. These cover everything from signing up to Netflix, logging in, adding and updating profiles, finding something to watch by browsing or searching, playing content, and more. These APIs rely on underlying domain services and data provided by many teams across the Product Engineering organization at Netflix. For a deep dive into one of those teams, see this Product Edge Systems (PES) Overview.
Architecting for Failure
Failures are to be expected in every subsystem of a complex ecosystem. In light of this reality, we’ve designed the architecture so that local-level failures don’t cause a macro-level outage. Ways we achieve this include running multi-region with fast failover, deploying multiple clusters for our critical services (sharded by functionality), and practicing chaos engineering regularly, for example injecting failure and verifying working fallbacks for key business logic.
Conclusion
This cloud edge architecture works well for Netflix’s global scale and 200M+ member base. Evolving from the early 2007 architecture to today’s required designing for the next order of scale, implementing and iterating, and learning from failures along the way. Scaling for the next 100M members is going to require more design, experimentation, and iteration. If you’re interested in being a part of Netflix’s cloud architecture evolution, we’re hiring!
Such an insightful post! 💡Impressive how Netflix’s Cloud Edge Architecture has evolved to scale globally, with a focus on resilience, flexibility, and handling massive traffic seamlessly
Building Temperstack | AI Agent for Software Reliability | AI SRE Agent
1yPhilip, 👍
Cybersecurity, Security, and Crisis Response Innovator and Leader
3yThis was fantastic - thanks for posting, Phillip!
Software Engineering
4yGreat article, almost reference architecture on how to do things right on the Edge
Architecting Micro-services| MSC | AWS CSA PRO | AWS 6X | Mongo & Neo4J Certified | CKAD
4yPhilip Fisher-Ogden A great article. One question. let's say there's one request that needs an aggregation of responses from multiple microservices. Should this logic be implemented in Cloud Gateway or in one of the microservices which first consumes the request?