Scaling to 200K TPS : Lessons from the trenches

Scaling to 200K TPS : Lessons from the trenches

Introduction: The Scale Challenge

When you’re running a massive e-commerce platform, every second counts. A Black Friday event, a flash sale, or even an influencer-driven traffic surge can push the system to its limits. And our marketing team would leave no stone unturned to be able to maximise our sales - and they shouldn't! Like I always say - Tech should be an enabler and we need to know the requirements of NFRs (Non-Functional Requirements) as much as we know our Functional requirements! In our case, as we came out of the meeting room, our NFRs were clear - we needed to scale our system to handle 200,000 transactions per second (TPS) while maintaining a seamless user experience.

This wasn’t just about adding more servers—it was about rethinking the architecture to handle distributed transactions, UI optimization, and intelligent request routing while ensuring the system remained highly available and fault-tolerant.

Here’s a gist of how we tackled the problem and how we went about building it.

1. Understanding the Scaling Problem

Key Challenges We Faced

  1. Massive concurrent transactions – Users simultaneously placing orders, processing payments, and fetching product details.
  2. Data consistency in a distributed system – Ensuring correct inventory counts across multiple databases and regions.
  3. UI request explosion – The home page contained dozens of widgets, causing thousands of parallel API calls.
  4. Hot data problems – Popular products caused contention issues in databases and caches.
  5. Network & API rate limits – Third-party services (like payments, shipping) had rate limits that we had to work around. Surprisingly, there were banks, national -level banks whose payment gateway was not quipped to handle the spike!

With these challenges in mind, we designed a scalable, resilient, and highly available system.


2. The Tech Stack Behind Scaling

Backend: Distributed Transaction Processing

Microservices: Decomposed the monolith into domain-driven microservices (Registration, Product page, Orders, Payments, Inventory, Shipping and Tracking). We skipped the browse and search pages as there was not enough time to put it into the release. Our marketing and home pages would have deep-links directly to the product page and user behaviour was clear that during an event, customers would go directly to the particular product. (Shaping user behaviour is another story for another day).

To manage the load, we moved to an Event-Driven Architecture. Implemented Apache Kafka for asynchronous, high-throughput event processing.

We had to choose the databases and we looked at different ones for the various use cases. Databases:

  • PostgreSQL (ACID-compliant) for orders and payments.
  • Cassandra (Eventual consistency) for user activity tracking.
  • Redis & DynamoDB for caching product details, pricing, and session data.

Transaction Handling: We knew we needed to handle a different implementation based on the distributed architecture that we were going to roll out. After a few reads, bringing together experiences and POCs, we went with SAGA Pattern to manage distributed transactions across microservices and a Two-Phase Commit (2PC) for financial transactions that required strict consistency.

Message Brokers: In an event driven system, message brokers form the crux and we chose Apache Kafka for event sourcing and RabbitMQ for low-latency transactional messaging.

Compute & Scaling: This was key to scale. We used Kubernetes on AWS EKS for containerized deployment and auto-scaling. We then went with Serverless AWS Lambda for short-lived, event-driven processing.


Frontend: High-Performance UI Design

  • Frontend Framework: React with Next.js for server-side rendering (SSR).
  • Above-the-Fold Optimization: Prioritized critical content loading before below-the-fold widgets.
  • Request Consolidation: Used GraphQL to batch multiple API requests into a single network call.
  • Edge Caching: Used CloudFront to serve static assets and cached API responses at the edge.
  • WebSocket Streaming: For real-time inventory and pricing updates without polling.


3. Tackling the Challenges One by One

A. Scaling to 200K TPS: The Transaction Processing Engine

Kafka-Based Event Processing – Instead of processing transactions synchronously, we queued them in Kafka, allowing consumers to process at peak throughput. We could scale horizontally and consume at a higher speed during sales.

Shard Database Writes – We horizontally scaled PostgreSQL with partitioning, ensuring high-write throughput without conflicts. We also separated reads and writes, so that the load was reduced. For eg, for price rendering, we used a reader node.

Pre-Warmed Payment Processing – Payments were pre-authorized before checkout, reducing database contention.

Circuit Breakers – We implemented Hystrix-style circuit breakers to prevent cascading failures.

Article content
Microservices architecture

B. Distributed Data Consistency with Inventory Management

The Problem

  • Multiple warehouses updating inventory at different times caused race conditions and double-booked orders.

The Solution

Optimistic Concurrency Control (OCC) – Implemented versioning on inventory records, ensuring only the first valid request succeeded.

SAGA Pattern for Distributed Transactions – If an order failed after inventory was locked, we automatically rolled back the inventory update.

Eventual Consistency was not OK as customers would get promises on the product page and will face disappintment that the product was out of stock. So the need was to have an implementation of Strong Consistency. We did this with Distributed SQL (e.g., CockroachDB, Google Spanner, YugabyteDB)

  • These databases provide ACID guarantees across distributed nodes, ensuring stock levels are always accurate in real-time.
  • They scale horizontally while maintaining strict consistency.

Also note that we were keeping customer experience as the higher order bit and would do the payment only after inventory reserve was successful. We then had to deal with adding inventory, for customer cancelled orders and re-open ordering - which was built into the flow from day one.

Article content
Flowchart

C. Preventing UI Request Fanning

The Problem

  • The homepage had 20+ widgets, each triggering separate API calls on page load.
  • This resulted in thousands of simultaneous requests, overwhelming backend services.

The Solution

GraphQL Query Batching – Merged multiple API calls into a single query, reducing request volume.

Lazy Loading Below-the-Fold Content – Used Intersection Observer API to load non-critical widgets only when they appeared on screen.

Edge Caching – Cached static API responses in CloudFront, serving repeated requests from the CDN instead of hitting the backend.

Article content
Side-by-side

D. Handling Hot Data Issues (Popular Product Contention)

The Problem

  • A single hot product (e.g., iPhone 15 Pro Max) received millions of concurrent cart additions, overloading the database.

The Solution

Redis-Based Inventory Locks – Used Redis with TTL-based distributed locks to prevent multiple orders from reserving the same item.

Pre-Computed Availability – Cached "stock available" data in DynamoDB, reducing database reads.

Write Throttling – Implemented Rate-Limiting on Add-to-Cart API to prevent abuse.

Article content
Redis-based

E. Overcoming API Rate Limits & Third-Party Dependencies

The Problem

  • Payment gateways and shipping providers had API rate limits, causing failures during peak traffic.

The Solution

API Gateway with Smart Throttling – We buffered API calls and scheduled retries for rate-limited requests.

Multi-Vendor Payments – Used Stripe, PayPal, and Razorpay in parallel, switching dynamically in case of failures.

Webhooks for Asynchronous Processing – Instead of waiting for real-time payment confirmation, we queued transactions and used webhooks to update order status asynchronously.

Article content
API Gateway

4. Learnings from Scaling

🔥 Events > Transactions – Moving from synchronous processing to event-driven architecture was a game-changer.

🔥 GraphQL > REST – Reducing API requests by batching queries significantly improved frontend performance.

🔥 Pre-Warmed Databases Reduce Latency – Keeping read replicas pre-warmed with frequent queries avoided cold-start delays.

🔥 Resiliency Trumps Performance – Features like circuit breakers, rate-limiting, and load-shedding prevented system meltdowns.

🔥 AI-Driven Auto-Scaling Works – Using AWS Auto-Scaling with ML-based prediction models helped us scale up before traffic spikes.


Conclusion: Scaling is an Ongoing Journey

Scaling to 200K TPS wasn’t just about adding more servers—it required a shift in architecture, engineering mindset, and deep performance tuning. By adopting distributed transactions, event-driven processing, intelligent UI optimization, and multi-layered caching, we built a system that could handle extreme loads without breaking.


The load testing was a different beast altogether ... thats a story for another day! :)

Karthik .P.R

Founder, CEO @ Mydbops (MySQL | MariaDB | PostgreSQL| MongoDB | TiDB Solutions and Managed Services Provider)

1mo

The choice of databases and their purposes are clearly explained 👏

Scaling is one of the most critical yet complex challenges. Nicely articulated. Looking forward to more.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics