Scaling to 200K TPS : Lessons from the trenches
Introduction: The Scale Challenge
When you’re running a massive e-commerce platform, every second counts. A Black Friday event, a flash sale, or even an influencer-driven traffic surge can push the system to its limits. And our marketing team would leave no stone unturned to be able to maximise our sales - and they shouldn't! Like I always say - Tech should be an enabler and we need to know the requirements of NFRs (Non-Functional Requirements) as much as we know our Functional requirements! In our case, as we came out of the meeting room, our NFRs were clear - we needed to scale our system to handle 200,000 transactions per second (TPS) while maintaining a seamless user experience.
This wasn’t just about adding more servers—it was about rethinking the architecture to handle distributed transactions, UI optimization, and intelligent request routing while ensuring the system remained highly available and fault-tolerant.
Here’s a gist of how we tackled the problem and how we went about building it.
1. Understanding the Scaling Problem
Key Challenges We Faced
With these challenges in mind, we designed a scalable, resilient, and highly available system.
2. The Tech Stack Behind Scaling
Backend: Distributed Transaction Processing
Microservices: Decomposed the monolith into domain-driven microservices (Registration, Product page, Orders, Payments, Inventory, Shipping and Tracking). We skipped the browse and search pages as there was not enough time to put it into the release. Our marketing and home pages would have deep-links directly to the product page and user behaviour was clear that during an event, customers would go directly to the particular product. (Shaping user behaviour is another story for another day).
To manage the load, we moved to an Event-Driven Architecture. Implemented Apache Kafka for asynchronous, high-throughput event processing.
We had to choose the databases and we looked at different ones for the various use cases. Databases:
Transaction Handling: We knew we needed to handle a different implementation based on the distributed architecture that we were going to roll out. After a few reads, bringing together experiences and POCs, we went with SAGA Pattern to manage distributed transactions across microservices and a Two-Phase Commit (2PC) for financial transactions that required strict consistency.
Message Brokers: In an event driven system, message brokers form the crux and we chose Apache Kafka for event sourcing and RabbitMQ for low-latency transactional messaging.
Compute & Scaling: This was key to scale. We used Kubernetes on AWS EKS for containerized deployment and auto-scaling. We then went with Serverless AWS Lambda for short-lived, event-driven processing.
Frontend: High-Performance UI Design
3. Tackling the Challenges One by One
A. Scaling to 200K TPS: The Transaction Processing Engine
Kafka-Based Event Processing – Instead of processing transactions synchronously, we queued them in Kafka, allowing consumers to process at peak throughput. We could scale horizontally and consume at a higher speed during sales.
Shard Database Writes – We horizontally scaled PostgreSQL with partitioning, ensuring high-write throughput without conflicts. We also separated reads and writes, so that the load was reduced. For eg, for price rendering, we used a reader node.
Pre-Warmed Payment Processing – Payments were pre-authorized before checkout, reducing database contention.
Circuit Breakers – We implemented Hystrix-style circuit breakers to prevent cascading failures.
B. Distributed Data Consistency with Inventory Management
The Problem
The Solution
Optimistic Concurrency Control (OCC) – Implemented versioning on inventory records, ensuring only the first valid request succeeded.
SAGA Pattern for Distributed Transactions – If an order failed after inventory was locked, we automatically rolled back the inventory update.
Eventual Consistency was not OK as customers would get promises on the product page and will face disappintment that the product was out of stock. So the need was to have an implementation of Strong Consistency. We did this with Distributed SQL (e.g., CockroachDB, Google Spanner, YugabyteDB)
Recommended by LinkedIn
Also note that we were keeping customer experience as the higher order bit and would do the payment only after inventory reserve was successful. We then had to deal with adding inventory, for customer cancelled orders and re-open ordering - which was built into the flow from day one.
C. Preventing UI Request Fanning
The Problem
The Solution
GraphQL Query Batching – Merged multiple API calls into a single query, reducing request volume.
Lazy Loading Below-the-Fold Content – Used Intersection Observer API to load non-critical widgets only when they appeared on screen.
Edge Caching – Cached static API responses in CloudFront, serving repeated requests from the CDN instead of hitting the backend.
D. Handling Hot Data Issues (Popular Product Contention)
The Problem
The Solution
Redis-Based Inventory Locks – Used Redis with TTL-based distributed locks to prevent multiple orders from reserving the same item.
Pre-Computed Availability – Cached "stock available" data in DynamoDB, reducing database reads.
Write Throttling – Implemented Rate-Limiting on Add-to-Cart API to prevent abuse.
E. Overcoming API Rate Limits & Third-Party Dependencies
The Problem
The Solution
API Gateway with Smart Throttling – We buffered API calls and scheduled retries for rate-limited requests.
Multi-Vendor Payments – Used Stripe, PayPal, and Razorpay in parallel, switching dynamically in case of failures.
Webhooks for Asynchronous Processing – Instead of waiting for real-time payment confirmation, we queued transactions and used webhooks to update order status asynchronously.
4. Learnings from Scaling
🔥 Events > Transactions – Moving from synchronous processing to event-driven architecture was a game-changer.
🔥 GraphQL > REST – Reducing API requests by batching queries significantly improved frontend performance.
🔥 Pre-Warmed Databases Reduce Latency – Keeping read replicas pre-warmed with frequent queries avoided cold-start delays.
🔥 Resiliency Trumps Performance – Features like circuit breakers, rate-limiting, and load-shedding prevented system meltdowns.
🔥 AI-Driven Auto-Scaling Works – Using AWS Auto-Scaling with ML-based prediction models helped us scale up before traffic spikes.
Conclusion: Scaling is an Ongoing Journey
Scaling to 200K TPS wasn’t just about adding more servers—it required a shift in architecture, engineering mindset, and deep performance tuning. By adopting distributed transactions, event-driven processing, intelligent UI optimization, and multi-layered caching, we built a system that could handle extreme loads without breaking.
The load testing was a different beast altogether ... thats a story for another day! :)
Founder, CEO @ Mydbops (MySQL | MariaDB | PostgreSQL| MongoDB | TiDB Solutions and Managed Services Provider)
1moThe choice of databases and their purposes are clearly explained 👏
Scaling is one of the most critical yet complex challenges. Nicely articulated. Looking forward to more.