Scaling Order Platforms: Database Architecture Insights from Grab, Foodpanda, and Industry Best Practices
In the fast-paced world of on-demand platforms like Grab and Foodpanda, the database is the backbone of the system. These platforms process millions of orders daily, and their database architecture must be designed to ensure high availability, low latency, and cost-effectiveness. In this blog, we’ll explore how companies like Grab and Foodpanda tackle these challenges, along with broader industry practices and personal insights to help you build a scalable and resilient database architecture.
Key Takeaways
Building a scalable and resilient database architecture requires careful planning and optimization. Here are the key lessons from Grab, Foodpanda, and broader industry practices:
Separate Databases for Different Workloads: Use an OLTP database for transactional queries and an OLAP database for analytical queries.
Leverage Managed Services: Offload infrastructure management to cloud providers to focus on core business logic.
Optimize for Spiky Traffic: Choose databases like DynamoDB that can handle traffic spikes seamlessly.
Plan for Failures: Use fallback mechanisms like SQS and DLQ to ensure high availability.
Smart Indexing and Caching: Use techniques like partial indexing and caching to improve query performance.
Sharding and Data Retention: Implement sharding and data retention policies to keep databases lean and scalable.
Understanding the Query Workload
Before diving into the architecture, it’s essential to understand the types of queries a system like Grab or Foodpanda handles:
The Challenge of Spiky Traffic
Platforms like Grab and Foodpanda face spiky traffic patterns. For instance, during lunch or dinner hours, the system experiences a massive surge in orders. Similarly, during promotional events or festivals, traffic can spike unpredictably. Handling these spikes without degrading performance is a key challenge.
Design Goals for a Scalable Database Architecture
To build a database architecture that can handle these challenges, the following design goals are critical:
The Two-Database Approach: OLTP and OLAP
One of the most effective strategies for handling transactional and analytical workloads is the two-database approach:
Keeping the Databases in Sync
To ensure data consistency between the OLTP and OLAP databases, companies use a message-driven architecture:
Choosing the Right Databases
OLTP Database: DynamoDB and Beyond
Grab chose DynamoDB as their OLTP database for several reasons:
Recommended by LinkedIn
Personal Insight: Optimizing DynamoDB Usage
One of the most brilliant optimizations I’ve seen is Grab’s use of partial indexing in DynamoDB:
Instead of creating a global secondary index (GSI) directly on the user_id field, they introduced a new attribute called user_id_order_GSI.
Example: If a user has 100 orders, but only 2 are ongoing, the GSI will only index those 2 orders, reducing the size of the index and improving query performance.
OLAP Database: MySQL and Alternatives
For analytical queries, Grab chose MySQL over a full-fledged data warehouse. Here’s why:
Foodpanda’s Approach: Foodpanda, on the other hand, uses Google BigQuery for analytical queries. BigQuery is a serverless data warehouse that excels at handling large-scale analytics. While it’s more expensive than MySQL, it provides better performance for complex queries and large datasets.
Handling Failures and Edge Cases
In a distributed system, failures are inevitable. Here’s how companies like Grab and Foodpanda handle edge cases:
1. Kafka Downtime
Kafka has an SLA of 99.95%, but Grab uses Amazon SQS as a fallback. If Kafka is down, messages are pushed to SQS instead.
Example: During a Kafka outage, Grab’s order service automatically switches to SQS, ensuring that no messages are lost.
2. SQS Downtime
If SQS fails, AWS’s Dead Letter Queue (DLQ) mechanism ensures that no messages are lost. Messages that cannot be processed are moved to the DLQ for later retry.
Example: If SQS is unavailable, messages are stored in the DLQ and retried once SQS is back online.
3. Out-of-Order Events
To handle out-of-order events, companies use upserts (update or insert) in the OLAP database. They also include a timestamp check to ensure that older updates don’t overwrite newer ones.
Example: If an "update order" event arrives before the "create order" event, the system performs an upsert to handle the discrepancy. Additionally, a timestamp check ensures that only the latest update is applied.
Broader Industry Practices
1. Caching for Performance
Both Grab and Foodpanda use caching to reduce the load on their databases. For example, they cache frequently accessed data like restaurant menus or user profiles in Redis or Memcached.
Example: When a user opens the Foodpanda app, the restaurant menu is served from the cache, reducing the number of database queries.
2. Sharding for Scalability
To handle large datasets, companies often use sharding. For example, Foodpanda shards its MySQL database by region, ensuring that queries are routed to the appropriate shard.
Example: Orders from Singapore are stored in one shard, while orders from Malaysia are stored in another. This reduces the load on each shard and improves query performance.
3. Data Retention Policies
To keep the OLTP database lean, companies implement data retention policies. For example, Grab deletes orders older than 3 months from DynamoDB but retains them in the OLAP database for analytics.
Example: Using DynamoDB’s Time-to-Live (TTL) feature, Grab automatically deletes old orders, reducing storage costs.