Scaling Order Platforms: Database Architecture Insights from Grab, Foodpanda, and Industry Best Practices

Scaling Order Platforms: Database Architecture Insights from Grab, Foodpanda, and Industry Best Practices

In the fast-paced world of on-demand platforms like Grab and Foodpanda, the database is the backbone of the system. These platforms process millions of orders daily, and their database architecture must be designed to ensure high availability, low latency, and cost-effectiveness. In this blog, we’ll explore how companies like Grab and Foodpanda tackle these challenges, along with broader industry practices and personal insights to help you build a scalable and resilient database architecture.


Key Takeaways

Building a scalable and resilient database architecture requires careful planning and optimization. Here are the key lessons from Grab, Foodpanda, and broader industry practices:

Separate Databases for Different Workloads: Use an OLTP database for transactional queries and an OLAP database for analytical queries.

Leverage Managed Services: Offload infrastructure management to cloud providers to focus on core business logic.

Optimize for Spiky Traffic: Choose databases like DynamoDB that can handle traffic spikes seamlessly.

Plan for Failures: Use fallback mechanisms like SQS and DLQ to ensure high availability.

Smart Indexing and Caching: Use techniques like partial indexing and caching to improve query performance.

Sharding and Data Retention: Implement sharding and data retention policies to keep databases lean and scalable.


Understanding the Query Workload

Before diving into the architecture, it’s essential to understand the types of queries a system like Grab or Foodpanda handles:

  1. Transactional Queries: These are critical for operations like creating, updating, or retrieving orders. For example, "get an order," "update an order status," or "cancel an order." These queries require strong consistency and low latency.
  2. Analytical Queries: These are used for generating insights, such as "show me all orders from a specific user" or "calculate the average order value for a restaurant." These queries can tolerate slightly higher latency and work with eventual consistency.

The Challenge of Spiky Traffic

Platforms like Grab and Foodpanda face spiky traffic patterns. For instance, during lunch or dinner hours, the system experiences a massive surge in orders. Similarly, during promotional events or festivals, traffic can spike unpredictably. Handling these spikes without degrading performance is a key challenge.


Design Goals for a Scalable Database Architecture

To build a database architecture that can handle these challenges, the following design goals are critical:

  1. Stability: The system must remain operational even during traffic spikes or partial failures.
  2. Cost-Effectiveness: At scale, inefficient queries or over-provisioned resources can lead to skyrocketing costs.
  3. Consistency: Transactional queries require strong consistency, while analytical queries can work with eventual consistency.
  4. Resilience: The system must handle failures gracefully, ensuring minimal downtime and data loss.


The Two-Database Approach: OLTP and OLAP

One of the most effective strategies for handling transactional and analytical workloads is the two-database approach:

  1. OLTP Database (Online Transactional Processing): This database handles real-time transactional queries. It is optimized for fast reads and writes and serves as the single source of truth for orders.
  2. OLAP Database (Online Analytical Processing): This database is designed for analytical queries. It stores historical data and is optimized for complex queries and aggregations.

Keeping the Databases in Sync

To ensure data consistency between the OLTP and OLAP databases, companies use a message-driven architecture:

  • When an order is created or updated in the OLTP database, an event is pushed to a messaging system like Kafka or Amazon SQS.
  • Consumers read these events and update the OLAP database asynchronously.
  • This approach ensures that the OLAP database eventually reflects the latest data while keeping the OLTP database lightweight and fast.


Choosing the Right Databases

OLTP Database: DynamoDB and Beyond

Grab chose DynamoDB as their OLTP database for several reasons:

  1. Managed Service: DynamoDB is fully managed by AWS, which means Grab doesn’t have to worry about scaling or availability.
  2. Strong Consistency: DynamoDB provides strong consistency for primary key reads, which is essential for transactional queries.
  3. Handling Traffic Spikes: DynamoDB excels at handling spiky traffic by dynamically redistributing hot partitions.

Personal Insight: Optimizing DynamoDB Usage

One of the most brilliant optimizations I’ve seen is Grab’s use of partial indexing in DynamoDB:

Instead of creating a global secondary index (GSI) directly on the user_id field, they introduced a new attribute called user_id_order_GSI.

  • When an order is in the "ongoing" state, user_id_order_GSI is set to the user_id. Once the order is completed, this field is set to null, effectively removing it from the GSI.
  • This ensures that the GSI only contains ongoing orders, making queries faster and more efficient.

Example: If a user has 100 orders, but only 2 are ongoing, the GSI will only index those 2 orders, reducing the size of the index and improving query performance.

OLAP Database: MySQL and Alternatives

For analytical queries, Grab chose MySQL over a full-fledged data warehouse. Here’s why:

  • Cost-Effectiveness: A data warehouse would be overkill for Grab’s use case, as they don’t need to process petabytes of data.
  • Near Real-Time Analytics: MySQL supports near real-time queries, which is sufficient for Grab’s analytical needs.

Foodpanda’s Approach: Foodpanda, on the other hand, uses Google BigQuery for analytical queries. BigQuery is a serverless data warehouse that excels at handling large-scale analytics. While it’s more expensive than MySQL, it provides better performance for complex queries and large datasets.


Handling Failures and Edge Cases

In a distributed system, failures are inevitable. Here’s how companies like Grab and Foodpanda handle edge cases:

1. Kafka Downtime

Kafka has an SLA of 99.95%, but Grab uses Amazon SQS as a fallback. If Kafka is down, messages are pushed to SQS instead.

Example: During a Kafka outage, Grab’s order service automatically switches to SQS, ensuring that no messages are lost.

2. SQS Downtime

If SQS fails, AWS’s Dead Letter Queue (DLQ) mechanism ensures that no messages are lost. Messages that cannot be processed are moved to the DLQ for later retry.

Example: If SQS is unavailable, messages are stored in the DLQ and retried once SQS is back online.

3. Out-of-Order Events

To handle out-of-order events, companies use upserts (update or insert) in the OLAP database. They also include a timestamp check to ensure that older updates don’t overwrite newer ones.

Example: If an "update order" event arrives before the "create order" event, the system performs an upsert to handle the discrepancy. Additionally, a timestamp check ensures that only the latest update is applied.


Broader Industry Practices

1. Caching for Performance

Both Grab and Foodpanda use caching to reduce the load on their databases. For example, they cache frequently accessed data like restaurant menus or user profiles in Redis or Memcached.

Example: When a user opens the Foodpanda app, the restaurant menu is served from the cache, reducing the number of database queries.

2. Sharding for Scalability

To handle large datasets, companies often use sharding. For example, Foodpanda shards its MySQL database by region, ensuring that queries are routed to the appropriate shard.

Example: Orders from Singapore are stored in one shard, while orders from Malaysia are stored in another. This reduces the load on each shard and improves query performance.

3. Data Retention Policies

To keep the OLTP database lean, companies implement data retention policies. For example, Grab deletes orders older than 3 months from DynamoDB but retains them in the OLAP database for analytics.

Example: Using DynamoDB’s Time-to-Live (TTL) feature, Grab automatically deletes old orders, reducing storage costs.


To view or add a comment, sign in

More articles by Abu Jafor Mohammad Saleh

Insights from the community

Others also viewed

Explore topics