When And How To Use MongoDB For Distributed Database Architecture?

When And How To Use MongoDB For Distributed Database Architecture?

Introduction

MongoDB is a powerful NoSQL database designed for high scalability, flexibility, and distributed deployments. As organizations increasingly require robust database architectures that support large-scale applications, MongoDB offers multiple strategies for distributing data, including replica sets for high availability and sharding for horizontal scaling. This paper explores the best practices for deploying MongoDB in a distributed architecture, when to use sharding, and how tools like mongosync and replica sets enhance availability and resilience.

Note on Code snippets: Code snippets are provided to explain the concept. If you are not a developer, just read it as plain english as mostly you will understand what it says. Or ignore the code snippets and just read the remaining document.

Why do we need distributed databases?


Article content


Distributed databases are essential for several reasons:

  1. Scalability: As organizations grow, their data needs increase. Distributed databases allow for horizontal scaling, meaning that additional servers can be added to handle increased loads without significant changes to the existing infrastructure.
  2. High Availability: By distributing data across multiple nodes, distributed databases can ensure that the system remains operational even if one or more nodes fail. This redundancy helps in maintaining uptime and reliability.
  3. Performance: Distributed databases can improve performance by spreading the workload across multiple servers. This can lead to faster query responses and better overall system performance, especially for read and write operations.
  4. Geographic Distribution: For organizations with a global presence, distributed databases can store data closer to where it is needed. This reduces latency and improves access speeds for users in different regions.
  5. Data Redundancy and Disaster Recovery: Distributed databases often replicate data across multiple locations, which provides a safeguard against data loss. In the event of a failure, data can be recovered from another node.
  6. Flexibility: They can handle various types of data and workloads, making them suitable for a wide range of applications, from transactional systems to big data analytics.
  7. Load Balancing: By distributing data and requests across multiple nodes, distributed databases can balance the load, preventing any single node from becoming a bottleneck.

In summary, distributed databases are crucial for modern applications that require scalability, reliability, and performance in handling large volumes of data across diverse environments.

How to migrate data from one database cluster to another?

What is a cluster in MongoDB?

In MongoDB, a cluster refers to a collection of servers that work together to provide a distributed database environment. Clusters can be configured in different ways to achieve various goals such as high availability, scalability, and performance. There are two primary types of clusters in MongoDB:

  1. Replica Set: For redundancy and high availability
  2. Sharded Cluster: For distributing data and load

In summary, a cluster in MongoDB is a way to organize multiple servers to work together.

How is data migrated between clusters?

The mongosync binary is the primary process used in Cluster-to-Cluster Sync. mongosync migrates data from a source cluster to a destination cluster and keeps the clusters in continuous sync until you finalize the sync. In addition to continuous data synchronization, mongosync can also perform a one time data migration between clusters.

Mongosync can synchronize data between MongoDB clusters in real-time using oplog-based replication. The oplog (operations log) in MongoDB is a collection that records all write operations (insert, update, delete) that modify the data in the database, enabling real-time data replication across nodes.

How to configure mongosync

--source "mongodb://operational-db"
--destination "mongodb://analytics-db"
--destination-namespace "analytics.*"
--mode continuous # sync in real time
--oplog # 1GB of mongo oplog can contain 1 million normal size records => so no case of oplogs filling up
--filter-oplog '{"op": {"$in": ["i", "u"]}}' #sync insert, update operations only. Exclude delete operations from syncing.

--no-reverse-sync #Changes made in the destination database do not propagate back to source db

# Transform data: Exclude confidential fields. For example, private data can be excluded
--transform '
function(doc) {  
   delete doc.user_name;
   delete doc.user_details;
   return doc;
}'        

What Is Distributed MongoDB?

A distributed MongoDB deployment enables data to be replicated and/or partitioned across multiple servers, improving scalability, reliability, and geographic distribution.

MongoDB provides two key distribution mechanisms:

  • Replica Sets: Ensure high availability and disaster recovery by maintaining copies of data on multiple nodes.
  • Sharding: Distributes data across multiple nodes to balance load and handle large datasets.

Each method serves distinct use cases and is often used in combination to optimize performance and reliability.

A node in MongoDB refers to an individual server instance participating in a replica set or sharded cluster, acting as either one of these, depending on its role in the distributed database system:

  1. A primary or secondary data bearing node
  2. Configuration server
  3. Arbiter: one who decides which of the node should be elected as primary at any point in time)

How is High Availability ensured? Ans: Using replica sets.

A replica set can consist of one or more client applications. It must include at least one primary data-bearing node and can have zero or more secondary data-bearing nodes. The application utilizes the MongoDB driver for reading and writing operations. By default, all read and write requests are directed to the primary data-bearing node. However, it can be configured to prefer secondary nodes for read operations, allowing for better load balancing. Any inserts, updates, or deletions performed on the primary database are automatically synchronized across all secondary nodes.


Article content


What is a Replica Set?

A replica set is a group of MongoDB servers that maintain identical copies of the same dataset. One node acts as the primary, while others serve as secondaries and sync data from the primary using an oplog (operations log).

When to use a replica set?

  1. to provide high availability via redundancy
  2. normally done in production environments as dev environments don’t need high availability
  3. often data bearing nodes of a replica set are in different data centres to provide disaster recovery

How Replica Set Failover Works?

  • If the primary node fails, an election is held among the secondaries.
  • A new primary is elected automatically.
  • Applications automatically reconnect to the new primary for write operations.

What is Mongodb driver?

The MongoDB driver is a software library that enables applications to communicate with a MongoDB database. Located on the application server side (rather than the database server side), the driver offers essential functions and methods for various operations, including establishing a connection to the database, executing queries, and managing data. It effectively manages the communication between the application and the MongoDB server, ensuring that requests are accurately formatted and responses are processed correctly.

Key features of MongoDB drivers include:

  • Automatic Failover: The driver can automatically handle failover scenarios in replica sets, redirecting requests to the new primary node if the current primary fails.
  • Connection Management: It manages connections to the database, allowing for efficient use of resources.
  • Support for CRUD Operations: The driver provides methods for creating, reading, updating, and deleting documents in the database.
  • Load Balancing: It can distribute read operations across multiple nodes in a replica set based on configured read preferences.

MongoDB drivers are available for various programming languages, including Python, Java, Node.js, C#, and many others, enabling developers to integrate MongoDB into their applications easily.

Role of MongoDB & Mongo Drivers in Load Balancing

  • MongoDB drivers efficiently manage failover and replica set selection without requiring any additional coding on the application side.
  • The application server will always connect to the current Primary for write operations.

In the event that the Primary fails:

  • One of the secondary data-bearing nodes is promoted to Primary (with the arbiter playing a role here).
  • The MongoDB driver seamlessly redirects requests to the new Primary following the failover.

Read preference settings can be customized to enhance load balancing.

How to configure a Replica Set?

MongoDB allows you to assign priority values to nodes in a replica set to control which node is more likely to become the primary during an election.

Each MongoDB node in the replica set should have a configuration file (mongod.conf) where you define the replica set settings.

Example mongod.conf for a node:

replication:
  replSetName: rs0
  enableMajorityReadConcern: true
  members:
    - _id: 0
      host: "node1:27017"
      priority: 2  # Highest priority (most likely to be primary)
    - _id: 1
      host: "node2:27017"
      priority: 1  # Default priority
    - _id: 2
      host: "node3:27017"
      priority: 0  # Will never become primary (hidden node)
        

To set up a replica set, follow these steps:

Start MongoDB instances on three different servers (ex: one primary and two secondaries):

mongod --replSet rs0 --port 27017 --dbpath /data/node1 --bind_ip 0.0.0.0
mongod --replSet rs0 --port 27018 --dbpath /data/node2 --bind_ip 0.0.0.0
mongod --replSet rs0 --port 27019 --dbpath /data/node3 --bind_ip 0.0.0.0        

Initiate the replica set from any node:

rs.initiate({
    _id: "rs0",
    members: [
        { _id: 0, host: "node1:27017" },
        { _id: 1, host: "node2:27018" },
        { _id: 2, host: "node3:27019" }
    ]
});        

Verify the replica set status:

rs.status();        


Accessing a Replica Set from Client Applications

Client applications connect to a replica set using a MongoDB connection string that includes all replica set members:

mongodb://node1:27017,node2:27018,node3:27019/?replicaSet=rs0        

Example in Python:

from pymongo import MongoClient

# Connect to the replica set
client = MongoClient("mongodb://node1:27017,node2:27018,node3:27019/?replicaSet=rs0")

# Select the database and collection
db = client["mydatabase"]
collection = db["mycollection"]

# Insert a document
collection.insert_one({"name": "Alice", "age": 30})

# Fetch documents
for doc in collection.find():
    print(doc)        


Understanding Oplog (Operations Log)

The oplog (operations log) is a special capped collection (local.oplog.rs) used by MongoDB replica sets to record all write operations (insert, update, delete, command). Command could be operations like Create Index, Drop Collection, Rename Collection. Oplog enables replication by allowing secondary nodes to apply the same operations as the primary.

When is Oplog Used?

  • Replication: Secondaries read from the oplog to apply changes from the primary.
  • Point-in-Time Recovery: The oplog can be used to replay recent transactions after a failure.
  • Incremental Data Sync: Tools like mongosync use the oplog for real-time data synchronization. Mongosync is explained later in this paper.

Explanation of Each Key in an Oplog Entry

An example oplog entry looks like this:

{
  "ts": { "$timestamp": { "t": 1739986451, "i": 3 } },
  "op": "i",
  "ns": "log_database.logs",
  "o": { "_id": "abc123", "message": "New log entry" },
  "lsid": { "id": "session-xyz" },
  "txnNumber": 3
}        

Sharding: to distribute your data

Sharding is a technique used to distribute large amounts of data across multiple database nodes to enhance performance and scalability. However, sharding is not always necessary.


Start with a Replica Set → Sharding adds complexity; only use it when scaling demands it.

Use Indexes Effectively → Poor indexing can negate sharding benefits.


When Is Sharding Needed?

  1. Large Datasets (Big Data Applications): If your data exceeds the storage capacity of a single server (say 100s of gigabytes of data), sharding allows data to be distributed across multiple machines.
  2. High Read & Write Workloads: If a single MongoDB node is overloaded with queries and inserts, say millions of queries per day, sharding distributes the load, improving performance.
  3. High Throughput Applications: Applications that handle millions of requests per second (e.g., social media, gaming, streaming) require horizontal scaling with sharding.
  4. Low-Latency Global Applications: If users are spread across different geographical regions, for example on social media sites, sharding can route queries to the nearest replica for faster response times.
  5. Real-Time Analytics & IoT Data: When ingesting high-frequency time-series data (e.g., IoT sensor logs, event tracking), sharding prevents bottlenecks.
  6. Multi-Tenant Applications: If multiple tenants (clients) need isolated data storage, each tenant's data can be stored on separate shards.

When NOT to Use Sharding?

  1. Small Datasets (< 100 GB): A single replica set is easier to manage than a sharded cluster.
  2. Low Query Load: If queries are not causing CPU, RAM, or disk I/O issues, sharding is unnecessary.
  3. Write-Once, Read-Often Workloads: If your database is mostly read-heavy with minimal writes, replica sets (not sharding) are a better choice.
  4. Simple Applications: For apps with moderate growth, scaling up with better hardware (vertical scaling) is often easier than sharding.

What is a sharded cluster?

A MongoDB sharded cluster consists of the following components:

  1. Shard: Each shard contains a subset of the sharded data. Each shard must be deployed as a replica set.
  2. Query Router: The query routing is done by mongos, providing an interface between client applications and the sharded cluster.
  3. Config servers: Config servers store metadata and configuration settings for the cluster. Config servers must be deployed as a replica set (CSRS).
  4. MongoDB shards data at the collection level, distributing the records of a collection across the shards in the cluster.

The following graphic describes the interaction of components within a sharded cluster:


Article content


How To Connect to a Sharded Cluster?

You must connect to a mongos router to interact with any collection in the sharded cluster. This includes sharded and unsharded collections. Clients should never connect to a single shard in order to perform read or write operations.


Article content


You can connect to a mongos the same way you connect to a mongod using the mongosh or a MongoDB driver.

Benefits and Drawbacks of Sharding

Scalability

  • Benefit of Sharding: Can handle large data sets by distributing data across multiple servers.
  • Drawback of Sharding: Increased complexity in deployment and maintenance.

Performance

  • Benefit of Sharding: Improves read and write throughput by spreading the load.
  • Drawback of Sharding: Queries spanning multiple shards may introduce latency.

High Availability

  • Benefit of Sharding:Failure of one shard does not bring down the entire database.
  • Drawback of Sharding: If a critical shard goes down without replication, data loss may occur.

Geographic Distribution

  • Benefit of Sharding: Data can be partitioned by region for better locality.
  • Drawback of Sharding: Requires careful planning to prevent uneven data distribution.

Cost

  • Benefit of Sharding: Reduces hardware bottlenecks by utilizing multiple servers.
  • Drawback of Sharding: Increases infrastructure costs due to additional servers and networking.


How to Send Data from OLTP to OLAP MongoDB?

Use MongoSync to continuously replicate data from your OLTP (OnLine Transaction Processing) database to your OLAP (OnLine Analytical Processing) database in MongoDB. This is an approach for separating operational and analytical workloads while keeping the data synchronized. A smaller transactional database ensures faster database operations.


How MongoSync Helps in OLTP to OLAP Sync?

Real-time or Batch Syncing → Supports continuous or one-time data synchronization.

No Performance Impact on OLTP → Keeps analytical queries separate from transactional queries.

Filter & Transform Data → Can exclude unnecessary collections (e.g., logs, temporary data).

No Reverse Sync → Ensures analytics DB doesn't affect the transactional DB.


🛠 1. Example MongoSync Configuration

You can configure MongoSync to copy OLTP data into an OLAP database.

Example mongosync-config.yml

syncGroup:
  - name: "oltp-to-olap-sync"
    source:
      connection: "mongodb://oltp-db-host:27017"
    destination:
      connection: "mongodb://olap-db-host:27017"
    options:
      mode: "continuous"  # Keeps syncing data in real-time
      noReverseSync: true  # Prevents changes in OLAP DB from syncing back to OLTP DB
      ignoreDatabases:
        - "admin"
        - "local"
        - "config"
      includeDatabases:
        - "orders_db"
        - "customers_db"
      ignoreCollections:
        - "sessions"  # Exclude unnecessary collections
        - "temporary_logs"
      batchSize: 500  # Number of documents to sync per batch
      workerCount: 4  # Parallel workers to speed up sync
    logging:
      level: "info"
      output: "/var/log/mongosync.log"
    monitoring:
      enabled: true
      port: 9090
        

🛠 2. Running MongoSync

Run MongoSync with the configuration file:

mongosync --config mongosync-config.yml        

This will continuously sync data from oltp-db-host to olap-db-host.


📊 3. Optimizations for OLAP Performance

Once the data is in the OLAP database, you can:

  • Index analytical fields for fast queries.
  • Use aggregation pipelines to transform data.
  • Create materialized views for reporting.

Example: Indexing Analytical Data

db.orders.createIndex({ "order_date": 1 })  # Faster range queries 
db.orders.createIndex({ "customer_id": 1 })  # Optimize customer analytics         

Example: Pre-Aggregated Data for OLAP Queries

db.orders.aggregate([
  { "$group": { "_id": "$customer_id", "total_spent": { "$sum": "$amount" } } },
  { "$out": "customer_spending_summary" }
])        

Conclusion

MongoDB’s distributed architecture offers powerful capabilities for scaling applications while ensuring reliability. Replica sets provide high availability, while sharding enables horizontal scaling when you have to deal with big data or high throughput db operations. Change streams allow real-time data streaming, and mongosync facilitates data synchronization across databases with transformation capabilities. By carefully choosing the right strategy, organizations can build resilient and scalable data architectures using MongoDB.








To view or add a comment, sign in

More articles by Mohit Shukla - Software Engineering Leader

Insights from the community

Others also viewed

Explore topics