When And How To Use MongoDB For Distributed Database Architecture?
Introduction
MongoDB is a powerful NoSQL database designed for high scalability, flexibility, and distributed deployments. As organizations increasingly require robust database architectures that support large-scale applications, MongoDB offers multiple strategies for distributing data, including replica sets for high availability and sharding for horizontal scaling. This paper explores the best practices for deploying MongoDB in a distributed architecture, when to use sharding, and how tools like mongosync and replica sets enhance availability and resilience.
Note on Code snippets: Code snippets are provided to explain the concept. If you are not a developer, just read it as plain english as mostly you will understand what it says. Or ignore the code snippets and just read the remaining document.
Why do we need distributed databases?
Distributed databases are essential for several reasons:
In summary, distributed databases are crucial for modern applications that require scalability, reliability, and performance in handling large volumes of data across diverse environments.
How to migrate data from one database cluster to another?
What is a cluster in MongoDB?
In MongoDB, a cluster refers to a collection of servers that work together to provide a distributed database environment. Clusters can be configured in different ways to achieve various goals such as high availability, scalability, and performance. There are two primary types of clusters in MongoDB:
In summary, a cluster in MongoDB is a way to organize multiple servers to work together.
How is data migrated between clusters?
The mongosync binary is the primary process used in Cluster-to-Cluster Sync. mongosync migrates data from a source cluster to a destination cluster and keeps the clusters in continuous sync until you finalize the sync. In addition to continuous data synchronization, mongosync can also perform a one time data migration between clusters.
Mongosync can synchronize data between MongoDB clusters in real-time using oplog-based replication. The oplog (operations log) in MongoDB is a collection that records all write operations (insert, update, delete) that modify the data in the database, enabling real-time data replication across nodes.
How to configure mongosync
--source "mongodb://operational-db"
--destination "mongodb://analytics-db"
--destination-namespace "analytics.*"
--mode continuous # sync in real time
--oplog # 1GB of mongo oplog can contain 1 million normal size records => so no case of oplogs filling up
--filter-oplog '{"op": {"$in": ["i", "u"]}}' #sync insert, update operations only. Exclude delete operations from syncing.
--no-reverse-sync #Changes made in the destination database do not propagate back to source db
# Transform data: Exclude confidential fields. For example, private data can be excluded
--transform '
function(doc) {
delete doc.user_name;
delete doc.user_details;
return doc;
}'
What Is Distributed MongoDB?
A distributed MongoDB deployment enables data to be replicated and/or partitioned across multiple servers, improving scalability, reliability, and geographic distribution.
MongoDB provides two key distribution mechanisms:
Each method serves distinct use cases and is often used in combination to optimize performance and reliability.
A node in MongoDB refers to an individual server instance participating in a replica set or sharded cluster, acting as either one of these, depending on its role in the distributed database system:
How is High Availability ensured? Ans: Using replica sets.
A replica set can consist of one or more client applications. It must include at least one primary data-bearing node and can have zero or more secondary data-bearing nodes. The application utilizes the MongoDB driver for reading and writing operations. By default, all read and write requests are directed to the primary data-bearing node. However, it can be configured to prefer secondary nodes for read operations, allowing for better load balancing. Any inserts, updates, or deletions performed on the primary database are automatically synchronized across all secondary nodes.
What is a Replica Set?
A replica set is a group of MongoDB servers that maintain identical copies of the same dataset. One node acts as the primary, while others serve as secondaries and sync data from the primary using an oplog (operations log).
When to use a replica set?
How Replica Set Failover Works?
What is Mongodb driver?
The MongoDB driver is a software library that enables applications to communicate with a MongoDB database. Located on the application server side (rather than the database server side), the driver offers essential functions and methods for various operations, including establishing a connection to the database, executing queries, and managing data. It effectively manages the communication between the application and the MongoDB server, ensuring that requests are accurately formatted and responses are processed correctly.
Key features of MongoDB drivers include:
MongoDB drivers are available for various programming languages, including Python, Java, Node.js, C#, and many others, enabling developers to integrate MongoDB into their applications easily.
Role of MongoDB & Mongo Drivers in Load Balancing
In the event that the Primary fails:
Read preference settings can be customized to enhance load balancing.
How to configure a Replica Set?
MongoDB allows you to assign priority values to nodes in a replica set to control which node is more likely to become the primary during an election.
Each MongoDB node in the replica set should have a configuration file (mongod.conf) where you define the replica set settings.
Example mongod.conf for a node:
replication:
replSetName: rs0
enableMajorityReadConcern: true
members:
- _id: 0
host: "node1:27017"
priority: 2 # Highest priority (most likely to be primary)
- _id: 1
host: "node2:27017"
priority: 1 # Default priority
- _id: 2
host: "node3:27017"
priority: 0 # Will never become primary (hidden node)
To set up a replica set, follow these steps:
Start MongoDB instances on three different servers (ex: one primary and two secondaries):
mongod --replSet rs0 --port 27017 --dbpath /data/node1 --bind_ip 0.0.0.0
mongod --replSet rs0 --port 27018 --dbpath /data/node2 --bind_ip 0.0.0.0
mongod --replSet rs0 --port 27019 --dbpath /data/node3 --bind_ip 0.0.0.0
Initiate the replica set from any node:
rs.initiate({
_id: "rs0",
members: [
{ _id: 0, host: "node1:27017" },
{ _id: 1, host: "node2:27018" },
{ _id: 2, host: "node3:27019" }
]
});
Verify the replica set status:
rs.status();
Accessing a Replica Set from Client Applications
Client applications connect to a replica set using a MongoDB connection string that includes all replica set members:
mongodb://node1:27017,node2:27018,node3:27019/?replicaSet=rs0
Example in Python:
from pymongo import MongoClient
# Connect to the replica set
client = MongoClient("mongodb://node1:27017,node2:27018,node3:27019/?replicaSet=rs0")
# Select the database and collection
db = client["mydatabase"]
collection = db["mycollection"]
# Insert a document
collection.insert_one({"name": "Alice", "age": 30})
# Fetch documents
for doc in collection.find():
print(doc)
Understanding Oplog (Operations Log)
The oplog (operations log) is a special capped collection (local.oplog.rs) used by MongoDB replica sets to record all write operations (insert, update, delete, command). Command could be operations like Create Index, Drop Collection, Rename Collection. Oplog enables replication by allowing secondary nodes to apply the same operations as the primary.
When is Oplog Used?
Recommended by LinkedIn
Explanation of Each Key in an Oplog Entry
An example oplog entry looks like this:
{
"ts": { "$timestamp": { "t": 1739986451, "i": 3 } },
"op": "i",
"ns": "log_database.logs",
"o": { "_id": "abc123", "message": "New log entry" },
"lsid": { "id": "session-xyz" },
"txnNumber": 3
}
Sharding: to distribute your data
Sharding is a technique used to distribute large amounts of data across multiple database nodes to enhance performance and scalability. However, sharding is not always necessary.
✅ Start with a Replica Set → Sharding adds complexity; only use it when scaling demands it.
✅ Use Indexes Effectively → Poor indexing can negate sharding benefits.
When Is Sharding Needed?
When NOT to Use Sharding?
What is a sharded cluster?
A MongoDB sharded cluster consists of the following components:
The following graphic describes the interaction of components within a sharded cluster:
How To Connect to a Sharded Cluster?
You must connect to a mongos router to interact with any collection in the sharded cluster. This includes sharded and unsharded collections. Clients should never connect to a single shard in order to perform read or write operations.
You can connect to a mongos the same way you connect to a mongod using the mongosh or a MongoDB driver.
Benefits and Drawbacks of Sharding
Scalability
Performance
High Availability
Geographic Distribution
Cost
How to Send Data from OLTP to OLAP MongoDB?
Use MongoSync to continuously replicate data from your OLTP (OnLine Transaction Processing) database to your OLAP (OnLine Analytical Processing) database in MongoDB. This is an approach for separating operational and analytical workloads while keeping the data synchronized. A smaller transactional database ensures faster database operations.
How MongoSync Helps in OLTP to OLAP Sync?
✅ Real-time or Batch Syncing → Supports continuous or one-time data synchronization.
✅ No Performance Impact on OLTP → Keeps analytical queries separate from transactional queries.
✅ Filter & Transform Data → Can exclude unnecessary collections (e.g., logs, temporary data).
✅ No Reverse Sync → Ensures analytics DB doesn't affect the transactional DB.
🛠 1. Example MongoSync Configuration
You can configure MongoSync to copy OLTP data into an OLAP database.
Example mongosync-config.yml
syncGroup:
- name: "oltp-to-olap-sync"
source:
connection: "mongodb://oltp-db-host:27017"
destination:
connection: "mongodb://olap-db-host:27017"
options:
mode: "continuous" # Keeps syncing data in real-time
noReverseSync: true # Prevents changes in OLAP DB from syncing back to OLTP DB
ignoreDatabases:
- "admin"
- "local"
- "config"
includeDatabases:
- "orders_db"
- "customers_db"
ignoreCollections:
- "sessions" # Exclude unnecessary collections
- "temporary_logs"
batchSize: 500 # Number of documents to sync per batch
workerCount: 4 # Parallel workers to speed up sync
logging:
level: "info"
output: "/var/log/mongosync.log"
monitoring:
enabled: true
port: 9090
🛠 2. Running MongoSync
Run MongoSync with the configuration file:
mongosync --config mongosync-config.yml
✅ This will continuously sync data from oltp-db-host to olap-db-host.
📊 3. Optimizations for OLAP Performance
Once the data is in the OLAP database, you can:
Example: Indexing Analytical Data
db.orders.createIndex({ "order_date": 1 }) # Faster range queries
db.orders.createIndex({ "customer_id": 1 }) # Optimize customer analytics
Example: Pre-Aggregated Data for OLAP Queries
db.orders.aggregate([
{ "$group": { "_id": "$customer_id", "total_spent": { "$sum": "$amount" } } },
{ "$out": "customer_spending_summary" }
])
Conclusion
MongoDB’s distributed architecture offers powerful capabilities for scaling applications while ensuring reliability. Replica sets provide high availability, while sharding enables horizontal scaling when you have to deal with big data or high throughput db operations. Change streams allow real-time data streaming, and mongosync facilitates data synchronization across databases with transformation capabilities. By carefully choosing the right strategy, organizations can build resilient and scalable data architectures using MongoDB.
Sr Engineering Manager
2moInsightful article