MongoDB: From Documents to Distributed Clusters ๐
MongoDB: From Documents to Distributed Clusters ๐

Goal: Go beyond โitโs a JSON databaseโ and understand MongoDBโs internal mechanics, scaling strategies, indexing, and aggregation. ๐ง
1. Philosophy & Document Model ๐
Why MongoDB? (Significance) ๐ฏ
Relational databases normalize data across many tables (e.g., Users, Orders, Payments) and rely on JOINs. At scale, JOINs can be expensive. MongoDBโs guiding principle is:
Data that is accessed together should be stored together. ๐ฆ
Benefits:Aligns naturally with object-oriented models ๐งฉ
Reduces impedance mismatch ๐ง
Fewer JOINs โ faster reads โก


BSON (Binary JSON) ๐งฑ
MongoDB stores data as BSON (not plain JSON).
Why BSON?
- Rich data types: Date, Binary, Int32/Int64/Decimal128, ObjectId, etc. ๐งฌ
- Faster traversal than text JSON ๐
- Optimized for indexing and storage ๐ฆ
The _id Field (ObjectId) ๐
- 12 bytes: 4 bytes timestamp, 5 bytes random, 3 bytes counter
- Creation time can be extracted from
_idโฒ๏ธ
Json:
{
"name": "Spring Boot",
"completed": false,
"videos": 160,
"likes": 10400,
"registrations": 4600,
"instructors": ["Prakash"],
"tech": ["Java", "Spring"],
"level": "Advanced"
}MQl:
{
name: "Spring Boot",
completed: false,
videos: 160,
likes: 10400,
registrations: 4600,
instructors: ["Prakash"],
tech: ["Java", "Spring"],
level: "Advanced",
date:Date()
}2. Core Mechanics & CRUD ๐ ๏ธ
CRUD Operations ๐

Insert โ
insertOne()insertMany()(atomic per-document)
Update operators (key ones) โ๏ธ
$setโ update specific fields$incโ atomic increments (safe for concurrency)$pushโ append to arrays$addToSetโ append if not present$unsetโ remove fields
Atomicity example ๐
// BAD: Read-modify-write (race condition)
let user = db.users.findOne({ _id: 1 });
user.visits++;
db.users.save(user);
// GOOD: Atomic operator
db.users.updateOne(
{ _id: 1 },
{ $inc: { visits: 1 } }
);Schema Design: Embed vs Reference ๐งฑ๐
Embedding (default) ๐
- Fast reads โก
- Single query fetches all related data ๐ฏ
- Bound by 16MB document size
- Ideal when relationship is bounded and frequently read together
Referencing ๐
- Better for unbounded growth (logs, events, analytics)
- Avoids document bloat
- Requires additional queries or aggregation
$lookupwhen joining
Rule of thumb:
- Embed when data is mostly read together and bounded ๐
- Reference when data grows without bound or is shared across many parents ๐
3. Aggregation Framework ๐
Concept ๐ง
- Aggregation is a pipeline of stages, similar to Linux pipes:
Input โ Filter โ Group โ Transform โ Output
Key stages ๐งฑ
$matchโ filter early (use indexes) ๐งน$groupโ aggregate ๐งฎ$lookupโ left outer join across collections ๐$projectโ reshape fields ๐งญ$sortโ ordering (ideally with supporting index) ๐
Example: Total Revenue per Category ๐ต๐ฆ
db.orders.aggregate([
{ $match: { status: "completed" } },
{ $group: {
_id: "$product_category",
totalRevenue: { $sum: "$amount" }
}},
{ $sort: { totalRevenue: -1 } }
]);Note โน๏ธ
- Aggregation runs inside MongoDBโs C++ engine and is typically much faster than app-side processing.
- Place
$matchas early as possible to reduce the working set.
4. Indexing (Performance Core) ๐๏ธ
The Problem: Collection Scan (COLLSCAN) ๐
Without indexes, MongoDB scans every document:
- Complexity: O(N)
- CPU spikes, disk I/O bottlenecks, poor latency at scale
Analogy: Searching a shuffled phone book page by page.
The Solution: B-Tree Index (IXSCAN) ๐ฒ
MongoDB uses B-Tree indexes.
Index stores:
- Indexed field value
- Pointer to document location
Complexity:
- O(log N)
- 1M docs โ ~20 steps; 1B docs โ ~30 steps
- Scales with minimal performance cost
Query Execution: COLLSCAN vs IXSCAN ๐งช
- COLLSCAN (Collection Scan) โ
- Scans all documents
- Time complexity: O(N)
- Degrades with dataset size
Example (No index)
db.courses.find({ name: "kubernetes" }).explain("executionStats")
// Key fields (conceptually)
stage: COLLSCAN
totalDocsExamined: 43
nReturned: 1
totalKeysExamined: 0Interpretation ๐งฉ:
- MongoDB examined 43 docs to return 1 result โ inefficient.
Problems โ ๏ธ:
- High CPU, heavy I/O, slow APIs, not production-ready at scale.
- IXSCAN (Index Scan) โ
- Uses index for lookup
- Time complexity: O(log N)
- Scales well with large datasets
Create index:
db.courses.createIndex({ name: 1 })Example (With index)
db.courses.find({ name: "kubernetes" }).explain("executionStats")
// Key fields (conceptually)
stage: FETCH
inputStage: IXSCAN
totalDocsExamined: 1
totalKeysExamined: 1
nReturned: 1Interpretation ๐งฉ:
- Index found the key โkubernetesโ; FETCH retrieved the document.
Why FETCH appears with IXSCAN? ๐ค
- Indexes donโt store full documents, only keys and pointers.
- IXSCAN โ find pointer; FETCH โ load document.
- Covered Query (No FETCH)
A query is covered when:
- All requested fields are in the index
- MongoDB does not fetch the document
Example:
db.courses.createIndex({ name: 1 })
db.courses.find(
{ name: "kubernetes" },
{ _id: 0, name: 1 }
).explain("executionStats")
// stage: IXSCAN (no FETCH)Fastest possible query for that access pattern.
Key comparison ๐
- Uses Index: COLLSCAN โ vs IXSCAN โ
- Docs Examined: All vs Only matching
- Complexity: O(N) vs O(log N)
- Production readiness: COLLSCAN โ vs IXSCAN โ
Golden performance rule ๐ฏ
- Aim for
totalDocsExamined == nReturned - If not equal, check indexes and query shape.
Compound Indexes & ESR Rule ๐งฉ
Field order matters.
ESR Rule:
- E โ Equality: exact match fields first
- S โ Sort: fields used for sort next
- R โ Range: range predicates last
Correct ordering avoids expensive in-memory sorts and maximizes index utility.
Index Trade-Offs โ๏ธ
Indexes improve reads but add write overhead.
Each write must:
- Write the document
- Update every relevant index
Too many indexes can cause:
- Slower writes
- Higher RAM usage
- Disk swapping if index working set exceeds RAM
Verification with explain() ๐
Use:
db.collection.find(query).explain("executionStats")
Check:totalDocsExamined,totalKeysExamined,nReturned
Goal:totalDocsExamined == nReturnedfor selective queries
5. Architecture: Replication & Sharding ๐๏ธ
Replication (High Availability) ๐
Replica set roles:
- Primary โ handles writes
- Secondary โ replicates data
- Automatic election on failure
Oplog ๐:
- Primary writes operations to the oplog
- Secondaries tail the oplog to stay in sync
Read preferences ๐:
- Reads can be routed to secondaries (with consistency caveats)
Sharding (Horizontal Scaling) ๐งฉ
Problem โ:
- Single server cannot handle massive datasets or throughput.
Solution โ :

- Split data across shards.

Components ๐งฐ:
mongosโ query router- Config servers โ metadata
- Shards โ data storage
Shard Key ๐:
- Critical design decision
- Poor choice causes hot shards and bottlenecks
- Prefer keys with good cardinality and balanced distribution
6. Advanced & Modern Features โจ

Multi-Document ACID Transactions ๐
- Supported since MongoDB 4.0
- Snapshot isolation
- Commit/rollback
Use cases ๐ผ:
- Financial systems, inventory, multi-document consistency
Time Series Collections โฑ๏ธ
Optimized for:
- IoT sensors, logs, stock prices
Benefits:
- Automatic compression
- High write throughput
- Efficient storage layout
Atlas Vector Search (GenAI) ๐ง
- Stores vector embeddings
- Enables semantic similarity search
- Used in AI/LLM applications
7. Final Takeaways ๐
- NoSQL = Not Only SQL
- Schema validation is optional but powerful
- Indexes are mandatory for performance
- MongoDB excels at large-scale, evolving, semi-structured data
Best fit ๐:
- User profiles
- Product catalogs
- Content platforms
- IoT & real-time systems
SQL still best for ๐ฆ:
- Strong relational integrity
- Highly structured financial systems

How MongoDB works

MongoDB Queries
Getting Started
Connect MongoDB Shell
mongo # connects to mongodb://127.0.0.1:27017 by defaultmongo --host <host> --port <port> -u <user> -p <pwd> # omit the password if you want a promptmongo "mongodb://192.168.1.1:27017"mongo "mongodb+srv://cluster-name.abcde.mongodb.net/<dbname>" --username <username> # MongoDB AtlasHelpers
Show dbs :
db // prints the current databaseSwitch database :
use <database_name>Show collections :
show collectionsRun JavaScript file :
load("myScript.js")Crud
Create
db.coll.insertOne({name: "Max"})
db.coll.insertMany([{name: "Max"}, {name:"Alex"}]) // ordered bulk insert
db.coll.insertMany([{name: "Max"}, {name:"Alex"}], {ordered: false}) // unordered bulk insert
db.coll.insertOne({date: ISODate()})
db.coll.insertMany({name: "Max"}, {"writeConcern": {"w": "majority", "wtimeout": 5000}})Delete
db.coll.deleteOne({name: "Max"})
db.coll.deleteMany( $and: [{name: "Max"}, {justOne: true}]) //delete all entries which contain both values
db.coll.deleteMany( $or: [{name: "Max"}, {justOne: true}]) //delete all entries which contain any of the specified values
db.coll.deleteMany({}) // WARNING! Deletes all the docs but not the collection itself and its index definitions
db.coll.deleteMany({name: "Max"}, {"writeConcern": {"w": "majority", "wtimeout": 5000}})
db.coll.findOneAndDelete({"name": "Max"})Update
db.coll.updateMany({"_id": 1}, {$set: {"year": 2016}}) // WARNING! Replaces the entire document where "_id" = 1
db.coll.updateOne({"_id": 1}, {$set: {"year": 2016, name: "Max"}})
db.coll.updateOne({"_id": 1}, {$unset: {"year": 1}})
db.coll.updateOne({"_id": 1}, {$rename: {"year": "date"} })
db.coll.updateOne({"_id": 1}, {$inc: {"year": 5}})
db.coll.updateOne({"_id": 1}, {$mul: {price: 2}})
db.coll.updateOne({"_id": 1}, {$min: {"imdb": 5}})
db.coll.updateOne({"_id": 1}, {$max: {"imdb": 8}})
db.coll.updateMany({"_id": {$lt: 10}}, {$set: {"lastModified": ISODate()}})Array
db.coll.updateOne({"_id": 1}, {$push :{"array": 1}})
db.coll.updateOne({"_id": 1}, {$pull :{"array": 1}})
db.coll.updateOne({"_id": 1}, {$addToSet :{"array": 2}})
db.coll.updateOne({"_id": 1}, {$pop: {"array": 1}}) // last element
db.coll.updateOne({"_id": 1}, {$pop: {"array": -1}}) // first element
db.coll.updateOne({"_id": 1}, {$pullAll: {"array" :[3, 4, 5]}})
db.coll.updateOne({"_id": 1}, {$push: {scores: {$each: [90, 92, 85]}}})
db.coll.updateOne({"_id": 1, "grades": 80}, {$set: {"grades.$": 82}})
db.coll.updateMany({}, {$inc: {"grades.$[]": 10}})
db.coll.updateMany({}, {$set: {"grades.$[element]": 100}}, {arrayFilters: [{"element": {$gte: 100}}]})Update many
db.coll.updateMany({"year": 1999}, {$set: {"decade": "90's"}})FindOneAndUpdate
db.coll.findOneAndUpdate({"name": "Max"}, {$inc: {"points": 5}}, {returnNewDocument: true})Upsert
db.coll.updateOne({"_id": 1}, {$set: {item: "apple"}, $setOnInsert: {defaultQty: 100}}, {upsert: true})Replace
db.coll.replaceOne({"name": "Max"}, {"firstname": "Maxime", "surname": "Beugnet"})Write concern
db.coll.updateMany({}, {$set: {"x": 1}}, {"writeConcern": {"w": "majority", "wtimeout": 5000}})Find
db.coll.findOne() // returns a single document
db.coll.find() // returns a cursor - show 20 results - "it" to display more
db.coll.find().pretty()
db.coll.find({name: "Max", age: 32}) // implicit logical "AND".
db.coll.find({date: ISODate("2020-09-25T13:57:17.180Z")})
db.coll.find({name: "Max", age: 32}).explain("executionStats") // or "queryPlanner" or "allPlansExecution"
db.coll.distinct("name")Count
db.coll.estimatedDocumentCount() // estimation based on collection metadata
db.coll.countDocuments({age: 32}) // alias for an aggregation pipeline - accurate countComparison
db.coll.find({"year": {$gt: 1970}})
db.coll.find({"year": {$gte: 1970}})
db.coll.find({"year": {$lt: 1970}})
db.coll.find({"year": {$lte: 1970}})
db.coll.find({"year": {$ne: 1970}})
db.coll.find({"year": {$in: [1958, 1959]}})
db.coll.find({"year": {$nin: [1958, 1959]}})Logical
db.coll.find({name:{$not: {$eq: "Max"}}})
db.coll.find({$or: [{"year" : 1958}, {"year" : 1959}]})
db.coll.find({$nor: [{price: 1.99}, {sale: true}]})
db.coll.find({
$and: [
{$or: [{qty: {$lt :10}}, {qty :{$gt: 50}}]},
{$or: [{sale: true}, {price: {$lt: 5 }}]}
]
})Element
db.coll.find({name: {$exists: true}})
db.coll.find({"zipCode": {$type: 2 }})
db.coll.find({"zipCode": {$type: "string"}})Aggregation Pipeline
db.coll.aggregate([
{$match: {status: "A"}},
{$group: {_id: "$cust_id", total: {$sum: "$amount"}}},
{$sort: {total: -1}}
])Text search with a "text" index
db.coll.find({$text: {$search: "cake"}}, {score: {$meta: "textScore"}}).sort({score: {$meta: "textScore"}})Regex
db.coll.find({name: /^Max/}) // regex: starts by letter "M"
db.coll.find({name: /^Max$/i}) // regex case insensitiveArray
db.coll.find({tags: {$all: ["Realm", "Charts"]}})
db.coll.find({field: {$size: 2}}) // impossible to index - prefer storing the size of the array & update it
db.coll.find({results: {$elemMatch: {product: "xyz", score: {$gte: 8}}}})Projections
db.coll.find({"x": 1}, {"actors": 1}) // actors + \_id
db.coll.find({"x": 1}, {"actors": 1, "\_id": 0}) // actors
db.coll.find({"x": 1}, {"actors": 0, "summary": 0}) // all but "actors" and "summary"Sort, skip, limit
db.coll.find({}).sort({"year": 1, "rating": -1}).skip(10).limit(3)Read Concern
db.coll.find().readConcern("majority")Databases and Collections {.cols-2}
Drop
db.coll.drop() // removes the collection and its index definitions
db.dropDatabase() // double check that you are *NOT* on the PROD cluster... :-)Create Collection
db.createCollection("contacts", {
validator: {$jsonSchema: {
bsonType: "object",
required: ["phone"],
properties: {
phone: {
bsonType: "string",
description: "must be a string and is required"
},
email: {
bsonType: "string",
pattern: "@mongodb\.com$",
description: "must be a string and match the regular expression pattern"
},
status: {
enum: [ "Unknown", "Incomplete" ],
description: "can only be one of the enum values"
}
}
}}
})Other Collection Functions
db.coll.stats()
db.coll.storageSize()
db.coll.totalIndexSize()
db.coll.totalSize()
db.coll.validate({full: true})
db.coll.renameCollection("new_coll", true) // 2nd parameter to drop the target collection if existsIndexes {.cols-2}
Basics
List
db.coll.getIndexes()
db.coll.getIndexKeys()Drop Indexes
db.coll.dropIndex("name_1")Hide/Unhide Indexes
db.coll.hideIndex("name_1")
db.coll.unhideIndex("name_1")Create Indexes
// Index Types
db.coll.createIndex({"name": 1}) // single field index
db.coll.createIndex({"name": 1, "date": 1}) // compound index
db.coll.createIndex({foo: "text", bar: "text"}) // text index
db.coll.createIndex({"$**": "text"}) // wildcard text index
db.coll.createIndex({"userMetadata.$**": 1}) // wildcard index
db.coll.createIndex({"loc": "2d"}) // 2d index
db.coll.createIndex({"loc": "2dsphere"}) // 2dsphere index
db.coll.createIndex({"_id": "hashed"}) // hashed index
// Index Options
db.coll.createIndex({"lastModifiedDate": 1}, {expireAfterSeconds: 3600}) // TTL index
db.coll.createIndex({"name": 1}, {unique: true})
db.coll.createIndex({"name": 1}, {partialFilterExpression: {age: {$gt: 18}}}) // partial index
db.coll.createIndex({"name": 1}, {collation: {locale: 'en', strength: 1}}) // case insensitive index with strength = 1 or 2
db.coll.createIndex({"name": 1 }, {sparse: true})