MongoDB Schema Design

Design efficient and scalable document structures

📐 What is Schema Design?

MongoDB schema design involves structuring your documents and collections for optimal performance and scalability. Unlike relational databases, MongoDB offers flexible schemas with embedded documents and references for different use cases.


// Embedded document example
{
  name: "John Doe",
  address: { street: "123 Main St", city: "NYC" }
}
                                    

Result:

Single document with nested address data

Key Design Patterns

📦

Embedding

Store related data together

{ user: { name: "Alice", email: "..." } }
🔗

Referencing

Link documents with IDs

{ userId: ObjectId("...") }
🎯

Denormalization

Duplicate data for speed

{ authorName: "John", authorId: "..." }
📊

Bucketing

Group time-series data

{ hour: "2024-01", readings: [...] }

🔹 Embedding vs Referencing

Choose between embedding related data within documents or referencing separate documents. Embedding provides faster reads with single queries, while referencing offers better data consistency and smaller documents.

// EMBEDDING - One-to-Few relationship
// Good for: Data accessed together, infrequent updates
{
  _id: 1,
  name: "John Doe",
  addresses: [
    { street: "123 Main St", city: "NYC", type: "home" },
    { street: "456 Work Ave", city: "NYC", type: "work" }
  ]
}

// REFERENCING - One-to-Many relationship
// Good for: Large related data, frequent updates
// Users collection
{
  _id: 1,
  name: "John Doe"
}

// Orders collection
{
  _id: 101,
  userId: 1,
  items: [...],
  total: 299.99
}

// Query with reference
const user = db.users.findOne({ _id: 1 });
const orders = db.orders.find({ userId: user._id });

When to Use:

Embed: 1-to-few, data read together | Reference: 1-to-many, independent updates

🔹 One-to-One Relationships

Model one-to-one relationships by embedding the related document directly. This approach provides the best performance for data that's always accessed together and has a clear ownership relationship.

// Embed one-to-one data
// User with profile (always accessed together)
{
  _id: ObjectId("..."),
  username: "johndoe",
  email: "[email protected]",
  profile: {
    firstName: "John",
    lastName: "Doe",
    bio: "Software developer",
    avatar: "avatar.jpg",
    birthDate: ISODate("1990-01-15")
  },
  settings: {
    theme: "dark",
    notifications: true,
    language: "en"
  }
}

// Alternative: Reference for large or rarely accessed data
// User document
{
  _id: ObjectId("user123"),
  username: "johndoe",
  profileId: ObjectId("profile456")
}

// Profile document (separate if very large)
{
  _id: ObjectId("profile456"),
  userId: ObjectId("user123"),
  detailedBio: "Very long biography...",
  portfolio: [/* large array */]
}

Best Practice:

Embed for small, frequently accessed data. Reference for large or rarely used data.

🔹 One-to-Many Relationships

Handle one-to-many relationships based on the "many" side's size. Embed for small arrays, reference for large collections. Consider the access patterns and update frequency when choosing.

// ONE-TO-FEW: Embed (< 100 items)
// Blog post with comments
{
  _id: ObjectId("post123"),
  title: "MongoDB Schema Design",
  content: "...",
  comments: [
    { user: "Alice", text: "Great post!", date: ISODate() },
    { user: "Bob", text: "Very helpful", date: ISODate() }
  ]
}

// ONE-TO-MANY: Reference (100-1000s items)
// Author with books
// Authors collection
{
  _id: ObjectId("author123"),
  name: "Jane Smith",
  bio: "Bestselling author"
}

// Books collection
{
  _id: ObjectId("book456"),
  title: "MongoDB Mastery",
  authorId: ObjectId("author123"),
  isbn: "978-1234567890"
}

// ONE-TO-SQUILLIONS: Reference with parent ref
// Product with reviews (millions)
// Products collection
{
  _id: ObjectId("prod123"),
  name: "Laptop",
  reviewCount: 15420
}

// Reviews collection (parent reference)
{
  _id: ObjectId("rev789"),
  productId: ObjectId("prod123"),
  rating: 5,
  text: "Excellent!"
}

Guidelines:

Few (< 100): Embed | Many (100-1000s): Reference | Millions: Reference with parent ID

🔹 Many-to-Many Relationships

Model many-to-many relationships using arrays of references or a separate junction collection. Choose based on which side you query more frequently and the relationship's complexity.

// APPROACH 1: Array of References (Simple)
// Students collection
{
  _id: ObjectId("student123"),
  name: "Alice",
  courseIds: [
    ObjectId("course1"),
    ObjectId("course2"),
    ObjectId("course3")
  ]
}

// Courses collection
{
  _id: ObjectId("course1"),
  name: "MongoDB Basics",
  studentIds: [
    ObjectId("student123"),
    ObjectId("student456")
  ]
}

// APPROACH 2: Junction Collection (Complex)
// Students collection
{
  _id: ObjectId("student123"),
  name: "Alice"
}

// Courses collection
{
  _id: ObjectId("course1"),
  name: "MongoDB Basics"
}

// Enrollments collection (with metadata)
{
  _id: ObjectId("enroll789"),
  studentId: ObjectId("student123"),
  courseId: ObjectId("course1"),
  enrollDate: ISODate("2024-01-15"),
  grade: "A",
  status: "completed"
}

// Query: Find all courses for a student
db.enrollments.find({ studentId: ObjectId("student123") })

When to Use:

Simple: Array of IDs | Complex: Junction collection with metadata

🔹 Denormalization Pattern

Duplicate frequently accessed data to avoid joins and improve read performance. Accept some data redundancy in exchange for faster queries and simpler application code.

// WITHOUT Denormalization (Multiple queries)
// Users collection
{ _id: 1, name: "John Doe", email: "[email protected]" }

// Posts collection
{ _id: 101, userId: 1, title: "My Post", content: "..." }

// Need 2 queries to display post with author name

// WITH Denormalization (Single query)
// Posts collection with embedded author info
{
  _id: 101,
  title: "My Post",
  content: "...",
  author: {
    id: 1,
    name: "John Doe",  // Duplicated from users
    avatar: "avatar.jpg"  // Duplicated from users
  },
  createdAt: ISODate()
}

// Single query gets everything
db.posts.find({ _id: 101 })

// Trade-off: Update author name in multiple places
db.posts.updateMany(
  { "author.id": 1 },
  { $set: { "author.name": "John Smith" } }
)

Best For:

Read-heavy applications where data rarely changes

🔹 Bucketing Pattern

Group time-series or sequential data into buckets to reduce document count and improve query performance. Ideal for IoT sensors, logs, and analytics data with high write volumes.

// WITHOUT Bucketing (One doc per reading)
// 1 million documents for 1 million readings
{ sensorId: "A1", temp: 22.5, time: ISODate("2024-01-01T10:00:00") }
{ sensorId: "A1", temp: 22.7, time: ISODate("2024-01-01T10:01:00") }
// ... millions more

// WITH Bucketing (Group by hour)
// 24 documents per day instead of 1440
{
  _id: ObjectId("..."),
  sensorId: "A1",
  date: ISODate("2024-01-01"),
  hour: 10,
  readings: [
    { minute: 0, temp: 22.5 },
    { minute: 1, temp: 22.7 },
    { minute: 2, temp: 22.6 },
    // ... 60 readings per hour
  ],
  avgTemp: 22.6,
  minTemp: 22.4,
  maxTemp: 22.9,
  count: 60
}

// Query hourly data efficiently
db.sensors.find({
  sensorId: "A1",
  date: ISODate("2024-01-01"),
  hour: { $gte: 10, $lte: 12 }
})

// Benefits:
// - Fewer documents (better index performance)
// - Pre-calculated aggregates
// - Efficient range queries

Use Cases:

IoT sensors, server logs, financial ticks, analytics events

🔹 Schema Design Best Practices

Follow these proven guidelines to create efficient, scalable MongoDB schemas. Consider your application's access patterns, query frequency, and data relationships when making design decisions.

✅ Design Principles:

  • Design for your queries: Structure data how you'll access it
  • Embed for atomicity: Keep related data together for atomic updates
  • Reference for flexibility: Separate data that grows unbounded
  • Denormalize for reads: Duplicate data to avoid joins
  • Consider document size: Keep under 16MB limit
  • Use arrays wisely: Limit array growth (< 1000 items)

📏 Document Size Guidelines:

  • Maximum: 16MB per document
  • Recommended: < 1MB for best performance
  • Arrays: < 1000 elements (use references for more)
  • Nesting: < 100 levels deep
// Good schema example
{
  _id: ObjectId("..."),
  // Frequently accessed together - embedded
  user: {
    name: "Alice",
    email: "[email protected]"
  },
  // Small array - embedded
  tags: ["mongodb", "database", "nosql"],
  // Reference to large collection
  categoryId: ObjectId("..."),
  // Denormalized for display
  categoryName: "Databases",
  // Metadata
  createdAt: ISODate(),
  updatedAt: ISODate()
}

🧠 Test Your Knowledge

When should you embed documents instead of referencing?