Files
claude-skills-reference/engineering/database-designer/references/database_selection_decision_tree.md

16 KiB

Database Selection Decision Tree

Overview

Choosing the right database technology is crucial for application success. This guide provides a systematic approach to database selection based on specific requirements, data patterns, and operational constraints.

Decision Framework

Primary Questions

  1. What is your primary use case?

    • OLTP (Online Transaction Processing)
    • OLAP (Online Analytical Processing)
    • Real-time analytics
    • Content management
    • Search and discovery
    • Time-series data
    • Graph relationships
  2. What are your consistency requirements?

    • Strong consistency (ACID)
    • Eventual consistency
    • Causal consistency
    • Session consistency
  3. What are your scalability needs?

    • Vertical scaling sufficient
    • Horizontal scaling required
    • Global distribution needed
    • Multi-region requirements
  4. What is your data structure?

    • Structured (relational)
    • Semi-structured (JSON/XML)
    • Unstructured (documents, media)
    • Graph relationships
    • Time-series data
    • Key-value pairs

Decision Tree

START: What is your primary use case?
│
├── OLTP (Transactional Applications)
│   │
│   ├── Do you need strong ACID guarantees?
│   │   ├── YES → Do you need horizontal scaling?
│   │   │   ├── YES → Distributed SQL
│   │   │   │   ├── CockroachDB (Global, multi-region)
│   │   │   │   ├── TiDB (MySQL compatibility)
│   │   │   │   └── Spanner (Google Cloud)
│   │   │   └── NO → Traditional SQL
│   │   │       ├── PostgreSQL (Feature-rich, extensions)
│   │   │       ├── MySQL (Performance, ecosystem)
│   │   │       └── SQL Server (Microsoft stack)
│   │   └── NO → Are you primarily key-value access?
│   │       ├── YES → Key-Value Stores
│   │       │   ├── Redis (In-memory, caching)
│   │       │   ├── DynamoDB (AWS managed)
│   │       │   └── Cassandra (High availability)
│   │       └── NO → Document Stores
│   │           ├── MongoDB (General purpose)
│   │           ├── CouchDB (Sync, replication)
│   │           └── Amazon DocumentDB (MongoDB compatible)
│   │
├── OLAP (Analytics and Reporting)
│   │
│   ├── What is your data volume?
│   │   ├── Small to Medium (< 1TB) → Traditional SQL with optimization
│   │   │   ├── PostgreSQL with columnar extensions
│   │   │   ├── MySQL with analytics engine
│   │   │   └── SQL Server with columnstore
│   │   ├── Large (1TB - 100TB) → Data Warehouse Solutions
│   │   │   ├── Snowflake (Cloud-native)
│   │   │   ├── BigQuery (Google Cloud)
│   │   │   ├── Redshift (AWS)
│   │   │   └── Synapse (Azure)
│   │   └── Very Large (> 100TB) → Big Data Platforms
│   │       ├── Databricks (Unified analytics)
│   │       ├── Apache Spark on cloud
│   │       └── Hadoop ecosystem
│   │
├── Real-time Analytics
│   │
│   ├── Do you need sub-second query responses?
│   │   ├── YES → Stream Processing + OLAP
│   │   │   ├── ClickHouse (Fast analytics)
│   │   │   ├── Apache Druid (Real-time OLAP)
│   │   │   ├── Pinot (LinkedIn's real-time DB)
│   │   │   └── TimescaleDB (Time-series)
│   │   └── NO → Traditional OLAP solutions
│   │
├── Search and Discovery
│   │
│   ├── What type of search?
│   │   ├── Full-text search → Search Engines
│   │   │   ├── Elasticsearch (Full-featured)
│   │   │   ├── OpenSearch (AWS fork of ES)
│   │   │   └── Solr (Apache Lucene-based)
│   │   ├── Vector/similarity search → Vector Databases
│   │   │   ├── Pinecone (Managed vector DB)
│   │   │   ├── Weaviate (Open source)
│   │   │   ├── Chroma (Embeddings)
│   │   │   └── PostgreSQL with pgvector
│   │   └── Faceted search → Search + SQL combination
│   │
├── Graph Relationships
│   │
│   ├── Do you need complex graph traversals?
│   │   ├── YES → Graph Databases
│   │   │   ├── Neo4j (Property graph)
│   │   │   ├── Amazon Neptune (Multi-model)
│   │   │   ├── ArangoDB (Multi-model)
│   │   │   └── TigerGraph (Analytics focused)
│   │   └── NO → SQL with recursive queries
│   │       └── PostgreSQL with recursive CTEs
│   │
└── Time-series Data
    │
    ├── What is your write volume?
        ├── High (millions/sec) → Specialized Time-series
        │   ├── InfluxDB (Purpose-built)
        │   ├── TimescaleDB (PostgreSQL extension)
        │   ├── Apache Druid (Analytics focused)
        │   └── Prometheus (Monitoring)
        └── Medium → SQL with time-series optimization
            └── PostgreSQL with partitioning

Database Categories Deep Dive

Traditional SQL Databases

PostgreSQL

  • Best For: Complex queries, JSON data, extensions, geospatial
  • Strengths: Feature-rich, reliable, strong consistency, extensible
  • Use Cases: OLTP, mixed workloads, JSON documents, geospatial applications
  • Scaling: Vertical scaling, read replicas, partitioning
  • When to Choose: Need SQL features, complex queries, moderate scale

MySQL

  • Best For: Web applications, read-heavy workloads, simple schema
  • Strengths: Performance, replication, large ecosystem
  • Use Cases: Web apps, content management, e-commerce
  • Scaling: Read replicas, sharding, clustering (MySQL Cluster)
  • When to Choose: Simple schema, performance priority, large community

SQL Server

  • Best For: Microsoft ecosystem, enterprise features, business intelligence
  • Strengths: Integration, tooling, enterprise features
  • Use Cases: Enterprise applications, .NET applications, BI
  • Scaling: Always On availability groups, partitioning
  • When to Choose: Microsoft stack, enterprise requirements

Distributed SQL (NewSQL)

CockroachDB

  • Best For: Global applications, strong consistency, horizontal scaling
  • Strengths: ACID guarantees, automatic scaling, survival
  • Use Cases: Multi-region apps, financial services, global SaaS
  • Trade-offs: Complex setup, higher latency for global transactions
  • When to Choose: Need SQL + global scale + consistency

TiDB

  • Best For: MySQL compatibility with horizontal scaling
  • Strengths: MySQL protocol, HTAP (hybrid), cloud-native
  • Use Cases: MySQL migrations, hybrid workloads
  • When to Choose: Existing MySQL expertise, need scale

NoSQL Document Stores

MongoDB

  • Best For: Flexible schema, rapid development, document-centric data
  • Strengths: Developer experience, flexible schema, rich queries
  • Use Cases: Content management, catalogs, user profiles, IoT
  • Scaling: Automatic sharding, replica sets
  • When to Choose: Schema evolution, document structure, rapid development

CouchDB

  • Best For: Offline-first applications, multi-master replication
  • Strengths: HTTP API, replication, conflict resolution
  • Use Cases: Mobile apps, distributed systems, offline scenarios
  • When to Choose: Need offline capabilities, bi-directional sync

Key-Value Stores

Redis

  • Best For: Caching, sessions, real-time applications, pub/sub
  • Strengths: Performance, data structures, persistence options
  • Use Cases: Caching, leaderboards, real-time analytics, queues
  • Scaling: Clustering, sentinel for HA
  • When to Choose: High performance, simple data model, caching

DynamoDB

  • Best For: Serverless applications, predictable performance, AWS ecosystem
  • Strengths: Managed, auto-scaling, consistent performance
  • Use Cases: Web applications, gaming, IoT, mobile backends
  • Trade-offs: Vendor lock-in, limited querying
  • When to Choose: AWS ecosystem, serverless, managed solution

Column-Family Stores

Cassandra

  • Best For: Write-heavy workloads, high availability, linear scalability
  • Strengths: No single point of failure, tunable consistency
  • Use Cases: Time-series, IoT, messaging, activity feeds
  • Trade-offs: Complex operations, eventual consistency
  • When to Choose: High write volume, availability over consistency

HBase

  • Best For: Big data applications, Hadoop ecosystem
  • Strengths: Hadoop integration, consistent reads
  • Use Cases: Analytics on big data, time-series at scale
  • When to Choose: Hadoop ecosystem, very large datasets

Graph Databases

Neo4j

  • Best For: Complex relationships, graph algorithms, traversals
  • Strengths: Mature ecosystem, Cypher query language, algorithms
  • Use Cases: Social networks, recommendation engines, fraud detection
  • Trade-offs: Specialized use case, learning curve
  • When to Choose: Relationship-heavy data, graph algorithms

Time-Series Databases

InfluxDB

  • Best For: Time-series data, IoT, monitoring, analytics
  • Strengths: Purpose-built, efficient storage, query language
  • Use Cases: IoT sensors, monitoring, DevOps metrics
  • When to Choose: High-volume time-series data

TimescaleDB

  • Best For: Time-series with SQL familiarity
  • Strengths: PostgreSQL compatibility, SQL queries, ecosystem
  • Use Cases: Financial data, IoT with complex queries
  • When to Choose: Time-series + SQL requirements

Search Engines

Elasticsearch

  • Best For: Full-text search, log analysis, real-time search
  • Strengths: Powerful search, analytics, ecosystem (ELK stack)
  • Use Cases: Search applications, log analysis, monitoring
  • Trade-offs: Complex operations, resource intensive
  • When to Choose: Advanced search requirements, analytics

Data Warehouses

Snowflake

  • Best For: Cloud-native analytics, data sharing, varied workloads
  • Strengths: Separation of compute/storage, automatic scaling
  • Use Cases: Data warehousing, analytics, data science
  • When to Choose: Cloud-native, analytics-focused, multi-cloud

BigQuery

  • Best For: Serverless analytics, Google ecosystem, machine learning
  • Strengths: Serverless, petabyte scale, ML integration
  • Use Cases: Analytics, data science, reporting
  • When to Choose: Google Cloud, serverless analytics

Selection Criteria Matrix

Criterion SQL NewSQL Document Key-Value Column-Family Graph Time-Series
ACID Guarantees Strong Strong ⚠️ Eventual ⚠️ Eventual ⚠️ Tunable ⚠️ Varies ⚠️ Varies
Horizontal Scaling Limited Native Native Native Native ⚠️ Limited Native
Query Flexibility High High ⚠️ Moderate Low Low High ⚠️ Specialized
Schema Flexibility Rigid Rigid High High ⚠️ Moderate High ⚠️ Structured
Performance (Reads) ⚠️ Good ⚠️ Good Excellent Excellent Excellent ⚠️ Good Excellent
Performance (Writes) ⚠️ Good ⚠️ Good Excellent Excellent Excellent ⚠️ Good Excellent
Operational Complexity Low High ⚠️ Moderate Low High ⚠️ Moderate ⚠️ Moderate
Ecosystem Maturity Mature ⚠️ Growing Mature Mature Mature Mature ⚠️ Growing

Decision Checklist

Requirements Analysis

  • Data Volume: Current and projected data size
  • Transaction Volume: Reads per second, writes per second
  • Consistency Requirements: Strong vs eventual consistency needs
  • Query Patterns: Simple lookups vs complex analytics
  • Schema Evolution: How often does schema change?
  • Geographic Distribution: Single region vs global
  • Availability Requirements: Acceptable downtime
  • Team Expertise: Existing knowledge and learning curve
  • Budget Constraints: Licensing, infrastructure, operational costs
  • Compliance Requirements: Data residency, audit trails

Technical Evaluation

  • Performance Testing: Benchmark with realistic data and queries
  • Scalability Testing: Test scaling limits and patterns
  • Failure Scenarios: Test backup, recovery, and failure handling
  • Integration Testing: APIs, connectors, ecosystem tools
  • Migration Path: How to migrate from current system
  • Monitoring and Observability: Available tooling and metrics

Operational Considerations

  • Management Complexity: Setup, configuration, maintenance
  • Backup and Recovery: Built-in vs external tools
  • Security Features: Authentication, authorization, encryption
  • Upgrade Path: Version compatibility and upgrade process
  • Support Options: Community vs commercial support
  • Lock-in Risk: Portability and vendor independence

Common Decision Patterns

E-commerce Platform

Typical Choice: PostgreSQL or MySQL

  • Primary Data: Product catalog, orders, users (structured)
  • Query Patterns: OLTP with some analytics
  • Consistency: Strong consistency for financial data
  • Scale: Moderate with read replicas
  • Additional: Redis for caching, Elasticsearch for product search

IoT/Sensor Data Platform

Typical Choice: TimescaleDB or InfluxDB

  • Primary Data: Time-series sensor readings
  • Query Patterns: Time-based aggregations, trend analysis
  • Scale: High write volume, moderate read volume
  • Additional: Kafka for ingestion, PostgreSQL for metadata

Social Media Application

Typical Choice: Combination approach

  • User Profiles: MongoDB (flexible schema)
  • Relationships: Neo4j (graph relationships)
  • Activity Feeds: Cassandra (high write volume)
  • Search: Elasticsearch (content discovery)
  • Caching: Redis (sessions, real-time data)

Analytics Platform

Typical Choice: Snowflake or BigQuery

  • Primary Use: Complex analytical queries
  • Data Volume: Large (TB to PB scale)
  • Query Patterns: Ad-hoc analytics, reporting
  • Users: Data analysts, data scientists
  • Additional: Data lake (S3/GCS) for raw data storage

Global SaaS Application

Typical Choice: CockroachDB or DynamoDB

  • Requirements: Multi-region, strong consistency
  • Scale: Global user base
  • Compliance: Data residency requirements
  • Availability: High availability across regions

Migration Strategies

From Monolithic to Distributed

  1. Assessment: Identify scaling bottlenecks
  2. Data Partitioning: Plan how to split data
  3. Gradual Migration: Move non-critical data first
  4. Dual Writes: Run both systems temporarily
  5. Validation: Verify data consistency
  6. Cutover: Switch reads and writes gradually

Technology Stack Evolution

  1. Start Simple: Begin with PostgreSQL or MySQL
  2. Identify Bottlenecks: Monitor performance and scaling issues
  3. Selective Scaling: Move specific workloads to specialized databases
  4. Polyglot Persistence: Use multiple databases for different use cases
  5. Service Boundaries: Align database choice with service boundaries

Conclusion

Database selection should be driven by:

  1. Specific Use Case Requirements: Not all applications need the same database
  2. Data Characteristics: Structure, volume, and access patterns matter
  3. Non-functional Requirements: Consistency, availability, performance targets
  4. Team and Organizational Factors: Expertise, operational capacity, budget
  5. Evolution Path: How requirements and scale will change over time

The best database choice is often not a single technology, but a combination of databases that each excel at their specific use case within your application architecture.