Managed databases in the cloud
In this series (10 parts)
- Cloud fundamentals and the shared responsibility model
- Compute: VMs, containers, serverless
- Networking in the cloud
- Cloud storage services
- Managed databases in the cloud
- Cloud IAM and access control
- Serverless architecture patterns
- Cloud cost management
- Multi-cloud and cloud-agnostic design
- Cloud Well-Architected Framework
Running databases is hard. Backups, replication, failover, patching, performance tuning, and capacity planning consume engineering time that could go toward building features. Cloud providers offer managed database services that handle these operational tasks. This article covers what is available, when to use it, and what you give up. For a deeper look at database internals and data modeling, see databases overview.
Managed relational databases
Relational databases remain the default choice for most applications. Managed services remove the undifferentiated heavy lifting while keeping the SQL interface you already know.
Provider implementations
| Feature | AWS RDS | GCP Cloud SQL | Azure SQL Database |
|---|---|---|---|
| Engines | MySQL, PostgreSQL, MariaDB, Oracle, SQL Server | MySQL, PostgreSQL, SQL Server | SQL Server (native), MySQL, PostgreSQL (Flexible Server) |
| Max storage | 64 TB | 64 TB | 100 TB (Hyperscale) |
| Automated backups | Yes, up to 35 days | Yes, up to 365 days | Yes, up to 35 days |
| Multi-AZ | Yes | Yes (HA configuration) | Yes (zone redundant) |
| Read replicas | Up to 15 (Aurora), 5 (RDS) | Up to 10 | Yes (Hyperscale) |
Multi-AZ deployments
A multi-AZ deployment maintains a synchronous standby replica in a different availability zone. If the primary fails, the service automatically promotes the standby. Failover typically takes 60-120 seconds.
graph LR App["Application"] --> Primary["Primary DB<br/>AZ-a"] Primary -->|"Sync replication"| Standby["Standby DB<br/>AZ-b"] Primary -->|"Async replication"| Read1["Read Replica<br/>AZ-a"] Primary -->|"Async replication"| Read2["Read Replica<br/>AZ-c"] style Primary fill:#3498db,color:#fff style Standby fill:#e74c3c,color:#fff style Read1 fill:#2ecc71,color:#fff style Read2 fill:#2ecc71,color:#fff
The standby handles failover. Read replicas handle read-heavy workloads. They serve different purposes.
Read replicas
Read replicas use asynchronous replication to create read-only copies of your database. They reduce load on the primary for read-heavy workloads. Replication lag is typically under a second but can spike during heavy writes.
Use read replicas for:
- Reporting queries that scan large amounts of data.
- Geographically distributed reads by placing replicas in different regions.
- Offloading analytics workloads from the production primary.
Do not use read replicas for workloads that require strict read-after-write consistency. A write to the primary may not appear on the replica for milliseconds to seconds.
Aurora and Spanner
AWS Aurora and Google Cloud Spanner go beyond traditional managed databases.
Aurora is compatible with MySQL and PostgreSQL but uses a distributed storage layer. It provides up to 5x throughput over standard MySQL, automatic storage scaling up to 128 TB, and up to 15 read replicas with sub-10ms replication lag.
Spanner is a globally distributed relational database that provides strong consistency across regions. It sacrifices nothing on the CAP theorem by using TrueTime (atomic clocks and GPS receivers in every data center). Spanner is expensive but unmatched for workloads that need global consistency.
Managed NoSQL databases
When your data model does not fit neatly into tables with relationships, NoSQL databases offer flexibility.
Key-value and document stores
| Feature | AWS DynamoDB | GCP Firestore | Azure Cosmos DB |
|---|---|---|---|
| Data model | Key-value, document | Document | Multi-model (document, key-value, graph, column) |
| Consistency | Eventually or strongly consistent | Strong (multi-region) | Five consistency levels |
| Scaling | Automatic | Automatic | Automatic |
| Pricing model | On-demand or provisioned capacity | Per read/write/delete | Request units (RU) |
| Global distribution | Global Tables | Multi-region | Turnkey global distribution |
DynamoDB
DynamoDB is a fully managed key-value and document database. It delivers single-digit millisecond performance at any scale. You define a partition key and optional sort key. All access patterns must be designed around these keys.
The single-table design pattern stores multiple entity types in one table, using composite keys to model relationships. It is powerful but requires upfront investment in access pattern analysis.
Firestore
Firestore is a document database optimized for mobile and web applications. It provides real-time synchronization out of the box. Changes to documents are pushed to connected clients instantly. Offline support is built in: the client SDK caches data locally and synchronizes when connectivity returns.
Cosmos DB
Cosmos DB offers five consistency models ranging from strong to eventual. This flexibility lets you tune consistency per request. It supports multiple APIs: SQL, MongoDB, Cassandra, Gremlin (graph), and Table. If you are on Azure and uncertain about your long-term data model, Cosmos DB is a safe starting point.
Managed caching
Caching layers sit between your application and database, serving frequently accessed data from memory.
| Feature | AWS ElastiCache | GCP Memorystore | Azure Cache for Redis |
|---|---|---|---|
| Engines | Redis, Memcached | Redis, Memcached | Redis |
| Cluster mode | Yes | Yes (Redis) | Yes |
| Max memory | 635 GB (Redis) | 300 GB | 1.2 TB |
| Multi-AZ | Yes | Yes | Yes |
When to use managed caching
- Session storage: Store user sessions in Redis for fast access across application instances.
- Database query cache: Cache the results of expensive queries. Invalidate when the underlying data changes.
- Rate limiting: Use Redis counters to implement rate limiting with minimal latency.
- Leaderboards and counters: Redis sorted sets provide O(log N) ranking operations.
The alternative is running your own Redis cluster on VMs. Managed caching handles patching, failover, and backup. For most teams, the operational savings justify the managed service premium.
Managed search
Full-text search is a specialized workload. Managed search services handle the complexities of indexing, relevance scoring, and cluster management.
| Feature | AWS OpenSearch | Elastic Cloud (any provider) | Azure AI Search |
|---|---|---|---|
| Engine | OpenSearch (Elasticsearch fork) | Elasticsearch | Proprietary |
| Managed by | AWS | Elastic | Microsoft |
| Vector search | Yes | Yes | Yes |
| Serverless option | Yes | Yes | No |
Use managed search when your application needs:
- Full-text search with relevance ranking.
- Faceted navigation (filter by category, price range, brand).
- Log analytics and observability.
- Vector similarity search for AI/ML applications.
Choosing managed vs self-managed
The decision framework is straightforward.
graph TD
Q1{"Is the database a core<br/>competitive advantage?"} -->|Yes| Self["Self-manage"]
Q1 -->|No| Q2{"Does the managed service<br/>support your requirements?"}
Q2 -->|Yes| Q3{"Can you afford the premium?"}
Q3 -->|Yes| Managed["Use managed service"]
Q3 -->|No| Q4{"Do you have the ops expertise?"}
Q4 -->|Yes| Self
Q4 -->|No| Managed
Q2 -->|No| Self
style Managed fill:#2ecc71,color:#fff
style Self fill:#e74c3c,color:#fff
Most teams should default to managed. Self-manage only when you have a strong reason and the expertise to back it up.
Arguments for managed
- Time to market: Deploy a production-ready database in minutes.
- Operational reliability: Automated backups, patching, failover.
- Security: Encryption at rest, network isolation, IAM integration out of the box.
- Scaling: Add read replicas or increase capacity without downtime.
Arguments for self-managed
- Cost at scale: At very high data volumes, self-managed on reserved instances can be 40-60% cheaper.
- Full configuration control: Tune every parameter. Use custom extensions. Run any version.
- Compliance requirements: Some regulations require specific database configurations that managed services do not expose.
- Exotic workloads: Time-series databases, graph databases, or niche engines that providers do not offer as managed services.
Cost comparison
Database costs vary widely based on instance size, storage, I/O, and backup retention.
The managed premium is roughly 30-40%. Factor in the engineering time saved before deciding.
The managed premium covers automated backups, monitoring, failover, and patching. If your team would otherwise spend 10+ hours per month on database operations, the managed service is likely cheaper in total.
High availability patterns
Multi-AZ with automatic failover
Deploy the primary in one AZ and a standby in another. The managed service monitors the primary and promotes the standby if it fails. Your application connects through a DNS endpoint that automatically points to the current primary.
Cross-region read replicas
For global applications, place read replicas in regions close to your users. Writes go to the primary region. Reads are served locally. This reduces latency for read-heavy workloads but introduces replication lag.
Active-active (multi-region writes)
DynamoDB Global Tables, Cosmos DB, and Spanner support writes in multiple regions simultaneously. This eliminates the single write region bottleneck but adds complexity around conflict resolution. DynamoDB uses last-writer-wins. Cosmos DB offers configurable conflict resolution. Spanner provides strong consistency globally.
Active-active is the most complex and expensive option. Use it only when your application genuinely needs low-latency writes from multiple continents.
Backup and recovery
Every managed database service provides automated backups. Understand these settings:
- Backup window: When automated backups run. Schedule during low-traffic periods.
- Retention period: How long backups are kept. Set based on compliance requirements.
- Point-in-time recovery (PITR): Restore to any second within the retention window. This uses transaction logs, not just snapshots.
- Manual snapshots: Create before risky operations like schema migrations.
Test your recovery process regularly. A backup you have never restored is not a backup. It is a hope.
What comes next
This article completes the Cloud Platforms and Services series. You now understand compute, networking, storage, and databases in the cloud. The next step is putting these pieces together: designing architectures that are secure, scalable, and cost-effective. Explore the broader DevOps series to learn about infrastructure as code, CI/CD pipelines, and monitoring that bring these cloud services to life.