Apr 13, 2026 · 16 min read · DevOps

Managed databases in the cloud

In this series (10 parts)

Running databases is hard. Backups, replication, failover, patching, performance tuning, and capacity planning consume engineering time that could go toward building features. Cloud providers offer managed database services that handle these operational tasks. This article covers what is available, when to use it, and what you give up. For a deeper look at database internals and data modeling, see databases overview.

Managed relational databases

Relational databases remain the default choice for most applications. Managed services remove the undifferentiated heavy lifting while keeping the SQL interface you already know.

Provider implementations

Feature	AWS RDS	GCP Cloud SQL	Azure SQL Database
Engines	MySQL, PostgreSQL, MariaDB, Oracle, SQL Server	MySQL, PostgreSQL, SQL Server	SQL Server (native), MySQL, PostgreSQL (Flexible Server)
Max storage	64 TB	64 TB	100 TB (Hyperscale)
Automated backups	Yes, up to 35 days	Yes, up to 365 days	Yes, up to 35 days
Multi-AZ	Yes	Yes (HA configuration)	Yes (zone redundant)
Read replicas	Up to 15 (Aurora), 5 (RDS)	Up to 10	Yes (Hyperscale)

Multi-AZ deployments

A multi-AZ deployment maintains a synchronous standby replica in a different availability zone. If the primary fails, the service automatically promotes the standby. Failover typically takes 60-120 seconds.

graph LR
  App["Application"] --> Primary["Primary DB<br/>AZ-a"]
  Primary -->|"Sync replication"| Standby["Standby DB<br/>AZ-b"]
  Primary -->|"Async replication"| Read1["Read Replica<br/>AZ-a"]
  Primary -->|"Async replication"| Read2["Read Replica<br/>AZ-c"]
  style Primary fill:#3498db,color:#fff
  style Standby fill:#e74c3c,color:#fff
  style Read1 fill:#2ecc71,color:#fff
  style Read2 fill:#2ecc71,color:#fff

The standby handles failover. Read replicas handle read-heavy workloads. They serve different purposes.

Read replicas

Read replicas use asynchronous replication to create read-only copies of your database. They reduce load on the primary for read-heavy workloads. Replication lag is typically under a second but can spike during heavy writes.

Use read replicas for:

Reporting queries that scan large amounts of data.
Geographically distributed reads by placing replicas in different regions.
Offloading analytics workloads from the production primary.

Do not use read replicas for workloads that require strict read-after-write consistency. A write to the primary may not appear on the replica for milliseconds to seconds.

Aurora and Spanner

AWS Aurora and Google Cloud Spanner go beyond traditional managed databases.

Aurora is compatible with MySQL and PostgreSQL but uses a distributed storage layer. It provides up to 5x throughput over standard MySQL, automatic storage scaling up to 128 TB, and up to 15 read replicas with sub-10ms replication lag.

Spanner is a globally distributed relational database that provides strong consistency across regions. It sacrifices nothing on the CAP theorem by using TrueTime (atomic clocks and GPS receivers in every data center). Spanner is expensive but unmatched for workloads that need global consistency.

Managed NoSQL databases

When your data model does not fit neatly into tables with relationships, NoSQL databases offer flexibility.

Key-value and document stores

Feature	AWS DynamoDB	GCP Firestore	Azure Cosmos DB
Data model	Key-value, document	Document	Multi-model (document, key-value, graph, column)
Consistency	Eventually or strongly consistent	Strong (multi-region)	Five consistency levels
Scaling	Automatic	Automatic	Automatic
Pricing model	On-demand or provisioned capacity	Per read/write/delete	Request units (RU)
Global distribution	Global Tables	Multi-region	Turnkey global distribution

DynamoDB

DynamoDB is a fully managed key-value and document database. It delivers single-digit millisecond performance at any scale. You define a partition key and optional sort key. All access patterns must be designed around these keys.

The single-table design pattern stores multiple entity types in one table, using composite keys to model relationships. It is powerful but requires upfront investment in access pattern analysis.

Firestore

Firestore is a document database optimized for mobile and web applications. It provides real-time synchronization out of the box. Changes to documents are pushed to connected clients instantly. Offline support is built in: the client SDK caches data locally and synchronizes when connectivity returns.

Cosmos DB

Cosmos DB offers five consistency models ranging from strong to eventual. This flexibility lets you tune consistency per request. It supports multiple APIs: SQL, MongoDB, Cassandra, Gremlin (graph), and Table. If you are on Azure and uncertain about your long-term data model, Cosmos DB is a safe starting point.

Managed caching

Caching layers sit between your application and database, serving frequently accessed data from memory.

Feature	AWS ElastiCache	GCP Memorystore	Azure Cache for Redis
Engines	Redis, Memcached	Redis, Memcached	Redis
Cluster mode	Yes	Yes (Redis)	Yes
Max memory	635 GB (Redis)	300 GB	1.2 TB
Multi-AZ	Yes	Yes	Yes

When to use managed caching

Session storage: Store user sessions in Redis for fast access across application instances.
Database query cache: Cache the results of expensive queries. Invalidate when the underlying data changes.
Rate limiting: Use Redis counters to implement rate limiting with minimal latency.
Leaderboards and counters: Redis sorted sets provide O(log N) ranking operations.

The alternative is running your own Redis cluster on VMs. Managed caching handles patching, failover, and backup. For most teams, the operational savings justify the managed service premium.

Managed search

Full-text search is a specialized workload. Managed search services handle the complexities of indexing, relevance scoring, and cluster management.

Feature	AWS OpenSearch	Elastic Cloud (any provider)	Azure AI Search
Engine	OpenSearch (Elasticsearch fork)	Elasticsearch	Proprietary
Managed by	AWS	Elastic	Microsoft
Vector search	Yes	Yes	Yes
Serverless option	Yes	Yes	No

Use managed search when your application needs:

Full-text search with relevance ranking.
Faceted navigation (filter by category, price range, brand).
Log analytics and observability.
Vector similarity search for AI/ML applications.

Choosing managed vs self-managed

The decision framework is straightforward.

graph TD
  Q1{"Is the database a core<br/>competitive advantage?"} -->|Yes| Self["Self-manage"]
  Q1 -->|No| Q2{"Does the managed service<br/>support your requirements?"}
  Q2 -->|Yes| Q3{"Can you afford the premium?"}
  Q3 -->|Yes| Managed["Use managed service"]
  Q3 -->|No| Q4{"Do you have the ops expertise?"}
  Q4 -->|Yes| Self
  Q4 -->|No| Managed
  Q2 -->|No| Self
  style Managed fill:#2ecc71,color:#fff
  style Self fill:#e74c3c,color:#fff

Most teams should default to managed. Self-manage only when you have a strong reason and the expertise to back it up.

Arguments for managed

Time to market: Deploy a production-ready database in minutes.
Operational reliability: Automated backups, patching, failover.
Security: Encryption at rest, network isolation, IAM integration out of the box.
Scaling: Add read replicas or increase capacity without downtime.

Arguments for self-managed

Cost at scale: At very high data volumes, self-managed on reserved instances can be 40-60% cheaper.
Full configuration control: Tune every parameter. Use custom extensions. Run any version.
Compliance requirements: Some regulations require specific database configurations that managed services do not expose.
Exotic workloads: Time-series databases, graph databases, or niche engines that providers do not offer as managed services.

Cost comparison

Database costs vary widely based on instance size, storage, I/O, and backup retention.

The managed premium is roughly 30-40%. Factor in the engineering time saved before deciding.

The managed premium covers automated backups, monitoring, failover, and patching. If your team would otherwise spend 10+ hours per month on database operations, the managed service is likely cheaper in total.

High availability patterns

Multi-AZ with automatic failover

Deploy the primary in one AZ and a standby in another. The managed service monitors the primary and promotes the standby if it fails. Your application connects through a DNS endpoint that automatically points to the current primary.

Cross-region read replicas

For global applications, place read replicas in regions close to your users. Writes go to the primary region. Reads are served locally. This reduces latency for read-heavy workloads but introduces replication lag.

Active-active (multi-region writes)

DynamoDB Global Tables, Cosmos DB, and Spanner support writes in multiple regions simultaneously. This eliminates the single write region bottleneck but adds complexity around conflict resolution. DynamoDB uses last-writer-wins. Cosmos DB offers configurable conflict resolution. Spanner provides strong consistency globally.

Active-active is the most complex and expensive option. Use it only when your application genuinely needs low-latency writes from multiple continents.

Backup and recovery

Every managed database service provides automated backups. Understand these settings:

Backup window: When automated backups run. Schedule during low-traffic periods.
Retention period: How long backups are kept. Set based on compliance requirements.
Point-in-time recovery (PITR): Restore to any second within the retention window. This uses transaction logs, not just snapshots.
Manual snapshots: Create before risky operations like schema migrations.

Test your recovery process regularly. A backup you have never restored is not a backup. It is a hope.

What comes next

This article completes the Cloud Platforms and Services series. You now understand compute, networking, storage, and databases in the cloud. The next step is putting these pieces together: designing architectures that are secure, scalable, and cost-effective. Explore the broader DevOps series to learn about infrastructure as code, CI/CD pipelines, and monitoring that bring these cloud services to life.

← Back to all series