Multi-cloud and cloud-agnostic design
In this series (10 parts)
- Cloud fundamentals and the shared responsibility model
- Compute: VMs, containers, serverless
- Networking in the cloud
- Cloud storage services
- Managed databases in the cloud
- Cloud IAM and access control
- Serverless architecture patterns
- Cloud cost management
- Multi-cloud and cloud-agnostic design
- Cloud Well-Architected Framework
Most companies do not choose multi-cloud. It happens to them. An acquisition brings in a team running on Azure while the parent company runs on AWS. A data regulation requires European workloads on a provider with specific regional certifications. A CTO wants to avoid lock-in and mandates that everything must be portable. Each reason leads to a different strategy and different tradeoffs.
Why companies use multiple clouds
There are legitimate reasons and there are cargo-cult reasons. It helps to distinguish between them.
Acquisitions and mergers. The most common path to multi-cloud is organizational. Two companies merge. Each has years of investment in a different provider. Migration is expensive and risky. Both platforms persist.
Regulatory requirements. Some industries and governments mandate data residency in specific regions. Not every provider has data centers in every country. Running workloads on the provider that meets compliance requirements is simpler than fighting it.
Best-of-breed services. GCP offers strong data analytics and ML platforms. AWS has the broadest service catalog. Azure integrates deeply with Microsoft enterprise tooling. Some organizations pick services from each provider based on strength.
Negotiating leverage. Spreading spend across providers gives you leverage in pricing negotiations. A credible threat to move workloads keeps discount conversations productive.
Avoiding single-provider outages. A multi-cloud architecture can survive a total provider failure. In practice, this is extremely expensive to implement correctly and few organizations actually test failover.
The cost of multi-cloud
Multi-cloud sounds safe. In practice it multiplies operational complexity.
Every cloud provider has different APIs, different IAM models, different networking constructs, and different operational tooling. Your team needs expertise in all of them. Hiring becomes harder. On-call rotations cover more surface area. Every automation script, monitoring dashboard, and runbook needs variants for each provider.
Data transfer between clouds is expensive. Cross-provider egress charges of $0.08-0.12 per GB add up fast. A service on AWS calling a database on GCP generates costs on both sides. Latency also increases since cross-provider traffic goes over the public internet unless you set up dedicated interconnects.
Consider these tradeoffs honestly before committing to multi-cloud.
Abstraction layers
Abstraction layers reduce the cost of multi-cloud by providing a single interface across providers. Two dominate: Terraform for infrastructure and Kubernetes for workloads.
Terraform
Terraform uses a declarative language (HCL) to describe infrastructure. You define what you want and Terraform figures out how to create it. Providers are plugins that translate HCL resources into cloud API calls.
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.medium"
tags = {
Name = "web-server"
}
}
Switching clouds means changing the provider and resource types. The workflow stays the same: terraform plan, terraform apply, terraform destroy. You cannot write one Terraform config that deploys to any cloud, but you can use the same language, state management, and processes everywhere.
Terraform modules help. A module that creates a “compute instance with load balancer” can have AWS, GCP, and Azure implementations behind a common interface. Callers pick the implementation at deploy time.
Kubernetes
Kubernetes provides a portable workload runtime. A Deployment manifest that runs on EKS also runs on GKE and AKS. The container image is the same. The service mesh is the same. The monitoring stack is the same.
This portability has limits. Managed services like load balancers, storage classes, and ingress controllers differ between providers. You abstract these with Kubernetes interfaces (StorageClass, IngressClass) but the implementations are provider-specific. Moving a workload from EKS to GKE still requires changing annotations, IAM bindings, and networking configurations.
Kubernetes is the strongest multi-cloud abstraction available today for application workloads. It is not free. Operating Kubernetes well requires deep expertise.
Cloud-agnostic data formats
Lock-in is stickiest at the data layer. Moving compute is straightforward compared to moving petabytes of data.
Use open formats wherever possible:
Layer Vendor-locked Cloud-agnostic
--------------------------------------------------------
Object storage S3 API (de facto) S3-compatible API
Table format DynamoDB format Apache Parquet, Iceberg
Message format SQS message schema CloudEvents, Avro
Database Aurora Serverless PostgreSQL-compatible
ML model format SageMaker model ONNX
Container image None (OCI standard) OCI image spec
IaC definition CloudFormation Terraform HCL
Apache Parquet for columnar data, Apache Iceberg for table metadata, and CloudEvents for event schemas give you formats that any platform can read. Building around open formats means your data survives a provider switch even if you rebuild the infrastructure from scratch.
What lock-in actually looks like
Lock-in exists on a spectrum. Not all lock-in is equal.
Shallow lock-in: your deployment scripts reference AWS-specific APIs. Rewriting the scripts takes a week. This is inconvenient but manageable.
Medium lock-in: your application uses DynamoDB directly with features like DynamoDB Streams and Global Tables. Migration requires redesigning the data access layer. This takes months.
Deep lock-in: your entire event architecture runs on AWS EventBridge, Step Functions, and Lambda with tight IAM integration. Migration means rearchitecting the application. This takes a year.
The deeper the lock-in, the higher the switching cost. But deeper lock-in also means you are using more of the provider’s managed services, which reduces operational burden. There is a direct tradeoff between portability and operational simplicity.
Lock-in that is fine to accept
Not all lock-in is bad. Some provider-specific services deliver enough value that the switching cost is justified.
Managed databases. Running your own database cluster to avoid lock-in is almost always worse than using a managed service. The operational overhead of self-managing PostgreSQL or MySQL outweighs the lock-in risk for most teams. Choose a standard engine (PostgreSQL, MySQL) on a managed service and you get portability of the query language if not the management plane.
Identity and access management. IAM is inherently provider-specific. Abstracting it adds complexity without clear benefit. Use the native IAM system and accept that migration means rebuilding policies.
CDN and edge services. CDN configuration is provider-specific but the concepts transfer. Switching CDNs is a DNS change and a config rewrite, not an architecture change.
Container orchestration. If you are on Kubernetes, the managed service (EKS, GKE, AKS) adds convenience. The workloads themselves remain portable.
The decision framework is simple. If a service saves significant operational effort and uses standard interfaces or data formats, accept the lock-in. If it traps your data in a proprietary format with no export path, think twice.
Designing for portability
If multi-cloud or cloud-exit is a real requirement, design for it from the start. Retrofitting portability into an existing architecture is painful.
Separate cloud-specific code. Put provider-specific logic (SDK calls, IAM configuration, resource naming) behind interfaces. Your application code calls the interface. Adapters implement it for each provider.
Use standard protocols. gRPC instead of provider-specific RPC. AMQP or MQTT instead of proprietary message protocols. SQL instead of proprietary query languages.
Avoid proprietary orchestration. If you build on Step Functions, you are locked in. If you build on Temporal or Argo Workflows running on Kubernetes, you can move the orchestrator with the workloads.
Automate everything with Terraform. If your infrastructure is code, migration is a rewrite of that code rather than a manual reconstruction of hundreds of resources.
Test portability periodically. Deploy a non-critical workload on your secondary provider quarterly. If the deployment succeeds, you know your abstractions work. If it fails, you learn where hidden lock-in has crept in.
The cloud exit strategy
Multi-cloud is one response to provider risk. A cloud exit strategy is another. An exit strategy does not mean you plan to leave. It means you know how you would leave if you had to.
Document the following for each critical service:
- What provider-specific features it depends on
- How much data it stores and the estimated transfer time
- Which open-source or self-hosted alternatives exist
- The estimated engineering effort to migrate
Review this document annually. Update it when you adopt new services. You may never execute the plan, but having one reduces the anxiety that drives premature multi-cloud decisions.
Networking across clouds
If you do run on multiple providers, networking is the hardest problem. By default, traffic between clouds traverses the public internet. This adds latency, costs egress fees, and exposes data to interception.
Dedicated interconnects solve the latency and security problems. AWS Direct Connect, GCP Cloud Interconnect, and Azure ExpressRoute provide private links between your cloud environments. These cost thousands per month but eliminate public internet traversal for inter-cloud traffic.
For smaller workloads, VPN tunnels between cloud VPCs provide encrypted connectivity over the internet. Latency is higher than dedicated interconnects but the cost is much lower.
Service mesh solutions like Istio or Consul can span multiple clouds and provide service discovery, mTLS, and traffic management across provider boundaries. This adds operational complexity but gives you a consistent networking model regardless of where the workload runs.
When to go multi-cloud vs single-cloud
For most organizations, single-cloud is the right default. The operational simplicity of one provider, one IAM system, one networking model, and one billing console outweighs the theoretical benefits of multi-cloud.
Go multi-cloud when:
- Regulations require it
- An acquisition makes it unavoidable
- A specific service on another provider is genuinely better and the workload justifies the complexity
- Your organization is large enough to afford dedicated platform teams for each provider
Stay single-cloud when:
- Your team is small
- Your workloads do not have regulatory constraints
- You want to move fast
- The cost of multi-cloud expertise exceeds the risk of lock-in
What comes next
Whether you run on one cloud or three, the architecture still needs to be sound. The next article covers the Cloud Well-Architected Framework, a structured approach to evaluating your architecture across five pillars: operational excellence, security, reliability, performance, and cost optimization.