Cloud Architecture Patterns
- Cloud architecture is the set of structural decisions that determine how cloud components - compute, storage, networking, services - are selected, configured, and connected to meet a system’s functional and non-functional requirements.
- Good architecture is never accidental. It emerges from deliberate tradeoffs between availability, cost, performance, security, and operational complexity.
The Well-Architected Framework
Section titled “The Well-Architected Framework”The AWS Well-Architected Framework (mirrored closely by Azure’s and GCP’s equivalents) defines 6 pillars that underpin sound cloud design. Think of these as the lens through which every architectural decision should be evaluated.
| Pillar | Core Question |
|---|---|
| Operational Excellence | Can we run and monitor systems to deliver business value and continually improve? |
| Security | Are we protecting data, systems, and assets appropriately? |
| Reliability | Can the system recover from failures and meet demand? |
| Performance Efficiency | Are we using compute resources efficiently as demand changes? |
| Cost Optimization | Are we delivering business value at the lowest price point? |
| Sustainability | Are we minimizing the environmental impact of our workloads? |
In practice, pillars conflict. Reliability costs money. Security adds latency. The job of an architect is to make those tradeoffs explicit, not avoid them.
High Availability (HA)
Section titled “High Availability (HA)”- High Availability is the design goal of keeping a system operational for the highest possible percentage of time, typically expressed as “nines” (99.9%, 99.99%, 99.999%).
- HA is not the same as fault tolerance - HA systems can tolerate individual component failures without a full outage, but may still experience brief degradation.
Availability Zones and Regions
Section titled “Availability Zones and Regions”- Region: A geographic area containing multiple physically isolated data centers.
- Availability Zone (AZ): An independent data center (or cluster of data centers) within a region, with its own power, cooling, and networking.
- Deploying across multiple AZs protects against a single AZ failure. Deploying across multiple regions protects against a regional outage or disaster.
HA Patterns
Section titled “HA Patterns”- Active-Active: All instances are live and serving traffic simultaneously. Load is distributed across all nodes. If one fails, others absorb the traffic with no downtime.
- Active-Passive: A primary instance serves all traffic. A standby instance is kept in sync and only promoted if the primary fails. Simpler to operate, but the failover is never truly zero-downtime.
- Multi-Region Active-Active: Most complex and expensive, but provides the highest availability. Traffic is geo-routed to the nearest region; a full regional failure only degrades, not stops, the system.
Disaster Recovery (DR)
Section titled “Disaster Recovery (DR)”- Disaster Recovery is the strategy for restoring service after a catastrophic event (data center loss, ransomware, accidental deletion).
- Two key metrics define your DR posture:
- RTO (Recovery Time Objective): How long the system can be down before business impact becomes unacceptable.
- RPO (Recovery Point Objective): How much data loss is acceptable - measured in time (e.g., “we can afford to lose up to 1 hour of data”).
DR Strategies (Cheapest to Most Expensive)
Section titled “DR Strategies (Cheapest to Most Expensive)”| Strategy | RTO | RPO | How |
|---|---|---|---|
| Backup & Restore | Hours | Hours | Periodic backups to cold storage. Restore from scratch on failure. |
| Pilot Light | 10–30 min | Minutes | Core infrastructure always running but idle. Scale up on failure. |
| Warm Standby | Minutes | Seconds | Scaled-down replica always running and receiving replicated data. |
| Multi-Site Active-Active | Near zero | Near zero | Full duplicate running in another region, serving live traffic. |
The faster your RTO/RPO, the more expensive the architecture. Choose the strategy that matches business tolerance, not the one that sounds most impressive.
Fault Tolerance vs. Resilience
Section titled “Fault Tolerance vs. Resilience”- Fault Tolerant: The system continues operating without any degradation when a component fails (e.g., RAID arrays, redundant power supplies). Expensive to achieve at every layer.
- Resilient: The system detects failures, isolates them, and recovers - possibly with brief degradation. This is the practical standard for most cloud workloads.
- Graceful Degradation: A resilient pattern where a system intentionally serves a reduced feature set under stress (e.g., showing cached results when a live API is down) rather than failing completely.
Key Design Patterns
Section titled “Key Design Patterns”Stateless Architecture
Section titled “Stateless Architecture”- Servers do not retain any client session data between requests. All state is externalized to a shared store (database, cache like Redis, or object storage).
- Why it matters: Stateless instances can be scaled horizontally without sticky sessions, replaced without session loss, and load-balanced freely.
Circuit Breaker
Section titled “Circuit Breaker”- A pattern that monitors calls to a downstream service. If failures exceed a threshold, the circuit “opens” and subsequent calls fail immediately (fallback response) instead of piling up and cascading.
- Practical use: Protects against a slow or unresponsive dependency taking down the entire service. Library: Netflix Hystrix (Java), Polly (.NET), resilience4j.
Retry with Exponential Backoff
Section titled “Retry with Exponential Backoff”- On transient failures, retry the request with increasing delays (1s, 2s, 4s, 8s…) to avoid thundering herd situations where all retries hit the service simultaneously.
- Add jitter - randomize the backoff slightly so not all clients retry at the exact same time.
Bulkhead
Section titled “Bulkhead”- Isolate resources (threads, connections, memory) for different concerns, so a failure in one part doesn’t consume all shared resources and bring down unrelated parts.
- Named after the watertight compartments in ship hulls - one flooded compartment doesn’t sink the ship.
Strangler Fig
Section titled “Strangler Fig”- A migration pattern for decomposing monoliths. Gradually replace parts of the old system with new services while the old system continues to run. Traffic is incrementally re-routed until the old system can be decommissioned.
CQRS (Command Query Responsibility Segregation)
Section titled “CQRS (Command Query Responsibility Segregation)”- Split the data model into separate read and write paths. The write side handles commands (mutations), the read side serves queries (often from a denormalized read model optimized for specific access patterns).
- When to use: High read traffic with complex queries; when read and write scale requirements differ significantly.
Event Sourcing
Section titled “Event Sourcing”- Instead of storing only current state, store every event that led to that state (e.g.,
OrderPlaced,PaymentProcessed,OrderShipped). Current state is derived by replaying events. - Upside: Full audit trail, time-travel debugging, easy event replay for new consumers.
- Downside: Complexity. Event schema evolution is hard. Not appropriate for simple CRUD applications.
Cloud-Native Design Principles
Section titled “Cloud-Native Design Principles”- Design for failure - assume every component will fail. Build detection and recovery in from day one.
- Prefer managed services - using a managed database, queue, or cache means the provider handles patching, HA, and backups. Only self-host when you have a specific reason.
- Immutable infrastructure - never patch a running server. Build a new image, deploy it, and terminate the old one. Eliminates “drift” between environments.
- Loose coupling - services should communicate through well-defined interfaces (APIs, queues, events). Tight coupling means one service’s change breaks another.
- 12-Factor App - a widely adopted methodology for building cloud-native applications covering configuration, dependencies, logging, statelessness, and more. Worth reading in full at 12factor.net.