Skip to content

Cloud Architecture Patterns

  • Cloud architecture is the set of structural decisions that determine how cloud components - compute, storage, networking, services - are selected, configured, and connected to meet a system’s functional and non-functional requirements.
  • Good architecture is never accidental. It emerges from deliberate tradeoffs between availability, cost, performance, security, and operational complexity.

The AWS Well-Architected Framework (mirrored closely by Azure’s and GCP’s equivalents) defines 6 pillars that underpin sound cloud design. Think of these as the lens through which every architectural decision should be evaluated.

PillarCore Question
Operational ExcellenceCan we run and monitor systems to deliver business value and continually improve?
SecurityAre we protecting data, systems, and assets appropriately?
ReliabilityCan the system recover from failures and meet demand?
Performance EfficiencyAre we using compute resources efficiently as demand changes?
Cost OptimizationAre we delivering business value at the lowest price point?
SustainabilityAre we minimizing the environmental impact of our workloads?

In practice, pillars conflict. Reliability costs money. Security adds latency. The job of an architect is to make those tradeoffs explicit, not avoid them.

  • High Availability is the design goal of keeping a system operational for the highest possible percentage of time, typically expressed as “nines” (99.9%, 99.99%, 99.999%).
  • HA is not the same as fault tolerance - HA systems can tolerate individual component failures without a full outage, but may still experience brief degradation.
  • Region: A geographic area containing multiple physically isolated data centers.
  • Availability Zone (AZ): An independent data center (or cluster of data centers) within a region, with its own power, cooling, and networking.
  • Deploying across multiple AZs protects against a single AZ failure. Deploying across multiple regions protects against a regional outage or disaster.
  • Active-Active: All instances are live and serving traffic simultaneously. Load is distributed across all nodes. If one fails, others absorb the traffic with no downtime.
  • Active-Passive: A primary instance serves all traffic. A standby instance is kept in sync and only promoted if the primary fails. Simpler to operate, but the failover is never truly zero-downtime.
  • Multi-Region Active-Active: Most complex and expensive, but provides the highest availability. Traffic is geo-routed to the nearest region; a full regional failure only degrades, not stops, the system.
  • Disaster Recovery is the strategy for restoring service after a catastrophic event (data center loss, ransomware, accidental deletion).
  • Two key metrics define your DR posture:
    • RTO (Recovery Time Objective): How long the system can be down before business impact becomes unacceptable.
    • RPO (Recovery Point Objective): How much data loss is acceptable - measured in time (e.g., “we can afford to lose up to 1 hour of data”).

DR Strategies (Cheapest to Most Expensive)

Section titled “DR Strategies (Cheapest to Most Expensive)”
StrategyRTORPOHow
Backup & RestoreHoursHoursPeriodic backups to cold storage. Restore from scratch on failure.
Pilot Light10–30 minMinutesCore infrastructure always running but idle. Scale up on failure.
Warm StandbyMinutesSecondsScaled-down replica always running and receiving replicated data.
Multi-Site Active-ActiveNear zeroNear zeroFull duplicate running in another region, serving live traffic.

The faster your RTO/RPO, the more expensive the architecture. Choose the strategy that matches business tolerance, not the one that sounds most impressive.

  • Fault Tolerant: The system continues operating without any degradation when a component fails (e.g., RAID arrays, redundant power supplies). Expensive to achieve at every layer.
  • Resilient: The system detects failures, isolates them, and recovers - possibly with brief degradation. This is the practical standard for most cloud workloads.
  • Graceful Degradation: A resilient pattern where a system intentionally serves a reduced feature set under stress (e.g., showing cached results when a live API is down) rather than failing completely.
  • Servers do not retain any client session data between requests. All state is externalized to a shared store (database, cache like Redis, or object storage).
  • Why it matters: Stateless instances can be scaled horizontally without sticky sessions, replaced without session loss, and load-balanced freely.
  • A pattern that monitors calls to a downstream service. If failures exceed a threshold, the circuit “opens” and subsequent calls fail immediately (fallback response) instead of piling up and cascading.
  • Practical use: Protects against a slow or unresponsive dependency taking down the entire service. Library: Netflix Hystrix (Java), Polly (.NET), resilience4j.
  • On transient failures, retry the request with increasing delays (1s, 2s, 4s, 8s…) to avoid thundering herd situations where all retries hit the service simultaneously.
  • Add jitter - randomize the backoff slightly so not all clients retry at the exact same time.
  • Isolate resources (threads, connections, memory) for different concerns, so a failure in one part doesn’t consume all shared resources and bring down unrelated parts.
  • Named after the watertight compartments in ship hulls - one flooded compartment doesn’t sink the ship.
  • A migration pattern for decomposing monoliths. Gradually replace parts of the old system with new services while the old system continues to run. Traffic is incrementally re-routed until the old system can be decommissioned.

CQRS (Command Query Responsibility Segregation)

Section titled “CQRS (Command Query Responsibility Segregation)”
  • Split the data model into separate read and write paths. The write side handles commands (mutations), the read side serves queries (often from a denormalized read model optimized for specific access patterns).
  • When to use: High read traffic with complex queries; when read and write scale requirements differ significantly.
  • Instead of storing only current state, store every event that led to that state (e.g., OrderPlaced, PaymentProcessed, OrderShipped). Current state is derived by replaying events.
  • Upside: Full audit trail, time-travel debugging, easy event replay for new consumers.
  • Downside: Complexity. Event schema evolution is hard. Not appropriate for simple CRUD applications.
  • Design for failure - assume every component will fail. Build detection and recovery in from day one.
  • Prefer managed services - using a managed database, queue, or cache means the provider handles patching, HA, and backups. Only self-host when you have a specific reason.
  • Immutable infrastructure - never patch a running server. Build a new image, deploy it, and terminate the old one. Eliminates “drift” between environments.
  • Loose coupling - services should communicate through well-defined interfaces (APIs, queues, events). Tight coupling means one service’s change breaks another.
  • 12-Factor App - a widely adopted methodology for building cloud-native applications covering configuration, dependencies, logging, statelessness, and more. Worth reading in full at 12factor.net.