Skip to content

Scalability & Load Balancing

Scalability is a system’s ability to handle increased load without sacrificing performance or reliability. A scalable system can grow — more users, more data, more requests — without requiring a fundamental redesign.

Scalability is not something you add later. Systems that weren’t designed for scale routinely require expensive, disruptive rewrites once they hit their limits.


These are the two fundamental approaches to handling more load.

Add more power to a single machine: more CPU, more RAM, faster disks.

Before: [App → 4 CPU / 16 GB RAM]
After: [App → 32 CPU / 128 GB RAM]

Advantages:

  • Simple — no application changes required
  • No distributed systems complexity

Hard limits:

  • Hardware ceiling: you eventually can’t buy a bigger machine
  • Single point of failure: if that one machine goes down, the entire service is unavailable
  • Diminishing returns: doubling RAM does not double throughput for most workloads

When to use: Database servers (especially relational) where distributing data is complex; early-stage systems before distribution is justified.

Run more instances of the application across multiple machines.

Before: [App instance × 1]
After: [App instance × 10, behind a load balancer]

Advantages:

  • Effectively unlimited ceiling (add more nodes)
  • No single point of failure — one node dying doesn’t take down the service
  • Commodity hardware is cheaper per unit than high-end servers

Requirements:

  • Stateless application design: instances must not store local session data (see 12-Factor App, Factor 6)
  • A load balancer to distribute traffic across instances

A load balancer sits in front of a pool of application instances and distributes incoming requests across them. It also performs health checks and stops sending traffic to unhealthy instances.

AlgorithmHow It WorksBest For
Round RobinCycles through instances sequentiallyUniform, stateless requests
Least ConnectionsRoutes to the instance with the fewest active connectionsLong-lived connections (WebSockets, streaming)
Weighted Round RobinLike Round Robin, but instances get traffic proportional to their weightMixed instance sizes
IP Hash / Sticky SessionsSame client IP always routes to the same instanceSession-dependent apps (avoid if possible — use external session stores instead)
Least Response TimeRoutes to the fastest-responding instanceLatency-sensitive APIs
Layer 4 (Transport)Layer 7 (Application)
Operates onTCP/UDP packetsHTTP/HTTPS content
Can route based onIP address, portURL path, headers, cookies, host
TLS terminationNoYes
PerformanceFaster (less processing)Slightly more overhead
ExampleAWS NLBAWS ALB, Nginx, HAProxy

In practice: Layer 7 (ALB / Nginx) is the default choice for HTTP services. Layer 4 (NLB) is used for extremely high-throughput, latency-sensitive protocols (gaming, financial trading, raw TCP).


Autoscaling automatically adjusts the number of running instances in response to load, eliminating the need to pre-provision capacity for peak traffic.

  • Metric-based (reactive): Scale when CPU > 70%, memory > 80%, request queue depth > threshold, etc.
  • Schedule-based (proactive): If you know traffic spikes every weekday at 9am, pre-scale at 8:45am instead of waiting for CPU to climb.
  • Predictive: ML-based forecasting (available in AWS, Azure) that pre-scales based on historical patterns.

Scaling down is more dangerous than scaling up. Be conservative:

Scale-out: Aggressive (add capacity quickly when load rises)
Scale-in: Conservative (wait longer before removing capacity)

Common pattern: scale out when > 70% CPU for 1 minute. Scale in only when < 30% CPU for 10+ minutes. The asymmetry prevents thrashing where instances are repeatedly added and removed.


Application servers are easy to scale horizontally. Databases are where scaling gets hard.

Promote read traffic to replica databases. The primary handles writes; one or more read replicas serve read queries.

Writes → [Primary DB]
↓ replication
Reads → [Read Replica 1]
Reads → [Read Replica 2]

Replication lag: Replicas are slightly behind the primary. For use cases where reading stale data is unacceptable (e.g., “did my payment just succeed?”), route those specific reads to the primary.

Each database connection is expensive. Application servers should not open a raw connection per request. Use a connection pooler (pgBouncer for Postgres, RDS Proxy on AWS) to maintain a pool of pre-established connections and multiplex application requests through them.

Distribute data across multiple independent database nodes by splitting rows based on a shard key (e.g., user_id % 4 routes users to one of 4 shards).

  • Cross-shard queries become impossible — joins across shards don’t work
  • Choosing the wrong shard key leads to hot spots (one shard handles disproportionate load)
  • Sharding is a last resort for relational databases; explore read replicas, caching, and vertical scaling first

Caching sits in front of the database and absorbs repeated reads, reducing both latency and database load.

  • A single Redis node serving cached responses at 100k RPS replaces hundreds of database queries that would each take 10–100ms
  • See Caching Strategies for implementation patterns

  • Application is stateless — session data stored in Redis or a database, not in-process memory
  • Load balancer health checks configured with realistic thresholds
  • Autoscaling configured with asymmetric scale-out/scale-in policies
  • Read replicas in place for read-heavy workloads
  • Connection pooling enabled on all database connections
  • Key database queries are indexed — a missing index degrades performance long before traffic is the bottleneck
  • Cache layer in front of expensive or repeated reads
  • Load tested with realistic traffic before production launch