Scalability & Load Balancing

Scalability is a system’s ability to handle increased load without sacrificing performance or reliability. A scalable system can grow — more users, more data, more requests — without requiring a fundamental redesign.

Scalability is not something you add later. Systems that weren’t designed for scale routinely require expensive, disruptive rewrites once they hit their limits.

Vertical vs. Horizontal Scaling

These are the two fundamental approaches to handling more load.

Vertical Scaling (Scale Up)

Add more power to a single machine: more CPU, more RAM, faster disks.

Before:  [App → 4 CPU / 16 GB RAM]
After:   [App → 32 CPU / 128 GB RAM]

Advantages:

Simple — no application changes required
No distributed systems complexity

Hard limits:

Hardware ceiling: you eventually can’t buy a bigger machine
Single point of failure: if that one machine goes down, the entire service is unavailable
Diminishing returns: doubling RAM does not double throughput for most workloads

When to use: Database servers (especially relational) where distributing data is complex; early-stage systems before distribution is justified.

Horizontal Scaling (Scale Out)

Run more instances of the application across multiple machines.

Before:  [App instance × 1]
After:   [App instance × 10, behind a load balancer]

Advantages:

Effectively unlimited ceiling (add more nodes)
No single point of failure — one node dying doesn’t take down the service
Commodity hardware is cheaper per unit than high-end servers

Requirements:

Stateless application design: instances must not store local session data (see 12-Factor App, Factor 6)
A load balancer to distribute traffic across instances

Load Balancing

A load balancer sits in front of a pool of application instances and distributes incoming requests across them. It also performs health checks and stops sending traffic to unhealthy instances.

Load Balancing Algorithms

Algorithm	How It Works	Best For
Round Robin	Cycles through instances sequentially	Uniform, stateless requests
Least Connections	Routes to the instance with the fewest active connections	Long-lived connections (WebSockets, streaming)
Weighted Round Robin	Like Round Robin, but instances get traffic proportional to their weight	Mixed instance sizes
IP Hash / Sticky Sessions	Same client IP always routes to the same instance	Session-dependent apps (avoid if possible — use external session stores instead)
Least Response Time	Routes to the fastest-responding instance	Latency-sensitive APIs

Layer 4 vs. Layer 7 Load Balancers

	Layer 4 (Transport)	Layer 7 (Application)
Operates on	TCP/UDP packets	HTTP/HTTPS content
Can route based on	IP address, port	URL path, headers, cookies, host
TLS termination	No	Yes
Performance	Faster (less processing)	Slightly more overhead
Example	AWS NLB	AWS ALB, Nginx, HAProxy

In practice: Layer 7 (ALB / Nginx) is the default choice for HTTP services. Layer 4 (NLB) is used for extremely high-throughput, latency-sensitive protocols (gaming, financial trading, raw TCP).

Autoscaling

Autoscaling automatically adjusts the number of running instances in response to load, eliminating the need to pre-provision capacity for peak traffic.

Scaling Triggers

Metric-based (reactive): Scale when CPU > 70%, memory > 80%, request queue depth > threshold, etc.
Schedule-based (proactive): If you know traffic spikes every weekday at 9am, pre-scale at 8:45am instead of waiting for CPU to climb.
Predictive: ML-based forecasting (available in AWS, Azure) that pre-scales based on historical patterns.

Scale-in Considerations

Scaling down is more dangerous than scaling up. Be conservative:

Scale-out: Aggressive (add capacity quickly when load rises)
Scale-in:  Conservative (wait longer before removing capacity)

Common pattern: scale out when > 70% CPU for 1 minute. Scale in only when < 30% CPU for 10+ minutes. The asymmetry prevents thrashing where instances are repeatedly added and removed.

Database Scaling Strategies

Application servers are easy to scale horizontally. Databases are where scaling gets hard.

Read Replicas

Promote read traffic to replica databases. The primary handles writes; one or more read replicas serve read queries.

Writes → [Primary DB]
                 ↓ replication
Reads  → [Read Replica 1]
Reads  → [Read Replica 2]

Replication lag: Replicas are slightly behind the primary. For use cases where reading stale data is unacceptable (e.g., “did my payment just succeed?”), route those specific reads to the primary.

Connection Pooling

Each database connection is expensive. Application servers should not open a raw connection per request. Use a connection pooler (pgBouncer for Postgres, RDS Proxy on AWS) to maintain a pool of pre-established connections and multiplex application requests through them.

Sharding (Horizontal Partitioning)

Distribute data across multiple independent database nodes by splitting rows based on a shard key (e.g., user_id % 4 routes users to one of 4 shards).

Cross-shard queries become impossible — joins across shards don’t work
Choosing the wrong shard key leads to hot spots (one shard handles disproportionate load)
Sharding is a last resort for relational databases; explore read replicas, caching, and vertical scaling first

Caching as a Scalability Tool

Caching sits in front of the database and absorbs repeated reads, reducing both latency and database load.

A single Redis node serving cached responses at 100k RPS replaces hundreds of database queries that would each take 10–100ms
See Caching Strategies for implementation patterns

Scalability Checklist

Application is stateless — session data stored in Redis or a database, not in-process memory
Load balancer health checks configured with realistic thresholds
Autoscaling configured with asymmetric scale-out/scale-in policies
Read replicas in place for read-heavy workloads
Connection pooling enabled on all database connections
Key database queries are indexed — a missing index degrades performance long before traffic is the bottleneck
Cache layer in front of expensive or repeated reads
Load tested with realistic traffic before production launch