Scalability & Load Balancing
Scalability is a system’s ability to handle increased load without sacrificing performance or reliability. A scalable system can grow — more users, more data, more requests — without requiring a fundamental redesign.
Scalability is not something you add later. Systems that weren’t designed for scale routinely require expensive, disruptive rewrites once they hit their limits.
Vertical vs. Horizontal Scaling
Section titled “Vertical vs. Horizontal Scaling”These are the two fundamental approaches to handling more load.
Vertical Scaling (Scale Up)
Section titled “Vertical Scaling (Scale Up)”Add more power to a single machine: more CPU, more RAM, faster disks.
Before: [App → 4 CPU / 16 GB RAM]After: [App → 32 CPU / 128 GB RAM]Advantages:
- Simple — no application changes required
- No distributed systems complexity
Hard limits:
- Hardware ceiling: you eventually can’t buy a bigger machine
- Single point of failure: if that one machine goes down, the entire service is unavailable
- Diminishing returns: doubling RAM does not double throughput for most workloads
When to use: Database servers (especially relational) where distributing data is complex; early-stage systems before distribution is justified.
Horizontal Scaling (Scale Out)
Section titled “Horizontal Scaling (Scale Out)”Run more instances of the application across multiple machines.
Before: [App instance × 1]After: [App instance × 10, behind a load balancer]Advantages:
- Effectively unlimited ceiling (add more nodes)
- No single point of failure — one node dying doesn’t take down the service
- Commodity hardware is cheaper per unit than high-end servers
Requirements:
- Stateless application design: instances must not store local session data (see 12-Factor App, Factor 6)
- A load balancer to distribute traffic across instances
Load Balancing
Section titled “Load Balancing”A load balancer sits in front of a pool of application instances and distributes incoming requests across them. It also performs health checks and stops sending traffic to unhealthy instances.
Load Balancing Algorithms
Section titled “Load Balancing Algorithms”| Algorithm | How It Works | Best For |
|---|---|---|
| Round Robin | Cycles through instances sequentially | Uniform, stateless requests |
| Least Connections | Routes to the instance with the fewest active connections | Long-lived connections (WebSockets, streaming) |
| Weighted Round Robin | Like Round Robin, but instances get traffic proportional to their weight | Mixed instance sizes |
| IP Hash / Sticky Sessions | Same client IP always routes to the same instance | Session-dependent apps (avoid if possible — use external session stores instead) |
| Least Response Time | Routes to the fastest-responding instance | Latency-sensitive APIs |
Layer 4 vs. Layer 7 Load Balancers
Section titled “Layer 4 vs. Layer 7 Load Balancers”| Layer 4 (Transport) | Layer 7 (Application) | |
|---|---|---|
| Operates on | TCP/UDP packets | HTTP/HTTPS content |
| Can route based on | IP address, port | URL path, headers, cookies, host |
| TLS termination | No | Yes |
| Performance | Faster (less processing) | Slightly more overhead |
| Example | AWS NLB | AWS ALB, Nginx, HAProxy |
In practice: Layer 7 (ALB / Nginx) is the default choice for HTTP services. Layer 4 (NLB) is used for extremely high-throughput, latency-sensitive protocols (gaming, financial trading, raw TCP).
Autoscaling
Section titled “Autoscaling”Autoscaling automatically adjusts the number of running instances in response to load, eliminating the need to pre-provision capacity for peak traffic.
Scaling Triggers
Section titled “Scaling Triggers”- Metric-based (reactive): Scale when CPU > 70%, memory > 80%, request queue depth > threshold, etc.
- Schedule-based (proactive): If you know traffic spikes every weekday at 9am, pre-scale at 8:45am instead of waiting for CPU to climb.
- Predictive: ML-based forecasting (available in AWS, Azure) that pre-scales based on historical patterns.
Scale-in Considerations
Section titled “Scale-in Considerations”Scaling down is more dangerous than scaling up. Be conservative:
Scale-out: Aggressive (add capacity quickly when load rises)Scale-in: Conservative (wait longer before removing capacity)Common pattern: scale out when > 70% CPU for 1 minute. Scale in only when < 30% CPU for 10+ minutes. The asymmetry prevents thrashing where instances are repeatedly added and removed.
Database Scaling Strategies
Section titled “Database Scaling Strategies”Application servers are easy to scale horizontally. Databases are where scaling gets hard.
Read Replicas
Section titled “Read Replicas”Promote read traffic to replica databases. The primary handles writes; one or more read replicas serve read queries.
Writes → [Primary DB] ↓ replicationReads → [Read Replica 1]Reads → [Read Replica 2]Replication lag: Replicas are slightly behind the primary. For use cases where reading stale data is unacceptable (e.g., “did my payment just succeed?”), route those specific reads to the primary.
Connection Pooling
Section titled “Connection Pooling”Each database connection is expensive. Application servers should not open a raw connection per request. Use a connection pooler (pgBouncer for Postgres, RDS Proxy on AWS) to maintain a pool of pre-established connections and multiplex application requests through them.
Sharding (Horizontal Partitioning)
Section titled “Sharding (Horizontal Partitioning)”Distribute data across multiple independent database nodes by splitting rows based on a shard key (e.g., user_id % 4 routes users to one of 4 shards).
- Cross-shard queries become impossible — joins across shards don’t work
- Choosing the wrong shard key leads to hot spots (one shard handles disproportionate load)
- Sharding is a last resort for relational databases; explore read replicas, caching, and vertical scaling first
Caching as a Scalability Tool
Section titled “Caching as a Scalability Tool”Caching sits in front of the database and absorbs repeated reads, reducing both latency and database load.
- A single Redis node serving cached responses at 100k RPS replaces hundreds of database queries that would each take 10–100ms
- See Caching Strategies for implementation patterns
Scalability Checklist
Section titled “Scalability Checklist”- Application is stateless — session data stored in Redis or a database, not in-process memory
- Load balancer health checks configured with realistic thresholds
- Autoscaling configured with asymmetric scale-out/scale-in policies
- Read replicas in place for read-heavy workloads
- Connection pooling enabled on all database connections
- Key database queries are indexed — a missing index degrades performance long before traffic is the bottleneck
- Cache layer in front of expensive or repeated reads
- Load tested with realistic traffic before production launch