Skip to content

Disaster Recovery

A Disaster Recovery Plan (DRP) is a collection of documented procedures for how to react to and recover from an emergency or disaster scenario. Its goal: minimize downtime and prevent significant data loss.

A DRP isn’t just a document - it’s a tested, rehearsed playbook.


Actions taken before a disaster to minimize impact:

  • Regular, automated backups stored both on-site and off-site
  • Redundant systems - no single point of failure for critical services
  • Redundant power supplies, network links, and hardware
  • Comprehensive monitoring and alerting
  • Clear, up-to-date operational documentation

Systems that alert you when something goes wrong:

MeasureWhat It Detects
Network monitoringService outages, latency spikes, packet loss
Server monitoringHigh CPU, memory exhaustion, disk full
Environmental sensorsTemperature, humidity, water/flood
Smoke/fire alarmsPhysical threat to data center
Log aggregation & alertingError patterns, security events

Steps taken after a disaster to restore operations:

  • Restore data from backups
  • Rebuild and reconfigure damaged systems
  • Failover to standby systems
  • Follow the documented recovery procedures step-by-step

Brainstorm scenarios and assess each one:

ScenarioLikelihoodImpactPriority
Server hardware failureHighMediumP1
Data center power outageMediumHighP1
Ransomware attackMediumCriticalP0
Natural disaster (flood, fire)LowCriticalP1
Accidental data deletionHighLow-MediumP2

Not everything has the same priority. Rank systems by criticality:

  • Mission-critical: Authentication, billing, production databases - immediate recovery required
  • Important: Email, internal tools - hours of downtime acceptable
  • Non-critical: Internal wiki, staging environments - can wait

For each critical system, document:

  1. How to restore from backup
  2. How to failover to a standby system
  3. Who is responsible for each step
  4. Expected recovery time
  5. Links to detailed runbooks
  • Run disaster recovery simulations at least annually
  • Involve all relevant teams - IT, security, management
  • Time the recovery to verify you meet your RTO
  • Document what worked, what didn’t, and what needs improvement

A single point of failure (SPOF) is any component whose failure would take down the entire service. Identify and eliminate them:

SPOF ExampleMitigation
One database serverPrimary + replica with automatic failover
One internet connectionDual ISP with BGP failover
One power feedUPS + generator + dual power supplies
One load balancerActive-passive pair
One person who knows the systemDocumentation + cross-training

A post-mortem is a report written after an incident to document what happened, why it happened, what was done about it, and how to prevent it from happening again.

# Incident Post-Mortem: [Title]
## Summary
Brief description of the incident - what, when, how long, impact, resolution.
## Timeline (all times in UTC)
- 14:03 - Monitoring alert: API response time > 5s
- 14:07 - On-call engineer acknowledges alert
- 14:12 - Root cause identified: DB connection pool exhausted
- 14:15 - Fix applied: connection pool limit increased
- 14:20 - Services restored, monitoring confirms recovery
## Root Cause
Detailed, honest explanation of what went wrong.
## What Went Well
- Monitoring detected the issue within 3 minutes
- Failover to read replica kept partial service running
## What Went Wrong
- Connection pool limit hadn't been reviewed since initial deploy
- No alert existed for connection pool saturation specifically
## Action Items
- [ ] Add connection pool saturation alert (owner: @alice, due: March 20)
- [ ] Review all service limits as part of quarterly capacity planning
- [ ] Update runbook with new connection pool tuning steps
  1. Write it promptly - within 24-48 hours while details are fresh
  2. Be honest - document the real root cause, even if it’s embarrassing
  3. Highlight what went well - failovers that worked, monitoring that caught things early
  4. Actionable items - each action item has an owner and a deadline
  5. Share widely - other teams can learn from your incidents
  6. No-blame culture - focus on systems and processes, not individuals

Disaster recovery isn’t only about servers and data. It also covers:

  • Employee safety - evacuation plans, physical security
  • Communication - how do you notify staff, customers, and stakeholders?
  • Temporary work arrangements - can people work remotely if the office is inaccessible?
  • Cooperation with facilities/building management - fire suppression, access control