Changing Infrastructure
Changing running infrastructure is fundamentally different from changing application code - a refactored module can destroy and recreate critical resources, a misapplied plan can wipe production data, and a drifted state file can cause Terraform to undo emergency fixes. This page covers how to plan, apply, and recover from infrastructure changes safely.
Making Change Routine
Section titled “Making Change Routine”Large organisations classify infrastructure changes by risk, disruption, and delivery method:
| Type | Characteristics | Example |
|---|---|---|
| Exceptional | Infrequent, approached differently every time, requires significant planning and expertise | Major OS upgrade |
| Routine | Frequent, follows the same standardised process, minimal active thought required | Minor OS patch |
Automating a delivery process is the primary mechanism for converting exceptional changes into routine operations. Automation lowers risk by providing:
- Automated testing in production-representative environments
- An automated build-progression system that records and enforces all testing and approval steps
- An automated deployment process that executes identically every time
Incremental Delivery
Section titled “Incremental Delivery”Why Incrementalism
Section titled “Why Incrementalism”Delivering large changes as a series of small increments is fundamentally easier and safer - each increment simplifies planning, implementation, testing, and debugging.
Each increment must:
- Be deployable to production
- Leave the system in a workable state
- Pass all automated tests
- Meet current operability requirements
Avoiding the Component-by-Component Trap
Section titled “Avoiding the Component-by-Component Trap”Building one component at a time (networking → storage → compute) means you can’t meaningfully test the system until everything is finished. Integration problems surface last, when rework is most expensive.
Walking Skeleton
Section titled “Walking Skeleton”Start with a bare-bones iteration that wires all layers together, even if each layer is minimal:
- Validates the build, test, configure, and deploy workflow end-to-end
- Discovers integration issues immediately
- Evolves into the production system naturally
Tracer bullet pipeline: the delivery pipeline built alongside the walking skeleton grows with the system - the same tools and processes used for early iterations become the production deployment pipeline.
Handling Incomplete Changes
Section titled “Handling Incomplete Changes”| Technique | How it works | Trade-off |
|---|---|---|
| Feature branches | Develop in isolation; merge when complete | Delays feedback; risk accumulates until merge |
| Feature toggles | Deploy code everywhere but activate only in specific environments | No branches needed; requires toggle management |
| Dark launching | Deploy to production but keep out of the critical path; test with real data and integrations | Proves performance without risking user traffic |
Planning and Applying Changes
Section titled “Planning and Applying Changes”The Terraform Plan
Section titled “The Terraform Plan”The planning phase compares your configuration against real-world infrastructure and formulates a directed acyclic graph (DAG) to determine the exact order of operations.
| Plan type | Command | Purpose |
|---|---|---|
| Speculative | terraform plan | Test code changes safely - no intention to apply |
| Saved | terraform plan -out=deploy.tfplan | Binary file guaranteeing exact actions during apply |
| Destroy | terraform plan -destroy | Tear down all managed infrastructure (reverse dependency order) |
| Refresh-only | terraform plan -refresh-only | Detect drift and update the state file without modifying infrastructure |
Replacing Resources
Section titled “Replacing Resources”If a resource is corrupted or manually altered, force its recreation:
terraform plan -replace='aws_instance.app_server'This replaces the deprecated terraform taint command and lets you preview cascading effects before they hit your state.
Resource Targeting
Section titled “Resource Targeting”Focus a plan on isolated resources with -target. This is a massive anti-pattern for regular use - it should only be used as an emergency debugging tool to fix corrupted state.
Execution Options
Section titled “Execution Options”| Option | Detail |
|---|---|
| Parallelism | Default: 10 concurrent actions (-parallelism=n). True parallelism is limited by dependency ordering and API rate limits. Set to 1 for readable debug logs |
| State locking | Automatic during operations. Disable with -lock=false only for speculative plans where no changes will be applied |
Reviewing Plans Programmatically
Section titled “Reviewing Plans Programmatically”Saved plan files are binary - use terraform show to read them:
# Human-readableterraform show deploy.tfplan
# JSON for CI/CD pipelines and automated reviewterraform show -json deploy.tfplan > plan.jsonManipulating State
Section titled “Manipulating State”State manipulation modifies the state file without altering real-world infrastructure - used for renaming resources, reorganising code, or removing infrastructure from Terraform’s control.
Always Back Up First
Section titled “Always Back Up First”# Download current stateterraform state pull > backup.tfstate
# Restore if needed (force bypasses lineage/serial checks)terraform state push backup.tfstateterraform state push -force backup.tfstateCode-Driven Changes (Recommended)
Section titled “Code-Driven Changes (Recommended)”| Block | Purpose | Example |
|---|---|---|
moved | Rename a resource or move it into a child module | moved { from = aws_instance.web to = module.app.aws_instance.web } |
removed | Stop managing a resource without destroying it | removed { from = aws_instance.legacy lifecycle { destroy = false } } |
Code-driven changes are the safest - Terraform handles formatting, ensures correctness, and makes the change repeatable for anyone sharing the codebase.
CLI-Driven Changes
Section titled “CLI-Driven Changes”| Command | Purpose | Note |
|---|---|---|
terraform state rm | Remove a resource from state (leaves real infrastructure running) | Must also remove from code, or Terraform will recreate it |
terraform state replace-provider | Migrate resources between providers | Common when testing custom/dev provider versions |
Manual Editing (Last Resort)
Section titled “Manual Editing (Last Resort)”Manually editing the terraform.tfstate JSON file should only be attempted to recover a corrupted state:
- Pull the state locally
- Edit the JSON carefully
- Run through a JSON validator
- Increment the
serialfield by 1 (so Terraform accepts it without-force) - Push back to the backend
Refactoring Live Resources
Section titled “Refactoring Live Resources”There is a critical distinction between refactoring infrastructure code and refactoring live resources. Refactoring code in an IDE is safe. Applying that refactored code to running infrastructure can be catastrophic - splitting a network stack into two may destroy and recreate critical components.
The Danger of Destructive Interim Steps
Section titled “The Danger of Destructive Interim Steps”When you split a large stack, Terraform may automatically:
- Destroy resources removed from the old stack
- Attempt to create them in the new stack
- Fail because dependent resources are still in use
Three Strategies for Safe Remapping
Section titled “Three Strategies for Safe Remapping”1. Manual State Remapping (“Infrastructure Surgery”)
Edit state files or use terraform state mv to move resource mappings without destroying infrastructure. This is highly risky, error-prone, and violates IaC principles. Use only in extreme situations with full backups.
2. Pipeline-Based Remapping
Define changes in code and deliver through automated pipelines:
| Tool | Mechanism | Limitation |
|---|---|---|
moved block | Updates state mapping to a new identifier | Can only rename within a single stack |
aliases (Pulumi) | Achieves the same result more naturally in code | - |
| Tfmigrate | Scripts cross-state migrations (moving resources between state files) | Requires additional pipeline orchestration |
Changes must be idempotent and work regardless of which previous version the environment is running.
3. Expand and Contract (Parallel Change)
The safest approach - works with any IaC tool, including those without editable state files:
| Phase | Action | Risk |
|---|---|---|
| 1 · Expand | Deploy the new resource alongside the old one - unused, hidden from workloads | Minimal |
| 2 · Integrate | Route traffic/usage from old to new; old resource remains available for rollback | Low - quickly reversible |
| 3 · Contract | Remove the old, unused resource | Low - new resource is proven in production |
Each phase is a standard, tested pipeline deployment. No state file edits required.
Zero-Downtime Deployment Patterns
Section titled “Zero-Downtime Deployment Patterns”Teams that deploy to production more frequently typically achieve higher reliability - frequent deployments force optimisation of techniques, while infrequent deployments tend to be large, complex, and manual.
Blue-Green Deployments
Section titled “Blue-Green Deployments”Create a completely new infrastructure instance, switch traffic, then remove the old one:
- Requires a routing mechanism (load balancer) for the switchover
- Advanced setups allow workloads to drain - new work goes to the new instance while old tasks finish
- Originally used for entire data centres; now commonly applied at the individual stack level
Rolling Upgrades
Section titled “Rolling Upgrades”Incrementally deploy new versions into an active pool:
- Add new nodes with the updated configuration one by one
- Remove old nodes progressively
- At any point, both versions coexist in the pool
Canary Releases
Section titled “Canary Releases”A cautious variant of rolling upgrades:
- Deploy to a small subset first (the “canary”)
- Let the canary take real traffic and prove stability
- Only proceed with the full rollout after monitoring confirms health
- If issues are detected, halt and roll back automatically
Immutable Infrastructure and Phoenix Servers
Section titled “Immutable Infrastructure and Phoenix Servers”| Concept | Description |
|---|---|
| Immutable infrastructure | Never patch existing resources - build a new instance from scratch and destroy the old one |
| Offline testing | The new instance can be fully tested before any live traffic is switched to it |
| Instant rollback | If something fails after the switch, route traffic back to the old instance |
| Phoenix servers | Rebuild instances frequently to prevent automation lag (drift between actual state and automated configuration) |
Managing Data During Changes
Section titled “Managing Data During Changes”Data-hosting infrastructure requires special handling because destroying or replacing resources can cause data loss, service interruptions, and complex migrations.
Store and Load
Section titled “Store and Load”Back up data before destruction, load onto the new resource after creation:
- Use native cloud features (automated snapshots)
- Keep orchestration simple - deployment scripts should not contain storage-specific details (table structures, etc.)
- Use event-based triggers (lifecycle hooks, Lambda functions) to decouple backup/restore from deployment scripts
Continuous Data Transfer
Section titled “Continuous Data Transfer”The store-and-load gap - data written between backup and switchover is lost:
| Technique | How it works |
|---|---|
| Brief pause | Acceptable for resilient systems (message queues); messages back up then drain |
| Streaming transaction logs | Old instance (active) streams logs to new instance (passive); switchover requires only a brief, often unnoticeable pause |
| Active-active synchronisation | Required for rolling deployments where multiple versions coexist; data can be safely written to any active node |
Segregate Data Infrastructure
Section titled “Segregate Data Infrastructure”Define data-hosting resources in separate deployment stacks:
- Non-data updates (compute, networking) don’t trigger complex data-consistency steps
- Data deployments stay smaller, faster, and easier to roll back
Separate Software and Data Changes
Section titled “Separate Software and Data Changes”If a software update introduces a backward-incompatible data format change:
- Write the new software version to be backward compatible with the old data format
- Deploy the new software via rolling upgrade
- Once all nodes run the new software, deploy the data format change as a separate step
State Drift
Section titled “State Drift”State drift occurs when real-world infrastructure changes outside of Terraform - the state file no longer matches reality.
Terraform detects drift during refresh (at the start of every plan or via terraform plan -refresh-only). Its default behaviour is to plan changes that revert infrastructure back to match the code.
Four Categories of Drift
Section titled “Four Categories of Drift”1. Accidental Manual Changes
Section titled “1. Accidental Manual Changes”- Cause: Human error - wrong account, incorrect command
- Prevention: Restrict production access, enforce CI/CD pipelines, create policies around manual changes
- Fix: Standard
terraform planand apply - typically the easiest to resolve
2. Intentional Manual Changes
Section titled “2. Intentional Manual Changes”- Cause: Emergency fix applied outside Terraform (e.g., on-call response)
- Risk: The next Terraform run will undo the emergency fix
- Prevention: Build a culture and pipeline where even emergency changes go through Terraform
- Fix: Update Terraform code to reflect the manual change before running Terraform again
3. Conflicting Automated Changes
Section titled “3. Conflicting Automated Changes”- Cause: External systems alter infrastructure as expected - new AMIs published, autoscaling changes instance counts, cloud provider performs minor version upgrades
Resolution options:
| Option | When to use |
|---|---|
| Apply | Let Terraform update resources (e.g., redeploy to the newest image) |
| Ignore | Use lifecycle { ignore_changes = [tags, desired_count] } for attributes managed externally |
| Sync | Run a refresh-only plan to update the state without code changes |
4. Terraform Errors
Section titled “4. Terraform Errors”- Cause: Terraform crashes, host fails, backend auth expires mid-run, or state saves corrupted
- Risk: Terraform may have created resources but failed to record them - the next run creates duplicates
- Fix: Review logs to identify what was created; either
terraform importthe orphaned resources or manually delete them so Terraform can recreate cleanly. In extreme cases, restore from a backend backup
Continuous Disaster Recovery
Section titled “Continuous Disaster Recovery”Restoring failed infrastructure is virtually identical to deploying a new environment with infrastructure code. Use your routine automated deployment processes for disaster recovery - every routine update becomes a rehearsal, keeping the team comfortable with the tools so recovery is muscle memory rather than panic.