Changing Infrastructure

Changing running infrastructure is fundamentally different from changing application code - a refactored module can destroy and recreate critical resources, a misapplied plan can wipe production data, and a drifted state file can cause Terraform to undo emergency fixes. This page covers how to plan, apply, and recover from infrastructure changes safely.

Making Change Routine

Large organisations classify infrastructure changes by risk, disruption, and delivery method:

Type	Characteristics	Example
Exceptional	Infrequent, approached differently every time, requires significant planning and expertise	Major OS upgrade
Routine	Frequent, follows the same standardised process, minimal active thought required	Minor OS patch

Automating a delivery process is the primary mechanism for converting exceptional changes into routine operations. Automation lowers risk by providing:

Automated testing in production-representative environments
An automated build-progression system that records and enforces all testing and approval steps
An automated deployment process that executes identically every time

Incremental Delivery

Why Incrementalism

Delivering large changes as a series of small increments is fundamentally easier and safer - each increment simplifies planning, implementation, testing, and debugging.

Each increment must:

Be deployable to production
Leave the system in a workable state
Pass all automated tests
Meet current operability requirements

Avoiding the Component-by-Component Trap

Building one component at a time (networking → storage → compute) means you can’t meaningfully test the system until everything is finished. Integration problems surface last, when rework is most expensive.

Walking Skeleton

Start with a bare-bones iteration that wires all layers together, even if each layer is minimal:

Validates the build, test, configure, and deploy workflow end-to-end
Discovers integration issues immediately
Evolves into the production system naturally

Tracer bullet pipeline: the delivery pipeline built alongside the walking skeleton grows with the system - the same tools and processes used for early iterations become the production deployment pipeline.

Handling Incomplete Changes

Technique	How it works	Trade-off
Feature branches	Develop in isolation; merge when complete	Delays feedback; risk accumulates until merge
Feature toggles	Deploy code everywhere but activate only in specific environments	No branches needed; requires toggle management
Dark launching	Deploy to production but keep out of the critical path; test with real data and integrations	Proves performance without risking user traffic

Planning and Applying Changes

The Terraform Plan

The planning phase compares your configuration against real-world infrastructure and formulates a directed acyclic graph (DAG) to determine the exact order of operations.

Plan type	Command	Purpose
Speculative	`terraform plan`	Test code changes safely - no intention to apply
Saved	`terraform plan -out=deploy.tfplan`	Binary file guaranteeing exact actions during apply
Destroy	`terraform plan -destroy`	Tear down all managed infrastructure (reverse dependency order)
Refresh-only	`terraform plan -refresh-only`	Detect drift and update the state file without modifying infrastructure

Replacing Resources

If a resource is corrupted or manually altered, force its recreation:

terraform plan -replace='aws_instance.app_server'

This replaces the deprecated terraform taint command and lets you preview cascading effects before they hit your state.

Resource Targeting

Focus a plan on isolated resources with -target. This is a massive anti-pattern for regular use - it should only be used as an emergency debugging tool to fix corrupted state.

Execution Options

Option	Detail
Parallelism	Default: 10 concurrent actions (`-parallelism=n`). True parallelism is limited by dependency ordering and API rate limits. Set to 1 for readable debug logs
State locking	Automatic during operations. Disable with `-lock=false` only for speculative plans where no changes will be applied

Reviewing Plans Programmatically

Saved plan files are binary - use terraform show to read them:

# Human-readable
terraform show deploy.tfplan

# JSON for CI/CD pipelines and automated review
terraform show -json deploy.tfplan > plan.json

Manipulating State

State manipulation modifies the state file without altering real-world infrastructure - used for renaming resources, reorganising code, or removing infrastructure from Terraform’s control.

Always Back Up First

# Download current state
terraform state pull > backup.tfstate

# Restore if needed (force bypasses lineage/serial checks)
terraform state push backup.tfstate
terraform state push -force backup.tfstate

Code-Driven Changes (Recommended)

Block	Purpose	Example
`moved`	Rename a resource or move it into a child module	`moved { from = aws_instance.web to = module.app.aws_instance.web }`
`removed`	Stop managing a resource without destroying it	`removed { from = aws_instance.legacy lifecycle { destroy = false } }`

Code-driven changes are the safest - Terraform handles formatting, ensures correctness, and makes the change repeatable for anyone sharing the codebase.

CLI-Driven Changes

Command	Purpose	Note
`terraform state rm`	Remove a resource from state (leaves real infrastructure running)	Must also remove from code, or Terraform will recreate it
`terraform state replace-provider`	Migrate resources between providers	Common when testing custom/dev provider versions

Manual Editing (Last Resort)

Manually editing the terraform.tfstate JSON file should only be attempted to recover a corrupted state:

Pull the state locally
Edit the JSON carefully
Run through a JSON validator
Increment the serial field by 1 (so Terraform accepts it without -force)
Push back to the backend

Refactoring Live Resources

There is a critical distinction between refactoring infrastructure code and refactoring live resources. Refactoring code in an IDE is safe. Applying that refactored code to running infrastructure can be catastrophic - splitting a network stack into two may destroy and recreate critical components.

The Danger of Destructive Interim Steps

When you split a large stack, Terraform may automatically:

Destroy resources removed from the old stack
Attempt to create them in the new stack
Fail because dependent resources are still in use

Three Strategies for Safe Remapping

1. Manual State Remapping (“Infrastructure Surgery”)

Edit state files or use terraform state mv to move resource mappings without destroying infrastructure. This is highly risky, error-prone, and violates IaC principles. Use only in extreme situations with full backups.

2. Pipeline-Based Remapping

Define changes in code and deliver through automated pipelines:

Tool	Mechanism	Limitation
`moved` block	Updates state mapping to a new identifier	Can only rename within a single stack
`aliases` (Pulumi)	Achieves the same result more naturally in code	-
Tfmigrate	Scripts cross-state migrations (moving resources between state files)	Requires additional pipeline orchestration

Changes must be idempotent and work regardless of which previous version the environment is running.

3. Expand and Contract (Parallel Change)

The safest approach - works with any IaC tool, including those without editable state files:

Phase	Action	Risk
1 · Expand	Deploy the new resource alongside the old one - unused, hidden from workloads	Minimal
2 · Integrate	Route traffic/usage from old to new; old resource remains available for rollback	Low - quickly reversible
3 · Contract	Remove the old, unused resource	Low - new resource is proven in production

Each phase is a standard, tested pipeline deployment. No state file edits required.

Zero-Downtime Deployment Patterns

Teams that deploy to production more frequently typically achieve higher reliability - frequent deployments force optimisation of techniques, while infrequent deployments tend to be large, complex, and manual.

Blue-Green Deployments

Create a completely new infrastructure instance, switch traffic, then remove the old one:

Requires a routing mechanism (load balancer) for the switchover
Advanced setups allow workloads to drain - new work goes to the new instance while old tasks finish
Originally used for entire data centres; now commonly applied at the individual stack level

Rolling Upgrades

Incrementally deploy new versions into an active pool:

Add new nodes with the updated configuration one by one
Remove old nodes progressively
At any point, both versions coexist in the pool

Canary Releases

A cautious variant of rolling upgrades:

Deploy to a small subset first (the “canary”)
Let the canary take real traffic and prove stability
Only proceed with the full rollout after monitoring confirms health
If issues are detected, halt and roll back automatically

Immutable Infrastructure and Phoenix Servers

Concept	Description
Immutable infrastructure	Never patch existing resources - build a new instance from scratch and destroy the old one
Offline testing	The new instance can be fully tested before any live traffic is switched to it
Instant rollback	If something fails after the switch, route traffic back to the old instance
Phoenix servers	Rebuild instances frequently to prevent automation lag (drift between actual state and automated configuration)

Managing Data During Changes

Data-hosting infrastructure requires special handling because destroying or replacing resources can cause data loss, service interruptions, and complex migrations.

Store and Load

Back up data before destruction, load onto the new resource after creation:

Use native cloud features (automated snapshots)
Keep orchestration simple - deployment scripts should not contain storage-specific details (table structures, etc.)
Use event-based triggers (lifecycle hooks, Lambda functions) to decouple backup/restore from deployment scripts

Continuous Data Transfer

The store-and-load gap - data written between backup and switchover is lost:

Technique	How it works
Brief pause	Acceptable for resilient systems (message queues); messages back up then drain
Streaming transaction logs	Old instance (active) streams logs to new instance (passive); switchover requires only a brief, often unnoticeable pause
Active-active synchronisation	Required for rolling deployments where multiple versions coexist; data can be safely written to any active node

Segregate Data Infrastructure

Define data-hosting resources in separate deployment stacks:

Non-data updates (compute, networking) don’t trigger complex data-consistency steps
Data deployments stay smaller, faster, and easier to roll back

Separate Software and Data Changes

If a software update introduces a backward-incompatible data format change:

Write the new software version to be backward compatible with the old data format
Deploy the new software via rolling upgrade
Once all nodes run the new software, deploy the data format change as a separate step

State Drift

State drift occurs when real-world infrastructure changes outside of Terraform - the state file no longer matches reality.

Terraform detects drift during refresh (at the start of every plan or via terraform plan -refresh-only). Its default behaviour is to plan changes that revert infrastructure back to match the code.

Four Categories of Drift

1. Accidental Manual Changes

Cause: Human error - wrong account, incorrect command
Prevention: Restrict production access, enforce CI/CD pipelines, create policies around manual changes
Fix: Standard terraform plan and apply - typically the easiest to resolve

2. Intentional Manual Changes

Cause: Emergency fix applied outside Terraform (e.g., on-call response)
Risk: The next Terraform run will undo the emergency fix
Prevention: Build a culture and pipeline where even emergency changes go through Terraform
Fix: Update Terraform code to reflect the manual change before running Terraform again

3. Conflicting Automated Changes

Cause: External systems alter infrastructure as expected - new AMIs published, autoscaling changes instance counts, cloud provider performs minor version upgrades

Resolution options:

Option	When to use
Apply	Let Terraform update resources (e.g., redeploy to the newest image)
Ignore	Use `lifecycle { ignore_changes = [tags, desired_count] }` for attributes managed externally
Sync	Run a refresh-only plan to update the state without code changes

4. Terraform Errors

Cause: Terraform crashes, host fails, backend auth expires mid-run, or state saves corrupted
Risk: Terraform may have created resources but failed to record them - the next run creates duplicates
Fix: Review logs to identify what was created; either terraform import the orphaned resources or manually delete them so Terraform can recreate cleanly. In extreme cases, restore from a backend backup

Continuous Disaster Recovery

Restoring failed infrastructure is virtually identical to deploying a new environment with infrastructure code. Use your routine automated deployment processes for disaster recovery - every routine update becomes a rehearsal, keeping the team comfortable with the tools so recovery is muscle memory rather than panic.