Skip to content

Changing Infrastructure

Changing running infrastructure is fundamentally different from changing application code - a refactored module can destroy and recreate critical resources, a misapplied plan can wipe production data, and a drifted state file can cause Terraform to undo emergency fixes. This page covers how to plan, apply, and recover from infrastructure changes safely.


Large organisations classify infrastructure changes by risk, disruption, and delivery method:

TypeCharacteristicsExample
ExceptionalInfrequent, approached differently every time, requires significant planning and expertiseMajor OS upgrade
RoutineFrequent, follows the same standardised process, minimal active thought requiredMinor OS patch

Automating a delivery process is the primary mechanism for converting exceptional changes into routine operations. Automation lowers risk by providing:

  • Automated testing in production-representative environments
  • An automated build-progression system that records and enforces all testing and approval steps
  • An automated deployment process that executes identically every time

Delivering large changes as a series of small increments is fundamentally easier and safer - each increment simplifies planning, implementation, testing, and debugging.

Each increment must:

  • Be deployable to production
  • Leave the system in a workable state
  • Pass all automated tests
  • Meet current operability requirements

Building one component at a time (networking → storage → compute) means you can’t meaningfully test the system until everything is finished. Integration problems surface last, when rework is most expensive.

Start with a bare-bones iteration that wires all layers together, even if each layer is minimal:

  • Validates the build, test, configure, and deploy workflow end-to-end
  • Discovers integration issues immediately
  • Evolves into the production system naturally

Tracer bullet pipeline: the delivery pipeline built alongside the walking skeleton grows with the system - the same tools and processes used for early iterations become the production deployment pipeline.

TechniqueHow it worksTrade-off
Feature branchesDevelop in isolation; merge when completeDelays feedback; risk accumulates until merge
Feature togglesDeploy code everywhere but activate only in specific environmentsNo branches needed; requires toggle management
Dark launchingDeploy to production but keep out of the critical path; test with real data and integrationsProves performance without risking user traffic

The planning phase compares your configuration against real-world infrastructure and formulates a directed acyclic graph (DAG) to determine the exact order of operations.

Plan typeCommandPurpose
Speculativeterraform planTest code changes safely - no intention to apply
Savedterraform plan -out=deploy.tfplanBinary file guaranteeing exact actions during apply
Destroyterraform plan -destroyTear down all managed infrastructure (reverse dependency order)
Refresh-onlyterraform plan -refresh-onlyDetect drift and update the state file without modifying infrastructure

If a resource is corrupted or manually altered, force its recreation:

Terminal window
terraform plan -replace='aws_instance.app_server'

This replaces the deprecated terraform taint command and lets you preview cascading effects before they hit your state.

Focus a plan on isolated resources with -target. This is a massive anti-pattern for regular use - it should only be used as an emergency debugging tool to fix corrupted state.

OptionDetail
ParallelismDefault: 10 concurrent actions (-parallelism=n). True parallelism is limited by dependency ordering and API rate limits. Set to 1 for readable debug logs
State lockingAutomatic during operations. Disable with -lock=false only for speculative plans where no changes will be applied

Saved plan files are binary - use terraform show to read them:

Terminal window
# Human-readable
terraform show deploy.tfplan
# JSON for CI/CD pipelines and automated review
terraform show -json deploy.tfplan > plan.json

State manipulation modifies the state file without altering real-world infrastructure - used for renaming resources, reorganising code, or removing infrastructure from Terraform’s control.

Terminal window
# Download current state
terraform state pull > backup.tfstate
# Restore if needed (force bypasses lineage/serial checks)
terraform state push backup.tfstate
terraform state push -force backup.tfstate
BlockPurposeExample
movedRename a resource or move it into a child modulemoved { from = aws_instance.web to = module.app.aws_instance.web }
removedStop managing a resource without destroying itremoved { from = aws_instance.legacy lifecycle { destroy = false } }

Code-driven changes are the safest - Terraform handles formatting, ensures correctness, and makes the change repeatable for anyone sharing the codebase.

CommandPurposeNote
terraform state rmRemove a resource from state (leaves real infrastructure running)Must also remove from code, or Terraform will recreate it
terraform state replace-providerMigrate resources between providersCommon when testing custom/dev provider versions

Manually editing the terraform.tfstate JSON file should only be attempted to recover a corrupted state:

  1. Pull the state locally
  2. Edit the JSON carefully
  3. Run through a JSON validator
  4. Increment the serial field by 1 (so Terraform accepts it without -force)
  5. Push back to the backend

There is a critical distinction between refactoring infrastructure code and refactoring live resources. Refactoring code in an IDE is safe. Applying that refactored code to running infrastructure can be catastrophic - splitting a network stack into two may destroy and recreate critical components.

When you split a large stack, Terraform may automatically:

  1. Destroy resources removed from the old stack
  2. Attempt to create them in the new stack
  3. Fail because dependent resources are still in use

1. Manual State Remapping (“Infrastructure Surgery”)

Edit state files or use terraform state mv to move resource mappings without destroying infrastructure. This is highly risky, error-prone, and violates IaC principles. Use only in extreme situations with full backups.

2. Pipeline-Based Remapping

Define changes in code and deliver through automated pipelines:

ToolMechanismLimitation
moved blockUpdates state mapping to a new identifierCan only rename within a single stack
aliases (Pulumi)Achieves the same result more naturally in code-
TfmigrateScripts cross-state migrations (moving resources between state files)Requires additional pipeline orchestration

Changes must be idempotent and work regardless of which previous version the environment is running.

3. Expand and Contract (Parallel Change)

The safest approach - works with any IaC tool, including those without editable state files:

PhaseActionRisk
1 · ExpandDeploy the new resource alongside the old one - unused, hidden from workloadsMinimal
2 · IntegrateRoute traffic/usage from old to new; old resource remains available for rollbackLow - quickly reversible
3 · ContractRemove the old, unused resourceLow - new resource is proven in production

Each phase is a standard, tested pipeline deployment. No state file edits required.


Teams that deploy to production more frequently typically achieve higher reliability - frequent deployments force optimisation of techniques, while infrequent deployments tend to be large, complex, and manual.

Create a completely new infrastructure instance, switch traffic, then remove the old one:

  • Requires a routing mechanism (load balancer) for the switchover
  • Advanced setups allow workloads to drain - new work goes to the new instance while old tasks finish
  • Originally used for entire data centres; now commonly applied at the individual stack level

Incrementally deploy new versions into an active pool:

  • Add new nodes with the updated configuration one by one
  • Remove old nodes progressively
  • At any point, both versions coexist in the pool

A cautious variant of rolling upgrades:

  • Deploy to a small subset first (the “canary”)
  • Let the canary take real traffic and prove stability
  • Only proceed with the full rollout after monitoring confirms health
  • If issues are detected, halt and roll back automatically

Immutable Infrastructure and Phoenix Servers

Section titled “Immutable Infrastructure and Phoenix Servers”
ConceptDescription
Immutable infrastructureNever patch existing resources - build a new instance from scratch and destroy the old one
Offline testingThe new instance can be fully tested before any live traffic is switched to it
Instant rollbackIf something fails after the switch, route traffic back to the old instance
Phoenix serversRebuild instances frequently to prevent automation lag (drift between actual state and automated configuration)

Data-hosting infrastructure requires special handling because destroying or replacing resources can cause data loss, service interruptions, and complex migrations.

Back up data before destruction, load onto the new resource after creation:

  • Use native cloud features (automated snapshots)
  • Keep orchestration simple - deployment scripts should not contain storage-specific details (table structures, etc.)
  • Use event-based triggers (lifecycle hooks, Lambda functions) to decouple backup/restore from deployment scripts

The store-and-load gap - data written between backup and switchover is lost:

TechniqueHow it works
Brief pauseAcceptable for resilient systems (message queues); messages back up then drain
Streaming transaction logsOld instance (active) streams logs to new instance (passive); switchover requires only a brief, often unnoticeable pause
Active-active synchronisationRequired for rolling deployments where multiple versions coexist; data can be safely written to any active node

Define data-hosting resources in separate deployment stacks:

  • Non-data updates (compute, networking) don’t trigger complex data-consistency steps
  • Data deployments stay smaller, faster, and easier to roll back

If a software update introduces a backward-incompatible data format change:

  1. Write the new software version to be backward compatible with the old data format
  2. Deploy the new software via rolling upgrade
  3. Once all nodes run the new software, deploy the data format change as a separate step

State drift occurs when real-world infrastructure changes outside of Terraform - the state file no longer matches reality.

Terraform detects drift during refresh (at the start of every plan or via terraform plan -refresh-only). Its default behaviour is to plan changes that revert infrastructure back to match the code.

  • Cause: Human error - wrong account, incorrect command
  • Prevention: Restrict production access, enforce CI/CD pipelines, create policies around manual changes
  • Fix: Standard terraform plan and apply - typically the easiest to resolve
  • Cause: Emergency fix applied outside Terraform (e.g., on-call response)
  • Risk: The next Terraform run will undo the emergency fix
  • Prevention: Build a culture and pipeline where even emergency changes go through Terraform
  • Fix: Update Terraform code to reflect the manual change before running Terraform again
  • Cause: External systems alter infrastructure as expected - new AMIs published, autoscaling changes instance counts, cloud provider performs minor version upgrades

Resolution options:

OptionWhen to use
ApplyLet Terraform update resources (e.g., redeploy to the newest image)
IgnoreUse lifecycle { ignore_changes = [tags, desired_count] } for attributes managed externally
SyncRun a refresh-only plan to update the state without code changes
  • Cause: Terraform crashes, host fails, backend auth expires mid-run, or state saves corrupted
  • Risk: Terraform may have created resources but failed to record them - the next run creates duplicates
  • Fix: Review logs to identify what was created; either terraform import the orphaned resources or manually delete them so Terraform can recreate cleanly. In extreme cases, restore from a backend backup

Restoring failed infrastructure is virtually identical to deploying a new environment with infrastructure code. Use your routine automated deployment processes for disaster recovery - every routine update becomes a rehearsal, keeping the team comfortable with the tools so recovery is muscle memory rather than panic.