Skip to content

Refactoring IaC

Refactoring is the practice of restructuring code to make it easier to maintain and extend - without changing its external behaviour. In infrastructure, refactoring carries an extra dimension of risk: applying reorganised code to a live environment can destroy and recreate resources, causing real downtime. This page covers how to refactor Terraform code safely, manage breaking changes, and deliver large changes incrementally.


Neglecting refactoring in favour of feature work leads to technical debt - accumulated complexity that slows teams down and eventually causes bugs or outages. A robust automated test suite acts as the safety net, giving developers the confidence to refactor without the risk of accidentally introducing breaking changes.

There is a crucial difference between refactoring infrastructure code and refactoring deployed infrastructure resources:

  • An IDE can safely rename a Go function - the compiler handles the rest
  • Splitting a single Terraform networking stack into two may temporarily delete and recreate critical network components, causing system downtime

TypeScopeBreaks consumers?Tests should fail?
InternalChanges the internal structure of a module without altering its inputs or outputs❌ No❌ No
ExternalModifies existing input variables, outputs, or resource interfaces✅ Yes - inherently a breaking change✅ Yes - by design

Internal refactoring is invisible to module consumers and should never cause automated tests to fail.

Terraform determines resource ordering from its dependency graph, not from file layout. You can freely move resources between files without Terraform detecting any change.

As main.tf grows, split resources into functional files:

modules/web-cluster/
├── main.tf # Core resources
├── networking.tf # VPC, subnets, security groups
├── logging.tf # CloudWatch, log groups
├── load-balancer.tf # ALB, target groups, listeners
├── variables.tf # All input variables
├── outputs.tf # All outputs
└── versions.tf # Required providers and versions

Renaming a resource in code causes Terraform to destroy the old resource and create a new one. The moved block prevents this by telling Terraform to update the state mapping instead:

moved {
from = aws_instance.web_server
to = aws_instance.app
}
resource "aws_instance" "app" {
# ... unchanged configuration
}

Renaming an input variable is an external change that breaks consumers. You can do it safely with an expand-and-contract pattern that maintains backward compatibility during the transition:

Step 1 - Expand: Add the new variable alongside the old one:

variable "instance_name" { # NEW
type = string
description = "Name tag for the instance."
}
variable "server_name" { # OLD - deprecated
type = string
description = "DEPRECATED: Use 'instance_name' instead."
default = null
}

Step 2 - Bridge: Create a local that falls back gracefully:

locals {
name = var.server_name != null ? var.server_name : var.instance_name
}

Consumers using the old variable continue to work. New consumers use the new variable.

Step 3 - Contract: In a future major version, remove var.server_name and the bridge logic.


Managing External Refactoring and Breaking Changes

Section titled “Managing External Refactoring and Breaking Changes”

External refactoring modifies a module’s public interface - inputs, outputs, or resource contracts. It is inherently a breaking change.

ReasonExample
SecurityFixing insecure default settings
UsabilityStandardising inconsistent variable names across modules
Upstream changesAdapting to a new major provider version
New functionalityReplacing a workaround with a proper implementation

Avoid releasing frequent, small breaking changes. Instead, maintain a wishlist of refactoring ideas - tracked via Jira tickets, GitHub Issues, or a markdown file in the repository. Release them all at once during a planned major version upgrade.

Strategic timing: a great moment to ship a major module release is when an upstream provider publishes its own major version. Users already expect breaking changes, and you can cascade yours alongside.

  1. Branch - develop the new major version on a separate Git branch so the main branch stays stable
  2. Document - write an UPGRADE.md or CHANGELOG.md with clear, specific migration instructions (which variables were renamed, which outputs were removed, etc.)
  3. Tag - create a semantic version tag (v3.0.0) and publish

Users won’t upgrade immediately. When a critical bug fix is needed for an older major version:

Terminal window
# Create a branch from the old version tag
git checkout -b v2-maintenance v2.4.1
# Apply the fix, then tag and release
git tag v2.4.2
git push origin v2-maintenance --tags

This lets legacy users receive patches without being forced to upgrade.


Large organisations classify infrastructure changes by risk, disruption, and delivery method:

ClassificationCharacteristicsExample
ExceptionalInfrequent, handled differently each time, requires significant planningMajor OS upgrade
RoutineFrequent, follows the same process every time, minimal active thoughtMinor OS patch

The goal of automation is to convert exceptional changes into routine operations. Automation directly satisfies governance requirements by providing:

  • Automated testing in production-representative environments
  • An automated build progression system that enforces and records all steps
  • An automated deployment process that executes identically every time

Large or complex infrastructure changes should be delivered as a series of small increments - each one independently deployable, testable, and safe.

  • Significantly easier to plan, implement, test, and debug
  • Failures affect a smaller blast radius
  • Feedback arrives sooner, enabling course correction

Building a system one component at a time (networking → storage → compute) means you can’t meaningfully test end-to-end until everything is complete. If components don’t integrate well, you discover it last.

Start with a bare-bones iteration that wires together all layers - even if each layer is minimal. A walking skeleton lets you:

  • Validate the build, test, configure, and deploy workflow end-to-end
  • Discover integration issues immediately
  • Flesh out capabilities incrementally on a proven foundation

Build the delivery pipeline alongside the system. The initial “tracer bullet” pipeline is simple - it grows in sophistication with each increment.

An increment is a small change that is part of a larger intended change. Each increment must:

  • Be deployable to production
  • Leave the system in a workable state
  • Pass all automated tests
  • Meet current operability requirements

When a larger change requires multiple increments before it’s usable:

TechniqueHow it worksTrade-off
Feature branchesDevelop in isolation; merge only when completeDelays feedback; risk accumulates until the merge
Feature togglesDeploy the code everywhere but activate only in specific environmentsAvoids branches; requires toggle management
Dark launchingDeploy to production without routing live trafficTests with real data and integrations; doesn’t affect users

When you refactor Terraform code (e.g., splitting a stack or renaming modules), the underlying resources must be remapped in state - otherwise Terraform will destroy and recreate them.

Use native moved blocks to remap resources within the same state:

moved {
from = module.network.aws_subnet.public
to = module.public_network.aws_subnet.main
}

Pulumi uses aliases for the same purpose.

When breaking a large stack into multiple smaller stacks, resources need to move between state files. Tools like Tfmigrate can script these cross-state moves, though they require additional pipeline orchestration.

Editing state files directly by removing a resource from one state and adding it to another is known as infrastructure surgery. It is:

  • Highly error-prone
  • Nicknamed for good reason
  • Only acceptable in extreme situations with full data backups

The Expand-and-Contract Pattern for Resources

Section titled “The Expand-and-Contract Pattern for Resources”

For tools without state files, or as a safer alternative to state remapping, use the three-phase expand-and-contract pattern:

PhaseActionRisk
1 · ExpandDeploy the new resource alongside the old oneMinimal - the new resource isn’t used yet
2 · MigrateRoute traffic/usage from old to newLow - the old resource is still available for rollback
3 · ContractRemove the old resourceLow - the new resource is proven in production

Each phase is a standard, testable pipeline deployment. No state file edits required.


Many organisations treat disaster recovery as a rare, specialised activity. But restoring failed infrastructure is virtually identical to deploying a new environment with infrastructure code.

Use your routine automated deployment processes for disaster recovery. Every routine update becomes a rehearsal - the team stays comfortable with the tools and processes, so when a real crisis hits, recovery is muscle memory rather than panic.