Refactoring IaC

Refactoring is the practice of restructuring code to make it easier to maintain and extend - without changing its external behaviour. In infrastructure, refactoring carries an extra dimension of risk: applying reorganised code to a live environment can destroy and recreate resources, causing real downtime. This page covers how to refactor Terraform code safely, manage breaking changes, and deliver large changes incrementally.

Why Refactor

Neglecting refactoring in favour of feature work leads to technical debt - accumulated complexity that slows teams down and eventually causes bugs or outages. A robust automated test suite acts as the safety net, giving developers the confidence to refactor without the risk of accidentally introducing breaking changes.

The Refactoring Paradox in IaC

There is a crucial difference between refactoring infrastructure code and refactoring deployed infrastructure resources:

An IDE can safely rename a Go function - the compiler handles the rest
Splitting a single Terraform networking stack into two may temporarily delete and recreate critical network components, causing system downtime

Internal vs. External Refactoring

Type	Scope	Breaks consumers?	Tests should fail?
Internal	Changes the internal structure of a module without altering its inputs or outputs	❌ No	❌ No
External	Modifies existing input variables, outputs, or resource interfaces	✅ Yes - inherently a breaking change	✅ Yes - by design

Internal Refactoring Strategies

Internal refactoring is invisible to module consumers and should never cause automated tests to fail.

Reorganising Code

Terraform determines resource ordering from its dependency graph, not from file layout. You can freely move resources between files without Terraform detecting any change.

As main.tf grows, split resources into functional files:

modules/web-cluster/
├── main.tf          # Core resources
├── networking.tf    # VPC, subnets, security groups
├── logging.tf       # CloudWatch, log groups
├── load-balancer.tf # ALB, target groups, listeners
├── variables.tf     # All input variables
├── outputs.tf       # All outputs
└── versions.tf      # Required providers and versions

Renaming Resources with `moved`

Renaming a resource in code causes Terraform to destroy the old resource and create a new one. The moved block prevents this by telling Terraform to update the state mapping instead:

moved {
  from = aws_instance.web_server
  to   = aws_instance.app
}

resource "aws_instance" "app" {
  # ... unchanged configuration
}

Renaming Variables (Expand and Contract)

Renaming an input variable is an external change that breaks consumers. You can do it safely with an expand-and-contract pattern that maintains backward compatibility during the transition:

Step 1 - Expand: Add the new variable alongside the old one:

variable "instance_name" {                    # NEW
  type        = string
  description = "Name tag for the instance."
}

variable "server_name" {                      # OLD - deprecated
  type        = string
  description = "DEPRECATED: Use 'instance_name' instead."
  default     = null
}

Step 2 - Bridge: Create a local that falls back gracefully:

locals {
  name = var.server_name != null ? var.server_name : var.instance_name
}

Consumers using the old variable continue to work. New consumers use the new variable.

Step 3 - Contract: In a future major version, remove var.server_name and the bridge logic.

Managing External Refactoring and Breaking Changes

External refactoring modifies a module’s public interface - inputs, outputs, or resource contracts. It is inherently a breaking change.

Valid Reasons to Break Compatibility

Reason	Example
Security	Fixing insecure default settings
Usability	Standardising inconsistent variable names across modules
Upstream changes	Adapting to a new major provider version
New functionality	Replacing a workaround with a proper implementation

Batching Changes via a Wishlist

Avoid releasing frequent, small breaking changes. Instead, maintain a wishlist of refactoring ideas - tracked via Jira tickets, GitHub Issues, or a markdown file in the repository. Release them all at once during a planned major version upgrade.

Strategic timing: a great moment to ship a major module release is when an upstream provider publishes its own major version. Users already expect breaking changes, and you can cascade yours alongside.

Executing the Major Release

Branch - develop the new major version on a separate Git branch so the main branch stays stable
Document - write an UPGRADE.md or CHANGELOG.md with clear, specific migration instructions (which variables were renamed, which outputs were removed, etc.)
Tag - create a semantic version tag (v3.0.0) and publish

Maintaining Legacy Versions

Users won’t upgrade immediately. When a critical bug fix is needed for an older major version:

# Create a branch from the old version tag
git checkout -b v2-maintenance v2.4.1

# Apply the fix, then tag and release
git tag v2.4.2
git push origin v2-maintenance --tags

This lets legacy users receive patches without being forced to upgrade.

Making Change Routine

Large organisations classify infrastructure changes by risk, disruption, and delivery method:

Classification	Characteristics	Example
Exceptional	Infrequent, handled differently each time, requires significant planning	Major OS upgrade
Routine	Frequent, follows the same process every time, minimal active thought	Minor OS patch

The goal of automation is to convert exceptional changes into routine operations. Automation directly satisfies governance requirements by providing:

Automated testing in production-representative environments
An automated build progression system that enforces and records all steps
An automated deployment process that executes identically every time

Changing a System Incrementally

Large or complex infrastructure changes should be delivered as a series of small increments - each one independently deployable, testable, and safe.

Why Incrementalism Wins

Significantly easier to plan, implement, test, and debug
Failures affect a smaller blast radius
Feedback arrives sooner, enabling course correction

Avoiding the Component-by-Component Trap

Building a system one component at a time (networking → storage → compute) means you can’t meaningfully test end-to-end until everything is complete. If components don’t integrate well, you discover it last.

Walking Skeleton

Start with a bare-bones iteration that wires together all layers - even if each layer is minimal. A walking skeleton lets you:

Validate the build, test, configure, and deploy workflow end-to-end
Discover integration issues immediately
Flesh out capabilities incrementally on a proven foundation

Tracer Bullet Pipeline

Build the delivery pipeline alongside the system. The initial “tracer bullet” pipeline is simple - it grows in sophistication with each increment.

Defining an Increment

An increment is a small change that is part of a larger intended change. Each increment must:

Be deployable to production
Leave the system in a workable state
Pass all automated tests
Meet current operability requirements

Handling Incomplete Changes

When a larger change requires multiple increments before it’s usable:

Technique	How it works	Trade-off
Feature branches	Develop in isolation; merge only when complete	Delays feedback; risk accumulates until the merge
Feature toggles	Deploy the code everywhere but activate only in specific environments	Avoids branches; requires toggle management
Dark launching	Deploy to production without routing live traffic	Tests with real data and integrations; doesn’t affect users

Safely Remapping Live Infrastructure

When you refactor Terraform code (e.g., splitting a stack or renaming modules), the underlying resources must be remapped in state - otherwise Terraform will destroy and recreate them.

Declarative Remapping (Preferred)

Use native moved blocks to remap resources within the same state:

moved {
  from = module.network.aws_subnet.public
  to   = module.public_network.aws_subnet.main
}

Pulumi uses aliases for the same purpose.

Cross-State Remapping

When breaking a large stack into multiple smaller stacks, resources need to move between state files. Tools like Tfmigrate can script these cross-state moves, though they require additional pipeline orchestration.

Manual State Surgery (Last Resort)

Editing state files directly by removing a resource from one state and adding it to another is known as infrastructure surgery. It is:

Highly error-prone
Nicknamed for good reason
Only acceptable in extreme situations with full data backups

The Expand-and-Contract Pattern for Resources

For tools without state files, or as a safer alternative to state remapping, use the three-phase expand-and-contract pattern:

Phase	Action	Risk
1 · Expand	Deploy the new resource alongside the old one	Minimal - the new resource isn’t used yet
2 · Migrate	Route traffic/usage from old to new	Low - the old resource is still available for rollback
3 · Contract	Remove the old resource	Low - the new resource is proven in production

Each phase is a standard, testable pipeline deployment. No state file edits required.

Continuous Disaster Recovery

Many organisations treat disaster recovery as a rare, specialised activity. But restoring failed infrastructure is virtually identical to deploying a new environment with infrastructure code.

Use your routine automated deployment processes for disaster recovery. Every routine update becomes a rehearsal - the team stays comfortable with the tools and processes, so when a real crisis hits, recovery is muscle memory rather than panic.