Refactoring IaC
Refactoring is the practice of restructuring code to make it easier to maintain and extend - without changing its external behaviour. In infrastructure, refactoring carries an extra dimension of risk: applying reorganised code to a live environment can destroy and recreate resources, causing real downtime. This page covers how to refactor Terraform code safely, manage breaking changes, and deliver large changes incrementally.
Why Refactor
Section titled “Why Refactor”Neglecting refactoring in favour of feature work leads to technical debt - accumulated complexity that slows teams down and eventually causes bugs or outages. A robust automated test suite acts as the safety net, giving developers the confidence to refactor without the risk of accidentally introducing breaking changes.
The Refactoring Paradox in IaC
Section titled “The Refactoring Paradox in IaC”There is a crucial difference between refactoring infrastructure code and refactoring deployed infrastructure resources:
- An IDE can safely rename a Go function - the compiler handles the rest
- Splitting a single Terraform networking stack into two may temporarily delete and recreate critical network components, causing system downtime
Internal vs. External Refactoring
Section titled “Internal vs. External Refactoring”| Type | Scope | Breaks consumers? | Tests should fail? |
|---|---|---|---|
| Internal | Changes the internal structure of a module without altering its inputs or outputs | ❌ No | ❌ No |
| External | Modifies existing input variables, outputs, or resource interfaces | ✅ Yes - inherently a breaking change | ✅ Yes - by design |
Internal Refactoring Strategies
Section titled “Internal Refactoring Strategies”Internal refactoring is invisible to module consumers and should never cause automated tests to fail.
Reorganising Code
Section titled “Reorganising Code”Terraform determines resource ordering from its dependency graph, not from file layout. You can freely move resources between files without Terraform detecting any change.
As main.tf grows, split resources into functional files:
modules/web-cluster/├── main.tf # Core resources├── networking.tf # VPC, subnets, security groups├── logging.tf # CloudWatch, log groups├── load-balancer.tf # ALB, target groups, listeners├── variables.tf # All input variables├── outputs.tf # All outputs└── versions.tf # Required providers and versionsRenaming Resources with moved
Section titled “Renaming Resources with moved”Renaming a resource in code causes Terraform to destroy the old resource and create a new one. The moved block prevents this by telling Terraform to update the state mapping instead:
moved { from = aws_instance.web_server to = aws_instance.app}
resource "aws_instance" "app" { # ... unchanged configuration}Renaming Variables (Expand and Contract)
Section titled “Renaming Variables (Expand and Contract)”Renaming an input variable is an external change that breaks consumers. You can do it safely with an expand-and-contract pattern that maintains backward compatibility during the transition:
Step 1 - Expand: Add the new variable alongside the old one:
variable "instance_name" { # NEW type = string description = "Name tag for the instance."}
variable "server_name" { # OLD - deprecated type = string description = "DEPRECATED: Use 'instance_name' instead." default = null}Step 2 - Bridge: Create a local that falls back gracefully:
locals { name = var.server_name != null ? var.server_name : var.instance_name}Consumers using the old variable continue to work. New consumers use the new variable.
Step 3 - Contract: In a future major version, remove var.server_name and the bridge logic.
Managing External Refactoring and Breaking Changes
Section titled “Managing External Refactoring and Breaking Changes”External refactoring modifies a module’s public interface - inputs, outputs, or resource contracts. It is inherently a breaking change.
Valid Reasons to Break Compatibility
Section titled “Valid Reasons to Break Compatibility”| Reason | Example |
|---|---|
| Security | Fixing insecure default settings |
| Usability | Standardising inconsistent variable names across modules |
| Upstream changes | Adapting to a new major provider version |
| New functionality | Replacing a workaround with a proper implementation |
Batching Changes via a Wishlist
Section titled “Batching Changes via a Wishlist”Avoid releasing frequent, small breaking changes. Instead, maintain a wishlist of refactoring ideas - tracked via Jira tickets, GitHub Issues, or a markdown file in the repository. Release them all at once during a planned major version upgrade.
Strategic timing: a great moment to ship a major module release is when an upstream provider publishes its own major version. Users already expect breaking changes, and you can cascade yours alongside.
Executing the Major Release
Section titled “Executing the Major Release”- Branch - develop the new major version on a separate Git branch so the main branch stays stable
- Document - write an
UPGRADE.mdorCHANGELOG.mdwith clear, specific migration instructions (which variables were renamed, which outputs were removed, etc.) - Tag - create a semantic version tag (
v3.0.0) and publish
Maintaining Legacy Versions
Section titled “Maintaining Legacy Versions”Users won’t upgrade immediately. When a critical bug fix is needed for an older major version:
# Create a branch from the old version taggit checkout -b v2-maintenance v2.4.1
# Apply the fix, then tag and releasegit tag v2.4.2git push origin v2-maintenance --tagsThis lets legacy users receive patches without being forced to upgrade.
Making Change Routine
Section titled “Making Change Routine”Large organisations classify infrastructure changes by risk, disruption, and delivery method:
| Classification | Characteristics | Example |
|---|---|---|
| Exceptional | Infrequent, handled differently each time, requires significant planning | Major OS upgrade |
| Routine | Frequent, follows the same process every time, minimal active thought | Minor OS patch |
The goal of automation is to convert exceptional changes into routine operations. Automation directly satisfies governance requirements by providing:
- Automated testing in production-representative environments
- An automated build progression system that enforces and records all steps
- An automated deployment process that executes identically every time
Changing a System Incrementally
Section titled “Changing a System Incrementally”Large or complex infrastructure changes should be delivered as a series of small increments - each one independently deployable, testable, and safe.
Why Incrementalism Wins
Section titled “Why Incrementalism Wins”- Significantly easier to plan, implement, test, and debug
- Failures affect a smaller blast radius
- Feedback arrives sooner, enabling course correction
Avoiding the Component-by-Component Trap
Section titled “Avoiding the Component-by-Component Trap”Building a system one component at a time (networking → storage → compute) means you can’t meaningfully test end-to-end until everything is complete. If components don’t integrate well, you discover it last.
Walking Skeleton
Section titled “Walking Skeleton”Start with a bare-bones iteration that wires together all layers - even if each layer is minimal. A walking skeleton lets you:
- Validate the build, test, configure, and deploy workflow end-to-end
- Discover integration issues immediately
- Flesh out capabilities incrementally on a proven foundation
Tracer Bullet Pipeline
Section titled “Tracer Bullet Pipeline”Build the delivery pipeline alongside the system. The initial “tracer bullet” pipeline is simple - it grows in sophistication with each increment.
Defining an Increment
Section titled “Defining an Increment”An increment is a small change that is part of a larger intended change. Each increment must:
- Be deployable to production
- Leave the system in a workable state
- Pass all automated tests
- Meet current operability requirements
Handling Incomplete Changes
Section titled “Handling Incomplete Changes”When a larger change requires multiple increments before it’s usable:
| Technique | How it works | Trade-off |
|---|---|---|
| Feature branches | Develop in isolation; merge only when complete | Delays feedback; risk accumulates until the merge |
| Feature toggles | Deploy the code everywhere but activate only in specific environments | Avoids branches; requires toggle management |
| Dark launching | Deploy to production without routing live traffic | Tests with real data and integrations; doesn’t affect users |
Safely Remapping Live Infrastructure
Section titled “Safely Remapping Live Infrastructure”When you refactor Terraform code (e.g., splitting a stack or renaming modules), the underlying resources must be remapped in state - otherwise Terraform will destroy and recreate them.
Declarative Remapping (Preferred)
Section titled “Declarative Remapping (Preferred)”Use native moved blocks to remap resources within the same state:
moved { from = module.network.aws_subnet.public to = module.public_network.aws_subnet.main}Pulumi uses aliases for the same purpose.
Cross-State Remapping
Section titled “Cross-State Remapping”When breaking a large stack into multiple smaller stacks, resources need to move between state files. Tools like Tfmigrate can script these cross-state moves, though they require additional pipeline orchestration.
Manual State Surgery (Last Resort)
Section titled “Manual State Surgery (Last Resort)”Editing state files directly by removing a resource from one state and adding it to another is known as infrastructure surgery. It is:
- Highly error-prone
- Nicknamed for good reason
- Only acceptable in extreme situations with full data backups
The Expand-and-Contract Pattern for Resources
Section titled “The Expand-and-Contract Pattern for Resources”For tools without state files, or as a safer alternative to state remapping, use the three-phase expand-and-contract pattern:
| Phase | Action | Risk |
|---|---|---|
| 1 · Expand | Deploy the new resource alongside the old one | Minimal - the new resource isn’t used yet |
| 2 · Migrate | Route traffic/usage from old to new | Low - the old resource is still available for rollback |
| 3 · Contract | Remove the old resource | Low - the new resource is proven in production |
Each phase is a standard, testable pipeline deployment. No state file edits required.
Continuous Disaster Recovery
Section titled “Continuous Disaster Recovery”Many organisations treat disaster recovery as a rare, specialised activity. But restoring failed infrastructure is virtually identical to deploying a new environment with infrastructure code.
Use your routine automated deployment processes for disaster recovery. Every routine update becomes a rehearsal - the team stays comfortable with the tools and processes, so when a real crisis hits, recovery is muscle memory rather than panic.