Skip to content

IaC & CI/CD

Infrastructure as Code reaches its full potential only when changes flow through an automated pipeline - built, tested, and delivered with the same rigour as application software. This page covers the principles, workflows, and tooling that make that possible.


CD for infrastructure builds on a set of foundational rules:

PrincipleWhat it means in practice
Automate the full processPipelines orchestrate every step without human intervention; tests, monitoring, and developer setups are all codified
Use only the automated processNo manual fixes in staging or production - every fix goes through the pipeline from the start
Keep environments consistentUse reusable stacks and effective workflows to prevent configuration drift between dev, test, and production
Deliver changes comprehensivelyApply code to all relevant infrastructure within a short timeframe; measure time to reach the last system, not just the first
Keep delivery cycles shortIf automation is slow, developers bypass it; optimise continuously so engineers prefer the pipeline over manual changes
Keep all code production-readyValidate incrementally rather than batching testing to the end of a sprint
Ensure code and deployed resources matchUse control loops (GitOps, Puppet, Chef) to continuously reconcile the codebase with live infrastructure
Minimise disruptionTrack downtime metrics; deliver small, frequent, incremental changes instead of large, infrequent batches

The workflow processes small, incremental changes through a repeating cycle:

StageWhat happens
DevelopmentDeveloper edits code in a personal workspace (local emulator or cloud sandbox), runs initial tests, then pushes
BuildAutomated system downloads dependencies and packages code into a versioned, deployable artifact (container image, tagged branch)
TestBuild is deployed to a series of test environments - automated checks + manual validations (exploratory testing, code reviews, UAT)
ReleaseAfter passing all tests, the build is deployed to production (called “release” because the code has already been deployed multiple times to test environments)
RunInfrastructure actively hosts workloads; the versioned build can be reapplied to recover from failures or correct drift
  • Small, frequent pushes - isolate issues immediately; if an increment breaks the build, the developer can pinpoint the exact change
  • Selective production deployment - not every commit reaches production; teams may batch several passing increments and release once or twice a day
  • Fix forward - if a change fails, the fix must be made in source code and pushed through the pipeline from the beginning, never patched directly in a downstream environment

StepPurposeOutput
AssembleGather code and build-time dependencies (modules, plugins, providers)A build - not specific to any target instance
CompileAdd deployment-specific configuration (variables, state config, auth tokens)A desired-state model for a specific instance
ExecuteCompare desired state against actual resources via the IaaS API and apply changesModified infrastructure

Build-on-Deploy:

Many tools resolve dependencies during every deployment. If a dependency version changes between deploying to staging and production, the environments silently diverge - creating bugs that are extremely difficult to trace.

Build Once, Deploy Many:

Assemble exactly once. Deploy the same build to every environment, running only the Compile and Execute steps. Lock dependencies by bundling them or using a lock file.

WorkflowHow it worksFeedback speed
Pull RequestsFeature branches → human review → merge to mainSlower - waits for reviewers
Trunk-Based DevelopmentFrequent, small commits directly to main; fast automated tests catch errorsFaster - release candidate published immediately on green tests
MethodMechanismBest for
Code branchesPull from SCM by commit ID, tag, or environment branchSimple setups; environment branches work well with controllers that sync branch → environment
Stack packagesBundle code into ZIP, TGZ, RPM, container image; store in Artifactory, Nexus, S3, or GitHub ReleasesTeams needing strict artifact immutability
Libraries (wrapper stacks)Publish core logic as a versioned Terraform module; each environment has a thin stack that consumes itLeverages existing module registries for versioning and distribution

Immutability rule: every build must be treated as immutable. Never edit stack code to customise it for an environment - use configuration parameters instead.

When infrastructure is split into independently deployable stacks, teams need a strategy to integrate and test them:

PatternHow it worksBest for
Fan-inBuild and test each component separately, then deploy and test all related stacks together before productionComponents owned by a single team
FederationEach component is delivered and released independently; dependencies treated like APIs with contract testingComponents owned by different teams
MonorepoAll components and shared code in one repository; build tools (Bazel, Buck) limit builds to changed pathsLarge codebases needing guaranteed consistency of shared code

SCM platforms (GitHub, GitLab) provide the foundation: code storage, action runners for automated testing, issue trackers, and security scanners. Branches let developers build and test features in isolation; PRs provide a gate where automated tests run and team members review changes.

What Humans Should Review vs. What Should Be Automated

Section titled “What Humans Should Review vs. What Should Be Automated”
Automated (CI system)Human (code review)
Formatting, syntax, lintingSecurity implications (wrong subnet, missing backups)
Test executionAdherence to team best practices and consistency
Dependency checksQuality of comments and documentation

Consolidate repetitive maintenance into a single command (e.g., make chores):

Generating documentation (terraform-docs):

Terminal window
terraform-docs markdown table --output-file README.md --output-mode inject .

Place <!-- BEGIN_TF_DOCS --> / <!-- END_TF_DOCS --> markers in the README. Use --output-check in CI to verify docs are current.

Standardising formatting (terraform fmt):

Terminal window
terraform fmt -recursive

Resolves formatting discrepancies project-wide. Rudimentary but effective for keeping code uniform.

Auto-fixing lint errors (tflint --fix):

Terminal window
tflint --fix

TermMeaning
Build projectCode used to build a discrete component (library, stack, application)
CodebaseOne or more interrelated build projects
RepositoryOne or more build projects in a source control system; branches/tags/commits apply to all files
StrategyStrengthsWeaknesses
MonorepoSimplifies integration; code is versioned and branched togetherProject boundaries blur; tangled cross-folder imports
MicrorepoClean separation; change triggers only its own pipelineImpractical for build-time integration across repos
HybridGroup tightly integrated projects; separate loosely coupled onesRequires deliberate design decisions

Design forces: team ownership and access controls, reducing friction from conflicting changelogs, enforcing architectural boundaries.

Organising by technology (all databases in one file, all firewalls in another) emphasises implementation over use and forces developers to sift through unrelated workloads. Instead, organise by domain or workload:

infrastructure/
├── customer_service.infra # DB, networking, security for this service
├── search_service.infra
├── shared_network.infra # Categorised by domain, not dumped in "shared"
└── monitoring.infra

Keep support files alongside the primary source code to guarantee version alignment:

my-stack/
├── src/ # Core infrastructure code
├── tests/ # Offline and online test suites
├── environments/ # Per-instance configuration values
├── pipeline/ # Delivery configuration
├── build.sh # Build orchestration
└── deploy.sh # Deployment orchestration

Standardise tools, versions, and configuration across the team. Automate setup with containers, local VMs (Vagrant, Batect, Dojo), or server configuration tools (Ansible, Chef, Puppet). This accelerates onboarding and eliminates “works on my machine” debugging.

Emulators (LocalStack, Moto, Azurite) provide fast feedback by mocking cloud APIs locally. However, they don’t provision real resources and lack useful UIs - they’re best for running automated tests, not interactive exploration.

High-performing teams let every developer provision a personal cloud environment on demand and tear it down when finished. Deploy from a branch via hosted pipelines (not local workstations) so the team can clean up orphaned environments if someone goes on holiday.

Full environments may be too expensive or slow. Provision partial environments with only the dependencies you need, using test fixtures to replace heavy upstream stacks.


Every stage has three elements:

Content (Inputs → Outputs):

  • Inputs: source code, libraries, test files, configuration values, or a completed build
  • Scope: the stage proves its component works with its dependencies - it doesn’t validate the dependencies themselves
  • Outputs: distributable code/package, version numbers, tags, test reports, logs

Actions (Triggers → Promotion):

  • Automated stages run on every input change; manual stages wait for a human
  • Never mix automated and manual activities in the same stage
  • Use passive triggers - consumer pipelines auto-detect when a provider pipeline publishes a new build

Context (Progressive Realism):

StageEnvironmentDependencies
OfflinePipeline agent / emulatorTest fixtures and mocks
IaaS with mocksReal cloud platformTest fixtures replace real dependencies - fast, isolated
Production-likeReal cloud + real integrationsFull dependencies - only catch issues that emerge in realistic conditions
  • Place automated stages first - catch machine-detectable errors before humans invest time
  • Manual stages (exploratory testing, code review, UAT) come later
  • Automation doesn’t mean surrendering control over when things deploy - it eliminates the manual, error-prone execution of repetitive tasks

Wrap build, deployment, and testing logic in standalone scripts (Bash, Python, Make) rather than embedding it in the CI platform’s configuration:

ActivityWhat the script manages
BuildingResolve dependencies, assemble files, generate code
TestingSet up fixtures/emulators, execute tests, compile results
DeploymentAssemble config parameters, apply code to stacks, orchestrate multi-stack deployments
DeliveryUpload, download, and promote packages

Best practices:

  • Keep scripts small and focused on a single activity - don’t build a monolith
  • Separate multi-stack orchestration from single-stack deployment details
  • Write automated tests for your scripts (e.g., Bats for shell scripts)
  • Use the same scripts locally and in CI for consistency

TypeRole
Stream-aligned5–9 people focused on long-term design, build, and run of a service
EnablingExperts who mentor and facilitate - don’t own components themselves
PlatformProvides non-differentiating infrastructure “as a service”
Complicated subsystemDedicated to a specific complex domain requiring deep expertise
ModelStructureTrade-off
Split ownershipSeparate software and infrastructure teamsHandoffs cause delays and rework; fragmented workflow
Full-stackOne team owns both software and infrastructureNo handoffs; treats delivery as a single stream
EnablementSoftware team owns infrastructure; enablement team mentors themInterim step before scaling to dedicated service/component teams

As organisations scale, infrastructure teams shift from instance management to service provision:

ModelHow it works
Shared infrastructure (multi-tenancy)Multiple teams deploy onto shared infrastructure (e.g., a shared cluster); four self-service journeys: onboarding, configuring, troubleshooting, deploying
On-demand provisioning (single-tenancy)Teams provision dedicated instances via API; automated policy checks enforce compliance
Deployable componentsTeams publish versioned infrastructure components to a repository; consumers deploy via a portal without writing IaC

DORA Metrics:

MetricWhat it measures
Delivery lead timeTime from commit to production
Deployment frequencyHow often changes reach production
Change fail percentagePercentage of changes that cause impairment or require rollback
Mean time to restoreTime to recover from an unplanned outage

Additional IaC metrics: effort (expert time per change), toil (repetitive manual work), version spread (how many versions are deployed), utilisation (how often environments are actually used).

Measure the total time for every activity - including queue time. Often the biggest bottleneck isn’t the automated step (e.g., 8-hour provisioning reduced to 10 minutes) but the waiting time (e.g., an 8-day approval queue). Optimise the wait, not just the automation.


Terraform assumes published modules follow Semantic Versioning 2.0 (vMajor.Minor.Patch):

LevelMeaningExample
PatchBug fix, no interface changev1.2.3v1.2.4
MinorNew feature, backward compatiblev1.2.4v1.3.0
MajorBreaking changev1.3.0v2.0.0

Use the pessimistic constraint operator (~>) to allow safe upgrades:

module "vpc" {
source = "registry.example.com/networking/vpc"
version = "~> 1.1" # allows 1.1.x and 1.2.x, blocks 2.0.0
}

Pulling modules directly from Git (using the ref field to pin a commit or tag) works for testing branches but doesn’t scale - Git sources don’t support Terraform’s version constraint logic.

TypeDetails
Public (HashiCorp / OpenTofu)Index pointing to public GitHub repos; automatically tracks semantic version tags
PrivateFor proprietary code; authenticate with terraform login; self-host with Terrareg or use a commercial CD platform’s built-in registry
ArtifactoryEnterprise registry; requires explicit pushes via jf CLI; automate via GitHub Actions triggered on release tags; authenticate with OIDC

MethodSecurityRecommendation
OIDC✅ No static secrets; temporary credentialsPreferred - eliminates secret sprawl entirely
Secret managers✅ Centralised, RBAC-controlledUse when OIDC isn’t available; authenticate to the manager itself via OIDC
Orchestrator settings⚠️ Write-only; scales poorlyLast resort - updating an expired key across hundreds of projects is painful
  1. Register the Identity Provider URL (GitHub Actions, Spacelift, etc.) with the cloud vendor
  2. Map the IdP to a specific identity (AWS IAM role, Azure Service Principal)
  3. Enforce conditions - restrict the assumed role to specific repositories and workflows

Fetch secrets dynamically with a Terraform data source. But beware: retrieved values may be exposed in the state file. Where possible, pass the secret’s identifier (e.g., an ARN) directly to the resource instead of pulling the plaintext value into Terraform.

RequirementWhy it matters
Access and credentialsThe system needs correct network access and cloud credentials
TimeSome resources (databases) take up to an hour to launch; deployment tools must handle long-running jobs without interruption
Consistency and queuingNever run concurrent deployments to the same environment - use job queuing to enforce sequential execution

TACOS (Terraform Automation and Collaboration Software)

Section titled “TACOS (Terraform Automation and Collaboration Software)”

Platforms that bundle delivery, state management, and private module registries. They manage the state backend transparently and provide web UIs to review previous state versions.

FeatureDetails
Drift detectionAutomatically detect when live infrastructure diverges from code; many teams enable alerts (e.g., Slack) without automatic correction to avoid unreviewed changes
Multi-IaC supportSome platforms support Helm, Pulumi, Ansible alongside Terraform - avoids maintaining separate deployment systems as you scale
Policy enforcementEnforce rules at the deployment level (not module level, where users inject values via variables); most platforms standardise on OPA / Rego
Cost estimationBuilt-in (HCP Terraform) or via Infracost; limited to major cloud providers; estimates only - cannot predict consumption-based spikes
PlatformTypeKey characteristics
HCP TerraformManaged TACOSDeep CLI integration, built-in cost estimation; Terraform-only (no OpenTofu/Terragrunt); per-resource pricing can inflate costs
Env0 / SpaceliftManaged TACOSOpenTofu sponsors; multi-framework (Terragrunt, Helm, Ansible, Pulumi); recommended for polished multi-tool experience
ScalrManaged TACOSOpenTofu sponsor; Terraform/OpenTofu-only; native CLI-driven workflows; excellent migration target from HCP Terraform
Digger / TerrateamGitOps PlusDeployment-focused; no state backend or registry; PR-comment-driven workflow tightly integrated with GitHub
Harness / Octopus DeployEnterprise CDBroad platforms for mixed environments (IaC + legacy + hardware); no built-in registries or state management
Atlantis / TerrakubeSelf-hosted OSSTerrakube = traditional TACOS; Atlantis = PR-comment workflow; saves money but introduces administrative burden and security responsibility