Skip to content

Servers & Environments

Infrastructure as Code covers two tightly related concerns: how you build and manage the servers that run your workloads, and how you organise those workloads into environments. This page covers both - starting with the server lifecycle, then zooming out to environment design.


The first generation of IaC tools - Ansible, CFEngine, Chef, Puppet, Salt - were built specifically for server configuration. They organise code into modules (playbooks in Ansible, cookbooks in Chef) and apply them to servers through server roles.

Everything on a server falls into one of three categories:

CategoryWhat it containsHow IaC treats it
SoftwareApplications, libraries, static codeInstalls and versions it; treats the contents as an opaque box
ConfigurationFiles controlling how software behavesManages content directly; varies across roles and environments
DataLogs, database files, user-generated contentMay backup or distribute, but treats contents as a black box

The distinction between configuration and data is whether the automation tool actively manages the file’s contents. A system log is vital but treated as data - the automation doesn’t write into it.

Where server content comes from:

  • Base OS from a physical disk, ISO, or IaaS stock image
  • OS packages from vendor, third-party, or internal repositories
  • Language/framework packages (pip, gem, npm, Maven)
  • Nonstandard packages via custom installers
  • Separate material: firewall rules, user accounts, local overrides

A server role is the entry point for managing a server - it defines which configuration modules to apply and sets default parameter values for them. Roles are used when creating servers manually, defining them in a stack, or configuring auto-scaling groups.

Three common strategies for structuring roles:

StrategyDescriptionExample
Fine-grainedMultiple narrow roles composed togetherApplicationServer + MonitoredServer + PublicFacingServer
Higher-levelOne comprehensive role per server typeShoppingServiceApplicationServer
Base + inheritanceUniversal base role extended by specialised rolesBaseServerContainerHostServer

The base role approach is most common: it encodes organisation-wide requirements (monitoring agents, admin accounts, network hardening) that all servers must inherit.


Every server moves through four phases:

flowchart LR
    A["1 · Build image\n(optional)"] --> B["2 · Create instance"]
    B --> C["3 · Change instance"]
    C --> D["4 · Destroy instance"]

Platform stock images (AMIs, Azure managed images, GCP VM templates) are a valid starting point, but teams often build custom images to:

  • Pre-install monitoring agents, admin accounts, and standard configs
  • Make server creation faster (no runtime package installs)
  • Harden security by stripping unnecessary software and accounts
  • Produce role-specific images for container nodes, CI agents, app servers

Image building methods:

MethodNotes
Modify a stock imageMost common - boot stock image, configure, save as new image
Boot an OS installerMaximum control; avoids relying on third-party stock images
Offline image buildingMount image as a disk volume, configure, unmount - faster but needs more tooling
Hot-clone a running server❌ Never recommended - inherits data pollution (old logs), produces inconsistent images

Tooling: HashiCorp Packer is the most popular orchestration tool. For images built from a fresh OS, simple sequential shell scripts (e.g., 10-install-monitoring-agent.sh) are often more appropriate than full configuration management tools, which are better suited to managing variable starting conditions on live servers.

Server creation involves four core steps regardless of the tool: selecting a host, allocating resources (CPU, memory, storage, networking), installing the base image, and applying configuration.

Creation triggers:

TriggerHow
Network provisioningPXE boot → downloads image → reboots → applies config (Cobbler, Foreman, MAAS, RackN)
Infrastructure stackDefined as a resource in Terraform/Pulumi; provisioned via IaaS API
Auto-scalingPlatform spawns instances in response to load metrics
Auto-recoveryPlatform replaces instances that fail health checks

Three strategies exist for managing updates to running servers:

Configuration code is applied only when a specific change is needed.

Problem: Newer servers get the latest patches; older ones don’t. The result is a fleet of snowflake servers - each with a subtly different history, each a liability during incident response.

Configuration code is applied repeatedly on a schedule, even if the code hasn’t changed. Any manual change made to a server (by an engineer or an attacker) is automatically reverted on the next cycle.

  • Push implementation: A central service connects to each server via SSH to apply updates - requires all servers to be registered and network-accessible
  • Pull implementation (more popular): An agent on the server runs on a cron schedule, checks a central repo for the latest code version, and pulls it down

Changing by Replacement - Immutable Servers

Section titled “Changing by Replacement - Immutable Servers”

The most reliable approach: never modify a running server. Instead, build a new image, provision new instances from it, validate them, redirect traffic, and destroy the old instances.

Replacement sequence:

  1. Create a new instance without putting it into service
  2. Run validation checks to confirm it is ready
  3. Switch services to the new instance
  4. Verify the new instance is handling workload correctly
  5. Destroy the old instance

The Immutable Server pattern makes replacement the only mechanism for change. Every update goes through a delivery pipeline before it reaches production - no exceptions.

Under the Immutable Server pattern, destruction happens last - only after the replacement is live and verified. This guarantees zero downtime during the transition.


Baking and frying describe when configuration is applied to a server.

BakingFrying
When appliedBefore the server is ever startedAt instance creation time
Optimises forSpeed of boot and consistencyVariability and fast config changes
Best forAuto-scaling, auto-recovery, container nodesOn-demand customised workloads
DrawbackSlow path to deploying config changesSlower boot time (installs on the fly)
flowchart LR
    A["Config change needed"] --> B{Strategy?}
    B -- Baking --> C["Build new image\n→ test pipeline\n→ replace instances"]
    B -- Frying --> D["Update config code\n→ provision new instances\nwith new config"]

In practice most teams combine both: bake large, slow-to-install dependencies into the base image; fry instance-specific parameters at creation time.

Base image (baked):
- JDK / application server
- Container cluster agent
- Monitoring agent
Creation-time scripts (fried):
- Environment name
- Application version
- Feature flags

This gives you the fast boot time of baked images without sacrificing flexibility for customisation.


When applying configuration to a new or existing server, two architectures exist for how the code gets there.

Pull Configuration (preferred for security)

Section titled “Pull Configuration (preferred for security)”

The server configures itself from within using initialization scripts:

CloudMechanism
AWSUser data
AzureCustom data
GCPStartup scripts

All three leverage the cloud-init standard preinstalled on most Linux images. On first boot, the script passes a role name and environment parameters to a preinstalled configuration agent (Chef, Puppet, Ansible), which downloads and applies the relevant modules.

For ongoing updates, a background agent or cron job periodically pulls the latest code from a central repository.

Security advantage: The server never needs external inbound network access. In high-security environments, SSH doesn’t need to be running at all.

A central service connects to the server over the network (typically SSH) and executes configuration commands.

Advantage: No configuration agent needs to be preinstalled on the server image.

Risks:

  • Grants a central service root access over the network - if that service is compromised, every registered server is compromised
  • Requires diligent tracking to ensure every server is registered; unregistered servers silently miss all updates

An environment is a logical grouping of deployed infrastructure providing the resources, platform services, and controls needed to run a specific set of workloads. Multi-environment architecture falls into three categories:

flowchart TD
    A["Multi-environment needs"] --> B["Delivery environments\n(path to production)"]
    A --> C["Split environments\n(manageability & ownership)"]
    A --> D["Replica environments\n(scale, geography, user bases)"]

Complex systems combine all three - product groups may have their own delivery pipelines feeding into separate production replicas.


Changes to software, infrastructure, or configuration move through a series of delivery environments before reaching production - the path to production. Environments earlier in the flow are upstream; production is downstream.

ConcernWhat it means
SegregationEnvironments must not interfere with each other; upstream testing must never affect downstream data
ConsistencyDifferences across stages invalidate tests and complicate deployments - this is a primary driver for adopting IaC
VariationSome differences are unavoidable: scaling capacity, access levels, resource IDs, naming conventions

Separate delivery environments: Each distinct production workload gets its own dev/test pipeline. Required when production systems are fundamentally different from each other.

Fan-out delivery: A single shared dev/test pipeline validates changes, then deploys them simultaneously to multiple identical production environments (e.g., the same storefront deployed to multiple regions).


When systems grow too large for a single environment, they are split along three dimensions:

Sharing an environment creates coupling - the more workloads share an environment, the more coordination is required for changes to shared infrastructure. Split along service boundaries to keep coupling low.

  • Shared-nothing systems (two distinct brands’ storefronts) can live in completely separate environments
  • Integrated systems (storefronts + shared data service) can still be split into cohesive individual environments as long as integration is loose enough to allow independent changes

Teams tend to own environments. A new team for a new service naturally leads to a new environment. A shared platform team naturally leads to shared environments.

This is Conway’s Law applied to infrastructure: the environment structure mirrors the org structure. Be deliberate about whether that is the outcome you want.

Separate environments make compliance easier to enforce and audit:

  • Blast radius: A compromised application environment cannot reach backend systems if they are in a separate environment
  • Log integrity: Security monitoring services in their own isolated environment cannot be tampered with by attackers who compromise the application tier
  • Pipeline safety: Delivery pipeline infrastructure in its own environment is protected from damage caused by the workloads it deploys

Replica environments run the same software as a canonical production environment but serve distinct user bases, geographic regions, or availability zones.

DriverDetails
AvailabilityTraffic can be rerouted to a replica if one region fails; replicas provide independent redundancy units
ScalabilityAdd replicas to absorb traffic that a single environment cannot handle
Geographic latencyReplicas closer to users reduce round-trip times
Regulatory complianceRegional replicas provide hard data residency boundaries, simplifying audit
Multiple user basesWhite-label platforms can isolate each customer’s data in a dedicated replica
ApproachInfrastructureTrade-off
Single-tenant replicasSeparate environment per customer/brandStrong data isolation; high maintenance cost at scale
Multi-tenantOne shared environment, software separates tenantsEfficient resource use; requires sophisticated application-level isolation

Environment Layers and IaaS Resource Groups

Section titled “Environment Layers and IaaS Resource Groups”

Environments can be implemented at three levels of abstraction:

LayerShared resourcesIsolation boundary
PhysicalData centre facilities onlyDedicated hardware per environment
VirtualPhysical hardware via IaaSDedicated virtual resources per environment
ConfigurationShared container cluster or serverless platformNamespaces and config settings

Configuration environments (namespaces on a shared cluster) are only viable for cloud-native containerised or serverless workloads. Even then, they carry risks:

  • Namespace-level separation often fails regulatory requirements that demand hard segregation
  • Conflicting workload profiles (low-latency vs. heavy analytics) compete for shared cluster resources
  • Upgrading the shared runtime impacts all hosted environments simultaneously, forcing coordinated change windows across every team
  • Cluster core service failure takes down all hosted environments - true availability requires independent clusters

Every cloud platform provides a base-level grouping primitive:

CloudResource group primitive
AWSAccount
AzureResource group
GCPProject

These primitives define the default boundary for access policies, billing, and resource naming. Your environment architecture must explicitly map logical environments to these cloud primitives.

Common (but problematic) approach: Multiple environments in one resource group.

This happens because creating new accounts/projects involves heavyweight approval processes. The result: shared access policies, shared resource naming, complex in-group segregation with tags - all of which are less reliable than simply using separate groups.

Recommended approaches:

ModelStructureBest for
One group per environmentdev account / test account / prod accountMost organisations
Multiple groups per environmentapp account + management account + monitoring account = one production environmentRegulated industries requiring hard segregation between workloads, delivery pipelines, and observability

The second model is particularly powerful for governance: the application account has no access to the management or monitoring accounts, so a compromised workload cannot disable its own monitoring.


Application runtime platforms determine where and how workloads execute. Three compute models intersect with IaC:

Traditional clusters consist of identically configured servers running the same workloads. Modern IaC-managed clusters use event-driven scaling - the platform automatically adds or removes servers based on load metrics and health checks, with software delivered via pull deployments or baked images rather than push-scripting tools.

Application Clusters (Container Orchestration)

Section titled “Application Clusters (Container Orchestration)”

Application clusters are pools of servers where individual application instances are dynamically scheduled and replaced. Kubernetes dominates this space.

Two provisioning models:

ModelExamplesNotes
Cluster as a ServiceAmazon EKS, Azure AKS, Google GKEManaged control plane; integrates naturally with IaC for networking/storage
Packaged distributionsRed Hat OpenShift, Rancher RKE, VMware TanzuSelf-managed; consistent across cloud providers

Code executes purely on demand, triggered by events (inbound messages, schedules, IaaS lifecycle events). The platform manages ports, memory, and process lifecycle.

Details
AdvantagesEngineers focus on business logic; highly efficient for unpredictable workloads
ChallengesCold-start latency; limited portability across providers (AWS Lambda ≠ Azure Functions)
IaC relevanceServerless does not eliminate IaC - shared storage, networking, and event infrastructure still needs to be defined as code. Serverless shifts system concerns out of application code into the infrastructure layer

When building container clusters or serverless platforms, four topological models balance governance, optimisation, ownership, and upgrade continuity:

TopologyDescriptionBest forChallenge
Multiple environments, one clusterQA, Staging, and Production share a single clusterSmall orgs, fully containerised systemsUpgrade coordination grows complex at scale
One cluster per environmentSeparate cluster for each environmentGovernance simplicity; safe cluster upgradesEnsuring consistency across clusters requires automated IaC delivery
Multiple clusters per environmentEach team gets a dedicated cluster within their environmentStrict ownership and workload optimisationMassive management overhead; risk of resource waste from undersized clusters
Cross-environment clustersClusters divided by purpose (public services, internal services, data processing) rather than environmentMixed workloads with distinct resource profilesThese are shared environments - integrated services span cluster boundaries

Infrastructure design should start with the workloads, not the other way around. The workflow:

  1. Identify workloads - break the system into separately deployable services
  2. Map required capabilities - describe what each service needs functionally (networking, messaging, storage, compute) without specifying the technology
  3. Determine implementations - match abstract capabilities to specific services
CapabilityImplementation options
NetworkingIaaS VPC, subnets, firewall rules
Container clusterGKE, EKS, self-managed Kubernetes
Async messagingCloud Pub/Sub, SQS, RabbitMQ
AI/MLSaaS API (external vendor)
SearchSelf-hosted open-source tool

This design flow prevents the common failure mode of designing infrastructure first and then trying to make the workloads fit.


“Cloud native” has multiple overlapping definitions in the industry:

SourceDefinition
Practical definitionSoftware designed and implemented to leverage cloud platform capabilities, built to adapt to shifting needs in capacity, availability, and locality
Industry shorthandContainerised workloads running on Kubernetes
CNCFSystems deployed at scale in a programmatic and repeatable manner; characterised as loosely coupled, secure, resilient, manageable, and observable

For teams building applications to run on cloud infrastructure, the Twelve-Factor App methodology provides a concrete set of design principles to ensure portability and operability across cloud environments.