Servers & Environments

Infrastructure as Code covers two tightly related concerns: how you build and manage the servers that run your workloads, and how you organise those workloads into environments. This page covers both - starting with the server lifecycle, then zooming out to environment design.

Servers as Code

The first generation of IaC tools - Ansible, CFEngine, Chef, Puppet, Salt - were built specifically for server configuration. They organise code into modules (playbooks in Ansible, cookbooks in Chef) and apply them to servers through server roles.

Server Composition

Everything on a server falls into one of three categories:

Category	What it contains	How IaC treats it
Software	Applications, libraries, static code	Installs and versions it; treats the contents as an opaque box
Configuration	Files controlling how software behaves	Manages content directly; varies across roles and environments
Data	Logs, database files, user-generated content	May backup or distribute, but treats contents as a black box

The distinction between configuration and data is whether the automation tool actively manages the file’s contents. A system log is vital but treated as data - the automation doesn’t write into it.

Where server content comes from:

Base OS from a physical disk, ISO, or IaaS stock image
OS packages from vendor, third-party, or internal repositories
Language/framework packages (pip, gem, npm, Maven)
Nonstandard packages via custom installers
Separate material: firewall rules, user accounts, local overrides

Server Roles

A server role is the entry point for managing a server - it defines which configuration modules to apply and sets default parameter values for them. Roles are used when creating servers manually, defining them in a stack, or configuring auto-scaling groups.

Three common strategies for structuring roles:

Strategy	Description	Example
Fine-grained	Multiple narrow roles composed together	`ApplicationServer` + `MonitoredServer` + `PublicFacingServer`
Higher-level	One comprehensive role per server type	`ShoppingServiceApplicationServer`
Base + inheritance	Universal base role extended by specialised roles	`BaseServer` → `ContainerHostServer`

The base role approach is most common: it encodes organisation-wide requirements (monitoring agents, admin accounts, network hardening) that all servers must inherit.

Server Lifecycle

Every server moves through four phases:

flowchart LR
    A["1 · Build image\n(optional)"] --> B["2 · Create instance"]
    B --> C["3 · Change instance"]
    C --> D["4 · Destroy instance"]

Phase 1 - Build a Server Image

Platform stock images (AMIs, Azure managed images, GCP VM templates) are a valid starting point, but teams often build custom images to:

Pre-install monitoring agents, admin accounts, and standard configs
Make server creation faster (no runtime package installs)
Harden security by stripping unnecessary software and accounts
Produce role-specific images for container nodes, CI agents, app servers

Image building methods:

Method	Notes
Modify a stock image	Most common - boot stock image, configure, save as new image
Boot an OS installer	Maximum control; avoids relying on third-party stock images
Offline image building	Mount image as a disk volume, configure, unmount - faster but needs more tooling
Hot-clone a running server	❌ Never recommended - inherits data pollution (old logs), produces inconsistent images

Tooling: HashiCorp Packer is the most popular orchestration tool. For images built from a fresh OS, simple sequential shell scripts (e.g., 10-install-monitoring-agent.sh) are often more appropriate than full configuration management tools, which are better suited to managing variable starting conditions on live servers.

Phase 2 - Create an Instance

Server creation involves four core steps regardless of the tool: selecting a host, allocating resources (CPU, memory, storage, networking), installing the base image, and applying configuration.

Creation triggers:

Trigger	How
Network provisioning	PXE boot → downloads image → reboots → applies config (Cobbler, Foreman, MAAS, RackN)
Infrastructure stack	Defined as a resource in Terraform/Pulumi; provisioned via IaaS API
Auto-scaling	Platform spawns instances in response to load metrics
Auto-recovery	Platform replaces instances that fail health checks

Phase 3 - Change an Instance

Three strategies exist for managing updates to running servers:

Push on Change (antipattern)

Configuration code is applied only when a specific change is needed.

Problem: Newer servers get the latest patches; older ones don’t. The result is a fleet of snowflake servers - each with a subtly different history, each a liability during incident response.

Continuous Configuration Synchronization

Configuration code is applied repeatedly on a schedule, even if the code hasn’t changed. Any manual change made to a server (by an engineer or an attacker) is automatically reverted on the next cycle.

Push implementation: A central service connects to each server via SSH to apply updates - requires all servers to be registered and network-accessible
Pull implementation (more popular): An agent on the server runs on a cron schedule, checks a central repo for the latest code version, and pulls it down

Changing by Replacement - Immutable Servers

The most reliable approach: never modify a running server. Instead, build a new image, provision new instances from it, validate them, redirect traffic, and destroy the old instances.

Replacement sequence:

Create a new instance without putting it into service
Run validation checks to confirm it is ready
Switch services to the new instance
Verify the new instance is handling workload correctly
Destroy the old instance

The Immutable Server pattern makes replacement the only mechanism for change. Every update goes through a delivery pipeline before it reaches production - no exceptions.

Phase 4 - Destroy an Instance

Under the Immutable Server pattern, destruction happens last - only after the replacement is live and verified. This guarantees zero downtime during the transition.

Baking Images vs. Frying Instances

Baking and frying describe when configuration is applied to a server.

	Baking	Frying
When applied	Before the server is ever started	At instance creation time
Optimises for	Speed of boot and consistency	Variability and fast config changes
Best for	Auto-scaling, auto-recovery, container nodes	On-demand customised workloads
Drawback	Slow path to deploying config changes	Slower boot time (installs on the fly)

flowchart LR
    A["Config change needed"] --> B{Strategy?}
    B -- Baking --> C["Build new image\n→ test pipeline\n→ replace instances"]
    B -- Frying --> D["Update config code\n→ provision new instances\nwith new config"]

The Hybrid Approach (recommended)

In practice most teams combine both: bake large, slow-to-install dependencies into the base image; fry instance-specific parameters at creation time.

Base image (baked):
  - JDK / application server
  - Container cluster agent
  - Monitoring agent

Creation-time scripts (fried):
  - Environment name
  - Application version
  - Feature flags

This gives you the fast boot time of baked images without sacrificing flexibility for customisation.

Pull vs. Push Configuration

When applying configuration to a new or existing server, two architectures exist for how the code gets there.

Pull Configuration (preferred for security)

The server configures itself from within using initialization scripts:

Cloud	Mechanism
AWS	User data
Azure	Custom data
GCP	Startup scripts

All three leverage the cloud-init standard preinstalled on most Linux images. On first boot, the script passes a role name and environment parameters to a preinstalled configuration agent (Chef, Puppet, Ansible), which downloads and applies the relevant modules.

For ongoing updates, a background agent or cron job periodically pulls the latest code from a central repository.

Security advantage: The server never needs external inbound network access. In high-security environments, SSH doesn’t need to be running at all.

Push Configuration

A central service connects to the server over the network (typically SSH) and executes configuration commands.

Advantage: No configuration agent needs to be preinstalled on the server image.

Risks:

Grants a central service root access over the network - if that service is compromised, every registered server is compromised
Requires diligent tracking to ensure every server is registered; unregistered servers silently miss all updates

Multi-Environment Architecture

An environment is a logical grouping of deployed infrastructure providing the resources, platform services, and controls needed to run a specific set of workloads. Multi-environment architecture falls into three categories:

flowchart TD
    A["Multi-environment needs"] --> B["Delivery environments\n(path to production)"]
    A --> C["Split environments\n(manageability & ownership)"]
    A --> D["Replica environments\n(scale, geography, user bases)"]

Complex systems combine all three - product groups may have their own delivery pipelines feeding into separate production replicas.

Delivery Environments

Changes to software, infrastructure, or configuration move through a series of delivery environments before reaching production - the path to production. Environments earlier in the flow are upstream; production is downstream.

Three Tensions to Balance

Concern	What it means
Segregation	Environments must not interfere with each other; upstream testing must never affect downstream data
Consistency	Differences across stages invalidate tests and complicate deployments - this is a primary driver for adopting IaC
Variation	Some differences are unavoidable: scaling capacity, access levels, resource IDs, naming conventions

Delivery Patterns

Separate delivery environments: Each distinct production workload gets its own dev/test pipeline. Required when production systems are fundamentally different from each other.

Fan-out delivery: A single shared dev/test pipeline validates changes, then deploys them simultaneously to multiple identical production environments (e.g., the same storefront deployed to multiple regions).

Environment Ownership Warning

Splitting Environments

When systems grow too large for a single environment, they are split along three dimensions:

1 · System Architecture Alignment

Sharing an environment creates coupling - the more workloads share an environment, the more coordination is required for changes to shared infrastructure. Split along service boundaries to keep coupling low.

Shared-nothing systems (two distinct brands’ storefronts) can live in completely separate environments
Integrated systems (storefronts + shared data service) can still be split into cohesive individual environments as long as integration is loose enough to allow independent changes

2 · Organisational Alignment

Teams tend to own environments. A new team for a new service naturally leads to a new environment. A shared platform team naturally leads to shared environments.

This is Conway’s Law applied to infrastructure: the environment structure mirrors the org structure. Be deliberate about whether that is the outcome you want.

3 · Governance Alignment

Separate environments make compliance easier to enforce and audit:

Blast radius: A compromised application environment cannot reach backend systems if they are in a separate environment
Log integrity: Security monitoring services in their own isolated environment cannot be tampered with by attackers who compromise the application tier
Pipeline safety: Delivery pipeline infrastructure in its own environment is protected from damage caused by the workloads it deploys

Replica Environments

Replica environments run the same software as a canonical production environment but serve distinct user bases, geographic regions, or availability zones.

Why Replicate

Driver	Details
Availability	Traffic can be rerouted to a replica if one region fails; replicas provide independent redundancy units
Scalability	Add replicas to absorb traffic that a single environment cannot handle
Geographic latency	Replicas closer to users reduce round-trip times
Regulatory compliance	Regional replicas provide hard data residency boundaries, simplifying audit
Multiple user bases	White-label platforms can isolate each customer’s data in a dedicated replica

Single-Tenant vs. Multi-Tenant

Approach	Infrastructure	Trade-off
Single-tenant replicas	Separate environment per customer/brand	Strong data isolation; high maintenance cost at scale
Multi-tenant	One shared environment, software separates tenants	Efficient resource use; requires sophisticated application-level isolation

Environment Layers and IaaS Resource Groups

Three Abstraction Layers

Environments can be implemented at three levels of abstraction:

Layer	Shared resources	Isolation boundary
Physical	Data centre facilities only	Dedicated hardware per environment
Virtual	Physical hardware via IaaS	Dedicated virtual resources per environment
Configuration	Shared container cluster or serverless platform	Namespaces and config settings

Configuration environments (namespaces on a shared cluster) are only viable for cloud-native containerised or serverless workloads. Even then, they carry risks:

Namespace-level separation often fails regulatory requirements that demand hard segregation
Conflicting workload profiles (low-latency vs. heavy analytics) compete for shared cluster resources
Upgrading the shared runtime impacts all hosted environments simultaneously, forcing coordinated change windows across every team
Cluster core service failure takes down all hosted environments - true availability requires independent clusters

IaaS Resource Groups

Every cloud platform provides a base-level grouping primitive:

Cloud	Resource group primitive
AWS	Account
Azure	Resource group
GCP	Project

These primitives define the default boundary for access policies, billing, and resource naming. Your environment architecture must explicitly map logical environments to these cloud primitives.

Common (but problematic) approach: Multiple environments in one resource group.

This happens because creating new accounts/projects involves heavyweight approval processes. The result: shared access policies, shared resource naming, complex in-group segregation with tags - all of which are less reliable than simply using separate groups.

Recommended approaches:

Model	Structure	Best for
One group per environment	dev account / test account / prod account	Most organisations
Multiple groups per environment	app account + management account + monitoring account = one production environment	Regulated industries requiring hard segregation between workloads, delivery pipelines, and observability

The second model is particularly powerful for governance: the application account has no access to the management or monitoring accounts, so a compromised workload cannot disable its own monitoring.

Application Runtime Platforms

Application runtime platforms determine where and how workloads execute. Three compute models intersect with IaC:

Server Clusters

Traditional clusters consist of identically configured servers running the same workloads. Modern IaC-managed clusters use event-driven scaling - the platform automatically adds or removes servers based on load metrics and health checks, with software delivered via pull deployments or baked images rather than push-scripting tools.

Application Clusters (Container Orchestration)

Application clusters are pools of servers where individual application instances are dynamically scheduled and replaced. Kubernetes dominates this space.

Two provisioning models:

Model	Examples	Notes
Cluster as a Service	Amazon EKS, Azure AKS, Google GKE	Managed control plane; integrates naturally with IaC for networking/storage
Packaged distributions	Red Hat OpenShift, Rancher RKE, VMware Tanzu	Self-managed; consistent across cloud providers

Serverless / FaaS

Code executes purely on demand, triggered by events (inbound messages, schedules, IaaS lifecycle events). The platform manages ports, memory, and process lifecycle.

	Details
Advantages	Engineers focus on business logic; highly efficient for unpredictable workloads
Challenges	Cold-start latency; limited portability across providers (AWS Lambda ≠ Azure Functions)
IaC relevance	Serverless does not eliminate IaC - shared storage, networking, and event infrastructure still needs to be defined as code. Serverless shifts system concerns out of application code into the infrastructure layer

Cluster Topologies

When building container clusters or serverless platforms, four topological models balance governance, optimisation, ownership, and upgrade continuity:

Topology	Description	Best for	Challenge
Multiple environments, one cluster	QA, Staging, and Production share a single cluster	Small orgs, fully containerised systems	Upgrade coordination grows complex at scale
One cluster per environment	Separate cluster for each environment	Governance simplicity; safe cluster upgrades	Ensuring consistency across clusters requires automated IaC delivery
Multiple clusters per environment	Each team gets a dedicated cluster within their environment	Strict ownership and workload optimisation	Massive management overhead; risk of resource waste from undersized clusters
Cross-environment clusters	Clusters divided by purpose (public services, internal services, data processing) rather than environment	Mixed workloads with distinct resource profiles	These are shared environments - integrated services span cluster boundaries

Application-Driven Infrastructure Design

Infrastructure design should start with the workloads, not the other way around. The workflow:

Identify workloads - break the system into separately deployable services
Map required capabilities - describe what each service needs functionally (networking, messaging, storage, compute) without specifying the technology
Determine implementations - match abstract capabilities to specific services

Capability	Implementation options
Networking	IaaS VPC, subnets, firewall rules
Container cluster	GKE, EKS, self-managed Kubernetes
Async messaging	Cloud Pub/Sub, SQS, RabbitMQ
AI/ML	SaaS API (external vendor)
Search	Self-hosted open-source tool

This design flow prevents the common failure mode of designing infrastructure first and then trying to make the workloads fit.

Cloud Native Software

“Cloud native” has multiple overlapping definitions in the industry:

Source	Definition
Practical definition	Software designed and implemented to leverage cloud platform capabilities, built to adapt to shifting needs in capacity, availability, and locality
Industry shorthand	Containerised workloads running on Kubernetes
CNCF	Systems deployed at scale in a programmatic and repeatable manner; characterised as loosely coupled, secure, resilient, manageable, and observable

For teams building applications to run on cloud infrastructure, the Twelve-Factor App methodology provides a concrete set of design principles to ensure portability and operability across cloud environments.