Skip to content

The Mechanics of Containerization

A container image is an immutable, self-contained package that bundles everything your application needs to run. Think of it as a snapshot - a blueprint that describes the state of a filesystem and the configuration needed to launch a process from it.

A container image can include any combination of the following:

Content TypeDescription
System packagesOS-level utilities and libraries (e.g., glibc, openssl)
A runtimeThe execution environment for your app (e.g., JVM, Node.js, Python interpreter)
Library dependenciesApplication-level libraries your code imports
Source codeYour application’s source files (in interpreted languages)
BinariesPre-compiled executables
Static assetsHTML, CSS, images, config templates
ConfigurationContainer configuration (entrypoint, env vars, user, working dir)
ConceptDefinition
Container ImageA static, read-only blueprint stored on disk. It doesn’t “run” - it just exists.
ContainerA live, running instance created from a container image. It only exists while there are active processes.
  • When you run a container image, the container runtime executes the program specified in the image’s entrypoint (e.g., starting the JVM for a Java app).
  • A container only exists at runtime. If the process exits or is killed, the container stops and ceases to exist.

Two important things happen automatically when a container is launched:

The contents of the container image are used to seed a private, isolated file system for the container. Every process inside the container sees this file system - and only this file system - as if it were the entire machine.

The container gets its own virtual network interface with a local IP address. Your application can bind to this interface and start listening on a port, enabling it to receive incoming network traffic.

The configuration section of a container image tells the runtime how to turn the image into a running container. Key settings include:

The entrypoint is the command executed when the container starts. For example:

  • A Java app → java -jar app.jar
  • A Python app → python main.py
  • A compiled binary → /usr/local/bin/myapp

Used to pass runtime configuration into your application without baking it into the image. Common uses: database URLs, API keys, feature flags, log levels.

DATABASE_URL=postgres://user:pass@host/db
LOG_LEVEL=info

Specifies which OS user the container process runs as.

Sets the default directory from which the entrypoint command is executed (equivalent to cd /app before running your process).

When starting a container, you can override the image’s default values for:

  • The entrypoint command
  • Arguments passed to the entrypoint
  • Environment variables

This makes images reusable across different environments (dev, staging, production) without rebuilding.

The Linux Kernel Features That Make Containers Possible

Section titled “The Linux Kernel Features That Make Containers Possible”

Containers are not virtual machines. They don’t emulate hardware or run a full guest OS. Instead, they rely on three foundational Linux kernel features to provide isolation, resource control, and efficient storage.

Container

Namespaces

A namespace wraps a global system resource so that processes within the namespace have their own isolated view of it. From inside a namespace, a process believes it has the entire resource to itself.

PID Namespace

  • Inside the container, your application’s main process is assigned PID 1, even if on the host machine it’s actually running as PID 12345.
  • The process cannot see, signal, or interact with processes running outside its PID namespace.
  • This is why running ps aux inside a container only shows the container’s own processes.

Network Namespace

  • Each container gets its own completely isolated network stack: its own IP address, routing table, firewall rules (iptables), and ports.
  • Two containers can each bind to port 8080 simultaneously without conflict, because they live in different network namespaces.
  • The container runtime (e.g., Docker) creates virtual ethernet pairs (veth) to connect the container’s network namespace to the host.

Mount Namespace

  • Gives the container its own isolated view of the file system hierarchy.
  • The container can have /etc, /var, /home, etc. that are completely separate from the host’s file system.
  • Changes to the file system inside the container (in the writable layer) are invisible to the host and other containers.
  • Allows the container to have its own hostname and domain name, independent of the host.
  • This is why a container can report its hostname as web-server-1 while the host machine is named prod-node-42.

IPC Namespace (Inter-Process Communication Isolation)

Section titled “IPC Namespace (Inter-Process Communication Isolation)”
  • Isolates IPC resources such as System V message queues and POSIX shared memory.
  • Prevents processes in one container from interfering with IPC resources used by another.

User Namespace

  • Maps user IDs inside the container to different user IDs on the host.
  • A process running as UID 0 (root) inside the container can be mapped to an unprivileged user (e.g., UID 65534) on the host.
  • This is a critical security feature: even if a malicious process “escapes” the container, it runs as a non-privileged host user.

2. Control Groups (cgroups) - Resource Management

Section titled “2. Control Groups (cgroups) - Resource Management”

Control Groups

While namespaces provide isolation, cgroups provide resource governance. They allow the kernel to limit, account for, and isolate the resource usage of a group of processes.

  • You can cap how much CPU time a container’s processes receive.
  • Example: Limit a container to 0.5 CPU cores, even on a 32-core machine.
  • Implemented via CPU shares, CPU quotas, and CPU periods.
  • Set a maximum amount of RAM a container can use.
  • Example: --memory=512m restricts the container to 512 MB of RAM.
  • If the container exceeds its memory limit, the kernel’s OOM (Out-Of-Memory) killer will terminate a process in the container.
  • Throttle the rate at which a container can read from or write to disk.
  • Prevents one container from saturating disk bandwidth.

Noisy Neighbor

Without cgroups, a single poorly-written or malicious container could consume 100% of CPU or RAM, starving every other application on the same host. cgroups enforce guaranteed resource isolation, making multi-tenant container hosting reliable.

cgroups v1 vs. v2

  • cgroups v1: Each resource controller (cpu, memory, blkio) is managed separately.
  • cgroups v2: A unified hierarchy where all controllers are managed together. More modern and consistent. Required by some newer container runtimes.

3. Union File Systems & Copy-on-Write (CoW) - Storage

Section titled “3. Union File Systems & Copy-on-Write (CoW) - Storage”

Union File Systems

Container images are built in layers. This layered architecture is what makes images lightweight, fast to pull, and highly efficient in storage and memory usage.

A container image is composed of multiple stacked, read-only layers. Each layer represents a set of file system changes (additions, modifications, deletions) from one build step.

Example layer stack for a Java web app:

[ Layer 4 ] → App JAR file added (top, most specific)
[ Layer 3 ] → JDK installed
[ Layer 2 ] → apt-get update + curl
[ Layer 1 ] → Base Ubuntu 22.04 image (bottom, most general)

Each layer is content-addressed - identified by a cryptographic hash of its contents. This means:

  • If two images share the same base layer (e.g., Ubuntu 22.04), that layer is stored once on disk and shared in memory, even if 50 containers are running from different images.
  • Pulling a new image version is fast: only the changed layers need to be downloaded.

Container Layer (Upper Layer - Read-Write)

Section titled “Container Layer (Upper Layer - Read-Write)”

When a container is started, a thin, ephemeral read-write layer is added on top of the read-only image layers. This is the only place where the running container can write new data.

[ Container Layer ] → Read-Write (ephemeral, destroyed when container stops)
─────────────────────────────────────────────────────────────────────────────
[ Layer 4 ] → Read-Only ─┐
[ Layer 3 ] → Read-Only │ These are the image layers - shared and immutable
[ Layer 2 ] → Read-Only │
[ Layer 1 ] → Read-Only ─┘

Copy-on-Write

The key question is: what happens if a running container needs to modify a file that exists in a read-only layer?

The answer is Copy-on-Write:

  1. When a container process writes to a file that exists only in a read-only lower layer, the file system detects this.
  2. A copy of the file is made up into the writable container layer.
  3. The modification is applied to the copy in the upper layer.
  4. Subsequent reads of that file will see the modified version from the upper layer (it shadows the original below).
  5. When the container is deleted, the entire upper read-write layer is discarded - the original image layers remain perfectly intact.

This is why stopping and removing a container does not destroy the image, and why you can spin up 100 containers from the same image without 100 copies of the image on disk.

Union File System Implementations

The “stacking” of layers into a single unified view is handled by a Union File System. Common implementations include:

ImplementationNotes
OverlayFSThe default for Docker on modern Linux. Uses kernel-native overlay mounts. Very fast.
AUFSAn older union file system, once the Docker default. Requires a kernel patch; largely deprecated.
BtrfsA copy-on-write file system at the block level. Offers snapshots.
ZFSProvides strong integrity guarantees and efficient snapshots. Less common for containers.
Device MapperBlock-level thin provisioning. Used in older RHEL/CentOS environments.

Putting It All Together: The Full Container Lifecycle

Section titled “Putting It All Together: The Full Container Lifecycle”
1. Build → Dockerfile instructions create stacked read-only image layers
each layer hashed and stored
2. Push/Pull → Only missing layers are transferred over the network
shared layers are reused from local cache
3. Start → Runtime creates:
• New set of Namespaces (pid, net, mnt, uts, ipc, user)
• New cgroup for the container's processes
• A writable CoW layer on top of the image layers
• A virtual network interface with an IP address
4. Running → Process runs in isolation
sees its own PID 1, its own /etc, its own IP
cgroups enforce CPU/memory limits
writes go to the ephemeral upper layer
5. Stop/Delete → Running processes are terminated
The ephemeral read-write layer is discarded
Image layers remain untouched
Namespaces and cgroups are torn down