Kubernetes Architecture

What is a Cluster?

A cluster is a group of nodes - physical servers, virtual machines, or cloud instances - that pool their CPU, memory, and storage into a single compute surface. Nodes can run on ARM or AMD64/x86-64 architectures, and a single cluster can mix hardware types, operating systems, and even cloud providers.

From the outside, none of this heterogeneity is visible. Developers interact with a single API endpoint; Kubernetes decides where and how workloads actually run across the available machines. The entire cluster presents itself as one unified deployment area.

Internally, the cluster is divided into two planes: the control plane, which implements the intelligence of the system, and the workload plane, which executes your applications. Both planes communicate exclusively through the API server - no component talks to another component directly.

The Two-Plane Model

Plane	Nodes	Runs	OS requirement
Control plane	1 (dev) · 3 or 5 (prod HA)	Cluster management services	Linux only
Workload plane	1 → thousands	Your application workloads	Linux or Windows

Control Plane Components

The control plane is the cluster’s brain. All control plane services are typically replicated across every control plane node for high availability.

API Server

The API server is the single entry point for all cluster communication - including communication between internal components. Everything goes through it.

Exposes a RESTful HTTPS API
Handles authentication, authorisation, and admission control on every request
Is stateless - it reads and writes all state to etcd
kubectl, CI pipelines, operators, and controllers all communicate via the API server

etcd

etcd is the cluster’s only persistent store - the source of truth for all desired and observed state.

Distributed key-value database built on the RAFT consensus algorithm, which prevents split-brain and data corruption from concurrent writes
Only the API server communicates with etcd directly - no other component does
Prefers an odd number of replicas (3 or 5) to maintain quorum
Large, high-churn clusters may run a dedicated etcd cluster for performance

Scheduler

The scheduler watches the API server for Pods that have no node assignment and selects the best node for each one.

Node selection is a two-phase process:

Filtering - eliminates nodes that cannot satisfy the Pod’s requirements (insufficient CPU/memory, missing ports, taints, node affinity rules)
Scoring / ranking - ranks remaining nodes and picks the highest scorer

If no node passes the filter, the Pod stays Pending. If cluster autoscaling is configured, a Pending Pod automatically triggers node provisioning.

Controller Manager

Controllers are background reconciliation loops - they watch for a difference between desired state and observed state, then take action to close the gap.

Each controller handles one resource type (Deployment, ReplicaSet, StatefulSet, Job, etc.)
The controller manager spawns and supervises all individual controllers
Controllers never act directly on infrastructure - they write objects to the API server and let downstream components react

flowchart LR
    A["Desired state (etcd)"] --> B["Controller observes gap"]
    B --> C["Controller writes corrective objects"]
    C --> D["Kubelet / runtime executes"]

Cloud Controller Manager

Present only on clusters running inside a public cloud. It integrates Kubernetes with provider-specific APIs:

Provisioning cloud load balancers for LoadBalancer Services
Attaching cloud storage volumes
Managing cloud-specific node lifecycle (e.g., removing a node record when a VM is terminated)

Worker Node Components

Worker nodes execute the actual workloads. Every worker node runs three core components.

Kubelet

The kubelet is the primary Kubernetes agent on each node.

Watches the API server for Pods assigned to its node
Instructs the container runtime to pull images and start/stop containers
Reports node and Pod health back to the API server continuously
Acts as the bridge between the Kubernetes control plane and the container runtime

Container Runtime

The runtime performs the low-level work: pulling images, creating namespaces and cgroups, starting and stopping containers. The kubelet communicates with it via the CRI (see below).

Runtime	Default on	Notes
containerd	Most distributions	Lightweight; extracted from Docker
CRI-O	Red Hat OpenShift	Purpose-built for Kubernetes
Docker Engine	Legacy clusters	Removed in K8s 1.24; now via `cri-dockerd` shim

kube-proxy

kube-proxy runs on every node and implements cluster networking rules for Services.

Maintains iptables or IPVS rules that route Service VIP traffic to healthy Pod endpoints
Handles load balancing at the node level for traffic destined to Pods on that node
Partially responsible for implementing the Kubernetes Service abstraction

Add-on Components

Beyond the core node components, every production cluster runs additional services that are not strictly part of the Kubernetes binary but are essential for a functioning environment:

CoreDNS - cluster-internal DNS server that resolves Service names to ClusterIPs (covered in Service Discovery Deep Dive)
CNI plugin - implements Pod-to-Pod networking (Calico, Cilium, Flannel - see CNI section below)
Metrics Server - collects resource usage data from kubelets for HPA and kubectl top
Logging agents - DaemonSet-deployed collectors (Fluent Bit, Fluentd) that ship container logs to a central store

Add-ons typically run as Pods on worker nodes, though some (like CoreDNS) may also run on control plane nodes in smaller clusters.

The Deployment Sequence

When you submit a manifest, these components act in a precise sequence:

flowchart TD
    A["YAML manifest file + You (kubectl apply)"] --> B["API Server"]
    B --> C["etcd - stores desired state"]
    B --> D["Controller Manager - creates Pod objects"]
    D --> E["Scheduler - assigns Pods to nodes"]
    E --> F["Kubelet on assigned node"]
    F --> G["Container Runtime - pulls image, starts container"]
    G --> H["kube-proxy - configures Service routing"]

After initial deployment, the kubelet and controllers run in continuous watch loops - they detect drift and reconcile automatically, enabling self-healing without manual intervention.

Architectural Properties

Five design properties permeate every part of the architecture and explain why Kubernetes is built the way it is:

Property	What it means
Portability	Container images bundle all dependencies - code, runtime, config. The same manifest runs identically on bare metal, a private data centre, or any major cloud provider. Avoid cloud-provider-specific opt-in features if portability across providers matters.
Resilience	The entire system is a declarative state machine. Controllers are reconciliation loops: they continuously compare observed state to desired state and make corrections automatically - without human intervention.
Scalability	Designed for enterprise scale from the ground up. Pod counts can be adjusted automatically based on real-time resource consumption or historical load trends. The architecture handles tens of thousands of nodes in a single cluster.
API-first	Every capability is exposed through a standard RESTful API. Clients (`kubectl`, CI pipelines, operators, custom tools) all talk to the same API server endpoints - making it straightforward to build new integrations without modifying the core.
Extensibility	Core Kubernetes is deliberately minimal. Purpose-built extension points (CRDs, admission webhooks, the CNI/CRI/CSI interfaces below) allow the community and vendors to add capabilities - monitoring, policy enforcement, custom runtimes - without forking the project.

Extension Interfaces: CNI · CRI · CSI

Kubernetes deliberately delegates three infrastructure concerns to pluggable interfaces, keeping the core lean and vendor-neutral.

CNI - Container Network Interface

Responsible for establishing network connectivity between Pods.

A CNI plugin must be installed on control plane node(s) before the network functions
Without a CNI plugin, Pods cannot communicate across nodes and the cluster is non-functional
The choice of CNI plugin determines what NetworkPolicy enforcement, encryption, and observability features are available

Plugin	Notes
Calico	BGP-based routing; mature NetworkPolicy support
Cilium	eBPF-based; high performance, L7 policy, built-in observability
Flannel	Simple overlay; minimal features; common in dev clusters
AWS VPC CNI	Native VPC IP assignment for EKS
Azure CNI	Native VNet IP assignment for AKS

CRI - Container Runtime Interface

An API that decouples Kubernetes from the container runtime layer, making runtimes swappable.

The kubelet communicates with any CRI-compliant runtime without modification
Enables specialised runtimes for security (Kata Containers, gVisor) or Wasm workloads
~10–15 production-ready runtime implementations exist today; containerd is the most widely deployed

CSI - Container Storage Interface

A standard that allows third-party storage systems to integrate with Kubernetes via a driver plugin, without modifying the Kubernetes core.

Replaced the old in-tree volume plugin system
Over 100 CSI drivers are available in the ecosystem

Category	Examples
Cloud-native	AWS EBS CSI, Azure Disk CSI, GCE Persistent Disk CSI
Enterprise storage	NetApp, Pure Storage, Portworx
Open-source / self-hosted	Rook/Ceph, Longhorn