Kubernetes Architecture
What is a Cluster?
Section titled “What is a Cluster?”A cluster is a group of nodes - physical servers, virtual machines, or cloud instances - that pool their CPU, memory, and storage into a single compute surface. Nodes can run on ARM or AMD64/x86-64 architectures, and a single cluster can mix hardware types, operating systems, and even cloud providers.
From the outside, none of this heterogeneity is visible. Developers interact with a single API endpoint; Kubernetes decides where and how workloads actually run across the available machines. The entire cluster presents itself as one unified deployment area.
Internally, the cluster is divided into two planes: the control plane, which implements the intelligence of the system, and the workload plane, which executes your applications. Both planes communicate exclusively through the API server - no component talks to another component directly.
The Two-Plane Model
Section titled “The Two-Plane Model”| Plane | Nodes | Runs | OS requirement |
|---|---|---|---|
| Control plane | 1 (dev) · 3 or 5 (prod HA) | Cluster management services | Linux only |
| Workload plane | 1 → thousands | Your application workloads | Linux or Windows |
Control Plane Components
Section titled “Control Plane Components”The control plane is the cluster’s brain. All control plane services are typically replicated across every control plane node for high availability.
API Server
Section titled “API Server”The API server is the single entry point for all cluster communication - including communication between internal components. Everything goes through it.
- Exposes a RESTful HTTPS API
- Handles authentication, authorisation, and admission control on every request
- Is stateless - it reads and writes all state to etcd
kubectl, CI pipelines, operators, and controllers all communicate via the API server
etcd is the cluster’s only persistent store - the source of truth for all desired and observed state.
- Distributed key-value database built on the RAFT consensus algorithm, which prevents split-brain and data corruption from concurrent writes
- Only the API server communicates with etcd directly - no other component does
- Prefers an odd number of replicas (3 or 5) to maintain quorum
- Large, high-churn clusters may run a dedicated etcd cluster for performance
Scheduler
Section titled “Scheduler”The scheduler watches the API server for Pods that have no node assignment and selects the best node for each one.
Node selection is a two-phase process:
- Filtering - eliminates nodes that cannot satisfy the Pod’s requirements (insufficient CPU/memory, missing ports, taints, node affinity rules)
- Scoring / ranking - ranks remaining nodes and picks the highest scorer
If no node passes the filter, the Pod stays Pending. If cluster autoscaling is configured, a Pending Pod automatically triggers node provisioning.
Controller Manager
Section titled “Controller Manager”Controllers are background reconciliation loops - they watch for a difference between desired state and observed state, then take action to close the gap.
- Each controller handles one resource type (Deployment, ReplicaSet, StatefulSet, Job, etc.)
- The controller manager spawns and supervises all individual controllers
- Controllers never act directly on infrastructure - they write objects to the API server and let downstream components react
flowchart LR
A["Desired state (etcd)"] --> B["Controller observes gap"]
B --> C["Controller writes corrective objects"]
C --> D["Kubelet / runtime executes"]
Cloud Controller Manager
Section titled “Cloud Controller Manager”Present only on clusters running inside a public cloud. It integrates Kubernetes with provider-specific APIs:
- Provisioning cloud load balancers for
LoadBalancerServices - Attaching cloud storage volumes
- Managing cloud-specific node lifecycle (e.g., removing a node record when a VM is terminated)
Worker Node Components
Section titled “Worker Node Components”Worker nodes execute the actual workloads. Every worker node runs three core components.
Kubelet
Section titled “Kubelet”The kubelet is the primary Kubernetes agent on each node.
- Watches the API server for Pods assigned to its node
- Instructs the container runtime to pull images and start/stop containers
- Reports node and Pod health back to the API server continuously
- Acts as the bridge between the Kubernetes control plane and the container runtime
Container Runtime
Section titled “Container Runtime”The runtime performs the low-level work: pulling images, creating namespaces and cgroups, starting and stopping containers. The kubelet communicates with it via the CRI (see below).
| Runtime | Default on | Notes |
|---|---|---|
| containerd | Most distributions | Lightweight; extracted from Docker |
| CRI-O | Red Hat OpenShift | Purpose-built for Kubernetes |
| Docker Engine | Legacy clusters | Removed in K8s 1.24; now via cri-dockerd shim |
kube-proxy
Section titled “kube-proxy”kube-proxy runs on every node and implements cluster networking rules for Services.
- Maintains iptables or IPVS rules that route Service VIP traffic to healthy Pod endpoints
- Handles load balancing at the node level for traffic destined to Pods on that node
- Partially responsible for implementing the Kubernetes Service abstraction
Add-on Components
Section titled “Add-on Components”Beyond the core node components, every production cluster runs additional services that are not strictly part of the Kubernetes binary but are essential for a functioning environment:
- CoreDNS - cluster-internal DNS server that resolves Service names to ClusterIPs (covered in Service Discovery Deep Dive)
- CNI plugin - implements Pod-to-Pod networking (Calico, Cilium, Flannel - see CNI section below)
- Metrics Server - collects resource usage data from kubelets for HPA and
kubectl top - Logging agents - DaemonSet-deployed collectors (Fluent Bit, Fluentd) that ship container logs to a central store
Add-ons typically run as Pods on worker nodes, though some (like CoreDNS) may also run on control plane nodes in smaller clusters.
The Deployment Sequence
Section titled “The Deployment Sequence”When you submit a manifest, these components act in a precise sequence:
flowchart TD
A["YAML manifest file + You (kubectl apply)"] --> B["API Server"]
B --> C["etcd - stores desired state"]
B --> D["Controller Manager - creates Pod objects"]
D --> E["Scheduler - assigns Pods to nodes"]
E --> F["Kubelet on assigned node"]
F --> G["Container Runtime - pulls image, starts container"]
G --> H["kube-proxy - configures Service routing"]
After initial deployment, the kubelet and controllers run in continuous watch loops - they detect drift and reconcile automatically, enabling self-healing without manual intervention.
Architectural Properties
Section titled “Architectural Properties”Five design properties permeate every part of the architecture and explain why Kubernetes is built the way it is:
| Property | What it means |
|---|---|
| Portability | Container images bundle all dependencies - code, runtime, config. The same manifest runs identically on bare metal, a private data centre, or any major cloud provider. Avoid cloud-provider-specific opt-in features if portability across providers matters. |
| Resilience | The entire system is a declarative state machine. Controllers are reconciliation loops: they continuously compare observed state to desired state and make corrections automatically - without human intervention. |
| Scalability | Designed for enterprise scale from the ground up. Pod counts can be adjusted automatically based on real-time resource consumption or historical load trends. The architecture handles tens of thousands of nodes in a single cluster. |
| API-first | Every capability is exposed through a standard RESTful API. Clients (kubectl, CI pipelines, operators, custom tools) all talk to the same API server endpoints - making it straightforward to build new integrations without modifying the core. |
| Extensibility | Core Kubernetes is deliberately minimal. Purpose-built extension points (CRDs, admission webhooks, the CNI/CRI/CSI interfaces below) allow the community and vendors to add capabilities - monitoring, policy enforcement, custom runtimes - without forking the project. |
Extension Interfaces: CNI · CRI · CSI
Section titled “Extension Interfaces: CNI · CRI · CSI”Kubernetes deliberately delegates three infrastructure concerns to pluggable interfaces, keeping the core lean and vendor-neutral.
CNI - Container Network Interface
Section titled “CNI - Container Network Interface”Responsible for establishing network connectivity between Pods.
- A CNI plugin must be installed on control plane node(s) before the network functions
- Without a CNI plugin, Pods cannot communicate across nodes and the cluster is non-functional
- The choice of CNI plugin determines what NetworkPolicy enforcement, encryption, and observability features are available
| Plugin | Notes |
|---|---|
| Calico | BGP-based routing; mature NetworkPolicy support |
| Cilium | eBPF-based; high performance, L7 policy, built-in observability |
| Flannel | Simple overlay; minimal features; common in dev clusters |
| AWS VPC CNI | Native VPC IP assignment for EKS |
| Azure CNI | Native VNet IP assignment for AKS |
CRI - Container Runtime Interface
Section titled “CRI - Container Runtime Interface”An API that decouples Kubernetes from the container runtime layer, making runtimes swappable.
- The kubelet communicates with any CRI-compliant runtime without modification
- Enables specialised runtimes for security (Kata Containers, gVisor) or Wasm workloads
- ~10–15 production-ready runtime implementations exist today; containerd is the most widely deployed
CSI - Container Storage Interface
Section titled “CSI - Container Storage Interface”A standard that allows third-party storage systems to integrate with Kubernetes via a driver plugin, without modifying the Kubernetes core.
- Replaced the old in-tree volume plugin system
- Over 100 CSI drivers are available in the ecosystem
| Category | Examples |
|---|---|
| Cloud-native | AWS EBS CSI, Azure Disk CSI, GCE Persistent Disk CSI |
| Enterprise storage | NetApp, Pure Storage, Portworx |
| Open-source / self-hosted | Rook/Ceph, Longhorn |