Skip to content

Kubernetes Architecture

A cluster is a group of nodes - physical servers, virtual machines, or cloud instances - that pool their CPU, memory, and storage into a single compute surface. Nodes can run on ARM or AMD64/x86-64 architectures, and a single cluster can mix hardware types, operating systems, and even cloud providers.

From the outside, none of this heterogeneity is visible. Developers interact with a single API endpoint; Kubernetes decides where and how workloads actually run across the available machines. The entire cluster presents itself as one unified deployment area.

Internally, the cluster is divided into two planes: the control plane, which implements the intelligence of the system, and the workload plane, which executes your applications. Both planes communicate exclusively through the API server - no component talks to another component directly.

PlaneNodesRunsOS requirement
Control plane1 (dev) · 3 or 5 (prod HA)Cluster management servicesLinux only
Workload plane1 → thousandsYour application workloadsLinux or Windows

The control plane is the cluster’s brain. All control plane services are typically replicated across every control plane node for high availability.

The API server is the single entry point for all cluster communication - including communication between internal components. Everything goes through it.

  • Exposes a RESTful HTTPS API
  • Handles authentication, authorisation, and admission control on every request
  • Is stateless - it reads and writes all state to etcd
  • kubectl, CI pipelines, operators, and controllers all communicate via the API server

etcd is the cluster’s only persistent store - the source of truth for all desired and observed state.

  • Distributed key-value database built on the RAFT consensus algorithm, which prevents split-brain and data corruption from concurrent writes
  • Only the API server communicates with etcd directly - no other component does
  • Prefers an odd number of replicas (3 or 5) to maintain quorum
  • Large, high-churn clusters may run a dedicated etcd cluster for performance

The scheduler watches the API server for Pods that have no node assignment and selects the best node for each one.

Node selection is a two-phase process:

  1. Filtering - eliminates nodes that cannot satisfy the Pod’s requirements (insufficient CPU/memory, missing ports, taints, node affinity rules)
  2. Scoring / ranking - ranks remaining nodes and picks the highest scorer

If no node passes the filter, the Pod stays Pending. If cluster autoscaling is configured, a Pending Pod automatically triggers node provisioning.

Controllers are background reconciliation loops - they watch for a difference between desired state and observed state, then take action to close the gap.

  • Each controller handles one resource type (Deployment, ReplicaSet, StatefulSet, Job, etc.)
  • The controller manager spawns and supervises all individual controllers
  • Controllers never act directly on infrastructure - they write objects to the API server and let downstream components react
flowchart LR
    A["Desired state (etcd)"] --> B["Controller observes gap"]
    B --> C["Controller writes corrective objects"]
    C --> D["Kubelet / runtime executes"]

Present only on clusters running inside a public cloud. It integrates Kubernetes with provider-specific APIs:

  • Provisioning cloud load balancers for LoadBalancer Services
  • Attaching cloud storage volumes
  • Managing cloud-specific node lifecycle (e.g., removing a node record when a VM is terminated)

Worker nodes execute the actual workloads. Every worker node runs three core components.

The kubelet is the primary Kubernetes agent on each node.

  • Watches the API server for Pods assigned to its node
  • Instructs the container runtime to pull images and start/stop containers
  • Reports node and Pod health back to the API server continuously
  • Acts as the bridge between the Kubernetes control plane and the container runtime

The runtime performs the low-level work: pulling images, creating namespaces and cgroups, starting and stopping containers. The kubelet communicates with it via the CRI (see below).

RuntimeDefault onNotes
containerdMost distributionsLightweight; extracted from Docker
CRI-ORed Hat OpenShiftPurpose-built for Kubernetes
Docker EngineLegacy clustersRemoved in K8s 1.24; now via cri-dockerd shim

kube-proxy runs on every node and implements cluster networking rules for Services.

  • Maintains iptables or IPVS rules that route Service VIP traffic to healthy Pod endpoints
  • Handles load balancing at the node level for traffic destined to Pods on that node
  • Partially responsible for implementing the Kubernetes Service abstraction

Beyond the core node components, every production cluster runs additional services that are not strictly part of the Kubernetes binary but are essential for a functioning environment:

  • CoreDNS - cluster-internal DNS server that resolves Service names to ClusterIPs (covered in Service Discovery Deep Dive)
  • CNI plugin - implements Pod-to-Pod networking (Calico, Cilium, Flannel - see CNI section below)
  • Metrics Server - collects resource usage data from kubelets for HPA and kubectl top
  • Logging agents - DaemonSet-deployed collectors (Fluent Bit, Fluentd) that ship container logs to a central store

Add-ons typically run as Pods on worker nodes, though some (like CoreDNS) may also run on control plane nodes in smaller clusters.


When you submit a manifest, these components act in a precise sequence:

flowchart TD
    A["YAML manifest file + You (kubectl apply)"] --> B["API Server"]
    B --> C["etcd - stores desired state"]
    B --> D["Controller Manager - creates Pod objects"]
    D --> E["Scheduler - assigns Pods to nodes"]
    E --> F["Kubelet on assigned node"]
    F --> G["Container Runtime - pulls image, starts container"]
    G --> H["kube-proxy - configures Service routing"]

After initial deployment, the kubelet and controllers run in continuous watch loops - they detect drift and reconcile automatically, enabling self-healing without manual intervention.


Five design properties permeate every part of the architecture and explain why Kubernetes is built the way it is:

PropertyWhat it means
PortabilityContainer images bundle all dependencies - code, runtime, config. The same manifest runs identically on bare metal, a private data centre, or any major cloud provider. Avoid cloud-provider-specific opt-in features if portability across providers matters.
ResilienceThe entire system is a declarative state machine. Controllers are reconciliation loops: they continuously compare observed state to desired state and make corrections automatically - without human intervention.
ScalabilityDesigned for enterprise scale from the ground up. Pod counts can be adjusted automatically based on real-time resource consumption or historical load trends. The architecture handles tens of thousands of nodes in a single cluster.
API-firstEvery capability is exposed through a standard RESTful API. Clients (kubectl, CI pipelines, operators, custom tools) all talk to the same API server endpoints - making it straightforward to build new integrations without modifying the core.
ExtensibilityCore Kubernetes is deliberately minimal. Purpose-built extension points (CRDs, admission webhooks, the CNI/CRI/CSI interfaces below) allow the community and vendors to add capabilities - monitoring, policy enforcement, custom runtimes - without forking the project.

Kubernetes deliberately delegates three infrastructure concerns to pluggable interfaces, keeping the core lean and vendor-neutral.

Responsible for establishing network connectivity between Pods.

  • A CNI plugin must be installed on control plane node(s) before the network functions
  • Without a CNI plugin, Pods cannot communicate across nodes and the cluster is non-functional
  • The choice of CNI plugin determines what NetworkPolicy enforcement, encryption, and observability features are available
PluginNotes
CalicoBGP-based routing; mature NetworkPolicy support
CiliumeBPF-based; high performance, L7 policy, built-in observability
FlannelSimple overlay; minimal features; common in dev clusters
AWS VPC CNINative VPC IP assignment for EKS
Azure CNINative VNet IP assignment for AKS

An API that decouples Kubernetes from the container runtime layer, making runtimes swappable.

  • The kubelet communicates with any CRI-compliant runtime without modification
  • Enables specialised runtimes for security (Kata Containers, gVisor) or Wasm workloads
  • ~10–15 production-ready runtime implementations exist today; containerd is the most widely deployed

A standard that allows third-party storage systems to integrate with Kubernetes via a driver plugin, without modifying the Kubernetes core.

  • Replaced the old in-tree volume plugin system
  • Over 100 CSI drivers are available in the ecosystem
CategoryExamples
Cloud-nativeAWS EBS CSI, Azure Disk CSI, GCE Persistent Disk CSI
Enterprise storageNetApp, Pure Storage, Portworx
Open-source / self-hostedRook/Ceph, Longhorn