Docker Model Runner

Docker Model Runner (DMR) is a Docker toolchain component that executes AI models directly on the host machine - not inside containers.
It exposes OpenAI-compatible inference endpoints, making it a drop-in local replacement for cloud AI APIs.
Because it runs locally, DMR addresses privacy concerns, eliminates cloud costs, reduces latency, and gives you full control over which models run.

Why Models Run Outside Containers

Containers cannot easily access most AI acceleration hardware:

Only CUDA-capable NVIDIA GPUs are accessible from containers, and doing so requires the complex NVIDIA Container Toolkit installation.
NPUs, TPUs, and non-NVIDIA GPUs (including Apple Silicon) are entirely inaccessible from inside containers.

DMR bypasses this by running as a host process with direct hardware access - NVIDIA GPUs on Windows, Apple Silicon via Metal on Mac. CPU fallback is available but significantly slower.

DMR vs. Alternatives

Tool	Inference Engine	Hardware	Docker Integration
DMR	llama.cpp (pluggable)	NVIDIA + Apple Silicon	Native - `docker model` CLI, OCI registry
Ollama	llama.cpp	NVIDIA only (in containers)	Manual - separate tool
LM Studio	llama.cpp	NVIDIA + Apple Silicon	None - GUI only

Architecture

DMR is a host process, entirely separate from the Docker Engine.
On Mac, it runs outside the Docker Desktop VM - direct hardware access with no virtualization overhead.
The runtime layer is pluggable: DMR wraps one or more inference engines. Default is llama.cpp; the active runtime and hardware backend are visible in docker model status.
Models are dynamically loaded and unloaded based on demand.

API Endpoints

DMR exposes two endpoint families:

Family	Path prefix	Purpose
Native model management	`/models`	List, pull, inspect models
OpenAI-compatible inference	`/engines/v1/...`	Chat completions, embeddings

Accessible from three contexts:

Context	Address
Containerized apps on the same host	`http://model-runner.docker.internal/`
Local non-containerized apps	`localhost:12434` (host-side TCP must be enabled)
Remote apps on another host	`http://<dmr-host>:12434`

Model Storage

Models are stored as OCI artifacts (amodel type) in ~/.docker/models/blobs/sha256/.
This means your existing private OCI registries (ACR, ECR, GHCR, Harbor) can store and distribute AI models - no separate model registry required, no “registry sprawl”.

Setup

Requirements

Docker Desktop v4.41 or newer (bundles Compose v2.35)
Mac (Apple Silicon preferred) or Windows host
NVIDIA GPU (Windows) or Apple Silicon GPU for hardware-accelerated inference; CPU fallback available but slow

Enabling DMR

Open Docker Desktop Settings → Features in development
Check Enable Docker Model Runner
Check Enable host-side TCP support - maps DMR to port 12434 on the host’s network interface, making it reachable from both local and remote clients
Leave the port at 12434 and click Apply & restart

Verifying

docker model status

A successful output confirms the runner is active and reports the inference backend - e.g., llama.cpp using Apple Metal or NVIDIA CUDA.

Working with Models

Pulling Models

Docker maintains a catalog of verified models under the ai namespace on Docker Hub:

# Pull a model by name and quantization tag
docker model pull ai/gemma3:4B-Q4_K_M

# Pull a smaller model for quick testing
docker model pull ai/qwen3:0.6B-Q4_K_M

Listing and Inspecting

# List all locally downloaded models
docker model ls

# Inspect a local model (architecture, size, format)
docker model inspect ai/gemma3:4B-Q4_K_M

# Query model manifest directly from Docker Hub (no local copy needed)
docker manifest inspect ai/gemma3:4B-Q4_K_M

Local model files (GGUF weights, config, license) are stored in ~/.docker/models/blobs/sha256/.

Testing Models

CLI - quick tests, stateless:

# Start an interactive chat REPL
docker model run ai/gemma3:4B-Q4_K_M

# Exit
/bye

The CLI does not maintain conversational context - every prompt is treated independently.

Docker Desktop UI - context-aware:
Open the Models tab in Docker Desktop. Clicking a model opens a chat window that retains history across prompts, like a commercial chatbot.

API Usage

DMR’s OpenAI-compatible API allows any script or tool that speaks the OpenAI REST API to use your local models without code changes.

# List available local models
curl http://localhost:12434/engines/v1/models

# Chat completion
curl http://localhost:12434/engines/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/gemma3:4B-Q4_K_M",
    "messages": [
      {"role": "system", "content": "Keep your responses to one sentence."},
      {"role": "user",   "content": "What is a Docker container?"}
    ],
    "temperature": 0.7,
    "max_tokens": 200
  }'

Parameter	Effect
`model`	Which local model to use
`messages`	Conversation turns - `system` sets behavior rules, `user` is the prompt
`temperature`	`0` = deterministic, `1` = most creative
`max_tokens`	Maximum response length in tokens

Compose Integration

DMR integrates with Docker Compose via the provider extension (requires Compose v2.35+), allowing you to declare it as a typed service dependency.

Standard Architecture

Frontend (UI) → Backend (API) → DMR (inference) → Model

Compose File

services:
  frontend:
    build: ./frontend
    ports:
      - "3000:3000"
    depends_on:
      - backend

  backend:
    build: ./backend
    environment:
      - MODEL_HOST=http://model-runner.docker.internal/engines/v1
      - LLM_MODEL_NAME=${LLM_MODEL_NAME}
    depends_on:
      - dmr

  dmr:
    provider:
      type: model
      options:
        model: ${LLM_MODEL_NAME} # auto-pulled if not already local

LLM_MODEL_NAME=ai/gemma3:4B-Q4_K_M

docker compose up --build --detach

Compose starts services in dependency order: DMR → backend → frontend. DMR auto-downloads the specified model if it is not already in the local store.

Third-Party Apps: Open WebUI

Because DMR exposes standard OpenAI-compatible endpoints, any OpenAI-compatible frontend works against it without modification. Open WebUI is the most common choice - a self-hosted ChatGPT-like interface backed by your local models.

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3001:8080"
    environment:
      - OPENAI_API_BASE_URL=http://model-runner.docker.internal/engines/v1
      - OPENAI_API_KEY=na # DMR does not require authentication
    volumes:
      - open-webui-data:/app/backend/data # persist chats + settings

volumes:
  open-webui-data:

Access at http://localhost:3001 after startup
First-time setup creates a local admin account - all data stays on your machine
Customize behavior via Settings → General → System Prompt
(e.g., “Give simple answers. Limit responses to two sentences.”)
The model selector dropdown lets you switch between any models in your local store without restarting

Running Models in Containers (Discouraged)

The core problem is hardware access: containers can only reach CUDA-capable NVIDIA GPUs via the NVIDIA Container Toolkit. NPUs, TPUs, and all non-NVIDIA GPUs (including Apple Silicon) are invisible from inside containers - models fall back to CPU, which is significantly slower.

If you need a containerized model server (e.g., on a CI server without DMR), Ollama is the standard choice:

# CPU-only compose.yaml
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama

volumes:
  ollama-data:

For NVIDIA GPU access, add the deploy.resources.reservations.devices block and ensure the NVIDIA Container Toolkit is installed on the host.