Diagnosing Linux Issues

Diagnosing System Issues

When a process or service is consuming too much CPU, memory, disk I/O, or network bandwidth - or when the system is just behaving unexpectedly - Linux provides a rich set of observability tools to diagnose what’s happening.

The general approach: work from symptoms to data. Start with high-level overview tools (top, vmstat), narrow down to the subsystem (iostat, ss), then drill into specific processes or kernel events (strace, perf).

Linux observability tools

CPU & Memory

Tool	Purpose
`top`	Dynamic real-time view of running processes + CPU/memory
`htop`	Better `top` - color, mouse, per-core bars (install separately)
`atop`	Full-featured: CPU, memory, disk, net; logs history
`ps aux`	Snapshot of all processes with CPU/memory usage
`pidstat`	Per-process CPU, memory, I/O, thread stats
`vmstat`	System-wide: processes, memory, paging, block I/O, CPU
`mpstat`	Per-CPU statistics (helpful on multi-core systems)
`sar`	System Activity Reporter - collect, report, save system data
`free -h`	Memory and swap usage summary
`slabtop`	Kernel slab allocator statistics
`numastat`	NUMA memory allocation statistics

vmstat 1 10              # 1-second intervals, 10 samples
mpstat -P ALL 1          # per-CPU stats every second
sar -u 1 5               # CPU utilization, 5 samples
free -h                  # human-readable memory summary
ps aux --sort=-%mem | head -10    # top 10 memory consumers
ps aux --sort=-%cpu | head -10    # top 10 CPU consumers

Disk & I/O

Tool	Purpose
`iostat`	Block device I/O statistics and CPU utilization
`iotop`	Real-time I/O usage per process (like `top` for disk)
`blktrace`	Low-level block layer tracing
`biolatency`	Disk I/O latency histogram (BCC/eBPF)
`biotop`	Top disk I/O by process (BCC/eBPF)
`swapon -s`	Show active swap devices and usage
`lsof`	List open files (includes sockets, pipes)

iostat -xz 1             # extended stats, skip idle, 1-second intervals
iostat -d sda 1 5        # disk stats for sda only
iotop -o                 # show only processes doing I/O
lsof | grep deleted      # find deleted-but-still-open files consuming space

File Systems & Block Devices

Tool	Purpose
`df -h`	Disk space usage by filesystem
`du -sh /path`	Disk usage of a directory
`lsof`	Files open by processes (fd leaks, locked files)
`perf`	CPU, tracepoints, kernel profiling
`Ftrace`	Linux kernel function tracer
`bcc` / `bpftrace`	eBPF-based dynamic tracing tools

df -h                              # filesystem usage overview
du -sh /var/* | sort -rh | head   # find largest directories under /var
lsof +D /path                      # find all open files under a path

Network

Tool	Purpose
`ss -tulpn`	Open sockets, listening ports, processes
`netstat -rn`	Routing table (deprecated; use `ip route`)
`ip -s link`	Interface statistics (errors, drops)
`tcpdump`	Packet capture
`tcpretrans`	Trace TCP retransmits (BCC)
`tcplife`	Log TCP connection lifetimes (BCC)
`nicstat`	Network interface throughput and utilization
`nstat`	Network statistics (faster than netstat)

ss -tulpn                           # all listening sockets
ss -tp state established           # established TCP connections
tcpdump -i eth0 -n port 443        # capture HTTPS traffic
ip -s link show eth0               # interface error counters

System Calls & Application Tracing

Tool	Purpose
`strace`	Trace system calls made by a process
`ltrace`	Trace library calls
`opensnoop`	Trace file opens system-wide (BCC)
`execsnoop`	Trace new process executions (BCC)
`profile`	CPU sampling profiler (BCC)

strace -p PID                       # attach to running process
strace -e trace=open,read ./binary  # filter to specific syscalls
strace -c ./binary                  # count syscalls and report summary

System Information & Logs

Tool	Purpose
`dmesg`	Kernel ring buffer - hardware errors, driver messages
`journalctl`	systemd journal - aggregated system logs
`/proc`	Kernel-exposed process and system info as virtual files
`uptime`	Load averages + time since last boot
`lscpu`	CPU topology and features
`lspci`	PCI devices
`lsusb`	USB devices
`lsblk`	Block devices and mount points

dmesg -T | tail -50               # last 50 kernel messages with timestamps
dmesg -T | grep -i error          # kernel errors
journalctl -u nginx --since "1h ago"  # service logs from last hour
journalctl -p err -n 50           # last 50 error-level log entries
cat /proc/meminfo                 # detailed memory breakdown
cat /proc/cpuinfo                 # per-CPU info

Quick Diagnostic Workflow

When asked to investigate a slow or misbehaving system, work through this order:

Starting point - uptime (load averages) + dmesg -T | tail (recent kernel events)
CPU/memory - top or htop → identify top consumers
Disk I/O - iostat -xz 1 → look for 100% %util or high await
Network - ss -tulpn + ip -s link → check for drops/errors
Application - strace -p PID or perf → syscall/CPU profile
Logs - journalctl -u servicename --since "10 min ago" → check service output