A container is a regular Linux process — but wrapped in several kernel isolation features that make it feel like its own machine. There is no virtualisation happening. If you run ps aux on the host, you’ll see the container’s processes right there alongside everything else.
The kernel provides four key primitives:
- Namespaces
- cgroups
- OverlayFS
- Seccomp
1. Namespaces — What the Process Can See
Namespaces limit the process’s view of the system. Docker uses six:
| Namespace | Isolates |
|---|---|
pid | Process tree — the container sees only its own processes; its main process thinks it’s PID 1 |
net | Network stack — its own interfaces, routing table, ports |
mnt | Filesystem mounts — its own root filesystem (the image) |
uts | Hostname and domain name |
ipc | Shared memory and message queues |
user | User/group IDs — can map container root to an unprivileged host user |
The process hasn’t moved — it’s still on the host kernel. It just has a filtered view of the world.
2. cgroups — What the Process Can Use
Control groups (cgroups) limit and account for resource consumption: CPU, memory, disk I/O, network bandwidth. This is how docker run --memory 512m is enforced — the kernel will OOM-kill the container if it exceeds the limit, exactly like any other cgroup.
docker run --memory 512m --cpus 1.0 nginxInternally, Docker writes to the cgroup hierarchy at /sys/fs/cgroup/. You can inspect a container’s limits:
docker inspect <container> | grep -A5 Memory3. OverlayFS — What the Process Can Read and Write
The container gets a merged filesystem view: the image layers as read-only lower dirs, plus a fresh writable upper layer on top. From the process’s perspective it looks like a normal root filesystem. Under the hood it’s stacked tarballs — see docker images under the hood
4. Seccomp + Capabilities — What the Process Can Do
By default Docker drops most Linux capabilities (e.g. CAP_NET_ADMIN, CAP_SYS_MODULE) and applies a seccomp filter that blocks ~44 system calls. This limits what the process can ask the kernel to do, even if it escapes its namespace.
You can inspect what capabilities a running container has:
docker inspect <container> | grep -A20 CapAddPutting It Together
When you run docker run nginx, the sequence at the OS level is roughly:
- Docker asks the kernel to
clone()a new process with new namespaces - The process is placed in a cgroup with its resource limits
- Its root is switched (via
pivot_root) to the image’s merged OverlayFS directory - Capabilities and seccomp filters are applied
- PID 1 (
nginx) starts — it thinks it’s alone on a machine; it’s actually just a process on the host
Container Lifecycle
created → running → stopped → removed
↑ |
└─restart─┘
| State | What it means |
|---|---|
| created | Namespaces and cgroup allocated, not yet started |
| running | PID 1 is alive |
| stopped | PID 1 exited or was stopped; cgroup and filesystem still exist |
| removed | Writable layer deleted; cgroup released |
A stopped container still has its writable layer on disk — you can docker start it again and the filesystem changes are still there. Only docker rm destroys them.
Signals and Graceful Shutdown
docker stop sends SIGTERM to PID 1, waits 10 seconds (configurable), then sends SIGKILL. This means your application’s PID 1 must handle SIGTERM to shut down gracefully.
If your Dockerfile uses shell form (CMD gunicorn ...) instead of exec form (CMD ["gunicorn", ...]), the process is wrapped in /bin/sh -c and /bin/sh becomes PID 1 — which typically doesn’t forward signals. Use exec form.
One sentence: a container is a Linux process with a namespaced view of the system, a cgroup-enforced resource budget, an OverlayFS root filesystem, and a restricted set of kernel syscalls.