A container is a regular Linux process — but wrapped in several kernel isolation features that make it feel like its own machine. There is no virtualisation happening. If you run ps aux on the host, you’ll see the container’s processes right there alongside everything else.

The kernel provides four key primitives:

  1. Namespaces
  2. cgroups
  3. OverlayFS
  4. Seccomp

1. Namespaces — What the Process Can See

Namespaces limit the process’s view of the system. Docker uses six:

NamespaceIsolates
pidProcess tree — the container sees only its own processes; its main process thinks it’s PID 1
netNetwork stack — its own interfaces, routing table, ports
mntFilesystem mounts — its own root filesystem (the image)
utsHostname and domain name
ipcShared memory and message queues
userUser/group IDs — can map container root to an unprivileged host user

The process hasn’t moved — it’s still on the host kernel. It just has a filtered view of the world.


2. cgroups — What the Process Can Use

Control groups (cgroups) limit and account for resource consumption: CPU, memory, disk I/O, network bandwidth. This is how docker run --memory 512m is enforced — the kernel will OOM-kill the container if it exceeds the limit, exactly like any other cgroup.

docker run --memory 512m --cpus 1.0 nginx

Internally, Docker writes to the cgroup hierarchy at /sys/fs/cgroup/. You can inspect a container’s limits:

docker inspect <container> | grep -A5 Memory

3. OverlayFS — What the Process Can Read and Write

The container gets a merged filesystem view: the image layers as read-only lower dirs, plus a fresh writable upper layer on top. From the process’s perspective it looks like a normal root filesystem. Under the hood it’s stacked tarballs — see docker images under the hood


4. Seccomp + Capabilities — What the Process Can Do

By default Docker drops most Linux capabilities (e.g. CAP_NET_ADMIN, CAP_SYS_MODULE) and applies a seccomp filter that blocks ~44 system calls. This limits what the process can ask the kernel to do, even if it escapes its namespace.

You can inspect what capabilities a running container has:

docker inspect <container> | grep -A20 CapAdd

Putting It Together

When you run docker run nginx, the sequence at the OS level is roughly:

  1. Docker asks the kernel to clone() a new process with new namespaces
  2. The process is placed in a cgroup with its resource limits
  3. Its root is switched (via pivot_root) to the image’s merged OverlayFS directory
  4. Capabilities and seccomp filters are applied
  5. PID 1 (nginx) starts — it thinks it’s alone on a machine; it’s actually just a process on the host

Container Lifecycle

created → running → stopped → removed
                ↑         |
                └─restart─┘
StateWhat it means
createdNamespaces and cgroup allocated, not yet started
runningPID 1 is alive
stoppedPID 1 exited or was stopped; cgroup and filesystem still exist
removedWritable layer deleted; cgroup released

A stopped container still has its writable layer on disk — you can docker start it again and the filesystem changes are still there. Only docker rm destroys them.


Signals and Graceful Shutdown

docker stop sends SIGTERM to PID 1, waits 10 seconds (configurable), then sends SIGKILL. This means your application’s PID 1 must handle SIGTERM to shut down gracefully.

If your Dockerfile uses shell form (CMD gunicorn ...) instead of exec form (CMD ["gunicorn", ...]), the process is wrapped in /bin/sh -c and /bin/sh becomes PID 1 — which typically doesn’t forward signals. Use exec form.


One sentence: a container is a Linux process with a namespaced view of the system, a cgroup-enforced resource budget, an OverlayFS root filesystem, and a restricted set of kernel syscalls.


See Also