What is an Operating System?

An operating system is the layer of software that sits between your application and the physical hardware. Its job is to mediate access — to be the single authority that decides which programs can use which resources, when, and how much.

Applications never touch hardware directly. There is no way for your code to write to a disk, send a network packet, or allocate RAM without going through the OS. This is by design: direct hardware access would mean any program could corrupt memory, starve other processes, or crash the machine.

Key idea: The kernel is the authority. Applications make requests; the kernel decides the answers.


System Calls — The Interface Between Applications and the OS

The boundary between your application and the kernel is crossed via system calls (syscalls). These are the only legitimate way for a program to ask the OS to do something on its behalf.

When your code does something like:

  • Open a file → open()
  • Write to a socket → send()
  • Allocate memory → mmap()
  • Spawn a process → fork() / clone()

…your language runtime or standard library translates that into a syscall. The CPU switches from user mode (restricted) to kernel mode (privileged), the kernel does the work, and control returns to your program.

This mode switch is the hard boundary. Code in user mode physically cannot perform privileged operations — the CPU enforces it at the hardware level.

Python through the Interpreter to the Kernel

Python code never calls the kernel directly. There are always at least two layers in between: the Python interpreter (CPython) and the C standard library (glibc). The journey looks like this:

your Python code
      ↓
CPython interpreter (Python built-ins, file objects, socket module)
      ↓
C standard library / glibc (fopen, malloc, pthread_create)
      ↓
syscall instruction (CPU switches to kernel mode)
      ↓
Linux kernel
      ↓
hardware

Each layer translates the request into something lower-level until it becomes a raw syscall number passed to the CPU.

File I/O

with open("data.txt", "r") as f:
    contents = f.read()

What actually happens:

  1. open() → CPython calls glibc fopen() → kernel openat() syscall — returns a file descriptor (an integer handle)
  2. f.read() → CPython calls glibc fread() → kernel read(fd, buffer, n) syscall — copies bytes from kernel buffer into user space
  3. Exiting the with block → close(fd) syscall — releases the file descriptor

Network I/O

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("example.com", 80))
s.send(b"GET / HTTP/1.0\r\n\r\n")
data = s.recv(1024)

Syscall sequence:

  1. socket.socket() → kernel socket() — allocates a socket, returns a file descriptor
  2. s.connect() → kernel connect(fd, addr) — initiates TCP handshake
  3. s.send() → kernel sendto(fd, data) — passes bytes to the network stack
  4. s.recv() → kernel recvfrom(fd, buffer) — blocks until data arrives, copies into user space

Process Creation

import subprocess
result = subprocess.run(["ls", "-la"], capture_output=True)

Syscall sequence:

  1. subprocess.run() → kernel fork() (or clone()) — duplicates the current process
  2. In the child: execve("/bin/ls", ["ls", "-la"], env) — kernel replaces child’s memory with the ls binary
  3. Parent calls wait4(child_pid) — blocks until child exits, then collects the exit code

Memory Allocation

data = [0] * 10_000_000   # allocate a large list

Python’s memory allocator (pymalloc) manages a private pool and avoids syscalls where possible. But when it needs more from the OS: mmap(NULL, size, PROT_READ|PROT_WRITE, ...) — the kernel maps new pages into the process’s virtual address space. Those pages aren’t backed by physical RAM until first written — the kernel uses demand paging, where a page fault on first access triggers physical allocation.

Observing it Yourself

strace intercepts every syscall a process makes and prints them. You can watch Python’s syscalls live:

# Trace all syscalls
strace python3 -c "open('test.txt', 'w').write('hello')"
 
# Trace only specific syscalls
strace -e trace=openat,read,write,close python3 script.py
 
# Count syscalls by type (useful for performance analysis)
strace -c python3 script.py

If something is slow or failing at the OS level, strace will show you exactly which syscall and why — one of the most useful debugging tools on Linux.


Common Syscalls Reference

CategorySyscallWhat it doesPython trigger
FilesopenatOpen or create a file, returns fdopen()
readRead bytes from fd into bufferf.read()
writeWrite bytes from buffer to fdf.write()
closeRelease a file descriptorf.close() / with block exit
statGet file metadata (size, mtime, etc.)os.stat(), Path.exists()
unlinkDelete a fileos.remove()
renameRename or move a fileos.rename()
mkdirCreate a directoryos.mkdir()
getdentsRead directory entriesos.listdir()
ProcessesforkDuplicate the current processsubprocess, os.fork()
cloneCreate process or thread with optionsthreading.Thread()
execveReplace process image with new programsubprocess.run()
exitTerminate the processend of script / sys.exit()
wait4Wait for a child process to finishproc.wait()
getpidReturn current process IDos.getpid()
killSend a signal to a processos.kill()
MemorymmapMap memory (allocate, map file, shared mem)large allocations, mmap module
munmapUnmap previously mapped memorygarbage collection
brkAdjust the process heap boundarysmall allocations via glibc
NetworkingsocketCreate a network socketsocket.socket()
bindAssign an address to a sockets.bind()
listenMark socket as passive (server)s.listen()
acceptAccept an incoming connections.accept()
connectInitiate a connections.connect()
sendtoSend data on a sockets.send()
recvfromReceive data from a sockets.recv()
I/O controlepollWait for one or more fds to be readyasyncio, selectors
ioctlDevice-specific control operationslow-level device access
fcntlFile descriptor flags and locksfcntl module
FilesystemmountMount a filesystemos.system("mount ...")
chdirChange working directoryos.chdir()
chrootChange root directoryrarely directly from Python
IdentitygetuidGet real user IDos.getuid()
setuidSet user IDos.setuid()

User Space Vs Kernel Space

User spaceKernel space
Who runs hereYour application, libraries, runtimesThe OS kernel
Memory accessOwn process memory onlyAll physical memory
Hardware accessNone — must ask kernelDirect
A crash hereKills the processCrashes the whole machine

Processes

A process is a running program. More precisely, it is the kernel’s unit of isolation — each process gets its own:

  • Virtual address space — it believes it has memory starting at address 0, with no awareness of other processes’ memory. The kernel maps this to real physical RAM behind the scenes.
  • File descriptor table — its own list of open files, sockets, and pipes
  • Credentials — the user/group it runs as, which determines what it’s allowed to do
  • PID — a unique process ID used to reference it

By default, all processes on a Linux system exist in a single global process tree, rooted at PID 1 (typically systemd). Every process has a parent. Processes can see other processes on the system (subject to permissions) and share one view of the filesystem, network, and hostname.

This shared, global view is the default — and it’s exactly what namespaces break apart.


Namespaces — Partitioning the Global View

A namespace is a kernel feature that gives a group of processes their own private view of a particular system resource, without any emulation or interception in user space.

When a process inside a namespace makes a syscall asking about that resource, the kernel simply returns a different answer — one scoped to that namespace. The process has no way to know it is not seeing the full picture.

Definition: A namespace is a rule inside the kernel: “When these processes ask about this resource, give them a different answer.”

Nothing is virtualised. Nothing is emulated. The kernel just maintains separate instances of the resource and consults the process’s namespace when answering syscalls.

The Six Core Namespaces


PID — Process Isolation

Normally there is one global process tree. Every process can see every other (with appropriate permissions).

With a PID namespace:

  • Processes inside only see themselves and their children
  • PID numbering restarts at 1 inside the namespace
  • The host still has a global view; processes inside do not

The process thinks it is alone. It is not — it is still a normal process on the host kernel, just with a restricted view.


NET — Network Isolation

Normally there is one network stack: one set of interfaces, one routing table, one port table. A process binding to port 80 claims it system-wide.

With a NET namespace:

  • Each namespace has its own network stack, interfaces, IP addresses, and port table
  • Port 80 in one namespace does not conflict with port 80 in another
  • Interfaces must be explicitly created and wired up to connect namespaces to each other or the host

Relevant syscalls (socket, bind, listen) are answered in the context of the process’s NET namespace.


MNT — Filesystem Isolation

Normally there is one mount table. Everyone sees the same / (root filesystem) and the same tree of mounted filesystems.

With a MNT namespace:

  • Each namespace has its own mount table
  • / can point to an entirely different directory or filesystem
  • The underlying files still exist on disk; only the mapping differs

This is how a container gets its own root filesystem. The image is mounted as the root in the container’s MNT namespace — the host filesystem is untouched and invisible.


UTS — Identity Isolation

Isolates the system’s hostname and domain name. Allows each namespace to have its own hostname without affecting the host or other namespaces. Small but important for services that use the hostname to identify themselves.


IPC — Inter-Process Communication Isolation

Isolates shared memory segments, semaphores, and message queues. Processes can only communicate via IPC with others in the same IPC namespace. Prevents containers from accidentally (or maliciously) interfering with shared memory on the host.


USER — Privilege Isolation

The most powerful namespace. Isolates user and group IDs, and allows UID remapping:

  • A process can appear to be root (UID 0) inside the namespace
  • While running as an unprivileged user on the host
  • The kernel maps between the two UID spaces

This is what makes rootless containers possible: you get the experience of running as root inside the container without actually granting root privileges on the host.


Cgroups — Controlling Resource Usage

Namespaces control visibility. cgroups control usage.

A control group (cgroup) is a kernel mechanism that limits, accounts for, and optionally prioritises the resource consumption of a group of processes. Where a namespace answers “what can this process see?”, a cgroup answers “how much can this process use?”

Resources cgroups govern:

  • CPU — time quota per period (e.g. 50% of one core)
  • Memory — hard limit; processes exceeding it are OOM-killed by the kernel
  • Disk I/O — read/write throughput and IOPS limits
  • Network — bandwidth (via traffic control integration)
  • PIDs — maximum number of processes in the group

cgroups are exposed as a virtual filesystem at /sys/fs/cgroup/. Setting a limit is as simple as writing a value to a file in that tree. The kernel reads those values when scheduling and allocating resources.

Simple rule:

  • Namespaces → what a process can see
  • cgroups → how much a process can use

The Full Picture

Putting it together, these are the kernel primitives that define the application execution environment:

PrimitiveLayerControls
SyscallsInterfaceHow apps talk to the kernel
Virtual memoryProcessEach process’s isolated address space
NamespacesVisibilityWhat a process can see (processes, network, filesystem, etc.)
cgroupsResourcesHow much CPU, memory, I/O a process can consume
CapabilitiesPrivilegesWhich privileged operations a process is allowed to perform
SeccompSyscall filteringWhich syscalls a process is allowed to make at all

Everything above userspace — runtimes, containers, language VMs, orchestrators — is built on top of these six things. Understanding them gives you a model for reasoning about isolation, performance, and security at any level of the stack.


See Also