What is an Operating System?

An operating system is the layer of software that sits between your application and the physical hardware. Its job is to mediate access — to be the single authority that decides which programs can use which resources, when, and how much.

Applications never touch hardware directly. There is no way for your code to write to a disk, send a network packet, or allocate RAM without going through the OS. This is by design: direct hardware access would mean any program could corrupt memory, starve other processes, or crash the machine.

Key idea: The kernel is the authority. Applications make requests; the kernel decides the answers.

System Calls — The Interface Between Applications and the OS

The boundary between your application and the kernel is crossed via system calls (syscalls). These are the only legitimate way for a program to ask the OS to do something on its behalf.

When your code does something like:

Open a file → open()
Write to a socket → send()
Allocate memory → mmap()
Spawn a process → fork() / clone()

…your language runtime or standard library translates that into a syscall. The CPU switches from user mode (restricted) to kernel mode (privileged), the kernel does the work, and control returns to your program.

This mode switch is the hard boundary. Code in user mode physically cannot perform privileged operations — the CPU enforces it at the hardware level.

Python through the Interpreter to the Kernel

Python code never calls the kernel directly. There are always at least two layers in between: the Python interpreter (CPython) and the C standard library (glibc). The journey looks like this:

your Python code
      ↓
CPython interpreter (Python built-ins, file objects, socket module)
      ↓
C standard library / glibc (fopen, malloc, pthread_create)
      ↓
syscall instruction (CPU switches to kernel mode)
      ↓
Linux kernel
      ↓
hardware

Each layer translates the request into something lower-level until it becomes a raw syscall number passed to the CPU.

File I/O

with open("data.txt", "r") as f:
    contents = f.read()

What actually happens:

open() → CPython calls glibc fopen() → kernel openat() syscall — returns a file descriptor (an integer handle)
f.read() → CPython calls glibc fread() → kernel read(fd, buffer, n) syscall — copies bytes from kernel buffer into user space
Exiting the with block → close(fd) syscall — releases the file descriptor

Network I/O

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("example.com", 80))
s.send(b"GET / HTTP/1.0\r\n\r\n")
data = s.recv(1024)

Syscall sequence:

socket.socket() → kernel socket() — allocates a socket, returns a file descriptor
s.connect() → kernel connect(fd, addr) — initiates TCP handshake
s.send() → kernel sendto(fd, data) — passes bytes to the network stack
s.recv() → kernel recvfrom(fd, buffer) — blocks until data arrives, copies into user space

Process Creation

import subprocess
result = subprocess.run(["ls", "-la"], capture_output=True)

Syscall sequence:

subprocess.run() → kernel fork() (or clone()) — duplicates the current process
In the child: execve("/bin/ls", ["ls", "-la"], env) — kernel replaces child’s memory with the ls binary
Parent calls wait4(child_pid) — blocks until child exits, then collects the exit code

Memory Allocation

data = [0] * 10_000_000   # allocate a large list

Python’s memory allocator (pymalloc) manages a private pool and avoids syscalls where possible. But when it needs more from the OS: mmap(NULL, size, PROT_READ|PROT_WRITE, ...) — the kernel maps new pages into the process’s virtual address space. Those pages aren’t backed by physical RAM until first written — the kernel uses demand paging, where a page fault on first access triggers physical allocation.

Observing it Yourself

strace intercepts every syscall a process makes and prints them. You can watch Python’s syscalls live:

# Trace all syscalls
strace python3 -c "open('test.txt', 'w').write('hello')"
 
# Trace only specific syscalls
strace -e trace=openat,read,write,close python3 script.py
 
# Count syscalls by type (useful for performance analysis)
strace -c python3 script.py

If something is slow or failing at the OS level, strace will show you exactly which syscall and why — one of the most useful debugging tools on Linux.

Common Syscalls Reference

Category	Syscall	What it does	Python trigger
Files	`openat`	Open or create a file, returns fd	`open()`
	`read`	Read bytes from fd into buffer	`f.read()`
	`write`	Write bytes from buffer to fd	`f.write()`
	`close`	Release a file descriptor	`f.close()` / `with` block exit
	`stat`	Get file metadata (size, mtime, etc.)	`os.stat()`, `Path.exists()`
	`unlink`	Delete a file	`os.remove()`
	`rename`	Rename or move a file	`os.rename()`
	`mkdir`	Create a directory	`os.mkdir()`
	`getdents`	Read directory entries	`os.listdir()`
Processes	`fork`	Duplicate the current process	`subprocess`, `os.fork()`
	`clone`	Create process or thread with options	`threading.Thread()`
	`execve`	Replace process image with new program	`subprocess.run()`
	`exit`	Terminate the process	end of script / `sys.exit()`
	`wait4`	Wait for a child process to finish	`proc.wait()`
	`getpid`	Return current process ID	`os.getpid()`
	`kill`	Send a signal to a process	`os.kill()`
Memory	`mmap`	Map memory (allocate, map file, shared mem)	large allocations, `mmap` module
	`munmap`	Unmap previously mapped memory	garbage collection
	`brk`	Adjust the process heap boundary	small allocations via glibc
Networking	`socket`	Create a network socket	`socket.socket()`
	`bind`	Assign an address to a socket	`s.bind()`
	`listen`	Mark socket as passive (server)	`s.listen()`
	`accept`	Accept an incoming connection	`s.accept()`
	`connect`	Initiate a connection	`s.connect()`
	`sendto`	Send data on a socket	`s.send()`
	`recvfrom`	Receive data from a socket	`s.recv()`
I/O control	`epoll`	Wait for one or more fds to be ready	`asyncio`, `selectors`
	`ioctl`	Device-specific control operations	low-level device access
	`fcntl`	File descriptor flags and locks	`fcntl` module
Filesystem	`mount`	Mount a filesystem	`os.system("mount ...")`
	`chdir`	Change working directory	`os.chdir()`
	`chroot`	Change root directory	rarely directly from Python
Identity	`getuid`	Get real user ID	`os.getuid()`
	`setuid`	Set user ID	`os.setuid()`

User Space Vs Kernel Space

	User space	Kernel space
Who runs here	Your application, libraries, runtimes	The OS kernel
Memory access	Own process memory only	All physical memory
Hardware access	None — must ask kernel	Direct
A crash here	Kills the process	Crashes the whole machine

Processes

A process is a running program. More precisely, it is the kernel’s unit of isolation — each process gets its own:

Virtual address space — it believes it has memory starting at address 0, with no awareness of other processes’ memory. The kernel maps this to real physical RAM behind the scenes.
File descriptor table — its own list of open files, sockets, and pipes
Credentials — the user/group it runs as, which determines what it’s allowed to do
PID — a unique process ID used to reference it

By default, all processes on a Linux system exist in a single global process tree, rooted at PID 1 (typically systemd). Every process has a parent. Processes can see other processes on the system (subject to permissions) and share one view of the filesystem, network, and hostname.

This shared, global view is the default — and it’s exactly what namespaces break apart.

Namespaces — Partitioning the Global View

A namespace is a kernel feature that gives a group of processes their own private view of a particular system resource, without any emulation or interception in user space.

When a process inside a namespace makes a syscall asking about that resource, the kernel simply returns a different answer — one scoped to that namespace. The process has no way to know it is not seeing the full picture.

Definition: A namespace is a rule inside the kernel: “When these processes ask about this resource, give them a different answer.”

Nothing is virtualised. Nothing is emulated. The kernel just maintains separate instances of the resource and consults the process’s namespace when answering syscalls.

The Six Core Namespaces

PID — Process Isolation

Normally there is one global process tree. Every process can see every other (with appropriate permissions).

With a PID namespace:

Processes inside only see themselves and their children
PID numbering restarts at 1 inside the namespace
The host still has a global view; processes inside do not

The process thinks it is alone. It is not — it is still a normal process on the host kernel, just with a restricted view.

NET — Network Isolation

Normally there is one network stack: one set of interfaces, one routing table, one port table. A process binding to port 80 claims it system-wide.

With a NET namespace:

Each namespace has its own network stack, interfaces, IP addresses, and port table
Port 80 in one namespace does not conflict with port 80 in another
Interfaces must be explicitly created and wired up to connect namespaces to each other or the host

Relevant syscalls (socket, bind, listen) are answered in the context of the process’s NET namespace.

MNT — Filesystem Isolation

Normally there is one mount table. Everyone sees the same / (root filesystem) and the same tree of mounted filesystems.

With a MNT namespace:

Each namespace has its own mount table
/ can point to an entirely different directory or filesystem
The underlying files still exist on disk; only the mapping differs

This is how a container gets its own root filesystem. The image is mounted as the root in the container’s MNT namespace — the host filesystem is untouched and invisible.

UTS — Identity Isolation

Isolates the system’s hostname and domain name. Allows each namespace to have its own hostname without affecting the host or other namespaces. Small but important for services that use the hostname to identify themselves.

IPC — Inter-Process Communication Isolation

Isolates shared memory segments, semaphores, and message queues. Processes can only communicate via IPC with others in the same IPC namespace. Prevents containers from accidentally (or maliciously) interfering with shared memory on the host.

USER — Privilege Isolation

The most powerful namespace. Isolates user and group IDs, and allows UID remapping:

A process can appear to be root (UID 0) inside the namespace
While running as an unprivileged user on the host
The kernel maps between the two UID spaces

This is what makes rootless containers possible: you get the experience of running as root inside the container without actually granting root privileges on the host.

Cgroups — Controlling Resource Usage

Namespaces control visibility. cgroups control usage.

A control group (cgroup) is a kernel mechanism that limits, accounts for, and optionally prioritises the resource consumption of a group of processes. Where a namespace answers “what can this process see?”, a cgroup answers “how much can this process use?”

Resources cgroups govern:

CPU — time quota per period (e.g. 50% of one core)
Memory — hard limit; processes exceeding it are OOM-killed by the kernel
Disk I/O — read/write throughput and IOPS limits
Network — bandwidth (via traffic control integration)
PIDs — maximum number of processes in the group

cgroups are exposed as a virtual filesystem at /sys/fs/cgroup/. Setting a limit is as simple as writing a value to a file in that tree. The kernel reads those values when scheduling and allocating resources.

Simple rule:

Namespaces → what a process can see

cgroups → how much a process can use

The Full Picture

Putting it together, these are the kernel primitives that define the application execution environment:

Primitive	Layer	Controls
Syscalls	Interface	How apps talk to the kernel
Virtual memory	Process	Each process’s isolated address space
Namespaces	Visibility	What a process can see (processes, network, filesystem, etc.)
cgroups	Resources	How much CPU, memory, I/O a process can consume
Capabilities	Privileges	Which privileged operations a process is allowed to perform
Seccomp	Syscall filtering	Which syscalls a process is allowed to make at all

Everything above userspace — runtimes, containers, language VMs, orchestrators — is built on top of these six things. Understanding them gives you a model for reasoning about isolation, performance, and security at any level of the stack.

Notes

Operating system fundamentals - Linux