What is an Operating System?
An operating system is the layer of software that sits between your application and the physical hardware. Its job is to mediate access — to be the single authority that decides which programs can use which resources, when, and how much.
Applications never touch hardware directly. There is no way for your code to write to a disk, send a network packet, or allocate RAM without going through the OS. This is by design: direct hardware access would mean any program could corrupt memory, starve other processes, or crash the machine.
Key idea: The kernel is the authority. Applications make requests; the kernel decides the answers.
System Calls — The Interface Between Applications and the OS
The boundary between your application and the kernel is crossed via system calls (syscalls). These are the only legitimate way for a program to ask the OS to do something on its behalf.
When your code does something like:
- Open a file →
open() - Write to a socket →
send() - Allocate memory →
mmap() - Spawn a process →
fork()/clone()
…your language runtime or standard library translates that into a syscall. The CPU switches from user mode (restricted) to kernel mode (privileged), the kernel does the work, and control returns to your program.
This mode switch is the hard boundary. Code in user mode physically cannot perform privileged operations — the CPU enforces it at the hardware level.
Python through the Interpreter to the Kernel
Python code never calls the kernel directly. There are always at least two layers in between: the Python interpreter (CPython) and the C standard library (glibc). The journey looks like this:
your Python code
↓
CPython interpreter (Python built-ins, file objects, socket module)
↓
C standard library / glibc (fopen, malloc, pthread_create)
↓
syscall instruction (CPU switches to kernel mode)
↓
Linux kernel
↓
hardware
Each layer translates the request into something lower-level until it becomes a raw syscall number passed to the CPU.
File I/O
with open("data.txt", "r") as f:
contents = f.read()What actually happens:
open()→ CPython calls glibcfopen()→ kernelopenat()syscall — returns a file descriptor (an integer handle)f.read()→ CPython calls glibcfread()→ kernelread(fd, buffer, n)syscall — copies bytes from kernel buffer into user space- Exiting the
withblock →close(fd)syscall — releases the file descriptor
Network I/O
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("example.com", 80))
s.send(b"GET / HTTP/1.0\r\n\r\n")
data = s.recv(1024)Syscall sequence:
socket.socket()→ kernelsocket()— allocates a socket, returns a file descriptors.connect()→ kernelconnect(fd, addr)— initiates TCP handshakes.send()→ kernelsendto(fd, data)— passes bytes to the network stacks.recv()→ kernelrecvfrom(fd, buffer)— blocks until data arrives, copies into user space
Process Creation
import subprocess
result = subprocess.run(["ls", "-la"], capture_output=True)Syscall sequence:
subprocess.run()→ kernelfork()(orclone()) — duplicates the current process- In the child:
execve("/bin/ls", ["ls", "-la"], env)— kernel replaces child’s memory with thelsbinary - Parent calls
wait4(child_pid)— blocks until child exits, then collects the exit code
Memory Allocation
data = [0] * 10_000_000 # allocate a large listPython’s memory allocator (pymalloc) manages a private pool and avoids syscalls where possible. But when it needs more from the OS: mmap(NULL, size, PROT_READ|PROT_WRITE, ...) — the kernel maps new pages into the process’s virtual address space. Those pages aren’t backed by physical RAM until first written — the kernel uses demand paging, where a page fault on first access triggers physical allocation.
Observing it Yourself
strace intercepts every syscall a process makes and prints them. You can watch Python’s syscalls live:
# Trace all syscalls
strace python3 -c "open('test.txt', 'w').write('hello')"
# Trace only specific syscalls
strace -e trace=openat,read,write,close python3 script.py
# Count syscalls by type (useful for performance analysis)
strace -c python3 script.pyIf something is slow or failing at the OS level, strace will show you exactly which syscall and why — one of the most useful debugging tools on Linux.
Common Syscalls Reference
| Category | Syscall | What it does | Python trigger |
|---|---|---|---|
| Files | openat | Open or create a file, returns fd | open() |
read | Read bytes from fd into buffer | f.read() | |
write | Write bytes from buffer to fd | f.write() | |
close | Release a file descriptor | f.close() / with block exit | |
stat | Get file metadata (size, mtime, etc.) | os.stat(), Path.exists() | |
unlink | Delete a file | os.remove() | |
rename | Rename or move a file | os.rename() | |
mkdir | Create a directory | os.mkdir() | |
getdents | Read directory entries | os.listdir() | |
| Processes | fork | Duplicate the current process | subprocess, os.fork() |
clone | Create process or thread with options | threading.Thread() | |
execve | Replace process image with new program | subprocess.run() | |
exit | Terminate the process | end of script / sys.exit() | |
wait4 | Wait for a child process to finish | proc.wait() | |
getpid | Return current process ID | os.getpid() | |
kill | Send a signal to a process | os.kill() | |
| Memory | mmap | Map memory (allocate, map file, shared mem) | large allocations, mmap module |
munmap | Unmap previously mapped memory | garbage collection | |
brk | Adjust the process heap boundary | small allocations via glibc | |
| Networking | socket | Create a network socket | socket.socket() |
bind | Assign an address to a socket | s.bind() | |
listen | Mark socket as passive (server) | s.listen() | |
accept | Accept an incoming connection | s.accept() | |
connect | Initiate a connection | s.connect() | |
sendto | Send data on a socket | s.send() | |
recvfrom | Receive data from a socket | s.recv() | |
| I/O control | epoll | Wait for one or more fds to be ready | asyncio, selectors |
ioctl | Device-specific control operations | low-level device access | |
fcntl | File descriptor flags and locks | fcntl module | |
| Filesystem | mount | Mount a filesystem | os.system("mount ...") |
chdir | Change working directory | os.chdir() | |
chroot | Change root directory | rarely directly from Python | |
| Identity | getuid | Get real user ID | os.getuid() |
setuid | Set user ID | os.setuid() |
User Space Vs Kernel Space
| User space | Kernel space | |
|---|---|---|
| Who runs here | Your application, libraries, runtimes | The OS kernel |
| Memory access | Own process memory only | All physical memory |
| Hardware access | None — must ask kernel | Direct |
| A crash here | Kills the process | Crashes the whole machine |
Processes
A process is a running program. More precisely, it is the kernel’s unit of isolation — each process gets its own:
- Virtual address space — it believes it has memory starting at address 0, with no awareness of other processes’ memory. The kernel maps this to real physical RAM behind the scenes.
- File descriptor table — its own list of open files, sockets, and pipes
- Credentials — the user/group it runs as, which determines what it’s allowed to do
- PID — a unique process ID used to reference it
By default, all processes on a Linux system exist in a single global process tree, rooted at PID 1 (typically systemd). Every process has a parent. Processes can see other processes on the system (subject to permissions) and share one view of the filesystem, network, and hostname.
This shared, global view is the default — and it’s exactly what namespaces break apart.
Namespaces — Partitioning the Global View
A namespace is a kernel feature that gives a group of processes their own private view of a particular system resource, without any emulation or interception in user space.
When a process inside a namespace makes a syscall asking about that resource, the kernel simply returns a different answer — one scoped to that namespace. The process has no way to know it is not seeing the full picture.
Definition: A namespace is a rule inside the kernel: “When these processes ask about this resource, give them a different answer.”
Nothing is virtualised. Nothing is emulated. The kernel just maintains separate instances of the resource and consults the process’s namespace when answering syscalls.
The Six Core Namespaces
PID — Process Isolation
Normally there is one global process tree. Every process can see every other (with appropriate permissions).
With a PID namespace:
- Processes inside only see themselves and their children
- PID numbering restarts at 1 inside the namespace
- The host still has a global view; processes inside do not
The process thinks it is alone. It is not — it is still a normal process on the host kernel, just with a restricted view.
NET — Network Isolation
Normally there is one network stack: one set of interfaces, one routing table, one port table. A process binding to port 80 claims it system-wide.
With a NET namespace:
- Each namespace has its own network stack, interfaces, IP addresses, and port table
- Port 80 in one namespace does not conflict with port 80 in another
- Interfaces must be explicitly created and wired up to connect namespaces to each other or the host
Relevant syscalls (socket, bind, listen) are answered in the context of the process’s NET namespace.
MNT — Filesystem Isolation
Normally there is one mount table. Everyone sees the same / (root filesystem) and the same tree of mounted filesystems.
With a MNT namespace:
- Each namespace has its own mount table
/can point to an entirely different directory or filesystem- The underlying files still exist on disk; only the mapping differs
This is how a container gets its own root filesystem. The image is mounted as the root in the container’s MNT namespace — the host filesystem is untouched and invisible.
UTS — Identity Isolation
Isolates the system’s hostname and domain name. Allows each namespace to have its own hostname without affecting the host or other namespaces. Small but important for services that use the hostname to identify themselves.
IPC — Inter-Process Communication Isolation
Isolates shared memory segments, semaphores, and message queues. Processes can only communicate via IPC with others in the same IPC namespace. Prevents containers from accidentally (or maliciously) interfering with shared memory on the host.
USER — Privilege Isolation
The most powerful namespace. Isolates user and group IDs, and allows UID remapping:
- A process can appear to be
root(UID 0) inside the namespace - While running as an unprivileged user on the host
- The kernel maps between the two UID spaces
This is what makes rootless containers possible: you get the experience of running as root inside the container without actually granting root privileges on the host.
Cgroups — Controlling Resource Usage
Namespaces control visibility. cgroups control usage.
A control group (cgroup) is a kernel mechanism that limits, accounts for, and optionally prioritises the resource consumption of a group of processes. Where a namespace answers “what can this process see?”, a cgroup answers “how much can this process use?”
Resources cgroups govern:
- CPU — time quota per period (e.g. 50% of one core)
- Memory — hard limit; processes exceeding it are OOM-killed by the kernel
- Disk I/O — read/write throughput and IOPS limits
- Network — bandwidth (via traffic control integration)
- PIDs — maximum number of processes in the group
cgroups are exposed as a virtual filesystem at /sys/fs/cgroup/. Setting a limit is as simple as writing a value to a file in that tree. The kernel reads those values when scheduling and allocating resources.
Simple rule:
- Namespaces → what a process can see
- cgroups → how much a process can use
The Full Picture
Putting it together, these are the kernel primitives that define the application execution environment:
| Primitive | Layer | Controls |
|---|---|---|
| Syscalls | Interface | How apps talk to the kernel |
| Virtual memory | Process | Each process’s isolated address space |
| Namespaces | Visibility | What a process can see (processes, network, filesystem, etc.) |
| cgroups | Resources | How much CPU, memory, I/O a process can consume |
| Capabilities | Privileges | Which privileged operations a process is allowed to perform |
| Seccomp | Syscall filtering | Which syscalls a process is allowed to make at all |
Everything above userspace — runtimes, containers, language VMs, orchestrators — is built on top of these six things. Understanding them gives you a model for reasoning about isolation, performance, and security at any level of the stack.