Data Engineers Should Understand the Systems Beneath Their Tools
Modern data engineers live in a world of abstraction.
Most of the infrastructure we interact with today is designed to hide the complexity underneath it. Managed databases, serverless compute, container orchestration, and hosted data pipelines let teams move quickly without worrying about the details of the machines running the system.
This is largely a good thing. Cloud providers have become extremely good at running infrastructure at scale. A small team can deploy distributed systems in minutes, spin up warehouses with a few clicks, and process terabytes of data without ever thinking about disk partitions or network interfaces. For most companies, that tradeoff makes sense. Engineers can focus on solving business problems instead of maintaining servers.
But abstraction has a downside. The more layers that sit between you and the machine, the harder it becomes to understand when something goes wrong. Cloud tools almost feel magical.
A Spark job might slow down because disk I/O is saturated. A container might keep restarting because it ran out of memory. At the operating system level, these are often straightforward resource problems that you should be able to identify.
That’s why it is important to learn Linux system administration as a data engineer.
Part of the reason I became interested in this is that I run Arch Linux on my main machine. Arch tends to expose more of the operating system than most distributions. Installing software, configuring services, managing packages, and troubleshooting issues all happen directly through the system rather than behind a polished interface. Over time that forces you to learn how the machine actually works.
I also spend a lot of time experimenting with self-hosted infrastructure. Recently, I started building a small analytics stack that runs entirely on my own hardware using open source tools. The idea is to run ingestion, orchestration, storage, and dashboards without relying entirely on managed cloud services. I wrote about that project in a separate post called DataSpec, which walks through an open source data infrastructure stack built with tools like Postgres, Airflow, Superset, and Docker.
After tinkering with this project on my Arch Linux machine, I realized I needed to understand the operating system these tools actually run on. Even managed platforms like Snowflake, BigQuery, and AWS Glue ultimately run on machines with kernels, file systems, networks, and processes underneath them. A serious data engineer should understand those layers.
That curiosity is what led me to read Unix and Linux System Administration Handbook.
The book starts off by describing the boot process of a machine. Firmware initializes the hardware, the bootloader loads the kernel, and the kernel starts the first user-process that brings the system online. This is accomplished via systemd, which manages Linux services and dependencies. Knowing this can help with debugging at the operating system level.
The book then goes into more detail about processes and how the operating system manages them. Every program running on a system exists as a process with its own memory space, file handles, and execution priority. The kernel is responsible for allocating CPU time and system resources among these processes. For data engineers, this is surprisingly relevant. Many of the workloads we run are resource-intensive. Large queries, batch pipelines, and distributed jobs can push CPU, memory, and disk I/O to their limits. When something slows down or fails unexpectedly, it is often because the underlying system is running out of one of those resources. Knowing how processes behave and how the operating system schedules them helps you interpret system metrics in a more meaningful way.
The same goes for file systems. Almost every data platform ultimately depends on storage performance. Data warehouses, data lakes, and analytics pipelines all rely on reading and writing files efficiently. Linux organizes storage through a filesystem hierarchy that maps file paths to physical storage. System directories such as /etc, /var, and /usr each serve specific roles in the operating system. Once you understand how file systems are mounted and how permissions work, a lot of system behavior starts to make more sense.
Networking is another area where a little system knowledge goes a long way. Data platforms are distributed systems. Services talk to each other across networks, APIs move data between components, and storage systems replicate information across machines. Underneath all of that are the protocols that power the internet: IP, TCP, UDP, DNS, and routing. Linux administrators rely on tools like ping, traceroute, and tcpdump to diagnose connectivity problems and understand how packets move through the network stack. Even if you mostly work with managed cloud networking, having a mental model of how these pieces fit together helps when debugging issues.
Security is another area where the Unix model still shows up everywhere. Linux systems control access through users, groups, and permission bits. The root user has unrestricted authority, while tools like sudo allow specific users to run administrative commands.
The book also connects many of these traditional concepts to modern infrastructure. Containers, for example, rely heavily on kernel features like namespaces and control groups to isolate processes while sharing the host operating system. Technologies like Docker and Kubernetes, which manage and deploy applications in isolated environments, can feel very modern, but they are really built on top of capabilities that already existed in the Linux kernel.
One reason this knowledge matters for data engineers is that not every company runs entirely in the cloud. Some organizations avoid cloud platforms because of cost or concerns over vendor lock-in. In these environments, teams often run their own infrastructure. That means someone needs to install packages, configure services, manage disks, and monitor machines. Tasks like scheduling jobs with cron, rotating logs, configuring storage, and managing user accounts are still very much part of operating a data platform in those settings.
I highly recommend reading this book to round out your technical knowledge as a data professional.
Becoming too reliant on abstractions provided by tools limits your flexibility.

