Top 10 Data Engineering Books

James Phoenix
James Phoenix

Data engineers are the unsung heroes of the data world. 

Data scientists and analysts soak up the glitz and glamour whilst the engineers are down in the belly of the beast, welding pipes and ensuring that data pumped through the system is clean and high-quality. 

Here’s a quote that captures the utilitarian remit of the data engineer:

“Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers, giving meaning to an otherwise static entity”David Bianco from UrTheCast

It’s easy to get sidetracked in analytics and data science, but quality data engineering is totally indispensable.

Data engineers, never fear that your time is up – for it has only just begun! In fact, the Dice 2020 Tech Job Report found that demand for data engineers had climbed by some 50%, faster than any other job in the tech industry. 

With demand for data practitioners seemingly forever rising, there’s no better team to get involved with a career in data, and if you’re interested in the data engineering side of things then you’re in luck. 

To get the ball rolling, here are 15 of the best data engineering books. Analytics Vidhya’s Founder and CEO Mr Kunal Jain says he reads one book every week. Successful people read and they read a lot, not just online, but from the humble book. 

Check out this post on 15 of the best data science and analysis books for even more books on data!


Designing Data-Intensive Applications – Martin Kleppman

With a colossal number of reviews on Amazon for a seemingly niche book, Designing Data-Intensive Applications provides a foundational overview of data engineering in a modern Big Data context. Many data tools, methods and processes are covered, detailing everything from collecting and storing data to cleaning and transforming data for use in a number of modern tools and platforms.

The book covers key topics such as data storage and warehousing, structures, distributed systems, batch and stream processing, encoding, replication, partitioning and much more. The goal is to break down terminology and buzzwords and provide an in-context view of data tools in action. It’s probably the bestselling book on data engineering and data science today.

Data Genres:

  • Data engineering
  • Data warehousing 
  • Big Data
  • Cleaning and transforming data 

Suitable For:

  • Anyone interested in the engineering or otherwise practical side of data. 

Spark: The Definitive Guide: Big Data Processing Made Simple – Bill Chambers, Matei Zaharia

Apache Spark is a Big Data and machine learning analytics engine. Spark provides APIs in Java, Scala, Python and R, and this book teaches a great deal about how to utilise Spark in a number of business or organisational contexts.

Though fairly niche in its focus, this is an excellent book to read for anyone who is interested in a data engineering or data science job that involves working with Spark. It covers everything from clustering to debugging, as well as Spark’s superb stream processing engine. It’s a high-level book for those who use Spark or intend to in the future. Published in 2018, it promises to remain relevant for a fair while yet. 

Data Genres:

  • Data engineering
  • Big Data 
  • Machine learning
  • Apache Spark 

Suitable For:

  • Anyone using Apache Spark already, or are intending on using it in the future. 

Snowflake Cookbook – Hamid Mahmood Qureshi and Hammad Sharif

Snowflake, the unique all-in-one cloud-based data warehousing platform, is extremely popular. Snowflake has raised a colossal amount of money since its formation some years ago, mainly because it was simply outselling itself. It’s become a go-to for SMBs and enterprises looking to centralise and scale their data strategies. 

This book delves into Snowflake and is a superb introduction for anyone wishing to work with the tool. It takes readers through Snowflake’s advanced scalable virtual warehousing functions, processing SQL queries and statements and leveraging its internal and external integrations to perform near-limitless analysis functions. The book provides a solid foundation for working in Snowflake in a professional capacity. 

Data Genres:

  • Data engineering
  • Snowflake
  • Cloud data warehousing
  • Data pipelines

Suitable For:

  • Anyone looking for an intro to Snowflake, or anyone wishing to consolidate their knowledge in the data platform

The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling – Ralph Kimball, Margy Ross

Dimensional data models were developed by Ralph Kimball and this book has become a staple read on the subject. The key components of dimensional modelling are facts, dimensions, and attributes; this modelling technique for data warehousing has become deeply ingrained into the fabric of modern data modelling culture. 

This book is found on the reading list of many data-related courses and curriculums. It covers a multitude of topics related to data modelling, warehousing, transforming, cleaning and ETL techniques. There are 12 case studies of differing complexity. Tons of real-world examples are given, covering everything from inventory management and procurement to accounting, CRM and big data analysis. 

Data Genres:

  • Big data engineering
  • Data modelling
  • ELT pipelines 
  • Business intelligence

Suitable For:

  • Anyone looking for a contextualised guide to data modelling and data warehousing 

A great book for anyone applying to a role that involves Snowflake, or wishing to add experience in Snowflake to their resume. 


Data Engineering With Python – Paul Crickard

Python is the apex programming language for data, and of that, there is little doubt! This book fills a great niche then, as it teaches how to data engineer with Python, which is what a lot of people want or need to learn. Published in 2020, it’s thoroughly up to date and covers everything from ETL using Python to scheduling, automating and monitoring complex data pipelines. It provides guidelines on building data architecture, using real-world case studies to guide the reader. 

This is a real data engineering-oriented book that focuses on building solid foundations for data science and analysis. There’s tons of useful how-to information in there that promises to be of pretty much immediate use to anyone working in data engineering right now, or in the future. No previous data engineering knowledge is required, but there’s some pretty high-level stuff in there too. 

Data Genres

  • Data engineering 
  • ETL pipelines 
  • Python for data 
  • Cleaning and enriching data 

Suitable For:


Data Pipelines Pocket Reference: Moving and Processing Data for Analytics – James Densmore

A super small book on data pipelines, this pocket reference is packed full of information on how to build and deploy successful data pipelines in real-world contexts. It covers some useful conundrums that data engineers have to solve, e.g. batch versus streaming data ingestion and build versus buy. It focuses on modern data tools and platforms, e.g. using data pipelines to extract, transform and load data into cloud-based platforms. 

This compact, well-designed back is full of excellent diagrams and contextualised examples. It covers key areas like ensuring data quality, testing pipelines prior to deployment and some other overlooked areas. A must-have for any established or budding data engineer. 

Data Genres:

  • Data pipelines
  • Data warehousing
  • Cleaning and transforming data
  • ETL and ELT 

Suitable For:

  • Anyone and everything that works with data pipelines 

97 Things Every Data Engineer Should Know: Collective Wisdom from the Experts – Tobias Macey 

Edited by well-revered data engineer Tobias Macey, this book was published very recently as of when this article was written (published in June 2021).

It’s full of sub-topics, each of which acts as a sort of mini-lecture or seminar on data engineering:

  • The Importance of Data Lineage – Julien Le Dem 
  • Data Security for Data Engineers – Katharine Jarmul
  • The Two Types of Data Engineering and Data Engineers – Jesse Anderson 
  • Six Dimensions for Picking an Analytical Data Warehouse – Gleb Mezhanskiy 
  • The End of ETL as We Know It – Paul Singman 
  • Building a Career as a Data Engineer – Vijay Kiran 
  • Modern Metadata for the Modern Data Stack – Prukalpa Sankar 
  • Your Data Tests Failed! Now What? – Sam Bail

Think of it as a book of data wisdom and anecdotes. It’s very current and up-to-date, so ought really to be full to the brim of immediately useful and applicable concepts, ideas, strategies and processes. A very interesting looking book that has a wide professional remit. 

Data Genres:

  • Looks like it covers just about everything!

Suitable For:

  • Anyone looking for pearly wisdom explained by some of the world’s top data practitioners

Python Data Cleaning Cookbook – Michael Walker

One of the first things that gets hammered home to you when you work in data is rubbish in = rubbish out. Cleaning and enriching data is so incredibly important and can really make or break a whole project, or worse, a career. That’s partly why Michael Walker wrote this book – to explore the myriad of data cleaning techniques available with Python.

Some of the specific data cleaning techniques it covers are removing duplicate data, handling missing values, monitoring particularly high volumes of data, validating errors, handling outliers and dealing with invalid dates. It also explores how to uncover unexpected values and classification errors using visualisation and exploratory data analysis. This is the perfect book for anyone that works with large volumes of messy or unclean data. 

Data Genres:

  • Data cleaning
  • Data wrangling 
  • Data engineering with Python 
  • Exploratory data analysis (EDA)

Suitable For:

Unleash Your Potential with AI-Powered Prompt Engineering!

Dive into our comprehensive Udemy course and learn to craft compelling, AI-optimized prompts. Boost your skills and open up new possibilities with ChatGPT and Prompt Engineering.

Embark on Your AI Journey Now!
  • Those looking to learn about data cleaning with Python 

Disrupting Data Governance: A Call to Action – Laura Madsen

Here’s something slightly different and possibly tangential to data engineering itself, but still very relevant to the field. Data governance, the act and method of managing data flow in and out of a business or organisation with regards to both key stakeholders and regulations, is a fast-changing topic. 

Organisations are now moving towards universal data literacy, that is, training vast numbers of employees in how to work with and manage data independently of monolithic IT departments. GDPR and other regulations have added a new layer of complexity to data governance. This book is an exploration of this complex topic, aimed probably at C-Level executives. It aims to educate on how to govern data in the modern era. 

Data Genres:

  • Data governance
  • Data democratisation 
  • Management of data 
  • Cybersecurity and privacy 

Suitable For:

  • C-Level data executives and other high-level managers

Data-Driven Science and Engineering – Steven L. Brunton and Nathan Kutz

With a focus on scientific computing and how data has revolutionised our technological approach to understand everything from turbulence and the brain to the climate, environment epidemiology, finance and robotics, this book is an all-encompassing guide to the cutting edge of data. 

The book covers many areas within both data science and engineering such as data mining, dimensional reduction, applied optimisation, machine learning and artificial intelligence and is probably aimed at higher-level researchers and professionals. 

There are ample technical diagrams in this rather large book, it’s an excellent fusion between theory and practice that isn’t afraid to pay attention to the fringes of data science and data engineering. 

Data Genres:

  • Machine learning 
  • Applied optimisation 
  • Scientific computing
  • Data science and engineering 

Suitable For:

  • Anyone with a keen academic or intellectual interest in cutting-edge data and scientific computing 

Summary: Top 10 Data Engineering Books

Some excellent books in data engineering here, some of which inevitably crossover into the spheres of data science and analysis, amongst other disciplines.

Whilst we take the internet for granted, books will forever remain an invaluable source of information. 

The tactile nature of a book cannot be traded for another medium, but combining the topic of data with the format of a book really works – don’t knock it until you try it – grab some data engineering books and get stuck in!

FAQ


What is the best data engineering book?

It’s a tough call, an impossible call! Probably the bestselling data engineering book is Designing Data-Intensive Applications by Martin Kleppman. It’s very well-written, thoroughly contextualised and contains many worked examples and case studies that break down both more simple and more complex data topics. 

How can I learn about data engineering? 

From coding bootcamps to Udemy courses and learning via YouTube tutorials, there are many ways to get stuck into data engineering. The internet is home to near-limitless resources on all things to do with data – data is one of the fastest-growing disciplines and mediums in the world today. Get on Google with ‘learn data engineering’ and you won’t go far wrong!

What is data engineering?

Simply put, data engineering is focused more on data architecture and infrastructure rather than synthesising high-level information from the data itself. It involves data pipelines, warehouses and databases. It lays the groundwork for data science and analysis, which is where data as a medium becomes more animated and perceptible. 


More Stories

Cover Image for Why I’m Betting on AI Agents as the Future of Work

Why I’m Betting on AI Agents as the Future of Work

I’ve been spending a lot of time with Devin lately, and I’ve got to tell you – we’re thinking about AI agents all wrong. You and I are standing at the edge of a fundamental shift in how we work with AI. These aren’t just tools anymore; they’re becoming more like background workers in our digital lives. Let me share what I’ve…

James Phoenix
James Phoenix
Cover Image for Supercharging Devin + Supabase: Fixing Docker Performance on EC2 with overlay2

Supercharging Devin + Supabase: Fixing Docker Performance on EC2 with overlay2

The Problem While setting up Devin (a coding assistant) with Supabase CLI on an EC2 instance, I encountered significant performance issues. After investigation, I discovered that Docker was using the VFS storage driver, which is known for being significantly slower than other storage drivers like overlay2. The root cause was interesting: the EC2 instance was already using overlayfs for its root filesystem,…

James Phoenix
James Phoenix