The Data Engineer’s Roadmap

James Phoenix
James Phoenix

There is a huge demand for data engineers right now, and it still doesn’t appear as if there are enough prospective data engineers to meet industry needs. As such, becoming a data engineer is an attractive proposition, not just for general employability in today’s world but for earning an excellent salary right from the get-go. 

If we look at two major employment studies below, we can see that data engineering is growing at a staggering rate. We’ve also taken an in-depth look at data engineering salaries here. It’s clear that skyrocketing demand is pushing wages up – there’s never been a better time to get into data engineering!

SourceWhen?Year on Year Job growth
Dice Hottest Tech Jobs 2020201950%
LinkedIn’s Emerging Jobs Report 2020201933%

This article will outline the roadmap to becoming a fully-qualified, employable data engineer. 

Let’s go!


Data Engineering: More Than Just a Salary

Of course, money and employability aren’t the only reasons to become a data engineer. Data engineering is a cutting-edge contemporary job which pitches individuals into the heart of modern businesses and organisations.

While data engineers are employed by thousands of enterprises and SMBs around the world, they’re also required for all manner of non-profit and scientific jobs. So, whether you’re interested in the Fortune 500 or want to help build future generations of scientific models and applications, data engineering might be the job for you. 

Data engineering (and data science too, find our comparison here) are attractive careers that promise excellent prospects, earning potential and flexibility concerning what industry one works in. As a result, it’s hardly surprising that many people are interested in learning what it takes to be a data engineer


Level 0: Deciding to Become a Data Engineer

Data engineers are more infrastructure and architecture-focused. The fundamental goal is to engineer data and related data systems in such a way that makes data useful for real-world applications.

Data engineers relinquish data from its constraints and make it useful for operational and analytical purposes. Only then can data be harnessed as an asset. This isn’t the same job remit as data science or data analytics, though there is obviously a good deal of crossover. 

Data engineering is collaborative with other areas of data science when it comes to machine learning, where data engineers, data scientists and ML specialists will combine forces to build and prepare data and train and optimise models. 

Some fundamental skills you’ll need to become a data engineer when you’re only just starting out include:

  • Statistics and maths skills 
  • Basic computer science; any knowledge of coding and programming languages is excellent before studying data engineering formally 
  • Understanding of modern computer science concepts, such as machine learning 
  • Problem-solving skills
  • And of course, a great work ethic!
Data Engineering Skills

And then there’s the small matter of degrees and courses. Do you need a degree in data engineering? Since many institutions offer widely credited degrees that are cut out for the modern industry, it might be a missed opportunity to skip studying the field formally. 

You don’t need to study a traditional 3-year BSc, though, and there are many excellent distance courses available on Coursera and DataCamp. 

It depends on how you want to orientate your career. While a degree/degree-level qualification is likely required for many data-related jobs at SMBs and enterprises, that’s not to say you can’t freelance your way to success without a degree. 

You can also consider coding bootcamps. Also, bear in mind that degrees in maths, statistics or other ‘hard sciences’ are great foundations for a career in data. 


Level 1: Base Knowledge

Firstly, it’s essential to build a strong knowledge of SQL, and to get to grips with Python. SQL is all about interacting and querying with databases, which is where data engineers will spend a good deal of time. SQL builds a bridge towards an enhanced knowledge of distributed systems, streaming and NoSQL. 

Learning server basics is also essential, as is familiarising yourself with Git. Learning about API and REST API is also useful here. At this stage, you should understand how the various core computing technologies communicate with each other.

Understanding the difference between structured and unstructured data is also essential, especially if you wish to go down the machine learning route. Preliminary or basic knowledge of data structures, data types and algorithms is also useful. 

Some useful resources:

  1. Python With Corey Schafer
  2. Data Structures And Algorithms
  3. FTP, SFTP, and TFTP 

And don’t forget to check out the best books on data engineering!


Level 2: Develop Python Skills

Can you get away with being a data engineer without Python? Not really!

Python is incredibly well-supported for data-centric coding, and there are some excellent libraries available that many data engineers and scientists use every day. Some of the best Python libraries for data engineering include:

While some of these libraries and frameworks are not exactly for beginners, it’s great to familiarise yourself with the breadth and diversity of Python. 

Python is quick, efficient and has a relatively gentle learning curve. It’s also essential for anyone who wants to get into AI and machine learning, and its automation frameworks are unparalleled. 

To learn about the pros and cons of Python, head here


Level 3: Your First Project

Taking on a data engineering project will teach you how to apply your knowledge.

There are many data engineering projects for beginners out there, but building a REST API with Flask is probably one of the most useful. 

Here are two excellent beginner tutorials for data engineering: 

You can also check out the Python for SEO articles above for lots of practical and hands-on guides to data engineering. 


Level 4: Data Warehousing and Pipelines

By now, you’ll know a bit about ETL pipelines already. It’s worth familiarising yourself with many of the various products that make up the modern data stack, including ETL services, CDPs and cloud warehouses. 

Some big names here include: 

Cloud Databases

  • BigQuery
  • AWS Redshift
  • Snowflake


ETL/ELT/Reverse ETL

  • Xplenty
  • Stitch
  • Airbyte
  • Fivetran

Customer Data Platforms and Infrastructure

  • Segment
  • mParticle
  • Rudderstack
  • Snowplow

Level 5: Data Modelling Project

Your second project should look to apply both coding and warehousing skills for a business or organisation-focused project.

  • Implement data scraping to scrape open data suitable for a dimensional model. Check out this post on data modelling.
  • Encrypt your data and store it with SFTP
  • Create your dimensional model 
  • Ingest data from SFTP, and load into a data warehouse 

Encryption and secure data transit are essential for applying data engineering skills in a real-life situation. Creating your own business-grade project using open data will develop practical skills and give you something to talk about on your presume. To find data for data engineering and data science projects, head here

Unleash Your Potential with AI-Powered Prompt Engineering!

Dive into our comprehensive Udemy course and learn to craft compelling, AI-optimized prompts. Boost your skills and open up new possibilities with ChatGPT and Prompt Engineering.

Embark on Your AI Journey Now!

Level 6: Develop Testing Skills

Data engineers need to implement testing skills as part of the CI/CD process. You’ll need an in-depth understanding of integration, functional and unit tests. Advertising your test-driven development (TDD) skills is a great way to improve your employability as a data engineer. 

Unit testing: Refers to testing individual modules without interactions or dependencies. The test is administered to a single unit to ensure it’s working properly.

Integration testing: This means testing modules together as a group. 

Functional testing: Tests software to ensure it’s meeting the specifications and requirements of the project.  


Level 7: Advanced Cloud and NoSQL

Gaining some advanced cloud skills are now essential. If you’re interested in AI and machine learning, it’s time to familiarise yourself with the machine learning services and technologies available in AWS and Google Cloud. 

This Google Cloud Coursera course is an excellent choice. 

Learning about when NoSQL is most effective, and understanding how it doesn’t support fixed schemas, normalisation and expressive queries in the same way as SQL.


Level 8: Big Data, Streaming and Distributed Systems

Streaming and distributed systems are much simpler to work with than ever before. 

Learning about distributed event streaming and Big Data frameworks is essential for applying data engineering to enterprise-level projects. Learn about using customer data for eCommerce and other commercial purposes. 

Spark, Kafka and Hadoop are three essential components here, and learning about each and how they interact with each other is crucial. Learn about common issues when streaming real-time data from multiple sources. Using Apache Airflow to schedule data pipelines and workflows is also essential here.


Level 9: Learn Data Visualisation and UI/UX

While data visualisation, dashboarding and UX/UX isn’t strictly required for most data engineering jobs, it’s worth gaining as much adjacent knowledge in data analysis and data science as you can.

Not all businesses or organisations will be data-mature, so it’s ideal to spread your knowledge across a wide remit of data science, at least to some extent. However, this should never prevent you from specialising, and at some point in your career, you’ll probably want to funnel your efforts into something specific, whether that be Big Data or machine learning. 


Level 10: Advanced Machine Learning

There’s little doubt that machine learning is at the frontier of data-centric disciplines today. If you want to get into ML, you should already know the differences between supervised, unsupervised and reinforcement learning:

Ensure you understand some of the most widely-used algorithms:

  • Decision tree
  • Dimensionality reduction algorithms
  • Gradient boosting algorithm and AdaBoosting algorithm
  • K-means
  • KNN algorithm
  • Linear regression
  • Logistic regression
  • Naive Bayes algorithm
  • Random forest algorithm
  • SVM algorithm

Try and create at least one supervised and one unsupervised machine learning project for your resume. Understand the role of training and testing data and data labelling and annotation. Look into AI ethics, ground truth and the challenges we need to overcome to build the next generations of AIs. 


Level 10+: Develop With The Field

Data engineering is a highly contemporary and fast-developing field. It’s crucial to understand what’s going on in the industry, not only to foster your own knowledge and interest in data, but also to give you something to talk about with prospective clients, customers and employers. 

It’s also essential to engage with the contemporary debates surrounding AI and ML, especially with regard to ethics and compliance. 

Learn about the future of the modern data stack and equip yourself with knowledge of where the field is going in the near future.


Summary: Data Engineer Roadmap

This is by no means an exhaustive list of data engineering skills, and you could by no means learn all of this in the space of less than 2 to 5 years! 

In reality, you’ll probably be tuning and flexing your skills as you respond to demand. Once you’ve found a job that suits you, it’s worth delving into the skills you’ll need for it as much as you can. 

However, don’t let that neutralise your personal development – there are so many opportunities on the horizon for data engineers who wish to immerse themselves in cutting-edge skills and knowledge. 


More Stories

Cover Image for Why I’m Betting on AI Agents as the Future of Work

Why I’m Betting on AI Agents as the Future of Work

I’ve been spending a lot of time with Devin lately, and I’ve got to tell you – we’re thinking about AI agents all wrong. You and I are standing at the edge of a fundamental shift in how we work with AI. These aren’t just tools anymore; they’re becoming more like background workers in our digital lives. Let me share what I’ve…

James Phoenix
James Phoenix
Cover Image for Supercharging Devin + Supabase: Fixing Docker Performance on EC2 with overlay2

Supercharging Devin + Supabase: Fixing Docker Performance on EC2 with overlay2

The Problem While setting up Devin (a coding assistant) with Supabase CLI on an EC2 instance, I encountered significant performance issues. After investigation, I discovered that Docker was using the VFS storage driver, which is known for being significantly slower than other storage drivers like overlay2. The root cause was interesting: the EC2 instance was already using overlayfs for its root filesystem,…

James Phoenix
James Phoenix