The Data Engineer’s Roadmap

James Phoenix

There is a huge demand for data engineers right now, and it still doesn’t appear as if there are enough prospective data engineers to meet industry needs. As such, becoming a data engineer is an attractive proposition, not just for general employability in today’s world but for earning an excellent salary right from the get-go.

If we look at two major employment studies below, we can see that data engineering is growing at a staggering rate. We’ve also taken an in-depth look at data engineering salaries here. It’s clear that skyrocketing demand is pushing wages up – there’s never been a better time to get into data engineering!

Source	When?	Year on Year Job growth
Dice Hottest Tech Jobs 2020	2019	50%
LinkedIn’s Emerging Jobs Report 2020	2019	33%

This article will outline the roadmap to becoming a fully-qualified, employable data engineer.

Let’s go!

Data Engineering: More Than Just a Salary

Of course, money and employability aren’t the only reasons to become a data engineer. Data engineering is a cutting-edge contemporary job which pitches individuals into the heart of modern businesses and organisations.

While data engineers are employed by thousands of enterprises and SMBs around the world, they’re also required for all manner of non-profit and scientific jobs. So, whether you’re interested in the Fortune 500 or want to help build future generations of scientific models and applications, data engineering might be the job for you.

Data engineering (and data science too, find our comparison here) are attractive careers that promise excellent prospects, earning potential and flexibility concerning what industry one works in. As a result, it’s hardly surprising that many people are interested in learning what it takes to be a data engineer

Level 0: Deciding to Become a Data Engineer

Data engineers are more infrastructure and architecture-focused. The fundamental goal is to engineer data and related data systems in such a way that makes data useful for real-world applications.

Data engineers relinquish data from its constraints and make it useful for operational and analytical purposes. Only then can data be harnessed as an asset. This isn’t the same job remit as data science or data analytics, though there is obviously a good deal of crossover.

Data engineering is collaborative with other areas of data science when it comes to machine learning, where data engineers, data scientists and ML specialists will combine forces to build and prepare data and train and optimise models.

Some fundamental skills you’ll need to become a data engineer when you’re only just starting out include:

Statistics and maths skills
Basic computer science; any knowledge of coding and programming languages is excellent before studying data engineering formally
Understanding of modern computer science concepts, such as machine learning
Problem-solving skills
And of course, a great work ethic!

And then there’s the small matter of degrees and courses. Do you need a degree in data engineering? Since many institutions offer widely credited degrees that are cut out for the modern industry, it might be a missed opportunity to skip studying the field formally.

You don’t need to study a traditional 3-year BSc, though, and there are many excellent distance courses available on Coursera and DataCamp.

It depends on how you want to orientate your career. While a degree/degree-level qualification is likely required for many data-related jobs at SMBs and enterprises, that’s not to say you can’t freelance your way to success without a degree.

You can also consider coding bootcamps. Also, bear in mind that degrees in maths, statistics or other ‘hard sciences’ are great foundations for a career in data.

Level 1: Base Knowledge

Firstly, it’s essential to build a strong knowledge of SQL, and to get to grips with Python. SQL is all about interacting and querying with databases, which is where data engineers will spend a good deal of time. SQL builds a bridge towards an enhanced knowledge of distributed systems, streaming and NoSQL.

Learning server basics is also essential, as is familiarising yourself with Git. Learning about API and REST API is also useful here. At this stage, you should understand how the various core computing technologies communicate with each other.

Understanding the difference between structured and unstructured data is also essential, especially if you wish to go down the machine learning route. Preliminary or basic knowledge of data structures, data types and algorithms is also useful.

Some useful resources:

And don’t forget to check out the best books on data engineering!

Level 2: Develop Python Skills

Can you get away with being a data engineer without Python? Not really!

Python is incredibly well-supported for data-centric coding, and there are some excellent libraries available that many data engineers and scientists use every day. Some of the best Python libraries for data engineering include:

While some of these libraries and frameworks are not exactly for beginners, it’s great to familiarise yourself with the breadth and diversity of Python.

Python is quick, efficient and has a relatively gentle learning curve. It’s also essential for anyone who wants to get into AI and machine learning, and its automation frameworks are unparalleled.

To learn about the pros and cons of Python, head here.

Level 3: Your First Project

Taking on a data engineering project will teach you how to apply your knowledge.

There are many data engineering projects for beginners out there, but building a REST API with Flask is probably one of the most useful.

Here are two excellent beginner tutorials for data engineering:

You can also check out the Python for SEO articles above for lots of practical and hands-on guides to data engineering.

Want to improve your data skills?

See the best data engineering & data science books

Shop Now

Level 4: Data Warehousing and Pipelines

By now, you’ll know a bit about ETL pipelines already. It’s worth familiarising yourself with many of the various products that make up the modern data stack, including ETL services, CDPs and cloud warehouses.

Some big names here include:

Cloud Databases

BigQuery
AWS Redshift
Snowflake

ETL/ELT/Reverse ETL

Xplenty
Stitch
Airbyte
Fivetran

Customer Data Platforms and Infrastructure

Segment
mParticle
Rudderstack
Snowplow

Level 5: Data Modelling Project

Your second project should look to apply both coding and warehousing skills for a business or organisation-focused project.

Implement data scraping to scrape open data suitable for a dimensional model. Check out this post on data modelling.
Encrypt your data and store it with SFTP
Create your dimensional model
Ingest data from SFTP, and load into a data warehouse

Encryption and secure data transit are essential for applying data engineering skills in a real-life situation. Creating your own business-grade project using open data will develop practical skills and give you something to talk about on your presume. To find data for data engineering and data science projects, head here.

Level 6: Develop Testing Skills

Data engineers need to implement testing skills as part of the CI/CD process. You’ll need an in-depth understanding of integration, functional and unit tests. Advertising your test-driven development (TDD) skills is a great way to improve your employability as a data engineer.

Unit testing: Refers to testing individual modules without interactions or dependencies. The test is administered to a single unit to ensure it’s working properly.

Integration testing: This means testing modules together as a group.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

★ 4.5/5 rating

306,000+ learners

View Course

Functional testing: Tests software to ensure it’s meeting the specifications and requirements of the project.

Level 7: Advanced Cloud and NoSQL

Gaining some advanced cloud skills are now essential. If you’re interested in AI and machine learning, it’s time to familiarise yourself with the machine learning services and technologies available in AWS and Google Cloud.

This Google Cloud Coursera course is an excellent choice.

Learning about when NoSQL is most effective, and understanding how it doesn’t support fixed schemas, normalisation and expressive queries in the same way as SQL.

Level 8: Big Data, Streaming and Distributed Systems

Streaming and distributed systems are much simpler to work with than ever before.

Learning about distributed event streaming and Big Data frameworks is essential for applying data engineering to enterprise-level projects. Learn about using customer data for eCommerce and other commercial purposes.

Spark, Kafka and Hadoop are three essential components here, and learning about each and how they interact with each other is crucial. Learn about common issues when streaming real-time data from multiple sources. Using Apache Airflow to schedule data pipelines and workflows is also essential here.

Level 9: Learn Data Visualisation and UI/UX

While data visualisation, dashboarding and UX/UX isn’t strictly required for most data engineering jobs, it’s worth gaining as much adjacent knowledge in data analysis and data science as you can.

Not all businesses or organisations will be data-mature, so it’s ideal to spread your knowledge across a wide remit of data science, at least to some extent. However, this should never prevent you from specialising, and at some point in your career, you’ll probably want to funnel your efforts into something specific, whether that be Big Data or machine learning.

Level 10: Advanced Machine Learning

There’s little doubt that machine learning is at the frontier of data-centric disciplines today. If you want to get into ML, you should already know the differences between supervised, unsupervised and reinforcement learning:

Ensure you understand some of the most widely-used algorithms:

Decision tree
Dimensionality reduction algorithms
Gradient boosting algorithm and AdaBoosting algorithm
K-means
KNN algorithm
Linear regression
Logistic regression
Naive Bayes algorithm
Random forest algorithm
SVM algorithm

Try and create at least one supervised and one unsupervised machine learning project for your resume. Understand the role of training and testing data and data labelling and annotation. Look into AI ethics, ground truth and the challenges we need to overcome to build the next generations of AIs.

Level 10+: Develop With The Field

Data engineering is a highly contemporary and fast-developing field. It’s crucial to understand what’s going on in the industry, not only to foster your own knowledge and interest in data, but also to give you something to talk about with prospective clients, customers and employers.

It’s also essential to engage with the contemporary debates surrounding AI and ML, especially with regard to ethics and compliance.

Learn about the future of the modern data stack and equip yourself with knowledge of where the field is going in the near future.

Summary: Data Engineer Roadmap

This is by no means an exhaustive list of data engineering skills, and you could by no means learn all of this in the space of less than 2 to 5 years!

In reality, you’ll probably be tuning and flexing your skills as you respond to demand. Once you’ve found a job that suits you, it’s worth delving into the skills you’ll need for it as much as you can.

However, don’t let that neutralise your personal development – there are so many opportunities on the horizon for data engineers who wish to immerse themselves in cutting-edge skills and knowledge.