There are many collaborators in the wonderful universe of data.
Amongst the various data practitioners and professionals are data engineers and data scientists.
Data science and data engineering do overlap considerably, but also have their own unique characteristics and specialisms.
There is an influential quote that elegantly captures their relationship:
“Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers, giving meaning to an otherwise static entity”David Bianco from UrTheCast
Data engineers and data scientists both possess a diverse array of both generalist and specialist skills that are essential to any data project and there is considerable overlap between the two.
Science and Engineering
To tackle the comparison between data science and data engineering, it’s sensible to first compare science and engineering.
Broadly speaking, science seeks to further a body of knowledge. Logic is applied to measurements and these develop into theories that are tested, eventually becoming knowledge, insight and understanding. Bianco is right in saying that data science animates the otherwise static entity of data, as it turns it into something that furthers our understanding.
Data science draws upon resources of data in order to conceptualise it in the context of the real world. Ideas and theories about the data will be inducted and tested by the data scientist with a view to applying this to a problem, whether in a business or non-commercial setting.
Engineering, on the other hand, applies knowledge to construct physical and digital systems.
These systems perform tasks or are built in order to meet certain specifications. They have a purpose and use, working with tangible objects, in this case, data.
Data engineers work with the architecture and infrastructure that extracts data from its sources, whether that’s inside or behind webpages, IoT sensors, the human body, or the atmosphere and environment. They set the groundwork for a data project as well as building upwardly mobile systems that can grow and develop in their own right.
Science and Engineering in Collaboration
Science and engineering are deeply collaborative disciplines. Any scientific project requires an element of engineering and any engineering project requires an element of science. The two are not mutually exclusive in any way.
The same concept applies to data. Data engineers and data scientists work in partnership for the benefit of the entire discipline, its projects and its users.
The Data Science Hierarchy of Needs
The domains of data science and engineering vary based on their remit and focus, but they also vary based on where they are situated in the ‘data science hierarchy of needs’.
Data projects generally have a timeline. They start with an objective, usually described as a problem. The purpose of the data project is to solve that problem with data.
The problem could be commercial (how many customers buy icecream on a Saturday?) or non-commercial (how long will a hurricane take to make landfall?).
Once the problem is established, data engineering and data science generally take place at different periods on the project’s timeline and hierarchy of needs.
When we look a the data science hierarchy of needs, the tasks at the bottom of the pyramid are absolute prerequisite requirements for the tasks at the top of the pyramid. Michelangelo can’t paint the Sistine Chapel if the foundations were never laid in the first place!
Tier 1: Collect – Data Engineering
Data engineering services takes place primarily at the bottom of this pyramid. Instrumentation, logging, sensors, external data and user-generated content are all thoroughly within the remit of the data engineer.
Tier 2: Move/Store – Data Engineering
Infrastructure, pipelines, ETL and data storage are still within the remit of the data engineer. They begin to move raw data, purposing it for use up the timeline.
Tier 3: Explore/Transform – Both Data Science and Data Engineering
Cleaning and anomaly detection is usually a collaborative effort, but again, much of this is done during data engineering. Usually, this step also involves loading data into dashboards for analysis.
Tier 4: Aggregate/Label – Data Science
At tier 4 of the pyramid, the data has been loaded into software for exploration. It is exposed, captured, and can now be assessed. At this stage, it passes into the hands of the data scientist. Initial analyses are made and hypotheses are developed.
Tier 5/6: Learn/Optimise – Data Science and ML Engineering
The creation of models may or may not be handled by a specialist data scientist or ML engineer. Hypotheses are tested and the results are refined. Data could be fed into machine learning (ML) algorithms that perform deep learning, or it could be used to create AI which are deployed into products and/or software platforms.
The Tools and Resources of the Data Engineer
The resource in any data-oriented discipline is data and databases.
Data is encoded, embedded or impregnated within various systems or objects. Just like iron needs to be extracted from iron ore, data needs to be extracted from the resources that contain it.
The types of data sources an engineer might work with are numerous, including everything from manually created spreadsheets to webpages, IoT sensors, CRM software and SQL & NoSQL databases.
Other times, the data engineer will have to find a way to extract new data, creating and manipulating data resources. This might be a case of auditing software systems to identify what data can be collected already whilst also finding opportunities for collecting new data. This might involve feature engineering, where new data points are created from combinations of existing data.
Data engineers will also need to maintain these resources and make sure they are serving their required functions.
Once a data engineer understands the resources that he/she has at their disposal, it’s time to implement various tools to get those resources moving to where they need to be and in an ideal format.
This is where a data pipeline comes into effect.
The data pipeline is a fundamental data tool attributed to more-or-less specifically to data engineers. Its purpose is to move data from resources, converting it or transforming it into something that can be used upstream. ETL is a type of data pipeline that extracts data from its resource, transforms it, and then loads it to a destination.
Since data extracted from its original source can contain errors, both related to humans, machines and instruments (e.g. sensors), the data will have to be cleaned before it is analysed and implemented.
The data engineer and scientist both play some role here, but the data engineer’s job is usually to at least transform the data into a readable format. This might involve hot decking and cold deck, regression modelling, expectation maximisation, clustering and binning.
In terms of programming knowledge, data engineers require extensive knowledge of databases and how they work including knowledge of MongoDB, Oracle, Microsoft SQL Server, PostgreSQL, and MySQL. Python is the main programming language of pipelines and there are several key tools here such as Apache Airflow, Luigi and Bonobo.
The Tools and Resources of The Data Scientist
Visualisation and Exploration Tools
Bianco described data scientists as animating static entities as a painter does with paint. This is carried out via the use of data visualisation tools.
Visualisation tools include Seaborn, Matplotlib and Plotly in Python or ggplot2 in R. These tools enable data scientists to create and investigate data visualisations, which in turn leads to a better understanding of the data.
Data science is more client-side as data visualisations will often need to be relayed to other teams or individuals within a business or organisation.
The general idea here will be to show and present important trends and characteristics of the data in order to answer a hypothesis.
Features can be studied and analysed together as a whole or in isolation. The process works directly with the problem proposed at the start of the data project.
For example, a weather prediction system will look at how different variables interacted to produce any given weather event. These can be analysed to discover how strong or weak they were as a predictor of the event.
Once the data has been visualised and explored, machine learning modelling can begin to take place.
For example, a data project aiming to predict an extreme weather event will need to take into account the variables that combined in order to create that event, measuring their development throughout time. A predictive model will be able to predict the progression of the sequence that precedes the extreme weather event taking place. This is more clearly related to science in its involvement of query, hypothesis and theory.
Data scientists working with models will also need to understand common modelling problems such as data leakage and overfitting.
The sample data used to build and train models will need to be heavily scrutinised and models will need to be tested and re-tested on different samples before they are exposed to new data in the real world.
What About Data Analysts?
Data analysts could be considered the third broad type of data practitioner. Data analysts tend to work very closely with clients and their strategies and will have an in-depth knowledge of the business or project itself. It’s a broad remit that covers elements of data science but less so engineering, and requires more mathematical or statistical skills and fewer programming skills.
Data analysis skills are required for both data engineers and data scientists as there will be numerous opportunities to pull up information about data, making informed decisions about what aspects to tune, explore, optimise, etc.
Data engineers primarily work with data infrastructure and architecture closer to the bottom line, or data source, whilst data scientists primarily work with analysis and modelling closer to a scientific hypothesis.
Data engineers and data scientists work in collaboration. Of course, it may be the case that one individual performs the roles of both the data engineer and data scientist, and this is probably the most common set up on small to medium-sized data projects.
As projects grow, niche specialisms may become increasingly in-demand and in the case of AI and machine learning, a new set of specialisms open.