Data science is a burgeoning field with remarkably modern origins.
The realms of data science and data engineering feature many components that combine to form modern data infrastructure.
One incredibly important component is the data pipeline.
As the name suggests, a data pipeline’s primary function is to move data from one place to another.
The Plumbing of Data Pipelines
Data pipelines start at the point of ingestion where the data is collected or extracted. Also a data pipeline can be purely used as a transformational step, this is important when the data needs to be cleaned.
You’ll often hire a data engineering consulting firm or data engineer to help with the production of your data pipelines.
The ingestion points for a data pipeline could be:
- Manual input data in spreadsheets.
- Information collected by sensors in the real world (e.g. weather sensors).
- SaaS platforms.
- Relational databases (RDBMS).
- Social media or web analytics tools.
- Event-related potentials (ERPs).
- Customer relationship management systems (CRMs).
- Data collected via web scraping.
Many pipelines ingest data from various sources that may include any combination of the above.
How Data Is Ingested
Data ingestion methods include API calls, webhooks andweb scraping that ingest batches of data at set intervals. Data contained in these initial raw ‘streams’ or ‘rivers’ might be SQL tables, data warehouses, file paths (HDFS) or blobs such as Amazon S3 or Google Cloud Storage.
The data could be structured already to some extent, e.g. tables of customer data with set fields, or it could be unstructured, e.g. search criteria or a mixture of both, e.g. data collected via web scraping.
The type of data and its format will greatly influence the architecture of the pipeline, particularly regarding how the data is parsed and cleaned.
What Happens In A Data Pipeline?
Not every data pipeline will perform any processing. A pipeline’s purpose might simply be to transfer raw data from one database and load it into another.
Most of the time, though, the purpose of a data pipeline is also to perform some sort of processing or transformation on the data to enhance it.
Because raw data ingested from a source is not clean, pure, or particularly usable.
It needs to be altered in some way to be usable at its next destination. It is critical to clean data soon after ingestion, as once poor or dirty data is loaded into databases for analysis, or used to train models, etc, reverse engineering the process is extremely time-consuming.
ETL (extract, transform, load) is a type of data pipeline architecture where data is transformed in transit to its destination. This will ensure that the data upstream is clean and usable; ready to be analysed, fed into models and deployed into products. etc.
Depending on the data’s format and content at the point of induction, it may need to be parsed and/or cleaned. Data wrangling and parsing is very wide-ranging and will vary with the task at hand and how dirty the data is.
For example, very dirty data collected by hand, e.g. a large manually-input spreadsheet, may be riddled with issues such as missing values or incorrect formatting.
This data here will need to be cleaned and transformed.
ETL pipelines can also aggregate different types of data and combine it into a single stream or batch.
For example, an ETL pipeline can ingest credit card payment data alongside times, dates, usage patterns and locations. This disparate data is then transformed together, creating one schema that matches its destination, in this case, an anti-fraud protection mechanism.
Batch Processing vs Stream Processing
In a data pipeline, data might be processed via either batch processing or stream processing.
Batch processing, the staple option, is most useful when moving and transforming large amounts of data. It works at regular intervals, literally moving and transforming data in batches.
This works well when data is collected over a given period, e.g. a month, and analysed for that time period to answer questions such as “how many products sold this month?” or “who purchased what this month and why?”
Stream processing works in near-real-time. Ingested data is transformed and moved to its destination immediately at a rate dedicated by some sort of incremental schedule ranging from hours to microseconds.
This is suitable when data needs to be analysed and monitored immediately in real-time, for example in the case of predicting an earthquake or tsunami from seismic activity received across a wide network of sensors.
A Data Pipeline’s Destination
A pipeline’s destination may simply be another database, data warehouse or data lake. Data warehouses and databases store primarily structured data as transformed via an ETL pipeline.
Data lakes are instead more oriented to raw or unprocessed data. Data lakes or databases may cooperate with an ELT pipeline (in oppose to ELT). An ELT pipeline does not transform data prior to loading but instead leaves the transformation to the target platform.
Pipelines may also feed directly into models and machine learning algorithms. Really, the potential destination of data pipelines are near limitless and range from simple phone apps to remarkably powerful scientific modelling systems.
Data Pipeline Use Cases
Data pipelines are used to perform data migration tasks. These might involve moving data from traditional relational databases, e.g. MongoDB, Oracle, Microsoft SQL Server, PostgreSQL, and MySQL into the cloud. Cloud databases are scalable and flexible and enable for easier creation of other data pipelines that use real-time streaming.
Data Warehousing and Analysis
Probably the most common destination for a data pipeline is a dashboard or suite of analytical tools. Raw data that is structured via ETL can be loaded into databases for analysis and visualisation. Data scientists can then create graphs, tables and other visualisations from the data. This data can then be used to inform strategies and guide the purpose of future data projects.
AI and Machine Learning Algorithms
ETL and ELT pipelines can move data into machine learning and AI models. Machine learning algorithms can learn from the data, performing advanced parsing and wrangling techniques. These ML models can then be deployed into various software. Machine learning algorithms fed by data pipelines can be used in marketing, finance, science, telecoms, etc.
Data pipelines are frequently used in IoT systems that use networks of sensors for data collection. Data inducted from various sources across a network can be transformed into data available for ready analysis. For example, an ETL pipeline may perform numerous calculations on huge quantities of delivery tracking information, vehicle locations, delay expectations, etc, to form a rough ETA estimate.
Common Challenges In Data Pipelines
Simply writing and scheduling your data pipelines, is not sufficient. There could be underlying data quality issues, for example potentially your function might be missing days of data which isn’t currently being noticed.
You must investigate the quality of your data, double check that the data is flowing correctly and use data pipeline orchestration tools such as prefect, luigi or apache airflow to track successful runs vs unsuccessful runs.
Failing Data Pipelines Through Changes
Whenever you update any tables or data lakes that your data pipelines are running against, there is a possibility that your previously created data pipelines will fail. Therefore if you must change the table references, remember to update any of your existing data pipelines to avoid them breaking in production.
Changes to your data warehouse, noSQL or SQL database can cause your data pipelines to stop working. Make sure that before updating the schema or design your database that your data pipelines will still run.
3rd Party Dependencies
If you’re using third party API vendors, then if you run out credits or balance these APIs could stop working. This would cause your data pipeline to fail. Additionally if you’re using lots of libraries, when you update to the latest version of these your code could stop working due to library conflicts.
Make sure to have automatic billing setup with alerts when you’re dependent on 3rd party API vendors. Additionally include fallbacks or failsafes in your code so that the errors are handled gracefully and that you’re made aware of the situation.
In terms of the library conflicts, you can have a test database or test data warehouse where you can test out single pipelines in a staged environment before pushing the latest version into production.
The term data pipeline is essentially a generic and wide-ranging term or buzzword that refers to a number of processes relating to data transit and movement. Data pipelines can be very simple, working with small quantities of simple data, or absolutely colossal, working with data covering millions of customers.
Data pipelines frequently use ETL processing methods, either via batch or streaming. ETL involves transformation, the process of modifying, filtering, duplicating, parsing, converting, structuring or cleaning the data. ETL can involve virtually any type of data transformation you can think of, ranging from a simple currency conversion calculation to complex parsing of written text.
Data Pipeline FAQ
What does a Data Pipeline Do?
A data pipeline moves data between different locations. These locations can be anything from databases to sensors. Data pipelines can be very simple, e.g. uploading some files to a cloud drive (e.g. Google Drive), or very complex, e.g. aggregating and transforming information from vast networks of IoT sensors and SaaS platforms.
What’s the Difference Between ETL and a Data Pipeline?
Data pipelines may or may not be ETL (extract, transform, load) pipelines. Data pipeline is a generic term referring to any process that involves the movement of data from one location to another.
An ETL pipeline is a subset of a data pipeline and involves any number of processes that extract and transform data before loading it into a new location. Transformation may involve parsing, wrangling, filtering, aggregating and other processing methods that take place in or along the pipeline.
Are Data Pipelines Programmed in Python?
Most of the time, yes. There are hundreds of data pipeline and TL tools available in Python to cover every aspect of data pipelines and ETL. Some notable examples are Apache Airflow, Luigi and Bonobo.