You care about the data. Actually, you really care about the insights from that data or ML models you train with that data. And that forces you to care about the data.
You may also care about how you clean, curate, and transform the raw data into usable data. Therefore, the data pipeline at most has a nuisance value for you, an unavoidable chore.
Considering this pyramid of what you really care about and what you are forced to care about, where do you place data pipeline orchestration tools? How much of your mind share is occupied by that choice? My guess is that it is very little, almost nothing, or maybe even total indifference.
Yet, it is one of the most critical decisions in building your data and ML infra. Let me lay out the choices so you can make informed decisions.
But first, let’s quickly recap data, ML, and MLOps pipelines:
- Data pipelines get data to the warehouse or lake.
- ML pipelines transform data before training or inference.
- MLOps pipelines automate ML workflows.
The boundaries between these three are fluid and overlapping. In the future, these may converge and become one.
Let’s examine the choices open source and various cloud ecosystems offer.
Open Source Data Pipeline Orchestration Tools
Apache ecosystem was and continues to be an important part of the data stack. No wonder that 2 out of the 3 tools listed here are apache projects.
Apache Oozie has been around for quite a while for executing workflow DAG. It integrates nicely with the Hadoop ecosystem. If your organization is using it, it will a big endeavor to move away from it. It still got life and can carry you to some distance.
But do not set up new projects or new infra with it. For that, look at the next tool below.
At the moment, nobody gets fired for choosing Apache Airflow. It is the default choice and a pretty good one too. No wonder both AWS and Google Cloud offer a managed Airflow.
You can’t go wrong by choosing Apache Airflow, but take a look and keep an eye on the next tool in the list.
Data Pipeline Orchestration Tools on AWS
Amazon is a customer-first company, and its offerings reflect that. AWS services are easy to start with, but the one you pick first may not remain suitable as your use case expands.
I suggest not blindly picking the easiest AWS tool for you, but pausing and thinking a little bit about your future needs. Data pipelines have an uncanny ability to quickly grow and become very complex. Then the unavoidable migration will carry a good amount of outage risk.
AWS Step Function
AWS Step Function is an apt tool for automating business process workflows. But it can be used for building data pipelines too. If your data pipeline is simple and consists of a few steps, this is probably the easiest to start with. But I am very reluctant to advise you to do so.
AWS Data Pipeline
AWS Data Pipeline is a service to move data from AWS and on-premises data sources to AWS compute services, run transformations, and store it in a data warehouse or a data lake. You can knit together an AWS Data Pipeline with S3, RDS, DynamoDB, and Redshift as data storage and EC2 and EMR as compute services. It is easy to use, and yet very powerful and versatile. But read the next two options.
AWS Glue Workflow
AWS Glue Workflow is another tool on AWS to perform ETL workflows. It is unclear to me why Amazon has two tools with largely overlapping use cases. I am biased to use AWS Data Pipeline by default, and use Glue Workflow only if the whole data infra is built of AWS Glue and Amazon Athena.
Amazon Managed Workflow for Apache Airflow (MWAA)
If you want to stick to Apache Airflow, MWAA may suit you the most. It is a secure and highly available managed workflow orchestration for Apache Airflow. It is a vendor-independent option, and you don’t need to master a new tool.
Data Pipeline Orchestration Tools on Google Cloud
Google is a technology-first company, and it offers only two clearly differentiable choices. Both are built on open-source technology.
Cloud Data Fusion
Cloud Data Fusion is a fully managed GUI tool to define ETL/ELT data pipelines. It is based on CDAP, an open-source framework for building data analytic applications. If your team is not tech-heavy and building mainly analytics applications, then Cloud Data Fusion will suffice.
Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow. If your workflows are inching towards data science and spanning across hybrid and multi-cloud environments, then Cloud Composer (which is Airflow under the hood) is a better choice.
Data Pipeline Orchestration Tools on Azure
Microsoft is a sales-first company. It is just so good at selling that it has beaten arguably a better cloud stack of Google, and given Amazon’s customer obsession and first-mover advantage a run for its money, despite being a late entrant in cloud services.
Azure Data Factory
Azure Data Factory is a data integration and transformation service to construct code-free ETL and ELT pipelines. It is often used to process data from diverse sources and deliver integrated data to Azure Synapse Analytics. If you are on Azure and doing mainly data analytics, then using it is a no-brainer.
Oozie on HDInsight
DIY Apache Airflow
You can deploy Apache Airflow on Azure too, but there is no managed service.
There are some interesting combinations that simplify the data pipelines for common use cases. These alternatives may not apply to complex and most general data pipelines, but nonetheless might be best for your use case.
It consists of two parts:
- Light transformation while ingesting data into a data warehouse or lakehouse using a tool like airbyte, fivetran, Apache NiFi, meltano, or stitch.
- SQL pipeline using dbt to transform data in the data warehouse or lakehouse.
Seeing this myriad of choices can be confusing and distressing. Here are some defaults that may help you choose a fairly safe option:
- If vendor-locking and future-proofing is your key concern, then go for Airflow (or its managed version on your cloud vendor).
- If analytics is your main application, then you can pick a simpler option on your cloud vendor (AWS Data Pipeline, Google Cloud Fusion, Azure Data Factory).
- If you consume data from diverse sources and have in-house SQL expertise, you can evaluate airbyte/fivetran + dbt.
- If you are mainly an ML shop, then may check Flyte (and not just Airflow).