Apache Airflow , Azure HDI clusters : Scenarios for hybrid environments
AWS and Azure are today’s leading cloud providers with plethora of offerings for data, analytics and ML workloads.In the recent past cloud native start-ups had the liberty to chose their cloud environments based on the product path forward. In some cases we have even seen that it makes more sense to build one’s own Cloud platform with PCF etc. The case of small medium large enteprises especially the ones which i have worked in last 5 years have found a new hybrid paths more manageable, cost effective and a preferred choice of cloud compute along with existing historians on-prem.For their data and insight applications both the on-prem and a pipeline injecting variety of data is of high critical value for enterprises on the path to digital transformations.
This article concerns itself on big data solution architecture of analytics workloads.Azure has prebuilt service known as Azure Data Factory a.k.a ADF. ADF is a service offered by Microsoft for ETL/ELT Pipelines along with HDI for compute.Effortless,easy to get started and pay-as-you go service,ADF abstracts underlying infrastructure and resources and relieves the burden of infrastructure and security management from developer.ADF GUI provides configurable options for seamless integration with various data sources, such as on-prem ,S3://, Google Big Query.Sometimes it is also possible to write custom code ( we will address this separately).
Introduced in 2014 by AirBnB Airflow is a stand-alone open source ELT/ETL engine evolved itself into an ecosystem of platform for Bigdata & Analytics community on AWS and PCF. if you are on AWS you can set up Airflow with additional work of managing security infrastructure with necessary hooks. If you are on Azure ADF very much takes care of all necessary security infrastructure and bootstrapping code.Agree that ADF provides less customization than compare to Airflow which is more developer & custom code friendly, but the hassle of managing a separate team for the infrastructure and CI/CD integration is the baggage that comes with Aifrlow. In such circumstances using ADF with Airflow is also another option in a hybrid cloud scenarios.It is also possible to build a custom operator for Airflow to orchestrate with ADF but again it is resource intensive.
Two approaches prominent in recent past is Airflow on EMR ( AWS )and ADF on HDI clusters(Azure with ADF provisioning the pipe design). This trend paved way for a matured choice of engineering pipes for ELT/ETL with markdown Machine Learning model injection. However orchestration on ADF depends on the version of ADF, from the date of it’s inception back in ~2014 for Big data compute on AZURE, HD insight has evolved into a stable platform on Azure. In version-2 or current version of ADF it is possible to write more custom components. On ADF there are two types of activities possible 1)Data Movement activities between supported source and sink data stores . 2)Data transformation activities using Azure compute services such as Azure HDI, Azure Batch and Azure ML.
with Azure Data Factory , in-order to move or transform data residing on a non ADF supported source, one has to write a custom activity with one’s own custom movement logic and use it in the pipeline design. this custom activity can be configured to run on Azure Batch on a pool of existing virtual machines. Remember it was not possible to use a data management gateway from a custom activity to access on-premises data sources.The data management Gateway supports only copy and stored procedure activity in the Data factory. This is where an orchestration of Airflow with HDI Clusters makes sense…
we will dwell deeper into Airflow orchestration on HDI clusters in next article…stay tuned
Thanks for reading your suggestions are always welcome!
By Kiran Balijepalli