With the release of Airflow 2.0, we’re delighted to officially announce Airflow’s refactored Highly Available Scheduler, and formally share our work with the open-source community. In fact, “Scheduler Performance” was listed as the most asked for improvement in Airflow’s 2019 Community Survey, which garnered over 300 individual responses.Ī solution that addresses all three problem areas was originally proposed by the Astronomer team as part of AIP-15 in February 2020. The performance capability of Apache Airflow’s Scheduler has been a pain point for advanced users in the open-source community. Performance: Measured by task latency, the scheduler must schedule and start tasks far more quickly and efficiently. We have long felt that a horizontally scalable and highly-available Scheduler was critical to moving the needle in Airflow’s performance with predictable latency in order to meet such new demands and cement its place as the industry’s leading data orchestration tool.ģ. An example of this has been in automated surge pricing where the price is recalculated every few minutes requiring data pipelines to be run at that frequency. We have heard data teams want to stretch Airflow beyond its strength as an Extract, Transform, Load (ETL) tool for batch processing. Scalability: Airflow’s scheduling functionality should be horizontally scalable, able to handle running hundreds of thousands of tasks, without being limited by the computing capabilities of a single node. This has been a source of concern for many enterprises running Airflow in production, who have adopted mitigation strategies using “health checks”, but are looking for a better alternative.Ģ. High Availability: Airflow should be able to continue running data pipelines without a hiccup, even in the situation of a node failure taking down a Scheduler. We at Astronomer saw this scalability as crucial to Airflow’s continued growth, and therefore attacked this issue with three main areas of focus:ġ. Though Airflow task execution has always been scalable, the Airflow Scheduler itself was (until now) a single point of failure and not horizontally scalable. Historically, Airflow has had excellent support for task execution ranging from a single machine, to Celery-based distributed execution on a dedicated set of nodes, to Kubernetes-based distributed execution on a scalable set of nodes. The Airflow Scheduler reads the data pipelines represented as Directed Acyclic Graphs (DAGs), schedules the contained tasks, monitors the task execution, and then triggers the downstream tasks once their dependencies are met. Import Banner from ”././components/Banner.astro” Īs part of Apache Airflow 2.0, a key area of focus has been on the Airflow Scheduler.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |