SaaSHub helps you find the best software and product alternatives Learn more →
Top 12 Python apache-airflow Projects
-
Scout Monitoring
Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
-
couler
Unified Interface for Constructing and Managing Workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.
-
astronomer-cosmos
Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code
-
ethereum-etl-airflow
Airflow DAGs for exporting, loading, and parsing the Ethereum blockchain data. How to get any Ethereum smart contract into BigQuery https://towardsdatascience.com/how-to-get-any-ethereum-smart-contract-into-bigquery-in-8-mins-bab5db1fdeee
-
astro-sdk
Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
airflowctl
A CLI tool to streamline getting started with Apache Airflow™ and managing multiple Airflow projects
-
covid-19-data-engineering-pipeline
A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.
-
twitter_data-lakehouse_minio_drill_superset
Building a Data Lakehouse for Analyzing Elon Musk Tweets using MinIO, Apache Airflow, Apache Drill and Apache Superset
Level 1 of MLOps is when you've put each lifecycle stage and their intefaces in an automated pipeline. The pipeline could be a python or bash script, or it could be a directed acyclic graph run by some orchestration framework like Airflow, dagster or one of the cloud-provider offerings. AI- or data-specific platforms like MLflow, ClearML and dvc also feature pipeline capabilities.
author seems to be describing the kind of patterns you might make with https://argoproj.github.io/argo-workflows/ . or see for example https://github.com/couler-proj/couler , which is an sdk for describing tasks that may be submitted to different workflow engines on the backend.
it's a little confusing to me that the author seems to object to "pipelines" and then equate them with messaging-queues. for me at least, "pipeline" vs "workflow-engine" vs "scheduler" are all basically synonyms in this context. those things may or may not be implemented with a message-queue for persistence, but the persistence layer itself is usually below the level of abstraction that $current_problem is really concerned with. like the author says, eventually you have to track state/timestamps/logs, but you get that from the beginning if you start with a workflow engine.
i agree with author that message-queues should not be a knee-jerk response to most problems because the LoE for edge-cases/observability/monitoring is huge. (maybe reach for a queue only if you may actually overwhelm whatever the "scheduler" can handle.) but don't build the scheduler from scratch either.. use argowf, kubeflow, or a more opinionated framework like airflow, mlflow, databricks, aws lamda or step-functions. all/any of these should have config or api that's robust enough to express rate-limit/retry stuff. almost any of these choices has better observability out-of-the-box than you can easily get from a queue. but most importantly.. they provide idioms for handling failure that data-science folks and junior devs can work with. the right way to structure code is just much more clear and things like structuring messages/events, subclassing workers, repeating/retrying tasks, is just harder to mess up.
Project mention: A look at airflowctl, a tool to help developers manage Apache Airflow projects | dev.to | 2023-08-14NOTE! I found a small issue in that when you run in background mode, it creates a file (.airflowctl/.background_process_ids) which contains the parent PID. The PID was always off, so I needed to manually edit this. I have created an issue here so if this happens to you, follow that.
Project mention: First End-to-End Data Engineering Project: Formula 2 Data Pipeline for for Automated Updates of a Kaggle's dataset. | /r/dataengineering | 2023-07-06GitHub Repository: here
Python apache-airflow related posts
-
Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions
-
Navigating Week Two: Insights and Experiences from My Tublian Internship Journey
-
Best ETL Tools And Why To Choose
-
Simplifying Data Transformation in Redshift: An Approach with DBT and Airflow
-
Share Your favorite python related software!
-
"Você veio protestar para ter acesso ao código fonte da urnas. O que é o código fonte?" "Não sei" 🤡
-
Unable to login into airflow webserver account
-
A note from our sponsor - SaaSHub
www.saashub.com | 3 Jun 2024
Index
What are some of the best open-source apache-airflow projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Airflow | 34,877 |
2 | elyra | 1,785 |
3 | airflow-maintenance-dags | 1,613 |
4 | couler | 891 |
5 | astronomer-cosmos | 476 |
6 | ethereum-etl-airflow | 387 |
7 | astro-sdk | 326 |
8 | airflow-chart | 267 |
9 | airflowctl | 171 |
10 | covid-19-data-engineering-pipeline | 22 |
11 | F2-Data-Pipeline | 8 |
12 | twitter_data-lakehouse_minio_drill_superset | 3 |
Sponsored