SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python ETL Projects
-
airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
-
AWS Data Wrangler
pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
-
ethereum-etl
Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
-
mara-pipelines
A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
NeumAI
Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
-
baby-names-analysis
Data ETL & Analysis on the dataset 'Baby Names from Social Security Card Applications - National Data'.
-
bitcoin-etl
ETL scripts for Bitcoin, Litecoin, Dash, Zcash, Doge, Bitcoin Cash. Available in Google BigQuery https://goo.gl/oY5BCQ
-
ethereum-etl-airflow
Airflow DAGs for exporting, loading, and parsing the Ethereum blockchain data. How to get any Ethereum smart contract into BigQuery https://towardsdatascience.com/how-to-get-any-ethereum-smart-contract-into-bigquery-in-8-mins-bab5db1fdeee
-
astro-sdk
Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Level 1 of MLOps is when you've put each lifecycle stage and their intefaces in an automated pipeline. The pipeline could be a python or bash script, or it could be a directed acyclic graph run by some orchestration framework like Airflow, dagster or one of the cloud-provider offerings. AI- or data-specific platforms like MLflow, ClearML and dvc also feature pipeline capabilities.
Project mention: How to Build a Chat App with Your Postgres Data using Agent Cloud | dev.to | 2024-05-13AgentCloud uses Airbyte to build data pipelines, which allow us to split, chunk, and embed data from over 300 data sources, including Postgres.
Level 1 of MLOps is when you've put each lifecycle stage and their intefaces in an automated pipeline. The pipeline could be a python or bash script, or it could be a directed acyclic graph run by some orchestration framework like Airflow, dagster or one of the cloud-provider offerings. AI- or data-specific platforms like MLflow, ClearML and dvc also feature pipeline capabilities.
Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool
Project mention: Blockchain transactions decoding: making wallet activity understandable | dev.to | 2023-10-27Event is a log entity which EVM smart contracts can emit during transaction execution. Events are very good at signalling that an some action has taken place on-chain. Applications can subscribe and listen to events to trigger some off-chain logic or they can index, transform and store events in some off-chain storage (look at The Graph protocol or Ethereum ETL).
Project mention: Launch HN: Serra (YC S23) – Open-source, Python-based dbt alternative | news.ycombinator.com | 2023-08-14There is also sqlmesh (https://sqlmesh.com/). Pretty new as well. It introduces some interesting concepts. For smaller dbt projects it could be a drop-in replacement as it allows importing dbt projects.
Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...
Project mention: Orchestration: Thoughts on Dagster, Airflow and Prefect? | /r/dataengineering | 2023-06-01Have you tried the Astro SDK? https://github.com/astronomer/astro-sdk
Project mention: Show HN: Open-source Rule-based PDF parser for RAG | news.ycombinator.com | 2024-01-23
Project mention: Recap: A python library for describing database tables and serialization formats with minimal type coercion. | /r/dataengineering | 2023-07-12The Github Repo: https://github.com/recap-build/recap
Project mention: Show HN: Generate JSON mock data for testing/initial app development | news.ycombinator.com | 2023-10-03A friend of mine built a tool called Trex that you might find helpful, check it out here: https://github.com/automorphic-ai/trex
It's very consistent at generating templated data.
Python ETL related posts
-
Show HN: Open-source Rule-based PDF parser for RAG
-
Prism: the easiest way to create robust data workflows. Accessible via CLI
-
Show HN: Prism – a framework for creating robust data science workflows
-
Show HN: Prism – Data Orchestration in Python
-
Introducing Prism: A Novel, Open-Source Data Orchestration Software. Feedback needed!
-
Prism - a lightweight, yet powerful data orchestration platform in Python. Accessible via CLI
-
Intelligently transform unstructured to structured output (JSON, Regex, CFG)
-
A note from our sponsor - SaaSHub
www.saashub.com | 22 May 2024
Index
What are some of the best open-source ETL projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Airflow | 34,705 |
2 | airbyte | 14,296 |
3 | dagster | 10,382 |
4 | Mage | 7,171 |
5 | AWS Data Wrangler | 3,816 |
6 | ethereum-etl | 2,836 |
7 | mara-pipelines | 2,056 |
8 | pyspark-example-project | 1,370 |
9 | sqlmesh | 1,334 |
10 | pgsync | 1,071 |
11 | NeumAI | 788 |
12 | eland | 617 |
13 | baby-names-analysis | 563 |
14 | redun | 489 |
15 | versatile-data-kit | 411 |
16 | bitcoin-etl | 388 |
17 | ethereum-etl-airflow | 386 |
18 | astro-sdk | 323 |
19 | paperetl | 319 |
20 | recap | 305 |
21 | usaspending-api | 283 |
22 | trex | 238 |
23 | dbt-coves | 210 |
Sponsored