Python ETL

Open-source Python projects categorized as ETL

Top 23 Python ETL Projects

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

  • Project mention: AI Strategy Guide: How to Scale AI Across Your Business | dev.to | 2024-05-11

    Level 1 of MLOps is when you've put each lifecycle stage and their intefaces in an automated pipeline. The pipeline could be a python or bash script, or it could be a directed acyclic graph run by some orchestration framework like Airflow, dagster or one of the cloud-provider offerings. AI- or data-specific platforms like MLflow, ClearML and dvc also feature pipeline capabilities.

  • airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

  • Project mention: How to Build a Chat App with Your Postgres Data using Agent Cloud | dev.to | 2024-05-13

    AgentCloud uses Airbyte to build data pipelines, which allow us to split, chunk, and embed data from over 300 data sources, including Postgres.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • dagster

    An orchestration platform for the development, production, and observation of data assets.

  • Project mention: AI Strategy Guide: How to Scale AI Across Your Business | dev.to | 2024-05-11

    Level 1 of MLOps is when you've put each lifecycle stage and their intefaces in an automated pipeline. The pipeline could be a python or bash script, or it could be a directed acyclic graph run by some orchestration framework like Airflow, dagster or one of the cloud-provider offerings. AI- or data-specific platforms like MLflow, ClearML and dvc also feature pipeline capabilities.

  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

  • Project mention: FLaNK AI-April 22, 2024 | dev.to | 2024-04-22
  • AWS Data Wrangler

    pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

  • Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06

    I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool

  • ethereum-etl

    Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ

  • Project mention: Blockchain transactions decoding: making wallet activity understandable | dev.to | 2023-10-27

    Event is a log entity which EVM smart contracts can emit during transaction execution. Events are very good at signalling that an some action has taken place on-chain. Applications can subscribe and listen to events to trigger some off-chain logic or they can index, transform and store events in some off-chain storage (look at The Graph protocol or Ethereum ETL).

  • mara-pipelines

    A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

  • sqlmesh

    Efficient data transformation and modeling framework that is backwards compatible with dbt.

  • Project mention: Launch HN: Serra (YC S23) – Open-source, Python-based dbt alternative | news.ycombinator.com | 2023-08-14

    There is also sqlmesh (https://sqlmesh.com/). Pretty new as well. It introduces some interesting concepts. For smaller dbt projects it could be a drop-in replacement as it allows importing dbt projects.

  • pgsync

    Postgres to Elasticsearch/OpenSearch sync (by toluaina)

  • NeumAI

    Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.

  • Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21

    Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...

  • eland

    Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

  • baby-names-analysis

    Data ETL & Analysis on the dataset 'Baby Names from Social Security Card Applications - National Data'.

  • redun

    Yet another redundant workflow engine

  • Project mention: Redun: Yet another redundant workflow engine | news.ycombinator.com | 2023-08-11
  • versatile-data-kit

    One framework to develop, deploy and operate data workflows with Python and SQL.

  • bitcoin-etl

    ETL scripts for Bitcoin, Litecoin, Dash, Zcash, Doge, Bitcoin Cash. Available in Google BigQuery https://goo.gl/oY5BCQ

  • ethereum-etl-airflow

    Airflow DAGs for exporting, loading, and parsing the Ethereum blockchain data. How to get any Ethereum smart contract into BigQuery https://towardsdatascience.com/how-to-get-any-ethereum-smart-contract-into-bigquery-in-8-mins-bab5db1fdeee

  • Project mention: ethereum-etl-airflow: NEW Data - star count:358.0 | /r/algoprojects | 2023-07-10
  • astro-sdk

    Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.

  • Project mention: Orchestration: Thoughts on Dagster, Airflow and Prefect? | /r/dataengineering | 2023-06-01

    Have you tried the Astro SDK? https://github.com/astronomer/astro-sdk

  • paperetl

    📄 ⚙️ ETL processes for medical and scientific papers

  • Project mention: Show HN: Open-source Rule-based PDF parser for RAG | news.ycombinator.com | 2024-01-23
  • recap

    Work with your web service, database, and streaming schemas in a single format.

  • Project mention: Recap: A python library for describing database tables and serialization formats with minimal type coercion. | /r/dataengineering | 2023-07-12

    The Github Repo: https://github.com/recap-build/recap

  • usaspending-api

    Server application to serve U.S. federal spending data via a RESTful API

  • trex

    Enforce structured output from LLMs 100% of the time (by automorphic-ai)

  • Project mention: Show HN: Generate JSON mock data for testing/initial app development | news.ycombinator.com | 2023-10-03

    A friend of mine built a tool called Trex that you might find helpful, check it out here: https://github.com/automorphic-ai/trex

    It's very consistent at generating templated data.

  • dbt-coves

    CLI tool for dbt users to simplify creation of staging models (yml and sql) files

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python ETL related posts

  • Show HN: Open-source Rule-based PDF parser for RAG

    9 projects | news.ycombinator.com | 23 Jan 2024
  • Prism: the easiest way to create robust data workflows. Accessible via CLI

    1 project | /r/coolgithubprojects | 21 Sep 2023
  • Show HN: Prism – a framework for creating robust data science workflows

    1 project | news.ycombinator.com | 1 Sep 2023
  • Show HN: Prism – Data Orchestration in Python

    1 project | news.ycombinator.com | 28 Jul 2023
  • Introducing Prism: A Novel, Open-Source Data Orchestration Software. Feedback needed!

    2 projects | /r/EntrepreneurRideAlong | 27 Jul 2023
  • Prism - a lightweight, yet powerful data orchestration platform in Python. Accessible via CLI

    1 project | /r/coolgithubprojects | 27 Jul 2023
  • Intelligently transform unstructured to structured output (JSON, Regex, CFG)

    1 project | /r/ETL | 18 Jul 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 22 May 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source ETL projects in Python? This list will help you:

Project Stars
1 Airflow 34,705
2 airbyte 14,296
3 dagster 10,382
4 Mage 7,171
5 AWS Data Wrangler 3,816
6 ethereum-etl 2,836
7 mara-pipelines 2,056
8 pyspark-example-project 1,370
9 sqlmesh 1,334
10 pgsync 1,071
11 NeumAI 788
12 eland 617
13 baby-names-analysis 563
14 redun 489
15 versatile-data-kit 411
16 bitcoin-etl 388
17 ethereum-etl-airflow 386
18 astro-sdk 323
19 paperetl 319
20 recap 305
21 usaspending-api 283
22 trex 238
23 dbt-coves 210

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com