Top 23 Python data-engineering Projects

Airflow

170 34,705 10.0 Python

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Project mention: AI Strategy Guide: How to Scale AI Across Your Business | dev.to | 2024-05-11

Level 1 of MLOps is when you've put each lifecycle stage and their intefaces in an automated pipeline. The pipeline could be a python or bash script, or it could be a directed acyclic graph run by some orchestration framework like Airflow, dagster or one of the cloud-provider offerings. AI- or data-specific platforms like MLflow, ClearML and dvc also feature pipeline capabilities.

Prefect

19 14,829 10.0 Python

The easiest way to build, run, and monitor data pipelines at scale.

Project mention: Prefect: A workflow orchestration tool for data pipelines | news.ycombinator.com | 2024-03-13

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
airbyte

140 14,296 10.0 Python

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Project mention: How to Build a Chat App with Your Postgres Data using Agent Cloud | dev.to | 2024-05-13

AgentCloud uses Airbyte to build data pipelines, which allow us to split, chunk, and embed data from over 300 data sources, including Postgres.

dagster

47 10,382 10.0 Python

An orchestration platform for the development, production, and observation of data assets.

Project mention: AI Strategy Guide: How to Scale AI Across Your Business | dev.to | 2024-05-11

Level 1 of MLOps is when you've put each lifecycle stage and their intefaces in an automated pipeline. The pipeline could be a python or bash script, or it could be a directed acyclic graph run by some orchestration framework like Airflow, dagster or one of the cloud-provider offerings. AI- or data-specific platforms like MLflow, ClearML and dvc also feature pipeline capabilities.

great_expectations

15 9,526 9.9 Python

Always know what to expect from your data.
Taipy

16 8,847 9.9 Python

Turns Data and AI algorithms into production-ready web applications in no time.

Project mention: Python Day 9: Building Interactive Web Apps without HTML/CSS and JavaScript | dev.to | 2024-04-26

Taipy is an open-source Python library that enables data scientists and developers to build robust end-to-end data pipelines.

Mage

77 7,171 9.9 Python

🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

Project mention: FLaNK AI-April 22, 2024 | dev.to | 2024-04-22

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
phidata

15 8,379 9.9 Python

Build AI Assistants with memory, knowledge and tools.

Project mention: Phidata: Add memory, knowledge and tools to LLMs | news.ycombinator.com | 2024-05-06

feast

8 5,293 9.5 Python

The Open Source Feature Store for Machine Learning

Project mention: What's Happening with Feast? | news.ycombinator.com | 2023-12-07

AWS Data Wrangler

9 3,816 9.4 Python

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06

I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool

ploomber

121 3,392 7.4 Python

The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

Project mention: Show HN: JupySQL – a SQL client for Jupyter (ipython-SQL successor) | news.ycombinator.com | 2023-12-06

- One-click sharing powered by Ploomber Cloud: https://ploomber.io
Documentation: https://jupysql.ploomber.io
Note that JupySQL is a fork of ipython-sql; which is no longer actively developed. Catherine, ipython-sql's creator, was kind enough to pass the project to us (check out ipython-sql's README).
We'd love to learn what you think and what features we can ship for JupySQL to be the best SQL client! Please let us know in the comments!

data-diff

20 2,865 9.4 Python

Compare tables within or across databases

Project mention: How to Check 2 SQL Tables Are the Same | news.ycombinator.com | 2023-07-26

If the issue happen a lot, there is also: https://github.com/datafold/data-diff
That is a nice tool to do it cross database as well.
I think it's based on checksum method.

soda-core

5 1,776 8.9 Python

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
dlt

6 1,792 9.9 Python

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

Project mention: Ask HN: Freelancer? Seeking freelancer? (December 2023) | news.ycombinator.com | 2023-12-03

SEEKING FREELANCER | REMOTE | GERMANY
dltHub is looking for a freelance help in the following repos:
- https://github.com/dlt-hub/dlt

meltano

9 1,617 9.8 Python

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

Project mention: meltano VS cloudquery - a user suggested alternative | libhunt.com/r/meltano | 2023-06-02

pyspark-example-project

1 1,370 0.0 Python

Implementing best practices for PySpark ETL jobs and applications.
mlrun

3 1,316 9.9 Python

MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.
Udacity-Data-Engineering-Projects

5 1,295 0.0 Python

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

Project mention: Pitanje za data engineering? | /r/programiranje | 2023-06-30

pyjanitor

4 1,291 8.3 Python

Clean APIs for data cleaning. Python implementation of R package Janitor
bytewax

18 1,255 9.9 Python

Python Stream Processing

Project mention: Building a streaming SQL engine with Arrow and DataFusion | news.ycombinator.com | 2024-03-18

DataEngineeringProject

5 985 0.0 Python

Example end to end data engineering project.
NeumAI

2 788 8.7 Python

Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.

Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21

Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...

vectorflow

9 643 8.2 Python

VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice. (by dgarnitz)

Project mention: FLaNK Weekly 08 Jan 2024 | dev.to | 2024-01-08

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python data-engineering related posts

Phidata: Add memory, knowledge and tools to LLMs

1 project | news.ycombinator.com | 6 May 2024
Python Day 9: Building Interactive Web Apps without HTML/CSS and JavaScript

1 project | dev.to | 26 Apr 2024
Building a streaming SQL engine with Arrow and DataFusion

1 project | news.ycombinator.com | 18 Mar 2024
Prefect: A workflow orchestration tool for data pipelines

1 project | news.ycombinator.com | 13 Mar 2024
+10 Resources to Empower Women in Technology

1 project | dev.to | 6 Mar 2024
Show HN: Use function calling to build AI Assistants

1 project | news.ycombinator.com | 27 Feb 2024
Phidata: Build AI Assistants using function calling

1 project | news.ycombinator.com | 25 Feb 2024
A note from our sponsor - InfluxDB
www.influxdata.com | 22 May 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source data-engineering projects in Python? This list will help you:

	Project	Stars
1	Airflow	34,705
2	Prefect	14,829
3	airbyte	14,296
4	dagster	10,382
5	great_expectations	9,526
6	Taipy	8,847
7	Mage	7,171
8	phidata	8,379
9	feast	5,293
10	AWS Data Wrangler	3,816
11	ploomber	3,392
12	data-diff	2,865
13	soda-core	1,776
14	dlt	1,792
15	meltano	1,617
16	pyspark-example-project	1,370
17	mlrun	1,316
18	Udacity-Data-Engineering-Projects	1,295
19	pyjanitor	1,291
20	bytewax	1,255
21	DataEngineeringProject	985
22	NeumAI	788
23	vectorflow	643

Python data-engineering

Top 23 Python data-engineering Projects

Python data-engineering related posts

Phidata: Add memory, knowledge and tools to LLMs

Python Day 9: Building Interactive Web Apps without HTML/CSS and JavaScript

Building a streaming SQL engine with Arrow and DataFusion

Prefect: A workflow orchestration tool for data pipelines

+10 Resources to Empower Women in Technology

Show HN: Use function calling to build AI Assistants

Phidata: Build AI Assistants using function calling

Index