Top 23 Python Data Projects

llama_index

75 31,628 10.0 Python

LlamaIndex is a data framework for your LLM applications

Project mention: LlamaIndex: A data framework for your LLM applications | news.ycombinator.com | 2024-04-07

Prefect

19 14,780 10.0 Python

The easiest way to build, run, and monitor data pipelines at scale.

Project mention: Prefect: A workflow orchestration tool for data pipelines | news.ycombinator.com | 2024-03-13

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
airbyte

140 14,217 10.0 Python

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Project mention: How to Build a Chat App with Your Postgres Data using Agent Cloud | dev.to | 2024-05-13

AgentCloud uses Airbyte to build data pipelines, which allow us to split, chunk, and embed data from over 300 data sources, including Postgres.

pandas-ai

14 11,140 9.8 Python

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.

Project mention: PandasAI is great but is there a more general library? | news.ycombinator.com | 2023-08-23

chinese-xinhua

1 10,668 0.0 Python

:orange_book: 中华新华字典数据库。包括歇后语，成语，词语，汉字。
akshare

0 8,479 9.8 Python

AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)
Mage

77 7,131 9.9 Python

🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

Project mention: FLaNK AI-April 22, 2024 | dev.to | 2024-04-22

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
knowledge-repo

2 5,441 4.1 Python

A next-generation curated knowledge sharing platform for data scientists and other technical professions.
Mimesis

3 4,310 8.9 Python

Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.
CKAN

6 4,273 9.8 Python

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

Project mention: Open Source Flask-based web applications | dev.to | 2023-07-11

CKAN The Open Source Data Portal Software

datasets

5 4,192 9.4 Python

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)
TextRecognitionDataGenerator

1 3,064 4.3 Python

A synthetic data generator for text recognition
pandas-datareader

3 2,828 6.3 Python

Extract data from a wide range of Internet sources into a pandas DataFrame.
PyPika

4 2,386 5.6 Python

PyPika is a python SQL query builder that exposes the full richness of the SQL language using a syntax that reflects the resulting query. PyPika excels at all sorts of SQL queries but is especially useful for data analysis.

Project mention: any recommendations for a good query builder library with good support? | /r/learnpython | 2023-07-11

I recently started using drizzle orm and I am now looking for something similar in python, my goal is to be as close to sql syntax as possible without just passing dml commands as strings, type safety would be cool as well, I saw this one pypika but it ha a lot of open issues and no commits for a year, is there anything similar but more stable?

PyFunctional

4 2,341 5.1 Python

Python library for creating data pipelines with chain functional programming

Project mention: Python: Uncovering the Overlooked Core Functionalities | news.ycombinator.com | 2023-07-24

If you actually think this code is better there's a real library that does this: https://github.com/EntilZha/PyFunctional.

mito

18 2,223 10.0 Python

The mitosheet package, trymito.io, and other public Mito code.

Project mention: The Design Philosophy of Great Tables (Software Package) | news.ycombinator.com | 2024-04-04

2. The report you're sending out for display is _expected_ in an Excel format. The two main reasons for this are just organizational momentum, or that you want to let the receiver conduct additional ad-hoc analysis (Excel is best for this in almost every org).
The way we've sliced this problem space is by improving the interfaces that users can use to export formatting to Excel. You can see some of our (open-core) code here [2]. TL;DR: Mito gives you an interface in Jupyter that looks like a spreadsheet, where you can apply formatting like Excel (number formatting, conditional formatting, color formatting) - and then Mito automatically generates code that exports this formatting to an Excel. This is one of our more compelling enterprise features, for decision makers that work with non-expert Python programmers - getting formatting into Excel is a big hassle.
[1] https://trymito.io
[2] https://github.com/mito-ds/mito/blob/dev/mitosheet/mitosheet...

sketch

20 2,195 4.4 Python

AI code-writing assistant that understands data content

Project mention: Ask HN: What have you built with LLMs? | news.ycombinator.com | 2024-02-05

We've made a lot of data tooling things based on LLMs, and are in the process of rebranding and launching our main product.
1. sketch (in notebook, ai for pandas) https://github.com/approximatelabs/sketch
2. datadm (open source, "chat with data", with support for the open source LLMs (https://github.com/approximatelabs/datadm)
3. Our main product: julyp. https://julyp.com/ (currently under very active rebrand and cleanup) -- but a "chat with data" style app, with a lot of specialized features. I'm also streaming me using it (and sometimes building it) every weekday on twitch to solve misc data problems (https://www.twitch.tv/bluecoconut)
For your next question, about the stack and deploy:

mara-pipelines

3 2,053 6.0 Python

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
Colour

6 1,984 9.2 Python

Colour Science for Python

Project mention: Tailwind Color Palette Generator | news.ycombinator.com | 2024-02-02

Colour Science is one of the more serious projects I know of, and more or less lets you get as advanced as you want. Used by film professionals among others. https://www.colour-science.org/
How would you define what the perfect color tool is? I would guess like most tools that it depends entirely on the job at hand, and that maybe no one perfect tool can exist. Colour Science might be great at serious color management and perceptual measurements and conversions between standardized color spaces, but not the right tool for a web developer looking for quick & easy way to make an HSV palette generation widget (and not because Colour Science is Python, but because it’s too big and heavy of a hammer).

glom

2 1,829 7.4 Python

☄️ Python's nested data operator (and CLI), for all your declarative restructuring needs. Got data? Glom it! ☄️

Project mention: Ask HN: How can I get better at writing production-level Python? | news.ycombinator.com | 2023-07-18

diffgram

9 1,801 8.8 Python

The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.
dlt

6 1,792 9.9 Python

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

Project mention: Ask HN: Freelancer? Seeking freelancer? (December 2023) | news.ycombinator.com | 2023-12-03

SEEKING FREELANCER | REMOTE | GERMANY
dltHub is looking for a freelance help in the following repos:
- https://github.com/dlt-hub/dlt

meltano

9 1,608 9.8 Python

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

Project mention: meltano VS cloudquery - a user suggested alternative | libhunt.com/r/meltano | 2023-06-02

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Data related posts

How to Build a Chat App with Your Postgres Data using Agent Cloud

3 projects | dev.to | 13 May 2024
Cold-(Brew) Outreach: Landing my first big client at a coffee shop

1 project | news.ycombinator.com | 30 Apr 2024
LlamaIndex: A data framework for your LLM applications

1 project | news.ycombinator.com | 7 Apr 2024
LlamaIndex is a data framework for your LLM applications

1 project | news.ycombinator.com | 28 Mar 2024
Ask HN: What have you built with LLMs?

43 projects | news.ycombinator.com | 5 Feb 2024
GitHub Innovation Graph

1 project | news.ycombinator.com | 5 Feb 2024
Show HN: Finagg – free and nearly unlimited financial data

1 project | news.ycombinator.com | 21 Jan 2024
A note from our sponsor - InfluxDB
www.influxdata.com | 17 May 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Data projects in Python? This list will help you:

	Project	Stars
1	llama_index	31,628
2	Prefect	14,780
3	airbyte	14,217
4	pandas-ai	11,140
5	chinese-xinhua	10,668
6	akshare	8,479
7	Mage	7,131
8	knowledge-repo	5,441
9	Mimesis	4,310
10	CKAN	4,273
11	datasets	4,192
12	TextRecognitionDataGenerator	3,064
13	pandas-datareader	2,828
14	PyPika	2,386
15	PyFunctional	2,341
16	mito	2,223
17	sketch	2,195
18	mara-pipelines	2,053
19	Colour	1,984
20	glom	1,829
21	diffgram	1,801
22	dlt	1,792
23	meltano	1,608

Python Data

Top 23 Python Data Projects

Python Data related posts

How to Build a Chat App with Your Postgres Data using Agent Cloud

Cold-(Brew) Outreach: Landing my first big client at a coffee shop

LlamaIndex: A data framework for your LLM applications

LlamaIndex is a data framework for your LLM applications

Ask HN: What have you built with LLMs?

GitHub Innovation Graph

Show HN: Finagg – free and nearly unlimited financial data

Index