Top 23 Data Open-Source Projects

TanStack Query

56 39,548 9.8 TypeScript

🤖 Powerful asynchronous state management, server-state utilities and data fetching for the web. TS/JS, React Query, Solid Query, Svelte Query and Vue Query.

Project mention: Best Next.js Libraries and Tools in 2024 | dev.to | 2024-04-10

Metabase

67 36,417 10.0 Clojure

The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:

Project mention: HackTheBox - Writeup Analytics | dev.to | 2024-03-30

Remote Code Execution via H2

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
SheetJS js-xlsx

61 34,449 4.9 JavaScript

📗 SheetJS Spreadsheet Data Toolkit -- New home https://git.sheetjs.com/SheetJS/sheetjs

Project mention: how to work with .xlsx files? | /r/node | 2023-06-28

ExcelJS and XLSX (SheetJS) are great libraries to work with XLSX files. The former I've found a bit easier to work with but less efficient in general.

llama_index

75 30,639 10.0 Python

LlamaIndex is a data framework for your LLM applications

Project mention: LlamaIndex: A data framework for your LLM applications | news.ycombinator.com | 2024-04-07

SWR

243 29,331 8.4 TypeScript

React Hooks for Data Fetching

Project mention: Best Next.js Libraries and Tools in 2024 | dev.to | 2024-04-10

Link: https://swr.vercel.app/

data

116 16,617 8.5 Jupyter Notebook

Data and code behind the articles and graphics at FiveThirtyEight

Project mention: [USMNT] It only took 20 caps for Jesus Ferreira to get double-digit goals. The fastest in #USMNT history. | /r/MLS | 2023-06-29

You of course already know this answer, but just to put it into more perspective. Here are the SPI ranking equivalents to what he did with these 11 goals in Scotland and Switzerland.

Presto

14 15,582 9.9 Java

The official home of the Presto distributed SQL query engine for big data

Project mention: Multi-Database Support in DuckDB | news.ycombinator.com | 2024-01-28

We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Prefect

19 14,512 9.9 Python

The easiest way to build, run, and monitor data pipelines at scale.

Project mention: Prefect: A workflow orchestration tool for data pipelines | news.ycombinator.com | 2024-03-13

airbyte

139 13,821 10.0 Python

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12

I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.
It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.

awesome-bigdata

3 12,773 1.5

A curated list of awesome big data frameworks, ressources and other awesomeness.

Project mention: Good coding groups for black women? | news.ycombinator.com | 2024-01-13

faker

58 11,683 9.7 TypeScript

Generate massive amounts of fake data in the browser and node.js (by faker-js)

Project mention: Easily create mock data for unit tests 🧪 | dev.to | 2024-02-15

Instead of manually having to think of defaults for your interface properties, you could use Faker.

chinese-xinhua

1 10,627 0.0 Python

:orange_book: 中华新华字典数据库。包括歇后语，成语，词语，汉字。
prql

106 9,414 9.9 Rust

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement

Project mention: Prolog language for PostgreSQL proof of concept | news.ycombinator.com | 2024-03-30

semantic-source

23 8,858 9.1 Haskell

Parsing, analyzing, and comparing source code across many languages

Project mention: The Meaning of Monad in MonadTrans | news.ycombinator.com | 2023-08-13

One production example I know: GitHub code navigation is written in Haskell https://github.com/github/semantic

akshare

0 8,321 9.7 Python

AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)
Bogus

27 8,208 8.3 C#

:card_index: A simple fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.

Project mention: Bogus custom Dataset | dev.to | 2024-01-06

Bogus NuGet package is fake data generator which can be helpful for populating tables in a database and testing purposes. If a database is not used and Bogus populates list of data each time an application runs, the data is random, never the same. Also, the random data generated by Bogus may not meet a developer’s requirements.

machine-learning-roadmap

5 7,164 0.0

A roadmap connecting many of the most important concepts in machine learning, how to learn them and what tools to use to perform them.

Project mention: Best AI ML DL DS Roadmap | /r/deeplearning | 2023-12-07

**[Mrdbourke/machine-learning-roadmap on GitHub](https://github.com/mrdbourke/machine-learning-roadmap)**: This GitHub repository is more focused on machine learning. It's a good choice if you're looking for a more community-driven approach, as GitHub repositories often encourage contributions and updates from various experts.

Mage

76 6,953 9.9 Python

🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | dev.to | 2023-06-12

In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.

Snowplow

21 6,731 8.7 Scala

The enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP
kestra

32 6,260 9.9 Java

Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.

Project mention: A High-Performance, Java-Based Orchestration Platform | /r/java | 2023-10-11

Kestra's communication is asynchronous and based on a queuing mechanism. It leverages the Micronaut framework and offers two runners: one that uses a database (JDBC) for both the message queue and resource storage, and another that uses Kafka as the message queue and Elasticsearch as the resource storage. The platform is fully extensible and plugin-based, providing a rich set of plugins for various workflow tasks, triggers, and data storage options. For those interested, the GitHub repository is available here: https://github.com/kestra-io/kestra

Tabulator

24 6,167 9.6 JavaScript

Interactive Tables and Data Grids for JavaScript

Project mention: Tabulator – JavaScript Tables and Data Grids | news.ycombinator.com | 2024-02-09

Parsr

7 5,640 4.6 JavaScript

Transforms PDF, Documents and Images into Enriched Structured Data

Project mention: LlamaCloud and LlamaParse | news.ycombinator.com | 2024-02-20

I'm part of the team that build LlamaParse. It's net improvement compare to other PDF->Structured Text extractors (I build several in the past, includig https://github.com/axa-group/Parsr).
For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.
Once plug with a Recursive retrieval strategy, allow you to get Sota result on question answering over complexe text (see notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...).
AMA

cloudquery

102 5,565 10.0 Go

The open source high performance ELT framework powered by Apache Arrow

Project mention: We might want to regularly keep track of how important each server is | news.ycombinator.com | 2024-02-06

Check out CloudQuery - https://github.com/cloudquery/cloudquery for an easy cloud asset inventory.

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-10.

Data related posts

Ask HN: Best way to mirror a Postgres database to parquet?
1 project | news.ycombinator.com | 10 Apr 2024
LlamaIndex: A data framework for your LLM applications
1 project | news.ycombinator.com | 7 Apr 2024
Build Responsive, Animated Charts in Minutes with EazyChart for React and Vue
1 project | news.ycombinator.com | 3 Apr 2024
How to use fly.io and Tigris to deploy a Next.js app
3 projects | dev.to | 2 Apr 2024
LlamaIndex is a data framework for your LLM applications
1 project | news.ycombinator.com | 28 Mar 2024
Malloy: A language for describing data relationships and transformations
1 project | news.ycombinator.com | 17 Mar 2024
Making a nice API of Amtrak's ugly API
2 projects | news.ycombinator.com | 16 Mar 2024
A note from our sponsor - InfluxDB
www.influxdata.com | 19 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Data projects? This list will help you:

	Project	Stars
1	TanStack Query	39,548
2	Metabase	36,417
3	SheetJS js-xlsx	34,449
4	llama_index	30,639
5	SWR	29,331
6	data	16,617
7	Presto	15,582
8	Prefect	14,512
9	airbyte	13,821
10	awesome-bigdata	12,773
11	faker	11,683
12	chinese-xinhua	10,627
13	prql	9,414
14	semantic-source	8,858
15	akshare	8,321
16	Bogus	8,208
17	machine-learning-roadmap	7,164
18	Mage	6,953
19	Snowplow	6,731
20	kestra	6,260
21	Tabulator	6,167
22	Parsr	5,640
23	cloudquery	5,565