Top 23 Big Data Open-Source Projects

awesome-scalability

8 54,694 6.2

The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

Project mention: 10 resources to become a system design hero | dev.to | 2024-05-28

Modified from Zach system design repository. Added more links and topics to cover on both PS/DS & System Design Interviews. We will keep updating this posting from time to time. Some more awesome resource

Apache Spark

101 38,569 10.0 Scala

Apache Spark - A unified analytics engine for large-scale data processing

Project mention: "xAI will open source Grok" | news.ycombinator.com | 2024-03-11

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
ClickHouse

211 34,836 10.0 C++

ClickHouse® is a real-time analytics DBMS

Project mention: Universal Data Migration: Using Slingdata to Transfer Data Between Databases | dev.to | 2024-05-24

ClickHouse installed and running.

data-science-ipython-notebooks

1 26,551 0.0 Python

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Apache Flink

10 23,317 9.9 Java

Apache Flink

Project mention: What is RocksDB (and its role in streaming)? | dev.to | 2024-05-13

You can find example of usage in org/apache/flink/contrib/streaming/state package (https://github.com/apache/flink/tree/9fe8d7bf870987bf43bad63078e2590a38e4faf6/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state).

gun

247 17,843 7.2 JavaScript

An open source cybersecurity protocol for syncing decentralized graph data.

Project mention: gun: NEW Data - star count:17470.0 | /r/algoprojects | 2023-10-28

Presto

14 15,646 9.9 Java

The official home of the Presto distributed SQL query engine for big data

Project mention: Multi-Database Support in DuckDB | news.ycombinator.com | 2024-01-28

We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
QuestDB

311 13,561 9.7 Java

An open source time-series database for fast ingest and SQL queries

Project mention: How to Forecast Air Temperatures with AI + IoT Sensor Data | dev.to | 2024-03-24

If your data lacks uniform time intervals between consecutive entries, QuestDB offers a solution by allowing you to sample your data. After that, MindsDB facilitates creating, training, and deploying your time-series models.

Cookbook

21 13,172 7.8

The Data Engineering Cookbook

Project mention: Tranzitie catre data engineering | /r/programare | 2023-07-12

https://github.com/andkret/Cookbook arunca un ochi aici. Omul are si youtube channel https://www.youtube.com/@andreaskayy

kafka-manager

13 11,690 0.0 Scala

CMAK is a tool for managing Apache Kafka clusters

Project mention: FLaNK Stack Weekly 16 October 2023 | dev.to | 2023-10-17

NebulaGraph Database

8 10,263 7.9 C++

A distributed, fast open-source graph database featuring horizontal scalability and high availability (by vesoft-inc)
Trino

45 9,676 10.0 Java

Official repository of Trino, the distributed SQL query engine for big data, former

Project mention: Trino & Iceberg Made Easy: A Ready-to-Use Playground | dev.to | 2024-05-19

By the way, I wanted to continue to use the previous experiment with Flink SQL and Iceberg, but I found out Trino doesn't support Iceberg's DynamoDB catalog. Therefore, I had to create a new one.

Cython

79 9,021 9.8 Python

The most widely used Python to C compiler

Project mention: Ask HN: C/C++ developer wanting to learn efficient Python | news.ycombinator.com | 2024-04-10

kafka-ui

47 8,739 8.0 Java

Open-Source Web UI for Apache Kafka Management

Project mention: FLaNK Stack Weekly 16 October 2023 | dev.to | 2023-10-17

starrocks

12 8,061 10.0 Java

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.

Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09

tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb
Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks

catboost

8 7,812 9.9 Python

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05

beam

30 7,582 10.0 Java

Apache Beam is a unified programming model for Batch and Streaming data processing.

Project mention: Ask HN: Does (or why does) anyone use MapReduce anymore? | news.ycombinator.com | 2024-01-24

The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).
As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.

delta

69 6,980 9.9 Scala

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

Project mention: Delta Lake vs. Parquet: A Comparison | news.ycombinator.com | 2024-01-19

Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.
I think the website is here: https://delta.io

H2O

10 6,756 9.6 Jupyter Notebook

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Project mention: Really struggling with open source models | /r/LocalLLaMA | 2023-07-12

I would use H20 if I were you. You can try out LLMs with a nice GUI. Unless you have some familiarity with the tools needed to run these projects, it can be frustrating. https://h2o.ai/

risingwave

27 6,407 10.0 Rust

SQL stream processing, analytics, and management. We decouple storage and compute to offer instant failover, dynamic scaling, speedy bootstrapping, and efficient joins.

Project mention: Proton, a fast and lightweight alternative to Apache Flink | news.ycombinator.com | 2024-01-30

How does this compare to RisingWave and Materialize?
https://github.com/risingwavelabs/risingwave

Zeppelin

8 6,288 8.6 Java

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Project mention: Serverless Apache Zeppelin on AWS | dev.to | 2024-02-04

Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.

quickwit

65 6,490 9.8 Rust

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.

Project mention: Tantivy – full-text search engine library inspired by Apache Lucene | news.ycombinator.com | 2024-05-27

https://github.com/quickwit-oss/quickwit
Had a surprisingly good experience with combined power of Quickwit and Clickhouse for multilingual search pet project. Finally something usable for Chinese, Japanese, Korean
https://quickwit.io/docs/guides/add-full-text-search-to-your...
to_tsvector in PG never worked well for my use cases
SELECT * FROM dump WHERE to_tsvector('english'::regconfig, hh_fullname) @@ to_tsquery('english'::regconfig, 'query');
Wish them to succeed. Will automatically upvote any post Tantivy as keyword

arkime

13 6,154 9.6 JavaScript

Arkime is an open source, large scale, full packet capturing, indexing, and database system.
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Big Data related posts

DataFusion Comet: Apache Spark Accelerator

4 projects | news.ycombinator.com | 31 May 2024
Array Expansion in Flink SQL

1 project | dev.to | 23 May 2024
Show HN: Interactive Graph by LLM (GPT-4o)

5 projects | news.ycombinator.com | 19 May 2024
Pg_lakehouse: Query Any Data Lake from Postgres

7 projects | news.ycombinator.com | 13 May 2024
Umbra: A Disk-Based System with In-Memory Performance [pdf]

3 projects | news.ycombinator.com | 2 May 2024
Top 10 Common Data Engineers and Scientists Pain Points in 2024

1 project | dev.to | 11 Apr 2024
Velox: Meta's Unified Execution Engine [pdf]

2 projects | news.ycombinator.com | 25 Mar 2024
A note from our sponsor - InfluxDB
www.influxdata.com | 1 Jun 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Big Data projects? This list will help you:

	Project	Stars
1	awesome-scalability	54,694
2	Apache Spark	38,569
3	ClickHouse	34,836
4	data-science-ipython-notebooks	26,551
5	Apache Flink	23,317
6	gun	17,843
7	Presto	15,646
8	QuestDB	13,561
9	Cookbook	13,172
10	kafka-manager	11,690
11	NebulaGraph Database	10,263
12	Trino	9,676
13	Cython	9,021
14	kafka-ui	8,739
15	starrocks	8,061
16	catboost	7,812
17	beam	7,582
18	delta	6,980
19	H2O	6,756
20	risingwave	6,407
21	Zeppelin	6,288
22	quickwit	6,490
23	arkime	6,154

Big Data

Top 23 Big Data Open-Source Projects

Big Data related posts

DataFusion Comet: Apache Spark Accelerator

Array Expansion in Flink SQL

Show HN: Interactive Graph by LLM (GPT-4o)

Pg_lakehouse: Query Any Data Lake from Postgres

Umbra: A Disk-Based System with In-Memory Performance [pdf]

Top 10 Common Data Engineers and Scientists Pain Points in 2024

Velox: Meta's Unified Execution Engine [pdf]

Index