Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Big Data Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
NebulaGraph Database
A distributed, fast open-source graph database featuring horizontal scalability and high availability (by vesoft-inc)
-
starrocks
StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.
-
catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
-
delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)
-
H2O
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
-
risingwave
SQL stream processing, analytics, and management. We decouple storage and compute to offer instant failover, dynamic scaling, speedy bootstrapping, and efficient joins.
-
Zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
-
quickwit
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Modified from Zach system design repository. Added more links and topics to cover on both PS/DS & System Design Interviews. We will keep updating this posting from time to time. Some more awesome resource
Project mention: Universal Data Migration: Using Slingdata to Transfer Data Between Databases | dev.to | 2024-05-24ClickHouse installed and running.
You can find example of usage in org/apache/flink/contrib/streaming/state package (https://github.com/apache/flink/tree/9fe8d7bf870987bf43bad63078e2590a38e4faf6/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state).
We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.
If your data lacks uniform time intervals between consecutive entries, QuestDB offers a solution by allowing you to sample your data. After that, MindsDB facilitates creating, training, and deploying your time-series models.
https://github.com/andkret/Cookbook arunca un ochi aici. Omul are si youtube channel https://www.youtube.com/@andreaskayy
By the way, I wanted to continue to use the previous experiment with Flink SQL and Iceberg, but I found out Trino doesn't support Iceberg's DynamoDB catalog. Therefore, I had to create a new one.
Project mention: Ask HN: C/C++ developer wanting to learn efficient Python | news.ycombinator.com | 2024-04-10
Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb
Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks
Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05
Project mention: Ask HN: Does (or why does) anyone use MapReduce anymore? | news.ycombinator.com | 2024-01-24The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).
As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.
Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.
I think the website is here: https://delta.io
I would use H20 if I were you. You can try out LLMs with a nice GUI. Unless you have some familiarity with the tools needed to run these projects, it can be frustrating. https://h2o.ai/
Project mention: Proton, a fast and lightweight alternative to Apache Flink | news.ycombinator.com | 2024-01-30How does this compare to RisingWave and Materialize?
https://github.com/risingwavelabs/risingwave
Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.
Project mention: Tantivy – full-text search engine library inspired by Apache Lucene | news.ycombinator.com | 2024-05-27https://github.com/quickwit-oss/quickwit
Had a surprisingly good experience with combined power of Quickwit and Clickhouse for multilingual search pet project. Finally something usable for Chinese, Japanese, Korean
https://quickwit.io/docs/guides/add-full-text-search-to-your...
to_tsvector in PG never worked well for my use cases
SELECT * FROM dump WHERE to_tsvector('english'::regconfig, hh_fullname) @@ to_tsquery('english'::regconfig, 'query');
Wish them to succeed. Will automatically upvote any post Tantivy as keyword
Big Data related posts
-
DataFusion Comet: Apache Spark Accelerator
-
Array Expansion in Flink SQL
-
Show HN: Interactive Graph by LLM (GPT-4o)
-
Pg_lakehouse: Query Any Data Lake from Postgres
-
Umbra: A Disk-Based System with In-Memory Performance [pdf]
-
Top 10 Common Data Engineers and Scientists Pain Points in 2024
-
Velox: Meta's Unified Execution Engine [pdf]
-
A note from our sponsor - InfluxDB
www.influxdata.com | 1 Jun 2024
Index
What are some of the best open-source Big Data projects? This list will help you:
Project | Stars | |
---|---|---|
1 | awesome-scalability | 54,694 |
2 | Apache Spark | 38,569 |
3 | ClickHouse | 34,836 |
4 | data-science-ipython-notebooks | 26,551 |
5 | Apache Flink | 23,317 |
6 | gun | 17,843 |
7 | Presto | 15,646 |
8 | QuestDB | 13,561 |
9 | Cookbook | 13,172 |
10 | kafka-manager | 11,690 |
11 | NebulaGraph Database | 10,263 |
12 | Trino | 9,676 |
13 | Cython | 9,021 |
14 | kafka-ui | 8,739 |
15 | starrocks | 8,061 |
16 | catboost | 7,812 |
17 | beam | 7,582 |
18 | delta | 6,980 |
19 | H2O | 6,756 |
20 | risingwave | 6,407 |
21 | Zeppelin | 6,288 |
22 | quickwit | 6,490 |
23 | arkime | 6,154 |
Sponsored