Big Data

Open-source projects categorized as Big Data

Top 23 Big Data Open-Source Projects

  • awesome-scalability

    The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

  • Project mention: 10 resources to become a system design hero | dev.to | 2024-05-28

    Modified from Zach system design repository. Added more links and topics to cover on both PS/DS & System Design Interviews. We will keep updating this posting from time to time. Some more awesome resource

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

  • Project mention: "xAI will open source Grok" | news.ycombinator.com | 2024-03-11
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • ClickHouse

    ClickHouse® is a real-time analytics DBMS

  • Project mention: Universal Data Migration: Using Slingdata to Transfer Data Between Databases | dev.to | 2024-05-24

    ClickHouse installed and running.

  • data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • Project mention: What is RocksDB (and its role in streaming)? | dev.to | 2024-05-13

    You can find example of usage in org/apache/flink/contrib/streaming/state package (https://github.com/apache/flink/tree/9fe8d7bf870987bf43bad63078e2590a38e4faf6/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state).

  • gun

    An open source cybersecurity protocol for syncing decentralized graph data.

  • Project mention: gun: NEW Data - star count:17470.0 | /r/algoprojects | 2023-10-28
  • Presto

    The official home of the Presto distributed SQL query engine for big data

  • Project mention: Multi-Database Support in DuckDB | news.ycombinator.com | 2024-01-28

    We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • QuestDB

    An open source time-series database for fast ingest and SQL queries

  • Project mention: How to Forecast Air Temperatures with AI + IoT Sensor Data | dev.to | 2024-03-24

    If your data lacks uniform time intervals between consecutive entries, QuestDB offers a solution by allowing you to sample your data. After that, MindsDB facilitates creating, training, and deploying your time-series models.

  • Cookbook

    The Data Engineering Cookbook

  • Project mention: Tranzitie catre data engineering | /r/programare | 2023-07-12

    https://github.com/andkret/Cookbook arunca un ochi aici. Omul are si youtube channel https://www.youtube.com/@andreaskayy

  • kafka-manager

    CMAK is a tool for managing Apache Kafka clusters

  • Project mention: FLaNK Stack Weekly 16 October 2023 | dev.to | 2023-10-17
  • NebulaGraph Database

    A distributed, fast open-source graph database featuring horizontal scalability and high availability (by vesoft-inc)

  • Trino

    Official repository of Trino, the distributed SQL query engine for big data, former

  • Project mention: Trino & Iceberg Made Easy: A Ready-to-Use Playground | dev.to | 2024-05-19

    By the way, I wanted to continue to use the previous experiment with Flink SQL and Iceberg, but I found out Trino doesn't support Iceberg's DynamoDB catalog. Therefore, I had to create a new one.

  • Cython

    The most widely used Python to C compiler

  • Project mention: Ask HN: C/C++ developer wanting to learn efficient Python | news.ycombinator.com | 2024-04-10
  • kafka-ui

    Open-Source Web UI for Apache Kafka Management

  • Project mention: FLaNK Stack Weekly 16 October 2023 | dev.to | 2023-10-17
  • starrocks

    StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.

  • Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09

    tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb

    Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks

  • catboost

    A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

  • Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05
  • beam

    Apache Beam is a unified programming model for Batch and Streaming data processing.

  • Project mention: Ask HN: Does (or why does) anyone use MapReduce anymore? | news.ycombinator.com | 2024-01-24

    The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).

    As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.

  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

  • Project mention: Delta Lake vs. Parquet: A Comparison | news.ycombinator.com | 2024-01-19

    Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.

    I think the website is here: https://delta.io

  • H2O

    H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

  • Project mention: Really struggling with open source models | /r/LocalLLaMA | 2023-07-12

    I would use H20 if I were you. You can try out LLMs with a nice GUI. Unless you have some familiarity with the tools needed to run these projects, it can be frustrating. https://h2o.ai/

  • risingwave

    SQL stream processing, analytics, and management. We decouple storage and compute to offer instant failover, dynamic scaling, speedy bootstrapping, and efficient joins.

  • Project mention: Proton, a fast and lightweight alternative to Apache Flink | news.ycombinator.com | 2024-01-30

    How does this compare to RisingWave and Materialize?

    https://github.com/risingwavelabs/risingwave

  • Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

  • Project mention: Serverless Apache Zeppelin on AWS | dev.to | 2024-02-04

    Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.

  • quickwit

    Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.

  • Project mention: Tantivy – full-text search engine library inspired by Apache Lucene | news.ycombinator.com | 2024-05-27

    https://github.com/quickwit-oss/quickwit

    Had a surprisingly good experience with combined power of Quickwit and Clickhouse for multilingual search pet project. Finally something usable for Chinese, Japanese, Korean

    https://quickwit.io/docs/guides/add-full-text-search-to-your...

    to_tsvector in PG never worked well for my use cases

    SELECT * FROM dump WHERE to_tsvector('english'::regconfig, hh_fullname) @@ to_tsquery('english'::regconfig, 'query');

    Wish them to succeed. Will automatically upvote any post Tantivy as keyword

  • arkime

    Arkime is an open source, large scale, full packet capturing, indexing, and database system.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Big Data related posts

  • DataFusion Comet: Apache Spark Accelerator

    4 projects | news.ycombinator.com | 31 May 2024
  • Array Expansion in Flink SQL

    1 project | dev.to | 23 May 2024
  • Show HN: Interactive Graph by LLM (GPT-4o)

    5 projects | news.ycombinator.com | 19 May 2024
  • Pg_lakehouse: Query Any Data Lake from Postgres

    7 projects | news.ycombinator.com | 13 May 2024
  • Umbra: A Disk-Based System with In-Memory Performance [pdf]

    3 projects | news.ycombinator.com | 2 May 2024
  • Top 10 Common Data Engineers and Scientists Pain Points in 2024

    1 project | dev.to | 11 Apr 2024
  • Velox: Meta's Unified Execution Engine [pdf]

    2 projects | news.ycombinator.com | 25 Mar 2024
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 1 Jun 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Big Data projects? This list will help you:

Project Stars
1 awesome-scalability 54,694
2 Apache Spark 38,569
3 ClickHouse 34,836
4 data-science-ipython-notebooks 26,551
5 Apache Flink 23,317
6 gun 17,843
7 Presto 15,646
8 QuestDB 13,561
9 Cookbook 13,172
10 kafka-manager 11,690
11 NebulaGraph Database 10,263
12 Trino 9,676
13 Cython 9,021
14 kafka-ui 8,739
15 starrocks 8,061
16 catboost 7,812
17 beam 7,582
18 delta 6,980
19 H2O 6,756
20 risingwave 6,407
21 Zeppelin 6,288
22 quickwit 6,490
23 arkime 6,154

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com