Scala Big Data

Open-source Scala projects categorized as Big Data

Top 23 Scala Big Data Projects

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

  • Project mention: "xAI will open source Grok" | news.ycombinator.com | 2024-03-11
  • kafka-manager

    CMAK is a tool for managing Apache Kafka clusters

  • Project mention: FLaNK Stack Weekly 16 October 2023 | dev.to | 2023-10-17
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

  • Project mention: Delta Lake vs. Parquet: A Comparison | news.ycombinator.com | 2024-01-19

    Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.

    I think the website is here: https://delta.io

  • SynapseML

    Simple and Distributed Machine Learning

  • Project mention: FLaNK Stack Weekly for 12 September 2023 | dev.to | 2023-09-12
  • Scalding

    A Scala API for Cascading

  • Scio

    A Scala API for Apache Beam and Google Cloud Dataflow.

  • Jupyter Scala

    A Scala kernel for Jupyter

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • Reactive-kafka

    Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.

  • adam

    ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

  • H2O

    Sparkling Water provides H2O functionality inside Spark cluster

  • BIDMach

    CPU and GPU-accelerated Machine Learning Library

  • Gearpump

    Lightweight real-time big data streaming engine over Akka

  • Vegas

    The missing MatPlotLib for Scala + Spark (by vegas-viz)

  • spark-rapids

    Spark RAPIDS plugin - accelerate Apache Spark with GPUs

  • delta-sharing

    An open protocol for secure data sharing

  • Project mention: Azure data lake - Data Share | /r/dataengineering | 2023-06-29
  • nussknacker

    Low-code tool for automating actions on real time data | Stream processing for the users.

  • metorikku

    A simplified, lightweight ETL Framework based on Apache Spark

  • Sparkta

    Real Time Analytics and Data Pipelines based on Spark Streaming (by Stratio)

  • Scoobi

    A Scala productivity framework for Hadoop. (by NICTA)

  • qbeast-spark

    Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!

  • Clustering4Ever

    C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

  • Schemer

    Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

  • Scoozie

    Scala DSL on top of Oozie XML

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Scala Big Data related posts

  • Azure data lake - Data Share

    1 project | /r/dataengineering | 29 Jun 2023
  • The "Big Three's" Data Storage Offerings

    2 projects | /r/dataengineering | 15 Jun 2023
  • Medallion/lakehouse architecture data modelling

    1 project | /r/dataengineering | 3 Jun 2023
  • How to build a data pipeline using Delta Lake

    2 projects | dev.to | 19 May 2023
  • whenNotMatchedBySourceUpdate not existing? Trying to upsert parquet into Delta table

    1 project | /r/apachespark | 10 May 2023
  • Delta.io/deltalake self hosting

    2 projects | /r/bigdata | 26 Apr 2023
  • Delta.io/deltalake self hosting

    1 project | /r/DeltaLake | 25 Apr 2023
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 6 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Big Data projects in Scala? This list will help you:

Project Stars
1 Apache Spark 38,414
2 kafka-manager 11,676
3 delta 6,919
4 SynapseML 4,970
5 Scalding 3,471
6 Scio 2,524
7 Jupyter Scala 1,564
8 Reactive-kafka 1,418
9 adam 967
10 H2O 952
11 BIDMach 913
12 Gearpump 765
13 Vegas 729
14 spark-rapids 722
15 delta-sharing 676
16 nussknacker 611
17 metorikku 576
18 Sparkta 524
19 Scoobi 482
20 qbeast-spark 192
21 Clustering4Ever 128
22 Schemer 112
23 Scoozie 82

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com