Scala Spark

Open-source Scala projects categorized as Spark

Top 23 Scala Spark Projects

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

  • Project mention: "xAI will open source Grok" | news.ycombinator.com | 2024-03-11
  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

  • Project mention: Delta Lake vs. Parquet: A Comparison | news.ycombinator.com | 2024-01-19

    Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.

    I think the website is here: https://delta.io

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • SynapseML

    Simple and Distributed Machine Learning

  • Project mention: FLaNK Stack Weekly for 12 September 2023 | dev.to | 2023-09-12
  • spark-nlp

    State of the Art Natural Language Processing

  • Project mention: Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more! | /r/Python | 2023-09-06
  • deequ

    Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

  • Quill

    Compile-time Language Integrated Queries for Scala

  • Project mention: Dear Sir, You Have Built a Compiler (2022) | news.ycombinator.com | 2023-08-17

    https://github.com/zio/zio-quill

    This library does exactly what you prescribe. Pretty sure under the hood it's using macros with string templates

  • kyuubi

    Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • spark-cassandra-connector

    DataStax Connector for Apache Spark to Apache Cassandra (by datastax)

  • Jupyter Scala

    A Scala kernel for Jupyter

  • mleap

    MLeap: Deploy ML Pipelines to Production

  • LearningSparkV2

    This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

  • adam

    ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

  • H2O

    Sparkling Water provides H2O functionality inside Spark cluster

  • tispark

    TiSpark is built for running Apache Spark on top of TiDB/TiKV

  • frameless

    Expressive types for Spark.

  • incubator-livy

    Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

  • spark-daria

    Essential Spark extensions and helper methods ✨😲

  • spark-rapids

    Spark RAPIDS plugin - accelerate Apache Spark with GPUs

  • delta-sharing

    An open protocol for secure data sharing

  • Project mention: Azure data lake - Data Share | /r/dataengineering | 2023-06-29
  • sparkMeasure

    This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

  • spline

    Data Lineage Tracking And Visualization Solution (by AbsaOSS)

  • metorikku

    A simplified, lightweight ETL Framework based on Apache Spark

  • spark-solr

    Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.

  • Project mention: How to store 175 million rows and query them | /r/datasets | 2023-05-10
  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Scala Spark related posts

  • Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more!

    1 project | /r/Python | 6 Sep 2023
  • Azure data lake - Data Share

    1 project | /r/dataengineering | 29 Jun 2023
  • Pandas was faster and less memory intensive then crealytics pyspark. How is it possible?

    2 projects | /r/dataengineering | 17 Jun 2023
  • The "Big Three's" Data Storage Offerings

    2 projects | /r/dataengineering | 15 Jun 2023
  • Medallion/lakehouse architecture data modelling

    1 project | /r/dataengineering | 3 Jun 2023
  • How to build a data pipeline using Delta Lake

    2 projects | dev.to | 19 May 2023
  • PySpark for NLP Workshop - Materials and Jupyter Notebooks

    2 projects | /r/dataengineering | 14 May 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 6 May 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Spark projects in Scala? This list will help you:

Project Stars
1 Apache Spark 38,414
2 delta 6,919
3 SynapseML 4,970
4 spark-nlp 3,695
5 deequ 3,134
6 Quill 2,136
7 kyuubi 1,941
8 spark-cassandra-connector 1,930
9 Jupyter Scala 1,564
10 mleap 1,494
11 LearningSparkV2 1,095
12 adam 967
13 H2O 952
14 tispark 878
15 frameless 870
16 incubator-livy 856
17 spark-daria 742
18 spark-rapids 722
19 delta-sharing 676
20 sparkMeasure 642
21 spline 582
22 metorikku 576
23 spark-solr 445

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com