Understanding Parquet, Iceberg and Data Lakehouses

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

iceberg-python

2 243 9.8 Python

Apache PyIceberg

You don't need a Spark deployment. The first reference implementations for reading and writing were in Spark.
Now, with PyIceberg, there is read support in Python. Write support should be merged very soon - https://github.com/apache/iceberg-python/pull/41

delta

69 6,958 9.8 Scala

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

I often hear references to Apache Iceberg and Delta Lake as if they’re two peas in the Open Table Formats pod. Yet…
Here’s the Apache Iceberg table format specification:
https://iceberg.apache.org/spec/
As they like to say in patent law, anyone “skilled in the art” of database systems could use this to build and query Iceberg tables without too much difficulty.
This is nominally the Delta Lake equivalent:
https://github.com/delta-io/delta/blob/master/PROTOCOL.md
I defy anyone to even scope out what level of effort would be required to fully implement the current spec, let alone what would be involved in keeping up to date as this beast evolves.
Frankly, the Delta Lake spec reads like a reverse engineering of whatever implementation tradeoffs Databricks is making as they race to build out a lakehouse for every Fortune 1000 company burned by Hadoop (which is to say, most of them).
My point is that I’ve yet to be convinced that buying into Delta Lake is actually buying into an open ecosystem. Would appreciate any reassurance on this front!

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
lance

10 3,328 9.8 Rust

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..

Parquet has been the lakehouse file format of choice for nearly half a decade. But we are starting to see other contenders that are optimized more for lower latency like lance https://github.com/lancedb/lance

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

4 best opensource projects about big data you should try out

3 projects | /r/learnprogramming | 24 Mar 2022
Voxel51 Is Hiring AI Researchers and Scientists — What the New Open Science Positions Mean

1 project | dev.to | 26 Apr 2024
Show HN: I made a ROS package for realtime semantic segmentation

1 project | news.ycombinator.com | 26 Apr 2024
The Nimble File Format by Meta

2 projects | news.ycombinator.com | 25 Apr 2024
How to Estimate Depth from a Single Image

8 projects | dev.to | 25 Apr 2024

Understanding Parquet, Iceberg and Data Lakehouses

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Spark Machine Learning Apache Acid Computer Vision
Post date: 29 Dec 2023

iceberg-python

delta

InfluxDB

lance

Related posts

4 best opensource projects about big data you should try out

Voxel51 Is Hiring AI Researchers and Scientists — What the New Open Science Positions Mean

Show HN: I made a ROS package for realtime semantic segmentation

The Nimble File Format by Meta

How to Estimate Depth from a Single Image

Understanding Parquet, Iceberg and Data Lakehouses

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Spark Machine Learning Apache Acid Computer Vision Post date: 29 Dec 2023

iceberg-python

delta

InfluxDB

lance

Related posts

4 best opensource projects about big data you should try out

Voxel51 Is Hiring AI Researchers and Scientists — What the New Open Science Positions Mean

Show HN: I made a ROS package for realtime semantic segmentation

The Nimble File Format by Meta

How to Estimate Depth from a Single Image

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Spark Machine Learning Apache Acid Computer Vision
Post date: 29 Dec 2023