Parquet: More than just “Turbo CSV”

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

ryu

12 1,166 5.9 C++

Converts floating point numbers to decimal strings (by ulfjack)

> There isn't really a CSV standard that defines the precise grammar of CSV.
Did you read the link going to a page literally titled: "Parsing JSON is a Minefield."?
JSON has a "precise" grammar intended to be human readable. The end result is a mess, vulnerable to attacks due to dissimilarities between different implementations.
Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu
Why bother, you ask? Why would anyone bother to make floating point number parsing super efficient?
JSON.

arrow-tools

1 126 8.5 Rust

A collection of handy CLI tools to convert CSV and JSON to Apache Arrow and Parquet

If you need a quick tool to convert your CSV files, you can use csv2parquet from https://github.com/domoritz/arrow-tools.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
parquet-format

4 1,666 7.2 Thrift

Apache Parquet

Date is confusing with a timezone (UTC or otherwise) and the doco makes no such suggestion.
The Parquet datatypes documentation is pretty clear that there is a flag isAdjustedToUTC to define if the timestamp should be interpreted as having Instant semantics or Local semantics.
https://github.com/apache/parquet-format/blob/master/Logical...
Still no option to include a TZ offset in the data (so the same datum can be interpreted with both Local and Instant semantics) but not bad really.

fast_float

15 1,292 8.5 C++

Fast and exact implementation of the C++ from_chars functions for number types: 4x to 10x faster than strtod, part of GCC 12 and WebKit/Safari

> Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu
It's not a parsing library, but a printing one, i.e., double -> string. https://github.com/fastfloat/fast_float is a parsing library, i.e., string -> double, not by Google though, but was indeed motivated by parsing JSON fast https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac...

rapidgzip

14 325 9.6 C++

Gzip Decompression and Random Access for Modern Multi-Core Machines

Decompression of arbitrary gzip files can be parallelized with pragzip: https://github.com/mxmlnkn/pragzip

ClickHouse

211 34,836 10.0 C++

ClickHouse® is a real-time analytics DBMS

https://github.com/ClickHouse/ClickHouse/pull/45878
Also, we still have optimizations for reading Parquet from S3 coming so that might improve

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Universal Data Migration: Using Slingdata to Transfer Data Between Databases

2 projects | dev.to | 24 May 2024
Simplified API Creation and Management: ClickHouse to APISIX Integration Without Code

3 projects | dev.to | 22 May 2024
We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions

1 project | news.ycombinator.com | 2 Apr 2024
Erasure Coding versus Tail Latency

1 project | news.ycombinator.com | 28 Mar 2024
Build time is a collective responsibility

2 projects | news.ycombinator.com | 24 Mar 2024

Parquet: More than just “Turbo CSV”

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Big Data Database Cpp11 Parquet Dbms
Post date: 3 Apr 2023

ryu

arrow-tools

InfluxDB

parquet-format

fast_float

rapidgzip

ClickHouse

Related posts

Universal Data Migration: Using Slingdata to Transfer Data Between Databases

Simplified API Creation and Management: ClickHouse to APISIX Integration Without Code

We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions

Erasure Coding versus Tail Latency

Build time is a collective responsibility

Parquet: More than just “Turbo CSV”

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Big Data Database Cpp11 Parquet Dbms Post date: 3 Apr 2023

ryu

arrow-tools

InfluxDB

parquet-format

fast_float

rapidgzip

ClickHouse

Related posts

Universal Data Migration: Using Slingdata to Transfer Data Between Databases

Simplified API Creation and Management: ClickHouse to APISIX Integration Without Code

We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions

Erasure Coding versus Tail Latency

Build time is a collective responsibility

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Big Data Database Cpp11 Parquet Dbms
Post date: 3 Apr 2023