Parquet: More than just “Turbo CSV”

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • ryu

    Converts floating point numbers to decimal strings (by ulfjack)

  • > There isn't really a CSV standard that defines the precise grammar of CSV.

    Did you read the link going to a page literally titled: "Parsing JSON is a Minefield."?

    JSON has a "precise" grammar intended to be human readable. The end result is a mess, vulnerable to attacks due to dissimilarities between different implementations.

    Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu

    Why bother, you ask? Why would anyone bother to make floating point number parsing super efficient?

    JSON.

  • arrow-tools

    A collection of handy CLI tools to convert CSV and JSON to Apache Arrow and Parquet

  • If you need a quick tool to convert your CSV files, you can use csv2parquet from https://github.com/domoritz/arrow-tools.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • parquet-format

    Apache Parquet

  • Date is confusing with a timezone (UTC or otherwise) and the doco makes no such suggestion.

    The Parquet datatypes documentation is pretty clear that there is a flag isAdjustedToUTC to define if the timestamp should be interpreted as having Instant semantics or Local semantics.

    https://github.com/apache/parquet-format/blob/master/Logical...

    Still no option to include a TZ offset in the data (so the same datum can be interpreted with both Local and Instant semantics) but not bad really.

  • fast_float

    Fast and exact implementation of the C++ from_chars functions for number types: 4x to 10x faster than strtod, part of GCC 12 and WebKit/Safari

  • > Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu

    It's not a parsing library, but a printing one, i.e., double -> string. https://github.com/fastfloat/fast_float is a parsing library, i.e., string -> double, not by Google though, but was indeed motivated by parsing JSON fast https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac...

  • rapidgzip

    Gzip Decompression and Random Access for Modern Multi-Core Machines

  • Decompression of arbitrary gzip files can be parallelized with pragzip: https://github.com/mxmlnkn/pragzip

  • ClickHouse

    ClickHouse® is a real-time analytics DBMS

  • https://github.com/ClickHouse/ClickHouse/pull/45878

    Also, we still have optimizations for reading Parquet from S3 coming so that might improve

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Universal Data Migration: Using Slingdata to Transfer Data Between Databases

    2 projects | dev.to | 24 May 2024
  • Simplified API Creation and Management: ClickHouse to APISIX Integration Without Code

    3 projects | dev.to | 22 May 2024
  • We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions

    1 project | news.ycombinator.com | 2 Apr 2024
  • Erasure Coding versus Tail Latency

    1 project | news.ycombinator.com | 28 Mar 2024
  • Build time is a collective responsibility

    2 projects | news.ycombinator.com | 24 Mar 2024