-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
fast_float
Fast and exact implementation of the C++ from_chars functions for number types: 4x to 10x faster than strtod, part of GCC 12 and WebKit/Safari
> There isn't really a CSV standard that defines the precise grammar of CSV.
Did you read the link going to a page literally titled: "Parsing JSON is a Minefield."?
JSON has a "precise" grammar intended to be human readable. The end result is a mess, vulnerable to attacks due to dissimilarities between different implementations.
Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu
Why bother, you ask? Why would anyone bother to make floating point number parsing super efficient?
JSON.
If you need a quick tool to convert your CSV files, you can use csv2parquet from https://github.com/domoritz/arrow-tools.
Date is confusing with a timezone (UTC or otherwise) and the doco makes no such suggestion.
The Parquet datatypes documentation is pretty clear that there is a flag isAdjustedToUTC to define if the timestamp should be interpreted as having Instant semantics or Local semantics.
https://github.com/apache/parquet-format/blob/master/Logical...
Still no option to include a TZ offset in the data (so the same datum can be interpreted with both Local and Instant semantics) but not bad really.
> Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu
It's not a parsing library, but a printing one, i.e., double -> string. https://github.com/fastfloat/fast_float is a parsing library, i.e., string -> double, not by Google though, but was indeed motivated by parsing JSON fast https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac...
Decompression of arbitrary gzip files can be parallelized with pragzip: https://github.com/mxmlnkn/pragzip
https://github.com/ClickHouse/ClickHouse/pull/45878
Also, we still have optimizations for reading Parquet from S3 coming so that might improve
Related posts
-
Universal Data Migration: Using Slingdata to Transfer Data Between Databases
-
Simplified API Creation and Management: ClickHouse to APISIX Integration Without Code
-
We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions
-
Erasure Coding versus Tail Latency
-
Build time is a collective responsibility