Building a faster hash table for high performance SQL joins

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

hashtable-benchmarks

8 29 4.7 Java

An Evaluation of Linear Probing Hashtable Algorithms

Since the blog post mentioned a PR to replace linear probing with Robin Hood, I just wanted to mention that I found bidirectional linear probing to outperform Robin Hood across the board in my Java integer set benchmarks:
https://github.com/senderista/hashtable-benchmarks/blob/mast...
https://github.com/senderista/hashtable-benchmarks/wiki/64-b...

CBQN

4 301 9.6 C

a BQN implementation in C

Worth pointing out that this can depend a lot more on fiddly details than you might expect. In particular, you're dealing with a small fixed width allowing the hash to be stored in the table instead of the key. The article emphasizes variable-length keys, and I don't see any specialization on key sizes (if 4- and 8-byte keys aren't common then this makes sense; if they are then I'd expect dedicated table code for those sizes to be valuable). And set lookups are also just a bit different from value lookups. I think these cases are different enough that I have no idea if the results would carry over, although I can see how the bidirectional approach would reduce probing more than RH which seems good.
...and since I've done a lot of work with Robin Hood on small-key lookups, I can point out some little tweaks that have made a big difference for me. I have 8-byte lookups at just over 3ns/lookup[0], albeit at a very low load factor, typically <50%. A key step was to use the maximum possible hash as a sentinel value, handling it specially in case it shows up in the data. This way, instead of probing until finding an empty bucket or greater hash, probing just finds the first slot that's greater than or equal to the requested key's hash. So the lookup code[1] is very simple (the rest, not so much). The while loop is only needed on a hash collision, so at a low load factor a lookup is effectively branchless. However, these choices are specialized for a batched search where the number of insertions never has to be higher than the number of searches, and all the insertions can be done first. And focused on small-ish (under a million entries) tables.
[0] https://mlochbaum.github.io/bencharray/pages/search.html
[1] https://github.com/dzaima/CBQN/blob/5c7ab3f/src/singeli/src/...

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
QuestDB

311 13,852 9.7 Java

An open source time-series database for fast ingest and SQL queries

Looks like full keys are always compared if hash codes test equal, which is what I'd expect. For example: https://github.com/questdb/questdb/blob/master/core/src/main...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Normalizing Grafana charts with window functions

1 project | dev.to | 10 Jan 2024
How to increase Grafana refresh rate frequency

1 project | dev.to | 10 Jan 2024
Is all data time-series data?

1 project | dev.to | 22 Nov 2023
questdb: NEW Data - star count:12960.0

1 project | /r/algoprojects | 20 Nov 2023
questdb: NEW Data - star count:12960.0

1 project | /r/algoprojects | 19 Nov 2023

Building a faster hash table for high performance SQL joins

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Projects Language Database Bqn time-series
Post date: 20 Dec 2023

hashtable-benchmarks

CBQN

InfluxDB

QuestDB

Related posts

Normalizing Grafana charts with window functions

How to increase Grafana refresh rate frequency

Is all data time-series data?

questdb: NEW Data - star count:12960.0

questdb: NEW Data - star count:12960.0

Building a faster hash table for high performance SQL joins

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Projects Language Database Bqn time-series Post date: 20 Dec 2023

hashtable-benchmarks

CBQN

InfluxDB

QuestDB

Related posts

Normalizing Grafana charts with window functions

How to increase Grafana refresh rate frequency

Is all data time-series data?

questdb: NEW Data - star count:12960.0

questdb: NEW Data - star count:12960.0

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Projects Language Database Bqn time-series
Post date: 20 Dec 2023