Building a faster hash table for high performance SQL joins

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • hashtable-benchmarks

    An Evaluation of Linear Probing Hashtable Algorithms

  • Since the blog post mentioned a PR to replace linear probing with Robin Hood, I just wanted to mention that I found bidirectional linear probing to outperform Robin Hood across the board in my Java integer set benchmarks:

    https://github.com/senderista/hashtable-benchmarks/blob/mast...

    https://github.com/senderista/hashtable-benchmarks/wiki/64-b...

  • CBQN

    a BQN implementation in C

  • Worth pointing out that this can depend a lot more on fiddly details than you might expect. In particular, you're dealing with a small fixed width allowing the hash to be stored in the table instead of the key. The article emphasizes variable-length keys, and I don't see any specialization on key sizes (if 4- and 8-byte keys aren't common then this makes sense; if they are then I'd expect dedicated table code for those sizes to be valuable). And set lookups are also just a bit different from value lookups. I think these cases are different enough that I have no idea if the results would carry over, although I can see how the bidirectional approach would reduce probing more than RH which seems good.

    ...and since I've done a lot of work with Robin Hood on small-key lookups, I can point out some little tweaks that have made a big difference for me. I have 8-byte lookups at just over 3ns/lookup[0], albeit at a very low load factor, typically <50%. A key step was to use the maximum possible hash as a sentinel value, handling it specially in case it shows up in the data. This way, instead of probing until finding an empty bucket or greater hash, probing just finds the first slot that's greater than or equal to the requested key's hash. So the lookup code[1] is very simple (the rest, not so much). The while loop is only needed on a hash collision, so at a low load factor a lookup is effectively branchless. However, these choices are specialized for a batched search where the number of insertions never has to be higher than the number of searches, and all the insertions can be done first. And focused on small-ish (under a million entries) tables.

    [0] https://mlochbaum.github.io/bencharray/pages/search.html

    [1] https://github.com/dzaima/CBQN/blob/5c7ab3f/src/singeli/src/...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • QuestDB

    An open source time-series database for fast ingest and SQL queries

  • Looks like full keys are always compared if hash codes test equal, which is what I'd expect. For example: https://github.com/questdb/questdb/blob/master/core/src/main...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Normalizing Grafana charts with window functions

    1 project | dev.to | 10 Jan 2024
  • How to increase Grafana refresh rate frequency

    1 project | dev.to | 10 Jan 2024
  • Is all data time-series data?

    1 project | dev.to | 22 Nov 2023
  • questdb: NEW Data - star count:12960.0

    1 project | /r/algoprojects | 20 Nov 2023
  • questdb: NEW Data - star count:12960.0

    1 project | /r/algoprojects | 19 Nov 2023