10x-ing Rust hashmap throughput with 100,000 threads

Aug 07, 2023

(EDIT!!! (08/27/2023): this post previously featured extremely incorrect data for the M1 GPU implementation, due to both a bug and poorly implemented tests. I have removed the data and cleaned up the discussion.)

(Edit: the title of this post was previously “Can we 10x Rust hashmap throughput?”)

tl;dr: Is it really 10x? Kinda, more or less. ~~Mostly~~ ~~more~~. Especially with the help of disclaimers -- and GPUs!

Disclaimer: This post isn't really about Rust. Also, it won't come with pretty benchmark graphs, nor should readers take it as a strict, completely clean benchmark. I’ll be focusing more on “discussion”.

Hello, first blog post

Writing publicly is a bit scary -- whether or not it's code. I can already hear people talking about how "--" isn't/is a substitute for the correct dash to use. Or worse, how I don't know how to benchmark for shit (which, maybe, could be right). Or that what I'm writing about isn't new (which is 90% right).

But, as many useful sources have said, I should just write anyway, so here we are.

Another disclaimer: I usually spend my time with higher level languages like Scala and Python. Don't get me wrong, it was fun making green threads and whatever in OS class back in the day, but I can't seem to ever want to go back to std::find(<insert lots of code here>). Sorry, was that spicy enough?

The issue with only using high level languages is that I also really like the idea of tinkering with lower level stuff. It's the best when we do questionable things to get something a bit more performant. I started playing with Rust for this purpose, but as I said, this post isn't really about Rust at all, aside from using its ahash::AHashMap as a sort of baseline.

Hashmaps

Digressions aside, let's talk about hashmaps. That ubiquitous data structure we all know and love from our leetcode interviews. (Is this the part where someone comments about terminology and that maybe I'd be better off talking about dictionaries?)

Considering how useful hashmaps/maps/dictionaries are, we programmers want them fast. You know what's better than a fast hashmap? A fast concurrent hashmap. And even better than that? A fast concurrent hashmap which gets even faster with an increasing number of threads. There’s a nice collection of benchmarks for concurrent hashmaps in Rust, listed here. The performance scaling is pretty good, but as expected, memory contention will stop us in our tracks at some point.

But it isn't all bad, because Rust's hashbrown1 is exceedingly fast. Like, reading almost 200 Melem/s fast2. What’s an Melem? A million elements. The numbers get a bit less fantastic when talking about asking for elements which may or may not exist, with Rust coming in at around 90 Melem/s. Inserts are also far slower at 50 Melem/s, presumably because the hashmap is optimized for reads3.

Anyway, so am I saying that I can read at 2000 Melem/s? ~~Apparently, those are the numbers --~~ No, but I can read at ~1000-1200 Melem/s, assuming I have to copy data to my GPU and back.

That's right, GPU-accelerated hashmaps!

To those new to the concept of using GPUs for things other than graphics, ML, or math: congrats, you've happened upon the awesomeness of general-purpose GPU computing (GPGPU)! Of course, this isn't the first time GPU hashmaps have been done and blogged about4. Even Nvidia wrote an article about it. But hold your horses because I'll be adding small a twist (probably).

But where are the numbers?

Okay, let's first look at the numbers, and then we can go over the gotchas.

Bench notes:

Using 32-bit keys and values, generated randomly between 1 and 2³²-1 (exclusive), i.e. 8-byte key-value pairs.
Capacity (and load factor5) and number of elements inserted/retrieved are different between different implementations, mostly to highlight a more optimal scenario for the implementation, so we can see its throughput capabilities.
"Gets (~x%)" means using a batch of keys where around x% of the requested keys were contained in the map.
More bench details will be above the footnotes.6

Disclaimer: Please note that these numbers are meant to give an impression of performance, and their actual values should be taken with several grains of salt.

(Substack doesn’t support tables, so here’s an image + code block. Good luck to mobile readers!)

(Text version)

| Processor   | Inserts | Gets (~0%) | Gets (~50%) | Gets (~100%) |
| ----------- | ------- | ---------- | ----------- | ------------ |
| M1 Max CPU  | 49.8    | 179.9      | 87.1        | 174.4        |
| 3090 (Linux)| 915.5   | 613.4      | 916.9       | 1163.2       |

GPUs?

(Note: In the first version of this post, I posted numbers for Metal on Apple’s M1 Max GPU. Unfortunately, the numbers were an egregious error! I’ve taken them out and edited this post accordingly.)

The comparison between quite good, but not quite 10x. The 3090 showed 6.6x read throughput for 100% hit rate, though we did get 10x on the 50% miss workload. If I had to guess the reason(s), I’d say one major reason is the lack of speculative execution on GPUs, causing more consistent performance across workloads differing in their branch patterns, i.e. if an element is missing or not.

For smaller batch sizes, it seems that the "transition" point between Rust's hashmap and our GPU hashmaps is around 2¹⁵ elements per batch for reads7. As for value sizes larger than 4 bytes, while Rust's hashmap seems to not degrade, CUDA does (e.g. to only 173 Melem/s at 32 bytes per key-value pair).

But, let's talk about the most obvious gotcha: we're testing batch operations on the scale of a million+ key-value pairs per batch, with the size of each pair being simply 8 bytes (yet another gotcha, but not a strict requirement). I'm guessing most workloads using hashmaps don't map to this, but for those which can, it's clear that we can get around at least an order of magnitude of extra throughput.

Servers?

Or, maybe we can get even more. The 3090 I’m using is handicapped by PCIe 4.0, since we have to copy data off of/onto main memory. Nvidia has already launched CPU+GPU hybrid "superchips" (H100) in datacenters, and AMD is coming out with MI300A. These chips also have way higher memory bandwidth than my 3090 (2x-4x). Which is possibly something to think about the next time you need to have insane batch throughput, maybe for hydrating some in-memory data?

Speaking of server compute, I'm not sure how hashmaps fare in that environment, i.e. high core count, big L3 cache, but lower clock speeds and possibly-shared cores, which would presumably cause cache issues. As for shared GPUs (if that's a thing at all?), I (mildly) doubt there would be much of an impact, due to the synchronization/compute model being inherently hyper-parallel (<insert handwavey hypotheses here>). Would be nice if some kind, knowledgable reader let me know how all that works in the real world :)

Of course, the power draw and electricity cost (and hardware cost!) associated with this is quite high compared to using just one core. I should mention, though, that we're not saturating compute nor memory bandwidth (nor, apparently, PCIe), or at least that's what a profiler told me. However, this kind of hashmap is threadsafe, so presumably it could be used somewhat efficiently as a concurrent hashmap across a multitude of CPU cores.

Even more GPU work?

But here's the other thing: all the numbers we got were about only using the GPU as a hashmap for CPU work. We see significantly better numbers when talking only GPU-only performance, i.e. keeping data on the GPU without transferring back and forth from/to the CPU. With our 3090, on 2²⁷ elements and load factor = 0.8, we can get 1725.2 Melems/s (13.8 GBps) for inserts and 5174.0 Melems/s (41.3 GBps !!!) for reads.

However, I've also only shown numbers for 4-byte keys and values. What if we need larger, variable-length values like strings? Well, I also tested the (GPU-only) throughput of reading random ranges8 of data from a big array and the numbers were surprisingly good: 475.3 Melem/s (245.2 GBps). Something like that could be a subroutine for a hashmap which stores range indices for a backing array of values.

Fun fact, my 3090's PCIe 4.0x16 connection (with a riser cable) was measured at only ~14GBps by nvbandwidth, at least with memcpy coming from the host (i.e. CPU), despite the link having a theoretical max of 32GBps. The things we do for a good-looking PC.

Though I'm pretty new to this stuff, the idea of using a GPU to accelerate various things for shiggles is very exciting and somewhat hilarious. There's definitely a large landscape of unexplored and unexploited GPU acceleration. I do plan on putting my code in a github repo sometime, after I get ready for the inevitable roasts and "doesn't work on my machine"s. That being said, there are significantly more advanced open source projects out there already, e.g. cuCollections and warpcore.

Thanks for stopping by this post, and if you know of an interesting use case for high-throughput maps, let me know!

Bench setup

General methodology:

Prepare hashmap and arrays of KV pairs (uniformly random), one for inserts and one for gets.
Measure time to insert/read all pairs.

Rust (M1 Max CPU)

Elements: 2¹⁸
Capacity: Default
Notes:

Default load factor seems to be 0.875
~250 Melem/s “reads” if we use load factor <= 0.5 with ~0% hit rate
My 5600G (DDR4-3200 dual channel) on the Linux machine (with 3090) was slighty slower, so I’m not including it.

3090:

Elements: 2²⁷
Capacity: 2²⁷ / 0.8 (load factor)9
Notes:

Using pinned host memory (cudaMallocHost) and volatile
CPU is AMD 5600G
Performance remained fixed across a spectrum of power limits (~150W - 390W)
Should be considered as an unoptimized solution.

I'm aware std has the same impl but with a secure hasher.

With our benchmark parameters. It can be faster still, depending on the context.

Github issue on load factor

My major inspiration for investigating GPU hashmaps. The guy just goes and does it with some fantastically pedagogical code.

"Load factor" means num_elements/capacity, i.e. how full the map is. Different maps perform differently across different load factors.

Scroll up a bit for the bench setup details. Substack is hard to use for this stuff.

Using load factor = 0.5; ~2¹⁷ elements for writes.

With 2²⁴ shuffled, non-overlapping ranges with random lengths uniformly distributed between 2 and 256.

Got these parameters from warpcore benchmarks. The astute reader will see I didn't post their (GPU-only) numbers: insert_GB/s=18.160158 query_GB/s=31.145351

Towers Of Sand

Discussion about this post