We’re All CHAMPs On The Inside

Posted on March 31, 2025

HAMT’s are cool, Haskell has a ubiquitous one in unordered-containers. But other languages have the even cooler CHAMPs. This post is about optimizing the filter-ish functions in an attempt at a drop-in replacement of unordered-containers by my colleague Marten that uses CHAMP.

Can I Eat It?

I’ll assume some familiarity with how HAMT works, but will summarize it very shortly as a trie on hash codes, implementing persistent hash maps and -sets. In code, a hash map based on HAMT may look like the following
contiguous is used everywhere in haskell-champ, in code snippets in this post, I’ll be eliding where functions come from, but you can probably safely assume they’re in that package somewhere.
There’s also a Full node that is “full” and contains the maximum number of child nodes. I’ve omitted this for brevity.
:
data Hashmap k v
  = Empty
  -- ^ Used to only describe an empty hash map
  | Leaf Hash (k, v)
  -- ^ A singular key and corresponding value,
  -- along with the memoized hash of the key
  | Collision Hash (Array (k, v))
  -- ^ A hash collision node, where the hashes of all of
  -- it's elements are indistinguishable.
  | BitmapIndexed Word32 (Array (HashMap k v))
  -- ^ The main juice, the bitmap encodes where in
  -- the array a key may be located, based on a segment
  -- of its hash.

The Bitmap in BitmapIndexed encodes the segment of the hash relevant to the node. To give an example let’s say we inserted 5 and 13 with the fixed hash partition size of 3. Using the identity hash function that maps 5 to 0101 and 13 to 1101, we get that 5 and 13 both have the same first segment of hash & (2^3 - 1) == 101. We then check the next 3 bits and do find that they differ at (hash >> 3) & (2^3 - 1) == (1 for 13 and 0 for 5).

Structure of a hash map implemented using HAMT containing 5 and 13.

As the name implies, CHAMP (compressed hash array mapped prefix-tree), uses a more compact representation, where the leaves are inlined where possible. To differentiate between inline entries and child nodes, 2 bitmaps are used instead of one. This is phrased approximately in haskell-champ as the following.
data MapNode k v
  -- ^ Unpack the bitmap into a separate data type.
  = Compact Word64 (Array k) (Array v) (Array (MapNode k v))
  -- ^ This time we have a single bitmap, but split it in 2.
  -- The first half shows which keys are present inline,
  -- while the second half shows which children should be
  -- checked for the key.
  | Collision (Array k) (Array v)

data Hashmap k v
  = Empty
  | Singleton Hash k v
  -- ^ Use the Word otherwise used for the bitmap, to store the
  -- Hash of the only key we have.
  | ManyMap Word (MapNode k v)
  -- ^ Also memoize the number of items in the hash map.

Same hash map, but now using CHAMP.

You may notice that the hash-memoization isn’t present anymore, I’d encourage you to check out the paper, as the authors tested both with-and-without memoization forms. As a library, haskell-champ makes some additional nifty improvements, which I’d also encourage you to check out! Important for this blog post is that sizes are memoized instead of hashes.

Gotta Start Somewhere

Before you optimize, gotta do some groundwork, so you know what you’re measuring. We’ll be optimizing the filterWithKey function with the approximate
I say approximate, because the type signature is a bit more complicated, due to the way haskell-champ is written to support lifted/unlifted/boxed/unboxed datatypes.
type signature and base implementation that goes via List.
filterWithKey :: (k -> v -> Bool) -> HashMap k v -> HashMap k v
filterWithKey f = fromList . List.filter (uncurry f) .  toList

Running a tasty-bench run of the base implementation against unordered-containers’ variant with the hash maps being constructed out of [0..n) and the filter function being even
I know, I know, microbenchmarks… There are definite biases here, and you can probably pretty easily spot them. With the hash map depth likely not getting large enough to be significant, and even always halving the hash map. Still though, pretty nice to get an idea of how this approximately performs in at least this specific case. I’ll get back to this at the end.
.
build n = force $ fromList [(x, x) | x <- [0 .. (n - 1)]]

Running the benchmarks using the naive implementation gives an expected picture, with the naive implementation being ~6x slower and allocating ~22x more. Now the fun can start.

When Children Thrash Around

The first thing we have to do is plan out how we plan to filter out values from the trie. As there are entries both inline and within children, we’ll have to enumerate both. So concretely, in order to maintain as compact a representation as possible.
  1. Go over children nodes and run the filter across them.
    1. Remove any child nodes that become empty.
    2. When a child node only contains a single inline entry, bring it inline.
  2. Run the filter across the inline entries.
    1. If we have to inline any child nodes, do so on the output.
  3. Ensure the bitmap is kept in sync with the elements in each array.

We’ll go over these steps in turn. Firstly, we’ll have to go over the child nodes. This is the point where we do a recursive call. It is useful to realize that there are 3 distinct cases. The child node can either be empty, contain a single entry, where we will have to inline that entry, or have >1 elements. In the last case, we’ll have to ensure that the child node maintains the correct structure, i.e. its bitmap is consistent with what it contains.
data RecursiveFilterOutput k v
  = EmptyChildNode
  | InlineEntry k v
  | FilteredChild (MapNode k v)

This looks a lot like how we defined our HashMap! With a bit of massaging, we can use HashMap instead and do a recursive call to our same filtering function we were defining, no unrelated data types or conversions necessary. When we inline an entry, because we use a unified representation for all cases, the InlineEntry case has an empty bitmap it can actually use to store the hash of the key. Perfect, because we need that hash in order to insert the entry into the inline arrays. So we strictly only calculate hashes of keys we need to move up, and nothing else.

The compact structure of our nodes, used for the inline keys and values is useful for cache locality, but less so when we need to execute a function on elements in tandem, now having to put elements in new arrays conditionally. One could phrase this as a incremental fold on two new empty arrays. So something akin to the following.
filterKeysValues f keys values =
  let check (ks, vs) ix k =
        let v = index values ix
         in if f k v
              then (ks, vs)
              else (insertAt ks (size ks) k, insertAt vs (size vs) v)
   in ifoldl' check (empty, empty) keys

This isn’t that great though. When inserting into the array, it’s dynamically resized every time the array grows past what it was allocated for. Instead, we’d ideally allocate the right size immediately when copying over references and avoid any resizes. We can do this by executing f over all keys and values first, capturing information in a bitmask, and only afterwards copy from the original arrays into new ones using this bitmask.
bitmask f !keys !values =
  let createMask !acc !ix !k !v = if f k v then acc `setBit` ix else acc
      !mask = ifoldlZipWith' createMask zeroBits keys values
  in  mask

filterKeysValues f !keys !values =
  let !mask = bitmask f keys values
      !keys' = filterUsingMask keys mask
      !values' = filterUsingMask values mask
  in (mask, keys', values')

These new key and value arrays are smaller than the originals, losing some entries it previously contained inline. The entries that were deleted need to be removed from the bitmask. Because we don’t memoize hashes, we’ll have to recompute them. The naive approach would go over all the keys we kept/or deleted, and set or unset respectively, the corresponding bits in the bitmap.

One thing to note is that the relative indices in the bitmap, denote the locations in the actual contiguous arrays. Now that we have a boolean mask, we can overlay that on the set bits in the bitmap, and get the bitmap of the filtered items, all without hashing!
maskBitmap !booleanMask !bitmap =
  go booleanMask bitmap 0 0
 where
  go !_ !0 !acc !_ = acc
  go !0 !_ !acc !_ = acc
  go !booleanMask !bitmap !acc !ix | bitmap .&. 1 == 0
    = go booleanMask (bitmap !>>. 1) acc (ix + 1)
  go !booleanMask !bitmap !acc !ix | booleanMask .&. 1 == 0
    = go (booleanMask !>>. 1) (bitmap !>>. 1) acc (ix + 1)
  go !booleanMask !bitmap !acc !ix
    = go (booleanMask !>>. 1) (bitmap !>>. 1) (acc `setBit` ix) (ix + 1)

Wrapping these up, leads to the following nicer picture.

Nothing But Lies

Benchmarking is always very tricky. At this point I had my fill for the moment and stopped looking for improvements. I made the commit and pushed it. The unfortunate thing about microbenchmarks though, is that minor deviations can give very different pictures. While the above graphs shows that we’re pretty close, just changing the key to ShortText instead of Int, already worsens the improvement we had of -11%, to +72% relative to unordered-containers.

Even better, the fact that the hash map in the benchmark is constructed out of fromList [(x, x) | x <- [0 .. (n - 1)]] is already problematic in combination with using even as the filter. Changing the hash map where the keys and values are uniformly selected i64, shows a closer picture to the run with ShortText.

So, while I end with a nice improvement relative to what I started with, it isn’t an improvement on the status quo, so my enthusiasm has calmed down a bit. Pretty fun exercise in world’s best imperative programming language though.