Wandering in a simulated wood: evaluating cache oblivious search trees
Part 1: An investigation of cache oblivious search trees
Recently, I had the chance to take my septennial summer vacation/mini-sabbatical. I decided to spend it on some backburner projects, among them, data structures! In my experience in the field, there tends to be a big divide between researchers and engineers, with researchers designing lofty new data structures while engineers dig up yet another nifty optimization on... B-Trees and Hashmaps. It's rare that you get a chance to use funny data structures in the wild so for vacation I decided to brush off an old favorite. The Cache Oblivious B-Tree.
AI disclaimer:
No AI was consulted in the creation of this article; this is handcrafted artisanal nonsense.
Why we aren't using them yet, but maybe we should be?
The chance to find a new, general purpose data structure is nifty and one I always thought was underappreciated is the cache oblivious B-Tree. A data structure serving the same role as a B-Tree and, in theory, performs better in places where B-Trees struggle. However it's hampered by two things: 1. sheer complexity: it's harder to understand than a B-Tree, which is not a simple data structure to start with. 2. Constant factor performance. Most notably, the paper "Optimal Memory Layout for Binary Search" did an empiric evaluation of various main memory layouts for search data structures, and found that the layout used in cache-oblivious B-Trees actually performs worse than the classic layouts at the scale tested. They do leave open the question of whether larger workflows can make up that ground.
I also think with Cloud and the diversification of secondary storage media, it's worth taking another look at cache-oblivious and similarly generalized structures. Historically, main memory page size was fixed at ≈4096 bytes, with disk blocks presumed to be that size or smaller. This is already changing with modern storage media–-SSDs can have erasure blocks in the MBs, and SMR causes HDDs to behave similarly–-while disaggregated cloud storage means that the behavior of storage devices could change out from under you, making it impossible to pre-select an optimal size. Wouldn't it be nice if we didn't have to make a choice?
How to slice a binary search tree into fractally caching chunks
Understanding this well was one of the main goals of this experiment. It's laid out in a few papers"Cache Oblivious Search Trees via Binary Trees of Small Height" which this is mostly based on, and "Cache Oblivious Streaming B Trees", but it can be a little hard to fit in your head because it is defined differently from other inductive search trees. It also relies on a couple of high level concepts which I'll explain below; feel free to skip over those sections if they are basic to you.
Main memory and secondary memory. Optimizations == packing bits close together into pages.
Memory consists of ≈2 kindsYes, I'm ignoring CPU cache; that counts as main memory for this purpose.:
The combination of all these layers we'll call the memory hierarchyWhether or not the "memory hierarchy" refers to the entire thing or just he hierarchy within main memory depends on the subfield discussing it. I'm going with the whole thing for this article because it's closest to the cache oblivious algorithm usage of "memory". Transferring data between layers in the hierarchy is done in fixed-size segments (called "lines," "pages," or "blocks" depending on the specific kind of memoryI think some other phenomena, such as disk seek time, are morally equivalent to a single-element cache.), the size of which depends on the implementation. I'll be calling the segments "pages" out of habit throughout the rest of this post.
A major factor in optimizations is fitting access patterns to the discreteness and inflexibility of these transfers; if you're going to transfer eg. 4KiBs of data when reading a page into memory anyway, it's pretty inefficient if you then only make use of 8 bytes of that.
Better memory-locality; perfect if you know the page size.
Historically the data structure for on-disk data, but making inroads for main-memory also recently.
A B-Tree is what you get when you take the concepts behind a binary search tree and optimize it for page size. If you know that you're bottlenecked on a layer of memory that does reads at 4KiB granularity, then you want each node of the tree to be 4KiB, maximizing the information gathered per read. This matters! If you have 8 Byte (64 bit) elements and pointers, a binary search tree can require 8 timesA B-Tree takes logB(N) transfers while a binary search tree takes log2(N) and log2(N)/log256(N) = 8. as many transfers as the B-Tree.
B-Trees work great if you know the limiting page size, but what if you don't?
Cache-oblivious data structures are a class of data structures designed to optimize memory transfers in a more general wayThey were, unsurprisingly, originally useful for matrix multiplication.: they have an asymptotically optimal number of transfers regardless of the size of the transfer or the number of layers in the memory hierarchy, and it so happens that there's a Cache-Oblivious equivalent to the B-TreeOk, technically a family of data structures of which I'm going to talk about in generalities..
Fractal locality for efficient reads.
The van Embde Boas (vEB) Layout is the basic building block for cache-oblivious B-Trees. At a high level, the vEB layout approximates a B-Tree search for every feasible memory transfer size (note the "every feasible", that's an important caveat, we'll come back to this).
To lay out a data set in vEB layout first you build the (logical) complete binary search tree for the data set. You then divide this (logical) binary search tree in half by height, recursively laying out the top tree in vEB layout, then laying out the bottom trees in vEB layout.
Cache oblivious search trees are contiguous regions of memory with a fractal-ish internal structure; at every scale the array consists of a (Cache Oblivious) B-Tree node followed by its children.
This is easiest understood with some pictures, so here's a complete binary search tree of height 5:
If you want to lay this out flat in memory there's a number of ways to do it. You may be familiar with the in-order layout:
or breadth-first layout:
And here's the vEB layout we're going to build:
To layout a tree in vEB layout you start by dividing it in half vertically:
And placing the tree in the top half
into the front of your memory buffer
in order to accomplish this we divide it in half, and lay out the top half in vEB order...
until our tree consists of a single node:
at this point we write the node to its place in the array
and lay out each of the trees from the bottom half
which we place in the remaining memory for that division
in vEB layout. Since the bottom trees are also single-element, placing them is once again as simple as writing them to their cells:
Having once again finished a top tree, we go back up the stack and place the bottom trees from the previous division which being single nodes, remains trivial
so we go back up the stack and start placing another set of bottom trees (Pictured here being laid out in parallel).
Since we're placing them in vEB layout, we once again divide the tree(s) by height
lay out the top tree(s)
and then the bottom trees (each node below is its own tree, I just ran out of colors)
Thus completing the tree.
To see this layout is useful, let's compare a search in the tree for the value in node #17 in both vEB layout and the standard breadth first (BFS) layout on a system with am arbitrarily chosen fetch size of 3.
Both of them start out reading the same nodes from the tree: the root and its two children.
The second read is also similar, both of their reads of the grandchild labeled 20 also fetch a pair of irrelevant nodes.
But at the third read we start to see a difference as we need to visit the left child of 18.
For the vEB layout the buffer already has the data due to its locality, but in the BFS layout it just has a couple of irrelevant nodes from the third level, so we will require another read to finish the search.
This is the common pattern with the vEB layout: it packs nodes together so that fetching a node is very likely to fetch its children, regardless of what size fetch you happen to use. Because binary search tree lookups go from parent to child, that is more useful than fetching siblings or other nodes.
This layout is asymptotically optimal but imperfect. The specific size of the tree determines whether it can be subdivided into optimal chunks, or whether there are inefficient chunks that don't fill the fetch size. For instance our 5 height tree had a bunch of one node chunks in the middle, making reads involving those nodes and its children less efficient.
The layout is most efficient when you fetch a complete subtree each time. That means that the smallest fetch size we can optimize for is 3.
Due to the way in which the vEB layout is constructed by repeatedly dividing trees in half by height, if you have a tree of a given height, the next tree that'll have subtrees of that size is the one with double the height.
So for fetches of size 3, you get a perfect sequence of fetches for trees of heights 2 (size 3), 4 (size 15), 8 (size 255), 16 (size 65535) etc. That imposes an awkward constraint, to say the least.
For any tree whose size isn't of the form 22^n-1 you'll end up with some of your fetches reading incomplete trees and thus wasting some of your memory bandwidth. So too with other sizes, for fetches of 7 the tree must be of size 23*2^n-1, 15 they must be 24*2^n-1 etc. In addition to that, the fetch sizes are a bit awkward being of the form 2X-1: as an industry we tend to manufacture things using powers of 2 or multiples of 10, meaning you're practically guaranteed to not be right-fitWhich is one of the things cited by "Optimal Memory Layout for Binary Search" as leading to suboptimal performance in main memory..
That said, if the fetch would otherwise be reading arbitrary nodes any locality would be a win, so the tree need not be perfect to be worthwhile.
Part 2: Simulation-Testing Cache Oblivious Trees
Now that we've gone over how Cache Oblivious Trees work we can start to examine how they perform in practice. Previous work in this area is sparse and somewhat dated; eg Memory Layouts for Binary Search - V2 found that constant factors dominate for small in-memory workloads. "Cache Oblivious Search Trees via Binary Trees of Small Height" found Cache Oblivious Trees competitive... against a B-Tree array... on a Pentium. I'm not sure how to generalize these results to, say, a database in the modern era.
It would be good to evaluate how much of a performance difference would there be in-practice between a regular and cache-oblivious B-Tree, and are there regimes in which the difference is significant enough to be worthwhile.
A digression into real engineering
If you want to determine the performance of a system the ideal is to use a large realistic benchmark that's similar in characteristics to the practical stuff you're going to use the system for. The gold standard is running the actual production workload on the versions of the system you want to evaluate and seeing how they perform in practice. Of course, this requires that you have a production workload, can duplicate it, and, more subtly, that the workload will remain self-similar enough that testing on the current iteration will be predictive of future needs. Absent this you want the most realistic workload you can reasonably generate that will be predictive of the system or subsystem's performance on the workloads of interestThere's a whole long aside I could get into on how to do predictive benchmarking which I'm working very hard to avoid getting rabbitholed in..
Benchmarking for fun and no profit
I want my benchmarks to run fast. Databasing workloads that require secondary memory tend to be large. Even small computers can have GiBs of main memory, and ones typically used for databasing can have 10s of GiB or moreTo take a fairly arbitrary example, as of October 21st 2025, an AWS ec2 m8gd.4xlarge instance has 64GiB main memory and 950GB of SSD for around $8082 per year at on-demand prices.. Running a realistic benchmark would take hoursAt 50µs per element (yeah, I know) it'd take on the order of 20days to do a TiB run., if not days or weeks, and running anything like a real experiment would take months. Since I'm not running this on a production system and mostly want an intuition for how cache-oblivious trees would work on databases in practice, small and fast is the way to go.
There are a few ways to get small and fast, but most of them take us out of the interesting domain. If we're already going to make sketchy assumptions about how our benchmarks generalize we might as well get the most precise information we can. So, Simulation!
There are a few ways to get small and fast, but most of them take us out of the interesting domain. If we want to approximate a large dataset with a small dataset, we need simulation!
Shrinking the dataset would get us out of the domain where we expect to see interesting effects from cache sizeThe dataset would be too small and cacheable to get interesting timing differences, and I don't want to have to look at hardware cache counters to guess how things are doing., so my answer is simulation.
Use the data you have to get the data you want
The big advantage to simulation is that we can pick the kinds of data we gather to answer the question at hand. In particular, the common performance assumption of search trees is that it is entirely contingent on the ratio of size (nodes) to size of main memory to size of the fetch (nodes : main_memory : fetch).
For any search tree, the top few levels stay permanently in cache, since they are relevant to any searchCaveat: in practice it's a little complicated since search trees are found in search forests, and skewed query patterns (eg a particularly hot branch) can change that considerably. This line of reasoning only applies to randomized workloads. you want to make. If the tree is large, some nodes will always be out of cache. From a memory hierarchy perspective, the idealized search looks like hitting N tree levels of cache and then incurring M misses once it gets out of the precached levels. How many misses depends on the size of the nodes relative to the fetch size, and the locality of the nodes; the greater the number of relevant nodes you can retrieve per-fetch, the fewer misses you incur.
Simulating performance therefore looks like picking a desired B-Tree miss rate and fetch size, then comparing how the vEB layout performs on the same workload. Then you can compare the difference in number of fetches needed at a given level (usually the last level dominates) to get an indication of the performance difference between the data structures.
To implement this, I wrote up some reasonable implementations of the B-Tree and Cache Oblivious Data Structures and ran them over a simulated LRU cache hierarchy. Real caches often use more complicated strategies, but those strategies would be defeated by randomized search, so LRU should be as good as the next thing. I inserted randomized data and ran randomized searches to check how performanceOne big caveat is that I used real pointers provided by the allocator rather than manufacturing my own. I did this because deciding how much locality the B-Tree will get for a given pattern can vary a lot; I didn't want to overly penalize it or favor it, and this seemed like a good way to avoid biasing that choice. varied over different sizes and hierarchies.
You can check this out yourself at https://github.com/JLockerman/Cache-Oblivious-Search-Tree-Sim. At a high level you give it a list of <page size>:<number of pages> that describe the (eg. 64:1000,4K:100n,100K:1) and it'll run a specified number of inserts and/or reads on the hierarchy and report the number of hits and misses at each level.
You don't really need data
You can even go one step further and leave out the data entirely. For a cache-oblivious B-Tree, the memory behavior is a deterministic function of the start address and the search path, so if you can generate those in a consistent way you can get cache traces without the actual data structure.
Regular B-Tree reads are a bit more complicated because you need to generate node addresses in a consistent manner in order for repeated reads to a given node to cache correctly, but you can do this pretty simply by generating child addresses as a function of their parent's address. This technique would theoretically allow one to simulate much larger datasets for any given runtime while still providing insight into the constant factors that we're trying to examine. In practice, I never got the kinks quite worked out enough to be confident in the data quality...
There's a very large parameterization space that can be used for these caches, see Cache Oblivious B-Tree Data 📈 for some example runs. I was hoping the differences between the data structures would be brightline across some of those parameters, but in practice, it ended up being fairly fiddly.
Both data structures show nice logarithmic curves for the number of reads and misses performed. This isn't a surprising result but does help validate that I implemented the vEB layout correctly.
Example graph of simulated disk reads in a fairly extreme case, y-axis, versus number of elements, x-axis, for each data structure. Blue is vEB, red is B-Tree.
The B-Tree performs meaningfully better for cache page sizes that match its node size. Again, not surprising per se–it's what you expect to see based on asymptotic analysis, and the prior work in main-memory performance–but nice to confirm. Eyeballing it, It seemed to me that a 4KiB node-size B-Tree would generally have 1 fewer miss per-op in a level right-sized to it.
Due to the above, if we assume a right-sized B-Tree page size, the cache oblivious tree would perform strictly worse for entirely cacheable workloads.
The cache oblivious tree tends to outperform the classic B-Tree on cold fetches with large sizes if the cold read size is large enough. In general it seems that once the operation is going to miss multiple pages of data the increased locality of the cache oblivious tree tends to be noticeable. This would only generalize to very large datasets or very small caches, so it seems like the data structure would be pretty niche for any kind of workload which can be cached in main memory, and with modern machine sizes, that is a lot of them.
There is still the possibility that a hybrid data structure would be the best of both worlds. Once example I could see being interesting is a vEB layout tree that bottoms out in ≈page-sized B-Tree nodes instead of individual elements; this should have better cache performance in main memory, and potentially defer expensive vEB rebalancing to larger data sizes where the increased locality starts to pay off.
That said, even in the cases where the vEB layout performs better it's not obvious that the complexity is worth the result because it doesn't seem to usually change a workload that incurs misses into one that doesn't; the best cases I've seen the layout reduces a ≈2-miss workload into a ≈single-miss one.
All of the above suggests that the data structure only pays off if you really care about total number of disk access, either because you care about latency beyond order of magnitude, you're charged by disk-access, or you perform enough disk accesses that the aggregate increase would push your workload into a new performance class.
So in practice, a cache oblivious tree may perform better than a classic B-Tree if you have sufficient cache misses, your secondary memory access and dataset are large enough to get additional locality out of the cache oblivious tree, and you're not getting the locality already from happenstance of B-Tree node placement. That all suggests that the cache oblivious tree will be more effective in cases of very large datasets, or small caches, though in the latter case constant-factor differences may dominate.
The carcinization of the database world...
If there's no good way to do rebalancing except for rebuilds anyway, the tempting solution is to organize the cache-oblivious trees as a Log Structured Merge (LSM) Tree. This would give the benefits of immutability for the trees in the common case, allow you to defer the rebuilds to background work, and use a more CPU-friendly data structure for recent in-memory data. It'd need measurement to see if it had sufficient benefit at that point. (I thought I saw a paper suggesting the same idea but I can't seem to find it now. Maybe I'm misremembering COLA from "Cache Oblivious Streaming B-Trees"?)
I feel like simulation is underutilized in our field. With the exception of networking and some of the newer deterministic simulation testing stuff, it seems like our attempts to factor out programs into separable subsets is still rudimentary, mainly happening at API boundaries. There are some attempts at simulated timeWith the causal testing work being a particularly notable example working as it does at a different access to everyone else. but very little attempts to simulate things like space usage or contention. Doing a mediocre analysis over a month can be more useful than a perfect one that takes years.