A hash table is a data structure for efficient lookups. It is referred to as
unordered_map
in c++
, map
in
go
, dict
in c/redis
,
HashMap
in rust
, among others.
This report will provide a comprehensive summary of its implementation details
and best practices for its use.
The primary motivation for using this data structure is its efficiency in searching, also known as lookup (see TAOCP Vol. 3 ). For search-related problems, the number of assembly-level instructions required is lower compared to other data structures, such as binary trees, while maintaining reasonable space consumption. For a concrete example, refer to TAOCP Vol. 3 . Ultimately, the focus is on speed.
A notable example of this is the use of perfect hash tables for keyword lookups in compilers (lexers) (see Go's keyword map ). If performance is a critical factor in your workflow, consider optimizing the hash table extensively to maximize efficiency.
Chaining The solution to question 1 could involve chaining. In this approach, the hash value determines the slot position in the backing array, and that slot stores references to all items with the same hash key, typically linked using a separate data structure such as a linked list in common implementations. A key assumption is that the hash values are uniformly distributed, ensuring that the additional data structure remains efficient for lookups, even in the worst case. Alternatively, a tree-based structure can be used instead of a linked list to improve lookup performance, though this approach is less commonly adopted in most codebases.
A common solution to question 2 is rehashing (explained later), which is triggered once a threshold (to be defined shortly) exceeds certain criteria. One approach is to compare the load factor with the threshold (see Java HashMap), and perform rehashing if the load factor surpasses the threshold:
num of items in table
----------------------- = load factor > threshold
num of buckets in table
Rehashing Using a hash table with a large number of buckets can be inefficient in terms of space and not ideal for memory caching. On the other hand, as new items are inserted, the number of buckets may quickly become insufficient to accommodate them. This can lead to poor performance of the linked list due to hash value conflicts. Finding a practical mechanism to balance these trade-offs is a key design challenge.
To address this, when the load factor (as mentioned earlier) exceeds a predefined threshold, increasing the bucket space followed by rehashing is a common solution. There are two approaches to consider for implementation:
As an alternative to chaining, open addressing is a method that directly stores items within the bucket array instead of relying on an additional data structure, such as a linked list.
An open addressing data structure typically utilizes a single flat backing array. The hash value determines the initial slot for an item. When a conflict occurs, a predefined probe sequence—commonly linear probing—is used to search for the next available slot within the array. This process continues iteratively until an empty slot is found.
Swiss Table Due to its implementation complexity, open addressing was not widely adopted in
the past. However, Google SwissTable (see
blog)
leverages this
technique to achieve significant performance improvements, particularly in
memory caching efficiency.
Furthermore, its design utilizes SIMD (Single Instruction, Multiple Data)
instructions, which are available in modern CPUs, to further accelerate lookups
and insertions. SwissTable becomes the defalt implementation for rust
. For an in-depth understanding, I recommend reading blog
The Swiss Army Knife of Hashmaps
, which provides a comprehensive deep dive into its
architecture, as well as watching the original talk on YouTube.
In Go 1.24, the language also adopted this design and introduced incremental rehashing, similar to Redis (as mentioned earlier). According to benchmarks, map operations are up to 60% faster compared to Go 1.23, marking a significant improvement in performance.
Funnel Hashing The seminal paper Optimal Bounds for Open Addressing Without Reordering introduced a novel scheme utilizing geometrically decreasing-sized sub-arrays. This approach significantly improved expected probe complexity in both amortized and worst-case scenarios, all while avoiding the need to reorder elements over time.
High-Level Idea:
This reasoning proves Theorem 3 on Page 12 of the paper.