COMP 210: Data Structures
More About Hash Tables
The Big Picture
Hash Tables aim to provide O(1) time complexity (on average) for the three basic data structures operations: insertion, deletion, and searching. They do this by converting the data item (or part of the data item) as a key that can be converted directly into the address, or index, where the data is stored.
The Core Concept: Hashing
A hash table consists of two primary components:
The Hash Function: The hash function is a mathematical function that takes an input key (e.g., a person’s name, a 16-digit product ID, or a string) and transforms it into a fixed, small, non-negative integer – the hash code or hash value – that can be used as an index into a table.
Hash(Key)→IndexThe Array (The Table): This is the structure where the actual data (the value associated with the key) is stored. It could be a simple
ArrayListwith just the values, or it could be a map, i.e., a table containing key-value pairs. The hash codes calculated by the hash function are used as the indices into this array.The size of the table is “baked into” the hash function since the resulting hash codes must be in the range of 0 < hashCode < sizeOfTable. A good hash function distributes the resulting hash codes (indices) as evenly as possible across the table size to minimize collisions.
Collisions
The key challenge in hash tables is that multiple keys may hash to the same index. This is called a collision. This becomes more likely as the number of data entries gets closer to the size of the table. (And becomes inevitable if the number of data entries exceeds the size of the table – see Separate Chaining below.)
Example:
Hash("apple") → 5
Hash("grape") → 5 (Collision!)
We cannot store both items at index 5 without overwriting one. Hash tables rely on Collision Resolution techniques to handle this.
Collision Resolution Techniques
There are two main approaches for resolving collisions:
Separate Chaining (Open Hashing)
Each entry in the hash table contains a reference to a secondary data structure, rather than a data item directly.Each slot in the main hash table array holds a pointer to a linked list (or occasionally, a small BST).
If two keys hash to index i, their data items are simply added to the linked list at array index i.
If the original table is big enough and the hash codes are distributed evenly enough, the length of any given linked list of collisions will be reasonably short, giving access performance that is still close to O(1).
Advantages: Simple to implement; never “runs out of room” unless memory is exhausted.
Disadvantages: Requires extra space for the pointers/lists; lookup becomes O(N) in the worst case (if all elements hash to the same spot).Open Addressing (Closed Hashing)
When a collision occurs, the system looks for an available empty slot in the main array itself, following a specific probing strategy.Linear Probing: The table searches for the next available slot sequentially: index i + 1, then i + 2, i + 3, etc., wrapping around if necessary.
Quadratic Probing: The table searches slots i + 12, i + 22, i + 32, etc.
Double Hashing: A second hash function is used to determine the size of the step to take when probing.
Advantages: Better memory utilization (no separate linked lists); faster iteration over the keys.
Disadvantages: More complex deletion (you can’t just remove a node, you must mark it as “deleted”); suffers from clustering (long runs of occupied slots), which degrades performance.
Load Factor and Rehashing
The performance of a hash table is critically dependent on its load factor α.
α = Number of Items (N) / Size of Array (M)
If the load factor gets too high (typically above 0.7 for open addressing or above 1.0 for separate chaining), the chance of collisions increases dramatically, pushing the average time complexity closer to O(N). In other words, once a hash table that uses open addressing gets to be about 70% full, performance starts to degrade quickly. On the other hand, if the hash table uses chaining, it starts to slow down as the size of the data approaches the size of the table.
When the load factor exceeds a threshold, the hash table performs rehashing:
A new array, typically double the size of the old one, is allocated.
Every single item from the old array is inserted into the new, larger table using a hash function that depends on the table size.
The old, small array is discarded.
While rehashing is an O(N) operation, it happens so infrequently that the amortized time complexity for insertion remains O(1).
Simple Hash Table Example Using Separate Chaining
Imagine we have a small phone book application where we store names (the Key) and phone numbers (the Value).
- Table Size (M): 10 slots (indices 0 through 9).
- Key: Name (String)
- Hash Function: A simplified function that takes the key, sums the ASCII values of the first two characters, and then applies the modulo operator with the table size as the modulus (the divisor).
Index = (ASCII(char1) + ASCII(char2)) mod M or, simplified, Index = Sum mod M.
We will implement the hash table using Separate Chaining, so each entry in the hash table will be a (possibly empty) linked list of entries that hash to that index.
Inserting Data
We want to insert the following data, calculating hash codes using the hash function above. We insert each data entry into the linked list at the index given by the hash function.
| Key (Classmate Name) | Sum of ASCII Values | Hash Code (Sum % 10) |
|---|---|---|
| “Snoopy” | 83 + 110 = 193 | 3 |
| “Charlie Brown” | 67 + 104 = 171 | 1 |
| “Franklin” | 70 + 114 = 184 | 4 |
| “Peppermint Patty” | 80 + 101 = 181 | 1 |
| “Lucy” | 76 + 117 = 193 | 3 |
By sheer bad luck, all of our keys hash to indices in the first half of our table, and we have two collisions (Snoopy / Lucy and Charlie Brown / Peppermint Patty) among only five data entries.
Addressing the Collisions
When we insert “Peppermint Patty”, we are inserting data into a linked list that already contains data for “Charlie Brown”. When we insert “Lucy”, we are inserting into a linked list that already contains “Snoopy”.
Final Hash Table:
| Index | Data |
|---|---|
| 0 | empty |
| 1 | [Key: “Peppermint Patty”] → [Key: “Charlie Brown”] |
| 2 | empty |
| 3 | [Key: “Lucy”] → [Key: “Snoopy”] |
| 4 | [Key: “Franklin”] |
| 5 .. 9 | empty |
Searching
Now, imagine we want to find the phone number for “Snoopy”:
- Hash the Key: Index = Hash(“Snoopy”) = 3.
- Go to Index 3: We land at the list starting with “Lucy”.
- Traverse the Chain: We check the first item. Is it “Snoopy”? No, it’s “Lucy.”
- Check Next: We move to the next item in the list. Is it “Snoopy?” Yes.
- Return Value: We return the associated phone number.
In this simple case, this search required one hash calculation (instant) and two comparisons (traversing the linked list). This demonstrates how collisions increase the complexity above the ideal O(1).
Note: Based on the number of collisions we have encountered so far, you might already be wondering whether
Index = (ASCII(char1) + ASCII(char2)) mod M
is a good hash function. It probably is not. If we were to add “Linus” to the list, for example, we would have a third data entry hashing to index 1. Adding Sally, on the other hand, would not result in a collision; “Sally” hashes to 0.
| Key (Classmate Name) | Sum of ASCII Values | Hash Code (Sum % 10) |
|---|---|---|
| Linus | 76 + 105 = 181 | 1 |
| Sally | 83 + 97 = 180 | 0 |
Linear Probing
In this method, the main array holds the data directly, and we search for the next empty slot within that array upon collision.
Let’s look at the same example that we used above, this time with linear probing.
- Table Size (M): 10 slots (indices 0 through 9).
- Hash Function: Same as before: Index = (ASCII(char1) + ASCII(char2)) mod M or, simplified, Index = Sum mod M.
Inserting Data
The data items hash to the exact same initial indices as in the previous example:
| Key (Classmate Name) | Sum of ASCII Values | Hash Code (Sum % 10) |
|---|---|---|
| Snoopy | 83 + 110 = 193 | 3 |
| Charlie Brown | 67 + 104 = 171 | 1 |
| Franklin | 70 + 114 = 184 | 4 |
| Peppermint Patty | 80 + 101 = 181 | 1 |
| Lucy | 76 + 117 = 193 | 3 |
The first three items are placed directly into their hashed indices:
| Index | Content |
|---|---|
| 0 | empty |
| 1 | Charlie Brown |
| 2 | empty |
| 3 | Snoopy |
| 4 | Franklin |
| 5 .. 9 | empty |
Inserting “Pepperment Patty” and “Lucy”, though, results in collisions.
Addressing the Collisions
Placing Peppermint Patty:
- Calculate Hash: “Peppermint Patty” hashes to index 1.
- Check Index 1: Slot 1 is occupied by “Charlie Brown” (Collision!).
- Probing (Linear): We check the next consecutive slot:
- Check Index 2: Empty.
- Insert: “Peppermint Patty” is placed at index 2.
Placing Lucy:
- Calculate Hash: “Lucy” hashes to index 3.
- Check Index 3: Slot 3 is occupied by “Snoopy” (Collision!).
- Probing (Linear): We check the next consecutive slot:
- Check Index 4: Occupied by “Franklin.”
- Check Index 5: Empty.
- Insert: “Lucy” is placed at index 5.
Final Hash Table:
| Index | Content |
|---|---|
| 0 | empty |
| 1 | Charlie Brown |
| 2 | Peppermint Patty |
| 3 | Snoopy |
| 4 | Franklin |
| 5 | Lucy |
| 6 .. 9 | empty |
Stop and Think:
What would be the result of adding “Linus” and “Sally” to this hash table? Remember that the hash code for “Linus” is 1, while the hash code for Lucy is 0. How many comparisons would be required for each?
Searching
When searching for a data entry, we follow the same steps: compute the hash code, check that index, if that is not the entry we are searching for then compute a new index using the step, check the new index, etc.
Primary Clustering
This time our bad luck (or poor hash function) has led to five data entries being clustered together – indices 1, 2, 3, 4, and 5 are all full and adjacent. This is called Primary Clustering. A long cluster makes subsequent insertions and searches into that cluster much slower, degrading the average time complexity away from O(1).
In this case, inserting “Lucy” required three comparisons and searching for “Lucy” would require the same, because “Lucy” was forced two steps away from the original hash location due to the cluster.
Variations on Linear Probing
The simplest form of linear probing, which we used in our example above, uses a step size of 1. In other words, when we encountered a collision we tried the next spot, incrementing the index by 1. We could also have used a different step, such as 2 or 3. We have to be careful, though, in choosing the step. Consider if we had chosen a step of 5 in our example above. In that case, the data for “Peppermint Patty” would have been placed in index 6 and the data for “Lucy” would have been placed in index 8. This would have been a definite improvement. If the next name happened to also hash to 1, though, we would have checked 1, found a collision, checked 6, found a second collision, and then been back at 1 again ($ (6 + 5) % 10 = 1$). We would now be in a cycle, and either throw an exception or rehash the table to a larger size. We would have a similar problem with 2, although it would take more checks to arrive back where we started. The problem is that 2 and 5 are both divisors of 10. If our step were 3, on the other hand, or any other number that is relatively prime to the size of the table, we would be checking different indices in each pass through the table.
For this reason, hash tables that use linear probing are generally created with prime sizes – a capacity of 101, for example, rather than 100. That way, any step size is relatively prime to the capacity.
Another variation on probing that provides a more dynamic way to define the step is quadratic probing, where the step is a function of the index: i + 12, i + 22, i + 32, etc.
Double Hashing
Double Hashing is the most advanced of the standard probing techniques. It is highly effective because it uses a secondary hash function to calculate a key-specific step size, effectively preventing primary clustering.
To make Double Hashing mathematically robust, the table size (M) must be a prime number. For this example, we will use a table size of 11 (indices 0-10).
Table Size (M): 11 slots (indices 0 through 10). M is a prime number.
Primary Hash Function (h1): Finds the initial index.
h1(k) = (ASCII(char1) + ASCII(char2)) mod M or, simplified, h1(k) = Sum mod M.Secondary Hash Function (h2): Calculates the unique step size. We will use Sum mod R, where R is a prime number smaller than the table size to ensure the probe sequence covers the whole table. In this case, we will use R = 7.
h2(k) = R − ((ASCII(char1) + ASCII(char2)) mod R) or, simplified, h2(k) = R − (Sum mod R).Probing Formula: The index for the i-th probe is calculated as:
Indexi = (h1(k) + i ⋅ h2(k))
Note that the form of the secondary hash function h2 = (Sum mod R) is similar to the form of the primary hash function h1, but with a smaller prime modulus (R instead of M) and an extra subtraction. The subtraction changes the range of h2 from 0...(R − 1) to 1...R, guaranteeing that the step is never 0. A step of 0 would mean never stepping away from the collision, checking the same value over and over again.
Inserting Data
First we calculate the initial indices and step sizes for our example data using the primary and secondary hash functions, with M = 11 and R = 7. Note that the change in table size (M) from 10 to 11 will change our initial hash codes and collisions. We will also include “Linus” and “Sally” in this example.
| Key (Classmate Name) | Sum | h1(k) = Sum mod 11 | Sum mod 7 | h2(k) = 7 − (Sum mod 7) |
|---|---|---|---|---|
| Snoopy | 193 | 6 | 4 | 3 |
| Charlie Brown | 171 | 6 | 3 | 4 |
| Franklin | 184 | 8 | 2 | 5 |
| Peppermint Patty | 181 | 5 | 6 | 1 |
| Lucy | 193 | 6 | 4 | 3 |
| Linus | 181 | 5 | 6 | 1 |
| Sally | 180 | 4 | 5 | 2 |
How do we use these formulas to place our data entries into the table, addressing collisions when needed?
- “Snoopy”: h1 = 6. Index 6 is empty. (Placed at 6)
- “Charlie Brown”: h1 = 6. Collision at 6 (Snoopy). h2 = 4 (Step size).
- Probe 1 (i = 1): (6 + 1 ⋅ 4)mod 11 = 10. Index 10 is empty. (Placed at 10)
- “Franklin”: h1 = 8. Index 8 is empty. (Placed at 8)
- “Peppermint Patty”: h1 = 5. Index 5 is empty. (Placed at 5)
- “Lucy”: h1 = 6. Collision at 6 (Snoopy). h2 = 3 (Step size).
- Probe 1 (i = 1): (6 + 1 ⋅ 3)mod 11 = 9. Index 9 is empty. (Placed at 9)
- “Linus”: h1 = 5. Collision at 5 (Peppermint Patty). h2 = 1 (Step size).
- Probe 1 (i = 1): (5 + 1 ⋅ 1)mod 11 = 6. Collision at 6 (Snoopy).
- Probe 2 (i = 2): (5 + 2 ⋅ 1)mod 11 = 7. Index 7 is empty. (Placed at 7)
- “Sally”: h1 = 4. Index 4 is empty. (Placed at 4)
Final Hash Table:
| Index | Content |
|---|---|
| 0 | empty |
| 1 | empty |
| 2 | empty |
| 3 | empty |
| 4 | Sally |
| 5 | Peppermint Patty |
| 6 | Snoopy |
| 7 | Linus |
| 8 | Franklin |
| 9 | Lucy |
| 10 | Charlie Brown |
The Advantage: Scattered Probing
The key takeaway is that even though Snoopy, Charlie Brown, and Lucy all initially hashed to 6, they took different paths to find an empty slot. Snoopy took the slot at index 6, Charlie Brown had a collision at 6 but then stepped 4 to index 10, and Lucy had a collision at 6 but stepped 3 to get to index 9.
Double hashing ensures that every key follows a distinct path through the hash table when resolving a collision.
In theory, this behavior can more effectively scatter the colliding elements throughout the table. Even when the data are clustered, as they are in this example, the differing secondary hashes provide more opportunity for stepping over and out of a cluster. Fewer collisions keep insertion, deletion, and lookup times closer to O(1).