Concept#
Introduction#
Hash maps, also known as hash tables, are a critical and frequently used data structure in computer science. A hash map is a data structure that implements an associative array abstract data type, a structure that can map keys to values. This key-value pairing lends itself to efficient querying and manipulation of data.
In a hash map, each key is unique and is used to pinpoint the location of its corresponding value in the structure. This is done by applying a hash function to the key, which translates it into a hash, a fixed-size result or index that represents the location of the corresponding value. The hash function is vital to the performance of the hash map. It is designed to distribute keys evenly across the array to minimize clustering and ensure an average time complexity of \(\mathcal{O}(1)\) for basic operations.
This direct mapping from key to index allows for rapid access, insertion, and removal of data, making hash maps highly efficient for dealing with large datasets. However, it’s essential to note that hash maps can encounter hash collisions, a situation where two different keys yield the same hash. This problem is typically addressed using various collision resolution techniques, such as chaining or open addressing.
Hash maps form the foundation of many high-level data structures and are vital in solving complex problems due to their unique properties of efficient storage and retrieval. For a deeper understanding of their intricacies and their wide range of applications, visit their details on Wikipedia.
Common Notations#
Term |
Definition |
Example |
---|---|---|
Hash Map (also called Hash Table) |
A data structure implementing an associative array abstract data type. It maps keys to values using a hash function to compute an index into an array of buckets or slots. |
|
Key |
The identifier by which a value is associated and retrieved in a hash map. Each key is unique within the hash map. |
Given |
Value |
The specific data element associated with a key in a hash map. |
In the hash map |
A function used to map a data element (the key) to a fixed address in the hash map. A good hash function produces a uniform distribution across the array for various keys and minimizes collisions. |
Various techniques exist such as Division Method, Multiplication Method, etc. |
|
Bucket |
A slot in the hash map where key-value pairs are stored. The hash function determines the index of these buckets. |
In a hash map as an array of size 10, bucket 5 might store key-value pairs
|
Collision |
A situation when two different keys hash to the same index in a hash map. Various collision resolution techniques exist, such as chaining or open addressing. |
If both |
Load Factor |
A measure that decides when to increase the hash map capacity to maintain efficiency. It’s calculated by dividing the number of occupied buckets by the total number of buckets. |
If a hash map has 100 buckets and 75 are occupied, the load factor is |
Advantages of Hash Maps#
Efficient Key-Value Access#
Elements in a hash map are accessed through keys, which are hashed to compute an index in the array where the corresponding value is stored. If you know the key of the desired element, you can retrieve the value in constant time, denoted as \(\mathcal{O}(1)\).
This is possible because a hash function is used to map keys to their associated values directly. The operation is quick and is independent of the size of the hash map, which is why accessing an element in a hash map by its key is a constant time \(\left(\mathcal{O}(1)\right)\) operation.
1@time_complexity_analyzer(data_type="dict", repeat=10, plot=True)
2def dict_access(n: int, dict_: Dict[int, int]) -> None:
3 _ = dict_[n - 1]
1n_sizes = range(1000000, 10000001, 1000000) # 1M to 10M
2_ = dict_access(n_sizes)
Accommodate Dynamic Size#
Unlike arrays, hash maps are dynamic, which means they can grow or shrink in size as required. This property gives hash maps a significant advantage when the number of entries is not known in advance or when entries need to be added or removed dynamically.
Key-Value Pairs#
Hash maps store data as key-value pairs, which can be a more intuitive and flexible way to organize data. This is especially useful when each piece of data is associated with a unique identifier, like a name, an ID number, or a login handle.
Disadvantages of Hash Maps#
Inefficient Space Usage#
Hash maps can be space-inefficient. They typically reserve more space than necessary to account for the potential addition of new key-value pairs and to minimize the occurrence of hash collisions. This leads to an increased memory footprint, especially when the load factor (the ratio of the number of entries to the number of slots) is low.
No Order of Elements#
Hash maps do not maintain any order of elements. If you need to retrieve or display key-value pairs in a specific order (such as in ascending order of keys or values), you would need to sort the hash map, which requires additional time.
Complexity of Collision Resolution#
In cases where multiple keys hash to the same index (a collision), the hash map needs to have a strategy to resolve the collision. Collision resolution strategies like chaining or open addressing can be complex to implement and understand.
Hash Function Dependence#
The performance of a hash map significantly depends on the quality of the hash function. A poor hash function that leads to many collisions can degrade the performance of key lookups, insertions, and deletions from constant time to linear time in the worst case.
No Direct Support for Sorting#
Hash maps do not directly support sorting of their keys or values. If you need to keep your keys or values sorted, you’ll have to sort the hash map manually or use a data structure that maintains order, such as a tree-based map.
Complexity Analysis#
Time Complexity#
Let’s consider a Python dictionary named hashmap
for illustrative purposes.
Operation |
Python Syntax |
Best Case |
Average Case |
Worst Case |
Description |
---|---|---|---|---|---|
Access |
|
\(\mathcal{O}(1)\) |
\(\mathcal{O}(1)\) |
\(\mathcal{O}(n)\) |
Accessing a value by its key in a dictionary is a constant time operation on average. However, due to potential hash collisions, the worst case could be linear. |
Search |
|
\(\mathcal{O}(1)\) |
\(\mathcal{O}(1)\) |
\(\mathcal{O}(n)\) |
Searching for a key in the dictionary is also a constant time operation on average. Worst case could be linear due to hash collisions. |
Insert |
|
\(\mathcal{O}(1)\) |
\(\mathcal{O}(1)\) |
\(\mathcal{O}(n)\) |
Inserting a key-value pair in a dictionary is typically a constant time operation. However, in the worst case, when a rehashing is necessary, the operation may take linear time. |
Delete |
|
\(\mathcal{O}(1)\) |
\(\mathcal{O}(1)\) |
\(\mathcal{O}(n)\) |
Deleting a key-value pair is a constant time operation on average. In the worst case, due to potential hash collisions, it can take linear time. |
1@time_complexity_analyzer(data_type="dict", repeat=10, plot=True)
2def dict_set(n: int, dict_: Dict[int, int]) -> None:
3 dict_[n] = n
4
5
6@time_complexity_analyzer(data_type="dict", repeat=10, plot=True)
7def dict_search(n: int, dict_: Dict[int, int]) -> None:
8 _ = n in dict_
9
10
11@time_complexity_analyzer(data_type="dict", repeat=10, plot=True)
12def dict_access(n: int, dict_: Dict[int, int]) -> None:
13 _ = dict_[n - 1]
14
15@time_complexity_analyzer(data_type="dict", repeat=10, plot=True)
16def dict_delete(n: int, dict_: Dict[int, int]) -> None:
17 for i in range(n):
18 if i in dict_:
19 del dict_[i]
1n_sizes = range(1000000, 10000001, 1000000) # 1M to 10M
2for func in [dict_set, dict_search, dict_access, dict_delete]:
3 _ = func(n_sizes)
The function to test hashmap del
does not work properly because of the way
time_complexity_analyzer
works. It does not delete the key from the dict, so
the next time the function is called, it will not find the key and will not
delete it. A simple workaround without modifying the decorator is to use a for
loop and confirm the time complexity is now \(\mathcal{O}(n)\) instead of
\(\mathcal{O}(1)\).
Space Complexity#
Space complexity is a measure of the amount of memory an algorithm needs to run to completion. Just like time complexity, space complexity is typically expressed using Big-O notation.
Let’s consider the space complexity of Python dictionaries:
Operation |
Python Syntax |
Big-O Notation |
Description |
---|---|---|---|
Initializing |
|
\(\mathcal{O}(n)\) |
Initializing a dictionary with |
Inserting |
|
\(\mathcal{O}(n)\) |
Inserting a key-value pair might require resizing the dictionary,
which involves creating a new dictionary and copying all key-value pairs
from the old dictionary. In the worst case, this requires |
In most cases, both time and space complexity are important considerations when writing code. Sometimes, you can trade off between the two: you can sometimes make your code run faster by using more memory (a technique sometimes called space-time tradeoff), or you can make your code use less memory by accepting that it will run slower. The right balance between time and space complexity depends on the specific requirements of your program.
The Space-Time Tradeoff#
In computer science, the term space-time tradeoff refers to a situation where the efficiency of an algorithm can be improved through consuming more memory, or, inversely, memory usage can be decreased at the cost of slower execution times. In essence, you can save time by using more space, or conserve space by investing more time.
Consider a scenario where you are given a sizable array of integers and tasked with determining whether a certain integer is present within this array. Using a standard search approach, on average, you would need \(\mathcal{O}(n)\)[1] time to locate the integer in the array. This could become inefficient if this operation has to be repeated numerous times.
In such a case, it would be more prudent to create a hash map of the array, then check for membership within this hash map. Although this would require additional \(\mathcal{O}(n)\) auxiliary space, the time complexity for each membership check would be reduced to \(\mathcal{O}(1)\).
This means that each subsequent membership check, by simply hashing the item and verifying its presence in the hash map, would be significantly faster compared to performing a linear search through the array. However, this expedited process comes at the expense of increased memory usage needed to store the hash map, thus exemplifying a classic space-time tradeoff.
Example#
Let’s consider the below experiment to understand the space-time tradeoff better.
1# Let's say we have a list of one million items
2my_list = list(range(1, 1000001))
3
4# And we want to check if an item exists in the list
5item_to_check = 999999
6
7# This operation has a time complexity of O(n) because it might need to
8# go through every single item in the list in the worst case
9print(item_to_check in my_list) # returns True
10
11
12# Now let's say we store the items in a dictionary (which is a hash table in Python)
13my_dict = {item: True for item in my_list}
14
15# Checking if an item exists in the dictionary is generally an O(1) operation
16print(item_to_check in my_dict) # returns True
True
True
Of course, this assumes that we are only interested in checking for membership because hash map removes the duplicates.
Exercise: Can you construct a statistically reliable experiment to test the following hypotheses?
What is the threshold size for a list, beyond which it becomes more efficient to convert it into a hash map for membership checks?
Given a list of size \(N = 100000\), how many membership checks should be conducted before it becomes more efficient to convert the list into a hash map for these operations?
Common Clarifications and Corner Cases#
Tip |
Clarify |
Example |
---|---|---|
Clarify keys and values |
Discuss the types of keys and values. Are they integers, strings, or objects? Can values be null? |
|
Understand hash collision |
Hash collisions are inevitable. Understand how your language handles them (chaining, open addressing, etc.). |
Python dict handles hash collisions internally. |
Beware of key absence |
Accessing a key that doesn’t exist will cause an error. |
|
Understand time complexity |
Most operations are \(\mathcal{O}(1)\), but can degrade to \(\mathcal{O}(n)\) due to hash collisions. |
Inserting n items is generally \(\mathcal{O}(n)\), but can become $\mathcal{O}(n^2) in the worst case. |
Always consider corner cases |
How does your algorithm handle an empty hash map, a map with one key-value pair, or duplicate values? |
Consider empty map ( |
Hash Map resizing |
When a map’s load factor exceeds a threshold, it will be resized, increasing time complexity. |
Python’s dict resizes when too full, causing a temporary \(\mathcal{O}(n)\) operation. |
Use for item search |
Hash maps are efficient for searching operations, use them when you need to look up items. |
Use a hash map to store numbers as keys for \(\mathcal{O}(1)\) lookups. |