Set Over List For Frequent Membership Tests#
from typing import List
def has_duplicates(lst: List[int]) -> bool:
seen = []
for item in lst:
if item in seen:
return True
seen.append(item)
return False
%time
print(has_duplicates([1, 2, 3, 4, 5])) # Output: False
print(has_duplicates([1, 2, 3, 2, 5])) # Output: True
print(has_duplicates(list(range(100000)) + [99999])) # Large list with duplicate
CPU times: user 3 μs, sys: 0 ns, total: 3 μs
Wall time: 5.96 μs
False
True
True
Time Complexity: Inefficient Membership Testing#
Using a list (seen
) for membership checks (item in seen
) results in
\(\mathcal{O}(N)\) time complexity for each check.
The has_duplicates
function has an overall time complexity of
\(\mathcal{O}(N^2)\) in the worst case, making it inefficient for large lists.
Space Complexity#
How does the space complexity of using a set compare to using a list in this scenario?
Set:
Space complexity: \(\mathcal{O}(N)\)
Reason: A set requires additional space to store the hash table and its associated metadata.
List:
Space complexity: \(\mathcal{O}(N)\)
Reason: A list requires additional space to store the elements and their order.
Resolution#
A set should be used instead of a list for the
seen
collection because sets in Python provide average-case \(\mathcal{O}(1)\) time complexity for membership tests (in
operations).This change reduces the overall time complexity of the function to \(\mathcal{O}(n)\), making it much more efficient for large datasets.
The refactored
has_duplicates
function now operates with \(\mathcal{O}(n)\) time complexity, as each membership test and addition to the set is \(\mathcal{O}(1)\) on average.
def has_duplicates(lst: List[int]) -> bool:
seen = set()
for item in lst:
if item in seen:
return True
seen.add(item)
return False
%time
print(has_duplicates([1, 2, 3, 4, 5])) # Output: False
print(has_duplicates([1, 2, 3, 2, 5])) # Output: True
print(has_duplicates(list(range(100000)) + [99999])) # Large list with duplicate
CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 1.91 μs
False
True
True
The time difference is huge.
Are there any scenarios where using a list might be preferable over a set for tracking seen items?#
Order Preservation: Lists preserve the order of elements, whereas sets do not. If maintaining the insertion order is crucial for subsequent operations, a list might still be necessary.
Mutable Elements: Sets require their elements to be hashable (immutable). If the list contains unhashable items (like other lists or dictionaries), a set cannot be used directly.
Memory Overhead: Sets may have a slightly higher memory overhead compared to lists due to their underlying hash table implementation. However, this is typically negligible compared to the performance benefits for large datasets.
Find Duplicates#
Similar idea, but now count
is \(\mathcal{O}(N)\) also.
from typing import List, Set
def find_duplicates(lst: List[int]) -> List[int]:
duplicates = []
for item in lst:
if lst.count(item) > 1 and item not in duplicates:
duplicates.append(item)
return duplicates
def find_duplicates_efficient(lst: List[int]) -> Set[int]:
seen = set()
duplicates = set()
for item in lst:
if item in seen:
duplicates.add(item)
else:
seen.add(item)
return duplicates
[1, 2]