Set Over List For Frequent Membership Tests#

Twitter Handle LinkedIn Profile GitHub Profile Tag

from typing import List


def has_duplicates(lst: List[int]) -> bool:
    seen = []
    for item in lst:
        if item in seen:
            return True
        seen.append(item)
    return False
%time

print(has_duplicates([1, 2, 3, 4, 5]))  # Output: False
print(has_duplicates([1, 2, 3, 2, 5]))  # Output: True
print(has_duplicates(list(range(100000)) + [99999]))  # Large list with duplicate
CPU times: user 3 μs, sys: 0 ns, total: 3 μs
Wall time: 5.96 μs
False
True
True

Time Complexity: Inefficient Membership Testing#

Using a list (seen) for membership checks (item in seen) results in \(\mathcal{O}(N)\) time complexity for each check.

The has_duplicates function has an overall time complexity of \(\mathcal{O}(N^2)\) in the worst case, making it inefficient for large lists.

Space Complexity#

How does the space complexity of using a set compare to using a list in this scenario?

  • Set:

    • Space complexity: \(\mathcal{O}(N)\)

    • Reason: A set requires additional space to store the hash table and its associated metadata.

  • List:

    • Space complexity: \(\mathcal{O}(N)\)

    • Reason: A list requires additional space to store the elements and their order.

Resolution#

  • A set should be used instead of a list for the seen collection because sets in Python provide average-case \(\mathcal{O}(1)\) time complexity for membership tests (in operations).

  • This change reduces the overall time complexity of the function to \(\mathcal{O}(n)\), making it much more efficient for large datasets.

  • The refactored has_duplicates function now operates with \(\mathcal{O}(n)\) time complexity, as each membership test and addition to the set is \(\mathcal{O}(1)\) on average.

def has_duplicates(lst: List[int]) -> bool:
    seen = set()
    for item in lst:
        if item in seen:
            return True
        seen.add(item)
    return False
%time

print(has_duplicates([1, 2, 3, 4, 5]))  # Output: False
print(has_duplicates([1, 2, 3, 2, 5]))  # Output: True
print(has_duplicates(list(range(100000)) + [99999]))  # Large list with duplicate
CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 1.91 μs
False
True
True

The time difference is huge.

Are there any scenarios where using a list might be preferable over a set for tracking seen items?#

  • Order Preservation: Lists preserve the order of elements, whereas sets do not. If maintaining the insertion order is crucial for subsequent operations, a list might still be necessary.

  • Mutable Elements: Sets require their elements to be hashable (immutable). If the list contains unhashable items (like other lists or dictionaries), a set cannot be used directly.

  • Memory Overhead: Sets may have a slightly higher memory overhead compared to lists due to their underlying hash table implementation. However, this is typically negligible compared to the performance benefits for large datasets.

Find Duplicates#

Similar idea, but now count is \(\mathcal{O}(N)\) also.

from typing import List, Set


def find_duplicates(lst: List[int]) -> List[int]:
    duplicates = []
    for item in lst:
        if lst.count(item) > 1 and item not in duplicates:
            duplicates.append(item)
    return duplicates


def find_duplicates_efficient(lst: List[int]) -> Set[int]:
    seen = set()
    duplicates = set()
    for item in lst:
        if item in seen:
            duplicates.add(item)
        else:
            seen.add(item)
    return duplicates
[1, 2]