Set Over List For Frequent Membership Tests

Set Over List For Frequent Membership Tests#

from typing import List


def has_duplicates(lst: List[int]) -> bool:
    seen = []
    for item in lst:
        if item in seen:
            return True
        seen.append(item)
    return False

%time

print(has_duplicates([1, 2, 3, 4, 5]))  # Output: False
print(has_duplicates([1, 2, 3, 2, 5]))  # Output: True
print(has_duplicates(list(range(100000)) + [99999]))  # Large list with duplicate

CPU times: user 3 μs, sys: 0 ns, total: 3 μs
Wall time: 5.96 μs
False
True
True

Time Complexity: Inefficient Membership Testing #

Using a list (seen) for membership checks (item in seen) results in \(\mathcal{O}(N)\) time complexity for each check.

The has_duplicates function has an overall time complexity of \(\mathcal{O}(N^2)\) in the worst case, making it inefficient for large lists.

Space Complexity #

How does the space complexity of using a set compare to using a list in this scenario?

Set:
- Space complexity: \(\mathcal{O}(N)\)
- Reason: A set requires additional space to store the hash table and its associated metadata.
List:
- Space complexity: \(\mathcal{O}(N)\)
- Reason: A list requires additional space to store the elements and their order.

Resolution #

A set should be used instead of a list for the seen collection because sets in Python provide average-case \(\mathcal{O}(1)\) time complexity for membership tests (in operations).
This change reduces the overall time complexity of the function to \(\mathcal{O}(n)\), making it much more efficient for large datasets.
The refactored has_duplicates function now operates with \(\mathcal{O}(n)\) time complexity, as each membership test and addition to the set is \(\mathcal{O}(1)\) on average.

def has_duplicates(lst: List[int]) -> bool:
    seen = set()
    for item in lst:
        if item in seen:
            return True
        seen.add(item)
    return False

%time

print(has_duplicates([1, 2, 3, 4, 5]))  # Output: False
print(has_duplicates([1, 2, 3, 2, 5]))  # Output: True
print(has_duplicates(list(range(100000)) + [99999]))  # Large list with duplicate

CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 1.91 μs
False
True
True

The time difference is huge.

Are there any scenarios where using a list might be preferable over a set for tracking seen items?#

Order Preservation: Lists preserve the order of elements, whereas sets do not. If maintaining the insertion order is crucial for subsequent operations, a list might still be necessary.
Mutable Elements: Sets require their elements to be hashable (immutable). If the list contains unhashable items (like other lists or dictionaries), a set cannot be used directly.
Memory Overhead: Sets may have a slightly higher memory overhead compared to lists due to their underlying hash table implementation. However, this is typically negligible compared to the performance benefits for large datasets.

Find Duplicates #

Similar idea, but now count is \(\mathcal{O}(N)\) also.

from typing import List, Set


def find_duplicates(lst: List[int]) -> List[int]:
    duplicates = []
    for item in lst:
        if lst.count(item) > 1 and item not in duplicates:
            duplicates.append(item)
    return duplicates


def find_duplicates_efficient(lst: List[int]) -> Set[int]:
    seen = set()
    duplicates = set()
    for item in lst:
        if item in seen:
            duplicates.add(item)
        else:
            seen.add(item)
    return duplicates

[1, 2]

Set Over List For Frequent Membership Tests

Contents

Set Over List For Frequent Membership Tests#

Time Complexity: Inefficient Membership Testing#

Space Complexity#

Resolution#

Are there any scenarios where using a list might be preferable over a set for tracking seen items?#

Find Duplicates#

Connect with me!

Share this page!

Time Complexity: Inefficient Membership Testing #

Space Complexity #

Resolution #

Find Duplicates #