Generators Over Lists For Memory Efficiency

Generators Over Lists For Memory Efficiency#

Twitter Handle LinkedIn Profile GitHub Profile Tag

from typing import Generator, Iterable, List


def process_large_dataset_inefficient(data: Iterable[int]) -> int:
    processed: List[int] = [x * 2 for x in data if x > 0]
    return sum(processed)
%time

large_data = range(10**8)  # 10 million items
result = process_large_dataset_inefficient(large_data)
print(result)
CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 3.1 µs
9999999900000000
  • [Generator] Use generator instead of list to save memory.

  • [Eager Evaluation] List is eager evaluation, means that it will evaluate the entire list before returning. This implies the entire data structure (list) is computed and stored in memory all at once.

  • [Lazy Evaluation] Generator is lazy evaluation, it will evaluate the item on the fly.

The process_large_dataset_inefficient function is designed to process a large dataset by performing the following operations:

  1. List Comprehension: It creates a new list, processed, containing elements from data that are greater than 0, each multiplied by 2.

  2. Summation: It then computes the sum of all elements in the processed list.

While this approach is straightforward and works well for smaller datasets, it becomes inefficient and potentially problematic when dealing with very large datasets due to the following reasons:

  • High Memory Consumption: The list comprehension [x * 2 for x in data if x > 0] generates an entire list in memory. For large datasets, this can consume a significant amount of memory, leading to increased memory usage or even memory exhaustion.

  • Unnecessary Intermediate Storage: Storing all processed elements before summing them is unnecessary when only the cumulative sum is required. This intermediate storage adds overhead without providing any tangible benefits.

  • Lack of Lazy Evaluation: The current implementation does not leverage Python’s ability to handle data lazily, which can process elements on-the-fly without holding the entire dataset in memory

squared_gen: Generator[int, None, None] = (x**2 for x in range(10))
print(squared_gen)
print(type(squared_gen))
print(isinstance(squared_gen, Generator))
<generator object <genexpr> at 0x110569740>
<class 'generator'>
True
def process_large_dataset_efficient(data: Iterable[int]) -> int:
    processed: Generator[int, None, None] = (x * 2 for x in data if x > 0)
    return sum(processed)
  • Generator Expression: Replaced the list comprehension with a generator expression: (x * 2 for x in data if x > 0). This change ensures that elements are processed one at a time, reducing memory footprint.

  • Elimination of Intermediate List: Removed the processed list, thereby avoiding the storage of all processed elements in memory.

  • Documentation: Added a docstring to explain the purpose and behavior of the function, enhancing code readability and maintainability.

%time

result = process_large_dataset_efficient(large_data)
print(result)
CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 5.25 µs
9999999900000000

Time Complexity#

  • Question: What is the time complexity of the original function compared to the refactored version?

  • Answer: Both functions have \(\mathcal{O}(N)\) time complexity, where N is the number of elements in data. This is because each function iterates through the entire dataset once to process and sum the elements.

Space Complexity#

  • Question: What is the space complexity of the original function compared to the refactored version?

  • Answer: The original function has \(\mathcal{O}(N)\) space complexity due to the creation of the processed list, where N is the number of elements in data that satisfy the condition x > 0. The refactored version using a generator expression has \(\mathcal{O}(1)\) space complexity, as it processes one element at a time without storing the entire list.

Benchmark#

import timeit
from memory_profiler import memory_usage


def benchmark() -> None:
    data = range(10**8)

    def run_inefficient() -> int:
        return process_large_dataset_inefficient(data)

    def run_efficient() -> int:
        return process_large_dataset_efficient(data)

    mem_inefficient = max(memory_usage(run_inefficient))
    time_inefficient = timeit.timeit(run_inefficient, number=1)

    mem_efficient = max(memory_usage(run_efficient))
    time_efficient = timeit.timeit(run_efficient, number=1)

    print(
        f"Original Function: Time = {time_inefficient:.2f}s, Max Memory = {mem_inefficient:.2f}MB"
    )
    print(
        f"Refactored Function: Time = {time_efficient:.2f}s, Max Memory = {mem_efficient:.2f}MB"
    )


benchmark()
Original Function: Time = 6.71s, Max Memory = 2118.64MB
Refactored Function: Time = 4.04s, Max Memory = 26.97MB

More accurate profiling, run in python script instead.

import cProfile
import io
import pstats
from typing import Iterable

from memory_profiler import profile


def process_large_dataset_inefficient(data: Iterable[int]) -> int:
    processed = [x * 2 for x in data if x > 0]
    return sum(processed)


def process_large_dataset_efficient(data: Iterable[int]) -> int:
    processed = (x * 2 for x in data if x > 0)
    return sum(processed)


@profile
def run_inefficient(data_size: int) -> int:
    data = range(data_size)
    return process_large_dataset_inefficient(data)


@profile
def run_efficient(data_size: int) -> int:
    data = range(data_size)
    return process_large_dataset_efficient(data)


def profile_function(func, data_size: int) -> None:
    pr = cProfile.Profile()
    pr.enable()
    result = func(data_size)
    pr.disable()

    s = io.StringIO()
    ps = pstats.Stats(pr, stream=s).sort_stats("cumulative")
    ps.print_stats()

    print(f"Result: {result}")
    print(s.getvalue())


if __name__ == "__main__":
    data_size = 10**8  # Adjust as needed

    print("Profiling inefficient function:")
    profile_function(run_inefficient, data_size)

    print("\nProfiling efficient function:")
    profile_function(run_efficient, data_size)
Profiling inefficient function:
ERROR: Could not find file /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py
Result: 9999999900000000
         77 function calls in 26.042 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   26.042   26.042 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:1185(wrapper)
        1    0.000    0.000   26.041   26.041 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:759(f)
        1    0.563    0.563   26.040   26.040 /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py:19(run_inefficient)
        1    0.001    0.001   25.477   25.477 /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py:9(process_large_dataset_inefficient)
        1   24.020   24.020   24.020   24.020 /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py:10(<listcomp>)
        1    1.457    1.457    1.457    1.457 {built-in method builtins.sum}
        1    0.000    0.000    0.001    0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:713(__call__)
        1    0.000    0.000    0.001    0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:728(add_function)
        1    0.000    0.000    0.001    0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:645(add)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:114(__enter__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/genericpath.py:16(exists)
        1    0.000    0.000    0.000    0.000 {built-in method posix.stat}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:748(wrap_function)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/asyncio/coroutines.py:164(iscoroutinefunction)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:190(iscoroutinefunction)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/ipykernel/iostream.py:626(write)
        2    0.000    0.000    0.000    0.000 {built-in method builtins.next}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:261(helper)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:740(_count_ctxmgr)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:775(enable_by_count)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:842(enable)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:86(__init__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:172(_has_code_flag)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:702(__init__)
        2    0.000    0.000    0.000    0.000 {built-in method sys.settrace}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:1201(choose_backend)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:123(__exit__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:782(disable_by_count)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:853(show_results)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:849(disable)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:689(items)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/ipykernel/iostream.py:521(_is_master_process)
        6    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:1215(<genexpr>)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:640(__init__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:81(ismethod)
        3    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        5    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'endswith' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'pop' of 'list' objects}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/functools.py:420(_unwrap_partial)
        4    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:159(isfunction)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/ipykernel/iostream.py:548(_schedule_flush)
        1    0.000    0.000    0.000    0.000 {method 'update' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'write' of '_io.StringIO' objects}
        2    0.000    0.000    0.000    0.000 {built-in method posix.getpid}
        2    0.000    0.000    0.000    0.000 {method '__exit__' of '_thread.RLock' objects}
        1    0.000    0.000    0.000    0.000 {method 'insert' of 'list' objects}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {built-in method sys.gettrace}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




Profiling efficient function:
ERROR: Could not find file /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py
Result: 9999999900000000
         100000076 function calls in 63.801 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   63.801   63.801 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:1185(wrapper)
        1    0.000    0.000   63.801   63.801 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:759(f)
        1    0.000    0.000   63.801   63.801 /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py:25(run_efficient)
        1    0.000    0.000   63.801   63.801 /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py:14(process_large_dataset_efficient)
        1   20.677   20.677   63.801   63.801 {built-in method builtins.sum}
100000000   43.124    0.000   43.124    0.000 /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py:15(<genexpr>)
        1    0.000    0.000    0.001    0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:713(__call__)
        1    0.000    0.000    0.001    0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:728(add_function)
        1    0.000    0.000    0.001    0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:645(add)
        1    0.000    0.000    0.001    0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/genericpath.py:16(exists)
        1    0.001    0.001    0.001    0.001 {built-in method posix.stat}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:748(wrap_function)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/asyncio/coroutines.py:164(iscoroutinefunction)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:190(iscoroutinefunction)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:702(__init__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:172(_has_code_flag)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:123(__exit__)
        2    0.000    0.000    0.000    0.000 {built-in method builtins.next}
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/ipykernel/iostream.py:626(write)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:740(_count_ctxmgr)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:261(helper)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:81(ismethod)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:1201(choose_backend)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:782(disable_by_count)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:86(__init__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:114(__enter__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:775(enable_by_count)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:853(show_results)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:849(disable)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:842(enable)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/ipykernel/iostream.py:521(_is_master_process)
        6    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:1215(<genexpr>)
        2    0.000    0.000    0.000    0.000 {built-in method sys.settrace}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:689(items)
        5    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:640(__init__)
        3    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        1    0.000    0.000    0.000    0.000 {method 'pop' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {method 'write' of '_io.StringIO' objects}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:159(isfunction)
        4    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'endswith' of 'str' objects}
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/ipykernel/iostream.py:548(_schedule_flush)
        1    0.000    0.000    0.000    0.000 {method 'update' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/functools.py:420(_unwrap_partial)
        2    0.000    0.000    0.000    0.000 {method '__exit__' of '_thread.RLock' objects}
        2    0.000    0.000    0.000    0.000 {built-in method posix.getpid}
        1    0.000    0.000    0.000    0.000 {method 'insert' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {built-in method sys.gettrace}

References And Further Readings#

  • Item 30 Of Effective Python