Generators Over Lists For Memory Efficiency#
from typing import Generator, Iterable, List
def process_large_dataset_inefficient(data: Iterable[int]) -> int:
processed: List[int] = [x * 2 for x in data if x > 0]
return sum(processed)
%time
large_data = range(10**8) # 10 million items
result = process_large_dataset_inefficient(large_data)
print(result)
CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 3.1 µs
9999999900000000
[Generator] Use generator instead of list to save memory.
[Eager Evaluation] List is eager evaluation, means that it will evaluate the entire list before returning. This implies the entire data structure (list) is computed and stored in memory all at once.
[Lazy Evaluation] Generator is lazy evaluation, it will evaluate the item on the fly.
The process_large_dataset_inefficient
function is designed to process a large
dataset by performing the following operations:
List Comprehension: It creates a new list,
processed
, containing elements fromdata
that are greater than 0, each multiplied by 2.Summation: It then computes the sum of all elements in the
processed
list.
While this approach is straightforward and works well for smaller datasets, it becomes inefficient and potentially problematic when dealing with very large datasets due to the following reasons:
High Memory Consumption: The list comprehension
[x * 2 for x in data if x > 0]
generates an entire list in memory. For large datasets, this can consume a significant amount of memory, leading to increased memory usage or even memory exhaustion.Unnecessary Intermediate Storage: Storing all processed elements before summing them is unnecessary when only the cumulative sum is required. This intermediate storage adds overhead without providing any tangible benefits.
Lack of Lazy Evaluation: The current implementation does not leverage Python’s ability to handle data lazily, which can process elements on-the-fly without holding the entire dataset in memory
squared_gen: Generator[int, None, None] = (x**2 for x in range(10))
print(squared_gen)
print(type(squared_gen))
print(isinstance(squared_gen, Generator))
<generator object <genexpr> at 0x110569740>
<class 'generator'>
True
def process_large_dataset_efficient(data: Iterable[int]) -> int:
processed: Generator[int, None, None] = (x * 2 for x in data if x > 0)
return sum(processed)
Generator Expression: Replaced the list comprehension with a generator expression:
(x * 2 for x in data if x > 0)
. This change ensures that elements are processed one at a time, reducing memory footprint.Elimination of Intermediate List: Removed the
processed
list, thereby avoiding the storage of all processed elements in memory.Documentation: Added a docstring to explain the purpose and behavior of the function, enhancing code readability and maintainability.
%time
result = process_large_dataset_efficient(large_data)
print(result)
CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 5.25 µs
9999999900000000
Time Complexity#
Question: What is the time complexity of the original function compared to the refactored version?
Answer: Both functions have \(\mathcal{O}(N)\) time complexity, where
N
is the number of elements indata
. This is because each function iterates through the entire dataset once to process and sum the elements.
Space Complexity#
Question: What is the space complexity of the original function compared to the refactored version?
Answer: The original function has \(\mathcal{O}(N)\) space complexity due to the creation of the
processed
list, whereN
is the number of elements indata
that satisfy the conditionx > 0
. The refactored version using a generator expression has \(\mathcal{O}(1)\) space complexity, as it processes one element at a time without storing the entire list.
Benchmark#
import timeit
from memory_profiler import memory_usage
def benchmark() -> None:
data = range(10**8)
def run_inefficient() -> int:
return process_large_dataset_inefficient(data)
def run_efficient() -> int:
return process_large_dataset_efficient(data)
mem_inefficient = max(memory_usage(run_inefficient))
time_inefficient = timeit.timeit(run_inefficient, number=1)
mem_efficient = max(memory_usage(run_efficient))
time_efficient = timeit.timeit(run_efficient, number=1)
print(
f"Original Function: Time = {time_inefficient:.2f}s, Max Memory = {mem_inefficient:.2f}MB"
)
print(
f"Refactored Function: Time = {time_efficient:.2f}s, Max Memory = {mem_efficient:.2f}MB"
)
benchmark()
Original Function: Time = 6.71s, Max Memory = 2118.64MB
Refactored Function: Time = 4.04s, Max Memory = 26.97MB
More accurate profiling, run in python script instead.
import cProfile
import io
import pstats
from typing import Iterable
from memory_profiler import profile
def process_large_dataset_inefficient(data: Iterable[int]) -> int:
processed = [x * 2 for x in data if x > 0]
return sum(processed)
def process_large_dataset_efficient(data: Iterable[int]) -> int:
processed = (x * 2 for x in data if x > 0)
return sum(processed)
@profile
def run_inefficient(data_size: int) -> int:
data = range(data_size)
return process_large_dataset_inefficient(data)
@profile
def run_efficient(data_size: int) -> int:
data = range(data_size)
return process_large_dataset_efficient(data)
def profile_function(func, data_size: int) -> None:
pr = cProfile.Profile()
pr.enable()
result = func(data_size)
pr.disable()
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumulative")
ps.print_stats()
print(f"Result: {result}")
print(s.getvalue())
if __name__ == "__main__":
data_size = 10**8 # Adjust as needed
print("Profiling inefficient function:")
profile_function(run_inefficient, data_size)
print("\nProfiling efficient function:")
profile_function(run_efficient, data_size)
Profiling inefficient function:
ERROR: Could not find file /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py
Result: 9999999900000000
77 function calls in 26.042 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 26.042 26.042 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:1185(wrapper)
1 0.000 0.000 26.041 26.041 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:759(f)
1 0.563 0.563 26.040 26.040 /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py:19(run_inefficient)
1 0.001 0.001 25.477 25.477 /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py:9(process_large_dataset_inefficient)
1 24.020 24.020 24.020 24.020 /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py:10(<listcomp>)
1 1.457 1.457 1.457 1.457 {built-in method builtins.sum}
1 0.000 0.000 0.001 0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:713(__call__)
1 0.000 0.000 0.001 0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:728(add_function)
1 0.000 0.000 0.001 0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:645(add)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:114(__enter__)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/genericpath.py:16(exists)
1 0.000 0.000 0.000 0.000 {built-in method posix.stat}
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:748(wrap_function)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/asyncio/coroutines.py:164(iscoroutinefunction)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:190(iscoroutinefunction)
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
2 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/ipykernel/iostream.py:626(write)
2 0.000 0.000 0.000 0.000 {built-in method builtins.next}
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:261(helper)
2 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:740(_count_ctxmgr)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:775(enable_by_count)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:842(enable)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:86(__init__)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:172(_has_code_flag)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:702(__init__)
2 0.000 0.000 0.000 0.000 {built-in method sys.settrace}
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:1201(choose_backend)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:123(__exit__)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:782(disable_by_count)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:853(show_results)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:849(disable)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:689(items)
2 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/ipykernel/iostream.py:521(_is_master_process)
6 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:1215(<genexpr>)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:640(__init__)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:81(ismethod)
3 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
5 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
1 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'pop' of 'list' objects}
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/functools.py:420(_unwrap_partial)
4 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:159(isfunction)
2 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/ipykernel/iostream.py:548(_schedule_flush)
1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
2 0.000 0.000 0.000 0.000 {built-in method posix.getpid}
2 0.000 0.000 0.000 0.000 {method '__exit__' of '_thread.RLock' objects}
1 0.000 0.000 0.000 0.000 {method 'insert' of 'list' objects}
2 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {built-in method sys.gettrace}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Profiling efficient function:
ERROR: Could not find file /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py
Result: 9999999900000000
100000076 function calls in 63.801 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 63.801 63.801 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:1185(wrapper)
1 0.000 0.000 63.801 63.801 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:759(f)
1 0.000 0.000 63.801 63.801 /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py:25(run_efficient)
1 0.000 0.000 63.801 63.801 /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py:14(process_large_dataset_efficient)
1 20.677 20.677 63.801 63.801 {built-in method builtins.sum}
100000000 43.124 0.000 43.124 0.000 /var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_41702/1029353620.py:15(<genexpr>)
1 0.000 0.000 0.001 0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:713(__call__)
1 0.000 0.000 0.001 0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:728(add_function)
1 0.000 0.000 0.001 0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:645(add)
1 0.000 0.000 0.001 0.001 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/genericpath.py:16(exists)
1 0.001 0.001 0.001 0.001 {built-in method posix.stat}
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:748(wrap_function)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/asyncio/coroutines.py:164(iscoroutinefunction)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:190(iscoroutinefunction)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:702(__init__)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:172(_has_code_flag)
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:123(__exit__)
2 0.000 0.000 0.000 0.000 {built-in method builtins.next}
2 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/ipykernel/iostream.py:626(write)
2 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:740(_count_ctxmgr)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:261(helper)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:81(ismethod)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:1201(choose_backend)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:782(disable_by_count)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:86(__init__)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/contextlib.py:114(__enter__)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:775(enable_by_count)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:853(show_results)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:849(disable)
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:842(enable)
2 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/ipykernel/iostream.py:521(_is_master_process)
6 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:1215(<genexpr>)
2 0.000 0.000 0.000 0.000 {built-in method sys.settrace}
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:689(items)
5 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/memory_profiler.py:640(__init__)
3 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
1 0.000 0.000 0.000 0.000 {method 'pop' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/inspect.py:159(isfunction)
4 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects}
2 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/site-packages/ipykernel/iostream.py:548(_schedule_flush)
1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
1 0.000 0.000 0.000 0.000 /opt/homebrew/Caskroom/miniconda/base/envs/omniverse/lib/python3.9/functools.py:420(_unwrap_partial)
2 0.000 0.000 0.000 0.000 {method '__exit__' of '_thread.RLock' objects}
2 0.000 0.000 0.000 0.000 {built-in method posix.getpid}
1 0.000 0.000 0.000 0.000 {method 'insert' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {built-in method sys.gettrace}
References And Further Readings#
Item 30 Of Effective Python