A Rudimentary Introduction to Generator and Yield in Python#
Generator functions, which were introduced in Python Enhancement Proposal (PEP) 255, are a unique type of function that yield a lazy iterator. This is an object that can be iterated over, similar to a list. The key difference, however, is that unlike lists, lazy iterators do not hold their contents in memory. Instead, they generate their contents on the fly, as they are iterated over.
Reading Large Files, Generator vs Iterator#
Let’s consider the following example from How to Use Generators and yield in Python - Real Python[1]. Suppose we have a large text file that we want to read and iterate over, say, to obtain the total number of rows.
Reading a Big File into a List#
We can use the following code to read the file first into memory and then iterate over it:
1def file_reader_using_iterator(file_path: str) -> List[str]:
2 file = open(file_path, "r", encoding="utf-8")
3 print(f"Is file an generator? {inspect.isgenerator(file)}")
4 print(f"Is file an iterator? {isinstance(file, Iterator)}")
5 result = file.read().split("\n")
6 return result
7
8text = file_reader_using_iterator("./assets/sample.txt")
9print(f"Is text an generator? {inspect.isgenerator(text)}")
10print(f"Is text an iterator? {isinstance(text, Iterator)}")
11pprint(text)
12
13row_count = 0
14for row in text:
15 row_count += 1
16
17print(f"Row count: {row_count}")
Is file an generator? False
Is file an iterator? True
Is text an generator? False
Is text an iterator? False
[ │ 'hydra-core==1.3.2', │ 'matplotlib>=3.8.0', │ 'numpy>=1.26.0', │ 'openai>=1.1.1', │ 'pandas>=2.1.1', │ 'portalocker>=2.8.2', │ 'pydantic==2.5.2', │ 'rich>=13.6.0', │ 'seaborn>=0.13.0', │ 'tiktoken==0.5.2', │ 'torch>=2.1.0', │ 'torchinfo>=1.8.0', │ 'torchmetrics>=1.3.0', │ 'torchtext', │ 'torchvision>=0.16.0', │ '' ]
Row count: 16
In file_reader_using_iterator
, we read the entire file into memory and then
split it and return it as a list of strings (list is an iterator). Then we
iterate over the list to count the number of rows. This approach is
straightforward and easy to understand, but it has a major drawback: it reads
the entire file into memory. This is not a problem for small files, but for
large files, it can be cause memory issues - the file itself being larger than
your system’s available memory.
When you read a big file into a list, you’re loading the entire content of the
file into memory at once. This is because a list in Python is an in-memory data
structure, and when you create a list containing every line of a file, each of
those lines is stored in memory. This can be highly inefficient for large files,
as it requires enough memory to hold the entire file content at once, which can
lead to MemoryError
if the file size exceeds the available memory.
Using a Generator Function#
To overcome this issue, we can use a generator function,
file_reader_using_generator
, which reads the file line by line, yielding each
line as it goes. This approach is memory-efficient because it only needs to hold
one line in memory at a time, not the entire file.
1def file_reader_using_generator(file_path: str) -> Generator[str, None, None]:
2 file = open(file_path, "r", encoding="utf-8")
3 for row in file:
4 yield row.rstrip("\n")
5
6text_gen = file_reader_using_generator("./assets/sample.txt")
7print(f"Is text_gen a generator? {inspect.isgenerator(text_gen)}")
8print(f"Is text_gen an iterator? {isinstance(text_gen, Iterator)}")
9
10row_count = 0
11for row in text_gen:
12 row_count += 1
13
14print(f"Row count: {row_count}")
Is text_gen a generator? True
Is text_gen an iterator? True
Row count: 15
In file_reader_using_generator
, we open the file and iterate over it line by
line. For each line, we yield the line, which means we produce a value that can
be iterated over, but we do not terminate the function. Instead, we pause it
until the next value is requested. This allows us to read large files
efficiently, even if they are larger than the available memory.
How does this work? On a high level, when we call file_reader_using_generator
,
it returns a generator object. This object is an iterator, so we can iterate
over it using a for
loop. When we do this, the function is executed until the
first yield
statement, at which point the function is paused. The value of the
yield
statement is returned to the caller, and the function is paused. When
the next value is requested, the function resumes from where it was paused,
until the next yield
statement is encountered. This process continues until
the function terminates.
(All Generators are Iterators)
Note that all generators are iterators, but not all iterators are generators.
To recap, the main reason why the generator function does not hold the entire file in memory is because it yields each line one by one, rather than returning a list of all lines at once.
Lazy Evaluation: Generators are lazily evaluated. This means that they generate values on the fly as needed, rather than computing them all at once and storing them.
Single Item in Memory: At any point in time, only the current row being yielded by the generator is held in memory. Once the consumer of the generator moves to the next item, the previous item can be garbage collected if no longer referenced, keeping the memory footprint low.
Stateful Iteration: The generator function maintains its state between each yield. It knows where it left off (which line it yielded last) and resumes from that point the next time the next value is requested. This statefulness is managed without keeping the entire dataset in memory.
Next Method#
The __next__
method is a fundamental part of the iterator protocol in Python.
It’s used to get the next value in an iteration.
When you use a for
loop, or the next()
function, Python internally calls the
__next__
method of the iterator object. This method should return the next
value for the iterable. When there are no more items to return, it should raise
StopIteration
.
In the context of generators, each call to the generator’s __next__
method
resumes the generator function from where it left off and runs until the next
yield
statement, at which point it returns the yielded value and pauses
execution.
Without the __next__
method, we wouldn’t be able to use Python’s built-in
iteration mechanisms with our custom iterator or generator objects.
So let’s see how the __next__
method works with our generator function.
1try:
2 first_row = text_gen.__next__()
3 print(f"First row: {first_row}")
4except StopIteration:
5 print("StopIteration: No more rows")
StopIteration: No more rows
Oh what happened? We try to get the first row from the generator using the
__next__
method, but it raises a StopIteration
exception. This is because
the generator has already been exhausted by the for
loop earlier when we
counted the number of rows. Unlike a list, a generator can only be iterated over
once. Once it’s been exhausted, it can’t be iterated over again and will raise a
StopIteration
exception if you try to do so.
Let’s create a new generator and see how the __next__
method works.
1text_gen = file_reader_using_generator("./assets/sample.txt")
2first_row = text_gen.__next__()
3print(f"First row: {first_row}")
4
5second_row = text_gen.__next__()
6print(f"Second row: {second_row}")
First row: hydra-core==1.3.2
Second row: matplotlib>=3.8.0
Generator Expression#
Similar to list comprehensions, you can also create a generator using a generator expression (generator comprehension) so that you can create a genereator without defining a function.
1text_gen_comprehension = (row for row in open("./assets/sample.txt", "r", encoding="utf-8"))
2print(f"Is text_gen_comprehension a generator? {inspect.isgenerator(text_gen_comprehension)}")
3print(f"Is text_gen_comprehension an iterator? {isinstance(text_gen_comprehension, Iterator)}")
Is text_gen_comprehension a generator? True
Is text_gen_comprehension an iterator? True
How does Generator Work?#
Generator functions are nearly indistinguishable from standard functions in
appearance and behavior, with one key distinction. They utilize the yield
keyword in place of return
. Consider the generator function that yields the
next integer indefinitely:
def infinite_sequence() -> Generator[int, None, None]:
num = 0
while True:
yield num
num += 1
This function might look familiar, but it’s the yield
statement that sets it
apart. yield
serves to return a value to the caller without exiting the
function.
What’s truly unique here is how the function’s state is preserved. Upon each
subsequent call to next()
on the generator object (whether done directly or
through a loop), the function picks up right where it left off, incrementing and
yielding num
once more[1].
Profiling Generator Performance#
Let’s compare the performance of the generator function and the list comprehension.
Memory Efficiency#
Let’s create a list of squared numbers using a list comprehension and a generator expression, and compare their memory usage.
1N = 100000
2
3nums_squared_list_comprehension = [num ** 2 for num in range(N)]
4print(f"Size of nums_squared_list_comprehension: {sys.getsizeof(nums_squared_list_comprehension)} bytes")
5
6nums_squared_generator = (num ** 2 for num in range(N))
7print(f"Size of nums_squared_generator: {sys.getsizeof(nums_squared_generator)} bytes")
Size of nums_squared_list_comprehension: 800984 bytes
Size of nums_squared_generator: 112 bytes
The size of
nums_squared_list_comprehension
is 800984 bytes.The size of
nums_squared_generator
is 112 bytes.
The list comprehension (nums_squared_list_comprehension
) creates a list of all
squared numbers at once. This means it needs to allocate enough memory to hold
all these numbers. This can be quite large for big sequences, as shown by the
sys.getsizeof(nums_squared_list_comprehension)
call.
On the other hand, the generator expression (nums_squared_generator
) doesn’t
compute all the squared numbers at once. Instead, it computes them one at a
time, on-the-fly, as you iterate over the generator. This means it doesn’t need
to allocate memory for the whole sequence, only for the current number. This is
why sys.getsizeof(nums_squared_generator)
returns a much smaller number.
This demonstrates the main advantage of generators when it comes to memory efficiency: they allow you to work with large sequences of data without needing to load the entire sequence into memory. This can be a significant advantage when working with large data sets, where loading the entire data set into memory might not be feasible.
Time Efficiency#
We know that creating a very large list in memory can take time and potentially hang the system. However, if our list is much smaller than the system memory, it has been shown that it can much faster to evaluate than a generator expression[1].
1cProfile.run("sum([num ** 2 for num in range(N)])")
2cProfile.run("sum(num ** 2 for num in range(N))")
5 function calls in 0.037 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.036 0.036 0.036 0.036 <string>:1(<listcomp>)
1 0.001 0.001 0.037 0.037 <string>:1(<module>)
1 0.000 0.000 0.037 0.037 {built-in method builtins.exec}
1 0.001 0.001 0.001 0.001 {built-in method builtins.sum}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100005 function calls in 0.044 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
100001 0.037 0.000 0.037 0.000 <string>:1(<genexpr>)
1 0.000 0.000 0.044 0.044 <string>:1(<module>)
1 0.000 0.000 0.044 0.044 {built-in method builtins.exec}
1 0.007 0.007 0.044 0.044 {built-in method builtins.sum}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
This shows for a small list, the list comprehension is faster than the generator expression.
Now, why are there 100005 function calls for N=100000?
100001
calls are from the generator expression<string>:1(<genexpr>)
. For each number in therange(N)
, the generator expression is called once, hence 100000 calls. The extra 1 call is to raise theStopIteration
exception when the generator is exhausted.The remaining
4
calls are from the other functions:<string>:1(<module>)
,{built-in method builtins.exec}
,{built-in method builtins.sum}
, and{method 'disable' of '_lsprof.Profiler' objects}
. Each of these is called once, hence 4 calls.
So, in total, there are 100001 (from the generator expression) + 4 (from the other functions) = 100005 function calls.
Yield#
The yield
statement in Python is probably what defines the generator
functions, let’s take a look.
The yield
Statement#
A generator function is a function that, when called, returns a
generator iterator. This is achieved by including at least one yield
statement in the function definition. Unlike a return
statement, which
terminates a function entirely and sends a value back to the caller, yield
pauses the function, saving its state for continuation when next required.
When a generator function calls yield
, the function execution is paused,
and a value is sent to the caller. However, the function’s local variables and
execution state are saved internally. The next time the generator is advanced
(using the next()
function or a for loop, for example), execution resumes from
exactly where it was left off, immediately after the yield
statement.
More concretely (and verbosely), upon encountering a yield
statement, the
function’s current state is preserved or “frozen”. This means that local
variables, the instruction pointer, and even the state of the evaluation stack
are saved. Consequently, when .next()
is called again, the function resumes
precisely from where it left off, as if yield
were a pause in execution rather
than an interruption. This mechanism allows generator functions to produce a
sequence of values over time, providing an efficient way to work with data
streams or large datasets without requiring all data to be loaded into memory
simultaneously[2].
Moreover, when a generator function is called, the actual arguments are bound to function-local formal argument names in the usual way, but no code in the body of the function is executed. Instead a generator-iterator object is returned; this conforms to the iterator protocol, so in particular can be used in for-loops in a natural way. Note that when the intent is clear from context, the unqualified name “generator” may be used to refer either to a generator-function or a generator-iterator[2].
An Example#
Let’s see a function count_up_to
that just yield count up to a max number.
1def count_up_to(max: int) -> Generator[int, None, None]:
2 count = 1
3 while count <= max:
4 yield count
5 count += 1
6
7gen = count_up_to(3)
8for number in gen:
9 print(number)
1
2
3
First, let’s add type hints to the function. The return type of a generator
function can be hinted using Generator[YieldType, SendType, ReturnType]
from
the typing
module. Since this generator yields integers and does not
explicitly return a value, we’ll use None
for both the SendType
and
ReturnType
.
Secondly, as we see it prints from 1
to 3
incrementally.
When
yield count
is executed, the generator pauses and returns control to the caller,The caller then resumes the generator, which continues execution from where it left off,
So
count += 1
is executed, incrementing the counter,Then the loop condition
count <= max
is checked before yielding again.
So the key thing is - the generator resumes and continues execution from where
it previously yielded each time next()
or send()
is called.
Adding a Return Statement#
The yield
statement allows the generator to produce a series of values, while
the return
statement can be used to terminate the generator and, optionally,
to provide a value that is accessible through the StopIteration
exception
raised when the generator is exhausted.
Let’s modify the function to return a message when the count exceeds the maximum:
1def count_up_to(max: int) -> Generator[int, None, Literal["Completed"]]:
2 count = 1
3 while count <= max:
4 yield count
5 count += 1
6 return "Completed!"
7
8gen = count_up_to(3)
9try:
10 while True:
11 print(next(gen))
12except StopIteration as err:
13 completion_status = err.value
14 print(completion_status) # Output: Completed!
1
2
3
Completed!
Adding send
#
The send()
method of a generator is used to send a value back into the
generator function. The value sent in is returned by the yield
expression.
This can be used to modify the internal state of the generator. Let’s adapt the
function to use send()
to optionally reset the count
.
1def count_up_to(max: int) -> Generator[int, int, Literal["Completed"]]:
2 count = 1
3 while count <= max:
4 received = yield count
5 print(f"count: {count}, received: {received}")
6 if received is not None:
7 count = received
8 else:
9 count += 1
10 return "Completed!"
11
12gen = count_up_to(10)
13print(gen.__next__()) # 1
14print(gen.send(5)) # 6
15for number in gen:
16 print(number) # Continues from 7 to 10
1
count: 1, received: 5
5
count: 5, received: None
6
count: 6, received: None
7
count: 7, received: None
8
count: 8, received: None
9
count: 9, received: None
10
count: 10, received: None
This example illustrates basic usage, including how to use send()
to alter the
internal state of the generator. After initializing the generator:
printing the first value with
__next__()
which gives1
,then
send(5)
is called,received = yield count
is called, and nowreceived = 5
. Thencount
is subsequently set to5
as well.The generator is resumed and hits the
yield count
statement again, yielding the current value ofcount
to be5
. So it will print out5
on the next yieldThe generator continues, yielding values from
6
to10
as it iterates through the remaining loop cycles, with each value being printed in the for loop.
(Yield is an expression and not a statement)
How did the received
become None
after send
is done?
The key to understanding the behavior of your count_up_to
generator,
especially in relation to how received
can be None
, lies in how the
generator is advanced and interacts with the .send()
method versus the
.__next__()
method (or its equivalent, next(gen)
).
When you first call gen.__next__()
or next(gen)
, the generator starts
executing up to the first yield
statement, yielding the value of count
(which is 1
). At this point, since you’re not using .send()
to advance the
generator but .__next__()
instead, the value received by the yield
expression is None
. This is the default behavior when the generator is
advanced without explicitly sending a value. The generator then proceeds to the
if received is not None:
check. Since received
is None
, the condition
fails, and execution moves to the else:
clause, incrementing count
.
However, when you call gen.send(5)
, you’re explicitly sending a value (5
)
into the generator, which resumes execution right after the yield
statement,
with received
now being 5
. This means the if received is not None:
condition succeeds, and the code inside that block executes, setting count
to
5
.
To clarify, here’s a step-by-step breakdown:
Initial Call with
.__next__()
:The generator yields
1
, andreceived
is implicitlyNone
because no value was sent into the generator. Theelse:
clause is executed, incrementingcount
.
Call with
.send(5)
:received
is set to5
, so theif received is not None:
condition is true andcount
is set to5
.
Subsequent Iteration in the For Loop:
The for loop implicitly calls
.__next__()
on each iteration, not.send()
, so no new value is sent into the generator. Therefore,received
isNone
again for each iteration within the loop, and the generator simply incrementscount
until it exceedsmax
.
This mechanism allows the generator to either accept new values from the outside
via .send(value)
or continue its own internal logic, incrementing count
,
when advanced with .__next__()
or next(gen)
, where no external value is
provided, and thus received
is None
.
What we see here is a coroutine, a generator function in which you can pass data[1].
Adding throw
and close
#
Let’s extend our example to demonstrate how to use the .throw()
and .close()
methods with our generator function.
We’ll continue with the modified count_up_to
function that allows for
resetting the count via the send()
method.
1def count_up_to(max: int) -> Generator[int, None, Literal["Completed"]]:
2 count = 1
3 while count <= max:
4 try:
5 received = yield count
6 if received is not None:
7 count = received
8 else:
9 count += 1
10 except ValueError as err:
11 print(f"Exception caught inside generator: {err}")
12 count = max # Force the loop to end.
13 yield "Exception processed"
14 return "Completed!"
The .throw()
method is used to throw exceptions from the calling scope into
the generator. When a generator encounters an exception thrown into it, it can
either handle the exception or let it propagate, terminating the generator.
1gen = count_up_to(5)
2
3print(next(gen)) # Starts the generator, prints 1
4
5# Injecting an exception into the generator
6try:
7 gen.throw(ValueError, "Something went wrong")
8except StopIteration as err:
9 print("Generator returned:", err.value)
1
Exception caught inside generator: Something went wrong
In this example, after starting the generator and advancing it to yield 1
and
2
, we throw a ValueError
into the generator using .throw()
. The generator
function can catch this exception and yield a response or allow it to propagate,
leading to the generator’s termination. Our function does not explicitly catch
ValueError
, so it will terminate and raise StopIteration
.
The .close()
method is used to stop a generator. After calling .close()
, if
the generator function is executing a yield
expression, it will raise a
GeneratorExit
inside the generator function. This can be used to perform any
cleanup actions before the generator stops.
1gen = count_up_to(10)
2
3print(next(gen)) # Output: 1
4print(next(gen)) # Output: 2
5
6# Close the generator
7gen.close()
8
9# Trying to advance the generator after closing it will raise StopIteration
10try:
11 print(next(gen))
12except StopIteration:
13 print("Generator has been closed.")
1
2
Generator has been closed.
In this scenario, we start the generator, yield a couple of values, and then
close it using .close()
. Any attempt to advance the generator after closing it
results in a StopIteration
exception, indicating that the generator is
exhausted.
DataLoaders, Streaming and Lazy Loading#
Deep learning models, particularly those trained on large datasets, benefit
significantly from efficient data loading mechanisms. PyTorch, a popular deep
learning framework, provides a powerful abstraction for this purpose through its
DataLoader
class, which under the hood can be understood as leveraging
Python’s generator functionality for streaming data.
Generators and Streaming Data#
Generators in Python are a way to iterate over data without loading the entire dataset into memory. This is especially useful in deep learning, where datasets can be enormous. A generator-based DataLoader:
Lazily Loads Data: It loads data as needed, rather than all at once. This means that at any point, only a portion of the dataset is in memory, making it possible to work with datasets larger than the available system memory.
Supports Parallel Data Processing: PyTorch’s
DataLoader
can prefetch batches of data using multiple worker processes. This is akin to a generator yielding batches of data in parallel, improving efficiency by overlapping data loading with model training computations.Enables Real-time Data Augmentation: Data augmentation (e.g., random transformations of images) can be applied on-the-fly as each batch is loaded. This dynamic generation of training samples from a base dataset keeps memory use low and variation high.
Here’s a simplified conceptual example of how a data loader might be implemented using a generator pattern in PyTorch:
1T_co = TypeVar('T_co', covariant=True)
2
3class MyDataset(Dataset[T_co]):
4 def __init__(self, data: Sized) -> None:
5 self.data = data
6
7 def __len__(self) -> int:
8 return len(self.data)
9
10 def __getitem__(self, index: int) -> Any:
11 sample = self.data[index]
12 return sample
13
14data = torch.randn(8, 3) # 128 samples, 3 features each
15dataset = MyDataset(data)
16dataloader = DataLoader(dataset, batch_size=2, shuffle=True, num_workers=0)
17dataloader = iter(dataloader)
18
19try:
20 while True:
21 _ = dataloader.__next__()
22except StopIteration:
23 print("StopIteration: No more data.")
StopIteration: No more data.
A Naive Implementation of DataLoader#
1def simple_data_loader(dataset: Dataset[T_co], batch_size: int = 1)-> Generator[List[T], None, None]:
2 batch = []
3 for idx in range(len(dataset)):
4 batch.append(dataset[idx])
5 if len(batch) == batch_size:
6 yield batch
7 batch = []
8 # Yield any remaining data as the last batch
9 if batch:
10 yield batch
11
12def simple_data_loader(
13 dataset: Dataset[T_co], batch_size: int = 1
14) -> Generator[List[T_co], None, None]:
15 batch = []
16 for idx in range(len(dataset)):
17 batch.append(dataset[idx])
18 if len(batch) == batch_size:
19 yield batch
20 batch = []
21 # Yield any remaining data as the last batch
22 if batch:
23 yield batch
24
25
26data = list(range(100)) # Simulated dataset of 100 integers
27dataset = MyDataset(data)
28
29# Create and use the data loader
30batch_size = 10
31dataloader = simple_data_loader(dataset, batch_size=batch_size)
32
33try:
34 while True:
35 print(dataloader.__next__())
36except StopIteration:
37 print("StopIteration: No more data.")
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
[80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
[90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
StopIteration: No more data.