A Rudimentary Introduction to Generator and Yield in Python#

Twitter Handle LinkedIn Profile Tag Tag Tag Tag

Generator functions, which were introduced in Python Enhancement Proposal (PEP) 255, are a unique type of function that yield a lazy iterator. This is an object that can be iterated over, similar to a list. The key difference, however, is that unlike lists, lazy iterators do not hold their contents in memory. Instead, they generate their contents on the fly, as they are iterated over.

Reading Large Files, Generator vs Iterator#

Let’s consider the following example from How to Use Generators and yield in Python - Real Python[1]. Suppose we have a large text file that we want to read and iterate over, say, to obtain the total number of rows.

Reading a Big File into a List#

We can use the following code to read the file first into memory and then iterate over it:

 1def file_reader_using_iterator(file_path: str) -> List[str]:
 2    file = open(file_path, "r", encoding="utf-8")
 3    print(f"Is file an generator? {inspect.isgenerator(file)}")
 4    print(f"Is file an iterator? {isinstance(file, Iterator)}")
 5    result = file.read().split("\n")
 6    return result
 7
 8text = file_reader_using_iterator("./assets/sample.txt")
 9print(f"Is text an generator? {inspect.isgenerator(text)}")
10print(f"Is text an iterator? {isinstance(text, Iterator)}")
11pprint(text)
12
13row_count = 0
14for row in text:
15    row_count += 1
16
17print(f"Row count: {row_count}")
Is file an generator? False
Is file an iterator? True
Is text an generator? False
Is text an iterator? False
[
│   'hydra-core==1.3.2',
│   'matplotlib>=3.8.0',
│   'numpy>=1.26.0',
│   'openai>=1.1.1',
│   'pandas>=2.1.1',
│   'portalocker>=2.8.2',
│   'pydantic==2.5.2',
│   'rich>=13.6.0',
│   'seaborn>=0.13.0',
│   'tiktoken==0.5.2',
│   'torch>=2.1.0',
│   'torchinfo>=1.8.0',
│   'torchmetrics>=1.3.0',
│   'torchtext',
│   'torchvision>=0.16.0',
│   ''
]
Row count: 16

In file_reader_using_iterator, we read the entire file into memory and then split it and return it as a list of strings (list is an iterator). Then we iterate over the list to count the number of rows. This approach is straightforward and easy to understand, but it has a major drawback: it reads the entire file into memory. This is not a problem for small files, but for large files, it can be cause memory issues - the file itself being larger than your system’s available memory.

When you read a big file into a list, you’re loading the entire content of the file into memory at once. This is because a list in Python is an in-memory data structure, and when you create a list containing every line of a file, each of those lines is stored in memory. This can be highly inefficient for large files, as it requires enough memory to hold the entire file content at once, which can lead to MemoryError if the file size exceeds the available memory.

Using a Generator Function#

To overcome this issue, we can use a generator function, file_reader_using_generator, which reads the file line by line, yielding each line as it goes. This approach is memory-efficient because it only needs to hold one line in memory at a time, not the entire file.

 1def file_reader_using_generator(file_path: str) -> Generator[str, None, None]:
 2    file = open(file_path, "r", encoding="utf-8")
 3    for row in file:
 4        yield row.rstrip("\n")
 5
 6text_gen = file_reader_using_generator("./assets/sample.txt")
 7print(f"Is text_gen a generator? {inspect.isgenerator(text_gen)}")
 8print(f"Is text_gen an iterator? {isinstance(text_gen, Iterator)}")
 9
10row_count = 0
11for row in text_gen:
12    row_count += 1
13
14print(f"Row count: {row_count}")
Is text_gen a generator? True
Is text_gen an iterator? True
Row count: 15

In file_reader_using_generator, we open the file and iterate over it line by line. For each line, we yield the line, which means we produce a value that can be iterated over, but we do not terminate the function. Instead, we pause it until the next value is requested. This allows us to read large files efficiently, even if they are larger than the available memory.

How does this work? On a high level, when we call file_reader_using_generator, it returns a generator object. This object is an iterator, so we can iterate over it using a for loop. When we do this, the function is executed until the first yield statement, at which point the function is paused. The value of the yield statement is returned to the caller, and the function is paused. When the next value is requested, the function resumes from where it was paused, until the next yield statement is encountered. This process continues until the function terminates.

Remark 7 (All Generators are Iterators)

Note that all generators are iterators, but not all iterators are generators.

To recap, the main reason why the generator function does not hold the entire file in memory is because it yields each line one by one, rather than returning a list of all lines at once.

  1. Lazy Evaluation: Generators are lazily evaluated. This means that they generate values on the fly as needed, rather than computing them all at once and storing them.

  2. Single Item in Memory: At any point in time, only the current row being yielded by the generator is held in memory. Once the consumer of the generator moves to the next item, the previous item can be garbage collected if no longer referenced, keeping the memory footprint low.

  3. Stateful Iteration: The generator function maintains its state between each yield. It knows where it left off (which line it yielded last) and resumes from that point the next time the next value is requested. This statefulness is managed without keeping the entire dataset in memory.

Next Method#

The __next__ method is a fundamental part of the iterator protocol in Python. It’s used to get the next value in an iteration.

When you use a for loop, or the next() function, Python internally calls the __next__ method of the iterator object. This method should return the next value for the iterable. When there are no more items to return, it should raise StopIteration.

In the context of generators, each call to the generator’s __next__ method resumes the generator function from where it left off and runs until the next yield statement, at which point it returns the yielded value and pauses execution.

Without the __next__ method, we wouldn’t be able to use Python’s built-in iteration mechanisms with our custom iterator or generator objects.

So let’s see how the __next__ method works with our generator function.

1try:
2    first_row = text_gen.__next__()
3    print(f"First row: {first_row}")
4except StopIteration:
5    print("StopIteration: No more rows")
StopIteration: No more rows

Oh what happened? We try to get the first row from the generator using the __next__ method, but it raises a StopIteration exception. This is because the generator has already been exhausted by the for loop earlier when we counted the number of rows. Unlike a list, a generator can only be iterated over once. Once it’s been exhausted, it can’t be iterated over again and will raise a StopIteration exception if you try to do so.

Let’s create a new generator and see how the __next__ method works.

1text_gen = file_reader_using_generator("./assets/sample.txt")
2first_row = text_gen.__next__()
3print(f"First row: {first_row}")
4
5second_row = text_gen.__next__()
6print(f"Second row: {second_row}")
First row: hydra-core==1.3.2
Second row: matplotlib>=3.8.0

Generator Expression#

Similar to list comprehensions, you can also create a generator using a generator expression (generator comprehension) so that you can create a genereator without defining a function.

1text_gen_comprehension = (row for row in open("./assets/sample.txt", "r", encoding="utf-8"))
2print(f"Is text_gen_comprehension a generator? {inspect.isgenerator(text_gen_comprehension)}")
3print(f"Is text_gen_comprehension an iterator? {isinstance(text_gen_comprehension, Iterator)}")
Is text_gen_comprehension a generator? True
Is text_gen_comprehension an iterator? True

How does Generator Work?#

Generator functions are nearly indistinguishable from standard functions in appearance and behavior, with one key distinction. They utilize the yield keyword in place of return. Consider the generator function that yields the next integer indefinitely:

def infinite_sequence() -> Generator[int, None, None]:
    num = 0
    while True:
        yield num
        num += 1

This function might look familiar, but it’s the yield statement that sets it apart. yield serves to return a value to the caller without exiting the function.

What’s truly unique here is how the function’s state is preserved. Upon each subsequent call to next() on the generator object (whether done directly or through a loop), the function picks up right where it left off, incrementing and yielding num once more[1].

Profiling Generator Performance#

Let’s compare the performance of the generator function and the list comprehension.

Memory Efficiency#

Let’s create a list of squared numbers using a list comprehension and a generator expression, and compare their memory usage.

1N = 100000
2
3nums_squared_list_comprehension = [num ** 2 for num in range(N)]
4print(f"Size of nums_squared_list_comprehension: {sys.getsizeof(nums_squared_list_comprehension)} bytes")
5
6nums_squared_generator = (num ** 2 for num in range(N))
7print(f"Size of nums_squared_generator: {sys.getsizeof(nums_squared_generator)} bytes")
Size of nums_squared_list_comprehension: 800984 bytes
Size of nums_squared_generator: 112 bytes
  • The size of nums_squared_list_comprehension is 800984 bytes.

  • The size of nums_squared_generator is 112 bytes.

The list comprehension (nums_squared_list_comprehension) creates a list of all squared numbers at once. This means it needs to allocate enough memory to hold all these numbers. This can be quite large for big sequences, as shown by the sys.getsizeof(nums_squared_list_comprehension) call.

On the other hand, the generator expression (nums_squared_generator) doesn’t compute all the squared numbers at once. Instead, it computes them one at a time, on-the-fly, as you iterate over the generator. This means it doesn’t need to allocate memory for the whole sequence, only for the current number. This is why sys.getsizeof(nums_squared_generator) returns a much smaller number.

This demonstrates the main advantage of generators when it comes to memory efficiency: they allow you to work with large sequences of data without needing to load the entire sequence into memory. This can be a significant advantage when working with large data sets, where loading the entire data set into memory might not be feasible.

Time Efficiency#

We know that creating a very large list in memory can take time and potentially hang the system. However, if our list is much smaller than the system memory, it has been shown that it can much faster to evaluate than a generator expression[1].

1cProfile.run("sum([num ** 2 for num in range(N)])")
2cProfile.run("sum(num ** 2 for num in range(N))")
         5 function calls in 0.037 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.036    0.036    0.036    0.036 <string>:1(<listcomp>)
        1    0.001    0.001    0.037    0.037 <string>:1(<module>)
        1    0.000    0.000    0.037    0.037 {built-in method builtins.exec}
        1    0.001    0.001    0.001    0.001 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


         100005 function calls in 0.044 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   100001    0.037    0.000    0.037    0.000 <string>:1(<genexpr>)
        1    0.000    0.000    0.044    0.044 <string>:1(<module>)
        1    0.000    0.000    0.044    0.044 {built-in method builtins.exec}
        1    0.008    0.008    0.044    0.044 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

This shows for a small list, the list comprehension is faster than the generator expression.

Now, why are there 100005 function calls for N=100000?

  • 100001 calls are from the generator expression <string>:1(<genexpr>). For each number in the range(N), the generator expression is called once, hence 100000 calls. The extra 1 call is to raise the StopIteration exception when the generator is exhausted.

  • The remaining 4 calls are from the other functions: <string>:1(<module>), {built-in method builtins.exec}, {built-in method builtins.sum}, and {method 'disable' of '_lsprof.Profiler' objects}. Each of these is called once, hence 4 calls.

So, in total, there are 100001 (from the generator expression) + 4 (from the other functions) = 100005 function calls.

Yield#

The yield statement in Python is probably what defines the generator functions, let’s take a look.

The yield Statement#

A generator function is a function that, when called, returns a generator iterator. This is achieved by including at least one yield statement in the function definition. Unlike a return statement, which terminates a function entirely and sends a value back to the caller, yield pauses the function, saving its state for continuation when next required.

When a generator function calls yield, the function execution is paused, and a value is sent to the caller. However, the function’s local variables and execution state are saved internally. The next time the generator is advanced (using the next() function or a for loop, for example), execution resumes from exactly where it was left off, immediately after the yield statement.

More concretely (and verbosely), upon encountering a yield statement, the function’s current state is preserved or “frozen”. This means that local variables, the instruction pointer, and even the state of the evaluation stack are saved. Consequently, when .next() is called again, the function resumes precisely from where it left off, as if yield were a pause in execution rather than an interruption. This mechanism allows generator functions to produce a sequence of values over time, providing an efficient way to work with data streams or large datasets without requiring all data to be loaded into memory simultaneously[2].

Moreover, when a generator function is called, the actual arguments are bound to function-local formal argument names in the usual way, but no code in the body of the function is executed. Instead a generator-iterator object is returned; this conforms to the iterator protocol, so in particular can be used in for-loops in a natural way. Note that when the intent is clear from context, the unqualified name “generator” may be used to refer either to a generator-function or a generator-iterator[2].

An Example#

Let’s see a function count_up_to that just yield count up to a max number.

1def count_up_to(max: int) -> Generator[int, None, None]:
2    count = 1
3    while count <= max:
4        yield count
5        count += 1
6
7gen = count_up_to(3)
8for number in gen:
9    print(number)
1
2
3

First, let’s add type hints to the function. The return type of a generator function can be hinted using Generator[YieldType, SendType, ReturnType] from the typing module. Since this generator yields integers and does not explicitly return a value, we’ll use None for both the SendType and ReturnType.

Secondly, as we see it prints from 1 to 3 incrementally.

  • When yield count is executed, the generator pauses and returns control to the caller,

  • The caller then resumes the generator, which continues execution from where it left off,

  • So count += 1 is executed, incrementing the counter,

  • Then the loop condition count <= max is checked before yielding again.

So the key thing is - the generator resumes and continues execution from where it previously yielded each time next() or send() is called.

Adding a Return Statement#

The yield statement allows the generator to produce a series of values, while the return statement can be used to terminate the generator and, optionally, to provide a value that is accessible through the StopIteration exception raised when the generator is exhausted.

Let’s modify the function to return a message when the count exceeds the maximum:

 1def count_up_to(max: int) -> Generator[int, None, Literal["Completed"]]:
 2    count = 1
 3    while count <= max:
 4        yield count
 5        count += 1
 6    return "Completed!"
 7
 8gen = count_up_to(3)
 9try:
10    while True:
11        print(next(gen))
12except StopIteration as err:
13    completion_status = err.value
14    print(completion_status)  # Output: Completed!
1
2
3
Completed!

Adding send#

The send() method of a generator is used to send a value back into the generator function. The value sent in is returned by the yield expression. This can be used to modify the internal state of the generator. Let’s adapt the function to use send() to optionally reset the count.

 1def count_up_to(max: int) -> Generator[int, int, Literal["Completed"]]:
 2    count = 1
 3    while count <= max:
 4        received = yield count
 5        print(f"count: {count}, received: {received}")
 6        if received is not None:
 7            count = received
 8        else:
 9            count += 1
10    return "Completed!"
11
12gen = count_up_to(10)
13print(gen.__next__())  # 1
14print(gen.send(5))  # 6
15for number in gen:
16    print(number)  # Continues from 7 to 10
1
count: 1, received: 5
5
count: 5, received: None
6
count: 6, received: None
7
count: 7, received: None
8
count: 8, received: None
9
count: 9, received: None
10
count: 10, received: None

This example illustrates basic usage, including how to use send() to alter the internal state of the generator. After initializing the generator:

  • printing the first value with __next__() which gives 1,

  • then send(5) is called, received = yield count is called, and now received = 5. Then count is subsequently set to 5 as well.

  • The generator is resumed and hits the yield count statement again, yielding the current value of count to be 5. So it will print out 5 on the next yield

  • The generator continues, yielding values from 6 to 10 as it iterates through the remaining loop cycles, with each value being printed in the for loop.

Remark 8 (Yield is an expression and not a statement)

How did the received become None after send is done?

The key to understanding the behavior of your count_up_to generator, especially in relation to how received can be None, lies in how the generator is advanced and interacts with the .send() method versus the .__next__() method (or its equivalent, next(gen)).

When you first call gen.__next__() or next(gen), the generator starts executing up to the first yield statement, yielding the value of count (which is 1). At this point, since you’re not using .send() to advance the generator but .__next__() instead, the value received by the yield expression is None. This is the default behavior when the generator is advanced without explicitly sending a value. The generator then proceeds to the if received is not None: check. Since received is None, the condition fails, and execution moves to the else: clause, incrementing count.

However, when you call gen.send(5), you’re explicitly sending a value (5) into the generator, which resumes execution right after the yield statement, with received now being 5. This means the if received is not None: condition succeeds, and the code inside that block executes, setting count to 5.

To clarify, here’s a step-by-step breakdown:

  1. Initial Call with .__next__():

    • The generator yields 1, and received is implicitly None because no value was sent into the generator. The else: clause is executed, incrementing count.

  2. Call with .send(5):

    • received is set to 5, so the if received is not None: condition is true and count is set to 5.

  3. Subsequent Iteration in the For Loop:

    • The for loop implicitly calls .__next__() on each iteration, not .send(), so no new value is sent into the generator. Therefore, received is None again for each iteration within the loop, and the generator simply increments count until it exceeds max.

This mechanism allows the generator to either accept new values from the outside via .send(value) or continue its own internal logic, incrementing count, when advanced with .__next__() or next(gen), where no external value is provided, and thus received is None.

What we see here is a coroutine, a generator function in which you can pass data[1].

Adding throw and close#

Let’s extend our example to demonstrate how to use the .throw() and .close() methods with our generator function.

We’ll continue with the modified count_up_to function that allows for resetting the count via the send() method.

 1def count_up_to(max: int) -> Generator[int, None, Literal["Completed"]]:
 2    count = 1
 3    while count <= max:
 4        try:
 5            received = yield count
 6            if received is not None:
 7                count = received
 8            else:
 9                count += 1
10        except ValueError as err:
11            print(f"Exception caught inside generator: {err}")
12            count = max  # Force the loop to end.
13            yield "Exception processed"
14    return "Completed!"

The .throw() method is used to throw exceptions from the calling scope into the generator. When a generator encounters an exception thrown into it, it can either handle the exception or let it propagate, terminating the generator.

1gen = count_up_to(5)
2
3print(next(gen))  # Starts the generator, prints 1
4
5# Injecting an exception into the generator
6try:
7    gen.throw(ValueError, "Something went wrong")
8except StopIteration as err:
9    print("Generator returned:", err.value)
1
Exception caught inside generator: Something went wrong

In this example, after starting the generator and advancing it to yield 1 and 2, we throw a ValueError into the generator using .throw(). The generator function can catch this exception and yield a response or allow it to propagate, leading to the generator’s termination. Our function does not explicitly catch ValueError, so it will terminate and raise StopIteration.

The .close() method is used to stop a generator. After calling .close(), if the generator function is executing a yield expression, it will raise a GeneratorExit inside the generator function. This can be used to perform any cleanup actions before the generator stops.

 1gen = count_up_to(10)
 2
 3print(next(gen))  # Output: 1
 4print(next(gen))  # Output: 2
 5
 6# Close the generator
 7gen.close()
 8
 9# Trying to advance the generator after closing it will raise StopIteration
10try:
11    print(next(gen))
12except StopIteration:
13    print("Generator has been closed.")
1
2
Generator has been closed.

In this scenario, we start the generator, yield a couple of values, and then close it using .close(). Any attempt to advance the generator after closing it results in a StopIteration exception, indicating that the generator is exhausted.

DataLoaders, Streaming and Lazy Loading#

Deep learning models, particularly those trained on large datasets, benefit significantly from efficient data loading mechanisms. PyTorch, a popular deep learning framework, provides a powerful abstraction for this purpose through its DataLoader class, which under the hood can be understood as leveraging Python’s generator functionality for streaming data.

Generators and Streaming Data#

Generators in Python are a way to iterate over data without loading the entire dataset into memory. This is especially useful in deep learning, where datasets can be enormous. A generator-based DataLoader:

  • Lazily Loads Data: It loads data as needed, rather than all at once. This means that at any point, only a portion of the dataset is in memory, making it possible to work with datasets larger than the available system memory.

  • Supports Parallel Data Processing: PyTorch’s DataLoader can prefetch batches of data using multiple worker processes. This is akin to a generator yielding batches of data in parallel, improving efficiency by overlapping data loading with model training computations.

  • Enables Real-time Data Augmentation: Data augmentation (e.g., random transformations of images) can be applied on-the-fly as each batch is loaded. This dynamic generation of training samples from a base dataset keeps memory use low and variation high.

Here’s a simplified conceptual example of how a data loader might be implemented using a generator pattern in PyTorch:

 1T_co = TypeVar('T_co', covariant=True)
 2
 3class MyDataset(Dataset[T_co]):
 4    def __init__(self, data: Sized) -> None:
 5        self.data = data
 6
 7    def __len__(self) -> int:
 8        return len(self.data)
 9
10    def __getitem__(self, index: int) -> Any:
11        sample = self.data[index]
12        return sample
13
14data = torch.randn(8, 3)  # 128 samples, 3 features each
15dataset = MyDataset(data)
16dataloader = DataLoader(dataset, batch_size=2, shuffle=True, num_workers=0)
17dataloader = iter(dataloader)
18
19try:
20    while True:
21        _ = dataloader.__next__()
22except StopIteration:
23    print("StopIteration: No more data.")
StopIteration: No more data.

A Naive Implementation of DataLoader#

 1def simple_data_loader(dataset: Dataset[T_co], batch_size: int = 1)-> Generator[List[T], None, None]:
 2    batch = []
 3    for idx in range(len(dataset)):
 4        batch.append(dataset[idx])
 5        if len(batch) == batch_size:
 6            yield batch
 7            batch = []
 8    # Yield any remaining data as the last batch
 9    if batch:
10        yield batch
11
12def simple_data_loader(
13    dataset: Dataset[T_co], batch_size: int = 1
14) -> Generator[List[T_co], None, None]:
15    batch = []
16    for idx in range(len(dataset)):
17        batch.append(dataset[idx])
18        if len(batch) == batch_size:
19            yield batch
20            batch = []
21    # Yield any remaining data as the last batch
22    if batch:
23        yield batch
24
25
26data = list(range(100))  # Simulated dataset of 100 integers
27dataset = MyDataset(data)
28
29# Create and use the data loader
30batch_size = 10
31dataloader = simple_data_loader(dataset, batch_size=batch_size)
32
33try:
34    while True:
35        print(dataloader.__next__())
36except StopIteration:
37    print("StopIteration: No more data.")
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
[80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
[90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
StopIteration: No more data.

References and Further Readings#