Generators can help you write reusable and scalable Python code, but the problem is that complicated processing often requires using the same data multiple times. However, the issue with generators is that they exhaust themselves, meaning they can produce results only once. Trying to retrieve new output from an exhausted generator will lead to a StopIteration
exception.
Even more troubling is that for
loops, list()
, tuple()
, set()
, and many other functions in Python expect a StopIteration
exception to be raised when passed-in generators are exhausted, which is why it's handled in these methods. So, you might find yourself in a situation when your code returns a wrong result, instead of throwing an error:
>>> gen = (x for x in range(5))
>>> sum(gen) * sum(gen) # instead of an exception, we get 0:
0
>>> list(gen) # an empty list here instead of an exception:
[]
We can use regular functions instead of generators, of course, but they won't be up to the challenge when it comes to processing enormous amounts of data, because of storing the entire output in memory.
Fortunately, Python has a couple of tricks up its sleeve. To make it simpler, I'll be starting from the easiest ways to solve our problem.
SOLUTION I: COPYING THE RESULTS
Suppose, you're doing a research for a supermarket chain and you have raw .txt data on one of the supermarket's check sums for a certain period of time, which looks like this:
122
78
161
64
# ...
You first need to calculate the average check before getting down to more serious analysis. If you'd like to practice yourself, you can download the file I used here.
Now, let's write a very simple program that will read each line from the file, convert it to a float, and then calculate the average check sum by dividing the total sum of money that customers spent in the supermarket by the number of checks (i.e., customers):
def average_check(gen: 'function') -> float:
""" Calculate the average check per supermarket. """
data = list(gen)
return sum(data) / len(data) # 1
def read_checks(path: str) -> float:
""" Each line contains information about one check sum. """
with open(path) as file:
for line in file:
yield float(line)
if __name__ == '__main__':
avg = average_check(read_checks('checks.txt'))
print(f'The average check is: {avg}')
# The result:
# The average check is: 100.543487
# 1
: as you can see, we needed the data from the file twice, which is why we exhausted the iterator on purpose and kept a copy of its results in a list. It means we stored them all in memory, and that's exactly what we are trying to avoid in the first place. If we use this code on bigger datasets, we'll still get memory blowup. But for smaller data, this solution is perfectly acceptable.
SOLUTION II: USING LAMBDA
We can come to a more elegant solution to our problem:
def average_check(gen: 'function') -> float:
""" Calculate the average check per supermarket. """
spent: float = sum(gen()) # spent in total
checks: int = sum(1 for _ in gen()) # count the number of purchases
avg: float = spent / checks # an average check
return avg
def read_checks(path: str) -> float:
""" Each line contains information about one check sum. """
with open(path) as file:
for line in file:
yield float(line)
if __name__ == '__main__':
avg = average_check(lambda: read_checks('checks.txt')) # 1
print(f'The average check is: {avg}')
# The result:
# The average check is: 100.543487
#1
: lambda
keyword returns a function, so, on this line we expect a generator function to be passed in to average_check
. We pass in the function without calling it! Only from within average_check
we call it (twice): to count the total sum of purchases and the number of checks.
Notice, how I used a generator expression to count the number of checks checks: int = sum(1 for _ in gen())
instead of using a for
loop. Generator expressions work the same as normal generators, they're just different in syntax.
This code won't cause a memory error if you need to process a huge number of checks for the whole supermarket chain. Now, while this solution requires less typing than the next one, it's still not the most elegant one. Also, remember that lambda expressions slow your code down a bit.
SOLUTION III: USING A CONTAINER CLASS
Python iterator protocol and object-oriented programming offer a better solution.
As you may know, iterable is an object that we can iterate over. In Python, the iterables are required to support the following methods. The __iter__()
method creates (returns) an iterator object. To access this object and return the next item from it, the __next__()
method is used. It's also responsible for raising a StopIteration
exception once the iterator is exhausted. These two methods combined form the iterator protocol in Python, and its how Python for
loops and other expressions traverse iterables.
The easiest way to implement the iterator protocol and create a special container class is to define the __iter__()
method as a generator like this:
def average_check(gen) -> float:
""" Calculate the average check per supermarket. """
spent: float = sum(gen) # spent in total
checks: int = sum(1 for _ in gen) # count the number of purchases
avg: float = spent / checks # an average check
return avg
class ReadChecks:
""" Convert each line to a float value, and yield it. """
def __init__(self, path: str):
self.path = path
def __iter__(self) -> float:
with open(self.path) as file:
for line in file:
yield float(line)
if __name__ == '__main__':
it = ReadChecks('checks.txt')
avg_check = average_check(it)
print(f'The average check is: {avg_check}')
Our ReadChecks
container works just fine when passed to the average_check
function without any lambda
keywords or other modifications. Although, implementing a container class requires additional lines of code, but it provides a cleaner interface, and it's faster than using lambda
. Internally, each use of the gen
argument creates a separate generator.
By the way, this code will successfully accept any iterable as an argument. If you put this code into the final part:
if __name__ == '__main__':
checks = [100, 100, 100, 50, 50, 50]
avg_check = average_check(checks)
print(f'The average check is: {avg_check}')
You'll get:
The average check is: 75.0
The only problem is that if you pass an iterator, it'll be exhausted after sum(gen)
that runs the entire generator and you may get an error or a wrong result. Try running that:
checks = [100, 100, 100, 50, 50, 50]
avg_check = iter(checks)
print(f'The average check is: {average_check(avg_check)}')
You'll get a ZeroDivisionError
, because sum(1 for _ in gen)
will return 0
(remember, sum()
prevents raising a StopIteration
exception).
So, we're in need of some error handling. To do that, we can use the following feature. When iterating over a generator, internally the iter()
method is called, and the iteration protocol states if iter()
gets an iterator as an argument, it'll return the very same iterator:
>>> num = [1, 2, 3]
>>> it = iter(num)
>>> iter(num) is iter(num) # creates two different iterators
False
>>> iter(it) is iter(it) # iter(it) returns the same it iterator
True
In the first example, we get two different iterators from the same sequence, but in the second one we passed in an iterator as an argument and the iter()
function returned it back twice.
So, let's finalize our code:
def average_check(gen) -> float:
if iter(gen) is iter(gen):
raise TypeError('Cannot pass an iterator as an argument!')
spent: float = sum(gen) # spent in total
checks: int = sum(1 for _ in gen) # number of purchases
avg: float = spent / checks # an average check
return avg
class ReadChecks:
def __init__(self, path: str):
self.path = path
def __iter__(self) -> float:
with open(self.path) as file:
for line in file:
yield float(line)
if __name__ == '__main__':
it = ReadChecks('checks.txt')
avg_check = average_check(it)
print(f'The average check is: {avg_check}')
checks = [100, 100, 100, 50, 50, 50]
avg_check = average_check(checks)
print(f'The average check is: {avg_check}')
avg_check = iter(checks)
print(f'The average check is: {average_check(avg_check)}')
The first two chunks will execute just fine, but in the third case our code will raise a TypeError
exception to prevent returning a wrong result:
The average check is: 100.543487
The average check is: 75.0
# error feedback
TypeError: Cannot pass an iterator as an argument!
SPEED PERFORMANCE
One more thing you should keep in mind. Using data structures like lists is usually more efficient for smaller inputs, the generators are good when you need to provide scalability and avoid memory crashes when dealing with large datasets or if your code is interacting with a network and instead of waiting for the whole input, it can start processing and yielding the results as soon as they come. Read more about it in my post about using generator expressions over large list comprehensions.
Equipped with this, you can count the average checks for the whole world without your program crashing, if you so wish :) And do many other interesting things as well.
Enjoyed my post? Don't forget to leave a like, please :)
You can connect me on LinkedIn.
Top comments (0)