One of the first data structures discussed in Sedgewick's Algorithms is the Bag. A Bag is a data structure, which holds an unordered collection. We say unordered because it is in contrast to ordered collections like conventional arrays. Because we are not concerned with order, we can store the elements in any way we want.
This article will discuss three different implementations of Bag, each using a different internal data structure. By using different internal data structures, a clearer picture is created on the value of these interfaces, and can also help identify time complexity trade-offs.
The Bag Interface
The image below describes a Bag interface in Java:
It describes five operations:
- The constructor
Bag()
takes zero arguments - The method
add(item)
inserts an item into the Bag. - The method
isEmpty()
tells us if the Bag is empty. - The method
size()
tells us the size of the Bag. - The interface
Iterable<Item>
in Java allows the use of thefor .. in ..
loop.
Implementing Iterable
in Python
Unlike Java, Python does not have a formal concept of a typed interface. However, it does have facilities to allow an object to be iterable, much like Java, using the __iter__
and _next__
. The following is an example of how these methods work in Python:
class DoubleYourNumber:
def __init__(self, start=0, stop=10):
self.start = start
self.stop = stop
def __iter__(self):
self.current = self.start
return self
def __next__(self):
if self.current <= self.stop:
value = self.current * 2
self.current += 1
return value
else:
raise StopIteration
The above is an example of a class that supports the Python Iterator interface. Below is an example of how this class could be used to iterate:
>>> for value in DoubleYourNumber(stop=5):
... print(value)
...
0
2
4
6
8
10
You can get the same effect by using iter
and next
directly on the iterator object:
>>> doubler = DoubleYourNumber(stop=5)
>>> iter_doubler = iter(doubler)
>>> next(iter_doubler)
0
>>> next(iter_doubler)
2
>>> next(iter_doubler)
4
>>> next(iter_doubler)
6
>>> next(iter_doubler)
8
>>> next(iter_doubler)
10
>>> next(iter_doubler)
Traceback (most recent call last):
...
StopIteration
In a way, you can think of iterators as following this while
loop pattern:
iter_object = iter(object)
while True:
try:
value = next(iter_object)
# Do something with `value`
except StopIteration:
break # Leave the infinite while loop
We'll be using this pattern in our implementations of Bag.
Unit Testing a Bag
Since we don't have a formal interface, we can instead rely on unit testing to confirm the implementation is useful for our needs. Below is a set of tests that we can use to verify our implementation.
import unittest
from bag_implementation import Bag
class BagUnitTest(unittest.TestCase):
def test_new_bag_is_empty(self):
"""
A Bag should initially be empty.
"""
bag = Bag()
self.assertEqual(bag.is_empty(), True)
def test_add_increases_size_by_one(self):
"""
Calling bag.add(...) should increase it's size by one.
"""
bag = Bag()
bag.add(1)
self.assertEqual(bag.size(), 1)
def test_bag_can_be_iterated_over(self):
"""
Bag uses the iterable Python interface.
"""
bag = Bag()
for i in [1,2,3]:
bag.add(i)
sum = 0
for i in bag:
sum += i
self.assertEqual(sum, 6)
Creating a Bag Using the Built-In list
The first implementation of Bag is mostly a wrapper around list
. Since list
is an ordered collection, if we drop the ordered constraint list
can be used as an unordered collection. Below is a possible implementation using list
:
class BagWithList(object):
def __init__(self):
self.collection = list()
def add(self, item):
self.collection.append(item)
def size(self):
return len(self.collection)
def is_empty(self):
return len(self.collection) == 0
def __iter__(self):
return iter(self.collection)
The official Python wiki has a page on the time complexity of some of the built-in data structures. In particular, the list
operations has the following big-O in the average case:
Version | Operation | Big-O |
---|---|---|
list |
Append | O(1) |
list |
Get Length | O(1) |
list |
Iteration | O(n) |
Also this data structure uses O(n) space.
Using a Linked List
By default Python does not come with a version of a linked list as a native data structure. Causing us to write our own below, with a custom iterator.
class LinkedListNode(object):
def __init__(self, value):
self.value = value
self.next_node = None
def add(self, value):
if self.next_node == None:
self.next_node = LinkedListNode(value)
else:
self.next_node.add(value)
def size(self):
if self.next_node == None:
return 1
else:
return 1 + self.next_node.size()
def __iter__(self):
return LinkedListIter(self)
class LinkedListIter(object):
def __init__(self, current_node):
self.current_node = current_node
def __next__(self):
if self.current_node == None:
raise StopIteration
else:
value = self.current_node.value
self.current_node = self.current_node.next_node
return value
We could use this implementation of a Linked List in our implementation of a Bag.
class BagWithLinkedList(object):
def __init__(self):
self.linked_list = None
def add(self, value):
if self.linked_list == None:
self.linked_list = LinkedListNode(value)
else:
self.linked_list.add(value)
def size(self):
if self.linked_list == None:
return 0
else:
return self.linked_list.size()
def is_empty(self):
return self.linked_list == None
def __iter__(self):
if self.linked_list == None:
return iter([])
else:
return iter(self.linked_list)
The time complexity of these methods are not as attractive as the previous. There are other use cases for linked lists that show how to use it effectively. Despite this, it still uses O(n) space complexity, as the list
version did.
Version | Operation | Big-O |
---|---|---|
Linked List | Append | O(n) |
Linked List | Get Length | O(n) |
Linked List | Iteration | O(n) |
Using defaultdict
from collections
as a Multiset
A set is a commonly known mathematical object. A set is only concerned with membership, or whether an element belongs to collection or not. Like bags, sets are unordered. However, unlike bags, they only allow an element to appear once. If we relax this constraint, and keep track of the number of times an element appears, then we have a multiset.
Multisets appear in C++ as a built-in data structure. However, Python does not have anything named as a multiset. Instead, there is collections.Counter
and collections.defaultdict
. The former keeps track of both counts and the insertion order. We only need to keep track of counts, so we will use defaultdict
with a default value of 0
by using defaultdict(int)
.
from collections import defaultdict
class Multiset(object):
def __init__(self):
self.counts = defaultdict(int)
self.total = 0
def add(self, item):
self.counts[item] += 1
self.total += 1
def size(self):
return self.total
class MultisetIterator(object):
def __init__(self, multiset):
self.multiset = multiset
self.keys = list(multiset.counts.keys()) # Choose a fixed ordering
self.key_index = 0 # Current `keys` index
self.key_repeat = 0 # Number of times we've repeated the current key
def current_key(self):
return self.keys[self.key_index]
def current_count(self):
return self.multiset.counts[self.current_key()]
def __next__(self):
key = self.current_key()
count = self.current_count()
if self.key_repeat == count:
if self.key_index + 1 < len(self.keys):
self.key_index += 1
self.key_repeat = 0
else:
raise StopIteration
self.key_repeat += 1
return self.current_key()
Next, we use our Multiset
in the implementation of Bag
.
class BagWithMultiset(object):
def __init__(self):
self.multiset = Multiset()
def add(self, item):
self.multiset.add(item)
def size(self):
return self.multiset.size()
def is_empty(self):
return self.multiset.size() == 0
def __iter__(self):
return MultisetIterator(self.multiset)
This version of Bag might be the most performant, given the time and space complexity.
Version | Operation | Big-O Average Case | Big-O Worst Case |
---|---|---|---|
Multiset | Append | O(1) | O(n) |
Multiset | Get Length | O(1) | O(1) |
Multiset | Iteration | O(n) | O(n) |
These results are almost on par with the list
implementation. Perhaps the worst case for appending, is worth revisiting. However, the space complexity is not O(n), but in fact O(k) where k is the number of unique elements in the collection. If there are many repeated values, this might be more advantageous than the two other versions of Bag.
Conclusion
We covered the basics of Python iterators, and considered three different versions of Bag
: one using list
, one using our hand-rolled linked list implementation, and finally a version of a multiset using Python's defaultdict
. This data structure is not very common, and does not have many advantages to ordered arrays. Nevertheless, because of its simplicity, we can use the Bag data structure as an introduction to algorithms.
With that.. happy algorithming, friends!
Top comments (0)