Written by Emaan Hariri and Sayan Paul
Background
In many contexts, keeping all of one’s data in memory is impractical. Let’s say we were implementing a filter function that keeps all the elements in a list that meet a certain predicate. Note that here, and for the rest of this article, we’ll be using Python 3 with type hints.
def filter(lst: List[Any], predicate: Callable): filtered = [] for elem in lst: if predicate(elem): filtered.append(elem) return filtered
At first glance, this seems perfectly reasonable. One could make some modest performance improvements by using a list comprehension, or by writing specific versions of the function for different predicates, thereby reducing the number of overall function calls. But ignoring this, this implementation is optimal from an asymptotic point of view.
However, the main issue here is implicit in the arguments of the function itself (at least based on the type hints). Since we are passing in an entire list, we need to have it entirely in memory before we can call the function. And inside the function itself, we’re constructing another list! If we have on the order of millions or even billions of items in this list, then our system would need to have a massive amount of memory to be able to use this function.
A streaming implementation of this function would work with (as you might guess) two streams: one for input and one for output. Let’s assume that we’ve defined InputStream and OutputStream classes in our file, and that we can iterate over the InputStream and put items into the OutputStream.
def filter(inpt: InputStream[Any], filtered: OutputStream[Any], predicate: Callable): for elem in inpt: if predicate(elem): filtered.put(elem)
This looks almost identical to our previous implementation—however, the majority of the work is done in the InputStream and OutputStream classes. The InputStream class is reading from some sort of datasource, and therefore needs to keep track of where it is in the datasource and how to get the next element once requested. Similarly, the OutputStream also keeps track of how and where the data should be outputted for storage, to properly add at some point after calling put.
The crucial point here is that we haven’t sacrificed any runtime since our implementation is still linear in the size of the list, but our memory usage has gone from linear to constant in the size of the list. That being said, we’re lucky that this algorithm could be easily translated to a streaming algorithm. Specifically, the processing of one element of the array has no effect on the others in a filter operation, so we did not have to keep track of any kind of state. The core difficulty in streaming algorithms is managing state (and, consequently, memory usage) while also being correct or, at the very least, sufficiently accurate.
Before we enter our discussion on specific types of streaming algorithms, we note that the general structure of streaming algorithms is separated into three parts, running in sublinear-space (generally log-space or constant-space):
- Initialization
- Process incoming element
- Output (if queried)
We will explore this further in the following case studies.
Popular Case Study: Boyer-Moore Majority Vote
We start by analyzing one of the more popular yet extremely applicable streaming algorithms that is in widespread use: The Boyer-Moore majority vote algorithm. Named after its creators Robert Boyre and J Strother Moore (whose whole given first name is the character “J”), the majority vote algorithm as it is often referred to came to fruition in the early 1980’s in a time where limited data storage necessitated the need for algorithms which were space efficient, a need that was satisfied very naturally by streaming algorithms.
The majority vote algorithm, as its name might suggest, is an algorithm that detects and returns the element in a sequence of items (which in many cases, including for our purposes, might be integers) that occurs the majority of the time. Note that the majority element must occur in over half of the array, while the plurality element is simply the one that appears the most often (i.e. the mode). Another important caveat is that the algorithm is only guaranteed to return the majority element if there actually exists a majority. If there isn’t, then the output is just an element in the array. The algorithm operates as follows (we present it as a typical Python function):
def majority_vote(s: InputStream): majority = None count = 0 for elem in s: if count == 0: majority = elem count = 1 elif majority == elem: count += 1 else: count -= 1 return majority
We now formulate this in a more streaming-friendly format:
class MajorityVote: def __init__(self): self.majority = None self.count = 0 def process(self, a: Any): if self.count == 0: self.majority = elem self.count = 1 elif self.majority == elem: self.count += 1 else: self.count -= 1 def output(self): return self.majority
Note that in this context, we don’t actually know if the element we found is the majority. That is to say that the majority vote algorithm (as implemented above) has an algorithmic guarantee if the array has a majority element. Naturally, one way to check if the found element is the majority is to simply check the sequence to see if it is the majority, and if it’s not the majority, then the sequence contains no majority.
Here is where we hit a roadblock which we will discuss further in the following sections that highlights a fundamental difference between streaming and non-streaming algorithms. Above it is mentioned we can simply check the sequence to see if the element is the majority, but remember we can’t check the sequence, it’s already been streamed! In actuality, it is impossible to solve the problem above in sublinear-space, which is a space complexity that we always strive for when making streaming algorithms.
Tradeoffs with Streaming: Heavy Hitters
What if we want to extend the above algorithm—rather than finding the majority element array, what if we want to find the most frequent elements (i.e. the heavy hitters)? The formal definition for heavy hitters is that given n
, the size of the stream, and some value ɸ (0 < ɸ < 1)
, we define a heavy hitter to be all of the elements in the stream that appear at least ɸn
times.
It turns out that this is impossible to do with appropriately low memory constraints so we’ll consider a relaxation of this problem where:
- We’ll definitely return all of the elements that appear at least ɸn times.
- In addition to this, we’ll return some elements that don’t meet the first criterion.
The Misra-Greis summary solves this problem and is a simple modification of the Boyer-Moore majority vote from above. Rather than having a single count for the majority element, we have a count for all of the possible heavy hitters.
def misra_greis(s: InputStream, phi: float): most_frequent = {} max_num_heavy_hitters = ceil(1 / phi) for item in s: if item in most_frequent: most_frequent[i] += 1 elif len(most_frequent) < max_num_heavy_hitters - 1: most_frequent[i] = 1 else: for h in most_frequent: most_frequent[h] -= 1 if most_frequent[h] == 0: del most_frequent[h] return most_frequent
The first thing we have to convince ourselves of is that we can have at most 1/ɸ
heavy hitters. This is because a heavy hitter is defined to appearing at least ɸn
times, where n
is the size of the stream. If we have m
heavy hitters, then we have at least ɸmn
elements in our stream. Suppose m > 1/ɸ
. That would mean that we would have more than n elements in our stream, which is a contradiction. So, there are at most 1/ɸ
heavy hitters.
Once we’ve convinced ourselves of the above, then the rest of the algorithm is essentially the same as the majority vote algorithm. We have a dictionary containing counts for the top max_num_heavy_hitters
elements, and we boot elements from this dictionary once the count hits 0. Note that if s
is big enough, i.e. larger than max_num_heavy_hitters
elements, we’ll always return that many elements—so we can immediately see the tradeoff with this algorithm since some of the elements we’ll return aren’t actually heavy hitters. Incidentally, we could make a short modification to make this return the k
most frequent elements, where we just substitute every instance of max_num_heavy_hitters
with k
.
This is all pretty technically dense, but the core idea we want to demonstrate is that the limitations inherent in the streaming paradigm mean that we occasionally have to make some concessions for the accuracy of our algorithm to use an appropriate amount of memory.
Conclusion
We have seen in our above examples how streaming algorithms operate, and how they may not behave as expected (i.e. we rely on probabilistic guarantees instead of deterministic guarantees). Streaming algorithms find many practical applications in numerous process-intensive contexts. We seem them in a plethora of fields and applications including networking, I/O, databases, machine learning, etc. In a modern context where data seems infinite, the call for space-efficient algorithms has been answered by the ever-developing domain of streaming algorithms.