Some evenings I like to sit at home, make myself a tea, and start working on random problems that pop into my head or that I find through various contexts. I force myself NOT to look on stackoverflow or wikipedia for a copy-paste solution because I enjoy bashing my head against something that I don’t know how to solve.

A few days ago I started thinking about the following problem.

Let’s say we have a list of intervals, each interval corresponding to a ID. Something like this:

ID 0 : [start0, end0], [start1, end1], [start2, end2]

ID 1 : [start3, end3], [start4, end4]

Given a query made up of one or more IDs, how can we return the intervals that contain ALL the IDs specified in the query.

We need to return only the overlapping parts of the intervals.

Now this can be separated into two problems:

1) Finding the intervals for an ID. This was the easy part, just created a dictionary with a ID key and a value being a list of intervals.

In python this would look like this: { 1 : [0, 1], [1, 2], etc.] }

2) Intersecting lists of intervals and extracting the overlapping ones.

The second part was at the time a fascinating problem to me, since I could not figure out a non-stupid way of doing this intersection.

First let’s observe that

Now I just needed to solve the problem of intersecting two lists of intervals.

The obvious way to solve this would be to do a pass on both lists and check all of the intervals against each other.

But there has to be a better way!

The first rule of optimization is: Don’t do unnecessary work and things will go faster. But how could I eliminate work ?

First I needed to quantify work, I just used incremented steps variable so I could somehow quantify work by just counting the number of intervals touched. That is not shown here since I simplified the code for this article in order to make it more understandable. It just involves incrementing a *steps* variable.

Did I really need to check all of a against all of b? I thought, what if I sort the intervals? Can that help?

Well if both interval lists are sorted then if a[i].start > b[last].end then a[i+1]…a[n]. would be beyond b[last].end anyway and we can cut that work.

This is because b[last].end represents the largest possible number included in an interval in b.

The awesome hand-made picture bellow would help visualize it.

As we can see after we reach a3, a3.start will be bigger than b3.end so no matter how many elements are in a right now, there is no need to check them against the b array. This started starting to feel fun.

Here is the solution that first sorts the list of intervals based on their start:

def intersect_lists2(a, b): answer = [] for a_item in a: for b_item in b: first = a_item second = b_item if first.start > b[-1].end: #b[-1] = last element from b break if first.start > second.start: first, second = second, first if second.start < first.end: answer.append(interval(second.start, min(first.end, second.end))) return answer

Now that we have sorted intervals, binary trees came into my mind, but how would binary trees for intervals work?

I started doodling some trees with intervals as nodes on paper.

At first I tried using two binary trees, one for the start of the interval and one for the end of the intervals, but I could not see how I could exploit that.

After some experimentation I started to focus more on a binary tree with the start of an interval used as a key and see if I could make it work with that.

As a reminder the rules for a binary tree are the following:

For each value in a node N:

– The left subtree contains value less than N

– The right subtree contains values greater than N.

How would the tree look for the following data?

1: [23, 42], [100, 146], [50, 60], [19, 200]

2: [90, 150], [90, 110]

Well it would look like this:

This looks interesting already. But how would it work?

Let’s take the first element of 2, [90, 150], and intersect it against our tree.

At each step I intersect the interval with the current node. But I had nothing that stopped me checking a whole tree. An intersection could be in any of the two children of a node. It is no better than the solution, and it adds the cost of creating the tree in the first place.

From the x.start > b[last].end check mentioned before, I got the idea of saving in each node node the maximum value that can be reached on the subtree that has that node as a root.

The tree would look like this:

Using that information, I had enough information to skip whole sections of the tree using the rules below:

1. Intersect with the current node

2. If there is a left subtree and it’s maximum is bigger than our start, intersect with it.

3. If there is a right subtree and it’s start is less than our end, intersect with it.

This is the python code that does the things described above:

def interval_tree_intersect(node, x, intersection): if not node: return if node.max < x.start: return if node.interval.start > x.end: return # intersection here tmp = intersect_intervals(node.interval, x) if tmp: intersection.append(tmp) # move on to the child nodes interval_tree_intersect(node.left, x, intersection) interval_tree_intersect(node.right, x, intersection)

Now that I felt that I reached a good enough solution, I lifted my googling ban, and it turns out that this structure already existed and was described a long time ago.

According to Wikipedia this is an Augmented tree. Another explanation can be found here: https://en.wikipedia.org/wiki/Interval_tree.Scroll down to the Augmented tree section.

This was a really nice problem and I hope that people would try this method more often instead of just google searching for a solution :).

It took longer to solve, but it also has a gratifying feeling to it.