I had an interesting day at work today. I thought my code had broken… but it turns out it was just a strange corner case which made it work very slowly. Usually when something interesting happens in my code it’s quite hard to blog about it, because of all the confidentiality issues involved. In this case, it’s extremely easy to reproduce the oddity in an entirely vanilla manner. All we need is the Java collections API.
I have a set – a
HashSet, in fact. I want to remove some items from it… and many of the items may well not exist. In fact, in our test case, none of the items in the "removals" collection will be in the original set. This sounds – and indeed is – extremely easy to code. After all, we’ve got
Set<T>.removeAll to help us, right?
Let’s make this concrete, and look at a little test. We specify the size of the "source" set and the size of the "removals" collection on the command line, and build both of them. The source set contains only non-negative integers; the removals set contains only negative integers. We measure how long it takes to remove all the elements using
System.currentTimeMillis(), which isn’t the world most accurate stopwatch but is more than adequate in this case, as you’ll see. Here’s the code:
public class Test
public static void main(String args)
int sourceSize = Integer.parseInt(args);
int removalsSize = Integer.parseInt(args);
Set<Integer> source = new HashSet<Integer>();
Collection<Integer> removals = new ArrayList<Integer>();
for (int i = 0; i < sourceSize; i++)
for (int i = 1; i <= removalsSize; i++)
long start = System.currentTimeMillis();
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end – start) + "ms");
Let’s start off by giving it an easy job: a source set of 100 items, and 100 to remove:
Time taken: 1ms
Okay, so we hadn’t expected it to be slow… clearly we can ramp things up a bit. How about a source of one million items1 and 300,000 items to remove?
Time taken: 38ms
Hmm. That still seems pretty speedy. Now I feel I’ve been a little bit cruel, asking it to do all that removing. Let’s make it a bit easier – 300,000 source items and 300,000 removals:
Time taken: 178131ms
Excuse me? Nearly three minutes? Yikes! Surely it ought to be easier to remove items from a smaller collection than the one we managed in 38ms? Well, it does all make sense, eventually.
AbstractSet<T>, which includes this snippet in its documentation for the
This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator’s remove method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set’s remove method.
Now that sounds reasonable on the surface of it – iterate through the smaller collection, check for the presence in the bigger collection. However, this is where the abstraction is leaky. Just because we can ask for the presence of an item in a large collection doesn’t mean it’s going to be fast. In our case, the collections are the same size – but checking for the presence of an item in the
HashSet is O(1) whereas checking in the
ArrayList is O(N)… whereas the cost of iterating is going to be the same for each collection. Basically by choosing to iterate over the HashSet and check for presence in the
ArrayList, we’ve got an O(M * N) solution overall instead of an O(N) solution. Ouch. The
removeAll method is making an "optimization" based on assumptions which just aren’t valid in this case.
Then fix it, dear Henry, dear Henry, dear Henry
There are two simple ways of fixing the problem. The first is to simply change the type of the collection we’re removing from. Simply changing
HashSet<Integer> gets us back down to the 34ms range. We don’t even need to change the declared type of
The second approach is to change the API we use: if we know we want to iterate over
removals and perform the lookup in
source, that’s easy to do:
In fact, on my machine that performs slightly better than
removeAll – it doesn’t need to check the return value of
remove on each iteration, which
removeAll does in order to return whether or not any items were removed. The above runs in about 28ms. (I’ve tested it with rather larger datasets, and it really is faster than the dual-hash-set approach.)
However, both of these approaches require comments in the source code to explain why we’re not using the most obvious code (a list and
removeAll). I can’t complain about the documentation here – it says exactly what it will do. It’s just not obvious that you need to worry about it, until you run into such a problem.
So what should the implementation do? Arguably, it really needs to know what’s cheap in each of the collections it’s dealing with. The idea of probing for performance characteristics before you decide on a strategy is completely anathema to clean abstraction we like to consider with frameworks like Java collections… but maybe in this case it would be a good idea.
1 Please perform Dr Evil impression while reading this. I’m watching you through your webcam, and I’ll be disappointed if I don’t see you put your little finger to your mouth.