There’s a hole in my abstraction, dear Liza, dear Liza

I had an interesting day at work today. I thought my code had broken… but it turns out it was just a strange corner case which made it work very slowly. Usually when something interesting happens in my code it’s quite hard to blog about it, because of all the confidentiality issues involved. In this case, it’s extremely easy to reproduce the oddity in an entirely vanilla manner. All we need is the Java collections API.

I have a set – a HashSet, in fact. I want to remove some items from it… and many of the items may well not exist. In fact, in our test case, none of the items in the "removals" collection will be in the original set. This sounds – and indeed is – extremely easy to code. After all, we’ve got Set<T>.removeAll to help us, right?

Let’s make this concrete, and look at a little test. We specify the size of the "source" set and the size of the "removals" collection on the command line, and build both of them. The source set contains only non-negative integers; the removals set contains only negative integers. We measure how long it takes to remove all the elements using System.currentTimeMillis(), which isn’t the world most accurate stopwatch but is more than adequate in this case, as you’ll see. Here’s the code:

import java.util.*;

public class Test
{
    public static void main(String[] args)
    {
        int sourceSize = Integer.parseInt(args[0]);
        int removalsSize = Integer.parseInt(args[1]);
        
        Set<Integer> source = new HashSet<Integer>();
        Collection<Integer> removals = new ArrayList<Integer>();
        
        for (int i = 0; i < sourceSize; i++)
        {
            source.add(i);
        }
        for (int i = 1; i <= removalsSize; i++)
        {
            removals.add(-i);
        }
        
        long start = System.currentTimeMillis();
        source.removeAll(removals);
        long end = System.currentTimeMillis();
        System.out.println("Time taken: " + (end – start) + "ms");
    }
}

Let’s start off by giving it an easy job: a source set of 100 items, and 100 to remove:

c:UsersJonTest>java Test 100 100
Time taken: 1ms

Okay, so we hadn’t expected it to be slow… clearly we can ramp things up a bit. How about a source of one million items1 and 300,000 items to remove?

c:UsersJonTest>java Test 1000000 300000
Time taken: 38ms

Hmm. That still seems pretty speedy. Now I feel I’ve been a little bit cruel, asking it to do all that removing. Let’s make it a bit easier – 300,000 source items and 300,000 removals:

c:UsersJonTest>java Test 300000 300000
Time taken: 178131ms

Excuse me? Nearly three minutes? Yikes! Surely it ought to be easier to remove items from a smaller collection than the one we managed in 38ms? Well, it does all make sense, eventually. HashSet<T> extends AbstractSet<T>, which includes this snippet in its documentation for the removeAll method:

This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator’s remove method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set’s remove method.

Now that sounds reasonable on the surface of it – iterate through the smaller collection, check for the presence in the bigger collection. However, this is where the abstraction is leaky. Just because we can ask for the presence of an item in a large collection doesn’t mean it’s going to be fast. In our case, the collections are the same size – but checking for the presence of an item in the HashSet is O(1) whereas checking in the ArrayList is O(N)… whereas the cost of iterating is going to be the same for each collection. Basically by choosing to iterate over the HashSet and check for presence in the ArrayList, we’ve got an O(M * N) solution overall instead of an O(N) solution. Ouch. The removeAll method is making an "optimization" based on assumptions which just aren’t valid in this case.

Then fix it, dear Henry, dear Henry, dear Henry

There are two simple ways of fixing the problem. The first is to simply change the type of the collection we’re removing from. Simply changing ArrayList<Integer> to HashSet<Integer> gets us back down to the 34ms range. We don’t even need to change the declared type of removals.

The second approach is to change the API we use: if we know we want to iterate over removals and perform the lookup in source, that’s easy to do:

for (Integer value : removals)
{
    source.remove(value);
}

In fact, on my machine that performs slightly better than removeAll – it doesn’t need to check the return value of remove on each iteration, which removeAll does in order to return whether or not any items were removed. The above runs in about 28ms. (I’ve tested it with rather larger datasets, and it really is faster than the dual-hash-set approach.)

However, both of these approaches require comments in the source code to explain why we’re not using the most obvious code (a list and removeAll). I can’t complain about the documentation here – it says exactly what it will do. It’s just not obvious that you need to worry about it, until you run into such a problem.

So what should the implementation do? Arguably, it really needs to know what’s cheap in each of the collections it’s dealing with. The idea of probing for performance characteristics before you decide on a strategy is completely anathema to clean abstraction we like to consider with frameworks like Java collections… but maybe in this case it would be a good idea.


1 Please perform Dr Evil impression while reading this. I’m watching you through your webcam, and I’ll be disappointed if I don’t see you put your little finger to your mouth.

19 thoughts on “There’s a hole in my abstraction, dear Liza, dear Liza”

  1. Sometimes it can be worth breaking an abstraction, if only internally.

    For example, Guava’s com.google.common.collect.ImmutableSet has an (internal) isHashCodeFast() method that is used to work out whether it’s worth using a set’s (presumably precomputed) hashCode as a quick way to determine whether two sets are different.

    Like

  2. Surprising; java.util uses a RandomAccess marker interface to enable collection processors to choose how to interact with arbitrary collections; surprised they didn’t just throw in a marker interface to help them decide how to determine whether members are randomly accessible by value, also.

    Reminds me of the situation with LINQ’s Count() method which hardcodes an optimal implementation for ICollection (but forgot ICollection in the first release, iirc), and otherwise falls back to counting on its fingers.

    There’s a design tension in creating collection frameworks between wanting to implement common algorithms -outside- the basic storage types, and the problems caused when an inappropriate algorithm for the underlying store is applied.

    I’d suggest neither Java.Util nor the BCL have got that balance spot on.

    Like

  3. Maybe it is just me, but this is (one of many) and example that something wrong is with Java and C# standard collections.

    ArrayList

    I don’t do Java, but I simply cannot translate it. Is it an array? Is it a list?

    Is it a rocket science to name a list List, an array Array, a set Set, and so on? Each time I write List in my C# code I feel my mind is raped ;-)

    Like

  4. @macias: It’s an implementation of the List interface, backed by an array. Just like a HashSet is a Set backed by a hash table.

    I’m not quite sure what you’re objecting to about List in C#… care to elaborate? Were you expecting it to be a *linked* list?

    Like

  5. Jon, I am surprised you are surprised :-) List in C# is an array really, not a list. List in comp.theory has well established meaning, and no one, even MS, should not use this term for quite opposite meaning. List does not behave like list so the name is wrong.

    And yes, I am expecting that List would be a list, in terms of C# LinkedList, because it is, well, a list (I mean: name).

    Some designer could of course make even worse names, but basic principle is: say what you mean, do what you say. This holds true for user interface, data processing, method names of your private class and for standard library too.

    ArrayList is the worst — you cannot have at the same something that behaves like array and like a list, you can say dequeue from C++ does that, but it is neither this nor that too.

    Like

  6. @macias: I’ll have to dig out my copy of Rivest et al, but the Wikipedia entry for “list” (http://en.wikipedia.org/wiki/List_(computing)) certainly gives a certain amount of ambiguity around the topic – allowing that the term is used for various different data structures.

    Perhaps the definition you happened to learn from CompSci isn’t the *only* one used in CompSci?

    Like

  7. Jon, I dug up my copy :-) Cormen: list is one of the dynamic structures, that uses links. Wirth: one of the simplest dynamic structure is a list, the simplest one is single-linked. Knuth is way too elaborate to quote him, but he also talks about list as a list :-)

    Should I check more? :-))

    C# naming is the first time I saw somebody (designers) how calls an array, a list. Of course C# does not have to use “theoretical” comp.sci. conventions, but:
    a) it would be useful
    b) even if not, what the name ArrayList says? BlackWhite, DryWet?

    Like

  8. @macias: Firstly, let’s note that this isn’t anything to do with C# – it’s .NET. C# doesn’t know or care about lists.

    Both your points are explained by the simple need for an abstraction of “a collection we can iterate over, ask the size of, and access by index.” That sounds like a list in the real world to me… and given that a linked list is most clearly described by *calling* it a linked list, I see no significant problem with appropriating the word.

    At that point, it’s obvious what an ArrayList is – it’s an implementation of List (Java) or IList (.NET) which is backed by an array. The name says the implementation and the abstraction.

    I don’t know if any other platforms used “list” this way before Java, but that’s where I first came across it – and I think it’s reasonable for .NET to have followed suit, personally.

    I’m rather glad we haven’t chosen the compsci names for everything, to be honest – “fold” and “cons” for example? Ick.

    Like

  9. This is once case where .NET got it right. The ExceptWith method of the HashSet class is:

    public void ExceptWith(IEnumerable other)
    {
    if (other == null)
    {
    throw new ArgumentNullException(“other”);
    }
    if (this.m_count != 0)
    {
    if (other == this)
    {
    this.Clear();
    }
    else
    {
    foreach (T local in other)
    {
    this.Remove(local);
    }
    }
    }
    }

    Like

  10. Nice article Jon.

    Interestingly, I’ve seen this same problem before in a third-party implementation of the C++ STL, and other collection libraries. In all instances where I’ve seen this, the library is trying to provide an additional level of optimization by performing some heuristic probing of the collections involved. Unfortunately, the heuristics employed are often naïve and a “bad idea” in many cases.

    In the case of Java, the heuristic makes a number of poor assumptions:
    1. The cost of retrieving the count of items in the collection of items to remove is inexpensive.
    2. The cost of locating an instance of a value in either collection is roughly the same.
    3. The type system provides sufficient information to make reasonable choices about how to optimize the performance of the operation.
    All of these assumptions can be invalidated in different scenarios.

    It’s entirely possible to pass a Collection implementation to removeAll() that computes its size using a query or expensive operation. In fact, computing the size of a collection may often be as (or more) expensive than just iterating over it. Assuming that retrieving the size of a collection is cheap is dangerous.

    Assuming that the cost of performing a lookup is the same in either collection is a bad idea. While it’s true for pairs of collection of the same nature (List/List, Hashset/Hashset, etc.) it’s generally better to perform fewer lookup operations, when the collections are different (as in your example) this assumption can fail badly.

    The author of removeAll() assumes that the callers wants the operate to perform as efficiently as possible. Generally, of course, this is true. Unfortunately, there’s not nearly enough information expressed in the type system to adequately decide which operations are cheap rather than expensive. Computing the size, iterating items, and performing lookups have costs that may vary depending on the specific runtime characteristics of the types involved. Naïve optimizations can sometimes do more harm than good. If the type system had interfaces like ConstantTimeLookup and EfficientlyCountable as interfaces, then it could perhaps it would make some sense use this information in structuring overloads (or perhaps runtime checks) that select different algorithms for different cases. I am not necessarily advocating for such interfaces (I actually think they can be more harmful than beneficial) – I’m just pointing out that encoding performance characteristics in your type system would be necessary to allow for effective algorithm selection.

    Now the fact that the library documentation tells us the algorithm that the removeAll() operation will perform may seem like a helpful tidbit – but unfortunately, it exposes an implementation detail of the operation that (in theory) could change in the future or between implementations of the JVM. Developers who use this information to make design decisions could well run into problems in the future or on different platforms.

    Writing flexible, well-performing, and type-safe libraries for collection is not an easy job. For the most part Java and .NET both do a decent job, but to date I have been more impressed with .NET’s attention to detail and sensitivity to variation. In fact, many of the latest capabilities like LINQ-to-objects are beneficiaries of good design choices in the core library interfaces. While I would love to someday see LINQ-like features in Java, I think there are too many obstacles in the type system, libraries, and JVM (no lambdas, for instance) for an elegant implementation to emerge any time soon.

    As an aside, I’ve often thought that it would make for an interesting project to allow classes, interfaces and their methods/properties to be decorated with annotations that describe their (expected) performance characteristics. This metadata could then be used by compilers and other tools to warn developers when the composition of certain methods could result in unexpected performance problems. That would be an incredibly powerful feature …

    Liked by 1 person

  11. Jon,

    > Both your points are explained by the simple need for
    > an abstraction of “a collection we can iterate over,
    > ask the size of, and access by index.” That sounds
    > like a list in the real world to me…

    Sounds like a lot of things — array, set, list. And I think you meant implementation, because abstraction of an array (for example) is a collection we can iterate…

    Of course, it may be a matter of personal taste, but I am really surprised that calling an array a list is OK for you :-) Nothing negative here, just surprise, that’s all.

    C++ names are more consistent — list for a list, vector for an array (the weakest name), set for a set, and so on.

    > I’m rather glad we haven’t chosen the compsci names
    > for everything, to be honest – “fold” and “cons”
    > for example? Ick.

    :-)) Lisp is not comp.sci (the same way, as Casio calculator is not math).

    Like

  12. @macias: No, I definitely meant abstraction. That’s the whole point – if I want to say, “So long as I can iterate over the collection, access it by index, and get the size, then my algorithm will work” then I want to work with that abstraction rather than relying on one particular implementation. You can give me a plain .NET array (fixed size), a List (variable size, backed by an array), a Collection etc.

    As for whether that sounds like other things in the real world: how often do you hear the word “array” used outside a computing context? And a “set” is a different abstraction again, as it doesn’t have an implied ordering (which a list in Java/.NET terminology does, by index).

    Like

  13. The problem of optimizing this method is certainly an interesting one. Having some background in C++ It seems that the std library designers solved this sort of thing by providing partial specializations choose the correct optimizations.

    Like

  14. The performance aspect is not the only problem. Consider what will happen if Collection A is a TreeSet using String.CASE_INSENSITIVE_ORDER as Comparator and Collection B is a HashSet. Then, A.removeAll(B) or B.removeAll(A) can have surprising semantic differences regarding the result, depending on the sizes of A and B. That’s even worse than an unexpected performance. You have to consider the mixing of A and B semantically wrong, but unfortunately, it may do the desired thing a million times, in all tests, etc and then, suddenly fail in one environment.

    Like

Leave a comment