Anti-pattern: parallel collections

(Note that I’m not talking about "processing collections in parallel, which is definitely not an anti-pattern…)

I figured it was worth starting to blog about anti-patterns I see frequently on Stack Overflow. I realize that some or all of these patterns may be collected elsewhere, but it never hurts to express such things yourself… it’s a good way of internalizing information, aside from anything else. I don’t guarantee that my style of presenting these will stay consistent, but I’ll do what I can…

The anti-patterns themselves are likely to be somewhat language-agnostic, or at the very least common between Java and C#. I’m likely to post code samples in C#, but I don’t expect it to be much of a hindrance to anyone coming from a background in a similar language.

Context

You have related pieces of data about each of several items, and want to keep this data in memory. For example, you’re writing a game and have multiple players, each with a name, score and health.

Anti-pattern

Each kind of data is stored (all the names, all the scores, all the health values) in a separate collection. Typically I see this with arrays. Then each time you need to access related values, you need to make sure you’re using the same index for each collection.

So the code might look like this:

class Game 

    string[] names;
    int[] scores;
    int[] health;

    public Game() 
    { 
        // Initialize all the values appropriately
    } 

    public void PrintScores() 
    { 
        for (int i = 0; i < names.Length; i++) 
        { 
            Console.WriteLine("{0}: {1}", names[i], scores[i]); 
        } 
    } 
}

Preferred approach

The code above fails to represent an entity which seems pretty obvious when you look at the description of the data: a player. Whenever you find yourself describing pieces of data which are closely related, you should make sure you have some kind of representation of that in your code. (In some cases an anonymous type is okay, but often you’ll want a separate named class.)

Once you’ve got that type, you can use a single collection, which makes the code much cleaner to work with.

class Player
{
    public string Name { get; private set; }
    public int Score { get; set; }
    public int Health { get; set; }

    public Player(string name, int health)
    {
        Name = name;
        Health = health;
    }

    // Consider other player-specific operations
}

class Game
{
    List<Player> players;

    public Game()
    {
        // Initialize players appropriately
    }

    public void PrintScores()
    {
        foreach (var player in players)
        {
            Console.WriteLine("{0}: {1}", player.Name, player.Score);
        }
    }
}

Note how we can now use a foreach loop to iterate over our players, because we don’t care need to use the same index for two different collections.

Once you perform this sort of refactoring, you may well find that there are other operations within the Game class which would be better off in the Player class. For example, if you also had a Level property, and increasing that would automatically increase a player’s health and score, then it makes much more sense for that "level up" operation to be in Player than in Game. Without the Player concept, you’d have nowhere else to put the code, but once you’ve identified the relationship between the values, it becomes much simpler to work with.

It’s also much easier to modify a single collection than multiple ones. For example, if we wanted to add or remove a player, we now just need to make a change to a single collection, instead of making sure we perform the same operation to each "single value" collection in the original code. This may sound like a small deal, but it’s easy to make a mistake and miss out on one of the collections somewhere. Likewise if you need to add another related value – like the "level" value described above – it’s much easier to add that in one place than adding another collection and then making sure you do the right thing in every piece of code which changes any of the other collections.

Summary

Any time you find yourself with multiple collections sharing the same keys (whether those are simple list indexes or dictionary keys), think about whether you could have a single collection of a type which composes the values stored in each of the original collections. As well as making it easier to handle the collection data, you may find the new type allows you to encapsulate other operations more sensibly.

Update: what about performance?

As some readers have noted, this transformation can have an impact on performance. Originally, all the scores were kept close together in memory, all the health etc. If you perform a bulk operation on the scores (finding the average score, for example) that locality of reference can have a significant impact. In some cases that may be enough justification to use the parallel collections instead… but this should be a conscious decision, having weighed up the pros and cons and measured the performance impact. Even then, I’d be tempted to encapsulate that PlayerCollection in a separate type, allowing it to implement IEnumerable<Player> where useful. (If you wanted the Player to be mutable, you’d need it to be aware of PlayerCollection itself.)

In almost all these anti-patterns, there will be cases where they’re the lesser of two evils – but novice developers need to be aware of them as anti-patterns to start with. As ever with performance trade-offs, I believe in first deciding on concrete performance goals, then implementing the code in the simplest possible way that meets the non-performance goals, measuring against the performance goals, and tweaking if necessary to achieve them, relying heavily on measurement.