Overuse of regular expressions

I’ve been having a discussion on the C# newsgroup about the best way of searching for multiple strings in a single string.

The question posed was how to find out whether any of the strings “something1”, “something2” and “something3” were contained within another string. The suggested response was to use regular expressions. Personally, I think this is using a sledgehammer to crack a nut – and a dangerous sledgehammer at that.

Now, I want to make it perfectly clear from the start that I have nothing against regular expressions – I think they’re very powerful, and can save a lot of very complex code when they’re used in the right place. I just don’t think this is the right place.

The simplest way (to my mind) of checking for the presence of multiple strings is to use String.IndexOf multiple times. Very simple to follow and easy to modify. It seems a lot less risky to me than checking the regular expression “something1|something2|something3”. (Yes, in this case you could actually use “something[123]” but that’s not actually much easier to read, and I doubt that these are the real strings the original poster had in mind.) The regular expression way will certainly work (and efficiency hasn’t been brought up as a significant issue – I don’t even know which way is faster), but I believe it’s harder to understand and maintain Why? Because you have to know more information to understand it. Not only do you have to have that extra knowledge, but you have to apply it too. The vertical bars in that string have a special meaning – you have to spot that to start with. Then you have to do the splitting up mentally to see what’s actually being searched for (rather than seeing the separate items being searched for as completely separate strings).

Reading that example isn’t too bad – it’s harder than just repeated IndexOf (and thus already not as good a solution in my view), but it’s not awful. However, think of maintenance. Suppose someone wanted to change it to look for “some+thing” – they’d have to know that “+” is a special character too, and how to escape it. Of course, they’d need to know how to escape “” in C# when using IndexOf, but when using regular expressions you’d need to escape it from the point of view of both C# and the regular expressions.

My contention is that there’s no point in putting this extra burden on the maintenance engineer (whether it happens to be the original author of the code or not) when there’s a simpler solution which doesn’t involve any special meanings for anything in the string beyond what’s mandated by using C# in the first place.

Unfortunately, this argument seems to be hard to accept…

This is far from the first time I’ve seen this urge to use regular expressions for no good reason, unfortunately. If it had been, I probably wouldn’t be blogging about it. (For a really big discussion about it, see this thread in comp.lang.java.programmer from 2002. Over 1600 posts, and that’s not including those from one of the major posters in the thread!)

A couple of years back there seemed to be an urge in the .NET community to use reflection to solve every problem under the sun in a similar kind of fashion to the urge to use regular expressions here. Like regular expressions, reflection can be very powerful when used carefully to solve problems which would otherwise have very complex solutions, if indeed any solutions at all – but it only needs to be used occasionally.

It would be really nice if all developers considered the people who might have to read their code (and what they might have to do to maintain it) before committing anything to source control. It’s not that I think people can’t understand regular expressions – I just think it’s extra work which can be avoided in many situations (like finding one string inside another, or replacing one substring with another). There’s no benefit to introducing the complexity, so why do it?

6 thoughts on “Overuse of regular expressions”

  1. Hi Jon, nice to see you blogging.

    I think regular expressions must be slower than indexof cuz of the great compiler work that gets executed when we use regexs. Recently I had to write a little code where I had to remove all the tags <somethinghere/> from a given string. I wrote a pretty quick loop using indexof & copied the filtered bytes into a byte[] and returning a string. I also thought of writing a regex code but had no time to think and I’m also not so good in regex.

    Like

  2. Regular expressions in .NET can be faster or slower than IndexOf – it depends on whether you ask them to be compiled or not, whether the string is found and where, and how long the string you’re searching for and in are.

    For me, performance is rarely a significant issue for this kind of thing though – the readability of the code is *far* more important.

    (I haven’t tried benchmarking the regular expressions provided by Sun’s JRE against the .NET ones. I think I may have done a while ago, but chances are those results aren’t valid any more. It would be mildly interesting to know the results, but it would rarely affect my use of them.)

    Jon

    Like

  3. Regular expression speed: depends wildly on the regex library. The traditional approach of compiling the regex into a deterministic state machine makes it very fast indeed per mile of string searched: one table lookup per character no matter how complex the search expression. On the other hand, the compilation phase is rather slow (in fact potentially exponential, although not usually that bad in practice). But my understanding is that modern regex libraries don’t generally use this approach, because it’s inflexible when it comes to things like sub-match retrieval, back-references and other advanced features. Instead they’d probably use non-deterministic state machine or even some sort of backtracking, so in a simple case like the one you cite IndexOf might perfectly well be faster.

    I’m inclined to agree that regexes are overused. I get most annoyed when I see them appearing willy-nilly in user interfaces: when a perfectly ordinary "Search" box turns out to be expecting a regex and didn’t bother saying so. It’s particularly bad because regexes don’t all have the same syntax – some consider parentheses to be special characters which you have to escape if you want them to be literal, and others consider them to be literals which you have to escape if you want them to be special.

    Like

  4. I know that all of our team have RegEx Buddy installed and that therefore will be able to cope with analysing each other’s RegExs. What if you wanted to extend the original match to also match “something4”, “something5” and “something6” would you just extend it to 6 indexOf statements or would there be some cut off point at which you felt the RegEx solution was better?

    Like

  5. Beyond about 3, I’d just encapsulate it in a method which took a collection of strings to check for the presence of, and called Contains once for each of them.

    That would be entirely readable in itself, reusable, and simpler than a regular expression.

    It would be slower in some cases, but I’d only address that when I new it would produce a *significant* difference in speed.

    Jon

    Like

Leave a comment