I’ve been having a discussion on the C# newsgroup about the best way of searching for multiple strings in a single string.
The question posed was how to find out whether any of the strings “something1”, “something2” and “something3” were contained within another string. The suggested response was to use regular expressions. Personally, I think this is using a sledgehammer to crack a nut – and a dangerous sledgehammer at that.
Now, I want to make it perfectly clear from the start that I have nothing against regular expressions – I think they’re very powerful, and can save a lot of very complex code when they’re used in the right place. I just don’t think this is the right place.
The simplest way (to my mind) of checking for the presence of multiple strings is to use
String.IndexOf multiple times. Very simple to follow and easy to modify. It seems a lot less risky to me than checking the regular expression “something1|something2|something3”. (Yes, in this case you could actually use “something” but that’s not actually much easier to read, and I doubt that these are the real strings the original poster had in mind.) The regular expression way will certainly work (and efficiency hasn’t been brought up as a significant issue – I don’t even know which way is faster), but I believe it’s harder to understand and maintain Why? Because you have to know more information to understand it. Not only do you have to have that extra knowledge, but you have to apply it too. The vertical bars in that string have a special meaning – you have to spot that to start with. Then you have to do the splitting up mentally to see what’s actually being searched for (rather than seeing the separate items being searched for as completely separate strings).
Reading that example isn’t too bad – it’s harder than just repeated IndexOf (and thus already not as good a solution in my view), but it’s not awful. However, think of maintenance. Suppose someone wanted to change it to look for “some+thing” – they’d have to know that “+” is a special character too, and how to escape it. Of course, they’d need to know how to escape “” in C# when using IndexOf, but when using regular expressions you’d need to escape it from the point of view of both C# and the regular expressions.
My contention is that there’s no point in putting this extra burden on the maintenance engineer (whether it happens to be the original author of the code or not) when there’s a simpler solution which doesn’t involve any special meanings for anything in the string beyond what’s mandated by using C# in the first place.
Unfortunately, this argument seems to be hard to accept…
This is far from the first time I’ve seen this urge to use regular expressions for no good reason, unfortunately. If it had been, I probably wouldn’t be blogging about it. (For a really big discussion about it, see this thread in comp.lang.java.programmer from 2002. Over 1600 posts, and that’s not including those from one of the major posters in the thread!)
A couple of years back there seemed to be an urge in the .NET community to use reflection to solve every problem under the sun in a similar kind of fashion to the urge to use regular expressions here. Like regular expressions, reflection can be very powerful when used carefully to solve problems which would otherwise have very complex solutions, if indeed any solutions at all – but it only needs to be used occasionally.
It would be really nice if all developers considered the people who might have to read their code (and what they might have to do to maintain it) before committing anything to source control. It’s not that I think people can’t understand regular expressions – I just think it’s extra work which can be avoided in many situations (like finding one string inside another, or replacing one substring with another). There’s no benefit to introducing the complexity, so why do it?