Just how spiky is your traffic?

No, this isn’t the post about dynamic languages I promise. That will come soon. This is just a quick interlude. This afternoon, while answering a question on Stack Overflow1 about the difference between using an array and a Dictionary<string, string> (where each string was actually the string representation of an integer) I posted the usual spiel about preferring readable code to micro-optimisation. The response in a comment – about the performance aspect – was:

Well that’s not so easily said for a .com where performance on a site that receives about 1 million hits a month relies on every little ounce of efficiency gains you can give it.

A million hits a month, eh? That sounds quite impressive, until you actually break it down. Let’s take a month of 30 days – that has 30 * 24 * 60 * 60 = 2,592,000 seconds2. In other words, a million hits a month is less than one hit every two seconds. Not so impressive. At Google we tend to measure traffic in QPS (queries per second, even if they’re not really queries – the search terminology becomes pervasive) so this is around 0.39 QPS. Astonished that someone would make such a claim in favour of micro-optimisation at that traffic level, I tweeted about it. Several of the replies were along the lines of "yeah, but traffic’s not evenly distributed." That’s entirely true. Let’s see how high we can make the traffic without going absurd though.

Let’s suppose this is a site which is only relevant on weekdays – that cuts us down to 20 days in the month. Now let’s suppose it’s only relevant for one hour per day – it’s something people look at when they get to work, and most of the users are in one time zone. That’s a pretty massive way of spiking. We’ve gone down from 30 full days of traffic to 20 hours – or 20 * 60 * 60 = 72000 seconds, giving 14 QPS. Heck, let’s say the peak of the spike is double that – a whopping 28 QPS.

Three points about this:

  • 28 QPS is still not a huge amount of traffic.
  • If you’re really interested in handling peak traffic of ~28 QPS without latency becoming huge, it’s worth quoting that figure rather than "a million hits a month" because the latter is somewhat irrelevant, and causes us to make wild (and probably wildly inaccurate) guesses about your load distribution.
  • If you’re going to bring the phrase "a .com" into the picture, attempting to make it sound particularly important, you really shouldn’t be thinking about hosting your web site on one server – so the QPS gets diluted again.
  • Even at 28 QPS, the sort of difference that would be made here is tiny. A quick microbenchmark (with all the associated caveats) showed that on my laptop (hardly a server-class machine) I could build the dictionary and index into it 3 times 2.8 million times in about 5 seconds. If every request needed to do that 100 times, then the cost of doing it 28 requests per second on my laptop would still only be 0.5% of that second – not a really significant benefit, despite the hugely exaggerated estimates of how often we needed to do that.

There are various other ways in which it’s not a great piece of code, but the charge against premature optimization still stands. You don’t need to get every little ounce of efficiency out of your code. Chances are, if you start guessing at where you can get efficiency, you’re going to be wrong. Measure, measure, measure – profile, profile, profile. Once you’ve done all of that and proved that a change reducing clarity has a significant benefit, go for it – but until then, write the most readable code you can. Likewise work out your performance goals in a meaningful fashion before you worry too much – and hits per months isn’t a meaningful figure.

Performance is important – too important to be guessed about instead of measured.


1 I’m not linking to it because the Streisand effect would render this question more important than it really is. I’m sure you can find it if you really want to, but that’s not the point of the post.

2 Anyone who wants to nitpick and talk about months which are a bit longer or shorter than that due to daylight saving time changes (despite still being 30 days) can implement that logic for me in Noda Time.

13 thoughts on “Just how spiky is your traffic?”

  1. Speaking of Rico Mariani, I’m reminded of the infamous death-match on the true trust cost of .NET exceptions [1]. Working on a 40-50M page/month job board, we frequently observe CPU spikes as exceptions/second climbs – the actual numbers are less important but the cost is definitely noticeable at these volumes.

    [1] –
    http://blogs.msdn.com/ricom/archive/2006/09/14/754661.aspx
    http://blogs.msdn.com/ricom/archive/2006/09/25/771142.aspx
    http://stackoverflow.com/questions/161942/how-slow-are-net-exceptions
    http://pobox.com/~skeet/csharp/exceptions.html
    http://www.yoda.arachsys.com/csharp/exceptions2.html

    Like

  2. @Nariman: Is that cost due to the exceptions, or the other way round though? Or could they come from the same cause? (Obviously it’s not going to help matters if CPU pressure is causing errors, and those errors are slower than the spike.)

    I’d be interested to hear more about the actual numbers involved if you’re able to share them. I wouldn’t expect to see significant CPU load added by exceptions unless you had a metric shedload of them.

    Like

  3. Our application handles about a million queries a night (it services devices during off peak hours). We don’t micro optimize. We’ve always coded first, tested (sometimes trial by fire), and then optimized, and even then, never micro optimized. Code readability and maintainability is key in a system that handles a lot of traffic. When something stops working its never because you used a dictionary instead of an array and you always need to be able to troubleshoot the problem quickly.

    I hear about and have been party to crazy “optimization” schemes that people dream up. My favorite one, and I’ve seen it/heard about it twice now, is adding a layer of WCF indirection between the presentation/business/domain layers of an application. (The best part: both of these systems were expected no more than 100 users in an hour in worst case usage peak scenarios.)

    Like

  4. I’m about to toddle off to vendor to tweak their software to be faster.
    We’re talking 20K+ ‘QPS’ though so a little bit different.

    Pretty sure based on profiling needless allocations on the hot path are the key to it.

    Like

  5. @ShuggyCoUk: Yes, that does sound like a bit of a different story :)

    (How many boxes does that 20K QPS go through though? As a colleague of mine has on a T-shirt: “If brute force isn’t working for you, you’re not using enough of it” :)

    Like

  6. @Jon – does a QPS just count as a single page request, while a hit includes all page resources?

    1m hits a month might include images, javascript files in which case .NET is being used even less than once every two seconds.

    10,000 page views a month is a big site in world :[

    Like

  7. @Chris S: It depends on the service, really :) Of course the scale of things also depends on what you’re *doing* in each request, which is basically a more general way of saying the same thing: not all requests are equal.

    Like

  8. Hi John, how is ~6 mln page-views/day in your opinion? That is, based on your calculations, something around 69.5 QPS i think. Oh, and everything running o 2 servers, 1 db-server and 1 web-server… and yeah, please, leave aside that 1 server-per-job is stupid, i know, because it would not be fault tolerant. but that’s a different story…

    Like

  9. @engioi: At that point I’d be looking at improving performance by scaling out instead of micro-optimising. Beyond that, I’m not sure what you’re asking…

    Like

  10. @skeet: yeah scaling horizontally, for sure! i was asking for more servers since the dawn of time, but you know how it is to get some bucks from customers ;) … anyway mine wasn’t really a “technical question”, more so an opinion on our own peak performance. 6 mln page-views/day on asp.net on a single web server seems pretty good to me, but i didn’t had the opportunity to confront with a lot of others that have reached peaks like that on asp.net so i was simply searching for an opinion. anyway thanks!

    Like

  11. @engioi: It really depends on what it’s doing. A “current time” web service should be able to handle that easily; something doing a lot more complicated work could easily struggle with 70 QPS.

    Like

  12. @skeet: well it is a thematic portal in 25 different editions (languages) serving both typical news, media and the likes, and live updates of events in semi-realtime (around an aupdate each couple of seconds) each event localized in each of the 25 languages, on top of that serving customized feeds, a rest api for some internal data (for external partners) and a bunch of other stuff. anyway i understand that it may be difficult to figure it out, thanks none the less ;) !

    Like

Leave a comment