Buffering vs streaming benchmark: first results

My poor laptop’s had a busy weekend. It’s run 72 tests, rebooting between each test. Most of these tests have kept both the CPU and disk busy for a lot of the time. I expect to update this blog post with more numbers – and possibly more strategies – as time goes on, but I wanted to get the numbers out as quickly as possible. Graphs should be coming when I’ve got a decent network to experiment with Google graphing.

Source code and setup

While I don’t really expect anyone else to sacrifice their computer for the best part of 24 hours, it doesn’t take very much effort to get the tests running. This zip file contains the source code for three programs, a batch file to generate the test data, and a command file.

The CreateFiles executable creates input files based on its command line arguments. (This is run multiple times by the batch file.) EncryptFiles “encrypts” (I use the term loosely) all the files in a given directory, writing the results to another directory. The class design of EncryptFiles is a little bit funny, but it’s really tailored for these tests. It should be easy to write other strategies in the same way too.

ExecuteAndReboot is used to automate the testing. Basically it reads a command file (from a hard-coded directory; simplicity trumps flexibility here). If the file isn’t empty, it waits five minutes (so that after a reboot the computer has time to settle down after loading all its services) and runs the command specified in the first line of the file, appending the output to a results file. When the command has finished executing, ExecuteAndReboot rewrites the command file, removing the first line, and schedules a reboot. The idea is to drop ExecuteAndReboot into a Startup folder, set Windows to log in automatically, reboot once and then just let it go. If you want to use the computer, just wait until it’s in the “waiting five minutes” bit at the start of the cycle and kill ExecuteAndReboot. Next time you reboot it will start testing again.

That’s the general principle – if anyone does want to run the tests on their mail and needs a bit of help setting it up, just let me know.

Test environment and workload

I expect the details of the execution environment to affect the outcome significantly. For the sake of the record then, my laptop specs are:

Dell Inspiron 1720 (unlikely to be relevant, but…)
Intel Core 2 Duo T7250 2.0GHz
Vista Home Premium 32 bit
3 GB main memory, 32K L1 cache, 2048K L2 cache
2 disks, both 120GB (Fujitsu MHY2120BH, 5400RPM, 8MB cache)

Even though there are two drives in the system, the tests only ever read from and wrote to a single disk.

I created four sets of input data; I’ve presented the results for each one separately below, and included the description of the files along with the results. Basically in each case there’s a set of files where each file is the same size, consisting of a fixed number of fixed-length lines.

The “encryption” that is performed in each case (on the encoded version of the line) is just to XOR each byte with a value. This is repeated a specified number of times, and the XOR value on each pass is the current index, i.e. it first XORs with 0, then 1, then 2 etc. I’ve used work levels of 0 (the loop never runs), 100 and 1000. These were picked somewhat arbitrarily, but they give interesting results. At 0 the process is clearly IO-bound. At 1000 it’s clearly CPU-bound. At 100 it’s not pegging the CPU, but there’s significant activity. I may try 200 as well at some point.

For each scenario I tried using 1, 2 and 3 threads (and 4 for the big tests; I’ll go back and repeat earlier ones with 4 threads soon). The two strategies used are “streaming”: read a line, encrypt, write the line, repeat and “buffering”: read a whole file, encrypt it all, write it all out. After a thread has started, it needs no shared data at all – it’s given the complete list of files to encrypt at the very start.

Results

For every table, the number of threads is shown by the leftmost column, and the work level is shown by the topmost row. The cells are of the form “x/y” where “x” is the time taken for the streaming strategy, and “y” is the time taken for the buffering strategy. All times are in seconds. For each column, I’ve highlighted the optimal test.

Small tests

There are 10000 test files. Each is 100 lines long, with 100 characters per line.

Threads/Work 0 100 1000
1 47/90 94/79 163/158
2 95/112 113/157 129/146
3 179/128 130/144 152/160
4 144/139 163/193 156/204

Medium tests (short lines)

There are 100 test files. Each is 500,000 lines long, with 20 characters per line.

Threads/Work 0 100 1000
1 138/157 219/267 1125/1205
2 196/238 259/263 604/691
3 292/250 312/236 624/676
4 302/291 321/281 602/666

Medium tests (long lines)

There are 100 test files. Each is 100 lines long, with 100,000 characters per line.

Threads/Work 0 100 1000
1 203/213 249/286 1027/1107
2 283/264 311/296 602/696
3 319/284 314/278 608/618
4 360/300 336/297 607/598

Big tests

There are 20 files. Each file is 100,000 lines long, with 1000 characters per line.

Threads/Work 0 100 1000
1 190/246 349/441 2047/2165
2 365/375 392/473 1097/1387
3 456/379 484/399 1161/1296
4 517/442 521/431 1113/1214

Conclusions

I’m loathe to draw too many conclusions so far. After all, we’ve only got data for a single test environment. It would be interesting to see what would happen on a quad core machine. However, I think a few things can be said with reasonable confidence:

  • Sometimes the streaming strategy wins, sometimes the buffering strategy wins. The difference can be quite significant either way. For a given input and workload, the streaming solution almost always wins on my machine.
  • In ideal scenario, the bottlenecked resource should be busy for as much of the time as possible. For example, in a high-CPU task it makes little sense to have idle cores when you’ve already loaded data. Likewise in a disk-bound task we should always be fetching (or writing) data even if all our cores are momentarily busy. Personally I think this is the operating system’s pre-fetch systems’s job. We could do it ourselves, but it’s nicer if we don’t have to.
  • If we’re disk-bound, it makes sense to read (or write) one file at a time, to avoid seeking. This is obvious from the results – for work levels of 0 and 100, a single-thread solution is consistently best (and streaming usually works better buffering). Based on this assumption, I suspect a cleaner solution would be to have a single thread reading largish chunks (but not the whole file) and feeding the other threads – or to use async IO. However, this all gets very complicated. One nice aspect of both of the current solutions is that they’re really simple to implement safely. Reading line-by-line in an async manner is a pain. I’d quite like to write an async “execute for every line” helper at some time, but not just yet.
  • If we’re CPU bound, the buffering solution’s sweet spot is 3 threads (remember we’re using 2 cores) and the streaming solution works better with just 2 threads – introducing more threads will make both the IO and context switching less efficient.
  • The streaming strategy always uses less application memory than the buffering strategy. When running the buffering strategy against the “big” data set with 4 threads, the memory varied between 800MB and 1.8GB (as measured by the “Total bytes in all heaps” performance counter), vs around 165KB with the streaming strategy. (Obviously the process itself took more than 165KB, but that was the memory used in “normal” managed heap memory.) Using more memory to improve performance is a valid technique, but I’d say it’s only worth it when the improvement is very significant. Additionally, the streaming technique instantly scales to huge data files. While they’re not in the current requirements, it doesn’t hurt to get that ability for free.

Next steps…

I’ve included quite a few “I’m going to try to […]” points in this post. Just for reference, my plans are:

  • Turn off my virus checker! If this makes a significant difference, I’ll rerun all the current tests
  • Try to visualise the results with Google graphs – although I think it may get cluttered
  • Rerun big/buffering/1000 test – odd results
  • Rerun tests with a work level of 200 and keep an eye on CPU usage – I’d like to spot where we tip between CPU and IO being the bottleneck
  • Try to optimize the Stream/StreamReader being used – we should be able to skip the FileStream buffer completely, and specifying FileOptions.SequentialScan should give more hints to Windows about how we’re intending to read the data. I’ll try a few different buffer sizes, probably 4K, 8K and 64K, just to see what happens.

This is a really interesting little example problem, because it sounds so simple (and it is simple in terms of the implementation) but it has a lot of environment variables. Of course it would be nicer if we didn’t have to reboot the machine between test runs in order to get a fair result, but never mind…

11 thoughts on “Buffering vs streaming benchmark: first results”

  1. Hi Jon,

    I found this research very interesting indeed.
    Currently, I’m working on a Core i7 with 4 cores and HT (Hyper-Threading). Hence, it can work with 8 threads (8 logical cores -> 4 physical cores x 2 by HT). However, I’m using it as a database server and it won’t offer serious results. I can’t use it for benchmarks these days… I will do that later.
    Nevertheless, I have a Core 2 Quad Q6600 (highly overclocked to a FSB 1600 MHz) with a RAID 0 HD subsystem (2 x 7,200 RPM x 320 GB). It transfers about 200 MBps with sequential reads. Nice to add info to the experience.
    I can prepare this machine with a clean Win XP 32-bits without antivirus running. I can do that to run the tests next week. We will have interesting results. Let me know your thoughts?

    Cheers,

    Gastón

    Like

  2. Gastón, that would be great.

    I still haven’t checked out the difference that anti-virus makes – although it undoubtedly depends on the exact AV system and what settings it’s running under.

    I *have* done most of the “work level 200” tests, and they’re coming out with buffering on top. I’ll update the post when I’ve finished the tests.

    I’ve also tried optimising the Stream/StreamReader, but that hasn’t improved the results as far as I’ve seen so far.

    Jon

    Like

  3. Hey Jon –

    Great article. After reading it, I remembered what bugged me most about the book that sparked your investigation. The author made the assumption that you actually *could* buffer the entire file in memory. In the “real world”, I frequently have had requirements for software that required me to read in files that are very large in size – up to 2 GB in some cases. Reading this into memory just isn’t feasible on many machines.

    Any published book that makes the assumption that reading a file into memory is always “a good thing” has – in my opinion – some fundamental problems at its core.

    Like

  4. @Matt: To be fair, this assumption was made very explicit. I would have liked to have seen more discussions about the pros and cons of buffered vs streaming algorithms, but I think it’s fair to say that there are plenty of times when we *do* know that all the data will fit into memory – and that can often make a huge difference in the best algorithm to pick.

    Like

  5. Hi Jon,

    Some I/O experience I gained while at my previous employer (we dealt with massive files, GBs to TBs so we tended to always go with streaming):

    fragmentation affects things a LOT. just something to keep in mind while benchmarking and creating/destroying lots of files. you become a bit more i/o bound as time goes on. as en example, one test set i ran could vary from running ~30-50 MB/s on a fairly fragged system. defrag and suddenly i am back up to 130-200+ MB/s (raid 5, multiple disks)

    buffer your writes. how much? really depends on each system. pulling from memory (which could be faulty) try caching write data in chunks of 16-512mb

    when using StreamReader, try using the constructor that allows explicit buffersize: public StreamReader(string path, Encoding encoding, bool detectEncodingFromByteOrderMarks, int bufferSize)
    personally, I have found that 128mb works well on desktop boxes

    Like

  6. Wouldn’t it be better to do a lot of light tests with different parameters than a lot of heavy tests with the same parameters?

    If you do a lot of tests you can plot the results, allowing you to deduce relations between the parameters and the results from the graph. It’s hard to predict results from a table of a few heavy tests.

    Like

  7. @Joren: Assuming that by “light” you mean “run very quickly” the problem is that then you get more environmental factors kicking in, and less reproducible results. More tests is always better though, I agree.

    Like

  8. Yes, that’s what I meant by ‘light’. And as you say, a single test would be less accurate because of environmental factors. But I think that this might not be a problem. Light test results may not be significantly accurate by themselves, but if you plot a lot of them the general trend might still be apparent.

    You could emulate a heavy test by doing a lot of light tests with the same parameters, and averaging their results. But if you do the same amount of light tests but with subtly different parameters, doesn’t that give more information?

    Like

Leave a comment