Benchmarking IO: buffering vs streaming

I mentioned in my recent book review that I was concerned about a recommendation to load all of the data from an input file before processing all of it. This seems to me to be a bad idea in an age where Windows prefetch will anticipate what data you need next, etc – allowing you to process efficiently in a streaming fashion.

However, without any benchmarks I’m just guessing. I’d like to set up a benchmark to test this – it’s an interesting problem which I suspect has lots of nuances. This isn’t about trying to prove the book wrong – it’s about investigating a problem which sounds relatively simple, but could well not be. I wouldn’t be at all surprised to see that in some cases the streaming solution is faster, and in other cases the buffered solution is faster.

The Task

The situation presented is like this:

  • We have a bunch of input files, either locally or on the network (I’m probably just going to test locally for now)
  • Each file is less than 100MB
  • We need to encrypt each line of text in each input file, writing it to a corresponding output file

The method suggested in the book is for each thread to:

  1. Load a file into a List<string>
  2. Encrypt every line (replacing it in the list)
  3. Save to a new file

My alternative option is:

  1. Open a TextReader and a TextWriter for the input/output
  2. Repeatedly read a line, encrypt, write the encrypted line until we’ve exhausted the input file
  3. Close both the reader and the writer

These are the two implementations I want to test. I strongly suspect that the optimal solution would involve async IO, but doing an async version of ReadLine is a real pain for various reasons. I’m going to keep it simple – using plain threading, no TPL etc.

I haven’t written any code yet. This is where you come in – not to write the code for me, but to make sure I test in a useful way.

Environmental variations

My plan of attack is to first write a small program to generate the input files. These will just be random text files, and the program will have a few command line parameters:

  • Directory to put files under (one per test variation, basically)
  • Number of files to create
  • Number of lines per file
  • Number of characters per line

I’ll probably test a high and a low number for each of the last three parameters, possibly omitting a few variations for practical reasons.

In an ideal world I’d test on several different computers, locally and networked, but that just isn’t practical. In particular I’d be interested to see how much difference an SSD (low seek time) makes to this test. I’ll be using my normal laptop, which is a dual core Core Duo with two normal laptop disks. I may well try using different drives for reading and writing to see how much difference that makes.

Benchmarking

The benchmark program will also have a few command line parameters:

  • Directory to read files from
  • Directory to write files to
  • Number of threads to use (in some cases I suspect that more threads than cores will be useful, to avoid cores idling while data is read for a blocking thread)
  • Strategy to use (buffered or streaming)
  • Encryption work level

The first three parameters here are pretty self-explanatory, but the encryption work level isn’t. Basically I want to be able to vary the difficulty of the task, which will vary whether it ends up being CPU-bound or IO-bound (I expect). So, for a particular line I will:

  • Convert to binary (using Encoding.ASCII – I’ll generate just ASCII files)
  • Encrypt the binary data
  • Encrypt the encrypted binary data
  • Encrypt the encrypted encrypted […] etc until we’ve hit the number given by the encryption work level
  • Base64 encode the result – this will be the output line

So with an encryption work level of 1 I’ll just encrypt once. With a work level of 2 I’ll encrypt twice, etc. This is purely for the sake of giving the computer something to do. I’ll use AES unless anyone has a better suggestion. (Another option would be to just use an XOR or something else incredibly simple.) The key/IV will be fixed for all tests, just in case that has a bearing on anything.

The benchmarking program is going to be as simple as I can possibly make it:

  • Start a stopwatch
  • Read the names of all the files in the directory
  • Create a list of files for each thread to encrypt
  • Create and start the threads
  • Use Thread.Join on all the threads
  • Stop the stopwatch and report the time taken

No rendezvous required at all, which certainly simplifies things. By creating the work list before the thread, I don’t need to worry about memory model issues. It should all just be fine.

In the absence of a better way of emptying all the file read caches (at the Windows and disk levels) I plan to reboot my computer between test runs (which makes it pretty expensive in terms of time spent – hence omitting some variations). I wasn’t planning on shutting services etc down: I really hope that Vista won’t do anything silly like trying to index the disk while I’ve got a heavy load going. Obviously I won’t run any other applications at the same time.

If anyone has any suggested changes, I’d be very glad to hear them. Have I missed anything? Should I run a test where the file sizes vary? Is there a better way of flushing all caches than rebooting?

I don’t know exactly when I’m going to find time to do all of this, but I’ll get there eventually :)

24 thoughts on “Benchmarking IO: buffering vs streaming”

  1. Wouldn’t Windows begin to swap memory when you load a bunch of large input files ?
    There is probably a limit for RAM one process can use ? Less than 4 GB on 32-bit OS ?

    On the other hand TextReader.ReadLine() probably involves greater deal of work than List[i], so you will probably get faster results with List for smaller files, I guess.

    Regards.

    Like

  2. @Petar: The lines will be read using ReadLine either way, it’s just a matter of whether they’re all read to start with (for a particular file) or streamed.

    If the files are only 100MB at most, that’ll be ~200MB of strings (quite possibly a lot more if it’s a lot of very *very* short lines, admittedly – TextReader.ReadLine would default to a 16 character buffer, so you could end up with three bytes of input (xrn) taking up around 40 bytes in memory) per file. Even with 4 threads going that’s only 800MB.

    I may try the pathological “single character per line” case just to see how nasty the memory situation gets. There are ways round it, of course, such as loading the whole file into memory but parsing it as text in a streaming fashion.

    Like

  3. @configurator: Yes, I very much would.

    I may well install the Windows 7 beta on the Samsung NC10 I plan to buy soon… but I suspect that using an SSD will change the performance characteristics out of all recognition anyway.

    I may need to wait until Windows 7 is released, upgrade my laptop, and *then* rerun all the tests. Or if you’re volunteering your machine as another test bed…

    Jon

    Like

  4. @Jon: Thanks for inviting me by e-mail to post my experiences in this blog entry. I really appreciate that.
    There are many hardware and operating system issues that make it difficult to say “this is the best practice” for this situation. As in many issues related to computer science and programming, there is no silver bullet. I don’t believe in silver bullets. I think that we must learn a bit more each day. There’s no other alternative.
    The benchmark you are proposing here will show very different results depending on hardware. If you run asynchronous I/O in a 5,400 RPM hard drive (most notebooks use 5,400 RPM hard drives), it will run really slow. Why? Because async I/O adds a great latency in low RPM hard drives with small buffers (hard drive buffers range between 1 MB, worst-case, and 128 MB, high range hard drives).
    Therefore, if you run this application in a notebook, results will be completely different than running the same application in a server.
    If I found enough memory, I believe that in many situations, the method suggested in the book will work much better than the other. Why? Because we are taking full advantage of modern hard drives large buffers.
    For example, in my workstation, I have 4 10,000 RPM hard drives in a RAID 1 configuration. Async I/O is still painful with this configuration, because latencies are really painful for hard drives. There are many benchmarks about this. Give me some days and I’ll add you very interesting links about this.
    You can use PC WIZARD (http://www.cpuid.com/pcwizard.php), a free tool which includes a hard drive benchmark. Look at how they show the results for the benchmark. When you take advantage of the hard drive buffers, you can write 100 MB in just a few seconds in many modern hard drives. When you perform asynchronous I/O, you’ll need more time, because latencies will add an overhead.
    You can take a look at this sequential read & write pattern benchmarks for modern 1 TB hard drives: http://www.xbitlabs.com/articles/storage/display/1tb-14hdd-roundup_9.html#sect1
    And then the random read & write patterns: http://www.xbitlabs.com/articles/storage/display/1tb-14hdd-roundup_11.html
    However, the OS system does a lot of work here too, because, it can transform an asynchronous I/O in a buffered write. Free BSD does a great job and Windows does a great job in its 64 bits versions.
    Furthermore, going back and forth from I/O flushes the modern multicore CPUs powerful L2 & L3 cache memories… That’s the main reason why, in many situations, the example in the book will work fine.
    However, there is no *silver bullet*. We must benchmark and benchmark.
    I’ll tell you another nice experience, some years ago. Back in year 2000, I had the opportunity to develop an application similar to the one that you are talking about in Java, using a Sun Enterprise server (32 700 MHz RISC processors). Yes, 32 physical CPUs!!! :)
    I prepared an asynchronous I/O, I had 32 cores for me. It should have worked fine…… It was horrible. Why? Because the 16 hard drives where crazy moving from one side to the other. Each line was processed so quickly than there was an incredible queue waiting to be written, but using a random pattern.
    I was one week reading and reading and researching about the issue. I had 8 GB RAM for me. I wondered whether buffering could be the best way to do it. The hard drives performed very fast reading big files. I rewrote the application.
    The async I/O version needed 1 day to finish.
    The buffered I/O version finished in 1 hour.
    Why? Memory was and is faster than I/O.
    SSDs are still horrible with random reads and writes. There are many benchamrks in XBitlabs website.
    I believe that the results will be very different with different hardware. However, if I have to choose, if I have enough memory I load, process and then write.
    Recently, I optimized a video conversion software (one customer contracted me to help the development team doing that). It was developed using one

    Like

  5. Fortunately Gastón cc’d me with his whole comment, so I can continue his truncated comment with the rest of it:

    —————————

    Why? Memory was and is faster than I/O.

    SSDs are still horrible with random reads and writes. There are many benchamrks in XBitlabs website.

    I believe that the results will be very different with different hardware. However, if I have to choose, if I have enough memory I load, process and then write.
    Recently, I optimized a video conversion software (one customer contracted me to help the development team doing that). It was developed using one thread and async I/O.
    We changed the code to use buffered I/O according to the available memory and multithreading. In this case, it was also optimized to use SSE3 and SSE4 (this topic is not covered in my book). The video conversion runs 20 times faster in a Core i7. We were benchmarking 2 weeks before finding the best approach. Unluckily, it is a closed source implementation. However, give me some days and I’ll post some code with some results. :)
    That’s just my point of view. I don’t recommend a best practice. I’m just saying what I know so far. Tomorrow, I can go on learning and I’ll know more about this. Everything changes in software engineering.
    Cheers,
    Gastón

    Like

  6. Thanks for the comments Gastón. Intereting perspective about async IO – I’d really expect modern operating systems to be able to use both the in-memory buffer and the hard disk buffer in a near-optimal way.

    I understand the comment about *writing* in a buffered fashion vs streaming (although I’d hope the OS would be buffering again), but I still don’t really see why it would be faster to wait until we’ve got all the data before starting to process it at all. The important thing is presumably what the operating system tells the disk to read – and so long as it can keep telling the disk to read data as quickly as it can, storing the results in an OS cache until the process itself requests it, I’d have thought it would make sense for the CPU to be busy processing the data instead of idling.

    However, this is why I want to run the benchmark of course :)

    (The point I mentioned in another comment about memory is interesting, by the way. I hadn’t really thought about the potential difference between size of data on disk and size in memory, and how it’s often impossible to tell what the memory requirements will be until you’ve already read the data. If my back-of-the-envelope calculations are correct, a single 100MB file could end up taking over 1.7GB in memory as lines of text. Even that’s assuming a non-empty string and CRLF as the line terminator. However, this would clearly be a pathological case.)

    Jon, regretting he has less interesting but more important things to do right now :(

    Like

  7. I think the real point of this is being lost in the process of everything here:

    The code that is most clear, easiest to write, easiest to maintain, and will work across the broadest range of situations is the best answer. The performance of this code, in the absence of specific requirements, is largely irrelevant.

    The code that goes:
    1 – Read the file
    2 – Encrypt the File
    3 – Write the file

    … is a great winner in some of categories, but is trivially easy to break by using large files or a resource constrained machine. To me, this makes that code a failure as a general solution.


    Chris Mullins

    Like

  8. Gastón, I don’t understand your comment here:

    I prepared an asynchronous I/O, I had 32 cores for me. It should have worked fine…… It was horrible. Why? Because the 16 hard drives where crazy moving from one side to the other. Each line was processed so quickly than there was an incredible queue waiting to be written, but using a random pattern.

    My understanding of how an O/S & I/O deal with this is that the combination of Scatter Gather descriptors and Command Queuing is supposed to take all of these “Random” seeks from all over the drive and figure out the most efficient way to perform the read. There is no “in order” requirement at the I/O level, nor has there been for a very long time. When I write these drivers for Intel (a long time ago now) this was cutting-edge stuff. Since then, it’s all moved mainstream.

    The more advanced drives (e.g. Everything but IDE) have had these features for a long time. If you’re not seeing this behavior then there’s likely a driver issue.


    Chris Mullins

    Like

  9. @Chris,

    In my previous post, I’ve added some hyperlinks talking about performance for modern 1 TB hard drives.
    Did you run modern PC WIZARD (www.cpuid.com/pcwizard.php) benchmarks? You’ll find nice information there.
    Hard drives are still the slower component and they still rotate.

    Gastón

    Like

  10. You might want to consider using a strem based cypher instead of AES. AES will force every implementation to have a particular block size which may bias things (and the behaviour on line ends is a bit of a pain).

    If you want to increase the cpu load just pump through multiple encoders.

    Like

  11. @Shuggy: Good plan. I think I’ll just use something I can perform directly on the raw bytes, so I don’t need to mess around with the crypto APIs at all. It’s only to induce load, after all.

    Like

  12. I’ve now got a plan to automatically make my machine reboot, wait a few minutes to quieten down, run the next test, reboot etc. If I can find the time to write the code, the tests should be able to run overnight.

    I’ll need to look into programmatic rebooting though…

    Jon

    Like

  13. I would have thought that an asynchronous I/O solution – whether through provided Asynchronous I/O facilities, or manually with a separate thread – would be the one solution that could offer the best mix of performance but without requiring excessive amounts of memory. Adjusting the amount of data read each time should even make it possible to balance I/O vs CPU such that the data for each chunk becomes available just as the processing thread finishes with the previous chunk of data. Such a solution should even be able to adapt to varying systems – if the processing thread would wait for I/O, increase the I/O chunk size, spending memory to gain speed; if I/O finishes way before the processing thread needs new data, decrease the I/O chunk size and save memory. And of course, if a memory allocation fails, obviously reduce the chunk size and try again until things fit.

    Best wishes,

    // Christian

    Like

  14. Miscellaneous points to consider for this problem:

    * If you know that you’re reading once from front to back, you should use a BufferedStream (http://msdn.microsoft.com/en-us/library/system.io.bufferedstream.aspx , or similar) to avoid calling into the OS too often.

    * Vista seems to have improved the read-ahead algorithm significantly (http://en.wikipedia.org/wiki/Technical_features_new_to_Windows_Vista#Memory_management , last point).

    * On POSIX compliant systems you should use posix_fadvise (http://www.opengroup.org/onlinepubs/000095399/functions/posix_fadvise.html) or posix_madvise when opening the input files with POSIX_FADV_SEQUENTIAL to advise the implementation to improve performance by reading file contents before they’re needed.

    * Many small reads from a file/stream cause overhead by having to call the underlying stream (and ultimately the kernel) often and allocate and copy results every time. Both effects can be mitigated by mmap()ing (http://www.opengroup.org/onlinepubs/000095399/functions/mmap.html) the file into the process’ address space, where it can be read without having to allocate any local buffers. Using (POSIX_FADV_SEQUENTIAL | POSIX_FADV_NOREUSE), the implementation of mmap can confidently read-ahead and free caches as soon as the program has read the data. In the best case, this means that the encryption routine can directly read the (physical) memory that was written by the DMA of the disk controller before the line was actually needed.

    * The theoretical limit to performance is one seek time to the start of the file when opening it and the maximum of a streaming read of the whole file (triggered by OS heuristics or an explicit fadvise()) and the cpu time needed to process the contents. Additional delays can occur when writing out results blocks the processing thread, interferes with the read-ahead streaming or is not finished at the end of processing. All three can be reduced by using a separate write-out thread which takes care of flushing results to the disk in big enough batches.

    * Optimising for throughput (number of files/amount of data processed) or latency (time from starting processing of a file until (all) results are available) will affect many decisions.

    Like

  15. @David:

    – Is there any point in using BufferedStream when StreamReader already contains a buffer?

    – Interesting point about Vista; I believe Windows 7 is meant to be even better on this front.

    – No idea how we’d specific posix_fadvise from C# :)

    – Memory mapping sounds like it would be very handy… I wonder when we’ll get it in .NET :)

    I’m most of the way through running 60+ tests right now. Even on my laptop, I’ve got an interesting mix of results – sometimes buffering is faster, sometimes it’s slower. Will blog again when I’ve got the full set.

    Jon

    Like

  16. Ooh, just discovered the FileOptions enumeration and SequentialScan.

    I suspect that by tweaking this and possibly the buffer size of both FileStream and StreamReader, we may be able to optimise further. Will try that when I’ve got a complete set of results for the simplest code. In particular, I suspect that it may be worth trying to avoid the FileStream bothering with a buffer, given that it’s just going to be copied into the StreamReader’s buffer.

    Like

  17. Re BufferedStream: Indeed, I didn’t see that StreamReader has a buffer too. But who knows which implementation performs better?

    Re posix vs. .NET: I was rather aiming at the hope that the .NET implementation would use such constructs to optimise itself. The FileOptions.SequentialScan sounds very much like it could cause this.

    Re buffer size: While you surely have seen the note “When reading from a Stream, it is more efficient to use a buffer that is the same size as the internal buffer of the stream.” on the StreamReader MSDN page, you should also consider that reading from the file system makes most sense in multiples of the underlying block size.

    Like

  18. @David: I hadn’t seen that note, but yes – the underlying block size seems to be the way forward. From what I can tell, I can’t see why I’d want the FileStream to use a buffer at all. Why bother writing into that buffer, then copying the data into the StreamReader’s buffer, instead of just writing directly into the StreamReader’s buffer?

    First results post is half complete now, btw.

    Like

  19. Jon,

    Here is a list of 11 thing that you should consider when measuring perf. The list comes from a blog post that I suggest you read at http://geekswithblogs.net/akraus1/archive/2008/12/16/127989.aspx

    1. DateTime.Now has a resolution of ~16ms. If you measure anything faster you must use Stopwatch which has a resolution of about 100ns.
    2. With Windows Vista you need to set your Power options to maximum performance to prevent the OS from change the CPU clock frequency at random times.
    3. Warm and cold start times are way different. Usually a factor of 2-6 is quite normal. With cold startup I mean that you have a fresh booted system which did have never run your test case. This cold startup time does mainly contribute to disk IO which you can watch with XPerf very nicely.
    4. Be sure that you measure the right thing. I cannot emphasize this enough. Did you really measure the right thing?
    5. Know your influencers. You need to get a good understanding how much e.g. number of iterations, input data size, concurrency, other applications can affect the outcome. I had more than once the case that I did execute a test in the evening and then next morning. But the results did differ by a factor two for no apparent reason.
    6. Debug and Release builds still differ in the managed world. Although the difference is much smaller than it was with C++.
    7. First time initialization effects should be measured separately. When you initialize a class and call some method 10 million times it makes not much sense to mix the ctor call time with pure function calling time.
    8. Do measure long enough to get reliable results. The mean value is a powerful ally to flatten sporadic effects.
    9. If you get strange disk IO reports check if a virus scanner does interfere.
    10. Other processes might be running as well. These might influence your test if they consume significant resources.
    11. Stupid but happens: Turn off tracing before you measure.

    Regards,

    Anthony

    Like

  20. @Anthony: Thanks for the list.

    1) I’m actually using Stopwatch already, but only reporting accuracy to the nearest second. The times I’m measuring usually range between about 3 and 35 minutes, so I don’t need better than that.

    2) Yup, I’m using max performance.

    3) I’m using cold startup, hence the reboot program.

    4) I’m measuring the time from starting the threads to them all terminating. I’m *not* forcing all write buffers to be emptied, mostly because that’s relatively tricky :)

    5) I’ve been trying quite a few factors. Unfortunately there are three I can’t easily change on my box: CPU, OS, disks.

    6) Release all the way :)

    7) Initialization effects would be lost in the noise here.

    8) I’ve chosen workloads to give useful times.

    9) I haven’t tried turning off my virus scanner. Interesting point. I might rerun all the tests with it turned off. It’s definitely not doing a full system scan at the same time or anything like that.

    10) I haven’t tried to reduce the number of other services running, but I’m not running any other main applications.

    11) No tracing involved.

    It looks like I’m doing okay by that list – the only thing I can think of that I’m immediately worried about is the virus checker…

    Jon

    Like

  21. Jon,

    Good to hear that you have almost everything covered.

    I am not sure if you looked the reference blog links off Alois’ blog post, but he link to some musings of Vance Morrison ( Architect on the .NET Runtime Team, specializing in performance issues ). It’s a good source of performance information.

    Vance has a microbenchmarking tool called “MeasureIt” that you may want to also grab from this blog post http://blogs.msdn.com/vancem/archive/2009/02/06/measureit-update-tool-for-doing-microbenchmarks.aspx

    Out of the box MeasureIt runs a set of tests that focus on .NET language constructs, which really don’t touch the disk and reports a nice set of stats. Tests include MethodCalls, FieldAccess, Looping, ObjectOps, ArrayAccess, DelegateCalls, Generics, Iteration, Regex, etc.

    In addition to those test results you also get some stats about the machine. For example I receive the following table:

    Number of Processors 1
    Processor Name Intel(R) Core(TM)2 CPU T7400 @ 2.16GHz
    Processor Mhz 2167
    Memory MBytes 2045
    L1 Cache KBytes 32
    L2 Cache KBytes 4096
    Operating System Microsoft® Windows Vistaâ„¢ Business
    Operating System Version 6.0.6001
    Stopwatch resolution (nsec) 69.841
    CompileType JIT
    CodeSharing AppDomainSpecific
    CodeOptimization Optimized

    I think you might be able to use this tool in one of two ways to further isolate the performance of disks as the primary cause across two machines:

    1) Use it before the testing to find two machines that have roughly the same performance profile (Stopwatch resolution, Execution profile), and only differ from the point that one has an SSD, thus isolating just the drives; or

    2) Extend it to use your final test code. This would allow you to see that during your actual test runs that the performance comparisons were caused by the differences of the drives in the two machines.

    Regards,

    Anthony

    Like

  22. @Anthony:

    I haven’t had time to read the post properly yet. Will do so tonight. The machine stats are probably more useful than anything else, to be honest.

    I’m not going to go to the trouble of getting identical machines with different drives, etc – this is more for casual interest than anything else :)

    Jon

    Like

  23. re buffering: that depends how it is implemented internally. The reference for BufferedStream explicitly says “If you always read and write for sizes greater than the internal buffer size, then BufferedStream might not even allocate the internal buffer.” I’ve not found similar claims for other streams.

    #include

    Like

Leave a comment