Storage and I/O: reads vs. writes

There is a fundamental difference between a read operation and a write operation. Storage can lie about completing a write operation, but it can never lie about completing a read operation. Therefore read and writes have different characteristics. This is what I've learned.

About writes

So what does this mean? Well, if you write data to disk, the I/O subsystem only has to acknowledge that it has written the data to the actual medium. Basically, the application says "please write this data to disk" and the I/O subsystem answers "done, feel free to give me another block of data!".

But the application cannot be sure that the I/O subsystem actually wrote that data to disk. More likely, the application can be sure the I/O subsystem lied.

Compared to RAM, non-volatile storage like hard-drives are slow. Orders of magnitudes slower. And the worst-case scenario, which is often also the real-life scenario, is that both read and write patterns are random as perceived from the storage I/O subsystem.

So you have this mechanical device with rotating platters and a moving arm, governed by Newtons rules of physics, trying to compete with CPUs and memory that are so small that they are affected by quantum mechanical effects. No way that device is going to be able to keep up with that.

So the I/O subsystem cheats. Hard drives are relatively great at reading and writing blocks of data sequentially, it's the random access patterns that wreaks havoc on performance. So the trick is to lie to the application and collect a bunch of writes in a cache, in memory.

So, meanwhile, the I/O subsystem looks at the data to be written to disk, and reorders the write operations, so that it becomes as 'serialised' as possible. It tries to take into account all the latencies involved in moving the arm, timing that with the rotation of the platter and that kind of stuff.

A 7200 RPM hard drive can do only 75 IOPS with random access patterns, but that is a worst-case of worst-case scenario's. Real-life usage scenario's often allow for some optimalisation.

I used FIO to perform some random-IO performance benchmarks on different hard drive types and RAID configurations. It turns out that read performance was conform the 75 IOPS, but writes where in the thousands of IOPS, not a realistic figure. The operating system (Linux) employed heavy caching of the writes, lying to FIO about the actual IOPS being written to disk.

Thousands of IOPS sounds great, but you can only lie until your write cache is full. There comes a time when you have to actually deliver and write this data to disk. This is where you see large drops in performance, to almost zero IOPS.

Most of the time, this behaviour is overall beneficial to application performance, as long as the application usage patterns are often short bursts of data, that need to be written to disk. With more steady streams of data being written to disk in a random order, this might influence application responsiveness. The application might become periodically unresponsive as data is flushed from the cache to disk.

This write-caching behaviour is often desired, because by reordering and optimising the order of the write requests, the actual overall obtained random I/O write performance is often significantly higher than could be achieved by the disk subsystem itself.

If the disk subsystem is not just a single disk, but a RAID array, comprised of multiple drives, write-caching is often even more important to keep performance acceptable, especially for RAID arrays with parity, such as RAID 5 and RAID 6.

Write-back caching may help increase performance significantly, but it may come at a cost. As the I/O subsystem lies about data being written to disk, that data may get lost if the system crashes or loses power. There is a risk of data loss or data corruption. Only use write-back caching on equipment that is supported by battery backup units and a UPS. Due to the risks associated with write-back caching, there might be use cases where it might be advised not to enable it to retain data consistency.

About reads

The I/O subsystem can't lie to the application about reads. If the application asks the I/O subsystem "can I have the contents of file X", the I/O subsystem can't just say "well, yes, sure". It actually has to deliver that data. So any arbitrary write can be easily cached and written to disk in a more optimised way, reads may be harder. There is no easy way out, the I/O subsystem must deliver.

Where any arbitrary write can be cached, only a limited number of reads can be cached. Cache memory is relatively small compared to the storage of the disk subsystem. The I/O subsystem must be smart about which data needs to be cached.

More complex storage solutions keep track of 'hot spots' and keep that data cached. As a side note, such caching constructions can now also be found in consumer grade equipment: Apple's fusion drive uses the SSD as a cache and stores the data that is less frequently accessed on the HDD.

But in the end, regarding reads, chances are higher that data must be retrieved that is not stored in cache (cache miss) and thus the drives must do actual work. Fortunately, that work is not as 'expensive' as writes for RAID 5 or RAID 6 arrays.

Furthermore, reads can also be 'grouped' and serialised (increased queue depth) at the cost of latency to optimise them (setup a more sequential read access pattern for the disk subsystem) and achieve better performance. But again, at the cost of latency, thus responsiveness. That may or may not be a problem depending of the type of application.

Some remarks

If possible, it's better to try and avoid having to access the storage subsystem in the first place, if possible. Try and trow RAM memory at the problem. Buy systems with sufficient RAM memory, so that the entire database fits in RAM memory. A few years ago this was unthinkable, but 128 GB of RAM memory can be had for less than two thousand dollars.

If RAM isn't an option (dataset is too large) still try and put in as much RAM as possible. Also, try and see if server grade Solid State Drives (SSDs) are an option (always RAID 1 at least for redundancy!), although their cost may be an obstacle.

The gateway of last resort is the old trusted hard drive. If random I/O is really an issue, take a look at 15000 RPM or at least 10000 RPM SAS drives and a good RAID controller with loads of cache memory. In general, more drives or more 'spindles' equals more I/O performance.

You might encounter a situation where you want to add drives to increase I/O performance, not for the storage. More important: you may choose not to use that extra storage as it may decrease performance. Because if you put more data on a disk, the head must cover larger areas of the disk platter, increasing latency.

There are usecases where drives are intentionally under-partitioned to (artificially) increase the performance of the drives.

Louwrentius