Why RAID 1, 5 and 10 Will Kill You Some Day

Sat 02 August 2008 Category: RAID

How important is availability of an information system to you and your company? What are the costs of, let's say, a couple of hours downtime and maybe the loss of all the work since the last backup? 

Depending on the information system, the impact can be quite grave, I presume. So what are the biggest risks regarding the availability of your systems? Human error is probably number one. Number two might be the hardware.

One of the most unreliable components of the hardware on which your precious information systems run is the good old hard drive. There are not two but three certainties in life: death, taxes and that sooner or later hard drives will fail.

So back in the eighties (some patent was already awarded back in 1978) some smart people invented RAID. Using RAID, your information system can tollerate a disk faillure, and still continue to operate. 

There are many different tastes of RAID, so called RAID levels. One of the most populair RAID levels is RAID 1. Two disks acting like 1. If one fails, the other takes over. For performance, you can stack them together and you get RAID 10 arrays. However, 50% of your storage space is waisted because for every n of storage, you need (n / c ) * 2 disks, where c represents the capacity of a single drive.

RAID level 5 is a more efficient solution. Using this RAID level, only the capacity of one disk is lost in order to provide redundancy. So for every n of storage, you need ( n / c ) + 1 disks. It is easy to see that for larger arrays with more disks, RAID 5 is much more efficient. The downside of RAID 5 is mainly (write) performance, if compared to RAID 10. However, if it is sufficient, that is often not an issue. Hence the popularity of RAID 5. 

This story is all about risk vs. costs. And there is a risk using RAID 1 and 5 that can not be neglected that should be pointed out. If a drive fails, redundancy is lost. At that moment, until the faulty drive is replaced you will run the risk of losing the entire RAID array and all data if another disk would fail. 

How big is that risk? Well, that is the weakest point of this article. I honestly don't know. There is some anecdotal "evidence" that it occurs occasionally. And it is not that surprising: restoring an array puts extra stress on all de disks involved, which might be fatal for a second drive. 

Today, RAID arrays of 10+ disks are not a rarity. With that amount of drives, it wouldn't be surprising if, during recovery, a second drive would fail. It's easy: with a 10-disk array the chance that a disk fails is twice that of a 5 disk array. 

The most common solution is to revert back to RAID 10. RAID 10 consists of disk pairs concatenated to one big virtual disk. RAID 10 can tollerate up to 50% loss of drives if one member of every pair would fail. The caveat is obvious: if a disk fails and the other drive of that pair will fail during recovery, the whole array will be lost. However, compared with RAID 5, the risk is reduced. In degraded mode (non-redundant) any drive failure will destroy a RAID 5 array. RAID 10 can tollerate additional drive failures as long as it is not the drive of the pair that just already lost one. 

So, although the risk that a second drive failure might destroy your array is greatly reduced using RAID 10 (compared to RAID 5), there is still a risk that the array is lost is the 'wrong' drive fails. 

So the solution should be that redundancy is not lost if a single drive failure occurs. RAID 6 provides that solution. RAID 6 is in nature identical to RAID 5. However, an additional drive is sacrified for additional redundancy. So for every n of storage, you need ( n / c ) + 2 disks. If you need 10 TB of storage using 1 TB disks, you need 12 disks. If a disk fails, the array is still redundant. Even a second drive can fail and the array will still continue to operate. I think that the chance that a third drive would fail is so low that it is an accepted risk.

For smaller arrays, the risk of a double drive failure might not that high to justify RAID 6, but with larger arrays (more drives) RAID 6 might become a necessity.

So there you have it. With current costs of hard drives and the wide support for RAID 6, it is an option that should be taken into account when designing the hardware platform for an information system. 

Aftertought: this article is mainly about considering RAID 6 in stead of RAID 5. Raid 5 or 6 may often not be a solution if performance in terms of IO (input/output) is an issue. Please note that when running in degraded mode (a drive failure occurred) the performance penalty on RAID 5 and RAID 6 will be severe (may be 80%). RAID 10 will suffer far less in that regard.

Comments