Articles in the Storage category

  1. RAID 5 vs. RAID 6 or Do You Care About Your Data?

    Fri 13 August 2010

    Storage is cheap. Lots of storage with 10+ hard drives is still cheap. Running 10 drives increases the risk of a drive failure tenfold. So often RAID 5 is used to keep your data up and running if one single disks fails.

    But disks are so cheap and storage arrays are getting so vast that RAID 5 does not cut it anymore. With larger arrays, the risk of a second drive failure while your failed array is in a degraded state (a drive already failed and the array is rebuilding or waiting for a replacement), is serious.

    RAID 6 uses two parity disks, so you loose two disks of capacity, but the rewards in terms of availability are very large. Especially regarding larger arrays.

    I found a blog posting that showed the results on a big simulation run on the reliability of various RAID setups. One picture of this post is important and it is shown below. This picture shows the risk of the entire RAID array failing before 3 years.

    image

    From this picture, the difference between RAID 5 and RAID 6 regarding reliability (availability) is astounding. There is a strong relation with the size of the array (number of drives) and the increased risk that more than one drive fails, thus destroying the array. Notice the strong contrast with RAID 6.

    Even with a small RAID 5 array of 6 disks, there is already a 1 : 10 chance that the array will fail within 3 years. Even with 60+ drives, a RAID 6 array never comes close to a risk like that.

    Creating larger RAID 5 arrays beyond 8 to 10 disks means there is a 1 : 8 to 1
    5 chance that you will have to recreate the array and restore the contents from backup (which you have of course).

    I have a 20 disk RAID 6 running at home. Even with 20 disks, the risk that the entire array fails due to failure of more than 2 disks is very small. It is more likely that I lose my data due to failure of a RAID controller, motherboard or PSU than dying drives.

    There are more graphs that are worth viewing, so take a look at this excelent blog post.

    Tagged as : Uncategorized
  2. Lustre and the Risk of Serious Data Loss

    Sat 03 July 2010

    Personally I have a weakness for big-ass storage. Say 'petabyte' and I'm interested. So I was thinking about how you would setup a large, scalable storage infrastructure. How should such a thing work?

    Very simple: you should be able just to add hosts with some bad-ass huge RAID arrays attached to them. Maybe even not that huge, say 8 TB RAID 6 arrays or maybe bigger. You use these systems as building blocks to create a single and very large storage space. And then there is one additional requirement: as the number of these building blocks increase, you must be able to loose some and not loose data or availability. You should be able to continue operations without one or two of those storage building blocks before you would loose data and/or availability. Like RAID 5 or 6 but then over server systems instead of hard drives.

    The hard part is in connecting all this separate storage to one virtual environment. A solution to this problem is Lustre.

    Lustre is a network clustering filesystem. What does that mean? You can use Lustre to create a scalable storage platform. A single filesystem that can grow to multiple Petabytes. Lustre is deployed within production environments at large scale sites involving some of the fastest and largest computer clusters. Luster is thus something to take seriously.

    Lustre stores all metadata about files on a separate MetaDataServer (MDS). Al actual file data is stored on Object Storage Targets (OSTs). These are just machines with one or more big RAID arrays (or simple disks) attached to them. The OSTs are not directly accessible by clients, but through an Object Storage Server (OSS). The data stored within a file can be striped over multiple OSTs for performance reasons. A sort of network RAID 0.

    Lustre does not only allow scaling up to Petabytes of storage, but allows also a parallel file transfer performance in excess of 100 GB/s. How you like them apples? That is just wicked sick.

    Just take a look at this diagram about how Lustre operates:

    lustre schema

    I'm not going into the details about Lustre. I want to discuss a shortcoming that may pose a serious risk of data loss: if you loose a single OST with any attached storage, you will lose all data stored on that OST.

    Lustre cannot cope with the loss of a single OST! Even if you buy fully redundant hardware, with double RAID controllers, ECC memory, double PSU, etc, even then, if the motherboard gets fried, you will loose data. Surely not everything, but let's say 'just' 8 TB maybe?

    I guess the risk is assumed to be low, because of the wide scale deployment of Lustre. Deployed by people who actually use it and have way more experience and knowledge than me about this whole stuff. So maybe I'm pointing out risks that are just very small. But I have seen server systems fail this bad as described. I don't think the risk, especially at this scale, is not that small.

    I am certainly not the first to point out this risk.

    The solution for Lustre to became truly awesome is to implement some kind of network based RAID 6 striping so you could loose one or even two OSTs and not have any impact on availability except maybe for performance. But it doesn't (yet).

    This implies that you have to create your OSTs super-reliable, which would be very expensive (does not scale). Or have some very high-capacity backup solution, which would be able to restore some data. But you would have downtime.

    So my question to you is: is there an actual scalable filesystem as Lustre that actually is capable of withstanding the failure of a single storage building block? If you have something to point out, please do.

    BTW: please note that the loss of an OSS can be overcome because another OSS can take over the OSTs of a failed OSS.

    Tagged as : lustre ost failure data loss
  3. 'Linux RAID Level and Chunk Size: The Benchmarks'

    Sun 23 May 2010

    Introduction

    When configuring a Linux RAID array, the chunk size needs to get chosen. But what is the chunk size?

    When you write data to a RAID array that implements striping (level 0, 5, 6, 10 and so on), the chunk of data sent to the array is broken down in to pieces, each part written to a single drive in the array. This is how striping improves performance. The data is written in parallel to the drive.

    The chunk size determines how large such a piece will be for a single drive. For example: if you choose a chunk size of 64 KB, a 256 KB file will use four chunks. Assuming that you have setup a 4 drive RAID 0 array, the four chunks are each written to a separate drive, exactly what we want.

    This also makes clear that when choosing the wrong chunk size, performance may suffer. If the chunk size would be 256 KB, the file would be written to a single drive, thus the RAID striping wouldn't provide any benefit, unless manny of such files would be written to the array, in which case the different drives would handle different files.

    In this article, I will provide some benchmarks that focus on sequential read and write performance. Thus, these benchmarks won't be of much importance if the array must sustain a random IO workload and needs high random iops.

    Test setup

    All benchmarks are performed with a consumer grade system consisting of these parts:

    Processor: AMD Athlon X2 BE-2300, running at 1.9 GHz.

    RAM: 2 GB

    Disks: SAMSUNG HD501LJ (500GB, 7200 RPM)

    SATA controller: Highpoint RocketRaid 2320 (non-raid mode)

    Tests are performed with an array of 4 and an array of 6 drives.

    • All drives are attached to the Highpoint controller. The controller is not used for RAID, only to supply sufficient SATA ports. Linux software RAID with mdadm is used.

    • A single drive provides a read speed of 85 MB/s and a write speed of 88 MB/s

    • The RAID levels 0, 5, 6 and 10 are tested.

    • Chunk sizes starting from 4K to 1024K are tested.

    • XFS is used as the test file system.

    • Data is read from/written to a 10 GB file.

    • The theoretical max through put of a 4 drive array is 340 MB/s. A 6 drive array should be able to sustain 510 MB/s.

    About the data:

    • All tests have been performed by a Bash shell script that accumulated all data, there was no human intervention when acquiring data.

    • All values are based on the average of five runs. After each run, the RAID array is destroyed, re-created and formatted.

    • For every RAID level + chunk size, five tests are performed and averaged.

    • Data transfer speed is measured using the 'dd' utility with the option bs=1M.

    Test results

    Results of the tests performed with four drives:

    image

    Test results with six drives:

    image

    Analysis and conclusion

    Based on the test results, several observations can be made. The first one is that RAID levels with parity, such as RAID 5 and 6, seem to favor a smaller chunk size of 64 KB.

    The RAID levels that only perform striping, such as RAID 0 and 10, prefer a larger chunk size, with an optimum of 256 KB or even 512 KB.

    It is also noteworthy that RAID 5 and RAID 6 performance don't differ that much.

    Furthermore, the theoretical transfer rates that should be achieved based on the performance of a single drive, are not met. The cause is unknown to me, but overhead and the relatively weak CPU may have a part in this. Also, the XFS file system may play a role in this. Overall, it seems that on this system, software RAID does not seem to scale well. Since my big storage monster (as seen on the left) is able to perform way better, I suspect that it is a hardware issue.

    because the M2A-VM consumer-grade motherboard can't go any faster.

Page 14 / 16