Louwrentius

Tracking Down a Faulty Storage Array Controller With ZFS

Thu 15 December 2016
One day, I lost two virtual machines on our DR environment after a storage vMotion.

Further investigation uncovered that any storage vMotion of a virtual machine residing on our DR storage array would corrupt the virtual machine's disks.

I could easily restore the affected virtual machines from backup and once that was done, continued my investigation.

I needed a way to quickly verifying if a virtual hard drive of a virtual machine was corrupted after a storage vMotion to understand what the pattern was.

First, I created a virtual machine based on Linux and installed ZFS. Then, I attached a second disk of about 50 gigabytes and formatted this drive with ZFS. Once I filled the drive using 'dd' to about 40 gigabytes I was ready to test.

ZFS was chosen for testing purposes because it stores hashes of all blocks of data. This makes it very simple to quickly detect any data corruption. If the hash doesn't match the hash generated from the data, you just detected corruption.

Other file systems don't store hashes and don't check for data corruption so they just trust the storage layer. It may take a while before you find out that data is corrupted.

I performed a storage vMotion of this secondary disk towards different datastores and then ran a 'zfs scrub' to track down any corruption. This worked better than expected: the scrub command would hang if the drive was corrupted by the storage vMotion. The test virtual machine required a reboot and a reformat of the secondary hard drive with ZFS as the previous file system, including data got corrupted.

After performing a storage vMotion on the drive in different directions, from different datastores to other datastores slowly a pattern emerged.
1. Storage vMotion corruption happened independent of the VMware ESXi host used.
2. a Storage vMotion never caused any issues when the disk was residing on our production storage array.
3. the corruption only happened when the virtual machine was stored on particular datastores on our DR storage array.
Now it got really 'interesting'. The thing is that our DR storage array has two separate storage controllers running in active-active mode. However, the LUNs are always owned by a particular controller. Although the other controller can take over from the controller who 'owns' the LUNs in case of a failure, the owner will process the I/O when everything is fine. Particular LUNs are thus handled by a particular controller.

So first I made a table where I listed the controllers and the LUNs it had ownership over, like this:
```
            Owner       
Controller      a               b
            LUN001          LUN002
            LUN003          LUN004
            LUN005          LUN006
```
Then I started to perform Storage vMotions of the ZFS disk from one LUN to the other. After performing several test, the pattern became quite obvious.
```
            LUN001  ->  LUN002  =   BAD
            LUN001  ->  LUN004  =   BAD
            LUN004  ->  LUN003  =   BAD
            LUN003  ->  LUN005  =   GOOD
            LUN005  ->  LUN001  =   GOOD
```
I continued to test some additional permutations but it became clear that only LUNs owned by controller b caused problems.

With the evidence in hand, I managed to convince our vendor support to replace storage controller b and that indeed resolved the problem. Data corruption due to a Storage vMotion never occurred after the controller was replaced.

There is no need to name/shame the vendor in this regard. The thing is that all equipment can fail and what can happen will happen. What really counts is: are you prepared?
Tagged as : ZFS

Read and Post Comments
RAID 5 Is Perfectly Fine for Home Usage

Thu 08 September 2016

RAID 5 gets a lot of flak these days. You either run RAID 1, RAID 10 or you use RAID 6, but if you run RAID 5 you're told that you are a crazy person.

Using RAID 5 is portrayed as an unreasonable risk to the availability of your data. It is suggested that it is likely that you will lose your RAID array at some point.

That's an unfair representation of the actual risk that surrounds RAID 5. As I see it, the scare about RAID 5 is totally blown out of proportion.

I would argue that for small RAID arrays with a maximum of five to six drives, it's totally reasonable to use RAID 5 for your home NAS.

As far as I can tell, the campaign against RAID 5 mainly started with this article from zdnet.

As you know RAID 5 can tollerate a single drive failure. If a second drive dies and the first drive was not yet replaced or rebuild, you lose all contents of the array.

In the article the author argues that because drives become bigger but not more reliable, the risk of losing a second drive during a rebuild is so high that running RAID 5 is becoming risky.

You don't need a second drive failure for you to lose your data. A bad sector, also known as an Unrecoverable Read Error (URE), can also cause problems during a rebuild. Depending on the RAID implementation, you may lose some files or the entire array.

The author calculates and argues that the risk of such a bad sector or URE is so high with modern high-capacity drives, that this risk of a second drive failure during rebuild is almost unavoidable.

Most drives have a URE specification of 1 bit error in 12.5 TB of data (10^14). That number is used as an absolute, it's what drives do experience in our daily lives, but that's not true.

It's a worst-case number. You will see a read error in at-most 10^14 bits, but in practice drives are way more reliable.

I run ZFS on my 71 TB ZFS NAS and I scrub from time to time.

If that worst-case number were 'real', I would have caught some data errors by now. However, in line with my personal experience, ZFS hasn't corrected a single byte since the system came online a few years ago.

And I've performed so many scrubs that my system has read over a petabyte of data. No silent data corruption, no regular bad sectors.

It seems to me that all those risk aren't nearly as high as it seems.

I would argue that choosing RAID-5/Z in the right circumstances is reasonable. RAID-6 is clearly safer than RAID-5 as you can survive the loss of two drives instead of a single drive, but that doesn't mean that RAID-5 is unsafe.

If you are going to run a RAID 5 array, make sure you run a scrub or patrol read or whatever the name is that your RAID solution uses. A scrub is nothing more than attempt to try and read all data from disk.

Scrubbing allows detection of bad sectors in advance, so you can replace drives before they cause real problems (like failing during a rebuild).

If you keep the number of drives in a RAID-5 array low, maybe at most 5 or 6, I think for home users, who need to find a balance between cost and capacity, RAID-5 is an acceptable option.

And remember: if you care about your data, you need a backup anyway.

This topic was also discussed on reddit.

Tagged as : RAID

Read and Post Comments
ZFS: Resilver Performance of Various RAID Schemas

Sun 31 January 2016
When building your own DIY home NAS, it is important that you simulate and test drive failures before you put your important data on it. It makes sense to know what to do in case a drive needs to be replaced. I also recommend putting a substantial amount of data on your NAS and see how long a resilver takes just so you know what to expect.

There are many reports of people building their own (ZFS-based) NAS who found out after a drive failure that resilvering would take days. If your chosen redundancy level for the VDEV would not protect against a second drive failure in the same VDEV (Mirror, RAID-Z) things may get scary. Especially because drives are quite bussy rebuilding data and the extra load on the remaining drives may increase the risk of a second failure.

The chosen RAID level for your VDEV, has an impact on the resilver performance. You may chose to accept lower resilver performance in exchange for additional redundancy (RAID-Z2, RAID-Z3).

I did wonder though how much those resilver times would differ between the various RAID levels. This is why I decided to run some tests to get some numbers.

Test hardware

I've used some test equipment running Debian Jessie + ZFS on Linux. The hardware is rather old and the CPU may have an impact on the results.
```
CPU : Intel(R) Core(TM)2 Duo CPU     E7400  @ 2.80GHz
RAM : 8 GB
HBA : HighPoint RocketRaid 2340 (each drive in a jbod)
Disk: Samsung Spinpoint F1 - 1 TB - 7200 RPM ( 12 x )
```
Test method

I've created a script that runs all tests automatically. This is how the script works:
1. Create pool + vdev(s).
2. Write data on pool ( XX % of pool capacity)
3. Replace arbitrary drive with another one.
4. Wait for resilver to complete.
5. Log resilver duration o csv file.
For each test, I fill the pool up to 25% with data before I measure resilver performance.

Caveats

The problem with the pool only being filled for 25% is that drives are fast at the start, but their performance deteriorates significantly as they fill up. This means that you cannot extrapolate the results and calculate resilver times for 50% or 75% pool usage, the numbers are likely worse than that.

I should run the test again with 50% usage to see if we can demonstrate this effect.

Beware that this test method is probably only suitable for DIY home NAS builds. Production file systems used within businesses may be way more fragmented and I've been told that this could slow down resilver times dramatically.

Test result (lower is better)

The results can only be used to demonstrate the relative resilver performance differences of the various RAID levels and disk counts per VDEV.

You should not expect the same performance results for your own NAS as the hardware probably differs significantly from my test setup.

Observations

I think the following observations can be made:
1. Mirrors resilver the fastest even if the number of drives involved is increased.
2. RAID-Z resilver performance is on-par with using mirrors when using 5 disks or less.
3. RAID-Zx resilver performance deteriorates as the number of drives in a VDEV increases.
I find it interesting that with smaller number of drives in a RAID-Z VDEV, rebuild performance is roughly on par with a mirror setup. If long rebuild times would scare you away from using RAID-Z, maybe it should not. There may be other reasons why you might shy away from RAID-Z, but this doesn't seem one of them.

RAID-Z2 is often very popular amongst home NAS builders, as it offers a very nice balance between capacity and redundancy. Wider RAID-Z2 VDEVs are more space efficient, but it is also clear that resilver operations take longer. Because RAID-Z2 can tollerate the loss of two drives, I think longer resilver times seem like a reasonable tradeoff.

It is clear that as you put more disks in a single RAID-Zx VDEV, rebuild times increase. This can be used as an argument to keep the number of drives per VDEV 'reasonable' or to switch to RAID-Z3.

25% vs 50% pool usage

To me, there's nothing special to see here. The resilver times are on average slightly worse than double the 25% resilver durations. As disks performance start to deteriorate as they fill up (inner tracks are shorter/slower) sequential performance drops. So this is why I would explain the results are slightly worse than perfect linear scaling.

Final words

I hope this benchmark is of interest to anyone and more importantly, you can run your own by using the aforementioned script. If you ever want to run your own benchmarks, expect the script to run for days. Leave a comment if you have questions or remarks about these test results or the way testing is done.
Tagged as : ZFS

Read and Post Comments

Solar Status

71 TiB NAS

20C/40T 128G Server

Projects

Categories

Archive

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

Page 14 / 73