1. Tracking Down a Faulty Storage Array Controller With ZFS

    Thu 15 December 2016

    One day, I lost two virtual machines on our DR environment after a storage vMotion.

    Further investigation uncovered that any storage vMotion of a virtual machine residing on our DR storage array would corrupt the virtual machine's disks.

    I could easily restore the affected virtual machines from backup and once that was done, continued my investigation.

    I needed a way to quickly verifying if a virtual hard drive of a virtual machine was corrupted after a storage vMotion to understand what the pattern was.

    First, I created a virtual machine based on Linux and installed ZFS. Then, I attached a second disk of about 50 gigabytes and formatted this drive with ZFS. Once I filled the drive using 'dd' to about 40 gigabytes I was ready to test.

    ZFS was chosen for testing purposes because it stores hashes of all blocks of data. This makes it very simple to quickly detect any data corruption. If the hash doesn't match the hash generated from the data, you just detected corruption.

    Other file systems don't store hashes and don't check for data corruption so they just trust the storage layer. It may take a while before you find out that data is corrupted.

    I performed a storage vMotion of this secondary disk towards different datastores and then ran a 'zfs scrub' to track down any corruption. This worked better than expected: the scrub command would hang if the drive was corrupted by the storage vMotion. The test virtual machine required a reboot and a reformat of the secondary hard drive with ZFS as the previous file system, including data got corrupted.

    After performing a storage vMotion on the drive in different directions, from different datastores to other datastores slowly a pattern emerged.

    1. Storage vMotion corruption happened independent of the VMware ESXi host used.

    2. a Storage vMotion never caused any issues when the disk was residing on our production storage array.

    3. the corruption only happened when the virtual machine was stored on particular datastores on our DR storage array.

    Now it got really 'interesting'. The thing is that our DR storage array has two separate storage controllers running in active-active mode. However, the LUNs are always owned by a particular controller. Although the other controller can take over from the controller who 'owns' the LUNs in case of a failure, the owner will process the I/O when everything is fine. Particular LUNs are thus handled by a particular controller.

    So first I made a table where I listed the controllers and the LUNs it had ownership over, like this:

                Owner       
    Controller      a               b
                LUN001          LUN002
                LUN003          LUN004
                LUN005          LUN006
    

    Then I started to perform Storage vMotions of the ZFS disk from one LUN to the other. After performing several test, the pattern became quite obvious.

                LUN001  ->  LUN002  =   BAD
                LUN001  ->  LUN004  =   BAD
                LUN004  ->  LUN003  =   BAD
                LUN003  ->  LUN005  =   GOOD
                LUN005  ->  LUN001  =   GOOD
    

    I continued to test some additional permutations but it became clear that only LUNs owned by controller b caused problems.

    With the evidence in hand, I managed to convince our vendor support to replace storage controller b and that indeed resolved the problem. Data corruption due to a Storage vMotion never occurred after the controller was replaced.

    There is no need to name/shame the vendor in this regard. The thing is that all equipment can fail and what can happen will happen. What really counts is: are you prepared?

    Tagged as : ZFS
  2. Should I Use ZFS for My Home NAS?

    Mon 09 May 2016

    When building your own home NAS, you may be advised to use ZFS for the file system. I would only recommend using ZFS if you understand it well and you accept its limitation. It must make sense for your particular situation and your skill level.

    I would like to make the case that for a lot of people, ZFS offers little benefit given their circumstances and for those people, I think it is totally reasonable to select a different platform.

    I wrote this article because I sense that people not well-versed into Linux or FreeBSD may sometimes feel pressured1 into building a NAS that they can't handle when problems arise. In that case, opting for ZFS could cause more trouble than it would solve.

    Why ZFS?

    The main reason why people advise ZFS is the fact that ZFS offers better protection against data corruption as compared to other file systems. It has extra defences build-in that protect your data in a manner that other free file systems cannot2.

    The fact that ZFS is better at protecting your data against corruption isn't that important for most home NAS builders because the risks ZFS protect against are very small. I would argue that at the small scale home users operate, it would be reasonable to just accept the risks ZFS protect against.

    I must say that software like Freenas does make it very simple to setup a ZFS-based NAS. If you are able to drop to a console and stand your own if problems arise and if you accept the limitations of ZFS, it may be a reasonable option.

    In this article, I'm not arguing against ZFS itself, it's an amazing file system with interesting features, but it may not be the best option for your particular situation. I just want to argue that it is reasonable not to use ZFS for your home NAS build. Only use it if it fits your needs.

    Silent data corruption

    From the perspective of protecting the integrity of your data or preventing data corruption, our computers do an excellent job, most of the time. This is because almost every component in any computer has been build with resiliency in mind.

    They deploy checksums and parity when data is stored or transmitted, as a means to assure data integrity3.

    For instance, hard drives store extra redundant information alongside your data in order to verify data integrity. They can also use this redundant data to recover from data corruption, although only to some extend. If a hard drive can't read some portion of the disk anymore, it will report an error. So this is not silent data corruption, this is just a 'bad sector' or an Unrecoverable Read Error (URE).

    Silent data corruption occurs when data corruption is not detected by the hard drive. The hard drive will thus return corrupt data and will not sound any alarm. This can cause corruption of your files. The chance of this happening at home is extremely rare.

    It is way more likely that a hard drive just fails completely or develops regular 'bad sectors'. Those incidents can be handled by any kind of RAID solution, you don't need ZFS to handle these events.

    I've been looking around for a real study on the prevalence of silent data corruption. I found a study from 2008. Honestly, I'm not sure what to make of it. Because I'm not sure how the risks portrayed in this study can be translated to a real-life risk for home users.

    The study talks about 'checksum errors' and 'identity discrepancies'. Checksum errors are quite prevalent, but they are handled by the storage / RAID subsystem and are of no real concern as far as I can tell. It seems that the problem lies with 'identity discrepancies'. Such errors would cause silent data corruption.

    The 'identity discrepancies' are events where - for example - a sector ends up at the wrong spot on the drive, so the sector itself is ok, but the file is still corrupt. That would be a true example of silent data corruption. And ZFS would protect against this risk.

    Of the 1.53 milion drives, only 365 drives witnessed an 'identity discrepancy' error. I have difficulty to determine how many of those 365 drives are SATA drives. Since there are a total of 358,000 SATA drives used in this study, even if all 365 disks were SATA drives (which is not true), the risk would be 0,102% or one in 980 drives over the period of 17 months. Remember that this is a worst-case scenario.

    To me, this risk seems rather small. If I don't mess up the statistics, you should have a thousand hard drives running for 17 months for a single instance of silent data corruption to show up. So unless you're operating at that scale, I would say that silent data corruption is indeed not a risk a DIY home user should worry about.

    ZFS and Unrecoverable Read Errors (UREs)

    If you are building your own NAS, you may want to deploy RAID to protect against the impact of one or two drives failing.

    For instance, using RAID 5 (RAIDZ) allows your NAS to survice a single drive failure. RAID 6 (RAIDZ2) would survive two drive failures.

    Let's take the example of RAID 5. Your NAS can survive a single drive failure and one drive fails. At some point you replace the failed drive and the RAID array starts the rebuild process. During this rebuild process the array is not protected against drive failure so no additional drives should fail.

    If a second drive would encounter a bad sector or what people today call an Unrecoverable Read Error during this rebuild process, in most cases the RAID solution would give up on the entire array. A single 'bad sector' may have the same impact as a second drive failure.

    This is where ZFS shines as ZFS is a file system and RAID solution in one.

    Instead of failing the whole drive, ZFS is capable of keeping the affected drive online and only mark affected files as 'bad'. So clearly this is a benefit over other RAID solutions, which are not file(system)-aware and just have to give up.

    Although this scenario is often cited by people recommending ZFS, I would argue that the chance that ZFS will save you from this risk, is rather small, because the risk itself is very small.

    Just to be clear, an URE is the oposite of silent data corruption. It is the hard drive reporting a read error back to the RAID solution or operating system. So this is a separate topic.

    I'm aware of the (in)famous article declaring RAID 5 dead. In this article, the URE specification of hard drives is used to prove that it's quite likely to encounter a second drive failure due to an URE during an array rebuild.

    It seems though that the one URE in 10^14^ bits (an error every 12.5 TB of data read) is a worst-case specification. In real life, drives are way more reliable than this specification. So in practice, the risks aren't as high as this article portrays them to be.

    The hidden cost of ZFS

    For home users, it would be very convenient to just add extra hard drives over time as the capacity demand increases. With ZFS, this is not possible in a cost-efficient manner. I've written an article about this topic where this drawback is discussed in more detail.

    Many other RAID solutions support online capacity expansion, where you can just add drives as you see fit. For instance, Linux MDADM supports 'growing' an array with one or more drives at once.

    Especially for home users, it's important to take this limitation of ZFS into account.

    Conclusion

    For home usage, building your own ZFS-based NAS is a cool project that can be educational and fun, but there is no real necessity to use ZFS over any other solution.

    In many cases, I doubt that building your own NAS gives you any significant edge over a pre-build Synology, QNAP or Netgear NAS unless you have specific needs that these products don't cover.

    These products may be a bit more expensive in an absolute sense as compared to a DIY NAS build, but if you factor in your own free time as a cost, they are probably hard to beat.

    That said, if you regard building your own NAS as a fun hobby project and you are willing to spend some time on it, it's perfectly fine to run with ZFS, considering you accept the 'hidden cost'.

    However, I think it's perfectly reasonable to build your NAS based on Windows, with:

    • Storage Spaces
    • Hardware RAID
    • SnapRAID
    • FlexRAID

    Or you many want to use Linux with:

    • MDADM kernel software RAID
    • Hardware RAID
    • unRAID
    • SnapRAID
    • FlexRAID

    Vendor SnapRAID provides a comparison of the various products. Please note it's a vendor comparing it's product against competitors. I have never used SnapRAID, unRAID or FlexRAID.

    Would you still use ZFS?

    I've build a 71 TiB NAS based on 24 drives using ZFS on Linux.

    In my case, I would probably keep on using ZFS. Here are my reasons:

    1. It's a hobby project I'm happy to spend some time on
    2. I'm totally OK with Linux and know my way around
    3. I buy all my storage upfront so the 'hidden cost' is no issue for me
    4. In my case, I would probably now use RAIDZ3 with the large_blocks feature enabled to regain some space. Tripple parity is unique to ZFS as far as I know and with my 24-drive setup, I think that would add a bit of extra safety.

    So in my particular situation, ZFS offers little to no drawbacks and I see the extra data integrity protection as a nice bonus.


    1. People may create an impression that not using ZFS is incredibly dangerous and that you're foolish if you don't use ZFS for your home NAS. I strongly disagree with that idea. 

    2. BTRFS's implementation of RAID 5/6 is not considered production-ready at the time this article was written. 

    3. One notable exception is the lack of ECC memory in most laptop/desktop computers. 

    Tagged as : ZFS
  3. ZFS: Resilver Performance of Various RAID Schemas

    Sun 31 January 2016

    When building your own DIY home NAS, it is important that you simulate and test drive failures before you put your important data on it. It makes sense to know what to do in case a drive needs to be replaced. I also recommend putting a substantial amount of data on your NAS and see how long a resilver takes just so you know what to expect.

    There are many reports of people building their own (ZFS-based) NAS who found out after a drive failure that resilvering would take days. If your chosen redundancy level for the VDEV would not protect against a second drive failure in the same VDEV (Mirror, RAID-Z) things may get scary. Especially because drives are quite bussy rebuilding data and the extra load on the remaining drives may increase the risk of a second failure.

    The chosen RAID level for your VDEV, has an impact on the resilver performance. You may chose to accept lower resilver performance in exchange for additional redundancy (RAID-Z2, RAID-Z3).

    I did wonder though how much those resilver times would differ between the various RAID levels. This is why I decided to run some tests to get some numbers.

    Test hardware

    I've used some test equipment running Debian Jessie + ZFS on Linux. The hardware is rather old and the CPU may have an impact on the results.

    CPU : Intel(R) Core(TM)2 Duo CPU     E7400  @ 2.80GHz
    RAM : 8 GB
    HBA : HighPoint RocketRaid 2340 (each drive in a jbod)
    Disk: Samsung Spinpoint F1 - 1 TB - 7200 RPM ( 12 x )
    

    Test method

    I've created a script that runs all tests automatically. This is how the script works:

    1. Create pool + vdev(s).
    2. Write data on pool ( XX % of pool capacity)
    3. Replace arbitrary drive with another one.
    4. Wait for resilver to complete.
    5. Log resilver duration o csv file.

    For each test, I fill the pool up to 25% with data before I measure resilver performance.

    Caveats

    The problem with the pool only being filled for 25% is that drives are fast at the start, but their performance deteriorates significantly as they fill up. This means that you cannot extrapolate the results and calculate resilver times for 50% or 75% pool usage, the numbers are likely worse than that.

    I should run the test again with 50% usage to see if we can demonstrate this effect.

    Beware that this test method is probably only suitable for DIY home NAS builds. Production file systems used within businesses may be way more fragmented and I've been told that this could slow down resilver times dramatically.

    Test result (lower is better)

    resilver graph

    The results can only be used to demonstrate the relative resilver performance differences of the various RAID levels and disk counts per VDEV.

    You should not expect the same performance results for your own NAS as the hardware probably differs significantly from my test setup.

    Observations

    I think the following observations can be made:

    1. Mirrors resilver the fastest even if the number of drives involved is increased.
    2. RAID-Z resilver performance is on-par with using mirrors when using 5 disks or less.
    3. RAID-Zx resilver performance deteriorates as the number of drives in a VDEV increases.

    I find it interesting that with smaller number of drives in a RAID-Z VDEV, rebuild performance is roughly on par with a mirror setup. If long rebuild times would scare you away from using RAID-Z, maybe it should not. There may be other reasons why you might shy away from RAID-Z, but this doesn't seem one of them.

    RAID-Z2 is often very popular amongst home NAS builders, as it offers a very nice balance between capacity and redundancy. Wider RAID-Z2 VDEVs are more space efficient, but it is also clear that resilver operations take longer. Because RAID-Z2 can tollerate the loss of two drives, I think longer resilver times seem like a reasonable tradeoff.

    It is clear that as you put more disks in a single RAID-Zx VDEV, rebuild times increase. This can be used as an argument to keep the number of drives per VDEV 'reasonable' or to switch to RAID-Z3.

    25% vs 50% pool usage

    To me, there's nothing special to see here. The resilver times are on average slightly worse than double the 25% resilver durations. As disks performance start to deteriorate as they fill up (inner tracks are shorter/slower) sequential performance drops. So this is why I would explain the results are slightly worse than perfect linear scaling.

    Final words

    I hope this benchmark is of interest to anyone and more importantly, you can run your own by using the aforementioned script. If you ever want to run your own benchmarks, expect the script to run for days. Leave a comment if you have questions or remarks about these test results or the way testing is done.

    Tagged as : ZFS

Page 1 / 4