1. Should I Use ZFS for My Home NAS?

    May 09, 2016

    When building your own home NAS, you may be advised to use ZFS for the file system. I would only recommend using ZFS if you understand it well and you accept its limitation. It must make sense for your particular situation and your skill level.

    I would like to make the case that for a lot of people, ZFS offers little benefit given their circumstances and for those people, I think it is totally reasonable to select a different platform.

    I wrote this article because I sense that people not well-versed into Linux or FreeBSD may sometimes feel pressured1 into building a NAS that they can't handle when problems arise. In that case, opting for ZFS could cause more trouble than it would solve.

    Why ZFS?

    The main reason why people advise ZFS is the fact that ZFS offers better protection against data corruption as compared to other file systems. It has extra defences build-in that protect your data in a manner that other free file systems cannot2.

    The fact that ZFS is better at protecting your data against corruption isn't that important for most home NAS builders because the risks ZFS protect against are very small. I would argue that at the small scale home users operate, it would be reasonable to just accept the risks ZFS protect against.

    I must say that software like Freenas does make it very simple to setup a ZFS-based NAS. If you are able to drop to a console and stand your own if problems arise and if you accept the limitations of ZFS, it may be a reasonable option.

    In this article, I'm not arguing against ZFS itself, it's an amazing file system with interesting features, but it may not be the best option for your particular situation. I just want to argue that it is reasonable not to use ZFS for your home NAS build. Only use it if it fits your needs.

    Silent data corruption

    From the perspective of protecting the integrity of your data or preventing data corruption, our computers do an excellent job, most of the time. This is because almost every component in any computer has been build with resiliency in mind.

    They deploy checksums and parity when data is stored or transmitted, as a means to assure data integrity3.

    For instance, hard drives store extra redundant information alongside your data in order to verify data integrity. They can also use this redundant data to recover from data corruption, although only to some extend. If a hard drive can't read some portion of the disk anymore, it will report an error. So this is not silent data corruption, this is just a 'bad sector' or an Unrecoverable Read Error (URE).

    Silent data corruption occurs when data corruption is not detected by the hard drive. The hard drive will thus return corrupt data and will not sound any alarm. This can cause corruption of your files. The chance of this happening at home is extremely rare.

    It is way more likely that a hard drive just fails completely or develops regular 'bad sectors'. Those incidents can be handled by any kind of RAID solution, you don't need ZFS to handle these events.

    I've been looking around for a real study on the prevalence of silent data corruption. I found a study from 2008. Honestly, I'm not sure what to make of it. Because I'm not sure how the risks portrayed in this study can be translated to a real-life risk for home users.

    The study talks about 'checksum errors' and 'identity discrepancies'. Checksum errors are quite prevalent, but they are handled by the storage / RAID subsystem and are of no real concern as far as I can tell. It seems that the problem lies with 'identity discrepancies'. Such errors would cause silent data corruption.

    The 'identity discrepancies' are events where - for example - a sector ends up at the wrong spot on the drive, so the sector itself is ok, but the file is still corrupt. That would be a true example of silent data corruption. And ZFS would protect against this risk.

    Of the 1.53 milion drives, only 365 drives witnessed an 'identity discrepancy' error. I have difficulty to determine how many of those 365 drives are SATA drives. Since there are a total of 358,000 SATA drives used in this study, even if all 365 disks were SATA drives (which is not true), the risk would be 0,102% or one in 980 drives over the period of 17 months. Remember that this is a worst-case scenario.

    To me, this risk seems rather small. If I don't mess up the statistics, you should have a thousand hard drives running for 17 months for a single instance of silent data corruption to show up. So unless you're operating at that scale, I would say that silent data corruption is indeed not a risk a DIY home user should worry about.

    ZFS and Unrecoverable Read Errors (UREs)

    If you are building your own NAS, you may want to deploy RAID to protect against the impact of one or two drives failing.

    For instance, using RAID 5 (RAIDZ) allows your NAS to survice a single drive failure. RAID 6 (RAIDZ2) would survive two drive failures.

    Let's take the example of RAID 5. Your NAS can survive a single drive failure and one drive fails. At some point you replace the failed drive and the RAID array starts the rebuild process. During this rebuild process the array is not protected against drive failure so no additional drives should fail.

    If a second drive would encounter a bad sector or what people today call an Unrecoverable Read Error during this rebuild process, in most cases the RAID solution would give up on the entire array. A single 'bad sector' may have the same impact as a second drive failure.

    This is where ZFS shines as ZFS is a file system and RAID solution in one.

    Instead of failing the whole drive, ZFS is capable of keeping the affected drive online and only mark affected files as 'bad'. So clearly this is a benefit over other RAID solutions, which are not file(system)-aware and just have to give up.

    Although this scenario is often cited by people recommending ZFS, I would argue that the chance that ZFS will save you from this risk, is rather small, because the risk itself is very small.

    Just to be clear, an URE is the oposite of silent data corruption. It is the hard drive reporting a read error back to the RAID solution or operating system. So this is a separate topic.

    I'm aware of the (in)famous article declaring RAID 5 dead. In this article, the URE specification of hard drives is used to prove that it's quite likely to encounter a second drive failure due to an URE during an array rebuild.

    It seems though that the one URE in 1014 bits (an error every 12.5 TB of data read) is a worst-case specification. In real life, drives are way more reliable than this specification. So in practice, the risks aren't as high as this article portrays them to be.

    The hidden cost of ZFS

    For home users, it would be very convenient to just add extra hard drives over time as the capacity demand increases. With ZFS, this is not possible in a cost-efficient manner. I've written an article about this topic where this drawback is discussed in more detail.

    Many other RAID solutions support online capacity expansion, where you can just add drives as you see fit. For instance, Linux MDADM supports 'growing' an array with one or more drives at once.

    Especially for home users, it's important to take this limitation of ZFS into account.

    Conclusion

    For home usage, building your own ZFS-based NAS is a cool project that can be educational and fun, but there is no real necessity to use ZFS over any other solution.

    In many cases, I doubt that building your own NAS gives you any significant edge over a pre-build Synology, QNAP or Netgear NAS unless you have specific needs that these products don't cover.

    These products may be a bit more expensive in an absolute sense as compared to a DIY NAS build, but if you factor in your own free time as a cost, they are probably hard to beat.

    That said, if you regard building your own NAS as a fun hobby project and you are willing to spend some time on it, it's perfectly fine to run with ZFS, considering you accept the 'hidden cost'.

    However, I think it's perfectly reasonable to build your NAS based on Windows, with:

    • Storage Spaces
    • Hardware RAID
    • SnapRAID
    • FlexRAID

    Or you many want to use Linux with:

    • MDADM kernel software RAID
    • Hardware RAID
    • unRAID
    • SnapRAID
    • FlexRAID

    Vendor SnapRAID provides a comparison of the various products. Please note it's a vendor comparing it's product against competitors. I have never used SnapRAID, unRAID or FlexRAID.

    Would you still use ZFS?

    I've build a 71 TiB NAS based on 24 drives using ZFS on Linux.

    In my case, I would probably keep on using ZFS. Here are my reasons:

    1. It's a hobby project I'm happy to spend some time on
    2. I'm totally OK with Linux and know my way around
    3. I buy all my storage upfront so the 'hidden cost' is no issue for me
    4. In my case, I would probably now use RAIDZ3 with the large_blocks feature enabled to regain some space. Tripple parity is unique to ZFS as far as I know and with my 24-drive setup, I think that would add a bit of extra safety.

    So in my particular situation, ZFS offers little to no drawbacks and I see the extra data integrity protection as a nice bonus.


    1. People may create an impression that not using ZFS is incredibly dangerous and that you're foolish if you don't use ZFS for your home NAS. I strongly disagree with that idea. 

    2. BTRFS's implementation of RAID 5/6 is not considered production-ready at the time this article was written. 

    3. One notable exception is the lack of ECC memory in most laptop/desktop computers. 

    Tagged as : ZFS
  2. ZFS: Resilver Performance of Various RAID Schemas

    January 31, 2016

    When building your own DIY home NAS, it is important that you simulate and test drive failures before you put your important data on it. It makes sense to know what to do in case a drive needs to be replaced. I also recommend putting a substantial amount of data on your NAS and see how long a resilver takes just so you know what to expect.

    There are many reports of people building their own (ZFS-based) NAS who found out after a drive failure that resilvering would take days. If your chosen redundancy level for the VDEV would not protect against a second drive failure in the same VDEV (Mirror, RAID-Z) things may get scary. Especially because drives are quite bussy rebuilding data and the extra load on the remaining drives may increase the risk of a second failure.

    The chosen RAID level for your VDEV, has an impact on the resilver performance. You may chose to accept lower resilver performance in exchange for additional redundancy (RAID-Z2, RAID-Z3).

    I did wonder though how much those resilver times would differ between the various RAID levels. This is why I decided to run some tests to get some numbers.

    Test hardware

    I've used some test equipment running Debian Jessie + ZFS on Linux. The hardware is rather old and the CPU may have an impact on the results.

    CPU : Intel(R) Core(TM)2 Duo CPU     E7400  @ 2.80GHz
    RAM : 8 GB
    HBA : HighPoint RocketRaid 2340 (each drive in a jbod)
    Disk: Samsung Spinpoint F1 - 1 TB - 7200 RPM ( 12 x )
    

    Test method

    I've created a script that runs all tests automatically. This is how the script works:

    1. Create pool + vdev(s).
    2. Write data on pool ( XX % of pool capacity)
    3. Replace arbitrary drive with another one.
    4. Wait for resilver to complete.
    5. Log resilver duration o csv file.

    For each test, I fill the pool up to 25% with data before I measure resilver performance.

    Caveats

    The problem with the pool only being filled for 25% is that drives are fast at the start, but their performance deteriorates significantly as they fill up. This means that you cannot extrapolate the results and calculate resilver times for 50% or 75% pool usage, the numbers are likely worse than that.

    I should run the test again with 50% usage to see if we can demonstrate this effect.

    Beware that this test method is probably only suitable for DIY home NAS builds. Production file systems used within businesses may be way more fragmented and I've been told that this could slow down resilver times dramatically.

    Test result (lower is better)

    resilver graph

    The results can only be used to demonstrate the relative resilver performance differences of the various RAID levels and disk counts per VDEV.

    You should not expect the same performance results for your own NAS as the hardware probably differs significantly from my test setup.

    Observations

    I think the following observations can be made:

    1. Mirrors resilver the fastest even if the number of drives involved is increased.
    2. RAID-Z resilver performance is on-par with using mirrors when using 5 disks or less.
    3. RAID-Zx resilver performance deteriorates as the number of drives in a VDEV increases.

    I find it interesting that with smaller number of drives in a RAID-Z VDEV, rebuild performance is roughly on par with a mirror setup. If long rebuild times would scare you away from using RAID-Z, maybe it should not. There may be other reasons why you might shy away from RAID-Z, but this doesn't seem one of them.

    RAID-Z2 is often very popular amongst home NAS builders, as it offers a very nice balance between capacity and redundancy. Wider RAID-Z2 VDEVs are more space efficient, but it is also clear that resilver operations take longer. Because RAID-Z2 can tollerate the loss of two drives, I think longer resilver times seem like a reasonable tradeoff.

    It is clear that as you put more disks in a single RAID-Zx VDEV, rebuild times increase. This can be used as an argument to keep the number of drives per VDEV 'reasonable' or to switch to RAID-Z3.

    25% vs 50% pool usage

    To me, there's nothing special to see here. The resilver times are on average slightly worse than double the 25% resilver durations. As disks performance start to deteriorate as they fill up (inner tracks are shorter/slower) sequential performance drops. So this is why I would explain the results are slightly worse than perfect linear scaling.

    Final words

    I hope this benchmark is of interest to anyone and more importantly, you can run your own by using the aforementioned script. If you ever want to run your own benchmarks, expect the script to run for days. Leave a comment if you have questions or remarks about these test results or the way testing is done.

    Tagged as : ZFS
  3. The 'Hidden' Cost of Using ZFS for Your Home NAS

    January 02, 2016

    Many home NAS builders consider using ZFS for their file system. But there is a caveat with ZFS that people should be aware of.

    Although ZFS is free software, implementing ZFS is not free. The key issue is that expanding capacity with ZFS is more expensive compared to legacy RAID solutions.

    With ZFS, you either have to buy all storage you expect to need upfront, or you will be wasting a few hard drives on redundancy you don't need.

    This fact is often overlooked, but it's very important when you are planning your build.

    Other software RAID solutions like Linux MDADM lets you grow an existing RAID array with one disk at a time. This is also true for many hardware-based RAID solutions. This is ideal for home users because you can expand as you need.

    ZFS does not allow this!

    To understand why using ZFS may cost you extra money, we will dig a little bit into ZFS itself.

    Quick recap of ZFS

    zfs

    The schema above illustrates the architecture of ZFS. There are a few things you should take away from it.

    The main takeaway of this picture is that your ZFS pool and thus your file system is based on one or more VDEVs. And those VDEVs contain the actual hard drives.

    Fault-tolerance or redundancy is addressed within a VDEV. A VDEV is either a RAID-1 (mirror), RAID-5 (RAIDZ) or RAID-6 (RAIDZ2). It can even use tripple parity (RAID-Z3) but I doubt many of you will ever need that.

    So it's important to understand that a ZFS pool itself is not fault-tolerant. If you lose a single VDEV within a pool, you lose the whole pool. You lose the pool, all data is lost.

    You can't add hard drives to a VDEV

    Now it's very important to understand that you cannot add hard drives to a VDEV.

    This is the key limitation of ZFS as seen from the perspective of home NAS builders.

    To expand the storage capacity of your pool, you need to add extra VDEVs. And because each VDEV needs to take care of its own redundancy, you also need to buy extra drives for parity.

    I will quickly add that there is a way out: replace every hard drive in the VDEV, one by one, with a higher capacity hard drive. You will have to 'rebuild' or 'resilver' the VDEV after each replacement, but it will work, although it's a bit cumbersome and quite expensive.

    So back to the topic at hand: what does this limitation mean in real life? I'll give an example.

    Let's say you plan on building a small NAS with a capacity of four drives. Please don't create a three-drive RAID-Z thinking you can just add the fourth drive when you need to, because that's not possible.

    In this example, you would be better off buying the fourth drive upfront and create a four-drive RAID-Z. This is an example where you are forced to buy the extra space you don't need yet upfront because expanding is otherwise not possible.

    You could have expanded your pool with another VDEV consisting of a minimum of three drives (if you run RAID-Z) but the chassis has only room for one extra drive so that doesn't work.

    Planning your ZFS Build with the VDEV limitation in mind

    Many home NAS builders use RAID-6 (RAID-Z2) for their builds, because of the extra redundancy. This makes sense because a double drive failure is not something unheard of, especially during rebuilds where all drives are being taxed quite heavily for many hours.

    I personally would recommend running RAID-Z2 over RAID-Z1 if you go over five to six drives and to spend the extra money on the additional hard drive it requires. Actually, With RAID-Z2 or RAID-6, I think it's perfectly reasonable to run a single VDEV at home with up to 12 drives1.

    With RAID-Z2 however, the 'ZFS tax' is even more clearly visible. By having to add an additional VDEV, you will also lose two drives due to parity overhead.

    zfs2

    Please note that the 'yellow' drives mark the parity/redundancy overhead. It does not mark where parity data lives (it's striped across all drives).

    Let's illustrate the above picture with an example. Your NAS chassis can hold a maximum of twelve drives. You start out with six drives in a RAID-Z2. At some point you want to expand. The cheapest option is to expand with another RAID-Z2 consisting of four drives (minimum size of a RAID-Z2 VDEV).

    With a cost of $150 per hard drive3, expanding the capacity of your pool will cost you $600 instead of $150 (single drive) and $300 dollar of the $600 (50%) is wasted on redundancy you don't really need.

    Furthermore, you can no longer expand your pool, so the remaining two drive slots are 'wasted'2. You end up with a maximum of ten drives.

    In this example, to make use of the drive capacity of your NAS chassis, you should expand with another six hard drives. That would cost you $900 and $300 of that $900 (33%) is wasted on redundancy. This is illustrated above.

    Storage-wise it's more efficient to expand with six drives instead of four. But it will cost you another $300 to expand, paying for storage you may not immediately need.

    But both options aren't that efficient. Because you end up using four drives for parity where two would - in my view - be sufficient.

    So, if you want to get the most capacity out of that chassis, and the most space per dollar, your only option is to buy all twelve drives upfront and create a single RAID-Z2 consisting of twelve drives.

    zfs1

    Buying all drives upfront is expensive and you may only benefit from that extra space years down the road.

    Summary

    So I hope this example clearly illustrates the issue at hand. With ZFS, you either need to buy all storage upfront or you will lose hard drives to redundancy you don't need, reducing the maximum storage capacity of your NAS.

    You have to decide what your needs are. ZFS is an awesome file system that offers you way better data integrity protection than other file system + RAID solution combination.

    But implementing ZFS has a certain 'cost'. You must decide if ZFS is worth it for you.

    Addressing some feedback

    I found out that my article was discussed on a vodcast of BSDNOW.

    This article also got some attention on hacker news.

    To me, some of the feedback is not 'wrong' but feels rather disingenuous or not relevant for the intended audience of this article. I have provided the links so you can make up your own mind.

    This article has a particular user group in mind so you really should think about how much their needs align with yours.

    You are steering people away from ZFS

    No I don't and this is not my intention. I run ZFS myself on two servers. I do feel that sometimes the downsides of ZFS are wiped under the rug and we should be very open and clear about them towards people seeking advice.

    Use mirrors not RAID-Z(2/3)!

    Doesn't make much sense to me for home NAS builders.

    Using mirrors is wasting space

    Advising people to use mirrors instead of RAID-Z(2/3) I do find a little bit disingenuous. Because you are throwing away 50% of your disk capacity. With RAIDZ you 'lose' 33% for three drives, 25% for four drives. If we look at RAIDZ2, we would 'lose' 33% for six drives, 25% for eight drives and only 20% for ten drives.

    In the end, you are waisting multiple drives worth of storage capacity depending on the number of drives in your pool.

    Adding mirrors with larger drives

    As time goes by, larger disks become cheaper. So it could make sense to expand your pool with mirrors based on bigger drives than the original drives you started out on. The size of your pool would increase. However, it's still only 50% space efficient.

    Random I/O performance is better

    Using mirrors is running RAID 10. Yes you can expand your pool with two drives at a time, and you gain better random I/O performance. However, the large majority of home NAS builders don't care about random I/O performance. You just care if you can saturate gigabit and have one big pool of storage. In that case, you don't need the random IOPs.

    If you run some VMs from your storage that require high storage performance, it's an entirely different matter. But I expect that most DIY NAS builders just want some storage to put a ton of data on and nothing more.

    RAIDZ2 is more reliable than using mirrors

    The redundancy of RAIDZ2 beats using mirrors. (If during a rebuild the surviving member of a mirror fails (the one disk in the pool that is taxed the most during rebuild) you lose your pool. With RAIDZ2 any second drive can fail and you are still OK.

    There is only one 'upside' regarding mirrors that is discussed in the next section.

    Mirror rebuild times are better

    The only upside of using mirrors is that in the event a disk has failed and the new disk is being 'resilvered' it is reported that those rebuilds tend to be faster than if you use RAID-Z(2/3). I think this is no different from legacy RAID, where the main difference with ZFS is that ZFS only rebuilds actual data, not the entire disk.

    ZFS rebuilds are faster

    This is indeed a benefit of ZFS. The question is how relevant it is for you.

    ZFS only rebuilds data. Legacy RAID just rebuilds every 'bit' on a drive. The latter takes longer than the former. So with legacy RAID, rebuild times depend on the size of a single drive, not on the number of drives in the array, no matter how much data you have stored on your array.

    My old 18 TB server was based on a single twenty-drive RAID 6 using MDADM. It took 5 hours to rebuild a 1 TB drive. If you would have used 4 TB drives, it would have taken 20 hours if I'm allowed to extrapolate. With ZFS - if you would have been using only 50% of capacity - those rebuild times would have been half of this.

    Personally with RAID6 or with RAIDZ2, rebuild times aren't that a big of a deal as you can lose a second drive and still be safe.

    Just replace existing drives with bigger ones!

    I did briefly touch this option in the article above. I will address it again. The problem with this approach is twofold. First, you can't expand storage capacity as you need it. You need to replace all existing drives with larger ones.

    The procedure itself is also a bit cumbersome and time intensive. You need to replace each drive one by one. And every time, you need to 'resilver' your VDEV. Only when all drives have been replaced you will be able to grow the size of your pool.

    If you are OK with this approach - and people have used it - it is a way to work around the 'ZFS-tax'.

    Not using ZFS is putting your data at great risk!

    The BSDNOW podcasts seems to agree with me that if you want true data safety, this 'ZFS-tax' is just the price you have to pay. Either you go with mirrors or you accept the extra parity redundancy.

    It is not my goal to steer you away from ZFS. The above is true. ZFS offers something no other (stable) file system currently offers to home NAS builders. But at a cost.

    The thing is that I find it perfectly reasonable for home NAS users to just buy a Synology, QNAP or some ready-made NAS from another quality brand. That's what the majority of people do and I think it's a reasonable option. I don't think you are taking crazy risks if you would do so.

    If you do build your own home NAS, it's reasonable to accept the 'risk' of using Windows with storage spaces or hardware RAID. Or using Linux with MDADM or hardware RAID. I would say: ZFS is clearly technically the better option, but those 'legacy' options are not so bad that you are taking unreasonable risks with your data.

    So using ZFS is the better option, it's up to you and your particular needs and circumstances to decide if using ZFS is worth it for you.


    1. For my own 71 TB storage NAS I decided at that time to run with an eighteen-disk VDEV plus a six-disk VDEV. Not standard, but I decided that I accept the risk.  

    2. Expanding with a VDEV consisting of a mirrored pair is technically possible but it breaks the RAID-Z2 redundancy. It doesn't make much sense to me. 

    3. Just an example for illustration purposes.  

    Tagged as : ZFS
  4. ZFS Performance on HP Proliant Microserver Gen8 G1610T

    August 14, 2015

    I think the HP Proliant Microserver Gen8 is a very interesting little box if you want to build your own ZFS-based NAS. The benchmarks I've performed seem to confirm this.

    The Microserver Gen8 has nice features such as:

    • iLO (KVM over IP with dedicated network interface)
    • support for ECC memory
    • 2 x Gigabit network ports
    • Free PCIe slot (half-height)
    • Small footprint
    • Fairly silent
    • good build quality

    The Microserver Gen8 can be a better solution than the offerings of - for example - Synology or QNAP because you can create a more reliable system based on ECC-memory and ZFS.

    gen8

    Please note that the G1610T version of the Microserver Gen8 does not ship with a DVD/CD drive as depicted in the image above.

    The Gen8 can be found fairly cheap on the European market at around 240 Euro including taxes and if you put in an extra 8 GB of memory on top of the 2 GB installed you have a total of 10 GB, which is more than enough to support ZFS.

    The Gen8 has room for 4 x 3.5" hard drives so with todays large disk sizes you can pack quite a bit of storage inside this compact machine.

    gen82

    Netto storage capacity:

    This table gives you a quick overview of the netto storage capacity you would get depending on the chosen drive size and redundancy.

    Drive sizeRAIDZRAIDZ2 or Mirror
    3 TB 9 TB 6 TB
    4 TB12 TB 8 TB
    6 TB18 TB12 TB
    8 TB24 TB16 TB

    Boot device

    If you want to use all four drive slots for storage, you need to boot this machine from either the fifth internal SATA port, the internal USB 2.0 port or the microSD card slot.

    The fifth SATA port is not bootable if you disable the on-board RAID controller and run in pure AHCI mode. This mode is probably the best mode for ZFS as there seems to be no RAID controller firmware active between the disks and ZFS. However, only the four 3.5" drive bays are bootable.

    The fifth SATA port is bootable if you configure SATA to operate in Legacy mode. This is not recommended as you lose the benefits of AHCI such as hot-swap of disks and there are probably also performance penalties.

    The fifth SATA port is also bootable if you enable the on-board RAID controller, but do not configure any RAID arrays with the drives you plan to use with ZFS (Thanks Mikko Rytilahti). You do need to put the boot drive in a RAID volume in order to be able to boot from the fifth SATA port.

    The unconfigured drives will just be passed as AHCI devices to the OS and thus can be used in your ZFS array. The big question here is what happens if you encounter read errors or other drive problems that ZFS could handle, but would be a reason for the RAID controller to kick a drive off the SATA bus. I have no information on that.

    I myself used an old 2.5" hard drive with a SATA-to-USB converter which I stuck in the case (use double-sided tape or velcro to mount it to the PSU). Booting from USB stick is also an option, although a regular 2.5" hard drive or SSD is probably more reliable (flash wear) and faster.

    Boot performance

    The Microserver Gen8 takes about 1 minute and 50 seconds just to pass the BIOS boot process and start booting the operating system (you will hear a beep).

    Test method and equipment

    I'm running Debian Jessie with the latest stable ZFS-on-Linux 0.6.4. Please note that reportedly FreeNAS also runs perfectly fine on this box.

    I had to run my tests with the disk I had available:

    root@debian:~# show disk -sm
    -----------------------------------
    | Dev | Model              | GB   |   
    -----------------------------------
    | sda | SAMSUNG HD103UJ    | 1000 |   
    | sdb | ST2000DM001-1CH164 | 2000 |   
    | sdc | ST2000DM001-1ER164 | 2000 |   
    | sdd | SAMSUNG HM250HI    | 250  |   
    | sde | ST2000DM001-1ER164 | 2000 |   
    -----------------------------------
    

    The 250 GB is a portable disk connected to the internal USB port. It is used as the OS boot device. The other disks, 1 x 1 TB and 3 x 2 TB are put together in a single RAIDZ pool, which results in 3 TB of storage.

    Tests with 4-disk RAIDZ VDEV

    root@debian:~# zfs list
    NAME       USED  AVAIL  REFER  MOUNTPOINT
    testpool  48.8G  2.54T  48.8G  /testpool
    root@debian:~# zpool status
      pool: testpool
     state: ONLINE
      scan: none requested
    config:
    
        NAME                        STATE     READ WRITE CKSUM
        testpool                    ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            wwn-0x50000f0008064806  ONLINE       0     0     0
            wwn-0x5000c5006518af8f  ONLINE       0     0     0
            wwn-0x5000c5007cebaf42  ONLINE       0     0     0
            wwn-0x5000c5007ceba5a5  ONLINE       0     0     0
    
    errors: No known data errors
    

    Because a NAS will face data transfers that are sequential in nature, I've done some tests with 'dd' to measure this performance.

    Read performance:

    root@debian:~# dd if=/testpool/test.bin of=/dev/null bs=1M
    50000+0 records in 50000+0 records out 52428800000 bytes (52 GB) copied, 162.429 s, 323 MB/s

    Write performance:

    root@debian:~# dd if=/dev/zero of=/testpool/test.bin bs=1M count=50000 conv=sync 50000+0 records in 50000+0 records out 52428800000 bytes (52 GB) copied, 169.572 s, 309 MB/s

    Test with 3-disk RAIDZ VDEV

    After the previous test I wondered what would happen if I would exclude the older 1 TB disk and create a pool with just the 3 x 2 TB drives. This is the result:

    Read performance:

    root@debian:~# dd if=/testpool/test.bin of=/dev/null bs=1M conv=sync 50000+0 records in 50000+0 records out 52428800000 bytes (52 GB) copied, 149.509 s, 351 MB/s

    Write performance:

    root@debian:~# dd if=/dev/zero of=/testpool/test.bin bs=1M count=50000 conv=sync 50000+0 records in 50000+0 records out 52428800000 bytes (52 GB) copied, 144.832 s, 362 MB/s

    The performance is clearly better even there's one disk less in the VDEV. I would have liked to test with an additional 2 TB drive what kind of performance would be achieved with four drives but I only have three.

    The result does show that the pool is more than capable of sustaining gigabit network transfer speeds.

    This is confirmed when performing the actual network file transfers. In the example below, I simulate a copy of a 50 GB test file from the Gen8 towards a test system using NFS. Tests are performed using the 3-disk pool.

    NFS read performance:

    root@nano:~# dd if=/mnt/server/test2.bin of=/dev/null bs=1M
    50000+0 records in
    50000+0 records out
    52428800000 bytes (52 GB) copied, 443.085 s, 118 MB/s
    

    NFS write performance:

    root@nano:~# dd if=/dev/zero of=/mnt/server/test2.bin bs=1M count=50000 conv=sync 
    50000+0 records in
    50000+0 records out
    52428800000 bytes (52 GB) copied, 453.233 s, 116 MB/s
    

    I think these results are excellent. Tests with the 'cp' command give the same results.

    I've also done some test with the SMB/CIFS protocol. I've used a second Linux box as a CIFS client to connect to the Gen8.

    CIFS read performance:

    root@nano:~# dd if=/mnt/test/test.bin of=/dev/null bs=1M
    50000+0 records in
    50000+0 records out
    52428800000 bytes (52 GB) copied, 527.778 s, 99.3 MB/s
    

    CIFS write performance:

    root@nano:~# dd if=/dev/zero of=/mnt/test/test3.bin bs=1M count=50000 conv=sync
    50000+0 records in
    50000+0 records out
    52428800000 bytes (52 GB) copied, 448.677 s, 117 MB/s
    

    Hot-swap support

    Although it's even printed on the hard drive caddies that hot-swap is not supported, it does seem to work perfectly fine if you run the SATA controller in AHCI mode.

    Fifth SATA port for SSD SLOG/L2ARC?

    If you buy a converter cable that converts a floppy power connector to a SATA power connector, you could install an SSD. This SSD can then be used as a dedicated SLOG device and/or L2ARC cache if you have a need for this.

    RAIDZ, is that OK?

    If you want maximum storage capacity with redundancy RAIDZ is the only option. RAID6 or two mirrored VDEVs is more reliable, but will reduce available storage space by a third.

    The main risk of RAIDZ is a double-drive failure. As with larger drive sizes, a resilver of a VDEV will take quite some time. It could take more than a day before the pool is resilvered, during which you run without redundancy.

    With the low number of drives in the VDEV the risk of a second drive failure may be low enough to be acceptable. That's up to you.

    Noise levels

    In the past, there have been reports about the Gen8 making tons of noise because the rear chasis fan spins at a high RPM if the RAID card is set to AHCI mode.

    I myself have not encountered this problem. The machine is almost silent.

    Power consumption

    With drives spinning: 50-55 Watt. With drives standby: 30-35 Watt.

    Conclusion

    I think my benchmarks show that the Microserver Gen8 could be an interesting platform if you want to create your own ZFS-based NAS.

    Please note that it is likely that since the Gen9 server platform is already out for some time, HP may release a Gen9 version of the microserver in the near future. However as of August 2015, there is no information on this yet and it is not clear if a successor is going to be released.

    Tagged as : ZFS microserver
  5. The Sorry State of CoW File Systems

    March 01, 2015

    I'd like to argue that both ZFS and BTRFS both are incomplete file systems with their own drawbacks and that it may still be a long way off before we have something truly great.

    Both ZFS and BTRFS are two heroic feats of engineering, created by people who are probably ten times more capable and smarter than me. There is no question about my appreciation for these file systems and what they accomplish.

    Still, as an end-user, I would like to see some features that are often either missing or not complete. Make no mistake, I believe that both ZFS and BTRFS are probably the best file systems we have today. But they can be much better.

    I want to start with a terse and quick overview on why both ZFS and BTRFS are such great file systems and why you should take some interest in them.

    Then I'd like to discuss their individual drawbacks and explain my argument.

    Why ZFS and BTRFS are so great

    Both ZFS and BTRFS are great for two reasons:

    1. They focus on preserving data integrity
    2. They simplify storage management

    Data integrity

    ZFS and BTRFS implement two important techniques that help preserve data.

    1. Data is checksummed and its checksum is verified to guard against bit rot due to broken hard drives or flaky storage controllers. If redundancy is available (RAID), errors can even be corrected.

    2. Copy-on-Write (CoW), existing data is never overwritten, so any calamity like sudden power loss cannot cause existing data to be in an inconsistent state.

    Simplified storage management

    In the old days, we had MDADM or hardware RAID for redundancy. LVM for logical volume management and then on top of that, we have the file system of choice (EXT3/4, XFS, REISERFS, etc).

    The main problem with this approach is that the layers are not aware of each other and this makes things very inefficient and more difficult to administer. Each layer needs it's own attention.

    For example, if you simply want to expand storage capacity, you need to add drives to your RAID array and expand it. Then, you have to alert the LVM layer of the extra storage and as a last step, grow the file system.

    Both ZFS and BTRFS make capacity expansion a simple one line command that addresses all three steps above.

    Why are ZFS and BTRFS capable of doing this? Because they incorporate RAID, LVM and the file system in one single integrated solution. Each 'layer' is aware of the other, they are tightly integrated. Because of this integration, rebuilds after a drive faillure are often faster than with 'legacy RAID' solutions, because they only need to rebuild the actual data, not the entire drive.

    And I'm not even talking about the joy of snapshots here.

    The inflexibility of ZFS

    The storage building block of ZFS is a VDEV. A VDEV is either a single disk (not so interesting) or some RAID scheme, such as mirroring, single-parity (RAIDZ), dual-parity (RAIDZ2) and even tripple-parity (RAIDZ3).

    To me, a big downside to ZFS is the fact that you cannot expand a VDEV. Ok, the only way you can expand the VDEV is quite convoluted. You have to replace all of the existing drives, one by one, with bigger ones and rebuild the VDEV each time you replace one of the drives. Then, when all drives are of the higher capacity, you can expand your VDEV. This is quite impractical and time-consuming, if you ask me.

    ZFS expects you just to add extra VDEVS. So if you start with a single 6-drive RAIDZ2 (RAID6), you are expected to add another 6-drive RAIDZ2 if you want to expand capacity.

    What I would want to do is just to ad one or two more drives and grow the VDEV, as is possible with many hardware RAID solutions and with "MDADM --grow" for ages.

    Why do I prefer this over adding VDEVS? Because it's quite evident that this is way more economical. If I can just expand my RAIDZ2 from 6 drives to 12 drives, I would only sacrifice two drives for parity. If I add two VDEVS each of them RAIDZ2, I sacrifice four drives (16% vs 33% capacity loss).

    I can imagine that in the enterprise world, this is just not that big of a deal, a bunch of drives are a rounding error on the total budget and availability and performance are more important. Still, I'd like to have this option.

    Either you are forced to buy and implement the storage you may expect to need in the future, or you must add it later on, wasting drives on parity you would otherwise not have done.

    Maybe my wish for a zpool grow option is more geared to hobbyist or home usage of ZFS and ZFS was always focussed on enterprise needs, not the needs of hobbyists. So I'm aware of the context here.

    I'm not done with ZFS however, because the way ZFS works, there is another great inflexibility. If you don't put the 'right' number of drives in a VDEV, you may lose significant portions of storage, which is a side-effect of how ZFS works.

    The following ZFS pool configurations are optimal for modern 4K sector harddrives:
    RAID-Z: 3, 5, 9, 17, 33 drives
    RAID-Z2: 4, 6, 10, 18, 34 drives
    RAID-Z3: 5, 7, 11, 19, 35 drives
    

    I've seen first-hand with my 71 TiB NAS that if you don't use the optimal number of drives in a VDEV, you may lose whole drives worth of netto storage capacity. In that regard, my 24-drive chassis is very suboptimal.

    The sad state of RAID on BTRFS

    BTRFS has none of the downsides of ZFS as described in the previous section as far as I'm aware of. It has plenty of its own, though. First of all: BTRFS is still not stable, especially the RAID 5/6 part is unstable.

    The RAID 5 and RAID 6 implementation are so new, the ink they were written with is still wet (February 8th 2015). Not something you want to trust your important data to I suppose.

    I did setup a test environment to play a bit with this new Linux kernel (3.19.0) and BTRFS to see how it works and although it is not production-ready yet, I really like what I see.

    With BTRFS you can just add or remove drives to a RAID6 array as you see fit. Add two? Subtract 3? Whatever, the only thing you have to wait for is BTRFS rebalancing the data over either the new or remaining drives.

    This is friggin' awesome.

    If you want to remove a drive, just wait for BTRFS to copy the data from that drive to the other remaining drives and you can remove it. You want to expand storage? Just add the drives to your storage pool and have BTRFS rebalance the data (which may take a while, but it works).

    But I'm still a bit sad. Because BTRFS does not support anything beyond RAID6. No multiple RAID6 (RAID60) arrays or tripple-parity, as ZFS supports for ages. As with my 24-drive file server, putting 24 drives in a single RAID6, starts to feel like I'm asking for trouble. Tripple-parity or RAID 60 would probably be more reasonable. But no luck with BTRFS.

    However, what really frustrates me is this article by Ronny Egner. The author of snapraid, Andrea Mazzoleni, has written a functional patch for BTRFS that implements not only tripple-parity RAID, but even up to six parity disks for a volume.

    The maddening thing is that the BTRFS maintainers are not planning to include this patch into the BTRFS code base. Please read Ronny's blog. The people working on BTRFS are working for enterprises who want enterprise features. They don't care about tripple-parity or features like that because they have access to something presumably better: distributed file systems, which may do away with the need for larger disk arrays and thus tripple-parity.

    BTRFS is in development for a very long time and only recently has RAID 5/6 support been introduced. The risk of the write-hole, something addressed by ZFS ages ago, is still an open issue. Considering all of this, BTRFS is still a very long way off, of being the file system of choice for larger storage arrays.

    BTRFS seems to be way more flexible in terms of storage expansion or shrinking, but it slow pace of development makes it still unusable for anything serious for at least the next year I guess.

    Conclusion

    BTRFS addresses all the inflexibilities of ZFS but it's immaturity and lack of more advanced RAID schemes makes it unusable for larger storage solutions. This is so sad because by design it seems to be the better, way more flexible option as compared to ZFS.

    I do understand the view of the BTRFS developers. With the enterprise data sets, at scale, it's better to use distributed file systems to handle storage and redundancy, than on the smaller system scale. But this kind of environment is not reachable for many.

    So at the moment, compared to BTRFS, ZFS is still the better option for people who want to setup large, reliable storage arrays.

    Tagged as : ZFS BTRFS

Page 1 / 37