1. The Sorry State of CoW File Systems

    March 01, 2015

    I'd like to argue that both ZFS and BTRFS both are incomplete file systems with their own drawbacks and that it may still be a long way off before we have something truly great.

    Both ZFS and BTRFS are two heroic feats of engineering, created by people who are probably ten times more capable and smarter than me. There is no question about my appreciation for these file systems and what they accomplish.

    Still, as an end-user, I would like to see some features that are often either missing or not complete. Make no mistake, I believe that both ZFS and BTRFS are probably the best file systems we have today. But they can be much better.

    I want to start with a terse and quick overview on why both ZFS and BTRFS are such great file systems and why you should take some interest in them.

    Then I'd like to discuss their individual drawbacks and explain my argument.

    Why ZFS and BTRFS are so great

    Both ZFS and BTRFS are great for two reasons:

    1. They focus on preserving data integrity
    2. They simplify storage management

    Data integrity

    ZFS and BTRFS implement two important techniques that help preserve data.

    1. Data is checksummed and its checksum is verified to guard against bit rot due to broken hard drives or flaky storage controllers. If redundancy is available (RAID), errors can even be corrected.

    2. Copy-on-Write (CoW), existing data is never overwritten, so any calamity like sudden power loss cannot cause existing data to be in an inconsistent state.

    Simplified storage management

    In the old days, we had MDADM or hardware RAID for redundancy. LVM for logical volume management and then on top of that, we have the file system of choice (EXT3/4, XFS, REISERFS, etc).

    The main problem with this approach is that the layers are not aware of each other and this makes things very inefficient and more difficult to administer. Each layer needs it's own attention.

    For example, if you simply want to expand storage capacity, you need to add drives to your RAID array and expand it. Then, you have to alert the LVM layer of the extra storage and as a last step, grow the file system.

    Both ZFS and BTRFS make capacity expansion a simple one line command that addresses all three steps above.

    Why are ZFS and BTRFS capable of doing this? Because they incorporate RAID, LVM and the file system in one single integrated solution. Each 'layer' is aware of the other, they are tightly integrated. Because of this integration, rebuilds after a drive faillure are often faster than with 'legacy RAID' solutions, because they only need to rebuild the actual data, not the entire drive.

    And I'm not even talking about the joy of snapshots here.

    The inflexibility of ZFS

    The storage building block of ZFS is a VDEV. A VDEV is either a single disk (not so interesting) or some RAID scheme, such as mirroring, single-parity (RAIDZ), dual-parity (RAIDZ2) and even tripple-parity (RAIDZ3).

    To me, a big downside to ZFS is the fact that you cannot expand a VDEV. Ok, the only way you can expand the VDEV is quite convoluted. You have to replace all of the existing drives, one by one, with bigger ones and rebuild the VDEV each time you replace one of the drives. Then, when all drives are of the higher capacity, you can expand your VDEV. This is quite impractical and time-consuming, if you ask me.

    ZFS expects you just to add extra VDEVS. So if you start with a single 6-drive RAIDZ2 (RAID6), you are expected to add another 6-drive RAIDZ2 if you want to expand capacity.

    What I would want to do is just to ad one or two more drives and grow the VDEV, as is possible with many hardware RAID solutions and with "MDADM --grow" for ages.

    Why do I prefer this over adding VDEVS? Because it's quite evident that this is way more economical. If I can just expand my RAIDZ2 from 6 drives to 12 drives, I would only sacrifice two drives for parity. If I add two VDEVS each of them RAIDZ2, I sacrifice four drives (16% vs 33% capacity loss).

    I can imagine that in the enterprise world, this is just not that big of a deal, a bunch of drives are a rounding error on the total budget and availability and performance are more important. Still, I'd like to have this option.

    Either you are forced to buy and implement the storage you may expect to need in the future, or you must add it later on, wasting drives on parity you would otherwise not have done.

    Maybe my wish for a zpool grow option is more geared to hobbyist or home usage of ZFS and ZFS was always focussed on enterprise needs, not the needs of hobbyists. So I'm aware of the context here.

    I'm not done with ZFS however, because the way ZFS works, there is another great inflexibility. If you don't put the 'right' number of drives in a VDEV, you may lose significant portions of storage, which is a side-effect of how ZFS works.

    The following ZFS pool configurations are optimal for modern 4K sector harddrives:
    RAID-Z: 3, 5, 9, 17, 33 drives
    RAID-Z2: 4, 6, 10, 18, 34 drives
    RAID-Z3: 5, 7, 11, 19, 35 drives
    

    I've seen first-hand with my 71 TiB NAS that if you don't use the optimal number of drives in a VDEV, you may lose whole drives worth of netto storage capacity. In that regard, my 24-drive chassis is very suboptimal.

    The sad state of RAID on BTRFS

    BTRFS has none of the downsides of ZFS as described in the previous section as far as I'm aware of. It has plenty of its own, though. First of all: BTRFS is still not stable, especially the RAID 5/6 part is unstable.

    The RAID 5 and RAID 6 implementation are so new, the ink they were written with is still wet (February 8th 2015). Not something you want to trust your important data to I suppose.

    I did setup a test environment to play a bit with this new Linux kernel (3.19.0) and BTRFS to see how it works and although it is not production-ready yet, I really like what I see.

    With BTRFS you can just add or remove drives to a RAID6 array as you see fit. Add two? Subtract 3? Whatever, the only thing you have to wait for is BTRFS rebalancing the data over either the new or remaining drives.

    This is friggin' awesome.

    If you want to remove a drive, just wait for BTRFS to copy the data from that drive to the other remaining drives and you can remove it. You want to expand storage? Just add the drives to your storage pool and have BTRFS rebalance the data (which may take a while, but it works).

    But I'm still a bit sad. Because BTRFS does not support anything beyond RAID6. No multiple RAID6 (RAID60) arrays or tripple-parity, as ZFS supports for ages. As with my 24-drive file server, putting 24 drives in a single RAID6, starts to feel like I'm asking for trouble. Tripple-parity or RAID 60 would probably be more reasonable. But no luck with BTRFS.

    However, what really frustrates me is this article by Ronny Egner. The author of snapraid, Andrea Mazzoleni, has written a functional patch for BTRFS that implements not only tripple-parity RAID, but even up to six parity disks for a volume.

    The maddening thing is that the BTRFS maintainers are not planning to include this patch into the BTRFS code base. Please read Ronny's blog. The people working on BTRFS are working for enterprises who want enterprise features. They don't care about tripple-parity or features like that because they have access to something presumably better: distributed file systems, which may do away with the need for larger disk arrays and thus tripple-parity.

    BTRFS is in development for a very long time and only recently has RAID 5/6 support been introduced. The risk of the write-hole, something addressed by ZFS ages ago, is still an open issue. Considering all of this, BTRFS is still a very long way off, of being the file system of choice for larger storage arrays.

    BTRFS seems to be way more flexible in terms of storage expansion or shrinking, but it slow pace of development makes it still unusable for anything serious for at least the next year I guess.

    Conclusion

    BTRFS addresses all the inflexibilities of ZFS but it's immaturity and lack of more advanced RAID schemes makes it unusable for larger storage solutions. This is so sad because by design it seems to be the better, way more flexible option as compared to ZFS.

    I do understand the view of the BTRFS developers. With the enterprise data sets, at scale, it's better to use distributed file systems to handle storage and redundancy, than on the smaller system scale. But this kind of environment is not reachable for many.

    So at the moment, compared to BTRFS, ZFS is still the better option for people who want to setup large, reliable storage arrays.

    Tagged as : ZFS BTRFS
  2. Why I Do Use ZFS as a File System for My NAS

    January 29, 2015

    On February 2011, I posted an article about my motivations why I did not use ZFS as a file system for my 18 TB NAS.

    You have to understand that at the time, I believe the arguments in the article were relevant, but much has changed since then, and I do believe this article is not relevant anymore.

    My stance on ZFS is in the context of a home NAS build.

    I really recommend giving ZFS a serious consideration if you are building your own NAS. It's probably the best file system you can use if you care about data integrity.

    ZFS may only be available for non-Windows operating systems, but there are quite a few easy-to-use NAS distros available that turn your hardware into a full-featured home NAS box, that can be managed through your web browser. A few examples:

    I also want to add this: I don't think it's wrong or particular risky if you - as a home NAS builder - would decide not to use ZFS and select a 'legacy' solution if that better suits your needs. I think that proponents of ZFS often overstate the risks ZFS mitigates a bit, maybe to promote ZFS. I do think those risks are relevant but it all depends on your circumstances. So you decide.

    May 2016: I have also written a separate article on how I feel about using ZFS for DIY home NAS builds.

    Arstechnica article about FreeNAS vs NAS4free.

    If you are quite familiar with FreeBSD or Linux, I do recommend this ZFS how-to article from Arstechnica. It offers a very nice introduction to ZFS and explains terms like 'pool' and 'vdev'.

    If you are planning on using ZFS for your own home NAS, I would recommend reading the following articles:

    My historical reasons for not using ZFS at the time

    When I started with my 18 TB NAS in 2009, there was no such thing as ZFS for Linux. ZFS was only available in a stable version for Open Solaris. We all know what happened to Open Solaris (it's gone).

    So you might ask: "Why not use ZFS on FreeBSD then?". Good question, but it was bad timing:

    The FreeBSD implementation of ZFS became only stable [sic] in January 2010, 6 months after I build my NAS (summer 2009). So FreeBSD was not an option at that time.
    

    One of the other objections against ZFS is the fact that you cannot expand your storage by adding single drives and growing the array as your data set grows.

    A ZFS pool consists of one or more VDEVs. A VDEV is a traditional RAID-array. You expand storage capacity by expanding the ZFS pool, not the VDEVS. You cannot expand the VDEV itself. You can only add VDEVS to a pool.

    So ZFS either forces you to invest in storage you don't need upfront, or it forces you invest later on because you may waste quite a few extra drives on parity. For example, if you start with a 6-drive RAID6 (RAIDZ) configuration, you will probably expand with another 6 drives. So the pool has 4 parity drives on 12 total drives (33% loss). Investing upfront in 10 drives instead of 6 would have been more efficient because you only lose 2 drives out of 10 to parity (20% loss).

    So at the time, I found it reasonable to stick with what I knew: Linux & MDADM.

    But my new 71 TiB NAS is based on ZFS.

    I wrote an article about my worry that ZFS may die with FreeBSD as it sole backing, but fortunately, I've been proven very, very wrong.

    ZFS is now supported on FreeBSD and Linux. Despite some licencing issues that prevent ZFS from being integrated in the Linux kernel itself, it can still be used as a regular kernel module and it works perfectly.

    There is even an open-source ZFS consortium that brings together all the developers for the different operating systems supporting ZFS.

    ZFS is here to stay for a very long time.

    Tagged as : ZFS
  3. Please Use ZFS With ECC Memory

    August 27, 2014

    In this blogpost I argue why it's strongly recommended to use ZFS with ECC memory when building a NAS. I would argue that if you do not use ECC memory, it's reasonable to also forgo on ZFS altogether and use any (legacy) file system that suits your needs.

    Why ZFS?

    Many people consider using ZFS when they are planning to build their own NAS. This is for good reason: ZFS is an excellent choice for a NAS file system. There are many reasons why ZFS is such a fine choice, but the most important one is probably 'data integrity'. Data integrity was one of the primary design goals of ZFS.

    ZFS assures that any corrupt data served by the underlying storage system is either detected or - if possible - corrected by using checksums and parity. This is why ZFS is so interesting for NAS builders: it's OK to use inexpensive (consumer) hard drives and solid state drives and not worry about data integrity.

    I will not go into the details, but for completeness I will also state that ZFS can make the difference between losing an entire RAID array or just a few files, because of the way it handles read errors as compared to 'legacy' hardware/software RAID solutions.

    Understanding ECC memory

    ECC memory or Error Correcting Code memory, contains extra parity data so the integrity of the data in memory can be verified and even corrected. ECC memory can correct single bit errors and detect multiple bit errors per word1.

    What's most interesting is how a system with ECC memory reacts to bit errors that cannot be corrected. Because it's how a system with ECC memory responds to uncorrectable bit errors that that makes all the difference in the world.

    If multiple bits are corrupted within a single word, the CPU will detect the errors, but will not be able to correct them. When the CPU notices that there are uncorrectable bit errors in memory, it will generate an MCE that will be handled by the operating system. In most cases, this will result in a halt2 of the system.

    This behaviour will lead to a system crash, but it prevents data corruption. It prevents the bad bits from being processed by the operating system and/or applications where it may wreak havoc.

    ECC memory is standard on all server hardware sold by all major vendors like HP, Dell, IBM, Supermicro and so on. This is for good reason, because memory errors are the norm, not the exception.

    The question is really why not all computers, including desktop and laptops, use ECC memory instead of non-ECC memory. The most important reason seems to be 'cost'.

    It is more expensive to use ECC memory than non-ECC memory. This is not only because ECC memory itself is more expensive. ECC memory requires a motherboard with support for ECC memory, and these motherboards tend to be more expensive as well.

    non-ECC Memory is reliable enough that you won't have an issue most of the time. And when it does go wrong, you just blame Microsoft or Apple3. For desktops, the impact of a memory failure is less of an issue than on servers. But remember, your NAS is your own (home) server. There is some evidence that memory errors are abundant4 on desktop systems.

    The price difference is small enough not to be relevant for businesses, but for the price-conscious consumer, it is a factor. A system based on ECC memory may cost in the range of $150 - $200 more than a system based on non-ECC memory.

    It's up to you if you want to spend this extra money. Why you are advised to do so will be discussed in the next paragraphs.

    Why ECC memory is important to ZFS

    ZFS trusts the contents of memory blindly. Please note that ZFS has no mechanisms to cope with bad memory. It is similar to every other file system in this regard. Here is a nice paper about ZFS and how it handles corrupt memory (it doesnt!).

    In the best case, bad memory corrupts file data and causes a few garbled files. In the worst case, bad memory mangles in-memory ZFS file system (meta) data structures, which may lead to corruption and thus loss of the entire zpool.

    It is important to put this into perspective. There is only a practical reason why ECC memory is more important for ZFS as compared to other file systems. Conceptually, ZFS does not require ECC memory any more as any other file system.

    Or let Matthew Ahrens, the co-founder of the ZFS project phrase it:

    There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than 
    any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM. I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS.
    

    Now this is the important part. File systems such as NTFS, EXT4, etc have (data recovery) tools that may allow you to rescue your files when things go bad due to bad memory. ZFS does not have such tools, if the pool is corrupt, all data must be considered lost, there is no option for recovery.

    So the impact of bad memory can be more devastating on a system with ZFS than on a system with NTFS, EXT4, XFS, etcetera. ZFS may force you to restore your data from backups sooner. Oh by the way, you, make backups right?

    I do have a personal concern5. I have nothing to substantiate this, but my thinking is that since ZFS is a way more advanced and complex file system, it may be more susceptible to the adverse effects of bad memory, compared to legacy file systems.

    ZFS, ECC memory and data integrity

    The main reason for using ZFS over legacy file systems is the ability to assure data integrity. But ZFS is only one piece of the data integrity puzzle. The other part of the puzzle is ECC memory.

    ZFS covers the risk of your storage subsystem serving corrupt data. ECC memory covers the risk of corrupt memory. If you leave any of these parts out, you are compromising data integrity.

    If you care about data integrity, you need to use ZFS in combination with ECC memory. If you don't care that much about data integrity, it doesn't really matter if you use either ZFS or ECC memory.

    Please remember that ZFS was developed to assure data integrity in a corporate IT environment, where data integrity is top priority and ECC-memory in servers is the norm, a fundament, on wich ZFS has been build. ZFS is not some magic pixie dust that protects your data under all circumstances. If its requirements are not met, data integrity is not assured.

    ZFS may be free, but data integrity and availability isn't. We spend money on extra hard drives so we can run RAID(Z) and lose one or more hard drives without losing our data. And we have to spend money on ECC-memory, to assure bad memory doesn't have a similar impact.

    This is a bit of an appeal to authority and not to data or reason but I think it's still relevant. FreeNAS is a vendor of a NAS solution that uses ZFS as its foundation.

    They have this to say about ECC memory:

    However if a non-ECC memory module goes haywire, it can cause irreparable damage to your ZFS pool that can cause complete loss of the storage.
    ...
    If it’s imperative that your ZFS based system must always be available, ECC RAM is a requirement. If it’s only some level of annoying (slightly, moderately…) that you need to restore 
    your ZFS system from backups, non-ECC RAM will fit the bill.
    

    Hopefully your backups won't contain corrupt data. If you make backups of all data in the first place.

    Many home NAS builders won't be able to afford to backup all data on their NAS, only the most critical data. For example, if you store a large collection of video files, you may accept the risk that you may have to redownload everything. If you can't accept that risk ECC memory is a must. If you are OK with such a scenario, non-ECC memory is OK and you can save a few bucks. It all depends on your needs.

    The risks faced in a business environment don't magically disapear when you apply the same technology at home. The main difference between a business setting and your home is the scale of operation, nothing else. The risks are still relevant and real.

    Things break, it's that simple. And although you may not face the same chances of getting affected by it based on the smaller scale at which you operate at home, your NAS is probably not placed in a temperature and humidity controlled server room. As the temperature rises, so does the risk of memory errors6. And remember, memory may develop spontaneous and temporary defects (random bitflips). If your system is powered on 24/7, there is a higher chance that such a thing will happen.

    Conclusion

    Personally, I think that even for a home NAS, it's best to use ECC memory regardless if you use ZFS. It makes for a more stable hardware platform. If money is a real constraint, it's better to take a look at AMD's offerings then to skip on ECC memory. It's important that if you select AMD hardware, that you make sure that both CPU and motherboard support ECC and that it is reported to be working.

    Still, if you decide to use non-ECC memory with ZFS: as long as you are aware of the risks outlined in this blog post and you're OK with that, fine. It's your data and you must decide for yourself what kind of protection and associated cost is reasonable for you.

    When people seek advice on their NAS builds, ECC memory should always be recommended. I think that nobody should create the impression that it's 'safe' for home use not to use ECC RAM purely seen from a technical and data integrity standpoint. People must understand that they are taking a risk. But there is a significant chance that they will never experience problems, but there is no guarantee. Do they accept the consequences if it does go wrong?

    If data integrity is not that important - because the data itself is not critical - I find it perfectly reasonable that people may decide not to use ECC memory and save a few hundred dollars. In that case, it would also be perfectly reasonable not to use ZFS either, which also may allow them other file system and RAID options that may better suit their particular needs.

    Questions and answers

    Q: When I bought my non-ECC memory, I ran memtest86+ and no errors were found, even after a burn-in tests. So I think I'm safe.

    A: No. A memory test with memtest86+ is just a snapshot in time. At that time, when you ran the test, you had the assurance that memory was fine. It could have gone bad right now while you are reading these words. And could be corrupting your data as we speak. So running memtest86+ frequently doesn't really buy you much.

    Q: Dit you see that article by Brian Moses?

    A: yes, and I disagree with his views, but I really appreciate the fact that he emphasises that you should really be aware of the risks involved and decide for yourself what suits your situation. A few points that are not OK in my opinion:

    Every bad stick of RAM I’ve experienced came to me that way from the factory and could be found via some burn-in testing.
    

    I've seen some consumer equipment in my life time that suddenly developed memory errors after years of perfect operation. This is argument from personal anekdote should not be used as a basis for decision making. Remember: memory errors are the norm, not the exception. Even at home. Things break, it's that simple. And having equipment running 24/7 doesn't help.

    Furthermore, Brian seems to think that you can mitigate the risk of non-ECC memory by spending money on other stuff, such as off-site backups. Brian himself links to an article that rebutes his position on this. Just for completeness: How valuable is a backup of corrupted data? How do you know which data was corrupted? ZFS won't save you here.

    Q: Should I use ZFS on my laptop or desktop?

    A: Running ZFS on your desktop or laptop is an entirely different use case as compared to a NAS. I see no problems with this, I don't think this discussion applies to desktop/laptop usage. Especially because you are probably creating regular backups of your data to your NAS or a cloud service, right? If there are any memory errors, you will notice soon enough.

    Updates

    • Updated on August 11, 2015 to reflect that ZFS was not designed with ECC in mind. In this regard, it doesn't differ from other file systems.

    • Updated on April 3rd, 2015 - rewrote large parts of the whole article, to make it a better read.

    • Updated on January 18th, 2015 - rephrased some sentences. Changed the paragraph 'Inform people and give them a choice' to argue when it would be reasonable not to use ECC memory. Furthermore, I state more explicitly that ZFS itself has no mechanisms to cope with bad RAM.

    • Updated on February 21th, 2015 - I substantially rewrote this article to give a better perspective on the ZFS + ECC 'debate'.


    1. On x64 processors, the size of a word is 64 bits

    2. Windows will generate a "blue screen of death" and Linux will generate a "kernel panic".  

    3. It is very likely that the computer you're using (laptop/desktop) encountered a memory issue this year, but there is no way you can tell. Consumer hardware doesn't have any mechanisms to detect and report memory errors.  

    4. Microsoft has performed a study on one million crash reports they received over a period of 8 months on roughly a million systems in 2008. The result is a 1 in 1700 failure rate for single-bit memory errors in kernel code pages (a tiny subset of total memory).

      A consequence of confining our analysis to kernel code pages is that we will miss DRAM failures in the vast majority of memory. On a typical machine kernel code pages occupy roughly 30 MB of memory, which is 1.5% of the memory on the average system in our study. [...] since we are capturing DRAM errors in only 1.5% of the address space, it is possible that DRAM error rates across all of DRAM may be far higher than what we have observed. 

    5. I did not come up with this argument myself. 

    6. The absolutely facinating concept of bitsquatting proved that hotter datacenters showed more bitflips 

    Tagged as : ZFS ECC
  4. ZFS: Performance and Capacity Impact of Ashift=9 on 4K Sector Drives

    July 31, 2014

    Update 2014-8-23: I was testing with ashift for my new NAS. The ashift=9 write performance deteriorated from 1.1 GB/s to 830 MB/s with just 16 TB of data on the pool. Also I noticed that resilvering was very slow. This is why I decided to abandon my 24 drive RAIDZ3 configuration.

    I'm aware that drives are faster at the outside of the platter and slower on the inside, but the performance deteriorated so dramatically that I did not wanted to continue further.

    My final setup will be a RAIDZ2 18 drive VDEV + RAIDZ2 6 drive VDEV which will give me 'only' 71 TiB of storage, but read performance is 2.6 GB/s and write performance is excellent at 1.9 GB/s. I've written about 40+ TiB to the array and after those 40 TiB, write performance was about 1.7 GB/s, so still very good and what I would expect as drives fill up.

    So actually, based on these results, I have learned not to deviate from the ZFS best practices too much. Use ashift=12 and put drives in VDEVS that adhere to the 2^n+parity rule.

    The uneven VDEVs (18 disk vs. 6 disks) are not according to best practice but ZFS is smart: it distributes data across the VDEVs based on their size. So they fill up equally.


    Choosing between ashift=9 and ashift=12 for 4K sector drives is not always a clear cut case. You have to choose between raw performance or storage capacity.

    My testplatform is Debian Wheezy with ZFS on Linux. I'm using a system with 24 x 4 TB drives in a RAIDZ3. The drives have a native sector size of 4K, and the array is formatted with ashift=12.

    First we create the array like this:

    zpool create storage -o ashift=12 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]
    

    Note: NEVER use /dev/sd? drive names for an array, this is just for testing, always use /dev/disk/by-id/ names.

    Then we run a simple sequential transfer benchmark with dd:

    root@nano:/storage# dd if=/dev/zero of=ashift12.bin bs=1M count=100000 
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 66.4922 s, 1.6 GB/s
    root@nano:/storage# dd if=ashift12.bin of=/dev/null bs=1M
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 42.0371 s, 2.5 GB/s
    

    This is quite impressive. With these speeds, you can saturate 10Gbe ethernet. But how much storage space do we get?

    df -h:

    Filesystem                            Size  Used Avail Use% Mounted on
    storage                                69T  512K   69T   1% /storage
    

    zfs list:

    NAME      USED  AVAIL  REFER  MOUNTPOINT
    storage  1.66M  68.4T   435K  /storage
    

    Only 68.4 TiB of storage? That's not good. There should be 24 drives minus 3 for parity is 21 x 3.6 TiB = 75 TiB of storage.

    So the performance is great, but somehow, we lost about 6 TiB of storage, more than a whole drive.

    So what happens if you create the same array with ashift=9?

    zpool create storage -o ashift=9 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]
    

    These are the benchmarks:

    root@nano:/storage# dd if=/dev/zero of=ashift9.bin bs=1M count=100000 
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 97.4231 s, 1.1 GB/s
    root@nano:/storage# dd if=ashift9.bin of=/dev/null bs=1M
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 42.3805 s, 2.5 GB/s
    

    So we lose about a third of our write performance, but the read performance is not affected, probably by read-ahead caching but I'm not sure.

    With ashift=9, we do lose some write performance, but we can still saturate 10Gbe.

    Now look what happens to the available storage capacity:

    df -h:

    Filesystem                         Size  Used Avail Use% Mounted on
    storage                             74T   98G   74T   1% /storage
    

    zfs list:

    NAME      USED  AVAIL  REFER  MOUNTPOINT
    storage   271K  73.9T  89.8K  /storage
    

    Now we have a capacity of 74 TiB, so we just gained 5 TiB with ashift=9 over ashift=12, at the cost of some write performance.

    So if you really care about sequential write performance, ashift=12 is the better option. If storage capacity is more important, ashift=9 seems to be the best solution for 4K drives.

    The performance of ashift=9 on 4K drives is always described as 'horrible' but I think it's best to run your own benchmarks and decide for yourself.

    Caveat: I'm quite sure about the benchmark performance. I'm not 100% sure how reliable the reported free space is according to df -h or zfs list.

    Edit: I have added a bit of my own opinion on the results.

    Tagged as : ZFS Linux
  5. Creating a Basic ZFS File System on Linux

    February 01, 2014

    Here are some notes on creating a basic ZFS file system on Linux, using ZFS on Linux.

    I'm documenting the scenario where I just want to create a file system that can tollerate at least a single drive failure and can be shared over NFS.

    Identify the drives you want to use for the ZFS pool

    The ZFS on Linux project advices not to use plain /dev/sdx (/dev/sda, etc.) devices but to use /dev/disk/by-id/ or /dev/disk/by-path device names.

    Device names for storage devices are not fixed, so /dev/sdx devices may not always point to the same disk device. I've been bitten by this when first experimenting with ZFS, because I did not follow this advice and then could not access my zpool after a reboot because I removed a drive from the system.

    So you should pick the appropriate device from the /dev/disk/by-[id|path] folder. However, it's often difficult to determine which device in those folders corresponds to an actual disk drive.

    So I wrote a simple tool called showdisks which helps you identify which identifiers you need to use to create your ZFS pool.

    diskbypath

    You can install showdisks yourself by cloning the project:

    git clone https://github.com/louwrentius/showtools.git
    

    And then just use showdisks like

    ./showdisks -sp  (-s (size) and -p (by-path) )
    

    For this example, I'd like to use all the 500 GB disk drives for a six-drive RAIDZ1 vdev. Based on the information from showdisks, this is the command to create the vdev:

    zfs create tank raidz1 pci-0000:03:00.0-scsi-0:0:21:0 pci-0000:03:00.0-scsi-0:0:19:0 pci-0000:02:00.0-scsi-0:0:9:0 pci-0000:02:00.0-scsi-0:0:11:0 pci-0000:03:00.0-scsi-0:0:22:0 pci-0000:03:00.0-scsi-0:0:18:0
    

    The 'tank' name can be anything you want, it's just a name for the pool.

    Please note that with newer bigger disk drives, you should test if the ashift=12 option gives you better performance.

    zfs create -o ashift=12 tank raidz1 <devices>
    

    I used this option on 2TB disk drives and the performance and the read performance improved twofold.

    How to setup a RAID10 style pool

    This is how to create the ZFS equivalent of a RAID10 setup:

    zfs create tank mirror <device 1> <device 2> mirror <device 3> <device 4> mirror <device 5> <device 6>
    

    How many drives should I use in a vdev

    I've learned to use a 'power of two' (2,4,8,16) of drives for a vdev, plus the appropriate number of drives for the parity. RAIDZ1 = 1 disk, RAIDZ2 = 2 disks, etc.

    So the optimal number of drives for RAIDZ1 would be 3,5,9,17. RAIDZ2 would be 4,6,10,18 and so on. Clearly in the example above with six drives in a RAIDZ1 configuration, I'm violating this rule of thumb.

    How to disable the ZIL or disable sync writes

    You can expect bad throughput performance if you want to use the ZIL / honour synchronous writes. For safety reasons, ZFS does honour sync writes by default, it's an important feature of ZFS to guarantee data integrity. For storage of virtual machines or databases, you should not turn of the ZIL, but use an SSD for the SLOG to get performance to acceptable levels.

    For a simple (home) NAS box, the ZIL is not so important and can quite safely be disabled, as long as you have your servers on a UPS and have it cleanly shutdown when the UPS battery runs out.

    This is how you turn of the ZIL / support for synchronous writes:

    zfs set sync=disabled <pool name>
    

    Disabling sync writes is especially important if you use NFS which issues sync writes by default.

    Example:

    zfs set sync=disabled tank
    

    How to add an L2ARC cache device

    Use showdisks to lookup the actual /dev/disk/by-path identifier and add it like this:

    zpool add tank cache <device>
    

    Example:

    zpool add tank cache pci-0000:00:1f.2-scsi-2:0:0:0
    

    This is the result (on another zpool called 'server'):

    root@server:~# zpool status
      pool: server
     state: ONLINE
      scan: none requested
    config:
    
        NAME                               STATE     READ WRITE CKSUM
        server                             ONLINE       0     0     0
          raidz1-0                         ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:0:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:1:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:2:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:3:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:4:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:5:0  ONLINE       0     0     0
        cache
          pci-0000:00:1f.2-scsi-2:0:0:0    ONLINE       0     0     0
    

    How to monitor performance / I/O statistics

    One time sample:

    zpool iostat
    

    A sample every 2 seconds:

        zpool iostat 2
    

    More detailed information every 5 seconds:

        zpool iostat -v 5
    

    Example output:

                                          capacity     operations    bandwidth
    pool                               alloc   free   read  write   read  write
    ---------------------------------  -----  -----  -----  -----  -----  -----
    server                             3.54T  7.33T      4    577   470K  68.1M
      raidz1                           3.54T  7.33T      4    577   470K  68.1M
        pci-0000:03:04.0-scsi-0:0:0:0      -      -      1    143  92.7K  14.2M
        pci-0000:03:04.0-scsi-0:0:1:0      -      -      1    142  91.1K  14.2M
        pci-0000:03:04.0-scsi-0:0:2:0      -      -      1    143  92.8K  14.2M
        pci-0000:03:04.0-scsi-0:0:3:0      -      -      1    142  91.0K  14.2M
        pci-0000:03:04.0-scsi-0:0:4:0      -      -      1    143  92.5K  14.2M
        pci-0000:03:04.0-scsi-0:0:5:0      -      -      1    142  90.8K  14.2M
    cache                                  -      -      -      -      -      -
      pci-0000:00:1f.2-scsi-2:0:0:0    55.9G     8M      0     70    349  8.69M
    ---------------------------------  -----  -----  -----  -----  -----  -----
    

    How to start / stop a scrub

    Start:

    zfs scrub <pool>
    

    Stop:

    zfs scrub -s <pool>
    

    Mount ZFS file systems on boot

    Edit /etc/defaults/zfs and set this parameter:

    ZFS_MOUNT='yes'
    

    How to enable sharing a file system over NFS:

    zfs set sharenfs=on <poolname>
    

    How to create a zvol for usage with iSCSI

    zfs create -V 500G <poolname>/volume-name
    

    How to force ZFS to import the pool using disk/by-path

    Edit /etc/default/zfs and add

    ZPOOL_IMPORT_PATH=/dev/disk/by-path/
    

    Links to important ZFS information sources:

    Tons of information on using ZFS on Linux by Aaron Toponce:

    https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux/

    Understanding the ZIL (ZFS Intent Log)

    http://nex7.blogspot.nl/2013/04/zfs-intent-log.html

    Information about 4K sector alignment problems

    http://www.opendevs.org/ritk/zfs-4k-aligned-space-overhead.html

    Important read about using the proper number of drives in a vdev

    http://forums.freenas.org/threads/getting-the-most-out-of-zfs-pools.16/

    Tagged as : ZFS

Page 2 / 3