Articles in the ZFS category

  1. Please Use ZFS With ECC Memory

    August 27, 2014

    In this blogpost I argue why it's strongly recommended to use ZFS with ECC memory when building a NAS. I would argue that if you do not use ECC memory, it's reasonable to also forgo on ZFS altogether and use any (legacy) file system that suits your needs.

    Why ZFS?

    Many people consider using ZFS when they are planning to build their own NAS. This is for good reason: ZFS is an excellent choice for a NAS file system. There are many reasons why ZFS is such a fine choice, but the most important one is probably 'data integrity'. Data integrity was one of the primary design goals of ZFS.

    ZFS assures that any corrupt data served by the underlying storage system is either detected or - if possible - corrected by using checksums and parity. This is why ZFS is so interesting for NAS builders: it's OK to use inexpensive (consumer) hard drives and solid state drives and not worry about data integrity.

    I will not go into the details, but for completeness I will also state that ZFS can make the difference between losing an entire RAID array or just a few files, because of the way it handles read errors as compared to 'legacy' hardware/software RAID solutions.

    Understanding ECC memory

    ECC memory or Error Correcting Code memory, contains extra parity data so the integrity of the data in memory can be verified and even corrected. ECC memory can correct single bit errors and detect multiple bit errors per word1.

    What's most interesting is how a system with ECC memory reacts to bit errors that cannot be corrected. Because it's how a system with ECC memory responds to uncorrectable bit errors that that makes all the difference in the world.

    If multiple bits are corrupted within a single word, the CPU will detect the errors, but will not be able to correct them. When the CPU notices that there are uncorrectable bit errors in memory, it will generate an MCE that will be handled by the operating system. In most cases, this will result in a halt2 of the system.

    This behaviour will lead to a system crash, but it prevents data corruption. It prevents the bad bits from being processed by the operating system and/or applications where it may wreak havoc.

    ECC memory is standard on all server hardware sold by all major vendors like HP, Dell, IBM, Supermicro and so on. This is for good reason, because memory errors are the norm, not the exception.

    The question is really why not all computers, including desktop and laptops, use ECC memory instead of non-ECC memory. The most important reason seems to be 'cost'.

    It is more expensive to use ECC memory than non-ECC memory. This is not only because ECC memory itself is more expensive. ECC memory requires a motherboard with support for ECC memory, and these motherboards tend to be more expensive as well.

    non-ECC Memory is reliable enough that you won't have an issue most of the time. And when it does go wrong, you just blame Microsoft or Apple3. For desktops, the impact of a memory failure is less of an issue than on servers. But remember, your NAS is your own (home) server. There is some evidence that memory errors are abundant4 on desktop systems.

    The price difference is small enough not to be relevant for businesses, but for the price-conscious consumer, it is a factor. A system based on ECC memory may cost in the range of $150 - $200 more than a system based on non-ECC memory.

    It's up to you if you want to spend this extra money. Why you are advised to do so will be discussed in the next paragraphs.

    Why ECC memory is important to ZFS

    ZFS trusts the contents of memory blindly. Please note that ZFS has no mechanisms to cope with bad memory. It is similar to every other file system in this regard. Here is a nice paper about ZFS and how it handles corrupt memory (it doesnt!).

    In the best case, bad memory corrupts file data and causes a few garbled files. In the worst case, bad memory mangles in-memory ZFS file system (meta) data structures, which may lead to corruption and thus loss of the entire zpool.

    It is important to put this into perspective. There is only a practical reason why ECC memory is more important for ZFS as compared to other file systems. Conceptually, ZFS does not require ECC memory any more as any other file system.

    Or let Matthew Ahrens, the co-founder of the ZFS project phrase it:

    There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than 
    any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM. I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS.
    

    Now this is the important part. File systems such as NTFS, EXT4, etc have (data recovery) tools that may allow you to rescue your files when things go bad due to bad memory. ZFS does not have such tools, if the pool is corrupt, all data must be considered lost, there is no option for recovery.

    So the impact of bad memory can be more devastating on a system with ZFS than on a system with NTFS, EXT4, XFS, etcetera. ZFS may force you to restore your data from backups sooner. Oh by the way, you, make backups right?

    I do have a personal concern5. I have nothing to substantiate this, but my thinking is that since ZFS is a way more advanced and complex file system, it may be more susceptible to the adverse effects of bad memory, compared to legacy file systems.

    ZFS, ECC memory and data integrity

    The main reason for using ZFS over legacy file systems is the ability to assure data integrity. But ZFS is only one piece of the data integrity puzzle. The other part of the puzzle is ECC memory.

    ZFS covers the risk of your storage subsystem serving corrupt data. ECC memory covers the risk of corrupt memory. If you leave any of these parts out, you are compromising data integrity.

    If you care about data integrity, you need to use ZFS in combination with ECC memory. If you don't care that much about data integrity, it doesn't really matter if you use either ZFS or ECC memory.

    Please remember that ZFS was developed to assure data integrity in a corporate IT environment, where data integrity is top priority and ECC-memory in servers is the norm, a fundament, on wich ZFS has been build. ZFS is not some magic pixie dust that protects your data under all circumstances. If its requirements are not met, data integrity is not assured.

    ZFS may be free, but data integrity and availability isn't. We spend money on extra hard drives so we can run RAID(Z) and lose one or more hard drives without losing our data. And we have to spend money on ECC-memory, to assure bad memory doesn't have a similar impact.

    This is a bit of an appeal to authority and not to data or reason but I think it's still relevant. FreeNAS is a vendor of a NAS solution that uses ZFS as its foundation.

    They have this to say about ECC memory:

    However if a non-ECC memory module goes haywire, it can cause irreparable damage to your ZFS pool that can cause complete loss of the storage.
    ...
    If it’s imperative that your ZFS based system must always be available, ECC RAM is a requirement. If it’s only some level of annoying (slightly, moderately…) that you need to restore 
    your ZFS system from backups, non-ECC RAM will fit the bill.
    

    Hopefully your backups won't contain corrupt data. If you make backups of all data in the first place.

    Many home NAS builders won't be able to afford to backup all data on their NAS, only the most critical data. For example, if you store a large collection of video files, you may accept the risk that you may have to redownload everything. If you can't accept that risk ECC memory is a must. If you are OK with such a scenario, non-ECC memory is OK and you can save a few bucks. It all depends on your needs.

    The risks faced in a business environment don't magically disapear when you apply the same technology at home. The main difference between a business setting and your home is the scale of operation, nothing else. The risks are still relevant and real.

    Things break, it's that simple. And although you may not face the same chances of getting affected by it based on the smaller scale at which you operate at home, your NAS is probably not placed in a temperature and humidity controlled server room. As the temperature rises, so does the risk of memory errors6. And remember, memory may develop spontaneous and temporary defects (random bitflips). If your system is powered on 24/7, there is a higher chance that such a thing will happen.

    Conclusion

    Personally, I think that even for a home NAS, it's best to use ECC memory regardless if you use ZFS. It makes for a more stable hardware platform. If money is a real constraint, it's better to take a look at AMD's offerings then to skip on ECC memory. It's important that if you select AMD hardware, that you make sure that both CPU and motherboard support ECC and that it is reported to be working.

    Still, if you decide to use non-ECC memory with ZFS: as long as you are aware of the risks outlined in this blog post and you're OK with that, fine. It's your data and you must decide for yourself what kind of protection and associated cost is reasonable for you.

    When people seek advice on their NAS builds, ECC memory should always be recommended. I think that nobody should create the impression that it's 'safe' for home use not to use ECC RAM purely seen from a technical and data integrity standpoint. People must understand that they are taking a risk. But there is a significant chance that they will never experience problems, but there is no guarantee. Do they accept the consequences if it does go wrong?

    If data integrity is not that important - because the data itself is not critical - I find it perfectly reasonable that people may decide not to use ECC memory and save a few hundred dollars. In that case, it would also be perfectly reasonable not to use ZFS either, which also may allow them other file system and RAID options that may better suit their particular needs.

    Questions and answers

    Q: When I bought my non-ECC memory, I ran memtest86+ and no errors were found, even after a burn-in tests. So I think I'm safe.

    A: No. A memory test with memtest86+ is just a snapshot in time. At that time, when you ran the test, you had the assurance that memory was fine. It could have gone bad right now while you are reading these words. And could be corrupting your data as we speak. So running memtest86+ frequently doesn't really buy you much.

    Q: Dit you see that article by Brian Moses?

    A: yes, and I disagree with his views, but I really appreciate the fact that he emphasises that you should really be aware of the risks involved and decide for yourself what suits your situation. A few points that are not OK in my opinion:

    Every bad stick of RAM I’ve experienced came to me that way from the factory and could be found via some burn-in testing.
    

    I've seen some consumer equipment in my life time that suddenly developed memory errors after years of perfect operation. This is argument from personal anekdote should not be used as a basis for decision making. Remember: memory errors are the norm, not the exception. Even at home. Things break, it's that simple. And having equipment running 24/7 doesn't help.

    Furthermore, Brian seems to think that you can mitigate the risk of non-ECC memory by spending money on other stuff, such as off-site backups. Brian himself links to an article that rebutes his position on this. Just for completeness: How valuable is a backup of corrupted data? How do you know which data was corrupted? ZFS won't save you here.

    Q: Should I use ZFS on my laptop or desktop?

    A: Running ZFS on your desktop or laptop is an entirely different use case as compared to a NAS. I see no problems with this, I don't think this discussion applies to desktop/laptop usage. Especially because you are probably creating regular backups of your data to your NAS or a cloud service, right? If there are any memory errors, you will notice soon enough.

    Updates

    • Updated on August 11, 2015 to reflect that ZFS was not designed with ECC in mind. In this regard, it doesn't differ from other file systems.

    • Updated on April 3rd, 2015 - rewrote large parts of the whole article, to make it a better read.

    • Updated on January 18th, 2015 - rephrased some sentences. Changed the paragraph 'Inform people and give them a choice' to argue when it would be reasonable not to use ECC memory. Furthermore, I state more explicitly that ZFS itself has no mechanisms to cope with bad RAM.

    • Updated on February 21th, 2015 - I substantially rewrote this article to give a better perspective on the ZFS + ECC 'debate'.


    1. On x64 processors, the size of a word is 64 bits

    2. Windows will generate a "blue screen of death" and Linux will generate a "kernel panic".  

    3. It is very likely that the computer you're using (laptop/desktop) encountered a memory issue this year, but there is no way you can tell. Consumer hardware doesn't have any mechanisms to detect and report memory errors.  

    4. Microsoft has performed a study on one million crash reports they received over a period of 8 months on roughly a million systems in 2008. The result is a 1 in 1700 failure rate for single-bit memory errors in kernel code pages (a tiny subset of total memory).

      A consequence of confining our analysis to kernel code pages is that we will miss DRAM failures in the vast majority of memory. On a typical machine kernel code pages occupy roughly 30 MB of memory, which is 1.5% of the memory on the average system in our study. [...] since we are capturing DRAM errors in only 1.5% of the address space, it is possible that DRAM error rates across all of DRAM may be far higher than what we have observed. 

    5. I did not come up with this argument myself. 

    6. The absolutely facinating concept of bitsquatting proved that hotter datacenters showed more bitflips 

    Tagged as : ZFS ECC
  2. Creating a Basic ZFS File System on Linux

    February 01, 2014

    Here are some notes on creating a basic ZFS file system on Linux, using ZFS on Linux.

    I'm documenting the scenario where I just want to create a file system that can tollerate at least a single drive failure and can be shared over NFS.

    Identify the drives you want to use for the ZFS pool

    The ZFS on Linux project advices not to use plain /dev/sdx (/dev/sda, etc.) devices but to use /dev/disk/by-id/ or /dev/disk/by-path device names.

    Device names for storage devices are not fixed, so /dev/sdx devices may not always point to the same disk device. I've been bitten by this when first experimenting with ZFS, because I did not follow this advice and then could not access my zpool after a reboot because I removed a drive from the system.

    So you should pick the appropriate device from the /dev/disk/by-[id|path] folder. However, it's often difficult to determine which device in those folders corresponds to an actual disk drive.

    So I wrote a simple tool called showdisks which helps you identify which identifiers you need to use to create your ZFS pool.

    diskbypath

    You can install showdisks yourself by cloning the project:

    git clone https://github.com/louwrentius/showtools.git
    

    And then just use showdisks like

    ./showdisks -sp  (-s (size) and -p (by-path) )
    

    For this example, I'd like to use all the 500 GB disk drives for a six-drive RAIDZ1 vdev. Based on the information from showdisks, this is the command to create the vdev:

    zfs create tank raidz1 pci-0000:03:00.0-scsi-0:0:21:0 pci-0000:03:00.0-scsi-0:0:19:0 pci-0000:02:00.0-scsi-0:0:9:0 pci-0000:02:00.0-scsi-0:0:11:0 pci-0000:03:00.0-scsi-0:0:22:0 pci-0000:03:00.0-scsi-0:0:18:0
    

    The 'tank' name can be anything you want, it's just a name for the pool.

    Please note that with newer bigger disk drives, you should test if the ashift=12 option gives you better performance.

    zfs create -o ashift=12 tank raidz1 <devices>
    

    I used this option on 2TB disk drives and the performance and the read performance improved twofold.

    How to setup a RAID10 style pool

    This is how to create the ZFS equivalent of a RAID10 setup:

    zfs create tank mirror <device 1> <device 2> mirror <device 3> <device 4> mirror <device 5> <device 6>
    

    How many drives should I use in a vdev

    I've learned to use a 'power of two' (2,4,8,16) of drives for a vdev, plus the appropriate number of drives for the parity. RAIDZ1 = 1 disk, RAIDZ2 = 2 disks, etc.

    So the optimal number of drives for RAIDZ1 would be 3,5,9,17. RAIDZ2 would be 4,6,10,18 and so on. Clearly in the example above with six drives in a RAIDZ1 configuration, I'm violating this rule of thumb.

    How to disable the ZIL or disable sync writes

    You can expect bad throughput performance if you want to use the ZIL / honour synchronous writes. For safety reasons, ZFS does honour sync writes by default, it's an important feature of ZFS to guarantee data integrity. For storage of virtual machines or databases, you should not turn of the ZIL, but use an SSD for the SLOG to get performance to acceptable levels.

    For a simple (home) NAS box, the ZIL is not so important and can quite safely be disabled, as long as you have your servers on a UPS and have it cleanly shutdown when the UPS battery runs out.

    This is how you turn of the ZIL / support for synchronous writes:

    zfs set sync=disabled <pool name>
    

    Disabling sync writes is especially important if you use NFS which issues sync writes by default.

    Example:

    zfs set sync=disabled tank
    

    How to add an L2ARC cache device

    Use showdisks to lookup the actual /dev/disk/by-path identifier and add it like this:

    zpool add tank cache <device>
    

    Example:

    zpool add tank cache pci-0000:00:1f.2-scsi-2:0:0:0
    

    This is the result (on another zpool called 'server'):

    root@server:~# zpool status
      pool: server
     state: ONLINE
      scan: none requested
    config:
    
        NAME                               STATE     READ WRITE CKSUM
        server                             ONLINE       0     0     0
          raidz1-0                         ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:0:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:1:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:2:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:3:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:4:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:5:0  ONLINE       0     0     0
        cache
          pci-0000:00:1f.2-scsi-2:0:0:0    ONLINE       0     0     0
    

    How to monitor performance / I/O statistics

    One time sample:

    zpool iostat
    

    A sample every 2 seconds:

        zpool iostat 2
    

    More detailed information every 5 seconds:

        zpool iostat -v 5
    

    Example output:

                                          capacity     operations    bandwidth
    pool                               alloc   free   read  write   read  write
    ---------------------------------  -----  -----  -----  -----  -----  -----
    server                             3.54T  7.33T      4    577   470K  68.1M
      raidz1                           3.54T  7.33T      4    577   470K  68.1M
        pci-0000:03:04.0-scsi-0:0:0:0      -      -      1    143  92.7K  14.2M
        pci-0000:03:04.0-scsi-0:0:1:0      -      -      1    142  91.1K  14.2M
        pci-0000:03:04.0-scsi-0:0:2:0      -      -      1    143  92.8K  14.2M
        pci-0000:03:04.0-scsi-0:0:3:0      -      -      1    142  91.0K  14.2M
        pci-0000:03:04.0-scsi-0:0:4:0      -      -      1    143  92.5K  14.2M
        pci-0000:03:04.0-scsi-0:0:5:0      -      -      1    142  90.8K  14.2M
    cache                                  -      -      -      -      -      -
      pci-0000:00:1f.2-scsi-2:0:0:0    55.9G     8M      0     70    349  8.69M
    ---------------------------------  -----  -----  -----  -----  -----  -----
    

    How to start / stop a scrub

    Start:

    zfs scrub <pool>
    

    Stop:

    zfs scrub -s <pool>
    

    Mount ZFS file systems on boot

    Edit /etc/defaults/zfs and set this parameter:

    ZFS_MOUNT='yes'
    

    How to enable sharing a file system over NFS:

    zfs set sharenfs=on <poolname>
    

    How to create a zvol for usage with iSCSI

    zfs create -V 500G <poolname>/volume-name
    

    How to force ZFS to import the pool using disk/by-path

    Edit /etc/default/zfs and add

    ZPOOL_IMPORT_PATH=/dev/disk/by-path/
    

    Links to important ZFS information sources:

    Tons of information on using ZFS on Linux by Aaron Toponce:

    https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux/

    Understanding the ZIL (ZFS Intent Log)

    http://nex7.blogspot.nl/2013/04/zfs-intent-log.html

    Information about 4K sector alignment problems

    http://www.opendevs.org/ritk/zfs-4k-aligned-space-overhead.html

    Important read about using the proper number of drives in a vdev

    http://forums.freenas.org/threads/getting-the-most-out-of-zfs-pools.16/

    Tagged as : ZFS
  3. Things You Should Consider When Building a ZFS NAS

    December 29, 2013

    ZFS is a modern file system designed by Sun Microsystems, targeted at enterprise environments. Many features of ZFS also appeal to home NAS builders and with good reason. But not all features are relevant or necessary for home use.

    I believe that most home users building their own NAS, are just looking for a way to create a large centralised storage pool. As long as the solution can saturate gigabit ethernet, performance is not much of an issue. Workloads are typically single-client and sequential in nature.

    If this description rings true for your environment, there are some options regarding ZFS that are often very popular but not very relevant to you. Furthermore, there are also some facts that you should take into account when preparing for your own NAS build.

    Expanding your pool may not be as simple as you think

    Added August 2015

    If you are familiar with regular hardware or software RAID, you might expect to use on-line capacity expansion, where you just grow a RAID array with extra drives as you see fit. Many hardware RAID cards support this feature and Linux software RAID (MDADM) supports this too. This is very economical: just add an extra drive and you gain some space. Very flexible and simple.

    But ZFS does not support this approach. ZFS requires you to create a new 'RAID array' and add it to the pool. So you will lose extra drives to redundancy.

    To be more presise: you cannot expand VDEVs. You can only add VDEVS to a pool. And each VDEV requires it's own redundancy.

    So if you start with a single 6-disk RAIDZ2 you may end up with two 6-disk RAIDZ2 VDEVS. This means you use 4 out of 12 drives for redundancy. If you would have started out with a 10-disk RAIDZ2, you would only lose 2 drives to redundancy. Example:

    A: 2 x 6-disk RAIDZ2 consisting of 4 TB drives = 12 disks - 4 redundancy = 8 x 4 = 32 TB netto capacity.

    B: 1 x 10-disk RAIDZ2 consisting of 4 TB drives = 10 disks - 2 redundancy = 8 x 2 = 32 TB netto capacity.

    Option A lost you two drives at 150$ = 300$ and also requires 2 extra SATA ports and chassis slots.

    Option B will cost you 4 extra drives upfront, space you may not need immediately.

    Also take into account that there is such a thing as a 'recommended number of disks in a VDEV' depending on the redundancy used. This is discussed further down.

    Tip: there is a 'trick' how you can expand a VDEV. If you replace every drive with a larger one, you can then resize the VDEV and the pool. So replacing 2 TB drives with 4 TB drives would double your capacity without adding an extra VDEV.

    This approach requires a full VDEV rebuild after each drive replacement. So you may understand that this takes quite some time, during which your are running on no (RAIDZ) or less (RAIDZ2) redundancy. But it does work.

    If you have an additional spare SAS/SATA port and power, you can keep the redundancy and do an 'on-line' replace of the drive. This way, you don't lose or reduce redundancy during a rebuild. This is relatively ideal if you also have room for an additional drive in the chassis.

    This can be quite some work if you do have an available SATA port, but no additional drive slots. You will have to open the chassis, find a temporary spot for the new drive and then after the rebuild, move the new drive into the slot of the old one.

    There is a general recommendation not to mix different VDEV sizes in a pool, but for home usage, this is not an issue. So you could - for example - expand a pool based on a 6-drive VDEV with an additional 4-drive VDEV RAIDZ2.

    Remember: lose one VDEV and you lose your entire pool. So I would not recommend mixing RAIDZ and RAIDZ2 VDEVs in a pool.

    You don't need a SLOG for the ZIL

    Quick recap: the ZIL or ZFS intent Log is - as I understand it - only relevant for synchronous writes. If data integrity is important to an application, like a database server or a virtual machine, writes are performed synchronous. The application wants to make sure that the data is actually stored on the physical storage media and it waits for a confirmation from ZFS that it has done so. Only then will it continue.

    Asynchronous writes on the contrary, never hit the ZIL. They are just cached in RAM and directly written to the VDEV in one sequential swoop when the next transaction group commit will be performed (currently by default every 5 seconds). In the mean time, the application gets a confirmation of ZFS that the data is stored (a white lie) and just continues where it left off. ZFS just caches the write in memory and actually write the data to the storage VDEV when it feels like it (fifo).

    As you may understand, asynchronous writes are way faster because they can be cached and ZFS can reorder the I/O to make it more sequential and prevent random I/O from hitting the VDEV. This is what I understood from this source.

    So if you encounter synchronous writes, they must be committed to the ZIL (thus VDEV) and this causes random I/O patterns on the VDEV, degrading performance significantly.

    The cool thing about ZFS is that it does provide the option to store the ZIL on a dedicated device called the SLOG. This doesn't do anything for performance by itself, but the secret ingredient is using a solid state drive as the SLOG, ideally in a mirror to insure data integrity and to maintain performance in the case of a SLOG device failure.

    For business critical environments, a separate SLOG device based on SSDs is a no-brainer. But for home use? If you don't have a SLOG, you still have a ZIL, it's only not as fast. That's not a real problem for single-client sequential throughput.

    For home usage, you may even consider how much you care about data integrity. That sounds strange, but the ZIL is used to recover from the event of a sudden power-loss. If your NAS is attached to a UPS, this is not much of a risk, you can perform a controlled shutdown before the batteries run out of power. The remaining risk is human error or some other catastrophic event within your NAS.

    So all data in rest already stored on your NAS is never at risk. It's only data that is in the process of being committed to storage that may get scrambled. But again: this is a home situation. Maybe restart your file transfer and you are done. You still have a copy of the data on the source device. This is entirely different from a setup with databases or virtual machines.

    Data integrity of data at rest is vitally important. The ZIL only protects data in transit. It has nothing to do with the data already committed to the VDEV.

    I see so many NAS builders being talked into buying some specific SSDs to be used for the ZIL whereas they probably won't benefit from them at all, it's just too bad.

    You don't need L2ARC cache

    ZFS relies heavily on caching of data to deliver decent performance, especially read performance. RAM provides the fasted cache and that is where the first level of caching lives, the ARC (Adaptive Replacement Cache). ZFS is smart and learns which data is often requested and keeps it in the ARC.

    But the size of the ARC is limited by the amount of RAM available. This is why you can add a second cache tier, based on SSDs. SSDs are not as fast as RAM, but still way faster than spinning disks. And they are cheaper compared to RAM memory if you look at their capacity.

    For additional more detailed information, go to this site

    L2ARC is important when you have multiple users or VMs accessing the same data sets. In this case, L2ARC based on SSDs will improve performance significantly. But if we just take a look at the average home NAS build, I'm not sure how the L2ARC adds any benefit. ZFS has no problem with single-client sequential file transfers so there is no benefit in implementing a L2ARC.

    Update 2015-02-08: There is even a downside to having a L2ARC cache. All the meta-data regarding data stored in the L2ARC cache is kept in memory, and thus eating away at your ARC!, thus your ARC becomes less effective (source).

    You don't need deduplication and compression

    For the home NAS, most data you store on it is already highly compressed and additional compression only wastes performance (Music, Videos, etc). It is a cool feature, but not so much for home use. If you are planning to store other types of data, compression actually may be of interest (documents, backups of VMs, etc). It is suggested by many (and in the comments) that with LZ4 compression, you don't lose performance (except for some CPU cycles) and with compressible data, you even gain performance, so you could just enable it and forget about it.

    Whereas compression may do not much harm, Deduplication is often more relevant in business environments where users are sloppy and store multiple copies of the same data at different locations. I'm quite sure you don't want to sacrifice RAM and performance for ZFS to keep track of duplicates you probably don't have.

    You don't need an ocean of RAM

    The absolute minimum RAM for a viable ZFS setup is 4 GB but there is not a lot of headroom for ZFS here. ZFS is quite memory hungry because it uses RAM as a buffer so it can perform operations like checksums and reorder all I/O to be sequential.

    If you don't have sufficient buffer memory, performance will suffer. 8 GB is probably sufficient for most arrays. If your array is faster, more memory may be required to actually benefit from this performance. For maximum performance, you should have enough memory to hold 5 seconds worth of maximum write throughput ( 5 x 400MB/s = 2GB ) and leave sufficient headroom for other ZFS RAM requirements. In the example, 4 GB RAM could be sufficient.

    For most home users, saturating gigabit is already sufficient so you might be safe with 8 GB of RAM in most cases. More RAM may not provide much more benefit, but it will increase power consumption.

    There is an often cited rule that you need 1 GB of RAM for every TB of storage, but this is not true for home NAS solutions. This is only relevant for high-performance multi-user or multi-VM environments.

    Additional information about RAM requirements can be found here

    You do need ECC RAM if you care about data integrity

    The money saved on a ZIL or L2ARC cache can be better spend on ECC RAM memory.

    ZFS does not rely on the quality of individual disks. It uses parity to verify that disks don't lie about the data stored on them (data corruption).

    But ZFS can't verify the contents of RAM memory, so here ZFS relies on the reliability of the hardware. And there is a reason why we use RAID or redundant power suplies in our server equipment: hardware fails. RAM fails too. This is the reason why every server product by well-known vendors like HP, Dell, IBM and Supermicro only support ECC memory. RAM memory errors do occur more frequent than you may think.

    ECC (Error Checking and Correcting) RAM corrects and detects RAM errors. This is the only way you can be fairly sure that ZFS is not fed with corrupted data. Keep in mind: with bad RAM, it is likely that corrupted data will be written to disk without ZFS ever being aware of it (garbage in - garbage out).

    Please note that the quality of your RAM memory will not directly affect any data that is at rest and already stored on your disks. Existing data will only be corrupted with bad RAM if it is modified or moved around. ZFS will probably detect checksum errors, but it will be too late by then...

    To me, it's simple. If you care enough about your data that you want to use ZFS, you should also be willing to pay for ECC memory. You are giving yourself a false sense of security if you do not use ECC memory. ZFS was never designed for consumer hardware, it was destined to be used on server hardware using ECC memory. Because it was designed with data integrity as the top most priority.

    There are entry-level servers that do support ECC memory and can be had fairly cheap with 4 hard drive bays, like the HP ProLiant MicroServer Gen8.

    I wrote an article about a reasonably priced CPU+RAM+MB combo that does support ECC memory starting at $360.

    If you feel lucky, go for good-quality non-ECC memory. But do understand that you are taking a risk here.

    Understanding random I/O performance

    Added August 2015

    With ZFS, the rule of thumb is this: regardless of the number of drives in a RAIDZ(2/3) VDEV, you always get roughly the random I/O performance of a single drive in the VDEV1.

    Now I want to make the case here that if you are building your own home NAS, you shouldn't care about random I/O performance too much.

    If you want better random I/O performance of your pool, the way to get it is to:

    1. add more VDEVS to your pool
    2. add more RAM/L2ARC for caching
    3. use disks with higher RPM or SSDs combined with option 1.

    Regarding point 1:

    So if you want the best random I/O performance, you should just use a ton of mirrored drives in the VDEV, so you essentially create a large RAID 10. This is not very space-efficient, so probably not so relevant in the context of a home NAS.

    Example similar to RAID 10:

    root@bunny:~# zfs list
    NAME       USED  AVAIL  REFER  MOUNTPOINT
    testpool  59.5K  8.92T    19K  /testpool
    
    root@bunny:~# zpool status
      pool: testpool
     state: ONLINE
      scan: none requested
    config:
    
        NAME        STATE     READ WRITE CKSUM
        testpool    ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdh     ONLINE       0     0     0
          mirror-3  ONLINE       0     0     0
            sdi     ONLINE       0     0     0
            sdj     ONLINE       0     0     0
          mirror-4  ONLINE       0     0     0
            sdk     ONLINE       0     0     0
            sdl     ONLINE       0     0     0
          mirror-5  ONLINE       0     0     0
            sdm     ONLINE       0     0     0
            sdn     ONLINE       0     0     0
          mirror-6  ONLINE       0     0     0
            sdo     ONLINE       0     0     0
            sdp     ONLINE       0     0     0
          mirror-7  ONLINE       0     0     0
            sdq     ONLINE       0     0     0
            sdr     ONLINE       0     0     0
          mirror-8  ONLINE       0     0     0
            sds     ONLINE       0     0     0
            sdt     ONLINE       0     0     0
          mirror-9  ONLINE       0     0     0
            sdu     ONLINE       0     0     0
            sdv     ONLINE       0     0     0
    

    Another option, if you need better storage efficiency, is to use multiple RAIDZ or RAIDZ2 VDEVS in the pool. In a way, you're then creating the equivalent of a RAID50 or RAID60.

    Example similar to RAID 50:

    root@bunny:~# zfs list
    NAME       USED  AVAIL  REFER  MOUNTPOINT
    testpool  77.5K  14.3T  27.2K  /testpool
    
    root@bunny:~# zpool status
      pool: testpool
     state: ONLINE
      scan: none requested
    config:
    
    NAME        STATE     READ WRITE CKSUM
    testpool    ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0
        sdf     ONLINE       0     0     0
        sdg     ONLINE       0     0     0
      raidz1-1  ONLINE       0     0     0
        sdh     ONLINE       0     0     0
        sdi     ONLINE       0     0     0
        sdj     ONLINE       0     0     0
        sdk     ONLINE       0     0     0
        sdl     ONLINE       0     0     0
      raidz1-2  ONLINE       0     0     0
        sdm     ONLINE       0     0     0
        sdn     ONLINE       0     0     0
        sdo     ONLINE       0     0     0
        sdp     ONLINE       0     0     0
        sdq     ONLINE       0     0     0
      raidz1-3  ONLINE       0     0     0
        sdr     ONLINE       0     0     0
        sds     ONLINE       0     0     0
        sdt     ONLINE       0     0     0
        sdu     ONLINE       0     0     0
        sdv     ONLINE       0     0     0
    

    You only need to deploy these kinds of pool/vdev configuratoins if you have valid reason that you need the random I/O performance they provide. Creating less but larger VDEVs is often more space efficient and will still saturate gigabit when transferring large files.

    It's ok to use multiple VDEVs of different drive sizes

    This only true in the context of a home NAS.

    Let's take an example. You have an existing pool consisting of a single RAIDZ VDEV with 4 x 2 TB drives and your pool is filling up.

    It's then perfectly fine in the context of a home NAS to add a second VDEV consisting of a 5 x 4 TB RAIDZ.

    ZFS will take care of how data is distributed across the VDEVs.

    It is NOT recommended to mix different RAIDZ schemas, so VDEV 1 = RAIDZ and VDEV 2 = RAIDZ2. Remember that losing a single VDEV = losing the whole pool. It doesn't make sense to mix redundancy levels.

    VDEVs should consist of the optimal number of drives

    Added August 2015: If you use the large_blocks feature and use 1MB records, you don't need to adhere to the rule of always putting a certain number of drives in a VDEV to prevent significant loss of storage capacity.

    This enables you to create an 8-drive RAIDZ2 where normally you would have to create either a RAIDZ2 VDEV that consists of 6 drives or 10 drives.

    For home use, expanding storage by adding VDEVs is often suboptimal because you may spend more disks on redundancy than required, as explained earlier. The support of large_blocks allows you to buy the number of disks upfront that suits current and future needs.

    In my own personal case, with my 19" chassis filled with 24 drives, I would enable the large_blocks feature and create a single 24-drive RAID-Z3 VDEV to give me optimal space and still very good redundancy.

    The large_blocks feature is supported on ZFS on Linux since version 0.6.5 (September 2015).

    Thanks to user "SirMaster" on Reddit for introducing this feature to me.


    Original advice:

    Depending on the type of 'RAID' you may choose for the VDEV(s) in your ZFS pool, you might want to make sure you only put in the right number of disks in the VDEV.

    This is important, if you don't use the right amount, performance will suffer, but more importantly: you will lose storage space, which can ad up to over 10% of the available capacity. That's quite a waste.

    This is a straight copy&paste from sub.mesa's post

    The following ZFS pool configurations are optimal for modern 4K sector harddrives:
    RAID-Z: 3, 5, 9, 17, 33 drives
    RAID-Z2: 4, 6, 10, 18, 34 drives
    RAID-Z3: 5, 7, 11, 19, 35 drives
    

    Sub.mesa also explains the details on why this is true. And here is another example.

    The gist is that you must use a power of two for your data disks and then add the number of parity disks required for your RAIDZ level on top of that. So 4 data disks + 1 parity disk (RAIDZ) is a total of 5 disks. Or 16 data disks + 2 parity disks (RAIDZ2) is 18 disks in the VDEV.

    Take this into account when deciding on your pool configuration. Also, RAIDZ2 is absolutely recommended. with more than 6-8 disks. The risk of losing a second drive during 'rebuild' (resilvering) is just too high with current high-density drives.

    You don't need to limit the number of data disks in a VDEV

    For home use, creating larger VDEVs is not an issue, even an 18 disk VDEV is probably fine, but don't expect any significant random I/O performance. It is always recommended to use multiple smaller VDEVs to increase random I/O performance (at the cost of capacity lost to parity) as ZFS does stripe I/O-requests across VDEVs. If you are building a home NAS, random I/O is probably not very relevant.

    You don't need to run ZFS at home

    ZFS is cool technology and it's perfectly fine to run ZFS at home. However, the world doesn't end if you don't.


    1. https://blogs.oracle.com/roch/entry/when_to_and_not_to 

    Tagged as : ZFS Storage

Page 2 / 3