Articles in the ZFS category

  1. Things You Should Consider When Building a ZFS NAS

    December 29, 2013

    ZFS is a modern file system designed by Sun Microsystems, targeted at enterprise environments. Many features of ZFS also appeal to home NAS builders and with good reason. But not all features are relevant or necessary for home use.

    I believe that most home users building their own NAS, are just looking for a way to create a large centralised storage pool. As long as the solution can saturate gigabit ethernet, performance is not much of an issue. Workloads are typically single-client and sequential in nature.

    If this description rings true for your environment, there are some options regarding ZFS that are often very popular but not very relevant to you. Furthermore, there are also some facts that you should take into account when preparing for your own NAS build.

    Expanding your pool may not be as simple as you think

    Added August 2015

    If you are familiar with regular hardware or software RAID, you might expect to use on-line capacity expansion, where you just grow a RAID array with extra drives as you see fit. Many hardware RAID cards support this feature and Linux software RAID (MDADM) supports this too. This is very economical: just add an extra drive and you gain some space. Very flexible and simple.

    But ZFS does not support this approach. ZFS requires you to create a new 'RAID array' and add it to the pool. So you will lose extra drives to redundancy.

    To be more presise: you cannot expand VDEVs. You can only add VDEVS to a pool. And each VDEV requires it's own redundancy.

    So if you start with a single 6-disk RAIDZ2 you may end up with two 6-disk RAIDZ2 VDEVS. This means you use 4 out of 12 drives for redundancy. If you would have started out with a 10-disk RAIDZ2, you would only lose 2 drives to redundancy. Example:

    A: 2 x 6-disk RAIDZ2 consisting of 4 TB drives = 12 disks - 4 redundancy = 8 x 4 = 32 TB netto capacity.

    B: 1 x 10-disk RAIDZ2 consisting of 4 TB drives = 10 disks - 2 redundancy = 8 x 2 = 32 TB netto capacity.

    Option A lost you two drives at 150$ = 300$ and also requires 2 extra SATA ports and chassis slots.

    Option B will cost you 4 extra drives upfront, space you may not need immediately.

    Also take into account that there is such a thing as a 'recommended number of disks in a VDEV' depending on the redundancy used. This is discussed further down.

    Tip: there is a 'trick' how you can expand a VDEV. If you replace every drive with a larger one, you can then resize the VDEV and the pool. So replacing 2 TB drives with 4 TB drives would double your capacity without adding an extra VDEV.

    This approach requires a full VDEV rebuild after each drive replacement. So you may understand that this takes quite some time, during which your are running on no (RAIDZ) or less (RAIDZ2) redundancy. But it does work.

    If you have an additional spare SAS/SATA port and power, you can keep the redundancy and do an 'on-line' replace of the drive. This way, you don't lose or reduce redundancy during a rebuild. This is relatively ideal if you also have room for an additional drive in the chassis.

    This can be quite some work if you do have an available SATA port, but no additional drive slots. You will have to open the chassis, find a temporary spot for the new drive and then after the rebuild, move the new drive into the slot of the old one.

    There is a general recommendation not to mix different VDEV sizes in a pool, but for home usage, this is not an issue. So you could - for example - expand a pool based on a 6-drive VDEV with an additional 4-drive VDEV RAIDZ2.

    Remember: lose one VDEV and you lose your entire pool. So I would not recommend mixing RAIDZ and RAIDZ2 VDEVs in a pool.

    You don't need a SLOG for the ZIL

    Quick recap: the ZIL or ZFS intent Log is - as I understand it - only relevant for synchronous writes. If data integrity is important to an application, like a database server or a virtual machine, writes are performed synchronous. The application wants to make sure that the data is actually stored on the physical storage media and it waits for a confirmation from ZFS that it has done so. Only then will it continue.

    Asynchronous writes on the contrary, never hit the ZIL. They are just cached in RAM and directly written to the VDEV in one sequential swoop when the next transaction group commit will be performed (currently by default every 5 seconds). In the mean time, the application gets a confirmation of ZFS that the data is stored (a white lie) and just continues where it left off. ZFS just caches the write in memory and actually write the data to the storage VDEV when it feels like it (fifo).

    As you may understand, asynchronous writes are way faster because they can be cached and ZFS can reorder the I/O to make it more sequential and prevent random I/O from hitting the VDEV. This is what I understood from this source.

    So if you encounter synchronous writes, they must be committed to the ZIL (thus VDEV) and this causes random I/O patterns on the VDEV, degrading performance significantly.

    The cool thing about ZFS is that it does provide the option to store the ZIL on a dedicated device called the SLOG. This doesn't do anything for performance by itself, but the secret ingredient is using a solid state drive as the SLOG, ideally in a mirror to insure data integrity and to maintain performance in the case of a SLOG device failure.

    For business critical environments, a separate SLOG device based on SSDs is a no-brainer. But for home use? If you don't have a SLOG, you still have a ZIL, it's only not as fast. That's not a real problem for single-client sequential throughput.

    For home usage, you may even consider how much you care about data integrity. That sounds strange, but the ZIL is used to recover from the event of a sudden power-loss. If your NAS is attached to a UPS, this is not much of a risk, you can perform a controlled shutdown before the batteries run out of power. The remaining risk is human error or some other catastrophic event within your NAS.

    So all data in rest already stored on your NAS is never at risk. It's only data that is in the process of being committed to storage that may get scrambled. But again: this is a home situation. Maybe restart your file transfer and you are done. You still have a copy of the data on the source device. This is entirely different from a setup with databases or virtual machines.

    Data integrity of data at rest is vitally important. The ZIL only protects data in transit. It has nothing to do with the data already committed to the VDEV.

    I see so many NAS builders being talked into buying some specific SSDs to be used for the ZIL whereas they probably won't benefit from them at all, it's just too bad.

    You don't need L2ARC cache

    ZFS relies heavily on caching of data to deliver decent performance, especially read performance. RAM provides the fasted cache and that is where the first level of caching lives, the ARC (Adaptive Replacement Cache). ZFS is smart and learns which data is often requested and keeps it in the ARC.

    But the size of the ARC is limited by the amount of RAM available. This is why you can add a second cache tier, based on SSDs. SSDs are not as fast as RAM, but still way faster than spinning disks. And they are cheaper compared to RAM memory if you look at their capacity.

    For additional more detailed information, go to this site

    L2ARC is important when you have multiple users or VMs accessing the same data sets. In this case, L2ARC based on SSDs will improve performance significantly. But if we just take a look at the average home NAS build, I'm not sure how the L2ARC adds any benefit. ZFS has no problem with single-client sequential file transfers so there is no benefit in implementing a L2ARC.

    Update 2015-02-08: There is even a downside to having a L2ARC cache. All the meta-data regarding data stored in the L2ARC cache is kept in memory, and thus eating away at your ARC!, thus your ARC becomes less effective (source).

    You don't need deduplication and compression

    For the home NAS, most data you store on it is already highly compressed and additional compression only wastes performance (Music, Videos, etc). It is a cool feature, but not so much for home use. If you are planning to store other types of data, compression actually may be of interest (documents, backups of VMs, etc). It is suggested by many (and in the comments) that with LZ4 compression, you don't lose performance (except for some CPU cycles) and with compressible data, you even gain performance, so you could just enable it and forget about it.

    Whereas compression may do not much harm, Deduplication is often more relevant in business environments where users are sloppy and store multiple copies of the same data at different locations. I'm quite sure you don't want to sacrifice RAM and performance for ZFS to keep track of duplicates you probably don't have.

    You don't need an ocean of RAM

    The absolute minimum RAM for a viable ZFS setup is 4 GB but there is not a lot of headroom for ZFS here. ZFS is quite memory hungry because it uses RAM as a buffer so it can perform operations like checksums and reorder all I/O to be sequential.

    If you don't have sufficient buffer memory, performance will suffer. 8 GB is probably sufficient for most arrays. If your array is faster, more memory may be required to actually benefit from this performance. For maximum performance, you should have enough memory to hold 5 seconds worth of maximum write throughput ( 5 x 400MB/s = 2GB ) and leave sufficient headroom for other ZFS RAM requirements. In the example, 4 GB RAM could be sufficient.

    For most home users, saturating gigabit is already sufficient so you might be safe with 8 GB of RAM in most cases. More RAM may not provide much more benefit, but it will increase power consumption.

    There is an often cited rule that you need 1 GB of RAM for every TB of storage, but this is not true for home NAS solutions. This is only relevant for high-performance multi-user or multi-VM environments.

    Additional information about RAM requirements can be found here

    You do need ECC RAM if you care about data integrity

    The money saved on a ZIL or L2ARC cache can be better spend on ECC RAM memory.

    ZFS does not rely on the quality of individual disks. It uses parity to verify that disks don't lie about the data stored on them (data corruption).

    But ZFS can't verify the contents of RAM memory, so here ZFS relies on the reliability of the hardware. And there is a reason why we use RAID or redundant power suplies in our server equipment: hardware fails. RAM fails too. This is the reason why every server product by well-known vendors like HP, Dell, IBM and Supermicro only support ECC memory. RAM memory errors do occur more frequent than you may think.

    ECC (Error Checking and Correcting) RAM corrects and detects RAM errors. This is the only way you can be fairly sure that ZFS is not fed with corrupted data. Keep in mind: with bad RAM, it is likely that corrupted data will be written to disk without ZFS ever being aware of it (garbage in - garbage out).

    Please note that the quality of your RAM memory will not directly affect any data that is at rest and already stored on your disks. Existing data will only be corrupted with bad RAM if it is modified or moved around. ZFS will probably detect checksum errors, but it will be too late by then...

    To me, it's simple. If you care enough about your data that you want to use ZFS, you should also be willing to pay for ECC memory. You are giving yourself a false sense of security if you do not use ECC memory. ZFS was never designed for consumer hardware, it was destined to be used on server hardware using ECC memory. Because it was designed with data integrity as the top most priority.

    There are entry-level servers that do support ECC memory and can be had fairly cheap with 4 hard drive bays, like the HP ProLiant MicroServer Gen8.

    I wrote an article about a reasonably priced CPU+RAM+MB combo that does support ECC memory starting at $360.

    If you feel lucky, go for good-quality non-ECC memory. But do understand that you are taking a risk here.

    Understanding random I/O performance

    Added August 2015

    With ZFS, the rule of thumb is this: regardless of the number of drives in a RAIDZ(2/3) VDEV, you always get roughly the random I/O performance of a single drive in the VDEV1.

    Now I want to make the case here that if you are building your own home NAS, you shouldn't care about random I/O performance too much.

    If you want better random I/O performance of your pool, the way to get it is to:

    1. add more VDEVS to your pool
    2. add more RAM/L2ARC for caching
    3. use disks with higher RPM or SSDs combined with option 1.

    Regarding point 1:

    So if you want the best random I/O performance, you should just use a ton of mirrored drives in the VDEV, so you essentially create a large RAID 10. This is not very space-efficient, so probably not so relevant in the context of a home NAS.

    Example similar to RAID 10:

    root@bunny:~# zfs list
    testpool  59.5K  8.92T    19K  /testpool
    root@bunny:~# zpool status
      pool: testpool
     state: ONLINE
      scan: none requested
        NAME        STATE     READ WRITE CKSUM
        testpool    ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdh     ONLINE       0     0     0
          mirror-3  ONLINE       0     0     0
            sdi     ONLINE       0     0     0
            sdj     ONLINE       0     0     0
          mirror-4  ONLINE       0     0     0
            sdk     ONLINE       0     0     0
            sdl     ONLINE       0     0     0
          mirror-5  ONLINE       0     0     0
            sdm     ONLINE       0     0     0
            sdn     ONLINE       0     0     0
          mirror-6  ONLINE       0     0     0
            sdo     ONLINE       0     0     0
            sdp     ONLINE       0     0     0
          mirror-7  ONLINE       0     0     0
            sdq     ONLINE       0     0     0
            sdr     ONLINE       0     0     0
          mirror-8  ONLINE       0     0     0
            sds     ONLINE       0     0     0
            sdt     ONLINE       0     0     0
          mirror-9  ONLINE       0     0     0
            sdu     ONLINE       0     0     0
            sdv     ONLINE       0     0     0

    Another option, if you need better storage efficiency, is to use multiple RAIDZ or RAIDZ2 VDEVS in the pool. In a way, you're then creating the equivalent of a RAID50 or RAID60.

    Example similar to RAID 50:

    root@bunny:~# zfs list
    testpool  77.5K  14.3T  27.2K  /testpool
    root@bunny:~# zpool status
      pool: testpool
     state: ONLINE
      scan: none requested
    testpool    ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0
        sdf     ONLINE       0     0     0
        sdg     ONLINE       0     0     0
      raidz1-1  ONLINE       0     0     0
        sdh     ONLINE       0     0     0
        sdi     ONLINE       0     0     0
        sdj     ONLINE       0     0     0
        sdk     ONLINE       0     0     0
        sdl     ONLINE       0     0     0
      raidz1-2  ONLINE       0     0     0
        sdm     ONLINE       0     0     0
        sdn     ONLINE       0     0     0
        sdo     ONLINE       0     0     0
        sdp     ONLINE       0     0     0
        sdq     ONLINE       0     0     0
      raidz1-3  ONLINE       0     0     0
        sdr     ONLINE       0     0     0
        sds     ONLINE       0     0     0
        sdt     ONLINE       0     0     0
        sdu     ONLINE       0     0     0
        sdv     ONLINE       0     0     0

    You only need to deploy these kinds of pool/vdev configuratoins if you have valid reason that you need the random I/O performance they provide. Creating less but larger VDEVs is often more space efficient and will still saturate gigabit when transferring large files.

    It's ok to use multiple VDEVs of different drive sizes

    This only true in the context of a home NAS.

    Let's take an example. You have an existing pool consisting of a single RAIDZ VDEV with 4 x 2 TB drives and your pool is filling up.

    It's then perfectly fine in the context of a home NAS to add a second VDEV consisting of a 5 x 4 TB RAIDZ.

    ZFS will take care of how data is distributed across the VDEVs.

    It is NOT recommended to mix different RAIDZ schemas, so VDEV 1 = RAIDZ and VDEV 2 = RAIDZ2. Remember that losing a single VDEV = losing the whole pool. It doesn't make sense to mix redundancy levels.

    VDEVs should consist of the optimal number of drives

    Added August 2015: If you use the large_blocks feature and use 1MB records, you don't need to adhere to the rule of always putting a certain number of drives in a VDEV to prevent significant loss of storage capacity.

    This enables you to create an 8-drive RAIDZ2 where normally you would have to create either a RAIDZ2 VDEV that consists of 6 drives or 10 drives.

    For home use, expanding storage by adding VDEVs is often suboptimal because you may spend more disks on redundancy than required, as explained earlier. The support of large_blocks allows you to buy the number of disks upfront that suits current and future needs.

    In my own personal case, with my 19" chassis filled with 24 drives, I would enable the large_blocks feature and create a single 24-drive RAID-Z3 VDEV to give me optimal space and still very good redundancy.

    The large_blocks feature is supported on ZFS on Linux since version 0.6.5 (September 2015).

    Thanks to user "SirMaster" on Reddit for introducing this feature to me.

    Original advice:

    Depending on the type of 'RAID' you may choose for the VDEV(s) in your ZFS pool, you might want to make sure you only put in the right number of disks in the VDEV.

    This is important, if you don't use the right amount, performance will suffer, but more importantly: you will lose storage space, which can ad up to over 10% of the available capacity. That's quite a waste.

    This is a straight copy&paste from sub.mesa's post

    The following ZFS pool configurations are optimal for modern 4K sector harddrives:
    RAID-Z: 3, 5, 9, 17, 33 drives
    RAID-Z2: 4, 6, 10, 18, 34 drives
    RAID-Z3: 5, 7, 11, 19, 35 drives

    Sub.mesa also explains the details on why this is true. And here is another example.

    The gist is that you must use a power of two for your data disks and then add the number of parity disks required for your RAIDZ level on top of that. So 4 data disks + 1 parity disk (RAIDZ) is a total of 5 disks. Or 16 data disks + 2 parity disks (RAIDZ2) is 18 disks in the VDEV.

    Take this into account when deciding on your pool configuration. Also, RAIDZ2 is absolutely recommended. with more than 6-8 disks. The risk of losing a second drive during 'rebuild' (resilvering) is just too high with current high-density drives.

    You don't need to limit the number of data disks in a VDEV

    For home use, creating larger VDEVs is not an issue, even an 18 disk VDEV is probably fine, but don't expect any significant random I/O performance. It is always recommended to use multiple smaller VDEVs to increase random I/O performance (at the cost of capacity lost to parity) as ZFS does stripe I/O-requests across VDEVs. If you are building a home NAS, random I/O is probably not very relevant.

    You don't need to run ZFS at home

    ZFS is cool technology and it's perfectly fine to run ZFS at home. However, the world doesn't end if you don't.


    Tagged as : ZFS Storage
  2. ZFS on Linux: Monitor Cache Hit Ratio

    December 23, 2013

    I'm performing some FIO random read 4k I/O benchmarks on a ZFS file system. So since I didn't trust the numbers I got, I wanted to know how many of the IOPs I got were due to cache hits rather than disk hits.

    This is why I wrote a small shell script called archhitratio.

    Sample output:

    IOPs: 133 | ARC cache hit ratio: 48.00 % | Hitrate: 64 / Missrate: 69
    IOPs: 131 | ARC cache hit ratio: 48.00 % | Hitrate: 63 / Missrate: 68
    IOPs: 136 | ARC cache hit ratio: 49.00 % | Hitrate: 67 / Missrate: 69
    IOPs: 128 | ARC cache hit ratio: 46.00 % | Hitrate: 59 / Missrate: 69
    IOPs: 127 | ARC cache hit ratio: 46.00 % | Hitrate: 59 / Missrate: 68
    IOPs: 135 | ARC cache hit ratio: 48.00 % | Hitrate: 65 / Missrate: 70
    IOPs: 127 | ARC cache hit ratio: 45.00 % | Hitrate: 58 / Missrate: 69
    IOPs: 125 | ARC cache hit ratio: 44.00 % | Hitrate: 56 / Missrate: 69
    IOPs: 128 | ARC cache hit ratio: 46.00 % | Hitrate: 60 / Missrate: 68

    In this example, I'm performing a random read test on a 16GB data set. This host has 16 GB RAM and 6 GB of this dataset was already in memory from previous FIO runs. This is why we see a ~45% hit ratio.

    This is a more interesting result:

    IOPs: 1404 | ARC cache hit ratio: 90.0 % | Hitrate: 1331 / Missrate: 73
    IOPs: 1425 | ARC cache hit ratio: 90.0 % | Hitrate: 1350 / Missrate: 75
    IOPs: 1395 | ARC cache hit ratio: 90.0 % | Hitrate: 1323 / Missrate: 72
    IOPs: 1740 | ARC cache hit ratio: 90.0 % | Hitrate: 1664 / Missrate: 76
    IOPs: 1351 | ARC cache hit ratio: 90.0 % | Hitrate: 1277 / Missrate: 74
    IOPs: 1613 | ARC cache hit ratio: 90.0 % | Hitrate: 1536 / Missrate: 77
    IOPs: 1920 | ARC cache hit ratio: 90.0 % | Hitrate: 1845 / Missrate: 75
    IOPs: 1431 | ARC cache hit ratio: 90.0 % | Hitrate: 1354 / Missrate: 77
    IOPs: 1675 | ARC cache hit ratio: 90.0 % | Hitrate: 1598 / Missrate: 77
    IOPs: 1560 | ARC cache hit ratio: 90.0 % | Hitrate: 1484 / Missrate: 76
    IOPs: 1574 | ARC cache hit ratio: 90.0 % | Hitrate: 1500 / Missrate: 74
    IOPs: 2017 | ARC cache hit ratio: 90.0 % | Hitrate: 1946 / Missrate: 71
    IOPs: 1696 | ARC cache hit ratio: 90.0 % | Hitrate: 1623 / Missrate: 73
    IOPs: 1776 | ARC cache hit ratio: 90.0 % | Hitrate: 1702 / Missrate: 74
    IOPs: 1671 | ARC cache hit ratio: 90.0 % | Hitrate: 1597 / Missrate: 74
    IOPs: 1729 | ARC cache hit ratio: 90.0 % | Hitrate: 1656 / Missrate: 73
    IOPs: 1902 | ARC cache hit ratio: 90.0 % | Hitrate: 1828 / Missrate: 74
    IOPs: 2029 | ARC cache hit ratio: 90.0 % | Hitrate: 1956 / Missrate: 73
    IOPs: 2228 | ARC cache hit ratio: 90.0 % | Hitrate: 2161 / Missrate: 67
    IOPs: 2289 | ARC cache hit ratio: 90.0 % | Hitrate: 2216 / Missrate: 73
    IOPs: 2385 | ARC cache hit ratio: 90.0 % | Hitrate: 2277 / Missrate: 108
    IOPs: 2595 | ARC cache hit ratio: 90.0 % | Hitrate: 2524 / Missrate: 71
    IOPs: 2940 | ARC cache hit ratio: 90.0 % | Hitrate: 2872 / Missrate: 68
    IOPs: 2984 | ARC cache hit ratio: 90.0 % | Hitrate: 2872 / Missrate: 112
    IOPs: 2622 | ARC cache hit ratio: 90.0 % | Hitrate: 2385 / Missrate: 237
    IOPs: 1518 | ARC cache hit ratio: 90.0 % | Hitrate: 1461 / Missrate: 57
    IOPs: 3221 | ARC cache hit ratio: 90.0 % | Hitrate: 3150 / Missrate: 71
    IOPs: 3745 | ARC cache hit ratio: 90.0 % | Hitrate: 3674 / Missrate: 71
    IOPs: 3363 | ARC cache hit ratio: 90.0 % | Hitrate: 3292 / Missrate: 71
    IOPs: 3931 | ARC cache hit ratio: 90.0 % | Hitrate: 3856 / Missrate: 75
    IOPs: 3765 | ARC cache hit ratio: 90.0 % | Hitrate: 3689 / Missrate: 76
    IOPs: 4845 | ARC cache hit ratio: 90.0 % | Hitrate: 4772 / Missrate: 73
    IOPs: 4422 | ARC cache hit ratio: 90.0 % | Hitrate: 4350 / Missrate: 72
    IOPs: 5602 | ARC cache hit ratio: 90.0 % | Hitrate: 5531 / Missrate: 71
    IOPs: 5351 | ARC cache hit ratio: 90.0 % | Hitrate: 5279 / Missrate: 72
    IOPs: 6075 | ARC cache hit ratio: 90.0 % | Hitrate: 6004 / Missrate: 71
    IOPs: 6586 | ARC cache hit ratio: 90.0 % | Hitrate: 6515 / Missrate: 71
    IOPs: 7974 | ARC cache hit ratio: 90.0 % | Hitrate: 7907 / Missrate: 67
    IOPs: 4434 | ARC cache hit ratio: 90.0 % | Hitrate: 4180 / Missrate: 254
    IOPs: 9793 | ARC cache hit ratio: 90.0 % | Hitrate: 9721 / Missrate: 72
    IOPs: 9395 | ARC cache hit ratio: 90.0 % | Hitrate: 9300 / Missrate: 95
    IOPs: 6171 | ARC cache hit ratio: 90.0 % | Hitrate: 6089 / Missrate: 82
    IOPs: 9209 | ARC cache hit ratio: 90.0 % | Hitrate: 9142 / Missrate: 67
    IOPs: 14883 | ARC cache hit ratio: 90.0 % | Hitrate: 14817 / Missrate: 66
    IOPs: 11304 | ARC cache hit ratio: 90.0 % | Hitrate: 11152 / Missrate: 152
    IOPs: 228 | ARC cache hit ratio: 30.0 % | Hitrate: 71 / Missrate: 157
    IOPs: 8321 | ARC cache hit ratio: 90.0 % | Hitrate: 8072 / Missrate: 249
    IOPs: 15550 | ARC cache hit ratio: 90.0 % | Hitrate: 15450 / Missrate: 100
    IOPs: 11819 | ARC cache hit ratio: 90.0 % | Hitrate: 11683 / Missrate: 136
    IOPs: 28630 | ARC cache hit ratio: 90.0 % | Hitrate: 28367 / Missrate: 263
    IOPs: 40484 | ARC cache hit ratio: 90.0 % | Hitrate: 40409 / Missrate: 75
    IOPs: 104501 | ARC cache hit ratio: 90.0 % | Hitrate: 103982 / Missrate: 519
    IOPs: 164483 | ARC cache hit ratio: 90.0 % | Hitrate: 163997 / Missrate: 486
    IOPs: 229729 | ARC cache hit ratio: 90.0 % | Hitrate: 228956 / Missrate: 773
    IOPs: 236479 | ARC cache hit ratio: 90.0 % | Hitrate: 235886 / Missrate: 593
    IOPs: 249232 | ARC cache hit ratio: 90.0 % | Hitrate: 248836 / Missrate: 396
    IOPs: 259156 | ARC cache hit ratio: 90.0 % | Hitrate: 258968 / Missrate: 188
    IOPs: 276099 | ARC cache hit ratio: 90.0 % | Hitrate: 275857 / Missrate: 242
    IOPs: 249382 | ARC cache hit ratio: 90.0 % | Hitrate: 249287 / Missrate: 95

    What does this result mean? The RAM size is 16 GB and the test data size is only 6 GB. If you just continue performing random I/O, eventually all data will be in RAM. I believe that here, you witness the moment when all data is in RAM and the already high IOPs goes through the roof (250K IOPS). However, I cannot explain the increase of the Missrate.

    Tagged as : ZFS
  3. Experiences Running ZFS on Ubuntu Linux 12.04

    October 18, 2012

    I really like ZFS because with current data sets, I do believe that data corruption may start becoming an issue. The thing is that the license under which ZFS is released does not permit it to be used in the Linux kernel. That's quite unfortunate, but there is hope. There is a project called 'ZFS on Linux' which provides ZFS support through a kernel module, circumventing any license issues.

    But as ZFS is a true next generation file system and the only one in its class stable enough for production use, I decided to give it a try.

    I used my existing download server running Ubuntu 12.04 LTS. I followed these steps:

    1. move all data to my big storage nas;
    2. destroy the existing MDADM RAID arrays;
    3. recreate a new storage array through ZFS;
    4. move all data back to the new storage array.

    Installation of ZFS is straight forward and well documented by the ZFSonLinux project. The main thing is how you setup your storage. My download server has six 500 GB disks and four 2 TB disks, thus a total of ten drives. So I decided to create a single zpool (logical volume) consisting of two vdevs (arrays). I thus created a vdev of 6 500 GB drives and a second vdev of the four 2 TB drives.

    root@server:~# zpool status
      pool: zpool
     state: ONLINE
     scan: scrub repaired 0 in 1h12m with 0 errors on Fri Sep  7 
        NAME                               STATE   READ WRITE CKSUM
        zpool                              ONLINE     0     0     0
          raidz1-0                         ONLINE     0     0     0
            pci-0000:03:04.0-scsi-0:0:1:0  ONLINE     0     0     0
            pci-0000:03:04.0-scsi-0:0:2:0  ONLINE     0     0     0
            pci-0000:03:04.0-scsi-0:0:3:0  ONLINE     0     0     0
            pci-0000:03:04.0-scsi-0:0:4:0  ONLINE     0     0     0
          raidz1-1                         ONLINE     0     0     0
            pci-0000:00:1f.2-scsi-2:0:0:0  ONLINE     0     0     0
            pci-0000:00:1f.2-scsi-3:0:0:0  ONLINE     0     0     0
            pci-0000:03:04.0-scsi-0:0:0:0  ONLINE     0     0     0
            pci-0000:03:04.0-scsi-0:0:5:0  ONLINE     0     0     0
            pci-0000:03:04.0-scsi-0:0:6:0  ONLINE     0     0     0
            pci-0000:03:04.0-scsi-0:0:7:0  ONLINE     0     0     0

    So the zpool consists of two vdevs that each consist of the physical drives.

    Everything is going smooth so far. I did have one issue though. I decided to remove a separate disk drive from the system that was no longer needed. As I initially setup the arrays based on device names (/dev/sda, /dev/sdb), the array broke as device names changed due to the missing drive.

    So I repared that by issuing these commands:

    zpool export zpool
    zpool import zpool -d /dev/disk/by-path/

    It's important to carefully read the FAQ of ZFS on Linux and understand that you should not use regular device names like /dev/sda for your ZFS array. It is recommended to use /dev/disk/by-path/ or /dev/disk/zpool/ exactly to prevent the issue I had with the disappeared drive.

    As discussed in my blog entry on why I decided not to use ZFS for my big 18 TB storage NAS, ZFS does not support 'growing' of an array as Linux software RAID does.

    As the zpool consists of different hard disk types, performance tests are not consistent. I've seen 450 MB/s read speeds on the zpool, which is more than sufficient for me.

    ZFS on Linux works, is fast enough and easy to setup. If I would have setup my big storage NAS today, I would probably have chosen ZFS on Linux by now. I would have accepted that I could not just expand the array with extra drives the way MDADM permits you to grow an array.

    In some way, ZFS on Linux is combining the best of both world. One of the best modern file systems with a modern and well-supported Linux distribution. Only the ZFS module itself may be the weak factor as it's fairly new for Linux and not optimised yet.

    Or we might have to just wait until BTFS is mature enough for production use.

  4. Why I Do Not Use ZFS as a File System for My NAS

    February 28, 2011

    Many people have asked me why I do not use ZFS for my NAS storage box. This is a good question and I have multpile reasons why I do not use ZFS and probably never will.

    A lot has changed since this article was first published. I do now recommend using ZFS. I've also based my new 71 TiB NAS on ZFS.

    The demise of Solaris

    ZFS is invented by Sun for the Solaris operating system. When I was building my NAS, the only full-featured and production-ready version of ZFS is implemented in Sun Solaris. The only usable version of Solaris was Open Solaris. I dismissed using Open Solaris because of the lack of hardware support and the small user base. This small user base is very important to me. More users is more testing. More support.

    The FreeBSD implementation of ZFS became only stable in January 2010, 6 months after I build my NAS (summer 2009). So FreeBSD was not an option at that time.

    I am glad that I didn't go for Open Solaris, as Suns new owner Oracle has killed this operating system in August 2010. Although ZFS is open source software, I think it is actually closed source already. The only open source version was through Open Solaris. That software is now killed. Oracle will close the source of ZFS just by not publishing the code of new features and updates. Only their proprietary closed source Solaris platform will obtain updates. But I must say that I don't have proof on this. However, Oracle seems to have at least no interest in open source software and almost seems to be hostile towards it.

    FreeBSD and ZFS

    So I build my NAS when basically ZFS was not around yet. But with FreeBSD as of today you can build a NAS based on ZFS right? Sure, you can do that. I had no choice back then but you do. But to be honest, I still would not use ZFS. As of March 1th, 2011, I would still go with Linux software RAID and XFS.

    The reasons are maybe not that great, I just provide them for you. It's up for you to decide.

    I sincerely do respect the FreeBSD community and platform, but it is not for me. It may be that I have just much more experience with Debian Linux and just don't like changing platforms. I find the installation process much more user friendly, I see a year over year improvement on Debian, I see none on the 8.2 FreeBSD release. Furthermore, I'm just thrilled with the really big APT repository. Last, I cannot oversee future requirements. But I'm sure that those requirements have a higher chance to support Linux than BSD.

    Furthermore, although FreeBSD has a community, it is relatively small. Resources on Debian an Ubuntu are abundant. I consider Linux a safer bet, also on the part of hardware support. My NAS must be simple to build and rock stable. I don't want to have a day time job just getting my NAS to work and maintain it.

    If you are experienced with FreeBSD, by all means, built a ZFS setup if you want. If you have to learn either BSD or Linux, I consider knowledge about Linux more valuable in the long run.

    ZFS is a hype

    This is the part where people may strongly disagree with me. I admire ZFS, but I consider it total overkill for home usage. I have seen many people talking about ZFS like Apple users about Apple products. It is a hype. Don't get me wrong. As a long-time Mac user I'm also mocking myself here. I get the impression that ZFS is regarded as the second coming of Jesus Christ. It solves problems that I didn't know of in the first place. The only thing it can't do is beat Chuck Norris. But it does vacuum your house if you ask it to.

    As a side note, one of the things I do not like about ZFS is the terminology. It is just RAID 0, RAID 1, RAID 5 or 6 but no, the ZFS people had to use different, more cool sounding terms like RAID Z or something. But it is basically the same thing.

    Okay, now back to the point: nobody at home needs ZFS. You may argue that nobody needs 18 TB of storage space at home, but that's another story. Running ZFS means using FreeBSD or an out-of-the-box NAS solution based on FreeBSD. And there aren't any other relevant options.

    Now, lets take a look at the requirements of most NAS builders. They want as much storage that is possible at the lowest price possible. That's about it. Many people want to add additional disk drives as their demand for storage capacity increases. So people buy a solution with a capacity for say 10 drives and start out with 4 drives and add disks when they need it.

    Linux allows you to 'grow' or 'expand' an array, just like most hardware RAID solutions. As far as I know, this is a feature is still not available in ZFS. Maybe this feature is not relevant in the enterprise world, but it is for most people who actually have to think about how they spend their money.

    Furthermore, I don't understand Why I can run any RAID array with decent performance with maybe 512 MB of RAM while ZFS would just totally crash with so little memory installed. You seem to need at least 2 GB to prevent crashing your system. More is recommended if you want to prevent it from crashing under high load or something. I really can't wrap my mind about this. Honestly, I think this is insane.

    ZFS does great things. Management is easy. Many features are cool. Snapshots, other stuff. But most features are just not required for a home setup. ZFS seems to solve a lot of 'scares' that I've only heard about since ZFS came along. Like the RAID 5/6 write hole. Where others just hookup a UPS in the first place (if you don't use a UPS on your NAS, you might as well also try and see if you are lucky running RAID 0) they find a solution that prevents data loss when power fails. One of the most interesting features to me is though that ZFS checksums all data and detects corruption. But I like it because it sounds useful, but how high are the chances that you need this stuff?

    If ZFS would be available under Linux as a native option instead of through FUSE, I would probably consider using it if I would know in advance that I would not want to expand or grow my array in the future. But I am pessimistic about this scenario. It is not in Oracle's interest to change the license on ZFS in order to allow Linux to incorporate support for it in the kernel.

    To build my 20 disk RAID array, I had to puzzle with my drives to keep all data while migrating to the new system. Some of the 20 disks came from my old NAS system, so I had to repeatedly grow the array and add disks, which I couldn't have done with ZFS.

    Why I choose to build this setup.

    The array is just a single 20 disk RAID 6 volume created with a single MDADM command. The second command I issued to make my array operational was to format this new 'virtual' disk with XFS, which just takes seconds. A UPS protects the systems against power failure and I'm happy with it for 1.5 years now. Never had any problems. Never had a disk failure... A single RAID 6 array is simple and fast. XFS is old but reliable. My whole setup is just this: extremely simple. I just love simple.

    My array does not use LVM, so I cannot create snapshots or stuff like that. But I don't need it. I just want so much storage that I don't have to think about it. And I think most people just want some storage share with lots of space. In that case, you don't need LVM or stuff like that. Just an array with a file system on top of it. If you can grow the array and the file system, you're set for the future. Speaking about the future: please note that on Linux, XFS is the only file system that is capable of addressing more than 16 TB of data. EXT4 is still limited to 16 TB.

    For the future, my hopes are that BTRFS will become a modern viable alternative to ZFS.

Page 2 / 2