Things You Should Consider When Building a ZFS NAS

December 29, 2013 Category: ZFS

ZFS is a modern file system designed by Sun Microsystems, targeted at enterprise environments. Many features of ZFS also appeal to home NAS builders and with good reason. But not all features are relevant or necessary for home use.

I believe that most home users building their own NAS, are just looking for a way to create a large centralised storage pool. As long as the solution can saturate gigabit ethernet, performance is not much of an issue. Workloads are typically single-client and sequential in nature.

If this description rings true for your environment, there are some options regarding ZFS that are often very popular but not very relevant to you.

You don't need a SLOG for the ZIL

Quick recap: the ZIL or ZFS intent Log is - as I understand it - only relevant for synchronous writes. If data integrity is important to an application, like a database server or a virtual machine, writes are performed synchronous. The application wants to make sure that the data is actually stored on the physical storage media and it waits for a confirmation from ZFS that it has done so. Only then will it continue.

Asynchronous writes on the contrary, never hit the ZIL. They are just cached in RAM and directly written to the VDEV in one sequential swoop when the next transaction group commit will be performed (currently by default every 5 seconds). In the mean time, the application gets a confirmation of ZFS that the data is stored (a white lie) and just continues where it left off. ZFS just caches the write in memory and actually write the data to the storage VDEV when it feels like it (fifo).

As you may understand, asynchronous writes are way faster because they can be cached and ZFS can reorder the I/O to make it more sequential and prevent random I/O from hitting the VDEV. This is what I understood from this source.

So if you encounter synchronous writes, they must be committed to the ZIL (thus VDEV) and this causes random I/O patterns on the VDEV, degrading performance significantly.

The cool thing about ZFS is that it does provide the option to store the ZIL on a dedicated device called the SLOG. This doesn't do anything for performance by itself, but the secret ingredient is using a solid state drive as the SLOG, ideally in a mirror to insure data integrity and to maintain performance in the case of a SLOG device failure.

For business critical environments, a separate SLOG device based on SSDs is a no-brainer. But for home use? If you don't have a SLOG, you still have a ZIL, it's only not as fast. That's not a real problem for single-client sequential throughput.

For home usage, you may even consider how much you care about data integrity. That sounds strange, but the ZIL is used to recover from the event of a sudden power-loss. If your NAS is attached to a UPS, this is not much of a risk, you can perform a controlled shutdown before the batteries run out of power. The remaining risk is human error or some other catastrophic event within your NAS.

So all data in rest already stored on your NAS is never at risk. It's only data that is in the process of being committed to storage that may get scrambled. But again: this is a home situation. Maybe restart your file transfer and you are done. You still have a copy of the data on the source device. This is entirely different from a setup with databases or virtual machines.

Data integrity of data at rest is vitally important. The ZIL only protects data in transit. It has nothing to do with the data already committed to the VDEV.

I see so many NAS builders being talked into buying some specific SSDs to be used for the ZIL whereas they probably won't benefit from them at all, it's just too bad.

You don't need L2ARC cache

ZFS relies heavily on caching of data to deliver decent performance, especially read performance. RAM provides the fasted cache and that is where the first level of caching lives, the ARC (Adaptive Replacement Cache). ZFS is smart and learns which data is often requested and keeps it in the ARC.

But the size of the ARC is limited by the amount of RAM available. This is why you can add a second cache tier, based on SSDs. SSDs are not as fast as RAM, but still way faster than spinning disks. And they are cheaper compared to RAM memory if you look at their capacity.

For additional more detailed information, go to this site

L2ARC is important when you have multiple users or VMs accessing the same data sets. In this case, L2ARC based on SSDs will improve performance significantly. But if we just take a look at the average home NAS build, I'm not sure how the L2ARC adds any benefit. ZFS has no problem with single-client sequential file transfers so there is no benefit in implementing a L2ARC.

Update 2015-02-08: There is even a downside to having a L2ARC cache. All the meta-data regarding data stored in the L2ARC cache is kept in memory, and thus eating away at your ARC!, thus your ARC becomes less effective (source).

You don't need deduplication and compression

For the home NAS, most data you store on it is already highly compressed and additional compression only wastes performance (Music, Videos, etc). It is a cool feature, but not so much for home use. If you are planning to store other types of data, compression actually may be of interest (documents, backups of VMs, etc). It is suggested by many (and in the comments) that with LZ4 compression, you don't lose performance (except for some CPU cycles) and with compressible data, you even gain performance, so you could just enable it and forget about it.

Whereas compression may do not much harm, Deduplication is often more relevant in business environments where users are sloppy and store multiple copies of the same data at different locations. I'm quite sure you don't want to sacrifice RAM and performance for ZFS to keep track of duplicates you probably don't have.

You don't need an ocean of RAM

The absolute minimum RAM for a viable ZFS setup is 4 GB but there is not a lot of headroom for ZFS here. ZFS is quite memory hungry because it uses RAM as a buffer so it can perform operations like checksums and reorder all I/O to be sequential.

If you don't have sufficient buffer memory, performance will suffer. 8 GB is probably sufficient for most arrays. If your array is faster, more memory may be required to actually benefit from this performance. For maximum performance, you should have enough memory to hold 5 seconds worth of maximum write throughput ( 5 x 400MB/s = 2GB ) and leave sufficient headroom for other ZFS RAM requirements. In the example, 4 GB RAM could be sufficient.

For most home users, saturating gigabit is already sufficient so you might be safe with 8 GB of RAM in most cases. More RAM may not provide much more benefit, but it will increase power consumption.

There is an often cited rule that you need 1 GB of RAM for every TB of storage, but this is not true for home NAS solutions. This is only relevant for high-performance multi-user or multi-VM environments.

Additional information about RAM requirements can be found here

You do need ECC RAM if you care about data integrity

The money saved on a ZIL or L2ARC cache can be better spend on ECC RAM memory.

ZFS does not rely on the quality of individual disks. It uses parity to verify that disks don't lie about the data stored on them (data corruption).

But ZFS can't verify the contents of RAM memory, so here ZFS relies on the reliability of the hardware. And there is a reason why we use RAID or redundant power suplies in our server equipment: hardware fails. RAM fails too. This is the reason why every server product by well-known vendors like HP, Dell, IBM and Supermicro only support ECC memory. RAM memory errors do occur more frequent than you may think.

ECC (Error Checking and Correcting) RAM corrects and detects RAM errors. This is the only way you can be fairly sure that ZFS is not fed with corrupted data. Keep in mind: with bad RAM, it is likely that corrupted data will be written to disk without ZFS ever being aware of it (garbage in - garbage out).

Please note that the quality of your RAM memory will not directly affect any data that is at rest and already stored on your disks. Existing data will only be corrupted with bad RAM if it is modified or moved around. ZFS will probably detect checksum errors, but it will be too late by then...

To me, it's simple. If you care enough about your data that you want to use ZFS, you should also be willing to pay for ECC memory. You are giving yourself a false sense of security if you do not use ECC memory. ZFS was never designed for consumer hardware, it was destined to be used on server hardware using ECC memory. Because it was designed with data integrity as the top most priority.

There are entry-level servers that do support ECC memory and can be had fairly cheap with 4 hard drive bays, like the HP ProLiant MicroServer N54L.

I wrote an article about a reasonably priced CPU+RAM+MB combo that does support ECC memory starting at $360.

If you feel lucky, go for good-quality non-ECC memory. But do understand that you are taking a risk here.

You don't need to limit the number of data disks in a vdev

For home use, creating larger vdevs is not an issue, even an 18 disk vdev is probably fine, but don't expect any significant random I/O. It is always recommended to use multiple smaller VDEVs to increase random I/O performance (at the cost of capacity lost to parity) as ZFS does stripe I/O-requests across VDEVs. If you are building a home NAS, random I/O is probably not very relevant.

Depending on the type of 'RAID' you may choose for the VDEV(s) in your ZFS pool, you might want to make sure you only put in the right number of disks in the VDEV.

This is important, if you don't use the right amount, performance will suffer but more importantly: you will lose storage space, which can ad up to over 10% of the available capacity. That's quite a waste.

This is a straight copy&paste from sub.mesa's post

The following ZFS pool configurations are optimal for modern 4K sector harddrives:
RAID-Z: 3, 5, 9, 17, 33 drives
RAID-Z2: 4, 6, 10, 18, 34 drives
RAID-Z3: 5, 7, 11, 19, 35 drives

Sub.mesa also explains the details on why this is true. And here is another example.

The gist is that you must use a power of two for your data disks and then add the number of parity disks required for your RAIDZ level on top of that. So 4 data disks + 1 parity disk (RAIDZ) is a total of 5 disks. Or 16 data disks + 2 parity disks (RAIDZ2) is 18 disks in the VDEV.

Take this into account when deciding on your pool configuration. Also, RAIDZ2 is absolutely recommended. with more than 6-8 disks. The risk of losing a second drive during 'rebuild' (resilvering) is just too high with current high-density drives.