ZFS RAIDZ Expansion Is Awesome but Has a Small Caveat

Tue 22 June 2021 Category: ZFS

Introduction

One of my most popular blog articles is this article about the "Hidden Cost of using ZFS for your home NAS". To summarise the key argument of this article:

Expanding ZFS-based storge can be relatively expensive / inefficient.

For example, if you run a ZFS pool based on a single 3-disk RAIDZ vdev (RAID5 equivalent2), the only way to expand a pool is to add another 3-disk RAIDZ vdev1.

You can't just add a single disk to the existing 3-disk RAIDZ vdev to create a 4-disk RAIDZ vdev because vdevs can't be expanded.

The impact of this limitation is that you have to buy all storage upfront even if you don't need the space for years to come.

Otherwise, by expanding with additional vdevs you lose capacity to parity you may not really want/need, which also limits the maximum usable capacity of your NAS.

RAIDZ vdev expansion

Fortunately, this limitation of ZFS is being addressed!

ZFS founder Matthew Ahrens created a pull request around June 11, 2021 detailing a new ZFS feature that would allow for RAIDZ vdev expansion.

Finally, ZFS users will be able to expand their storage by adding just one single drive at a time. This feature will make it possible to expand storage as-you-go, which is especially of interest to budget conscious home users3.

Jim Salter has written a good article about this on Ars Technica.

There is still a caveat

Existing data will be redistributed or rebalanced over all drives, including the freshly added drive. However, the data that was already stored on the vdev will not be restriped after the vdev is expanded. This means that this data is stored with the older, less efficient parity-to-data ratio.

I think Matthew Ahrends explains it best in his own words:

After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to zfs list, df, ls -s, and similar tools.

So, if you add a new drive to a RAIDZ vdev, you'll notice that after expansion, you will have less capacity available than you would theoretically expect.

However, it is even more important to understand that this effect accumulates. This is especially relevant for home users.

I think that the whole concept of starting with a small number of disks and expand-as-you-go is very desirable and typical for home users. But this also means that every time a disk is added to the vdev, existing data is still stored with the old data-to-parity rate.

Imagine that we have a 10-drive chassis and we start out with a 4-drive RAIDZ2.

If we keep adding drives5 conform this example, until the chassis is full at 10 drives, about 1.35 drives worth of capacity is 'lost' to parity overhead/efficiency loss4.

That is quite a lot of overhead or loss of capacity, I think.

How is this overhead calculated? If we would just buy 10 drives and create a 10-drive RAIDZ2 vdev, data-to-parity overhead is 20% meaning that 20% of the total raw capacity of the vdev is used for storing parity. This is the most efficient scenario in this case.

When we start out with the four-drive RAIDZ2 vdev, the data-to-parity overhead is 50%. That's a 30% overhead difference compared to the 'ideal' 10-drive setup.

As we keep adding drives, the relative overhead of the parity keeps dropping so we end up with 'multiple data sets' with different data-to-parity ratios, that are less efficient than the end-stage of 10 drives.

I created a google sheet to roughly estimate this overhead for each stage, but my math was totally off. Fortunately, Yorick rewrote the sheet, which can be found here. Thanks Yorick! Further more, Truenas user DayBlur shared additional insights on the calculations if you are interested in that.

The google sheet allows you to play with various variables to estimate how much capacity is lost for a given scenario. Please note that any losses that may arise because a number of drives is used that requires data to be padded - as discussed in the Ars Technica article - are not part of the calculation.

It is a bit unfortunate that especially in the scenario of the home user who want to start small and expand-as-you go that this overhead manifests itself so much. But there is good news!

Lost capacity can be recovered!

The overhead or 'lost capacity' can be recovered by rewriting existing data after the vdev has been expanded, because the data will then be written with the more efficient parity-to-data ratio of the larger vdev.

Rewriting all data may take quite some time and you may opt to postpone this step until the vdev has been expanded a couple of times so the parity-to-data ratio is now 'good enough' that significant storage gains can be had by rewriting the data.

Because capacity lost to overhead can be fully recovered, I think that this caveat is relatively minor, especially compared to the old situation where we had to expand a pool with entire vdevs and there was no way to recover any overhead.

There is currently no build-in mechanism to trigger this data rewrite as part of the native ZFS tools. This will be a manual process until somebody may create a script that automates this process. According to Matthew Ahrens, restriping the data as part of the vdev expansion process would be an effort of similar scale as the RAIDZ expansion itself.

Evaluation

I think it cannot be stated enough how awesome the RAIDZ vdev expansion feature is, especially for home users who want to start small and grow their storage over time.

Although the expansion process can accumulate quite a bit of overhead, that overhead can be recovered by rewriting existing data, which is probably not a problem for most people.

Despite all the awesome features and capabilities of ZFS, I think quite a few home users went with other storage solutions because of the relatively high expansion cost/overhead. Now that this barrier will be overcome, I think that ZFS will be more accessible to the home user DIY NAS crowd.

Release timeline

According to the Ars Technica article by Jim Salter, this feature will probably become available in August 2022, so we need to have some patience. Even so, you might want to already decide to build your new DIY NAS based on ZFS: by the time you may need to expand your storage, the feature may be available!

Update on some - in my opinion - bad advice

The podcast 2.5 admins (which I enjoy listening to) discussed the topic of RAIDZ expansion in episode 45.

There are two remarks made that I want to address, because I disagree with them.

Don't rewrite the data?

As in his Ars Technica article, Jim Salter keeps advocating not to bother rewriting the data after a vdev expansion, but I personally disagree with this advice. I hope I have demonstrated that if you keep adding drives, the parity overhead is significant enough for most home users to make it worthwhile to rewrite the data after a few drives have been added.

Just use mirrors!

I also disagree with the advice of using mirrors, especially for home users6. I personally think it is bad advice, because home users have other needs and desires as enterprise environments.

If 'just use mirrors' is still the advice, why did Matthew Ahrends build the whole RAIDZ vdev expansion feature in the first place? I think the RAIDZ vdev expansion is really beneficial for home users.

Maybe Jim and I have very different ideas about what a home user would want or need in a DIY NAS storage solution. I think that home users want this:

As much storage as possible for as little money as possible with acceptable redundancy.

In addition, I think that home users in general work with larger files (multiple megabytes at least). And if they sometimes work with smaller files, they accept some performance loss due to the lower random I/O performance of single RAIDZ vdevs7.

Frankly, to me it feels like the 'just use mirrors' advice is used to 'downplay' a significant limitation of ZFS8. Jim is a prolific writer on Ars Technica and has a large audience so his advice matters. So that's why I think it's sad that he sticks with 'just use mirrors' while that's clearly not in the best interest of most home users.

However, that's just my opinion, you decide for yourself what's best.


  1. The other method is to replace all existing drives one by one with larger ones. Only after you have replaced all drives will you be able to gain extra capacity so this method has a similar downside as just expanding with extra vdevs: you must buy multiple drives at once. In addition, I think this method is rather time consuming and cumbersome although people do use it to expand capacity. And to be fair: you can indeed add 4+ disk vdevs, vdevs with a higher RAIDZ level or mirrors but none of that makes sense in this context. 

  2. Just to illustrate the level of redundancy in terms of how many disks can be lost and still be operational. 

  3. I personally think that it's even great for small and medium business owners. Only larger businesses want to keep adding relatively large vdevs consisting of multiple drives because if they keep expanding with just one drive at a time, they may have to expand capacity very frequently which may not be practical. 

  4. If you would only upgrade once the pool is almost full - not recommended! - that overhead grows to 1.69 drives. 

  5. So you go from four to five drives. Then from five to six drives, and so on. 

  6. If random I/O performance is important, it is probably wise to go for SSD based storage anyway. 

  7. resolved by by ZFS vdev expansion obviously, when it lands in production. 

Comments