Louwrentius

Articles in the ZFS category

ZFS RAIDZ Expansion Is Awesome but Has a Small Caveat

Tue 22 June 2021
Introduction

Update April 2023: It has been fairly quiet since the announcement of this feature. The Github PR about this feature is rather stale and people are wondering what the status is and what the plans are. Meanwhile, FreeBSD has announced In February 2023 that they suspect to integrate RAIDZ expansion by Q3.

One of my most popular blog articles is this article about the "Hidden Cost of using ZFS for your home NAS". To summarise the key argument of this article:

Expanding ZFS-based storge can be relatively expensive / inefficient.

For example, if you run a ZFS pool based on a single 3-disk RAIDZ vdev (RAID5 equivalent²), the only way to expand a pool is to add another 3-disk RAIDZ vdev¹.

You can't just add a single disk to the existing 3-disk RAIDZ vdev to create a 4-disk RAIDZ vdev because vdevs can't be expanded.

The impact of this limitation is that you have to buy all storage upfront even if you don't need the space for years to come.

Otherwise, by expanding with additional vdevs you lose capacity to parity you may not really want/need, which also limits the maximum usable capacity of your NAS.

RAIDZ vdev expansion

Fortunately, this limitation of ZFS is being addressed!

ZFS founder Matthew Ahrens created a pull request around June 11, 2021 detailing a new ZFS feature that would allow for RAIDZ vdev expansion.

Finally, ZFS users will be able to expand their storage by adding just one single drive at a time. This feature will make it possible to expand storage as-you-go, which is especially of interest to budget conscious home users³.

Jim Salter has written a good article about this on Ars Technica.

There is still a caveat

Existing data will be redistributed or rebalanced over all drives, including the freshly added drive. However, the data that was already stored on the vdev will not be restriped after the vdev is expanded. This means that this data is stored with the older, less efficient parity-to-data ratio.

I think Matthew Ahrends explains it best in his own words:
```
After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to zfs list, df, ls -s, and similar tools.
```
So, if you add a new drive to a RAIDZ vdev, you'll notice that after expansion, you will have less capacity available than you would theoretically expect.

However, it is even more important to understand that this effect accumulates. This is especially relevant for home users.

I think that the whole concept of starting with a small number of disks and expand-as-you-go is very desirable and typical for home users. But this also means that every time a disk is added to the vdev, existing data is still stored with the old data-to-parity rate.

Imagine that we have a 10-drive chassis and we start out with a 4-drive RAIDZ2.

If we keep adding drives⁵ conform this example, until the chassis is full at 10 drives, about 1.35 drives worth of capacity is 'lost' to parity overhead/efficiency loss⁴.

That is quite a lot of overhead or loss of capacity, I think.

How is this overhead calculated? If we would just buy 10 drives and create a 10-drive RAIDZ2 vdev, data-to-parity overhead is 20% meaning that 20% of the total raw capacity of the vdev is used for storing parity. This is the most efficient scenario in this case.

When we start out with the four-drive RAIDZ2 vdev, the data-to-parity overhead is 50%. That's a 30% overhead difference compared to the 'ideal' 10-drive setup.

As we keep adding drives, the relative overhead of the parity keeps dropping so we end up with 'multiple data sets' with different data-to-parity ratios, that are less efficient than the end-stage of 10 drives.

I created a google sheet to roughly estimate this overhead for each stage, but my math was totally off. Fortunately, Yorick rewrote the sheet, which can be found here. Thanks Yorick! Further more, Truenas user DayBlur shared additional insights on the calculations if you are interested in that.

The google sheet allows you to play with various variables to estimate how much capacity is lost for a given scenario. Please note that any losses that may arise because a number of drives is used that requires data to be padded - as discussed in the Ars Technica article - are not part of the calculation.

It is a bit unfortunate that especially in the scenario of the home user who want to start small and expand-as-you go that this overhead manifests itself so much. But there is good news!

Lost capacity can be recovered!

The overhead or 'lost capacity' can be recovered by rewriting existing data after the vdev has been expanded, because the data will then be written with the more efficient parity-to-data ratio of the larger vdev.

Rewriting all data may take quite some time and you may opt to postpone this step until the vdev has been expanded a couple of times so the parity-to-data ratio is now 'good enough' that significant storage gains can be had by rewriting the data.

Because capacity lost to overhead can be fully recovered, I think that this caveat is relatively minor, especially compared to the old situation where we had to expand a pool with entire vdevs and there was no way to recover any overhead.

There is currently no build-in mechanism to trigger this data rewrite as part of the native ZFS tools. This will be a manual process until somebody may create a script that automates this process. According to Matthew Ahrens, restriping the data as part of the vdev expansion process would be an effort of similar scale as the RAIDZ expansion itself.

Evaluation

I think it cannot be stated enough how awesome the RAIDZ vdev expansion feature is, especially for home users who want to start small and grow their storage over time.

Although the expansion process can accumulate quite a bit of overhead, that overhead can be recovered by rewriting existing data, which is probably not a problem for most people.

Despite all the awesome features and capabilities of ZFS, I think quite a few home users went with other storage solutions because of the relatively high expansion cost/overhead. Now that this barrier will be overcome, I think that ZFS will be more accessible to the home user DIY NAS crowd.

Release timeline

According to the Ars Technica article by Jim Salter, this feature will probably become available in August 2022, so we need to have some patience. Even so, you might want to already decide to build your new DIY NAS based on ZFS: by the time you may need to expand your storage, the feature may be available!

Update on some - in my opinion - bad advice

The podcast 2.5 admins (which I enjoy listening to) discussed the topic of RAIDZ expansion in episode 45.

There are two remarks made that I want to address, because I disagree with them.

Don't rewrite the data?

As in his Ars Technica article, Jim Salter keeps advocating not to bother rewriting the data after a vdev expansion, but I personally disagree with this advice. I hope I have demonstrated that if you keep adding drives, the parity overhead is significant enough for most home users to make it worthwhile to rewrite the data after a few drives have been added.

Just use mirrors!

I also disagree with the advice of using mirrors, especially for home users⁶. I personally think it is bad advice, because home users have other needs and desires as enterprise environments.

If 'just use mirrors' is still the advice, why did Matthew Ahrends build the whole RAIDZ vdev expansion feature in the first place? I think the RAIDZ vdev expansion is really beneficial for home users.

Maybe Jim and I have very different ideas about what a home user would want or need in a DIY NAS storage solution. I think that home users want this:

As much storage as possible for as little money as possible with acceptable redundancy.

In addition, I think that home users in general work with larger files (multiple megabytes at least). And if they sometimes work with smaller files, they accept some performance loss due to the lower random I/O performance of single RAIDZ vdevs⁷.

Frankly, to me it feels like the 'just use mirrors' advice is used to 'downplay' a significant limitation of ZFS⁸. Jim is a prolific writer on Ars Technica and has a large audience so his advice matters. So that's why I think it's sad that he sticks with 'just use mirrors' while that's clearly not in the best interest of most home users.

However, that's just my opinion, you decide for yourself what's best.
1. The other method is to replace all existing drives one by one with larger ones. Only after you have replaced all drives will you be able to gain extra capacity so this method has a similar downside as just expanding with extra vdevs: you must buy multiple drives at once. In addition, I think this method is rather time consuming and cumbersome although people do use it to expand capacity. And to be fair: you can indeed add 4+ disk vdevs, vdevs with a higher RAIDZ level or mirrors but none of that makes sense in this context. ↩
2. Just to illustrate the level of redundancy in terms of how many disks can be lost and still be operational. ↩
3. I personally think that it's even great for small and medium business owners. Only larger businesses want to keep adding relatively large vdevs consisting of multiple drives because if they keep expanding with just one drive at a time, they may have to expand capacity very frequently which may not be practical. ↩
4. If you would only upgrade once the pool is almost full - not recommended! - that overhead grows to 1.69 drives. ↩
5. So you go from four to five drives. Then from five to six drives, and so on. ↩
6. I link to the original article by Jim Salter because I want to allow you to read the article and make up your own mind and not just listen to me. ↩
7. If random I/O performance is important, it is probably wise to go for SSD based storage anyway. ↩
8. resolved by by ZFS vdev expansion obviously, when it lands in production. ↩
Tagged as : Storage

Read and Post Comments
Why I Do Use ZFS as a File System for My NAS

Thu 29 January 2015
On February 2011, I posted an article about my motivations why I did not use ZFS as a file system for my 18 TB NAS.

You have to understand that at the time, I believe the arguments in the article were relevant, but much has changed since then, and I do believe this article is not relevant anymore.

My stance on ZFS is in the context of a home NAS build.

I really recommend giving ZFS a serious consideration if you are building your own NAS. It's probably the best file system you can use if you care about data integrity.

ZFS may only be available for non-Windows operating systems, but there are quite a few easy-to-use NAS distros available that turn your hardware into a full-featured home NAS box, that can be managed through your web browser. A few examples:
- FreeNAS
- NAS4free
- ZFSguru
I also want to add this: I don't think it's wrong or particular risky if you - as a home NAS builder - would decide not to use ZFS and select a 'legacy' solution if that better suits your needs. I think that proponents of ZFS often overstate the risks ZFS mitigates a bit, maybe to promote ZFS. I do think those risks are relevant but it all depends on your circumstances. So you decide.

May 2016: I have also written a separate article on how I feel about using ZFS for DIY home NAS builds.

Arstechnica article about FreeNAS vs NAS4free.

If you are quite familiar with FreeBSD or Linux, I do recommend this ZFS how-to article from Arstechnica. It offers a very nice introduction to ZFS and explains terms like 'pool' and 'vdev'.

If you are planning on using ZFS for your own home NAS, I would recommend reading the following articles:
- Things you should consider when building a ZFS NAS
- The 'hidden' cost of ZFS for your home NAS build
My historical reasons for not using ZFS at the time

When I started with my 18 TB NAS in 2009, there was no such thing as ZFS for Linux. ZFS was only available in a stable version for Open Solaris. We all know what happened to Open Solaris (it's gone).

So you might ask: "Why not use ZFS on FreeBSD then?". Good question, but it was bad timing:
```
The FreeBSD implementation of ZFS became only stable [sic] in January 2010, 6 months after I build my NAS (summer 2009). So FreeBSD was not an option at that time.
```
One of the other objections against ZFS is the fact that you cannot expand your storage by adding single drives and growing the array as your data set grows.

A ZFS pool consists of one or more VDEVs. A VDEV is a traditional RAID-array. You expand storage capacity by expanding the ZFS pool, not the VDEVS. You cannot expand the VDEV itself. You can only add VDEVS to a pool.

So ZFS either forces you to invest in storage you don't need upfront, or it forces you invest later on because you may waste quite a few extra drives on parity. For example, if you start with a 6-drive RAID6 (RAIDZ) configuration, you will probably expand with another 6 drives. So the pool has 4 parity drives on 12 total drives (33% loss). Investing upfront in 10 drives instead of 6 would have been more efficient because you only lose 2 drives out of 10 to parity (20% loss).

So at the time, I found it reasonable to stick with what I knew: Linux & MDADM.

But my new 71 TiB NAS is based on ZFS.

I wrote an article about my worry that ZFS may die with FreeBSD as it sole backing, but fortunately, I've been proven very, very wrong.

ZFS is now supported on FreeBSD and Linux. Despite some licencing issues that prevent ZFS from being integrated in the Linux kernel itself, it can still be used as a regular kernel module and it works perfectly.

There is even an open-source ZFS consortium that brings together all the developers for the different operating systems supporting ZFS.

ZFS is here to stay for a very long time.
Tagged as : ZFS

Read and Post Comments
The ZFS Event Daemon on Linux

Fri 29 August 2014
If something goes wrong with my zpool, I'd like to be notified by email. On Linux using MDADM, the MDADM daemon took care of that.

With the release of ZoL 0.6.3, a brand new 'ZFS Event Daemon' or ZED has been introduced.

I could not find much information about it, so consider this article my notes on this new service.

If you want to receive alerts there is only one requirement: you must setup an MTA on your machine and that is outside the scope of this article.

When you install ZoL, the ZED daemon is installed automatically and will start on boot.

The configuration file for ZED can be found here: /etc/zfs/zed.d/zed.rc. Just uncomment the "ZED_EMAIL=" section and fill out your email address. Don't forget to restart the service.

ZED seems to hook into the zpool event log that is kept in the kernel and monitors these events in real-time.

You can see those events yourself:
```
root@debian:/etc/zfs/zed.d# zpool events
TIME                           CLASS
Aug 29 2014 16:53:01.872269662 resource.fs.zfs.statechange
Aug 29 2014 16:53:01.873291940 resource.fs.zfs.statechange
Aug 29 2014 16:53:01.962528911 ereport.fs.zfs.config.sync
Aug 29 2014 16:58:40.662619739 ereport.fs.zfs.scrub.start
Aug 29 2014 16:58:40.670865689 ereport.fs.zfs.checksum
Aug 29 2014 16:58:40.671888655 ereport.fs.zfs.checksum
Aug 29 2014 16:58:40.671905612 ereport.fs.zfs.checksum
...
```
You can see that a scrub was started and that incorrect checksums were discovered. A few seconds later I received an email:

The first email:
```
A ZFS checksum error has been detected:

  eid: 5
 host: debian
 time: 2014-08-29 16:58:40+0200
 pool: storage
 vdev: disk:/dev/sdc1
```
And soon thereafter:
```
A ZFS pool has finished scrubbing:

  eid: 908
 host: debian
 time: 2014-08-29 16:58:51+0200
 pool: storage
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://zfsonlinux.org/msg/ZFS-8000-9P
 scan: scrub repaired 100M in 0h0m with 0 errors on Fri Aug 29 16:58:51 2014
config:

    NAME        STATE     READ WRITE CKSUM
    storage     ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdc     ONLINE       0     0   903

errors: No known data errors
```
Awesome!

The ZED daemon executes commands based on the event class. So it can do more than just send emails, you can customise different actions based on the event class. The event class can be seen in the zpool events output.

One of the more interesting features is automatic replacement of a defect drive with a hot spare, so full fault tolerance is restored as soon as possible.

I've not been able to get this to work. The ZED scripts would not automatically replace a failed/faulted drive.

There seem to be some known issues. The fixes seem to be in a pending pull request.

Just to make sure I got alerted, I've simulated the ZED configuration for my production environment in a VM.

I simulated a drive failure with dd as stated earlier, but the result was that for every checksum error I received one email. With thousands of checksum errors, I had to clear 1000+ emails from my inbox.

It seems that this option, which is uncommented by default, was not enabled.
```
ZED_EMAIL_INTERVAL_SECS="3600"
```
This option implements a cool-down period where an event is just reported once and suppressed afterwards until the interval expires.

It would be best if this option would be enabled by default.

The ZED authors acknowledge that ZED is a bit rough around the edges, but it sends out alerts consistently and that's what I was looking for, so I'm happy.
Tagged as : ZFS event daemon

Read and Post Comments

Solar Status

71 TiB NAS

20C/40T 128G Server

Projects

Categories

Archive

2021

2015

2014

2013

2012

2011

Page 1 / 4