The Sorry State of CoW File Systems

March 01, 2015 Category: Storage

I'd like to argue that both ZFS and BTRFS both are incomplete file systems with their own drawbacks and that it may still be a long way off before we have something truly great.

Both ZFS and BTRFS are two heroic feats of engineering, created by people who are probably ten times more capable and smarter than me. There is no question about my appreciation for these file systems and what they accomplish.

Still, as an end-user, I would like to see some features that are often either missing or not complete. Make no mistake, I believe that both ZFS and BTRFS are probably the best file systems we have today. But they can be much better.

I want to start with a terse and quick overview on why both ZFS and BTRFS are such great file systems and why you should take some interest in them.

Then I'd like to discuss their individual drawbacks and explain my argument.

Why ZFS and BTRFS are so great

Both ZFS and BTRFS are great for two reasons:

  1. They focus on preserving data integrity
  2. They simplify storage management

Data integrity

ZFS and BTRFS implement two important techniques that help preserve data.

  1. Data is checksummed and its checksum is verified to guard against bit rot due to broken hard drives or flaky storage controllers. If redundancy is available (RAID), errors can even be corrected.

  2. Copy-on-Write (CoW), existing data is never overwritten, so any calamity like sudden power loss cannot cause existing data to be in an inconsistent state.

Simplified storage management

In the old days, we had MDADM or hardware RAID for redundancy. LVM for logical volume management and then on top of that, we have the file system of choice (EXT3/4, XFS, REISERFS, etc).

The main problem with this approach is that the layers are not aware of each other and this makes things very inefficient and more difficult to administer. Each layer needs it's own attention.

For example, if you simply want to expand storage capacity, you need to add drives to your RAID array and expand it. Then, you have to alert the LVM layer of the extra storage and as a last step, grow the file system.

Both ZFS and BTRFS make capacity expansion a simple one line command that addresses all three steps above.

Why are ZFS and BTRFS capable of doing this? Because they incorporate RAID, LVM and the file system in one single integrated solution. Each 'layer' is aware of the other, they are tightly integrated. Because of this integration, rebuilds after a drive faillure are often faster than with 'legacy RAID' solutions, because they only need to rebuild the actual data, not the entire drive.

And I'm not even talking about the joy of snapshots here.

The inflexibility of ZFS

The storage building block of ZFS is a VDEV. A VDEV is either a single disk (not so interesting) or some RAID scheme, such as mirroring, single-parity (RAIDZ), dual-parity (RAIDZ2) and even tripple-parity (RAIDZ3).

To me, a big downside to ZFS is the fact that you cannot expand a VDEV. Ok, the only way you can expand the VDEV is quite convoluted. You have to replace all of the existing drives, one by one, with bigger ones and rebuild the VDEV each time you replace one of the drives. Then, when all drives are of the higher capacity, you can expand your VDEV. This is quite impractical and time-consuming, if you ask me.

ZFS expects you just to add extra VDEVS. So if you start with a single 6-drive RAIDZ2 (RAID6), you are expected to add another 6-drive RAIDZ2 if you want to expand capacity.

What I would want to do is just to ad one or two more drives and grow the VDEV, as is possible with many hardware RAID solutions and with "MDADM --grow" for ages.

Why do I prefer this over adding VDEVS? Because it's quite evident that this is way more economical. If I can just expand my RAIDZ2 from 6 drives to 12 drives, I would only sacrifice two drives for parity. If I add two VDEVS each of them RAIDZ2, I sacrifice four drives (16% vs 33% capacity loss).

I can imagine that in the enterprise world, this is just not that big of a deal, a bunch of drives are a rounding error on the total budget and availability and performance are more important. Still, I'd like to have this option.

Either you are forced to buy and implement the storage you may expect to need in the future, or you must add it later on, wasting drives on parity you would otherwise not have done.

Maybe my wish for a zpool grow option is more geared to hobbyist or home usage of ZFS and ZFS was always focussed on enterprise needs, not the needs of hobbyists. So I'm aware of the context here.

I'm not done with ZFS however, because the way ZFS works, there is another great inflexibility. If you don't put the 'right' number of drives in a VDEV, you may lose significant portions of storage, which is a side-effect of how ZFS works.

The following ZFS pool configurations are optimal for modern 4K sector harddrives:
RAID-Z: 3, 5, 9, 17, 33 drives
RAID-Z2: 4, 6, 10, 18, 34 drives
RAID-Z3: 5, 7, 11, 19, 35 drives

I've seen first-hand with my 71 TiB NAS that if you don't use the optimal number of drives in a VDEV, you may lose whole drives worth of netto storage capacity. In that regard, my 24-drive chassis is very suboptimal.

The sad state of RAID on BTRFS

BTRFS has none of the downsides of ZFS as described in the previous section as far as I'm aware of. It has plenty of its own, though. First of all: BTRFS is still not stable, especially the RAID 5/6 part is unstable.

The RAID 5 and RAID 6 implementation are so new, the ink they were written with is still wet (February 8th 2015). Not something you want to trust your important data to I suppose.

I did setup a test environment to play a bit with this new Linux kernel (3.19.0) and BTRFS to see how it works and although it is not production-ready yet, I really like what I see.

With BTRFS you can just add or remove drives to a RAID6 array as you see fit. Add two? Subtract 3? Whatever, the only thing you have to wait for is BTRFS rebalancing the data over either the new or remaining drives.

This is friggin' awesome.

If you want to remove a drive, just wait for BTRFS to copy the data from that drive to the other remaining drives and you can remove it. You want to expand storage? Just add the drives to your storage pool and have BTRFS rebalance the data (which may take a while, but it works).

But I'm still a bit sad. Because BTRFS does not support anything beyond RAID6. No multiple RAID6 (RAID60) arrays or tripple-parity, as ZFS supports for ages. As with my 24-drive file server, putting 24 drives in a single RAID6, starts to feel like I'm asking for trouble. Tripple-parity or RAID 60 would probably be more reasonable. But no luck with BTRFS.

However, what really frustrates me is this article by Ronny Egner. The author of snapraid, Andrea Mazzoleni, has written a functional patch for BTRFS that implements not only tripple-parity RAID, but even up to six parity disks for a volume.

The maddening thing is that the BTRFS maintainers are not planning to include this patch into the BTRFS code base. Please read Ronny's blog. The people working on BTRFS are working for enterprises who want enterprise features. They don't care about tripple-parity or features like that because they have access to something presumably better: distributed file systems, which may do away with the need for larger disk arrays and thus tripple-parity.

BTRFS is in development for a very long time and only recently has RAID 5/6 support been introduced. The risk of the write-hole, something addressed by ZFS ages ago, is still an open issue. Considering all of this, BTRFS is still a very long way off, of being the file system of choice for larger storage arrays.

BTRFS seems to be way more flexible in terms of storage expansion or shrinking, but it slow pace of development makes it still unusable for anything serious for at least the next year I guess.

Conclusion

BTRFS addresses all the inflexibilities of ZFS but it's immaturity and lack of more advanced RAID schemes makes it unusable for larger storage solutions. This is so sad because by design it seems to be the better, way more flexible option as compared to ZFS.

I do understand the view of the BTRFS developers. With the enterprise data sets, at scale, it's better to use distributed file systems to handle storage and redundancy, than on the smaller system scale. But this kind of environment is not reachable for many.

So at the moment, compared to BTRFS, ZFS is still the better option for people who want to setup large, reliable storage arrays.

Comments