Louwrentius

Please Use ZFS With ECC Memory

Wed 27 August 2014
In this blogpost I argue why it's strongly recommended to use ZFS with ECC memory when building a NAS. I would argue that if you do not use ECC memory, it's reasonable to also forgo on ZFS altogether and use any (legacy) file system that suits your needs.

Why ZFS?

Many people consider using ZFS when they are planning to build their own NAS. This is for good reason: ZFS is an excellent choice for a NAS file system. There are many reasons why ZFS is such a fine choice, but the most important one is probably 'data integrity'. Data integrity was one of the primary design goals of ZFS.

ZFS assures that any corrupt data served by the underlying storage system is either detected or - if possible - corrected by using checksums and parity. This is why ZFS is so interesting for NAS builders: it's OK to use inexpensive (consumer) hard drives and solid state drives and not worry about data integrity.

I will not go into the details, but for completeness I will also state that ZFS can make the difference between losing an entire RAID array or just a few files, because of the way it handles read errors as compared to 'legacy' hardware/software RAID solutions.

Understanding ECC memory

ECC memory or Error Correcting Code memory, contains extra parity data so the integrity of the data in memory can be verified and even corrected. ECC memory can correct single bit errors and detect multiple bit errors per word¹.

What's most interesting is how a system with ECC memory reacts to bit errors that cannot be corrected. Because it's how a system with ECC memory responds to uncorrectable bit errors that that makes all the difference in the world.

If multiple bits are corrupted within a single word, the CPU will detect the errors, but will not be able to correct them. When the CPU notices that there are uncorrectable bit errors in memory, it will generate an MCE that will be handled by the operating system. In most cases, this will result in a halt² of the system.

This behaviour will lead to a system crash, but it prevents data corruption. It prevents the bad bits from being processed by the operating system and/or applications where it may wreak havoc.

ECC memory is standard on all server hardware sold by all major vendors like HP, Dell, IBM, Supermicro and so on. This is for good reason, because memory errors are the norm, not the exception.

The question is really why not all computers, including desktop and laptops, use ECC memory instead of non-ECC memory. The most important reason seems to be 'cost'.

It is more expensive to use ECC memory than non-ECC memory. This is not only because ECC memory itself is more expensive. ECC memory requires a motherboard with support for ECC memory, and these motherboards tend to be more expensive as well.

non-ECC Memory is reliable enough that you won't have an issue most of the time. And when it does go wrong, you just blame Microsoft or Apple³. For desktops, the impact of a memory failure is less of an issue than on servers. But remember, your NAS is your own (home) server. There is some evidence that memory errors are abundant⁴ on desktop systems.

The price difference is small enough not to be relevant for businesses, but for the price-conscious consumer, it is a factor. A system based on ECC memory may cost in the range of $150 - $200 more than a system based on non-ECC memory.

It's up to you if you want to spend this extra money. Why you are advised to do so will be discussed in the next paragraphs.

Why ECC memory is important to ZFS

ZFS trusts the contents of memory blindly. Please note that ZFS has no mechanisms to cope with bad memory. It is similar to every other file system in this regard. Here is a nice paper about ZFS and how it handles corrupt memory (it doesnt!).

In the best case, bad memory corrupts file data and causes a few garbled files. In the worst case, bad memory mangles in-memory ZFS file system (meta) data structures, which may lead to corruption and thus loss of the entire zpool.

It is important to put this into perspective. There is only a practical reason why ECC memory is more important for ZFS as compared to other file systems. Conceptually, ZFS does not require ECC memory any more as any other file system.

Or let Matthew Ahrens, the co-founder of the ZFS project phrase it:
```
There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than 
any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM. I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS.
```
Now this is the important part. File systems such as NTFS, EXT4, etc have (data recovery) tools that may allow you to rescue your files when things go bad due to bad memory. ZFS does not have such tools, if the pool is corrupt, all data must be considered lost, there is no option for recovery.

So the impact of bad memory can be more devastating on a system with ZFS than on a system with NTFS, EXT4, XFS, etcetera. ZFS may force you to restore your data from backups sooner. Oh by the way, you, make backups right?

I do have a personal concern⁵. I have nothing to substantiate this, but my thinking is that since ZFS is a way more advanced and complex file system, it may be more susceptible to the adverse effects of bad memory, compared to legacy file systems.

ZFS, ECC memory and data integrity

The main reason for using ZFS over legacy file systems is the ability to assure data integrity. But ZFS is only one piece of the data integrity puzzle. The other part of the puzzle is ECC memory.

ZFS covers the risk of your storage subsystem serving corrupt data. ECC memory covers the risk of corrupt memory. If you leave any of these parts out, you are compromising data integrity.

If you care about data integrity, you need to use ZFS in combination with ECC memory. If you don't care that much about data integrity, it doesn't really matter if you use either ZFS or ECC memory.

Please remember that ZFS was developed to assure data integrity in a corporate IT environment, where data integrity is top priority and ECC-memory in servers is the norm, a fundament, on wich ZFS has been build. ZFS is not some magic pixie dust that protects your data under all circumstances. If its requirements are not met, data integrity is not assured.

ZFS may be free, but data integrity and availability isn't. We spend money on extra hard drives so we can run RAID(Z) and lose one or more hard drives without losing our data. And we have to spend money on ECC-memory, to assure bad memory doesn't have a similar impact.

This is a bit of an appeal to authority and not to data or reason but I think it's still relevant. FreeNAS is a vendor of a NAS solution that uses ZFS as its foundation.

They have this to say about ECC memory:
```
However if a non-ECC memory module goes haywire, it can cause irreparable damage to your ZFS pool that can cause complete loss of the storage.
...
If it’s imperative that your ZFS based system must always be available, ECC RAM is a requirement. If it’s only some level of annoying (slightly, moderately…) that you need to restore 
your ZFS system from backups, non-ECC RAM will fit the bill.
```
Hopefully your backups won't contain corrupt data. If you make backups of all data in the first place.

Many home NAS builders won't be able to afford to backup all data on their NAS, only the most critical data. For example, if you store a large collection of video files, you may accept the risk that you may have to redownload everything. If you can't accept that risk ECC memory is a must. If you are OK with such a scenario, non-ECC memory is OK and you can save a few bucks. It all depends on your needs.

The risks faced in a business environment don't magically disapear when you apply the same technology at home. The main difference between a business setting and your home is the scale of operation, nothing else. The risks are still relevant and real.

Things break, it's that simple. And although you may not face the same chances of getting affected by it based on the smaller scale at which you operate at home, your NAS is probably not placed in a temperature and humidity controlled server room. As the temperature rises, so does the risk of memory errors⁶. And remember, memory may develop spontaneous and temporary defects (random bitflips). If your system is powered on 24/7, there is a higher chance that such a thing will happen.

Conclusion

Personally, I think that even for a home NAS, it's best to use ECC memory regardless if you use ZFS. It makes for a more stable hardware platform. If money is a real constraint, it's better to take a look at AMD's offerings then to skip on ECC memory. It's important that if you select AMD hardware, that you make sure that both CPU and motherboard support ECC and that it is reported to be working.

Still, if you decide to use non-ECC memory with ZFS: as long as you are aware of the risks outlined in this blog post and you're OK with that, fine. It's your data and you must decide for yourself what kind of protection and associated cost is reasonable for you.

When people seek advice on their NAS builds, ECC memory should always be recommended. I think that nobody should create the impression that it's 'safe' for home use not to use ECC RAM purely seen from a technical and data integrity standpoint. People must understand that they are taking a risk. But there is a significant chance that they will never experience problems, but there is no guarantee. Do they accept the consequences if it does go wrong?

If data integrity is not that important - because the data itself is not critical - I find it perfectly reasonable that people may decide not to use ECC memory and save a few hundred dollars. In that case, it would also be perfectly reasonable not to use ZFS either, which also may allow them other file system and RAID options that may better suit their particular needs.

Questions and answers

Q: When I bought my non-ECC memory, I ran memtest86+ and no errors were found, even after a burn-in tests. So I think I'm safe.

A: No. A memory test with memtest86+ is just a snapshot in time. At that time, when you ran the test, you had the assurance that memory was fine. It could have gone bad right now while you are reading these words. And could be corrupting your data as we speak. So running memtest86+ frequently doesn't really buy you much.

Q: Dit you see that article by Brian Moses?

A: yes, and I disagree with his views, but I really appreciate the fact that he emphasises that you should really be aware of the risks involved and decide for yourself what suits your situation. A few points that are not OK in my opinion:
```
Every bad stick of RAM I’ve experienced came to me that way from the factory and could be found via some burn-in testing.
```
I've seen some consumer equipment in my life time that suddenly developed memory errors after years of perfect operation. This is argument from personal anekdote should not be used as a basis for decision making. Remember: memory errors are the norm, not the exception. Even at home. Things break, it's that simple. And having equipment running 24/7 doesn't help.

Furthermore, Brian seems to think that you can mitigate the risk of non-ECC memory by spending money on other stuff, such as off-site backups. Brian himself links to an article that rebutes his position on this. Just for completeness: How valuable is a backup of corrupted data? How do you know which data was corrupted? ZFS won't save you here.

Q: Should I use ZFS on my laptop or desktop?

A: Running ZFS on your desktop or laptop is an entirely different use case as compared to a NAS. I see no problems with this, I don't think this discussion applies to desktop/laptop usage. Especially because you are probably creating regular backups of your data to your NAS or a cloud service, right? If there are any memory errors, you will notice soon enough.

Updates
- Updated on August 11, 2015 to reflect that ZFS was not designed with ECC in mind. In this regard, it doesn't differ from other file systems.
- Updated on April 3rd, 2015 - rewrote large parts of the whole article, to make it a better read.
- Updated on January 18th, 2015 - rephrased some sentences. Changed the paragraph 'Inform people and give them a choice' to argue when it would be reasonable not to use ECC memory. Furthermore, I state more explicitly that ZFS itself has no mechanisms to cope with bad RAM.
- Updated on February 21th, 2015 - I substantially rewrote this article to give a better perspective on the ZFS + ECC 'debate'.
1. On x64 processors, the size of a word is 64 bits. ↩
2. Windows will generate a "blue screen of death" and Linux will generate a "kernel panic". ↩
3. It is very likely that the computer you're using (laptop/desktop) encountered a memory issue this year, but there is no way you can tell. Consumer hardware doesn't have any mechanisms to detect and report memory errors. ↩
4. Microsoft has performed a study on one million crash reports they received over a period of 8 months on roughly a million systems in 2008. The result is a 1 in 1700 failure rate for single-bit memory errors in kernel code pages (a tiny subset of total memory).
  
  :::text A consequence of confining our analysis to kernel code pages is that we will miss DRAM failures in the vast majority of memory. On a typical machine kernel code pages occupy roughly 30 MB of memory, which is 1.5% of the memory on the average system in our study. [...] since we are capturing DRAM errors in only 1.5% of the address space, it is possible that DRAM error rates across all of DRAM may be far higher than what we have observed. ↩
5. I did not come up with this argument myself. ↩
6. The absolutely facinating concept of bitsquatting proved that hotter datacenters showed more bitflips ↩
Tagged as : ZFS ECC

Read and Post Comments

71 TiB DIY NAS Based on ZFS on Linux

This is my new 71 TiB DIY NAS. This server is the successor to my six year old, twenty drive 18 TB NAS (17 TiB). With a storage capacity four times higher than the original and an incredible read (2.5 GB/s)/write (1.9 GB/s) performance, it's a worthy successor.

zfs nas

Purpose

The purpose of this machine is to store backups and media, primarily video.

The specs

Part	Description
Case	Ri-vier RV-4324-01A
Processor	Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz
RAM	16 GB ECC
Motherboard	Supermicro X9SCM-F
LAN	Intel Gigabit
Storage Connectivity	~~InfiniBand MHGA28-XTC~~ 2023: Mellanox ConnectX-3 Pro 10Gbit Ethernet (X312B-XCCT)
PSU	Seasonic Platinum 860
Controller	3 x IBM M1015
Disk	24 x HGST HDS724040ALE640 4 TB (7200RPM)
SSD	2 x Crucial M500 120GB in RAID 1 for boot drives
Arrays	Boot: 2 x 120 GB RAID 1 and storage: 18 disk RAIDZ2+ 6 disk RAIDZ2
Brutto storage	86 TiB (96 TB)
Netto storage	71 TiB (78 TB)
OS	~~Linux Debian Wheezy~~2023: Ubuntu 22.04
Filesystem	ZFS
Rebuild time	Depends on amount of data (rate is 4 TB/Hour)
UPS	~~Back-UPS RS 1200 LCD using Apcupsd~~ None
Power usage	about 200 Watt idle

front front front

CPU

The Intel Xeon E3-1230 V2 is not the latest generation but one of the cheapest Xeons you can buy and it supports ECC memory. It's a quad-core processor with hyper-threading.

Here you can see how it performs compared to other processors.

Memory

The system has 16 GB ECC RAM. Memory is relatively cheap these days but I don't have any reason to upgrade to 32 GB. I think that 8 GB would have been fine with this system.

Motherboard

The server is build around the SuperMicro X95SCM-F motherboard.

This is a server-grade motherboard and comes with typical features you might expect, like ECC memory support and out-of-band management (IPMI).

smboard top view

This motherboard has four PCIe slots (2 x 8x and 2 x 4x) in an 8x physical slot. My build requires four PCIe 4x+ slots and there aren't (m)any other server boards at this price point that support four PCIe slots in a 8x sized slot.

The chassis

The chassis has six rows of four drive bays that are kept cool by three 120mm fans in a fan wall behind the drive bays. At the rear of the case, there are two 'powerful' 80mm fans that remove the heat from the case, together with the PSU.

The chassis has six SAS backplanes that connect four drives each. The backplanes have dual molex power connectors, so you can put redundant power supplies into the chassis. Redundant power supplies are more expensive and due to their size, often have smaller, thus noisier fans. As this is a home build, I opted for just a single regular PSU.

When facing the front, there is a place at the left side of the chassis to mount a single 3.5 inch or two 2.5 inch drives next to each other as boot drives. I've mounted two SSDs (RAID1).

This particular chassis version has support for SPGIO, which should help identifying which drive has failed. The IBM 1015 cards I use do support SGPIO. Through the LSI megaraid CLI I have verified that SGPIO works, as you can use this tool as a drive locator. I'm not entirely sure how well SGPIO works with ZFS.

Power supply

I was using a Corsair 860i before, but it was unstable and died on me.

The Seasonic Platinum 860 may seem like overkill for this system. However, I'm not using staggered spinup for the 24 drives. So the drives all spinup at once and this results in a peak power usage of 600+ watts.

The PSU has a silent mode that causes the fan only to spin if the load reaches a certain threshold. Since the PSU fan also helps removing warm air from the chassis, I've disabled this feature, so the fan is spinning at all times.

Drive management

I've written a tool called lsidrivemap that displays each drive in an ASCII table that reflects the physical layout of the chassis.

The data is based on the output of the LSI 'megacli' tool for my IBM 1015 controllers.

root@nano:~# lsidrivemap disk

| sdr | sds | sdt | sdq |
| sdu | sdv | sdx | sdw |
| sdi | sdl | sdp | sdm |
| sdj | sdk | sdn | sdo |
| sdb | sdc | sde | sdf |
| sda | sdd | sdh | sdg |

This layout is 'hardcoded' for my chassis but the Python script can be easily tailored for your own server, if you're interested.

It can also show the temperature of the disk drives in the same table:

root@nano:~# lsidrivemap temp

| 36 | 39 | 40 | 38 |
| 36 | 36 | 37 | 36 |
| 35 | 38 | 36 | 36 |
| 35 | 37 | 36 | 35 |
| 35 | 36 | 36 | 35 |
| 34 | 35 | 36 | 35 |

These temperatures show that the top drives run a bit hotter than the other drives. An unverified explanation could be that the three 120mm fans are not in the center of the fan wall. They are skewed to the bottom of the wall, so they may favor the lower drive bays.

Filesystem (ZFS)

I'm using ZFS as the file system for the storage array. At this moment, there is no other file system that has the same features and stability as ZFS. BTRFS is not even finished.

The number one design goal of ZFS was assuring data integrity. ZFS checksums all data and if you use RAIDZ or a mirror, it can even repair data. Even if it can't repair a file, it can at least tell you which files are corrupt.

ZFS is not primarily focussed on performance, but to get the best performance possible, it makes heavy usage of RAM to cache both reads and writes. This is why ECC memory is so important.

ZFS also implements RAID. So there is no need to use MDADM. My previous file server was running a single RAID 6 of 20 x 1TB drives. With this new system I've created a single pool with two RAIDZ2 VDEVs.

Capacity

Vendors still advertise the capacity of their hard drives in TB whereas the operating system works with TiB. So the 4 TB drives I use are in fact 3.64 TiB.

The total raw storage capacity of the system is about 86 TiB.

My zpool is the 'appropriate' number of disks (2^n + parity^) in the VDEVs. So I have one 18 disk RAIDZ2 VDEV (2^4+2) and one 6 disk RAIDZ2 VDEV (2^2+2^) for a total of 24 drives.

Different VDEV sizes in a single pool are often not recommended, but ZFS is very smart and cool: it load-balances the data across the VDEVs based on the size of the VDEV. I could verify this with zpool iostat -v 5 and witness this in real-time. The small VDEV got just a fraction of the data compared to the large VDEV.

This choice leaves me with less capacity (71 TiB vs. 74 TiB for RAIDZ3) and also has a bit more risk to it, with the eighteen-disk RAIDZ2 VDEV. Regarding this latter risk, I've been running a twenty-disk MDADM RAID6 for the last 6 years and haven't seen any issues. That does not tell everything, but I'm comfortable with this risk.

Originalyl I was planning on using RAIDZ3 and by using ashift=9 (512 byte sectors) I would recuperate most of the space lost to the non-optimal number of drives in the VDEV. So why did I change my mind? Because the performance of my ashift=9 pool on my 4K drives deteriorated so much that a resilver of a failed drive would take ages.

Storage controllers

The IBM 1015 HBA's are reasonably priced and buying three of them, is often cheaper than buying just one HBA with a SAS expander. However, it may be cheaper to search for an HP SAS expander and use it with just one M1015 and save a PCIe slot.

I have not flashed the controllers to 'IT mode', as most people do. They worked out-of-the-box as HBAs and although it may take a little bit longer to boot the system, I decided not to go through the hassle.

The main risk here is how the controller handles a drive if a sector is not properly read. It may disable the drive entirely, which is not necessary for ZFS and often not preferred.

Storage performance

With twenty-four drives in a chassis, it's interesting to see what kind of performance you can get from the system.

Let's start with a twenty-four drive RAID 0. The drives I use have a sustained read/write speed of 160 MB/s so it should be possible to reach 3840 MB/s or 3.8 GB/s. That would be amazing.

This is the performance of a RAID 0 (MDADM) of all twenty-four drives.

root@nano:/storage# dd if=/dev/zero of=test.bin bs=1M count=1000000
1048576000000 bytes (1.0 TB) copied, 397.325 s, 2.6 GB/s

root@nano:/storage# dd if=test.bin of=/dev/null bs=1M
1048576000000 bytes (1.0 TB) copied, 276.869 s, 3.8 GB/s

Dead on, you would say, but if you divide 1 TB with 276 seconds, it's more like 3.6 GB/s. I would say that's still quite close.

This machine will be used as a file server and a bit of redundancy would be nice. So what happens if we run the same benchmark on a RAID6 of all drives?

root@nano:/storage# dd if=/dev/zero of=test.bin bs=1M count=100000
104857600000 bytes (105 GB) copied, 66.3935 s, 1.6 GB/s

root@nano:/storage# dd if=test.bin of=/dev/null bs=1M
104857600000 bytes (105 GB) copied, 38.256 s, 2.7 GB/s

I'm quite pleased with these results, especially for a RAID6. However, RAID6 with twenty-four drives feels a bit risky. So since there is no support for a three-parity disk RAID in MDADM/Linux, I use ZFS.

Sacrificing performance, I decided - as I mentioned earlier - to use ashift=9 on those 4K sector drives, because I gained about 5 TiB of storage in exchange.

This is the performance of twenty-four drives in a RAIDZ3 VDEV with ashift=9.

root@nano:/storage# dd if=/dev/zero of=ashift9.bin bs=1M count=100000 
104857600000 bytes (105 GB) copied, 97.4231 s, 1.1 GB/s

root@nano:/storage# dd if=ashift9.bin of=/dev/null bs=1M
104857600000 bytes (105 GB) copied, 42.3805 s, 2.5 GB/s

Compared to the other results, write performance is way down, although not too bad.

This is the write performance of the 18 disk RAIDZ2 + 6 disk RAIDZ2 zpool (ashift=12):

root@nano:/storage# dd if=/dev/zero of=test.bin bs=1M count=1000000 
1048576000000 bytes (1.0 TB) copied, 543.072 s, 1.9 GB/s

root@nano:/storage# dd if=test.bin of=/dev/null bs=1M 
1048576000000 bytes (1.0 TB) copied, 400.539 s, 2.6 GB/s

As you may notice, the write performance is better than the ashift=9 or ashift=12 RAIDZ3 VDEV.

In the end I chose to use the 18 disk RAIDZ2 + 6 disk RAIDZ2 setup because of the better performance and to adhere to the standards of ZFS.

I have not benchmarked random I/O performance as it is not relevant for this system. And with ZFS, the random I/O performance of a VDEV is that of a single drive.

Boot drives

I'm using two Crucial M500 120GB SSD drives. They are configured in a RAID1 (MDADM) and I've installed Debian Wheezy on top of them.

At first, I was planning on using a part of the capacity for caching purposes in combination with ZFS. However, there's no real need to do so. In hindsight I could also have used to very cheap 2.5" hard drives (simmilar to my older NAS), which would have cost less than a single M500.

Update 2014-09-01: I actually reinstalled Debian and kept about 50% free space on both M500s and put this space in a partition. These partitions have been provided to the ZFS pool as L2ARC cache. I did this because I could, but on the other hand, I wonder if I'm only really just wearing out my SSDs faster.

Update 2015-10-04: I saw no reason why I would wear out my SSDs as a L2ARC so I removed them from my pool. There is absolutely no benefit in my case.

Networking (updated 2017-03-25)

Current: I have installed a Mellanox MHGA28-XTC InfiniBand card. I'm using InfiniBand over IP so the InfiniBand card is effectively a faster network card. I have a point-to-point connection with another server, I do not have an InfiniBand switch.

I get about 6.5 Gbit from this card, which is not even near the theoretical performance limit. However, this translate into a constant 750 MB/s file transfer speed over NFS, which is amazing.

Using Linux bonding and the quad-port Ethernet adapter, I only got 400 MB/s and transfer speeds were fluctuating a lot.

Original: Maybe I will invest in 10Gbit ethernet or InfiniBand hardware in the future, but for now I settled on a quad-port gigabit adapter. With Linux bonding, I can still get 450+ MB/s data transfers, which is sufficient for my needs.

The quad-port card is in addition to the two on-board gigabit network cards. I use one of the on-board ports for client access. The four ports on the quad-port card are all in different VLANs and not accessible for client devices.

The storage will be accessible over NFS and SMB. Clients will access storage over one of the on-board Gigabit LAN interfaces.

Keeping things cool and quiet

It's important to keep the drive temperature at acceptable levels and with 24 drives packet together, there is an increased risk of overheating.

The chassis is well-equipped to keep the drives cool with three 120mm fans and two strong 80mm fans, all supporting PWM (pulse-width modulation).

The problem is that by default, the BIOS runs the fans at a too low speed to keep the drives at a reasonable temperature. I'd like to keep the hottest drive at about forty degrees Celsius. But I also want to keep the noise at reasonable levels.

I wrote a python script called storagefancontrol that automatically adjusts the fan speed based on the temperature of the hottest drive.

UPS

I'm running a HP N40L micro server as my firewall/router. My APC Back-UPS RS 1200 LCD (720 Watt) is connected with USB to this machine. I'm using apcupsd to monitor the UPS and shutdown servers if the battery runs low.

All servers, including my new build, run apcupsd in network mode and talk to the N40L to learn if power is still OK.

Keeping power consumption reasonable

So these are the power usage numbers.

 96 Watt with disks in spin down.
176 Watt with disks spinning but idle.
253 Watt with disks writing.

Edit 2015-10-04: I do have an unresolved issue where the drives keep spinning up even with all services on the box killed, including Cron. So it's configured so that the drives are always spinning. /end edit

But the most important stat is that it's using 0 Watt if powered off. The system will be turned on only when necessary through wake-on-lan. It will be powered off most of the time, like when I'm at work or sleeping.

Cost

The system has cost me about €6000. All costs below are in Euro and include taxes (21%).

Description	Product	Price	Amount	Total
Chassis	Ri-vier 4U 24bay storage chassis RV-4324-01A	554	1	554
CPU	Intel Xeon E3-1230V2	197	1	197
Mobo	SuperMicro X9SCM-F	157	1	157
RAM	Kingston DDR3 ECC KVR1333D3E9SK2/16G	152	1	152
PSU	AX860i 80Plus Platinum	175	1	175
Network Card	NC364T PCI Express Quad Port Gigabit	145	1	145
HBA Controller	IBM SERVERAID M1015	118	3	354
SSDs	Crucial M500 120GB	62	2	124
Fan	Zalman FB123 Casefan Bracket + 92mm Fan	7	1	7
Hard Drive	Hitachi 3.5 4TB 7200RPM (0S03356)	166	24	3984
SAS Cables		25	6	150
Fan cables		6	1	6
Sata-to-Molex		3,5	1	3,5
Molex splitter		3	1	3
				6012

Closing words

If you have any questions or remarks about what could have been done differently feel free to leave a comment, I appreciate it.

Read and Post Comments

ZFS: Performance and Capacity Impact of Ashift=9 on 4K Sector Drives

Thu 31 July 2014
Update 2014-8-23: I was testing with ashift for my new NAS. The ashift=9 write performance deteriorated from 1.1 GB/s to 830 MB/s with just 16 TB of data on the pool. Also I noticed that resilvering was very slow. This is why I decided to abandon my 24 drive RAIDZ3 configuration.

I'm aware that drives are faster at the outside of the platter and slower on the inside, but the performance deteriorated so dramatically that I did not wanted to continue further.

My final setup will be a RAIDZ2 18 drive VDEV + RAIDZ2 6 drive VDEV which will give me 'only' 71 TiB of storage, but read performance is 2.6 GB/s and write performance is excellent at 1.9 GB/s. I've written about 40+ TiB to the array and after those 40 TiB, write performance was about 1.7 GB/s, so still very good and what I would expect as drives fill up.

So actually, based on these results, I have learned not to deviate from the ZFS best practices too much. Use ashift=12 and put drives in VDEVS that adhere to the 2^n+parity rule.

The uneven VDEVs (18 disk vs. 6 disks) are not according to best practice but ZFS is smart: it distributes data across the VDEVs based on their size. So they fill up equally.

Choosing between ashift=9 and ashift=12 for 4K sector drives is not always a clear cut case. You have to choose between raw performance or storage capacity.

My testplatform is Debian Wheezy with ZFS on Linux. I'm using a system with 24 x 4 TB drives in a RAIDZ3. The drives have a native sector size of 4K, and the array is formatted with ashift=12.

First we create the array like this:
```
zpool create storage -o ashift=12 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]
```
Note: NEVER use /dev/sd? drive names for an array, this is just for testing, always use /dev/disk/by-id/ names.

Then we run a simple sequential transfer benchmark with dd:
```
root@nano:/storage# dd if=/dev/zero of=ashift12.bin bs=1M count=100000 
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 66.4922 s, 1.6 GB/s
root@nano:/storage# dd if=ashift12.bin of=/dev/null bs=1M
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 42.0371 s, 2.5 GB/s
```
This is quite impressive. With these speeds, you can saturate 10Gbe ethernet. But how much storage space do we get?

df -h:
```
Filesystem                            Size  Used Avail Use% Mounted on
storage                                69T  512K   69T   1% /storage
```
zfs list:
```
NAME      USED  AVAIL  REFER  MOUNTPOINT
storage  1.66M  68.4T   435K  /storage
```
Only 68.4 TiB of storage? That's not good. There should be 24 drives minus 3 for parity is 21 x 3.6 TiB = 75 TiB of storage.

So the performance is great, but somehow, we lost about 6 TiB of storage, more than a whole drive.

So what happens if you create the same array with ashift=9?
```
zpool create storage -o ashift=9 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]
```
These are the benchmarks:
```
root@nano:/storage# dd if=/dev/zero of=ashift9.bin bs=1M count=100000 
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 97.4231 s, 1.1 GB/s
root@nano:/storage# dd if=ashift9.bin of=/dev/null bs=1M
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 42.3805 s, 2.5 GB/s
```
So we lose about a third of our write performance, but the read performance is not affected, probably by read-ahead caching but I'm not sure.

With ashift=9, we do lose some write performance, but we can still saturate 10Gbe.

Now look what happens to the available storage capacity:

df -h:
```
Filesystem                         Size  Used Avail Use% Mounted on
storage                             74T   98G   74T   1% /storage
```
zfs list:
```
NAME      USED  AVAIL  REFER  MOUNTPOINT
storage   271K  73.9T  89.8K  /storage
```
Now we have a capacity of 74 TiB, so we just gained 5 TiB with ashift=9 over ashift=12, at the cost of some write performance.

So if you really care about sequential write performance, ashift=12 is the better option. If storage capacity is more important, ashift=9 seems to be the best solution for 4K drives.

The performance of ashift=9 on 4K drives is always described as 'horrible' but I think it's best to run your own benchmarks and decide for yourself.

Caveat: I'm quite sure about the benchmark performance. I'm not 100% sure how reliable the reported free space is according to df -h or zfs list.

Edit: I have added a bit of my own opinion on the results.
Tagged as : ZFS Linux

Read and Post Comments

Louwrentius

Please Use ZFS With ECC Memory

Why ZFS?

Understanding ECC memory

Why ECC memory is important to ZFS

ZFS, ECC memory and data integrity

Conclusion

Questions and answers

Updates