Some people say that it's OK (acceptable risk) to run a ZFS NAS without ECC memory.
I'd like to make the case that this is very bad advice and that these people are doing other people a disservice.
Running ZFS without ECC memory gives you a false sense of security and it can lead to serious data corruption or even loss of the whole zpool. You may lose all your data. Here is a nice paper about ZFS and how it handles corrupt memory (it doesnt!).
ZFS was designed to be run on hardware with ECC memory and it trusts memory blindly. ZFS addresses data integrity for disks. ECC memory addresses data integrity of data in memory. Each tool has it's own purposes and use the right tool for the job.
ZFS combined with bad RAM may be a significantly bigger threat to your data on your NAS than using EXT4/XFS/UFS. Not only because the file system may get corrupt and cannot be imported anymore, but also because there are no file system recovery tools available for ZFS. With the older file system, at least you may stand some chance to save some of your data.
ZFS amplifies the impact of bad memory
Aaron Toponce explains the danger of bad non-ECC memory with some examples. ZFS tries to repair data if it thinks it is corrupt. But since ZFS trust RAM memory it cannot distinct between bad RAM or bad disk data and will start to 'repair' good data. This will cause further corruption and will further damage data on disk. Imagine what will happen if you perform regular scrubs of your data.
Personally, I think that even for a home NAS, it's best to use ECC memory regardless if you use ZFS. It makes for more stable hardware. If money is a real constraint, it's better to take a look at AMD's offerings then to skip on ECC memory for a bit more performance.
ZFS is just one part of the data integrity/availability puzzle
From a technical perspective, it is always a bad choice to buy non-ECC memory for your DIY NAS. But you may have non-technical reasons not to buy ECC memory, like 'monetary' reasons.
ECC memory is a bit more expensive, but the question is: what is your goal?
If you care about your data and would lose sleep over the risk of silent data corruption, you need to go all the way to be safe. ZFS covers the risk of drives spewing corrupt data, extra drives cover the risk of drive failures and ECC memory covers the risk of bad memory.
ZFS itself is free. But data integrity and availability is not. We know that hardware can fail, in particular hard drives. So we buy some extra drives and sacrifice capacity in exchange for reliability. We pay real money to gain some safety. Why not with memory? Why is it suddenly not necessary to do exactly the same with memory what ZFS covers for hard disk drives?
The ECC vs. non-ECC debate is about wether the likelihood and the impact of a RAM bitflip warrants the extra costs of ECC memory for home usage. But before we look at the numbers, let's just think about this for a moment.
The only argument is that the likelihood that memory corruption occurs is low.
But there is no data on this for home environments. It's just anekdotes and here-say. The trouble is that non-ECC machines never tell you in your face that you just encountered some memory bit error. It just crashes, reboots, some app crashes or some file is suddenly lost. How do you know you've never experienced bad memory?
The chance is low, but if it goes wrong, the impact could be very high.
I like this argument from Andrew Galloway who has an ever stronger opinion in this debate:
Would you press a button with a 100$ reward if there's a one in ten thousand
chance that you will get zapped by a lightning strike and die instead of
getting that 100$?
Is the small risk of losing all your data worth the reward of 100$?
Vendors like HP or Dell, do not ship a single server or workstation with non-ECC RAM. Even the cheapest tower model servers for small businesses contain ECC memory. Please let that sink in for a moment.
On the FreeNAS forum, the've seen multiple people lose their data because of memory corruption rendering their zpool unusable. For some nice and very opinionated read check this topic.
non-ECC hardware will not warn you
How long will it take for you to notice that your NAS has memory problems? By the very nature of non-ECC memory and related hardware (motherboard), there is no way to tell if memory has gone bad. By the time you will notice, it may be too late. Just think what will happen if a scrub starts.
ECC motherboards log memory events to the BIOS and those events can often be read through IPMI from within the operating system.
The Google study
Now let's take a look at some data. I'm using the Google study that some of you may already be familiar with.
Our first observation is that memory errors are not rare events.
About a third of all machines in the fleet experience at least one memory
error per year [...]
One in three machines faces at least one memory error per year. But a machine contains multiple memory modules.
Around 20% of DIMMs in Platform A and B are affected by correctable errors
per year, compared to less than 4% of DIMMs in Platform C and D.
So let's assume that your hardware is of better design like platform C and D. In that case, each memory module has a four percent chance per year to see a correctable error. Remember that your NAS has at least two memory modules.
So the chance of seeing no errors per module per year is 96%. So 0,96 x 0,96 = 92% chance that everything will be fine that year. Or you could say: 8% chance that some failure will occur. With four memory modules, the risk is 15% per year that you will face a single memory error.
A memory error may not immediately lead total loss of you pool, but still. I find this number quite high.
There are more interesting observations in this paper.
Memory errors can be classified into soft errors, which randomly corrupt
bits, but do not leave any physical damage; and hard errors, which corrupt
bits in a repeatable manner because of a physical defect (e.g. “stuck bits”).
Conclusion 7: Error rates are unlikely to be dominated by soft errors.
We observe that CE rates are highly correlated with system utilization,
even when isolating utilization effects from the effects of temperature.
So If I understand this correctly, soft errors are mostly caused by high usage of CPU and RAM and cosmic radiation does not seem to be the cause that often.
Please note that Google did not measure hard or soft errors directly as they can't distinguish between them.
Brian Moses blogged about his reasons why he did not choose ECC memory for his NAS box. Although most of his arguments are not very strong in my opinion, he pointed out something interesting.
Google found that there is a strong correlation between memory errors and the CPU/RAM usage of the machine.
We observe clear trends of increasing correctable error rates with
increasing CPU utilization and allocated memory. Averaging across all
platforms, it seems that correctable error rates grow roughly
logarithmically as a function of utilization levels (based on the roughly
linear increase of error rates in the graphs, which have log scales on the
A major difference between the Google server and your home NAS will be that your home NAS won't see much usage of both memory and CPU in general, so if the relation is logarithmic in nature, the risk of seeing memory errors in a low-utilisation environment should be reduced. But what kind of number can we put on that? 1% per memory module per year? Or 0.1%?
Are you the person who is going to find out?
This information may be used as an indication that for a home environment, memory problems less likely compared to high-usage systems in a data center, but are you going to bet your data on that assumption?
Most people run their NAS 24/7. Often, it has other tasks beside storing files and this may cause a load on the system. Further more, ZFS tend to use as much memory as possible for caching purposes, increasing the risk of hitting bad memory. And ZFS users will need to perform regular scrubs of their pool, which cause a lot of disk, CPU and RAM activity.
Inform people and give them a choice
When people seek advice on their NAS builds, ECC memory should always be recommended and I think that nobody should create the impression that it's OK for home use not to use ECC RAM for technical reasons.
Even if it were true that home builds may be less susceptible to memory errors it would not be fair to create the impression that the likelihood of bad memory is so small that we can just ignore the impact and save a few bucks.
People are free to choose not to go for ECC memory for monetary reasons, but that does not justify the choice from a technical perspective and they should be aware that they are taking a risk.