Scrub your NAS hard drives regularly if you care about your data

Introduction

Lots of people run a NAS at home. Maybe it's a COTS device from one of the well-known vendors¹, or it's a custom build solution (DIY²) based on hardware you bought and assembled yourself.

Buying or building a NAS is one thing, but operating it in a way that assures that you won't lose data is something else.

Obviously, the best way to protect against dataloss, is to make regular backups. So ideally, even if the NAS would go up in flames, you would still have your data.

Since backup storage costs money, people make tradeoffs. They may decide to take the risk and only backup a small portion of the really important data and take their chances with the rest.

Well that is their own right. But still, it would be nice if we would reduce the risk of dataloss to a minimum.

The risk: bad sectors

The problem is that hard drives may develop bad sectors over time. Bad sectors are tiny portions of the drive that have become unreadable³. How small a sector may be, if any data is stored in them, it is now lost and this could cause data corruption (one or more corrupt files).

This is the thing: those bad sectors may never be discovered until it is too late!

With todays 14+ TB hard drives, it's easy to store vast amounts of data. Most of that data is probably not frequently accessed, especially at home.

One or more of your hard drives may be developing bad sectors and you wouldn't even know it. How would you?

Your data might be at risk right at this moment while you are reading this article.

A well-known disaster scenario in which people tend to lose data is double hard drive failure where only one drive faillure can be tolerated (RAID 1 (mirror) or RAID 5, and in some scenario's RAID 10).

In this scenario, a hard drive in their RAID array has failed and a second drive (one of the remaining good drives) has developed bad sectors. That means effectively a second drive has failed although the drive may still seem operational. Due to the bad sectors, data required to rebuild the array is lost because there is no longer any redundancy⁴.

If you run a (variant of) RAID 5, you can only lose a single disk, so if a second disk fails, you lose all data⁵.

The mitigation: periodic scrubbing / checking of your disks

The only way to find out if a disk has developed bad sectors is to just read them all. Yes: all the sectors.

Checking your hard drives for bad sectors (or other issues) is called 'data scrubbing'. If you bought a NAS from QNAP, Synology or another vendor, there is a menu which allows you to control how often and when you want to perform a data scrub.

RAID solutions are perfectly capable of handling bad sectors. For a RAID array, it's just equivalent to a failed drive and an affected drive will be kicked out of the RAID array if bad sectors start causing read errors. The big issue we want to prevent is that multiple drives start to develop bad sectors at the same time, because that is the equivalent of multiple simultaneous drive failures, which many RAID arrays can't recover from.

For home users I would recommend checking all hard drives once a month. I would recommend configuring the data scrub to run at night (often the default) because a scrub may impact performance in a way that can be noticeable and even inconvenient.

Your vendor may have already configured a default schedule for data scrubs, so you may have been protected all along. If you take a look, at least you know.

People who have built a DIY NAS have to setup and configure periodic scrubs themselves or they won't happen at all. However, that's not entirely true: I've noticed that on Ubuntu, all Linux software RAID arrays (MDADM) are checked once a month at night. So if you use Linux software RAID you may already be scrubbing.

A drive that develops bad sectors should be replaced as soon as possible. It should no longer be trusted. The goal of scrubbing is to identify these drives as soon as possible. You don't want to get in a position that multiple drives have started developing bad sectors. You can only prevent that risk by scanning for bad sectors periodically and replacing bad drives.

You should not be afraid about having to spend a ton of money replacing drives all the time. Bad sectors are not that common. But they are a common enough that you should check for them. There is a reason why NAS vendors offer the option to run data scrubs and recommend them⁶.

You probably forgot to configure email alerting

If a disk in your NAS would fail, how would you know? If the scrub would discover bad sectors, would you ever notice⁷?

The answer may be: only when it's too late. Maybe a drive already failed and you haven't even noticed yet!

When you've finished reading this article, it may be the right moment to take some time to check the status of your NAS and configure email alerting (or any other alerting mechanism that works for you). Make your NAS sends out a test message just to confirm it actually works!

Closing words

So I would like to advice you to do two things:

Make sure your NAS runs a data scrub once a month
Make sure your NAS is able to email alerts about failed disks or scrubs.

These actions allow you to fix problems before they become catastrophic.

P.S. S.M.A.R.T. monitoring

Hard drives have a build-in monitoring system called S.M.A.R.T.

If you have a NAS from one of the NAS vendors, they will allert on SMART monitoring information that would indicate that a drive is failing. DIY builders may have to spend time setting up this kind of monitoring manually.

For more information about SMART I would recommend [this][this article] and this one.

this article and also this one

Linux users can take a look at the SMART status of their hard drives with this tool (which I made).

QNAP, Synology, Netgear, Buffalo, Thecus, Western Digital, and so on. ↩
FreeNAS, Unraid, Windows/Linux with Snapraid, OpenMediaVault, or a custom solution, and so on. ↩
Bad sectors cause 'unrecoverable read errors' or UREs. Bad sectors have nothing to do with 'silent data corruption'. There's nothing silent about unrecoverable read errors. Hard drives report read errors back to the operating system, they won't go unnoticed. ↩
A DIY NAS based on ZFS (FreeNAS is based on ZFS) may help mitigate the impact of such an event. ZFS can continue reading data from the remaining drives, even if bad sectors are encountered. Some files will be corrupted, but most of the data would still be readable. I think this capability is by itself not enough reason to pick a NAS based on ZFS because ZFS also has a cost involved that you need to accept too. For my large NAS I have chosen ZFS because I was prepared to 'pay the cost'. ↩
Some people may chose to go with RAID 6 which tolerates two simultaneous drive failures but they also tend to run larger arrays with more drives, which also increases the risk of drive failure or one of the drives developing bad sectors. ↩
Enterprise storage solutions (Even entry level storage arrays) often run patrol reads both on individual hard drives and also the RAID arrays on top of them. They are also enabled by default. ↩
At one time I worked for a small company that ran their own (single) email server. One of the system administrators discovered totally by accident that one of the two drives in a RAID 1 had failed. It turns out we were running on a single drive for months before we discovered it, because we forgot to setup email alerting. We didn't lose data, but we came close. ↩

Louwrentius