If something goes wrong with my zpool, I'd like to be notified by email. On Linux using MDADM, the MDADM daemon took care of that.
With the release of ZoL 0.6.3, a brand new 'ZFS Event Daemon' or ZED has been introduced.
I could not find much information about it, so consider this article my notes on this new service.
If you want to receive alerts there is only one requirement: you must setup an MTA on your machine and that is outside the scope of this article.
When you install ZoL, the ZED daemon is installed automatically and will start on boot.
The configuration file for ZED can be found here: /etc/zfs/zed.d/zed.rc. Just uncomment the "ZED_EMAIL=" section and fill out your email address. Don't forget to restart the service.
ZED seems to hook into the zpool event log that is kept in the kernel and monitors these events in real-time.
You can see those events yourself:
root@debian:/etc/zfs/zed.d# zpool events TIME CLASS Aug 29 2014 16:53:01.872269662 resource.fs.zfs.statechange Aug 29 2014 16:53:01.873291940 resource.fs.zfs.statechange Aug 29 2014 16:53:01.962528911 ereport.fs.zfs.config.sync Aug 29 2014 16:58:40.662619739 ereport.fs.zfs.scrub.start Aug 29 2014 16:58:40.670865689 ereport.fs.zfs.checksum Aug 29 2014 16:58:40.671888655 ereport.fs.zfs.checksum Aug 29 2014 16:58:40.671905612 ereport.fs.zfs.checksum ...
You can see that a scrub was started and that incorrect checksums were discovered. A few seconds later I received an email:
The first email:
A ZFS checksum error has been detected: eid: 5 host: debian time: 2014-08-29 16:58:40+0200 pool: storage vdev: disk:/dev/sdc1
And soon thereafter:
A ZFS pool has finished scrubbing: eid: 908 host: debian time: 2014-08-29 16:58:51+0200 pool: storage state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub repaired 100M in 0h0m with 0 errors on Fri Aug 29 16:58:51 2014 config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 903 errors: No known data errors
The ZED daemon executes commands based on the event class. So it can do more than just send emails, you can customise different actions based on the event class. The event class can be seen in the zpool events output.
One of the more interesting features is automatic replacement of a defect drive with a hot spare, so full fault tolerance is restored as soon as possible.
I've not been able to get this to work. The ZED scripts would not automatically replace a failed/faulted drive.
There seem to be some known issues. The fixes seem to be in a pending pull request.
Just to make sure I got alerted, I've simulated the ZED configuration for my production environment in a VM.
I simulated a drive failure with dd as stated earlier, but the result was that for every checksum error I received one email. With thousands of checksum errors, I had to clear 1000+ emails from my inbox.
It seems that this option, which is uncommented by default, was not enabled.
This option implements a cool-down period where an event is just reported once and suppressed afterwards until the interval expires.
It would be best if this option would be enabled by default.
The ZED authors acknowledge that ZED is a bit rough around the edges, but it sends out alerts consistently and that's what I was looking for, so I'm happy.