1. The ZFS Event Daemon on Linux

    August 29, 2014

    If something goes wrong with my zpool, I'd like to be notified by email. On Linux using MDADM, the MDADM daemon took care of that.

    With the release of ZoL 0.6.3, a brand new 'ZFS Event Daemon' or ZED has been introduced.

    I could not find much information about it, so consider this article my notes on this new service.

    If you want to receive alerts there is only one requirement: you must setup an MTA on your machine and that is outside the scope of this article.

    When you install ZoL, the ZED daemon is installed automatically and will start on boot.

    The configuration file for ZED can be found here: /etc/zfs/zed.d/zed.rc. Just uncomment the "ZED_EMAIL=" section and fill out your email address. Don't forget to restart the service.

    ZED seems to hook into the zpool event log that is kept in the kernel and monitors these events in real-time.

    You can see those events yourself:

    root@debian:/etc/zfs/zed.d# zpool events
    TIME                           CLASS
    Aug 29 2014 16:53:01.872269662 resource.fs.zfs.statechange
    Aug 29 2014 16:53:01.873291940 resource.fs.zfs.statechange
    Aug 29 2014 16:53:01.962528911 ereport.fs.zfs.config.sync
    Aug 29 2014 16:58:40.662619739 ereport.fs.zfs.scrub.start
    Aug 29 2014 16:58:40.670865689 ereport.fs.zfs.checksum
    Aug 29 2014 16:58:40.671888655 ereport.fs.zfs.checksum
    Aug 29 2014 16:58:40.671905612 ereport.fs.zfs.checksum
    ...
    

    You can see that a scrub was started and that incorrect checksums were discovered. A few seconds later I received an email:

    The first email:

    A ZFS checksum error has been detected:
    
      eid: 5
     host: debian
     time: 2014-08-29 16:58:40+0200
     pool: storage
     vdev: disk:/dev/sdc1
    

    And soon thereafter:

    A ZFS pool has finished scrubbing:
    
      eid: 908
     host: debian
     time: 2014-08-29 16:58:51+0200
     pool: storage
    state: ONLINE
    status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
    action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
      see: http://zfsonlinux.org/msg/ZFS-8000-9P
     scan: scrub repaired 100M in 0h0m with 0 errors on Fri Aug 29 16:58:51 2014
    config:
    
        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0   903
    
    errors: No known data errors
    

    Awesome!

    The ZED daemon executes commands based on the event class. So it can do more than just send emails, you can customise different actions based on the event class. The event class can be seen in the zpool events output.

    One of the more interesting features is automatic replacement of a defect drive with a hot spare, so full fault tolerance is restored as soon as possible.

    I've not been able to get this to work. The ZED scripts would not automatically replace a failed/faulted drive.

    There seem to be some known issues. The fixes seem to be in a pending pull request.

    Just to make sure I got alerted, I've simulated the ZED configuration for my production environment in a VM.

    I simulated a drive failure with dd as stated earlier, but the result was that for every checksum error I received one email. With thousands of checksum errors, I had to clear 1000+ emails from my inbox.

    It seems that this option, which is uncommented by default, was not enabled.

    ZED_EMAIL_INTERVAL_SECS="3600"
    

    This option implements a cool-down period where an event is just reported once and suppressed afterwards until the interval expires.

    It would be best if this option would be enabled by default.

    The ZED authors acknowledge that ZED is a bit rough around the edges, but it sends out alerts consistently and that's what I was looking for, so I'm happy.

    Tagged as : ZFS event daemon
  2. Installation of ZFS on Linux Hangs on Debian Wheezy

    August 29, 2014

    After a fresh net-install of Debian Wheezy, I was unable to compile the ZFS for Linux kernel module. I've installed apt-get install build-essential but that wasn't enough.

    The apt-get install debian-zfs command would just hang.

    I noticed a 'configure' process and I killed it, and after a few seconds, the installer continued after spewing out this error:

    Building initial module for 3.2.0-4-amd64
    Error! Bad return status for module build on kernel: 3.2.0-4-amd64 (x86_64)
    Consult /var/lib/dkms/zfs/0.6.3/build/make.log for more information.
    

    So I ran ./configure manually inside the mentioned directory and then I got this error:

    checking for zlib.h... no
    configure: error: in `/var/lib/dkms/zfs/0.6.3/build':
    configure: error: 
        *** zlib.h missing, zlib-devel package required
    See `config.log' for more details
    

    So I ran apt-get install zlib1g-dev and no luck:

    checking for uuid/uuid.h... no
    configure: error: in `/var/lib/dkms/zfs/0.6.3/build':
    configure: error: 
        *** uuid/uuid.h missing, libuuid-devel package required
    See `config.log' for more details
    

    I searched a bit online and then I found this link that listed some additional packages that may be missing and I installed them all with:

    apt-get install zlib1g-dev uuid-dev libblkid-dev libselinux-dev parted
    lsscsi wget
    

    This time the ./configure went fine and I could manually make install the kernel module and import my existing pool.

    Tagged as : ZFS Wheezy
  3. Please Use ZFS With ECC Memory

    August 27, 2014

    Some people say that it's OK (acceptable risk) to run a ZFS NAS without ECC memory.

    I'd like to make the case that this is very bad advice and that these people are doing other people a disservice.

    Running ZFS without ECC memory gives you a false sense of security and it can lead to serious data corruption or even loss of the whole zpool. You may lose all your data. Here is a nice paper about ZFS and how it handles corrupt memory (it doesnt!).

    ZFS was designed to be run on hardware with ECC memory and it trusts memory blindly. ZFS addresses data integrity for disks. ECC memory addresses data integrity of data in memory. Each tool has it's own purposes and use the right tool for the job.

    ZFS combined with bad RAM may be a significantly bigger threat to your data on your NAS than using EXT4/XFS/UFS. Not only because the file system may get corrupt and cannot be imported anymore, but also because there are no file system recovery tools available for ZFS. With the older file system, at least you may stand some chance to save some of your data.

    ZFS amplifies the impact of bad memory

    Aaron Toponce explains the danger of bad non-ECC memory with some examples. ZFS tries to repair data if it thinks it is corrupt. But since ZFS trust RAM memory it cannot distinct between bad RAM or bad disk data and will start to 'repair' good data. This will cause further corruption and will further damage data on disk. Imagine what will happen if you perform regular scrubs of your data.

    Personally, I think that even for a home NAS, it's best to use ECC memory regardless if you use ZFS. It makes for more stable hardware. If money is a real constraint, it's better to take a look at AMD's offerings then to skip on ECC memory for a bit more performance.

    ZFS is just one part of the data integrity/availability puzzle

    From a technical perspective, it is always a bad choice to buy non-ECC memory for your DIY NAS. But you may have non-technical reasons not to buy ECC memory, like 'monetary' reasons.

    ECC memory is a bit more expensive, but the question is: what is your goal?

    If you care about your data and would lose sleep over the risk of silent data corruption, you need to go all the way to be safe. ZFS covers the risk of drives spewing corrupt data, extra drives cover the risk of drive failures and ECC memory covers the risk of bad memory.

    ZFS itself is free. But data integrity and availability is not. We know that hardware can fail, in particular hard drives. So we buy some extra drives and sacrifice capacity in exchange for reliability. We pay real money to gain some safety. Why not with memory? Why is it suddenly not necessary to do exactly the same with memory what ZFS covers for hard disk drives?

    The ECC vs. non-ECC debate is about wether the likelihood and the impact of a RAM bitflip warrants the extra costs of ECC memory for home usage. But before we look at the numbers, let's just think about this for a moment.

    The only argument is that the likelihood that memory corruption occurs is low. But there is no data on this for home environments. It's just anekdotes and here-say. The trouble is that non-ECC machines never tell you in your face that you just encountered some memory bit error. It just crashes, reboots, some app crashes or some file is suddenly lost. How do you know you've never experienced bad memory?

    The chance is low, but if it goes wrong, the impact could be very high.

    I like this argument from Andrew Galloway who has an ever stronger opinion in this debate:

    Would you press a button with a 100$ reward if there's a one in ten thousand
    chance that you will get zapped by a lightning strike and die instead of
    getting that 100$?
    

    Is the small risk of losing all your data worth the reward of 100$?

    Vendors like HP or Dell, do not ship a single server or workstation with non-ECC RAM. Even the cheapest tower model servers for small businesses contain ECC memory. Please let that sink in for a moment.

    On the FreeNAS forum, the've seen multiple people lose their data because of memory corruption rendering their zpool unusable. For some nice and very opinionated read check this topic.

    non-ECC hardware will not warn you

    How long will it take for you to notice that your NAS has memory problems? By the very nature of non-ECC memory and related hardware (motherboard), there is no way to tell if memory has gone bad. By the time you will notice, it may be too late. Just think what will happen if a scrub starts.

    ECC motherboards log memory events to the BIOS and those events can often be read through IPMI from within the operating system.

    The Google study

    Now let's take a look at some data. I'm using the Google study that some of you may already be familiar with.

    Our first observation is that memory errors are not rare events.
    About a third of all machines in the fleet experience at least one memory
    error per year [...]
    

    One in three machines faces at least one memory error per year. But a machine contains multiple memory modules.

    Around 20% of DIMMs in Platform A and B are affected by correctable errors
    per year, compared to less than 4% of DIMMs in Platform C and D.
    

    So let's assume that your hardware is of better design like platform C and D. In that case, each memory module has a four percent chance per year to see a correctable error. Remember that your NAS has at least two memory modules.

    So the chance of seeing no errors per module per year is 96%. So 0,96 x 0,96 = 92% chance that everything will be fine that year. Or you could say: 8% chance that some failure will occur. With four memory modules, the risk is 15% per year that you will face a single memory error.

    A memory error may not immediately lead total loss of you pool, but still. I find this number quite high.

    There are more interesting observations in this paper.

    Memory errors can be classified into soft errors, which randomly corrupt
    bits, but do not leave any physical damage; and hard errors, which corrupt
    bits in a repeatable manner because of a physical defect (e.g. “stuck bits”).
    
    [...]
    
    Conclusion 7: Error rates are unlikely to be dominated by soft errors.
    We observe that CE rates are highly correlated with system utilization,
    even when isolating utilization effects from the effects of temperature.
    

    So If I understand this correctly, soft errors are mostly caused by high usage of CPU and RAM and cosmic radiation does not seem to be the cause that often.

    Please note that Google did not measure hard or soft errors directly as they can't distinguish between them.

    Brian Moses blogged about his reasons why he did not choose ECC memory for his NAS box. Although most of his arguments are not very strong in my opinion, he pointed out something interesting.

    Google found that there is a strong correlation between memory errors and the CPU/RAM usage of the machine.

    We observe clear trends of increasing correctable error rates with
    increasing CPU utilization and allocated memory. Averaging across all
    platforms, it seems that correctable error rates grow roughly
    logarithmically as a function of utilization levels (based on the roughly
    linear increase of error rates in the graphs, which have log scales on the
    X-axis).
    

    A major difference between the Google server and your home NAS will be that your home NAS won't see much usage of both memory and CPU in general, so if the relation is logarithmic in nature, the risk of seeing memory errors in a low-utilisation environment should be reduced. But what kind of number can we put on that? 1% per memory module per year? Or 0.1%?

    Are you the person who is going to find out?

    This information may be used as an indication that for a home environment, memory problems less likely compared to high-usage systems in a data center, but are you going to bet your data on that assumption?

    Most people run their NAS 24/7. Often, it has other tasks beside storing files and this may cause a load on the system. Further more, ZFS tend to use as much memory as possible for caching purposes, increasing the risk of hitting bad memory. And ZFS users will need to perform regular scrubs of their pool, which cause a lot of disk, CPU and RAM activity.

    Inform people and give them a choice

    When people seek advice on their NAS builds, ECC memory should always be recommended and I think that nobody should create the impression that it's OK for home use not to use ECC RAM for technical reasons.

    Even if it were true that home builds may be less susceptible to memory errors it would not be fair to create the impression that the likelihood of bad memory is so small that we can just ignore the impact and save a few bucks.

    People are free to choose not to go for ECC memory for monetary reasons, but that does not justify the choice from a technical perspective and they should be aware that they are taking a risk.

    Tagged as : ZFS ECC
  4. 71 TiB DIY NAS Based on ZFS on Linux

    August 02, 2014

    Update 2014-08-23: I have decided to switch from a single RAIDZ3 to a large and small RAIDZ2 and use 4K aligned VDEV sizes.


    This is my new 71 TiB DIY NAS. This server is the successor to my six year old, twenty drive 18 TB NAS (17 TiB). With a storage capacity four times higher than the original and an incredible read (2.5 GB/s)/write (1.9 GB/s) performance, it's a worthy successor.

    zfs nas

    Purpose

    The purpose of this machine is to store backups and media, primarily video.

    The specs

    PartDescription
    CaseRi-vier RV-4324-01A
    ProcessorIntel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz
    RAM16 GB
    MotherboardSupermicro X9SCM-F
    LANIntel Gigabit (Quad-port) (Bonding)
    PSUSeasonic Platinum 860
    Controller 3 x IBM M1015
    Disk24 x HGST HDS724040ALE640 4 TB (7200RPM)
    SSD2 x Crucial M500 120GB
    ArraysBoot: 2 x 60 GB RAID 1 and storage: 18 disk RAIDZ2+ 6 disk RAIDZ2
    Brutto storage 86 TiB (96 TB)
    Netto storage71 TiB (78 TB)
    OSLinux Debian Wheezy
    FilesystemZFS
    Rebuild timeDepends on used space
    UPSBack-UPS RS 1200 LCD using Apcupsd
    Power usageabout  200 Watt idle

    front front front

    CPU

    The Intel Xeon E3-1230 V2 is not the latest generation but one of the cheapest Xeons you can buy and it supports ECC memory. It's a quad-core processor with hyper-threading.

    Here you can see how it performs compared to other processors.

    Memory

    The system has sixteen GB ECC RAM. Memory is relatively cheap these days but I don't have any reason to upgrade to thirty-two GB. I think that eight GB would have been fine with this system.

    Motherboard

    The server is build around the SuperMicro X95SCM-F motherboard.

    This is a server grade motherboard and comes with typical features you might expect from such a board, like ECC memory support and out-of-band management (IPMI).

    smboard top view

    This motherboard has four PCIe slots (2 x 8x and 2 x 4x) in an 8x physical slot. My build requires four PCIe 4x+ slots and there aren't (m)any other server boards at this price point that support four PCIe slots in a 8x sized slot.

    The chassis

    The chassis has six rows of four drive bays that are kept cool by three 120mm fans in a fan wall behind the drive bays. At the rear of the case, there are two 'powerful' 80mm fans that remove the heat from the case, together with the PSU.

    The chassis has six SAS backplanes that connect four drives each. The backplanes have dual molex power connectors, so you can put redundant power supplies into the chassis. Redundant power supplies are more expensive and due to their size, often have smaller, thus noisier fans. As this is a home build, I opted for just a single regular PSU.

    When facing the front, there is a place at the left side of the chassis to mount a single 3.5 inch or two 2.5 inch drives next to each other as boot drives. I've mounted two SSDs (RAID1).

    This particular chassis version has support for SPGIO, which should help identifying which drive has failed. The IBM 1015 cards I use do support SGPIO. Through the LSI megaraid CLI I have verified that SGPIO works, as you can use this tool as a drive locator. I'm not entirely sure how well SGPIO works with ZFS.

    Power supply

    I was using a Corsair 860i before, but it was unstable and died on me.

    The Seasonic Platinum 860 may seem like overkill for this system. However, I'm not using staggered spinup for the twenty-four drives. So the drives all spinup at once and this results in a peak power usage of 600+ watts.

    The PSU has a silent mode that causes the fan only to spin if the load reaches a certain threshold. Since the PSU fan also helps removing warm air from the chassis, I've disabled this feature, so the fan is spinning at all times.

    Drive management

    I've written a tool called lsidrivemap that displays each drive in an ASCII table that reflects the physical layout of the chassis.

    The data is based on the output of the LSI 'megacli' tool for my IBM 1015 controllers.

    root@nano:~# lsidrivemap disk
    
    | sdr | sds | sdt | sdq |
    | sdu | sdv | sdx | sdw |
    | sdi | sdl | sdp | sdm |
    | sdj | sdk | sdn | sdo |
    | sdb | sdc | sde | sdf |
    | sda | sdd | sdh | sdg |
    

    This layout is 'hardcoded' for my chassis but the Python script can be easily tailored for your own server, if you're interested.

    It can also show the temperature of the disk drives in the same table:

    root@nano:~# lsidrivemap temp
    
    | 36 | 39 | 40 | 38 |
    | 36 | 36 | 37 | 36 |
    | 35 | 38 | 36 | 36 |
    | 35 | 37 | 36 | 35 |
    | 35 | 36 | 36 | 35 |
    | 34 | 35 | 36 | 35 |
    

    These temperatures show that the top drives run a bit hotter than the other drives. An unverified explanation could be that the three 120mm fans are not in the center of the fan wall. They are skewed to the bottom of the wall, so they may favor the lower drive bays.

    Filesystem (ZFS)

    I'm using ZFS as the file system for the storage array. At this moment, there is no other file system that has the same features and stability as ZFS. BTRFS is not even finished.

    The number one design goal of ZFS was assuring data integrity. ZFS checksums all data and if you use RAIDZ or a mirror, it can even repair data. Even if it can't repair a file, it can at least tell you which files are corrupt.

    ZFS is not primarily focussed on performance, but to get the best performance possible, it makes heavy usage of RAM to cache both reads and writes. This is why ECC memory is so important.

    ZFS also implements RAID. So there is no need to use MDADM. My previous file server was running a RAID 6 of twenty 1TB drives. With twenty-four 4 TB drives, rebuild times will be higher and the risk of an unrecoverable read error will be increased as well, so that's why I'd like the array to survive more than two drive failures.

    The fun thing is that ZFS supports tripple parity RAID, a feature not found in MDADM or many other RAID solutions. With RAIDZ3, the server can lose 3 drives and the data will still be intact.


    Update 2014-08-23: I have decided to switch from a single twenty-four disk RAIDZ3 and ashift=9 with tripple parity to two RAIDZ2 VDEVS mainly for performance reasons.


    ZFS has a nice feature where it can use fast, low-latency SSD storage as both read and write cache. For my home NAS build, both are entirely unnecessary and would only wear down my SSDs.

    Capacity

    Vendors still advertise the capacity of their hard drives in TB whereas the operating system works with TiB. So the 4 TB drives I use are in fact 3.64 TiB.

    The total raw storage capacity of the system is about 86 TiB. I've placed the twenty-four drives in a single RAIDZ3 VDEV. This gives me a netto capacity of 74 TiB.


    Update 2014-08-23:

    My zpool is now the appropriate number of disks (2^n + parity) in the VDEVs. So I have one 18 disk RAIDZ2 VDEV (2^4+2) and one 6 disk RAIDZ2 VDEV (2^2+2) for a total of twenty-four drives.

    Different VDEV sizes are often not recommended, but ZFS is very smart and cool: it load-balances the data across the VDEVs based on the size of the VDEV. I could verify this with zpool iostat -v 5 and witness this real time. The small VDEV got just a fraction of the data compared to the large VDEV.

    This choice leaves me with less capacity (71 TiB vs. 74 TiB) and also has a bit more risk to it, with the eighteen-disk RAIDZ2 VDEV. Regarding this latter risk, I've been running a twenty-disk MDADM RAID6 for the last 6 years and haven't seen any issues. That does not tell everything, but I'm comfortable with this risk.

    So why did I change my mind? Because the performance of my ashift=9 pool on my 4K drives deteriorated so much that a resilver of a failed drive would take ages.


    Storage controllers

    The IBM 1015 HBA's are reasonably priced and buying three of them, is often cheaper than buying just one HBA with a SAS expander. However, it may be cheaper to search for an HP SAS expander and use it with just one M1015 and save a PCIe slot.

    I have not flashed the controllers to 'IT mode', as most people do. They worked out-of-the-box as HBAs and although it may take a little bit longer to boot the system, I decided not to go through the hassle.

    The main risk here is how the controller handles a drive if a sector is not properly read. It may disable the drive entirely, which is not necessary for ZFS and often not preferred.

    Storage performance

    With twenty-four drives in a chassis, it's interesting to see what kind of performance you can get from the system.

    Let's start with a twenty-four drive RAID 0. The drives I use have a sustained read/write speed of 160 MB/s so it should be possible to reach 3840 MB/s or 3.8 GB/s. That would be amazing.

    This is the performance of a RAID 0 (MDADM) of all twenty-four drives.

    root@nano:/storage# dd if=/dev/zero of=test.bin bs=1M count=1000000
    1048576000000 bytes (1.0 TB) copied, 397.325 s, 2.6 GB/s
    
    root@nano:/storage# dd if=test.bin of=/dev/null bs=1M
    1048576000000 bytes (1.0 TB) copied, 276.869 s, 3.8 GB/s
    

    Dead on, you would say, but if you divide 1 TB with 276 seconds, it's more like 3.6 GB/s. I would say that's still quite close.

    This machine will be used as a file server and a bit of redundancy would be nice. So what happens if we run the same benchmark on a RAID6 of all drives?

    root@nano:/storage# dd if=/dev/zero of=test.bin bs=1M count=100000
    104857600000 bytes (105 GB) copied, 66.3935 s, 1.6 GB/s
    
    root@nano:/storage# dd if=test.bin of=/dev/null bs=1M
    104857600000 bytes (105 GB) copied, 38.256 s, 2.7 GB/s
    

    I'm quite pleased with these results, especially for a RAID6. However, RAID6 with twenty-four drives feels a bit risky. So since there is no support for a three-parity disk RAID in MDADM/Linux, I use ZFS.

    Sacrificing performance, I decided - as I mentioned earlier - to use ashift=9 on those 4K sector drives, because I gained about 5 TiB of storage in exchange.

    This is the performance of twenty-four drives in a RAIDZ3 VDEV with ashift=9.

    root@nano:/storage# dd if=/dev/zero of=ashift9.bin bs=1M count=100000 
    104857600000 bytes (105 GB) copied, 97.4231 s, 1.1 GB/s
    
    root@nano:/storage# dd if=ashift9.bin of=/dev/null bs=1M
    104857600000 bytes (105 GB) copied, 42.3805 s, 2.5 GB/s
    

    Compared to the other results, write performance is way down, but 1 GB/s isn't too shabby is it? I'm not complaining. And with this ashift=9 setting, I get 74TiB of netto (actual usable) storage.


    Update 2014-08-23: as I put my data on this zpool, write performance deteriorated to about 830 MB/s and became worse. I tested a resilver of a drive and it seemed to take ages. This is why I abandoned the RAIDZ3 setup.

    This is the write performance of the 18 disk RAIDZ2 + 6 disk RAIDZ2 zpool (ashift=12):

    root@nano:/storage# dd if=/dev/zero of=test.bin bs=1M count=1000000 
    1048576000000 bytes (1.0 TB) copied, 543.072 s, 1.9 GB/s
    
    root@nano:/storage# dd if=test.bin of=/dev/null bs=1M 
    1048576000000 bytes (1.0 TB) copied, 400.539 s, 2.6 GB/s
    

    As you may notice, the write performance is better than the ashift=9 or ashift=12 RAIDZ3 VDEV.


    I have not benchmarked random I/O performance as it is not relevant for this system. And with ZFS, the random I/O performance of a VDEV is that of a single drive.

    Boot drives

    I'm using two Crucial M500 120GB SSD drives. They are configured in a RAID1 (MDADM) and I've installed Debian Wheezy on top of them.

    At first, I was planning on using a part of the capacity for caching purposes in combination with ZFS. However, there's no real need to do so. In hindsight I could also have used to very cheap 2.5" hard drives (simmilar to my older NAS), which would have cost less than a single M500.

    Update 2014-09-01: I actually reinstalled Debian and kept about 50% free space on both M400 and put this space in a partition. That partition has been provided to the ZFS pool as L2ARC cache. I did this because I could, but on the other hand, I wonder if I'm only really wearing out my SSDs faster.

    Networking

    Maybe I will invest in 10Gbit ethernet or infiniband hardware in the future, but for now I settled on a quad-port gigabit adapter. With Linux bonding, I can still get 450+ MB/s data transfers, which is sufficient for my needs.

    The quad-port card is in addition to the two on-board gigabit network cards. I use one of the on-board ports for client access. The four ports on the quad-port card are all in different VLANs and not accessible for client devices.

    The storage will be accessible over NFS and SMB.

    Keeping things cool and quiet

    It's important to keep the drive temperature at acceptable levels and with 24 drives packet together, there is an increased risk of overheating.

    The chassis is well-equipped to keep the drives cool with three 120mm fans and two strong 80mm fans, all supporting PWM (pulse-width modulation).

    The problem is that by default, the BIOS runs the fans at a too low speed to keep the drives at a reasonable temperature. I'd like to keep the hottest drive at about forty degrees Celsius. But I also want to keep the noise at reasonable levels.

    I wrote a python script called storagefancontrol that automatically adjusts the fan speed based on the temperature of the hottest drive.

    UPS

    I'm running a HP N40L micro server as my firewall/router. My APC Back-UPS RS 1200 LCD (720 Watt) is connected with USB to this machine. I'm using apcupsd to monitor the UPS and shutdown servers if the battery runs low.

    All servers, including my new build, run apcupsd in network mode and talk to the N40L to learn if power is still OK.

    Keeping power consumption reasonable

    So these are the power usage numbers.

     96 Watt with disks in spin down.
    176 Watt with disks spinning but idle.
    253 Watt with disks writing.
    

    But the most important stat is that it's using 0 Watt if powered off. The system will be turned on only when necessary through wake-on-lan. It will be powered off most of the time, like when I'm at work or sleeping.

    Cost

    The system has cost me about €6000. All costs below are in Euro and include taxes (21%).

    Description Product Price Amount Total
    Chassis Ri-vier 4U 24bay storage chassis RV-4324-01A 554 1 554
    CPU Intel Xeon E3-1230V2 197 1 197
    Mobo SuperMicro X9SCM-F 157 1 157
    RAM Kingston DDR3 ECC KVR1333D3E9SK2/16G  152 1 152
    PSU AX860i 80Plus Platinum 175 1 175
    Network Card NC364T PCI Express Quad Port Gigabit 145 1 145
    HBA Controller IBM SERVERAID M1015  118 3 354
    SSDs Crucial M500 120GB 62 2 124
    Fan  Zalman FB123 Casefan Bracket + 92mm Fan 7 1 7
    Hard Drive Hitachi 3.5 4TB 7200RPM (0S03356) 166 24 3984
    SAS Cables 25 6 150
    Fan cables 6 1 6
    Sata-to-Molex 3,5 1 3,5
    Molex splitter 3 1 3
    6012

    Closing words

    If you have any questions or remarks about what could have been done differently feel free to leave a comment, I appreciate it.

  5. ZFS: Performance and Capacity Impact of Ashift=9 on 4K Sector Drives

    July 31, 2014

    Update 2014-8-23: I was testing with ashift for my new NAS. The ashift=9 write performance deteriorated from 1.1 GB/s to 830 MB/s with just 16 TB of data on the pool. Also I noticed that resilvering was very slow. This is why I decided to abandon my 24 drive RAIDZ3 configuration.

    I'm aware that drives are faster at the outside of the platter and slower on the inside, but the performance deteriorated so dramatically that I did not wanted to continue further.

    My final setup will be a RAIDZ2 18 drive VDEV + RAIDZ2 6 drive VDEV which will give me 'only' 71 TiB of storage, but read performance is 2.6 GB/s and write performance is excellent at 1.9 GB/s. I've written about 40+ TiB to the array and after those 40 TiB, write performance was about 1.7 GB/s, so still very good and what I would expect as drives fill up.

    So actually, based on these results, I have learned not to deviate from the ZFS best practices too much. Use ashift=12 and put drives in VDEVS that adhere to the 2^n+parity rule.

    The uneven VDEVs (18 disk vs. 6 disks) are not according to best practice but ZFS is smart: it distributes data across the VDEVs based on their size. So they fill up equally.


    Choosing between ashift=9 and ashift=12 for 4K sector drives is not always a clear cut case. You have to choose between raw performance or storage capacity.

    My testplatform is Debian Wheezy with ZFS on Linux. I'm using a system with 24 x 4 TB drives in a RAIDZ3. The drives have a native sector size of 4K, and the array is formatted with ashift=12.

    First we create the array like this:

    zpool create storage -o ashift=12 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]
    

    Note: NEVER use /dev/sd? drive names for an array, this is just for testing, always use /dev/disk/by-id/ names.

    Then we run a simple sequential transfer benchmark with dd:

    root@nano:/storage# dd if=/dev/zero of=ashift12.bin bs=1M count=100000 
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 66.4922 s, 1.6 GB/s
    root@nano:/storage# dd if=ashift12.bin of=/dev/null bs=1M
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 42.0371 s, 2.5 GB/s
    

    This is quite impressive. With these speeds, you can saturate 10Gbe ethernet. But how much storage space do we get?

    df -h:

    Filesystem                            Size  Used Avail Use% Mounted on
    storage                                69T  512K   69T   1% /storage
    

    zfs list:

    NAME      USED  AVAIL  REFER  MOUNTPOINT
    storage  1.66M  68.4T   435K  /storage
    

    Only 68.4 TiB of storage? That's not good. There should be 24 drives minus 3 for parity is 21 x 3.6 TiB = 75 TiB of storage.

    So the performance is great, but somehow, we lost about 6 TiB of storage, more than a whole drive.

    So what happens if you create the same array with ashift=9?

    zpool create storage -o ashift=9 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]
    

    These are the benchmarks:

    root@nano:/storage# dd if=/dev/zero of=ashift9.bin bs=1M count=100000 
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 97.4231 s, 1.1 GB/s
    root@nano:/storage# dd if=ashift9.bin of=/dev/null bs=1M
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 42.3805 s, 2.5 GB/s
    

    So we lose about a third of our write performance, but the read performance is not affected, probably by read-ahead caching but I'm not sure.

    With ashift=9, we do lose some write performance, but we can still saturate 10Gbe.

    Now look what happens to the available storage capacity:

    df -h:

    Filesystem                         Size  Used Avail Use% Mounted on
    storage                             74T   98G   74T   1% /storage
    

    zfs list:

    NAME      USED  AVAIL  REFER  MOUNTPOINT
    storage   271K  73.9T  89.8K  /storage
    

    Now we have a capacity of 74 TiB, so we just gained 5 TiB with ashift=9 over ashift=12, at the cost of some write performance.

    So if you really care about sequential write performance, ashift=12 is the better option. If storage capacity is more important, ashift=9 seems to be the best solution for 4K drives.

    The performance of ashift=9 on 4K drives is always described as 'horrible' but I think it's best to run your own benchmarks and decide for yourself.

    Caveat: I'm quite sure about the benchmark performance. I'm not 100% sure how reliable the reported free space is according to df -h or zfs list.

    Edit: I have added a bit of my own opinion on the results.

    Tagged as : ZFS Linux

Page 1 / 35