Articles in the Storage category

  1. ZFS Performance on HP Proliant Microserver Gen8 G1610T

    August 14, 2015

    I think the HP Proliant Microserver Gen8 is a very interesting little box if you want to build your own ZFS-based NAS. The benchmarks I've performed seem to confirm this.

    The Microserver Gen8 has nice features such as:

    • iLO (KVM over IP with dedicated network interface)
    • support for ECC memory
    • 2 x Gigabit network ports
    • Free PCIe slot (half-height)
    • Small footprint
    • Fairly silent
    • good build quality

    The Microserver Gen8 can be a better solution than the offerings of - for example - Synology or QNAP because you can create a more reliable system based on ECC-memory and ZFS.

    gen8

    Please note that the G1610T version of the Microserver Gen8 does not ship with a DVD/CD drive as depicted in the image above.

    The Gen8 can be found fairly cheap on the European market at around 240 Euro including taxes and if you put in an extra 8 GB of memory on top of the 2 GB installed you have a total of 10 GB, which is more than enough to support ZFS.

    The Gen8 has room for 4 x 3.5" hard drives so with todays large disk sizes you can pack quite a bit of storage inside this compact machine.

    gen82

    Netto storage capacity:

    This table gives you a quick overview of the netto storage capacity you would get depending on the chosen drive size and redundancy.

    Drive sizeRAIDZRAIDZ2 or Mirror
    3 TB 9 TB 6 TB
    4 TB12 TB 8 TB
    6 TB18 TB12 TB
    8 TB24 TB16 TB

    Boot device

    If you want to use all four drive slots for storage, you need to boot this machine from either the fifth internal SATA port, the internal USB 2.0 port or the microSD card slot.

    The fifth SATA port is not bootable if you disable the on-board RAID controller and run in pure AHCI mode. This mode is probably the best mode for ZFS as there seems to be no RAID controller firmware active between the disks and ZFS. However, only the four 3.5" drive bays are bootable.

    The fifth SATA port is bootable if you configure SATA to operate in Legacy mode. This is not recommended as you lose the benefits of AHCI such as hot-swap of disks and there are probably also performance penalties.

    The fifth SATA port is also bootable if you enable the on-board RAID controller, but do not configure any RAID arrays with the drives you plan to use with ZFS (Thanks Mikko Rytilahti). You do need to put the boot drive in a RAID volume in order to be able to boot from the fifth SATA port.

    The unconfigured drives will just be passed as AHCI devices to the OS and thus can be used in your ZFS array. The big question here is what happens if you encounter read errors or other drive problems that ZFS could handle, but would be a reason for the RAID controller to kick a drive off the SATA bus. I have no information on that.

    I myself used an old 2.5" hard drive with a SATA-to-USB converter which I stuck in the case (use double-sided tape or velcro to mount it to the PSU). Booting from USB stick is also an option, although a regular 2.5" hard drive or SSD is probably more reliable (flash wear) and faster.

    Boot performance

    The Microserver Gen8 takes about 1 minute and 50 seconds just to pass the BIOS boot process and start booting the operating system (you will hear a beep).

    Test method and equipment

    I'm running Debian Jessie with the latest stable ZFS-on-Linux 0.6.4. Please note that reportedly FreeNAS also runs perfectly fine on this box.

    I had to run my tests with the disk I had available:

    root@debian:~# show disk -sm
    -----------------------------------
    | Dev | Model              | GB   |   
    -----------------------------------
    | sda | SAMSUNG HD103UJ    | 1000 |   
    | sdb | ST2000DM001-1CH164 | 2000 |   
    | sdc | ST2000DM001-1ER164 | 2000 |   
    | sdd | SAMSUNG HM250HI    | 250  |   
    | sde | ST2000DM001-1ER164 | 2000 |   
    -----------------------------------
    

    The 250 GB is a portable disk connected to the internal USB port. It is used as the OS boot device. The other disks, 1 x 1 TB and 3 x 2 TB are put together in a single RAIDZ pool, which results in 3 TB of storage.

    Tests with 4-disk RAIDZ VDEV

    root@debian:~# zfs list
    NAME       USED  AVAIL  REFER  MOUNTPOINT
    testpool  48.8G  2.54T  48.8G  /testpool
    root@debian:~# zpool status
      pool: testpool
     state: ONLINE
      scan: none requested
    config:
    
        NAME                        STATE     READ WRITE CKSUM
        testpool                    ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            wwn-0x50000f0008064806  ONLINE       0     0     0
            wwn-0x5000c5006518af8f  ONLINE       0     0     0
            wwn-0x5000c5007cebaf42  ONLINE       0     0     0
            wwn-0x5000c5007ceba5a5  ONLINE       0     0     0
    
    errors: No known data errors
    

    Because a NAS will face data transfers that are sequential in nature, I've done some tests with 'dd' to measure this performance.

    Read performance:

    root@debian:~# dd if=/testpool/test.bin of=/dev/null bs=1M
    50000+0 records in 50000+0 records out 52428800000 bytes (52 GB) copied, 162.429 s, 323 MB/s

    Write performance:

    root@debian:~# dd if=/dev/zero of=/testpool/test.bin bs=1M count=50000 conv=sync 50000+0 records in 50000+0 records out 52428800000 bytes (52 GB) copied, 169.572 s, 309 MB/s

    Test with 3-disk RAIDZ VDEV

    After the previous test I wondered what would happen if I would exclude the older 1 TB disk and create a pool with just the 3 x 2 TB drives. This is the result:

    Read performance:

    root@debian:~# dd if=/testpool/test.bin of=/dev/null bs=1M conv=sync 50000+0 records in 50000+0 records out 52428800000 bytes (52 GB) copied, 149.509 s, 351 MB/s

    Write performance:

    root@debian:~# dd if=/dev/zero of=/testpool/test.bin bs=1M count=50000 conv=sync 50000+0 records in 50000+0 records out 52428800000 bytes (52 GB) copied, 144.832 s, 362 MB/s

    The performance is clearly better even there's one disk less in the VDEV. I would have liked to test with an additional 2 TB drive what kind of performance would be achieved with four drives but I only have three.

    The result does show that the pool is more than capable of sustaining gigabit network transfer speeds.

    This is confirmed when performing the actual network file transfers. In the example below, I simulate a copy of a 50 GB test file from the Gen8 towards a test system using NFS. Tests are performed using the 3-disk pool.

    NFS read performance:

    root@nano:~# dd if=/mnt/server/test2.bin of=/dev/null bs=1M
    50000+0 records in
    50000+0 records out
    52428800000 bytes (52 GB) copied, 443.085 s, 118 MB/s
    

    NFS write performance:

    root@nano:~# dd if=/dev/zero of=/mnt/server/test2.bin bs=1M count=50000 conv=sync 
    50000+0 records in
    50000+0 records out
    52428800000 bytes (52 GB) copied, 453.233 s, 116 MB/s
    

    I think these results are excellent. Tests with the 'cp' command give the same results.

    I've also done some test with the SMB/CIFS protocol. I've used a second Linux box as a CIFS client to connect to the Gen8.

    CIFS read performance:

    root@nano:~# dd if=/mnt/test/test.bin of=/dev/null bs=1M
    50000+0 records in
    50000+0 records out
    52428800000 bytes (52 GB) copied, 527.778 s, 99.3 MB/s
    

    CIFS write performance:

    root@nano:~# dd if=/dev/zero of=/mnt/test/test3.bin bs=1M count=50000 conv=sync
    50000+0 records in
    50000+0 records out
    52428800000 bytes (52 GB) copied, 448.677 s, 117 MB/s
    

    Hot-swap support

    Although it's even printed on the hard drive caddies that hot-swap is not supported, it does seem to work perfectly fine if you run the SATA controller in AHCI mode.

    Fifth SATA port for SSD SLOG/L2ARC?

    If you buy a converter cable that converts a floppy power connector to a SATA power connector, you could install an SSD. This SSD can then be used as a dedicated SLOG device and/or L2ARC cache if you have a need for this.

    RAIDZ, is that OK?

    If you want maximum storage capacity with redundancy RAIDZ is the only option. RAID6 or two mirrored VDEVs is more reliable, but will reduce available storage space by a third.

    The main risk of RAIDZ is a double-drive failure. As with larger drive sizes, a resilver of a VDEV will take quite some time. It could take more than a day before the pool is resilvered, during which you run without redundancy.

    With the low number of drives in the VDEV the risk of a second drive failure may be low enough to be acceptable. That's up to you.

    Noise levels

    In the past, there have been reports about the Gen8 making tons of noise because the rear chasis fan spins at a high RPM if the RAID card is set to AHCI mode.

    I myself have not encountered this problem. The machine is almost silent.

    Power consumption

    With drives spinning: 50-55 Watt. With drives standby: 30-35 Watt.

    Conclusion

    I think my benchmarks show that the Microserver Gen8 could be an interesting platform if you want to create your own ZFS-based NAS.

    Please note that it is likely that since the Gen9 server platform is already out for some time, HP may release a Gen9 version of the microserver in the near future. However as of August 2015, there is no information on this yet and it is not clear if a successor is going to be released.

    Tagged as : ZFS microserver
  2. The Sorry State of CoW File Systems

    March 01, 2015

    I'd like to argue that both ZFS and BTRFS both are incomplete file systems with their own drawbacks and that it may still be a long way off before we have something truly great.

    Both ZFS and BTRFS are two heroic feats of engineering, created by people who are probably ten times more capable and smarter than me. There is no question about my appreciation for these file systems and what they accomplish.

    Still, as an end-user, I would like to see some features that are often either missing or not complete. Make no mistake, I believe that both ZFS and BTRFS are probably the best file systems we have today. But they can be much better.

    I want to start with a terse and quick overview on why both ZFS and BTRFS are such great file systems and why you should take some interest in them.

    Then I'd like to discuss their individual drawbacks and explain my argument.

    Why ZFS and BTRFS are so great

    Both ZFS and BTRFS are great for two reasons:

    1. They focus on preserving data integrity
    2. They simplify storage management

    Data integrity

    ZFS and BTRFS implement two important techniques that help preserve data.

    1. Data is checksummed and its checksum is verified to guard against bit rot due to broken hard drives or flaky storage controllers. If redundancy is available (RAID), errors can even be corrected.

    2. Copy-on-Write (CoW), existing data is never overwritten, so any calamity like sudden power loss cannot cause existing data to be in an inconsistent state.

    Simplified storage management

    In the old days, we had MDADM or hardware RAID for redundancy. LVM for logical volume management and then on top of that, we have the file system of choice (EXT3/4, XFS, REISERFS, etc).

    The main problem with this approach is that the layers are not aware of each other and this makes things very inefficient and more difficult to administer. Each layer needs it's own attention.

    For example, if you simply want to expand storage capacity, you need to add drives to your RAID array and expand it. Then, you have to alert the LVM layer of the extra storage and as a last step, grow the file system.

    Both ZFS and BTRFS make capacity expansion a simple one line command that addresses all three steps above.

    Why are ZFS and BTRFS capable of doing this? Because they incorporate RAID, LVM and the file system in one single integrated solution. Each 'layer' is aware of the other, they are tightly integrated. Because of this integration, rebuilds after a drive faillure are often faster than with 'legacy RAID' solutions, because they only need to rebuild the actual data, not the entire drive.

    And I'm not even talking about the joy of snapshots here.

    The inflexibility of ZFS

    The storage building block of ZFS is a VDEV. A VDEV is either a single disk (not so interesting) or some RAID scheme, such as mirroring, single-parity (RAIDZ), dual-parity (RAIDZ2) and even tripple-parity (RAIDZ3).

    To me, a big downside to ZFS is the fact that you cannot expand a VDEV. Ok, the only way you can expand the VDEV is quite convoluted. You have to replace all of the existing drives, one by one, with bigger ones and rebuild the VDEV each time you replace one of the drives. Then, when all drives are of the higher capacity, you can expand your VDEV. This is quite impractical and time-consuming, if you ask me.

    ZFS expects you just to add extra VDEVS. So if you start with a single 6-drive RAIDZ2 (RAID6), you are expected to add another 6-drive RAIDZ2 if you want to expand capacity.

    What I would want to do is just to ad one or two more drives and grow the VDEV, as is possible with many hardware RAID solutions and with "MDADM --grow" for ages.

    Why do I prefer this over adding VDEVS? Because it's quite evident that this is way more economical. If I can just expand my RAIDZ2 from 6 drives to 12 drives, I would only sacrifice two drives for parity. If I add two VDEVS each of them RAIDZ2, I sacrifice four drives (16% vs 33% capacity loss).

    I can imagine that in the enterprise world, this is just not that big of a deal, a bunch of drives are a rounding error on the total budget and availability and performance are more important. Still, I'd like to have this option.

    Either you are forced to buy and implement the storage you may expect to need in the future, or you must add it later on, wasting drives on parity you would otherwise not have done.

    Maybe my wish for a zpool grow option is more geared to hobbyist or home usage of ZFS and ZFS was always focussed on enterprise needs, not the needs of hobbyists. So I'm aware of the context here.

    I'm not done with ZFS however, because the way ZFS works, there is another great inflexibility. If you don't put the 'right' number of drives in a VDEV, you may lose significant portions of storage, which is a side-effect of how ZFS works.

    The following ZFS pool configurations are optimal for modern 4K sector harddrives:
    RAID-Z: 3, 5, 9, 17, 33 drives
    RAID-Z2: 4, 6, 10, 18, 34 drives
    RAID-Z3: 5, 7, 11, 19, 35 drives
    

    I've seen first-hand with my 71 TiB NAS that if you don't use the optimal number of drives in a VDEV, you may lose whole drives worth of netto storage capacity. In that regard, my 24-drive chassis is very suboptimal.

    The sad state of RAID on BTRFS

    BTRFS has none of the downsides of ZFS as described in the previous section as far as I'm aware of. It has plenty of its own, though. First of all: BTRFS is still not stable, especially the RAID 5/6 part is unstable.

    The RAID 5 and RAID 6 implementation are so new, the ink they were written with is still wet (February 8th 2015). Not something you want to trust your important data to I suppose.

    I did setup a test environment to play a bit with this new Linux kernel (3.19.0) and BTRFS to see how it works and although it is not production-ready yet, I really like what I see.

    With BTRFS you can just add or remove drives to a RAID6 array as you see fit. Add two? Subtract 3? Whatever, the only thing you have to wait for is BTRFS rebalancing the data over either the new or remaining drives.

    This is friggin' awesome.

    If you want to remove a drive, just wait for BTRFS to copy the data from that drive to the other remaining drives and you can remove it. You want to expand storage? Just add the drives to your storage pool and have BTRFS rebalance the data (which may take a while, but it works).

    But I'm still a bit sad. Because BTRFS does not support anything beyond RAID6. No multiple RAID6 (RAID60) arrays or tripple-parity, as ZFS supports for ages. As with my 24-drive file server, putting 24 drives in a single RAID6, starts to feel like I'm asking for trouble. Tripple-parity or RAID 60 would probably be more reasonable. But no luck with BTRFS.

    However, what really frustrates me is this article by Ronny Egner. The author of snapraid, Andrea Mazzoleni, has written a functional patch for BTRFS that implements not only tripple-parity RAID, but even up to six parity disks for a volume.

    The maddening thing is that the BTRFS maintainers are not planning to include this patch into the BTRFS code base. Please read Ronny's blog. The people working on BTRFS are working for enterprises who want enterprise features. They don't care about tripple-parity or features like that because they have access to something presumably better: distributed file systems, which may do away with the need for larger disk arrays and thus tripple-parity.

    BTRFS is in development for a very long time and only recently has RAID 5/6 support been introduced. The risk of the write-hole, something addressed by ZFS ages ago, is still an open issue. Considering all of this, BTRFS is still a very long way off, of being the file system of choice for larger storage arrays.

    BTRFS seems to be way more flexible in terms of storage expansion or shrinking, but it slow pace of development makes it still unusable for anything serious for at least the next year I guess.

    Conclusion

    BTRFS addresses all the inflexibilities of ZFS but it's immaturity and lack of more advanced RAID schemes makes it unusable for larger storage solutions. This is so sad because by design it seems to be the better, way more flexible option as compared to ZFS.

    I do understand the view of the BTRFS developers. With the enterprise data sets, at scale, it's better to use distributed file systems to handle storage and redundancy, than on the smaller system scale. But this kind of environment is not reachable for many.

    So at the moment, compared to BTRFS, ZFS is still the better option for people who want to setup large, reliable storage arrays.

    Tagged as : ZFS BTRFS
  3. Configuring SCST iSCSI Target on Debian Linux (Wheezy)

    February 01, 2015

    My goal is to export ZFS zvol volumes through iSCSI to other machines. The platform I'm using is Debian Wheezy.

    There are three iSCSI target solutions available for Linux:

    1. LIO
    2. IET
    3. SCST

    I've briefly played with LIO but the targetcli tool is interactive only. If you want to automate and use scripts, you need to learn the Python API. I wonder what's wrong with a plain old text-based configuration file.

    iscsitarget or IET is broken on Debian Wheezy. If you just 'apt-get install iscsitarget', the iSCSI service will just crash as soon as you connect to it. This has been the case for years. I wonder why they don't just drop this package. It is true that you can manually download the "latest" version of IET, but don't bother, it seems abandoned. The latest release stems from 2010.

    It seems that SCST is at least maintained and uses plain old text-based configuration files. So it has that going for it, which is nice. SCST does not require kernel patches to run. But particularly a patch regarding "CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION" is said to improve performance.

    To use full power of TCP zero-copy transmit functions, especially
    dealing with user space supplied via scst_user module memory, iSCSI-SCST
    needs to be notified when Linux networking finished data transmission.
    For that you should enable CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION
    kernel config option. This is highly recommended, but not required.
    Basically, iSCSI-SCST works fine with an unpatched Linux kernel with the
    same or better speed as other open source iSCSI targets, including IET,
    but if you want even better performance you have to patch and rebuild
    the kernel.
    

    So in general, patching your kernel is not always required, but an example will be given anyway.

    Getting the source

    cd /usr/src
    

    We need the following files:

    wget http://heanet.dl.sourceforge.net/project/scst/scst/scst-3.0.0.tar.bz2
    wget http://heanet.dl.sourceforge.net/project/scst/iscsi-scst/iscsi-scst-3.0.0.tar.bz2
    wget http://heanet.dl.sourceforge.net/project/scst/scstadmin/scstadmin-3.0.0.tar.bz2
    

    We extract them with:

    tar xjf scst-3.0.0.tar.bz2
    tar xjf iscsi-scst-3.0.0.tar.bz2
    tar xjf scstadmin-3.0.0.tar.bz2
    

    Patching the kernel

    You can skip this part if you don't feel like you need or want to patch your kernel.

    apt-get install linux-source kernel-package
    

    We need to extract the kernel source:

    cd /usr/src
    tar xjf linux-source-3.2.tar.bz2
    cd linux-source-3.2
    

    Now we first copy the kernel configuration from the current system:

    cp /boot/config-3.2.0-4-amd64 .config
    

    We patch the kernel with two patches:

    patch -p1 < /usr/src/scst-3.0.0/kernel/scst_exec_req_fifo-3.2.patch
    patch -p1 < /usr/src/iscsi-scst-3.0.0/kernel/patches/put_page_callback-3.2.57.patch
    

    It seems that for many different kernel versions, separate patches can be found in the above paths. If you follow these steps at a later date, please check the version numbers.

    The patches are based on stock kernels from kernel.org. I've applied the patches against the Debian-patched kernel and faced no problems, but your milage may vary.

    Let's build the kernel (will take a while):

    yes | make-kpkg -j $(nproc) --initrd --revision=1.0.custom.scst kernel_image
    

    The 'yes' is piped into the make-kpkg command to answer some questions with 'yes' during compilation. You could also add the appropriate value in the .config file.

    The end-result of this command is a kernel package in .deb format in /usr/src. Install it like this:

    dpkg -i /usr/src/<custom kernel image>.deb
    

    Now reboot into the new kernel:

    reboot
    

    Compiling SCST, ISCS-SCST and SCSTADMIN

    cd /usr/src/scst-3.0.0
    make install
    
    cd /usr/src/iscsi-scst-3.0.0
    make install
    
    cd /usr/src/scstadmin-3.0.0
    make install
    

    Make SCST start at boot

    On Debian Jessie:

    systemctl enable scst.service
    

    Configure SCST

    Copy the example configuration file to /etc:

    cp /usr/src/iscsi-scst-3.0.0/etc/scst.conf /etc
    

    Edit /etc/scst.conf to your liking. This is an example:

    HANDLER vdisk_fileio {
            DEVICE disk01 {
                    filename /dev/sdb
                    nv_cache 1
            }
    }
    
    TARGET_DRIVER iscsi {
            enabled 1
    
            TARGET iqn.2015-10.net.vlnb:tgt {
                    IncomingUser "someuser somepasswordof12+chars"
                    HeaderDigest   "CRC32C,None"
                    DataDigest   "CRC32C,None"
                    LUN 0 disk01
    
                    enabled 1
            }
    }
    

    Please note that the password must be at least 12 characters.

    After this, you can start the SCST module and connect your initiator to the appropriate LUN.

    /etc/init.d/scst start
    

    Closing words

    It turned out that setting up SCST and compiling a kernel wasn't that much of a hassle. The main issue with patching kernels is that you have to repeat the procedure every time a new kernel version is released. And there is always a risk that a new kernel version breaks the SCST patches.

    However, the whole process can be easily automated and thus run as a test in a virtual environment.

    Tagged as : iSCSI SCST
  4. 71 TiB DIY NAS Based on ZFS on Linux

    August 02, 2014

    This is my new 71 TiB DIY NAS. This server is the successor to my six year old, twenty drive 18 TB NAS (17 TiB). With a storage capacity four times higher than the original and an incredible read (2.5 GB/s)/write (1.9 GB/s) performance, it's a worthy successor.

    zfs nas

    Purpose

    The purpose of this machine is to store backups and media, primarily video.

    The specs

    PartDescription
    CaseRi-vier RV-4324-01A
    ProcessorIntel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz
    RAM16 GB
    MotherboardSupermicro X9SCM-F
    LANIntel Gigabit (Quad-port) (Bonding)
    PSUSeasonic Platinum 860
    Controller 3 x IBM M1015
    Disk24 x HGST HDS724040ALE640 4 TB (7200RPM)
    SSD2 x Crucial M500 120GB
    ArraysBoot: 2 x 60 GB RAID 1 and storage: 18 disk RAIDZ2+ 6 disk RAIDZ2
    Brutto storage 86 TiB (96 TB)
    Netto storage71 TiB (78 TB)
    OSLinux Debian Wheezy
    FilesystemZFS
    Rebuild timeDepends on amount of data (rate is 4 TB/Hour)
    UPSBack-UPS RS 1200 LCD using Apcupsd
    Power usageabout  200 Watt idle

    front front front

    CPU

    The Intel Xeon E3-1230 V2 is not the latest generation but one of the cheapest Xeons you can buy and it supports ECC memory. It's a quad-core processor with hyper-threading.

    Here you can see how it performs compared to other processors.

    Memory

    The system has 16 GB ECC RAM. Memory is relatively cheap these days but I don't have any reason to upgrade to 32 GB. I think that 8 GB would have been fine with this system.

    Motherboard

    The server is build around the SuperMicro X95SCM-F motherboard.

    This is a server-grade motherboard and comes with typical features you might expect, like ECC memory support and out-of-band management (IPMI).

    smboard top view

    This motherboard has four PCIe slots (2 x 8x and 2 x 4x) in an 8x physical slot. My build requires four PCIe 4x+ slots and there aren't (m)any other server boards at this price point that support four PCIe slots in a 8x sized slot.

    The chassis

    The chassis has six rows of four drive bays that are kept cool by three 120mm fans in a fan wall behind the drive bays. At the rear of the case, there are two 'powerful' 80mm fans that remove the heat from the case, together with the PSU.

    The chassis has six SAS backplanes that connect four drives each. The backplanes have dual molex power connectors, so you can put redundant power supplies into the chassis. Redundant power supplies are more expensive and due to their size, often have smaller, thus noisier fans. As this is a home build, I opted for just a single regular PSU.

    When facing the front, there is a place at the left side of the chassis to mount a single 3.5 inch or two 2.5 inch drives next to each other as boot drives. I've mounted two SSDs (RAID1).

    This particular chassis version has support for SPGIO, which should help identifying which drive has failed. The IBM 1015 cards I use do support SGPIO. Through the LSI megaraid CLI I have verified that SGPIO works, as you can use this tool as a drive locator. I'm not entirely sure how well SGPIO works with ZFS.

    Power supply

    I was using a Corsair 860i before, but it was unstable and died on me.

    The Seasonic Platinum 860 may seem like overkill for this system. However, I'm not using staggered spinup for the 24 drives. So the drives all spinup at once and this results in a peak power usage of 600+ watts.

    The PSU has a silent mode that causes the fan only to spin if the load reaches a certain threshold. Since the PSU fan also helps removing warm air from the chassis, I've disabled this feature, so the fan is spinning at all times.

    Drive management

    I've written a tool called lsidrivemap that displays each drive in an ASCII table that reflects the physical layout of the chassis.

    The data is based on the output of the LSI 'megacli' tool for my IBM 1015 controllers.

    root@nano:~# lsidrivemap disk
    
    | sdr | sds | sdt | sdq |
    | sdu | sdv | sdx | sdw |
    | sdi | sdl | sdp | sdm |
    | sdj | sdk | sdn | sdo |
    | sdb | sdc | sde | sdf |
    | sda | sdd | sdh | sdg |
    

    This layout is 'hardcoded' for my chassis but the Python script can be easily tailored for your own server, if you're interested.

    It can also show the temperature of the disk drives in the same table:

    root@nano:~# lsidrivemap temp
    
    | 36 | 39 | 40 | 38 |
    | 36 | 36 | 37 | 36 |
    | 35 | 38 | 36 | 36 |
    | 35 | 37 | 36 | 35 |
    | 35 | 36 | 36 | 35 |
    | 34 | 35 | 36 | 35 |
    

    These temperatures show that the top drives run a bit hotter than the other drives. An unverified explanation could be that the three 120mm fans are not in the center of the fan wall. They are skewed to the bottom of the wall, so they may favor the lower drive bays.

    Filesystem (ZFS)

    I'm using ZFS as the file system for the storage array. At this moment, there is no other file system that has the same features and stability as ZFS. BTRFS is not even finished.

    The number one design goal of ZFS was assuring data integrity. ZFS checksums all data and if you use RAIDZ or a mirror, it can even repair data. Even if it can't repair a file, it can at least tell you which files are corrupt.

    ZFS is not primarily focussed on performance, but to get the best performance possible, it makes heavy usage of RAM to cache both reads and writes. This is why ECC memory is so important.

    ZFS also implements RAID. So there is no need to use MDADM. My previous file server was running a single RAID 6 of 20 x 1TB drives. With this new system I've created a single pool with two RAIDZ2 VDEVs.

    Capacity

    Vendors still advertise the capacity of their hard drives in TB whereas the operating system works with TiB. So the 4 TB drives I use are in fact 3.64 TiB.

    The total raw storage capacity of the system is about 86 TiB. I've placed the 24 drives in a single RAIDZ3 VDEV. This gives me a netto capacity of 74 TiB.

    My zpool is now the appropriate number of disks (2n + parity) in the VDEVs. So I have one 18 disk RAIDZ2 VDEV (24+2) and one 6 disk RAIDZ2 VDEV (2^2+2) for a total of 24 drives.

    Different VDEV sizes in a single pool are often not recommended, but ZFS is very smart and cool: it load-balances the data across the VDEVs based on the size of the VDEV. I could verify this with zpool iostat -v 5 and witness this in real-time. The small VDEV got just a fraction of the data compared to the large VDEV.

    This choice leaves me with less capacity (71 TiB vs. 74 TiB for RAIDZ3) and also has a bit more risk to it, with the eighteen-disk RAIDZ2 VDEV. Regarding this latter risk, I've been running a twenty-disk MDADM RAID6 for the last 6 years and haven't seen any issues. That does not tell everything, but I'm comfortable with this risk.

    Originalyl I was planning on using RAIDZ3 and by using ashift=9 (512 byte sectors) I would recuperate most of the space lost to the non-optimal number of drives in the VDEV. So why did I change my mind? Because the performance of my ashift=9 pool on my 4K drives deteriorated so much that a resilver of a failed drive would take ages.


    Storage controllers

    The IBM 1015 HBA's are reasonably priced and buying three of them, is often cheaper than buying just one HBA with a SAS expander. However, it may be cheaper to search for an HP SAS expander and use it with just one M1015 and save a PCIe slot.

    I have not flashed the controllers to 'IT mode', as most people do. They worked out-of-the-box as HBAs and although it may take a little bit longer to boot the system, I decided not to go through the hassle.

    The main risk here is how the controller handles a drive if a sector is not properly read. It may disable the drive entirely, which is not necessary for ZFS and often not preferred.

    Storage performance

    With twenty-four drives in a chassis, it's interesting to see what kind of performance you can get from the system.

    Let's start with a twenty-four drive RAID 0. The drives I use have a sustained read/write speed of 160 MB/s so it should be possible to reach 3840 MB/s or 3.8 GB/s. That would be amazing.

    This is the performance of a RAID 0 (MDADM) of all twenty-four drives.

    root@nano:/storage# dd if=/dev/zero of=test.bin bs=1M count=1000000
    1048576000000 bytes (1.0 TB) copied, 397.325 s, 2.6 GB/s
    
    root@nano:/storage# dd if=test.bin of=/dev/null bs=1M
    1048576000000 bytes (1.0 TB) copied, 276.869 s, 3.8 GB/s
    

    Dead on, you would say, but if you divide 1 TB with 276 seconds, it's more like 3.6 GB/s. I would say that's still quite close.

    This machine will be used as a file server and a bit of redundancy would be nice. So what happens if we run the same benchmark on a RAID6 of all drives?

    root@nano:/storage# dd if=/dev/zero of=test.bin bs=1M count=100000
    104857600000 bytes (105 GB) copied, 66.3935 s, 1.6 GB/s
    
    root@nano:/storage# dd if=test.bin of=/dev/null bs=1M
    104857600000 bytes (105 GB) copied, 38.256 s, 2.7 GB/s
    

    I'm quite pleased with these results, especially for a RAID6. However, RAID6 with twenty-four drives feels a bit risky. So since there is no support for a three-parity disk RAID in MDADM/Linux, I use ZFS.

    Sacrificing performance, I decided - as I mentioned earlier - to use ashift=9 on those 4K sector drives, because I gained about 5 TiB of storage in exchange.

    This is the performance of twenty-four drives in a RAIDZ3 VDEV with ashift=9.

    root@nano:/storage# dd if=/dev/zero of=ashift9.bin bs=1M count=100000 
    104857600000 bytes (105 GB) copied, 97.4231 s, 1.1 GB/s
    
    root@nano:/storage# dd if=ashift9.bin of=/dev/null bs=1M
    104857600000 bytes (105 GB) copied, 42.3805 s, 2.5 GB/s
    

    Compared to the other results, write performance is way down, although not too bad.

    This is the write performance of the 18 disk RAIDZ2 + 6 disk RAIDZ2 zpool (ashift=12):

    root@nano:/storage# dd if=/dev/zero of=test.bin bs=1M count=1000000 
    1048576000000 bytes (1.0 TB) copied, 543.072 s, 1.9 GB/s
    
    root@nano:/storage# dd if=test.bin of=/dev/null bs=1M 
    1048576000000 bytes (1.0 TB) copied, 400.539 s, 2.6 GB/s
    

    As you may notice, the write performance is better than the ashift=9 or ashift=12 RAIDZ3 VDEV.


    I have not benchmarked random I/O performance as it is not relevant for this system. And with ZFS, the random I/O performance of a VDEV is that of a single drive.

    Boot drives

    I'm using two Crucial M500 120GB SSD drives. They are configured in a RAID1 (MDADM) and I've installed Debian Wheezy on top of them.

    At first, I was planning on using a part of the capacity for caching purposes in combination with ZFS. However, there's no real need to do so. In hindsight I could also have used to very cheap 2.5" hard drives (simmilar to my older NAS), which would have cost less than a single M500.

    Update 2014-09-01: I actually reinstalled Debian and kept about 50% free space on both M500s and put this space in a partition. These partitions have been provided to the ZFS pool as L2ARC cache. I did this because I could, but on the other hand, I wonder if I'm only really just wearing out my SSDs faster.

    Update 2015-10-04: I saw no reason why I would wear out my SSDs as a L2ARC so I removed them from my pool. There is absolutely no benefit in my case.

    Networking

    Maybe I will invest in 10Gbit ethernet or infiniband hardware in the future, but for now I settled on a quad-port gigabit adapter. With Linux bonding, I can still get 450+ MB/s data transfers, which is sufficient for my needs.

    The quad-port card is in addition to the two on-board gigabit network cards. I use one of the on-board ports for client access. The four ports on the quad-port card are all in different VLANs and not accessible for client devices.

    The storage will be accessible over NFS and SMB.

    Keeping things cool and quiet

    It's important to keep the drive temperature at acceptable levels and with 24 drives packet together, there is an increased risk of overheating.

    The chassis is well-equipped to keep the drives cool with three 120mm fans and two strong 80mm fans, all supporting PWM (pulse-width modulation).

    The problem is that by default, the BIOS runs the fans at a too low speed to keep the drives at a reasonable temperature. I'd like to keep the hottest drive at about forty degrees Celsius. But I also want to keep the noise at reasonable levels.

    I wrote a python script called storagefancontrol that automatically adjusts the fan speed based on the temperature of the hottest drive.

    UPS

    I'm running a HP N40L micro server as my firewall/router. My APC Back-UPS RS 1200 LCD (720 Watt) is connected with USB to this machine. I'm using apcupsd to monitor the UPS and shutdown servers if the battery runs low.

    All servers, including my new build, run apcupsd in network mode and talk to the N40L to learn if power is still OK.

    Keeping power consumption reasonable

    So these are the power usage numbers.

     96 Watt with disks in spin down.
    176 Watt with disks spinning but idle.
    253 Watt with disks writing.
    

    Edit 2015-10-04: I do have an unresolved issue where the drives keep spinning up even with all services on the box killed, including Cron. So it's configured so that the drives are always spinning. /end edit

    But the most important stat is that it's using 0 Watt if powered off. The system will be turned on only when necessary through wake-on-lan. It will be powered off most of the time, like when I'm at work or sleeping.

    Cost

    The system has cost me about €6000. All costs below are in Euro and include taxes (21%).

    Description Product Price Amount Total
    Chassis Ri-vier 4U 24bay storage chassis RV-4324-01A 554 1 554
    CPU Intel Xeon E3-1230V2 197 1 197
    Mobo SuperMicro X9SCM-F 157 1 157
    RAM Kingston DDR3 ECC KVR1333D3E9SK2/16G  152 1 152
    PSU AX860i 80Plus Platinum 175 1 175
    Network Card NC364T PCI Express Quad Port Gigabit 145 1 145
    HBA Controller IBM SERVERAID M1015  118 3 354
    SSDs Crucial M500 120GB 62 2 124
    Fan  Zalman FB123 Casefan Bracket + 92mm Fan 7 1 7
    Hard Drive Hitachi 3.5 4TB 7200RPM (0S03356) 166 24 3984
    SAS Cables 25 6 150
    Fan cables 6 1 6
    Sata-to-Molex 3,5 1 3,5
    Molex splitter 3 1 3
    6012

    Closing words

    If you have any questions or remarks about what could have been done differently feel free to leave a comment, I appreciate it.

  5. ZFS: Performance and Capacity Impact of Ashift=9 on 4K Sector Drives

    July 31, 2014

    Update 2014-8-23: I was testing with ashift for my new NAS. The ashift=9 write performance deteriorated from 1.1 GB/s to 830 MB/s with just 16 TB of data on the pool. Also I noticed that resilvering was very slow. This is why I decided to abandon my 24 drive RAIDZ3 configuration.

    I'm aware that drives are faster at the outside of the platter and slower on the inside, but the performance deteriorated so dramatically that I did not wanted to continue further.

    My final setup will be a RAIDZ2 18 drive VDEV + RAIDZ2 6 drive VDEV which will give me 'only' 71 TiB of storage, but read performance is 2.6 GB/s and write performance is excellent at 1.9 GB/s. I've written about 40+ TiB to the array and after those 40 TiB, write performance was about 1.7 GB/s, so still very good and what I would expect as drives fill up.

    So actually, based on these results, I have learned not to deviate from the ZFS best practices too much. Use ashift=12 and put drives in VDEVS that adhere to the 2^n+parity rule.

    The uneven VDEVs (18 disk vs. 6 disks) are not according to best practice but ZFS is smart: it distributes data across the VDEVs based on their size. So they fill up equally.


    Choosing between ashift=9 and ashift=12 for 4K sector drives is not always a clear cut case. You have to choose between raw performance or storage capacity.

    My testplatform is Debian Wheezy with ZFS on Linux. I'm using a system with 24 x 4 TB drives in a RAIDZ3. The drives have a native sector size of 4K, and the array is formatted with ashift=12.

    First we create the array like this:

    zpool create storage -o ashift=12 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]
    

    Note: NEVER use /dev/sd? drive names for an array, this is just for testing, always use /dev/disk/by-id/ names.

    Then we run a simple sequential transfer benchmark with dd:

    root@nano:/storage# dd if=/dev/zero of=ashift12.bin bs=1M count=100000 
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 66.4922 s, 1.6 GB/s
    root@nano:/storage# dd if=ashift12.bin of=/dev/null bs=1M
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 42.0371 s, 2.5 GB/s
    

    This is quite impressive. With these speeds, you can saturate 10Gbe ethernet. But how much storage space do we get?

    df -h:

    Filesystem                            Size  Used Avail Use% Mounted on
    storage                                69T  512K   69T   1% /storage
    

    zfs list:

    NAME      USED  AVAIL  REFER  MOUNTPOINT
    storage  1.66M  68.4T   435K  /storage
    

    Only 68.4 TiB of storage? That's not good. There should be 24 drives minus 3 for parity is 21 x 3.6 TiB = 75 TiB of storage.

    So the performance is great, but somehow, we lost about 6 TiB of storage, more than a whole drive.

    So what happens if you create the same array with ashift=9?

    zpool create storage -o ashift=9 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]
    

    These are the benchmarks:

    root@nano:/storage# dd if=/dev/zero of=ashift9.bin bs=1M count=100000 
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 97.4231 s, 1.1 GB/s
    root@nano:/storage# dd if=ashift9.bin of=/dev/null bs=1M
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 42.3805 s, 2.5 GB/s
    

    So we lose about a third of our write performance, but the read performance is not affected, probably by read-ahead caching but I'm not sure.

    With ashift=9, we do lose some write performance, but we can still saturate 10Gbe.

    Now look what happens to the available storage capacity:

    df -h:

    Filesystem                         Size  Used Avail Use% Mounted on
    storage                             74T   98G   74T   1% /storage
    

    zfs list:

    NAME      USED  AVAIL  REFER  MOUNTPOINT
    storage   271K  73.9T  89.8K  /storage
    

    Now we have a capacity of 74 TiB, so we just gained 5 TiB with ashift=9 over ashift=12, at the cost of some write performance.

    So if you really care about sequential write performance, ashift=12 is the better option. If storage capacity is more important, ashift=9 seems to be the best solution for 4K drives.

    The performance of ashift=9 on 4K drives is always described as 'horrible' but I think it's best to run your own benchmarks and decide for yourself.

    Caveat: I'm quite sure about the benchmark performance. I'm not 100% sure how reliable the reported free space is according to df -h or zfs list.

    Edit: I have added a bit of my own opinion on the results.

    Tagged as : ZFS Linux

Page 2 / 6