1. 74TB DIY NAS Based on ZFS on Linux

    August 02, 2014

    This is my new 74 TiB DIY NAS. This server is the successor to my six year old, twenty drive 18 TB NAS (17 TiB). With a storage capacity four times higher than the original and an incredible read (2.5 GB/s)/write (1.1 GB/s) performance, it's a worthy successor.

    zfs nas

    Purpose

    The purpose of this machine is to store backups and media, primarily video.

    The specs

    PartDescription
    CaseRi-vier RV-4324-01A
    ProcessorIntel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz
    RAM16 GB
    MotherboardSupermicro X9SCM-F
    LANIntel Gigabit (Quad-port) (Bonding)
    PSUSeasonic Platinum 860
    Controller 3 x IBM M1015
    Disk24 x HGST HDS724040ALE640 4 TB (7200RPM)
    SSD2 x Crucial M500 120GB
    ArraysBoot: 2 x 60 GB RAID 1 and storage: 24 x 4 TB RAIDZ3
    Brutto storage 86 TiB (96 TB)
    Netto storage74 TiB (81 TB)
    OSLinux Debian Wheezy
    FilesystemZFS
    Rebuild timeDepends on used space
    UPSBack-UPS RS 1200 LCD using Apcupsd
    Power usageabout  200 Watt idle

    front front front

    CPU

    The Intel Xeon E3-1230 V2 is not the latest generation but one of the cheapest Xeons you can buy and it supports ECC memory. It's a quad-core processor with hyper-threading.

    Here you can see how it performs compared to other processors.

    Memory

    The system has sixteen GB ECC RAM. Memory is relatively cheap these days but I don't have any reason to upgrade to thirty-two GB. I think that eight GB would have been fine with this system.

    Motherboard

    The server is build around the SuperMicro X95SCM-F motherboard.

    This is a server grade motherboard and comes with typical features you might expect from such a board, like ECC memory support and out-of-band management (IPMI).

    smboard top view

    This motherboard has four PCIe slots (2 x 8x and 2 x 4x) in an 8x physical slot. My build requires four PCIe 4x+ slots and there aren't (m)any other server boards at this price point that support four PCIe slots in a 8x sized slot.

    The chassis

    The chassis has six rows of four drive bays that are kept cool by three 120mm fans in a fan wall behind the drive bays. At the rear of the case, there are two 'powerful' 80mm fans that remove the heat from the case, together with the PSU.

    The chassis has six SAS backplanes that connect four drives each. The backplanes have dual molex power connectors, so you can put redundant power supplies into the chassis. Redundant power supplies are more expensive and due to their size, often have smaller, thus noisier fans. As this is a home build, I opted for just a single regular PSU.

    When facing the front, there is a place at the left side of the chassis to mount a single 3.5 inch or two 2.5 inch drives next to each other as boot drives. I've mounted two SSDs (RAID1).

    This particular chassis version has support for SPGIO, which should help identifying which drive has failed. The IBM 1015 cards I use do support SGPIO. Through the LSI megaraid CLI I have verified that SGPIO works, as you can use this tool as a drive locator. I'm not entirely sure how well SGPIO works with ZFS.

    Power supply

    I was using a Corsair 860i before, but it was unstable and died on me.

    The Seasonic Platinum 860 may seem like overkill for this system. However, I'm not using staggered spinup for the twenty-four drives. So the drives all spinup at once and this results in a peak power usage of 600+ watts.

    The PSU has a silent mode that causes the fan only to spin if the load reaches a certain threshold. Since the PSU fan also helps removing warm air from the chassis, I've disabled this feature, so the fan is spinning at all times.

    Drive management

    I've written a tool called lsidrivemap that displays each drive in an ASCII table that reflects the physical layout of the chassis.

    The data is based on the output of the LSI 'megacli' tool for my IBM 1015 controllers.

    root@nano:~# lsidrivemap disk
    
    | sdr | sds | sdt | sdq |
    | sdu | sdv | sdx | sdw |
    | sdi | sdl | sdp | sdm |
    | sdj | sdk | sdn | sdo |
    | sdb | sdc | sde | sdf |
    | sda | sdd | sdh | sdg |
    

    This layout is 'hardcoded' for my chassis but the Python script can be easily tailored for your own server, if you're interested.

    It can also show the temperature of the disk drives in the same table:

    root@nano:~# lsidrivemap temp
    
    | 36 | 39 | 40 | 38 |
    | 36 | 36 | 37 | 36 |
    | 35 | 38 | 36 | 36 |
    | 35 | 37 | 36 | 35 |
    | 35 | 36 | 36 | 35 |
    | 34 | 35 | 36 | 35 |
    

    These temperatures show that the top drives run a bit hotter than the other drives. An unverified explanation could be that the three 120mm fans are not in the center of the fan wall. They are skewed to the bottom of the wall, so they may favor the lower drive bays.

    Filesystem (ZFS)

    I'm using ZFS as the file system for the storage array. At this moment, there is no other file system that has the same features and stability as ZFS. BTRFS is not even finished.

    The number one design goal of ZFS was assuring data integrity. ZFS checksums all data and if you use RAIDZ or a mirror, it can even repair data. Even if it can't repair a file, it can at least tell you which files are corrupt.

    ZFS is not primarily focussed on performance, but to get the best performance possible, it makes heavy usage of RAM to cache both reads and writes. This is why ECC memory is so important.

    ZFS also implements RAID. So there is no need to use MDADM. My previous file server was running a RAID 6 of twenty 1TB drives. With twenty-four 4 TB drives, rebuild times will be higher and the risk of an unrecoverable read error will be increased as well, so that's why I'd like the array to survive more than two drive failures.

    The fun thing is that ZFS supports tripple parity RAID, a feature not found in MDADM or many other RAID solutions. With RAIDZ3, the server can lose 3 drives and the data will still be intact.

    ZFS has a nice feature where it can use fast, low-latency SSD storage as both read and write cache. For my home NAS build, both are entirely unnecessary and would only wear down my SSDs.

    Capacity

    Vendors still advertise the capacity of their hard drives in TB whereas the operating system works with TiB. So the 4 TB drives I use are in fact 3.64 TiB.

    The total raw storage capacity of the system is about 86 TiB. I've placed the twenty-four drives in a single RAIDZ3 VDEV. This gives me a netto capacity of 74 TiB.

    If you are familiar with ZFS, you may realise that twenty-four drives in a single VDEV is not according to best practice. As not to waste capacity and performance, it is advised to use VDEVs with 2^n (2,4,8,16,32) data drives. So a RAIDZ3 should be 19 or 35 drives.

    Indeed, if I create the ZFS VDEV I don't get 74 TiB, but only 69 TiB, so a lot of space is lost. This has to do with the fact that my drives are 4K sector drives instead of the 512 byte sectors used by older, often smaller drives.

    This is why I've created my pool with ashift=9 (512 bytes) instead of ashift=12 (4k). Performance is significantly reduced but still excellent and I gain a lot of storage space (5 TiB).

    This is just a file server for home usage, so in this case I think switching back to ashift=9 is reasonable.

    Earlier versions of ZFS did not support replacing 512 byte sector drives with 4K sector drives. But later ZFS version do with the -o ashift=9 option:

    root@nano:/storage# zpool replace storage /dev/sda /dev/sdg
    cannot replace /dev/sda with /dev/sdg: devices have different sector alignment
    

    Solved:

    root@nano:/storage# zpool replace -o ashift=9 storage /dev/sda /dev/sdg -f
    root@nano:/storage# zpool status
    

    Storage controllers

    The IBM 1015 HBA's are reasonably priced and buying three of them, is often cheaper than buying just one HBA with a SAS expander.

    I have not flashed the controllers to 'IT mode', as most people do. They worked out-of-the-box as HBAs and although it may take a little bit longer to boot the system, I decided not to go through the hassle.

    The main risk here is how the controller handles a drive if a sector is not properly read. It may disable the drive entirely, which is not necessary for ZFS and often not preferred.

    Storage performance

    With twenty-four drives in a chassis, it's interesting to see what kind of performance you can get from the system.

    Let's start with a twenty-four drive RAID 0. The drives I use have a sustained read/write speed of 160 MB/s so it should be possible to reach 3840 MB/s or 3.8 GB/s. That would be amazing.

    This is the performance of a RAID 0 (MDADM) of all twenty-four drives.

    root@nano:/storage# dd if=/dev/zero of=test.bin bs=1M count=1000000
    1048576000000 bytes (1.0 TB) copied, 397.325 s, 2.6 GB/s
    
    root@nano:/storage# dd if=test.bin of=/dev/null bs=1M
    1048576000000 bytes (1.0 TB) copied, 276.869 s, 3.8 GB/s
    

    Dead on, you would say, but if you divide 1 TB with 276 seconds, it's more like 3.6 GB/s. I would say that's still quite close.

    This machine will be used as a file server and a bit of redundancy would be nice. So what happens if we run the same benchmark on a RAID6 of all drives?

    root@nano:/storage# dd if=/dev/zero of=test.bin bs=1M count=100000
    104857600000 bytes (105 GB) copied, 66.3935 s, 1.6 GB/s
    
    root@nano:/storage# dd if=test.bin of=/dev/null bs=1M
    104857600000 bytes (105 GB) copied, 38.256 s, 2.7 GB/s
    

    I'm quite pleased with these results, especially for a RAID6. However, RAID6 with twenty-four drives feels a bit risqué. So since there is no support for a three-parity disk RAID in MDADM/Linux, I use ZFS.

    Sacrificing performance, I decided - as I mentioned earlier - to use ashift=9 on those 4K sector drives, because I gained about 5 TiB of storage in exchange.

    This is the performance of twenty-four drives in a RAIDZ3 VDEV with ashift=9.

    root@nano:/storage# dd if=/dev/zero of=ashift9.bin bs=1M count=100000 
    104857600000 bytes (105 GB) copied, 97.4231 s, 1.1 GB/s
    
    root@nano:/storage# dd if=ashift9.bin of=/dev/null bs=1M
    104857600000 bytes (105 GB) copied, 42.3805 s, 2.5 GB/s
    

    Compared to the other results, write performance is way down, but 1 GB/s isn't too shabby is it? I'm not complaining. And with this ashift=9 setting, I get 74TiB of netto (actual usable) storage.

    I have not benchmarked random I/O performance as it is not relevant for this system. And with ZFS, the random I/O performance of a VDEV is that of a single drive.

    Boot drives

    I'm using two Crucial M500 120GB drives. They are configured in a RAID1 (MDADM) and I've installed Debian Wheezy on top of them.

    At first, I was planning on using a part of the capacity for caching purposes in combination with ZFS. However, there's no real need to do so. In hindsight I could also have used to very cheap 2.5" hard drives (simmilar to my older NAS), which would have cost less than a single M500.

    Networking

    Maybe I will invest in 10Gbit ethernet or infiniband hardware in the future, but for now I settled on a quad-port gigabit adapter. With Linux bonding, I can still get 450+ MB/s data transfers, which is sufficient for my needs.

    The quad-port card is in addition to the two on-board gigabit network cards. I use one of the on-board ports for client access. The four ports on the quad-port card are all in different VLANs and not accessible for client devices.

    The storage will be accessible over NFS and SMB.

    Keeping things cool and quiet

    It's important to keep the drive temperature at acceptable levels and with 24 drives packet together, there is an increased risk of overheating.

    The chassis is well-equipped to keep the drives cool with three 120mm fans and two strong 80mm fans, all supporting PWM (pulse-width modulation).

    The problem is that by default, the BIOS runs the fans at a too low speed to keep the drives at a reasonable temperature. I'd like to keep the hottest drive at about forty degrees Celsius. But I also want to keep the noise at reasonable levels.

    I wrote a python script called storagefancontrol that automatically adjusts the fan speed based on the temperature of the hottest drive.

    UPS

    I'm running a HP N40L micro server as my firewall/router. My APC Back-UPS RS 1200 LCD (720 Watt) is connected with USB to this machine. I'm using apcupsd to monitor the UPS and shutdown servers if the battery runs low.

    All servers, including my new build, run apcupsd in network mode and talk to the N40L to learn if power is still OK.

    Keeping power consumption reasonable

    So these are the power usage numbers.

     85 Watt with disks in spin down.
    200 Watt with disks spinning but idle.
    260 Watt with disks writing.
    

    But the most important stat is that it's using 0 Watt if powered off. The system will be turned on only when necessary through wake-on-lan. It will be powered off most of the time, like when I'm at work or sleeping.

    Cost

    The system has cost me about €6000. All costs below are in Euro and include taxes (21%).

    Description Product Price Amount Total
    Chassis Ri-vier 4U 24bay storage chassis RV-4324-01A 554 1 554
    CPU Intel Xeon E3-1230V2 197 1 197
    Mobo SuperMicro X9SCM-F 157 1 157
    RAM Kingston DDR3 ECC KVR1333D3E9SK2/16G  152 1 152
    PSU AX860i 80Plus Platinum 175 1 175
    Network Card NC364T PCI Express Quad Port Gigabit 145 1 145
    HBA Controller IBM SERVERAID M1015  118 3 354
    SSDs Crucial M500 120GB 62 2 124
    Fan  Zalman FB123 Casefan Bracket + 92mm Fan 7 1 7
    Hard Drive Hitachi 3.5 4TB 7200RPM (0S03356) 166 24 3984
    SAS Cables 25 6 150
    Fan cables 6 1 6
    Sata-to-Molex 3,5 1 3,5
    Molex splitter 3 1 3
    6012

    Closing words

    If you have any questions or remarks about what could have been done differently feel free to leave a comment, I appreciate it.

  2. ZFS: Performance and Capacity Impact of Ashift=9 on 4K Sector Drives

    July 31, 2014

    Choosing between ashift=9 and ashift=12 for 4K sector drives is not always a clear cut case. You have to choose between raw performance or storage capacity.

    My testplatform is Debian Wheezy with ZFS on Linux. I'm using a system with 24 x 4 TB drives in a RAIDZ3. The drives have a native sector size of 4K, and the array is formatted with ashift=12.

    First we create the array like this:

    zpool create storage -o ashift=12 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]
    

    Note: NEVER use /dev/sd? drive names for an array, this is just for testing, always use /dev/disk/by-id/ names.

    Then we run a simple sequential transfer benchmark with dd:

    root@nano:/storage# dd if=/dev/zero of=ashift12.bin bs=1M count=100000 
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 66.4922 s, 1.6 GB/s
    root@nano:/storage# dd if=ashift12.bin of=/dev/null bs=1M
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 42.0371 s, 2.5 GB/s
    

    This is quite impressive. With these speeds, you can saturate 10Gbe ethernet. But how much storage space do we get?

    df -h:

    Filesystem                            Size  Used Avail Use% Mounted on
    storage                                69T  512K   69T   1% /storage
    

    zfs list:

    NAME      USED  AVAIL  REFER  MOUNTPOINT
    storage  1.66M  68.4T   435K  /storage
    

    Only 68.4 TiB of storage? That's not good. There should be 24 drives minus 3 for parity is 21 x 3.6 TiB = 75 TiB of storage.

    So the performance is great, but somehow, we lost about 6 TiB of storage, more than a whole drive.

    So what happens if you create the same array with ashift=9?

    zpool create storage -o ashift=9 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]
    

    These are the benchmarks:

    root@nano:/storage# dd if=/dev/zero of=ashift9.bin bs=1M count=100000 
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 97.4231 s, 1.1 GB/s
    root@nano:/storage# dd if=ashift9.bin of=/dev/null bs=1M
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 42.3805 s, 2.5 GB/s
    

    So we lose about a third of our write performance, but the read performance is not affected, probably by read-ahead caching but I'm not sure.

    With ashift=9, we do lose some write performance, but we can still saturate 10Gbe.

    Now look what happens to the available storage capacity:

    df -h:

    Filesystem                         Size  Used Avail Use% Mounted on
    storage                             74T   98G   74T   1% /storage
    

    zfs list:

    NAME      USED  AVAIL  REFER  MOUNTPOINT
    storage   271K  73.9T  89.8K  /storage
    

    Now we have a capacity of 74 TiB, so we just gained 5 TiB with ashift=9 over ashift=12, at the cost of some write performance.

    So if you really care about sequential write performance, ashift=12 is the better option. If storage capacity is more important, ashift=9 seems to be the best solution for 4K drives.

    The performance of ashift=9 on 4K drives is always described as 'horrible' but I think it's best to run your own benchmarks and decide for yourself.

    Caveat: I'm quite sure about the benchmark performance. I'm not 100% sure how reliable the reported free space is according to df -h or zfs list.

    Edit: I have added a bit of my own opinion on the results.

    Tagged as : ZFS Linux
  3. Achieving 2.3 GB/s With 16 X 4 TB Drives

    July 12, 2014

    I'm in the process of building a new storage server to replace my 18 TB NAS.

    The server is almost finished, it's now down to adding disk drives. I'm using the HGST 4 TB 7200 RPM drive for this build (SKU 0S03356) (review).

    I have not bought all drives at once, but slowly adding them in smaller quantities. I just don't want to feel too much pain in my wallet at once I guess.

    According to my own tests, this drive has a read/write throughput of 160 MB/s, which is in tune with it's specification.

    So the theoretical performance of a RAID 0 with 16 drives x 160 MB/s = 2560 MB/s. That's over 2.5 gigabytes per second.

    This is the actual real-life performance I was able to achieve.

    root@nano:/storage# dd if=pureawesomeness.dd of=/dev/null bs=1M
    1000000+0 records in
    1000000+0 records out
    1048576000000 bytes (1.0 TB) copied, 453.155 s, 2.3 GB/s
    

    Personally, 2.3 GB/s is not too shabby in my opinion. Please note that I used a test file of one terabyte, so the 16 GB of RAM my server has, doesn't skew the result.

    This result is very nice, but in practice almost useless. I can saturate dual 10 Gbit NICs with this system, but I don't have that kind of equipment or any other device that could handle such performance.

    But I think it's amazing anyway.

    I'm quite curious how the final 24 drive array will perform in a RAID 0.

    Tagged as : Storage
  4. Affordable Server With Server-Grade Hardware Part II

    June 20, 2014

    If you want to build a home server, it may be advised to actually use server-grade components. I documented the reasons for choosing server-grade hardware already in an earlier post on this topic.

    It is recommended to read the old post first. In this new post, I only show new hardware that could also be chosen as a more modern hardware option.

    My original post dates back to December 2013 and centers around the popular X9SCM-F which is based on the LGA 1155 socket. Please note that the X9SCM-F / LGA 1155 based solution may be cheaper if you want the Xeon processor.

    So I'd like to introduce two Supermicro motherboards that may be of interest.

    Supermicro X10SLL-F Supermicro X10SLL-F

    Some key features are:

    • 2 x Gigabit NIC on-board
    • 6 onboard SATA ports
    • 3 x PCIe (2 x 8x + 1 x 4x)
    • Costs $169 or €160

    This board is one of the cheapest Supermicro boards you can get and it has 3 x PCI-e, which may be of interest if you need to install extra HBA's or RAID cards, SAS expanders and/or network controllers.

    Supermicro X10SL7-F Supermicro X10SL7-F

    This board is about $80 or €90 more expensive than the X10SLL-F but in return, you get eight extra SAS/SATA ports, for a total of 14 SATA ports. With 4 TB drives, this would give you 56 TB of raw storage capacity. This motherboard provides a cheaper solution than an add-on HBA card, which would occupy a PCIe slot. Hoever, the's a caveat: this board has 'only' two PCIe slots. But there's still room for an additional quad-port or 10 Gbe NIC and an extra HBA if required.

    • 2 x Gigabit NIC on-board
    • 6 onboard SATA ports
    • 8 onboard SAS/SATA ports via LSI 2308 chip
    • 2 x PCIe (8x and 4x)
    • Costs $242 or €250

    Overview of CPU's

    CPUPassmark scorePrice in EuroPrice in Dollars
    Intel Pentium G3420 @ 3.20GHz345955 Euro74 Dollar
    Intel Core i3-4130 @ 3.40GHz482794 Euro124 Dollar
    Intel Xeon E3-1230 V3 @ 3.30GHz9459216Euro279 Dollar

    • Dollars are from Newegg, Euro's are from Tweakers.net.
    • Euros are including taxes.
    Tagged as : Supermicro Intel ECC
  5. How to Resolve Extreme Memory Usage on Windows 2008 R2-Based File Servers

    June 15, 2014

    I'm responsible for a file server with about 5 terrabytes of data. The file server is based on Windows 2008 R2. I've noticed extreme memory usage on the server. After a reboot, it slowly builds up until almost all RAM memory is consumed.

    So I googled around and found this post and it turned out I had the same exact issue.

    I've confirmed with the tool 'RAMmap' that NTFS metadata is the issue. Microsoft also created a blog post about this.

    The author of the first article resolved the issue by adding more RAM memory. But with 16 GB already assigned, I was not to happy to add more memory to the virtual file server, eating away on the RAM resources of our virtualisation platform.

    I could never find a root cause of the issue. In that case, you need to obtain the 'Microsoft Windows Dynamic Cache Service'. This application allows you to configure how large the medata caching may grow.

    Please note that this services is not a next-next-finish installation. Follow the included Word document with instructions carefully and configure a sane memory setting for your server. I limited the cache to half the RAM available to the server and this works out well.

    Tagged as : Windows file server

Page 1 / 34