1. Using InfiniBand for Cheap and Fast Point-To-Point Networking

    March 25, 2017

    InfiniBand networking is quite awesome. It's mainly used for two reasons:

    1. low latency
    2. high bandwidth

    As a home user, I'm mainly interested in setting up a high bandwidth link between two servers.

    I was using quad-port network cards with Linux Bonding, but this solution has some downsides:

    1. you can only go to 4 Gbit with Linux bonding (or you need more ports)
    2. you need a lot of cabling
    3. it is similar in price as InfiniBand

    So I've decided to take a gamble on some InfiniBand gear. You only need InfiniBand PCIe network cards and a cable.

    1 x SFF-8470 CX4 cable                                              $16
                                                                Total:  $66

    view of installed infiniband card and cable

    I find $66 quite cheap for 20 Gbit networking. Regular 10Gbit Ethernet networking is often still more expensive that using older InfiniBand cards.

    InfiniBand is similar to Ethernet, you can run your own protocol over it (for lower latency) but you can use IP over InfiniBand. The InfiniBand card will just show up as a regular network device (one per port).

    ib0 Link encap:UNSPEC HWaddr 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00  
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::202:c902:29:8e01/64 Scope:Link
          RX packets:7988691 errors:0 dropped:0 overruns:0 frame:0
          TX packets:17853128 errors:0 dropped:10 overruns:0 carrier:0
          collisions:0 txqueuelen:256 
          RX bytes:590717840 (563.3 MiB)  TX bytes:1074521257501 (1000.7 GiB)


    I've followed these instructions to get IP over InfiniBand working.


    First, you need to assure the following modules are loaded at a minimum:


    I only had to add the ib_ipoib module to /etc/modules. As soon as this module is loaded, you will notice you have some ibX interfaces available which can be configured like regular ethernet cards

    Subnet manager

    In addition to loading the modules, you also may need a subnet manager but this seems only relevant if you have an InfiniBand switch. Such switches either have a build-in subnet manager or you can just install and use 'opensm'

    Link status

    if you want you can check the link status of your InfiniBand connection like this:

    # ibstat
    CA 'mthca0'
        CA type: MT25208
        Number of ports: 2
        Firmware version: 5.3.0
        Hardware version: 20
        Node GUID: 0x0002c90200298e00
        System image GUID: 0x0002c90200298e03
        Port 1:
            State: Active
            Physical state: LinkUp
            Rate: 20
            Base lid: 1
            LMC: 0
            SM lid: 2
            Capability mask: 0x02510a68
            Port GUID: 0x0002c90200298e01
            Link layer: InfiniBand
        Port 2:
            State: Down
            Physical state: Polling
            Rate: 10
            Base lid: 0
            LMC: 0
            SM lid: 0
            Capability mask: 0x02510a68
            Port GUID: 0x0002c90200298e02
            Link layer: InfiniBand

    Set mode and MTU

    Since my systems run Debian Linux, I've configured /etc/network/interfaces like this:

    auto ib0
    iface ib0 inet static
        mtu 65520
        pre-up echo connected > /sys/class/net/ib0/mode

    Please take note of the 'mode' setting. The 'datagram' mode gave abysmal network performance (< Gigabit). The 'connected' mode made everything perform acceptable.

    The MTU setting of 65520 improved performance by another 30 percent.


    I've tested the card on two systems based on the Supermicro X9SCM-F motherboard. Using these systems, I was able to achieve file transfer speeds up to 750 MB (Megabytes) per second or about 6.5 Gbit as measured with iperf.

    ~# iperf -c
    Client connecting to, TCP port 5001
    TCP window size: 2.50 MByte (default)
    [  3] local port 40098 connected with port 5001
    [ ID] Interval       Transfer     Bandwidth
    [  3]  0.0-10.0 sec  7.49 GBytes  6.43 Gbits/sec

    Similar test with netcat and dd:

    ~# dd if=/dev/zero bs=1M count=100000 | nc 1234
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 128.882 s, 814 MB/s

    Testing was done on Debian Jessie.

    During earlier testing, I've also used these cards in HP Micro Proliant G8 servers. On those servers, I was running Ubuntu 16.04 LTS.

    As tested on Ubuntu with the HP Microserver:

    Client connecting to, TCP port 5001
    TCP window size: 4.00 MByte (default)
    [  5] local port 52572 connected with port 5001
    [  4] local port 5001 connected with port 44124
    [ ID] Interval       Transfer     Bandwidth
    [  5]  0.0-60.0 sec  71.9 GBytes  10.3 Gbits/sec
    [  4]  0.0-60.0 sec  72.2 GBytes  10.3 Gbits/sec

    Using these systems, I was able eventually able to achieve 15 Gbit as measured with iperf, although I have no 'console screenshot' from it.

    Closing words

    IP over InfiniBand seems to be a nice way to get high-performance networking on the cheap. The main downside is that when using IP over IB, CPU usage will be high.

    Another thing I have not researched, but could be of interest is running NFS or other protocols directly over InfiniBand using RDMA, so you would bypass the overhead of IP.

  2. Tracking Down a Faulty Storage Array Controller With ZFS

    December 15, 2016

    One day, I lost two virtual machines on our DR environment after a storage vMotion.

    Further investigation uncovered that any storage vMotion of a virtual machine residing on our DR storage array would corrupt the virtual machine's disks.

    I could easily restore the affected virtual machines from backup and once that was done, continued my investigation.

    I needed a way to quickly verifying if a virtual hard drive of a virtual machine was corrupted after a storage vMotion to understand what the pattern was.

    First, I created a virtual machine based on Linux and installed ZFS. Then, I attached a second disk of about 50 gigabytes and formatted this drive with ZFS. Once I filled the drive using 'dd' to about 40 gigabytes I was ready to test.

    ZFS was chosen for testing purposes because it stores hashes of all blocks of data. This makes it very simple to quickly detect any data corruption. If the hash doesn't match the hash generated from the data, you just detected corruption.

    Other file systems don't store hashes and don't check for data corruption so they just trust the storage layer. It may take a while before you find out that data is corrupted.

    I performed a storage vMotion of this secondary disk towards different datastores and then ran a 'zfs scrub' to track down any corruption. This worked better than expected: the scrub command would hang if the drive was corrupted by the storage vMotion. The test virtual machine required a reboot and a reformat of the secondary hard drive with ZFS as the previous file system, including data got corrupted.

    After performing a storage vMotion on the drive in different directions, from different datastores to other datastores slowly a pattern emerged.

    1. Storage vMotion corruption happened independent of the VMware ESXi host used.

    2. a Storage vMotion never caused any issues when the disk was residing on our production storage array.

    3. the corruption only happened when the virtual machine was stored on particular datastores on our DR storage array.

    Now it got really 'interesting'. The thing is that our DR storage array has two separate storage controllers running in active-active mode. However, the LUNs are always owned by a particular controller. Although the other controller can take over from the controller who 'owns' the LUNs in case of a failure, the owner will process the I/O when everything is fine. Particular LUNs are thus handled by a particular controller.

    So first I made a table where I listed the controllers and the LUNs it had ownership over, like this:

    Controller      a               b
                LUN001          LUN002
                LUN003          LUN004
                LUN005          LUN006

    Then I started to perform Storage vMotions of the ZFS disk from one LUN to the other. After performing several test, the pattern became quite obvious.

                LUN001  ->  LUN002  =   BAD
                LUN001  ->  LUN004  =   BAD
                LUN004  ->  LUN003  =   BAD
                LUN003  ->  LUN005  =   GOOD
                LUN005  ->  LUN001  =   GOOD

    I continued to test some additional permutations but it became clear that only LUNs owned by controller b caused problems.

    With the evidence in hand, I managed to convince our vendor support to replace storage controller b and that indeed resolved the problem. Data corruption due to a Storage vMotion never occurred after the controller was replaced.

    There is no need to name/shame the vendor in this regard. The thing is that all equipment can fail and what can happen will happen. What really counts is: are you prepared?

    Tagged as : ZFS
  3. RAID 5 Is Perfectly Fine for Home Usage

    September 08, 2016

    RAID 5 gets a lot of flak these days. You either run RAID 1, RAID 10 or you use RAID 6, but if you run RAID 5 you're told that you are a crazy person.

    Using RAID 5 is portrayed as an unreasonable risk to the availability of your data. It is suggested that it is likely that you will lose your RAID array at some point.

    That's an unfair representation of the actual risk that surrounds RAID 5. As I see it, the scare about RAID 5 is totally blown out of proportion.

    I would argue that for small RAID arrays with a maximum of five to six drives, it's totally reasonable to use RAID 5 for your home NAS.

    As far as I can tell, the campaign against RAID 5 mainly started with this article from zdnet.

    As you know RAID 5 can tollerate a single drive failure. If a second drive dies and the first drive was not yet replaced or rebuild, you lose all contents of the array.

    In the article the author argues that because drives become bigger but not more reliable, the risk of losing a second drive during a rebuild is so high that running RAID 5 is becoming risky.

    You don't need a second drive failure for you to lose your data. A bad sector, also known as an Unrecoverable Read Error (URE), can also cause problems during a rebuild. Depending on the RAID implementation, you may lose some files or the entire array.

    The author calculates and argues that the risk of such a bad sector or URE is so high with modern high-capacity drives, that this risk of a second drive failure during rebuild is almost unavoidable.

    Most drives have a URE specification of 1 bit error in 12.5 TB of data (10^14). That number is used as an absolute, it's what drives do experience in our daily lives, but that's not true.

    It's a worst-case number. You will see a read error in at-most 10^14 bits, but in practice drives are way more reliable.

    I run ZFS on my 71 TB ZFS NAS and I scrub from time to time.

    If that worst-case number were 'real', I would have caught some data errors by now. However, in line with my personal experience, ZFS hasn't corrected a single byte since the system came online a few years ago.

    And I've performed so many scrubs that my system has read over a petabyte of data. No silent data corruption, no regular bad sectors.

    It seems to me that all those risk aren't nearly as high as it seems.

    I would argue that choosing RAID-5/Z in the right circumstances is reasonable. RAID-6 is clearly safer than RAID-5 as you can survive the loss of two drives instead of a single drive, but that doesn't mean that RAID-5 is unsafe.

    If you are going to run a RAID 5 array, make sure you run a scrub or patrol read or whatever the name is that your RAID solution uses. A scrub is nothing more than attempt to try and read all data from disk.

    Scrubbing allows detection of bad sectors in advance, so you can replace drives before they cause real problems (like failing during a rebuild).

    If you keep the number of drives in a RAID-5 array low, maybe at most 5 or 6, I think for home users, who need to find a balance between cost and capacity, RAID-5 is an acceptable option.

    And remember: if you care about your data, you need a backup anyway.

    This topic was also discussed on reddit.

    Tagged as : RAID

Page 3 / 64