1. Creating a Basic ZFS File System on Linux

    Sat 01 February 2014

    Here are some notes on creating a basic ZFS file system on Linux, using ZFS on Linux.

    I'm documenting the scenario where I just want to create a file system that can tollerate at least a single drive failure and can be shared over NFS.

    Identify the drives you want to use for the ZFS pool

    The ZFS on Linux project advices not to use plain /dev/sdx (/dev/sda, etc.) devices but to use /dev/disk/by-id/ or /dev/disk/by-path device names.

    Device names for storage devices are not fixed, so /dev/sdx devices may not always point to the same disk device. I've been bitten by this when first experimenting with ZFS, because I did not follow this advice and then could not access my zpool after a reboot because I removed a drive from the system.

    So you should pick the appropriate device from the /dev/disk/by-[id|path] folder. However, it's often difficult to determine which device in those folders corresponds to an actual disk drive.

    So I wrote a simple tool called showdisks which helps you identify which identifiers you need to use to create your ZFS pool.

    diskbypath

    You can install showdisks yourself by cloning the project:

    git clone https://github.com/louwrentius/showtools.git
    

    And then just use showdisks like

    ./showdisks -sp  (-s (size) and -p (by-path) )
    

    For this example, I'd like to use all the 500 GB disk drives for a six-drive RAIDZ1 vdev. Based on the information from showdisks, this is the command to create the vdev:

    zfs create tank raidz1 pci-0000:03:00.0-scsi-0:0:21:0 pci-0000:03:00.0-scsi-0:0:19:0 pci-0000:02:00.0-scsi-0:0:9:0 pci-0000:02:00.0-scsi-0:0:11:0 pci-0000:03:00.0-scsi-0:0:22:0 pci-0000:03:00.0-scsi-0:0:18:0
    

    The 'tank' name can be anything you want, it's just a name for the pool.

    Please note that with newer bigger disk drives, you should test if the ashift=12 option gives you better performance.

    zfs create -o ashift=12 tank raidz1 <devices>
    

    I used this option on 2TB disk drives and the performance and the read performance improved twofold.

    How to setup a RAID10 style pool

    This is how to create the ZFS equivalent of a RAID10 setup:

    zfs create tank mirror <device 1> <device 2> mirror <device 3> <device 4> mirror <device 5> <device 6>
    

    How many drives should I use in a vdev

    I've learned to use a 'power of two' (2,4,8,16) of drives for a vdev, plus the appropriate number of drives for the parity. RAIDZ1 = 1 disk, RAIDZ2 = 2 disks, etc.

    So the optimal number of drives for RAIDZ1 would be 3,5,9,17. RAIDZ2 would be 4,6,10,18 and so on. Clearly in the example above with six drives in a RAIDZ1 configuration, I'm violating this rule of thumb.

    How to disable the ZIL or disable sync writes

    You can expect bad throughput performance if you want to use the ZIL / honour synchronous writes. For safety reasons, ZFS does honour sync writes by default, it's an important feature of ZFS to guarantee data integrity. For storage of virtual machines or databases, you should not turn of the ZIL, but use an SSD for the SLOG to get performance to acceptable levels.

    For a simple (home) NAS box, the ZIL is not so important and can quite safely be disabled, as long as you have your servers on a UPS and have it cleanly shutdown when the UPS battery runs out.

    This is how you turn of the ZIL / support for synchronous writes:

    zfs set sync=disabled <pool name>
    

    Disabling sync writes is especially important if you use NFS which issues sync writes by default.

    Example:

    zfs set sync=disabled tank
    

    How to add an L2ARC cache device

    Use showdisks to lookup the actual /dev/disk/by-path identifier and add it like this:

    zpool add tank cache <device>
    

    Example:

    zpool add tank cache pci-0000:00:1f.2-scsi-2:0:0:0
    

    This is the result (on another zpool called 'server'):

    root@server:~# zpool status
      pool: server
     state: ONLINE
      scan: none requested
    config:
    
        NAME                               STATE     READ WRITE CKSUM
        server                             ONLINE       0     0     0
          raidz1-0                         ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:0:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:1:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:2:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:3:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:4:0  ONLINE       0     0     0
            pci-0000:03:04.0-scsi-0:0:5:0  ONLINE       0     0     0
        cache
          pci-0000:00:1f.2-scsi-2:0:0:0    ONLINE       0     0     0
    

    How to monitor performance / I/O statistics

    One time sample:

    zpool iostat
    

    A sample every 2 seconds:

    zpool iostat 2
    

    More detailed information every 5 seconds:

    zpool iostat -v 5
    

    Example output:

                                          capacity     operations    bandwidth
    pool                               alloc   free   read  write   read  write
    ---------------------------------  -----  -----  -----  -----  -----  -----
    server                             3.54T  7.33T      4    577   470K  68.1M
      raidz1                           3.54T  7.33T      4    577   470K  68.1M
        pci-0000:03:04.0-scsi-0:0:0:0      -      -      1    143  92.7K  14.2M
        pci-0000:03:04.0-scsi-0:0:1:0      -      -      1    142  91.1K  14.2M
        pci-0000:03:04.0-scsi-0:0:2:0      -      -      1    143  92.8K  14.2M
        pci-0000:03:04.0-scsi-0:0:3:0      -      -      1    142  91.0K  14.2M
        pci-0000:03:04.0-scsi-0:0:4:0      -      -      1    143  92.5K  14.2M
        pci-0000:03:04.0-scsi-0:0:5:0      -      -      1    142  90.8K  14.2M
    cache                                  -      -      -      -      -      -
      pci-0000:00:1f.2-scsi-2:0:0:0    55.9G     8M      0     70    349  8.69M
    ---------------------------------  -----  -----  -----  -----  -----  -----
    

    How to start / stop a scrub

    Start:

    zfs scrub <pool>
    

    Stop:

    zfs scrub -s <pool>
    

    Mount ZFS file systems on boot

    Edit /etc/defaults/zfs and set this parameter:

    ZFS_MOUNT='yes'
    

    How to enable sharing a file system over NFS:

    zfs set sharenfs=on <poolname>
    

    How to create a zvol for usage with iSCSI

    zfs create -V 500G <poolname>/volume-name
    

    How to force ZFS to import the pool using disk/by-path

    Edit /etc/default/zfs and add

    ZPOOL_IMPORT_PATH=/dev/disk/by-path/
    

    Links to important ZFS information sources:

    Tons of information on using ZFS on Linux by Aaron Toponce:

    https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux/

    Understanding the ZIL (ZFS Intent Log)

    http://nex7.blogspot.nl/2013/04/zfs-intent-log.html

    Information about 4K sector alignment problems

    http://www.opendevs.org/ritk/zfs-4k-aligned-space-overhead.html

    Important read about using the proper number of drives in a vdev

    http://forums.freenas.org/threads/getting-the-most-out-of-zfs-pools.16/

    Tagged as : ZFS
  2. Why You Should Not Use IPsec for VPN Connectivity

    Tue 28 January 2014

    IPsec is a well-known and widely-used VPN solution. It seems that it's not widely known that Niels Ferguson and Bruce Schneier performed a detailed security analysis of IPsec and that the results were not very positive.

    We strongly discourage the use of IPsec in its current form for protection of any kind of valuable information, and hope that future iterations of the design will be improved.
    

    I conveniently left out the second part:

    However, we even more strongly discourage any current alterantives, and recommend IPsec when the alternative is an insecure network. Such are the realities of the world.
    

    To put this in context: keep in mind that this paper was released in 2003 and the actual research may even be older (1999!). OpenVPN, an open-source SSL-based VPN solution was born in 2001 and was still maturing in 2003. So there actually was no real alternative back then.

    It worries me that this research done by Ferguson and Schneier is more than a decade old. I've been looking for more recent articles on the current security status of IPsec, but I couldn't find much. There have been some new RFCs been published about IPsec but I'm not familiar enough with the material to understand the implications. They make a lot of recommendations in the paper to improve IPsec security, but are they actually implemented?

    I did find a presentation from 2013 by Peter Gutmann (University of Auckland). Based on his Wikipedia page, he seems to 'have some knowledge' about cryptography. The paper adresses the Snowden leaks about the NSA and also touches on IPsec. He basically relies on the paper written by Ferguson and Schneier.

    But let's think about this: Ferguson and Schneier criticises the design of IPsec. It is flawed by design. That's one of the worst criticisms any thing related to cryptography can get. That design has probably not changed much, from what I understand. So if their critique on IPsec is still mostly valid, all the more reason not to use IPsec.

    So this is part of the conclusion and it doesn't beat around the bush:

    We have found serious security weaknesses in all major components of IPsec.
    As always in security, there is no prize for getting 90% right; you have to get
    everything right. IPsec falls well short of that target, and will require some major
    changes before it can possibly provide a good level of security.
    What worries us more than the weaknesses we have identified is the complexity
    of the system. In our opinion, current evaluation methods cannot handle
    systems of such a high complexity, and current implementation methods are not
    capable of creating a secure implementation of a system as complex as this.
    

    So if not IPsec, what should you use? I would opt to use an SSL/TLS-based VPN solution like OpenVPN.

    I can't vouch for the security for OpenVPN, but a well-known Dutch security firm Fox-IT has released a stripped-down version of the OpenVPN software (removed features) that they consider fit for (Dutch) governmental use. Not to say that you should use that particular OpenVPN version: the point is that OpenVPN is deemed secure enough to be used for governmental usage. For whatever that's worth.

    At least, SSL-based VPN solutions have the benefit that they use SSL/TLS, which may have it's own problems, but is at least not as complex as IPsec.

    Tagged as : IPsec security
  3. Achieving 450 MB/s Network File Transfers Using Linux Bonding

    Tue 07 January 2014

    Linux Bonding

    In this article I'd like to show the results of using regular 1 Gigabit network connections to achieve 450 MB/s file transfers over NFS.

    I'm again using Linux interface bonding for this purpose.

    Linux interface bonding can be used to create a virtual network interface with the aggregate bandwidth of all the interfaces added to the bond. Two gigabit network interfaces will give you - guess what - two gigabit or ~220 MB/s of bandwidth.

    This bandwidth can be used by a single TCP-connection.

    So how is this achieved? The Linux bonding kernel module has special bonding mode: mode 0 or round-robin bonding. In this mode, the kernel will stripe packets across the interfaces in the 'bond' like RAID 0 with hard drives. As with RAID 0, you get additional performance with each device you add.

    So I've added HP NC364T quad-port network cards to my servers. Thus each server has a theoretical bandwidth of 4 Gigabit. These HP network cards cost just 145 Euro and I even found the card for 105 Dollar on Amazon.

    hp quad port nic

    With two servers, you just need four UTP cable's to connect the two network interfaces and you're done. This would cost you ~300 Euro or ~200 dollar in total.

    If you want to connect additional servers, you need a managed gigabit switch with VLAN-support and sufficient ports. Each additional server will use 4 ports on the switch, excluding interfaces for remote access and other purposes.

    Managed gigabit switches are quite inexpensive these days. I bought a 24 port switch: HP 1810-24G v2 (J9803A) for about 180 euros (209 Dollars on Newegg) and it can even be rack-mounted.

    switch

    Using VLANS

    So you can't just use a gigabit switch and connect all these quad-port network cards to a single VLAN. I tested this scenario first and only got a maximum transfer speed of 270 MB/s while copying a file between servers over NFS.

    The trick is to create a separate VLAN for every network port. So if you use a quad-port network card, you need four VLANs. Also, you must make sure that every port on every network card is in the same VLAN. For example, port 1 on every card needs to be in VLAN 21, port 2 in VLAN 22, and so on. You also must add the appropriate switch port to the correct VLAN. Last, you must add the network interfaces to the bond in the right order.

    bondingschema

    So why do you need to use VLANs? The reason is quite simple. Bonding works by spoofing the same hardware or MAC-address on all interfaces. So the switch sees the same hardware address on four ports, and thus gets confused. To which port should the packet be sent?

    If you put each port in it's own VLAN, the 'spoofed' MAC-address is seen only once in each VLAN. So the switch won't be confused. What you are in fact doing by creating VLANs is creating four separate switches. So if you have - for example - four cheap 8-port unmanaged gigabit switches, this would work too.

    So assuming that you have four ethernet interfaces, this is an example of how you can create the bond:

    ifenslave bond0 eth1 eth2 eth3 eth4
    

    Next, you just assign an IP-address to the bond0 interface, like you would with a regular eth(x) interface.

    ifconfig bond0 192.168.2.10 netmask 255.255.255.0
    

    Up to this point, I only achieved about 350 MB/s. I needed to enable jumbo frames on all interfaces and the switch to achieve 450 MB/s.

    ifconfig bond0 mtu 9000 up
    

    Next, you can just mount any NFS share over the interface and start copying files. That's all.

    Once I added each interface to the appropriate VLAN, I got about 450 MB/s for a single file copy with 'cp' over NFS.

    450 MB/s

    I did not perform a 'cp' but a 'dd' because I don't have a disk array fast enough (yet) that can write at 450 MB/s.

    So for three servers, this solution will cost me 600 Euro or 520 Dollar.

    What about LAGs and LACP?

    Don't configure your clients or switches for LACP, it doesn't give you the speed benefit and it's not required.

    10 Gigabit ethernet?

    Frankly, it's quite expensive if you need to connect more than two servers. An entry-level 10Gbe NIC like the Intel X540-T1 does about 300 Euros or 450 Dollar. This card allows you to use Cat 6e UTP Cabling. (Pricing from Dutch Webshops in Euros and Newegg in Dollars).

    With two of those, you can have 10Gbit ethernet between two servers for 600 Euro or 700 Dollar. If you need to connect more servers, you would need a switch. The problem is that 10Gbit switches are not cheap. An 8-port unmanaged switch from Netgear (ProSAFE Plus XS708E) does about 720 Euro's or 900 Dollar.

    If you want to connect three servers, you need three network cards and a switch. So three network cards and a switch will cost you 900 Euro (1050 Dollar) for the network cards and 720 Euro (900 Dollar) for the switch, totalling 1800 euro or 1950 Dollar.

    You will get higher transfer speeds, but at a significantly higher price.

    For many business purposes this higher price can be easily justified and I would select 10 Gb over 1 Gb bonding in a heart-beat. Less cables, higher performance, lower latency.

    However, bonding gigabit interfaces allows you to use off-the-shelf equipment and maybe a nice compromis between cost, usability and performance.

    Operating System Support

    As far as I'm aware, round-robin bonding is only supported on Linux. Other operating systems do not support it.

Page 22 / 73