1. ZFS: Performance and Capacity Impact of Ashift=9 on 4K Sector Drives

    July 31, 2014

    Update 2014-8-23: I was testing with ashift for my new NAS. The ashift=9 write performance deteriorated from 1.1 GB/s to 830 MB/s with just 16 TB of data on the pool. Also I noticed that resilvering was very slow. This is why I decided to abandon my 24 drive RAIDZ3 configuration.

    I'm aware that drives are faster at the outside of the platter and slower on the inside, but the performance deteriorated so dramatically that I did not wanted to continue further.

    My final setup will be a RAIDZ2 18 drive VDEV + RAIDZ2 6 drive VDEV which will give me 'only' 71 TiB of storage, but read performance is 2.6 GB/s and write performance is excellent at 1.9 GB/s. I've written about 40+ TiB to the array and after those 40 TiB, write performance was about 1.7 GB/s, so still very good and what I would expect as drives fill up.

    So actually, based on these results, I have learned not to deviate from the ZFS best practices too much. Use ashift=12 and put drives in VDEVS that adhere to the 2^n+parity rule.

    The uneven VDEVs (18 disk vs. 6 disks) are not according to best practice but ZFS is smart: it distributes data across the VDEVs based on their size. So they fill up equally.

    Choosing between ashift=9 and ashift=12 for 4K sector drives is not always a clear cut case. You have to choose between raw performance or storage capacity.

    My testplatform is Debian Wheezy with ZFS on Linux. I'm using a system with 24 x 4 TB drives in a RAIDZ3. The drives have a native sector size of 4K, and the array is formatted with ashift=12.

    First we create the array like this:

    zpool create storage -o ashift=12 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]

    Note: NEVER use /dev/sd? drive names for an array, this is just for testing, always use /dev/disk/by-id/ names.

    Then we run a simple sequential transfer benchmark with dd:

    root@nano:/storage# dd if=/dev/zero of=ashift12.bin bs=1M count=100000 
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 66.4922 s, 1.6 GB/s
    root@nano:/storage# dd if=ashift12.bin of=/dev/null bs=1M
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 42.0371 s, 2.5 GB/s

    This is quite impressive. With these speeds, you can saturate 10Gbe ethernet. But how much storage space do we get?

    df -h:

    Filesystem                            Size  Used Avail Use% Mounted on
    storage                                69T  512K   69T   1% /storage

    zfs list:

    storage  1.66M  68.4T   435K  /storage

    Only 68.4 TiB of storage? That's not good. There should be 24 drives minus 3 for parity is 21 x 3.6 TiB = 75 TiB of storage.

    So the performance is great, but somehow, we lost about 6 TiB of storage, more than a whole drive.

    So what happens if you create the same array with ashift=9?

    zpool create storage -o ashift=9 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]

    These are the benchmarks:

    root@nano:/storage# dd if=/dev/zero of=ashift9.bin bs=1M count=100000 
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 97.4231 s, 1.1 GB/s
    root@nano:/storage# dd if=ashift9.bin of=/dev/null bs=1M
    100000+0 records in
    100000+0 records out
    104857600000 bytes (105 GB) copied, 42.3805 s, 2.5 GB/s

    So we lose about a third of our write performance, but the read performance is not affected, probably by read-ahead caching but I'm not sure.

    With ashift=9, we do lose some write performance, but we can still saturate 10Gbe.

    Now look what happens to the available storage capacity:

    df -h:

    Filesystem                         Size  Used Avail Use% Mounted on
    storage                             74T   98G   74T   1% /storage

    zfs list:

    storage   271K  73.9T  89.8K  /storage

    Now we have a capacity of 74 TiB, so we just gained 5 TiB with ashift=9 over ashift=12, at the cost of some write performance.

    So if you really care about sequential write performance, ashift=12 is the better option. If storage capacity is more important, ashift=9 seems to be the best solution for 4K drives.

    The performance of ashift=9 on 4K drives is always described as 'horrible' but I think it's best to run your own benchmarks and decide for yourself.

    Caveat: I'm quite sure about the benchmark performance. I'm not 100% sure how reliable the reported free space is according to df -h or zfs list.

    Edit: I have added a bit of my own opinion on the results.

    Tagged as : ZFS Linux
  2. Linux: Script That Creates Table of Network Interface Properties

    August 15, 2013

    My server has 5 network interfaces and I wanted a quick overview of some properties. There may be an existing linux command for this but I couldn't find it so I quickly wrote my own script (download).

    This is the output:


    The only requirement for this script is that you have 'ethtool' installed.

    Update 2013-08-17

    I recreated the script in python (download) so I can just dynamically format the table and not use ugly hacks I used in the bash script.

    Tagged as : Linux Networking
  3. Why Filtering DHCP Traffic Is Not Always Possible With Iptables

    December 27, 2010

    When configuring my new firewall using iptables, I noticed something very peculiar. Even if all input, forward and output traffic was dropped, DHCP traffic to and from my DHCP server was not blocked even if there were no rules permitting this traffic.

    I even flushed all rules, put a drop all rule on all chains and only allowed SSH to the box. It did not matter. The DHCP server received the DHCP requests and happily answered back.

    How on earth is this possible? In my opinion, a firewall should block all traffic no matter what.

    But at least I found out the cause of this peculiar behaviour. The ISC DHCP daemon does not use the TCP/UDP/IP stack of the kernel. It uses RAW sockets. Raw sockets bypass the whole netfilter mechanism and thus the firewall.

    So remember: applications using RAW sockets cannot be fire walled by default. Applications need root privileges to use RAW sockets, so RAW sockets thankfully cannot be used by arbitrary unprivileged users on a system, but never the less. Be aware of this issue.

    Please understand that if a serious security vulnerability is found in the ISC DHCP daemon, you cannot protect your daemon with a local firewall on your system. Patching or disabling would then be the only solution.

  4. Linux Network Interface Bonding / Trunking or How to Get Beyond 1 Gb/s

    November 11, 2010

    This article discusses Linux bonding and how to achieve 2 Gb/s transfer speeds with a single TCP/UDP connection.

    UPDATE July 2011

    Due to hardware problems, I was not able to achieve transfer speeds beyond 150 MB/s. By replacing a network card with one from another vendor (HP Broadcom) I managed to obtain 220 MB/s which is about 110 MB/s per network interface.

    So I am now able to copy a single file with the 'cp' command over an NFS share with 220 MB/s.

    Update January 2014

    See this new article on how I got 340 MB/s transfer speeds.

    I had problems with a intel e1000e PCIe card in an intel DH67BL. I tested with different e1000e PCIe models but to no avial. RX was 110 MB/s. TX was always no faster than 80 MB/s. A HP Broadcom gave no problems and also provided 110 MB/s for RX traffic. LSCPI output:

    Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express

    The on-board e1000e NIC performed normal, all PCIe e1000e cards with different chipsets never got above 80 MB/s.

    A gigabit network card provides about 110 MB/s (megabytes) of bandwidth. If you want to go faster, the options are:

    1. buy infiniband stuff: I have no experience with it, may be smart thing to do but seems expensive.
    2. buy 10Gigabit network cards: very very expensive compared to other solutions.
    3. strap multiple network interfaces together to get 2 Gb/s or more with more cards.

    This article is discussing the third option. Teaming or bonding two network cards to a single virtual card that provides twice the bandwidth will provide you with that extra performance that you where looking for. But the 64000 dollar question is:

    How to obtain 2 Gb/s with a single transfer? Thus with a single TCP connection?

    Answer: The trick is to use Linux network bonding.

    Most bonding options only provide an accumulated performance of 2 Gb/s, by balancing different network connections over different interfaces. Individual transfers will never reach beyond 1 Gbit/s but it is possible to have two 1 Gb/s transfers going on at the same time.

    That is not what I was looking for. I want to copy a file using NFS and just get more than just 120 MB/s.

    The only bonding mode that supports single TCP or UDP connections to go beyond 1 Gb/s is mode 0: Round Robin. This bonding mode is kinda like RAID 0 over two or more network interfaces.

    However, you cannot use Round Robin with a standard switch. You need an advanced switch that is capable of creating "trunks". A trunk is a virtual network interface, that consists of individual ports that are grouped together". So you cannot use Round Robin mode with an average unmanaged switch. The only other option is to use direct cables between two hosts, although I didn't tested this.


    UPDATE July 2011 : Read the update at the top.

    Now the results: I was able to obtain a transferspeed (read) of 155 MB/s with a file copy using NFS. Normal transfers capped at 109 MB/s. To be honest: I had hoped to achieve way more, like 180MB/s. However, the actual transfer speeds that will be obtained will depend on the hardware used. I recommend using Intel or Broadcom hardware for this purpose.

    Also, I was not able to obtain write speed that surpasses the 1 Gb/s. Since I used a fast RAID array to write the data to, the underlying storage subsystem was not the bottleneck.

    So the bottom line is that it is possible to get more than 1 Gb/s but the performance gain is not as high as you may want to.



    modprobe bonding mode=0
    ifconfig bond0 up
    ifenslave bond0 eth0 eth1
    ifconfig bond0 netmask


    modprobe bonding mode=4 lacp_rate=0 xmit_hash_policy=layer3+4
    ifconfig bond0 up
    ifenslave bond0 eth0 eth1
    ifconfig bond0 netmask

    Bonding status:

    cat /proc/net/bonding/bond0
    Ethernet Channel Bonding Driver: v3.3.0 (June 10, 2008)
    Bonding Mode: IEEE 802.3ad Dynamic link aggregation
    Transmit Hash Policy: layer3+4 (1)
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    802.3ad info
    LACP rate: slow
    Active Aggregator Info:
    Aggregator ID: 2
    Number of ports: 2
    Actor Key: 9
    Partner Key: 26
    Partner Mac Address: 00:de:ad:be:ef:90
    Slave Interface: eth0
    MII Status: up
    Link Failure Count: 0
    Permanent HW addr: 00:co:ff:ee:aa:00
    Aggregator ID: 2
    Slave Interface: eth1
    MII Status: up
    Link Failure Count: 0
    Permanent HW addr: 00:de:ca:fe:b1:7d
    Aggregator ID: 2
  5. Secure Caching DNS Server on Linux With DJBDNS

    June 12, 2010

    The most commonly used DNS server software is ISC BIND, the "Berkeley Internet Name Daemon". However, this software has a bad security track record and is in my opinion a pain to configure.

    Mr. D.J. Bernstein developed "djbdns", which comes with a guarantee: if anyone finds a security vulnerability within djbdns, you will get one thousand dollars. This price has been claimed once. But djbdns has a far better track record than BIND.

    Well, attaching your own name to your DNS implementation and tying a price to it if someone finds a vulnerability in it, does show some confidence. But there is more to it. D.J. Bernstein already pointed out some important security risks regarding DNS and made djbdns immune against them, even before it became a serious world-wide security issue. However, djbdns is to this day vulnerable to a variant of this type of attack and the dbndns package is as of 2010 still not patched. Although the risk is small, you must be aware of this. I still think that djbdns is less of a security risk, especially regarding buffer overflows, but it is up to you to decide which risk you want to take.

    The nice thing about djbdns is that it consists of several separate programs, that each perform a dedicated task. This is in stark contrast with BIND, which is one single program that performs all DNS functionality. One can argue that djbdns is far more simpler and easy to use.

    So this post is about setting up djbdns on a Debian Linux host as a forwarding server, thus a 'DNS cache'. This is often used to speed up DNS queries. Clients do not have to connect to the DNS server of your ISP but can use your local DNS server. This server will also cache the results of queries, so it will reduce the number of DNS queries that will be sent out to your ISP DNS server or the Internet.

    Debian Lenny has a patched version of djbdns in its repository. The applied patch adds IPV6 support to djbdns. This is how you install it:

    apt-get install dbndns

    The dbndns package is actually a fork of the original djbdns software. Now the program we need to configure is called 'dnscache', which only does one thing: performing recursive DNS queries. This is exactly what we want.

    To keep things secure, the djbdns software must not be run with superuser (root) privileges, so two accounts must be made: one for the service, and one for logging.

    groupadd dnscache

    useradd -g dnscache dnscache

    useradd -g dnscache dnscachelog

    The next step is to configure the dnscache software like this:

    dnscache-conf dnscache dnscachelog /etc/dnscache

    The first two options tell dnscache which system user accounts to use for this service. The /etc/dnscache directory stores the dnscache configuration. The last option specifies which IP address to listen on. If you don't specify an IP address, localhost ( is used. If you want to run a forwarding DNS server for your local network, you need to make dnscache listen on the IP address on your local network, as in the example.

    Djbdns relies on daemontools and in order to be started by daemontools we need to perform one last step:

    ln -s /etc/dnscace /etc/service/

    Within a couple of seconds, the dnscache software will be started by the daemontools software. You can check it out like this:

    svstat /etc/service/dnscache

    A positive result will look like this:

    /etc/service/dnscache: up (pid 6560) 159 seconds

    However, the cache cannot be used just yet. Dnscache is governed by some text- based configuration files in the /etc/dnscache directory. For example, the ./env/IP file contains the IP address that we configured previously on which the service will listen.

    By default, only localhost will be able to access the dnscache. To allow access to all clients on the local network you have to create a file with the name of the network in ./root/ip/. If your network is (thus 254 hosts), create a file named 192.168.0:

    Mini:/etc/dnscache/root/ip# pwd


    Mini:/etc/dnscache/root/ip# ls


    Now clients will be able to use the dnscache. Now you are running a simple forwarding DNS server and it probably took you under ten minutes to configure it. Although djbdns is not very well maintained in Debian Lenny, there is currently not a really good alternative for BIND. PowerDNS is not very secure (buffer overflows) and djbdns / dbndns has in more than 10 years never been affected by this type of vulnerability.

Page 1 / 3