ZFS: Performance and Capacity Impact of Ashift=9 on 4K Sector Drives

July 31, 2014 Category: Storage

Update 2014-8-23: I was testing with ashift for my new NAS. The ashift=9 write performance deteriorated from 1.1 GB/s to 830 MB/s with just 16 TB of data on the pool. Also I noticed that resilvering was very slow. This is why I decided to abandon my 24 drive RAIDZ3 configuration.

I'm aware that drives are faster at the outside of the platter and slower on the inside, but the performance deteriorated so dramatically that I did not wanted to continue further.

My final setup will be a RAIDZ2 18 drive VDEV + RAIDZ2 6 drive VDEV which will give me 'only' 71 TiB of storage, but read performance is 2.6 GB/s and write performance is excellent at 1.9 GB/s. I've written about 40+ TiB to the array and after those 40 TiB, write performance was about 1.7 GB/s, so still very good and what I would expect as drives fill up.

So actually, based on these results, I have learned not to deviate from the ZFS best practices too much. Use ashift=12 and put drives in VDEVS that adhere to the 2^n+parity rule.

The uneven VDEVs (18 disk vs. 6 disks) are not according to best practice but ZFS is smart: it distributes data across the VDEVs based on their size. So they fill up equally.


Choosing between ashift=9 and ashift=12 for 4K sector drives is not always a clear cut case. You have to choose between raw performance or storage capacity.

My testplatform is Debian Wheezy with ZFS on Linux. I'm using a system with 24 x 4 TB drives in a RAIDZ3. The drives have a native sector size of 4K, and the array is formatted with ashift=12.

First we create the array like this:

zpool create storage -o ashift=12 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]

Note: NEVER use /dev/sd? drive names for an array, this is just for testing, always use /dev/disk/by-id/ names.

Then we run a simple sequential transfer benchmark with dd:

root@nano:/storage# dd if=/dev/zero of=ashift12.bin bs=1M count=100000 
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 66.4922 s, 1.6 GB/s
root@nano:/storage# dd if=ashift12.bin of=/dev/null bs=1M
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 42.0371 s, 2.5 GB/s

This is quite impressive. With these speeds, you can saturate 10Gbe ethernet. But how much storage space do we get?

df -h:

Filesystem                            Size  Used Avail Use% Mounted on
storage                                69T  512K   69T   1% /storage

zfs list:

NAME      USED  AVAIL  REFER  MOUNTPOINT
storage  1.66M  68.4T   435K  /storage

Only 68.4 TiB of storage? That's not good. There should be 24 drives minus 3 for parity is 21 x 3.6 TiB = 75 TiB of storage.

So the performance is great, but somehow, we lost about 6 TiB of storage, more than a whole drive.

So what happens if you create the same array with ashift=9?

zpool create storage -o ashift=9 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]

These are the benchmarks:

root@nano:/storage# dd if=/dev/zero of=ashift9.bin bs=1M count=100000 
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 97.4231 s, 1.1 GB/s
root@nano:/storage# dd if=ashift9.bin of=/dev/null bs=1M
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 42.3805 s, 2.5 GB/s

So we lose about a third of our write performance, but the read performance is not affected, probably by read-ahead caching but I'm not sure.

With ashift=9, we do lose some write performance, but we can still saturate 10Gbe.

Now look what happens to the available storage capacity:

df -h:

Filesystem                         Size  Used Avail Use% Mounted on
storage                             74T   98G   74T   1% /storage

zfs list:

NAME      USED  AVAIL  REFER  MOUNTPOINT
storage   271K  73.9T  89.8K  /storage

Now we have a capacity of 74 TiB, so we just gained 5 TiB with ashift=9 over ashift=12, at the cost of some write performance.

So if you really care about sequential write performance, ashift=12 is the better option. If storage capacity is more important, ashift=9 seems to be the best solution for 4K drives.

The performance of ashift=9 on 4K drives is always described as 'horrible' but I think it's best to run your own benchmarks and decide for yourself.

Caveat: I'm quite sure about the benchmark performance. I'm not 100% sure how reliable the reported free space is according to df -h or zfs list.

Edit: I have added a bit of my own opinion on the results.

Comments