Update 2014-8-23: I was testing with ashift for my new NAS. The ashift=9 write performance deteriorated from 1.1 GB/s to 830 MB/s with just 16 TB of data on the pool. Also I noticed that resilvering was very slow. This is why I decided to abandon my 24 drive RAIDZ3 configuration.
I'm aware that drives are faster at the outside of the platter and slower on the inside, but the performance deteriorated so dramatically that I did not wanted to continue further.
My final setup will be a RAIDZ2 18 drive VDEV + RAIDZ2 6 drive VDEV which will give me 'only' 71 TiB of storage, but read performance is 2.6 GB/s and write performance is excellent at 1.9 GB/s. I've written about 40+ TiB to the array and after those 40 TiB, write performance was about 1.7 GB/s, so still very good and what I would expect as drives fill up.
So actually, based on these results, I have learned not to deviate from the ZFS best practices too much. Use ashift=12 and put drives in VDEVS that adhere to the 2^n+parity rule.
The uneven VDEVs (18 disk vs. 6 disks) are not according to best practice but ZFS is smart: it distributes data across the VDEVs based on their size. So they fill up equally.
Choosing between ashift=9 and ashift=12 for 4K sector drives is not always a clear cut case. You have to choose between raw performance or storage capacity.
My testplatform is Debian Wheezy with ZFS on Linux. I'm using a system with 24 x 4 TB drives in a RAIDZ3. The drives have a native sector size of 4K, and the array is formatted with ashift=12.
First we create the array like this:
zpool create storage -o ashift=12 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]
Note: NEVER use /dev/sd? drive names for an array, this is just for testing, always use /dev/disk/by-id/ names.
Then we run a simple sequential transfer benchmark with dd:
root@nano:/storage# dd if=/dev/zero of=ashift12.bin bs=1M count=100000 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 66.4922 s, 1.6 GB/s root@nano:/storage# dd if=ashift12.bin of=/dev/null bs=1M 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 42.0371 s, 2.5 GB/s
This is quite impressive. With these speeds, you can saturate 10Gbe ethernet. But how much storage space do we get?
Filesystem Size Used Avail Use% Mounted on storage 69T 512K 69T 1% /storage
NAME USED AVAIL REFER MOUNTPOINT storage 1.66M 68.4T 435K /storage
Only 68.4 TiB of storage? That's not good. There should be 24 drives minus 3 for parity is 21 x 3.6 TiB = 75 TiB of storage.
So the performance is great, but somehow, we lost about 6 TiB of storage, more than a whole drive.
So what happens if you create the same array with ashift=9?
zpool create storage -o ashift=9 raidz3 /dev/sd[abcdefghijklmnopqrstuvwx]
These are the benchmarks:
root@nano:/storage# dd if=/dev/zero of=ashift9.bin bs=1M count=100000 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 97.4231 s, 1.1 GB/s root@nano:/storage# dd if=ashift9.bin of=/dev/null bs=1M 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 42.3805 s, 2.5 GB/s
So we lose about a third of our write performance, but the read performance is not affected, probably by read-ahead caching but I'm not sure.
With ashift=9, we do lose some write performance, but we can still saturate 10Gbe.
Now look what happens to the available storage capacity:
Filesystem Size Used Avail Use% Mounted on storage 74T 98G 74T 1% /storage
NAME USED AVAIL REFER MOUNTPOINT storage 271K 73.9T 89.8K /storage
Now we have a capacity of 74 TiB, so we just gained 5 TiB with ashift=9 over ashift=12, at the cost of some write performance.
So if you really care about sequential write performance, ashift=12 is the better option. If storage capacity is more important, ashift=9 seems to be the best solution for 4K drives.
The performance of ashift=9 on 4K drives is always described as 'horrible' but I think it's best to run your own benchmarks and decide for yourself.
Caveat: I'm quite sure about the benchmark performance. I'm not 100% sure how reliable the reported free space is according to df -h or zfs list.
Edit: I have added a bit of my own opinion on the results.