Tracking Down a Faulty Storage Array Controller With ZFS

Thu 15 December 2016 Category: Storage

One day, I lost two virtual machines on our DR environment after a storage vMotion.

Further investigation uncovered that any storage vMotion of a virtual machine residing on our DR storage array would corrupt the virtual machine's disks.

I could easily restore the affected virtual machines from backup and once that was done, continued my investigation.

I needed a way to quickly verifying if a virtual hard drive of a virtual machine was corrupted after a storage vMotion to understand what the pattern was.

First, I created a virtual machine based on Linux and installed ZFS. Then, I attached a second disk of about 50 gigabytes and formatted this drive with ZFS. Once I filled the drive using 'dd' to about 40 gigabytes I was ready to test.

ZFS was chosen for testing purposes because it stores hashes of all blocks of data. This makes it very simple to quickly detect any data corruption. If the hash doesn't match the hash generated from the data, you just detected corruption.

Other file systems don't store hashes and don't check for data corruption so they just trust the storage layer. It may take a while before you find out that data is corrupted.

I performed a storage vMotion of this secondary disk towards different datastores and then ran a 'zfs scrub' to track down any corruption. This worked better than expected: the scrub command would hang if the drive was corrupted by the storage vMotion. The test virtual machine required a reboot and a reformat of the secondary hard drive with ZFS as the previous file system, including data got corrupted.

After performing a storage vMotion on the drive in different directions, from different datastores to other datastores slowly a pattern emerged.

  1. Storage vMotion corruption happened independent of the VMware ESXi host used.

  2. a Storage vMotion never caused any issues when the disk was residing on our production storage array.

  3. the corruption only happened when the virtual machine was stored on particular datastores on our DR storage array.

Now it got really 'interesting'. The thing is that our DR storage array has two separate storage controllers running in active-active mode. However, the LUNs are always owned by a particular controller. Although the other controller can take over from the controller who 'owns' the LUNs in case of a failure, the owner will process the I/O when everything is fine. Particular LUNs are thus handled by a particular controller.

So first I made a table where I listed the controllers and the LUNs it had ownership over, like this:

Controller      a               b
            LUN001          LUN002
            LUN003          LUN004
            LUN005          LUN006

Then I started to perform Storage vMotions of the ZFS disk from one LUN to the other. After performing several test, the pattern became quite obvious.

            LUN001  ->  LUN002  =   BAD
            LUN001  ->  LUN004  =   BAD
            LUN004  ->  LUN003  =   BAD
            LUN003  ->  LUN005  =   GOOD
            LUN005  ->  LUN001  =   GOOD

I continued to test some additional permutations but it became clear that only LUNs owned by controller b caused problems.

With the evidence in hand, I managed to convince our vendor support to replace storage controller b and that indeed resolved the problem. Data corruption due to a Storage vMotion never occurred after the controller was replaced.

There is no need to name/shame the vendor in this regard. The thing is that all equipment can fail and what can happen will happen. What really counts is: are you prepared?