One day, I lost two virtual machines on our DR environment after a storage vMotion.
Further investigation uncovered that any storage vMotion of a virtual machine residing on our DR storage array would corrupt the virtual machine's disks.
I could easily restore the affected virtual machines from backup and once that was done, continued my investigation.
I needed a way to quickly verifying if a virtual hard drive of a virtual machine was corrupted after a storage vMotion to understand what the pattern was.
First, I created a virtual machine based on Linux and installed ZFS. Then, I attached a second disk of about 50 gigabytes and formatted this drive with ZFS. Once I filled the drive using 'dd' to about 40 gigabytes I was ready to test.
ZFS was chosen for testing purposes because it stores hashes of all blocks of data. This makes it very simple to quickly detect any data corruption. If the hash doesn't match the hash generated from the data, you just detected corruption.
Other file systems don't store hashes and don't check for data corruption so they just trust the storage layer. It may take a while before you find out that data is corrupted.
I performed a storage vMotion of this secondary disk towards different datastores and then ran a 'zfs scrub' to track down any corruption. This worked better than expected: the scrub would hang if the drive was corrupted by the storage vMotion. It required a full virtual machine reboot, a format of the affected drive and another 'dd' to fill it up to 40 gigabytes for the next test.
After performing a storage vMotion on the drive in different directions, from different datastores to other datastores slowly a pattern emerged.
Storage vMotion corruption happened independent of the VMware ESXi host used.
a Storage vMotion never caused any issues when the disk was residing on our production storage array.
the corruption only happened when the virtual machine was stored on particular datastores on our DR storage array.
Now it got really 'interesting'. The thing is that our DR storage array has two separate storage controllers running in active-active mode. However, the LUNs are always owned by a particular controller. Although the other controller can take over from the controller who 'owns' the LUNs in case of a failure, the owner will process the I/O when everything is fine. Particular LUNs are thus handled by a particular controller.
So first I made a table where I listed the controllers and the LUNs it had ownership over, like this:
Owner Controller a b LUN001 LUN002 LUN003 LUN004 LUN005 LUN006
Then I started to perform Storage vMotions of the ZFS disk from one LUN to the other. After performing several test, the pattern became quite obvious.
LUN001 -> LUN002 = BAD LUN001 -> LUN004 = BAD LUN004 -> LUN003 = BAD LUN003 -> LUN005 = GOOD LUN005 -> LUN001 = GOOD
I continued to test some additional permutations but it became clear that only LUNs owned by controller b caused problems.
With the evidence in hand, I managed to convince our vendor support to replace storage controller b and that indeed resolved the problem. Data corruption due to a Storage vMotion never occurred after the controller was replaced.
There is no need to name/shame the vendor in this regard although some of you may recognise the equipment from the way I describe it. The thing is that all equipment can fail and what can happen will happen. What really counts is: are you prepared?