Tuesday, December 11, 2018

VMware Change Block Tracking (CBT) and the issue with incremental backups

One of my customers is experiencing a weird issue when using a traditional enterprise backup (IBM TSM / Spectrum Protect in this particular case) leveraging VMware vSphere Storage APIs (aka VDDK) for image-level backups of vSphere 6.5 Virtual Machines. They observed strange behavior on the size of incremental backups. IBM TSM backup solution should do a full backup once and incremental backups forever. This is a great approach to save space on backup (secondary) storage. However, my customer observed on some Virtual Machines, randomly created over the time, almost full backups instead of expected continuous incremental backup. This has obviously a very negative impact on the capacity of the backup storage system and also on backup window times.

The customer has vSphere 6.5 U2 (build 9298722) and IBM TSM VE 8.1.4.1. They observed the problem just on VMs where VM hardware was upgraded to version 13. The customer opened a support case with VMware GSS and IBM support.

IBM Support observed VADP/VDDK API function QueryChangedDiskAreas was failing with TSM log message similar to ...

10/19/2018 12:04:26.230 [007260] [11900] : ..\..\common\vm\vmvisdk.cpp(2436): ANS9385W Error returned from VMware vStorage API for virtual machine 'VM-NAME' in vSphere API function __ns2__QueryChangedDiskAreas. RC=12, Detail message: SOAP 1.1 fault: "":ServerFaultCode[no subcode]
"Error caused by file /vmfs/volumes/583eb2d3-4345fd68-0c28-3464a9908b34/VM-NAME/VM-NAME.vmdk"

VMware Support (GSS) instructed my customer to reset CBT - https://kb.vmware.com/kb/2139574 or disable and re-enable CBT - https://kb.vmware.com/kb/1031873 and observe if it solves the problem.

A few days after CBT reset, the problem with backup occurred again, therefore it was not a resolution.

I did some research and found another KB - CBT reports larger area of changed blocks than expected if guest OS performed unmap on a disk (59608). We believe that this the root cause and KB contains workaround and final resolution.

The root cause mentioned in VMware KB 59608 ...
When an unmap is triggered in the guest, the OS issues UNMAP requests to underlying storage. However, the requested blocks include not only unmapped blocks but also unallocated blocks. And all those blocks are captured by CBT and considered as changed blocks then returned to backup software upon calling the vSphere API queryChangedDiskAreas(changeId).
Workaround recommended in KB ...
Disable unmap in guest VM.
For example, in MS Windows Operating Systems UNMAP can be disabled by command

fsutil behavior set Disable DeleteNotify 1 

and re-enabled by command

fsutil behavior set Disable DeleteNotify 0

Warning! Disabling UNMAP in guest OS can have a tremendous negative impact on storage space reclamation, therefore, fixing space issue in secondary storage can cause storage space issue on your primary storage. Check your specific design before the final decision on how to workaround this issue.

Anyway, the final problem resolution has to be done by the backup software vendor ...
If you have VDDK 6.7 or later libraries, take the intersection of VixDiskLib_QueryAllocatedBlocks() and queryChangedDiskAreas(changeId) to calculate the actually changed blocks.
The backup software should not use just API function QueryChangedDiskAreas but also function QueryAllocatedBlocks and calculate disk blocks for incremental backups. Based on VDDK 6.7 Release Notes, VDDK 6.7 can be leveraged even for vSphere 6.5 and 6.0. For more info read Release Notes here.

I believe the problem occurs only on the following conditions
  • The virtual disk must be thin-provisioned.
  • VM Hardware is 11 and later - older VM hardware versions do not pass UNMAP SCSI commands through
  • The guest operating system must be able to identify the virtual disk as thin and issuing UNMAP SCSI commands down to the storage system
Based on conditions above I personally believe, that another workaround to this issue would be to not use thin-provisioned virtual disks and convert them into thick virtual disks. As far as I know, thick virtual disks do not pass UNMAP commands through VM hardware, therefore it should not cause CBT issues.

My customer is not leveraging thin-provisioning on physical storage layer, therefore he is going to test workaround recommended in KB 59608 (disable UNMAP in Guest OS's) as a short-term solution and start the investigation of the long-term problem fix with IBM Spectrum Protect (aka TSM). It seems IBM Spectrum Protect Data Mover 8.1.6 is leveraging VDDK 6.7.1 so upgrade from current version 8.1.4 to 8.1.6 could solve the issue.

2 comments:

Hema said...

Hi David ,

We are facing exactly the same issue with 8.1.7 IBM DP version with 6.7 Vsphere .
We tried to disable unmap , but still the issue was not fixed . do we require a reboot to the get effective .

what are the points to be consider ?

David Pasek said...

First thing first, you should definitely open a support request with VMware GSS and IBM.

If you use the workaround to disable unmap within Guest OS (MS Windows) leveraging FSUTIL, I do not think the reboot is required but test it your self. I do not have IBM TSM in my lab so cannot help here. Here are details about FSUTIL
https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2012-R2-and-2012/cc785435(v=ws.11)

Did you try another workaround to convert (svMotion) virtual disk to thick? It should effectively disable UNMAP commands as well.

Anyway, the final recommendation should be done by VMware GSS and IBM.