Thursday, 18 August 2016

VDP Backup Fails: Failed To Download VM Metadata

Suhas Savkoor August 18, 2016 Home, Troubleshooting, VDP No comments

A virtual machine backup was continuously failing with the error: VDP: Miscellaneous error.

When you look at the backup job log at the location:

# cd /usr/local/avamarclient/var

2016-08-18T08:24:59.099+07:00 avvcbimage Warning <40722>: The VM 'Win2012R2/' could not be located on the datastore:[FreeNAS-iSCSI] Win2012R2/Win2012R2.vmx
2016-08-18T08:24:59.099+07:00 avvcbimage Warning <40649>: DataStore/File Metadata download failed.
2016-08-18T08:24:59.099+07:00 avvcbimage Warning <0000>: [IMG0016] The datastore from VMX '[FreeNAS-iSCSI] Win2012R2/Win2012R2.vmx' could not be fully inspected.
2016-08-18T08:24:59.128+07:00 avvcbimage Info <0000>: initial attempt at downloading VM file failed, now trying by re-creating the HREF, (CURL) /folder/Win2012R2%2fWin2012R2.vmx?dcPath=%252fDatacenter&dsName=FreeNAS-iSCSI, (LEGACY) /fold
er/Win2012R2%2FWin2012R2.vmx?dcPath=%252FDatacenter&dsName=FreeNAS%252DiSCSI.

2016-08-18T08:25:29.460+07:00 avvcbimage Info <40756>: Attempting to Download file:/folder/Win2012R2%2FWin2012R2.vmx?dcPath=%252FDatacenter&dsName=FreeNAS%252DiSCSI

2016-08-18T08:25:29.515+07:00 avvcbimage Warning <15995>: HTTP fault detected, Problem with returning page from HTTP, Msg:'SOAP 1.1 fault: SOAP-ENV:Server [no subcode]

"HTTP Error"

Detail: HTTP/1.1 404 Not Found

2016-08-18T08:25:59.645+07:00 avvcbimage Warning <40650>: Download VM .vmx file failed.

2016-08-18T08:25:59.645+07:00 avvcbimage Error <17821>: failed to download vm metadata, try later

2016-08-18T08:25:59.645+07:00 avvcbimage Info <9772>: Starting graceful (staged) termination, failed to download vm metadata (wrap-up stage)

2016-08-18T08:25:59.645+07:00 avvcbimage Error <0000>: [IMG0009] createSnapshot: snapshot creation or pre/post snapshot script failed

2016-08-18T08:25:59.645+07:00 avvcbimage Error <0000>: [IMG0009] createSnapshot: snapshot creation/pre-script/post-script failed

2016-08-18T08:25:59.645+07:00 avvcbimage Info <40654>: isExitOK()=157

Here we see that the backup is failing as it is unable to gather the vmx file details of the client it is trying to backup. It is unable to gather the vmx details is because it is unable to locate the client on the datastore.

This issue occurs when the virtual machine files or the folder itself are manually moved to a different folder.

If you look at the below screenshot, you see my virtual machine is under a folder called "do not power on"

Now, if you have a look the actual vmx file of this Windows2012R2 machine you see it is different.

From this, my virtual machine was initially under the FreeNAS-iSCSI / VM folder / VM Files and then this was moved to the "do not power on" folder. If you just move the virtual machine, the vmx file will not be updated with the new path. VDP during the backup will look at this path of the virtual machine and will fail because we do not have the vmx file in this location.

To fix this, you will have to remove the virtual machine from the inventory and re-add it back to the inventory. Doing this will update the vmx file location as seen below. Once the correct location is populated you can run the backups for this virtual machine.

To register VM to inventory you can follow this KB article here.

That's pretty much it!

Tuesday, 16 August 2016

VDP Backup Fails: PAX stream append at sector offset 128 length 128 sectors=65536 failed (32)

Suhas Savkoor August 16, 2016 Home, Troubleshooting, VDP No comments

So I have ran into this issue a couple of times now and was able to finally drill down to a solution, and so thought of sharing this as there is nothing about this issue anywhere on the web.

In the few scenarios where I came across this issue, some were due to expanding the VMDK causes the backup to fail, backup for VM with a disk greater than 2 TB fails and sometimes without any reason it fails continuously.

This failure occurs prominently when:

1) VMDK is larger than 2 TB.
2) VMDK is not an integral multiple of 1 MB.
3) VMDK has changes in the last extent of the VMDK

Let's see what time means a little down the line.

To start off, when you open the backup job log to see for a failure, you will see the following:
Location:

/usr/local/avamarclient/var

2016-07-25T20:50:48.091+04:00 avvcbimage Info <14700>: submitting pax container in-use block extents:
2016-07-25T20:50:48.221+04:00 avvcbimage Info <6688>: Process 9604 (/usr/local/avamarclient/bin/avtar) finished (code 176: fatal signal)
2016-07-25T20:50:48.221+04:00 avvcbimage Error <0000>: [IMG0010] PAX stream append ([Store_1] test/test.vmdk) at sector offset 128 length 128 sectors=65536 failed (32)
2016-07-25T20:50:48.221+04:00 avvcbimage Warning <6690>: CTL workorder "Test-backup-1469491200006" non-zero exit status 'code 176: fatal signal'
2016-07-25T20:50:48.221+04:00 avvcbimage Info <9772>: Starting graceful (staged) termination, PAX stream append failed (wrap-up stage)
2016-07-25T20:50:48.221+04:00 avvcbimage Info <40654>: isExitOK()=157
2016-07-25T20:50:48.221+04:00 avvcbimage Info <16022>: Cancel detected(miscellaneous error), isExitOK(0).
2016-07-25T20:50:48.221+04:00 avvcbimage Info <9746>: bytes submitted was 65536
2016-07-25T20:50:48.221+04:00 avvcbimage Error <0000>: [IMG0010] pax_container::endfile(virtdisk-flat.vmdk,65536) returned problem:32
2016-07-25T20:50:48.221+04:00 avvcbimage Error <0000>: [IMG0010] pax_container::enddir returned problem:32
2016-07-25T20:50:48.222+04:00 avvcbimage Info <40654>: isExitOK()=157
2016-07-25T20:50:48.222+04:00 avvcbimage Info <16022>: Cancel detected(miscellaneous error), isExitOK(0).
2016-07-25T20:50:48.222+04:00 avvcbimage Info <12032>: backup (Store_1] test/test.vmdk) terminated 3145728.00 MB
2016-07-25T20:50:48.222+04:00 avvcbimage Info <40654>: isExitOK()=157
2016-07-25T20:50:48.222+04:00 avvcbimage Error <0000>: [IMG1003] Backup of [Store_1] test/test.vmdk failed
2016-07-25T20:50:48.222+04:00 avvcbimage Info <16041>: VDDK:VixDiskLib: VixDiskLib_Close: Close disk.

The avtar.log for this backup job in the same directory has the following back trace.

2016-08-08T20:42:17.180+04:00 avtar FATAL <5889>: Fatal signal 11 in pid 28769
> 2016-08-08T20:42:17.186+04:00 [avtar] FATAL ERROR: <0001> uapp::handlefatal: Fatal signal 11
> 2016-08-08T20:42:17.186+04:00 [avtar] | 0000000000b3abe1
> 2016-08-08T20:42:17.186+04:00 [avtar] | 0000000000b3b957
> 2016-08-08T20:42:17.186+04:00 [avtar] | 0000000000b3bc7b
> 2016-08-08T20:42:17.186+04:00 [avtar] | 0000000000b3bd6e
> 2016-08-08T20:42:17.186+04:00 [avtar] | 0000000000a86530
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00007f9388482850
> 2016-08-08T20:42:17.186+04:00 [avtar] | 0000000000ac7364
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00000000006762ef
> 2016-08-08T20:42:17.186+04:00 [avtar] | 0000000000678363
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00000000006c8db8
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00000000006cc683
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00000000006cd083
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00000000009a5b76
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00000000009ad0ed
> 2016-08-08T20:42:17.186+04:00 [avtar] | 0000000000612bce
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00000000006c1492
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00000000006c36b6
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00000000004cfcfb
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00000000004d6d4e
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00000000004d82e0
> 2016-08-08T20:42:17.186+04:00 [avtar] | 0000000000a88129
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00000000004ea627
> 2016-08-08T20:42:17.186+04:00 [avtar] | 00007f93864b1c36
> 2016-08-08T20:42:17.186+04:00 [avtar] | 000000000048a029
> 2016-08-08T20:42:17.186+04:00 [avtar] ERROR: <0001> uapp::handlefatal: aborting program pid=28769, sig=11
> 2016-08-08T20:42:17.186+04:00 avtar FATAL <5890>: handlefatal: Aborting program with code 176, pid=28769, sig=11

So I had one disk for this VM which was expanded beyond 2 TB and when converted into MB it was 3397386.24 MB (3.24 TB). In plain words, the format in MB is fractional and not a whole number and hence it is not an integral multiple of 1.
When your VDP sees the disk as a fractional value and beyond 2 TB, the backup fails. This is a known issue in the avTar version running on VDP.

To find your current avTar version you can run the following;

> This command from root of the VDP SSH

status.dpn

> These two commands from admin mode only of VDP SSH

gsan --version

mcserver.sh --version

A VDP 6.1 has an avTar version of 7.2 with a varying minor build.

How do we resolve this?

Step 1: For a quick backup to make sure everything is saved for this VM. Not recommended as a permanent solution.

Disable CBT on the VM, so that incremental backups are disabled for this VM and all backups are always full backup. If you do not have a restore point for a long time and would badly need one, then you can implement this. Else, if you have a good known restore point, then this is not required as it unnecessarily consumes space. To disable CBT you can follow this VMware KB article here

Step 2: Recommended. Make sure a good backup is available before this.

Here, you will have to locate the sizes of all your disks for that VM and convert them to MB. If the converted value in MB is a whole number, then that disk is good to go.

If not, like in my case, where the drive was 3397386.24 MB, then we will have to round off this drive to the next largest whole number. Here it would be 3397387 MB. You can do this from the edit settings of the virtual machine. There is no need of expanding the drive from within the guest. This is done on the VMDK level only so that the VDP can look at this drive as an integral multiple of 1.

Post this, you can have incremental backups running.

This is acknowledged as a known issue and the fix is in the future release of VDP. Release version and ETA are unknown currently.

If you have questions before implementation, let me know.

Thanks!

Wednesday, 3 August 2016

VDP Automatic Backup Verification Fails With VMware Tools Heartbeat Timeout

Suhas Savkoor August 03, 2016 Home, Troubleshooting, VDP No comments

So recently I was working on a colleague's case where the backup verification, both manual and automatic were failing with VMware tools timeout. Now, there were multiple restore points being verified here and only one of them was failing. The VDP admin guide for verification states the below:

"Set the heartbeat timeout interval to its optimal value, depending on the environment. Note that some VMs may take longer to send and receive the VMware Tools heartbeat than others"

On the verification job logs, the following was observed. " >> " Marks the explanation of the event:

2016-07-14T21:52:10.541-02:00 avvcbimage Info <19672>: vmAction powerOnVM()
>> The VM power On function is called

2016-07-14T21:52:10.549-02:00 avvcbimage Info <17787>: Modifying E1000 adapter: Network adapter 1 to not Connect at Power On
>> The network adapter is disconnected on power on to avoid IP conflict

2016-07-14T21:52:17.665-02:00 avvcbimage Info <14632>: VM '[virtual_machines1] VDP_VERIFICATION_camserver_1468522800276/VDP_VERIFICATION_camserver_1468522800276.vmx' Power On task completed, moref=
>> VM is powered ON

2016-07-14T21:52:17.665-02:00 avvcbimage Info <19674>: vmAction waitforToolRunning()
>> Waiting for heartbeat verification

2016-07-14T21:57:22.177-02:00 avvcbimage Warning <19675>: wait for toolrunning is timed out or cancelled
>> Verification timed out

2016-07-14T21:57:22.178-02:00 avvcbimage Info <19685>: vmAction poweroffVM()
>> power off VM function is called

2016-07-14T21:57:25.225-02:00 avvcbimage Info <14632>: VM '[virtual_machines1] VDP_VERIFICATION_camserver_1468522800276/VDP_VERIFICATION_camserver_1468522800276.vmx' Power Off task completed, moref=
>> power off completed

2016-07-14T21:57:25.225-02:00 avvcbimage Info <19686>: vmAction deleteVM()
>> delete VM function is called

2016-07-14T21:57:28.295-02:00 avvcbimage Info <14632>: VM '[virtual_machines1] VDP_VERIFICATION_camserver_1468522800276/VDP_VERIFICATION_camserver_1468522800276.vmx' deletion task completed, moref=
>> VM is deleted successfully

2016-07-14T21:57:28.311-02:00 avvcbimage Error <19702>: ABV failed
>> Automatic backup verification failed.

So, there is a parameter that we need to add in the avvcbimageAll.cmd file to increase the heartbeat wait timeout for the appliance.

The steps would be:

1. Open a SSH to the VDP appliance
2. Browse to the following directory

# usr/local/avamarclient/var/avvcbimageAll.cmd

3. Open this file in a vi editor and add the following parameter:

--validate_script_max_in_min=20

4. Save the file and restart the avagent daemon using the below command:

service avagent-vmware restart

Post this, the VDP appliance will wait for 20 minutes for VM tools heartbeat to be received before it could time out.
If the tools validation complete before 20 minutes, then it will proceed to the next steps. So setting the timeout to a large value will not have any performance impact on the VMs which are not facing the VM tools timeout during verification.

Simple, yes!

Friday, 22 July 2016

Creating A Local User And Granting Shell Access In ESXi 6.0

Suhas Savkoor July 22, 2016 Home, How To, Troubleshooting No comments

So, in ESXi 6.0 onward, if you login to ESXi directly from vSphere client you do not have the option to specify Shell Access when you are creating a local user. The screen that you will see when creating a new user here is:

If you create this user and login to the Putty, you get the message saying Access denied.

The access.conf file should be updated automatically once the users are created and since it does not, perhaps due to security enhancements, there is a need of little tweaking that needs to be done.

Note: Please test this in your lab before you implement this in production. All the steps were implemented in a non production environment.

What you need to do is:

1. Create the user locally from the above wizard

2. Login to SSH for that ESXi host

3. Change the directory to:

# cd /etc/security

4. You will have a file called access.conf file. (Backup the file before editing) Open this file with a vi editor.

# vi access.conf

The contents look like below:

5. You need to add your user here in the format

+:<username>:ALL

6. Save the file

7. Restart the SSH session.

8. Now you can login to your ESXi host with the local user.

This user has shell access but not the root access. If I run any command to list the details of the devices connected to this host it displays the following:

Well that's pretty much it.

Friday, 15 July 2016

Web Client Login Page Displays vRA or vCAC As The Banner Name

Suhas Savkoor July 15, 2016 Home, How To, Troubleshooting No comments

So in 6.0, you should have VMware vCenter Single Sign On as the Web Client banner page login name. However, if you apply the Branding Name for the vRA appliance the web client banner display gets renamed. In some cases, the change occurs even when the branding is not selected. This is a bug in vRA and is discussed in this link here.

Now I am not a vRA guy and that is left to the handful in my organization. However, the case where I worked on was something similar:

Customer had 5.5 vCenter with vCAC installed. The vCAC was discarded and the vCenter was upgraded to 6.0. When he logged in to web client, instead of the Single Sign On banner it displayed vCloud Automation Center. This was indeed confusing and needed a fix.

Upon installation of vRA/vCAC the parameter vmwSTSBrandName gets populated with the banner image information for the vRA. Upon removal of this product this parameter is not cleared leading to the issue. This had to be removed from the vmdir for all the tenants present under the Identity Manager.

Before you perform the workaround, please have a snapshot and/or a backup of the vCenter machine.

1. Download Jxplorer by clicking this link here
2. Login to the PSC machine from Jxplorer using this link here
3. Expand Services > Identity Manager > Tenants
4. Click the tenant and switch to Table Editor View
5. In the table editor view, locate the filed called vmwSTSBrandName. This will be populated with a value as displayed in the below screenshot

4. Right click this attribute and select Delete.
5. Click Submit
6. If this is a Windows Server go to services.msc and restart the VMware STS Service. If it requests to restart the dependent services, click Ok.
7. If it is an appliance restart the STS identity manager service using the below commad:

# service vmware-sts-idmd restart

8. Reload the web client page. Once the vmwSTSBrandName attribute is empty it will display VMware vCenter Single Sign On by default.

Friday, 8 July 2016

Automatic/Manual Backup Verification Fails In VDP

Suhas Savkoor July 08, 2016 Home, Troubleshooting, VDP No comments

To check the consistency of the restore points you have backup verification jobs. These verification jobs can be either Automatic (ABV) or Manual backup verification. The backup verification flow on a high level basis goes as:

>> Restore: Restores the restore point as a temporary VM on the ESXi host and datastore which is defined on the backup verification job

>> Power On: Powers On the VM.

>> Heartbeat Verification: Verifies the heartbeat for the restored virtual machine

>> Power Off: Powers Off the VM once the verification is done

>> Delete VM: Remove the temporary restored VM from the inventory and delete from disk.

The issue I am going to be discussing here is not a general issue, and this caused due to a very specific cause. However, the troubleshooting steps can be used and you might have similar causes due to which you will run into verification jobs to fail.

All the verification job logs are present under the following directory:

/usr/local/avamarclient/var/

The verification job that I had was something as:

xyz-backup-verify-1467724890971-c2857d179f4b9e67465bf496709d8bc1f43149ef-1016-vmimagel.log

I created it as xyz because I have a VM named xyz, hence the temporary restored VM would have a name similar VDP_VERIFICATION_xyz

So in the verification job the initial logging refers to the following:

>> Which vCenter this VM is going to be restored for verification

>> The ESXi host

>> The VMFS/NFS datastore

Their logging would be in the start of the verification log and would look something as:


2016-07-05T09:21:32.956+04:00 avvcbimage Info <16010>: vCenter 'ABC.vcloud.local' is 192.168.1.1

2016-07-05T09:21:32.956+04:00 avvcbimage Info <11981>: VM's host is ESXi.vcloud.local

2016-07-05T09:21:32.956+04:00 avvcbimage Info <11982>: VM's primary storage location is [Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx

So, to the error. To the very end of the verification log is the final moments of the process in my case were the errors:

2016-07-05T09:35:00.128+04:00 avvcbimage Info <19670>: vmAction runRemote()

2016-07-05T09:35:00.177+04:00 avvcbimage Info <19672>: vmAction powerOnVM()

2016-07-05T09:35:00.187+04:00 avvcbimage Info <17789>: Modifying VmxNet3 adapter: Network adapter 1 to not Connect at Power On

2016-07-05T09:35:00.213+04:00 avvcbimage Info <0000>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' setNics config at PowerOn task still in progress, sleep for 3 sec

2016-07-05T09:35:03.243+04:00 avvcbimage Info <14632>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' setNics config at PowerOn task completed, moref=

2016-07-05T09:35:03.266+04:00 avvcbimage Info <14629>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' Power On task queued, sleep for 1 sec

2016-07-05T09:35:04.287+04:00 avvcbimage Error <16006>: vSphere Task failed: 'The operation is not allowed in the current state.'.

2016-07-05T09:35:04.287+04:00 avvcbimage Error <14635>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' Power On task creation encountered a problem

2016-07-05T09:35:04.287+04:00 avvcbimage Warning <19673>: PowerOnVM failed or cancelled

2016-07-05T09:35:04.287+04:00 avvcbimage Info <19684>: vmAction cleanupVM() DeletingVM=0

2016-07-05T09:35:04.287+04:00 avvcbimage Info <19685>: vmAction poweroffVM()

2016-07-05T09:35:04.311+04:00 avvcbimage Info <0000>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' Power Off task still in progress, sleep for 3 sec

2016-07-05T09:35:07.345+04:00 avvcbimage Error <16006>: vSphere Task failed: 'The attempted operation cannot be performed in the current state (Powered off).'.

2016-07-05T09:35:07.345+04:00 avvcbimage Error <14635>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' Power Off task creation encountered a problem

2016-07-05T09:35:07.345+04:00 avvcbimage Info <19686>: vmAction deleteVM()

2016-07-05T09:35:07.387+04:00 avvcbimage Info <0000>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' deletion task still in progress, sleep for 3 sec

2016-07-05T09:35:10.416+04:00 avvcbimage Info <14632>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' deletion task completed, moref=

2016-07-05T09:35:10.416+04:00 avvcbimage Info <9772>: Starting graceful (staged) termination, ABV failed (wrap-up stage)

2016-07-05T09:35:10.416+04:00 avvcbimage Error <19702>: ABV failed

2016-07-05T09:35:10.419+04:00 avvcbimage Info <16038>: Final summary, cancelled/aborted 0, snapview 0, exitcode 170: completed with errors, client log should be examined

So here, the restore was done successfully and the network adapter is always disconnected for the verification VM to avoid IP conflict.

Then there was several tries done to Power On the virtual machine and all of them failed. Since the Power On was not completed the Power off failed as well.

The step to verify the heartbeat is excluded since the virtual machine was not powered On which led to the final state, delete the VM which was completed successfully.

That's pretty much it in the verification logs. This was not sufficient to find a cause, which led me to implement the next couple of tests:

1. For this verification job, I changed the destination host and datastore. Basically, I am doing the restore on a different host and a different datastore and it went through successfully. So something was either wrong with the host or the datastore.

2. So I changed the datastore location to the old path and the host still on the new one. The verification job completed successfully again. And when edited the job back to the old host, it failed with the same error.

So something is going on with this host! So we need to troubleshoot on the host level.

From the vobd.log during this time, I saw the following:

2016-07-06T14:01:49.787Z: [UserWorldCorrelator] 3011315291947us: [vob.uw.core.dumped] /bin/hostd(2038251) /var/core/hostd-zdump.003

2016-07-06T14:04:26.406Z: [UserWorldCorrelator] 3011471909126us: [vob.uw.core.dumped] /bin/hostd(2043706) /var/core/hostd-zdump.000

2016-07-06T14:08:35.096Z: [UserWorldCorrelator] 3011720596785us: [vob.uw.core.dumped] /bin/hostd(2099166) /var/core/hostd-zdump.001

2016-07-06T14:11:12.313Z: [UserWorldCorrelator] 3011877811665us: [vob.uw.core.dumped] /bin/hostd(2849623) /var/core/hostd-zdump.002

2016-07-06T14:13:48.795Z: [UserWorldCorrelator] 3012034293301us: [vob.uw.core.dumped] /bin/hostd(2040079) /var/core/hostd-zdump.003

So here the hostd daemon on the host has crashed and a zdump is created. So the hostd and hostd-worker threads were in a crash state and hence I rebooted the host.

After a reboot the hostd and worker thread were not in an inconsistent state anymore allowing me to perform verification tasks without any issues on this host.

There can be multiple causes for backup verification failure. Well this is one of them!

Friday, 1 July 2016

Perform A VDP Backup When The Datastore For Client VM Is Running On Low Space

Suhas Savkoor July 01, 2016 Home, How To, VDP No comments

If you try to backup a virtual machine which is residing on a datastore running low on space, the backup fails. The VDP will try to issue a snapshot take call, however, the task is not initiated for either Create Snapshot or VDP: Backup. Instead you see the following in the reports tab.

VDP: Failed to create snapshot

The backup job log located at /usr/local/avamarclient/var will have the following logging:

2016-06-30T09:59:11.371+07:00 avvcbimage Info <19704>: DataStore Storage Info:Local-esxi02 capacity=4831838208 free=138412032

2016-06-30T09:59:11.372+07:00 avvcbimage Info <19716>: DS Capacity=4831838208 FreeSpace=138412032 / HD committed=3951319082 unCommitted=1680 unShared=3758096384

2016-06-30T09:59:11.372+07:00 avvcbimage Info <19717>: DS(Local-esxi02) does not have enough free space (138412032 ) for disks used (197565952).

2016-06-30T09:59:11.372+07:00 avvcbimage Error <19661>: Datastore does not have enough free space for snapshot

2016-06-30T09:59:11.372+07:00 avvcbimage Info <9772>: Starting graceful (staged) termination, failed to create snapshot (wrap-up stage)

2016-06-30T09:59:11.372+07:00 avvcbimage Error <0000>: [IMG0009] createSnapshot: snapshot creation or pre/post snapshot script failed

2016-06-30T09:59:11.372+07:00 avvcbimage Error <0000>: [IMG0009] createSnapshot: snapshot creation/pre-script/post-script failed

2016-06-30T09:59:11.372+07:00 avvcbimage Info <40654>: isExitOK()=202

2016-06-30T09:59:11.372+07:00 avvcbimage Info <40659>: snapshot created:false NOMC:false ChangeBlTrackingAvail:true UsingChBl:true, ExitOK:false, cancelled:false, fatal: true

2016-06-30T09:59:11.372+07:00 avvcbimage Info <40654>: isExitOK()=202

2016-06-30T09:59:11.372+07:00 avvcbimage Info <40660>: vcbimage_progress::terminate

2016-06-30T09:59:11.373+07:00 avvcbimage Info <16041>: VDDK:VixDiskLib: VixDiskLib_EndAccess: Disk access completed.

Now there is a parameter that can be added to avvcb daemon to ignore the free space on the datastore and still proceed with the backup operation. This workaround is tricky, why because let's consider the following situation:

You have a virtual machine of about 500 GB on a datastore of 550 GB. So the free space on the datastore is 50 GB. Let's say we added the parameter to ignore this and still take a backup, and the new data that has come into the VM has reached the 50 GB free space limit, then the VM will straight up stop to function because it has no space to get the new data. At this point of time, you will end up expanding the datastore so that the VM can be powered on.

This is why, it is always recommended you expand the datastore then perform a backup. Now, there are certain cases where that 500 GB VM is a file server where no new data is coming in, In that case, we are good to add this parameter and run the backup.

How do we do this?

1. SSH into your VDP appliance and change to the following directory:

# cd /usr/local/avamarclient/var

2. You will have a file called as avvcbimageAll.cmd

3. Open this file using a vi editor and edit the file to add the following parameter:

--snapshot_max_change_percent=0

4. Restart the avagent daemon using the below command

service avagent-vmware restart

5. Now you should be able to run the backup job even when your datastore is running low on space.

Note:

If you are using internal proxy, then this step will be done on the VDP appliance itself as the avvcbimage lies on the appliance itself. If you are using external proxy, then you will have to edit the avvcbimageAll.cmd file on the proxy machine, as the proxy VM would be responsible for opening/closing VMDK as this machine runs the avvcb daemon. Also, if you have multiple proxy VMs, then add this parameter to all your external proxy machines and restart the avagent service on all of them.

This is supported on 6.1 VDP as well.

virtuallyPeculiar