Friday, 1 July 2016

Perform A VDP Backup When The Datastore For Client VM Is Running On Low Space

If you try to backup a virtual machine which is residing on a datastore running low on space, the backup fails. The VDP will try to issue a snapshot take call, however, the task is not initiated for either Create Snapshot or VDP: Backup. Instead you see the following in the reports tab. 

VDP: Failed to create snapshot

The backup job log located at /usr/local/avamarclient/var will have the following logging:

2016-06-30T09:59:11.371+07:00 avvcbimage Info <19704>: DataStore Storage Info:Local-esxi02 capacity=4831838208     free=138412032
2016-06-30T09:59:11.372+07:00 avvcbimage Info <19716>: DS Capacity=4831838208     FreeSpace=138412032      / HD committed=3951319082     unCommitted=1680           unShared=3758096384
2016-06-30T09:59:11.372+07:00 avvcbimage Info <19717>: DS(Local-esxi02) does not have enough free space (138412032     ) for disks used (197565952).
2016-06-30T09:59:11.372+07:00 avvcbimage Error <19661>: Datastore does not have enough free space for snapshot
2016-06-30T09:59:11.372+07:00 avvcbimage Info <9772>: Starting graceful (staged) termination, failed to create snapshot (wrap-up stage)
2016-06-30T09:59:11.372+07:00 avvcbimage Error <0000>: [IMG0009] createSnapshot: snapshot creation  or pre/post snapshot script failed
2016-06-30T09:59:11.372+07:00 avvcbimage Error <0000>: [IMG0009] createSnapshot: snapshot creation/pre-script/post-script failed
2016-06-30T09:59:11.372+07:00 avvcbimage Info <40654>: isExitOK()=202
2016-06-30T09:59:11.372+07:00 avvcbimage Info <40659>: snapshot created:false NOMC:false ChangeBlTrackingAvail:true UsingChBl:true, ExitOK:false, cancelled:false, fatal: true
2016-06-30T09:59:11.372+07:00 avvcbimage Info <40654>: isExitOK()=202
2016-06-30T09:59:11.372+07:00 avvcbimage Info <40660>: vcbimage_progress::terminate
2016-06-30T09:59:11.373+07:00 avvcbimage Info <16041>: VDDK:VixDiskLib: VixDiskLib_EndAccess: Disk access completed.

Now there is a parameter that can be added to avvcb daemon to ignore the free space on the datastore and still proceed with the backup operation. This workaround is tricky, why because let's consider the following situation:

You have a virtual machine of about 500 GB on a datastore of 550 GB. So the free space on the datastore is 50 GB. Let's say we added the parameter to ignore this and still take a backup, and the new data that has come into the VM has reached the 50 GB free space limit, then the VM will straight up stop to function because it has no space to get the new data. At this point of time, you will end up expanding the datastore so that the VM can be powered on. 

This is why, it is always recommended you expand the datastore then perform a backup. Now, there are certain cases where that 500 GB VM is a file server where no new data is coming in, In that case, we are good to add this parameter and run the backup. 

How do we do this?

1. SSH into your VDP appliance and change to the following directory:
# cd /usr/local/avamarclient/var
2. You will have a file called as avvcbimageAll.cmd
3. Open this file using a vi editor and edit the file to add the following parameter:
--snapshot_max_change_percent=0
4. Restart the avagent daemon using the below command
service avagent-vmware restart
5. Now you should be able to run the backup job even when your datastore is running low on space. 

Note:
If you are using internal proxy, then this step will be done on the VDP appliance itself as the avvcbimage lies on the appliance itself. If you are using external proxy, then you will have to edit the avvcbimageAll.cmd file on the proxy machine, as the proxy VM would be responsible for opening/closing VMDK as this machine runs the avvcb daemon. Also, if you have multiple proxy VMs, then add this parameter to all your external proxy machines and restart the avagent service on all of them. 

This is supported on 6.1 VDP as well. 

Wednesday, 29 June 2016

VDP Backup Fails: Error In Creating Quiesce Snapshot


So there are many causes of failure for unable to quiesce a VM while taking a snapshot and this is one of them I came across while working on an issue. 

A brief summary what was happening here:
There was a virtual machine which was being backed up by VDP and it used to fail when VDP was initiating a quiesce snapshot on it. 

The backup job logs were showing the below:

2016-06-10T14:00:47.451+04:00 avvcbimage Info <0000>: vm-2412
2016-06-10T14:00:47.469+04:00 avvcbimage Info <0000>: Snapshot 'VDP-1465581646a5cf9db1234fe3558e090fec0f29a7eaa4807b2e' creation for VM '[XYZ] abc/abc.vmx' task still in progress, sleep for 3 sec
2016-06-10T14:00:50.524+04:00 avvcbimage Info <0000>: Snapshot 'VDP-1465581646a5cf9db1234fe3558e090fec0f29a7eaa4807b2e' creation for VM '[XYZ] abc/abc.vmx' task still in progress, sleep for 3 sec
2016-06-10T14:00:53.576+04:00 avvcbimage Warning <19733>: vSphere Task failed (quiesce, snapshot error=45): 'An error occurred while saving the snapshot: Failed to quiesce the virtual machine.'.
2016-06-10T14:00:53.576+04:00 avvcbimage Error <17775>: Snapshot 'VDP-1465581646a5cf9db1234fe3558e090fec0f29a7eaa4807b2e' creation for VM '[XYZ] abc/abc.vmx' task creation encountered a quiesce problem
2016-06-10T14:00:53.576+04:00 avvcbimage Warning <0000>: The VM could not be quiesced prior to snapshot creation and this backup will not be used as a base for subsequent CBT backups if successful.

So we tried a couple of basic troubleshooting steps such as uninstalling VMware tools and reinstalling it, moving the virtual machine to a different ESXi host and a different datastore. Everything produced the same result, Failed to quiesce the virtual machine. So if I quiesce snapshot the virtual machine from vCenter, it completes successfully. Only quiesce from VDP failed. 

The next step was to check the vss writer status on the virtual machine. So from an elevated command prompt, I ran the below command
vssadmin list writers

All the VSS writers were in a healthy state without any error. Then, I proceeded to check the Event Logs for VSS issues, and there was none. 

Next, over to the vmware.log for this virtual machine, displayed the below logging:

2016-06-13T18:01:55.531Z| vmx| I120: SnapshotVMX_TakeSnapshot start: 'VDP-1465581646a5cf9db1234fe3558e090fec0f29a7eaa4807b2e', deviceState=0, lazy=0, logging
=0, quiesced=1, forceNative=0, tryNative=1, sibling=0 saveAllocMaps=0 cb=B986C80, cbData=323CE050
2016-06-13T18:01:55.739Z| vcpu-2| I120: ToolsBackup: changing quiesce state: IDLE -> STARTED
2016-06-13T18:01:57.770Z| vcpu-0| I120: Msg_Post: Warning
2016-06-13T18:01:57.770Z| vcpu-0| I120: [msg.snapshot.quiesce.vmerr] The guest OS has reported an error during quiescing.
2016-06-13T18:01:57.770Z| vcpu-0| I120+ The error code was: 5
2016-06-13T18:01:57.770Z| vcpu-0| I120+ The error message was: 'VssSyncStart' operation failed: IDispatch error #8451 (0x80042303)
2016-06-13T18:01:57.770Z| vcpu-0| I120: ----------------------------------------
2016-06-13T18:01:57.775Z| vcpu-0| I120: ToolsBackup: changing quiesce state: STARTED -> ERROR_WAIT
2016-06-13T18:01:59.797Z| vcpu-0| I120: ToolsBackup: changing quiesce state: ERROR_WAIT -> IDLE
2016-06-13T18:01:59.797Z| vcpu-0| I120: ToolsBackup: changing quiesce state: IDLE -> DONE
2016-06-13T18:01:59.797Z| vcpu-0| I120: SnapshotVMXTakeSnapshotComplete: Done with snapshot 'VDP-1465581646a5cf9db1234fe3558e090fec0f29a7eaa4807b2e': 0
2016-06-13T18:01:59.797Z| vcpu-0| I120: SnapshotVMXTakeSnapshotComplete: Snapshot 0 failed: Failed to quiesce the virtual machine (40).
2016-06-13T18:02:08.245Z| vmx| I120: SnapshotVMX_Consolidate: Starting
2016-06-13T18:02:08.245Z| vmx| I120: SnapshotVMXConsolidateOnlineCB: nextState = 0 uid 0


So from vmware.log I saw logging regarding guest was unable to quiesce the virtual machine and VSS sync driver failed to start which led to failure of snapshot creation. 

Something was going on the VSS end. The VSS writers were good, so I proceeded to check the VSS providers. 
vssadmin list providers

This listed two providers with their own UUID. However, in the registry, there were three UUID entries. See the below image:


If you see here, 7669 is not a provider UUID. So, I took a backup of registry, and deleted this entry from the registry post which quiesce snapshots were successful from VDP. 

Hope this helps to someone out there facing this issue. 

Friday, 24 June 2016

Configure Syslog For VDP Appliance

So you want to forward VDP logs to a syslog server? Okay. Is it supported officially? No, VDP does not support forwarding its logs to a different machine or a log solution. Is a workaround available? Yes.

The VDP appliance is based on a SUSE box and we have the syslog-ng file which can be used to configure remote syslog for this machine.

In my lab, I have a VDP 6.1.2 appliance which is configured to a 6.0 U2 vCenter Server. The syslog collector solution I have here is VMware vRealize Log Insight. 

So how do we achieve this?

1. So William Lam has written a script which can be used to configure syslog for various VMware products. We will be using that script to configure syslog for VDP Server. Download the script from this link here.

2. So once this shell script is downloaded, copy this file into your VDP appliance root directory or /tmp directory. I used WinSCP to move this file to the appliance. 

3. Run the script using the below format:
# ./configurevCloudSuiteSyslog.sh vdp <Remote-Syslog-IP>

4. Once this command is executed browse to the following directory to view the syslog-ng configuration file.
# cd /etc/syslog-ng
#less syslog-ng.config 

5. The end of the file will have these lines appended:

# Configured using vCloud Suite Syslog Configuration Script by William Lam
source vdp {
       file("/space/avamar/var/log/av_boot.rb.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/space/avamar/var/log/dpnctl.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/space/avamar/var/log/dpnnetutil-av_boot.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/usr/local/avamar/var/log/dpnctl.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/usr/local/avamar/var/log/av_boot.rb.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/usr/local/avamar/var/log/av_boot.rb.err.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/usr/local/avamar/var/log/dpnnetutil-av_boot.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/usr/local/avamar/var/avi/server_log/flush.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/usr/local/avamar/var/avi/server_log/avinstaller.log.0" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/usr/local/avamar/var/vdr/server_logs/vdr-server.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/usr/local/avamar/var/vdr/server_logs/vdr-configure.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/usr/local/avamar/var/flr/server_logs/flr-server.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/data01/cur/err.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/usr/local/avamarclient/bin/logs/VmMgr.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/usr/local/avamarclient/bin/logs/MountMgr.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/usr/local/avamarclient/bin/logs/VmwareFlrWs.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
       file("/usr/local/avamarclient/bin/logs/VmwareFlr.log" log_prefix("vdp: ") follow_freq(1) flags(no-parse));
};

# Remote Syslog Host
destination remote_syslog {
      udp("192.168.10.6" port (514));

};

log {
        source(vdp);
        destination(remote_syslog);
};

The port used here is UDP and the IP 192.168.10.6 is my Log Insight server IP, this will of course vary for you. 

6. Over to the Log Insight, Under the Hosts section, you can see my VDP section added:


7. When I click vdpnew (Hostname of my VDP appliance), I see the logging being available:



Well, that's pretty much it. Just a workaround, not officially supported though. 

Monday, 6 June 2016

Unable To Connect VDP To Web Client: Failed To Validate vCenter certificate

So out of nowhere today, when I tried to connect the VDP appliance to web client, I got the following error message:

"Failed to validate the vCenter certificate. Either install or verify the certificate by using the vSphere Data configuration utility"

A screenshot of the message:


The vdr-configure logs displayed the following: 
Location: /usr/local/avamar/var/vdr/server_logs

2016-06-06 20:42:26,067 INFO  [http-nio-8543-exec-6]-rest.AuthenticationService: Vcentre Certificate validation had failed
2016-06-06 20:42:27,454 INFO  [Thread-15]-vi.ViJavaServiceInstanceProviderImpl: vcenter-ignore-cert ? true
2016-06-06 20:42:27,514 INFO  [Thread-15]-vi.ViJavaServiceInstanceProviderImpl: visdkUrl = https://192.168.1.1:443/sdk
2016-06-06 20:42:31,127 ERROR [Thread-15]-vi.ViJavaServiceInstanceProviderImpl: Failed To Create ViJava ServiceInstance owing to Remote VCenter connection error

So I was not using any custom certificate here. All default VMware certificate on my 5.5 vCenter. 

The resolution:

1. Restart all VDP services using the below command:

dpnctl stop all
dpnctl start all

2. When I logged into https://VDP-IP:8543/vdp-configure page I noticed the following:


I have an external proxy here, and it looked like the proxy VM was down. Restarted the proxy VM, yet the services and proxies status was not updated, which is when I proceeded to Step (3).

3. Re-register your VDP appliance to vCenter. You can follow this link here to get this task done.

Once the re-register (Same VDP user or a different user) was done (Without any changes to vCenter port or IP) the appliance was able to connect successfully to the web client.

Interesting enough, I was unable to find the cause of this, although I suspect that the connectivity between vCenter and VDP was broken. If there is something more detailed, do comment below. Always looking for VDP deep-dives.

That's pretty much it!

Thursday, 2 June 2016

Unable To Add A Client To A Backup Job In VDP: Already Registered

So today I was working on a case where a vCenter virtual machine and a PSC virtual machine were added to a backup job. One fine day, the PSC machine started behaving weirdly and had to be discarded. So the end solution was to re-deploy the PSC machine completely. The name of the faulty PSC machine was XYZ. This was renamed to XYZ-Old and a new PSC was deployed with the same old name, XYZ. 

The vCenter was back up and running good now. So, we went to the web client and connected to the VDP plugin and discarded the old backup job. So now, the backup tab had no backup jobs at all. Then, we went ahead and created a new backup job and added the new PSC machine, XYZ to this and clicked Finish. The screen grayed out for few seconds and the task finished with an error stating:

Backup job was created successfully, but the system failed to add the following clients to the job. ERROR: Host Named (<host-name>) is already registered as........



Why this is caused:

It is saying that the PSC, XYZ is already registered because it still had the old machine entry in the protected client list in the VDP backup client domain. This stale entry was preventing us to add the new machine with the old name, even though the old PSC machine was renamed and the backup job corresponding to it was removed. 

The solution:

1. Open a SSH to the VDP appliance and run the below command:
mccli client show --recursive

This will show all the protected clients associated with this VDP appliance.
Here I noticed that the XYZ PSC machine entry was still present. I need to remove this stale entry.

2. Run the below command to retire that VM from the protected list 
mccli client retire --domain=/vCenter-IP/VirtualMachines --name="VM-Name"

You will receive an output saying virtual machine is retired successfully. 
Once this was done, we went ahead and added the XYZ PSC virtual machine to the backup job successfully. 

Sharing is caring! You would not find much of VDP articles, believe me!!

Thank you!

Unable to vMotion Due To Difference In vDS Vendor Information Between Source And Destination vDS

So I was trying to vMotion my virtual machines between two vDS and it was failing with the following error: 

"The destination distributed switch has a different version or vendor than the source distributed switch"

A quick search brings up the following KB article which is great, however, there are certain cases where people do not want to upgrade the DVS to change the vendor ID. 

In that situation, there are two workarounds:

1. Changing the entry from the MOB page of the vCenter:

  • Open a browser and go to the following IP address to open the Managed Object Browser page:

https://<vcenter-IP/mob
  • Click Content
  • Under Name: rootFolder select the value: group-d1 (Datacenters)
  • Under Name: childEntity select the value data-center
  • Under Name: networkFolder select group-n* (network)
  • Under Name: childEntity locate the DV Switch and prior to the name of the DVSwitch you will have the moid of the Switch in the format dvs-*number*
  • Go to the following location once the moID is obtained. 
                                         https://vCenter-IP/mob/?moid=dvs-141
    • Click PerformDvsProductSpecOperation_Task on the page
    • Enter the following, For Fill operation: Upgrade
    • Fill productSpec:

                    <productSpec>
                         <vendor>VMware, Inc.</vendor>
                   </productSpec>
    • Click Invoke Method
    Go back to the vCenter Server and verify the Vendor Name now on your vDS. 

    2. If you are using a SQL database, you can run the below query: 
    update vpx_dvs set product_vendor='VMware, Inc.' where id=XXX(replace XXX with your dvs id);
    dvs ID can be found from the steps above. 

    However, before performing any steps, have snapshots, backup of your vCenter and database. 

    Monday, 30 May 2016

    VDP Full and File Level Restore Fails: Failed to get disks: Unable to browse as proxies are unavailable

    So when trying to perform a full virtual machine restore or a File Level Restore for any virtual machine the task fails, at various possible percentages. The failure descried here is, 

    Failed to get disks: Unable to browse as proxies are unavailable

    If I try to restore this virtual machine on any other host, datastore or restore as a new virtual machine, replace existing virtual machine, the task still fails with the same error. 

    The restore logs, in the following location shows:

    Location: # cd /usr/local/avamarclient/var

    The log with following MOD-*EPOCH*-vmimage1.log would be the restore logs:

    The logs had the following;

    2016-05-23T16:01:39.723-02:00 avvcbimage Error <0000>: [IMG0011] Timeout on wait for spawned restore metadata avtar process to complete
    2016-05-23T16:06:11.947-02:00 avvcbimage FATAL <17824>: GetDiskAttributed Failed
    2016-05-23T16:06:12.121-02:00 avvcbimage Error <17771>: Invalid request to create a VM.
    2016-05-23T16:06:12.121-02:00 avvcbimage Error <0000>: [IMG2012] VM creation failed during restore.

    Here the timeout is occurring while the AvVcbimage is spawning the avTar for metadata restore. This in-turn is failing to create the virtual machine for restore. 

    The resolution:

    Increase the AvVcbimage timeout using the below steps:

    1. If the VDP is using external proxy, then you will have to SSH to your proxy machine. If the VDP is using internal proxy, then we will have to use the SSH to VDP appliance. Change the directory to:
    # cd /usr/local/avamarclient/var
    2. Edit the following file:
    # vi vvcbimage.cmd
    3. Add the below and save the file
    --subprocesstimeout=600
    4. Restart the avAgent service using the below command:
    # avagent-vmware restart

    Run the restore operation now, and it should work good. 
    If not, leave a comment! Thank you.