Thursday, 25 January 2018

SRM Service Crashes During A Recovery Operation With timedFunc BackTrace

In few scenarios when you run a test recovery or a planned migration, the SRM service will crash. This might happen when you run a specific recovery plan or any recovery plan.

If you look into the vmware-dr.log you will notice the following back-trace:

--> Panic: VERIFY d:\build\ob\bora-3884620\srm\public\functional/async/timedFunc.h:210
-->
--> Backtrace:
-->
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: -
--> backtrace[00] vmacore.dll[0x001C568A]
--> backtrace[01] vmacore.dll[0x0005CA8F]
--> backtrace[02] vmacore.dll[0x0005DBDE]
--> backtrace[03] vmacore.dll[0x001D7405]
--> backtrace[04] vmacore.dll[0x001D74FD]
xxxxxxxxxxxxxxxxxxxxx Cut Logs Here xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
--> backtrace[36] ntdll.dll[0x000154E4]
--> [backtrace end]
-->

The timedFunc back-trace is seen when "Wait For VMware Tools" is set to 0 minutes and 0 seconds

And just about few lines top of this back trace you will see the faulty VM which caused this crash.

You will see something similar to:

2018-01-21T08:37:05.421-05:00 [44764 info 'VmDomain' ctxID=57d5ae61 opID=21076ff:c402:4147:d883] Waiting for VM '[vim.VirtualMachine:b2ab3f04-c72e-43ca-b93d-de1566e4de14:vm-323]' to reach desired powered state 'poweredOff' within '0' seconds.

The VM ID is given here. To find this VM ID you will need to go to the vCenter MOB page.

The way I found out to correlate this is:
1. Login to MOB page for vCenter (https://vcenter-ip/mob)
2. Content > group-d1 (Datacenters)
3. Respective datacenter under "Child Entity"
4. Then under vmFolder group-v4 (vm)
5. Expand childEntity and this will list out all the VMs in that vCenter.

My output was similar to:


The VM was CentOS7.2

> Then navigate to the Recovery plan in SRM
> Select the affected Recovery plan this VM is part of > Related Objects > Virtual Machines
> Right click this VM and select Configure Recovery

Here the Wait For VMware Tools were set to 0,0 timeout. We had to change this to a valid non zero value. 


Post this, the recovery plan completed fine without crashing the SRM service. This should ideally be fixed in the newer SRM releases as it would not let you set a 0 timeout. 

Hope this helps!

VDP Restore Fails: Virtual Machine Must Be Powered Off To Restore

There are few cases where you try to perform an in place restore of a virtual machine and it fails and complains the virtual machine is not powered off. The virtual machine is indeed powered off, however, the VDP does not understand this. You will see something like:


There might be two possible causes.

1. The MCS cache might not be updated with the power state.
Run the below command:
# mccli vmcache show --name=/vc-domain-name/VirtualMachines/<vm-name> | grep -i "power status"

Example command:
# mccli vmcache show --name=/cartman.southpark.local/VirtualMachines/Test-1 | grep -i "power status"

The output you should ideally see on a powered off VM is:
Power Status         poweredOff

In the above case, you might see:
Power Status         poweredOn

If this is the issue, then update the MCS VM cache by issuing the below command:
# mccli vmcache sync --name=/vc-domain-name/VirtualMachines/<vm-name>

Example command:
# mccli vmcache sync --name=/cartman.southpark.local/VirtualMachines/Test-1

Then the power state should be updated and the in place restore should work. 

2. The MCS cache might be updated but the tomcat is out of sync with MCS. 

In many cases the MCS and tomcat don't sync together. Due to this, the CLI will show one set of results and the GUI will say otherwise. To sync them up simply restart the tomcat service by issuing:
# emwebapp.sh --restart

Note that post restarting the tomcat service, it will take a while to connect the appliance back in the web client as it has to rebuild the cache. 
If you tail the vdr-server.log located under /usr/local/avamar/var/vdr/server_logs/vdr-server.log then once you see the below logging, it would indicate the connection has completed successfully:

2018-01-25 09:05:57,566 INFO  [Timer_PhoneHomeCollect]-schedule.PhonehomeCollectTask: Writing Phome data to location  /usr/local/avamar/var/vdr/phonehome/vdp_state
2018-01-25 09:05:57,567 INFO  [Timer_PhoneHomeCollect]-schedule.PhonehomeCollectTask: Writing Phome data to location  /usr/local/avamar/var/vdr/phonehome/vdp_state

The restore should then work successfully. 

If it still continues to fail, then try performing a restore to a different location as a workaround. 

Hope this helps!

Tuesday, 23 January 2018

File Level Restore - Select Destination Is Empty

In few cases, when you try to perform a File Level Restore on a windows server, the "Select Destination" window is empty.


This can happen due to few reasons. If this is happening on ESX 6.0 then it would imply that the vSphere Guest API would be used instead of the VIX API. We need to modified the configuration of the config.xml to use the VIX API only.

Make sure the values in the config.xml file are what is expected. Run the below commands:

# egrep "vmmgrflags|mountmgr" /usr/local/avamarclient/bin/config.xml
# egrep "vmmgrflags|mountmgr" /usr/local/avamarclient/bin/MountPoint/config.xml

The above two commands should display the below output:

<vmmgrflags>2</vmmgrflags>
<mountmgrflags>3221225476</mountmgrflags>

If you see this, skip to "Other possible solutions" section. If not, then edit these two files and make sure the values reflect the above mentioned output. 

Then restart the FLR service using:
# service vmwareflr restart

Then re-try the FLR operation. 

Other possible solutions:
In most recent VDP releases this value should be already set to VIX API only, and we might still run into the same issues. In that case a strong case be:

> Outdated VM Tools on that windows machine. Update the VMware tools and restart the VM and then re-attempt the FLR task. 
> Also, add the FLR URL to the browser's trusted website list. 

If the issue still persists, open a case with VMware Support.

Hope this helps!

Friday, 19 January 2018

MCS Service Crashes Due To Locked User Account

This is issue is only seen in an upgraded 6.1.5 instance, where the MCS service constantly crashes with a locked user account. In some cases, the MCS might be running, but none of the mccli commands run and in few cases the backup scheduler service would not start.

If you try to start MCS from the admin mode using mcserver.sh --start --verbose, it starts successfully but crashes immediately.

If you run the below command:
# grep locked /usr/local/avamar/var/mc/server_log/mcserver.log.0

You will notice the account being locked:

WARNING: The user MCUser@/ is locked. Product VDP
WARNING: The user MCUser@/ is locked. Product VDP
WARNING: The user MCUser@/ is locked. Product MCCLI
WARNING: The user MCUser@/ is locked. Product VDP
WARNING: The user MCUser@/ is locked. Product VDP

When you start backup scheduler, you might see it fail with:

root@Jimbo:/usr/local/avamar-tomcat/lib/#: dpnctl start sched
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: Resuming backup scheduler...
dpnctl: ERROR: error return from "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mccli mcs resume-scheduler" - exit status 1
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]

In the dpnctl.log you will notice the following:

2018/01/19-06:13:55 output of "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mccli mcs resume-scheduler":
2018/01/19-06:13:55 - - - - - - - - - - - - - - - BEGIN
2018/01/19-06:13:55 1,22801,User login failure.
2018/01/19-06:13:55 Attribute Value
2018/01/19-06:13:55 --------- --------------------
2018/01/19-06:13:55 reason    Locked user account.
2018/01/19-06:13:55 - - - - - - - - - - - - - - - END

This is because of a missing java archive file for MCS service due to which the account locks out. 

If you run the below command you should be able to see the symbolic link. 
# ls -lh /usr/local/avamar-tomcat/lib/mccommons.jar

lrwxrwxrwx 1 root root 35 Jan 19 11:44 /usr/local/avamar-tomcat/lib/mccommons.jar -> /usr/local/avamar/lib/mccommons.jar

If this is missing, then you will run into the above issue. To fix this:

1. Be logged in as root user into VDP
2. Navigate to the the below directory:
# cd /usr/local/avamar-tomcat/lib/mccommons.jar

3. Then run the below command to recreate the symbolic link:
# /bin/ln -fs /usr/local/avamar/lib/mccommons.jar .

The . is also required as you are linked to file to the current working directory.

4. Restart tomcat service using:
# emwebapp.sh --restart

5. Restart MCS using:
# mcserver.sh --restart --verbose

That should fix it. Hope this helps!

Wednesday, 17 January 2018

MCS Start Up Fails In VDP With: Permission Denied At MCServer.pm

In few rare cases the management services might fail to start up with a permission denied message.
If you run dpnctl start mcs you will notice the almost generic failure:

root@vdp:/#: dpnctl start mcs
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: Starting MCS...
dpnctl: INFO: To monitor progress, run in another window: tail -f /tmp/dpnctl-mcs-start-output-21990
dpnctl: ERROR: error return from "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mcserver.sh --start" - exit status 13
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]

If you start MCS from admin mode using mcserver.sh --start --verbose you will notice:

admin@vdp:~/>: mcserver.sh --start --verbose
Permission deniedPermission denied at /usr/lib/perl5/vendor_perl/5.10.0/x86_64-linux-thread-multi/XML/LibXML.pm line 587.
Could not create file parser context for file "/usr/local/avamar/var/mc/server_data/prefs/mcserver.xml": Permission denied at /usr/local/avamar/lib/MCServer.pm line 128

The same above is seen if you run mcserver.sh --test

In the dpnctl.log you will notice the same:

2018/01/17-15:13:55 dpnctl: INFO: - - - - - - - - - - - - - - - BEGIN
2018/01/17-15:13:55 Permission deniedPermission denied at /usr/lib/perl5/vendor_perl/5.10.0/x86_64-linux-thread-multi/XML/LibXML.pm line 587.
2018/01/17-15:13:55 Could not create file parser context for file "/usr/local/avamar/var/mc/server_data/prefs/mcserver.xml": Permission denied at /usr/local/avamar/lib/MCServer.pm line 128
2018/01/17-15:13:55 dpnctl: INFO: - - - - - - - - - - - - - - - END
2018/01/17-15:13:55 /bin/cat /tmp/dpnctl-get-mcs-status-status-21389 2>&1
2018/01/17-15:13:55 [ "/bin/cat /tmp/dpnctl-get-mcs-status-status-21389 2>&1" exit status = 0 ]
2018/01/17-15:13:55 dpnctl: INFO: "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mcserver.sh --test" exit status = 13
2018/01/17-15:13:55 rm -f /tmp/dpnctl-get-mcs-status-status-21389 /tmp/dpnctl-get-mcs-status-output-21389
2018/01/17-15:13:55 dpnctl: INFO: "rm -f /tmp/dpnctl-get-mcs-status-status-21389 /tmp/dpnctl-get-mcs-status-output-21389" - exit status 0
2018/01/17-15:13:55 dpnctl: INFO: MCS status - - - - - - - - - - END
2018/01/17-15:13:55 dpnctl: INFO: MCS status: down.

This is because of incorrect permissions and ownership on the mcserver.xml file. I do not know the reason why the permissions and ownership was flipped.

The ideal permission and ownership on this file should be:
-rwxrwx--- 1 admin admin 50320 Jan 13 01:48 /usr/local/avamar/var/mc/server_data/prefs/mcserver.xml

To change ownership (Be logged in as root)
# chown admin:admin mcserver.xml

To change permission run the below:
# chmod u+rwx,g+rwx mcserver.xml

Then you should be able to proceed further with the restart of the MCS service.

Hope this helps!

VDP Upgrade ISO Not Detected

There are a ton of reasons why VDP ISO might not be available in the vdp-configure page (Few of those are available on my blog here) This is another one of the possible causes.

> When we run df -h the ISO shows as mounted
> In the avinstaller.log.0 under /usr/local/avamar/var/avi/server_log/ shows the package extraction and package checksum verification is completed successfully.

The avinstaller.pl is responsible for all the upgrade related tasks. And a restart of this service should help putting few things back in place. However, when I restarted this service, I ended up with the below backtrace:

2017-07-19 10:29:32.441 1776 [main] WARN  o.e.jetty.webapp.WebAppContext - Failed startup of context o.e.j.w.WebAppContext@5d11346a{/avi,file:/usr/local/avamar/lib/jetty/work/jetty-0.0.0.0-7543-avi.war-_avi-any-/webapp/,STARTING}{/usr/local/avamar/lib/jetty/avi.war}
java.lang.IllegalStateException: java.lang.IllegalStateException: Failed to locate resource[avinstaller.properties] in ClassLoader
        at com.avamar.avinstaller.util.LifecycleListener.contextInitialized(LifecycleListener.java:64) ~[na:na]
        at org.eclipse.jetty.server.handler.ContextHandler.callContextInitialized(ContextHandler.java:788) ~[jetty-server-9.0.6.v20130930.jar:9.0.6.v20130930]
        at org.eclipse.jetty.servlet.ServletContextHandler.callContextInitialized(ServletContextHandler.java:434) ~[jetty-servlet-9.0.6.v20130930.jar:9.0.6.v20130930]
========================
Cutting down log snippet
========================
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:69) [jetty-util-9.0.6.v20130930.jar:9.0.6.v20130930]
        at com.avamar.avinstaller.AVIMain.start(AVIMain.java:313) [avistart.jar:na]
        at com.avamar.avinstaller.AVIMain.main(AVIMain.java:345) [avistart.jar:na]
Caused by: java.lang.IllegalStateException: Failed to locate resource[avinstaller.properties] in ClassLoader
        at com.avamar.avinstaller.util.AppConfig.getInstance(AppConfig.java:52) ~[na:na]
        at com.avamar.avinstaller.util.LifecycleListener.contextInitialized(LifecycleListener.java:59) ~[na:na]

So, there was a missing repo symlink in webapp directory. You can verify this by navigating to:
# cd /usr/local/avamar/lib/jetty/work/jetty-0.0.0.0-7543-avi.war-_avi-any-/webapp

If you ls -lh you should see this symbolic link here:
lrwxrwxrwx 1 root root    19 Jan 11 20:25 datalink -> /data01/avamar/repo

If this is missing, recreate it by using:
# ln -s /data01/avamar/repo datalink

Next, navigate to the below location:
# cd /usr/local/avamar/lib/jetty/work/jetty-0.0.0.0-7543-avi.war-_avi-any-/webapp/WEB-INF/classes/

Here you should have 13 files and I was missing avinstaller.properties file. I had to recreate this file. If you have a second VDP appliance copy this file from that server. Else, create a new file with the name avinstaller.properties using vi editor and copy paste the below content:

messageLogDirectory=/tmp/av

# db settings
dbserver=localhost
dbport=5555
dbname=avidb
dbuser=admin
dbpass=
dbmax=10


# db settings for EM
dbserverEM=localhost
dbportEM=5556
dbnameEM=emdb
dbuserEM=admin
dbpassEM=

# local repository
repo_location=/tmp/repo
repo_packages=/tmp/repo/packages
repo_downloads=/tmp/repo/downloads
repo_temp=/tmp/repo/temp

#client package
plugin_location=/usr/local/avamar/lib/plugins
waiting_timer_interval_in_min=60

The avinstaller.pl was still failing with this backtrace:

java.lang.IllegalStateException: org.jbpm.api.JbpmException: resource jbpm.cfg.xml does not exist
        at com.avamar.avinstaller.util.LifecycleListener.contextInitialized(LifecycleListener.java:64) ~[na:na]
        at org.eclipse.jetty.server.handler.ContextHandler.callContextInitialized(ContextHandler.java:788) ~[jetty-server-9.0.6.v20130930.jar:9.0.6.v20130930]
        at org.eclipse.jetty.servlet.ServletContextHandler.callContextInitialized(ServletContextHandler.java:434) ~[jetty-servlet-9.0.6.v20130930.jar:9.0.6.v20130930]
                ========================
Cutting down log snippet
========================
        at com.avamar.avinstaller.AVIMain.start(AVIMain.java:313) [avistart.jar:na]
        at com.avamar.avinstaller.AVIMain.main(AVIMain.java:345) [avistart.jar:na]
Caused by: org.jbpm.api.JbpmException: resource jbpm.cfg.xml does not exist
        at org.jbpm.pvm.internal.stream.ResourceStreamInput.openStream(ResourceStreamInput.java:60) ~[na:na]
        at org.jbpm.pvm.internal.xml.Parse.getInputSource(Parse.java:146) ~[na:na]
        at org.jbpm.pvm.internal.xml.Parser.buildDocument(Parser.java:453) ~[na:na]
                ========================
Cutting down log snippet
========================
        at com.avamar.avinstaller.process.PVM.getInstance(PVM.java:65) ~[na:na]
        at com.avamar.avinstaller.util.LifecycleListener.contextInitialized(LifecycleListener.java:60) ~[na:na]
        ... 20 common frames omitted

I was also missing jbpm.cfg.xml file. Recreate this file and enter the below contents:

<?xml version="1.0" encoding="UTF-8"?>

<jbpm-configuration>

  <import resource="jbpm.default.cfg.xml" />
  <import resource="jbpm.businesscalendar.cfg.xml" />
  <import resource="jbpm.tx.hibernate.cfg.xml" />
  <import resource="jbpm.jpdl.cfg.xml" />
  <import resource="jbpm.identity.cfg.xml" />

  <!-- Job executor is excluded for running the example test cases. -->
  <!-- To enable timers and messages in production use, this should be included. -->
  <!--
  <import resource="jbpm.jobexecutor.cfg.xml" />
  -->

  <import resource="jbpm.mail.templates.examples.xml" />

</jbpm-configuration>

Restart the avinstaller again using avinstaller.pl --restart and everything went fine and the ISO was detected in the configure page. 

Hope this helps!

Resetting Site Recovery Manager's Embedded DB Password

This article is applicable only to Postgres database which is the embedded DB option available during install. If you have forgotten the database password then you will not be able to login to DB to alter tables or perform a repair / modify on the SRM instance.

Before resetting the password, make sure the SRM machine is on a snapshot and a backup is available for the DB.

1. First we will need to edit the pg_hba.conf file to allow all users as trusted users so that a password-less authentication will be performed. The pg_hba.conf file is located under:
C:\ProgramData\VMware\VMware vCenter Site Recovery Manager Embedded Database\data\pg_hba.conf
Make a backup of the file before editing it.

Locate this section in the conf file:

# TYPE DATABASE USER ADDRESS METHOD
# IPv4 local connections:
host all all 127.0.0.1/32 md5
# IPv6 local connections:
host all all ::1/128 md5
# Allow replication connections from localhost, by a user with the
# replication privilege.
#host replication postgres 127.0.0.1/32 md5
#host replication postgres ::1/128 md5

Replace that complete set with this:

# TYPE DATABASE USER ADDRESS METHOD
# IPv4 local connections:
host all all 127.0.0.1/32 trust
# IPv6 local connections:
host all all ::1/128 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
#host replication postgres 127.0.0.1/32 md5
#host replication postgres ::1/128 md5

Save the file.

2. Restart the SRM Embedded database service from services.msc

3. Open a command prompt in admin mode and now we will have to login to the database. Navigate to the below bin directory:
C:\Program Files\VMware\VMware vCenter Site Recovery Manager Embedded Database\bin

4. Connect to the postgres database using:
psql -U postgres -p 5678

Port might vary if you had a custom install of the database. 

5. Run the below query to change the password:
ALTER USER "enter srm db user here" PASSWORD 'new_password';

The srm db user / port information can be found from the 64 bit ODBC connection. 
A successful execution will return the output: "ALTER ROLE"

6. Revert the changes performed the pg_hba.conf file so that md5 authentication is required for users to login to SRM database. 

7. Restart the SRM Embedded DB service again

Post this, the SRM service will fail to restart and you will notice the following backtrace in vmware-dr.log

2018-01-17T02:25:15.628Z [01748 error 'WindowsService'] Application error:
--> std::exception 'class Vmacore::Exception' "DBManager error: Could not initialize Vdb connection: ODBC error: (08001) - FATAL:  password authentication failed for user "srmadmin"
--> "
--> 
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: -
--> backtrace[00] vmacore.dll[0x001C568A]
--> backtrace[01] vmacore.dll[0x0005CA8F]
--> backtrace[02] vmacore.dll[0x0005DBDE]
--> backtrace[03] vmacore.dll[0x0001362B]
--> backtrace[04] vmware-dr.exe[0x0015C59A]
--> backtrace[05] MSVCR90.dll[0x00074830]
--> backtrace[06] MSVCR90.dll[0x00043B3C]
--> backtrace[07] ntdll.dll[0x0009CED3]
--> backtrace[08] vmware-dr.exe[0x000060AF]
--> backtrace[09] vmware-dr.exe[0x00006A5E]
--> backtrace[10] windowsService.dll[0x00002BCE]
--> backtrace[11] windowsService.dll[0x000020DD]
--> backtrace[12] sechost.dll[0x000081D5]
--> backtrace[13] KERNEL32.DLL[0x0000168D]
--> backtrace[14] ntdll.dll[0x00074629]
--> [backtrace end]

Run a Modify on the SRM instance from Add/Remove programs and provide the new database password during this process and the service will start up just fine. 

Hope this helps.