Monday, 17 January 2022

Removing A Stale Inaccessible NFS Datastore From vCenter

I had recently run into an issue, where one of the NFS datastores coming off a NAS share from one of my backup appliances (Rubrik) had gone inaccessible. 

I did a bit of look around to see what had happened, and I noticed that the NAS share was destroyed off from the backup end, in a rather unclean way which left some stale entries on the VMware end of it. 


The highlighted datastore was one of the datastore in question and I had to get rid of the entry from the vCenter and the ESX host. The VM residing on it was no longer needed for salvage so it was all good to let go of any data residing on it since the backing share was destroyed anyways. 

First, I had to disassociate the datastore from the ESX host, which was straightforward. From an SSH session to ESXi, all I had to do was run:

esxcli storage nfs remove -v <datastore_name> 

This removed the host relationship with the datastore and the UI displayed something like below



Now you cannot do an Unmount datastore in such a state is because there are no more ESX hosts associated with this datastore and this needed to be cleaned up manually from the vCenter Postgres database. 

Before making any changes to VCDB, ensure you have a vCenter backup, snapshot or whatever is necessary to revert back to a working state if something goes wrong. 

Stop the vmware-vpxd service using

service-control --stop vmware-vpxd

Connect to the VCDB using 

/opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres

Then run the below query, replacing the datastore name that you want to clean up 

SELECT id from vpx_entity where name='rubrik_9c65faa5cbe34dd6a9a796f1b4d12c20';

You will then see an output such as

VCDB=# SELECT id from vpx_entity where name='rubrik_9c65faa5cbe34dd6a9a796f1b4d12c20';
id
-------
 10369
(1 row)

Using this ds_id you can see the assignment 

VCDB=# select * from vpx_ds_assignment where ds_id=10369;
 ds_id | entity_id | accessible | mount_path | mount_id | mount_mode | mounted
-------+-----------+------------+------------+----------+------------+---------
10369 |     10370 |            |            |          |            |
(1 row)

Then all you need to do is clean up the entry from the below 3 tables using

VCDB=# delete from vpx_ds_assignment where ds_id=10369;
VCDB=# delete from vpx_datastore where id=10369;
VCDB=# delete from vpx_vm_ds_space where ds_id=10369;

Replace the IDs accordingly

Once done, finally start the vmware-vpxd service using

service-control --start vmware-vpxd

The stale datastore entry should now no longer be visible


Hope this helps.

Sunday, 16 January 2022

What Is Linux Load Average

 If you have ever used a Linux system, or been a system administrator you would have come across the term, load average. Whenever there happens to be a performance issue with the system, you commonly hear the question, what is the load average on the server? 

In this article, let's break down the questions into a couple of parts: 

  • What is load average? 
  • How do we calculate if we have sufficient load? 
  • How do we determine if the system is overloaded? 

What is load average

Load Average is technically the system load calculated as the average of the sum of running and waiting for process threads. It is defined over a period of time of 1, 5 and 15-minute intervals. 

The load average is generally observed from common commands like top and uptime

For example: 

[root@centos7 ~]# uptime
08:32:53 up 47 days, 21:09,  1 user,  load average: 0.05, 0.15, 0.20
So this section tells that the 1 5 and 15-minute interval for load average is 0.05, 0.15 and 0.20 respectively. 

To understand these values, let's take an example to cover if we have sufficient load and if the system is overloaded: 

Consider the system you are running is a single-core processor, which is equivalent to a one-lane bridge. If there are no cars on the bridge, then it is completely free of load and the load average is related to 0. If there is a car going on the bridge, then the lane is occupied and the bridge is at its capacity and the load average relates now to 1. Now, what means next is that if any additional cars come on the bridge the load is exceeded than what the bridge can handle and the system is now overloaded when the load average jumps to 2. 

The same concept is extrapolated when there are multiple cores available, essentially expanding the lanes on the bridge for reference. 

What about Multiprocessors and Multicore system


If you have a single processor with 4 cores for example, then it's a single processor quad-core system also called a multicore system
If you have multiple processors with one or more cores each then it is essentially called a multiprocessor system.

So for a single-core system the maximum range of load average is spread between 0 to 1, for dual-core, 0 to 2, quad-core 0 to 4 and so on.

On top of this, you can have a system with hyperthreading enabled. Under hyperthreading, a single physical CPU core appears as two logical CPUs core to an operating system. Since a single-core processor can process only one task at a time, hyperthreading was introduced to allow multiple threads to be executed within the same core. 

You can find your CPU / Core count and hyperthreading details from the lscpu command

[root@centos7 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel


So next time you are troubleshooting your system performances, you should definitely keep the load average under consideration. 


Tuesday, 24 July 2018

Understanding Perfbeat Logging In GSAN

If you have ever come across GSAN logs in VDP located under /data01/cur you would sometimes notice the below logging:

2018/05/28-10:34:31.01397 {0.0} [perfbeat.3:196]  WARN: <1060> perfbeat::outoftolerance mask=[gc,flush,gcmark] average=1397.27 limit=139.7273 mbpersec=0.79
2018/05/28-10:35:38.67619 {0.0} [perfbeat.2:194]  WARN: <1060> perfbeat::outoftolerance mask=[gc,flush,gcmark] average=53.72 limit=5.3722 mbpersec=0.88
The perfbeat outoftolerance is logged against various process.  In the above example, the task running is garbage collection and flush. This can be hfscheck, backup, restore and so on. Ideally, you will see this logging whenever that particular task has slow performance, causing the respective maintenance or backup jobs to take a long time to complete. If you are in a situation where the backup or restore or any Filesystem check is taking suspiciously long time to complete, then this would be a best place to look.

On a high level, GSAN measures the current performance over a period of previously measured average performance.

A simple explanation to the above logging is this. The average performance for the task within [] was 53.72, which was measured over a period of time. The current performance is 10 percent below the read average. (10 percent of 53.72 is 5.372) and the current mbpersec is 0.88

This mentions that there is a stress on the underlying storage or something wrong with that particular storage in terms of performance. Since VDP runs as a virtual machine. The flow would be:

> Check the load on the VDP itself. See if there is unusual load on the system and if yes, determine if there is a process hogging up the resources
> If the VM level checks out, then see if there are any issues on the DAVG or the VMFS file system. Perhaps there are multiple high I/O VMs running on this storage and there is a resource contention occurring? I would start with the vobd.log and vmkernel.log for that particular datastore naa.ID and then verify the Device Avg for that device.
> If this checks out too, then the last part would be the storage array itself. Moving VDP to another datastore is not an ideal test since these appliances fairly large in size.

Hope this helps!

Sunday, 8 July 2018

Unable To Pair SRM Sites: "Server certificate chain not verified"

So first things first, as of this post, you might know that I have moved out of VMware and ventured further into backup and recovery solutions domain. Currently, I work as a solutions engineer at Rubrik.

There are a lot of instances where you are unable to manage anything in Site Recovery Manager; regardless of the version (Also, applicable to vSphere Replication) and the common error that pops up on the bottom right of the web client is Server certificate chain not verified

Failed to connect to vCenter Server at vCenter_FQDN:443/sdk. Reason:
com.vmware.vim.vmomi.core.exception CertificateValidationException: Server certificate chain not verified.

This article will briefly explain only on the embedded Platform Services Controller deployment model. Similar logic needs to be extrapolated to the external deployments.

These issues are ideally seen when:
> PSC is migrated from embedded to external
> Certificates are replaced for the vCenter

I will be simplifying this KB article here for reference. Before proceeding have a powered off snapshot of the PSC and vCenters involved. 

So, for embedded deployment of VCSA:

1. SSH into the VCSA and run the below command:
# /usr/lib/vmidentity/tools/scripts/lstool.py list --url https://localhost/lookupservice/sdk --no-check-cert --ep-type com.vmware.cis.cs.identity.sso 2>/dev/null

This command will give you the ssl Trust that is currently stored on your PSC. Now, consider you are using an embedded PSC deployment on production and another embedded deployment in DR (No Enhanced Linked Mode). In this case, when you run the above command, you are expected to see just one single output where the URL section is the FQDN of your current PSC node and associated with it would be its current ssl trust. 

URL: https://current-psc.vmware.local/sts/STSService/vsphere.local
SSL trust: MIIDWDCCAkCgAwIBAgIJANr+++MJ5+WxMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNV
...Reducing output...
LqSKWg/apP1XlBV1VyC5LPZrH/rrq8+Naoj7i/P6HAzTwAAL+O10ggClaP8=

If this is your case, proceed to step (2), if not jump to step (3)

2. Run the next command:
# echo | openssl s_client -connect localhost:443

This is the current ssl Trust that is used by your deployment post the certificate replacement. Here look at the extract which speaks about the certificate chain.

Server certificate
-----BEGIN CERTIFICATE-----
MIIDWDCCyAHikleBAgIJANr+++MJ5+WxMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNV
...
LqSKWg/apP1XlBV1VyC5LPZrH/rrq8+Naoj7i/P6HAzTwAAL+O10fGhhDDqm=
-----END CERTIFICATE-----

So from here, the chain obtained from the first command (current ssl trust in PSC) does not match the chain from the second command (Actual ssl trust). And due to this mismatch you would see the chain not verified message in the UI.

To fix this, the logic is; find all the services using the thumbprint of the old ssl Trust (step 1) and update them with the thumbprint from step 2. The steps in KB article is a bit confusing, so this is what I follow to fix it.

A) Copy the SSL trust you obtained from the first command to Notepad++ Everything that starts from MIID... in my case (No need to include SSL Trust option in it).

B) The chain should contain 65 characters in one line. So in notepad++ place the mouse after a character and see what the col option reads at the bottom. Hit Enter at the mark when col: 65
Format this for the complete chain (The last line might have <65 characters which is okay)

C) Append -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- before and after the chain (5 hyphens are used before and after)

D) Save the Notepadd++ document as a .cer extension.

E) Open the certificate file that you saved and navigate to Details > Thumbprint. You will notice a string of hexa with spacing after every 2 characters. Copy this to a Notepadd++ and append : after every 2 characters, so you will end up with the thumbprint similar to: 13:1E:60:93:E4:E6:59:31:55:EB:74:51:67:2A:99:F8:3F:04:83:88

F) Next, we will export the current certificate using the below command
# /usr/lib/vmware-vmafd/bin/vecs-cli entry getcert --store MACHINE_SSL_CERT --alias __MACHINE_CERT --output /certificates/new_machine.crt

This will export the current certificate to the /certificates directory. 

G) Run the update thumbprint option using the below command:
# cd /usr/lib/vmidentity/tools/scripts/
# python ls_update_certs.py --url https://FQDN_of_Platform_Services_Controller/lookupservice.sdk --fingerprint Old_Certificate_Fingerprint --certfile New_Certificate_Path_from_/Certificates --user Administrator@vsphere.local --password 'Password'

So a sample command would be:
# python ls_update_certs.py --url https://vcsa.vmware.local --fingerprint 13:1E:60:93:E4:E6:59:31:55:EB:74:51:67:2A:99:F8:3F:04:83:88 --certfile /certificates/new_machine.crt --user Administrator@vsphere.local --password 'Password'

This will basically look for all the services using the old thumbprint (04:83:88) and then update them with the current thumbprint from the new_machine.crt

Note: Once you paste the thumbprint in the SSH, remove any extra spaces before and after the beginning and end of thumbprint respectively. I have seen the Update service fail because of this as it picks up some special character in few cases (Special characters that you would not see in the terminal) So remove the space after fingerprint and re-add the space back. Do the same before the --certfile switch too.

Re-run the commands in step 1 and 2 and the SSL trust should now match. If yes, then re-login back to the web client and you should be good to go. 

--------------------------------------------------------------------------------------------------------------------------
3. In this case, you might see two PSC URL outputs in the step 1 command. Both the PSC have the same URL, and it might or might not have the same SSL trust. 

Case 1:
Multiple URL with different SSL trust.

This would be a easy one. On the output from Step 1, you will see two outputs with same PSC URL, but different SSL trust. And one of the SSL trust from here will match the current certificate from step 2. So this means, the one that does not match is the stale one and can be removed from the STS store. 

You can remove them from the CLI, however, I stick to using Jxplorer tool to remove it from the GUI. You can connect to PSC from Jxplorer using this KB article here.

Once connected, navigate to Configuration > Sites > LookupService > Service Registrations. 
One of the fields from command in step 1 is Service ID. Which is something similar to:
Service ID: 04608398-1493-4482-881b-b35961bf5141

Locate this similar service ID in the service registrations and you should be good to remove it. 

Case 2:
Multiple URL with same SSL trust. 

In this case, after the output from step 1, you will see two same PSC URL along with the same SSL trust. And these might or might not match the output from step 2. 

The first step of this fix is:

Note down both of the service IDs from the output of step 1. Connect the Jxplorer as mentioned above. Select the service ID and on the right side, click Table Editor view and click submit. You can view the last modified date of this service registration. The service ID having the older last modified date would be the stale registration and can be removed via Jxplorer. Now, when you run the command from Step 1, it would have one output. If this matches the thumbprint from step 2, great! If not, then an additional step of updating the thumbprint needs to be performed. 

In an event of external PSC deployment, let's say one PSC in production site and one in recovery site in ELM, then the command from step 1 is supposed to populate two outputs with two different URL (production and DR PSC) since they are replicating. This will of course change if there are multiple PSCs replicating with or without a load balancer. The process would be too complex to explain using text, so in this event it would be best to involve VMware Support for assistance. 

Hope this helps!

Wednesday, 30 May 2018

Unable To Configure Or Reconfigure Protection Groups In SRM: java.net.SocketTimeoutException: Read timed out

When you try to reconfigure or create a new protection group you might run into the following message when you click on the Finish option.

 java.net.SocketTimeoutException: Read timed out

Below is a screenshot of this error:


In the web client logs for the respective vCenter you will see the below logging: 

[2018-05-30T13:49:14.548Z] [INFO ] health-status-65 com.vmware.vise.vim.cm.healthstatus.AppServerHealthService Memory usage: used=406,846,376; max=1,139,277,824; percentage=35.7109010137285%. Status: GREEN
[2018-05-30T13:49:14.549Z] [INFO ] health-status-65   c.v.v.v.cm.HealthStatusRequestHandler$HealthStatusCollectorTask Determined health status 'GREEN' in 0 ms
[2018-05-30T13:49:20.604Z] [ERROR] http-bio-9090-exec-16 70002318 100003 200007 c.v.s.c.g.wizard.addEditGroup.ProtectionGroupMutationProvider Failed to reconfigure PG [DrReplicationVmProtectionGroup:vm-protection-group-11552:67
2e1d34-cbad-46a4-ac83-a7c100547484]:  com.vmware.srm.client.topology.client.view.PairSetup$RemoteLoginFailed: java.net.SocketTimeoutException: Read timed out
        at com.vmware.srm.client.topology.impl.view.PairSetupImpl.remoteLogin(PairSetupImpl.java:126)
        at com.vmware.srm.client.infraservice.util.TopologyHelper.loginRemoteSite(TopologyHelper.java:398)
        at com.vmware.srm.client.groupservice.wizard.addEditGroup.ProtectionGroupMutationProvider.apply(ProtectionGroupMutationProvider.java:184)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.vmware.vise.data.provider.DelegatingServiceBase.invokeProviderInternal(DelegatingServiceBase.java:400)


Caused by: com.vmware.vim.vmomi.client.exception.ConnectionException: java.net.SocketTimeoutException: Read timed out
        at com.vmware.vim.vmomi.client.common.impl.ResponseImpl.setError(ResponseImpl.java:252)
        at com.vmware.vim.vmomi.client.http.impl.HttpExchange.run(HttpExchange.java:51)
        at com.vmware.vim.vmomi.client.http.impl.HttpProtocolBindingBase.executeRunnable(HttpProtocolBindingBase.java:226)
        at com.vmware.vim.vmomi.client.http.impl.HttpProtocolBindingImpl.send(HttpProtocolBindingImpl.java:110)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl$CallExecutor.sendCall(MethodInvocationHandlerImpl.java:613)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl$CallExecutor.executeCall(MethodInvocationHandlerImpl.java:594)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl.completeCall(MethodInvocationHandlerImpl.java:345)

There might an issue with ports between vCenter and SRM server and you can validate those ports using this KB here

If the ports are fine, then validate that no guest level security agents on SRM or vCenter (Windows) are blocking this traffic. 

In my case the network connection and firewall / security settings were fine, and a fix was done by performing a Modify on the SRM installation on both the sites. Once this was done, a reconfigure pairing was done and post this we were able to reconfigure the protection groups successfully. 

Tuesday, 22 May 2018

VDP 6.1.8 Upgrade Stalls At 97 Percent

When upgrading a vSphere Data Protection appliance to 6.1.8 there might be a case where the upgrade stalls at 97 percent with the below task stuck in progress


In the avinstaller.log.0 located under /usr/local/avamar/var/avi/server_logs has the following message:

May 21, 2018 12:07:38 PM org.hibernate.transaction.JDBCTransaction commit
SEVERE: JDBC commit failed
java.sql.SQLException: database is locked
        at org.sqlite.DB.throwex(DB.java:855)
        at org.sqlite.DB.exec(DB.java:138)
        at org.sqlite.SQLiteConnection.commit(SQLiteConnection.java:512)
        at org.hibernate.transaction.JDBCTransaction.commitAndResetAutoCommit(JDBCTransaction.java:170)
        at org.hibernate.transaction.JDBCTransaction.commit(JDBCTransaction.java:146)
        at org.jbpm.pvm.internal.tx.HibernateSessionResource.commit(HibernateSessionResource.java:64)
        at org.jbpm.pvm.internal.tx.StandardTransaction.commit(StandardTransaction.java:139)
        at org.jbpm.pvm.internal.tx.StandardTransaction.complete(StandardTransaction.java:64)
        at org.jbpm.pvm.internal.tx.StandardTransactionInterceptor.execute(StandardTransactionInterceptor.java:57)

This is because the avidb is locked and to resume the upgrade restart the avidb by issuing the below command:
# avinstaller.pl --restart

Post the restart the upgrade should resume successfully. Since this is on 126the package, the upgrade would complete fairly quickly post the restart of the service and you might not notice the progress further. In this case, you can go back to the avinstaller.log.0 and verify if the upgrade is completed.

Hope this helps.

Tuesday, 15 May 2018

Bad Exit Code: 1 During Upgrade Of vSphere Replication To 8.1

With the release of 8.1 vSphere replication comes a ton of new upgrade and deployment issues. The one common issue is the Bad Exit Code: 1 error during the upgrade phase. This is valid for 6.1.2 or 6.5.x to 8.1 upgrade.

The first thing you will notice in the GUI is the following error message.


If you retry the upgrade will still fail and if you Ignore, the upgrade will proceed but then you will notice during during the configuration section.


Only after a "successful" failed upgrade we can access the logs to see what's the issue.

There is a log called hms-boot.log which records all these information and can be found under /opt/vmware/hms/logs

Here, the first error was this:

----------------------------------------------------
# Upgrade Services
Stopping hms service ... OK
Stopping vcta service ... OK
Stopping hbr service ... OK
Downloading file [/opt/vmware/hms/conf/hms-configuration.xml] to [/opt/vmware/upgrade/oldvr] ...Failure during upgrade procedure at Upgrade Services phase: java.io.IOException: inputstream is closed

com.jcraft.jsch.JSchException: java.io.IOException: inputstream is closed
        at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:315)
        at com.jcraft.jsch.Channel.connect(Channel.java:152)
        at com.jcraft.jsch.Channel.connect(Channel.java:145)
        at com.vmware.hms.apps.util.upgrade.SshUtil.getSftpChannel(SshUtil.java:66)
        at com.vmware.hms.apps.util.upgrade.SshUtil.downloadFile(SshUtil.java:88)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.downloadConfigFiles(Vr81MigrationUpgradeWorkflow.java:578)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.lambda$compileUpgrade$3(Vr81MigrationUpgradeWorkflow.java:1222)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.run(Vr81MigrationUpgradeWorkflow.java:519)
        at com.vmware.jvsl.run.VlsiRunnable$1$1.run(VlsiRunnable.java:111)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.VlsiRunnable$1.run(VlsiRunnable.java:104)
        at com.vmware.jvsl.run.ExecutorRunnable.withExecutor(ExecutorRunnable.java:17)
        at com.vmware.jvsl.run.VlsiRunnable.withClient(VlsiRunnable.java:98)
        at com.vmware.jvsl.run.VcRunnable.withVc(VcRunnable.java:139)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.launchMigrationUpgrade(Vr81MigrationUpgrade.java:62)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.access$100(Vr81MigrationUpgrade.java:21)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade$1.run(Vr81MigrationUpgrade.java:51)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.run(Vr81MigrationUpgrade.java:46)
        at com.vmware.hms.apps.util.App.run(App.java:89)
        at com.vmware.hms.apps.util.App$1.run(App.java:122)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable$1.run(ExceptionHandlerRunnable.java:47)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable.withExceptionHandler(ExceptionHandlerRunnable.java:43)
        at com.vmware.hms.apps.util.App.main(App.java:118)
Caused by: java.io.IOException: inputstream is closed
        at com.jcraft.jsch.ChannelSftp.fill(ChannelSftp.java:2911)
        at com.jcraft.jsch.ChannelSftp.header(ChannelSftp.java:2935)
        at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:262)
        ... 24 more

Then when I proceeded with an ignore, the error was this:

# Reconfigure VR
Failure during upgrade procedure at Reconfigure VR phase: null

java.lang.NullPointerException
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.vrReconfig(Vr81MigrationUpgradeWorkflow.java:1031)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.lambda$compileUpgrade$5(Vr81MigrationUpgradeWorkflow.java:1253)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.run(Vr81MigrationUpgradeWorkflow.java:519)
        at com.vmware.jvsl.run.VlsiRunnable$1$1.run(VlsiRunnable.java:111)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.VlsiRunnable$1.run(VlsiRunnable.java:104)
        at com.vmware.jvsl.run.ExecutorRunnable.withExecutor(ExecutorRunnable.java:17)
        at com.vmware.jvsl.run.VlsiRunnable.withClient(VlsiRunnable.java:98)
        at com.vmware.jvsl.run.VcRunnable.withVc(VcRunnable.java:139)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.launchMigrationUpgrade(Vr81MigrationUpgrade.java:62)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.access$100(Vr81MigrationUpgrade.java:21)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade$1.run(Vr81MigrationUpgrade.java:51)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.run(Vr81MigrationUpgrade.java:46)
        at com.vmware.hms.apps.util.App.run(App.java:89)
        at com.vmware.hms.apps.util.App$1.run(App.java:122)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable$1.run(ExceptionHandlerRunnable.java:47)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable.withExceptionHandler(ExceptionHandlerRunnable.java:43)
        at com.vmware.hms.apps.util.App.main(App.java:118)

When we still proceeded with ignore, the last stack was this:

Initialization error: Bad exit code: 1
Traceback (most recent call last):
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 178, in main
    __ROUTINES__[name]()
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 86, in func
    return fn(*args)
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 86, in func
    return fn(*args)
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 714, in get_default_sitename
    ovf.hms_cache_sitename()
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 686, in hms_cache_sitename
    cache_f.write(hms_get_sitename(ext_key, jks, passwd, alias))
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 679, in hms_get_sitename
    ext_key, jks, passwd, alias
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 412, in get_sitename
    output = commands.execute(cmd, None, __HMS_HOME__)[0]
  File "/opt/vmware/share/htdocs/service/hms/cgi/commands.py", line 324, in execute
    raise Exception('Bad exit code: %d' % proc.returncode)
Exception: Bad exit code: 1

So it looks like there is any issue with the copy of files from the old vR server to the new one. In the sshd_config file under /etc/ssh/ on the old vR server, the following was an entry:

Subsystem sftp /usr/lib64/ssh/sftp-server

Edit this line, so it will be:
Subsystem sftp /usr/lib/ssh/sftp-server

Then retry the Upgrade by deploying a fresh 8.1 and going through the "upgrade" process again and this time it should complete successfully.

Hope this helps!