Monday, 17 January 2022

Removing A Stale Inaccessible NFS Datastore From vCenter

I had recently run into an issue, where one of the NFS datastores coming off a NAS share from one of my backup appliances (Rubrik) had gone inaccessible. 

I did a bit of look around to see what had happened, and I noticed that the NAS share was destroyed off from the backup end, in a rather unclean way which left some stale entries on the VMware end of it. 


The highlighted datastore was one of the datastore in question and I had to get rid of the entry from the vCenter and the ESX host. The VM residing on it was no longer needed for salvage so it was all good to let go of any data residing on it since the backing share was destroyed anyways. 

First, I had to disassociate the datastore from the ESX host, which was straightforward. From an SSH session to ESXi, all I had to do was run:

esxcli storage nfs remove -v <datastore_name> 

This removed the host relationship with the datastore and the UI displayed something like below



Now you cannot do an Unmount datastore in such a state is because there are no more ESX hosts associated with this datastore and this needed to be cleaned up manually from the vCenter Postgres database. 

Before making any changes to VCDB, ensure you have a vCenter backup, snapshot or whatever is necessary to revert back to a working state if something goes wrong. 

Stop the vmware-vpxd service using

service-control --stop vmware-vpxd

Connect to the VCDB using 

/opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres

Then run the below query, replacing the datastore name that you want to clean up 

SELECT id from vpx_entity where name='rubrik_9c65faa5cbe34dd6a9a796f1b4d12c20';

You will then see an output such as

VCDB=# SELECT id from vpx_entity where name='rubrik_9c65faa5cbe34dd6a9a796f1b4d12c20';
id
-------
 10369
(1 row)

Using this ds_id you can see the assignment 

VCDB=# select * from vpx_ds_assignment where ds_id=10369;
 ds_id | entity_id | accessible | mount_path | mount_id | mount_mode | mounted
-------+-----------+------------+------------+----------+------------+---------
10369 |     10370 |            |            |          |            |
(1 row)

Then all you need to do is clean up the entry from the below 3 tables using

VCDB=# delete from vpx_ds_assignment where ds_id=10369;
VCDB=# delete from vpx_datastore where id=10369;
VCDB=# delete from vpx_vm_ds_space where ds_id=10369;

Replace the IDs accordingly

Once done, finally start the vmware-vpxd service using

service-control --start vmware-vpxd

The stale datastore entry should now no longer be visible


Hope this helps.

Sunday, 16 January 2022

What Is Linux Load Average

 If you have ever used a Linux system, or been a system administrator you would have come across the term, load average. Whenever there happens to be a performance issue with the system, you commonly hear the question, what is the load average on the server? 

In this article, let's break down the questions into a couple of parts: 

  • What is load average? 
  • How do we calculate if we have sufficient load? 
  • How do we determine if the system is overloaded? 

What is load average

Load Average is technically the system load calculated as the average of the sum of running and waiting for process threads. It is defined over a period of time of 1, 5 and 15-minute intervals. 

The load average is generally observed from common commands like top and uptime

For example: 

[root@centos7 ~]# uptime
08:32:53 up 47 days, 21:09,  1 user,  load average: 0.05, 0.15, 0.20
So this section tells that the 1 5 and 15-minute interval for load average is 0.05, 0.15 and 0.20 respectively. 

To understand these values, let's take an example to cover if we have sufficient load and if the system is overloaded: 

Consider the system you are running is a single-core processor, which is equivalent to a one-lane bridge. If there are no cars on the bridge, then it is completely free of load and the load average is related to 0. If there is a car going on the bridge, then the lane is occupied and the bridge is at its capacity and the load average relates now to 1. Now, what means next is that if any additional cars come on the bridge the load is exceeded than what the bridge can handle and the system is now overloaded when the load average jumps to 2. 

The same concept is extrapolated when there are multiple cores available, essentially expanding the lanes on the bridge for reference. 

What about Multiprocessors and Multicore system


If you have a single processor with 4 cores for example, then it's a single processor quad-core system also called a multicore system
If you have multiple processors with one or more cores each then it is essentially called a multiprocessor system.

So for a single-core system the maximum range of load average is spread between 0 to 1, for dual-core, 0 to 2, quad-core 0 to 4 and so on.

On top of this, you can have a system with hyperthreading enabled. Under hyperthreading, a single physical CPU core appears as two logical CPUs core to an operating system. Since a single-core processor can process only one task at a time, hyperthreading was introduced to allow multiple threads to be executed within the same core. 

You can find your CPU / Core count and hyperthreading details from the lscpu command

[root@centos7 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel


So next time you are troubleshooting your system performances, you should definitely keep the load average under consideration.