Saturday, 2 January 2016

VMware Labs: ESXtopNGC Plugin Fling

Written by Suhas Savkoor



ESXtop is a command line utility available with vSphere which provides the resource usage of the environment in terms of CPU, memory, disk latency etc. Until now, if you wanted to observe these values, you had to run the command " esxtop " via the SSH of the host.

With this fling, we will now have esxtop integrated as a GUI in the vSphere Web Client.
You can download this fling in this link.

The documentation states that this is available only for Linux based vCenter which is the appliance. However, this can be also applied to your Windows based vCenter.

Procedure:
1. Download the zip file from the provided link.
2. Extract the zip into a desired location.
3. Copy the zip file to the vSphere Web Client installation location. The default location is:

C:\Program Files\VMware\Infrastructure\vSphereWebClient\plugins-packages

4. Restart the Web Client Service.
5. Login to Web Client, select a host, navigate to the Monitor tab and you will notice a sub-tab called " top " which is the esxtop GUI.



You can change the refresh rates of the display counter by modifying the value under the " Set Refresh Rate " option.
You can also export the reading to a csv file by Start/Stop exporting stats option.

Also, you can cycle through CPU, memory, disk, adapter performance tabs and also select which counter values to be displayed by changing the " Select Display Counters " option accordingly.

For a complete list of esxtop counter explanations:
http://www.yellow-bricks.com/esxtop/

The counter update is quite slower when compared to the command line esxtop, but, it is easy to navigate via the GUI.

+1 to this plugin.

VMware Labs: ESXi Embedded Host Client Fling



Written by Suhas Savkoor



Until now, ESXi management was done by two ways. One was by connecting it to a vCenter and logging into this vCenter via a vSphere Client or a Web Client, and the other way, was to directly log into the host via the vSphere Client. This was good, however, I recently came across the ESXi Embedded Host Client, and I found this to be better than the vSphere Client login to host mainly because of two reasons: Easy navigation and no requirement of opening a Putty Session for logging.

You can download the ESXi Embedded Host Client from this link here.

The setup is pretty simple and straightforward. 

1. Enable SSH on the host for which you want to configure the Host Client.

Two paths to proceed further:

Path 1:
In the SSH of the host, run the following command:

 ssh root@<esxip> esxcli software vib install -v <URL>

How to get the URL? 
In the Fling page, on the left side, from the drop-down select the esxui-singed.vib and right click the download button. 
Then go to the SSH session and in the place of <URL>, press Shift+Insert or Right click which will paste the URL. 

Run the command, and it will prompt for the root password. The VIB installation completion status will be displayed. 

Path 2:
Offline Install:

If you are using a 5.x host, from the drop down choose the 5.x bundle, otherwise the 6.x bundle. 
Check Agree and download box and click download. 

Login to the host via the vSphere client and upload the VIB to a datastore.


Open a SSH to this host and run the below command:
scp /path/to/downloaded/vib/esxui.vib root@<esxip>:/tmp

This looks something like below for my environment:
scp /vmfs/volumes/Protected_LUN/VMware_bootbank_esx-ui_0.0.2-0.1.3357452.vib root@192.168.1.176:/tmp 

Enter the Password when prompted. This will import the VIB from the datastore to the tmp folder of the host. 

Then to install the VIB run the following command:

ssh root@<esxip> esxcli software vib install -v /tmp/esxui.vib 

It will ask for the password of the ESXi host. Once completed the output is something like this:



Host reboot is not required.

To manage this:

Open a browser and enter the URL: https://<host_IP>/ui
You might receive: 503 Service unavailable


To resolve this:
1. Go back to the SSH session of this host.
2. Go to the following directory
# cd /etc/vmware/rhttpproxy/
3. Here you will see a file called endpoints.conf
Open this file using an editor

#vi endpoints.conf
4. Press "i" to begin edit and remove the following line:
/ui local 8308 redirect allow
5. Save the file by pressing ESC and then typing :wq!
6. Restart the rhttpproxy service using the below command:
/etc/init.d/rhttpporxy restart 

Login to the client again using the same address and you will be presented with the login screen.


Logging in with root credentials will give you the management page:


Now, there are lot of options here that you can explore from deploying VMs to managing the host.
The best part I liked about this is the ESXi logs. If you click the Monitor option on the left side and go to Logs tab you will get the list of ESXi logs that are available. Upon selecting the logs you can view that particular logging in the window below.

The host client is great and I will be actively testing this and will update this article if needed be.

Cheers!

Friday, 1 January 2016

vCenter 6.0 deployment models

Written by Suhas Savkoor



So with 5.1 and 5.5, the vCenter components included were Single Sign On, Web Client, Inventory Service and the vCenter Server. These components could be deployed all on one single machine, as a simple install. If we wanted redundancy, or removing one single point of failure, we had the choice to deploy each component on a different machine. We also had a choice to mix the installation, grouping certain components on certain set of machines. 

With vCenter 6.0, we have a concept of Platform Service Controller (PSC) and Management Server Node. With PSC and Management server into picture, most of the vCenter 5.x components have been consolidated into these two services.

The PSC includes the below services:
  • VMware Single Sign On
  • VMware License Service
  • VMware Lookup Service and Directory Service
  • VMware Certificate Authority

The Management Server node includes the following:
  • vCenter Server
  • Web Client 
  • Inventory Service
  • Other Services that were seen in the 5.5 installation ISO: (Auto Deploy, Dump Collector, Syslog Collector)
With these two components in picture, the vCenter 6.0 has a flexible deployment topology. Let's take a look at couple of recommended and non recommended types:

Recommended Types:
1. vCenter Server with Embedded PSC:

In embedded PSC deployment, you have the PSC component and the vCenter component residing on the same machine. This can be a physical machine or a virtual machine managed by an ESXi host. 
This type of deployment is used for small environments and it is the easiest of all deployments. Maintaining embedded PSC deployment is pretty easy. However, it comes with certain drawbacks such as single point of failure and replication between multiple embedded nodes is not supported. 


This type of deployment is supported both on Windows and Appliance based vCenter.

2. vCenter Server with External PSC:

In this type of deployment, the PSC is deployed on a separate machine (physical/virtual)
Using this type of deployment, you can have multiple vCenter Servers (Management server node) being managed by or connected to one PSC node. 
Using this mode, you can deploy a vCenter Server in a linked mode with two or more vCenter Server connected to the same PSC instance. Both the vCenter Servers are connected to the same PSC domain. 


This type of deployment is supported both on Windows and Appliance based vCenter.

This type of deployment is also known as Enhanced Linked Mode with External PSC Without HA. 

3. vCenter Server with External PSC in HA:

In the previous deployment type, if the PSC node goes down, then the vCenter authentication fails. In PSC with HA type deployment, the PSC node is made fault-tolerant by adding another PSC node. Both these PSC nodes reside on a VM of their own and is configured to reside behind a Load Balancer. The vCenter Servers are joined to the PSC domain using the load balancer IP which is shared among the multiple PSC nodes. So, in this case, even if one PSC node goes down, the authentication for vCenter nodes can be provided from the other PSC nodes. 


This configuration can be configured again for both Windows and Appliance base vCenter.

These 3 types of deployments are supported and recommended by VMware. 

There are couple of more types of vCenter deployment models which are supported, however, not recommended by VMware. Let's take a look at the these deployments:

Non Recommended Types:
1. Enhanced Linked Mode with Embedded PSC:

As we saw earlier embedded PSC with Linked Mode is not recommended. However, the deployment can still be done for a test environment. Here you deploy your first embedded node (PSC and Management Node) on a single physical/virtual machine. You then deploy the other embedded nodes, with their embedded PSC configured to the domain of the first embedded Node. 


2. Combination of Embedded PSC and External PSC:

In this type of deployment, the first node is an embedded node. On the first node, the PSC and Management Server are deployed on the same virtual machine. 
The second node is an External PSC node, in which the PSC is deployed on a separate machine, and is joined to the domain of the Embedded PSC. Multiple management nodes can be deployed and connected to the secondary PSC (External)


3. Enhanced linked mode using Embedded PSC only:

In this type of deployment, the first node is again an embedded PSC node. The second node does not have a PSC. The second node is just a Management Server node, and this node is connected to the embedded PSC domain. 


Bottom Line:

If there is just one vCenter instance that is required then you can go for embedded PSC deployment. You can make node Fault-Tolerant by using VMware HA or Fault Tolerance. Small environments will fall into the category of embedded deployment models. 


Upgrade Basics from 5.1 or 5.5 to 6.0 

1. Upgrade Simple Install 5.1/5.5 to Embedded PSC 6.0

In this, currently you have a simple install of a 5.1/5.5 vCenter. That is, SSO, Web Client, Inventory Service and vCenter all on the same machine. This vCenter machine needs to be upgraded to 6.0 with Embedded PSC. As mentioned above, in embedded PSC the vCenter and SSO all reside on the same machine.

So the upgrade process would be straightforward. Mount the ISO and upgrade the PSC first and then the management next. 

2. Upgrading Simple Install 5.1/5.5 to External PSC 6.0

In this scenario the 5.1/5.5 again is a simple install. However, when upgrading to 6.0, you want the PSC to be on a separate machine to provide Enhanced Linked Mode capabilities.

In this case:
  • Deploy a new Windows Server machine and Install the current version of SSO only on this machine
  • Re-point your Web Client, Inventory Service and vCenter Server to this new instance of SSO that was installed on the separate machine
  • Upgrade the SSO node first to 6.0
  • Upgrade the vCenter node next to 6.0
  • Uninstall the older version of 5.x SSO that was residing on the original simple install machine as this is not being used anymore. 

More information and detailed steps for Install and Upgrades can be found in the below link:


Picture credits to PenguinPunk

Friday, 25 December 2015

Understanding VMkernel.log for vMotion Operation

Written by Suhas Savkoor



Let's decode the vMotion logging in VMkernel.log.

Open a SSH (Putty) to the host where this virtual machine currently resides. Change the directory to:



Capture the live logging of VMkernel using the following command:



Perform vMotion of a virtual machine residing on this host to any other available host with a shared storage. You will see the below logging:

I will break down the logging with " // " for comments.

2015-12-25T16:39:25.565Z cpu4:2758489)Migrate: vm 2758492: 3284: Setting VMOTION info: Source ts = 1451061663105920, src ip = <192.168.1.176> dest ip = <192.168.1.177> Dest wid = 1830931 using SHARED swap

//The first line Migrate vm 2758492 does not tell which virtual machine is being migrated. It tells the world ID of the virtual machine that is going to be migrated. To find the world ID of the virtual machine, before migrating run the command # esxcli vm process list This command lists all the virtual machines world IDs that is residing on the host.

// The Setting vMotion info 1451061663105920 is the vMotion ID. This vMotion ID is required because when you "grep" for this ID in the hostd.log or vmware.log (residing in the virtual machine directory) gives you further information of vMotion. In vmware.log you can see the transitioning states for vMotion, with each state performing a set of steps.

// The source ip  where this virtual machine currently resides is 192.168.1.176 and the destination to where the virtual machine is being migrated to is 192.168.1.177

// The dest wid 1830931 is the world ID for this virtual machine once the vMotion is completed.


2015-12-25T16:39:25.567Z cpu4:2758489)Tcpip_Vmk: 1288: Affinitizing 192.168.1.176 to world 2772001, Success
2015-12-25T16:39:25.567Z cpu4:2758489)VMotion: 2734: 1451061663105920 S: Set ip address '192.168.1.176' worldlet affinity to send World ID 2772001
2015-12-25T16:39:25.567Z cpu4:2758489)Hbr: 3340: Migration start received (worldID=2758492) (migrateType=1) (event=0) (isSource=1) (sharedConfig=1)

// Here the host is being prepared for migration by taking it's IP address into consideration.

//The migration start received logs the vMotion type. The World ID 2758492 is recorded. MigrateType=1 is host migration,

//The host where I am logged in currently via the SSH is the source host which shows the isSource=1 and sharedConfig=1


2015-12-25T16:39:25.567Z cpu5:2771999)CpuSched: 583: user latency of 2771999 vmotionStreamHelper0-2758492 0 changed by 2771999 vmotionStreamHelper0-2758492 -1
2015-12-25T16:39:25.568Z cpu4:2772001)MigrateNet: 1186: 1451061663105920 S: Successfully bound connection to vmknic '192.168.1.176'

//Here the connection from source host vmkernel port-group is established.

2015-12-25T16:39:25.570Z cpu5:33435)MigrateNet: vm 33435: 2096: Accepted connection from <::ffff:192.168.1.177>

// Here the destination source has accepted the connection for vMotion


2015-12-25T16:39:25.570Z cpu5:33435)MigrateNet: vm 33435: 2166: dataSocket 0x410958a8dc00 receive buffer size is -565184049
2015-12-25T16:39:25.570Z cpu4:2772001)MigrateNet: 1186: 1451061663105920 S: Successfully bound connection to vmknic '192.168.1.176'
2015-12-25T16:39:25.571Z cpu4:2772001)VMotionUtil: 3396: 1451061663105920 S: Stream connection 1 added.
2015-12-25T16:39:25.571Z cpu4:2772001)MigrateNet: 1186: 1451061663105920 S: Successfully bound connection to vmknic '192.168.1.176'
2015-12-25T16:39:25.572Z cpu4:2772001)VMotionUtil: 3396: 1451061663105920 S: Stream connection 2 added.

//Both the surce and destination have established the connection and the vMotion process takes place. The VMkernel.log does not record the details of vMotion. If you check the vmware.log for this virtual machine, you can see the states and progress of vMotion in detail.

2015-12-25T16:39:25.848Z cpu3:2758492)VMotion: 4531: 1451061663105920 S: Stopping pre-copy: only 0 pages left to send, which can be sent within the switchover time goal of 0.500 seconds (network bandwidth ~2.116 MB/s, 52403100% t2d)

//In short how vMotion works is:


  • A shadow VM is created on the destination host.
  • Copy each memory page from the source to the destination via the vMotion network. This is known as preCopy.
  • Perform another pass over the VM’s memory, copying any pages that changed during the last preCopy iteration
  • Continue the pre-copy iteration until no changed page remains
  • Stun the VM and resume the destination VM
//Basically, the memory state of the virtual machine is being transferred to the shadow virtual machine created on the destination machine. The memory is nothing but pages.  The pages are transferred to the shadow VM over the vMotion network. Larger the VM I/Os, longer the vMotion process. 

//Towards the end of vMotion the source VM must be destroyed and the operations should continue at the destination end. For this, the ESXi should determine, that the last few memory pages can be transferred over to the destination quickly. Which is the switch-over goal of 0.5 seconds

//So here when it says only 0 pages left to send, which can be sent within the switchover time goal of 0.500 seconds it means that there are no more active memory pages left to be transferred. So the host declares that the source VM can be destroyed and the vMotion can be completed and the destination VM can resume. All this can happen within the feasible switch time. 


2015-12-25T16:39:25.952Z cpu5:2772001)VMotionSend: 3643: 1451061663105920 S: Sent all modified pages to destination (no network bandwidth estimate)

//Here it tells that the for the vMotion ID the "S" Source has sent all the memory pages to the destination.


2015-12-25T16:39:26.900Z cpu0:2758489)Hbr: 3434: Migration end received (worldID=2758492) (migrateType=1) (event=1) (isSource=1) (sharedConfig=1)
2015-12-25T16:39:26.908Z cpu3:32820)Net: 3354: disconnected client from port 0x200000c
2015-12-25T16:39:26.967Z cpu3:34039)DLX: 3768: vol 'Recovery_LUN', lock at 116094976: [Req mode 1] Checking liveness:

//Here the migration has completed for the world ID, the migration type. And the virtual machine in my case is residing on Recovery_LUN is locked by the new host that is residing on with a new world ID that was assigned during the vMotion.


So you know what a successful vMotion looks like in the vmkernel.log
In depth vMotion can be found in the vmware.log, which can be self explanatory once you know what to look at and where to look at.

Tuesday, 22 December 2015

Unable To Delete Orphaned/Stale VMDK File

Written by Suhas Savkoor



So today I got a case where we were trying to delete an orphaned flat.vmdk file.

A brief background of what was being experienced here:

There were three ESXi hosts and 2 shared datastores among these hosts. Now, there were couple of folders in these 2 shared datastores which contained only flat.vmdk files. These flat files were not associated with any virtual machines and also the last modified date of these files were somewhere about a year ago.

However, every time we tried to delete the file from the datastore browser GUI, we got the error:

Cannot Delete File [Datastore Name] File_Name.vmdk


So, when we try to delete this file from the command line using the " rm -f <file_name> " we got the error:

rm: cannot remove 'File.vmdk': No such file or directory

Also:
We were able to move the file to another datastore and remove it successfully. But, the stale file copy was still left behind in the original datastore.

So, how do we remove this stale file?

Step 1:

  • Take a SSH session to all the hosts that have access to this datastore where the stale file resides. 
  • In my case all the three hosts in the cluster.

Step 2:

  • Run the below command. This command has to be executed from the SSH(Putty) of all the hosts having connectivity to that datastore.

This can result in two error outputs:

First error:
Could not open /vmfs/volumes/xxxxxxxx/xxxxxxx/xxxxxx-flat.vmdk 
Command release failed Error: Device or resource busy

Second error:
Command release failed 
Error: Inappropriate ioctl for device


In my case it was the second error.

The host that gives you the second error has the stale lock on the file. All the three hosts returned the second error, and I had to reboot all the three hosts. 

Once the hosts are rebooted, you can successfully remove the stale flat.vmdk files.

Note:
If the remove operation still fails, then you will have to storage vMotion all the VMs from the affected datastore, then delete the VMFS volume and reformat it again.

Sunday, 20 December 2015

Configure Remote Syslog for ESXi host

Written by Suhas Savkoor



When you have installed and set-up an ESXi host, you would have configured a scratch location for all the host logging to go to. The configuration might have been done on the local datastore or a SAN.
You can also preserve your host logging on to a remote machine as well, configure host log rotation to retain logs for a longer time by using syslog. 

Here, I am going to configure my host logging in such a way that all the ESXi logging must go to a remote machine, in my case, a vCenter Windows machine. 

Step 1:

Installing the Syslog Collector:

From the ISO that you installed your vCenter Server, you will have an option for Syslog Collector. 



Go Next and accept the EULA


Once you go next, you get an option to configure a couple of things:

  • First, where you want the syslog collector to be installed
  • Second, where the syslog data logging to be configured to
  • Log rotation file size for the host logs which will be created in a .txt format
  • And how many log rotations to be retained. 

So basically, once the syslog text file reaches the rotate constraint, which by default is 2 MB, it will be zipped and the new logging will be done in a new text file. And 8 rotated zipped files will be retained at one time.


Choose a type of installation that is required and go Next


The default TCP and UDP port being used for syslog is 514, give a custom port if required. If you are using a custom port, then document it, as it would be necessary for configuration.


You can choose how your syslog should be identified on the network by either the vCenter IP/FQDN


Click Next > Install and Finish once the installation is complete. 

Step 2:

Once the syslog collector is installed, it is then time to configure syslog for the required ESXi host. 

Take a SSH session to the host that requires the syslog configuration to be done. Run the following command:


This will tell the current logging configuration of the ESXi host. The output is something as below:


Notice that I do not have Remote Host syslog configuration done yet. 

Next, run the following command to configure syslog to the required machine on a required protocol and port:

For udp:


For tcp:


If you are using a custom port, then specify that custom port in the above command. 

Next, Run the command to perform a syslog reload for the changes to take effect:


Now, you may need to manually open the Firewall rule set for syslog when redirecting logs. For this, we need to set a syslog rule-set in the defined firewall rules and reload the changes.



Now, let's check the directory to see if syslog is available for the host. 


The log file is created and when you review the syslog configuration for the host, you can now see the remote server IP.


Cheers!

Thursday, 17 December 2015

How To Analyze PSOD

Written by Suhas Savkoor



Purple Screen of Death or commonly known as PSOD is something which we see most of the times when we run an ESXi host.

Usually when we experience PSOD, we reboot the host (which is a must) and then gather the logs and upload it to VMware support for analysis (where I spend a good amount of time going through it)

Why not take a look at the dumps by yourself?

Step 1:
I am going to simulate a PSOD on my ESXi host. You need to be logged into the host's SSH. The command is



And when you open a DCUI to the ESXi host, you can see the PSOD


Step 2:
Sometimes, we might miss out on the screenshot of PSOD. Well that's alright! If we have core-dump configured for the ESXi, we can extract the dump files to gather the crash logs.

Reboot the host, if it is in the PSOD screen. Once the host is back up, login to the SSH/Putty of the host and go to the core directory. The core directory is the location where your PSOD logging go to.



Then list out the files here:



Here you can see the vmkernel dump file, and the file is in the zdump format.

Step 3:
How do we extract it?

Well, we have a nice extract script that does all the job, " vmkdump_extract ". This command must be executed against the zdump.1 file, which looks something like this:



It creates four files:
a) vmkernel-log.1
b) vmkernel-core.1
c) visorFS.tar
d) vmkernel-pci

All we require for analysis is the vmkernel-log.1 file

Step 4:
Open the vmkernel-log.1 file using the below command:



Skip to the end of the file by pressing Shift+G. Now let's slowly go to the top by pressing PageUp.
You will come across a line that says @BlueScreen: <event>

In my case, the dumps were:




  • The first line @BlueScreen: Tells the crash exception like Exception 13/14, in my case it is CrashMe which is for a manual crash. 
  • The VMKuptime tells the Kernel up-time before the crash.
  • The logging after that is the information that we need to be looking for, the cause as to why the crash occurred. 
Now, here the crash dump varies for every crash. These issues can range from hardware errors / driver issues / issues with ESXi build and a lot more.

Each dump analysis would be different. But the basic is the same. 

So, you can try analyzing the dumps by yourself. However, if you are entitled to VMware support, I will do the job for you.


Cheers!