Showing posts with label EMC Data Domain. Show all posts
Showing posts with label EMC Data Domain. Show all posts

Tuesday, 1 May 2018

VDP Backup And Maintenance Tasks Fail With "DDR result code: 5004, desc: nothing matched"

I had come across a scenario where the backups on VDP appliance connected to a data domain used to fail constantly and the backup logs for the avtar service had the following snippet:

2018-04-24T20:01:16.180+07:00 avtar Info <19156>: - Establishing a connection to the Data Domain system with encryption (Connection mode: A:3 E:2).
2018-04-24T20:01:26.473+07:00 avtar Error <10542>: Data Domain server "data-domain.vcloud.local" open failed DDR result code: 5004, desc: nothing matched
2018-04-24T20:01:26.473+07:00 avtar Error <10509>: Problem logging into the DDR server:'', only GSAN communication was enabled.
2018-04-24T20:01:26.474+07:00 avtar FATAL <17964>: Backup is incomplete because file "/ddr_files.xml" is missing
2018-04-24T20:01:26.474+07:00 avtar Info <10642>: DDR errors caused the backup to not be posted, errors=0, fatals=0
2018-04-24T20:01:26.474+07:00 avtar Info <12530>: Backup was not committed to the DDR.
2018-04-24T20:01:26.475+07:00 avtar FATAL <8941>: Fatal server connection problem, aborting initialization. Verify correct server address and login credentials.
2018-04-24T20:01:26.475+07:00 avtar Info <6149>: Error summary: 4 errors: 10542, 8941, 10509, 17964
2018-04-24T20:01:26.476+07:00 avtar Info <8468>: Sending wrapup message to parent
2018-04-24T20:01:26.476+07:00 avtar Info <5314>: Command failed (4 errors, exit code 10008: cannot establish connection with server (possible network or DNS failure))

Whenever we have issues with a VDP system with data domain we usually have a look into the ddrmaintlogs located under /usr/local/avamar/var/ddrmaintlogs

Under this, I saw the following:

Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: Data Domain Engine (3.1.0.1 build 481386)
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::enumerate_ddrconfig ddr_info version is 5.
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::enumerate_ddrconfig found 1 ddrconfig records in ddr_info
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::open_all_ddrs: dpnid 1494515613 from flag
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::open_all_ddrs: LSU: avamar-1494515613
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::open_all_ddrs: server=data-domain.vcloud.local(1),id=4D752F9C9A984D026520ACA64AA465388352BAB1,user=vmware_vdp
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: Establishing a connection to the Data Domain system with basic authentication (Connection mode: A:0 E:0).
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Error: cplist::open_ddr: DDR_Open failed: data-domain.vcloud.local(1) lsu: avamar-1494515613, DDR result code: 5004, desc: nothing matched
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Error: cplist::body - ddr_info file from the persistent store has no ddr_config entries
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Error: <4750>Datadomain get checkpoint list operation failed.
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Info: ============================= cplist finished in 11 seconds
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Info: ============================= cplist cmd finished =============================

Out of curiosity, when I try to list data domain checkpoints, it fails too.

root@vdp:~/#: ddrmaint cplist
<4750>Datadomain get checkpoint list operation failed.

The DDoS seemed to be supported for the VDP node:

root@vdp:/usr/local/avamar/var/ddrmaintlogs/#: ddrmaint read-ddr-info --format=full
====================== Read-DDR-Info ======================

 System name        : data-domain.vcloud.local
 System ID          : 4D752F9C9A984D026520ACB64AA465388352BAB1
 DDBoost user       : vmware_vdp
 System index       : 1
 Replication        : True
 CP Backup          : True
 Model number       : DD670
 Serialno           : 3FA0807274
 DDOS version       : 6.0.1.10-561375
 System attached    : 1970-01-01 00:00:00 (0)
 System max streams : 16

You can SSH into the Data Domain and view the ddfs logs using the below command:
# log view debug/ddfs.info

I noticed the following in this log:

04/25 12:08:57.141 (tid 0x7fbf4fa68490): exp_find_export: mount failed from (10.67.167.96/slc2pdvdp01.take.out): export point /backup/ost does not exist
04/25 12:08:57.141 (tid 0x7fbf4fa68490): nfsproc3_ost_mnt_3_svc: connection failed mount error 2 from plugin 10.67.167.96 version 3.1
04/25 12:08:58.147 (tid 0x6314fb0): nfs3_fm_lookup_by_path("/data/col1/backup/ost") failed: 5183
04/25 12:08:58.147 (tid 0x6314fb0): exp_find_export: mount failed from (10.67.167.96/slc2pdvdp01.take.out): export point /backup/ost does not exist
04/25 12:08:58.147 (tid 0x6314fb0): nfsproc3_ost_mnt_3_svc: connection failed mount error 2 from plugin 10.67.167.96 version 3.1
04/25 12:08:59.153 (tid 0x7fc24501bd40): nfs3_fm_lookup_by_path("/data/col1/backup/ost") failed: 5183
04/25 12:08:59.153 (tid 0x7fc24501bd40): exp_find_export: mount failed from (10.67.167.96/slc2pdvdp01.take.out): export point /backup/ost does not exist
04/25 12:08:59.153 (tid 0x7fc24501bd40): nfsproc3_ost_mnt_3_svc: connection failed mount error 2 from plugin 10.67.167.96 version 3.1

The VDP server relies upon the ost folder to perform the maintenance tasks and this folder was missing from the data domain causing maintenance and backup tasks to fail. 

To fix this we need to recreate the ost folder and export the NFS mount

1. Login to the Bash shell of Data Domain. You can view this article for the steps.

2. Navigate to the below directory:
# cd /data/col1/backup

3. Verify that there is no ost folder when you list for files and folders

4. Create the directory and set the ownership and permissions with the below commands:
# mkdir ost
# chmod 777 ost
# chown <your_ddboost_user> ost

5. Exit the bash shell with exit

6. Create the NFS mount with:
# nfs add /backup/ost *

You will see the below message:
NFS export for "/backup/ost" added.

Exit the data domain and back in VDP putty, run ddrmaint cplist and this should return the checkpoint lists successfully. This should proceed with a successful maintenance and backup tasks.

Hope this helps!

Thursday, 3 November 2016

VDP Status Error: Data Domain Storage Status Is Not Available

Today when I logged into my lab to do bit of testing on my vSphere Data Protection 6.1 appliance I noticed the following in the VDP plugin in Web Client:

It was stating, "The Data Domain storage status is not available. Unable to get system information
So, I logged into the vdp-configure page and switched to the Storage tab and noticed the similar error message.

So the first instinct was to go ahead and Edit the Data Domain Settings to try re-adding it back. But that failed too, with the below message.


When a VDP is configured to a Data Domain, these configuration error logging will be in ddrmaintlogs located under /usr/local/avamar/var/ddrmaintlogs

Here, I noticed the following:

Oct 29 01:54:27 vdp58 ddrmaint.bin[30277]: Error: get-system-info::body - DDR_Open failed: 192.168.1.200, DDR result code: 5040, desc: calling system(), returns nonzero
Oct 29 01:54:27 vdp58 ddrmaint.bin[30277]: Error: <4780>Datadomain get system info failed.
Oct 29 01:54:27 vdp58 ddrmaint.bin[30277]: Info: ============================= get-system-info finished in 62 seconds
Oct 29 01:54:27 vdp58 ddrmaint.bin[30277]: Info: ============================= get-system-info cmd finished =============================
Oct 29 01:55:06 vdp58 ddrmaint.bin[30570]: Warning: Calling DDR_OPEN returned result code:5040 message:calling system(), returns nonzero

Well, this says a part of the problem. It is unable to fetch the Data Domain System Information. So the next thing is to see vdr-configure logs located under /usr/local/avamar/var/vdr/server_logs
All operations done in the vdp-configure page will be logged under vdr-configure logs.

Here, the following was seen:

2016-10-29 01:54:28,564 INFO  [http-nio-8543-exec-9]-services.DdrService: Error Code='E30973' Message='Description: The file system is disabled on the Data Domain system. The file system must be enabled in order to perform backups and restores. Data: null Remedy: Enable the file system by running the 'filesys enable' command on the Data Domain system. Domain:  Publish Time: 0 Severity: PROCESS Source Software: MCS:DD Summary: The file system is disabled. Type: ERROR'

This is a much detailed Error. So, I logged into my Data Domain system using the "sysadmin" credentials and ran the below command to check the status of the filesystem:
# filesys status

The output was:
The filesystem is enabled and running.

The Data Domain was reporting the file-system is already up and running. Perhaps this was in a non responding / stale state. So, I re-enabled the file-system using:
# filesys enable

Post this, the data domain automatically connected to the VDP appliance and the right status was displayed.

Saturday, 29 October 2016

Migrating VDP From 5.8 and 6.0 To 6.1.x With Data Domain

You cannot upgrade a vSphere Data Protection appliance from 5.8.x and 6.0.x to 6.1.x due to the difference in the underlying SUSE Linux version. Since the earlier versions of vSphere Data Protection used SLES 11 SP1 and the 6.1.x uses SLES 11 SP3, we will be performing the migrate.

This article only discusses about migrating a VDP appliance from 5.8.x and 6.0.x with a data domain attached. If you had a VDP appliance without a data domain, we would choose the "Migrate" option in the vdp-configure wizard during the setup of the new 6.1.x appliance. However, this is not the path we will follow when the destination storage is an EMC Data Domain. A VDP appliance with Data Domain migration would be done by a process called as checkpoint restore. Let's discuss these steps below...

For this instance let's consider the following setup:
1. A vSphere Data Protection 5.8 appliance
2. A Virtual Edition of EMC Data Domain Appliance (Process is still the same for physical as well)
3. The 5.8 VDP was deployed as a 512GB deployment.
4. The IP address of this VDP appliance was 192.168.1.203
5. The IP address of the Data Domain appliance is 192.168.1.200

Pre-requisites:
1. In the point (3) above you saw that the 5.8 VDP appliance was setup with a 512 GB local drives. The first question that comes here is, why have a local drive when the backups are residing on the Data Domain?
A vSphere Data Protection appliance with a Data Domain would still have a local VMDK is to store the meta-data of the client backups. The actual data of the client is deduplicated and stored on the DD appliance and the meta-data of this backup is stored under the /data0?/cur directory on the VDP appliance. So, if your source appliance was of 512 GB deployment, then the destination has to be either equal to or greater than the source deployment.

2. The IP address, DNS name, domain and all other networking configuration of the destination appliance should be same as the source.

3. It is best to keep the same password on the destination appliance during the initial setup process.

4. On the source appliance make sure the Checkpoint Copy is Enabled. To verify this, go to https://vdp-ip:8543/vdp-configure page, select the Storage tab, click the Gear Icon and click Edit Data Domain. The first page displays this option. If this is not checked, then the checkpoint on the source appliance will not be copied over to the Data Domain, and you will not be able to perform a checkpoint restore.

The migration process:
1. Take a SSH to the source VDP appliance and run the below command to get the checkpoint list:
# cplist

The output would be similar to:
cp.20161011033032 Tue Oct 11 09:00:32 2016   valid rol ---  nodes   1/1 stripes     25
cp.20161011033312 Tue Oct 11 09:03:12 2016   valid --- ---  nodes   1/1 stripes     25

Make a note of this output.

2. Run the below command to obtain the Avamar System ID:
# avmaint config --ava | grep -i "system"
The output would be similar to:
  systemname="vdp58.vcloud.local"
  systemcreatetime="1476126720"
  systemcreateaddr="00:50:56:B9:3E:6D"

Make a note of this output as well.  1476126720 would be the Avamar System ID. This is used to determine which mTree this VDP appliance corresponds to on the Data Domain.

3. Run the below command to obtain the hashed Avamar Root Password. This would be to test the GSAN login if the migration fails. This will be used for VMware Support, so you can skip this step. 
# grep ap /usr/local/avamar/etc/usersettings.cfg
The output would be similar to:
password=6cbd70a95847fc58beb381e72600a4cb33d322cc3d9a262fdc17acdbeee80860a285534ab1427048

4. Power off the source appliance

5. Deploy VDP 6.1.x appliance via the OVF template, provide the same networking details during the ova deployment and power on the 6.1.x appliance once the ova deployment completes successfully.

6. Go to the https://vdp-ip:8543/vdp-configure page and complete the configuration process for the new appliance. As mentioned above, during the "Create Storage" section in the wizard specify the local storage space, either equal to or greater than the source VDP appliance system. Once the appliance configuration completes, it will reboot the new 6.1.x system.

7. Once the reboot is completed, open a SSH to the 6.1.x appliance and run the below command to list the available checkpoints on the data domain.
# ddrmaint cp-backup-list --full --ddr-server=<data-domain-IP> --ddr-user=<ddboost-user-name> --ddr-password=<ddboost-password>

Sample command from my lab:
# ddrmaint cp-backup-list --full --ddr-server=192.168.1.200 --ddr-user=ddboost-user --ddr-password=VMware123!
The output would be similar to:
================== Checkpoint ==================
 Avamar Server Name           : vdp58.vcloud.local
 Avamar Server MTree/LSU      : avamar-1476126720
 Data Domain System Name      : 192.168.1.200
 Avamar Client Path           : /MC_SYSTEM/avamar-1476126720
 Avamar Client ID             : 200e7808ddcde518fe08b6778567fa4f397e97fc
 Checkpoint Name              : cp.20161011033032
 Checkpoint Backup Date       : 2016-10-11 09:02:07
 Data Partitions              : 3
 Attached Data Domain systems : 192.168.1.200

The highlighted parts are what we need. The avamar-1476126720 would be the Avamar mTree on the data domain. We received this system ID earlier in this article. The checkpoint cp.20161011033032 was also a checkpoint on the source VDP appliance which was copied over to the data domain.

8. Now, we will perform a cprestore to this checkpoint. The command to perform the cprestore is:
# /usr/local/avamar/bin/#: cprestore --hfscreatetime=<avamar-ID> --ddr-server=<data-domain-IP> --ddr-user=<ddboost-user-name> --cptag=<checkpoint-name>

Sample command from my lab:
# /usr/local/avamar/bin/#: cprestore --hfscreatetime=1476126720 --ddr-server=192.168.1.200 --ddr-user=ddboost-user --cptag=cp.20161011033032
Where, 1476126720 is the Avamar System ID and cp.20161011033032 is a valid checkpoint. Do not rollback if the checkpoint is not valid. If the checkpoint is not validated, then on the source VDP appliance you will have to run an integrity check to generate a valid checkpoint and copy this over to the Data Domain system.

The output would be:
Version: 1.11.1
Current working directory: /space/avamar/var
Log file: cprestore-cp.20161011033032.log
Checking node type.
Node type: single-node server
Create DD NFS Export: data/col1/avamar-1476126720/GSAN
ssh ddboost-user@192.168.1.200 nfs add /data/col1/avamar-1476126720/GSAN 192.168.1.203 "(ro,no_root_squash,no_all_squash,secure)"
Execute: ssh ddboost-user@192.168.1.200 nfs add /data/col1/avamar-1476126720/GSAN 192.168.1.203 "(ro,no_root_squash,no_all_squash,secure)"
Warning: Permanently added '192.168.1.200' (RSA) to the list of known hosts.
Data Domain OS
Password:

Enter the data domain password when prompted. Once the password is authenticated, the cprestore will start. It is going to copy the meta data of the backups for the displayed checkpoint on to the 6.1.x appliance. 

The output would be similar to:
[Thu Oct  6 08:24:44 2016] (22497) 'ddnfs_gsan/cp.20161011033032/data01/0000000000000015.chd' -> '/data01/cp.20161011033032/0000000000000015.chd'
[Thu Oct  6 08:24:44 2016] (22498) 'ddnfs_gsan/cp.20161011033032/data02/0000000000000019.wlg' -> '/data02/cp.20161011033032/0000000000000019.wlg'
[Thu Oct  6 08:24:44 2016] (22497) 'ddnfs_gsan/cp.20161011033032/data01/0000000000000015.wlg' -> '/data01/cp.20161011033032/0000000000000015.wlg'
[Thu Oct  6 08:24:44 2016] (22499) 'ddnfs_gsan/cp.20161011033032/data03/0000000000000014.wlg' -> '/data03/cp.20161011033032/0000000000000014.wlg'
[Thu Oct  6 08:24:44 2016] (22498) 'ddnfs_gsan/cp.20161011033032/data02/checkpoint-complete' -> '/data02/cp.20161011033032/checkpoint-complete'
[Thu Oct  6 08:24:44 2016] (22499) 'ddnfs_gsan/cp.20161011033032/data03/0000000000000016.chd' -> '/data03/cp.20161011033032/0000000000000016.chd'

This would keep going on until all the meta-data is copied over. The length of cprestore process would depend on the amount of backup data. Once the process is complete you will see the below message.

Restore data01 finished.
Cleanup restore for data01
Changing owner/group and permissions: /data01/cp.20161011033032
PID 22497 returned with exit code 0
Restore data03 finished.
Cleanup restore for data03
Changing owner/group and permissions: /data03/cp.20161011033032
PID 22499 returned with exit code 0
Finished restoring files in 00:00:04.
Restoring ddr_info.
Copy: 'ddnfs_gsan/cp.20161011033032/ddr_info' -> '/usr/local/avamar/var/ddr_info'
Unmount NFS path 'ddnfs_gsan' in 3 seconds
Execute: sudo umount "ddnfs_gsan"
Remove DD NFS Export: data/col1/avamar-1476126720/GSAN
ssh ddboost-user@192.168.1.200 nfs del /data/col1/avamar-1476126720/GSAN 192.168.1.203
Execute: ssh ddboost-user@192.168.1.200 nfs del /data/col1/avamar-1476126720/GSAN 192.168.1.203
Data Domain OS
Password:
kthxbye

Once the data domain password is entered, the cprestore process completes with a kthxbye message.

9. Run the # cplist command on the 6.1.x appliance and you should notice that the checkpoint that was displayed in the cpbackup list is now listing under the 6.1.x checkpoints:

cp.20161006013247 Thu Oct  6 07:02:47 2016   valid hfs ---  nodes   1/1 stripes     25
cp.20161011033032 Tue Oct 11 09:00:32 2016   valid rol ---  nodes   1/1 stripes     25

The cp.20161006013247 is the 6.1.x appliance's local checkpoint and the cp.20161011033032 is the checkpoint of source appliance which was copied over from the data domain during the cprestore.

10. Once the restore is complete, we need to perform a rollback to this checkpoint. So first, you will have to stop all core services on the 6.1.x appliance using the below command:
# dpnctl stop
11. Initiate the force rollback using the below command:
# dpnctl start --force_rollback

You will see the following output:
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
Action: starting all
Have you contacted Avamar Technical Support to ensure that this
  is the right thing to do?
Answering y(es) proceeds with starting all;
          n(o) or q(uit) exits
y(es), n(o), q(uit/exit):

Select yes (y) to initiate the rollback. The next set of output you will see is:

dpnctl: INFO: Checking that gsan was shut down cleanly...
-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
Here is the most recent available checkpoint:
  Tue Oct 11 03:30:32 2016 UTC Validated(type=rolling)
A rollback was requested.
The gsan was shut down cleanly.

The choices are as follows:
  1   roll back to the most recent checkpoint, whether or not validated
  2   roll back to the most recent validated checkpoint
  3   select a specific checkpoint to which to roll back
  4   restart, but do not roll back
  5   do not restart
  q   quit/exit

Choose option 3 and the next set of output you will see is:

Here is the list of available checkpoints:

     2   Thu Oct  6 01:32:47 2016 UTC Validated(type=full)
     1   Tue Oct 11 03:30:32 2016 UTC Validated(type=rolling)

Please select the number of a checkpoint to which to roll back.

Alternatively:
     q   return to previous menu without selecting a checkpoint
(Entering an empty (blank) line twice quits/exits.)

So in the earlier cplist command you will notice that the cp.20161011033032 had a time-stamp of Oct 11. So choose option (1) and the next output you will see is:
-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
You have selected this checkpoint:
  name:       cp.20161011033032
  date:       Tue Oct 11 03:30:32 2016 UTC
  validated:  yes
  age:        -7229 minutes

Roll back to this checkpoint?
Answering y(es)  accepts this checkpoint and initiates rollback
          n(o)   rejects this checkpoint and returns to the main menu
          q(uit) exits

Verify if this indeed the checkpoint and proceed yes (y) upon confirmation. The GSAN and MCS rollback begins and you will notice this in the console:

dpnctl: INFO: rolling back to checkpoint "cp.20161011033032" and restarting the gsan succeeded.
dpnctl: INFO: gsan started.
dpnctl: INFO: Restoring MCS data...
dpnctl: INFO: MCS data restored.
dpnctl: INFO: Starting MCS...
dpnctl: INFO: To monitor progress, run in another window: tail -f /tmp/dpnctl-mcs-start-output-24536
dpnctl: WARNING: 1 warning seen in output of "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mcserver.sh --start"
dpnctl: INFO: MCS started.

**If this process fails, open a ticket with VMware support. I cannot provide the troubleshooting steps for this as this is confidential. Request / Add information in your support ticket to contact me if needed for the engineer assigned to run a check past me**

If the rollback goes through successfully you might be presented with an option to restore the tomcat database.

Do you wish to do a restore of the local EMS data?

Answering y(es) will restore the local EMS data
          n(o) will leave the existing EMS data alone
          q(uit) exits with no further action.

Please consult with Avamar Technical Support before answering y(es).

Answer n(o) here unless you have a special need to restore
  the EMS data, e.g., you are restoring this node from scratch,
  or you know for a fact that you are having EMS database problems
  that require restoring the database.

y(es), n(o), q(uit/exit):

I would choose no if my database is not causing issues in my environment. Post this, the remaining services will be started. The output:

dpnctl: INFO: EM Tomcat started.
dpnctl: INFO: Resuming backup scheduler...
dpnctl: INFO: Backup scheduler resumed.
dpnctl: INFO: AvInstaller is already running.
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]

That should be pretty much it. When you login to https://vdp-ip:8543/vdp-configure page, you should be able to see the Data Domain automatically in the Storage Tab. If not, open a support ticket with VMware

There are couple of post-migration steps:
1. If you are using internal proxy, un-register the proxy and re-register it back from the VDP configure page.
2. External proxies (if used) will be orphaned, so you will have to delete the external proxies, change the VDP root password and re-add the external proxy
3. If you are using Guest Level backups, then the agents for SQL, Exchange, Sharepoint has to be re-installed. 
4. If this appliance is replicating to another VDP appliance, then the replication agents need to be re-registered. Follow the below 4 commands in the same order to perform this:
# service avagent-replicate stop
# service avagent-replicate unregister 127.0.0.1 /MC_SYSTEM
# service avagent-replicate register 127.0.0.1 /MC_SYSTEM
# service avagent-replicate start

And that should be it...

Wednesday, 12 October 2016

vSphere Data Protection /data0? Partitions Are 100 Percent.

VDP can be connected to a data domain or a local deduplication store to contain all the backup data. This article discusses in specific when VDP is connected to a data domain. As far as the deployment process goes, a VDP with data domain attached to it, would still have a local data partition as well. The sda mount is for your OS partitions and sdb, sdc and so on are for your data partitions (Hard Disk1, 2, 3..and so on).

These partitions, data01, data02....(Grouped as data0?) contain the metadata of the backups that is stored on the data domain. So, if you cd to /data01/cur and do a list "ls", you will see the metadata stripes.

0000000000000000.tab 
0000000000000008.wlg 
0000000000000012.cdt 
0000000000000017.chd

Before, we get into the cause of this issue, let's have a quick look at what a retention policy is. When you create a backup job for a client / group of clients, you will define a retention policy for the restore points. The retention policy tells, how long you need your restore points to be saved after a backup. The default is 60 days and can be adjusted as per need. 

Once the retention policy is reached, that restore point which has reached its expiration date will be deleted. Then, during the maintenance window, the Garbage Collection (GC), will be executed, which will perform the space reclamation. If you run, status.dpn, you will notice, "Last GC" and amount of space that was reclaimed. 

Space reclamation by GC is done only on the data0? partitions. So, if your data0? partitions are 100 percent, then there are few explanations. 

1. Your retention period for all backup is set to "Never Expire", which is not recommended to be set.
2. The GC was not executed at all during the maintenance window. 

If you set the backups to never expire, then go ahead and set an expiration date for it, otherwise your data0? partitions will frequently enter 100 percent space usage. 

To check if your GC was executed successfully or not, run the below command:
# status.dpn
The output, you should look at is the "Last GC". You will either see an error here such as DDR_ERROR or Last GC was executed somewhere weeks back. 

Also, if you login to vdp-configure page, you should notice that your maintenance services are not running. If this is the case, then your space reclamation task will not run, and if your space reclamation task is not running, then those metadata for expired backup are not cleared. 

To understand why this happens, let's have a basic look at how MCS talks to data domain. Your MCS will be running on your VDP appliance. If there is a data domain attached to the appliance, the MCS will be querying the data domain via the DD SSH keys

This means, we have a private-public key combination on the VDP appliance and the data domain system. When there is a public-private key combination, there is no need as password authentication for MCS to connect to data domain. Your MCS will use it's private key and Data Domains public key, and similarly, the data domain will use it's private key and VDP's public key to communicate. 

You can do a simple test to see if this working by performing the below steps:

1. On the VDP appliance load and add the private key. 
# ssh-agent bash 
# ssh-add ~admin/.ssh/ddr_key
2. Once the key is added, you can login to Data Domain from the VDP SSH directly without a password. This is how the MCS works too. 
# ssh sysadmin@192.168.1.200
Two outcomes here: 

1. If there is no prompt to enter a password, it will directly connect you to the Data Domain console, and we are good to go. 

2. It will prompt you to enter a passphrase and/or a password to login to Data Domain. If you run into this issue, then it means that the SSH public keys for VDP are not loaded / unavailable on the Data Domain end.

For this issue, we will be most likely running into Outcome (2)

How to verify public key availability on data domain end:

1. On the VDP appliance run the following command to list the public key:
# cat ~admin/.ssh/ddr_key.pub
The output would be similar to:

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAw7XWjEK0jVPrT0z6JDmdKUDLfvvoizdzTpWPoCWNhJ/LerUs9L4UkNr0Q0mTK6U1tnlzlQlqeezIsWvhYJHTcU8rh
yufw1/YZLoGeA0tsHl6ruFAeCIYuf5+mmLXluPhYrjGMdsDa6czjIAtoA4RMY9WjAtSOPX3L2B73Wf3BScigzC/D83aX8GnaldwQU88qkfmhN+dpy2IdxiFm4
hnK+2m4XMtveBTq/8/7medeBTMXYYe7j7DVffViU4DizeEpGj2TBxHIe2dGe0epFDDc9wpa8W5a/XPOeiz4WelHfKtqS1hYUpFEQWXUOngwjDPpqG+6k1t
1HoOp/+OVC3lGw== admin@vmrel-ts2014-vdp

2. On the Data Domain end, run the following command:
# adminaccess show ssh-keys user <ddbost-user>
You can enter your custom ddboost user or sysadmin, if this was itself promoted to ddboost user. 

In our case, we should not see the above mentioned public key in the list. 
The DD will have its private key and VDP will have its private key. The public key of the VDP is not available on the data domain end, which leads to password request when connecting from SSH of VDP to DD. Due to this, the GC will not run as MCS will be waiting for a manual password entry. 

To fix this:

1. Copy the public key of the VDP appliance obtained from the "cat" command mentioned earlier. Copy the entire thing starting from and including ssh-rsa to the end, including -vdp
Make sure no spaces are copied, else this will not work. 

2. Login to DD with sysadmin and run the following command:
# adminaccess add ssh-keys user <ddboost-user>
You will see a prompt like below:

Enter the key and then press Control-D, or press Control-C to cancel.

Then, enter the copied key and Press Ctrl+D (You will see the "key accepted" message)

ssh-user AAAAB3NzaC1yc2EAAAABIwAAAQEAw7XWjEK0jVPrT0z6JDmdKUDLfvvoizdzTpWPoCWNhJ/LerUs9L4UkNr0Q0mTK6U1tnlzlQlqeezIsWvhYJHTcU8rh
yufw1/YZLoGeA0tsHl6ruFAeCIYuf5+mmLXluPhYrjGMdsDa6czjIAtoA4RMY9WjAtSOPX3L2B73Wf3BScigzC/D83aX8GnaldwQU88qkfmhN+dpy2IdxiFm4
hnK+2m4XMtveBTq/8/7medeBTMXYYe7j7DVffViU4DizeEpGj2TBxHIe2dGe0epFDDc9wpa8W5a/XPOeiz4WelHfKtqS1hYUpFEQWXUOngwjDPpqG+6k1t
1HoOp/+OVC3lGw== admin@vmrel-ts2014-vdpSSH key accepted.

3. Now test the login from VDP to DD using ssh sysadmin@192.168.1.200 and you should be directly connected to the data domain.

Even though, we have re-established the MCS connectivity to DD, we will have to now manually run a garbage collection to force clear the expired metadata. 

You have to first stop the backup scheduler and maintenance service else you will receive the below error when trying to run GC:
ERROR: avmaint: garbagecollect: server_exception(MSG_ERR_SCHEDULER_RUNNING)

To stop the backup scheduler and maintenance service:
# dpnctl stop maint
# dpnctl stop sched
Then, run the below command to force start a GC:
# avmaint garbagecollect --timeout=<how many seconds should GC run> --ava
4. run df -h again, and the space has to be reduced considerably provided all the backups have a good retention policy set.


**If you are unsure about this process, open a ticket with VMware to drive this further**

Wednesday, 5 October 2016

Understanding VDP Backups In A Data Domain Mtree

This is going to be a high level view of how to find out which backups on data domain relate to which clients on the VDP. Previously, we saw how to deploy and configure a data domain and connect it to a vSphere Data Protection 5.8 appliance. I am going to discuss what is an Mtree and certain information about it, as this is needed for the next article, which would be migration of VDP from 5.8/6.0 to 6.1 

In a data domain file system, a Mtree is created to store the files and checkpoint data of VDP and data domain snapshots under the avamar ID node of the respective appliance. 

Now, on the data domain appliance, the below command needs to be executed to display the mtree list. 
# mtree list
The output is seen as:

Name                                             Pre-Comp (GiB)   Status
----------------------------   ----------------------------------  ------
/data/col1/avamar-1475309625       32.0                      RW
/data/col1/backup                             0.0                        RW
----------------------------   ---------------------------------   ------

So the avamar node ID is 1475309625. To confirm this is the same mtree node created for your VDP appliance, run the following on the VDP appliance:
# avmaint config --ava | grep -i "system"
The output is:
  systemname="vdp58.vcloud.local"
  systemcreatetime="1475309625"
  systemcreateaddr="00:50:56:B9:54:56"

The system create time is nothing but the avamar ID. Using these two commands you can confirm which VDP appliance corresponds to which mtree on the data domain. 

Now, I mentioned earlier that mtree is where your backup data, VDP checkpoints and other related files reside. So, the next question is how to see what are the directories under this mtree? To check the directories under mtree, run the following command in the data domain appliance.
 # ddboost storage-unit show avamar-<ID>
The output is:

Name                Pre-Comp (GiB)   Status   User
-----------------   --------------   ------   ------------
avamar-1475309625             32.0   RW       ddboost-user
-----------------   --------------   ------   ------------
cur
ddrid
VALIDATED
GSAN
STAGING
  • The cur directory is also called as the "current" directory where all your backup data is stored. 
  • The validated folder is where the validated checkpoints reside
  • The GSAN folder contains the checkpoint that was copied over from the VDP appliance. Remember, in the previous article we checked the option for "Enable Checkpoint Copy". This is what copies over the daily generated checkpoints on VDP to the data domain. Why this is required? We will look into this in much detail during the migrate operation. 
  • The STAGING folder is where all your "in-progress" backup job is saved. Once the backup job completes successfully, they will be moved to the cur directory. If the backup job fails, it will remain in the STAGING folder and will be cleared out during the next GC on the data domain.
Now, as mentioned before, DDOS does not have complete commands that is available in Linux, which is why you will have to enter SE (System Engineering) mode and enable bash shell to obtain the superset of commands to browse and modify directories. 

Please note: This is meant to be handled by a EMC technician only. All the information I am displaying here is purely from my lab. Try this at your own risk. If you are uncomfortable, stop now, and involve EMC support. 

To enable the bash shell, we will have to first enter the SE mode. To do this, we will need the password for SE which would be your system serial number. This can be obtained from the below command:
# system show serialno
The output is similar to:
Serial number: XXXXXXXXXXX

Enable SE mode using the below command and enter the Serial number as the password when prompted for:
 # priv set se
Once the password is provided you can see the user sysadmin@data-domain has changed to SE@data-domain

Now, we need to enable the bash shell for your data domain. Run these commands in the same order:

1. Display the OS information using:
# uname
You will see:
Data Domain OS 5.5.0.4-430231

2. Enable the File system using:
# fi st
You will see:
The filesystem is enabled and running.

3. Run the below command to show the filesystem space:
 # filesys show space
You will see

Active Tier:
Resource           Size GiB   Used GiB   Avail GiB   Use%   Cleanable GiB*
----------------   --------   --------   ---------   ----   --------------
/data: pre-comp           -       32.0           -      -                -
/data: post-comp      404.8        2.0       402.8     0%              0.0
/ddvar                 49.2        2.3        44.4     5%                -
----------------   --------   --------   ---------   ----   --------------
 * Estimated based on last cleaning of 2016/10/04 06:00:58.

4. Press "Ctrl+C" three times and then type shell-escape
This enters you to the bash shell and you will see the following screen.

*************************************************************************
****                            WARNING                              ****
*************************************************************************
****   Unlocking 'shell-escape' may compromise your data integrity   ****
****                and void your support contract.                  ****
*************************************************************************
!!!! datadomain YOUR DATA IS IN DANGER !!!! #

Again, proceed at your own risk and 100^10 percent, involve EMC when you do this. 

You saw the mtree was located at the path /data/col1/avamar-ID. The data partition is not mounted by default and needs to be mounted and unmounted manually. 

To mount the data partition run the below command:
# mount localhost:/data /data
This will return to the next line and will not show any output. Once the partition has been mounted successfully, you can then use your regular Linux commands to browse the mtree. 

So, a cd to the /data/col1/avamar-ID will show the following:

drwxrwxrwx  3 ddboost-user users 167 Oct  1 01:46 GSAN
drwxrwxrwx  3 ddboost-user users 190 Oct  2 02:40 STAGING
drwxrwxrwx  9 ddboost-user users 563 Oct  3 20:33 VALIDATED
drwxrwxrwx  4 ddboost-user users 279 Oct  2 02:40 cur
-rw-rw-rw-  1 ddboost-user users  40 Oct  1 01:43 ddrid

As mentioned, before the "cur" directory has all your successfully backed up data. If you change your directory to cur and do a "ls" you will find the following:

drwxrwxrwx  4 ddboost-user users 229 Oct  2 07:30 5890c0677a03211b49a9cf08bf1dcebd2d7cd77d

Now, this is the Client ID of the client (VM) that was successfully backed up by VDP.
To find which client on VDP corresponds to which CID on the data domain, we have 2 simple commands. 

To understand this, I presume you have a fair idea of what MCS and GSAN is on vSphere Data Protection. Your GSAN node is responsible for storing all the actual backup data if you have a local vmdk storage. If your VDP is connected to the data domain, then GSAN only holds the meta data of the backup and not the actual backup data (As this will be on the data domain) 
The MCS in brief is what waits for the work-order and calls in the avagent and avtar to perform the backup. The MCS if it understands there is a data domain connected to it, then, using the DD public-private key combination (Also called SSH keys) will talk to DD to perform the regular maintenance tasks. 

So, first, we will run the avmgr command (avmgr command is only for GSAN and will not work if GSAN is not running), to display the client ID on the GSAN node. The command would be:
# avmgr getl --path=/VC-IP/VirtualMachines
The output is:

1  Request succeeded
1  RHEL_UDlVr74uB7JdXN8jgjRLlQ  location: 5890c0677a03211b49a9cf08bf1dcebd2d7cd77d      pswd: 0d0d7c6b09f2a2234c108e4f0647c277e8bf2562

The one highlighted in red is nothing but the Client ID on the GSAN for the client RHEL (a virtual machine)

Then, we will run the mccli command (mccli command is only for MCS and needs MCS to be up and running) to display the client ID on the MCS server. The command would be:
# mccli client show --domain=/VC-IP/VirtualMachines --name="Client_name"
For example,
# mccli client show --domain=/192.168.1.1/VirtualMachines -name="RHEL_UDlVr74uB7JdXN8jgjRLlQ"
The output is a pretty detailed one, what we are interested is in this particular line:
CID                      5890c0677a03211b49a9cf08bf1dcebd2d7cd77d

So, we see the client ID on data domain = client ID on the GSAN = client ID on the MCS

Here, if your client ID on GSAN does not match the client ID on MCS, then your full VM restore and File Level Restores will not work. We will have this CID to be corrected in case of a mismatch to get the restores working. 

Now, back to the data domain end, we were under the cur directory, right? Next, I will change directory to the CID

# cd 5890c0677a03211b49a9cf08bf1dcebd2d7cd77d

I will then do another "ls" to list the sub directories under it, and you may or may not notice the following:

drwxrwxrwx  2 ddboost-user users 1.2K Oct  2 02:55 1D21C9327C2E4C6
drwxrwxrwx  2 ddboost-user users 1.4K Oct  2 07:30 1D21CB99431214C

If you have one folder which a sub client ID, then it means there has been only one backup executed and completed successfully for the virtual machine. If you see multiple folder, then it means there has been multiple backups completed for this VM. 

To find out which backup was done first and which were the subsequent backups, we will have to query the GSAN, as you know, the GSAN holds the meta-data of the backups. 

Hence, on the VDP appliance, run the below command:
# avmgr getb --path=/VC-IP/VirtualMachines/Client-Name --format=xml
For example:
# avmgr getb --path=/192.168.1.1/VirtualMachines/RHEL_UDlVr74uB7JdXN8jgjRLlQ --format=xml
The output will be:

<backuplist version="3.0">

  <backuplistrec flags="32768001" labelnum="2" label="RHEL-DD-Job-RHEL-DD-Job-1475418600010" created="1475418652" roothash="505f1aba07f19d64df74670afa59ed39a3ece85d" totalbytes="17180938240.00" ispresentbytes="0.00" pidnum="1016" percentnew="0" expires="1476282600" created_prectime="0x1d21cb99431214c" partial="0" retentiontype="daily,weekly,monthly,yearly" backuptype="Full" ddrindex="1" locked="1"/>
  
  <backuplistrec flags="16777217" labelnum="1" label="RHEL-DD-Job-1475401181065" created="1475402150" roothash="22dc0dddea797d909a2587291e0e33916c35d7a2" totalbytes="17180938240.00" ispresentbytes="0.00" pidnum="1016" percentnew="0" expires="1476265181" created_prectime="0x1d21c9327c2e4c6" partial="0" retentiontype="none" backuptype="Full" ddrindex="1" locked="0"/>
</backuplist>

Looks confusing? Maybe, let's look at specific fields:

labelnum field shows the order of the backups. 
labelnum=1 means first backup, 2 means second and so on.

roothash is the hash value of the backup job. Next time you run incremental backup, it will check for the existing hashes, and ddboost will only backup the new hashes. The atomic hashes are then combined to form one unique root hash. So, root hash for each backup is unique. 

created_prectime is the main thing what we need. This is what we called as the sub client ID. 
For labelnum=1, we see the sub CID is 0x1d21c9327c2e4c6
For labelnum=2, we see the sub CID is 0x1d21cb99431214c

Now, let's go further into the CID. For example if I cd into the 0x1d21c9327c2e4c6 and perform a "ls" I will see the following:

-rw-rw-rw-  1 ddboost-user users  485 Oct  2 02:40 1188BE924964359A5C8F5EAEF552E523FBA83566
-rw-rw-rw-  1 ddboost-user users 1.1K Oct  2 02:40 140A189746A6EC3C49D24EA43A7811205345F1F4
-rw-rw-rw-  1 ddboost-user users 3.8K Oct  2 02:40 2CE724F2760C46CB67F679B76657C23606C06869
-rw-rw-rw-  1 ddboost-user users 2.5K Oct  2 02:40 400206DF07A942C066971D84F0CF063D2DE50F08
-rw-rw-rw-  1 ddboost-user users 1.0M Oct  2 02:55 4F50E1E506477801D0A566DEE50E5364B0F04BF0
-rw-rw-rw-  1 ddboost-user users  451 Oct  2 02:55 79DDA236EEEF192EED66CF605CD710B720A41E1F
-rw-rw-rw-  1 ddboost-user users 1.1K Oct  2 02:55 AFB6C8621EB6FA86DD8590841F80C7C78AC7BEEC
-rw-rw-rw-  1 ddboost-user users 1.9K Oct  2 02:40 B17DD9B7E8B2B6EE68294248D8FA42A955539C4C
-rw-rw-rw-  1 ddboost-user users  16G Oct  2 02:55 B212DB46684FFD5AFA41B87FD71A44469B04A38C
-rw-rw-rw-  1 ddboost-user users   15 Oct  2 02:40 D2CFFD87930DAEABB63EAEAA3C8C2AA9554286B5
-rw-rw-rw-  1 ddboost-user users 9.4K Oct  2 02:40 E2FF0829A0F02C1C6FA4A38324A5D9C23B07719B
-rw-rw-rw-  1 ddboost-user users 3.6K Oct  2 02:55 ddr_files.xml

Now there is a main file (record file) called ddr_files.xml. This file will have all the information regarding what the other files are for in this directory.

So if I take the first Hex number and grep for it in the ddr_files.xml I see the following;
# grep -i 1188BE924964359A5C8F5EAEF552E523FBA83566 ddr_files.xml
The interested output is:
clientfile="virtdisk-descriptor.vmdk"

So this a vmdk file that was backed up.

Similarly,
# grep -i 400206DF07A942C066971D84F0CF063D2DE50F08 ddr_files.xml
The interested output is:
clientfile="vm.nvram"

And one more example:
# grep -i 4F50E1E506477801D0A566DEE50E5364B0F04BF0 ddr_files.xml
The interested output is:
clientfile="virtdisk-flat.vmdk"

So if your VM file IDs are not populated correctly in the ddr_files.xml, then again your restores will not work. Engage EMC to get this corrected, because I am stressing again, do not fiddle with this in your production environment.

That's pretty much it for this. If you have questions feel free to comment or in-mail. The next article is going to be about Migrating VDP 5.8/6.0 to 6.1 with a data domain.

Saturday, 1 October 2016

Configuring Data Domain Virtual Edition And Connecting It To A VDP Appliance

Recently while browsing through my VMware repository, I came across EMC Data Domain Virtual Edition. Before this, the only time where I worked on EMC DD cases where on a web ex, which had integration / backup issues between VDP and Data-Domain. Currently, my expertise on Data-Domain is limited and with this appliance edition, you can expect more updates on VDP-DD issues. 

First things first, this article is going to be about deploying the virtual edition of Data-Domain and configuring it to VDP. You can say this is the initial setup. I am going to skip certain parts such as OVF deployment as I take into account this is already known to you. 

1. The Data Domain ova can be acquired from the EMC download site. It contains and ovf file, vmdk file and .mf (manifest) file. 
A prerequisite would be to have your DNS Forward/Reverse configured for your DD Appliance. Once this is in place, use either the web client or the vSphere Client to complete your deployment. Do not power On the appliance post deployment. 

2. Once the deployment is complete, if you go to the Edit Settings of the Data Domain Appliance, you can see that there are two hard drives for your system, 120GB and 10GB. These are your system drives and you will need to add an additional drive (500GB) for storing the backup data. At this point, you can go ahead and perform an "Add Device" operation and add a new vmdk of 500GB in size. Post successful addition of the drive, go ahead and power on the data domain appliance. 

3. The default credentials are username: sysadmin, password: abc123. At this point of time, there will be a script that will be executed automatically to configure network. I have not included this in the screenshot as it is fairly simple task. The options will be configure IPv4 network, IP address, DNS Server address, hostname, domain name, subnet mask and default gateway. IPv6 is optional and can be ignored if not required. This prompt will also ask you if you would like to change the password for sysadmin

4. The next step is to add the 500GB storage that was created for this appliance. Run the following command in the SSH of the data domain appliance to add the 500GB disk. 
# storage add dev3
This would be seen as:


5. Once the additional disk is added to the active tier, you then need to create a file system on this. Run the following command to create a file system:
# filesys create
This would be seen as:


This will provision the storage and create the FS. The progress will be displayed once the prompt is acknowledged with "yes" as above.


6. Once the file system is created, the next part is to enable the file system, which again is done by a pretty simple command:
# filesys enable

Note that, this is not a Full Linux OS, so most to almost all Linux commands will not work on a data domain. The Data domain has it's own OS, DD-OS which has a set of commands that are only supported. 

The output for enabling file system would be seen as:


Perform a restart of the data domain appliance before proceeding with the next steps. 

7. Once the reboot is completed, open a browser and go to the following link: 
https://data-domain-ip/ddem
If you see an unable to connect message in the browser, then go back to the SSH and run the following command:
# admin show
The output which we need would be:

Service   Enabled   Allowed Hosts
-------   -------   -------------
ssh       yes       -
scp       yes       (same as ssh)
telnet    no        -
ftp       no        -
ftps      no        -
http      no        -
https     no        -
-------   -------   -------------

Web options:
Option            Value
---------------   -----
http-port         80
https-port        443
session-timeout   10800

So the ports are already set, however, the services are not enabled. So we need to enable http and https services here. The command would be:
sysadmin@datadomain# adminaccess enable http

HTTP Access:    enabled
sysadmin@datadomain# adminaccess enable https

HTTPS Access:   enabled
This should do the trick, and now you should be able to login to the data domain manager from the browser.


The login username would be sysadmin and the password for this user. (Either default if you chose not to change it in SSH or the newly configured password)

8. Now, the VDP connects to the data domain via a ddboost user. You need to create a ddboost user with admin rights, and enable ddboost. This can be done from the GUI or from the command line. The command line is much easier.

The first step would be to "add" a ddboost user:
# user add ddboost-user password VMware123!
The next step would be to "set" the created ddboost user:
# ddboost set user-name ddboost-user
The last step would be to enable ddboost:
# ddboost enable 
You can also check ddboost status with:
# ddboost status

This will automatically reflect in the GUI. The location for ddboost in web manager of data domain is. Data Management > DD Boost > Settings. So, if you choose not to add the user from command line, you can do the same from here.


9. Now, you need to grant this ddboost user the admin rights, else, you will get the following error during connecting the data domain to the VDP appliance:


To grant the ddboost user with admin rights, from the web GUI navigate to System Settings > Access Management > Local Users. Check the ddboost user and select the Edit option, From the management role drop-down select admin > Click OK.


10. Post this, there is one more small configuration required for connecting the VDP appliance. This would be the NTP settings. From the data domain web GUI navigate to System Settings > General Configuration > Time and Date Settings and add your NTP Server. 

11. With this completed, now we will go ahead and connect this data domain to the VDP appliance. 
Login to the vdp-configure page at:
https://vdp-ip:8543/vdp-configure
12. Click Storage > Gear Icon > Add  Data Domain


13. Provide the primary IP of the data domain, the ddboost user and password and select Enable Checkpoint Copy

14. Enter the community string for the SNMP configuration.


Once the data domain addition completes successfully, you should be able to see the data domain storage details under the "Storage" section:


If you SSH into the VDP appliance and navigate to /usr/local/avamar/var, you will see a file called ddr_info which has the information regarding the data domain. 

This is the basic setup for connecting virtual edition of data domain to a VDP appliance.