Terraform Modules Simplified
Terraform is probably already the de-facto standard for cloud deployment. I use it on a daily basis deploying and destroying my tests and demo setups ... Read More
Découvrez pourquoi Eclipsys a été nommée 2023 Best Workplaces in Technology, Great Place to Work® Canada et Canada's Top 100 SME !
En savoir plus !The Oracle Zero Downtime Migration (ZDM) tool has been created to help with Oracle database migrations. It saves time and automates many tasks. The saved effort is getting bigger and accumulates on the scale when you need to move multiple databases. Behind the scenes, it uses the very well known Oracle Data Guard. As a result, you have good solid technology on the basis but at the same time limited to what the DataGuard can do and what it cannot. All details and documentation are available here. The tool works fine when all prerequisites are met but when you hit an issue you need to dig in and troubleshoot. Here I will try to share some experience with the ZDM troubleshooting. Please note that the information in the blog is actually for the 19.2 version of ZDM and it is possible that the behaviour will be different in the future versions.
We set up the tool, verified all prerequisites and ran a migration job in evaluation mode using the “-eval” parameter but the job failed.
To monitor a job execution you can use a command like “zdmcli query job -jobid 5”. The output will provide the basic information about the job and result of each phase. In case of success it looks like:
[zdmuser@zdmserver ~]$ /opt/oracle/app/zdmhome/bin/zdmcli query job -jobid 8 zdmserver: Audit ID: 516 Job ID: 8 User: zdmuser Client: zdmuser Scheduled job command: "zdmcli migrate database -sourcesid SOURCEDB -sourcenode source.localdomain -srcauth zdmauth -srcarg1 user:zdmuser -srcarg2 identity_file:/home/zdmuser/.ssh/id_rsa -srcarg3 sudo_location:/usr/bin/sudo -targetnode target.localdomain -tgtauth zdmauth -tgtarg1 user:opc -tgtarg2 identity_file:/home/zdmuser/.ssh/id_rsa -tgtarg3 sudo_location:/usr/bin/sudo -targethome /u02/app/oracle/product/12.2.0/dbhome1 -rsp /home/zdmuser/zdm_template_sourcedbstg.rsp -eval" Scheduled job execution start time: 2020-04-20T14:26:25-03. Equivalent local time: 2020-04-20 14:26:25 Current status: SUCCEEDED Result file path: "/opt/oracle/app/zdmbase/chkbase/scheduled/job-8-2020-04-20-14:26:38.log" Job execution start time: 2020-04-20 14:26:38 Job execution end time: 2020-04-20 14:32:29 Job execution elapsed time: 5 minutes 50 seconds ZDM_GET_SRC_INFO ………. COMPLETED ZDM_GET_TGT_INFO ………. COMPLETED ZDM_SETUP_SRC …………. COMPLETED ZDM_SETUP_TGT …………. COMPLETED ZDM_GEN_RMAN_PASSWD ……. COMPLETED ZDM_PREUSERACTIONS …….. COMPLETED ZDM_PREUSERACTIONS_TGT …. COMPLETED ZDM_VALIDATE_SRC ………. COMPLETED ZDM_VALIDATE_TGT ………. COMPLETED
But, unfortunately it failed in our case and you could see something like that:
zdmserver: Processing response file … zdmserver: Starting zero downtime migrate operation … zdmserver: Executing phase ZDM_GET_SRC_INFO zdmserver: retrieving information about database "DSST" … zdmserver: Executing phase ZDM_GET_TGT_INFO zdmserver: Retrieving information from target node "target.localhost" … zdmserver: Executing phase ZDM_SETUP_SRC zdmserver: Setting up ZDM on the source node source.localhost … zdmserver: Executing phase ZDM_SETUP_TGT zdmserver: Setting up ZDM on the target node target.localhost … zdmserver: Executing phase ZDM_GEN_RMAN_PASSWD zdmserver: Executing phase ZDM_PREUSERACTIONS zdmserver: Executing phase ZDM_PREUSERACTIONS_TGT zdmserver: Executing phase ZDM_VALIDATE_SRC zdmserver: Validating standby on the source node source.localhost … zdmserver: Executing phase ZDM_VALIDATE_TGT zdmserver: Validating standby on the target node target.localhost …
It was apparently not enough to troubleshoot the problem. We needed more logs. Luckily we had had full set of the logs on the source and target in the /tmp/zdm-*/log/ directory.
zdmuser@plxde746 ~]$ view /tmp/zdm-237637609/zdm/log/mZDM_oss_standby_validate_src_3119.log 19:25:17.000: Command received is : mZDM_oss_standby_validate_src -sdbsid SOURCEDB -sdbdomain localdomain -sdbhome /opt/oracle/release/12.2.0.1 -dbid 111111111 -scn 331201554581 -tdbname sourcedb -tdbhome /u02/app/oracle/product/12.2.0/dbhome1 -sdbScanName source.localdomain -tdbScanName test-scan.localdomain -tdbScanPort 1521 -tdatadg +DATAC1 -tredodg +DATAC1 -trecodg +RECOC1 -bkpPath /migration/staging 19:25:17.000: ### Printing the configuration values from files: 19:25:17.000: /tmp/zdm-237637609/zdm/mZDMconfig_params 19:25:17.000: DATA_DG=+DATAC1
And that log provides fully detailed execution with all parameters, commands and values. That helped to nail down the problem and resolve it.
But it doesn’t work for all cases. Sometimes, when for some reason it cannot even create the /tmp/zdm* directory on the source or target you don’t have any logs at all.
For example, the zdm user connected to the source doesn’t have the privilege to execute some commands as root and fails on the very first steps. In such case you don’t have any other option then try to execute the migration job not in evaluation mode (-eval) but in real migration mode. I recommend in such cases put parameter -pauseafter and specify where you want the job to stop. In my case I used “-pauseafter ZDM_SETUP_SRC” . We ran the job and the execution failed on the very first step.
[zdmuser@zdmserver ~]$ /opt/oracle/app/zdmhome/bin/zdmcli query job -jobid 6 zdmserver: Audit ID: 514 Job ID: 6 User: zdmuser Client: vlxpr1008 Scheduled job command: "zdmcli migrate database -sourcesid SOURCEDB -sourcenode source.localdomain -srcauth zdmauth -srcarg1 user:zdmuser -srcarg2 identity_file:/home/zdmuser/.ssh/id_rsa -srcarg3 sudo_location:/usr/bin/sudo -targetnode target.localdomain -tgtauth zdmauth -tgtarg1 user:opc -tgtarg2 identity_file:/home/zdmuser/.ssh/id_rsa -tgtarg3 sudo_location:/usr/bin/sudo -targethome /u02/app/oracle/product/12.2.0./dbhome1 -rsp /home/zdmuser/zdm_template_sourcedbstg.rsp -pauseafter ZDM_SETUP_SRC" Scheduled job execution start time: 2020-03-27T09:20:30-03. Equivalent local time: 2020-03-27 09:20:30 Current status: FAILED Result file path: "/opt/oracle/app/zdmbase/chkbase/scheduled/job-6-2020-03-27-09:20:31.log" Job execution start time: 2020-03-27 09:20:31 Job execution end time: 2020-03-27 09:20:48 Job execution elapsed time: 16 seconds ZDM_GET_SRC_INFO ………….. FAILED ZDM_GET_TGT_INFO ………….. PENDING ZDM_SETUP_SRC …………….. PENDING ZDM_SETUP_TGT …………….. PENDING
I checked the log and found the following.
[zdmuser@zdmserver ~]$ cat /opt/oracle/app/zdmbase/chkbase/scheduled/job-6-2020-03-27-09\:20\:31.log zdmserver: Processing response file … zdmserver: Starting zero downtime migrate operation … zdmserver: Executing phase ZDM_GET_SRC_INFO zdmserver: retrieving information about database "DSST" … PRCF-2056 : The copy operation failed on node: "source.localdomain". Details: {1} PRCZ-4002 : failed to execute command "/bin/cp" using the privileged execution plugin "zdmauth" on nodes "source.localdomain" [zdmuser@zdmserver ~]$
From the first glance, it looked like we were unable to use “sudo cp” but after several tests, it was discovered that we lacked the privilege to run “/bin/scp” on the source and could not copy the zdm files from zdmserver. After fixing the problem you can either resume the job using command “zdmcli resume job -jobid 6” or destroy it and run the “eval” job again. To destroy a job you need to run “zdmcli abort job -jobid 6”.
In my experience working with ZDM in several different environments the most of the problems boiled down to the network, database software, instances configuration, and permission issues. Let me stop on the last category. The ZDM user on source and target is supposed to have full privileges as a superuser. In Oracle cloud, it is an “opc” user that can run any command from “sudo”. But if you move the database from on-prem you might encounter some difficulties getting such privileges. In my case, we used help from the Oracle ZDM team and from the Oracle product manager for ZDM. We also did some troubleshooting and adjusting by ourselves to put all commands to /etc/sudoers list for the ZDM user on source machines.
A couple of other problems were related to the discrepancy in software level between source and target. The documented parameter “-ignore PATCH_CHECK” didn’t work for us and we used “-ignore ALL” instead. Also, I found that for 12.1 and 11g the ZDM tool didn’t encrypt tablespaces on the cloud side during standby creation and used “restoreDatabase” subprogram instead “restoreAndEncryptDatabase” which was used for 12.2 and later versions.
In summary, I can say that despite a few bumps and problems with the tool the ZDM was able to significantly reduce effort and number of errors during migrations even for cases when it was used only as a part of the migration process. I am looking forward to the new version and hope it provides more options for migrations. Shoot me an email or get me on twitter if you need help with migration or to make ZDM working.
Terraform is probably already the de-facto standard for cloud deployment. I use it on a daily basis deploying and destroying my tests and demo setups ... Read More
If you’ve been following the recent changes in the Linux world you probably remember how Red Hat and Centos announced in December 2020 that the ... Read More