Resolving ODA VM FileSystem Corruption

By: Kwame Bonsu
Topic: Engineered Systems | Posted: Mar 18, 2022 at 2:30 PM EST

Share on:

We recently experienced an issue with our client’s virtualized ODA (Oracle Database Appliance) with the following details:

VM-ODA_BASE ODA X5-2

Version
——-
12.1.2.12.0

Our client mentioned that one of the VMs (Virtual Machine) domiciled on the ODA was inaccessible so we went ahead and did the following by looking for more information on the two-node ODA X5-2.

Checking from ODA_BASE

[root@eng-oda-base0 ~]# oakcli show vm

NAME NODENUM MEMORY VCPU STATE REPOSITORY

eng-dns01 1 2048M 2 ONLINE vmrepo
eng-mgmt-web-01 1 8192M 3 ONLINE vmrepo
eng-prod-recovery-01 0 29696M 12 ONLINE vmrepo
eng-prod-recovery-02 1 29696M 12 ONLINE vmrepo
eng-prod-recovery-03 1 29696M 12 ONLINE vmrepo
eng-prod-mgmt-01 1 16384M 2 ONLINE vmrepo
eng-prod-mgmt-02 1 16384M 2 ONLINE vmrepo
eng-prod-oam-01 1 8192M 2 ONLINE vmrepo
eng-prod-oam-02 1 8192M 2 ONLINE vmrepo
eng-prod-ssa-01 1 7500M 3 ONLINE vmrepo
eng-prod-web-01 1 8192M 3 ONLINE vmrepo
eng-prod-web-02 1 8192M 3 ONLINE vmrepo
eng-oau-mgmt-01 0 16384M 2 ONLINE devvms
eng-oau-mgmt-01_clone 0 16384M 2 OFFLINE vmrepo
eng-oau-oam-01 1 8192M 2 ONLINE devvms
eng-oau-web-01 0 8192M 2 ONLINE devvms

The VM eng-prod-mgmt-01 looked to be online on the ODA_BASE level even though it was reported as inaccessible

We then established a ping to the problematic VM and it was unsuccessful with constant timeouts

A further check on the VM showed it to still be online as shown below

[root@eng-oda-base0 ~]# oakcli show vm eng-prod-mgmt-01
The Resource is : eng-prod-mgmt-01
AutoStart : restore
CPUPriority : 100
Disks : |file:/OVS/Repositories/vmrepo/.ACF
S/snaps/eng-prod-mgmt-01/VirtualMac
hines/eng-prod-mgmt-01/OEL7.4.img,x
vda,w|
Domain : XEN_PVM
DriverDomain : False
ExpectedState : online
FailOver : true
IsSharedRepo : true
Keyboard : en-us
MaxMemory : 16384M
MaxVcpu : 2
Memory : 16384M
Mouse : OS_DEFAULT
Name : eng-prod-mgmt-01
Networks : |bridge=vsok1||bridge=prodv|
NodeNumStart : 1
OS : OL_5
PrefNodeNum : 0
PrivateIP : None
ProcessorCap : 0
RepoName : vmrepo
State : Online
TemplateName : otml_OEL_7_5
VDisks : |0|
Vcpu : 2
cpupool : default-unpinned-pool
vncport : 5907

One interesting observation was that the preferred node is node 0 but the VM seemed to have been started/migrated on node 1

NodeNumStart : 1
OS : OL_5
PrefNodeNum : 0

We attempted an online migration to the preferred node 0 as follows

[root@eng-oda-base0 ~]# oakcli migrate vm eng-prod-mgmt-01

OAKERR : 9002 : Repo: vmrepo is not online on node: 0

The action failed, so taking a closer look at the repo concerned showed the following

[root@eng-oda-base0 ~]# oakcli show repo

NAME TYPE NODENUM FREE SPACE STATE SIZE

odarepo1 local 0 N/A N/A N/A

odarepo2 local 1 N/A N/A N/A

vmrepo shared 0 N/A UNKNOWN N/A

vmrepo shared 1 N/A UNKNOWN N/A

devvms shared 0 N/A UNKNOWN N/A

devvms shared 1 N/A UNKNOWN N/A

[root@eng-oda-base0 ~]# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base has been set to /u01/app/grid
[root@eng-oda-base0 ~]# oakcli show repo

NAME TYPE NODENUM FREE SPACE STATE SIZE

odarepo1 local 0 N/A N/A N/A

odarepo2 local 1 N/A N/A N/A

vmrepo shared 0 7.71% ONLINE 4072960.0M

vmrepo shared 1 7.71% ONLINE 4072960.0M

devvms shared 0 52.76% ONLINE 1512448.0M

devvms shared 1 52.76% ONLINE 1512448.0M

The repos show online after setting ASM environment

We attempted the VM migration again

[root@eng-oda-base0 ~]# oakcli migrate vm eng-prod-mgmt-01

OAKERR:7079 Error encountered while migrating VM eng-prod-mgmt-01 – OAKERR:7079 Error encountered while migrating VM eng-prod-mgmt-01 – Error: /usr/lib64/xen/bin/xc_restore 4 23 3 5 1 1 1 0 1 failed

Checking on DOM 0

[root@eng-oak2-dom0 ~]# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 3863 20 r—– 19335893.9
eng-dns01 5 2051 2 -b—- 73429.3
eng-mgmt-web-01 10 8195 3 -b—- 354394.6
eng-prod-recovery-02 134 29699 12 -b—- 12829439.8
eng-prod-recovery-03 121 29699 12 r—– 23506726.9
eng-prod-mgmt-01 146 16387 2 -b—- 1473.6
eng-prod-mgmt-02 12 16387 2 -b—- 254483.2
eng-prod-oam-01 145 8195 2 -b—- 1324.9
eng-prod-oam-02 3 8195 2 -b—- 863358.7
eng-prod-ssa-01 148 7503 3 -b—- 9644.3
eng-prod-web-01 136 8195 3 -b—- 26821.4
eng-prod-web-02 11 8195 3 -b—- 1203707.9
eng-oau-oam-01 13 8195 2 -b—- 874714.3
oakDom1 1 81923 24 r—– 50625431.6

The VM was found to be running on node 1 instead of node 0

Run the following on DOM (Document Object Model) where the problematic VM is currently running

xm list -l 146 | grep 59
59
59
(uuid 018e3193-36c9-9159-02df-d0930275427b)
(location 0.0.0.0:5906)

Indicating that we can use VNC port 5906 to login and see what’s going on

After logging in using VNC viewer as follows DOM 0 NODE-IP:5906

We realized that the VM was running on run level 1 safe mode so we took a look at journalctl searching for any recent errors

The following was discovered

…
Found device root=/dev/mapper/ol-root
started filesystem check on /dev/mapper/ol-root
started dracut initqueue hook
Reached target Remote File Systems (Pre)
Reached target Remote File Systems
Mounting /sysroot…
[ ***] A start job is running for /sysroot (3min 59s / 4min 31s)[240.527013] INFO: task mount:406 blocked for more than 120 seconds.
[ 240.527056] “echo 0 > /proc/sys/kernel/hung_task_timeout+secs” disables this message.”
[FAILED] Failed to mount /sysroot.
See ‘systemctl status sysroot.mount’ for more details.
[DEPEND] Dependency failed for Initrd Root File System.
[DEPEND] Dependency failed for Reload Configuration from the Real Root.
[ OK ] Stopped dracut pre-pivot and cleanup hook.
[ OK ] Stopped target Initrd Default Target.
[ OK ] Reached target Initrd File System.
[ OK ] Stopped dracut mount hook.
[ OK ] Stopped target Basic System.
[ OK ] Stopped System Initialization.
Starting Emergency Shell…

Generating “/run/initramfs/rdsosreport.txt”

Entering emergency mode. Exit the shell to continue.
Type “journalctl” to view system logs.
You might want to save “/run/initramfs/rdsosreport.txt” to a USB stick or /boot
after mounting them and attaching them to a bug report.

We ran the following and issued a reboot afterward

# xfs_repair /dev/mapper/ol-root

If the above does not work, try the following and reboot

# xfs_repair -L /dev/mapper/ol-root

Now VM is running on the correct node and is accessible

Last login: Fri Apr 17 09:16:28 2020 from 192.168.199.151

[root@eng-oak1-dom0 ~]# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 3852 20 r—– 20699947.9
eng-mgmt-web-01 15 8195 3 -b—- 31474.3
eng-prod-recovery-01 21 29699 12 r—– 7436775.4
eng-prod-mgmt-01 24 16387 2 -b—- 9.1
eng-prod-mgmt-02 13 16387 2 -b—- 25035.3
eng-prod-oam-01 8 8195 2 -b—- 1577418.4
eng-prod-oam-02 10 8195 2 r—– 3543915.9
eng-prod-ssa-01 9 7503 3 -b—- 1532317.8
eng-prod-web-01 5 8195 3 -b—- 1505332.2
eng-prod-web-02 22 8195 3 -b—- 4841.3
eng-oau-mgmt-01 4 16387 2 -b—- 1325935.3
eng-oau-web-01 3 8195 2 -b—- 361317.7
oakDom1 1 81923 24 r—– 79419898.1

Share on:

More from this Author

Troubleshooting Oracle Database Environments Navigating Through inventory.xml Corruption Issues

By: Kwame Bonsu
Topic: Engineered Systems | Posted: Mar 20, 2024

Troubleshooting Oracle Database Environments: Navigating Through inventory.xml Corruption Issues

Encountering errors can often lead us down a rabbit hole of troubleshooting and investigative work. Recently, a scenario encountered highlighted the ... Read More

Troubleshooting and Resolving ORA 00283 and ORA 28374 Errors During Remote Pluggable Database Cloning

By: Kwame Bonsu
Topic: Engineered Systems | Posted: Sep 06, 2023

Troubleshooting and Resolving ORA-00283 and ORA-28374 Errors During Remote Pluggable Database Cloning

Introduction: Creating a clone of a remote pluggable database across a database link can be a powerful tool for managing and replicating data across ... Read More

Blogs