Resolving ODA VM FileSystem Corruption

Share on:

We recently experienced an issue with our client’s virtualized ODA (Oracle Database Appliance) with the following details.

 
VM-ODA_BASE ODA X5-2

Version
——-
12.1.2.12.0

Our client mentioned that one of the VMs domiciled on the ODA was inaccessible so we went ahead and did the following by looking for more information on the two-node ODA X5-2.

Checking from ODA_BASE


[root@eng-oda-base0 ~]# oakcli show vm

NAME NODENUM MEMORY VCPU STATE REPOSITORY

eng-dns01 1 2048M 2 ONLINE vmrepo
eng-mgmt-web-01 1 8192M 3 ONLINE vmrepo
eng-prod-recovery-01 0 29696M 12 ONLINE vmrepo
eng-prod-recovery-02 1 29696M 12 ONLINE vmrepo
eng-prod-recovery-03 1 29696M 12 ONLINE vmrepo
eng-prod-mgmt-01 1 16384M 2 ONLINE vmrepo
eng-prod-mgmt-02 1 16384M 2 ONLINE vmrepo
eng-prod-oam-01 1 8192M 2 ONLINE vmrepo
eng-prod-oam-02 1 8192M 2 ONLINE vmrepo
eng-prod-ssa-01 1 7500M 3 ONLINE vmrepo
eng-prod-web-01 1 8192M 3 ONLINE vmrepo
eng-prod-web-02 1 8192M 3 ONLINE vmrepo
eng-oau-mgmt-01 0 16384M 2 ONLINE devvms
eng-oau-mgmt-01_clone 0 16384M 2 OFFLINE vmrepo
eng-oau-oam-01 1 8192M 2 ONLINE devvms
eng-oau-web-01 0 8192M 2 ONLINE devvms

The VM eng-prod-mgmt-01 looked to be online on the ODA_BASE level even though it was reported as inaccessible

We then established a ping to the problematic VM and it was unsuccessful with constant timeouts

A further check on the VM showed it to still be online as shown below

[root@eng-oda-base0 ~]# oakcli show vm eng-prod-mgmt-01
The Resource is : eng-prod-mgmt-01
AutoStart : restore
CPUPriority : 100
Disks : |file:/OVS/Repositories/vmrepo/.ACF
S/snaps/eng-prod-mgmt-01/VirtualMac
hines/eng-prod-mgmt-01/OEL7.4.img,x
vda,w|
Domain : XEN_PVM
DriverDomain : False
ExpectedState : online
FailOver : true
IsSharedRepo : true
Keyboard : en-us
MaxMemory : 16384M
MaxVcpu : 2
Memory : 16384M
Mouse : OS_DEFAULT
Name : eng-prod-mgmt-01
Networks : |bridge=vsok1||bridge=prodv|
NodeNumStart : 1
OS : OL_5
PrefNodeNum : 0
PrivateIP : None
ProcessorCap : 0
RepoName : vmrepo
State : Online
TemplateName : otml_OEL_7_5
VDisks : |0|
Vcpu : 2
cpupool : default-unpinned-pool
vncport : 5907

One interesting observation was that the preferred node is node 0 but the VM seemed to have been started/migrated on node 1

NodeNumStart : 1
OS : OL_5
PrefNodeNum : 0

We attempted an online migration to the preferred node 0 as follows

[root@eng-oda-base0 ~]# oakcli migrate vm eng-prod-mgmt-01

OAKERR : 9002 : Repo: vmrepo is not online on node: 0

The action failed, so taking a closer look at the repo concerned showed the following 

[root@eng-oda-base0 ~]# oakcli show repo

 

NAME TYPE NODENUM FREE SPACE STATE SIZE

odarepo1 local 0 N/A N/A N/A

odarepo2 local 1 N/A N/A N/A

vmrepo shared 0 N/A UNKNOWN N/A

vmrepo shared 1 N/A UNKNOWN N/A

devvms shared 0 N/A UNKNOWN N/A

devvms shared 1 N/A UNKNOWN N/A

 

[root@eng-oda-base0 ~]# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base has been set to /u01/app/grid
[root@eng-oda-base0 ~]# oakcli show repo

NAME TYPE NODENUM FREE SPACE STATE SIZE

odarepo1 local 0 N/A N/A N/A

odarepo2 local 1 N/A N/A N/A

vmrepo shared 0 7.71% ONLINE 4072960.0M

vmrepo shared 1 7.71% ONLINE 4072960.0M

devvms shared 0 52.76% ONLINE 1512448.0M

devvms shared 1 52.76% ONLINE 1512448.0M

 

The repos show online after setting ASM environment

We attempted the VM migration again

[root@eng-oda-base0 ~]# oakcli migrate vm eng-prod-mgmt-01

OAKERR:7079 Error encountered while migrating VM eng-prod-mgmt-01 – OAKERR:7079 Error encountered while migrating VM eng-prod-mgmt-01 – Error: /usr/lib64/xen/bin/xc_restore 4 23 3 5 1 1 1 0 1 failed 

Checking on DOM 0

 

[root@eng-oak2-dom0 ~]# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 3863 20 r—– 19335893.9
eng-dns01 5 2051 2 -b—- 73429.3
eng-mgmt-web-01 10 8195 3 -b—- 354394.6
eng-prod-recovery-02 134 29699 12 -b—- 12829439.8
eng-prod-recovery-03 121 29699 12 r—– 23506726.9
eng-prod-mgmt-01 146 16387 2 -b—- 1473.6
eng-prod-mgmt-02 12 16387 2 -b—- 254483.2
eng-prod-oam-01 145 8195 2 -b—- 1324.9
eng-prod-oam-02 3 8195 2 -b—- 863358.7
eng-prod-ssa-01 148 7503 3 -b—- 9644.3
eng-prod-web-01 136 8195 3 -b—- 26821.4
eng-prod-web-02 11 8195 3 -b—- 1203707.9
eng-oau-oam-01 13 8195 2 -b—- 874714.3
oakDom1 1 81923 24 r—– 50625431.6

The VM (Virtual Machine) was found to be running on node 1 instead of node 0

Login to VM using VNC (Virtual Network Computing)

 

Run the following on DOM (Document Object Model) where the problematic vm is currently running

xm list -l 146 | grep 59
59
59
(uuid 018e3193-36c9-9159-02df-d0930275427b)
(location 0.0.0.0:5906)

Indicating that we can use VNC port 5906 to login and see what’s going on

After logging in using VNC viewer as follows DOM 0 NODE-IP:5906

We realised that the VM was running on run level 1 safe mode so we took a look at journalctl searching for any recent errors

The following was discovered


Found device root=/dev/mapper/ol-root
started filesystem check on /dev/mapper/ol-root
started dracut initqueue hook
Reached target Remote File Systems (Pre)
Reached target Remote File Systems
Mounting /sysroot…
[ ***] A start job is running for /sysroot (3min 59s / 4min 31s)[240.527013] INFO: task mount:406 blocked for more than 120 seconds.
[ 240.527056] “echo 0 > /proc/sys/kernel/hung_task_timeout+secs” disables this message.”
[FAILED] Failed to mount /sysroot.
See ‘systemctl status sysroot.mount’ for more details.
[DEPEND] Dependency failed for Initrd Root File System.
[DEPEND] Dependency failed for Reload Configuration from the Real Root.
[ OK ] Stopped dracut pre-pivot and cleanup hook.
[ OK ] Stopped target Initrd Default Target.
[ OK ] Reached target Initrd File System.
[ OK ] Stopped dracut mount hook.
[ OK ] Stopped target Basic System.
[ OK ] Stopped System Initialization.
Starting Emergency Shell…

Generating “/run/initramfs/rdsosreport.txt”

Entering emergency mode. Exit the shell to continue.
Type “journalctl” to view system logs.
You might want to save “/run/initramfs/rdsosreport.txt” to a USB stick or /boot
after mounting them and attaching them to a bug report.

We ran the following and issued a reboot afterwards

# xfs_repair /dev/mapper/ol-root

If the above does not work, try the following and reboot

# xfs_repair -L /dev/mapper/ol-root

 

Now VM is running on the correct node and is accessible

Last login: Fri Apr 17 09:16:28 2020 from 192.168.199.151

[root@eng-oak1-dom0 ~]# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 3852 20 r—– 20699947.9
eng-mgmt-web-01 15 8195 3 -b—- 31474.3
eng-prod-recovery-01 21 29699 12 r—– 7436775.4
eng-prod-mgmt-01 24 16387 2 -b—- 9.1
eng-prod-mgmt-02 13 16387 2 -b—- 25035.3
eng-prod-oam-01 8 8195 2 -b—- 1577418.4
eng-prod-oam-02 10 8195 2 r—– 3543915.9
eng-prod-ssa-01 9 7503 3 -b—- 1532317.8
eng-prod-web-01 5 8195 3 -b—- 1505332.2
eng-prod-web-02 22 8195 3 -b—- 4841.3
eng-oau-mgmt-01 4 16387 2 -b—- 1325935.3
eng-oau-web-01 3 8195 2 -b—- 361317.7
oakDom1 1 81923 24 r—– 79419898.1

Share on:
Back to Top