Root Cause Analysis of CVM Reboots
Root Cause Analysis of CVM Reboots
Root Cause Analysis of CVM Reboots
Description
This article describes how to troubleshoot and perform root cause analysis when a CVM (Controller VM) suddenly reboots.
Logs to look for inside the CVM:
dmesg
/var/log/messages
/home/log/messages (Will have the detailed kernel logs at the time of the reboot.)
Logs to look for on AHV host:
/tmp/NTNX.serial.out.0
/var/tmp/NTNX.serial.out.0
/var/log/libvirt/qemu/NTNX-<blockserial>-<nodeposition>-CVM.log
Logs to look for inside the ESXi:
/vmfs/volumes/NTNX-local-ds-<serial>-<pos>/ServiceVM_Centos/ServiceVM_Centos.0.out
/vmfs/volumes/NTNX-local-ds-<serial>-<pos>/ServiceVM_Centos/vmware.log
/var/log/vmksummary.log
To review memory/cpu usage/disk latency of the CVM at the time of the reboot, the sysstats under /home/nutanix/data/logs/sysstats logs can be reviewed. Note that the logs are in UTC timestamp.
/home/nutanix/data/logs/sysstats/meminfo.INFO
/home/nutanix/data/logs/sysstats/mpstat.INFO
/home/nutanix/data/logs/sysstats/iostat.INFO
Solution
Examples
- CVM command last reboot:
nutanix@cvm$ last reboot reboot system boot 2.6.32-279.9.1.e Mon Dec 23 09:40 - 12:16 (02:36)
- Logs on CVM /var/log/messages and kern.log:
Dec 23 09:40:06 NTNX-CVM-A kernel: fioinf Waiting for /dev/fct0 to be created Dec 23 09:40:06 NTNX-CVM-A kernel: fioinf Fusion-io ioDrive2 365GB 0000:03:00.0: probed fct0 Dec 23 09:40:06 NTNX-CVM-A kernel: fioinf Fusion-io ioDrive2 365GB 0000:03:00.0: sector_size=512 Dec 23 09:40:06 NTNX-CVM-A kernel: fioinf Fusion-io ioDrive2 365GB 0000:03:00.0: Device is operating as a block device. Dec 23 09:40:06 NTNX-CVM-A kernel: fioinf Fusion-io ioDrive2 365GB 0000:03:00.0: setting channel range data to [2 .. 2047] Dec 23 09:40:06 NTNX-CVM-A kernel: fioinf Fusion-io ioDrive2 365GB 0000:03:00.0: *** unclean shutdown detected, re-scanning log. *** Dec 23 09:40:06 NTNX-CVM-A kernel: fioinf Fusion-io ioDrive2 365GB 0000:03:00.0: *** this may take several minutes. *** Dec 23 09:40:06 NTNX-CVM-A kernel: fioinf Fusion-io ioDrive2 365GB 0000:03:00.0: *************************************************** Dec 23 09:40:06 NTNX-CVM-A kernel: fioinf Fusion-io ioDrive2 365GB 0000:03:00.0: Powercut detected Dec 23 09:40:06 NTNX-CVM-A kernel: fioinf Fusion-io ioDrive2 365GB 0000:03:00.0: Successfully reattached after unclean shutdown. (AP: 1942+228114432) Dec 23 09:40:06 NTNX-CVM-A kernel: fioinf Fusion-io ioDrive2 365GB 0000:03:00.0: Creating block device fioa: major: 252 minor: 0 sector size: 512... Dec 23 09:40:06 NTNX-CVM-A kernel: fioa: fioa1
- ESXi logs /vmfs/volumes/xxxxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxx/ServiceVM*/vmware.log:
2013-12-23T17:35:25.959Z| vcpu-0| I120: CPU reset: soft (mode 1) 2013-12-23T17:35:25.960Z| vcpu-2| I120: CPU reset: soft (mode 1) 2013-12-23T17:35:25.960Z| vcpu-7| I120: CPU reset: soft (mode 1) 2013-12-23T17:35:25.960Z| vcpu-1| I120: CPU reset: soft (mode 1) 2013-12-23T17:35:25.960Z| vcpu-5| I120: CPU reset: soft (mode 1) 2013-12-23T17:35:25.960Z| vcpu-4| I120: CPU reset: soft (mode 1) 2013-12-23T17:35:25.960Z| vcpu-3| I120: CPU reset: soft (mode 1) 2013-12-23T17:35:25.960Z| vcpu-6| I120: CPU reset: soft (mode 1)
"Restart Guest OS" on CVM initiated from vCentre results in the following signature in the cvm's vmware.log
(Note that this entry does not occur in the vmware.log if the CVM has been gracefully restarted from within the Nutanix Cluster via AOS Upgrade or cvm_shutdown command)2022-03-01T23:24:30.638Z| vmx| I125: Tools: sending 'OS_Reboot' (state = 2) state change request
"Shutdown Guest OS" on CVM initiated from vCentre results in the following signature in the cvm's vmware.log
(Note that this entry does not occur in the vmware.log if the CVM has been gracefully shutdown from within the Nutanix Cluster via AOS Upgrade or cvm_shutdown command)2022-03-02T00:22:15.448Z| vmx| I125: Tools: sending 'OS_Halt' (state = 1) state change request
Another example of vmware.log (based on VMware bug nr. 676321):2013-07-17T22:35:53.907Z| vcpu-0| W110: MONITOR PANIC: vcpu-7:ASSERT vmcore/exts/hv/vt/hv-vt.c:1933 bugNr=676321 2013-07-17T22:35:53.907Z| vcpu-0| I120: Core dump with build build-838463 2013-07-17T22:35:53.907Z| vcpu-6| I120: Exiting vcpu-6 2013-07-17T22:35:53.907Z| vcpu-4| I120: Exiting vcpu-7 2013-07-17T22:35:53.907Z| vcpu-0| W110: Writing monitor corefile "/vmfs/volumes/50630639-74fa7b98-830d-0025904c8605/ServiceVM-1.24_Ubuntu/vmmcores.gz"
Another vmware.log (EPT misconfiguration - VMware KB 1036775):2013-05-03T17:27:43.262Z| vcpu-1| MONITOR PANIC: vcpu-0:EPT misconfiguration: PA b49b405b0 2013-05-03T17:27:43.262Z| vcpu-1| Core dump with build build-623860 2013-05-03T17:27:43.262Z| vcpu-1| Writing monitor corefile "/vmfs/volumes/51548019-3efd569e-d4d8-002590840e37/ServiceVM/vmmcores.gz" 2013-05-03T17:27:43.262Z| vcpu-6| Exiting vcpu-6
- ESXi logs /vmfs/volumes/xxxxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxx/ServiceVM*/ServiceVM.out.0 shows jbd2/fio driver issue in this example:
last sysfs file: /sys/devices/pci0000:00/0000:00:10.0/host2/target2:0:2/2:0:2:0/block/sdb/queue/scheduler CPU 0 Modules linked in: be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr i Pid: 3403, comm: jbd2/fioa1-8 Tainted: P --------------- 2.6.32-279.9.1.el6.nutanix.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop RIP: 0010:[<ffffffffa01d888c>] [<ffffffffa01d888c>] jbd2_journal_commit_transaction+0x120c/0x14b0 [jbd2] RSP: 0018:ffff880431113d30 EFLAGS: 00010246 RAX: 0000000000000008 RBX: ffff8804330d9800 RCX: 0000000000000000 RDX: ffff8804060ff000 RSI: 0000000000000286 RDI: ffff8804330d9800 RBP: ffff880431113e60 R08: ffff880028216e90 R09: ffff880028216f00 R10: 0000000000000018 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8804330d9800 R14: ffff8804220a4ae0 R15: ffff8804330d9898 FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007fbeca8a4916 CR3: 0000000378ef3000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process jbd2/fioa1-8 (pid: 3403, threadinfo ffff880431112000, task ffff8804220a4ae0) Stack:
-
For any recent hard drive failure, check hades.out log.
If the SSD is the metadata drive, AOS will force a CVM to reboot. Also, if AOS has trouble removing an HDD and a forced removal is triggered by hades, a CVM will reboot.
The output of ServiceVM.out.0 (Bug 735768):
kernel BUG at fs/jbd2/commit.c:353! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0/host2/port-2:2/end_device-2:2/target2:0:2/2:0:2:0/block/sdc/dev CPU 1
ESXi vmksummary to see if the ESXi host rebooted:[root@esxi]# grep -i bootstop /var/log/vmksummary.log 2015-02-07T02:54:17Z bootstop: Host is powering off 2015-02-07T08:43:04Z bootstop: Host has booted
AHV:System Boot logs from Audit logs on the hypervisor 11277 type=SYSTEM_BOOT msg=audit(1556350213.112:4): pid=4405 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:initrc_t:s0 msg='init exe="/sbin/telinit" hostname=? addr=? terminal=console res=success' 11278 type=SYSTEM_RUNLEVEL msg=audit(1556350213.112:5): pid=4405 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:initrc_t:s0 msg='old-level=N new-level=3 exe="/sbin/telinit" hostname=? addr=? terminal=console res=succe ss'
CVM:nutanix@cvm$ sudo grep -i "kmsg started" /home/log/messages 2015-01-30T10:59:39.957663-08:00 NTNX-A-CVM kernel: imklog 5.8.10, log source = /proc/kmsg started. 2015-02-07T00:46:55.164530-08:00 NTNX-A-CVM kernel: imklog 5.8.10, log source = /proc/kmsg started.
Scroll a few lines above to have more information:nutanix@cvm$ sudo grep -i -B 5 "kmsg started" /home/log/messages 2015-02-06T18:00:02.539862-08:00 NTNX-C-CVM audispd: node=NTNX-C-CVM type=EOE msg=audit(1423274402.537:7498): 2015-02-06T18:00:02.578946-08:00 NTNX-C-CVM audispd: node=NTNX-C-CVM type=SYSCALL msg=audit(1423274402.577:7499): arch=c000003e syscall=90 success=yes exit=0 a0=251b700 a1=1ed a2=7f1ddb485a08 a3=7fff69bbdf30 items=1 ppid=8586 pid=9025 auid=1000 uid=1000 gid=1000 euid=1000 suid=1000 fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=(none) ses=150912 comm="python" exe="/usr/bin/python" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="perm_mod" 2015-02-06T18:00:02.585360-08:00 NTNX-C-CVM audispd: node=NTNX-C-CVM type=SYSCALL msg=audit(1423274402.584:7500): arch=c000003e syscall=90 success=yes exit=0 a0=2894550 a1=1ed a2=7f1e2b955a08 a3=7fff0e433a48 items=1 ppid=8570 pid=9026 auid=1000 uid=1000 gid=1000 euid=1000 suid=1000 fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=(none) ses=150897 comm="python" exe="/usr/bin/python" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="perm_mod" 2015-02-06T18:00:02.585392-08:00 NTNX-C-CVM audispd: node=NTNX-C-CVM type=PATH msg=audit(1423274402.584:7500): item=0 name="/home/nutanix/.python-eggs/simplejson-3.4.1-py2.6-linux-x86_64.egg-tmp/simplejson/tmp0cHe62.$extract" inode=365 dev=09:02 mode=0100600 ouid=1000 ogid=1000 rdev=00:00 obj=unconfined_u:object_r:user_home_t:s0 nametype=NORMAL
For newer versions of CVM, you may have to grep for "rsyslogd.*start" rather than "kmsg started":
nutanix@cvm$ sudo grep -i "rsyslogd.*start" /var/log/messages
2018-03-06T03:28:13.648673-07:00 NTNX-C-CVM rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="1273" x-info="http://www.rsyslog.com"] start
2018-03-06T03:28:13.647853-07:00 NTNX-C-CVM rsyslogd-2307: warning: ~ action is deprecated, consider using the 'stop' statement instead [try http://www.rsyslog.com/e/2307 ]
2018-03-06T03:28:13.651494-07:00 NTNX-C-CVM systemd[1]: Started System Logging Service.
Additional Information
- Nutanix KB 1252 - Original document in Nutanix Portal
- Nutanix landing page
- Lenovo ISG Support Plan - ThinkAgile HX Appliance and Lenovo Converged HX Series