NCC Health Check: cluster_services_status
NCC Health Check: cluster_services_status
NCC Health Check: cluster_services_status
Description
The NCC health check cluster_services_status verifies if the Controller VM (CVM) services have restarted recently across the cluster.
Click here to display checked services
Prior to NCC-3.10.1
This check executes every 4 hours and looks for FATAL logs generated in the last 24 hours.
Although this NCC check performs checks on the current service status and the previous crashes, this check results in a FAIL status only if one or more services have crashed multiple times and generated a FATAL log within the last 24 hours (5 times on a single Controller VM or 10 times across the cluster).
Post to NCC-3.10.1
This check executes every 10 minutes and looks for FATAL logs generated in the last 24 hours.
The check fails in the following two cases:
- A service FATALs 10 times across the cluster, in one day for clusters having more than 10 nodes (OR) Number of FATALs, in one day, are greater than or equal to the number of nodes in the cluster, for clusters having up to 10 nodes.
- A service FATALs 5 times in a single CVM, in one day.
A single node in the cluster reports a FAIL status of the cluster_services_status check on behalf of all other CVMs in the cluster. When investigating for FATAL logs, ensure that you look at all CVMs, using the list of affected services from the FAIL status message as a guide.
If maintenance activities have recently been performed on the cluster, a FAIL status of this check indicates that the services are unstable, which might potentially affect the cluster performance or serviceability.
From NCC version 3.5.1, this check is applicable to Prism Central VMs in a scaleout PC cluster.
Running the NCC Check
Run this check as part of the complete NCC Health Checks.
nutanix@cvm$ ncc health_checks run_all
Or you can run this check individually.
nutanix@cvm$ ncc health_checks system_checks cluster_services_status
You can also run the checks from the Prism web console Health page: select Actions > Run Checks. Select All checks and click Run.
This check is scheduled to run every 10 minutes, by default.
This check will generate an alert after 1 failure.
Sample output
For Status: PASS
Running /health_checks/system_checks/cluster_services_status on all nodes [ PASS ]
------------------------------------------------------------------------+
+---------------+
| State | Count |
+---------------+
| Pass | 1 |
| Total | 1 |
+---------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log
For Status: FAIL
Detailed information for cluster_services_status:
Node x.x.x.x:
FAIL: Components core dumped in last 24 hours: ['cerebro', 'curator']
Refer to KB 3378 (http://portal.nutanix.com/kb/3378) for details on cluster_services_status or Recheck with: ncc health_checks system_checks cluster_services_status
Output messaging
Check ID | 3034 |
Description | Check if services have restarted recently across the cluster. |
Causes of failure | This alert indicates that one or more services in the cluster were restarted. |
Resolutions | If this alert occurs once or infrequently, no action is necessary. If it is frequent, contact Nutanix support. |
Impact | Cluster performance may be significantly degraded. In the case of multiple services with the same condition, the cluster may become unable to service I/O requests. |
Alert ID | A3034 |
Alert Smart Title | Cluster Service Restarting Frequently |
Alert Title | Cluster Service services Restarting Frequently |
Alert Message | There have been multiple service restarts of services across all Controller VM(s). Latest crash of these services have occurred at timestamps respectively. |
Solution
If the cluster_services_status check returns a FAIL status, do the following:
- Check the list of core dumps generated on all the Controller VMs.:
nutanix@cvm$ allssh 'ls -ltr /home/nutanix/data/cores'
- Run logbay from any Controller VM to collect the log files of the last 24 hours. (For more information on logbay, see Nutanix KB 6691.)
nutanix@cvm$ logbay collect --aggregate=true --duration=-24h
This generates a zip file in the directory /home/nutanix/data/logbay/bundles/.
Note: Aggregated log bundle might get too large on bigger clusters for 24h log collection. In such a case, use logbay without the --aggregate option and upload log bundles from each CVM to the support case.
- Verify the following and look for files with *.stack_trace.txt.gz that must be on the CVM that has generated core dumps.
nutanix@cvm$ allssh 'ls -ltr /home/nutanix/data/cores'
- Create a new case on the Nutanix Support Portal and attach the output of the above commands and the logbay bundle to the support case.
Additional Information
- Nutanix KB 3378 - Original document in Nutanix Portal
- Nutanix landing page
- Lenovo ISG Support Plan - ThinkAgile HX Appliance and Lenovo Converged HX Series