NCC Health Check: cluster_services_status

NCC Health Check: cluster_services_status

NCC Health Check: cluster_services_status

Description

The NCC health check cluster_services_status verifies if the Controller VM (CVM) services have restarted recently across the cluster.

Click here to display checked services

Prior to NCC-3.10.1

This check executes every 4 hours and looks for FATAL logs generated in the last 24 hours.

Although this NCC check performs checks on the current service status and the previous crashes, this check results in a FAIL status only if one or more services have crashed multiple times and generated a FATAL log within the last 24 hours (5 times on a single Controller VM or 10 times across the cluster).

Post to NCC-3.10.1

This check executes every 10 minutes and looks for FATAL logs generated in the last 24 hours.

The check fails in the following two cases:

  • A service FATALs 10 times across the cluster, in one day for clusters having more than 10 nodes (OR) Number of FATALs, in one day, are greater than or equal to the number of nodes in the cluster, for clusters having up to 10 nodes.
  • A service FATALs 5 times in a single CVM, in one day.

A single node in the cluster reports a FAIL status of the cluster_services_status check on behalf of all other CVMs in the cluster. When investigating for FATAL logs, ensure that you look at all CVMs, using the list of affected services from the FAIL status message as a guide.

If maintenance activities have recently been performed on the cluster, a FAIL status of this check indicates that the services are unstable, which might potentially affect the cluster performance or serviceability.

From NCC version 3.5.1, this check is applicable to Prism Central VMs in a scaleout PC cluster.

Running the NCC Check

Run this check as part of the complete NCC Health Checks.

nutanix@cvm$ ncc health_checks run_all

Or you can run this check individually.

nutanix@cvm$ ncc health_checks system_checks cluster_services_status

You can also run the checks from the Prism web console Health page: select Actions > Run Checks. Select All checks and click Run.

This check is scheduled to run every 10 minutes, by default.
This check will generate an alert after 1 failure.

Sample output

For Status: PASS

Running /health_checks/system_checks/cluster_services_status on all nodes [ PASS ]
------------------------------------------------------------------------+
+---------------+
| State | Count |
+---------------+
| Pass  | 1     |
| Total | 1     |
+---------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log

For Status: FAIL

Detailed information for cluster_services_status: 
Node x.x.x.x: 
FAIL: Components core dumped in last 24 hours: ['cerebro', 'curator'] 
Refer to KB 3378 (http://portal.nutanix.com/kb/3378) for details on cluster_services_status or Recheck with: ncc health_checks system_checks cluster_services_status 

Output messaging

Check ID 3034
Description  Check if services have restarted recently across the cluster.
Causes of failure This alert indicates that one or more services in the cluster were restarted.
Resolutions If this alert occurs once or infrequently, no action is necessary. If it is frequent, contact Nutanix support.
Impact Cluster performance may be significantly degraded. In the case of multiple services with the same condition, the cluster may become unable to service I/O requests.
Alert ID A3034
Alert Smart Title Cluster Service Restarting Frequently
Alert Title Cluster Service services Restarting Frequently
Alert Message There have been multiple service restarts of services across all Controller VM(s). Latest crash of these services have occurred at timestamps respectively.

Solution

If the cluster_services_status check returns a FAIL status, do the following:

  1. Check the list of core dumps generated on all the Controller VMs.:
    nutanix@cvm$ allssh 'ls -ltr /home/nutanix/data/cores'
  1. Run logbay from any Controller VM to collect the log files of the last 24 hours. (For more information on logbay, see Nutanix KB 6691.)
    nutanix@cvm$ logbay collect --aggregate=true --duration=-24h
    This generates a zip file in the directory /home/nutanix/data/logbay/bundles/.
    Note: Aggregated log bundle might get too large on bigger clusters for 24h log collection. In such a case, use logbay without the --aggregate option and upload log bundles from each CVM to the support case.
  1. Verify the following and look for files with *.stack_trace.txt.gz that must be on the CVM that has generated core dumps.
    nutanix@cvm$ allssh 'ls -ltr /home/nutanix/data/cores'
  1. Create a new case on the Nutanix Support Portal and attach the output of the above commands and the logbay bundle to the support case.

Additional Information

Document ID:HT516511
Original Publish Date:05/21/2024
Last Modified Date:05/23/2024