NCC Health Check: cfs_fatal_check

Description

The NCC health check cfs_fatal_check is a check to determine if the CFS (Collector Framework Service) process is stable. The CFS process resides under the cluster_health service and sends Pulse data to Insights. The check will trigger if the CFS process has restarted at least 4 times in the past 2 hours. Until the CFS process has stabilized, sending remote support/pulse Insights data and proactive support may be delayed.

Before running this check, upgrade NCC to the latest version. This check was introduced in NCC 4.6.1.

Running the NCC Check

You can run this check as part of the complete NCC Health Checks.

nutanix@cvm$ ncc health_checks run_all

Or you can run this check separately.

nutanix@cvm$ ncc health_checks pulse_checks cfs_fatal_check

You can also run the checks from the Prism web console Health page. Select Actions > Run Checks. Select All checks and click Run.

This check is scheduled to run every 7200 seconds.
This check will generate the CFS process that is not in a stable state alert.

Sample Outputs

For Status: PASS

Running : health_checks pulse_checks cfs_fatal_check
[==================================================] 100%

/health_checks/pulse_checks/cfs_fatal_check                                                                                                                        [ PASS ] 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+-----------------------+
| State         | Count |
+-----------------------+
| Pass          | 1     |
| Total Plugins | 1     |
+-----------------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log

For Status: Warning

Running : health_checks pulse_checks cfs_fatal_check
[==================================================] 100%
/health_checks/pulse_checks/cfs_fatal_check                                                                                                                        [ WARN ] 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Detailed information for cfs_fatal_check:
Node X.Y.Z.240: 
Warn: CFS process is not in a stable state.
Refer to KB 13150 (http://portal.nutanix.com/kb/13150) for details on cfs_fatal_check or Recheck with: ncc health_checks pulse_checks cfs_fatal_check --cvm_list=X.Y.Z.240
+-----------------------+
| State         | Count |
+-----------------------+
| Fail          | 1     |
| Total Plugins | 1     |
+-----------------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log

Output messaging

Check ID	140005
Description	This is a check to determine if the CFS process is stable. The CFS process resides under the cluster_health service and sends Pulse data to Insights. The check will trigger if the CFS process has restarted at least 4 times in the past 2 hours.
Causes of failure	The CFS process could repeatedly restart if it hits an unrecoverable error, or the CFS process fails to perform a task dependent on the filesystem or zookeeper process, or if it is killed by Out Of Memory (OOM) killer.
Resolutions	The CFS process sits under the cluster_health service on the Controller VM. Verify that the cluster_health service on the affected node is running. Check the cfs.out and cluster_health service logs for more details about the cause of the crash.
Impact	Insights data and proactive support can be delayed until the CFS process has stabilized.

Solution

Pulse provides diagnostic system data to Nutanix Support to deliver proactive, context-aware support for Nutanix solutions. Nutanix recommends that customers enable pulse. Refer to Pulse Health Monitoring for more details.

If your cluster runs an NCC version older than 4.6.3.1, Upgrade NCC to the latest version using Life Cycle Manager(LCM).
Resolve the alert "CFS process is not in a stable state" from the Prism Web Console.
Re-run the NCC check as instructed above.
- If you still see the NCC check failure, run the following NCC check to verify connectivity to Nutanix Pulse Insights servers.

nutanix@cvm$ ncc health_checks pulse_checks rest_connection_checks

If the above check fails, follow instructions from KB-5490 to resolve the unreachability. There may be an upstream network connectivity issue that requires resolution. Review your DNS, routing, and firewall or ACLs for your network.

Other reasons for the alert are that the CFS process might repeatedly restart if it hits an unrecoverable error, the CFS process fails to perform a task dependent on the filesystem or zookeeper process, or if it is killed due to an out-of-memory (OOM) issues.
- Verify that the CFS service is running on the cluster:

nutanix@CVM:~$ ps aux | grep /home/nutanix/ncc/bin/nusights/cfs | grep -v grep | awk '$11 == "/home/nutanix/ncc/bin/nusights/cfs" { print $0 }'

nutanix   4899  0.2  0.2 1438992 83792 ?       Sl   Jan03   6:31 /home/nutanix/ncc/bin/nusights/cfs -use_iam=True -log_dir=/home/nutanix/data/logs/ 
-logtostderr=True -logstacktostderr=True -useUTC=True -config_dir=/home/nutanix/ncc/config/nusights -protocol=https -tls_host_name= 
-ca_cert_path=/home/nutanix/ncc/cert/insights_collector/cacert.pem -rest_base_url=/nusights/services -rest_protocol_version=v1 
-use_pc_as_proxy=True -experimental_dump_to_file=True -experimental_dump_transported_data_to_file=False -stats_flush_frequency_secs=900 
-num_os_threads=1 -max_rss_memory_limit_mb=628 -high_rss_mb=130 -low_rss_pt=70 -resource_check_interval_secs=5 -enable_self_monitoring=false 
-prof_dir=/home/nutanix/data/cores/ -mem_profile_rate=-1 -enable_live_debug=False -v=0 -cgroup_subsystems=cpu,cpuacct,memory 
-use_resumable_file_upload=True -enable_metering_mode_monitoring=True -enable_message_batching=True -max_batch_message_size_in_kb=64 
-batch_msg_send_duration_in_sec=120 -enable_local_stats_storage=True -read_additional_cvmconfig_info=true -commit_log_read_buf_size_mb=2 
-token_generation_rate_per_sec=100.000000 -burst_size=200

Check if the CFS service has recently crashed. In the command below, the CFS process runs for 2 days, 6 hours, 2 minutes, and 33 seconds.

nutanix@CVM:~$  ps -eo etime,args | grep /home/nutanix/ncc/bin/nusights/cfs | grep -v grep | awk '$2 == "/home/nutanix/ncc/bin/nusights/cfs" { print $0 }'

2-06:02:33 /home/nutanix/ncc/bin/nusights/cfs -use_iam=True -log_dir=/home/nutanix/data/logs/ -logtostderr=True -logstacktostderr=True -useUTC=True 
-config_dir=/home/nutanix/ncc/config/nusights -protocol=https -tls_host_name= -ca_cert_path=/home/nutanix/ncc/cert/insights_collector/cacert.pem 
-rest_base_url=/nusights/services -rest_protocol_version=v1 -use_pc_as_proxy=True -experimental_dump_to_file=True 
-experimental_dump_transported_data_to_file=False -stats_flush_frequency_secs=900 -num_os_threads=1 -max_rss_memory_limit_mb=628 -high_rss_mb=130 
-low_rss_pt=70 -resource_check_interval_secs=5 -enable_self_monitoring=false -prof_dir=/home/nutanix/data/cores/ -mem_profile_rate=-1 
-enable_live_debug=False -v=0 -cgroup_subsystems=cpu,cpuacct,memory -use_resumable_file_upload=True -enable_metering_mode_monitoring=True 
-enable_message_batching=True -max_batch_message_size_in_kb=64 -batch_msg_send_duration_in_sec=120 -enable_local_stats_storage=True 
-read_additional_cvmconfig_info=true -commit_log_read_buf_size_mb=2 -token_generation_rate_per_sec=100.000000 -burst_size=200

Check for recent FATAL in the cfs.out.

nutanix@CVM:~$ grep -B8 ^F ~/data/logs/cfs.out*

I0418 08:22:18.217482Z   13365 transport.go:993] HTTP(S) proxy: Testing connectivity to end point https://insights.nutanix.com:443/nusights/services/v1/test by making a http POST without any proxy with timeoutSecs: 60.
I0418 08:22:46.813213Z   13365 cvmconfig.go:838] current status has remained to be the same from prevStatus: false
I0418 08:22:47.794832Z   13365 cfs_stats.go:356] Publishing the commitlog stats to DB.
I0418 08:23:18.218412Z   13365 transport.go:2231] Reset cached transport 0xc0000e57c0 for transportKey PULSE:DIRECT:insights.nutanix.com.
E0418 08:23:18.218466Z   13365 transport.go:1026] HTTP(S) proxy: Test request to https://insights.nutanix.com:443/nusights/services/v1/test without any proxy failed with error Post "https://insights.nutanix.com:443/nusights/services/v1/test": context deadline exceeded and response nil
I0418 08:23:18.218479Z   13365 transport.go:1083] Server endpoint(https://insights.nutanix.com:443/nusights/services/v1/test) is not reachable directly without any proxy.
I0418 08:23:18.218486Z   13365 transport.go:1044] Trying connectivity tests for proxy type PC Proxy
I0418 08:23:18.218493Z   13365 transport.go:1144] 10830.378976167 Seconds lapsed since the connectivity test is started.
F0418 08:23:18.218505Z   13365 transport.go:1161] QFATAL Exiting CFS since POST Endpoint https://insights.nutanix.com:443/nusights/services/ is not reachable via any of the configured proxies.

Restart the cluster_health service to attempt to stabilize the CFS process:

nutanix@CVM:~$ genesis stop cluster_health
nutanix@CVM:~$ cluster start

Monitor the stability of the CFS process by re-running the NCC check.

Original article in Nutanix Portal: Nutanix KB Article : 13150
Nutanix landing page

Document ID:HT516498

Original Publish Date:05/17/2024

Last Modified Date:05/23/2024