NCC Health Check: cfs_fatal_check
NCC Health Check: cfs_fatal_check
NCC Health Check: cfs_fatal_check
Description
The NCC health check cfs_fatal_check is a check to determine if the CFS (Collector Framework Service) process is stable. The CFS process resides under the cluster_health service and sends Pulse data to Insights. The check will trigger if the CFS process has restarted at least 4 times in the past 2 hours. Until the CFS process has stabilized, sending remote support/pulse Insights data and proactive support may be delayed.
Before running this check, upgrade NCC to the latest version. This check was introduced in NCC 4.6.1.
Running the NCC Check
You can run this check as part of the complete NCC Health Checks.
nutanix@cvm$ ncc health_checks run_all
Or you can run this check separately.
nutanix@cvm$ ncc health_checks pulse_checks cfs_fatal_check
You can also run the checks from the Prism web console Health page. Select Actions > Run Checks. Select All checks and click Run.
This check is scheduled to run every 7200 seconds.
This check will generate the CFS process that is not in a stable state alert.
Sample Outputs
For Status: PASS
Running : health_checks pulse_checks cfs_fatal_check
[==================================================] 100%
/health_checks/pulse_checks/cfs_fatal_check [ PASS ]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+-----------------------+
| State | Count |
+-----------------------+
| Pass | 1 |
| Total Plugins | 1 |
+-----------------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log
For Status: Warning
Running : health_checks pulse_checks cfs_fatal_check
[==================================================] 100%
/health_checks/pulse_checks/cfs_fatal_check [ WARN ]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Detailed information for cfs_fatal_check:
Node X.Y.Z.240:
Warn: CFS process is not in a stable state.
Refer to KB 13150 (http://portal.nutanix.com/kb/13150) for details on cfs_fatal_check or Recheck with: ncc health_checks pulse_checks cfs_fatal_check --cvm_list=X.Y.Z.240
+-----------------------+
| State | Count |
+-----------------------+
| Fail | 1 |
| Total Plugins | 1 |
+-----------------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log
Output messaging
|
140005 |
---|---|
|
This is a check to determine if the CFS process is stable. The CFS process resides under the cluster_health service and sends Pulse data to Insights. The check will trigger if the CFS process has restarted at least 4 times in the past 2 hours. |
|
The CFS process could repeatedly restart if it hits an unrecoverable error, or the CFS process fails to perform a task dependent on the filesystem or zookeeper process, or if it is killed by Out Of Memory (OOM) killer. |
|
The CFS process sits under the cluster_health service on the Controller VM. Verify that the cluster_health service on the affected node is running. Check the cfs.out and cluster_health service logs for more details about the cause of the crash. |
|
Insights data and proactive support can be delayed until the CFS process has stabilized. |
Solution
Pulse provides diagnostic system data to Nutanix Support to deliver proactive, context-aware support for Nutanix solutions. Nutanix recommends that customers enable pulse. Refer to Pulse Health Monitoring for more details.
- If your cluster runs an NCC version older than 4.6.3.1, Upgrade NCC to the latest version using Life Cycle Manager(LCM).
- Resolve the alert "CFS process is not in a stable state" from the Prism Web Console.
- Re-run the NCC check as instructed above.
- If you still see the NCC check failure, run the following NCC check to verify connectivity to Nutanix Pulse Insights servers.
nutanix@cvm$ ncc health_checks pulse_checks rest_connection_checks
- If the above check fails, follow instructions from KB-5490 to resolve the unreachability. There may be an upstream network connectivity issue that requires resolution. Review your DNS, routing, and firewall or ACLs for your network.
- Other reasons for the alert are that the CFS process might repeatedly restart if it hits an unrecoverable error, the CFS process fails to perform a task dependent on the filesystem or zookeeper process, or if it is killed due to an out-of-memory (OOM) issues.
- Verify that the CFS service is running on the cluster:
nutanix@CVM:~$ ps aux | grep /home/nutanix/ncc/bin/nusights/cfs | grep -v grep | awk '$11 == "/home/nutanix/ncc/bin/nusights/cfs" { print $0 }'
nutanix 4899 0.2 0.2 1438992 83792 ? Sl Jan03 6:31 /home/nutanix/ncc/bin/nusights/cfs -use_iam=True -log_dir=/home/nutanix/data/logs/
-logtostderr=True -logstacktostderr=True -useUTC=True -config_dir=/home/nutanix/ncc/config/nusights -protocol=https -tls_host_name=
-ca_cert_path=/home/nutanix/ncc/cert/insights_collector/cacert.pem -rest_base_url=/nusights/services -rest_protocol_version=v1
-use_pc_as_proxy=True -experimental_dump_to_file=True -experimental_dump_transported_data_to_file=False -stats_flush_frequency_secs=900
-num_os_threads=1 -max_rss_memory_limit_mb=628 -high_rss_mb=130 -low_rss_pt=70 -resource_check_interval_secs=5 -enable_self_monitoring=false
-prof_dir=/home/nutanix/data/cores/ -mem_profile_rate=-1 -enable_live_debug=False -v=0 -cgroup_subsystems=cpu,cpuacct,memory
-use_resumable_file_upload=True -enable_metering_mode_monitoring=True -enable_message_batching=True -max_batch_message_size_in_kb=64
-batch_msg_send_duration_in_sec=120 -enable_local_stats_storage=True -read_additional_cvmconfig_info=true -commit_log_read_buf_size_mb=2
-token_generation_rate_per_sec=100.000000 -burst_size=200
- Check if the CFS service has recently crashed. In the command below, the CFS process runs for 2 days, 6 hours, 2 minutes, and 33 seconds.
nutanix@CVM:~$ ps -eo etime,args | grep /home/nutanix/ncc/bin/nusights/cfs | grep -v grep | awk '$2 == "/home/nutanix/ncc/bin/nusights/cfs" { print $0 }'
2-06:02:33 /home/nutanix/ncc/bin/nusights/cfs -use_iam=True -log_dir=/home/nutanix/data/logs/ -logtostderr=True -logstacktostderr=True -useUTC=True
-config_dir=/home/nutanix/ncc/config/nusights -protocol=https -tls_host_name= -ca_cert_path=/home/nutanix/ncc/cert/insights_collector/cacert.pem
-rest_base_url=/nusights/services -rest_protocol_version=v1 -use_pc_as_proxy=True -experimental_dump_to_file=True
-experimental_dump_transported_data_to_file=False -stats_flush_frequency_secs=900 -num_os_threads=1 -max_rss_memory_limit_mb=628 -high_rss_mb=130
-low_rss_pt=70 -resource_check_interval_secs=5 -enable_self_monitoring=false -prof_dir=/home/nutanix/data/cores/ -mem_profile_rate=-1
-enable_live_debug=False -v=0 -cgroup_subsystems=cpu,cpuacct,memory -use_resumable_file_upload=True -enable_metering_mode_monitoring=True
-enable_message_batching=True -max_batch_message_size_in_kb=64 -batch_msg_send_duration_in_sec=120 -enable_local_stats_storage=True
-read_additional_cvmconfig_info=true -commit_log_read_buf_size_mb=2 -token_generation_rate_per_sec=100.000000 -burst_size=200
- Check for recent FATAL in the cfs.out.
nutanix@CVM:~$ grep -B8 ^F ~/data/logs/cfs.out*
I0418 08:22:18.217482Z 13365 transport.go:993] HTTP(S) proxy: Testing connectivity to end point https://insights.nutanix.com:443/nusights/services/v1/test by making a http POST without any proxy with timeoutSecs: 60.
I0418 08:22:46.813213Z 13365 cvmconfig.go:838] current status has remained to be the same from prevStatus: false
I0418 08:22:47.794832Z 13365 cfs_stats.go:356] Publishing the commitlog stats to DB.
I0418 08:23:18.218412Z 13365 transport.go:2231] Reset cached transport 0xc0000e57c0 for transportKey PULSE:DIRECT:insights.nutanix.com.
E0418 08:23:18.218466Z 13365 transport.go:1026] HTTP(S) proxy: Test request to https://insights.nutanix.com:443/nusights/services/v1/test without any proxy failed with error Post "https://insights.nutanix.com:443/nusights/services/v1/test": context deadline exceeded and response nil
I0418 08:23:18.218479Z 13365 transport.go:1083] Server endpoint(https://insights.nutanix.com:443/nusights/services/v1/test) is not reachable directly without any proxy.
I0418 08:23:18.218486Z 13365 transport.go:1044] Trying connectivity tests for proxy type PC Proxy
I0418 08:23:18.218493Z 13365 transport.go:1144] 10830.378976167 Seconds lapsed since the connectivity test is started.
F0418 08:23:18.218505Z 13365 transport.go:1161] QFATAL Exiting CFS since POST Endpoint https://insights.nutanix.com:443/nusights/services/ is not reachable via any of the configured proxies.
- Restart the cluster_health service to attempt to stabilize the CFS process:
nutanix@CVM:~$ genesis stop cluster_health
nutanix@CVM:~$ cluster start
- Monitor the stability of the CFS process by re-running the NCC check.
Related Articles
- Original article in Nutanix Portal: Nutanix KB Article: 13150
- Nutanix landing page