Please note: This website includes an accessibility system. Press Control-F11 to adjust the website to the visually impaired who are using a screen reader; Press Control-F10 to open an accessibility menu.

HDD, SSD, and HBA troubleshooting

HDD, SSD, and HBA troubleshooting

HDD, SSD, and HBA troubleshooting

Description

When a drive is experiencing recoverable errors, warnings, or a complete failure, the Stargate service marks the disk as offline. If the disk is detected to be offline 3 times within the hour, it is removed from the cluster automatically, and an alert is generated (KB-4158 or KB-6287).

If an alert is generated in Prism, the disk must be replaced. Troubleshooting steps do not need to be performed.

NOTE: If a failed disk is encountered in a Nutanix Clusters on AWS, once the disk is confirmed to have failed proceed to condemn the respective node. Condemning the affected node will replace it with a new bare metal instance of the same type.

Solution

Once the disk is replaced, an NCC health check should be performed to ensure optimal cluster health.
However, if an alert was not generated in the first place or further analysis is required, the steps below can be used to troubleshoot further.

Before you begin troubleshooting, verify the type of HBA controller.

Caution:
Using the SAS3IRCU command against an LSI 3408 or higher HBA can cause NMI events that could lead to storage unavailability.
Confirm the HBA controller before using the following commands.  

To determine what type of HBA is used, look for the controller name located in /etc/nutanix/hardware_config.json on the CVM.

  • Example of the output when SAS3008 is used:

    In this case, the command SAS3IRCU is the correct command to use.

    Note the "led_address": "sas3ircu:0,1:0" line:

    "node": {
        "storage_controllers": [
          {
            "subsystem": "15d9:0808",
            "name": "LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3",
            "mapping": [
              {
                "slot_designation": "1",
                "hba_address": "0",
                "slot_id": null,
                "location": {
                  "access_plane": 1,
                  "cell_x": 6,
                  "width": 6,
                  "cell_y": 2,
                  "height": 1
                },
                "led_address": "sas3ircu:0,1:0"
              },
  • Example of the output when SAS3400/3800 (or newer) is used: 

    In this case, using SAS3IRCU would be ill-advised. Use the storcli command instead. For information on StorCLI refer to KB-10951.

    Note "led_address": "storcli:0" line.

    "storage_controllers_v2": [
          {
            "subsystem": "15d9:1b64",
            "name": "Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure SAS38xx",
            "hba_hints": {
              "sas_address": "0x50030480208d9939"
            },
            "mapping": [
              {
                "slot_designation": "1",
                "hba_address": "0",
                "slot_id": 1,
                "location": {
                  "access_plane": 1,
                  "height": 3,
                  "width": 4,
                  "cell_y": 0,
                  "cell_x": 78
                },
                "led_address": "storcli:0"
              },
    
  1. Identify the problematic disks

    1. Check the Prism Web console for the failed disk. In the Diagram view, you can see red or grey for the missing disk.
    2. Check the Prism Web console for disk alerts, or use the following command to check for disks that generate the failure messages.
      nutanix@cvm$ ncli alert ls
    3. Check if any nodes are missing mounted disks. The two outputs should match numerically.
      1. Check the disks that are mounted on the CVM (Controller VM).
        nutanix@cvm$ allssh "df -h | grep -i stargate-storage | wc -l"
      2. Check the disks that are physical in the CVM.
        nutanix@cvm$ allssh "lsscsi | grep -v DVD-ROM | wc -l"
      3. Check if the status of the disks is all Online and indicated as Normal.
        nutanix@cvm$ ncli disk ls | egrep -i -E 'Online|Status'
    4. Validate the expected number of disks in the cluster.
      nutanix@cvm$ ncli disk ls | grep -i 'Status' | wc -l

      The output of the command above should be the sum of the outputs of steps 1c.i and 1c.ii.

      There are instances where the number can be higher or lower than expected. So, it is an important metric that can be compared to the disks listed in step 1b.

    5. Look for extra or missing disks.
      nutanix@cvm$ ncli disk ls
    6. Check that all disks are indicated as mounted rw (read-write) and not ro (read-only).
      nutanix@cvm$ sudo mount | grep -E 'stargate-storage.*rw'
      nutanix@cvm$ sudo mount | grep -E 'stargate-storage.*ro'
  2. Identify the problems with the disks nodes

    1. Orphaned disk ID

      This is a disk ID that the systems no longer use but was not properly removed. Symptoms include seeing an extra disk ID listed in the output of ncli disk ls.

      To fix the orphaned disk ID:

      nutanix@cvm$ ncli disk rm-start id=<diskID> force=true

      Ensure that you validate the disk serial number and that the device is not in the system. Also, ensure that all the disks are populating using lsscsimountdf -h, and counting the disks for the full-disk population.

    2. Failed disk and/or missing disk

      Check if the disk is visible to the controller as it is the device whose bus the disk resides on. The following commands can be used:

      1. lspci - displays the PCI devices seen by the CVM.
        • NVME device - Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01).
        • SAS3008 controller - Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) - LSI.
        • SAS2308 controller (Dell) - Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05).
        • MegaRaid LSI 3108 (Dell) - RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02).
        • LSI SAS3108 (UCS) - Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3108 PCI-Express Fusion-MPT SAS-3 (rev 02).
      2. lsiutil - displays the HBA (Host Bus Adapter) cards perspective of the ports and if the ports are in an UP state. If a port is not up, either the device did not respond, or the port or connection to the device is bad. The most likely problem is the device (disk).
        nutanix@cvm$ sudo /home/nutanix/cluster/lib/lsi-sas/lsiutil -a 12,0,0 20
      3. lsscsi - lists the SCSI bus devices seen that include any HDD or SSD (except NVME, which does not pass through the SATA controller).
      4. sas3ircu - reports slot position and disk state. It is useful for missing disks or verifying that disks are in the correct slot. (Do NOT run the following command on Lenovo HX hardware as it may lead to HBA lockups and resets)  
        nutanix@cvm$ sudo /home/nutanix/cluster/lib/lsi-sas/sas3ircu 0 display
      5. storcli - Reports drive errors similar to lsiutil. Also reports slot position and disk state.
        sudo ~/cluster/lib/storcli/storcli64 /call/pall show phyerrorcounters|tail -n+6   - Show phy error counts in concise output
        sudo ~/cluster/lib/storcli/storcli64 /call/pall show |tail -n+6     - Show detected speeds and interfaces
        sudo ~/cluster/lib/storcli/storcli64 /call show all   - Show everything
        
      6. Check the CVM's dmesg for LSI mpt3sas messages. We should typically see one entry for each physical slot. (The below example shows SAS address "0x5000c5007286a3f5" is repeatedly checked due to a bad/failed disk. Note how the other addresses are detected once, and the suspect is repeatedly being polled.)
        nutanix@cvm$ sudo dmesg | grep "detecting\: handle"
        [ 3.693032] mpt3sas_cm0: detecting: handle(0x0009), sas_address(0x5000c40074c6d56d), phy(0)
        [ 3.702423] mpt3sas_cm0: detecting: handle(0x000a), sas_address(0x4431221107000000), phy(7)
        [ 3.941624] mpt3sas_cm0: detecting: handle(0x000b), sas_address(0x4431221106000000), phy(6)
        [ 4.191170] mpt3sas_cm0: detecting: handle(0x000c), sas_address(0x5000c500856f9e51), phy(1)
        [ 4.211879] mpt3sas_cm0: detecting: handle(0x000d), sas_address(0x5000c5006286a3f5), phy(2)
        [ 4.213080] mpt3sas_cm0: detecting: handle(0x000e), sas_address(0x5000c500856fa075), phy(3)
        [ 4.231194] mpt3sas_cm0: detecting: handle(0x000f), sas_address(0x5000c500856f9735), phy(4)
        [ 4.245974] mpt3sas_cm0: detecting: handle(0x0010), sas_address(0x5000c50084e02b31), phy(5)
        [ 4.942347] mpt3sas_cm0: detecting: handle(0x000a), sas_address(0x4431221107000000), phy(7)
        [ 5.214032] mpt3sas_cm0: detecting: handle(0x000d), sas_address(0x5000c5007286a3f5), phy(2)
        [ 6.215092] mpt3sas_cm0: detecting: handle(0x000d), sas_address(0x5000c5007286a3f5), phy(2)
        .
        .
        [ 12.233236] mpt3sas_cm0: detecting: handle(0x000d), sas_address(0x5000c5007286a3f5), phy(2)
        
      7. smartctl - if Hades indicate that a disk is checked by smartctl 3 times in an hour, it is automatically failed.
        nutanix@cvm$ sudo smartctl -x /dev/sdX -T permissive
        • See KB-8094 for troubleshooting with smartctl.
      8. Check for offline disks using NCC check disk_online_check.
        nutanix@cvm$ ncc health_checks hardware_checks disk_checks disk_online_check
        • See KB 1536 for further troubleshooting offline disks.
      9. Confirm if disks are seen from LSI Config Utility. This can be useful for ruling out potential driver or CVM/Hypervisor configuration issues that could prevent you from detecting certain drives. The LSI Config Utility gives you an interface directly to the HBA firmware without relying on a software operating system. It can be used to do many of the same things that you can do with "lsiutil": (a) Check if a disk is detected in a particular slot, (b) Check a disk link speed, (c) Activate an LED beacon on a particular drive. On G6 & G7 platforms, the LSI Config Menu is disabled by default so you have to enable it in the BIOS before you can use it. On G8 platforms you must view the attached drives directly through the BIOS menu.
        • G8: View attached drives directly through the BIOS
          • Enter the BIOS Menu by hitting the DEL key at the "Nutanix" splash screen while the node is booting-up.
          • Go to the "Advanced" Tab and select "SCC-B8SB80-B1 (PCISlot=0x8) Configuration". This is what the menu option is called on 3060-G8. It may be named slightly differently on other models.
            1

            2
  • If the "Device Properties" option is greyed-out, select "Refresh Topology".
  • Select "Drive Properties" to see a list of the SATA drives visible to the host.
    3
  • G6 & G7: How to enable and access LSI HBA OPROM
    • Enter the BIOS Menu by hitting the DEL key at the "Nutanix" splash screen while the node is booting-up.
    • Go to "Advanced" tab and find "LSI HBA OPROM". Set this to "Enabled". Then hit "F4" to "Save & Exit" the BIOS menu. This will cause the node to reboot.
    • Note: After you have obtained the information you need, make sure to go back into the BIOS and DISABLE the OPROM. You can also press F3 to Load Optimized Defaults, which will bring the BIOS back to it's original factory settings where the OPROM is disabled.
      4
  • On the next boot-up, look for the screen titled "Avago Technologies MPT SAS3 BIOS" and hit CRTL+C to enter the "SAS Configuration Utility".
    5
  • Once inside the Config Utility, select the HBA card you are interested in. Multi-node models (2U4N, 2U2N) will only have a maximum of one HBA card, while single-node platforms (2U1N) may have as many as three. In multi-HBA systems, each HBA will be serving a different subset of drives on each node. 
    6
  • On the next screen, select "SAS Topology" and then "Direct Attach Devices" to see information about the drives associated with that HBA.
    7

    8

    9

    10

    11

    12
  • If the HBA you selected does not detect any drives at all, it will report "No devices to display."
    13
  1. There can be a case where the disk is DOWN in lsiutil, usually after a replacement or an upgrade of the disks. When all the above checks are carried out, and the disk is still not visible, compare the old and new disk "disk caddy or tray". Ensure the type is the same. There can be cases where an incorrect disk type is dispatched, and it does not seat properly in the disk bay hence not being detected by the controller.
    14
     
  1. Identify the node type or the problematic node.
    Run ncli host ls and find the matching node ID. Specific node slot location, node serial, and node type is important information to document in case of recurring issues. It also helps to track the field issues with the HBA's, node locations, and node types.
     
  2. Identify the failure occurrence.
    1. Check the Stargate log. The stargate.INFO log for the corresponding period indicates if Stargate saw an issue with a disk and sent it to the Disk Manager (Hades) to be checked or had other errors accessing the disk. Use the disk ID number and serial number to grep for in the Stargate log on the corresponding node the disk is in.
    2. The Hades log contains information about the disks it sees and the health of the disks. It also checks which disk is metadata or Curator disk and selects one if one did not already exist in the system or was removed/disappeared from the system. Check the Hades log.
    3. Check df -h in /home/nutanix/data/logs/sysstats/df.INFO to see when the disk was last seen as mounted.
    4. Check /home/nutanix/data/logs/sysstats/iostat.INFO to see when the device was last seen.
    5. Check /home/log/messages for errors on the device, specifically using the device name, for example, sda or sdc.
    6. Check dmesg for errors on the controller or device. Run dmesg | less for the current messages in the ring, or look at the logged dmesg output in /var/log.
  3. Identify the cause for the disk failure.
    • Check when the CVM was last started if the disk's last usage data were not available. Again, reference the Stargate and the Hades logs.
    • Check the Stargate log around the time of disk failure. Stargate sends a disk to Hades to check if it does not respond in a given time and ops timeout against that disk. Different errors and versions represent it differently, so always search by disk ID and disk serial.
  4. Check the count of disk failure.
    If a drive failed more than once in this slot and the disk was replaced, it would indicate a potential chassis issue at that point.

  5. Check if lsiutil is showing errors.
    If lsiutil shows errors evenly on multiple slots, it can indicate a bad controller.

  6. Check for known issues with the drive FW for the disk errors.  

  7. If this is a G8 that the MCU version is 1.1A or higher and that the Backplanes were upgraded as well: 
    Reference this document: NX-G8: Nutanix Backplane CPLD, Motherboard CPLD, and Multinode EC firmware manual upgrade guide.

  8. If this is a G8 check that the LSI controller FW is 25.00.00 or higher:
    There are fixes related to SSD stability when trim is in use that correct an instance that causes PHY errors to be seen on drives and instability. It is also important from a troubleshooting standpoint to be on FW 25.00.00 or higher.

Note: Event ID: 191, G-Sense_Error_Rate in "smartctl" output for Seagate HDD's can be safely ignored unless there is performance degradation. G-Sense_Error_Rate value only indicates HDD adapting to shock or vibration detection. Seagate recommends not to trust these values as this counter dynamically changes the threshold during runtime.

Document ID:HT516504
Original Publish Date:05/16/2024
Last Modified Date:05/28/2024