ESXi node PSOD or multiple drives missing when using lsi-msgpt35 driver version equal or prior to 18.00.01.00 - Lenovo ThinkSystem
ESXi node PSOD or multiple drives missing when using lsi-msgpt35 driver version equal or prior to 18.00.01.00 - Lenovo ThinkSystem
ESXi node PSOD or multiple drives missing when using lsi-msgpt35 driver version equal or prior to 18.00.01.00 - Lenovo ThinkSystem
Symptom
An ESXi server or a node part of a vSAN cluster may report PSOD with the message "... disk name: naa.5000xxxxxxxxx detected suspended I/Os..." or may report all drives attached to an HBA are missing.
There is a small possibility that the issue occurs every 49 days of up-time when using lsi-msgpt35 driver version 18.00.01.00 (or prior).
(where PSOD = Purple Screen of Death, HBA = Host Bus Adapter)
Affected Configurations
The system may be any of the following Lenovo servers:
- ThinkAgile VX 4-Socket 4U Certified Node, Type 7Z12, any model
- ThinkAgile VX Series VX2330/VX3330/VX3331/VX7330-N, Type 7Z62, any model
- ThinkAgile VX Series VX3530-G/VX5530/VX7530/VX7531, Type 7Z63, any model
- ThinkAgile VX2320, Type 7Y13, any model
- ThinkAgile VX2320, Type 7Y93, any model
- ThinkAgile VX3320, Type 7Y13, any model
- ThinkAgile VX3320, Type 7Y93, any model
- ThinkAgile VX3520-G, Type 7Y14, any model
- ThinkAgile VX3520-G, Type 7Y94, any model
- ThinkAgile VX3720, Type 7Y12, any model
- ThinkAgile VX3720, Type 7Y92, any model
- ThinkAgile VX5520, Type 7Y14, any model
- ThinkAgile VX5520, Type 7Y94, any model
- ThinkAgile VX7520, Type 7Y14, any model
- ThinkAgile VX7520-N, Type 7Y94, any model
- ThinkAgile VX7820 Appliance, Type 7Z13, any model, any CTO1WW
- ThinkSystem SD530, Type 7X21, any model
- ThinkSystem SD630 v2, Type 7D1K, any model
- ThinkSystem SD650 Dual Node WCT Tray, any model 7X58
- ThinkSystem SD650 v2, Type 7D1M, any model
- ThinkSystem SD650-N V2, Type 7D1N, any model
- ThinkSystem SR150, Type 7Y54, any model
- ThinkSystem SR158, Type 7Y55, any model
- ThinkSystem SR250, Type 7Y51, any model
- ThinkSystem SR250, Type 7Y52, any model
- ThinkSystem SR250, Type 7Y72, any model
- ThinkSystem SR250, Type 7Y73, any model
- ThinkSystem SR258, Type 7Y53, any model
- ThinkSystem SR530, Type 7X07, any model
- ThinkSystem SR530, Type 7X08, any model
- ThinkSystem SR550, Type 7X03, any model
- ThinkSystem SR550, Type 7X04, any model
- ThinkSystem SR570, Type 7Y02, any model
- ThinkSystem SR570, Type 7Y03, any model
- ThinkSystem SR590, Type 7X98, any model
- ThinkSystem SR590, Type 7X99, any model
- ThinkSystem SR630 V2, Type 7Z70/7Z71, any model
- ThinkSystem SR630, Type 7X01, any model
- ThinkSystem SR630, Type 7X02, any model
- ThinkSystem SR650 V2, Type 7Z72/7Z73, any model
- ThinkSystem SR650, Type 7X05, any model
- ThinkSystem SR650, Type 7X06, any model
- ThinkSystem SR670 V2, Type 7Z22/7Z23/7D47, any model, any 19A/MLK
- ThinkSystem SR670, Type 7Y36, any model
- ThinkSystem SR670, Type 7Y37, any model
- ThinkSystem ST250, Type 7Y45, any model
- ThinkSystem ST250, Type 7Y46, any model
- ThinkSystem ST258, Type 7Y47, any model
- ThinkSystem ST550, Type 7X09, any model 7X09, 7X10
- ThinkSystem ST550, Type 7X10, any model
- ThinkSystem ST558, Type 7Y15, any model
- ThinkSystem ST558, Type 7Y16, any model
- ThinkSystem ST650 V2, Type 7Z74/7Z75, any model
- ThinkSystem ST658 V2, Type 7Z76, any model
The system is configured with one or more of the following Lenovo Options:
- 430-16e HBA, Option 7Y37A01091, any FRU
- 430-8e HBA, Option 7Y37A01090, any FRU
- RAID 430-16i, Option 7Y37A01089, any FRU
- RAID 430-8i, Option 7Y37A01088, any FRU
- ThinkSystem 440-16e SAS/SATA PCIe Gen4 12Gb HBA, Option SR17A32420, any model
- ThinkSystem 440-16i SAS/SATA PCIe Gen4 12Gb Internal HBA, any model
This tip is not software specific.
The system has the symptom described above.
Solution
This behavior is corrected in ThinkSystem series server SAS HBA VMware driver version 18.00.02.00.
The file is already be available by selecting the appropriate Product Group, type of System, Product name, Product machine type, and Operating system on Lenovo Support web page, at the following URL:
http://datacentersupport.lenovo.com/
Workaround
Power cycle the ESXi system before 49 days of up-time.
If the system has faced this issue, a reboot will resume normal server operation without any functional side effects. Or force storage re-discovery.
Additional Information
Up-time + I/O wait period is used on drive call with a variable which can hold a value up to 2^32-1.
This value is milliseconds and 2^32-1= 4294967295 milliseconds = 49 days 17 hours 2 min 47.295 seconds.
When Up-time + I/O wait is above 2^32-1, the HBA will lose communication with all drives.
When Up-time = 2^32-1, it will reset back to 0.
There is a few milliseconds window of opportunity to run into this issue every 49 days 17 hours 2 min 47.295seconds of up-time.
lsi-msgpt35 drive version 18.00.02.00 has been signed for ESXi and available on the Lenovo support site. vSAN certification and inclusion on Lenovo Custom Image (CI) will be available in December 2021.
lsi-msgpt35 drive version 15.xx to 18.00.01.00 are affected by this issue. The issue is fixed in version 18.00.02.00. Versions prior to 15.x are not affected.