Please note: This website includes an accessibility system. Press Control-F11 to adjust the website to the visually impaired who are using a screen reader; Press Control-F10 to open an accessibility menu.

Considerations when using ThinkSystem SD650, SD650 V2, SD650 V3 and ConnectX-6 HDR, ConnectX-7 NDR SharedIO - Lenovo ThinkSystem and Lenovo Server

Considerations when using ThinkSystem SD650 / SD650 V2 and ConnectX-6 HDR SharedIO - Lenovo ThinkSystem and Server

Considerations when using ThinkSystem SD650 / SD650 V2 and ConnectX-6 HDR SharedIO - Lenovo ThinkSystem and Server

Symptom

The Mellanox ConnectX-6 HDR / Nvidia ConnectX-7 NDR adapter implements SharedIO, also known as NVidia/Mellanox Multi-Host technology. With SharedIO, a NVidia/Mellanox Virtual Protocol Interconnect® (VPI) adapter is installed in a slot in one ThinkSystem SD650 or SD650 V2 or SD650 V3 server and an auxiliary adapter is installed in a slot in an adjacent server, in the same tray. A cable connects the two adapters. The result is that the two servers share the network connection of the VPI adapter with significant savings both in the cost of the adapters and the cost of switch ports.

Certain considerations need to be taken into account when working on a ThinkSystem SD650 or SD650 V2 or SD650 V3 server with a Shared I/O HDR/NDR adapter installed.

For more information on the NVidia/Mellanox ConnectX-6 HDR and ConnectX-7 NDR adapter and the ThinkSystem SD650 or SD650 V2 or SD650 V3 server, visit the following URLs:

Affected Configurations

The system may be any of the following Lenovo servers:

  • Lenovo Client Site Integration Kit, machine type 7X74, any model
  • Lenovo NeXtScale n1200 DWC Enclosure, Type 5468, any model, any any model
  • Lenovo Scalable Infrastructure (LeSI) Cluster, type 1410, any model DSS
  • ThinkSystem DW612/DW612S DWC Enclosure, Type 7D1L, any model
  • ThinkSystem SD650 Dual Node WCT Tray, any model 7X58
  • ThinkSystem SD650 v2, Type 7D1M, any model
  • ThinkSystem SD650 v3, Type 7D7M, any model

The system is configured with one or more of the following Lenovo Options:

  • ThinkSystem Mellanox ConnectX-6 HDR/200GbE QSFP56 1-port PCIe VPI Adapter (SharedIO) WCT, Option 4C57A14925, any model
  • ThinkSystem Mellanox HDR/200GbE 2x PCIe Aux Kit, Option 4C57A14179, any model
  • ThinkSystem Mellanox ConnectX-6 HDR/200GbE QSFP56 1-Port PCIe 4 VPI Adapter (SharedIO) DWC, 4XC7A86672, any model
  • ThinkSystem NVIDIA ConnectX-7 NDR OSFP400 1-port PCIe Gen5 x16 InfiniBand Adapter (SharedIO) DWC, 4XC7A86670, any model
  • ThinkSystem NVIDIA ConnectX-7 NDR200/HDR QSFP112 2-port PCIe Gen5 x16 InfiniBand Adapter (SharedIO) DWC, 4XC7A86669, any model

This tip is not software specific.

The system has the symptom described above.

Workaround

Not applicable.

Additional Information

Power up

When powering up nodes with Shared I/O adapters, from an A/C down state or after a virtual reseat, the Primary node must be powered on before the Auxiliary node. It is recommended to wait until the Primary node completes POST before attempting to power up the Auxiliary node, or ideally, wait until the Primary node has completed boot to the operating system. Failure to wait will result in the Auxiliary node not being granted power permission, and therefore, the Auxiliary node will not boot. The System Event Log (SEL) for the Auxiliary node will also report the either one of following events.

Module/Board - SharedIO fail Asserted

Sensor Aux/Pri SharedIO has transitioned to critical from a less severe state.

Power down or Rebooting

When powering down, or rebooting, nodes with Shared I/O adapters, the Auxiliary node should always be powered down before the Primary Node. The Aux adapter cannot operate without the Primary node adapter having power. There is no mechanism in place to prevent the Primary node from powering down while the Auxiliary node is still powered up, so it is important to pay close attention to the order that the nodes are being powered off. Failure to power down the Auxiliary node first will result in a fault reported in the System Event Log (SEL) on the Auxiliary node, or in some cases, a software NMI once the Aux adapter loses power and is no longer visible.

Slot/Connector - PCIe 1 - Fault - PCIe 1

Critical interrupt - NMI State - Software NMI

Other considerations

When installing the Shared I/O adapters, the primary adapter should be installed on the right side of the chassis, with the auxiliary adapter on the left side.

To update firmware on the Shared I/O adapter, first power down the Auxiliary node. Once the code has been applied to the primary card, power down the Primary node and power it back up. Once the operating system has booted, power up the Auxiliary node.

If at any point a PCI bus fault or Software NMI has been generated in the System Event Log because of an incorrect power off sequence, a virtual reseat can be done to clear the event.

Alias Id:102262
Document ID:HT510888
Original Publish Date:07/28/2020
Last Modified Date:01/18/2024