Please note: This website includes an accessibility system. Press Control-F11 to adjust the website to the visually impaired who are using a screen reader; Press Control-F10 to open an accessibility menu.

How Upgrades Work at Nutanix

How Upgrades Work at Nutanix

How Upgrades Work at Nutanix

Description

Upgrades at Nutanix are always designed to be done without needing any downtime for User VMs and their workloads. This document is intended to serve as an introduction describing how each type of upgrade works and to share some useful best practices for administrators. You will find similar information in the Acropolis Upgrade Guide (remember to always choose the guide that matches the AOS currently running on your cluster).

Versions affected: All Versions, All Nutanix Files Versions, ALL LCM Versions, ALL AOS Version, ALL AHV Versions
The following is true for ALL Nutanix upgrades:

Is downtime required?

No. User VMs may live migrate between hosts depending on what type of upgrade is performed but there should be no impact to their services. Users should maintain access to their VMs and be able to work as normal during the upgrade. VMs which cannot live migrate, such as those with vGPUs or Affinity Rules, will need to be powered-down or to have these settings removed in advance of any upgrade that requires a host reboot. Failure to do so will result in the upgrade becoming stuck in the process of evacuating User VMs.

Is there a performance impact?

Nutanix recommends performing upgrades during your scheduled maintenance window or outside your normal business hours; otherwise, users might experience latency during the upgrade process. This latency may be especially noticeable for clusters that use only 1GB speed network uplinks due to the limited bandwidth available on this configuration.

What is the recommended Upgrade Order?

The Recommended Upgrade Order section from the acropolis guide can be referenced to understand the order to be followed for the upgrades.

What happens if the upgrade becomes stuck?

  • If you encounter a failure during pre-upgrade checks, review the article cited in Prism for information on how to resolve the issue. Once the issue is resolved, click the link to go back to the available versions in Prism and then try the upgrade again.
  • If the upgrade itself is stuck, contact Nutanix Support for assistance. Do not try intervening as this can potentially result in an outage.
  • Nutanix does not support roll-back for software upgrades.

How do I know if a version is compatible?

  • If a version appears in the Upgrade Software or Life Cycle Manager (LCM) sections of Prism, that automatically means that it is already confirmed that it is compatible with the cluster as it stands, and you can go to the new version anytime.
  • If you do not see the version you want listed, it could be for several reasons. Newer releases take some time before they are made available for One-Click Download, but you can still get the binaries and metadata files directly from the Nutanix Portal and then upload them to Prism manually. It also may be the case that you need to go to an intermediary version (multi-step upgrade) in order to first bring the cluster to a version which is able to upgrade to your desired version.
  • The Upgrade Paths page on the Nutanix Portal will show you what versions of AOS, Prism Central (PC), or Nutanix Files your cluster can be brought onto right now based on what you are currently running. If you need to go to a later version than what is shown in the Upgrade Paths page, start by upgrading the cluster to the latest possible version first. Once that is done you should be able to reach the version you want on your next attempt. To save time, remember that a cluster running AOS on a Long-Term Support (LTS) release branch (such as 5.5.x) can always upgrade directly to the next available LTS release branch (such as 5.10.x).
  • To see if a given version of AOS, Prism Central, and Nutanix Files are compatible with each other, check the Software Product Interoperability.

What should I do in advance?

  • It is always a good idea to install and run the latest version of Nutanix Cluster Check (NCC) to make sure your cluster is in the best shape possible before starting an upgrade.
  • To ensure that Prism has access to the software and firmware you wish to choose from, review the port and firewall requirements and verify that your network is configured accordingly. If you are using Prism Central, make sure SSL port 9440 is open in both directions between the Prism Central VM and any registered clusters.
  • If your cluster is registered to Prism Central, make sure that this is brought up-to-date before upgrading AOS on the Prism Element cluster. Prism Central is designed to manage Prism Element clusters that are within the same major build and earlier versions. For example, Prism Central 5.10.0.1 is supported to manage Prism Element clusters running 5.10.0.2 since the major build (5.10.0) is the same. However, PC 5.10.0.1 is not supported to manage PE clusters running 5.10.1 since this PE version is a later major build. In such cases, the PC should be upgraded to 5.10.1 or later builds to bring it back into compatibility with the clusters it manages.
  • Check the Upgrade Paths and Compatibility Matrix pages on the Nutanix Portal to make sure the new software is compatible. The Compatibility Matrix also contains guidance on software compatibility with Nutanix Ready Partner Solutions and AHV Guest Operating Systems.
  • Read the Release Notes on the Support Portal to get information on known issues in the release as well as what bug fixes, improvements, or features come with it.
  • If you are using a third-party hypervisor or application, check the vendor's website to make sure it is compatible with the desired version of AOS.

Solution

Below you will find a summary of the prerequisites for each type of upgrade, what happens on the backend, and how long the operation can be expected to take.

AOS Software

Upgrade Prerequisites

What happens when I click Upgrade Now?

  • First, the pre-upgrade checks will run to make sure that the cluster is able to be upgraded. If any of the pre-upgrade checks fail, you will see information about this in Prism and the actual AOS upgrade will not start. Users will have to click Back to Versions and start the upgrade again after the issue reported by the pre-checks is resolved. To see the full list of pre-checks and their related articles, check out KB 6524.
  • Next, the AOS software is copied to each CVM (Controller VM) in the cluster.
  • In the last stage, the Controller VMs in the cluster reboot one-at-a-time onto the new AOS version. Storage traffic from User VMs will be redirected to a neighboring CVM while the local one is upgrading. During this short period (about 10 minutes) the local User VMs may experience a small amount of additional latency since they are receiving their storage I/O from a remote CVM.

How long does it take?

This may take 15-20 minutes per node. The upgrade process in a two-node cluster will take longer than the usual process because of the additional step of syncing data while transitioning between single and two node state. Nevertheless, the cluster remains operational during the upgrade.

Prism Central Software

Overview and Requirements

What happens when I click Upgrade Now?

  • First, the pre-upgrade checks will run to make sure that the cluster is able to be upgraded. If any of the pre-upgrade checks fail, you will see information about this in Prism and the actual upgrade will not start. Users will have to click Back to Versions and start the upgrade again after the issue reported by the pre-checks is resolved. To see the full list of pre-checks and their related articles, check out KB 6524.
  • If you have a regular Single-VM Prism Central, the new software will be staged and then the PCVM will reboot to come up onto the new version. During this short time the UI will not be available, but there will be no effect to the Prism Element clusters that are managed by Prism Central.
  • If you have a Scale-Out Prism Central (three PCVMs) the software will be copied to each PCVM and then they will reboot one at a time to come up on the new software. The Prism Central services and UI will still be available during the upgrade.
  • After the PCVM boots up from the upgrade, it will take a few minutes for the UI to become available. Log in and make sure that the task for Prism Central Upgrade has completed successfully (100%).

How long does it take?

For Single-VM Prism Central, about 25 minutes.
For Scale-Out Prism Central (three PCVMs), about 1 hour.

Hypervisor Software

What happens when I click Upgrade Now?

  • First, the pre-upgrade checks will run to make sure that the cluster is able to be upgraded. If any of the pre-upgrade checks fail, you will see information about this in Prism and the actual hypervisor upgrade will not start. Users will have to click Back to Versions and start the upgrade again after the issue reported by the pre-checks is resolved. To see the full list of pre-checks and their related articles, check out KB 6524.
  • A host in the cluster is chosen by the upgrade and a task is submitted to migrate User VMs from the host.
  • Once the CVM is the only virtual machine left on the host, it is placed into Maintenance Mode and the new software for the hypervisor is staged.
  • After the new hypervisor version is installed, a reboot of the host is issued.
  • Once the host comes up from the reboot onto the new software version, the host is taken out of Maintenance Mode and the CVM is powered-up.
  • The cluster will wait for the Controller VM and its services to come online before selecting the next host to undergo the upgrade. The hypervisor will balance User VMs across the upgraded node as needed based on its existing configuration.

How long does it take?

This depends on how long it takes to evacuate User VMs from each host before it goes down for upgrade. A good estimate is about 30-45 minutes per node.

Firmware with Life Cycle Manager (LCM)

This section will focus mainly on firmware updates with LCM; however, you can also use LCM to upgrade software like AOS and Foundation. Updating software entities with LCM utilizes the same mechanisms that were available in the legacy One Click Upgrade Software section of Prism. If you're interested in how these work, please refer to the sections of this document that describe those workflows.

See KB 7536 for an FAQ on this feature. You must configure rules in your external firewall to allow LCM updates. See the Prism Web Console Guide: Firewall Requirements for details. Consult the LCM Guide for full details on using the feature.

LCM’s ability to inventory or update certain components may depend on which versions of the AOS and Foundation are running on the cluster. Users wishing to see a full list of available updates should consider bringing these software up-to-date first or check the LCM Release Notes to see if any of these dependencies exist for your environment.

What happens when I click Update?

  • First pre-checks will run to make sure that the cluster is in a good state for the upgrade to proceed. Prism will report if any pre-checks fail and you can consult KB 4584 for an explanation of each of them and how to resolve the issue. Once the issue that caused the pre-check to fail is resolved, run a fresh Inventory and then try the upgrade operation again.
  • Basically, all firmware updates performed through LCM require the hosts to boot into a CentOS-based staging area called Phoenix with the following exceptions.
    • Certain modules for Dell platforms.
    • LCM 2.3.2 onwards, for DISK firmware, LCM utilizes IVU based update mechanism which does not require the host reboot.
    • LCM 2.4.0 onwards, for BIOS and BMC firmware, when certain conditions are met, LCM utilizes Redfish update mechanism which does not require the host reboot.
  • LCM has built-in intelligence that tells it what order to do the firmware updates, so there is no need for users to worry about what updates to perform first. Users can simply select the action Update All and LCM will automatically satisfy all dependencies between the firmware.
  • If multiple hosts are selected to have firmware updates performed, LCM will evacuate User VMs from the hosts one-at-a-time and boot them into the Phoenix staging area to perform the updates. No user VMs will be powered-off and your workload should continue to be served without disruption.
  • Depending on the firmware being upgraded, you may see your hypervisor reboot several times back into Phoenix. This is expected behaviour and you should not try to intervene.
  • Once the firmware updates are completed the selected node will boot back into the hypervisor and power-up the local Controller VM, making sure that all clusters services are up and running.
  • Finally, the LCM will make sure that the local hypervisor is once again can host User VMs before the upgrade continues onto the next node.

How long does it take?

This depends on the number of firmware updates being performed on a given node and how long it takes to evacuate User VMs from each host. As a reference,

  • SATA DOM firmware upgrade (Phoenix) tends to take about 45 minutes per node.
  • BIOS and BMC firmware upgrades (Phoenix) tend to take about the same amount of time as SATA DOM.
  • BIOS and BMC firmware upgrades (Redfish) tend to take about 10 - 15 minutes.
  • DISK firmware upgrade (IVU) takes less time compared to Phoenix mechanism, but the total time is proportional to the number of disks being upgraded.

Foundation Software

The only prerequisite for Foundation software upgrade is that all CVMs are up, and that the Foundation service is in a stopped-state across the cluster. This service is typically not running unless an LCM upgrade or Cluster Expand operation is taking place.

What happens when I click Upgrade Now?

The foundation binaries are updates across all CVMs. No running services, CVMs, or hypervisors are restarted.

How long does it take?

This takes about one minute.

Nutanix Cluster Check (NCC) Software

There are no prerequisites for NCC upgrade other than all CVMs must be up. Check out the NCC Guide for instructions on how to upgrade.

What happens when I click Upgrade Now?

The new NCC software is copied to each CVM and then the cluster_health service, which is responsible for health monitoring and the logic underlying cluster alerts, is restarted on each node. No services involved in the data path are restarted.

How long does it take?

This takes about five minutes.

File Server (Nutanix Files) Software

Installing (or Upgrading) Files

What happens when I click Upgrade Now?

  • First, the pre-upgrade checks will run to make sure that the cluster is able to be upgraded. If any of the pre-upgrade checks fail, you will see information about this in Prism and the actual File Server upgrade will not start. Users will have to click Back to Versions and start the upgrade again after the issue reported by the pre-checks is resolved. To see the full list of pre-checks and their related articles, check out KB-6524.
  • Once the File Server upgrade begins, each File Server VM is upgraded one-at-a-time onto the new Nutanix Files version. While an FSVM is down for the upgrade, users connected to shares hosted by this node may experience a loss of connectivity for a duration of roughly 20-30 seconds. After this short period, another FSVM will pick up on hosting those shares, and users will regain access to their files.
  • After each FSVM completes its reboot onto the new version of Nutanix Files, the upgrade will make sure that it can once again host shares before starting to upgrade the next FSVM.

How long does it take?

About 20 minutes per File Server VM.

Additional Information

Document ID:HT514179
Original Publish Date:09/08/2022
Last Modified Date:08/27/2024