VMware vSAN 8.0 Update 3 - What's New (Complete and in-depth list)

VMware vSAN Data Protection

I think that vSAN Data Protection deserves a separate article, but let's briefly describe what it is, how it works and why it is needed.
One of the major improvements to the vSAN ESA architecture was the introduction of new B-tree snapshots. Snapshots that could be created, in theory, in any number, that have no impact on performance either at runtime or at deletion. But before vSAN 8.0U3 the maximum number of snapshots per VM was limited to 32, all workflows remained the same, and you could notice the benefits of the new snapshots either when you made them manually or when some external system (like a backup that uses VADP) created them.
But that has changed in vSAN 8.0 Update 3. VMware vSAN Data Protection is now introduced. It's essentially an orchestrator on top of the existing ESA snapshot system that adds new workflows.
The first and most obvious (even by name) use case is to automatically protect data on vSAN ESA cluster from accidental or malicious deletion.
How it works:
1.) You create a Protection Group (PG) that includes number of VMs. You can form the list by VM name pattern or select them manually. In this case 1 VM can belong up to 3 different PGs.
2.) Specify for this Protection Group the schedule of snapshot creation (up to 10 schedules per PG), retention period (specify fixed date of snapshot deletion, number of retention days or indefinite retention) and immutability (impossibility to manually delete a snapshot before retention period expires).
3.) vSAN Data Protection creates crush-consistent snapshots (up to 200 per VM) according to the schedule and stores them on the vSAN ESA datastore. Please note that at this point, snapshots for all VMs within the PG are not created at absolutely the same time.
4.) If vSAN datastore is 70%+ full, scheduled snapshot creation is stopped to prevent the datastore from filling up.
5.) At any time, you can restore a VM to a state from any stored snapshot or create a VM clone from it. Even if the VM is deleted from vCenter/ESXi, it will still be stored on the vSAN datastore and you can restore and register it again.

The second big use case I see is creating test landscapes from a production environment. You can create a Linked Clone of a production service consisting of one or more VMs for our developers, testers, DBAs or support team to work on. And these snapshots and clones have zero performance impact on the production environment.

Also, vSAN Data Protection integrates with VMware Live Cyber Recovery (ex-VCDR), which allows us to send snapshots there, start VMs from them in a clean Isolated Recovery Environment (IRE), and accelerate Failback.

That said, vSAN Data Protection as a service is an additional VM on the cluster that needs to be deployed from ova. But the important thing is that it is stateless and all snapshots and metadata are stored on vSAN itself, so even if this VM is also deleted, you just need to deploy a new one and all snapshots will be rediscovered and ready for recovery.

Finally, I want to highlight that ESA snapshots are local to the datastore there is why if something goes wrong with vSAN datastore itself snapshots may be lost with the source VM. Thus, you have to consider vSAN Data Protection not as replacement of classical backup and DR solutions but extension of data protection with ultra-fast recovery after breakdowns within the VM and/or operator mistakes or the actions of an intruder.

vSAN Management and Monitoring Improvements

Capacity-based licensing support.

Added support for new subscription licences for vSAN, which, I remind you, are now per-TiB RAW. VCF licences include 1 TiB per core, which can be extended with additional vSAN per-TiB licences, and VVF includes 100GiB promo licences per core (if this is not enough, you have to buy licences for the whole required volume, and these 100GiB are not counted).
The old perpetual licences still work as usual.

Proactive Hardware Management

This functionality facilitates the collection of detailed telemetry data from storage devices across various server vendors. The system utilizes this data to identify potential hardware issues in advance of failure. Server vendors can now integrate their storage device telemetry data through a standardized interface, allowing vSAN to aggregate and analyze this information. The collected data is then processed and presented to system administrators in a structured format, providing insights into the health and performance of storage devices. This approach aims to enhance predictive maintenance capabilities and reduce unexpected downtime by enabling administrators to address potential hardware problems before they escalate into critical issues.

The one important thing to understand is that Proactive Hardware Management (PHM) is an optional extension and API that hardware vendors can (but are not required to) use. PHM uses the Hardware Support Manager (HSM) plugin for its operation, which all together are part of the vLCM framework.

So far, Dell, HPE and Lenovo have announced support, but you will be able to see for sure in the HCL - there is a separate section that indicates such support.

Customizable Endurance Alerts

Starting with vSAN 8.0 Update 2, SMART values for NVMe drives are monitored for ESA clusters. One of the parameters that is often paid attention to is the Endurance of the drives. In vSAN 8.0U2 there are two automatic Alerts - Warning at 90% and Critical at 100%.
In vSAN 8.0 Update 3, the ability to customise these values has been added. Now you can specify your own thresholds and also specify which clusters/hosts/disks to apply them to (ESA only). For example, you can set the alert threshold for a production cluster to 75% and for a test cluster to 85%. Or you can set one threshold for Read-Intensive disks of vendor A and another for Mixed-Used disks of vendor B. This is done by creating a custom alert:

Multi-VM I/O Trip Analyzer

Previously, I/O Trip Analyzer could only be run for a single VM. Now you can select up to 8 VMs, for example, all the VMs that make up a service that is experiencing problems. Works for both OSA and ESA, requires both vCenter and ESXi to be version 8.0U3

New RDMA NIC/FW/Drivers Health Check

Now vSAN Health Check have added a NIC check when RDMA is enabled on the cluster. It checks that the NIC is certified for your version of ESXi, certified for your version of vSAN, and also checks that the drivers and firmware match those specified in the HCL.

VCF Operations support vSAN Max

VCF Operations (ex-vRops, ex-Aria Operations) has added support for vSAN Max, within which you can see the connectivity topology (which clusters/hosts vSAN Max is connected to), added Alerting and capacity management. In general, VCF Operations now realises that there is such a thing as vSAN Max and that it is not just another standard vSAN cluster. Although the work is not yet complete (and on many dashboards vSAN Max cannot be distinguished from a regular cluster), this is the first step towards supporting vSAN Max within VCF Operations.

Federated vSAN health monitoring in VCF Operations

The latest VCF Operations (ex-Aria Operations) introduces federated vSAN cluster health monitoring for clusters spanning across multiple vCenters.

Security Configuration Guide for vSAN

Widely used in critical and secured infrastructures, the vSphere Security Configuration & Hardening Guide now includes vSAN guidance as well.

Merging the vSAN Management SDK with the Python SDK for the VMware vSphere API

Starting with vSphere 8.0 Update 3, the vSAN Management SDK for Python is integrated into the Python SDK for the VMware vSphere API (pyVmomi). From the Python Package Index (PyPI), you can download a single package to manage vSAN, vCenter, and ESXi. This integration streamlines the discovery and installation process and enables automated pipelines instead of the series of manual steps previously.

Other vSAN Improvements

Congestion remediation.

vSAN 8.0 Update 3 enhances vSAN OSA's ability to detect and remediate various types of congestion early, preventing cluster-wide I/O latencies.

There was actually some very deep and extensive work done here, but its details are not published because they are very low-level. But the key point is that for a very large number of Congestions (they are of different types) at the LSOM level (OSA) has been added technology for early detection of congestion, as well as fixing it if possible. This has the potential to dramatically reduce the number of cases where a congestion is triggered and creates significant back pressure in the cluster, resulting in increased latency and performance degradation.

Adaptive delete congestion.

vSAN now provides adaptive delete congestion for compression-only disk groups in vSAN OSA, improving IOPS performance and delivering more predictable application responses.

In short, there is a "Delete congestion" (one of many types of Congestion in vSAN) at the LSOM level, the purpose of which is to prevent the disk from completely filling up in scenarios where the average load is high + we get Heavy Write Burst traffic on a single component. This is relevant for Compression-only disc groups (because in a group with deduplication data is written to all discs in DG more or less evenly), and we also have to take into account the (unknown) degree of compression and incoming zeros.
Past versions of vSAN used static thresholds and current fill, hence the static Congestion values (i.e. two fixed backpressure values). Hence, in some cases, two things could happen - a sharp spike in latency (when Congestion turns on), and also in some extreme cases, Congestion might not have time to react correctly, and the SSD would still fill.
In vSAN 8.0 U3, a forecast is made for the fill rate, and Congestion varies linearly over some range.

Device level unmap support for vSAN ESA.

This release enhances vSAN ESA to send UNMAP commands when space is freed, improving SSD garbage collection efficiency and overall I/O performance.

Most modern SSDs support the UNMAP/Deallocate command, which helps Garbage Collection work on SSDs. This command sends a list of blocks to the disk that can be cleaned/emptied by the SSD controller. To do this, the controller moves data blocks between pages and then clears the free pages. This allows the next time write to a clean block, rather than having to clean it beforehand. And yes, while Garbage Collection on Enterprise SSDs (unlike Consumer-grade SSDs) works always in the background and allows on-the-fly parsing of writes, for Write Intensive workloads and especially on Read Intensive SSDs, pre-clearing space via UNMAP can somewhat increase write speed and write consistency.
Until now, vSAN did not send to the disks a list of blocks to be cleared, thus UNMAP/Deallocate was not used. Now in vSAN 8.0 Update 3, when using ESA, the UNMAP/Deallocate command is sent to the SSD (provided the disk supports it and it has correctly reported it to ESXi) to help SSDs controller and its Garbage Collection mechanisms.

vSAN File Services now support up to 250 NFS File Shares per cluster.

It's pretty self-explanatory. Now you can create up to 250 NFS shares per vSAN Cluster. This is useful for two main scenarios - if you just need more NFS Shares, but most relevant for K8s where Read-Write-Many Persistent Volumes (RWM PV) is a separate NFS Share. These limits have now increased by 150% from 100 NFS Shares to 250.
Things to note:
1.) This is ONLY for ESAs. For OSAs, the limits are the same as there were (100).
2.) The maximum number of SMB Shares is still 100.
3.) In vSAN 8.0U3, each container can serve up to 25 NFS Shares. But several containers can be running on each host/FSVM, which allows you to get a large number of available NFS Shares even on small clusters, but here you need to take into account the balancing of containers on FSVM and the resources available to them.

As well as many other small improvements under the hood that didn't make it into Release Notes and announcements.

I'll add them if I see public mentions of them.

VCF Related Improvements

VMware Cloud Foundation brownfield ingest.

VMware Cloud Foundation now lets users import existing vSphere and vSAN clusters, including stand-alone vSAN deployments. It simplifies onboarding, speeds up integration, and reduces migration complexity for users upgrading to a full-stack private cloud platform.

VMware vSAN ESA Stretched Cluster Support in VMware Cloud Foundation

You can now use an ESA-based vSAN Stretched Cluster in the same way that you could previously use an OSA-based Stretched Cluster in VCF environments. The operation/enabling processes, limitations and so on are the same for OSA and ESA. The only nuance is that vSAN MAX in Stretched Cluster configuration is not supported in 5.2

VMware vSAN Max Support with VMware Cloud Foundation

You can now use vSAN Max as Primary Storage for Workload Domains (including compute-only clusters). The creation and management processes are integrated into the VMware Cloud Foundation workflows and console.

Kubernetes Related Improvements

Use CNS on TKGs on Stretched vSAN

Support Stretched vSAN Cluster for TKGs to ensure High Availability.

Everything is obvious here too, although some specifics should be noted - Control Plane must live in one site (because etcd uses an odd number of nodes and needs quorum), then you use Affinity/Anti-Affinity rules and vSAN Storage Policy to distribute data, workers and control plane between sites correctly and properly.

Enable PV Migration Across Non-shared Datastores within the Same VC

Ability to move a PV either attached or detached from vSAN to vSAN where there is no common host. An example for this would be the ability to move K8s workload from a vSAN OSA cluster to a vSAN ESA cluster.

Use CNS on vSAN Max

Enable support for vSphere Container Storage Plug-in consumers to deploy CSI Volumes to vSAN Max deployments.

Enable File Volume in HCI Mesh Topology within a Single Center.

Enable file volume in HCI Mesh with topology within a single VC.

Up to 250 RWM PV per Cluster

To be honest, this is a repeat because I already wrote about it above in the vSAN File Services enhancements section, but I thought it was important to note it here as well. Because of the increase in the maximum number of NFS File Shares per cluster to 250, you will now be able to use up to 250 RWM PVs per cluster. Again, please note that this is for ESA only. You can see the details above.

vSphere/vCenter Related Improvements

This set of features is not directly related to vSAN, but since vSAN and ESXi are two parts of the same whole, I can't help but mention them in one line. For details and other new features in vSphere 8.0 Update 3, go here.

ESXi Live Patching

With vSphere 8.0 Update 3 we can address critical bugs in the virtual machine execution environment (vmx) without the need to reboot or evacuate the entire host. Examples of fixes include those in the virtual devices space.

Virtual machines are fast-suspend-resumed (FSR) as part of the host remediation process. This is non-disruptive to most virtual machines. A virtual machine FSR is a non-disruptive operation and is already used in virtual machine operations when adding or removing virtual hardware devices to powered-on virtual machines.

vCenter Reduced Downtime

Patch and update vCenter with minimal downtime now includes complete topology support and the ability to automatically perform the switchover phase.

vSphere Configuration Profiles

Manage the configuration of vSphere clusters now including support for clusters using baselines, formerly Update Manager, that have not yet transitioned to cluster images using vSphere Lifecycle Manager. Now supporting baseline-managed clusters in vSphere 8 U3.

Enhanced Image Customization

You can now exclude individual vendor components from Vendor Addon, VMware Tools, and Host Client from the image. This can reduce the size of the image and also eliminate unnecessary components on the host.

Embedded vSphere Cluster Service

vCLS is becoming more and more convenient and seamless. Now you only need two VMs, but, most importantly, they are now built into ESXi (rather than pulled from vCenter) in form of CRX Runtime, and also run directly in RAM rather than on datastore.