Nikolay Kulikov

VMware vSAN 8.0 Update 3 - What's New (Complete and in-depth list)

2024-06-26T13:39:27.278Z

VMware vSAN Data Protection

I think that vSAN Data Protection deserves a separate article, but let's briefly describe what it is, how it works and why it is needed.
One of the major improvements to the vSAN ESA architecture was the introduction of new B-tree snapshots. Snapshots that could be created, in theory, in any number, that have no impact on performance either at runtime or at deletion. But before vSAN 8.0U3 the maximum number of snapshots per VM was limited to 32, all workflows remained the same, and you could notice the benefits of the new snapshots either when you made them manually or when some external system (like a backup that uses VADP) created them.
But that has changed in vSAN 8.0 Update 3. VMware vSAN Data Protection is now introduced. It's essentially an orchestrator on top of the existing ESA snapshot system that adds new workflows.
The first and most obvious (even by name) use case is to automatically protect data on vSAN ESA cluster from accidental or malicious deletion.
How it works:
1.) You create a Protection Group (PG) that includes number of VMs. You can form the list by VM name pattern or select them manually. In this case 1 VM can belong up to 3 different PGs.
2.) Specify for this Protection Group the schedule of snapshot creation (up to 10 schedules per PG), retention period (specify fixed date of snapshot deletion, number of retention days or indefinite retention) and immutability (impossibility to manually delete a snapshot before retention period expires).
3.) vSAN Data Protection creates crush-consistent snapshots (up to 200 per VM) according to the schedule and stores them on the vSAN ESA datastore. Please note that at this point, snapshots for all VMs within the PG are not created at absolutely the same time.
4.) If vSAN datastore is 70%+ full, scheduled snapshot creation is stopped to prevent the datastore from filling up.
5.) At any time, you can restore a VM to a state from any stored snapshot or create a VM clone from it. Even if the VM is deleted from vCenter/ESXi, it will still be stored on the vSAN datastore and you can restore and register it again.

The second big use case I see is creating test landscapes from a production environment. You can create a Linked Clone of a production service consisting of one or more VMs for our developers, testers, DBAs or support team to work on. And these snapshots and clones have zero performance impact on the production environment.

Also, vSAN Data Protection integrates with VMware Live Cyber Recovery (ex-VCDR), which allows us to send snapshots there, start VMs from them in a clean Isolated Recovery Environment (IRE), and accelerate Failback.

That said, vSAN Data Protection as a service is an additional VM on the cluster that needs to be deployed from ova. But the important thing is that it is stateless and all snapshots and metadata are stored on vSAN itself, so even if this VM is also deleted, you just need to deploy a new one and all snapshots will be rediscovered and ready for recovery.

Finally, I want to highlight that ESA snapshots are local to the datastore there is why if something goes wrong with vSAN datastore itself snapshots may be lost with the source VM. Thus, you have to consider vSAN Data Protection not as replacement of classical backup and DR solutions but extension of data protection with ultra-fast recovery after breakdowns within the VM and/or operator mistakes or the actions of an intruder.

vSAN Management and Monitoring Improvements

Capacity-based licensing support.

Added support for new subscription licences for vSAN, which, I remind you, are now per-TiB RAW. VCF licences include 1 TiB per core, which can be extended with additional vSAN per-TiB licences, and VVF includes 100GiB promo licences per core (if this is not enough, you have to buy licences for the whole required volume, and these 100GiB are not counted).
The old perpetual licences still work as usual.

Proactive Hardware Management

This functionality facilitates the collection of detailed telemetry data from storage devices across various server vendors. The system utilizes this data to identify potential hardware issues in advance of failure. Server vendors can now integrate their storage device telemetry data through a standardized interface, allowing vSAN to aggregate and analyze this information. The collected data is then processed and presented to system administrators in a structured format, providing insights into the health and performance of storage devices. This approach aims to enhance predictive maintenance capabilities and reduce unexpected downtime by enabling administrators to address potential hardware problems before they escalate into critical issues.

The one important thing to understand is that Proactive Hardware Management (PHM) is an optional extension and API that hardware vendors can (but are not required to) use. PHM uses the Hardware Support Manager (HSM) plugin for its operation, which all together are part of the vLCM framework.

So far, Dell, HPE and Lenovo have announced support, but you will be able to see for sure in the HCL - there is a separate section that indicates such support.

Customizable Endurance Alerts

Starting with vSAN 8.0 Update 2, SMART values for NVMe drives are monitored for ESA clusters. One of the parameters that is often paid attention to is the Endurance of the drives. In vSAN 8.0U2 there are two automatic Alerts - Warning at 90% and Critical at 100%.
In vSAN 8.0 Update 3, the ability to customise these values has been added. Now you can specify your own thresholds and also specify which clusters/hosts/disks to apply them to (ESA only). For example, you can set the alert threshold for a production cluster to 75% and for a test cluster to 85%. Or you can set one threshold for Read-Intensive disks of vendor A and another for Mixed-Used disks of vendor B. This is done by creating a custom alert:

Multi-VM I/O Trip Analyzer

Previously, I/O Trip Analyzer could only be run for a single VM. Now you can select up to 8 VMs, for example, all the VMs that make up a service that is experiencing problems. Works for both OSA and ESA, requires both vCenter and ESXi to be version 8.0U3

New RDMA NIC/FW/Drivers Health Check

Now vSAN Health Check have added a NIC check when RDMA is enabled on the cluster. It checks that the NIC is certified for your version of ESXi, certified for your version of vSAN, and also checks that the drivers and firmware match those specified in the HCL.

VCF Operations support vSAN Max

VCF Operations (ex-vRops, ex-Aria Operations) has added support for vSAN Max, within which you can see the connectivity topology (which clusters/hosts vSAN Max is connected to), added Alerting and capacity management. In general, VCF Operations now realises that there is such a thing as vSAN Max and that it is not just another standard vSAN cluster. Although the work is not yet complete (and on many dashboards vSAN Max cannot be distinguished from a regular cluster), this is the first step towards supporting vSAN Max within VCF Operations.

Federated vSAN health monitoring in VCF Operations

The latest VCF Operations (ex-Aria Operations) introduces federated vSAN cluster health monitoring for clusters spanning across multiple vCenters.

Security Configuration Guide for vSAN

Widely used in critical and secured infrastructures, the vSphere Security Configuration & Hardening Guide now includes vSAN guidance as well.

Merging the vSAN Management SDK with the Python SDK for the VMware vSphere API

Starting with vSphere 8.0 Update 3, the vSAN Management SDK for Python is integrated into the Python SDK for the VMware vSphere API (pyVmomi). From the Python Package Index (PyPI), you can download a single package to manage vSAN, vCenter, and ESXi. This integration streamlines the discovery and installation process and enables automated pipelines instead of the series of manual steps previously.

Other vSAN Improvements

Congestion remediation.

vSAN 8.0 Update 3 enhances vSAN OSA's ability to detect and remediate various types of congestion early, preventing cluster-wide I/O latencies.

There was actually some very deep and extensive work done here, but its details are not published because they are very low-level. But the key point is that for a very large number of Congestions (they are of different types) at the LSOM level (OSA) has been added technology for early detection of congestion, as well as fixing it if possible. This has the potential to dramatically reduce the number of cases where a congestion is triggered and creates significant back pressure in the cluster, resulting in increased latency and performance degradation.

Adaptive delete congestion.

vSAN now provides adaptive delete congestion for compression-only disk groups in vSAN OSA, improving IOPS performance and delivering more predictable application responses.

In short, there is a "Delete congestion" (one of many types of Congestion in vSAN) at the LSOM level, the purpose of which is to prevent the disk from completely filling up in scenarios where the average load is high + we get Heavy Write Burst traffic on a single component. This is relevant for Compression-only disc groups (because in a group with deduplication data is written to all discs in DG more or less evenly), and we also have to take into account the (unknown) degree of compression and incoming zeros.
Past versions of vSAN used static thresholds and current fill, hence the static Congestion values (i.e. two fixed backpressure values). Hence, in some cases, two things could happen - a sharp spike in latency (when Congestion turns on), and also in some extreme cases, Congestion might not have time to react correctly, and the SSD would still fill.
In vSAN 8.0 U3, a forecast is made for the fill rate, and Congestion varies linearly over some range.

Device level unmap support for vSAN ESA.

This release enhances vSAN ESA to send UNMAP commands when space is freed, improving SSD garbage collection efficiency and overall I/O performance.

Most modern SSDs support the UNMAP/Deallocate command, which helps Garbage Collection work on SSDs. This command sends a list of blocks to the disk that can be cleaned/emptied by the SSD controller. To do this, the controller moves data blocks between pages and then clears the free pages. This allows the next time write to a clean block, rather than having to clean it beforehand. And yes, while Garbage Collection on Enterprise SSDs (unlike Consumer-grade SSDs) works always in the background and allows on-the-fly parsing of writes, for Write Intensive workloads and especially on Read Intensive SSDs, pre-clearing space via UNMAP can somewhat increase write speed and write consistency.
Until now, vSAN did not send to the disks a list of blocks to be cleared, thus UNMAP/Deallocate was not used. Now in vSAN 8.0 Update 3, when using ESA, the UNMAP/Deallocate command is sent to the SSD (provided the disk supports it and it has correctly reported it to ESXi) to help SSDs controller and its Garbage Collection mechanisms.

vSAN File Services now support up to 250 NFS File Shares per cluster.

It's pretty self-explanatory. Now you can create up to 250 NFS shares per vSAN Cluster. This is useful for two main scenarios - if you just need more NFS Shares, but most relevant for K8s where Read-Write-Many Persistent Volumes (RWM PV) is a separate NFS Share. These limits have now increased by 150% from 100 NFS Shares to 250.
Things to note:
1.) This is ONLY for ESAs. For OSAs, the limits are the same as there were (100).
2.) The maximum number of SMB Shares is still 100.
3.) In vSAN 8.0U3, each container can serve up to 25 NFS Shares. But several containers can be running on each host/FSVM, which allows you to get a large number of available NFS Shares even on small clusters, but here you need to take into account the balancing of containers on FSVM and the resources available to them.

As well as many other small improvements under the hood that didn't make it into Release Notes and announcements.

I'll add them if I see public mentions of them.

VCF Related Improvements

VMware Cloud Foundation brownfield ingest.

VMware Cloud Foundation now lets users import existing vSphere and vSAN clusters, including stand-alone vSAN deployments. It simplifies onboarding, speeds up integration, and reduces migration complexity for users upgrading to a full-stack private cloud platform.

VMware vSAN ESA Stretched Cluster Support in VMware Cloud Foundation

You can now use an ESA-based vSAN Stretched Cluster in the same way that you could previously use an OSA-based Stretched Cluster in VCF environments. The operation/enabling processes, limitations and so on are the same for OSA and ESA. The only nuance is that vSAN MAX in Stretched Cluster configuration is not supported in 5.2

VMware vSAN Max Support with VMware Cloud Foundation

You can now use vSAN Max as Primary Storage for Workload Domains (including compute-only clusters). The creation and management processes are integrated into the VMware Cloud Foundation workflows and console.

Kubernetes Related Improvements

Use CNS on TKGs on Stretched vSAN

Support Stretched vSAN Cluster for TKGs to ensure High Availability.

Everything is obvious here too, although some specifics should be noted - Control Plane must live in one site (because etcd uses an odd number of nodes and needs quorum), then you use Affinity/Anti-Affinity rules and vSAN Storage Policy to distribute data, workers and control plane between sites correctly and properly.

Enable PV Migration Across Non-shared Datastores within the Same VC

Ability to move a PV either attached or detached from vSAN to vSAN where there is no common host. An example for this would be the ability to move K8s workload from a vSAN OSA cluster to a vSAN ESA cluster.

Use CNS on vSAN Max

Enable support for vSphere Container Storage Plug-in consumers to deploy CSI Volumes to vSAN Max deployments.

Enable File Volume in HCI Mesh Topology within a Single Center.

Enable file volume in HCI Mesh with topology within a single VC.

Up to 250 RWM PV per Cluster

To be honest, this is a repeat because I already wrote about it above in the vSAN File Services enhancements section, but I thought it was important to note it here as well. Because of the increase in the maximum number of NFS File Shares per cluster to 250, you will now be able to use up to 250 RWM PVs per cluster. Again, please note that this is for ESA only. You can see the details above.

vSphere/vCenter Related Improvements

This set of features is not directly related to vSAN, but since vSAN and ESXi are two parts of the same whole, I can't help but mention them in one line. For details and other new features in vSphere 8.0 Update 3, go here.

ESXi Live Patching

With vSphere 8.0 Update 3 we can address critical bugs in the virtual machine execution environment (vmx) without the need to reboot or evacuate the entire host. Examples of fixes include those in the virtual devices space.

Virtual machines are fast-suspend-resumed (FSR) as part of the host remediation process. This is non-disruptive to most virtual machines. A virtual machine FSR is a non-disruptive operation and is already used in virtual machine operations when adding or removing virtual hardware devices to powered-on virtual machines.

vCenter Reduced Downtime

Patch and update vCenter with minimal downtime now includes complete topology support and the ability to automatically perform the switchover phase.

vSphere Configuration Profiles

Manage the configuration of vSphere clusters now including support for clusters using baselines, formerly Update Manager, that have not yet transitioned to cluster images using vSphere Lifecycle Manager. Now supporting baseline-managed clusters in vSphere 8 U3.

Enhanced Image Customization

You can now exclude individual vendor components from Vendor Addon, VMware Tools, and Host Client from the image. This can reduce the size of the image and also eliminate unnecessary components on the host.

Embedded vSphere Cluster Service

vCLS is becoming more and more convenient and seamless. Now you only need two VMs, but, most importantly, they are now built into ESXi (rather than pulled from vCenter) in form of CRX Runtime, and also run directly in RAM rather than on datastore.

UPDATE: VMware Cloud on AWS (VMC/A) Host Types Comparison

2024-01-09T12:27:59.759Z

Due to updates, I've refreshed the comparison table of available node types. The main changes are:
1.) New nodes M7i.metal-24xl
2.) Available vSAN architecture (ESA and/or OSA)
3.) i3.metal are no longer available for new contracts as Reserved Instances.

You are welcome to take advantage of this!

VMC/A Instances

VMware Cloud on AWS Optimizations in Storage-bounded environments. Part 2. Optimize VMC on AWS infrastructure.

2023-12-18T17:43:50.767Z

So, we come to the second part. You've done all the steps in part one, but you still need more space (or you realize that you can reduce the number of hosts from Compute perspective, but not Storage). Then you need to start optimizing the VMC on AWS infrastructure itself. There will be three main steps here, just like in the last article - optimize storage policies, optimize host types and vSAN type, and optimize clusters.

Storage Policies.

SLA-compliant storage policies.

Note: This part is dedicated exclusively to vSAN OSA based clusters as the main architecture used at the moment. I will talk about vSAN ESA later.

Many customers use Managed Storage Policies by default. This is a Storage Policy that is created (and changed) automatically, depending on the size of your cluster, to meet SLAs. This is convenient, but not always optimal in terms of storage efficiency. The fact is that the default policy is as follows:

Single AZ - Mirror FTT=1 if hosts from 2 to 5, and RAID6 for 6+ hosts.
Multi AZ (Stretched Cluster) - if there are less than 4 hosts (i.e. Up to two in each AZ) then no local copies, if there are more than 6 hosts (i.e. From 3 in each AZ) then Mirror FTT=1 within each AZ. Plus of course a copy between AZs.

Now if you look at the VMC/A SLA document, it says the following:

For non-stretched clusters, you must have a minimum configuration for all VM storage policy Numbers of Failures to Tolerate (FTT) = 1 when the cluster has 2 to 5 hosts, and a minimum configuration of FTT = 2 when the cluster has 6 to 16 hosts. This is not dependent on RAID levels.
For stretched clusters with four hosts or less, spanning across more than one availability zone, you must have a minimum configuration for all VM storage policy Site Disaster Tolerance (PFTT) = Dual Site Mirroring.
For stretched clusters with six hosts or more, spanning across more than one availability zone, you must have a minimum configuration for all VM storage policy Site Disaster Tolerance (PFTT) = Dual Site Mirroring and Secondary level of failures to tolerate (SFTT) = 1. This is not dependent on RAID levels.

Note the phrase "This is not dependent on RAID levels", i.e. you can change the FTT Method (between Mirroring and Erasure Coding) as you like and meet the SLA requirements, as long as the FTT is not less than the required for your cluster size.

Let's start with Single AZ case and now turn to the table where we can see the available vSAN Storage Policies for Single AZ for each number of nodes and the capacity overhead from each of them:

We can clearly see that for 6+ nodes, the most efficient policy used (since we need FTT=2) is RAID 6 with overhead 1.5. But for small clusters of 2 to 5 nodes, Mirror with x2 overhead is used, even though RAID5 with 1.33 is available for cluster sizes of 4-5 nodes.

Now for Multi-AZ (Stretched Cluster) a similar table is here. I won't give you the whole table because it's quite long, but it's obvious that for clusters larger than 8 nodes (i.e. 4+ in each AZ) we can use RAID5 instead of Mirror. This will reduce our total overhead (including copies between AZs) from x4 to x2.66, i.e. almost by half.

I realize that numbers in this form can be hard to take in, so I made a spreadsheet for clarity:

As a result, if your cluster size is within the range marked in green, you can gain additional capacity simply by changing your storage policy. But of course, everything has a price, so I'll point out the disadvantages of this approach:

For Single AZ, you will have to manually monitor cluster size and change the policy if your cluster grows to 6 nodes or more to meet SLA requirements.
Performance of RAID5 in OSA is lower than for Mirror (especially in terms of latency). Therefore, the most performance and latency critical workloads may need to be left on Mirror. As always, mileage may vary and look at your workloads and test.
Changing the policy from Mirror to Erasure Coding (RAID5, RAID6) can take a relatively long time, while putting additional load on vSAN. So, the general recommendation is the quite standard - apply the policy not to all VMs at once, but one by one and do it not at the time of peak workloads.

Storage policies non-compliant with SLA requirements.

Another way (however risky and not always appropriate) is to break the SLA requirements and make the storage policy even more cost-effective. Here again we will refer to the document describing SLA in VMC on AWS. The following line can be found there:
"If an SLA Event occurs for your SDDC Infrastructure, it applies to a cluster within the SDDC. For each SLA Event for a cluster, you are entitled to an SLA Credit proportional to the number of hosts in that cluster."

I.e. SLA for workloads is evaluated for each cluster separately. This means that we can create a separate cluster for our non-critical workloads or those workloads where we are not interested in SLA compliance and are ready to take risks. At the same time, the workloads on the main/production clusters will still be subject to SLA conditions, if Storage Policy there is complaint. For sure we can do this for main/production cluster as well but it's even more extreme scenario.

I will not repeat all the calculations I did above, simply because they are completely similar, and I will give you the final table separately for Single AZ and Multi AZ. In it I have marked in red the fields where we don't have any data protection and in green where we have at least one copy.

Non-complaint Storage Policies for Single AZ

You can clearly see that the only way to save capacity in any meaningful way is to switch to FTT0, since the other options do not provide any significant savings and also break the SLA. As a reminder, FTT0 means that you will only have a single copy of your data. This means that in case of any failure (disk failure, node failure) you will lose your data with no possibility to recover it. Moreover, since data placement on disks/nodes is done automatically by vSAN itself based on the current disks utilization and you cannot control it (vSAN Host Affinity feature is not available in VMC on AWS), you can't protect workloads also at the application level. The disks of two VMs of the software cluster can be located on the same node and if it fails, you will lose both copies. And the only way to protect the service/data is to place VMs of the software cluster on different vSAN clusters, which is not always convenient and makes sense, and also you will have to manually recreate VMs and rebuild the software cluster. While this is possible, I suggest agreeing on that all VMs/services hosted with FTT0 policy will lose data if any failure happen.

In this case, this means that there are fundamentally several main types of candidates to be placed on FTT0 - stateless services, workloads that can be easily recreated or that are unimportant, and their failure does not have any impact on the operation.

As an example of stateless services, I can mention caching stateless services or data analysis that is performed on a copy of the data (and the original data is stored in a durable storage). In this scenario, if a failure occurs, we risk either losing some performance or just the time it takes to restart the job, which in some cases is quite acceptable. Test environments for automated testing can also be considered as an example of suitable workloads. For them similarly, in case of failure it will be enough to restart the test. Other "unimportant workloads" include various test environments, temporary workloads, etc. In general, a good indicator of applicability are two factors - whether a backup of this system doesn't exist (or rather is not needed) or how long it will take to redeploy the VM to restore its function.

But as usual, evaluation criteria and their importance are different for everyone, so carefully and individualistically approach this task and consult with the application/service owner. The last thing I would like to point out is that using FTT0 not only saves a lot of space, but also allows for significant performance improvements, even compared to Mirror FTT1. So that may be an additional factor.

Non-complaint Storage Policies for Multi AZ

But in the stretched cluster scenario, we see that the situation is fundamentally different. The point is that in VMC on AWS, all clusters within a single SDDC must be of the same type - either all MultiAZ or all Single AZ. Having said that, it is obvious that availability requirements can vary significantly and not all workloads in a company may require AZ failure protection. Also, in a stretched cluster, SLA compliance requires mirroring between AZs, which is not space efficient (x2) and also adds latency to writes due to synchronous replication to the second AZ.

So, first of all, you can of course use FTT=0, but the considerations are completely similar to the last paragraph, so I won't dwell on that again, but instead consider the scenario where we still want data protection, but we are happy with protection against any single failure or even two.

In the first case, for clusters larger than 6 nodes (3 at each site) you may not use local copies (within AZs), but only replicate between AZs. So, it works out like a normal Mirror FTT=1 (and similar to the scenario where you have less than 6 nodes in the stretched cluster), and the systems will still be available if one of the AZs fails. If a separate host failure occurs, the data will be available from the second AZ (but keep in mind this will add latency on reads equal to RTT because there will be no more local read until the rebuild is complete).

In the second case, we don't protect the data from AZ failure, but we store local copies within the AZ. And you can store them not only as a mirror, but also using Erasure Coding (RAID5, RAID6) if you have enough nodes to do so (4+). This means that you can use the space more efficiently and also provide protection against up to two failures. Also, in terms of performance you have no additional latency due to writing to the remote AZ. The downside to this approach (aside from the obvious lack of AZ crash protection) is that you have to manually monitor the capacity in each AZ and manually place VMs on one or another AZ. Also, to avoid reads from a remote AZ, use VM-Host Affinity Compute Profiles to ensure that VMs are running where their data resides.

Migrate to vSAN ESA-based clusters.

VMC on AWS 1.24 introduced support for vSAN Express Storage Architecture (ESA). An overview of this biggest change since vSAN version 1.0/5.5 is here, but I'll give the main differences:

New architecture for local disk handling (LSOM) - no more Disk Groups with separate cache and capacity disks. Now all host disks are a single pool, each providing both usable capacity and performance.
New Erasure Coding (RAID5/6). All writes go first to the small Performance-leg components in Mirror, then the data from them is written full stripe to the Capacity-leg in the EC. In short, this allows for performance (in terms of both latency and IOPS) similar to Mirror for EC.
Data Services such as compression and encryption now operate at the individual object level (DOM-layer) rather than at the disk group level. This allows for a significant reduction in overhead, as well as more granular management.
Very long-awaited performance penalty-free snapshots. Now there is no impact from creating, storing a chain of snapshots, as well as deleting them.

I've deliberately simplified a lot of things and also left a lot out, but if you're interested in the details go here.

Getting back to our topic of optimizing storage capacity in VMC on AWS, here are the following advantages that ESA has over OSA:

Each disk in the host now provides capacity, which increases the total available capacity. Previously some of the disks only acted as a write buffer and did not contribute to persistent storage.
The new Erasure Coding allows it to be used for all workloads without exception from performance point of view. And since Erasure Coding is more efficient than Mirror, it allows you to gain additional space if previously some data was stored in Mirror due to performance considerations.
Introduced a new RAID5 level with a 2 data + 1 parity scheme. This allows to use R5 (2+1) with x1.5 overhead already on three nodes, where previously only Mirror with x2 was available and RAID5 required a minimum of four nodes. There is also available R5 (4+1) scheme with x1.2 overhead.
Much more efficient compression. In ESA, each 4K is compressed in 512b increments. I.e. 4K, 3.5K, 3K, ..., 512b are possible. This significantly increases compression efficiency, because in OSA the mechanism is different - if a 4K block is compressed by 50% or more, exactly 2K is written, and if less, the original 4K block is written. In ESA we not only get the possibility to increase the maximum compression up to 8x (against 2x in OSA), but also because of intermediate values the average compression ratio increases significantly.
As I wrote in the last article, TRIM/UNMAP is enabled by default.

If we compare the available usable capacity on the first/main ESA-based cluster (so counting capacity for management components) with an OSA-based (with the most efficient SLA-compliant storage policy) with the same compression efficiency, we get the following values:

You can clearly see that the largest effect (almost 60%) is on the 3-host cluster due to the new Erasure Coding 2+1. For all clusters starting with 6 hosts there is about 12% more available space due to the absence of dedicated cache disks (in both cases using RAID6 4+2). However, for four and five hosts clusters there is almost no difference (technically minus ~1%) - this is due to the fact that ESA has no 3+1 RAID5 scheme, so the efficiency for ESA is 1.5 vs. 1.33 in OSA, which compensates for more raw storage available.

In addition, you can get better compression efficiency on top. Unfortunately, there are no public numbers on average compression ratios yet and VMware is still gathering data based on the early real customer installations of ESA-based clusters in VMC on AWS, but it's pretty clear that it will be better than OSA and the question is just how much better. Also note that compression is now a storage policy level setting, not a cluster-wide setting. So, you can disable it for individual VMs where you don't expect any compression - such as already compressed databases, media files, data encrypted at the guest OS or application level.

It's hard to think of any particular disadvantages of vSAN ESA over OSA that would be relevant for use in VMC on AWS, but there are limitations that may currently make it impossible to use:

SDDC must be updated to the latest currently available version 1.24
Only i4i nodes are supported. ESA is not available on i3, i3en and of course on M7i (because there are no disks at all).
Stretched cluster / Multi-AZ including the particular case of a 2-node cluster is not supported at the moment.

If you don't fall within these limitations, the obvious recommendation is to go to ESA.

Migration from OSA to ESA

Currently, in-place migration is not supported, so there are two ways to migrate data from OSA to ESA.

The first is to create a new separate cluster based on ESA, migrate (vMotion, Cold Migration, etc) VMs from the current clusters to the new one, then remove nodes from the OSA-based clusters. This works fine for not the primary/first cluster, but all those that were created additionally. And for the primary cluster, there's the problem that you cannot migrate management components this way. In this case, you can do the following - move all (or almost all, to make the most efficient use of available resources) the productive workloads and leave the first cluster on OSA as the minimum possible management cluster (e.g., two nodes). This usually makes sense if you have a large enough number of hosts in VMC on AWS and/or if the benefit of moving to ESA (and reducing the number of nodes required as a result) outweighs the cost of creating a dedicated management cluster (but don't forget that if you place management components on a shared cluster, resources are still consumed there and therefore cannot be allocated to workloads and this should be taken into account). There are other benefits to building a dedicated management cluster, such as isolating production workloads from management workloads, but that is a bit beyond the scope of this article. Great document about planning management cluster in VMC on AWS is here.

The second option, which is particularly well suited for smaller VMC on AWS installations, is to create a new SDDC based on ESA. While this is noticeably more time-consuming (because at a minimum, you'll need to migrate network settings, security, etc.), if the infrastructure in VMC on AWS is simple and/or small enough, it can make sense. When you need it and what it can give you:

You can migrate your entire infrastructure to an ESA, including the first/primary cluster.
You need or want to run VMC on AWS in a different region (due to price or business requirements).
You need or want to move from Single-AZ SDDC to Multi-AZ SDDC or vice versa.
You see the value or need to change Subnets for Management network and/or AWS VPC.
You need or want to change the host type. Although you can do this in the current cluster/SDDC as well, here it will be done at the same time.
First, you don't have to wait for your SDDC to be updated to 1.24.

A document that describes one option for such a migration is available here.

Optimize Host Type.

There are currently four node types available* in VMC on AWS:

i3 is the very first and probably still the most popular node type. These are General Propose nodes with a balanced ratio of capacity and compute resources. But in January 2023 it was announced that these nodes are no longer available as Reserved Instances, but many customers are still using them as they were purchased before and are still in use.
i4i - these nodes came to replace i3 in summer 2022. In terms of resources this is a doubled i3, but based on much more modern hardware. You could say that these are now the main type of hosts and are being migrated to from i3.
i3en - Storage-heavy hosts. They are popular with customers who need to store a lot of data relative to compute resource requirements.
M7i - The newest nodes that were announced in November 2023. The distinguishing feature is that they do not use local SSDs and vSAN but storage resources are provided based on VMware Flex Storage or FSx for NetApp ONTAP. At the time of this writing, they are in Tech.Preview and are only available in a limited number of regions.

It is sometimes the situation that when VMC on AWS infrastructure is calculated, the resource requirements are not yet fully known, or they may have changed since the project started. Also, at the time of original host procurement, some host types (e.g. i4i or m7i) may not be available yet, so current clusters may not be built on the optimal host type.
Second, after making optimizations in terms of storage capacity, your requirements may have decreased as well as your Compute to Capacity ratio.
And finally, as you use VMC on AWS, your competency has grown, as well as your understanding of your workloads and how VMC on AWS matches them.

Therefore, one way to further optimize costs is to distribute your workloads across the most optimal types of nodes and clusters. And while for relatively small VMC on AWS installations you should consider simply converting the cluster to other nodes in the first place, for larger installations it makes sense to create separate clusters for different tasks based on different types of nodes.

Unfortunately, there is no one-size-fits-all solution on how to do this optimization, but I will show you a way that will allow you to simplify this task a bit:

Make an inventory of all VMs in your VMC on AWS. You can do this by doing a custom report in Aria for Operations or RVTools and upload it to an excel file. There should be a list of VMs and the main characteristics of the VM (vCPU, RAM, Usable Storage).
Divide all workloads into several classes. Obviously, it makes sense to classify by workload type - Production, Test/Dev, VDI, etc, but from the point of view of capacity optimization it is interesting to understand the requirements for the storage subsystem. I propose to categorize them as follows:

Large (in terms of capacity) VMs, but which do not have special requirements for disk subsystem performance, as well as Compute. These are candidates for migrating to External NFS (I will cover this in the next article) with i4i or possibly M7i (if they are already available in your region).
Large VMs (in terms of capacity footprint), but which require a fast storage subsystem and/or have tight latency requirements. For example, Databases. These are candidates for i3en type nodes.
VMs with high requirements in terms of Compute, but minimal requirements for the storage subsystem. These are candidates for i4i or M7i
All other VMs without any special requirements.

Calculate the total resource requirements for each type.
Try to allocate them to clusters of different types. Use VMC Sizer for this purpose. Great guide about VMC Sizer is here. Your task is to minimize the total cost of all nodes. You can usually achieve this by utilizing all resources on each cluster as evenly as possible. I would suggest that you start with i4i to check if everything can be placed there. If not, remove storage intensive workloads (as candidates for i3en) and check again. If you see that the amount of workloads is enough to make sense to dedicate a separate cluster, then try that. And further, to fill evenly, both in terms of compute resources and capacity, then add individual workloads from category 4 until you achieve more or less even utilization.
Evaluate the cost of all the resulting nodes. If it is lower than it is now, then consider how much it makes sense to shift the workloads (taking into account the increased complexity of infrastructure management, scaling, etc). If this does not reduce the cost, then look at what is making a noticeable contribution and try redistributing again.

In general, it's such an interactional process and with no guarantees that optimization will actually work. As a rule, it works only on sufficiently large infrastructures, when it is possible and makes sense to create several large clusters.

VMware Cloud on AWS Optimizations in Storage-bounded environments. Part 1. Reduce amount of data you store.

2023-12-04T12:05:45.295Z

Intro

It's no secret that many infrastructures in VMC on AWS are bounded by storage capacity rather than compute resources. That is, the number of hosts is determined by the usable capacity requirements on the cluster, while there are spare/free compute resources. This situation follows directly from the fact that VMC on AWS use VMware vSAN hyperconverged infrastructure as its primary storage, and therefore there is an explicit capacity/compute resource ratio. That said, there are four types of nodes available in VMC on AWS (or even two if we're talking about Reserved Instances since i3 is now available only on demand and M7i is in tech.preview and diskless in principle) and these ratios are not always exactly what is required.
Within this article I would like to talk about what you can use and what approaches you can take to optimize your VMC on AWS infrastructure, reduce the number of nodes, and thus pay less money to VMware and AWS. 😊

I will describe the various steps in the order I personally recommend.

Reduce amount of data you store

First step is related to basic IT hygiene. You don't need any additional solutions or technology, but you will need time to figure it out, as well as perhaps some organizational efforts.

The main point of them is to keep only what you really need and get rid of various waste. Following will be a few examples and the most popular and effective steps to reduce the amount of data stored.

Get rid of all unnecessary VMs/vmdks and snapshots

I don't want to populate on this for too long simply because I think it's pretty obvious, but the main thing is that there are already a lot of great papers written on this topic and there are pretty well-known ways to do it.

It is well known that in particularly large infrastructures there are many VMs that are forgotten, were created by mistake, or are not doing any useful work anymore at all but are still stored or even running on the infrastructure.

The first correct step is to get rid of such workloads. The easiest and fastest way is to run an analysis of your infrastructure using Aria Operations (ex vRealize Operations) to find such VMs. After some time (yes, it is necessary to avoid false positive errors, so the sooner you run the analysis the better) you will get a detailed list of such powered off, orphaned and idle VMs. You need to make sure that these VMs are really no longer needed and remove them. Similarly, you should do the same with orphaned vmdks, snapshots, etc.

It is a good idea to go through the entire list of VMs on your infrastructure to see who needs them and why. Yes, it can be really long and annoying, but it will give you a better understanding of your infrastructure and will also be useful for further optimization and/or better response in case of any incidents. I encourage you to actively use tags, VM comments, folders, etc. - so that this information can be easily retrieved anytime and by anyone in the organization who needs it.

Thin/Thick Disks

Many IT administrators, especially those who have been working with vSphere for a long time, are somewhat cautious of thin disks. I have even observed a number of companies where the use of Thick disks has been written into IT policy as a default policy requirement. Historically, this is because when using thin disks on VMFS volumes, there could be a performance degradation of thin disks as they grow/expand (note that in a normal and steady state, there is generally no difference between thick and thin disks). Although this effect was gradually reduced in the subsequent vSphere versions, the principle remains unchanged if we are talking about VMFS datastore.

But in the case of VMC you have to remember that VMFS is not used there at all. And the main type of storage is vSAN, which is an object storage system, and which does not use VMFS. For this reason, vmdk growth is not an issue and the performance for Thin and Thick disks for vSAN is the same. Therefore, there is no reason to use thick disks in VMC/A from a performance perspective.

The second global reason for using thick disks is to simplify capacity management in the infrastructure. Roughly speaking, using thin disks can (and usually does) lead to capacity oversubscription (Here's how you can check this value for vSAN). And in case when amount of data will start growing inside VMs for one reason or another, we can face lack of space on datastore and, consequently, I/O freeze for many VMs located on it. Of course, this can have catastrophic consequences for the operation of productive workloads.

The main way to avoid this is to monitor datastore occupancy and manage capacity efficiently and promptly. Using thick disks allows you to completely eliminate such risks at the cost of decreased storage efficiency.

vSAN Object Space Reservation

But for vSAN, there is a separate way to manage capacity oversubscription with Storage Policy - Object Space Reservation - to address this issue. At a more granular level, it allows you to set the oversubscription factor in advance by allocating 0/25/50/75/100% of capacity for selected vmdk on vSAN Datastore. This setting makes sense for Business-Critical workloads, where we want to minimize the risk of running out of free space or ensure that critical VMs always get the capacity they need even if the datastore is full and their IO does not stop. The main reason for this is that expanding a vSAN cluster in an on-premises infrastructure is time consuming. It can be hours-days (if we have spare nodes and disks, which is usually quite rare) or even weeks-months (if we have to order new hardware, install it in the datacenter, configure it, etc.). Thus, we will not be able to react quickly to a sudden and unexpected increase in data and carries risks of infrastructure availability.

But the fundamental difference with VMC/A is that it takes only MINUTES (10-20 minutes in most cases) to expand the infrastructure instead of days and weeks. Adding a new node or multiple nodes is just a couple of clicks in the Cloud Console, and then the new nodes will be added to the cluster, which of course results in an expansion of available capacity.

What's more, you don't even need to monitor datastore occupancy to avoid capacity overflow. VMC/A has a default (and this cannot be disabled) Elastic DRS behavior that when the vSAN datastore is 70% full you will be notified of a high level of utilization, and when it reaches 80% utilization the host will be added automatically. Thus, the probability of vSAN datastore run-off capacity in VMC/A is practically zero. Although to be fair, it should be noted that if all your clusters already have a maximum size of 16 nodes, then further expansion will be impossible and such risks remain, but the solution is quite trivial - do not use clusters of maximum size and leave a reserve for the addition of at least a couple of nodes by splitting one large cluster to two smaller ones.

Last important thing to mention - TRIM/UNMAP feature (I will talk about it later) is available/makes sense only with thin disks.

Identifying thick disks in the infrastructure

So, I hope I have convinced you that using thick disks (or Object space reservation) in VMC/A does not make enough sense from either a performance or capacity management perspective to pay for it in terms of reduced efficiency and paying for more nodes.

So, the first thing I urge you to do is to check your VMC infrastructure for thick disks or Object space reservation and seriously evaluate whether they are really required in your case for each specific VM.

To do this you can follow the steps below:

1.) Make a list of all vmdk/VMs that use thick disks. You can use the following tools to do this:

Built-in check in vSAN/Skyline Health Check.
Check the vSAN Storage Polices in use for Object space reservation customizations.
Build the report in PowerShell(example), Aria Operations/vRops (VM properties) or any other tools like RVTools, LiveOptics, etc.

2.) Review whether you really need to use thick disks or Object space reservation for them.

3.) Change Object space reservation in the Storage Policy. To avoid a massive rebuild, it is better to create a new Storage Policy and attach it to VMs one at a time.

4.) Convert from thick to thin. This may require secondary cluster or additional temporary external NFS datastore (Flex Storage or FSxN) or cloning the VM.

TRIM/UNMAP

General overview of TRIM/UNMAP

The next point that follows the discussion of thin disks (the size of which, remember, is equal to the actual utilized amount of data, not the allocated amount) is: what is the actual utilized storage and to define it? Let me give you the simplest example - you have created a thin vmdk of size 1TB, but you haven't written anything to it yet. Then its size will be equal to zero (formally, of course, not strictly zero, but a few tens of MB for metadata, but it is not essential in this case). Then our guest OS will write 500GB of data there. Then the vmdk size will become 500GB, which is obvious. But, if we at the guest OS level remove 300GB of this data, what will be the size of the vmdk? In the general case, it can remain the same 500GB because the storage subsystem of the hypervisor has no idea what blocks have been deleted by the Guest OS and it can free them up for other uses by other VMs.

To solve this problem, the TRIM/UNMAP/DEALLOCATE commands were added to the block storage protocols. TRIM/UNMAP/DEALLOCATE are ATA, SCSI, and NVMe commands respectively that allow to send information that previously allocated and occupied data blocks have been cleared and are no longer in use.

First of all, when we talk about Space Reclamation, we need to clarify that for virtualization environments there are fundamentally two levels at which it can be performed.

The first level is when we free up space at the hypervisor level itself, for example when we delete a VM/vmdk or reduce the size of the vmdk and we want that space to be freed up on the datastore/storage. In case of VMFS, ESXi acts as an initiator and sends the list of freed blocks via the UNMAP command to the external storage. But because VMC on AWS does not use block-level datastores and the primary storage is vSAN, it works even easier. Since vSAN is fundamentally an object storage system that stores your objects (mostly vmdk and snapshot files in terms of capacity contribution, although there are other types of objects as well) in the format of multiple components distributed across the nodes and disks of the cluster, the moment we delete an object all its associated components are automatically deleted and that space becomes available for other data. Thus, when a vmdk is deleted or downsized, space on the vSAN datastore is freed up automatically and almost instantly (although on i3 clusters where dedupe is enabled this can take some time due to updates to the metadata tables and block map on the disk groups).

The second level, and partly more complex, is often referred as Guest UNMAP. The idea behind this is to make it possible to reduce the size of the vmdk itself, while freeing up blocks within the guest OS. For sure vmdk should be thin as there is no sense to do this on thick vmdks. But I hope you've already converted all your thick disks by now, as we discussed earlier. This will lead to freeing up capacity on the storage, as described in the previous paragraph. So, we want to make sure that when data is deleted inside the guest OS, the space on the storage system is freed up.

It is especially important to note that quite often we can observe a very unpleasant effect where the thin disk size gradually grows to the allocated vmdk size, despite the fact that we never filled it completely from the OS point of view. This is due to the behavior of some OS/file systems that prefer to write to new/empty or random blocks instead of previously filled (but previously freed up) blocks. Thus, it can almost always be observed that the longer a VM runs, the more we have such unused data stored.

An important point to discuss is the potential side effects of including TRIM/UNMAP. Here, if we don't talk about configuration efforts (more on that later), the only possible downside is the impact on storage performance, i.e. vSAN in the case of VMC on AWS. The point is that when the guest OS receives commands from the guest OS to clean blocks, the storage system must start cleaning and deleting them at its own level. This creates additional "background pressure" on the storage system by processing such requests. In the case of vSAN, this load is somewhat greater with deduplication enabled, as each deletion needs to be additionally reflected in the hash-map and metadata table and recalculated if necessary. In VMC on AWS, deduplication is only enabled on i3-based clusters, and i3en and i4i-based clusters only have compression enabled, which has minimal impact compared to a configuration without Space Efficiency. Also, obviously, the impact depends on the size of the vSAN payload as well as the amount of data that needs to be deleted - the more data should be cleaned, the longer it will take.
Thus, it is extremely difficult or even impossible to give even an indication of the impact on performance in general, but in my personal experience, in the vast majority of cases it does not have any noticeable impact, and in rare cases, it can be observed only at initial startup, when the amount of data to be freed is particularly large. But, I can't help but advise you to test on your infrastructure and with your workload if possible to be absolutely comfortable and confident.
Finally, I would like to point out that TRIM/UNMAP is a vmdk-level setting. And you can turn off TRIM/UNMAP for individual VMs or disks by specifying disk.scsiUnmapAllowed = false in the Advanced Settings of the VM. So, largely due to the fact that each vmdk on vSAN is processed relatively independently of the others (especially on clusters without deduplication), this can minimize the impact of TRIM/UNMAP for a particular VM.

Assessing the efficiency of enabling TRIM/UNMAP

The next obvious question that arises is exactly how much data I can save by doing this. While this of course depends on your infrastructure, the answer is usually A LOT. It is not uncommon for the amount of data stored after TRIM/UNMAP to be reduced by 50% or even up to two times or more. Fortunately, there are very simple and quick ways to verify this and get accurate numbers specific to your infrastructure. To do this, you need to compare two metrics - the size of the thin vmdk and the amount of usable data, from a guest OS perspective. Fortunately, both of these metrics are available through the regular vCenter API (but you need VMware Tools installed for utilization information within the guest OS), so many monitoring and inventory tools have this information. The only point at which the accuracy of the data can be spoiled is shared and network disks. The thing is that from VMware's perspective, it's one vmdk, but each OS will see it as its own and write a fill percentage. Because of this, the total amount of usable data from the guest OS perspective may be slightly larger than it actually is. So, this gives a conservative estimate for savings - you can definitely free up that amount of space, but in reality, it may be even more.

As an example of such tools that will help to capture the necessary data, I would like to point out some of the most popular ones - LiveOptics, RVTools, Aria Operations (ex vRealize Operations).
In LiveOptics, once you run the assessment (in this case, the analysis time doesn't matter), you will immediately see a comparison of the two metrics:

In RVTools, you need to compare two columns - "In Use" from the "vInfo" tab (this is the vmdk size) and "Consumed MiB" from the "vPartitions" tab (this is utilization from the guest OS perspective).

Aria Operations unfortunately doesn't have a built-in report or View, but we can make one very quickly ourselves:

How to enable TRIM/UNMAP

In order for everything to work as it should we need two main factors - for the OS to send TRIM/UNMAP/Deallocate commands (depending on the virtual controller used) and for the storage system (vSAN in our case) to properly manage them.

If we are talking about vSAN itself, TRIM/UNMAP support was introduced in it quite a bit more than 5 years ago in vSAN 6.7U1. In VMC on AWS minimal versions that support TRIM/UNMAP is 1.18v12 or 1.20v4. For vSAN, TRIM/UNMAP is a cluster level feature and is disabled by default in OSA (in ESA, which I will talk about later, it is enabled immediately), but since vSAN 8 it can now be enabled via a single button in the vSphere Client. But since in VMC on AWS you don't have permissions to change the cluster configuration, you just need to contact support/CSM and ask them to enable it, specifying the cluster where you want to do it. If you have any doubts about whether it is enabled or not, you can just look it up in the vSphere Client or with PowerCLI.

The second part of this task is to make the OS send commands to free up space. TRIM/UNMAP is supported for both Linux and Windows and require Virtual Machine Hardware version 11 for Windows and 13 for Linux (nevertheless for vNVMe controllers the minimal version is v14 for both Windows and Linux). As a reminder, the default version for all new VMs for all current VMC versions is 14, so it shouldn't be a problem, but when migrating from on-premise, some machines may have an older version, so it's worth checking and upgrade vHW if needed.

Second, the guest operating system must detect that the block device (vmdk in that case) supports TRIM/UNMAP. If the VMs are already running on a cluster that did not have TRIM/UNMAP, this flag is not active because it is set when the VM starts up. Thus, a Power Cycle (not just a reboot) will need to be performed for all VMs. On the one hand, this may look hard and complicated, but the good news is that VMs can safely continue to run on the cluster after TRIM/UNMAP is enabled without any consequences, just Space Reclamation will not work for them. And during the next scheduled updates, maintenance, etc perform Power Cycle. To keep this out of your head, you can set an additional parameter vmx.reboot.PowerCycle=True for the VMs, which will do a full Power Cycle on the next reboot operation instead (BTW, I wrote above about the possible need to raise the vHW version, so you can use a somewhat similar function that shuts down the VM on the next reboot). After the following Power Cycle, this value is automatically reset to the default state (False). For all new VMs created on the cluster after TRIM/UNMAP is enabled, it will work immediately.

The last step is for the operating system to transmit a list of freed blocks. I don't want to go into detail here, because it's better to refer to your OS manual instead.
I will only mention that starting from Windows Server 2012 automatic space clearing is enabled by default, but it is worth checking, for example, via PowerShell (Get-ItemProperty -Path "HKLM:\System\CurrentControlSet\Control\FileSystem" -Name DisableDeleteNotification). You can also run the cleanup manually by running Defrag or Optimize-Volume, but of course it is better to let it run automatically.
For Linux, this is usually done by running fstrim. The availability of automatic cleanup depends on the distro - some of them (e.g. Ubuntu, SUSE, etc) have it enabled by default, and the frequency can be customized in the corresponding configuration files. Of course, you can also run it manually as well.

So, if you enable TRIM/UNMAP on the cluster, run Power Cycle on the VM, and make sure that the cleanup is scheduled and executed automatically at the guest OS level, then after some time (usually 1-7 days), you will see additional space "magically" appear on vSAN. Note that at first this can be quite an active process (simply because a rather large list of freed blocks may have been accumulated), but then it is usually not a massive process, but rather a background process that keeps the thin disks of the VMs from further growth.

Summary and Next Steps

Surprisingly, it turns out that following these steps is more than enough in many cases. And further optimization may not be necessary because vSAN already provides you with all the capacity you need.

But if that's not the case, let's talk about other ways to ensure even greater storage efficiency in VMC on AWS in the Part2.

VMware Cloud Regions Map

2023-03-17T15:13:13.267Z

I have created a map with every region available for VMware Cloud on AWS (VMC on AWS), Azure VMware Solution (AVS), Google Cloud VMware Engine (GCVE) and Oracle Cloud VMware Solution (OVS). These are cloud services that allow you to run VMware workloads natively on different cloud platforms.

This map is based on the latest information from the official websites of these cloud providers as of November 2023.

Here is the link to it.

Here is a brief overview of each solution:

- VMC on AWS: This is a service that delivers a VMware SDDC on AWS infrastructure. You can use the same tools and processes as on-premises, and benefit from AWS services and global reach.

- AVS: This is a Microsoft service, verified by VMware, that runs on Azure infrastructure. You can migrate or extend your VMware workloads to Azure without refactoring or rearchitecting them.

- GCVE: This is a fully managed service that lets you run the VMware platform in Google Cloud. You can use Google's high-performance infrastructure and networking services, and access Google Cloud services such as AI/ML and Big Data.

- OVS: This is your implementation of VMware running on top of Oracle Cloud Infrastructure (OCI). You have full control over your VMware SDDC, and you can leverage OCI's security, scalability and price-performance advantages.

VMC on AWS Regions

2022-12-07T11:43:33.074Z

VMC on AWS Regions

I created the spreadsheet with all the available regions for VMware Cloud on AWS (VMC-A), including the services available there (like VSR, VCDR, etc), node types (i3, i3en, i4i), and other characteristics (like stretched cluster support and compliance). Also add plans from public roadmaps for both AWS and VMC-A.

All this information is publicly available, but unfortunately sometimes described in various places, which requires time to search and check.

I am planning to keep this table up to date (I have already added the newest VMC region, Cape Town) and having a single pane of glass. Also, as new services come out, I will add them to this table.

If you notice any inaccuracies, missed updates, or would like to see any additional information (for example VMware Cloud on other Hyperscalers) - let me know in the comments to this post, in the comments in the table itself or in the unofficial VMC Telegram channel.

The link to Google Spreadsheet.

VMware Cloud on AWS (VMC-A) Host Types Comparison

2022-10-29T13:34:53.269Z

With the appearance of i4i nodes in VMC, I realized that I was missing a table in which I could compare all available nodes (i3, i3en, i4i) in details.

Despite the fact that an excellent description of each type is available here and a simplified version of the table is available here, I decided to create an extended version of such a table myself. And then I thought that maybe it will be useful for someone else.
So, I'm laying it out below as a picture - enjoy!

Enterprise Storage Synthetic Benchmarking Guide and Best Practices. Part 3. Practical Illustration based on Benchmarking VMware vSAN Case.

2022-04-27T14:56:07.098Z

Well, let's go through the points outlined in Part 1 and Part 2 and look at an example of how testing might look end-to-end.

As I said, it is always worth starting with defining the goals and objectives. The true purpose of further benchmarks is to demonstrate with the specific example the process of conducting tests, as well as to illustrate the influence of individual parameters on test’s results. To make it closer to reality, let's come up with some kind of hypothetical, but still realistic situation.

Scenario and definition of objectives.

For starters, let’s imagine a company that built a VMware-based private cloud infrastructure some time ago. But it is used exclusively for software development and testing tasks, and even then, it provides only a partial coverage. The rest of the services run on a conventional VMware vSphere virtualization platform outside of the private cloud. Private cloud infrastructure, as well as the production environment, was built according to the traditional 3-tier architecture with blade servers, FC-network, SAN storage arrays and external hardware-defined network and security solutions.

It's been in use like this for quite a while, but, at some point in time, business stakeholders started to push for more speed and flexibility on the IT side. Responding to business needs, several strategic decisions were made:

Cloud-first approach. All and every resource should be provisioned from the cloud management platform.
Enable hybrid-cloud. Apps will be deployed both in the VMware-based private cloud infrastructure and in public clouds.
Self-governance. Application owners can choose where to provision their workloads, factoring in compliance, security, cost, availability, and other requirements.

At the same time, there was a need to refresh the private cloud infrastructure under, as it was already deprecated, and no longer met the requirements. When developing the new platform’s architecture, the following key requirements were identified:

Flexible and fast scaling that can be done in small steps. This is important because it is difficult to predict where developers will deploy their applications and how many, so infrastructure must be able to adapt fast.
Cost efficiency. Onprem infrastructure must offer competitive resource costs when compared to public clouds.
Significant simplification of infrastructure operation, as many IT specialists will be reassigned to create a set of new high-value easy-to-consume onprem services similar to those available in the public clouds.

To meet those requirements, VMware hyper-converged infrastructure was considered as the primary option for the next generation of private cloud infrastructure. The next task was to draft a design and size clusters running the new private cloud, including the underlying hardware. With that in mind, the existing infrastructure was analyzed to obtain the averaged ratio of VMs’ vCPU/vRAM/Disk. That allows to make a coarse dimensioning for the new cluster. But storage performance assessment absolutely required confirmation and refinements.

So, the benchmark goals were defined as follows:

Determine how many “averaged” VMs could be deployed on the cluster from the storage performance perspective to assess an averaged VM’s TCO for the new private cloud and to achieve a better hardware utilization.
Understand the cluster's performance potential.
Admins need to learn the specifics of the hyper converged platform, its behavior in different situations and under different loads, as well as potential bottlenecks.

Synthetic tests were chosen as a testing method because the workloads are expected to be very diverse, there are no workloads identified as particularly critical to be assessed individually, and, moreover, most workloads do not exist yet. The second reason is related to the fact that the use of synthetic benchmarks allows for the most flexible testing process and storage system’s behavior understanding.

It was decided to leverage HCIBench for the tests. The first reason for this was because the new private cloud platform will be built with VMware solutions. Another reason was that the test program included a large number of more or less standard tests that HCIBench fitted very well with.

Once we have decided on the benchmarking goals and methods, we need to collect some key metrics allowing us to define the success criteria. First, we need to understand from which segment of the infrastructure we can get them and which tools to use. As you remember, the company’s legacy private cloud platform hosts mostly lightly loaded test environments. And this will continue in the future, due to the expected growth in volume of test environments. That said, we are going to use the existing monitoring system (VMware vRealize Operations Manager) to assess the load’s profile, as well as to extract average and peak load values from the last year. After that, we will evaluate similar metrics for the existing mixed production clusters, because production workloads are also expected to be hosted in a new private cloud.

Let's say we get the following results:

8K 70/30 Read/Write 90% Random — close to the average numbers from the test/dev environments.
64K 50/50 Read/Write 90% Random — some sort of worst case obtained from the analysis of production clusters.

Thus, our success criteria should be set as follows:

Each node of the hyper converged platform must have a performance of at least 25,000 IOPS (observed year high from current production clusters) with a 64K 60/40 90% Random profile with the latency below 3 ms.
Each node of the hyper converged platform must have a performance of at least 50,000 IOPS with an 8K 70/30 90% Random profile with the latency below 3 ms.
Testing conditions - ~10-20 vmdks per host, normal operational state, capacity utilization in line with the vendor’s best practices, the active workload set of 10%.

We will also specifically add several types of synthetic load profiles, which are rarely seen in the infrastructure as they are, but which will be valuable for a better behavior analysis of the storage system:

4K 100% Read 100% Random — it is a synthetic workload that is commonly used to generate maximum IOPS from storage and stress storage controllers/processors. Random IO pattern is selected to make it appear more "realistic".
512K 100% Read 100% Sequential — workload heavily uses the bandwidth and reveals batch read operations performance.
512K 100% Write 100% Sequential — write-heavy workload, washes out controllers’ write buffers and reveals batch upload operations performance.

To replicate the scaled-in future production environment, the testbed consisted of 6 all-flash vSAN 7.0U2 nodes." in a configuration almost similar to the target (except for the compute resources). At the same time, we have evidence that performance scaling occurs linearly, provided that both the number of nodes and the number of VMs increase at the same time. Six nodes are required in order to run tests in various configurations, including tests with the enabled Erasure Coding FTT=2 (aka RAID 6), which requires a minimum of 6 nodes. But for sake of this guide’s conciseness, all tests in this document will refer to the storage policy FTT=1 Mirror (aka RAID1). You can find the detailed configuration of the testbed in Appendix A.

Workers VMs

Based on existing infrastructure’s assessment, about 16 vmdks per host are required. Since the computing resources on each node are quite limited, I decided to place 4 VMs with 4 vmdk on each node. Preliminary tests, as well as observations during the main tests, showed that 4vCPU is enough to avoid a bottleneck.

Further scaling-out the number of VMs or scaling up VMs themselves can only worsen the results due to the high compute oversubscription and CPU resources competition between the VMs.

Size Of the Virtual Disk and the Overall Capacity Utilization.

In the test conditions, it is said that it is necessary to fill the storage to the recommended levels that are used when designing the future infrastructure.

VMware recommends keeping the free capacity of vSAN at about 15–30%, depending on the cluster size and configuration. So, I calculated the size of vmdk to achieve the total capacity utilization of ~75% (~420GB) and this value was taken as a baseline.

Before that, I ran the test with capacity utilization increased from 17% to 98,7% with every workload profile to understand the impact. The most important part here — I did not change the absolute active workload size set in TB during the tests. Also, tests were executed in the steady state (this means data was written and some time passed so no rebalance active operations were running in the cluster). On the graph, the performance at 75% capacity utilization was set as 100%.

Performance impact of datastore capacity utilization.

We can see that there is no significant performance impact of the higher utilization regardless of the workload profile for the VMware vSAN. The interesting thing found out here is some 7–17% performance degradation at the lowest utilization (17%). The reason for this is because at this level most of the data resides at the write buffer SSDs without destaging to the capacity tier leading to a decreased number of SSDs proceeding the IO requests.

For all next tests 75% capacity utilization will be used.

Active Workload Set Performance Impact Tests

Now let’s investigate how the Workload Set dimensioning affects the performance of the vSAN Cluster. I varied the workload set size from 1% to 100% and measured IOPS and latency for every workload profile.

Performance impact of the active workload set during a random read workload.

At 4K 100% Reads we observe almost no impact of the active workload set size. The reason for this is because it was an All-flash vSAN Cluster and it serves reads right from its Capacity Tier without any overhead and SSDs in Capacity Tier can process them really fast. The same picture goes for the sequential read test.

At the same time, it is quite obvious that for a hybrid cluster there would be a very sharp performance drop since running out of the SSD cache.

Performance impact of the active workload set during a sequential write workload.

With the 512K Seq Write test, the picture has dramatically changed. You can see two performance layers — active workload set 1–10% and 30%+. At 1–10% level the system responds at a write buffer speed. We can write data into the write buffer really fast (with the Write Intensive Optane SSD), but once our active data no longer fits into the Write Buffer, the destaging process is triggered. This process affects our front-end performance — vSAN needs to free up some space in the buffer by destaging data to the capacity tier to process new data coming from the worker VM. Beyond 30%+ the destaging becomes continuous and massive, however the system achieves a balanced and steady state.

Performance impact of the active workload set during a mixed workload.

The mixed workload test is obviously a mix of 100% Read and 100% Write test. Every read is processed fast without any impact, writes are not that massive, hence the destaging process is not so intensive. It impacts the performance but does not limit it. This led to a smoother graph and less performance impact overall.

Conclusions About Workload Set

Active Workload Set can always affect the performance. So, you should test it.
Performance impact varies with different workload profiles as well as with different storage settings (Deduplication, Compression, Tiering, Caching, etc).
Honestly, you should not care much about the exact amount of cache/tier in the storage system in case the cache/capacity ratio is the same as in your target system. Think about it as black-box tests.
It is smart to analyze two cases — normal or expected with an active workload set of ~10% (or whatever number you got by assessing your existing environment) and a reasonable worst case of ~30%.
Be accurate and change only one variable at a time.
Test duration must be sufficient (I will demonstrate this later in the document) to achieve the system’s steady state when caches are warmed up and buffers are full.

For the next tests, I will use 10% WS (“Normal WS”) and 50% WS (“Huge WS”) also for a more holistic view.

Analyzing Test and Warm-up Duration Impact

Here are some examples of vSAN cluster performance behavior. For this test I’ve used a “Huge” 50% workload set for better visibility:

Performance over time during random read testing.

For the reads, we can see that the system achieves a steady state in a few minutes and the difference is quite small. This happens because All-Flash vSAN does not have a sizable read cache (well, technically, it does have a small RAM cache, but it is miniscule compared to the “Huge” workload set of 50%, so the RAM cache hit ratio is close to zero, leading to no-cache warm-up behavior).

Performance over time during the seq write and mixed workload testing.

For the writes, the behavior changes dramatically. Because of the write buffer and destaging process, which I explained before, at first, we fill the buffer (and achieve a maximum performance), and, once the buffer is full, vSAN starts destaging. vSAN manages the destaging intensity over time and its performance impact is usually manifested within 1800–3600 seconds (30–60 minutes).

Conclusions About Test and Warm-up Duration

On the one hand, the longer test and warm-up durations provide more confidence over the results quality:

Impact and volatility vary significantly, depending on the test’s configuration, storage system, its settings, etc.
The performance difference between the beginning of the test and its end can be significant.
The longer you test the clearer you see any volatility in the 95/98 percentile.

But if we run each individual test for many hours or days the total time, we need to invest becomes enormous. So, we need to find out the right balance and ways to optimize:

Look through the data point for the tests to evaluate its steadiness.
You can significantly decrease the test’s duration in case you are running a set of related tests. For instance, the same workload but with a different number of outstanding IO. The only side effect is — opening tests might not be relevant because of a small warm-up duration, so you should include some “dummy tests” before recording the results.
You can use shorter tests (like 15-30 minutes) as warm-up pre-runs and, once you understand the needed parameters, you proceed to the final long test and record its result.

The tests in this guide were run with 15 min of warm-up + 30 minutes of test or with 30 minutes of warm-up + 1-hour of test.

Test the OIO Impact

Now we are approaching the main stages of testing, namely, determining the performance of the storage system on the necessary workload profiles. To do this, we run tests at various OIO values, recording performance and latency. As a result, we will get two graphs (in fact, the latency vs IOPS graph is enough for us, but for clarity, I will leave both). Now drawing a line that indicates our latency threshold or adequate maximum if it bellows the threshold and so, we can understand how much IOPS the system can issue under the required conditions:

Absolute values of vSAN cluster performance at different OIOs under 8K mixed workload

For an 8K 70/30 workload profile six node cluster could achieve ~730K IOPS at ~1.1ms latency. Or 115K IOPS per node which is more than required.

Absolute values of vSAN cluster performance at different OIOs under 64K mixed workload

For an 64K 70/30 workload profile six node cluster could achieve ~200K IOPS at ~1.5ms latency and ~220K at 2.8ms. Or ~35K IOPS per node which is also more than required.

Additionally, we will conduct tests for all other workload profiles:

Absolute values of vSAN cluster performance at different OIOs under 4K read workload

Absolute values of vSAN cluster performance at different OIOs under Seq Write workload

Absolute values of vSAN cluster performance at different OIOs under Seq Read workload

Conclusions about OIO

In case you want to understand the storage potential, run the benchmark with different values for OIO to get OIO/IOPS/Latency dependency graphs.
From the IOPS/Latency graphs get the value of the maximum IOPS achievable with the latency equal to or below your requirements.
You should do this for every workload profile or storage setting because the optimal OIO value will vary.

IOPS Limits.

As an example, I will run tests with the optimal number of OIO, but different values of IOLimit, which will allow me to build a more accurate dependence of the latency vs IOPS and, at the same time, evaluate from almost zero values to the maximum:

Latency and IOPS graph at various IOPS limits for random read workload.

Latency and IOPS graph at various IOPS limits for mixed workloads.

Testing Results

Summarizing the intermediate results of the testing, we confirmed the viability of placing the required number of workloads on the vSAN cluster in the planned configuration at normal operational state if RAID 1 storage policy without Space Efficiency is used.

We also learned and understood how each of the parameters can affect the test results, as well as what you should pay attention to and in what cases you can expect an additional drop in performance. Further we do not need to carry out all these tests (such as the utilization, workload set, etc) it will be enough for us to use individual selected parameters.

Now it's worth testing against the required load profiles for other storage policies to determine if they can be used to accommodate different workloads. But I will not include these tests in this guide for the sake of brevity since the method will be exactly the same.

Conclusion.

I hope I was able to show you how the general approaches outlined in the first part can be applied in practice. You could be convinced that by consistently approaching the task, you can relatively easily conduct benchmarking and, most importantly, get a meaningful result.

Also, this material may be of particular interest to those of you who are looking at and planning to use VMware vSAN in their infrastructure. This guide refers to a fairly popular hardware configuration that can be found in a large number of vSAN installations.

Thanks to those who had the time and energy to read such long articles. I invite everyone to the discussion and please share your own experiences and recommendations. So, we can further improve these articles together and bring more value to the IT pros community!

Happy Benchmarking!

Special Thanks.

I would like to express my special thanks to Alexey Darchenkov, Mikhail Mikheev and Artem Geniev for a huge amount of their time, experience, competencies, and invaluable support while creating this guide. Without them it would not have been possible.

I also express my gratitude to Asbis and especially Nikolay Neuchev for providing vSAN demo stand and allowing these tests to be carried out.

Appendix A.

The Hardware BOM:

Hardware Bill of Materials

The software BOM:

ESXi 7.0 Update 2a (build 17867351) + vCenter 7.0.2.00100 (build 17920168)
Drivers and Firmware — latest certified from VMware HCL in May 2021
HCIBench 2.5.3 with FIO

The vSAN settings:

Storage Policy — FTT-1 Mirror
RDMA = off
Adaptive Resync — On
All other settings are set in the default.

Enterprise Storage Synthetic Benchmarking Guide and Best Practices. Part 2. Success Criteria and Benchmarking Parameters.

2022-04-27T14:54:21.539Z

Define the Success Criteria for the Synthetic Benchmark.

As I mentioned in the first part of the guide, defining the goals and success criteria is a mandatory step for a successful POC. Of course, success criteria will vary, but they all have something in common — they should follow the S.M.A.R.T. rules:

Specific — the goal should be specific and clear for everyone involved in testing. There should be no misaligned or ambiguous interpretations or “it goes without saying”.
Measurable — this means there is a metric or number, which can unequivocally decide pass/not-pass or the result A is better than B.
Attainable — the requirements should be realistic and achievable (at least in theory) by every participant. Otherwise, there is no reason to spend time on it.
Relevant — it is the most important item here. You should be able to clarify and very clearly show why such goals and metrics were chosen. How are they linked to the business needs? Why exactly this metric or number and not another? And be careful with references to the “industry standards,” “this is how we did it previously,” “experience in other companies.” It is too easy to make a mistake and lose relevance to your specific business needs.
Time-bound — your POC plan should have explicitly and upfront defined timelines and deadlines.

The goals, as well as a detailed benchmark program (that describes exactly how the tests will be done), should be written BEFORE the beginning of the POC and approved/confirmed by every participant.

I also recommend sending the draft of the program to the participants, often they can provide some useful advice, recommendations, or comments and in order to get them ready. As well as to be in touch with vendors/partners in the process of the benchmarking especially if there are some issues or the system could not achieve the goals. Often, the issues can be easily solved just with the proper setup and tuning.

A simplified example of the success criteria could look like this: “The latency in guest OSes running on the storage system under normal conditions and fully operational state should be less than 1.5ms at 95th percentile while providing 100K IOPS with profile 8K 70/30 Read/Write 90% Random. The workload is generated by 12 VMs with 5 virtual disks in each, storage utilization (total amount of data stored by these VMs) should be 70% of the useful storage capacity, active workload set size is 10%, test duration 2 hours after 1 hour of the warm-up”.

Initial data and inputs.

The first and the biggest challenge of the benchmarking that appears at the preparation phase is to gather data and the metrics to specify the goals. In the above example there were a lot of numbers, like workload profile, workload sets, required response time, etc. The question is “How, Where, and from Whom can I get all of these numbers?”

In your specific situation, you can explore various ways to collect the data, but the most common are:

Get the data directly from the application owners. In theory, they know the best about their app’s requirements. Practically, it is quite rare for the application team to be that much aware and be able to provide infrastructure-related metrics. But at the same time, they have other extremely valuable information - peak periods, components and processes that have a key impact on business users, response time objectives, types of services that run inside the application, etc.
Profiling. We can describe the required business services and deconstruct them down to the elements they consist of. Then pick the most critical and/or heavily loaded components and conduct the in-depth analysis of such. We do that by collecting the workload profile stats related to those components. At this point, you should get an accurate application load profile, response time objectives, and an estimate of performance requirements (which can either be equal to the current load or scaled up). This way we can reproduce the workload on the tested system as close as possible. The results will indicate if a particular service is able to run on the tested system or not (from the performance point of view). Having done with one service we will then move to the next one. Just mind the possibility of workload interference if your storage is shared amongst multiple services and explore the options to neutralize or minimize the potential impact in advance.
Averaging and generalization. With this approach, we look at the problem from the other side — apps and services are the “black boxes” which create the load onto your storage infrastructure. They show the overall load profile that comes to the storage system from applications as well as current average response time and performance. We can analyze the workload profile and patterns from the storage point of view and try to reproduce the load. Just be aware that it will be harder to notice finer performance issues hidden within this mass, while those can have a significant impact on a particular service.

I recommend combining these approaches for achieving the best results. It makes sense to join forces with app teams to analyze separately the Business-critical apps and services consuming a major part of the storage infrastructure. And augment this analysis with the more generic infrastructure tests if there is a possibility to do so.

Based on my experience, I would recommend the following course of action to you:

Define the list of most business-critical workloads, services and apps that are planned to be deployed on the storage system.
Have a discussion with the applications’ owners on apps’ components, specifics, key requirements and what you should pay special attention to.
Accurately analyze the load profile of these applications in low-level details.
Analyze overall metrics from the infrastructure.
Create a table showing the workload profile for every business-critical app and for infrastructure overall.
Specify the success criteria for each of them (which usually contain acceptable response time and required performance in IOPS).
Agree criteria with application owners and vendors.

Once the workload profile is captured there are several ways to reproduce it. Subject to your goals and objectives, one option is to generate exactly the same amount of load (bandwidth, IOPS with the same block size) and measure the response time to confirm if it is below the required level. This is an effective way for constant and stable infrastructure. Alternatively, you can measure the maximum load that the system can handle. You do so by setting the response time goal and gradually increasing the workload corresponding to the production one till the latency hits the threshold.

Finally, it is also important to analyze a reasonable worst-case scenario. The most obvious example is to test storage in a degraded state. We all know that disks and nodes will fail during the operations, this leads to performance degradation and rebuilds. Another example would be massive trim/unmap requests due to the bulk deletion of data / VMs, or dedup metadata recalculation, or low-level block management, etc. The idea here is to ensure that the system can always provide the required performance, even in the partially degraded or impaired state. But keep it reasonable, in most situations there is little sense in testing the absolutely worst-case scenario, where everything blows-up and all the bad things happen simultaneously (unless it is a mission-critical nuclear reactor safety system or something like this).

Collecting Metrics from Existing Systems.

Ok, so you have chosen the benchmarking approach, and defined which systems and components to analyze. Now you need to collect the data and metrics from these components and infrastructure to create workload profiles, so let’s discuss the popular options. For my example I am going to leverage VMware virtualization platform, however most tools are cross-platform or have an equivalent for other platforms, so the approach will be remarkably similar. I will start from the high-level data and then move to low-level statistics.

First of all, investigate existing monitoring dashboards. Probably there is some amount of required and important data there. Discuss with application owners and infrastructure teams the metrics they pay attention to and that are critical for them. But highly likely, you’ll have to augment that data using additional sources.

Various infrastructure assessment tools can serve as such additional data sources. One of the fastest and easiest assessment tools, wherein providing most of the required metrics is Live Optics. I frequently use Live Optics as it is a free and cross-platform tool. It collects required metrics from 1 to 7 days with absolutely minimal effort to configure. Just download a portable app, run it on any machine that has access to the infrastructure, provide read-only credentials, select the infrastructure segment and analysis’ duration. That’s it. When it is done, you download the report from the cloud portal directly. If your environment is air-gapped and the Live Optics machine has no Internet access, you can manually upload the source data into the portal. The main benefits of Live Optics are ease of use, good enough selection of metrics, including averages and peaks, configurable analysis duration (up to one-week). On the other hand, it is not customizable, and 7 days sometimes are not enough to capture possible spikes or peaks.

Example of the Live Optics Summary Report.

Example of Live Optics Block Size Chart.

If you need a more flexible tool, you can leverage custom dashboards or reports in your existing monitoring system. It is great if you already have most of the historical values available in your monitoring system, but if not — you can install it fresh and start analyzing the data. One of the commonly used monitoring solutions for VMware Software-Defined Datacenter is vRealize Operations (vRops). It stores most of all required data by default for a long time but lacks out-of-the-box dashboards and reports that immediately fit the storage benchmarking purposes. However, it is not a big deal to create custom dashboards and reports with the required metrics in vRops. You can also create filters for a specific service or a subset of VMs. Just be aware that by default vRops averages metrics’ values collected over 20 seconds intervals into a single five minute data point, so it is sometimes hard to discover short spikes and peaks. This, however, doesn’t create any major challenges for a longer-term analysis.

Ideally, you need both high-resolution metrics (about 20-60 seconds), at least for a few days or weeks, as well as a large amount of historical data, which should take into account periodic peaks (closing quarters, seasonal sales, regular reports, etc.). If it is difficult to obtain both at the same time, then a compromise can be found in the form of metrics of lower resolution (for example, 1-5 minutes), but for an extended period such as a year.

Example of the Custom Metrics Chart in vRops.

Finally, if averages are not good enough for your purposes, you can use traces. For the VMware platform, there is a great tool called vscsiStats. It is collecting data at the vSCSI device level inside of the ESXi kernel and has a lot of data required for storage profiling. Examples include the distribution of block sizes (not averaged but the actual split), seek distance, outstanding IO, latencies, read/write ratio, and so on. More detailed information can be found in this article Using vscsiStats for Storage Performance Analysis — VMware Technology Network VMTN. For the vscsiStats profiling, you will have to select a VM or VMs and start collecting metrics. By default, vscsiStats collects data points over 30 minutes and saves the data into the file for further analysis. However, keep in mind that this tool also has some drawbacks that limit its usage - it might itself impact the performance, data collection takes place on each host, vMotion of VMs will hinder data collection and it’s a short-term data collection.

If you are using VMware vSAN there is a tool that simplifies the collection of vscsiStats - vSAN IOInsight. It collects the same metrics and shows the results in vSphere Client, so it is just an easier way to collect traces. There is an example of metrics from vSAN IOInsight:

Example of vSAN IOInsight Report

As an example of a similar tool for the Linux platforms, there is the bpftrace, which can provide the storage metrics at the guest OS level (take a look at bpftrace utilities biolatency, bitesize, and others).

For the most holistic evaluation, you can combine different methods and tools to achieve a comprehensive assessment of the workload. Use the monitoring tool to analyze the long-term performance requirements and identify the most loaded components. And then analyze their workload profiles with vSCSI stats tracing tools.

The List of Parameters to Be Defined.

To run the tests, you must enter several parameters into the benchmarking tool, each of those can have a serious impact on the behavior of the storage system.

The main principle of the valid synthetic benchmarking — EVERY variable/parameter/number:

Must be explained and understood, there must be a clear understanding why exactly this number is chosen.
Must be written and documented.
Must be agreed with every POC participant.

The easiest way to self-check is to ask yourself: “Why did I choose this number and not another?”. If the answer is “I took this number from the assessment of the existing infrastructure” or “This is the requirement from the business/application owners” or “This is a recommendation from Best Practice and it's suitable for our future use” then it’s ok. If there is no answer or “some guy on the Internet told me it’s right” or “I saw this number in someone else’s benchmark” please pause and re-consider. Think about what this parameter really means and how you can define its value.

Let’s define the list of the general variables you should define based on the HCIBench example:

Workload profile:

Block Size
Read/Write Ratio
Random/Sequential Ratio

A number of workers VMs and their configuration.
Size of the virtual disks
Active Working Set
Test duration and Warm-Up Time
Number of Threads/Outstanding IO
IO Limit

Let's go one by one explaining what they mean and how to define them.

Workload profile

Workload profile defines the specific pattern(s) of IO requests sent to the storage system. It has the following dimensions:

Block size. It is the size (typically in KB or MB) of read and/or write request sent from Guest OS/Application to the storage.
Read/Write Ratio. It is the ratio of read operations to write operations.
Random/Sequential Ratio. Sequential IO is a sequence of requests accessing data blocks that are adjacent to each other. On the contrary, Random IOs are requests to unrelated blocks each time (with greater distance between them). Good illustration is here.

Every tool for collecting metrics I’ve described above, will provide you an average block size and read/write ratio. So, you just need to investigate it and you will get average numbers outlining your infrastructure.

It is a bit harder to assess the requests’ “randomness,” since this requires tracking LBA addresses of every request to determine seek distance. And it’s possible only with tracing tools like vSCSI stats or bpftrace. However, if you are running a virtualized or mixed environment, the IO requests will be of mostly random nature (exceptions to this rule include backup jobs, copy, upload operations and so on, but they rarely directly affect business processes). Hence, for such environments it makes a lot of sense to run most of the tests at 90-100% of Random IOs.

Most often averaged numbers are used to describe the workload profile. However, I strongly recommend analyzing and reproducing the exact block size distributions. Because in real life it’s always a non-uniform mix of blocks with different sizes. Here is just an IO pattern example, captured with vscsiStats for a real-world application:

Application’s IO block size distribution pattern in bytes captured with vscsiStats

In this example, most blocks are 8K and 16K for Reads, and most of the writes are 4K and 8K, but there are also some ~128K and 512K blocks. And this small number of huge blocks can make a significant difference, because they are filling up the storage write buffers and consume the bandwidth thus preventing the fast processing of small blocks. Therefore, even though the most popular write block size is 4-8K in this case, and the average is ~ 30KB, it is incorrect to run a benchmark with 30K blocks, and even more so 4-8K blocks.

Worker VMs

You also should define the number of worker VMs and their configurations, especially the number of virtual disks, number of vCPU, and vRAM amount per worker.

The choice of vCPU and vRAM per worker is quite straightforward. The idea is to avoid the bottlenecks within the workers themselves and at the virtualization platform level, thus making sure that the benchmarking results are isolated to the storage system only. So, you should just verify there is no 90–100% CPU utilization inside the workers and the platform. In ordinary cases (when it is not one worker rushing the whole storage but a lot of them), 4 vCPU/8GB RAM or 8vCPU/16GB RAM are ok. The most indicative test to check for the reasonable CPU utilization is 4K 100% Read. Immediately after the deployment, you can run this test to verify that you have not hit the compute limits. But even after, you should periodically check the utilization during the tests.

The number of VMs and the number of virtual disks (vmdks) is a little bit more complex. You should analyze the rough ratio of the vmdks per host in your environment and replicate it as much as possible during the benchmarking. By placing multiple vmdks we are avoiding the potential bottleneck with the vSCSI adapter and balancing the load. While using only one vmdk per worker VM can dramatically increase the number VMs leading to high vCPU oversubscription ratios and hiding the true storage system performance behind the stressed platform. This may be a challenge when your physical hosts resources are lacking. That said, it makes sense to use several vmdks per worker VM. A typical number of workers is 2–8 VMs per host and 2–10 vmdks per VM (which closes most cases in the range of 4-80 vmdks per host), but it is always better to validate the numbers yourself in your own environment.

Size Of the Virtual Disk and the Overall Capacity Utilization.

The size of the vmdk defines how much data you STORE. Total capacity is defined as the number of VMs multiplied by the number of vmdks per VM multiplied by the size of each vmdk. It has nothing to do with the data access. For example, it could be a cold archive — write once - read almost never, but you still store it. How to define the total capacity required? In case you have a scaled-down system for the test it should be defined as the percentage of the total capacity. This number should be aligned with your future storage environment. For example, if you need to store 100TB of usable capacity and the storage system vendor recommends you not to exceed 80% utilization because of the performance impact, you need to purchase 125TB of usable capacity (after raid/spares/system/etc). So, if a vendor provides you a 50TB system for a test, you should fill it at an 80% level by uploading 40TB of data.

Various levels of vSanDatastore utilization

Also, the critical part is to enable the PREPARATION of this capacity. This means you need to write down the whole amount of data before starting any test. I always recommend running the preparation with random data, not zeros to avoid undesirable specific behavior storage might have for the processing of bulk zeros (e.g. deduplication, compression, or zero-detection).

It is quite interesting to analyze the impact of high-capacity utilization, especially when you are evaluating the worst-case scenarios too. In some storage systems, a high-capacity utilization can dramatically impact the performance due to internal processes. And the unexpected storage space overconsumption can happen due to an operator’s mistake or poor planning or other unforeseen circumstances. Thus, it makes sense to test not only with the planned/expected utilization but also with the higher numbers to understand the behavior, risks, and the potential impact.

Active Workload Set

Active Workload Set or just Workload Set (WS) sometimes confuses people. To make it clear — it is an amount of data you ACCESS or in other words actively working with. This heatmap is good example to illustrate an active workload set vs the total capacity:

Illustration of blocks of data with active access

Here the total amount of bricks illustrates the amount of data you store, and their color indicates the amount of data you actively work with. For example, if you evenly process every GB you have written, the active workload set will be equal to the total capacity and every brick should be marked in red.

In fact, this almost never happens. Even if most of the data is analyzed it is not done at the same time — one day we are working with one block of the data and with another the next day. Also, most of the data is rarely touched at all. Just imagine the system disk of any OS — most of the data, files, and packets are used extremely rarely, wherein others are processed at every reboot.

The problem is how to analyze the amount of active workload set as the percentage of the usable capacity. And it is one of the most challenging metrics to analyze in your existing infrastructure. Some options we have:

Daily Writes in TB. It is quite easy to acquire (Write Performance in MB/s multiplied by the duration of the writes or more strictly speaking it’s an integral of the write throughput over the time). The problem here is that we can’t separate “net new writes” and “existing records re-writes”. For example, if an application constantly rewrites 10GB in the loop for an entire day, we will see a lot of daily writes, but the true active workload set is only 10GB. The second issue — it does not account reads, and the amount of data read. But still, it is the easiest way to somehow evaluate the active workload set.
The second option is to analyze the size of the incremental backup. It is also easy to do (Of course you make backups, right?) and provides the real amount of data changed. But still, no accountability for reads.
The third way is to examine your existing storage system. Almost every storage has a write buffer and a read cache. You know the size of both and by looking into the destaging activity (data eviction from the write buffer to the next tier), read cache miss ratio as well as how full they are you can more or less accurately estimate the active workload size.

Speaking about averages and industry guidelines (Yes, I remember that I asked not to rely on “industry averages,” but sometimes it is really hard to avoid this), you will see quite close estimates — the workload set is 5–15% of the total capacity. VMware points to ~10%. In case you want to test the worst-case scenario, 30% is a reasonable number that well overlaps the regular ~10%.

Test and Warm-Up Durations.

During the tests there are processes that lead to performance variations. Most of them are caused by the effect of various buffers, caches, and tiers. Because enterprise workloads generate continuous load to the storage system it is important to measure not the transient but the steady-state performance.

Warm-up helps eliminate initial transitional processes and to enter a steady state. It is just a period of time when workers process the requested workload, but the results are not captured. The warm-up duration should not be less than it takes for the transitional processes to settle down.

The duration of the test should be sufficient to analyze the behavior of the storage system in the long run. It is important to measure how consistent the results are and, in particular, the response time. Steady-state response time is a critical metric. Imagine, that your app is enjoying 1ms response time for the first 30 minutes, and for the next 5 minutes it goes up to 30ms to return back to 1ms once the buffers are flushed. The average latency will be ok, but these spikes should be discovered. Thus, it makes sense to measure not only the average latency but also its percentile (there is a great post about it from Dynatrace Why averages suck and percentiles are great | Dynatrace news). You can pick your own percentile to measure, but the most common are90th, 95th, or 98th. The higher percentile you set, the more stable and constant latency you will get. On the other hand, with the higher percentile value, you risk missing the transitions’ impact and, potentially, get a more littered result.

So how to understand what will be sufficient warmup and test durations? Look at the data points in the FIO/vdbench report. They not only show the final results but can also record the intermediate values during the test. Look into it and if you notice a significant difference over the test duration, then the tests are not performed in a steady-state, and you should increase the warm-up duration. Test duration can be set twice as long as the warm-up duration to be sure that you catch the impact of any extra internal operations of the storage system you test. Most likely, if the initial transitional processes take X time to settle down once the tests start, those extra internal processes should reveal their impact earlier on the loaded system.

Outstanding IO or iodepth (OIO).

Outstanding IO (OIO) or iodepth indicates in a nutshell the degree of parallelism. The more OIO are configured the more requests will be sent to the storage system from workers in parallel.

If you run validation tests with a number of IOPS equal to your production workload, you can simply configure the same total OIO from the workers. But if you want to uncover the full potential of the storage system you must vary the OIO to find out the point of best performance.

It is imperative to understand that storage is a reactive system. It does not generate anything itself, it only responds to the request from its clients. This is why there may be situations where clients generate a little load easily handled by the storage system. Storage is underutilized and has an enormous potential, but to demonstrate its potential we need to increase the load and the number of requests. Or, alternatively, the storage system may be fully utilized and cannot provide more IOPS. In this case, should you increase the number of requests, you will observe higher latencies, but not an increase in IOPS. And only somewhere in the middle, there is a balanced trade-off of latency vs IOPS in the operating mode. Hence, the dependency between IOPS and OIO from the latency perspective manifests itself like on the below graph:

Storage system performance modes

Therefore, you should vary OIO to obtain the “optimal” number of OIO. What is “Optimal''? It is the amount of OIO where the system provides maximum performance (IOPS) while latency is equal to or below your required level. The challenge is that this “optimal OIO'' will be different for different workload profiles, system settings, etc. This means that every test should be run not at a single OIO setting but with multiple OIO values in the neighborhood of the “optimal point.” Thus, you will ensure that you’ve discovered the correct OIO value.

To do this, run tests starting at low values and gradually increase them. By analyzing the results, you can determine what mode the storage system is currently in. Thus, if you see a significant increase in performance at about the same latency, then you should specify even higher values of OIO. And vice versa, if the delays increase sharply, and the performance practically does not change, then reduce the values of OIO. Do this until you get a few results in the normal mode in the vicinity of the latency threshold.

You will end up with two graphs IOPS/OIO and IOPS/Latency. After that, just cut off its upper at your required latency value and you will get your “optimal OIO'' as well as “Max IOPS.”

IOPS Limits.

Additionally, if you need more precise IOPS/Latency dependency graphs, especially on the wider range of values, you can run the benchmark with IOPS limits set. The pointhere is not to change the parallelism, but the request rate with the optimal outstanding OIO.

To do so, after defining the optimal OIO you run multiple tests with different IOPS Limits from small one to unlimited. Be aware that starting from some value of IOPS limits, the dots (IOPS and latency) will start to converge and match. This is ok because this means that you’ve hit the IOPS maximum and no matter what limit you set you get the same result.

Conclusion.

In order to conduct a successful benchmark, move consistently and do not skip any of the steps - from defining a goal to describing methods and approaches, and then to specific parameters and conditions for conducting tests.

And remember that one of the key indicators of quality testing is the presence of a clear description of ANY parameter used in the test, as well as the presence of an explanation of how it links to tasks and goals.

If you follow the general principles described in this document and adapt them to your situation, then you can be certain of the results and conclusions’ quality.

And now let's move on to a specific example in Part 3, where most of the points previously described in this document will be considered in practice.

Enterprise Storage Synthetic Benchmarking Guide and Best Practices. Part 1. General Theory, Methods, and Approaches.

2022-04-27T13:11:21.161Z

Introduction

For more than ten years in the IT industry, I have conducted a lot of tests and proof-of-concepts (POC) of various storage systems. While it was a completely different type of storage — traditional FC-connected SAN storage, NAS, SDS (Software Defined Storage), and HCI (Hyper Converged Infrastructure) platforms, the challenges were nearly the same. And quite often I faced a situation where tests were not anyhow aligned with the business needs and even were technically incorrect which led to the impossibility to deliver any meaningful results.

We all know how critically important storage can be for IT and business operations. We also know how much effort from all the involved parties is needed to properly test a storage system and produce meaningful results supporting the decision-making process. That's why I believe that a well-structured approach with the full attention to details is a must for any kind of storage system test and/or benchmarking.

Most of storage validation/POC has three parts — availability tests, functional tests, and performance tests (aka benchmarking):

Functional tests are very organization-specific since they must be aligned with the customer’s IT landscape, business processes, and other specific aspects.
Availability tests are time-consuming, but it is the most obvious part — the customer should create a list of potential risks, describe the expected behavior in each case, define with the vendor/value-added reseller/value-added distributor the options to simulate such scenarios and execute such tests.
Performance tests or benchmarking are a complicated and often contradictory part of the test plan because it is always a balance between the applicability and the accuracy vs the feasibility and the complexity.

In this guide, I am going to cover only the performance testing dimension, and it will be related mostly to synthetic benchmarking of a primary, general purpose storage system. The main goal of this guide is to provide some guidelines and recommendations on how to benchmark any type of enterprise primary storage to receive valuable results while minimizing the efforts. I will not be talking much about general things related to testing in general and about organizing a POC.

This guide is in two parts - the first part is more versatile and suitable for almost any situation where you need to conduct synthetic load testing. Therefore, it mainly consists of general techniques, best practices, and recommendations (although I will try to add more examples), rather than specific actions. The second part is practical and specific, where I show by example how load tests can be carried out, as well as how each of the load testing parameters affects the result.

In order to be more specific and, in fact, to reflect my most recent experience, all further examples will be related to the storage systems tests in a VMware virtual infrastructure environment. However, these guidelines can be applied to other cases with minimal modifications.

Define the Goal of the Benchmarking.

The first and the most important question you must accurately answer in detail before starting the benchmarking planning is “What is the goal (aligned with the business needs) of this activity? What should be the result and what conclusion must I ultimately reach?” To answer this question, you should have a conversation with all the key stakeholders, including the business owner of this project, system administrators, and the application owners. This is a critical step before moving on.

The goals can be different depending on the situation, but the main are:

Prove that the storage system(s) satisfies the needs of your apps and services.
Define the limits and maximums of the system to be prepared before the system becomes saturated and is not able to further satisfy the consumers’ needs.
Define the system’s tunable settings and parameters to be used in production, because they can impact the performance. The examples of such settings are deduplication, compression, type of erasure coding, RAID-levels, implementation of disaster recovery/avoidance technologies like replication (and the type of replication), etc, as well as understanding the impact of it. Also define if any advanced settings should be tuned.

Quite often the goal is described as “compare different storage systems to choose which one to purchase.” If you think about this for a minute, you will realize that this is not the goal itself — it is just the consequence of one of the above goals. First of all, you need to decide what storage system configuration (licenses, RAW capacity, nodes, etc) you should purchase. You always want to compare apples to apples and to do this you need accurate numbers at your disposal.

In case you are choosing a storage system for a well-defined and predictable workload (e.g., established production environment) you shouldn't base your decision on pure performance. It does not matter whether storage is 2 or 3 times faster than required — both are capable to handle your workloads and the decision should be made based on functional, financial, and other criteria.

Otherwise, if you are looking for a storage system for new and/or unpredictable workloads the system maximums are important to understand the potential and limits of the system.

So, before the beginning of any benchmarking, you should define and establish the goals, success criteria, and agree with all the interested parties.

Popular examples of benchmark goals:

“Validate that the storage system is able to handle the workload from the applications A, B and C (or any part of the infrastructure) and define the storage settings and configuration where this is possible.”

“Determine the maximum number of application X, Y, Z users that can be processed on the storage system while remaining within the acceptable response time limits. Analyze the impact of storage settings/configuration (like deduplication and compression) on the number of users served.”

“Define the possible number of virtual machines that can be deployed from the private cloud, as long as they match the current averages from the performance and capacity perspective.”

Choose the approach and benchmarking methodology.

After defining your goals, you need to select the way of creating a load and measuring the performance. Generally speaking, I could distinguish three types of approaches:

move or clone your production workload (in other words, one way or another, replicating the real load)
application-level synthetic benchmark
storage-level synthetic benchmark.

Let’s take a detailed look at each approach.

Move or clone the existing production workload to the test system.

In this case, you should define key metrics to measure, success criteria and reproduce/analyze the existing workload on the storage system that is being tested. I strongly recommend to choose business-specific metrics, not IT-focused ones. These metrics must be aligned with the business operations and the corresponding goals and objectives, such as customer experience or others. For example, it can be the response time of the app from the end-user perspective, business request processing time, report generation time, the number of users who can use the application with an acceptable response time, etc. After that you should move or clone the application to the new system and compare key metrics. If you decide to clone a workload, you need to remember that the number of users and the activity of their interaction with the system must be commensurate with the original system or scaled accordingly.

This approach is the most accurate way to evaluate your workload behavior on the tested system. It is also an efficient way to tune and select parameters of the storage system together with a vendor, partner, and customer.

Likewise, it is the hardest way that requires the highest amounts of resources, time, and deep involvement of application owners. The complexity of such tests often makes this approach almost non-viable, especially in situations where the storage will be shared amongst several business applications. If you choose to migrate existing applications to the new system that is being tested, you should carefully manage the related risks because if anything goes wrong it will affect the production environment hence current business operations. Also, this approach is applicable only for the existing workloads and not the future needs.

With that said, this method is usually applied as a final confirmation when the system has already been shortlisted after the first round of tests and the storage is intended to be used for the specific business-critical application such as the core banking system, ERP, etc.

Application-level Synthetic Benchmarks.

Typically, the application-level benchmark is a combination of the real application and load generating software. Examples include:

Any SQL Database (MS SQL, Oracle, PostgreSQL, MariaDB, etc) + HammerDB.
Web server + wrk/ApacheBenchmark (or JMeter, but it is more complex and usually used for end-to-end web app benchmark).
SAP + SAP Standard Application Benchmark.
VDI environment + LoginVSI.

Depending on the storage system use cases, it is worth using one or more of such application-level benchmarks.

These tests are quite representative (although may not fully respect customer application specifics), easier to leverage because of ready-to-use test plans and provide the possibility to push the system to its limits by scaling the workers/load generators.

But it still often requires the application owner’s participation and sometimes it is hard to interpret the results, especially in mixed environments. Also, it can be problematic to distinguish the impact of storage on the overall result because of the impact of CPU, network, etc.

For these reasons this method is commonly used to compare the storage systems or to test the existing system's applicability for the particular workload.

Synthetic Storage Benchmark.

The concept is completely different from the previous approaches — there is no real application, just a tool that sends IO requests with the required parameters depending on the expected workload pattern to the underlying infrastructure. There is no processing of data at the application level as well as there is no real-world data.

The main benefits of this method are flexibility and ease of use. Participation of apps teams is not required; the infrastructure team can complete all the tests themselves. The tests can be easily automated so it is possible to complete the tests of many storage systems in parallel thus significantly reducing the duration of POCs.

But you should be really diligent with the synthetic storage benchmarks since it is extremely easy to make absolutely meaningless tests and end up with non-relevant conclusions while they look ok at a first glance. At the same time, although synthetic tests may show representative results in terms of the overall performance, it's harder to predict some of the nuances for a particular application.

As a matter of fact, the wrongfully perceived simplicity of the synthetic storage benchmarking and its error-proneness were the main reasons why I decided to write this guide. Further in this guide, I will focus only on the synthetic storage benchmark not because it is the best way to test a storage system’s performance, but because it is the most popular approach, which is the least dependent on a particular environment's specifics.

Select the Tools and Utilities for Synthetic Storage Benchmarking.

As I mentioned previously, for synthetic benchmarks we need a tool that prepares the data, sends the IO requests to the storage, collects, and analyzes the results. It has to be flexible and customizable enough to maintain relevance to your infrastructure and goals. Also, it should support automation and orchestration scenarios.

The golden standard in the industry is FIO, Vdbench, and DiskSPD. All of them are proven, widely applied, and cross-platform, but historically DiskSPD is more often used in Microsoft environments and FIO or Vdbench in Linux environments. Without going into the details, they provide similar capabilities for storage benchmarking and have similar learning curves, so the selection of the tools is usually based on experience and habits.

There are also several PC-focused storage benchmarks like CrystalDiskMark/HDTune/PCMark/etc. PLEASE do not use them for enterprise storage benchmarking. They are not bad by themselves (furthermore, for example, CrystalDiskMark is built on top of DiskSPD), but their goals and use cases are completely different. Lack of control and customization, as well as lack of clusterization support makes them hardly applicable for the enterprise needs. Most enterprise environments produce compound workload profiles from many sources in parallel, not a single OS, dataset is tens and hundreds of TBs, load is generated continuously for hours and days and not for minutes and few GBs as it happens on PCs.

Another type of benchmarking utilities is built on top of FIO/Vdbench and leverage them as the workload generator engine. They provide more features like GUI, out-of-the-box deployment and automation, scheduling, scripting, etc, while still keeping FIO/vdbench as a workload generator.

HCIBench is one of such tools, extremely popular for storage systems benchmarking in VMware vSphere environments. Initially it was created for HCI platforms benchmarking, but now can be used universally with any type of storage. In a nutshell, it's a ready-to-deploy virtual appliance with a control VM and a fleet of VMs with workload engines. Control VM provides the UI for configuration, and results review, contains the image of micro-VM with preinstalled FIO/Vdbench, and scripts for the deployment automation, start-up, collection of data, etc. More content on how to customize it can be found in this VMware’s blog post and examples of test profiles at Github this page. If you are looking for more HCIBench automation itself, please have a look into this article Industrialising storage benchmarks with Hosted Private Cloud from OVHcloud — OVHcloud Blog from OVH Cloud.

High Level HCIBench Architecture

Technically, you can achieve the same or even better level of automation via custom scripts, but it rarely makes sense due to a significant time investment. So, if the primary use case for your storage system is hosting the datastores of VMware virtual infrastructure, it is tenable to use HCIBench as the default benchmarking tool and switch to a custom FIO/Vdbench setup if there is a reason for it. The excellent value of HCIBench is that it uses common workload generators (you can choose FIO or Vdbench) and provides full control over benchmark parameters, nothing is hidden from you or assumed on your behalf. It is critical to have a full control over the tool — nothing should be decided for you, configured by default, or hidden behind the scenes.

Establish Demo Environment Configuration.

Hardware

Ideally you would test the same storage system configuration as you plan to buy. However, this rarely happens in real life due to the demo systems pool limitations on the vendor/SI side. Sometimes you can use a“try-and-buy” program, but usually vendors offer this option only for the final validation. Under such a program's conditions, the customer commits to buy-out the system if the pre-defined qualification criteria are met. Otherwise, the system is returned back to the vendor / SI. This can be a feasible validation option, but not so much from the vendor comparison perspective.

Most often, the system you get for tests will be scaled down from your target configuration. In this case, you should test the storage of the same class, generation, and model. Components should be of the same performance class and type — if you plan to buy storage with SATA/SAS drives and FC/SCSI interfaces there is no sense in testing All NVMe storage with NVMe-o-F over RoCEv2 connectivity even if the storage controller matches. Also, the same should be the storage software versions and configuration/mode of storage software (for example, NetApp ASA/AFF options, Unified/Block Optimized software modes for Dell EMC Powerstore T, etc).

On the other hand, it can be ok if the system has less capacity or drives or nodes in case you understand the scaling process. But be careful because it is not so obvious in a lot of cases. For instance, a lot of All Flash arrays reach a performance plateau with 10-30 SSD drives because the controller becomes a bottleneck. This leads to a situation where the total performance for the storage system with 24 SSDs and 100+ will be the same. Another example is that increasing the number of controllers (SAN/NAS) or nodes (SDS, HCI) does not always lead to performance gains for a single LUN/Share/vmdk, despite an overall performance increase.

So, it is great to discuss all this in detail with a vendor, ask for proof/test/benchmarks, and after that try to carry out your own test (even at a minimal scale) to prove the concept.

After preparing and defining the goals/success criteria, you should specify the system configuration suitable for the benchmark and request it for the POC.

Software

This part can be complicated in practice, but easy in theory:

Check the software versions and install the latest (supported by the hardware and software components) updates on every component of the test environment.
Accurately check compatibility lists top down.
Setup and tune the test environment according to the applicable Best Practices Guides.
Have the setup reviewed with the vendor / SI and obtain their written confirmation that the setup is correct (adheres to the best practices).

While setting up the environment, document all the settings and options configured. You are likely to feel that it is easy to remember just a few checkboxes, but I assure you that after several months you will not remember anything. Good practice is to perform any setup and operations with the enabled screen recording (I usually use Zoom cloud-recording sessions even if I'm the only one doing the setup). This way you will be able to check out any details and/or prove the point at any time.

Conclusion.

Performance testing is not a simple, but certainly an extremely important procedure, since it provides a basis for long-term decisions that can have an immediate impact on production processes and business goals attainment.

Careful preparation and planning are critical success factors for successful load tests. And that equally applies to the goal definition, collection of data and the detailed description of the test program. If the preparation is done to the end, then the test itself becomes simple, understandable, and even, in a sense, a routine action.

In the second part, I will show you exactly what data is needed for successful synthetic benchmarking, the ways how to get it and how to formulate success criteria.