Enterprise Storage Synthetic Benchmarking Guide and Best Practices. Part 1. General Theory, Methods, and Approaches.

Introduction

For more than ten years in the IT industry, I have conducted a lot of tests and proof-of-concepts (POC) of various storage systems. While it was a completely different type of storage — traditional FC-connected SAN storage, NAS, SDS (Software Defined Storage), and HCI (Hyper Converged Infrastructure) platforms, the challenges were nearly the same. And quite often I faced a situation where tests were not anyhow aligned with the business needs and even were technically incorrect which led to the impossibility to deliver any meaningful results.

We all know how critically important storage can be for IT and business operations. We also know how much effort from all the involved parties is needed to properly test a storage system and produce meaningful results supporting the decision-making process. That's why I believe that a well-structured approach with the full attention to details is a must for any kind of storage system test and/or benchmarking.

Most of storage validation/POC has three parts — availability tests, functional tests, and performance tests (aka benchmarking):

Functional tests are very organization-specific since they must be aligned with the customer’s IT landscape, business processes, and other specific aspects.
Availability tests are time-consuming, but it is the most obvious part — the customer should create a list of potential risks, describe the expected behavior in each case, define with the vendor/value-added reseller/value-added distributor the options to simulate such scenarios and execute such tests.
Performance tests or benchmarking are a complicated and often contradictory part of the test plan because it is always a balance between the applicability and the accuracy vs the feasibility and the complexity.

In this guide, I am going to cover only the performance testing dimension, and it will be related mostly to synthetic benchmarking of a primary, general purpose storage system. The main goal of this guide is to provide some guidelines and recommendations on how to benchmark any type of enterprise primary storage to receive valuable results while minimizing the efforts. I will not be talking much about general things related to testing in general and about organizing a POC.

This guide is in two parts - the first part is more versatile and suitable for almost any situation where you need to conduct synthetic load testing. Therefore, it mainly consists of general techniques, best practices, and recommendations (although I will try to add more examples), rather than specific actions. The second part is practical and specific, where I show by example how load tests can be carried out, as well as how each of the load testing parameters affects the result.

In order to be more specific and, in fact, to reflect my most recent experience, all further examples will be related to the storage systems tests in a VMware virtual infrastructure environment. However, these guidelines can be applied to other cases with minimal modifications.

Define the Goal of the Benchmarking.

The first and the most important question you must accurately answer in detail before starting the benchmarking planning is “What is the goal (aligned with the business needs) of this activity? What should be the result and what conclusion must I ultimately reach?” To answer this question, you should have a conversation with all the key stakeholders, including the business owner of this project, system administrators, and the application owners. This is a critical step before moving on.

The goals can be different depending on the situation, but the main are:

Prove that the storage system(s) satisfies the needs of your apps and services.
Define the limits and maximums of the system to be prepared before the system becomes saturated and is not able to further satisfy the consumers’ needs.
Define the system’s tunable settings and parameters to be used in production, because they can impact the performance. The examples of such settings are deduplication, compression, type of erasure coding, RAID-levels, implementation of disaster recovery/avoidance technologies like replication (and the type of replication), etc, as well as understanding the impact of it. Also define if any advanced settings should be tuned.

Quite often the goal is described as “compare different storage systems to choose which one to purchase.” If you think about this for a minute, you will realize that this is not the goal itself — it is just the consequence of one of the above goals. First of all, you need to decide what storage system configuration (licenses, RAW capacity, nodes, etc) you should purchase. You always want to compare apples to apples and to do this you need accurate numbers at your disposal.

In case you are choosing a storage system for a well-defined and predictable workload (e.g., established production environment) you shouldn't base your decision on pure performance. It does not matter whether storage is 2 or 3 times faster than required — both are capable to handle your workloads and the decision should be made based on functional, financial, and other criteria.

Otherwise, if you are looking for a storage system for new and/or unpredictable workloads the system maximums are important to understand the potential and limits of the system.

So, before the beginning of any benchmarking, you should define and establish the goals, success criteria, and agree with all the interested parties.

Popular examples of benchmark goals:

“Validate that the storage system is able to handle the workload from the applications A, B and C (or any part of the infrastructure) and define the storage settings and configuration where this is possible.”

“Determine the maximum number of application X, Y, Z users that can be processed on the storage system while remaining within the acceptable response time limits. Analyze the impact of storage settings/configuration (like deduplication and compression) on the number of users served.”

“Define the possible number of virtual machines that can be deployed from the private cloud, as long as they match the current averages from the performance and capacity perspective.”

Choose the approach and benchmarking methodology.

After defining your goals, you need to select the way of creating a load and measuring the performance. Generally speaking, I could distinguish three types of approaches:

move or clone your production workload (in other words, one way or another, replicating the real load)
application-level synthetic benchmark
storage-level synthetic benchmark.

Let’s take a detailed look at each approach.

Move or clone the existing production workload to the test system.

In this case, you should define key metrics to measure, success criteria and reproduce/analyze the existing workload on the storage system that is being tested. I strongly recommend to choose business-specific metrics, not IT-focused ones. These metrics must be aligned with the business operations and the corresponding goals and objectives, such as customer experience or others. For example, it can be the response time of the app from the end-user perspective, business request processing time, report generation time, the number of users who can use the application with an acceptable response time, etc. After that you should move or clone the application to the new system and compare key metrics. If you decide to clone a workload, you need to remember that the number of users and the activity of their interaction with the system must be commensurate with the original system or scaled accordingly.

This approach is the most accurate way to evaluate your workload behavior on the tested system. It is also an efficient way to tune and select parameters of the storage system together with a vendor, partner, and customer.

Likewise, it is the hardest way that requires the highest amounts of resources, time, and deep involvement of application owners. The complexity of such tests often makes this approach almost non-viable, especially in situations where the storage will be shared amongst several business applications. If you choose to migrate existing applications to the new system that is being tested, you should carefully manage the related risks because if anything goes wrong it will affect the production environment hence current business operations. Also, this approach is applicable only for the existing workloads and not the future needs.

With that said, this method is usually applied as a final confirmation when the system has already been shortlisted after the first round of tests and the storage is intended to be used for the specific business-critical application such as the core banking system, ERP, etc.

Application-level Synthetic Benchmarks.

Typically, the application-level benchmark is a combination of the real application and load generating software. Examples include:

Any SQL Database (MS SQL, Oracle, PostgreSQL, MariaDB, etc) + HammerDB.
Web server + wrk/ApacheBenchmark (or JMeter, but it is more complex and usually used for end-to-end web app benchmark).
SAP + SAP Standard Application Benchmark.
VDI environment + LoginVSI.

Depending on the storage system use cases, it is worth using one or more of such application-level benchmarks.

These tests are quite representative (although may not fully respect customer application specifics), easier to leverage because of ready-to-use test plans and provide the possibility to push the system to its limits by scaling the workers/load generators.

But it still often requires the application owner’s participation and sometimes it is hard to interpret the results, especially in mixed environments. Also, it can be problematic to distinguish the impact of storage on the overall result because of the impact of CPU, network, etc.

For these reasons this method is commonly used to compare the storage systems or to test the existing system's applicability for the particular workload.

Synthetic Storage Benchmark.

The concept is completely different from the previous approaches — there is no real application, just a tool that sends IO requests with the required parameters depending on the expected workload pattern to the underlying infrastructure. There is no processing of data at the application level as well as there is no real-world data.

The main benefits of this method are flexibility and ease of use. Participation of apps teams is not required; the infrastructure team can complete all the tests themselves. The tests can be easily automated so it is possible to complete the tests of many storage systems in parallel thus significantly reducing the duration of POCs.

But you should be really diligent with the synthetic storage benchmarks since it is extremely easy to make absolutely meaningless tests and end up with non-relevant conclusions while they look ok at a first glance. At the same time, although synthetic tests may show representative results in terms of the overall performance, it's harder to predict some of the nuances for a particular application.

As a matter of fact, the wrongfully perceived simplicity of the synthetic storage benchmarking and its error-proneness were the main reasons why I decided to write this guide. Further in this guide, I will focus only on the synthetic storage benchmark not because it is the best way to test a storage system’s performance, but because it is the most popular approach, which is the least dependent on a particular environment's specifics.

Select the Tools and Utilities for Synthetic Storage Benchmarking.

As I mentioned previously, for synthetic benchmarks we need a tool that prepares the data, sends the IO requests to the storage, collects, and analyzes the results. It has to be flexible and customizable enough to maintain relevance to your infrastructure and goals. Also, it should support automation and orchestration scenarios.

The golden standard in the industry is FIO, Vdbench, and DiskSPD. All of them are proven, widely applied, and cross-platform, but historically DiskSPD is more often used in Microsoft environments and FIO or Vdbench in Linux environments. Without going into the details, they provide similar capabilities for storage benchmarking and have similar learning curves, so the selection of the tools is usually based on experience and habits.

There are also several PC-focused storage benchmarks like CrystalDiskMark/HDTune/PCMark/etc. PLEASE do not use them for enterprise storage benchmarking. They are not bad by themselves (furthermore, for example, CrystalDiskMark is built on top of DiskSPD), but their goals and use cases are completely different. Lack of control and customization, as well as lack of clusterization support makes them hardly applicable for the enterprise needs. Most enterprise environments produce compound workload profiles from many sources in parallel, not a single OS, dataset is tens and hundreds of TBs, load is generated continuously for hours and days and not for minutes and few GBs as it happens on PCs.

Another type of benchmarking utilities is built on top of FIO/Vdbench and leverage them as the workload generator engine. They provide more features like GUI, out-of-the-box deployment and automation, scheduling, scripting, etc, while still keeping FIO/vdbench as a workload generator.

HCIBench is one of such tools, extremely popular for storage systems benchmarking in VMware vSphere environments. Initially it was created for HCI platforms benchmarking, but now can be used universally with any type of storage. In a nutshell, it's a ready-to-deploy virtual appliance with a control VM and a fleet of VMs with workload engines. Control VM provides the UI for configuration, and results review, contains the image of micro-VM with preinstalled FIO/Vdbench, and scripts for the deployment automation, start-up, collection of data, etc. More content on how to customize it can be found in this VMware’s blog post and examples of test profiles at Github this page. If you are looking for more HCIBench automation itself, please have a look into this article Industrialising storage benchmarks with Hosted Private Cloud from OVHcloud — OVHcloud Blog from OVH Cloud.

High Level HCIBench Architecture

Technically, you can achieve the same or even better level of automation via custom scripts, but it rarely makes sense due to a significant time investment. So, if the primary use case for your storage system is hosting the datastores of VMware virtual infrastructure, it is tenable to use HCIBench as the default benchmarking tool and switch to a custom FIO/Vdbench setup if there is a reason for it. The excellent value of HCIBench is that it uses common workload generators (you can choose FIO or Vdbench) and provides full control over benchmark parameters, nothing is hidden from you or assumed on your behalf. It is critical to have a full control over the tool — nothing should be decided for you, configured by default, or hidden behind the scenes.

Establish Demo Environment Configuration.

Hardware

Ideally you would test the same storage system configuration as you plan to buy. However, this rarely happens in real life due to the demo systems pool limitations on the vendor/SI side. Sometimes you can use a“try-and-buy” program, but usually vendors offer this option only for the final validation. Under such a program's conditions, the customer commits to buy-out the system if the pre-defined qualification criteria are met. Otherwise, the system is returned back to the vendor / SI. This can be a feasible validation option, but not so much from the vendor comparison perspective.

Most often, the system you get for tests will be scaled down from your target configuration. In this case, you should test the storage of the same class, generation, and model. Components should be of the same performance class and type — if you plan to buy storage with SATA/SAS drives and FC/SCSI interfaces there is no sense in testing All NVMe storage with NVMe-o-F over RoCEv2 connectivity even if the storage controller matches. Also, the same should be the storage software versions and configuration/mode of storage software (for example, NetApp ASA/AFF options, Unified/Block Optimized software modes for Dell EMC Powerstore T, etc).

On the other hand, it can be ok if the system has less capacity or drives or nodes in case you understand the scaling process. But be careful because it is not so obvious in a lot of cases. For instance, a lot of All Flash arrays reach a performance plateau with 10-30 SSD drives because the controller becomes a bottleneck. This leads to a situation where the total performance for the storage system with 24 SSDs and 100+ will be the same. Another example is that increasing the number of controllers (SAN/NAS) or nodes (SDS, HCI) does not always lead to performance gains for a single LUN/Share/vmdk, despite an overall performance increase.

So, it is great to discuss all this in detail with a vendor, ask for proof/test/benchmarks, and after that try to carry out your own test (even at a minimal scale) to prove the concept.

After preparing and defining the goals/success criteria, you should specify the system configuration suitable for the benchmark and request it for the POC.

Software

This part can be complicated in practice, but easy in theory:

Check the software versions and install the latest (supported by the hardware and software components) updates on every component of the test environment.
Accurately check compatibility lists top down.
Setup and tune the test environment according to the applicable Best Practices Guides.
Have the setup reviewed with the vendor / SI and obtain their written confirmation that the setup is correct (adheres to the best practices).

While setting up the environment, document all the settings and options configured. You are likely to feel that it is easy to remember just a few checkboxes, but I assure you that after several months you will not remember anything. Good practice is to perform any setup and operations with the enabled screen recording (I usually use Zoom cloud-recording sessions even if I'm the only one doing the setup). This way you will be able to check out any details and/or prove the point at any time.

Conclusion.

Performance testing is not a simple, but certainly an extremely important procedure, since it provides a basis for long-term decisions that can have an immediate impact on production processes and business goals attainment.

Careful preparation and planning are critical success factors for successful load tests. And that equally applies to the goal definition, collection of data and the detailed description of the test program. If the preparation is done to the end, then the test itself becomes simple, understandable, and even, in a sense, a routine action.

In the second part, I will show you exactly what data is needed for successful synthetic benchmarking, the ways how to get it and how to formulate success criteria.