WEKA is a distributed, parallel filesystem designed for AI workloads, and we wanted to know how the MLPerf Storage AI workload scales on a high-performance SDS solution. The results are enlightening, helping us make sizing recommendations for current-generation AI systems and hinting at the massive throughput future AI storage systems will require.
First, a quick refresher on MLPerf Storage
MLCommons maintains and develops six different benchmark suites and is developing open datasets to support future state-of-the-art model development. The MLPerf Storage Benchmark Suite is the latest addition to the MLCommons’ benchmark collection.
MLPerf Storage sets out to address two challenges, among others, when characterizing the storage workload for AI training systems — the cost of AI accelerators and the small size of available datasets.
For a deeper dive into the workload generated by MLPerf Storage and a discussion of the benchmark, see our previous blog posts:
- The Micron 9400 NVMe SSD is the top PCIe Gen4 SSD for AI Storage
- Storage for AI training: MLPerf Storage on the Micron 9400 NVMe SSD
My teammate, Sujit, wrote a post earlier this year describing the performance of the cluster in synthetic workloads. See that post for the full results.
The cluster is made up of six storage nodes, and each node is configured with the following:
- Supermicro AS-1115CS-TNR
- Single-socket AMD EPYC™ 9554P CPU
- 64 cores / 3.1 GHz base / 3.75 GHz boost
- 384GB Micron DDR5 DRAM
- 10 Micron 30TB 6500 NVMe SSDs
- 400 GbE networking
Finally, let’s review how this cluster performs in MLPerf Storage
Quick note: The results presented here are unvalidated as they have not been submitted to MLPerf Storage for review. Also, the MLPerf Storage benchmark is undergoing changes from v0.5 to the next version for the first 2024 release. The numbers presented here use the same methodology as the v0.5 release (independent datasets for each client, independent clients, and accelerators in a client share a barrier).
The MLPerf Storage benchmark emulates NVIDIA® V100 accelerators in the 0.5 version. The NVIDIA DGX-2 server has 16 V100 accelerators. For this testing, we show the number of clients supported on the WEKA cluster where each client emulates 16 V100 accelerators, like in the NVIDIA DGX-2.
Additionally, v0.5 of the MLPerf Storage benchmark implements two different models, Unet3D and BERT. Through testing, we find that BERT does not generate significant storage traffic, so we’re going to focus on Unet3D for the testing here. (Unet3D is a 3D medical imaging model.)
This plot shows the total throughput to the storage system for a given number of client nodes. Remember, each node has 16 emulated accelerators. Furthermore, to be considered a “success,” a given quantity of nodes and accelerators need to maintain greater than 90% accelerator utilization. If the accelerators drop below 90%, that represents idle time on the accelerators as they wait for data.
Here we see that the six-node WEKA storage cluster supports 16 clients, each emulating 16 accelerators — for a total of 256 emulated accelerators — and reaching 91 GB/s of throughput.
This performance is like 16 NVIDIA DGX-2 systems (with 16 V100 GPUs each), which is a remarkably high number of AI systems supported by a six-node WEKA cluster.
The V100 is a PCIe Gen3 GPU, and the pace-of-performance increases in NVIDIA’s GPU generations are far surpassing platform and PCIe generations. In a single-node system, we find that an emulated NVIDIA A100 GPU is four times faster in this workload.
With a maximum 91 GB/s throughput, we can estimate that this WEKA deployment would support 8 DGX A100 systems (with 8 A100 GPUs each).
Looking further into the future at H100 / H200 (PCIe Gen5) and X100 (PCIe Gen6), cutting-edge AI training servers are going to push a massive amount of throughput.
For today, WEKA storage and the Micron 6500 NVMe SSD are the perfect combination of capacity, performance and scalability for your AI workloads.
Stay tuned as we continue to explore storage for AI!