DESIGN TOOLS
Storage

Micron 9400 NVMe SSDs Explore Big Accelerator Memory Using NVIDIA Technology

John Mazzie | January 2024

Dataset training sizes continue to grow beyond billions of parameters. While some models can fit in system memory completely, larger models cannot. In this situation, data loaders need to access models located on flash storage through various methods. One such method is a memory mapped file stored on SSDs. This allows the data loader to access the file as if it were in memory, but the overhead of the CPU and software stack drastically reduces the performance of the training system. This is where Big accelerator Memory (BaM)* and the GPU-Initiated Direct Storage (GIDS)* data loader come in.

What are BaM and GIDS?

 

BaM is a system architecture that utilizes the low latency, extremely high throughput, large density, and endurance of SSDs. BaM’s goal is to provide efficient abstractions that enable GPU threads to make fine-grained accesses to datasets on SSDs and achieve much higher performance than solutions requiring CPU(s) to provide storage requests to serve GPUs. BaM acceleration uses a custom storage driver that is designed specifically to enable the inherent parallelism of GPUs to access storage devices directly. BaM is different from NVIDIA Magnum IO™ GPUDirect® Storage (GDS), as BaM doesn’t rely on the CPU to prepare the communication from GPU to SSD.

Micron had done previous work with NVIDIA GDS as noted below:

The GIDS dataloader is built on the BaM subsystem to address memory capacity requirements for GPU-accelerated Graph Neural Network (GNN) training while also masking storage latency. GIDS does this by storing the feature data of the graph on the SSD, since this data is typically the largest part of the total graph dataset for large-scale graphs. The graph structure data, which is typically much smaller compared to the feature data, is pinned into system memory to enable rapid GPU graph sampling. Lastly, the GIDS dataloader allocates a software-defined cache on the GPU memory for recently accessed nodes in order to reduce storage accesses.

Graph neural network training using GIDS

 

To show the benefits of BaM and GIDS, we performed GNN training using the Illinois Graph Benchmark (IGB) heterogeneous full dataset. This dataset is 2.28TB large and would not fit into most platforms’ system memory. We timed the training for 100 iterations using a single NVIDIA A100 80GB Tensor Core GPU and varied the number of SSDs to provide a broad range of results, as seen in Figure 1 and Table 1.

Figure 1: GIDS Training Time for IGB-Heterogenous Full Dataset - 100 Iterations

 

 

GIDS (4 SSDs)

GIDS (2 SSDs)

GIDS (1 SSD)

DGL Memory Map Abstraction

Sampling

 4.75

 4.93

 4.08

 4.65

Feature Aggregation 

 8.57

 15.9

 31.6

 1,130

Training

 1.98

 1.97

 1.87

 2.13

End-to-End

 15.3

 22.8

 37.6

 1,143

Table 1: GIDS Training Time for IGB-Heterogenous Full Dataset - 100 Iterations


The first part of the training is graph sampling done by the GPU and by accessing the graph structure data within system memory (seen in blue). This value varies little across the different test configurations because the structure stored in system memory does not change between these tests.

Another part is the actual training time (seen at the far right in green). This part is highly dependent on the GPU, and we can see that this does not change much between the multiple test configurations as expected.

The most important section, where we see the largest difference, is feature aggregation (shown in gold). As the feature data is stored on the Micron 9400 SSDs for this system, we see that scaling from 1 to 4 Micron 9400 SSDs drastically improves (reduces) the feature aggregation processing time. Feature aggregation improves by 3.68x as we scale from 1 SSD to 4 SSDs. 

We also included a baseline calculation, which uses a memory map abstraction and the Deep Graph Library (DGL) data loader to access the feature data. Because this method of accessing the feature data requires the use of the CPU software stack instead of direct access by the GPU, we can see how inefficient the CPU software stack is at keeping the GPU saturated during training. The feature abstraction improvement versus baseline is 35.76x for 1 Micron 9400 NVMe SSD using GIDS and 131.87x on 4 Micron 9400 NVMe SSDs. Another view of this data can be seen in Figure 2 and Table 2, which shows the effective bandwidth and IOPs during these tests.

Figure 2: Effective Bandwidth and IOPS of GIDS Training vs Baseline

 

 

DGL Memory Map

GIDS (1 SSD)

GIDS (2 SSDs)

GIDS (4 SSDs)

Effective Bandwidth (GB/s)

0.194

6.9

13.8

25.6

Achieved IOPs (M/s)

0.049

1.7

3.4

6.3

Table 2: Effective Bandwidth and IOPS of GIDS Training vs Baseline


As datasets continue to grow, we can see the need for a shift in paradigm in order to train these models in a reasonable amount of time and to take advantage of the improvements provided by leading GPUs. BaM and GIDS are a great starting point, and we look forward to working with more of these types of systems in the future.

Test System

 

Component

Details

Server

Supermicro® AS 4124GS-TNR

CPU

2x AMD EPYC™ 7702 (64 Core)

Memory

1 TB Micron DDR4-3200

GPU

NVIDIA A100 80GB

Memory Clock: 1512 MHz

SM Clock: 1410 MHz

SSDs

4x Micron 9400 MAX 6.4TB

OS

Ubuntu 20.04, Kernel 5.4.0

NVIDIA Driver

535.113.01

MTS, Systems Performance Engineer

John Mazzie

John is a Member of the Technical Staff in the Data Center Workload Engineering group in Austin, TX. He graduated in 2008 from West Virginia University with his MSEE with an emphasis in wireless communications. John has worked for Dell on their storage MD3 Series of storage arrays on both the development and sustaining side. John joined Micron in 2016 where he has worked on Cassandra, MongoDB, and Ceph, and other advanced storage workloads.