DESIGN TOOLS
Storage

Why so much interest in high-capacity SSDs?

Currie Munce | September 2024

Anyone who attended the Future of Memory and Storage 2024 (FMS 2024) conference this year in Santa Clara, California, was inundated with presentations on large-capacity SSDs. Just a year ago, 64TB was considered too large a capacity by many customers. In multiple presentations and exhibit booths at FMS 2024, SSD roadmaps for products coming out in the next few years showed capacities as high as 128TB and 256TB. Why this sudden change? After a difficult financial period last year, has the flash industry collectively lost its mind? What’s going on?

There must be an explanation

The explanation for the rapid shift is the same as for many other changes happening today in the IT industry — the explosive emergence of generative AI. There have been whispers in the storage industry about the day when HDDs get too slow and are replaced by fast, cheap SSDs. The challenge has been that HDDs are inexpensive and clever storage software developers keep finding ways to squeak just enough performance from them. 

That is until large GPU clusters that quickly consume enormous amounts of data for training came along. Exponentially growing large language models (LLM) need more and more data for training. GPUs process data more quickly than traditional CPUs do. That massive increase is breaking HDDs’ ability to keep up, even if users try to stripe data across thousands of HDDs. The power and space requirements to do that are just too high.

Why not place SSDs near the GPU for speed and use HDDs for bulk data storage? Generative AI is a workflow, not just an application. The process involves ingesting, curating, formatting data for training, feeding it iteratively to GPUs, and periodically checkpointing to avoid restarting. Public LLMs need to be optimized and fine-tuned with user data, and retrieval-augmented generation (RAG) during inferencing requires quick access to application-specific data. Moving data between different storage systems is complex, expensive and inefficient in terms of power consumption, and it diverts focus from developing better models and leveraging existing ones.

That’s where high-capacity, low-cost SSDs come in. In compute systems, SSD performance is typically measured in IOPS (input/output operations per second). For storage systems, device performance is measured by throughput per capacity (MB/s / TB). For large GPU training clusters, the system requirement can be as high as 100 MB/s of bandwidth per terabyte of storage capacity. These large storage systems — which hold text, images and videos for multimodal models — need system capacities of petabytes to exabytes, hence requiring hundreds to tens of thousands of individual drives. 

SSDs boast bandwidths up to 50 times greater than HDDs, enabling fewer SDDs to achieve the same system throughput as numerous HDDs. The smaller number means they need to have higher capacity than the HDDs to fulfill the system capacity requirements. How much larger? 

That depends on performance requirements and network bandwidth. These storage systems are usually connected to the GPU clusters with ultrafast networks, but the aggregate bandwidth of those networks is still much lower than the aggregate bandwidth of the SSDs. For the largest GPU clusters (requiring up to 100 MB/s / TB), capacities up to 64TB are often the limit. Some users want to scale capacities up to 128TB or even 256TB when those SSDs are available for smaller clusters or systems with lower performance demands. 

Given that SSDs in networked systems don't operate at maximum speed, they use significantly less power than standard computing applications. Additionally, since speed and high write cycle support aren't paramount, design compromises are made to reduce costs relative to traditional mainstream compute SSDs.

Here's the bottom line reason

As a result, the storage systems have fewer drives and storage servers, reduced energy consumption, fewer racks, higher reliability, longer useful lifetimes, better latency characteristics, less idle time on GPUs awaiting data, and are simpler to manage when using all SSDs compared to a mix of SSDs and HDDs. 

Where does it go from here?

Large GPU clusters and GPU-as-a-service cloud providers are opting for large-capacity, low-cost SSDs for their storage needs. These initial applications justify the higher cost of the SSD for its performance, power and capacity advantages over the HDD. Other high-performance use cases are anticipated to convert to SSDs over the next several years as HDDs get slower in MB/s / TB. It’s great to be cheaper, but if users can’t meet their performance requirements, it’s expensive in both power and system cost to idle CPUs, GPUs and other accelerators.

We've long been acquainted with the storage and memory hierarchy, continuously updating it with new technologies and adjusting where the boundaries fall between the blocks in the pyramid. Micron has now introduced a new block to the pyramid, capacity SSDs, to recognize the emerging role for this new class of SSD that caters to storage applications requiring large capacities while balancing performance, power and cost.

Figure 1: Pyramid diagram showing how SSD's fit into the storage and memory hierarchy Figure 1: Pyramid diagram showing how SSD's fit into the storage and memory hierarchy


For more information on Micron's offering in this new Data Center segment see: Micron 6500 ION NVMe SSD

Currie Munce is a Senior Technology Advisor and Strategist for Micron’s Storage Business helping to define storage architecture and technology directions for the company.

Currie Munce