Data Center

Why latency in data center SSDs matters and how Micron became best in class

Steven Wells | June 2024

Why are latency outliers so important to mitigate?

In a 2015 paper, Meta presented some real-life implementation details of the social graph used by Facebook.1 The authors start with a hypothetical case of two posts by Alice that friends Bob and Carol each comment on and like. When Alice picks up her phone and opens Facebook, the news feed needs to identify who her friends are and what their posts are, as well as to set queries to notify Alice that Bob and Carol liked and commented on her post.  

First, let’s consider a solution (Figure 1) where a worker executes n subqueries. In this case, execution time would be O(n) and is roughly equal to average (not worst-case) lookup latency. 

Figure 1. Single worker doing N queries where execution time is average latency.

Hyperscalers take a different approach (Figure 2). They take n workers each doing one lookup, with an aggregating of n results. In this case, execution time is O(1), but execution time becomes the worst-case latency of any of the n nodes.

Figure 2. Hyperscaler approach issuing multiple subqueries in parallel with aggregation. Execution time is longest lookup latency.

The Meta paper goes on to discuss that it’s actually worse with multiple fork/join subqueries on thousands of subqueries with dozens of deep critical path-of-fork joins (Figure 3). Even a single outlier — in this case, at three nines — would impact nearly every single query performance. This outcome justifies looking at latencies at least four nines, if not further at six or seven nines (six-nines latency is shown in Figure 6). 

Figure 3. When Alice opens her Facebook app, thousands of servers do queries with critical paths that are dozens deep. A storage device with a high latency event — even at three or four nines — affects nearly every Meta user every day.

This situation happens well beyond Meta and its social graph, and it includes many database-intensive applications. A good discussion in another Micron blog looks at a YCSB database application against various storage solutions, including Micron 7450.2,3

At a minimum, it’s good to look at read-intensive workloads that has a “firestorm” of writes — such as 70% reads, 30% writes — and is deeply queued to ensure full pressure to the NAND array and controller. Examining read tail latencies against various pressures is also important as those will be closer to the typical daily server experience.

What causes latency variations and mitigations?

Using the nomenclature of CPU architecture, an SSD is both deeply pipelined (many stages) and a super scaler (many parallel stages). Focusing on pipeline stalls is key for the performance of CPUs and SSDs. In the case of SSDs (Figure 4), the pipeline stalls can come from many sources that we’ll consider below.

Figure 4. An SSD is a deeply pipelined and highly parallel system servicing multiple read requests. Pipeline stalls due to resource collisions, such as busy die and/or channels, need careful mitigation.

Some of the lowest order impacts to this idealized latency result from attempting to read data where the die or NAND bus are busy servicing the request of another pipeline stage, an occurrence commonly called plane, die or channel collisions. When a read conflicts with a read, it’s common to let the pipe stall on the later read and complete the in-progress read first.  

Integrating in both host writes (30% in this case) and their associated garbage collection creates not only additional reads but also programs and erases to the NAND. Program latencies can be as long as five to 10 times that of NAND reads, and NAND erases can be an order of magnitude higher than the NAND reads. Figure 5 gives a somewhat humorous view of GC impacts to host activity, while Figure 6 details what the read pipeline stalls would look like without suspends.

Figure 5. Why garbage collection on an SSD is a meaningful contributor to overall SSD latency.

Figure 6. A circa 2017 conceptual view of garbage collection impacts on read latencies without available program/erase suspends.4

This is where pipeline “suspends” for the programs and erases come in, allowing the servicing of host reads. Working closely, NAND component engineers, system on chip engineers and firmware engineers have invented a program and erase suspend to help mitigate these latencies. Today we see well less than a 2 mS latency impact at five nines to the above workload, a result that is at least five times better.

How can pipeline stalls be resolved?

Let’s head back to the freeway visual in Figure 5. An SSD with deeply queued reads and writes with the associated garbage collection is like a multilane freeway. Latency outliers on a freeway (aka traffic jams) are very similar to latency outliers in an SSD. A couple of latency outlier strategies can best be understood through the analogy of preventing freeway traffic jams.

Analogy 1: Don’t let freight trains block busy freeways (I’m not joking)

Obviously, a freight train blocking all lanes of traffic on a busy freeway at rush hour would guarantee a traffic jam (Figure 7). Well, it has taken time for SSD designers to fully appreciate that fact. Although it seems obvious, more than once the OCP data center NVMe™ SSD specification specifically call out the need to not have a freight train cross a busy freeway, likely stemming from experience with prior designs by the OCP authors:

  • Smart IO shall not block any host IO (SLOG-6)
  • Other periodically monitored logs (LMLOG-4, and TEL-5) shall constrain IO blocking to a small figure ~1 mS.

Figure 7. Having a train block lanes of traffic is bad practice in both freeway design and SSD design. Never block the flow of I/O (source Dall-E).

Analogy 2: Employ freeway on-ramp regulation

Another useful tool to prevent traffic jams is freeway on-ramp metering (Figure 8). We have probably all experienced this approach, which seems counterintuitive as a way to increase productivity. Data from the U.S. Department of Transportation shows a traffic time reduction (latency) of 20% or more.5 SSDs follow the same concept to prevent congestion that leads to latency outliers. We throttle not only writes but also overall I/O, including garbage collection, to ensure peak performance without the frustrating delay of an internal traffic jam on the SSD.

Figure 8. Freeway on-ramp meters reduce travel time by >20% on near-full-capacity freeways. The same ingress throttling concept is used in SSDs to prevent tail latency outliers.

Why is Micron committed to being best in class for tail latency mitigation?

As mentioned before, the industry struggled with latency outliers six to seven years ago before the advent of program and erase suspends. That approach plus ingress throttling — as well as optimization of NAND and controller interactions — have evolved Micron SSDs (Figure 9) to the degree that, today, we consider them to be best in class.  

Figure 9. Progression of latency outlier improvement over four generations of Micron SSDs. The Micron 7500 offers best-in-class latency mitigation for mainstream data center SSDs.

So next time you pick up your mobile device and surf social media, admire the great end-user experience you get in terms of responsiveness. Admire the wonder of literally thousands of servers running thousands of parallel queries that are deeply fork-joined. Even if you’re running a server farm on a much smaller scale, having consistent and predictable performance is key to ensuring consistent service to your customers. This performance is where Micron’s mainstream data center SSDs shine. 

Fellow, Architect Storage Systems

Steven Wells

Steven Wells is a Fellow at Micron, focusing on next generation SSD solutions with over 65+ patents in the area of non-volatile storage. He has been involved in flash component and SSD design since 1987 and has published at multiple conferences including ISSCC, JSSC, Flash Memory Summit, Storage Developer Conference, and OCP Global Summit.