RocksDB is a storage-focused key value database that is the backbone of many operations at Meta. Here’s their description from RocksDB.org:
"RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads, and to be flexible to allow for innovation.”
The goal of the Yahoo! Cloud Serving Benchmark (YCSB) project is to develop a framework and common set of workloads for evaluating the performance of different “key-value” and “cloud” serving stores. The project comprises two major components:
- The YCSB Client: an extensible workload generator
- The Core workloads: a set of workload scenarios to be executed by the generator
In the case of FIO, we looked at a transaction log for a job file and saw that nothing is occurring for a specific amount of time, which is equivalent to the latency seen. However, when we look at the NVMe™ driver layer, no transaction has a latency anywhere near 18ms, so we know the source of the latency must be the application.
Latency excursion is not a drive issue, but likely the behavior of layer above it.
- Even simple tools like FIO can experience system effects that can result in the reporting of high latencies.
- Noise on the system can affect max latency reporting in FIO
- QoS latencies out to 7x9’s looks consistent.
RocksDB latency of 113ms was seen at the block layer which was measured by BPF trace scripts. Bpftrace is a tracing framework for Linux that allows the user to trace and profile various aspects of the system at runtime. We referred to BIO snoop script from Brendan Gregg’s Performance Tools which outputs details for each storage device I/O with latency and looks for time-ordered patterns.
It uses kprobes (kernel probes), a Linux kernel mechanism for dynamic tracing which allows the user to insert breakpoint at almost any kernel function, invoke your handler and then continue executing.
To debug the root cause, we collected various stats to understand where the high latencies are coming from.
- IO stat
- System input/output statistics for devices and partitions
- Measured at the kernel layer
- Bus analyzer - Captures and analyzes the data transmitted on PCIe bus
- Individual transaction information – like transaction type, address, data payload
- Shows the signal timing
- OCP latency monitor
- Measured at the hardware and firmware layers.