block/blk-core.c | 5 ++- block/blk-flush.c | 6 ++- block/blk-mq.c | 5 ++- block/genhd.c | 50 ++++++++++++++++++++++++- include/linux/part_stat.h | 79 +++++++++++++++++++++++++++++++++++++++ 5 files changed, 139 insertions(+), 6 deletions(-)
Production environments occasionally run into elevated tail latencies. The source can be the underlying device, but it can also be higher in the stack (filesystem contention/journaling, memory reclaim, writeback, etc.). Existing block IO statistics only provide throughput and average latency, which fail to capture the critical tail end of the latency distribution that often causes user-visible performance problems. This patch adds windowed P99 latency tracking for block IO operations, exposing the 99th percentile latency in /proc/diskstats and /sys/block/<dev>/stat. System administrators can now monitor tail latency trends over time using tools like iostat, enabling quick validation or elimination of disk hardware as the source of latency issues. Implementation uses per-CPU sliced ring histograms (21 buckets, 8us..~8s range) with minimal overhead. P99 values are computed by aggregating recent 1-second slices when reading statistics, reported in microseconds using bucket midpoints. The added work on the IO path is intentionally small (bucket selection and a per-CPU counter update, with occasional per-slice reset), and in our testing it does not have a measurable impact on IO performance. Diangang Li (1): block: export windowed IO P99 latency block/blk-core.c | 5 ++- block/blk-flush.c | 6 ++- block/blk-mq.c | 5 ++- block/genhd.c | 50 ++++++++++++++++++++++++- include/linux/part_stat.h | 79 +++++++++++++++++++++++++++++++++++++++ 5 files changed, 139 insertions(+), 6 deletions(-) -- 2.39.5
On 2026/1/9 16:31, Diangang Li wrote: > Production environments occasionally run into elevated tail latencies. The > source can be the underlying device, but it can also be higher in the > stack (filesystem contention/journaling, memory reclaim, writeback, etc.). > Existing block IO statistics only provide throughput and average latency, > which fail to capture the critical tail end of the latency distribution > that often causes user-visible performance problems. > > This patch adds windowed P99 latency tracking for block IO operations, > exposing the 99th percentile latency in /proc/diskstats and > /sys/block/<dev>/stat. System administrators can now monitor tail latency > trends over time using tools like iostat, enabling quick validation or > elimination of disk hardware as the source of latency issues. > > Implementation uses per-CPU sliced ring histograms (21 buckets, 8us..~8s > range) with minimal overhead. P99 values are computed by aggregating > recent 1-second slices when reading statistics, reported in microseconds > using bucket midpoints. > > The added work on the IO path is intentionally small (bucket selection and > a per-CPU counter update, with occasional per-slice reset), and in our > testing it does not have a measurable impact on IO performance. > > Diangang Li (1): > block: export windowed IO P99 latency > > block/blk-core.c | 5 ++- > block/blk-flush.c | 6 ++- > block/blk-mq.c | 5 ++- > block/genhd.c | 50 ++++++++++++++++++++++++- > include/linux/part_stat.h | 79 +++++++++++++++++++++++++++++++++++++++ > 5 files changed, 139 insertions(+), 6 deletions(-) > Hi Jens, hi all, Quick sanity check on the motivation/design before I respin. I want to expose a simple tail metric (P99) via diskstats/sysfs stat, since avg latency/throughput often miss the spikes seen in prod. I considered read-to-read deltas, but diskstats is polled frequently (often sub-second, multiple agents), so the effective window becomes reader-dependent and too short/noisy. Current approach uses a fixed window (per-CPU 1s slices in a small ring histogram) and aggregates on read. Does this direction make sense? Is diskstats/sysfs the right place for it? Any better low-overhead, polling-independent approach (and preferences on percentile/window/buckets)? Best regards, Diangang Li
© 2016 - 2026 Red Hat, Inc.