[RFC 0/1] block: export windowed IO P99 latency

Diangang Li posted 1 patch 4 weeks, 1 day ago
block/blk-core.c          |  5 ++-
block/blk-flush.c         |  6 ++-
block/blk-mq.c            |  5 ++-
block/genhd.c             | 50 ++++++++++++++++++++++++-
include/linux/part_stat.h | 79 +++++++++++++++++++++++++++++++++++++++
5 files changed, 139 insertions(+), 6 deletions(-)
[RFC 0/1] block: export windowed IO P99 latency
Posted by Diangang Li 4 weeks, 1 day ago
Production environments occasionally run into elevated tail latencies. The
source can be the underlying device, but it can also be higher in the
stack (filesystem contention/journaling, memory reclaim, writeback, etc.).
Existing block IO statistics only provide throughput and average latency,
which fail to capture the critical tail end of the latency distribution
that often causes user-visible performance problems.

This patch adds windowed P99 latency tracking for block IO operations,
exposing the 99th percentile latency in /proc/diskstats and
/sys/block/<dev>/stat. System administrators can now monitor tail latency
trends over time using tools like iostat, enabling quick validation or
elimination of disk hardware as the source of latency issues.

Implementation uses per-CPU sliced ring histograms (21 buckets, 8us..~8s
range) with minimal overhead. P99 values are computed by aggregating
recent 1-second slices when reading statistics, reported in microseconds
using bucket midpoints.

The added work on the IO path is intentionally small (bucket selection and
a per-CPU counter update, with occasional per-slice reset), and in our
testing it does not have a measurable impact on IO performance.

Diangang Li (1):
  block: export windowed IO P99 latency

 block/blk-core.c          |  5 ++-
 block/blk-flush.c         |  6 ++-
 block/blk-mq.c            |  5 ++-
 block/genhd.c             | 50 ++++++++++++++++++++++++-
 include/linux/part_stat.h | 79 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 139 insertions(+), 6 deletions(-)

-- 
2.39.5
Re: [RFC 0/1] block: export windowed IO P99 latency
Posted by Diangang Li 1 week, 1 day ago
On 2026/1/9 16:31, Diangang Li wrote:
> Production environments occasionally run into elevated tail latencies. The
> source can be the underlying device, but it can also be higher in the
> stack (filesystem contention/journaling, memory reclaim, writeback, etc.).
> Existing block IO statistics only provide throughput and average latency,
> which fail to capture the critical tail end of the latency distribution
> that often causes user-visible performance problems.
> 
> This patch adds windowed P99 latency tracking for block IO operations,
> exposing the 99th percentile latency in /proc/diskstats and
> /sys/block/<dev>/stat. System administrators can now monitor tail latency
> trends over time using tools like iostat, enabling quick validation or
> elimination of disk hardware as the source of latency issues.
> 
> Implementation uses per-CPU sliced ring histograms (21 buckets, 8us..~8s
> range) with minimal overhead. P99 values are computed by aggregating
> recent 1-second slices when reading statistics, reported in microseconds
> using bucket midpoints.
> 
> The added work on the IO path is intentionally small (bucket selection and
> a per-CPU counter update, with occasional per-slice reset), and in our
> testing it does not have a measurable impact on IO performance.
> 
> Diangang Li (1):
>    block: export windowed IO P99 latency
> 
>   block/blk-core.c          |  5 ++-
>   block/blk-flush.c         |  6 ++-
>   block/blk-mq.c            |  5 ++-
>   block/genhd.c             | 50 ++++++++++++++++++++++++-
>   include/linux/part_stat.h | 79 +++++++++++++++++++++++++++++++++++++++
>   5 files changed, 139 insertions(+), 6 deletions(-)
> 

Hi Jens, hi all,

Quick sanity check on the motivation/design before I respin.

I want to expose a simple tail metric (P99) via diskstats/sysfs stat, 
since avg latency/throughput often miss the spikes seen in prod.

I considered read-to-read deltas, but diskstats is polled frequently 
(often sub-second, multiple agents), so the effective window becomes 
reader-dependent and too short/noisy. Current approach uses a fixed 
window (per-CPU 1s slices in a small ring histogram) and aggregates on read.

Does this direction make sense? Is diskstats/sysfs the right place for 
it? Any better low-overhead, polling-independent approach (and 
preferences on percentile/window/buckets)?

Best regards,
Diangang Li