block/blk-mq-debugfs.c | 109 +++++++++++++++++++++++++++++++++++ block/blk-mq-debugfs.h | 19 ++++++ block/blk-mq-tag.c | 8 +++ block/blk-mq.c | 5 ++ include/linux/blk-mq.h | 12 ++++ include/trace/events/block.h | 43 ++++++++++++++ 6 files changed, 196 insertions(+)
Hi Jens, Steve, Masami,
In high-performance storage environments, particularly when utilising RAID
controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe latency
spikes can occur when fast devices are starved of available tags.
Currently, diagnosing this specific queue contention requires deploying
dynamic kprobes or inferring sleep states, which lacks a simple,
out-of-the-box diagnostic path.
This short series introduces dedicated, low-overhead observability for tag
exhaustion events in the block layer:
- Patch 1 introduces the "block_rq_tag_wait" tracepoint in the tag
allocation slow-path to capture precise, event-based starvation.
- Patch 2 complements this by exposing "wait_on_hw_tag" and
"wait_on_sched_tag" per-CPU counters via debugfs for quick,
point-in-time cumulative polling.
Together, these provide storage engineers with zero-configuration
mechanisms to definitively identify shared-tag bottlenecks.
Please let me know your thoughts.
Changes since v5 [1]:
- Replaced this_cpu_inc() with raw_cpu_inc() within
blk_mq_debugfs_inc_wait_tags(). This resolves a preemption warning
triggered under CONFIG_DEBUG_PREEMPT=y, as the routine is invoked from a
preemptible context immediately prior to io_schedule(). This adjustment
deliberately prioritises the reduction of execution overhead over
absolute statistical precision for this diagnostic interface.
Changes since v4 [2]:
- Prevented a NULL pointer dereference in the tracepoint fast-assign for
disk-less request queues by safely checking q->disk before resolving the
dev_t
- Fixed a Use-After-Free (UAF) and permanent memory leak by decoupling
the per-CPU counter allocation from the volatile debugfs lifecycle and
tying it directly to the core hctx lifecycle (i.e., blk_mq_init_hctx()
and blk_mq_exit_hctx())
- Fixed a potential compiler double-fetch bug by wrapping the per-CPU
pointer evaluations with READ_ONCE() in blk_mq_debugfs_inc_wait_tags()
- Passed the appropriate gfp_t flags down to the allocation routines to
maintain the strict GFP_NOIO context
- Updated kernel-doc descriptions to clarify that the NULL pointer
checks guard against memory allocation failures under pressure, rather
than initialisation race conditions
Changes since v3 [3]:
- Transitioned tracking architecture from shared atomic_t variables to
dynamically allocated per-CPU counters to resolve cache line bouncing
(Bart Van Assche)
Changes since v2 [4]:
- Added "Reviewed-by:" and "Tested-by:" tags for patch 1
- Evaluate is_sched_tag directly within TP_fast_assign (Steven Rostedt)
- Introduced atomic counters via debugfs
Changes since v1 [5]:
- Improved the description of the trace point (Damien Le Moal)
- Removed the redundant "active requests" (Laurence Oberman)
- Introduced pool-specific starvation tracking
[1]: https://lore.kernel.org/lkml/20260427020142.358912-1-atomlin@atomlin.com/
[2]: https://lore.kernel.org/lkml/20260419023036.1419514-1-atomlin@atomlin.com/
[3]: https://lore.kernel.org/lkml/20260319221956.332770-1-atomlin@atomlin.com/
[4]: https://lore.kernel.org/lkml/20260319015300.287653-1-atomlin@atomlin.com/
[5]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/
Aaron Tomlin (2):
blk-mq: add tracepoint block_rq_tag_wait
blk-mq: expose tag starvation counts via debugfs
block/blk-mq-debugfs.c | 109 +++++++++++++++++++++++++++++++++++
block/blk-mq-debugfs.h | 19 ++++++
block/blk-mq-tag.c | 8 +++
block/blk-mq.c | 5 ++
include/linux/blk-mq.h | 12 ++++
include/trace/events/block.h | 43 ++++++++++++++
6 files changed, 196 insertions(+)
--
2.51.0
On 5/17/26 3:36 PM, Aaron Tomlin wrote: > Hi Jens, Steve, Masami, > > In high-performance storage environments, particularly when utilising RAID > controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe latency > spikes can occur when fast devices are starved of available tags. > Currently, diagnosing this specific queue contention requires deploying > dynamic kprobes or inferring sleep states, which lacks a simple, > out-of-the-box diagnostic path. > > This short series introduces dedicated, low-overhead observability for tag > exhaustion events in the block layer: > > - Patch 1 introduces the "block_rq_tag_wait" tracepoint in the tag > allocation slow-path to capture precise, event-based starvation. > > - Patch 2 complements this by exposing "wait_on_hw_tag" and > "wait_on_sched_tag" per-CPU counters via debugfs for quick, > point-in-time cumulative polling. > > Together, these provide storage engineers with zero-configuration > mechanisms to definitively identify shared-tag bottlenecks. Why not just issue the trace points? Then there's close to zero overhead, rather than needing to need added counters for this, and the kernel to keep track. If you just issue the get/put tag kind of traces, then userspace can keep track. That's what blktrace has done for decades for things like inflight/queue depth accounting. IOW, seems to me, this could be done with basically zero kernel additions outside of perhaps a trace point or two. -- Jens Axboe
On Mon, May 18, 2026 at 07:31:45AM -0600, Jens Axboe wrote:
> Why not just issue the trace points? Then there's close to zero
> overhead, rather than needing to need added counters for this, and the
> kernel to keep track. If you just issue the get/put tag kind of traces,
> then userspace can keep track. That's what blktrace has done for decades
> for things like inflight/queue depth accounting.
>
> IOW, seems to me, this could be done with basically zero kernel
> additions outside of perhaps a trace point or two.
Hi Jens,
Thanks for taking a look.
You make a completely fair point.
I agree that pushing the accounting to userspace is the right approach,
especially given the proposed hard-coded tracepoint. For example, with
bpftrace(8):
# bpftrace -e 'tracepoint:block:block_rq_tag_wait { @tag_waits[cpu] = count(); }'
Attaching 1 probe...
^C
@tag_waits[4]: 12
@tag_waits[12]: 87
I will drop Patch 2 from this series, in the next iteration.
Kind regards,
--
Aaron Tomlin
© 2016 - 2026 Red Hat, Inc.