kernel/rcu/rcu.h | 1 + kernel/rcu/tree_stall.h | 72 +++++++++++++++++++++++++++++++++++++++++++++++++ mm/slab_common.c | 18 +++++++++++++ 3 files changed, 91 insertions(+)
There is currently no easy way to monitor how many RCU callbacks are
pending system-wide. The existing trace points provide per-event data
but require active tracing, which makes them awkward for fleet-wide
monitoring. Knowing the depth and stage of pending callbacks helps
admins reason about RCU health, gives an indirect signal of memory
held back by RCU, and is useful when tuning RCU parameters.
This series adds a debugfs file at:
/sys/kernel/debug/rcu/pending_cbs
that reports per-CPU pending callback counts with a "total" row.
Patch 1 introduces the file with per-CPU columns for each segcblist
segment (done, wait, next_ready, next) plus a "lazy" column.
Patch 2 extends the file with a "kfree_rcu" column reporting objects
queued in the batched kfree_rcu()/kvfree_rcu() path
(CONFIG_KVFREE_RCU_BATCHED), which has its own per-CPU queues outside
the main segmented callback list.
Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
---
Gustavo Luiz Duarte (2):
rcu: Expose per-CPU segmented callback counts via debugfs
rcu: Include kfree_rcu/kvfree_rcu batched counts in pending_cbs
kernel/rcu/rcu.h | 1 +
kernel/rcu/tree_stall.h | 72 +++++++++++++++++++++++++++++++++++++++++++++++++
mm/slab_common.c | 18 +++++++++++++
3 files changed, 91 insertions(+)
---
base-commit: 8ab992f815d6736b5c7a6f5fd7bfe7bc106bb3dc
change-id: 20260318-rcu-pending-cbs-stats-f72f5ca03415
Best regards,
--
Gustavo Luiz Duarte <gustavold@gmail.com>
On 5/7/2026 1:37 PM, Gustavo Luiz Duarte wrote: > There is currently no easy way to monitor how many RCU callbacks are > pending system-wide. The existing trace points provide per-event data > but require active tracing, which makes them awkward for fleet-wide > monitoring. Knowing the depth and stage of pending callbacks helps > admins reason about RCU health, gives an indirect signal of memory > held back by RCU, and is useful when tuning RCU parameters. > > This series adds a debugfs file at: > > /sys/kernel/debug/rcu/pending_cbs > > that reports per-CPU pending callback counts with a "total" row. > > Patch 1 introduces the file with per-CPU columns for each segcblist > segment (done, wait, next_ready, next) plus a "lazy" column. > > Patch 2 extends the file with a "kfree_rcu" column reporting objects > queued in the batched kfree_rcu()/kvfree_rcu() path > (CONFIG_KVFREE_RCU_BATCHED), which has its own per-CPU queues outside > the main segmented callback list. > > Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com> You actually don't need debugfs for this. You can just use bpftrace and instrument trace_rcu_ (with other RCU tracing Kconfig options enabled?). I had something like that working sometime ago. Generally RCU doesn't add userspace interfaces randomly like that. I remember Paul ripped similar things out some time ago.
Hi Joel,
On Thu, May 7, 2026 at 7:59 PM Joel Fernandes <joelagnelf@nvidia.com> wrote:
>
>
>
> On 5/7/2026 1:37 PM, Gustavo Luiz Duarte wrote:
> > There is currently no easy way to monitor how many RCU callbacks are
> > pending system-wide. The existing trace points provide per-event data
> > but require active tracing, which makes them awkward for fleet-wide
> > monitoring. Knowing the depth and stage of pending callbacks helps
> > admins reason about RCU health, gives an indirect signal of memory
> > held back by RCU, and is useful when tuning RCU parameters.
> >
> > This series adds a debugfs file at:
> >
> > /sys/kernel/debug/rcu/pending_cbs
> >
> > that reports per-CPU pending callback counts with a "total" row.
> >
> > Patch 1 introduces the file with per-CPU columns for each segcblist
> > segment (done, wait, next_ready, next) plus a "lazy" column.
> >
> > Patch 2 extends the file with a "kfree_rcu" column reporting objects
> > queued in the batched kfree_rcu()/kvfree_rcu() path
> > (CONFIG_KVFREE_RCU_BATCHED), which has its own per-CPU queues outside
> > the main segmented callback list.
> >
> > Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
>
> You actually don't need debugfs for this. You can just use bpftrace and
> instrument trace_rcu_ (with other RCU tracing Kconfig options enabled?). I had
> something like that working sometime ago.
My initial attempt to do this using tracepoints was probing
trace_rcu_segcb_stats, but this would add significant overhead to
every callback enqueue/dequeue event which is too expensive for a
production environment. I played a bit more with bpftrace and managed
to get this working with an interval probe plus some __per_cpu_offset
pointer arithmetic (see below). It is not the most maintainable code
and has some race issues, but probably acceptable for us if you
believe having this information easily available doesn't add value for
other use cases.
If anyone is interested, here is what I came up with:
interval:s:5 {
printf("===== %s =====\n", strftime("%H:%M:%S", nsecs));
$rdp_base = kaddr("rcu_data");
$krc_base = kaddr("krc");
$offsets = (uint64 *)kaddr("__per_cpu_offset");
for ($cpu : 0..ncpus) {
$rdp = (struct rcu_data *)($rdp_base + $offsets[$cpu]);
$krcp = (struct kfree_rcu_cpu *)($krc_base + $offsets[$cpu]);
$kfree = $krcp->head_count.counter
+ $krcp->bulk_count[0].counter
+ $krcp->bulk_count[1].counter;
printf("cpu: %d done: %ld wait: %ld nr: %ld next: %ld lazy:
%ld kfree: %d\n",
$cpu,
$rdp->cblist.seglen[0],
$rdp->cblist.seglen[1],
$rdp->cblist.seglen[2],
$rdp->cblist.seglen[3],
$rdp->lazy_len,
$kfree);
}
}
>
> Generally RCU doesn't add userspace interfaces randomly like that. I remember
> Paul ripped similar things out some time ago.
Debugfs is intentionally not a stable ABI, so the bar for adding
things useful for debugging and tuning seems lower than /proc or /sys
-- which is why I went with debugfs here.
On Mon, May 11, 2026 at 06:08:53PM +0100, Gustavo Luiz Duarte wrote: > > You actually don't need debugfs for this. You can just use bpftrace and > > instrument trace_rcu_ (with other RCU tracing Kconfig options enabled?). I had > > something like that working sometime ago. > > My initial attempt to do this using tracepoints was probing > trace_rcu_segcb_stats, but this would add significant overhead to > every callback enqueue/dequeue event which is too expensive for a > production environment An additional benefit of this debugfs-based approach is that it eliminates the dependency on bpftrace and custom scripts. While instrumenting tracepoints with bpftrace is certainly feasible for a limited number of servers, deploying it fleet-wide becomes problematic. It requires distributing and maintaining additional binaries on every host, just to collect metrics that a few lines of kernel code can expose more efficiently. So while bpftrace can technically accomplish this, it may not be the most appropriate solution for this particular use case.
© 2016 - 2026 Red Hat, Inc.