include/linux/sched.h | 4 + kernel/fork.c | 4 + kernel/rcu/Kconfig | 12 +++ kernel/rcu/tree.c | 60 +++++++++-- kernel/rcu/tree.h | 11 +- kernel/rcu/tree_exp.h | 5 + kernel/rcu/tree_plugin.h | 211 +++++++++++++++++++++++++++++++++++---- kernel/rcu/tree_stall.h | 4 +- 8 files changed, 279 insertions(+), 32 deletions(-)
When a task is preempted while holding an RCU read-side lock, the kernel
must track it on the rcu_node's blocked task list. This requires acquiring
rnp->lock shared by all CPUs in that node's subtree.
Posting this as RFC for early feedback. There could be bugs lurking,
especially related to expedited GPs which I have not yet taken a close
look at. Several TODOs are added. It passed light TREE03 rcutorture
testing.
On systems with 16 or fewer CPUs, the RCU hierarchy often has just a single
rcu_node, making rnp->lock effectively a global lock for all blocked task
operations. Every context switch where a task holds an RCU read-side lock
contends on this single lock.
Enter Virtualization
--------------------
In virtualized environments, the problem becomes dramatically worse due to vCPU
preemption. Research from USENIX ATC'17 ("The RCU-Reader Preemption Problem in
VMs" by Gopinath and Paul McKenney) [1] explores the issue that RCU
reader preemption in VMs causes multi-second latency spikes and huge increases
in grace period duration.
When a vCPU is preempted by the hypervisor while holding rnp->lock, other
vCPUs spin waiting for a lock holder that isn't even running. In testing
with host RT preemptors to inject vCPU preemption, lock hold times extended
from ~4us to over 4000us - a 1000x increase.
The Solution
------------
This series introduces per-CPU lists for tracking blocked RCU readers. The
key insight is that when no grace period is active, blocked tasks complete
their critical sections before really requiring any rnp locking.
1. Fast path: At context switch, Add the task only to the
per-CPU list - no rnp->lock needed.
2. Promotion on demand: When a grace period starts, promote tasks from
per-CPU lists to the rcu_node list.
3. Normal path: If a grace period is already waiting, tasks go directly
to the rcu_node list as before.
Results
-------
Testing with 64 reader threads under vCPU preemption from 32 host SCHED_FIFO
preemptors), 100 runs each. Throughput measured of read lock/unlock iterations
per second.
Baseline Optimized
Mean throughput 66,980 iter/s 97,719 iter/s (+46%)
Lock hold time (mean) 1,069 us ~0 us
The optimized version maintains stable performance with essentially close to
zero rnp->lock overhead.
rcutorture Testing
------------------
TREE03 Testing with rcutorture without RCU or hotplug errors. More testing is
in progress.
Note: I have added a CONFIG_RCU_PER_CPU_BLOCKED_LISTS to guard the feature but
the plan is to eventually turn this on all the time.
[1] https://www.usenix.org/conference/atc17/technical-sessions/presentation/prasad
Joel Fernandes (14):
rcu: Add WARN_ON_ONCE for blocked flag invariant in exit_rcu()
rcu: Add per-CPU blocked task lists for PREEMPT_RCU
rcu: Early return during unlock for tasks only on per-CPU blocked list
rcu: Promote blocked tasks from per-CPU to rnp lists
rcu: Promote blocked tasks for expedited GPs
rcu: Promote per-CPU blocked tasks before checking for blocked readers
rcu: Promote late-arriving blocked tasks before reporting QS
rcu: Promote blocked tasks before QS report in force_qs_rnp()
rcu: Promote blocked tasks before QS report in
rcutree_report_cpu_dead()
rcu: Promote blocked tasks before QS report in rcu_gp_init()
rcu: Add per-CPU blocked list check in exit_rcu()
rcu: Skip per-CPU list addition when GP already started
rcu: Skip rnp addition when no grace period waiting
rcu: Remove checking of per-cpu blocked list against the node list
include/linux/sched.h | 4 +
kernel/fork.c | 4 +
kernel/rcu/Kconfig | 12 +++
kernel/rcu/tree.c | 60 +++++++++--
kernel/rcu/tree.h | 11 +-
kernel/rcu/tree_exp.h | 5 +
kernel/rcu/tree_plugin.h | 211 +++++++++++++++++++++++++++++++++++----
kernel/rcu/tree_stall.h | 4 +-
8 files changed, 279 insertions(+), 32 deletions(-)
base-commit: f8f9c1f4d0c7a64600e2ca312dec824a0bc2f1da
--
2.34.1
On Fri, Jan 02, 2026 at 07:23:29PM -0500, Joel Fernandes wrote:
> When a task is preempted while holding an RCU read-side lock, the kernel
> must track it on the rcu_node's blocked task list. This requires acquiring
> rnp->lock shared by all CPUs in that node's subtree.
>
> Posting this as RFC for early feedback. There could be bugs lurking,
> especially related to expedited GPs which I have not yet taken a close
> look at. Several TODOs are added. It passed light TREE03 rcutorture
> testing.
>
> On systems with 16 or fewer CPUs, the RCU hierarchy often has just a single
> rcu_node, making rnp->lock effectively a global lock for all blocked task
> operations. Every context switch where a task holds an RCU read-side lock
> contends on this single lock.
>
> Enter Virtualization
> --------------------
> In virtualized environments, the problem becomes dramatically worse due to vCPU
> preemption. Research from USENIX ATC'17 ("The RCU-Reader Preemption Problem in
> VMs" by Gopinath and Paul McKenney) [1] explores the issue that RCU
> reader preemption in VMs causes multi-second latency spikes and huge increases
> in grace period duration.
>
> When a vCPU is preempted by the hypervisor while holding rnp->lock, other
> vCPUs spin waiting for a lock holder that isn't even running. In testing
> with host RT preemptors to inject vCPU preemption, lock hold times extended
> from ~4us to over 4000us - a 1000x increase.
>
> The Solution
> ------------
> This series introduces per-CPU lists for tracking blocked RCU readers. The
> key insight is that when no grace period is active, blocked tasks complete
> their critical sections before really requiring any rnp locking.
>
> 1. Fast path: At context switch, Add the task only to the
> per-CPU list - no rnp->lock needed.
>
> 2. Promotion on demand: When a grace period starts, promote tasks from
> per-CPU lists to the rcu_node list.
>
> 3. Normal path: If a grace period is already waiting, tasks go directly
> to the rcu_node list as before.
>
> Results
> -------
> Testing with 64 reader threads under vCPU preemption from 32 host SCHED_FIFO
> preemptors), 100 runs each. Throughput measured of read lock/unlock iterations
> per second.
>
> Baseline Optimized
> Mean throughput 66,980 iter/s 97,719 iter/s (+46%)
> Lock hold time (mean) 1,069 us ~0 us
Excellent performance improvement!
It would be good to simplify the management of the blocked-tasks lists,
and to make it more exact, as in never unnecessarily priority-boost
a task. But it is not like people have been complaining, at least not
to me. And earlier attempts in that direction added more mess than
simplification. :-(
> The optimized version maintains stable performance with essentially close to
> zero rnp->lock overhead.
>
> rcutorture Testing
> ------------------
> TREE03 Testing with rcutorture without RCU or hotplug errors. More testing is
> in progress.
>
> Note: I have added a CONFIG_RCU_PER_CPU_BLOCKED_LISTS to guard the feature but
> the plan is to eventually turn this on all the time.
Yes, Aravinda, Gopinath, and I did publish that paper back in the day
(with Aravinda having done almost all the work), but it was an artificial
workload. Which is OK given that it was an academic effort. It has also
provided some entertainment, for example, an audience member asking me
if I was aware of this work in a linguistic-kill-shot manner. ;-)
So are we finally seeing this effect in the wild?
The main point of this patch series is to avoid lock contention due to
vCPU preemption, correct? If so, will we need similar work on the other
locks in the Linux kernel, both within RCU and elsewhere? I vaguely
recall your doing some work along those lines a few years back, and
maybe Thomas Gleixner's deferred-preemption work could help with this.
Or not, who knows? Keeping the hypervisor informed of lock state is
not necessarily free.
Also if so, would the following rather simpler patch do the same trick,
if accompanied by CONFIG_RCU_FANOUT_LEAF=1?
------------------------------------------------------------------------
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 6a319e2926589..04dbee983b37d 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -198,9 +198,9 @@ config RCU_FANOUT
config RCU_FANOUT_LEAF
int "Tree-based hierarchical RCU leaf-level fanout value"
- range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
- range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
- range 2 3 if RCU_STRICT_GRACE_PERIOD
+ range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
+ range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
+ range 1 3 if RCU_STRICT_GRACE_PERIOD
depends on TREE_RCU && RCU_EXPERT
default 16 if !RCU_STRICT_GRACE_PERIOD
default 2 if RCU_STRICT_GRACE_PERIOD
------------------------------------------------------------------------
This passes a quick 20-minute rcutorture smoke test. Does it provide
similar performance benefits?
Thanx, Paul
> [1] https://www.usenix.org/conference/atc17/technical-sessions/presentation/prasad
>
> Joel Fernandes (14):
> rcu: Add WARN_ON_ONCE for blocked flag invariant in exit_rcu()
> rcu: Add per-CPU blocked task lists for PREEMPT_RCU
> rcu: Early return during unlock for tasks only on per-CPU blocked list
> rcu: Promote blocked tasks from per-CPU to rnp lists
> rcu: Promote blocked tasks for expedited GPs
> rcu: Promote per-CPU blocked tasks before checking for blocked readers
> rcu: Promote late-arriving blocked tasks before reporting QS
> rcu: Promote blocked tasks before QS report in force_qs_rnp()
> rcu: Promote blocked tasks before QS report in
> rcutree_report_cpu_dead()
> rcu: Promote blocked tasks before QS report in rcu_gp_init()
> rcu: Add per-CPU blocked list check in exit_rcu()
> rcu: Skip per-CPU list addition when GP already started
> rcu: Skip rnp addition when no grace period waiting
> rcu: Remove checking of per-cpu blocked list against the node list
>
> include/linux/sched.h | 4 +
> kernel/fork.c | 4 +
> kernel/rcu/Kconfig | 12 +++
> kernel/rcu/tree.c | 60 +++++++++--
> kernel/rcu/tree.h | 11 +-
> kernel/rcu/tree_exp.h | 5 +
> kernel/rcu/tree_plugin.h | 211 +++++++++++++++++++++++++++++++++++----
> kernel/rcu/tree_stall.h | 4 +-
> 8 files changed, 279 insertions(+), 32 deletions(-)
>
>
> base-commit: f8f9c1f4d0c7a64600e2ca312dec824a0bc2f1da
> --
> 2.34.1
>
Hi Paul,
On 1/5/2026 11:46 AM, Paul E. McKenney wrote:
> On Fri, Jan 02, 2026 at 07:23:29PM -0500, Joel Fernandes wrote:
>> When a task is preempted while holding an RCU read-side lock, the kernel
>> must track it on the rcu_node's blocked task list. This requires acquiring
>> rnp->lock shared by all CPUs in that node's subtree.
>>
>> Posting this as RFC for early feedback. There could be bugs lurking,
>> especially related to expedited GPs which I have not yet taken a close
>> look at. Several TODOs are added. It passed light TREE03 rcutorture
>> testing.
>>
>> On systems with 16 or fewer CPUs, the RCU hierarchy often has just a single
>> rcu_node, making rnp->lock effectively a global lock for all blocked task
>> operations. Every context switch where a task holds an RCU read-side lock
>> contends on this single lock.
>>
>> Enter Virtualization
>> --------------------
>> In virtualized environments, the problem becomes dramatically worse due to vCPU
>> preemption. Research from USENIX ATC'17 ("The RCU-Reader Preemption Problem in
>> VMs" by Gopinath and Paul McKenney) [1] explores the issue that RCU
>> reader preemption in VMs causes multi-second latency spikes and huge increases
>> in grace period duration.
>>
>> When a vCPU is preempted by the hypervisor while holding rnp->lock, other
>> vCPUs spin waiting for a lock holder that isn't even running. In testing
>> with host RT preemptors to inject vCPU preemption, lock hold times extended
>> from ~4us to over 4000us - a 1000x increase.
>>
>> The Solution
>> ------------
>> This series introduces per-CPU lists for tracking blocked RCU readers. The
>> key insight is that when no grace period is active, blocked tasks complete
>> their critical sections before really requiring any rnp locking.
>>
>> 1. Fast path: At context switch, Add the task only to the
>> per-CPU list - no rnp->lock needed.
>>
>> 2. Promotion on demand: When a grace period starts, promote tasks from
>> per-CPU lists to the rcu_node list.
>>
>> 3. Normal path: If a grace period is already waiting, tasks go directly
>> to the rcu_node list as before.
>>
>> Results
>> -------
>> Testing with 64 reader threads under vCPU preemption from 32 host SCHED_FIFO
>> preemptors), 100 runs each. Throughput measured of read lock/unlock iterations
>> per second.
>>
>> Baseline Optimized
>> Mean throughput 66,980 iter/s 97,719 iter/s (+46%)
>> Lock hold time (mean) 1,069 us ~0 us
>
> Excellent performance improvement!
Thanks. :)
> It would be good to simplify the management of the blocked-tasks lists,
> and to make it more exact, as in never unnecessarily priority-boost
> a task. But it is not like people have been complaining, at least not
> to me. And earlier attempts in that direction added more mess than
> simplification. :-(
Interesting. I might look into the boosting logic to see whether we can avoid
boosting certain tasks depending on whether they help the grace period complete
or not. Thank you for the suggestion.
>> The optimized version maintains stable performance with essentially close to
>> zero rnp->lock overhead.
>>
>> rcutorture Testing
>> ------------------
>> TREE03 Testing with rcutorture without RCU or hotplug errors. More testing is
>> in progress.
>>
>> Note: I have added a CONFIG_RCU_PER_CPU_BLOCKED_LISTS to guard the feature but
>> the plan is to eventually turn this on all the time.
>
> Yes, Aravinda, Gopinath, and I did publish that paper back in the day
> (with Aravinda having done almost all the work), but it was an artificial
> workload. Which is OK given that it was an academic effort. It has also
> provided some entertainment, for example, an audience member asking me
> if I was aware of this work in a linguistic-kill-shot manner. ;-)
>
> So are we finally seeing this effect in the wild?
This patch set is also targeting a synthetic test I wrote to see if I could
reproduce a preemption problem. I know several instances over the years where my
teams (mainly at Google) were trying to resolve spin lock preemption inside
virtual machines by boosting vCPU threads. In the spirit of RCU performance and
VMs, we should probably optimize node locking IMO, but I do see your point of
view about optimizing real-world use cases as well.
What bothers me about the current state of affairs is that even without any
grace period in progress, any task blocking in an RCU Read Side critical section
will take a (almost-)global lock that is shared by other CPUs who might also be
preempting/blocking RCU readers. Further, if this happens to be a vCPU that was
preempted while holding the node lock, then every other vCPU thread that blocks
in an RCU critical section will also block and end up slowing preemption down in
the vCPU. My preference would be to keep the readers fast while moving the
overhead to the slow path (the overhead being promoting tasks at the right time
that were blocked). In fact, in these patches, I'm directly going to the node
list if there is a grace period in progress.
> The main point of this patch series is to avoid lock contention due to
> vCPU preemption, correct? If so, will we need similar work on the other
> locks in the Linux kernel, both within RCU and elsewhere? I vaguely
> recall your doing some work along those lines a few years back, and
> maybe Thomas Gleixner's deferred-preemption work could help with this.
> Or not, who knows? Keeping the hypervisor informed of lock state is
> not necessarily free.
Yes, I did some work on this at Google, but it turned out to be a very
fragmented effort in terms of where (which subsystem - KVM, scheduler etc)
should we do the priority boosting of vCPU threads. In the end, we just ended up
with an internal prototype that was not upstreamable but worked pretty well and
only had time for production (a lesson I learned there is we should probably
work on upstream solutions first, but life is not that easy sometimes).
About the deferred-preemption, I believe Steven Rostedt at one point was looking
at that for VMs, but that effort stalled as Peter is concerned about doing that
would mess up the scheduler. The idea (AFAIU) is to use the rseq page to
communicate locking information between vCPU threads and the host and then let
the host avoid vCPU preemption - but the scheduler needs to do something with
that information. Otherwise, it's no use.
> Also if so, would the following rather simpler patch do the same trick,
> if accompanied by CONFIG_RCU_FANOUT_LEAF=1?
>
> ------------------------------------------------------------------------
>
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index 6a319e2926589..04dbee983b37d 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -198,9 +198,9 @@ config RCU_FANOUT
>
> config RCU_FANOUT_LEAF
> int "Tree-based hierarchical RCU leaf-level fanout value"
> - range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
> - range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
> - range 2 3 if RCU_STRICT_GRACE_PERIOD
> + range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
> + range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
> + range 1 3 if RCU_STRICT_GRACE_PERIOD
> depends on TREE_RCU && RCU_EXPERT> default 16 if !RCU_STRICT_GRACE_PERIOD
> default 2 if RCU_STRICT_GRACE_PERIOD
>
> ------------------------------------------------------------------------
>
> This passes a quick 20-minute rcutorture smoke test. Does it provide
> similar performance benefits?
I tried this out, and it also brings down the contention and solves the problem
I saw (in testing so far).
Would this work also if the test had grace periods init/cleanup racing with
preempted RCU read-side critical sections? I'm doing longer tests now to see how
this performs under GP-stress, versus my solution. I am also seeing that with
just the node lists, not per-cpu list, I see a dramatic throughput drop after
some amount of time, but I can't explain it. And I do not see this with the
per-cpu list solution (I'm currently testing if I see the same throughput drop
with the fan-out solution you proposed).
I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is
reasonable, considering this is not a default. Are you suggesting defaulting to
this for small systems? If not, then I guess the optimization will not be
enabled by default. Eventually, with this patch set, if we are moving forward
with this approach, I will remove the config option for per-CPU block list
altogether so that it is enabled by default. That's kind of my plan if we agreed
on this, but it is just an RFC stage :).
thanks,
- Joel
On Mon, Jan 05, 2026 at 07:55:18PM -0500, Joel Fernandes wrote:
> Hi Paul,
>
> On 1/5/2026 11:46 AM, Paul E. McKenney wrote:
> > On Fri, Jan 02, 2026 at 07:23:29PM -0500, Joel Fernandes wrote:
> >> When a task is preempted while holding an RCU read-side lock, the kernel
> >> must track it on the rcu_node's blocked task list. This requires acquiring
> >> rnp->lock shared by all CPUs in that node's subtree.
> >>
> >> Posting this as RFC for early feedback. There could be bugs lurking,
> >> especially related to expedited GPs which I have not yet taken a close
> >> look at. Several TODOs are added. It passed light TREE03 rcutorture
> >> testing.
> >>
> >> On systems with 16 or fewer CPUs, the RCU hierarchy often has just a single
> >> rcu_node, making rnp->lock effectively a global lock for all blocked task
> >> operations. Every context switch where a task holds an RCU read-side lock
> >> contends on this single lock.
> >>
> >> Enter Virtualization
> >> --------------------
> >> In virtualized environments, the problem becomes dramatically worse due to vCPU
> >> preemption. Research from USENIX ATC'17 ("The RCU-Reader Preemption Problem in
> >> VMs" by Gopinath and Paul McKenney) [1] explores the issue that RCU
> >> reader preemption in VMs causes multi-second latency spikes and huge increases
> >> in grace period duration.
> >>
> >> When a vCPU is preempted by the hypervisor while holding rnp->lock, other
> >> vCPUs spin waiting for a lock holder that isn't even running. In testing
> >> with host RT preemptors to inject vCPU preemption, lock hold times extended
> >> from ~4us to over 4000us - a 1000x increase.
> >>
> >> The Solution
> >> ------------
> >> This series introduces per-CPU lists for tracking blocked RCU readers. The
> >> key insight is that when no grace period is active, blocked tasks complete
> >> their critical sections before really requiring any rnp locking.
> >>
> >> 1. Fast path: At context switch, Add the task only to the
> >> per-CPU list - no rnp->lock needed.
> >>
> >> 2. Promotion on demand: When a grace period starts, promote tasks from
> >> per-CPU lists to the rcu_node list.
> >>
> >> 3. Normal path: If a grace period is already waiting, tasks go directly
> >> to the rcu_node list as before.
> >>
> >> Results
> >> -------
> >> Testing with 64 reader threads under vCPU preemption from 32 host SCHED_FIFO
> >> preemptors), 100 runs each. Throughput measured of read lock/unlock iterations
> >> per second.
> >>
> >> Baseline Optimized
> >> Mean throughput 66,980 iter/s 97,719 iter/s (+46%)
> >> Lock hold time (mean) 1,069 us ~0 us
> >
> > Excellent performance improvement!
>
> Thanks. :)
> > It would be good to simplify the management of the blocked-tasks lists,
> > and to make it more exact, as in never unnecessarily priority-boost
> > a task. But it is not like people have been complaining, at least not
> > to me. And earlier attempts in that direction added more mess than
> > simplification. :-(
>
> Interesting. I might look into the boosting logic to see whether we can avoid
> boosting certain tasks depending on whether they help the grace period complete
> or not. Thank you for the suggestion.
Just so you know, all of my simplification efforts thus far have instead
made it more complex, but who knows what I might have been missing?
> >> The optimized version maintains stable performance with essentially close to
> >> zero rnp->lock overhead.
> >>
> >> rcutorture Testing
> >> ------------------
> >> TREE03 Testing with rcutorture without RCU or hotplug errors. More testing is
> >> in progress.
> >>
> >> Note: I have added a CONFIG_RCU_PER_CPU_BLOCKED_LISTS to guard the feature but
> >> the plan is to eventually turn this on all the time.
> >
> > Yes, Aravinda, Gopinath, and I did publish that paper back in the day
> > (with Aravinda having done almost all the work), but it was an artificial
> > workload. Which is OK given that it was an academic effort. It has also
> > provided some entertainment, for example, an audience member asking me
> > if I was aware of this work in a linguistic-kill-shot manner. ;-)
> >
> > So are we finally seeing this effect in the wild?
>
> This patch set is also targeting a synthetic test I wrote to see if I could
> reproduce a preemption problem. I know several instances over the years where my
> teams (mainly at Google) were trying to resolve spin lock preemption inside
> virtual machines by boosting vCPU threads. In the spirit of RCU performance and
> VMs, we should probably optimize node locking IMO, but I do see your point of
> view about optimizing real-world use cases as well.
Also taking care of all spinlocks instead of doing large numbers of
per-spinlock workarounds would be good. There are a *lot* of spinlocks
in the Linux kernel!
> What bothers me about the current state of affairs is that even without any
> grace period in progress, any task blocking in an RCU Read Side critical section
> will take a (almost-)global lock that is shared by other CPUs who might also be
> preempting/blocking RCU readers. Further, if this happens to be a vCPU that was
> preempted while holding the node lock, then every other vCPU thread that blocks
> in an RCU critical section will also block and end up slowing preemption down in
> the vCPU. My preference would be to keep the readers fast while moving the
> overhead to the slow path (the overhead being promoting tasks at the right time
> that were blocked). In fact, in these patches, I'm directly going to the node
> list if there is a grace period in progress.
Not "(almost-)global"!
That lock replicates itself automatically with increasing numbers of CPUs.
That 16 used to be the full (at the time) 32-bit cpumask, but we decreased
it to 16 based on performance feedback from Andi Kleen back in the day.
If we are seeing real-world contention on that lock in real-world
workloads on real-world systems, further adjustments could be made,
either reducing CONFIG_RCU_FANOUT_LEAF further or offloading the lock,
where your series is one example of the latter.
I could easily believe that the vCPU preemption problem needs to be
addressed, but doing so on a per-spinlock basis would lead to greatly
increased complexity throughout the kernel, not just RCU.
> > The main point of this patch series is to avoid lock contention due to
> > vCPU preemption, correct? If so, will we need similar work on the other
> > locks in the Linux kernel, both within RCU and elsewhere? I vaguely
> > recall your doing some work along those lines a few years back, and
> > maybe Thomas Gleixner's deferred-preemption work could help with this.
> > Or not, who knows? Keeping the hypervisor informed of lock state is
> > not necessarily free.
>
> Yes, I did some work on this at Google, but it turned out to be a very
> fragmented effort in terms of where (which subsystem - KVM, scheduler etc)
> should we do the priority boosting of vCPU threads. In the end, we just ended up
> with an internal prototype that was not upstreamable but worked pretty well and
> only had time for production (a lesson I learned there is we should probably
> work on upstream solutions first, but life is not that easy sometimes).
Which is one reason deferred preemption would be attractive.
> About the deferred-preemption, I believe Steven Rostedt at one point was looking
> at that for VMs, but that effort stalled as Peter is concerned about doing that
> would mess up the scheduler. The idea (AFAIU) is to use the rseq page to
> communicate locking information between vCPU threads and the host and then let
> the host avoid vCPU preemption - but the scheduler needs to do something with
> that information. Otherwise, it's no use.
Has deferred preemption for userspace locking also stalled? If not,
then the scheduler's support for userspace should apply directly to
guest OSes, right?
> > Also if so, would the following rather simpler patch do the same trick,
> > if accompanied by CONFIG_RCU_FANOUT_LEAF=1?
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > index 6a319e2926589..04dbee983b37d 100644
> > --- a/kernel/rcu/Kconfig
> > +++ b/kernel/rcu/Kconfig
> > @@ -198,9 +198,9 @@ config RCU_FANOUT
> >
> > config RCU_FANOUT_LEAF
> > int "Tree-based hierarchical RCU leaf-level fanout value"
> > - range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
> > - range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
> > - range 2 3 if RCU_STRICT_GRACE_PERIOD
> > + range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
> > + range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
> > + range 1 3 if RCU_STRICT_GRACE_PERIOD
> > depends on TREE_RCU && RCU_EXPERT> default 16 if !RCU_STRICT_GRACE_PERIOD
> > default 2 if RCU_STRICT_GRACE_PERIOD
> >
> > ------------------------------------------------------------------------
> >
> > This passes a quick 20-minute rcutorture smoke test. Does it provide
> > similar performance benefits?
>
> I tried this out, and it also brings down the contention and solves the problem
> I saw (in testing so far).
>
> Would this work also if the test had grace periods init/cleanup racing with
> preempted RCU read-side critical sections? I'm doing longer tests now to see how
> this performs under GP-stress, versus my solution. I am also seeing that with
> just the node lists, not per-cpu list, I see a dramatic throughput drop after
> some amount of time, but I can't explain it. And I do not see this with the
> per-cpu list solution (I'm currently testing if I see the same throughput drop
> with the fan-out solution you proposed).
Might the throughput drop be due to increased load on the host?
Another possibility is that tasks/vCPUs got shuffled so as to increase
the probability of preemption.
Also, doesn't your patch also cause the grace-period kthread to acquire
that per-CPU lock, thus also possibly resulting in contention, vCPU
preemption, and so on?
> I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is
> reasonable, considering this is not a default. Are you suggesting defaulting to
> this for small systems? If not, then I guess the optimization will not be
> enabled by default. Eventually, with this patch set, if we are moving forward
> with this approach, I will remove the config option for per-CPU block list
> altogether so that it is enabled by default. That's kind of my plan if we agreed
> on this, but it is just an RFC stage :).
Right now, we are experimenting, so the usability issue is less pressing.
Once we find out what is really going on for real-world systems, we
can make adjustments if and as appropriate, said adjustments including
usability.
Thanx, Paul
On 1/6/2026 2:17 PM, Paul E. McKenney wrote: > On Mon, Jan 05, 2026 at 07:55:18PM -0500, Joel Fernandes wrote: [..] >>>> The optimized version maintains stable performance with essentially close to >>>> zero rnp->lock overhead. >>>> >>>> rcutorture Testing >>>> ------------------ >>>> TREE03 Testing with rcutorture without RCU or hotplug errors. More testing is >>>> in progress. >>>> >>>> Note: I have added a CONFIG_RCU_PER_CPU_BLOCKED_LISTS to guard the feature but >>>> the plan is to eventually turn this on all the time. >>> >>> Yes, Aravinda, Gopinath, and I did publish that paper back in the day >>> (with Aravinda having done almost all the work), but it was an artificial >>> workload. Which is OK given that it was an academic effort. It has also >>> provided some entertainment, for example, an audience member asking me >>> if I was aware of this work in a linguistic-kill-shot manner. ;-) >>> >>> So are we finally seeing this effect in the wild? >> >> This patch set is also targeting a synthetic test I wrote to see if I could >> reproduce a preemption problem. I know several instances over the years where my >> teams (mainly at Google) were trying to resolve spin lock preemption inside >> virtual machines by boosting vCPU threads. In the spirit of RCU performance and >> VMs, we should probably optimize node locking IMO, but I do see your point of >> view about optimizing real-world use cases as well. > > Also taking care of all spinlocks instead of doing large numbers of > per-spinlock workarounds would be good. There are a *lot* of spinlocks > in the Linux kernel! I wouldn't call it a workaround yet. Avoiding lock contention by using per CPU list is an optimization we have done before right? (Example the synthetic RCU callback-flooding use case where we used a per-cpu list). We can call it defensive programming, if you will. ;-) Especially in the scheduler hot path where we are blocking/preempting. Again, I'm not saying we should do it for this case since we are still studying the issue, but just on the fact that we are optimizing a spin lock we acquire *a lot* shouldn't be categorized as a workaround in my opinion. This blocking is even more likely on preempt RT in read-side critical sections. Again, I'm not saying that we should do this optimization, but I don't think we can ignore it. At least not based on the data I have so far. >> What bothers me about the current state of affairs is that even without any >> grace period in progress, any task blocking in an RCU Read Side critical section >> will take a (almost-)global lock that is shared by other CPUs who might also be >> preempting/blocking RCU readers. Further, if this happens to be a vCPU that was >> preempted while holding the node lock, then every other vCPU thread that blocks >> in an RCU critical section will also block and end up slowing preemption down in >> the vCPU. My preference would be to keep the readers fast while moving the >> overhead to the slow path (the overhead being promoting tasks at the right time >> that were blocked). In fact, in these patches, I'm directly going to the node >> list if there is a grace period in progress. > > Not "(almost-)global"! > > That lock replicates itself automatically with increasing numbers of CPUs. > That 16 used to be the full (at the time) 32-bit cpumask, but we decreased > it to 16 based on performance feedback from Andi Kleen back in the day. > If we are seeing real-world contention on that lock in real-world > workloads on real-world systems, further adjustments could be made, > either reducing CONFIG_RCU_FANOUT_LEAF further or offloading the lock, > where your series is one example of the latter. I meant it is global or almost global depending on the number of CPUs. So for example on an 8 CPU system with the default fanout, it is a global lock, correct? > I could easily believe that the vCPU preemption problem needs to be > addressed, but doing so on a per-spinlock basis would lead to greatly > increased complexity throughout the kernel, not just RCU. I agree with this. I was not intending to solve this for the entire kernel, at first at least. >> About the deferred-preemption, I believe Steven Rostedt at one point was looking >> at that for VMs, but that effort stalled as Peter is concerned about doing that >> would mess up the scheduler. The idea (AFAIU) is to use the rseq page to >> communicate locking information between vCPU threads and the host and then let >> the host avoid vCPU preemption - but the scheduler needs to do something with >> that information. Otherwise, it's no use. > > Has deferred preemption for userspace locking also stalled? If not, > then the scheduler's support for userspace should apply directly to > guest OSes, right? I don't think there have been any user space locking optimizations for preemption that has made it upstream (AFAIK). I know there were efforts, but I could be out of date there. I think the devil is in the details as well because user space optimizations cannot always be applied to guests in my experience. The VM exit path and the syscall entry/exit paths are quite different, including the API boundary. >>> Also if so, would the following rather simpler patch do the same trick, >>> if accompanied by CONFIG_RCU_FANOUT_LEAF=1? >>> >>> ------------------------------------------------------------------------ >>> >>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig >>> index 6a319e2926589..04dbee983b37d 100644 >>> --- a/kernel/rcu/Kconfig >>> +++ b/kernel/rcu/Kconfig >>> @@ -198,9 +198,9 @@ config RCU_FANOUT >>> >>> config RCU_FANOUT_LEAF >>> int "Tree-based hierarchical RCU leaf-level fanout value" >>> - range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD >>> - range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD >>> - range 2 3 if RCU_STRICT_GRACE_PERIOD >>> + range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD >>> + range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD >>> + range 1 3 if RCU_STRICT_GRACE_PERIOD >>> depends on TREE_RCU && RCU_EXPERT> default 16 if !RCU_STRICT_GRACE_PERIOD >>> default 2 if RCU_STRICT_GRACE_PERIOD >>> >>> ------------------------------------------------------------------------ >>> >>> This passes a quick 20-minute rcutorture smoke test. Does it provide >>> similar performance benefits? >> >> I tried this out, and it also brings down the contention and solves the problem >> I saw (in testing so far). >> >> Would this work also if the test had grace periods init/cleanup racing with >> preempted RCU read-side critical sections? I'm doing longer tests now to see how >> this performs under GP-stress, versus my solution. I am also seeing that with >> just the node lists, not per-cpu list, I see a dramatic throughput drop after >> some amount of time, but I can't explain it. And I do not see this with the >> per-cpu list solution (I'm currently testing if I see the same throughput drop >> with the fan-out solution you proposed). > > Might the throughput drop be due to increased load on the host? The load is constant with the benchmark, and the data is repeatable and consistent. So random load on the host is unlikely. > Another possibility is that tasks/vCPUs got shuffled so as to increase > the probability of preemption. > > Also, doesn't your patch also cause the grace-period kthread to acquire > that per-CPU lock, thus also possibly resulting in contention, vCPU > preemption, and so on? Yes, I'm tracing it more. Even with baseline (without these patches), I see this throughput drop so it is worth investigating. I think it's something possibly like a lock convoy forming, but the fact that if I don't use RNP locking, the lock convoy disappears, and the throughput is completely stable. That tells me that that has something to do with that or something related. I also measured the exact RNP lock time and counted the number of contentions, so I am not really guessing here. The RNP lock is contended consistently. I think it's a great idea for me to extend this lock contention measurement to the run queue locks as well, for me to measure how they are doing (or even extending it to all locks, as you mentioned) - at least for me to confirm the theory that the same test severely contends other locks as well. >> I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is >> reasonable, considering this is not a default. Are you suggesting defaulting to >> this for small systems? If not, then I guess the optimization will not be >> enabled by default. Eventually, with this patch set, if we are moving forward >> with this approach, I will remove the config option for per-CPU block list >> altogether so that it is enabled by default. That's kind of my plan if we agreed >> on this, but it is just an RFC stage :). > > Right now, we are experimenting, so the usability issue is less pressing. > Once we find out what is really going on for real-world systems, we > can make adjustments if and as appropriate, said adjustments including > usability. Sure, thanks. - Joel
On Tue, Jan 06, 2026 at 03:40:04PM -0500, Joel Fernandes wrote: > On 1/6/2026 2:17 PM, Paul E. McKenney wrote: > > On Mon, Jan 05, 2026 at 07:55:18PM -0500, Joel Fernandes wrote: > [..] > >>>> The optimized version maintains stable performance with essentially close to > >>>> zero rnp->lock overhead. > >>>> > >>>> rcutorture Testing > >>>> ------------------ > >>>> TREE03 Testing with rcutorture without RCU or hotplug errors. More testing is > >>>> in progress. > >>>> > >>>> Note: I have added a CONFIG_RCU_PER_CPU_BLOCKED_LISTS to guard the feature but > >>>> the plan is to eventually turn this on all the time. > >>> > >>> Yes, Aravinda, Gopinath, and I did publish that paper back in the day > >>> (with Aravinda having done almost all the work), but it was an artificial > >>> workload. Which is OK given that it was an academic effort. It has also > >>> provided some entertainment, for example, an audience member asking me > >>> if I was aware of this work in a linguistic-kill-shot manner. ;-) > >>> > >>> So are we finally seeing this effect in the wild? > >> > >> This patch set is also targeting a synthetic test I wrote to see if I could > >> reproduce a preemption problem. I know several instances over the years where my > >> teams (mainly at Google) were trying to resolve spin lock preemption inside > >> virtual machines by boosting vCPU threads. In the spirit of RCU performance and > >> VMs, we should probably optimize node locking IMO, but I do see your point of > >> view about optimizing real-world use cases as well. > > > > Also taking care of all spinlocks instead of doing large numbers of > > per-spinlock workarounds would be good. There are a *lot* of spinlocks > > in the Linux kernel! > > I wouldn't call it a workaround yet. Avoiding lock contention by using per CPU > list is an optimization we have done before right? (Example the synthetic RCU > callback-flooding use case where we used a per-cpu list). We can call it > defensive programming, if you will. ;-) Especially in the scheduler hot path > where we are blocking/preempting. Again, I'm not saying we should do it for this > case since we are still studying the issue, but just on the fact that we are > optimizing a spin lock we acquire *a lot* shouldn't be categorized as a > workaround in my opinion. > > This blocking is even more likely on preempt RT in read-side critical sections. > Again, I'm not saying that we should do this optimization, but I don't think we > can ignore it. At least not based on the data I have so far. If the main motivation is vCPU preemption, I consider this to be a workaround for the lack of awareness of guest-OS locks by the host OS. If there is some other reasonable way of generating contention on this lock, then per-CPU locking is one specific way of addressing that contention. As is reducing the value of CONFIG_RCU_FANOUT_LEAF and who knows what all else. > >> What bothers me about the current state of affairs is that even without any > >> grace period in progress, any task blocking in an RCU Read Side critical section > >> will take a (almost-)global lock that is shared by other CPUs who might also be > >> preempting/blocking RCU readers. Further, if this happens to be a vCPU that was > >> preempted while holding the node lock, then every other vCPU thread that blocks > >> in an RCU critical section will also block and end up slowing preemption down in > >> the vCPU. My preference would be to keep the readers fast while moving the > >> overhead to the slow path (the overhead being promoting tasks at the right time > >> that were blocked). In fact, in these patches, I'm directly going to the node > >> list if there is a grace period in progress. > > > > Not "(almost-)global"! > > > > That lock replicates itself automatically with increasing numbers of CPUs. > > That 16 used to be the full (at the time) 32-bit cpumask, but we decreased > > it to 16 based on performance feedback from Andi Kleen back in the day. > > If we are seeing real-world contention on that lock in real-world > > workloads on real-world systems, further adjustments could be made, > > either reducing CONFIG_RCU_FANOUT_LEAF further or offloading the lock, > > where your series is one example of the latter. > > I meant it is global or almost global depending on the number of CPUs. So for > example on an 8 CPU system with the default fanout, it is a global lock, correct? Yes, but only assuming the default CONFIG_RCU_FANOUT_LEAF value of 16, or some other value of 8 or larger. But in that case, there are only 8 CPUs contending for that lock, so is there really a problem? (In the absence of vCPU contention, that is.) And the default value can be changed if needed. > > I could easily believe that the vCPU preemption problem needs to be > > addressed, but doing so on a per-spinlock basis would lead to greatly > > increased complexity throughout the kernel, not just RCU. > > I agree with this. I was not intending to solve this for the entire kernel, at > first at least. If addressing vCPU contention is the goal, how many locks are individually adjusted before solving it for the whole kernel becomes easier and less complex? > >> About the deferred-preemption, I believe Steven Rostedt at one point was looking > >> at that for VMs, but that effort stalled as Peter is concerned about doing that > >> would mess up the scheduler. The idea (AFAIU) is to use the rseq page to > >> communicate locking information between vCPU threads and the host and then let > >> the host avoid vCPU preemption - but the scheduler needs to do something with > >> that information. Otherwise, it's no use. > > > > Has deferred preemption for userspace locking also stalled? If not, > > then the scheduler's support for userspace should apply directly to > > guest OSes, right? > > I don't think there have been any user space locking optimizations for > preemption that has made it upstream (AFAIK). I know there were efforts, but I > could be out of date there. I think the devil is in the details as well because > user space optimizations cannot always be applied to guests in my experience. > The VM exit path and the syscall entry/exit paths are quite different, including > the API boundary. Thomas and Steve are having another go at the userspace portion of this problem. Should that make it in, the guest-OS portion might not be that big an ask. > >>> Also if so, would the following rather simpler patch do the same trick, > >>> if accompanied by CONFIG_RCU_FANOUT_LEAF=1? > >>> > >>> ------------------------------------------------------------------------ > >>> > >>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > >>> index 6a319e2926589..04dbee983b37d 100644 > >>> --- a/kernel/rcu/Kconfig > >>> +++ b/kernel/rcu/Kconfig > >>> @@ -198,9 +198,9 @@ config RCU_FANOUT > >>> > >>> config RCU_FANOUT_LEAF > >>> int "Tree-based hierarchical RCU leaf-level fanout value" > >>> - range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD > >>> - range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD > >>> - range 2 3 if RCU_STRICT_GRACE_PERIOD > >>> + range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD > >>> + range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD > >>> + range 1 3 if RCU_STRICT_GRACE_PERIOD > >>> depends on TREE_RCU && RCU_EXPERT> default 16 if !RCU_STRICT_GRACE_PERIOD > >>> default 2 if RCU_STRICT_GRACE_PERIOD > >>> > >>> ------------------------------------------------------------------------ > >>> > >>> This passes a quick 20-minute rcutorture smoke test. Does it provide > >>> similar performance benefits? > >> > >> I tried this out, and it also brings down the contention and solves the problem > >> I saw (in testing so far). > >> > >> Would this work also if the test had grace periods init/cleanup racing with > >> preempted RCU read-side critical sections? I'm doing longer tests now to see how > >> this performs under GP-stress, versus my solution. I am also seeing that with > >> just the node lists, not per-cpu list, I see a dramatic throughput drop after > >> some amount of time, but I can't explain it. And I do not see this with the > >> per-cpu list solution (I'm currently testing if I see the same throughput drop > >> with the fan-out solution you proposed). > > > > Might the throughput drop be due to increased load on the host? > > The load is constant with the benchmark, and the data is repeatable and > consistent. So random load on the host is unlikely. So you have a system with the various background threads corralled or disabled? > > Another possibility is that tasks/vCPUs got shuffled so as to increase > > the probability of preemption. > > > > Also, doesn't your patch also cause the grace-period kthread to acquire > > that per-CPU lock, thus also possibly resulting in contention, vCPU > > preemption, and so on? > > Yes, I'm tracing it more. Even with baseline (without these patches), I see this > throughput drop so it is worth investigating. I think it's something possibly > like a lock convoy forming, but the fact that if I don't use RNP locking, the > lock convoy disappears, and the throughput is completely stable. That tells me > that that has something to do with that or something related. I also measured > the exact RNP lock time and counted the number of contentions, so I am not > really guessing here. The RNP lock is contended consistently. I think it's a > great idea for me to extend this lock contention measurement to the run queue > locks as well, for me to measure how they are doing (or even extending it to all > locks, as you mentioned) - at least for me to confirm the theory that the same > test severely contends other locks as well. Is the RNP lock contended under non-overload conditions? If I remember correctly, you were running 2x CPU overload. Is the RNP lock contended in bare-metal kernels? Thanx, Paul > >> I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is > >> reasonable, considering this is not a default. Are you suggesting defaulting to > >> this for small systems? If not, then I guess the optimization will not be > >> enabled by default. Eventually, with this patch set, if we are moving forward > >> with this approach, I will remove the config option for per-CPU block list > >> altogether so that it is enabled by default. That's kind of my plan if we agreed > >> on this, but it is just an RFC stage :). > > > > Right now, we are experimenting, so the usability issue is less pressing. > > Once we find out what is really going on for real-world systems, we > > can make adjustments if and as appropriate, said adjustments including > > usability. > > Sure, thanks. > > - Joel >
On Tue, 6 Jan 2026 11:17:19 -0800 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > Interesting. I might look into the boosting logic to see whether we can avoid > > boosting certain tasks depending on whether they help the grace period complete > > or not. Thank you for the suggestion. > > Just so you know, all of my simplification efforts thus far have instead > made it more complex, but who knows what I might have been missing? Maybe you are too smart to make it simple? ;-) > I could easily believe that the vCPU preemption problem needs to be > addressed, but doing so on a per-spinlock basis would lead to greatly > increased complexity throughout the kernel, not just RCU. Agreed. > > > > The main point of this patch series is to avoid lock contention due to > > > vCPU preemption, correct? If so, will we need similar work on the other > > > locks in the Linux kernel, both within RCU and elsewhere? I vaguely > > > recall your doing some work along those lines a few years back, and > > > maybe Thomas Gleixner's deferred-preemption work could help with this. > > > Or not, who knows? Keeping the hypervisor informed of lock state is > > > not necessarily free. > > > > Yes, I did some work on this at Google, but it turned out to be a very > > fragmented effort in terms of where (which subsystem - KVM, scheduler etc) > > should we do the priority boosting of vCPU threads. In the end, we just ended up > > with an internal prototype that was not upstreamable but worked pretty well and > > only had time for production (a lesson I learned there is we should probably > > work on upstream solutions first, but life is not that easy sometimes). > > Which is one reason deferred preemption would be attractive. Yes. That's why I've been pushing it. > > > About the deferred-preemption, I believe Steven Rostedt at one point was looking > > at that for VMs, but that effort stalled as Peter is concerned about doing that > > would mess up the scheduler. The idea (AFAIU) is to use the rseq page to > > communicate locking information between vCPU threads and the host and then let > > the host avoid vCPU preemption - but the scheduler needs to do something with > > that information. Otherwise, it's no use. > > Has deferred preemption for userspace locking also stalled? If not, > then the scheduler's support for userspace should apply directly to > guest OSes, right? No, the user space deferred preemption is still moving along nicely (I believe Thomas has completed most of it). The issue here is that the deferred happens before going back to user space. That's a different location than going back to the guest. The logic needs to be in that path too. One thing that Peter Zijlstra pushed was the limited amount of time that deferred wait may happen. He says user space spinlocks are a bad design, but it has been proven for that they are currently the most efficient when coming to very short critical sections. That is, where the critical section is shorter than the cost of a system call. Thus, he forces the deferred scheduling to be at most 50us max (he's also suggested less than that). But when it comes to the guest, where kernel spinlocks are user space spinlocks, and can be held for more than 50us, I would like a way to have the guests defer the scheduling for even longer than user space spin locks. -- Steve
On Tue, Jan 06, 2026 at 03:19:24PM -0500, Steven Rostedt wrote: > On Tue, 6 Jan 2026 11:17:19 -0800 > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > Interesting. I might look into the boosting logic to see whether we can avoid > > > boosting certain tasks depending on whether they help the grace period complete > > > or not. Thank you for the suggestion. > > > > Just so you know, all of my simplification efforts thus far have instead > > made it more complex, but who knows what I might have been missing? > > Maybe you are too smart to make it simple? ;-) There is the old adage that the complexity of any software artifact grows to just barely exceed the capabilities of those working on it. ;-) But all that aside, getting fresh eyes on it would be a good thing. > > I could easily believe that the vCPU preemption problem needs to be > > addressed, but doing so on a per-spinlock basis would lead to greatly > > increased complexity throughout the kernel, not just RCU. > > Agreed. > > > > > The main point of this patch series is to avoid lock contention due to > > > > vCPU preemption, correct? If so, will we need similar work on the other > > > > locks in the Linux kernel, both within RCU and elsewhere? I vaguely > > > > recall your doing some work along those lines a few years back, and > > > > maybe Thomas Gleixner's deferred-preemption work could help with this. > > > > Or not, who knows? Keeping the hypervisor informed of lock state is > > > > not necessarily free. > > > > > > Yes, I did some work on this at Google, but it turned out to be a very > > > fragmented effort in terms of where (which subsystem - KVM, scheduler etc) > > > should we do the priority boosting of vCPU threads. In the end, we just ended up > > > with an internal prototype that was not upstreamable but worked pretty well and > > > only had time for production (a lesson I learned there is we should probably > > > work on upstream solutions first, but life is not that easy sometimes). > > > > Which is one reason deferred preemption would be attractive. > > Yes. That's why I've been pushing it. Very good to hear! > > > About the deferred-preemption, I believe Steven Rostedt at one point was looking > > > at that for VMs, but that effort stalled as Peter is concerned about doing that > > > would mess up the scheduler. The idea (AFAIU) is to use the rseq page to > > > communicate locking information between vCPU threads and the host and then let > > > the host avoid vCPU preemption - but the scheduler needs to do something with > > > that information. Otherwise, it's no use. > > > > Has deferred preemption for userspace locking also stalled? If not, > > then the scheduler's support for userspace should apply directly to > > guest OSes, right? > > No, the user space deferred preemption is still moving along nicely (I > believe Thomas has completed most of it). The issue here is that the > deferred happens before going back to user space. That's a different > location than going back to the guest. The logic needs to be in that path > too. OK, got it, thank you! > One thing that Peter Zijlstra pushed was the limited amount of time that > deferred wait may happen. He says user space spinlocks are a bad design, > but it has been proven for that they are currently the most efficient when > coming to very short critical sections. That is, where the critical section > is shorter than the cost of a system call. Thus, he forces the deferred > scheduling to be at most 50us max (he's also suggested less than that). > > But when it comes to the guest, where kernel spinlocks are user space > spinlocks, and can be held for more than 50us, I would like a way to have > the guests defer the scheduling for even longer than user space spin locks. I would *hope* that the rcu_node ->lock instances are held for less than 50us! At least in the absence of SMIs, NMIs, or vCPU preemption on systems with at least 100MHz core CPU clock frequency. Besides, SMIs, NMIs, vCPU preemption affect userspace locks, as do IRQs and softirqs. Of course, hope springs eternal... Thanx, Paul
On 1/6/2026 3:35 PM, Paul E. McKenney wrote: >>>> About the deferred-preemption, I believe Steven Rostedt at one point was looking >>>> at that for VMs, but that effort stalled as Peter is concerned about doing that >>>> would mess up the scheduler. The idea (AFAIU) is to use the rseq page to >>>> communicate locking information between vCPU threads and the host and then let >>>> the host avoid vCPU preemption - but the scheduler needs to do something with >>>> that information. Otherwise, it's no use. >>> Has deferred preemption for userspace locking also stalled? If not, >>> then the scheduler's support for userspace should apply directly to >>> guest OSes, right? >> No, the user space deferred preemption is still moving along nicely (I >> believe Thomas has completed most of it). The issue here is that the >> deferred happens before going back to user space. That's a different >> location than going back to the guest. The logic needs to be in that path >> too. > > OK, got it, thank you! There's also the challenge of sharing the locking information with the guest even when there is *no contention*. KVM being unaware of lock critical sections in the VM-exit path. Then after that wiring it up with the deffered preemption infra and moving beyond the 50 micro second limits. If we VM exited and then made a decision, I think we are easily going to blow past 50 micro seconds anyway. But again to clarify, I didn't mean to use vCPU preemption as the driving usecase for this.. but I ran into it when I wrote a benchmark to see how RCU behaves in a VM. - Joel
On Tue, Jan 06, 2026 at 03:49:07PM -0500, Joel Fernandes wrote: > > > On 1/6/2026 3:35 PM, Paul E. McKenney wrote: > >>>> About the deferred-preemption, I believe Steven Rostedt at one point was looking > >>>> at that for VMs, but that effort stalled as Peter is concerned about doing that > >>>> would mess up the scheduler. The idea (AFAIU) is to use the rseq page to > >>>> communicate locking information between vCPU threads and the host and then let > >>>> the host avoid vCPU preemption - but the scheduler needs to do something with > >>>> that information. Otherwise, it's no use. > >>> Has deferred preemption for userspace locking also stalled? If not, > >>> then the scheduler's support for userspace should apply directly to > >>> guest OSes, right? > >> No, the user space deferred preemption is still moving along nicely (I > >> believe Thomas has completed most of it). The issue here is that the > >> deferred happens before going back to user space. That's a different > >> location than going back to the guest. The logic needs to be in that path > >> too. > > > > OK, got it, thank you! > > There's also the challenge of sharing the locking information with the guest > even when there is *no contention*. KVM being unaware of lock critical sections > in the VM-exit path. Then after that wiring it up with the deffered preemption > infra and moving beyond the 50 micro second limits. If we VM exited and then > made a decision, I think we are easily going to blow past 50 micro seconds anyway. Yes, the VM-exit path would need to do its part. Could the 50 microseconds be measured up to but not including the VM exit? > But again to clarify, I didn't mean to use vCPU preemption as the driving > usecase for this.. but I ran into it when I wrote a benchmark to see how RCU > behaves in a VM. Me, I am just trying to keep the complexity down to a dull roar. So please do not take my pushback personally. "Just doing my job." Thanx, Paul
On 1/5/2026 7:55 PM, Joel Fernandes wrote: >> Also if so, would the following rather simpler patch do the same trick, >> if accompanied by CONFIG_RCU_FANOUT_LEAF=1? >> >> ------------------------------------------------------------------------ >> >> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig >> index 6a319e2926589..04dbee983b37d 100644 >> --- a/kernel/rcu/Kconfig >> +++ b/kernel/rcu/Kconfig >> @@ -198,9 +198,9 @@ config RCU_FANOUT >> >> config RCU_FANOUT_LEAF >> int "Tree-based hierarchical RCU leaf-level fanout value" >> - range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD >> - range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD >> - range 2 3 if RCU_STRICT_GRACE_PERIOD >> + range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD >> + range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD >> + range 1 3 if RCU_STRICT_GRACE_PERIOD >> depends on TREE_RCU && RCU_EXPERT> default 16 if !RCU_STRICT_GRACE_PERIOD >> default 2 if RCU_STRICT_GRACE_PERIOD >> >> ------------------------------------------------------------------------ >> >> This passes a quick 20-minute rcutorture smoke test. Does it provide >> similar performance benefits? > > I tried this out, and it also brings down the contention and solves the problem > I saw (in testing so far). > > Would this work also if the test had grace periods init/cleanup racing with > preempted RCU read-side critical sections? I'm doing longer tests now to see how > this performs under GP-stress, versus my solution. I am also seeing that with > just the node lists, not per-cpu list, I see a dramatic throughput drop after > some amount of time, but I can't explain it. And I do not see this with the > per-cpu list solution (I'm currently testing if I see the same throughput drop > with the fan-out solution you proposed). > > I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is > reasonable, considering this is not a default. Are you suggesting defaulting to > this for small systems? If not, then I guess the optimization will not be > enabled by default. Eventually, with this patch set, if we are moving forward > with this approach, I will remove the config option for per-CPU block list > altogether so that it is enabled by default. That's kind of my plan if we agreed > on this, but it is just an RFC stage 🙂. So the fanout solution works great when there are grace periods in progress. I see no throughput drop, and consistent performance with read site critical sections. However, if we switch to having no grace periods continuously happening in progress, I can see the throughput dropping quite a bit here (-30%). I can't explain that, but I do not see that issue with per-CPU lists. With the per-cpu list scheme, blocking does not involve the node at all, as long as there is no grace period in progress. So, in that sense, per-CPU blocked list is completely detached from RCU - it is a bit like lazy RCU in the sense instead of a callback, it is the blocking task in a per-cpu list, relieving RCU of the burden. Maybe the extra layer of the node tree (with fanout == 1) somehow adds unnecessary overhead that does not exist with Per CPU lists? Even though there is this throughput drop, it still does better than baseline with a common RCU node. Based on this, I would say per-cpu blocked list is still worth doing. Thoughts? - Joel
On Tue, Jan 06, 2026 at 10:08:51AM -0500, Joel Fernandes wrote: > > > On 1/5/2026 7:55 PM, Joel Fernandes wrote: > >> Also if so, would the following rather simpler patch do the same trick, > >> if accompanied by CONFIG_RCU_FANOUT_LEAF=1? > >> > >> ------------------------------------------------------------------------ > >> > >> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > >> index 6a319e2926589..04dbee983b37d 100644 > >> --- a/kernel/rcu/Kconfig > >> +++ b/kernel/rcu/Kconfig > >> @@ -198,9 +198,9 @@ config RCU_FANOUT > >> > >> config RCU_FANOUT_LEAF > >> int "Tree-based hierarchical RCU leaf-level fanout value" > >> - range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD > >> - range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD > >> - range 2 3 if RCU_STRICT_GRACE_PERIOD > >> + range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD > >> + range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD > >> + range 1 3 if RCU_STRICT_GRACE_PERIOD > >> depends on TREE_RCU && RCU_EXPERT> default 16 if !RCU_STRICT_GRACE_PERIOD > >> default 2 if RCU_STRICT_GRACE_PERIOD > >> > >> ------------------------------------------------------------------------ > >> > >> This passes a quick 20-minute rcutorture smoke test. Does it provide > >> similar performance benefits? > > > > I tried this out, and it also brings down the contention and solves the problem > > I saw (in testing so far). > > > > Would this work also if the test had grace periods init/cleanup racing with > > preempted RCU read-side critical sections? I'm doing longer tests now to see how > > this performs under GP-stress, versus my solution. I am also seeing that with > > just the node lists, not per-cpu list, I see a dramatic throughput drop after > > some amount of time, but I can't explain it. And I do not see this with the > > per-cpu list solution (I'm currently testing if I see the same throughput drop > > with the fan-out solution you proposed). > > > > I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is > > reasonable, considering this is not a default. Are you suggesting defaulting to > > this for small systems? If not, then I guess the optimization will not be > > enabled by default. Eventually, with this patch set, if we are moving forward > > with this approach, I will remove the config option for per-CPU block list > > altogether so that it is enabled by default. That's kind of my plan if we agreed > > on this, but it is just an RFC stage 🙂. > > So the fanout solution works great when there are grace periods in progress. I > see no throughput drop, and consistent performance with read site critical > sections. However, if we switch to having no grace periods continuously > happening in progress, I can see the throughput dropping quite a bit here > (-30%). I can't explain that, but I do not see that issue with per-CPU lists. Might this be due to the change in number of tasks? Not having the thread that continuously runs grace periods might be affecting scheduling decisions, and with CPU overcommit, those scheduling decisions can cause large changes in throughput. Plus there are other spinlocks that might be subject to vCPU preemption, including the various scheduler spinlocks. > With the per-cpu list scheme, blocking does not involve the node at all, as long > as there is no grace period in progress. So, in that sense, per-CPU blocked list > is completely detached from RCU - it is a bit like lazy RCU in the sense instead > of a callback, it is the blocking task in a per-cpu list, relieving RCU of the > burden. Unless I am seriously misreading your patch, the grace-period kthread still acquires your per-CPU locks. Also, reducing the number of grace periods should *reduce* contention on the rcu_node ->lock. > Maybe the extra layer of the node tree (with fanout == 1) somehow adds > unnecessary overhead that does not exist with Per CPU lists? Even though there > is this throughput drop, it still does better than baseline with a common RCU node. > > Based on this, I would say per-cpu blocked list is still worth doing. Thoughts? I think that we need to understand the differences before jumping to conclusions. There are a lot of possible reasons for changes in throughput, especially given the CPU overload. After all, queuing theory suggests high variance in that case, possibly even on exactly the same setup. Thanx, Paul
On 1/6/2026 2:24 PM, Paul E. McKenney wrote: > On Tue, Jan 06, 2026 at 10:08:51AM -0500, Joel Fernandes wrote: >> >> >> On 1/5/2026 7:55 PM, Joel Fernandes wrote: >>>> Also if so, would the following rather simpler patch do the same trick, >>>> if accompanied by CONFIG_RCU_FANOUT_LEAF=1? >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig >>>> index 6a319e2926589..04dbee983b37d 100644 >>>> --- a/kernel/rcu/Kconfig >>>> +++ b/kernel/rcu/Kconfig >>>> @@ -198,9 +198,9 @@ config RCU_FANOUT >>>> >>>> config RCU_FANOUT_LEAF >>>> int "Tree-based hierarchical RCU leaf-level fanout value" >>>> - range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD >>>> - range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD >>>> - range 2 3 if RCU_STRICT_GRACE_PERIOD >>>> + range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD >>>> + range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD >>>> + range 1 3 if RCU_STRICT_GRACE_PERIOD >>>> depends on TREE_RCU && RCU_EXPERT> default 16 if !RCU_STRICT_GRACE_PERIOD >>>> default 2 if RCU_STRICT_GRACE_PERIOD >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> This passes a quick 20-minute rcutorture smoke test. Does it provide >>>> similar performance benefits? >>> >>> I tried this out, and it also brings down the contention and solves the problem >>> I saw (in testing so far). >>> >>> Would this work also if the test had grace periods init/cleanup racing with >>> preempted RCU read-side critical sections? I'm doing longer tests now to see how >>> this performs under GP-stress, versus my solution. I am also seeing that with >>> just the node lists, not per-cpu list, I see a dramatic throughput drop after >>> some amount of time, but I can't explain it. And I do not see this with the >>> per-cpu list solution (I'm currently testing if I see the same throughput drop >>> with the fan-out solution you proposed). >>> >>> I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is >>> reasonable, considering this is not a default. Are you suggesting defaulting to >>> this for small systems? If not, then I guess the optimization will not be >>> enabled by default. Eventually, with this patch set, if we are moving forward >>> with this approach, I will remove the config option for per-CPU block list >>> altogether so that it is enabled by default. That's kind of my plan if we agreed >>> on this, but it is just an RFC stage 🙂. >> >> So the fanout solution works great when there are grace periods in progress. I >> see no throughput drop, and consistent performance with read site critical >> sections. However, if we switch to having no grace periods continuously >> happening in progress, I can see the throughput dropping quite a bit here >> (-30%). I can't explain that, but I do not see that issue with per-CPU lists. > > Might this be due to the change in number of tasks? Not having the > thread that continuously runs grace periods might be affecting scheduling > decisions, and with CPU overcommit, those scheduling decisions can cause > large changes in throughput. Plus there are other spinlocks that might > be subject to vCPU preemption, including the various scheduler spinlocks. Yeah these are all possible, currently studying it more :) >> With the per-cpu list scheme, blocking does not involve the node at all, as long >> as there is no grace period in progress. So, in that sense, per-CPU blocked list >> is completely detached from RCU - it is a bit like lazy RCU in the sense instead >> of a callback, it is the blocking task in a per-cpu list, relieving RCU of the >> burden. > > Unless I am seriously misreading your patch, the grace-period kthread still > acquires your per-CPU locks. Yes, but I am not triggering grace periods (in the tests where I am expecting an improvement). It is in those tests that I am seeing the throughput drop with FANOUT, but let me confirm that again. I did run it 200 times and notice this. I'm not sure what else a fanout of one for leaves does, but this is my chance to learn about it :). I am saying when there is no GPs active (that is when the optimization in these patches is active). In one of the patches, if grace period is in progress or already started, I do not trigger the optimization. The optimization is only when grace periods are not active. This is similar to lazy RCU, where, if we have active grace periods in progress, we don't really make new RCU callbacks lazy since it is pointless. > Also, reducing the number of grace periods> should *reduce* contention on the rcu_node ->lock. > >> Maybe the extra layer of the node tree (with fanout == 1) somehow adds >> unnecessary overhead that does not exist with Per CPU lists? Even though there >> is this throughput drop, it still does better than baseline with a common RCU node. >> >> Based on this, I would say per-cpu blocked list is still worth doing. Thoughts? > > I think that we need to understand the differences before jumping > to conclusions. There are a lot of possible reasons for changes in > throughput, especially given the CPU overload. After all, queuing > theory suggests high variance in that case, possibly even on exactly > the same setup. Sure, that's why I'm doing hundreds of runs to get repetitive results and cut back on the outliers. But it is quite challenging to study all possibilities given the time constraints. I'm trying to collect traces as much as I can and study them. The synchronize RCU latency that I just improved, for instance, came from one of such exercises. Thanks.
On Tue, Jan 06, 2026 at 04:24:07PM -0500, Joel Fernandes wrote: > > > On 1/6/2026 2:24 PM, Paul E. McKenney wrote: > > On Tue, Jan 06, 2026 at 10:08:51AM -0500, Joel Fernandes wrote: > >> > >> > >> On 1/5/2026 7:55 PM, Joel Fernandes wrote: > >>>> Also if so, would the following rather simpler patch do the same trick, > >>>> if accompanied by CONFIG_RCU_FANOUT_LEAF=1? > >>>> > >>>> ------------------------------------------------------------------------ > >>>> > >>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > >>>> index 6a319e2926589..04dbee983b37d 100644 > >>>> --- a/kernel/rcu/Kconfig > >>>> +++ b/kernel/rcu/Kconfig > >>>> @@ -198,9 +198,9 @@ config RCU_FANOUT > >>>> > >>>> config RCU_FANOUT_LEAF > >>>> int "Tree-based hierarchical RCU leaf-level fanout value" > >>>> - range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD > >>>> - range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD > >>>> - range 2 3 if RCU_STRICT_GRACE_PERIOD > >>>> + range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD > >>>> + range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD > >>>> + range 1 3 if RCU_STRICT_GRACE_PERIOD > >>>> depends on TREE_RCU && RCU_EXPERT> default 16 if !RCU_STRICT_GRACE_PERIOD > >>>> default 2 if RCU_STRICT_GRACE_PERIOD > >>>> > >>>> ------------------------------------------------------------------------ > >>>> > >>>> This passes a quick 20-minute rcutorture smoke test. Does it provide > >>>> similar performance benefits? > >>> > >>> I tried this out, and it also brings down the contention and solves the problem > >>> I saw (in testing so far). > >>> > >>> Would this work also if the test had grace periods init/cleanup racing with > >>> preempted RCU read-side critical sections? I'm doing longer tests now to see how > >>> this performs under GP-stress, versus my solution. I am also seeing that with > >>> just the node lists, not per-cpu list, I see a dramatic throughput drop after > >>> some amount of time, but I can't explain it. And I do not see this with the > >>> per-cpu list solution (I'm currently testing if I see the same throughput drop > >>> with the fan-out solution you proposed). > >>> > >>> I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is > >>> reasonable, considering this is not a default. Are you suggesting defaulting to > >>> this for small systems? If not, then I guess the optimization will not be > >>> enabled by default. Eventually, with this patch set, if we are moving forward > >>> with this approach, I will remove the config option for per-CPU block list > >>> altogether so that it is enabled by default. That's kind of my plan if we agreed > >>> on this, but it is just an RFC stage 🙂. > >> > >> So the fanout solution works great when there are grace periods in progress. I > >> see no throughput drop, and consistent performance with read site critical > >> sections. However, if we switch to having no grace periods continuously > >> happening in progress, I can see the throughput dropping quite a bit here > >> (-30%). I can't explain that, but I do not see that issue with per-CPU lists. > > > > Might this be due to the change in number of tasks? Not having the > > thread that continuously runs grace periods might be affecting scheduling > > decisions, and with CPU overcommit, those scheduling decisions can cause > > large changes in throughput. Plus there are other spinlocks that might > > be subject to vCPU preemption, including the various scheduler spinlocks. > > Yeah these are all possible, currently studying it more :) Looking forward to seeing what you find! > >> With the per-cpu list scheme, blocking does not involve the node at all, as long > >> as there is no grace period in progress. So, in that sense, per-CPU blocked list > >> is completely detached from RCU - it is a bit like lazy RCU in the sense instead > >> of a callback, it is the blocking task in a per-cpu list, relieving RCU of the > >> burden. > > > > Unless I am seriously misreading your patch, the grace-period kthread still > > acquires your per-CPU locks. > > Yes, but I am not triggering grace periods (in the tests where I am expecting an > improvement). It is in those tests that I am seeing the throughput drop with > FANOUT, but let me confirm that again. I did run it 200 times and notice this. > I'm not sure what else a fanout of one for leaves does, but this is my chance to > learn about it :). Well, TREE09 has tested at least one aspect of this configuration quite thoroughly over the years. ;-) > I am saying when there is no GPs active (that is when the optimization in these > patches is active). In one of the patches, if grace period is in progress or > already started, I do not trigger the optimization. The optimization is only > when grace periods are not active. This is similar to lazy RCU, where, if we > have active grace periods in progress, we don't really make new RCU callbacks > lazy since it is pointless. Interesting. When there is no grace period is also when it is least harmful to acquire the rcu_node structure's ->lock. > > Also, reducing the number of grace periods> should *reduce* contention on the > rcu_node ->lock. > > > >> Maybe the extra layer of the node tree (with fanout == 1) somehow adds > >> unnecessary overhead that does not exist with Per CPU lists? Even though there > >> is this throughput drop, it still does better than baseline with a common RCU node. > >> > >> Based on this, I would say per-cpu blocked list is still worth doing. Thoughts? > > > > I think that we need to understand the differences before jumping > > to conclusions. There are a lot of possible reasons for changes in > > throughput, especially given the CPU overload. After all, queuing > > theory suggests high variance in that case, possibly even on exactly > > the same setup. > > Sure, that's why I'm doing hundreds of runs to get repetitive results and cut > back on the outliers. But it is quite challenging to study all possibilities > given the time constraints. I'm trying to collect traces as much as I can and > study them. The synchronize RCU latency that I just improved, for instance, came > from one of such exercises. There is absolutely nothing wrong with experiments! Thanx, Paul
© 2016 - 2026 Red Hat, Inc.