[PATCH] rcu: Fix delayed execution of hurry callbacks

Tze-nan Wu posted 1 patch 2 months, 3 weeks ago
kernel/rcu/tree_nocb.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH] rcu: Fix delayed execution of hurry callbacks
Posted by Tze-nan Wu 2 months, 3 weeks ago
We observed a regression in our customer’s environment after enabling
CONFIG_LAZY_RCU. In the Android Update Engine scenario, where ioctl() is
used heavily, we found that callbacks queued via call_rcu_hurry (such as
percpu_ref_switch_to_atomic_rcu) can sometimes be delayed by up to 5
seconds before execution. This occurs because the new grace period does
not start immediately after the previous one completes.

The root cause is that the wake_nocb_gp_defer() function now checks
"rdp->nocb_defer_wakeup" instead of "rdp_gp->nocb_defer_wakeup". On CPUs
that are not rcuog, "rdp->nocb_defer_wakeup" may always be
RCU_NOCB_WAKE_NOT. This can cause "rdp_gp->nocb_defer_wakeup" to be
downgraded and the "rdp_gp->nocb_timer" to be postponed by up to 10
seconds, delaying the execution of hurry RCU callbacks.

The trace log of one scenario we encountered is as follow:
  // previous GP ends at this point
  rcu_preempt   [000] d..1.   137.240210: rcu_grace_period: rcu_preempt 8369 end
  rcu_preempt   [000] .....   137.240212: rcu_grace_period: rcu_preempt 8372 reqwait
  // call_rcu_hurry enqueues "percpu_ref_switch_to_atomic_rcu", the callback waited on by UpdateEngine
  update_engine [002] d..1.   137.301593: __call_rcu_common: wyy: unlikely p_ref = 00000000********. lazy = 0
  // FirstQ on cpu 2 rdp_gp->nocb_timer is set to fire after 1 jiffy (4ms)
  // and the rdp_gp->nocb_defer_wakeup is set to RCU_NOCB_WAKE
  update_engine [002] d..2.   137.301595: rcu_nocb_wake: rcu_preempt 2 FirstQ on cpu2 with rdp_gp (cpu0).
  // FirstBQ event on cpu2 during the 1 jiffy, make the timer postpond 10 seconds later.
  // also, the rdp_gp->nocb_defer_wakeup is overwrite to RCU_NOCB_WAKE_LAZY
  update_engine [002] d..1.   137.301601: rcu_nocb_wake: rcu_preempt 2 WakeEmptyIsDeferred
  ...
  ...
  ...
  // before the 10 seconds timeout, cpu0 received another call_rcu_hurry
  // reset the timer to jiffies+1 and set the waketype = RCU_NOCB_WAKE.
  kworker/u32:0 [000] d..2.   142.557564: rcu_nocb_wake: rcu_preempt 0 FirstQ
  kworker/u32:0 [000] d..1.   142.557576: rcu_nocb_wake: rcu_preempt 0 WakeEmptyIsDeferred
  kworker/u32:0 [000] d..1.   142.558296: rcu_nocb_wake: rcu_preempt 0 WakeNot
  kworker/u32:0 [000] d..1.   142.558562: rcu_nocb_wake: rcu_preempt 0 WakeNot
  // idle(do_nocb_deferred_wakeup) wake rcuog due to waketype == RCU_NOCB_WAKE
  <idle>        [000] d..1.   142.558786: rcu_nocb_wake: rcu_preempt 0 DoWake
  <idle>        [000] dN.1.   142.558839: rcu_nocb_wake: rcu_preempt 0 DeferredWake
  rcuog/0       [000] .....   142.558871: rcu_nocb_wake: rcu_preempt 0 EndSleep
  rcuog/0       [000] .....   142.558877: rcu_nocb_wake: rcu_preempt 0 Check
  // finally rcuog request a new GP at this point (5 seconds after the FirstQ event)
  rcuog/0       [000] d..2.   142.558886: rcu_grace_period: rcu_preempt 8372 newreq
  rcu_preempt   [001] d..1.   142.559458: rcu_grace_period: rcu_preempt 8373 start
  ...
  rcu_preempt   [000] d..1.   142.564258: rcu_grace_period: rcu_preempt 8373 end
  rcuop/2       [000] D..1.   142.566337: rcu_batch_start: rcu_preempt CBs=219 bl=10
  // the hurry CB is invoked at this point
  rcuop/2       [000] b....   142.566352: blk_queue_usage_counter_release: wyy: wakeup. p_ref = 00000000********.

This patch changes the condition to check "rdp_gp->nocb_defer_wakeup" in
the lazy path. This prevents an already scheduled "rdp_gp->nocb_timer"
from being postponed and avoids overwriting "rdp_gp->nocb_defer_wakeup"
when it is not RCU_NOCB_WAKE_NOT.

Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power")
Co-developed-by: Cheng-jui Wang <cheng-jui.wang@mediatek.com>
Signed-off-by: Cheng-jui Wang <cheng-jui.wang@mediatek.com>
Co-developed-by: Lorry.Luo@mediatek.com
Signed-off-by: Lorry.Luo@mediatek.com
Tested-by: weiyangyang@vivo.com
Signed-off-by: weiyangyang@vivo.com
Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
---
The regression is first observed by wyy in the Update Engine scenario
with  CONFIG_LAZY_RCU enabled. there is an additional delay of 4–5
seconds during the heavy ioctl API call, waiting for
percpu_ref_switch_to_atomic_rcu (RCU hurry CB) to complete.

Initially, we suspected that the percpu_ref_switch_to_atomic_rcu
function itself was taking too long. However, after enabling some
custome and the following trace events: rcu_do_batch, rcu_nocb_wake, and
rcu_grace_period. we found that the root cause was that rcuog was not
being woken up in time to request a new GP. This led to the delay in
invoking the hurry RCU callback (percpu_ref_switch_to_atomic_rcu).

Environment:
  Android-16, Kernel: 6.12, 8 CPUs (ARM)

Configuration:
  CONFIG_TREE_RCU=y
  CONFIG_PREEMPT_RCU=y
  CONFIG_LAZY_RCU=y
  CONFIG_RCU_NOCB_CPU=y
  CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y
  rcu_nocb_gp_stride = -1 (default is 4 for 8 cores)
  jiffies_lazy_flush = 10 * HZ

Contributions:
Tze-Nan Wu:
Collaborated with Cheng-Jui to discuss which tracepoints needed to be
added, jointly analyzed the trace logs, identified the root cause, and
proposed this upstream change.

Cheng-Jui Wang:
Provided many valuable suggestions during the debugging process,
repeatedly found breakthroughs when we were stuck, and helped identify
the root cause.

Lorry Luo:
Assisted in verifying whether rcu-hurry-callback was executed too long
or deferred, supported with testing, and helped with communication.

Weiyangyang:
Main tester who discovered the regression scenario, confirmed that
enabling CONFIG_LAZY_RCU caused the regression, and verified that this
patch resolves the issue.

Note:
With my limited understanding of lazy RCU, I am not fully confident that
this is a real issue. In my opinion, hurry callbacks should not be
delayed by other events such as firstBQ trace event.
If my understanding is incorrect, I would greatly appreciate any
guidance or clarification from the maintainers.
---
 kernel/rcu/tree_nocb.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 08eb9b0e2fab..e6cd56603cad 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -276,7 +276,7 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
 	 * callback storms, no need to wake up too early.
 	 */
 	if (waketype == RCU_NOCB_WAKE_LAZY &&
-	    rdp->nocb_defer_wakeup == RCU_NOCB_WAKE_NOT) {
+	    rdp_gp->nocb_defer_wakeup == RCU_NOCB_WAKE_NOT) {
 		mod_timer(&rdp_gp->nocb_timer, jiffies + rcu_get_jiffies_lazy_flush());
 		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
 	} else if (waketype == RCU_NOCB_WAKE_BYPASS) {
-- 
2.45.2

Re: [PATCH] rcu: Fix delayed execution of hurry callbacks
Posted by Frederic Weisbecker 2 months, 3 weeks ago
Le Thu, Jul 17, 2025 at 01:53:38PM +0800, Tze-nan Wu a écrit :
> We observed a regression in our customer’s environment after enabling
> CONFIG_LAZY_RCU. In the Android Update Engine scenario, where ioctl() is
> used heavily, we found that callbacks queued via call_rcu_hurry (such as
> percpu_ref_switch_to_atomic_rcu) can sometimes be delayed by up to 5
> seconds before execution. This occurs because the new grace period does
> not start immediately after the previous one completes.
> 
> The root cause is that the wake_nocb_gp_defer() function now checks
> "rdp->nocb_defer_wakeup" instead of "rdp_gp->nocb_defer_wakeup". On CPUs
> that are not rcuog, "rdp->nocb_defer_wakeup" may always be
> RCU_NOCB_WAKE_NOT. This can cause "rdp_gp->nocb_defer_wakeup" to be
> downgraded and the "rdp_gp->nocb_timer" to be postponed by up to 10
> seconds, delaying the execution of hurry RCU callbacks.
> 
> The trace log of one scenario we encountered is as follow:
>   // previous GP ends at this point
>   rcu_preempt   [000] d..1.   137.240210: rcu_grace_period: rcu_preempt 8369 end
>   rcu_preempt   [000] .....   137.240212: rcu_grace_period: rcu_preempt 8372 reqwait
>   // call_rcu_hurry enqueues "percpu_ref_switch_to_atomic_rcu", the callback waited on by UpdateEngine
>   update_engine [002] d..1.   137.301593: __call_rcu_common: wyy: unlikely p_ref = 00000000********. lazy = 0
>   // FirstQ on cpu 2 rdp_gp->nocb_timer is set to fire after 1 jiffy (4ms)
>   // and the rdp_gp->nocb_defer_wakeup is set to RCU_NOCB_WAKE
>   update_engine [002] d..2.   137.301595: rcu_nocb_wake: rcu_preempt 2 FirstQ on cpu2 with rdp_gp (cpu0).
>   // FirstBQ event on cpu2 during the 1 jiffy, make the timer postpond 10 seconds later.
>   // also, the rdp_gp->nocb_defer_wakeup is overwrite to RCU_NOCB_WAKE_LAZY
>   update_engine [002] d..1.   137.301601: rcu_nocb_wake: rcu_preempt 2 WakeEmptyIsDeferred
>   ...
>   ...
>   ...
>   // before the 10 seconds timeout, cpu0 received another call_rcu_hurry
>   // reset the timer to jiffies+1 and set the waketype = RCU_NOCB_WAKE.
>   kworker/u32:0 [000] d..2.   142.557564: rcu_nocb_wake: rcu_preempt 0 FirstQ
>   kworker/u32:0 [000] d..1.   142.557576: rcu_nocb_wake: rcu_preempt 0 WakeEmptyIsDeferred
>   kworker/u32:0 [000] d..1.   142.558296: rcu_nocb_wake: rcu_preempt 0 WakeNot
>   kworker/u32:0 [000] d..1.   142.558562: rcu_nocb_wake: rcu_preempt 0 WakeNot
>   // idle(do_nocb_deferred_wakeup) wake rcuog due to waketype == RCU_NOCB_WAKE
>   <idle>        [000] d..1.   142.558786: rcu_nocb_wake: rcu_preempt 0 DoWake
>   <idle>        [000] dN.1.   142.558839: rcu_nocb_wake: rcu_preempt 0 DeferredWake
>   rcuog/0       [000] .....   142.558871: rcu_nocb_wake: rcu_preempt 0 EndSleep
>   rcuog/0       [000] .....   142.558877: rcu_nocb_wake: rcu_preempt 0 Check
>   // finally rcuog request a new GP at this point (5 seconds after the FirstQ event)
>   rcuog/0       [000] d..2.   142.558886: rcu_grace_period: rcu_preempt 8372 newreq
>   rcu_preempt   [001] d..1.   142.559458: rcu_grace_period: rcu_preempt 8373 start
>   ...
>   rcu_preempt   [000] d..1.   142.564258: rcu_grace_period: rcu_preempt 8373 end
>   rcuop/2       [000] D..1.   142.566337: rcu_batch_start: rcu_preempt CBs=219 bl=10
>   // the hurry CB is invoked at this point
>   rcuop/2       [000] b....   142.566352: blk_queue_usage_counter_release: wyy: wakeup. p_ref = 00000000********.
> 
> This patch changes the condition to check "rdp_gp->nocb_defer_wakeup" in
> the lazy path. This prevents an already scheduled "rdp_gp->nocb_timer"
> from being postponed and avoids overwriting "rdp_gp->nocb_defer_wakeup"
> when it is not RCU_NOCB_WAKE_NOT.
> 
> Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power")
> Co-developed-by: Cheng-jui Wang <cheng-jui.wang@mediatek.com>
> Signed-off-by: Cheng-jui Wang <cheng-jui.wang@mediatek.com>
> Co-developed-by: Lorry.Luo@mediatek.com
> Signed-off-by: Lorry.Luo@mediatek.com
> Tested-by: weiyangyang@vivo.com
> Signed-off-by: weiyangyang@vivo.com
> Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
> ---
> The regression is first observed by wyy in the Update Engine scenario
> with  CONFIG_LAZY_RCU enabled. there is an additional delay of 4–5
> seconds during the heavy ioctl API call, waiting for
> percpu_ref_switch_to_atomic_rcu (RCU hurry CB) to complete.
> 
> Initially, we suspected that the percpu_ref_switch_to_atomic_rcu
> function itself was taking too long. However, after enabling some
> custome and the following trace events: rcu_do_batch, rcu_nocb_wake, and
> rcu_grace_period. we found that the root cause was that rcuog was not
> being woken up in time to request a new GP. This led to the delay in
> invoking the hurry RCU callback (percpu_ref_switch_to_atomic_rcu).
> 
> Environment:
>   Android-16, Kernel: 6.12, 8 CPUs (ARM)
> 
> Configuration:
>   CONFIG_TREE_RCU=y
>   CONFIG_PREEMPT_RCU=y
>   CONFIG_LAZY_RCU=y
>   CONFIG_RCU_NOCB_CPU=y
>   CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y
>   rcu_nocb_gp_stride = -1 (default is 4 for 8 cores)
>   jiffies_lazy_flush = 10 * HZ
> 
> Contributions:
> Tze-Nan Wu:
> Collaborated with Cheng-Jui to discuss which tracepoints needed to be
> added, jointly analyzed the trace logs, identified the root cause, and
> proposed this upstream change.
> 
> Cheng-Jui Wang:
> Provided many valuable suggestions during the debugging process,
> repeatedly found breakthroughs when we were stuck, and helped identify
> the root cause.
> 
> Lorry Luo:
> Assisted in verifying whether rcu-hurry-callback was executed too long
> or deferred, supported with testing, and helped with communication.
> 
> Weiyangyang:
> Main tester who discovered the regression scenario, confirmed that
> enabling CONFIG_LAZY_RCU caused the regression, and verified that this
> patch resolves the issue

Nice team work :-)

> 
> Note:
> With my limited understanding of lazy RCU, I am not fully confident that
> this is a real issue. In my opinion, hurry callbacks should not be
> delayed by other events such as firstBQ trace event.
> If my understanding is incorrect, I would greatly appreciate any
> guidance or clarification from the maintainers.

Your understanding looks quite right!

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Since the issue is there for 3 years now and was introduced with
the CONFIG_LAZY_RCU new feature, it can probably wait a few weeks
for the next merge window.

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH] rcu: Fix delayed execution of hurry callbacks
Posted by Tze-nan Wu (吳澤南) 2 months, 2 weeks ago
On Thu, 2025-07-17 at 15:34 +0200, Frederic Weisbecker wrote:
> 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
> 
> 
> Le Thu, Jul 17, 2025 at 01:53:38PM +0800, Tze-nan Wu a écrit :
> > We observed a regression in our customer’s environment after
> > enabling
> > CONFIG_LAZY_RCU. In the Android Update Engine scenario, where
> > ioctl() is
> > used heavily, we found that callbacks queued via call_rcu_hurry
> > (such as
> > percpu_ref_switch_to_atomic_rcu) can sometimes be delayed by up to
> > 5
> > seconds before execution. This occurs because the new grace period
> > does
> > not start immediately after the previous one completes.
> > 
> > The root cause is that the wake_nocb_gp_defer() function now checks
> > "rdp->nocb_defer_wakeup" instead of "rdp_gp->nocb_defer_wakeup". On
> > CPUs
> > that are not rcuog, "rdp->nocb_defer_wakeup" may always be
> > RCU_NOCB_WAKE_NOT. This can cause "rdp_gp->nocb_defer_wakeup" to be
> > downgraded and the "rdp_gp->nocb_timer" to be postponed by up to 10
> > seconds, delaying the execution of hurry RCU callbacks.
> > 
> > The trace log of one scenario we encountered is as follow:
> >   // previous GP ends at this point
> >   rcu_preempt   [000] d..1.   137.240210: rcu_grace_period:
> > rcu_preempt 8369 end
> >   rcu_preempt   [000] .....   137.240212: rcu_grace_period:
> > rcu_preempt 8372 reqwait
> >   // call_rcu_hurry enqueues "percpu_ref_switch_to_atomic_rcu", the
> > callback waited on by UpdateEngine
> >   update_engine [002] d..1.   137.301593: __call_rcu_common: wyy:
> > unlikely p_ref = 00000000********. lazy = 0
> >   // FirstQ on cpu 2 rdp_gp->nocb_timer is set to fire after 1
> > jiffy (4ms)
> >   // and the rdp_gp->nocb_defer_wakeup is set to RCU_NOCB_WAKE
> >   update_engine [002] d..2.   137.301595: rcu_nocb_wake:
> > rcu_preempt 2 FirstQ on cpu2 with rdp_gp (cpu0).
> >   // FirstBQ event on cpu2 during the 1 jiffy, make the timer
> > postpond 10 seconds later.
> >   // also, the rdp_gp->nocb_defer_wakeup is overwrite to
> > RCU_NOCB_WAKE_LAZY
> >   update_engine [002] d..1.   137.301601: rcu_nocb_wake:
> > rcu_preempt 2 WakeEmptyIsDeferred
> >   ...
> >   ...
> >   ...
> >   // before the 10 seconds timeout, cpu0 received another
> > call_rcu_hurry
> >   // reset the timer to jiffies+1 and set the waketype =
> > RCU_NOCB_WAKE.
> >   kworker/u32:0 [000] d..2.   142.557564: rcu_nocb_wake:
> > rcu_preempt 0 FirstQ
> >   kworker/u32:0 [000] d..1.   142.557576: rcu_nocb_wake:
> > rcu_preempt 0 WakeEmptyIsDeferred
> >   kworker/u32:0 [000] d..1.   142.558296: rcu_nocb_wake:
> > rcu_preempt 0 WakeNot
> >   kworker/u32:0 [000] d..1.   142.558562: rcu_nocb_wake:
> > rcu_preempt 0 WakeNot
> >   // idle(do_nocb_deferred_wakeup) wake rcuog due to waketype ==
> > RCU_NOCB_WAKE
> >   <idle>        [000] d..1.   142.558786: rcu_nocb_wake:
> > rcu_preempt 0 DoWake
> >   <idle>        [000] dN.1.   142.558839: rcu_nocb_wake:
> > rcu_preempt 0 DeferredWake
> >   rcuog/0       [000] .....   142.558871: rcu_nocb_wake:
> > rcu_preempt 0 EndSleep
> >   rcuog/0       [000] .....   142.558877: rcu_nocb_wake:
> > rcu_preempt 0 Check
> >   // finally rcuog request a new GP at this point (5 seconds after
> > the FirstQ event)
> >   rcuog/0       [000] d..2.   142.558886: rcu_grace_period:
> > rcu_preempt 8372 newreq
> >   rcu_preempt   [001] d..1.   142.559458: rcu_grace_period:
> > rcu_preempt 8373 start
> >   ...
> >   rcu_preempt   [000] d..1.   142.564258: rcu_grace_period:
> > rcu_preempt 8373 end
> >   rcuop/2       [000] D..1.   142.566337: rcu_batch_start:
> > rcu_preempt CBs=219 bl=10
> >   // the hurry CB is invoked at this point
> >   rcuop/2       [000] b....   142.566352:
> > blk_queue_usage_counter_release: wyy: wakeup. p_ref =
> > 00000000********.
> > 
> > This patch changes the condition to check "rdp_gp-
> > >nocb_defer_wakeup" in
> > the lazy path. This prevents an already scheduled "rdp_gp-
> > >nocb_timer"
> > from being postponed and avoids overwriting "rdp_gp-
> > >nocb_defer_wakeup"
> > when it is not RCU_NOCB_WAKE_NOT.
> > 
> > Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power")
> > Co-developed-by: Cheng-jui Wang <cheng-jui.wang@mediatek.com>
> > Signed-off-by: Cheng-jui Wang <cheng-jui.wang@mediatek.com>
> > Co-developed-by: Lorry.Luo@mediatek.com
> > Signed-off-by: Lorry.Luo@mediatek.com
> > Tested-by: weiyangyang@vivo.com
> > Signed-off-by: weiyangyang@vivo.com
> > Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
> > ---
> > The regression is first observed by wyy in the Update Engine
> > scenario
> > with  CONFIG_LAZY_RCU enabled. there is an additional delay of 4–5
> > seconds during the heavy ioctl API call, waiting for
> > percpu_ref_switch_to_atomic_rcu (RCU hurry CB) to complete.
> > 
> > Initially, we suspected that the percpu_ref_switch_to_atomic_rcu
> > function itself was taking too long. However, after enabling some
> > custome and the following trace events: rcu_do_batch,
> > rcu_nocb_wake, and
> > rcu_grace_period. we found that the root cause was that rcuog was
> > not
> > being woken up in time to request a new GP. This led to the delay
> > in
> > invoking the hurry RCU callback (percpu_ref_switch_to_atomic_rcu).
> > 
> > Environment:
> >   Android-16, Kernel: 6.12, 8 CPUs (ARM)
> > 
> > Configuration:
> >   CONFIG_TREE_RCU=y
> >   CONFIG_PREEMPT_RCU=y
> >   CONFIG_LAZY_RCU=y
> >   CONFIG_RCU_NOCB_CPU=y
> >   CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y
> >   rcu_nocb_gp_stride = -1 (default is 4 for 8 cores)
> >   jiffies_lazy_flush = 10 * HZ
> > 
> > Contributions:
> > Tze-Nan Wu:
> > Collaborated with Cheng-Jui to discuss which tracepoints needed to
> > be
> > added, jointly analyzed the trace logs, identified the root cause,
> > and
> > proposed this upstream change.
> > 
> > Cheng-Jui Wang:
> > Provided many valuable suggestions during the debugging process,
> > repeatedly found breakthroughs when we were stuck, and helped
> > identify
> > the root cause.
> > 
> > Lorry Luo:
> > Assisted in verifying whether rcu-hurry-callback was executed too
> > long
> > or deferred, supported with testing, and helped with communication.
> > 
> > Weiyangyang:
> > Main tester who discovered the regression scenario, confirmed that
> > enabling CONFIG_LAZY_RCU caused the regression, and verified that
> > this
> > patch resolves the issue
> 
> Nice team work :-)
> 
Thanks :)
> > 
> > Note:
> > With my limited understanding of lazy RCU, I am not fully confident
> > that
> > this is a real issue. In my opinion, hurry callbacks should not be
> > delayed by other events such as firstBQ trace event.
> > If my understanding is incorrect, I would greatly appreciate any
> > guidance or clarification from the maintainers.
> 
> Your understanding looks quite right!
> 
Glad to know I was on the right track.

> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> 
> Since the issue is there for 3 years now and was introduced with
> the CONFIG_LAZY_RCU new feature, it can probably wait a few weeks
> for the next merge window.
> 
> --
> Frederic Weisbecker
> SUSE Labs

Totally understand, we'll follow the kernel's development schedule.
--
Tze-Nan Wu


Re: [PATCH] rcu: Fix delayed execution of hurry callbacks
Posted by Joel Fernandes 2 months, 2 weeks ago

On 7/18/2025 5:10 AM, Tze-nan Wu (吳澤南) wrote:
> On Thu, 2025-07-17 at 15:34 +0200, Frederic Weisbecker wrote:
>> 
>> External email : Please do not click links or open attachments until
>> you have verified the sender or the content.
>> 
>> 
>> Le Thu, Jul 17, 2025 at 01:53:38PM +0800, Tze-nan Wu a écrit :
>> > We observed a regression in our customer’s environment after
>> > enabling
>> > CONFIG_LAZY_RCU. In the Android Update Engine scenario, where
>> > ioctl() is
>> > used heavily, we found that callbacks queued via call_rcu_hurry
>> > (such as
>> > percpu_ref_switch_to_atomic_rcu) can sometimes be delayed by up to
>> > 5
>> > seconds before execution. This occurs because the new grace period
>> > does
>> > not start immediately after the previous one completes.
>> > 
>> > The root cause is that the wake_nocb_gp_defer() function now checks
>> > "rdp->nocb_defer_wakeup" instead of "rdp_gp->nocb_defer_wakeup". On
>> > CPUs
>> > that are not rcuog, "rdp->nocb_defer_wakeup" may always be
>> > RCU_NOCB_WAKE_NOT. This can cause "rdp_gp->nocb_defer_wakeup" to be
>> > downgraded and the "rdp_gp->nocb_timer" to be postponed by up to 10
>> > seconds, delaying the execution of hurry RCU callbacks.
>> > 
>> > The trace log of one scenario we encountered is as follow:
>> >   // previous GP ends at this point
>> >   rcu_preempt   [000] d..1.   137.240210: rcu_grace_period:
>> > rcu_preempt 8369 end
>> >   rcu_preempt   [000] .....   137.240212: rcu_grace_period:
>> > rcu_preempt 8372 reqwait
>> >   // call_rcu_hurry enqueues "percpu_ref_switch_to_atomic_rcu", the
>> > callback waited on by UpdateEngine
>> >   update_engine [002] d..1.   137.301593: __call_rcu_common: wyy:
>> > unlikely p_ref = 00000000********. lazy = 0
>> >   // FirstQ on cpu 2 rdp_gp->nocb_timer is set to fire after 1
>> > jiffy (4ms)
>> >   // and the rdp_gp->nocb_defer_wakeup is set to RCU_NOCB_WAKE
>> >   update_engine [002] d..2.   137.301595: rcu_nocb_wake:
>> > rcu_preempt 2 FirstQ on cpu2 with rdp_gp (cpu0).
>> >   // FirstBQ event on cpu2 during the 1 jiffy, make the timer
>> > postpond 10 seconds later.
>> >   // also, the rdp_gp->nocb_defer_wakeup is overwrite to
>> > RCU_NOCB_WAKE_LAZY
>> >   update_engine [002] d..1.   137.301601: rcu_nocb_wake:
>> > rcu_preempt 2 WakeEmptyIsDeferred
>> >   ...
>> >   ...
>> >   ...
>> >   // before the 10 seconds timeout, cpu0 received another
>> > call_rcu_hurry
>> >   // reset the timer to jiffies+1 and set the waketype =
>> > RCU_NOCB_WAKE.
>> >   kworker/u32:0 [000] d..2.   142.557564: rcu_nocb_wake:
>> > rcu_preempt 0 FirstQ
>> >   kworker/u32:0 [000] d..1.   142.557576: rcu_nocb_wake:
>> > rcu_preempt 0 WakeEmptyIsDeferred
>> >   kworker/u32:0 [000] d..1.   142.558296: rcu_nocb_wake:
>> > rcu_preempt 0 WakeNot
>> >   kworker/u32:0 [000] d..1.   142.558562: rcu_nocb_wake:
>> > rcu_preempt 0 WakeNot
>> >   // idle(do_nocb_deferred_wakeup) wake rcuog due to waketype ==
>> > RCU_NOCB_WAKE
>> >   <idle>        [000] d..1.   142.558786: rcu_nocb_wake:
>> > rcu_preempt 0 DoWake
>> >   <idle>        [000] dN.1.   142.558839: rcu_nocb_wake:
>> > rcu_preempt 0 DeferredWake
>> >   rcuog/0       [000] .....   142.558871: rcu_nocb_wake:
>> > rcu_preempt 0 EndSleep
>> >   rcuog/0       [000] .....   142.558877: rcu_nocb_wake:
>> > rcu_preempt 0 Check
>> >   // finally rcuog request a new GP at this point (5 seconds after
>> > the FirstQ event)
>> >   rcuog/0       [000] d..2.   142.558886: rcu_grace_period:
>> > rcu_preempt 8372 newreq
>> >   rcu_preempt   [001] d..1.   142.559458: rcu_grace_period:
>> > rcu_preempt 8373 start
>> >   ...
>> >   rcu_preempt   [000] d..1.   142.564258: rcu_grace_period:
>> > rcu_preempt 8373 end
>> >   rcuop/2       [000] D..1.   142.566337: rcu_batch_start:
>> > rcu_preempt CBs=219 bl=10
>> >   // the hurry CB is invoked at this point
>> >   rcuop/2       [000] b....   142.566352:
>> > blk_queue_usage_counter_release: wyy: wakeup. p_ref =
>> > 00000000********.
>> > 
>> > This patch changes the condition to check "rdp_gp-
>> > >nocb_defer_wakeup" in
>> > the lazy path. This prevents an already scheduled "rdp_gp-
>> > >nocb_timer"
>> > from being postponed and avoids overwriting "rdp_gp-
>> > >nocb_defer_wakeup"
>> > when it is not RCU_NOCB_WAKE_NOT.
>> > 
>> > Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power")
>> > Co-developed-by: Cheng-jui Wang <cheng-jui.wang@mediatek.com>
>> > Signed-off-by: Cheng-jui Wang <cheng-jui.wang@mediatek.com>
>> > Co-developed-by: Lorry.Luo@mediatek.com
>> > Signed-off-by: Lorry.Luo@mediatek.com
>> > Tested-by: weiyangyang@vivo.com
>> > Signed-off-by: weiyangyang@vivo.com
>> > Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
>> > ---
>> > The regression is first observed by wyy in the Update Engine
>> > scenario
>> > with  CONFIG_LAZY_RCU enabled. there is an additional delay of 4–5
>> > seconds during the heavy ioctl API call, waiting for
>> > percpu_ref_switch_to_atomic_rcu (RCU hurry CB) to complete.
>> > 
>> > Initially, we suspected that the percpu_ref_switch_to_atomic_rcu
>> > function itself was taking too long. However, after enabling some
>> > custome and the following trace events: rcu_do_batch,
>> > rcu_nocb_wake, and
>> > rcu_grace_period. we found that the root cause was that rcuog was
>> > not
>> > being woken up in time to request a new GP. This led to the delay
>> > in
>> > invoking the hurry RCU callback (percpu_ref_switch_to_atomic_rcu).
>> > 
>> > Environment:
>> >   Android-16, Kernel: 6.12, 8 CPUs (ARM)
>> > 
>> > Configuration:
>> >   CONFIG_TREE_RCU=y
>> >   CONFIG_PREEMPT_RCU=y
>> >   CONFIG_LAZY_RCU=y
>> >   CONFIG_RCU_NOCB_CPU=y
>> >   CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y
>> >   rcu_nocb_gp_stride = -1 (default is 4 for 8 cores)
>> >   jiffies_lazy_flush = 10 * HZ
>> > 
>> > Contributions:
>> > Tze-Nan Wu:
>> > Collaborated with Cheng-Jui to discuss which tracepoints needed to
>> > be
>> > added, jointly analyzed the trace logs, identified the root cause,
>> > and
>> > proposed this upstream change.
>> > 
>> > Cheng-Jui Wang:
>> > Provided many valuable suggestions during the debugging process,
>> > repeatedly found breakthroughs when we were stuck, and helped
>> > identify
>> > the root cause.
>> > 
>> > Lorry Luo:
>> > Assisted in verifying whether rcu-hurry-callback was executed too
>> > long
>> > or deferred, supported with testing, and helped with communication.
>> > 
>> > Weiyangyang:
>> > Main tester who discovered the regression scenario, confirmed that
>> > enabling CONFIG_LAZY_RCU caused the regression, and verified that
>> > this
>> > patch resolves the issue
>> 
>> Nice team work :-)
>> 
> Thanks :)
>> > 
>> > Note:
>> > With my limited understanding of lazy RCU, I am not fully confident
>> > that
>> > this is a real issue. In my opinion, hurry callbacks should not be
>> > delayed by other events such as firstBQ trace event.
>> > If my understanding is incorrect, I would greatly appreciate any
>> > guidance or clarification from the maintainers.
>> 
>> Your understanding looks quite right!
>> 
> Glad to know I was on the right track.
> 
>> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
>> 
>> Since the issue is there for 3 years now and was introduced with
>> the CONFIG_LAZY_RCU new feature, it can probably wait a few weeks
>> for the next merge window.


Nice, thanks!

Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>

I suggest let us add the Fixes tag too so that stable gets it.

thanks,

 - Joel

Re: [PATCH] rcu: Fix delayed execution of hurry callbacks
Posted by Tze-nan Wu (吳澤南) 2 months, 2 weeks ago
On Wed, 2025-07-23 at 10:32 -0400, Joel Fernandes wrote:
> 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
> 
> 
> On 7/18/2025 5:10 AM, Tze-nan Wu (吳澤南) wrote:
> > On Thu, 2025-07-17 at 15:34 +0200, Frederic Weisbecker wrote:
> > > 
> > > External email : Please do not click links or open attachments
> > > until
> > > you have verified the sender or the content.
> > > 
> > > 
> > > Le Thu, Jul 17, 2025 at 01:53:38PM +0800, Tze-nan Wu a écrit :
> > > > We observed a regression in our customer’s environment after
> > > > enabling
> > > > CONFIG_LAZY_RCU. In the Android Update Engine scenario, where
> > > > ioctl() is
> > > > used heavily, we found that callbacks queued via call_rcu_hurry
> > > > (such as
> > > > percpu_ref_switch_to_atomic_rcu) can sometimes be delayed by up
> > > > to
> > > > 5
> > > > seconds before execution. This occurs because the new grace
> > > > period
> > > > does
> > > > not start immediately after the previous one completes.
> > > > 
> > > > The root cause is that the wake_nocb_gp_defer() function now
> > > > checks
> > > > "rdp->nocb_defer_wakeup" instead of "rdp_gp-
> > > > >nocb_defer_wakeup". On
> > > > CPUs
> > > > that are not rcuog, "rdp->nocb_defer_wakeup" may always be
> > > > RCU_NOCB_WAKE_NOT. This can cause "rdp_gp->nocb_defer_wakeup"
> > > > to be
> > > > downgraded and the "rdp_gp->nocb_timer" to be postponed by up
> > > > to 10
> > > > seconds, delaying the execution of hurry RCU callbacks.
> > > > 
> > > > The trace log of one scenario we encountered is as follow:
> > > >   // previous GP ends at this point
> > > >   rcu_preempt   [000] d..1.   137.240210: rcu_grace_period:
> > > > rcu_preempt 8369 end
> > > >   rcu_preempt   [000] .....   137.240212: rcu_grace_period:
> > > > rcu_preempt 8372 reqwait
> > > >   // call_rcu_hurry enqueues "percpu_ref_switch_to_atomic_rcu",
> > > > the
> > > > callback waited on by UpdateEngine
> > > >   update_engine [002] d..1.   137.301593: __call_rcu_common:
> > > > wyy:
> > > > unlikely p_ref = 00000000********. lazy = 0
> > > >   // FirstQ on cpu 2 rdp_gp->nocb_timer is set to fire after 1
> > > > jiffy (4ms)
> > > >   // and the rdp_gp->nocb_defer_wakeup is set to RCU_NOCB_WAKE
> > > >   update_engine [002] d..2.   137.301595: rcu_nocb_wake:
> > > > rcu_preempt 2 FirstQ on cpu2 with rdp_gp (cpu0).
> > > >   // FirstBQ event on cpu2 during the 1 jiffy, make the timer
> > > > postpond 10 seconds later.
> > > >   // also, the rdp_gp->nocb_defer_wakeup is overwrite to
> > > > RCU_NOCB_WAKE_LAZY
> > > >   update_engine [002] d..1.   137.301601: rcu_nocb_wake:
> > > > rcu_preempt 2 WakeEmptyIsDeferred
> > > >   ...
> > > >   ...
> > > >   ...
> > > >   // before the 10 seconds timeout, cpu0 received another
> > > > call_rcu_hurry
> > > >   // reset the timer to jiffies+1 and set the waketype =
> > > > RCU_NOCB_WAKE.
> > > >   kworker/u32:0 [000] d..2.   142.557564: rcu_nocb_wake:
> > > > rcu_preempt 0 FirstQ
> > > >   kworker/u32:0 [000] d..1.   142.557576: rcu_nocb_wake:
> > > > rcu_preempt 0 WakeEmptyIsDeferred
> > > >   kworker/u32:0 [000] d..1.   142.558296: rcu_nocb_wake:
> > > > rcu_preempt 0 WakeNot
> > > >   kworker/u32:0 [000] d..1.   142.558562: rcu_nocb_wake:
> > > > rcu_preempt 0 WakeNot
> > > >   // idle(do_nocb_deferred_wakeup) wake rcuog due to waketype
> > > > ==
> > > > RCU_NOCB_WAKE
> > > >   <idle>        [000] d..1.   142.558786: rcu_nocb_wake:
> > > > rcu_preempt 0 DoWake
> > > >   <idle>        [000] dN.1.   142.558839: rcu_nocb_wake:
> > > > rcu_preempt 0 DeferredWake
> > > >   rcuog/0       [000] .....   142.558871: rcu_nocb_wake:
> > > > rcu_preempt 0 EndSleep
> > > >   rcuog/0       [000] .....   142.558877: rcu_nocb_wake:
> > > > rcu_preempt 0 Check
> > > >   // finally rcuog request a new GP at this point (5 seconds
> > > > after
> > > > the FirstQ event)
> > > >   rcuog/0       [000] d..2.   142.558886: rcu_grace_period:
> > > > rcu_preempt 8372 newreq
> > > >   rcu_preempt   [001] d..1.   142.559458: rcu_grace_period:
> > > > rcu_preempt 8373 start
> > > >   ...
> > > >   rcu_preempt   [000] d..1.   142.564258: rcu_grace_period:
> > > > rcu_preempt 8373 end
> > > >   rcuop/2       [000] D..1.   142.566337: rcu_batch_start:
> > > > rcu_preempt CBs=219 bl=10
> > > >   // the hurry CB is invoked at this point
> > > >   rcuop/2       [000] b....   142.566352:
> > > > blk_queue_usage_counter_release: wyy: wakeup. p_ref =
> > > > 00000000********.
> > > > 
> > > > This patch changes the condition to check "rdp_gp-
> > > > > nocb_defer_wakeup" in
> > > > the lazy path. This prevents an already scheduled "rdp_gp-
> > > > > nocb_timer"
> > > > from being postponed and avoids overwriting "rdp_gp-
> > > > > nocb_defer_wakeup"
> > > > when it is not RCU_NOCB_WAKE_NOT.
> > > > 
> > > > Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power")
> > > > Co-developed-by: Cheng-jui Wang <cheng-jui.wang@mediatek.com>
> > > > Signed-off-by: Cheng-jui Wang <cheng-jui.wang@mediatek.com>
> > > > Co-developed-by: Lorry.Luo@mediatek.com
> > > > Signed-off-by: Lorry.Luo@mediatek.com
> > > > Tested-by: weiyangyang@vivo.com
> > > > Signed-off-by: weiyangyang@vivo.com
> > > > Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
> > > > ---
> > > > The regression is first observed by wyy in the Update Engine
> > > > scenario
> > > > with  CONFIG_LAZY_RCU enabled. there is an additional delay of
> > > > 4–5
> > > > seconds during the heavy ioctl API call, waiting for
> > > > percpu_ref_switch_to_atomic_rcu (RCU hurry CB) to complete.
> > > > 
> > > > Initially, we suspected that the
> > > > percpu_ref_switch_to_atomic_rcu
> > > > function itself was taking too long. However, after enabling
> > > > some
> > > > custome and the following trace events: rcu_do_batch,
> > > > rcu_nocb_wake, and
> > > > rcu_grace_period. we found that the root cause was that rcuog
> > > > was
> > > > not
> > > > being woken up in time to request a new GP. This led to the
> > > > delay
> > > > in
> > > > invoking the hurry RCU callback
> > > > (percpu_ref_switch_to_atomic_rcu).
> > > > 
> > > > Environment:
> > > >   Android-16, Kernel: 6.12, 8 CPUs (ARM)
> > > > 
> > > > Configuration:
> > > >   CONFIG_TREE_RCU=y
> > > >   CONFIG_PREEMPT_RCU=y
> > > >   CONFIG_LAZY_RCU=y
> > > >   CONFIG_RCU_NOCB_CPU=y
> > > >   CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y
> > > >   rcu_nocb_gp_stride = -1 (default is 4 for 8 cores)
> > > >   jiffies_lazy_flush = 10 * HZ
> > > > 
> > > > Contributions:
> > > > Tze-Nan Wu:
> > > > Collaborated with Cheng-Jui to discuss which tracepoints needed
> > > > to
> > > > be
> > > > added, jointly analyzed the trace logs, identified the root
> > > > cause,
> > > > and
> > > > proposed this upstream change.
> > > > 
> > > > Cheng-Jui Wang:
> > > > Provided many valuable suggestions during the debugging
> > > > process,
> > > > repeatedly found breakthroughs when we were stuck, and helped
> > > > identify
> > > > the root cause.
> > > > 
> > > > Lorry Luo:
> > > > Assisted in verifying whether rcu-hurry-callback was executed
> > > > too
> > > > long
> > > > or deferred, supported with testing, and helped with
> > > > communication.
> > > > 
> > > > Weiyangyang:
> > > > Main tester who discovered the regression scenario, confirmed
> > > > that
> > > > enabling CONFIG_LAZY_RCU caused the regression, and verified
> > > > that
> > > > this
> > > > patch resolves the issue
> > > 
> > > Nice team work :-)
> > > 
> > Thanks :)
> > > > 
> > > > Note:
> > > > With my limited understanding of lazy RCU, I am not fully
> > > > confident
> > > > that
> > > > this is a real issue. In my opinion, hurry callbacks should not
> > > > be
> > > > delayed by other events such as firstBQ trace event.
> > > > If my understanding is incorrect, I would greatly appreciate
> > > > any
> > > > guidance or clarification from the maintainers.
> > > 
> > > Your understanding looks quite right!
> > > 
> > Glad to know I was on the right track.
> > 
> > > Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> > > 
> > > Since the issue is there for 3 years now and was introduced with
> > > the CONFIG_LAZY_RCU new feature, it can probably wait a few weeks
> > > for the next merge window.
> 
> 
> Nice, thanks!
> 
> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
> 
> I suggest let us add the Fixes tag too so that stable gets it.
> 
> thanks,
> 
>  - Joel
> 

I noticed that I’ve already added the FIXES: tag, did you mean that we
should also add this line "CC:stable@vger.kernel.org" ?

If that’s the case, should I submit a new version of this patch(V2)?

thanks,
--
Tze-Nan Wu

Re: [PATCH] rcu: Fix delayed execution of hurry callbacks
Posted by Neeraj Upadhyay 2 months, 2 weeks ago
> > Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
> > 
> > I suggest let us add the Fixes tag too so that stable gets it.
> > 
> > thanks,
> > 
> >  - Joel
> > 
> 
> I noticed that I’ve already added the FIXES: tag, did you mean that we
> should also add this line "CC:stable@vger.kernel.org" ?
> 
> If that’s the case, should I submit a new version of this patch(V2)?
> 


Thanks Joel for reviewing the patch! I have already queued it for 6.17
[1]. So, v2 won't be required.


[1] https://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux.git/log/?h=next


- Neeraj