[v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

[PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Joel Fernandes 1 month, 2 weeks ago

The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
State) design where the first FQS saves dyntick-idle snapshots and
the second FQS compares them. This results in long and unnecessary latency
for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
1000HZ) whenever one FQS wait sufficed.

Some investigations showed that the GP kthread's CPU is the holdout CPU
a lot of times after the first FQS as - it cannot be detected as "idle"
because it's actively running the FQS scan in the GP kthread.

Therefore, at the end of rcu_gp_init(), immediately report a quiescent
state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
GP kthread cannot be in an RCU read-side critical section while running
GP initialization, so this is safe and results in significant latency
improvements.

I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
showing significant latency improvements (default settings for fqs jiffies):

Baseline (without fix):
| Run | Mean      | Min      | Max       |
|-----|-----------|----------|-----------|
| 1   | 10.088 ms | 9.989 ms | 18.848 ms |
| 2   | 10.064 ms | 9.982 ms | 16.470 ms |
| 3   | 10.051 ms | 9.988 ms | 15.113 ms |
| 4   | 10.125 ms | 9.929 ms | 22.411 ms |
| 5   |  8.695 ms | 5.996 ms | 15.471 ms |
| 6   | 10.157 ms | 9.977 ms | 25.723 ms |
| 7   | 10.102 ms | 9.990 ms | 20.224 ms |
| 8   |  8.050 ms | 5.985 ms | 10.007 ms |
| 9   | 10.059 ms | 9.978 ms | 15.934 ms |
| 10  | 10.077 ms | 9.984 ms | 17.703 ms |

With fix:
| Run | Mean     | Min      | Max       |
|-----|----------|----------|-----------|
| 1   | 6.027 ms | 5.915 ms |  8.589 ms |
| 2   | 6.032 ms | 5.984 ms |  9.241 ms |
| 3   | 6.010 ms | 5.986 ms |  7.004 ms |
| 4   | 6.076 ms | 5.993 ms | 10.001 ms |
| 5   | 6.084 ms | 5.893 ms | 10.250 ms |
| 6   | 6.034 ms | 5.908 ms |  9.456 ms |
| 7   | 6.051 ms | 5.993 ms | 10.000 ms |
| 8   | 6.057 ms | 5.941 ms | 10.001 ms |
| 9   | 6.016 ms | 5.927 ms |  7.540 ms |
| 10  | 6.036 ms | 5.993 ms |  9.579 ms |

Summary:
- Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
- Max latency:  25.72 ms -> 10.25 ms (60% improvement)

Tested rcutorture TREE and SRCU configurations.

[apply paulmck feedack on moving logic to rcu_gp_init()]

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/rcu/tree.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8293bae1dec1..0c7710caf041 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -160,6 +160,7 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
 			      unsigned long gps, unsigned long flags);
 static void invoke_rcu_core(void);
 static void rcu_report_exp_rdp(struct rcu_data *rdp);
+static void rcu_report_qs_rdp(struct rcu_data *rdp);
 static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
 static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
 static bool rcu_rdp_cpu_online(struct rcu_data *rdp);
@@ -1983,6 +1984,17 @@ static noinline_for_stack bool rcu_gp_init(void)
 	if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
 		on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
 
+	/*
+	 * Immediately report QS for the GP kthread's CPU. The GP kthread
+	 * cannot be in an RCU read-side critical section while running
+	 * the FQS scan. This eliminates the need for a second FQS wait
+	 * when all CPUs are idle.
+	 */
+	preempt_disable();
+	rcu_qs();
+	rcu_report_qs_rdp(this_cpu_ptr(&rcu_data));
+	preempt_enable();
+
 	return true;
 }
 
-- 
2.34.1

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Paul E. McKenney 1 month, 2 weeks ago

On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
> The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
> State) design where the first FQS saves dyntick-idle snapshots and
> the second FQS compares them. This results in long and unnecessary latency
> for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
> 1000HZ) whenever one FQS wait sufficed.
> 
> Some investigations showed that the GP kthread's CPU is the holdout CPU
> a lot of times after the first FQS as - it cannot be detected as "idle"
> because it's actively running the FQS scan in the GP kthread.
> 
> Therefore, at the end of rcu_gp_init(), immediately report a quiescent
> state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
> GP kthread cannot be in an RCU read-side critical section while running
> GP initialization, so this is safe and results in significant latency
> improvements.
> 
> I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
> showing significant latency improvements (default settings for fqs jiffies):
> 
> Baseline (without fix):
> | Run | Mean      | Min      | Max       |
> |-----|-----------|----------|-----------|
> | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
> | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
> | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
> | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
> | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
> | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
> | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
> | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
> | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
> | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
> 
> With fix:
> | Run | Mean     | Min      | Max       |
> |-----|----------|----------|-----------|
> | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
> | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
> | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
> | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
> | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
> | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
> | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
> | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
> | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
> | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
> 
> Summary:
> - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
> - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
> 
> Tested rcutorture TREE and SRCU configurations.
> 
> [apply paulmck feedack on moving logic to rcu_gp_init()]

If anything, these numbers look better, so good show!!!

Are there workloads that might be hurt by some side effect such
as increased CPU utilization by the RCU grace-period kthread?  One
non-mainstream hypothetical situation that comes to mind is a kernel
built with SMP=y but running on a single-CPU system with a high-frequence
periodic interrupt that does call_rcu().  Might that result in the RCU
grace-period kthread chewing up the entire CPU?

For a non-hypothetical case, could you please see if one of the
battery-powered embedded guys would be willing to test this?

							Thanx, Paul

> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/rcu/tree.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8293bae1dec1..0c7710caf041 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -160,6 +160,7 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
>  			      unsigned long gps, unsigned long flags);
>  static void invoke_rcu_core(void);
>  static void rcu_report_exp_rdp(struct rcu_data *rdp);
> +static void rcu_report_qs_rdp(struct rcu_data *rdp);
>  static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
>  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
>  static bool rcu_rdp_cpu_online(struct rcu_data *rdp);
> @@ -1983,6 +1984,17 @@ static noinline_for_stack bool rcu_gp_init(void)
>  	if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
>  		on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
>  
> +	/*
> +	 * Immediately report QS for the GP kthread's CPU. The GP kthread
> +	 * cannot be in an RCU read-side critical section while running
> +	 * the FQS scan. This eliminates the need for a second FQS wait
> +	 * when all CPUs are idle.
> +	 */
> +	preempt_disable();
> +	rcu_qs();
> +	rcu_report_qs_rdp(this_cpu_ptr(&rcu_data));
> +	preempt_enable();
> +
>  	return true;
>  }
>  
> -- 
> 2.34.1
>

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Joel Fernandes 1 month, 1 week ago

On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote:
> On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
> > The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
> > State) design where the first FQS saves dyntick-idle snapshots and
> > the second FQS compares them. This results in long and unnecessary latency
> > for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
> > 1000HZ) whenever one FQS wait sufficed.
> > 
> > Some investigations showed that the GP kthread's CPU is the holdout CPU
> > a lot of times after the first FQS as - it cannot be detected as "idle"
> > because it's actively running the FQS scan in the GP kthread.
> > 
> > Therefore, at the end of rcu_gp_init(), immediately report a quiescent
> > state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
> > GP kthread cannot be in an RCU read-side critical section while running
> > GP initialization, so this is safe and results in significant latency
> > improvements.
> > 
> > I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
> > showing significant latency improvements (default settings for fqs jiffies):
> > 
> > Baseline (without fix):
> > | Run | Mean      | Min      | Max       |
> > |-----|-----------|----------|-----------|
> > | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
> > | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
> > | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
> > | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
> > | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
> > | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
> > | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
> > | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
> > | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
> > | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
> > 
> > With fix:
> > | Run | Mean     | Min      | Max       |
> > |-----|----------|----------|-----------|
> > | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
> > | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
> > | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
> > | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
> > | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
> > | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
> > | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
> > | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
> > | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
> > | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
> > 
> > Summary:
> > - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
> > - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
> > 
> > Tested rcutorture TREE and SRCU configurations.
> > 
> > [apply paulmck feedack on moving logic to rcu_gp_init()]
> 
> If anything, these numbers look better, so good show!!!

Thanks, I ended up collecting more samples in the v2 to further confirm the
improvements.

> Are there workloads that might be hurt by some side effect such
> as increased CPU utilization by the RCU grace-period kthread?  One
> non-mainstream hypothetical situation that comes to mind is a kernel
> built with SMP=y but running on a single-CPU system with a high-frequence
> periodic interrupt that does call_rcu().  Might that result in the RCU
> grace-period kthread chewing up the entire CPU?

There are still GP delays due to FQS, even with this change, so it could not
chew up the entire CPU I believe. The GP cycle should still insert delays
into the GP kthread. I did not notice in my testing that synchronize_rcu()
latency dropping to sub millisecond, it was still limited by the timer wheel
delays and the FQS delays.

> For a non-hypothetical case, could you please see if one of the
> battery-powered embedded guys would be willing to test this?

My suspicion is the battery-powered folks are already running RCU_LAZY to
reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness
will be going to the bypass. Last I checked, Android and ChromeOS were both
enabling RCU_LAZY everywhere (back when I was at Google).

Uladzislau works on embedded (or at least till recently) and had recently
checked this area for improvements so I think he can help quantify too
perhaps. He is on CC. I personally don't directly work on embedded at the
moment, just big compute hungry machines. ;-) Uladzislau, would you have some
time to test on your Android devices?

thanks,

 - Joel

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Uladzislau Rezki 1 month, 1 week ago

On Thu, Dec 25, 2025 at 09:33:39PM -0500, Joel Fernandes wrote:
> On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote:
> > On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
> > > The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
> > > State) design where the first FQS saves dyntick-idle snapshots and
> > > the second FQS compares them. This results in long and unnecessary latency
> > > for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
> > > 1000HZ) whenever one FQS wait sufficed.
> > > 
> > > Some investigations showed that the GP kthread's CPU is the holdout CPU
> > > a lot of times after the first FQS as - it cannot be detected as "idle"
> > > because it's actively running the FQS scan in the GP kthread.
> > > 
> > > Therefore, at the end of rcu_gp_init(), immediately report a quiescent
> > > state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
> > > GP kthread cannot be in an RCU read-side critical section while running
> > > GP initialization, so this is safe and results in significant latency
> > > improvements.
> > > 
> > > I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
> > > showing significant latency improvements (default settings for fqs jiffies):
> > > 
> > > Baseline (without fix):
> > > | Run | Mean      | Min      | Max       |
> > > |-----|-----------|----------|-----------|
> > > | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
> > > | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
> > > | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
> > > | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
> > > | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
> > > | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
> > > | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
> > > | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
> > > | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
> > > | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
> > > 
> > > With fix:
> > > | Run | Mean     | Min      | Max       |
> > > |-----|----------|----------|-----------|
> > > | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
> > > | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
> > > | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
> > > | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
> > > | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
> > > | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
> > > | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
> > > | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
> > > | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
> > > | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
> > > 
> > > Summary:
> > > - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
> > > - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
> > > 
> > > Tested rcutorture TREE and SRCU configurations.
> > > 
> > > [apply paulmck feedack on moving logic to rcu_gp_init()]
> > 
> > If anything, these numbers look better, so good show!!!
> 
> Thanks, I ended up collecting more samples in the v2 to further confirm the
> improvements.
> 
> > Are there workloads that might be hurt by some side effect such
> > as increased CPU utilization by the RCU grace-period kthread?  One
> > non-mainstream hypothetical situation that comes to mind is a kernel
> > built with SMP=y but running on a single-CPU system with a high-frequence
> > periodic interrupt that does call_rcu().  Might that result in the RCU
> > grace-period kthread chewing up the entire CPU?
> 
> There are still GP delays due to FQS, even with this change, so it could not
> chew up the entire CPU I believe. The GP cycle should still insert delays
> into the GP kthread. I did not notice in my testing that synchronize_rcu()
> latency dropping to sub millisecond, it was still limited by the timer wheel
> delays and the FQS delays.
> 
> > For a non-hypothetical case, could you please see if one of the
> > battery-powered embedded guys would be willing to test this?
> 
> My suspicion is the battery-powered folks are already running RCU_LAZY to
> reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness
> will be going to the bypass. Last I checked, Android and ChromeOS were both
> enabling RCU_LAZY everywhere (back when I was at Google).
> 
> Uladzislau works on embedded (or at least till recently) and had recently
> checked this area for improvements so I think he can help quantify too
> perhaps. He is on CC. I personally don't directly work on embedded at the
> moment, just big compute hungry machines. ;-) Uladzislau, would you have some
> time to test on your Android devices?
> 
I will check the patch on my home based systems, big machines also :)
I do not work with mobile area any more thus do not have access to our
mobile devices. In fact i am glad that i have switched to something new.
I was a bit tired by the applied Google restrictions when it comes to
changes to the kernel and other Android layers.

--
Uladzislau Rezki

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Paul E. McKenney 1 month, 1 week ago

On Sun, Dec 28, 2025 at 06:57:58PM +0100, Uladzislau Rezki wrote:
> On Thu, Dec 25, 2025 at 09:33:39PM -0500, Joel Fernandes wrote:
> > On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote:
> > > On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
> > > > The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
> > > > State) design where the first FQS saves dyntick-idle snapshots and
> > > > the second FQS compares them. This results in long and unnecessary latency
> > > > for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
> > > > 1000HZ) whenever one FQS wait sufficed.
> > > > 
> > > > Some investigations showed that the GP kthread's CPU is the holdout CPU
> > > > a lot of times after the first FQS as - it cannot be detected as "idle"
> > > > because it's actively running the FQS scan in the GP kthread.
> > > > 
> > > > Therefore, at the end of rcu_gp_init(), immediately report a quiescent
> > > > state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
> > > > GP kthread cannot be in an RCU read-side critical section while running
> > > > GP initialization, so this is safe and results in significant latency
> > > > improvements.
> > > > 
> > > > I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
> > > > showing significant latency improvements (default settings for fqs jiffies):
> > > > 
> > > > Baseline (without fix):
> > > > | Run | Mean      | Min      | Max       |
> > > > |-----|-----------|----------|-----------|
> > > > | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
> > > > | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
> > > > | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
> > > > | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
> > > > | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
> > > > | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
> > > > | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
> > > > | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
> > > > | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
> > > > | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
> > > > 
> > > > With fix:
> > > > | Run | Mean     | Min      | Max       |
> > > > |-----|----------|----------|-----------|
> > > > | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
> > > > | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
> > > > | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
> > > > | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
> > > > | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
> > > > | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
> > > > | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
> > > > | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
> > > > | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
> > > > | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
> > > > 
> > > > Summary:
> > > > - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
> > > > - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
> > > > 
> > > > Tested rcutorture TREE and SRCU configurations.
> > > > 
> > > > [apply paulmck feedack on moving logic to rcu_gp_init()]
> > > 
> > > If anything, these numbers look better, so good show!!!
> > 
> > Thanks, I ended up collecting more samples in the v2 to further confirm the
> > improvements.
> > 
> > > Are there workloads that might be hurt by some side effect such
> > > as increased CPU utilization by the RCU grace-period kthread?  One
> > > non-mainstream hypothetical situation that comes to mind is a kernel
> > > built with SMP=y but running on a single-CPU system with a high-frequence
> > > periodic interrupt that does call_rcu().  Might that result in the RCU
> > > grace-period kthread chewing up the entire CPU?
> > 
> > There are still GP delays due to FQS, even with this change, so it could not
> > chew up the entire CPU I believe. The GP cycle should still insert delays
> > into the GP kthread. I did not notice in my testing that synchronize_rcu()
> > latency dropping to sub millisecond, it was still limited by the timer wheel
> > delays and the FQS delays.
> > 
> > > For a non-hypothetical case, could you please see if one of the
> > > battery-powered embedded guys would be willing to test this?
> > 
> > My suspicion is the battery-powered folks are already running RCU_LAZY to
> > reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness
> > will be going to the bypass. Last I checked, Android and ChromeOS were both
> > enabling RCU_LAZY everywhere (back when I was at Google).
> > 
> > Uladzislau works on embedded (or at least till recently) and had recently
> > checked this area for improvements so I think he can help quantify too
> > perhaps. He is on CC. I personally don't directly work on embedded at the
> > moment, just big compute hungry machines. ;-) Uladzislau, would you have some
> > time to test on your Android devices?
> > 
> I will check the patch on my home based systems, big machines also :)
> I do not work with mobile area any more thus do not have access to our
> mobile devices. In fact i am glad that i have switched to something new.
> I was a bit tired by the applied Google restrictions when it comes to
> changes to the kernel and other Android layers.

How quickly I forget!  ;-)

Any thoughts on who would be a good person to ask about testing Joel's
patch on mobile platforms?

							Thanx, Paul

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Uladzislau Rezki 1 month, 1 week ago

On Sun, Dec 28, 2025 at 04:04:49PM -0800, Paul E. McKenney wrote:
> On Sun, Dec 28, 2025 at 06:57:58PM +0100, Uladzislau Rezki wrote:
> > On Thu, Dec 25, 2025 at 09:33:39PM -0500, Joel Fernandes wrote:
> > > On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote:
> > > > On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
> > > > > The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
> > > > > State) design where the first FQS saves dyntick-idle snapshots and
> > > > > the second FQS compares them. This results in long and unnecessary latency
> > > > > for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
> > > > > 1000HZ) whenever one FQS wait sufficed.
> > > > > 
> > > > > Some investigations showed that the GP kthread's CPU is the holdout CPU
> > > > > a lot of times after the first FQS as - it cannot be detected as "idle"
> > > > > because it's actively running the FQS scan in the GP kthread.
> > > > > 
> > > > > Therefore, at the end of rcu_gp_init(), immediately report a quiescent
> > > > > state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
> > > > > GP kthread cannot be in an RCU read-side critical section while running
> > > > > GP initialization, so this is safe and results in significant latency
> > > > > improvements.
> > > > > 
> > > > > I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
> > > > > showing significant latency improvements (default settings for fqs jiffies):
> > > > > 
> > > > > Baseline (without fix):
> > > > > | Run | Mean      | Min      | Max       |
> > > > > |-----|-----------|----------|-----------|
> > > > > | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
> > > > > | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
> > > > > | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
> > > > > | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
> > > > > | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
> > > > > | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
> > > > > | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
> > > > > | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
> > > > > | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
> > > > > | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
> > > > > 
> > > > > With fix:
> > > > > | Run | Mean     | Min      | Max       |
> > > > > |-----|----------|----------|-----------|
> > > > > | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
> > > > > | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
> > > > > | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
> > > > > | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
> > > > > | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
> > > > > | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
> > > > > | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
> > > > > | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
> > > > > | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
> > > > > | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
> > > > > 
> > > > > Summary:
> > > > > - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
> > > > > - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
> > > > > 
> > > > > Tested rcutorture TREE and SRCU configurations.
> > > > > 
> > > > > [apply paulmck feedack on moving logic to rcu_gp_init()]
> > > > 
> > > > If anything, these numbers look better, so good show!!!
> > > 
> > > Thanks, I ended up collecting more samples in the v2 to further confirm the
> > > improvements.
> > > 
> > > > Are there workloads that might be hurt by some side effect such
> > > > as increased CPU utilization by the RCU grace-period kthread?  One
> > > > non-mainstream hypothetical situation that comes to mind is a kernel
> > > > built with SMP=y but running on a single-CPU system with a high-frequence
> > > > periodic interrupt that does call_rcu().  Might that result in the RCU
> > > > grace-period kthread chewing up the entire CPU?
> > > 
> > > There are still GP delays due to FQS, even with this change, so it could not
> > > chew up the entire CPU I believe. The GP cycle should still insert delays
> > > into the GP kthread. I did not notice in my testing that synchronize_rcu()
> > > latency dropping to sub millisecond, it was still limited by the timer wheel
> > > delays and the FQS delays.
> > > 
> > > > For a non-hypothetical case, could you please see if one of the
> > > > battery-powered embedded guys would be willing to test this?
> > > 
> > > My suspicion is the battery-powered folks are already running RCU_LAZY to
> > > reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness
> > > will be going to the bypass. Last I checked, Android and ChromeOS were both
> > > enabling RCU_LAZY everywhere (back when I was at Google).
> > > 
> > > Uladzislau works on embedded (or at least till recently) and had recently
> > > checked this area for improvements so I think he can help quantify too
> > > perhaps. He is on CC. I personally don't directly work on embedded at the
> > > moment, just big compute hungry machines. ;-) Uladzislau, would you have some
> > > time to test on your Android devices?
> > > 
> > I will check the patch on my home based systems, big machines also :)
> > I do not work with mobile area any more thus do not have access to our
> > mobile devices. In fact i am glad that i have switched to something new.
> > I was a bit tired by the applied Google restrictions when it comes to
> > changes to the kernel and other Android layers.
> 
> How quickly I forget!  ;-)
> 
> Any thoughts on who would be a good person to ask about testing Joel's
> patch on mobile platforms?
> 
As Joel already wrote, Suren probably is a good person to ask :)

--
Uladzislau Rezki

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Joel Fernandes 1 month, 1 week ago


> On Dec 28, 2025, at 7:04 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Sun, Dec 28, 2025 at 06:57:58PM +0100, Uladzislau Rezki wrote:
>>> On Thu, Dec 25, 2025 at 09:33:39PM -0500, Joel Fernandes wrote:
>>> On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote:
>>>> On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
>>>>> The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
>>>>> State) design where the first FQS saves dyntick-idle snapshots and
>>>>> the second FQS compares them. This results in long and unnecessary latency
>>>>> for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
>>>>> 1000HZ) whenever one FQS wait sufficed.
>>>>> 
>>>>> Some investigations showed that the GP kthread's CPU is the holdout CPU
>>>>> a lot of times after the first FQS as - it cannot be detected as "idle"
>>>>> because it's actively running the FQS scan in the GP kthread.
>>>>> 
>>>>> Therefore, at the end of rcu_gp_init(), immediately report a quiescent
>>>>> state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
>>>>> GP kthread cannot be in an RCU read-side critical section while running
>>>>> GP initialization, so this is safe and results in significant latency
>>>>> improvements.
>>>>> 
>>>>> I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
>>>>> showing significant latency improvements (default settings for fqs jiffies):
>>>>> 
>>>>> Baseline (without fix):
>>>>> | Run | Mean      | Min      | Max       |
>>>>> |-----|-----------|----------|-----------|
>>>>> | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
>>>>> | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
>>>>> | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
>>>>> | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
>>>>> | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
>>>>> | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
>>>>> | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
>>>>> | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
>>>>> | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
>>>>> | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
>>>>> 
>>>>> With fix:
>>>>> | Run | Mean     | Min      | Max       |
>>>>> |-----|----------|----------|-----------|
>>>>> | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
>>>>> | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
>>>>> | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
>>>>> | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
>>>>> | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
>>>>> | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
>>>>> | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
>>>>> | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
>>>>> | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
>>>>> | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
>>>>> 
>>>>> Summary:
>>>>> - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
>>>>> - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
>>>>> 
>>>>> Tested rcutorture TREE and SRCU configurations.
>>>>> 
>>>>> [apply paulmck feedack on moving logic to rcu_gp_init()]
>>>> 
>>>> If anything, these numbers look better, so good show!!!
>>> 
>>> Thanks, I ended up collecting more samples in the v2 to further confirm the
>>> improvements.
>>> 
>>>> Are there workloads that might be hurt by some side effect such
>>>> as increased CPU utilization by the RCU grace-period kthread?  One
>>>> non-mainstream hypothetical situation that comes to mind is a kernel
>>>> built with SMP=y but running on a single-CPU system with a high-frequence
>>>> periodic interrupt that does call_rcu().  Might that result in the RCU
>>>> grace-period kthread chewing up the entire CPU?
>>> 
>>> There are still GP delays due to FQS, even with this change, so it could not
>>> chew up the entire CPU I believe. The GP cycle should still insert delays
>>> into the GP kthread. I did not notice in my testing that synchronize_rcu()
>>> latency dropping to sub millisecond, it was still limited by the timer wheel
>>> delays and the FQS delays.
>>> 
>>>> For a non-hypothetical case, could you please see if one of the
>>>> battery-powered embedded guys would be willing to test this?
>>> 
>>> My suspicion is the battery-powered folks are already running RCU_LAZY to
>>> reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness
>>> will be going to the bypass. Last I checked, Android and ChromeOS were both
>>> enabling RCU_LAZY everywhere (back when I was at Google).
>>> 
>>> Uladzislau works on embedded (or at least till recently) and had recently
>>> checked this area for improvements so I think he can help quantify too
>>> perhaps. He is on CC. I personally don't directly work on embedded at the
>>> moment, just big compute hungry machines. ;-) Uladzislau, would you have some
>>> time to test on your Android devices?
>>> 
>> I will check the patch on my home based systems, big machines also :)
>> I do not work with mobile area any more thus do not have access to our
>> mobile devices. In fact i am glad that i have switched to something new.
>> I was a bit tired by the applied Google restrictions when it comes to
>> changes to the kernel and other Android layers.
> 
> How quickly I forget!  ;-)
> 
> Any thoughts on who would be a good person to ask about testing Joel's
> patch on mobile platforms?

Maybe Suren? As precedent and fwiw, When rcu_normal_wake_from_gp optimization happened, it only improved things for Android.

Also Android already uses RCU_LAZY so this should not affect power for non-hurry usages.

Also networking bridge removal depends on synchronize_rcu() latency. When I forced rcu_normal_wake_from_gp on large machines, it improved bridge removal speed by about 5% per my notes. I would expect similar improvements with this.

thanks,

- Joel

> 
>                            Thanx, Paul
>

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Uladzislau Rezki 1 month, 1 week ago

On Sun, Dec 28, 2025 at 09:49:45PM -0500, Joel Fernandes wrote:
> 
> 
> > On Dec 28, 2025, at 7:04 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Sun, Dec 28, 2025 at 06:57:58PM +0100, Uladzislau Rezki wrote:
> >>> On Thu, Dec 25, 2025 at 09:33:39PM -0500, Joel Fernandes wrote:
> >>> On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote:
> >>>> On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
> >>>>> The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
> >>>>> State) design where the first FQS saves dyntick-idle snapshots and
> >>>>> the second FQS compares them. This results in long and unnecessary latency
> >>>>> for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
> >>>>> 1000HZ) whenever one FQS wait sufficed.
> >>>>> 
> >>>>> Some investigations showed that the GP kthread's CPU is the holdout CPU
> >>>>> a lot of times after the first FQS as - it cannot be detected as "idle"
> >>>>> because it's actively running the FQS scan in the GP kthread.
> >>>>> 
> >>>>> Therefore, at the end of rcu_gp_init(), immediately report a quiescent
> >>>>> state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
> >>>>> GP kthread cannot be in an RCU read-side critical section while running
> >>>>> GP initialization, so this is safe and results in significant latency
> >>>>> improvements.
> >>>>> 
> >>>>> I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
> >>>>> showing significant latency improvements (default settings for fqs jiffies):
> >>>>> 
> >>>>> Baseline (without fix):
> >>>>> | Run | Mean      | Min      | Max       |
> >>>>> |-----|-----------|----------|-----------|
> >>>>> | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
> >>>>> | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
> >>>>> | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
> >>>>> | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
> >>>>> | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
> >>>>> | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
> >>>>> | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
> >>>>> | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
> >>>>> | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
> >>>>> | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
> >>>>> 
> >>>>> With fix:
> >>>>> | Run | Mean     | Min      | Max       |
> >>>>> |-----|----------|----------|-----------|
> >>>>> | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
> >>>>> | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
> >>>>> | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
> >>>>> | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
> >>>>> | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
> >>>>> | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
> >>>>> | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
> >>>>> | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
> >>>>> | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
> >>>>> | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
> >>>>> 
> >>>>> Summary:
> >>>>> - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
> >>>>> - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
> >>>>> 
> >>>>> Tested rcutorture TREE and SRCU configurations.
> >>>>> 
> >>>>> [apply paulmck feedack on moving logic to rcu_gp_init()]
> >>>> 
> >>>> If anything, these numbers look better, so good show!!!
> >>> 
> >>> Thanks, I ended up collecting more samples in the v2 to further confirm the
> >>> improvements.
> >>> 
> >>>> Are there workloads that might be hurt by some side effect such
> >>>> as increased CPU utilization by the RCU grace-period kthread?  One
> >>>> non-mainstream hypothetical situation that comes to mind is a kernel
> >>>> built with SMP=y but running on a single-CPU system with a high-frequence
> >>>> periodic interrupt that does call_rcu().  Might that result in the RCU
> >>>> grace-period kthread chewing up the entire CPU?
> >>> 
> >>> There are still GP delays due to FQS, even with this change, so it could not
> >>> chew up the entire CPU I believe. The GP cycle should still insert delays
> >>> into the GP kthread. I did not notice in my testing that synchronize_rcu()
> >>> latency dropping to sub millisecond, it was still limited by the timer wheel
> >>> delays and the FQS delays.
> >>> 
> >>>> For a non-hypothetical case, could you please see if one of the
> >>>> battery-powered embedded guys would be willing to test this?
> >>> 
> >>> My suspicion is the battery-powered folks are already running RCU_LAZY to
> >>> reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness
> >>> will be going to the bypass. Last I checked, Android and ChromeOS were both
> >>> enabling RCU_LAZY everywhere (back when I was at Google).
> >>> 
> >>> Uladzislau works on embedded (or at least till recently) and had recently
> >>> checked this area for improvements so I think he can help quantify too
> >>> perhaps. He is on CC. I personally don't directly work on embedded at the
> >>> moment, just big compute hungry machines. ;-) Uladzislau, would you have some
> >>> time to test on your Android devices?
> >>> 
> >> I will check the patch on my home based systems, big machines also :)
> >> I do not work with mobile area any more thus do not have access to our
> >> mobile devices. In fact i am glad that i have switched to something new.
> >> I was a bit tired by the applied Google restrictions when it comes to
> >> changes to the kernel and other Android layers.
> > 
> > How quickly I forget!  ;-)
> > 
> > Any thoughts on who would be a good person to ask about testing Joel's
> > patch on mobile platforms?
> 
> Maybe Suren? As precedent and fwiw, When rcu_normal_wake_from_gp optimization happened, it only improved things for Android.
> 
> Also Android already uses RCU_LAZY so this should not affect power for non-hurry usages.
> 
> Also networking bridge removal depends on synchronize_rcu() latency. When I forced rcu_normal_wake_from_gp on large machines, it improved bridge removal speed by about 5% per my notes. I would expect similar improvements with this.
> 
Here we go with some results. I tested bridge setup test case(100 loops):

<snip>
urezki@pc638:~$ cat bridge.sh
#!/bin/sh

BRIDGE="virbr0"
NETWORK="192.0.0.1"

# setup bridge
sudo brctl addbr ${BRIDGE}
sudo ifconfig ${BRIDGE} ${NETWORK} up
sudo ifconfig ${BRIDGE} ${NETWORK} down

sudo brctl delbr ${BRIDGE}
urezki@pc638:~$
<snip>

1)
# /tmp/default.txt
urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
real    0m24.221s
user    0m1.875s
sys     0m2.013s
urezki@pc638:~$

2)
# echo 1 > /sys/module/rcutree/parameters/enable_joel_patch
# /tmp/enable_joel_patch.txt
urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
real    0m20.754s
user    0m1.950s
sys     0m1.888s
urezki@pc638:~$

3)
# echo 1 > /sys/module/rcutree/parameters/enable_joel_patch
# echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
# /tmp/enable_joel_patch_enable_rcu_normal_wake_from_gp.txt
urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
real    0m15.895s
user    0m2.023s
sys     0m1.935s
urezki@pc638:~$

4)
# echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
# /tmp/enable_rcu_normal_wake_from_gp.txt
urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
real    0m18.947s
user    0m2.145s
sys     0m1.735s
urezki@pc638:~$ 

x86_64/64CPUs(in usec)
          1         2         3       4
median: 37249.5   31540.5   15765   22480
min:    7881      7918      9803    7857
max:    63651     55639     31861   32040

1 - default;
2 - Joel patch
3 - Joel patch + enable_rcu_normal_wake_from_gp
4 - enable_rcu_normal_wake_from_gp

Joel patch + enable_rcu_normal_wake_from_gp is a winner.
Time dropped from 24 seconds to 15 seconds to complete the test.

--
Uladzislau Rezki

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Paul E. McKenney 1 month, 1 week ago

On Mon, Dec 29, 2025 at 02:28:43PM +0100, Uladzislau Rezki wrote:
> On Sun, Dec 28, 2025 at 09:49:45PM -0500, Joel Fernandes wrote:
> > 
> > 
> > > On Dec 28, 2025, at 7:04 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > 
> > > On Sun, Dec 28, 2025 at 06:57:58PM +0100, Uladzislau Rezki wrote:
> > >>> On Thu, Dec 25, 2025 at 09:33:39PM -0500, Joel Fernandes wrote:
> > >>> On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote:
> > >>>> On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
> > >>>>> The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
> > >>>>> State) design where the first FQS saves dyntick-idle snapshots and
> > >>>>> the second FQS compares them. This results in long and unnecessary latency
> > >>>>> for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
> > >>>>> 1000HZ) whenever one FQS wait sufficed.
> > >>>>> 
> > >>>>> Some investigations showed that the GP kthread's CPU is the holdout CPU
> > >>>>> a lot of times after the first FQS as - it cannot be detected as "idle"
> > >>>>> because it's actively running the FQS scan in the GP kthread.
> > >>>>> 
> > >>>>> Therefore, at the end of rcu_gp_init(), immediately report a quiescent
> > >>>>> state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
> > >>>>> GP kthread cannot be in an RCU read-side critical section while running
> > >>>>> GP initialization, so this is safe and results in significant latency
> > >>>>> improvements.
> > >>>>> 
> > >>>>> I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
> > >>>>> showing significant latency improvements (default settings for fqs jiffies):
> > >>>>> 
> > >>>>> Baseline (without fix):
> > >>>>> | Run | Mean      | Min      | Max       |
> > >>>>> |-----|-----------|----------|-----------|
> > >>>>> | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
> > >>>>> | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
> > >>>>> | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
> > >>>>> | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
> > >>>>> | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
> > >>>>> | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
> > >>>>> | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
> > >>>>> | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
> > >>>>> | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
> > >>>>> | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
> > >>>>> 
> > >>>>> With fix:
> > >>>>> | Run | Mean     | Min      | Max       |
> > >>>>> |-----|----------|----------|-----------|
> > >>>>> | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
> > >>>>> | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
> > >>>>> | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
> > >>>>> | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
> > >>>>> | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
> > >>>>> | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
> > >>>>> | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
> > >>>>> | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
> > >>>>> | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
> > >>>>> | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
> > >>>>> 
> > >>>>> Summary:
> > >>>>> - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
> > >>>>> - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
> > >>>>> 
> > >>>>> Tested rcutorture TREE and SRCU configurations.
> > >>>>> 
> > >>>>> [apply paulmck feedack on moving logic to rcu_gp_init()]
> > >>>> 
> > >>>> If anything, these numbers look better, so good show!!!
> > >>> 
> > >>> Thanks, I ended up collecting more samples in the v2 to further confirm the
> > >>> improvements.
> > >>> 
> > >>>> Are there workloads that might be hurt by some side effect such
> > >>>> as increased CPU utilization by the RCU grace-period kthread?  One
> > >>>> non-mainstream hypothetical situation that comes to mind is a kernel
> > >>>> built with SMP=y but running on a single-CPU system with a high-frequence
> > >>>> periodic interrupt that does call_rcu().  Might that result in the RCU
> > >>>> grace-period kthread chewing up the entire CPU?
> > >>> 
> > >>> There are still GP delays due to FQS, even with this change, so it could not
> > >>> chew up the entire CPU I believe. The GP cycle should still insert delays
> > >>> into the GP kthread. I did not notice in my testing that synchronize_rcu()
> > >>> latency dropping to sub millisecond, it was still limited by the timer wheel
> > >>> delays and the FQS delays.
> > >>> 
> > >>>> For a non-hypothetical case, could you please see if one of the
> > >>>> battery-powered embedded guys would be willing to test this?
> > >>> 
> > >>> My suspicion is the battery-powered folks are already running RCU_LAZY to
> > >>> reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness
> > >>> will be going to the bypass. Last I checked, Android and ChromeOS were both
> > >>> enabling RCU_LAZY everywhere (back when I was at Google).
> > >>> 
> > >>> Uladzislau works on embedded (or at least till recently) and had recently
> > >>> checked this area for improvements so I think he can help quantify too
> > >>> perhaps. He is on CC. I personally don't directly work on embedded at the
> > >>> moment, just big compute hungry machines. ;-) Uladzislau, would you have some
> > >>> time to test on your Android devices?
> > >>> 
> > >> I will check the patch on my home based systems, big machines also :)
> > >> I do not work with mobile area any more thus do not have access to our
> > >> mobile devices. In fact i am glad that i have switched to something new.
> > >> I was a bit tired by the applied Google restrictions when it comes to
> > >> changes to the kernel and other Android layers.
> > > 
> > > How quickly I forget!  ;-)
> > > 
> > > Any thoughts on who would be a good person to ask about testing Joel's
> > > patch on mobile platforms?
> > 
> > Maybe Suren? As precedent and fwiw, When rcu_normal_wake_from_gp optimization happened, it only improved things for Android.
> > 
> > Also Android already uses RCU_LAZY so this should not affect power for non-hurry usages.
> > 
> > Also networking bridge removal depends on synchronize_rcu() latency. When I forced rcu_normal_wake_from_gp on large machines, it improved bridge removal speed by about 5% per my notes. I would expect similar improvements with this.
> > 
> Here we go with some results. I tested bridge setup test case(100 loops):
> 
> <snip>
> urezki@pc638:~$ cat bridge.sh
> #!/bin/sh
> 
> BRIDGE="virbr0"
> NETWORK="192.0.0.1"
> 
> # setup bridge
> sudo brctl addbr ${BRIDGE}
> sudo ifconfig ${BRIDGE} ${NETWORK} up
> sudo ifconfig ${BRIDGE} ${NETWORK} down
> 
> sudo brctl delbr ${BRIDGE}
> urezki@pc638:~$
> <snip>
> 
> 1)
> # /tmp/default.txt
> urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
> real    0m24.221s
> user    0m1.875s
> sys     0m2.013s
> urezki@pc638:~$
> 
> 2)
> # echo 1 > /sys/module/rcutree/parameters/enable_joel_patch
> # /tmp/enable_joel_patch.txt
> urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
> real    0m20.754s
> user    0m1.950s
> sys     0m1.888s
> urezki@pc638:~$
> 
> 3)
> # echo 1 > /sys/module/rcutree/parameters/enable_joel_patch
> # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> # /tmp/enable_joel_patch_enable_rcu_normal_wake_from_gp.txt
> urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
> real    0m15.895s
> user    0m2.023s
> sys     0m1.935s
> urezki@pc638:~$
> 
> 4)
> # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> # /tmp/enable_rcu_normal_wake_from_gp.txt
> urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
> real    0m18.947s
> user    0m2.145s
> sys     0m1.735s
> urezki@pc638:~$ 
> 
> x86_64/64CPUs(in usec)
>           1         2         3       4
> median: 37249.5   31540.5   15765   22480
> min:    7881      7918      9803    7857
> max:    63651     55639     31861   32040
> 
> 1 - default;
> 2 - Joel patch
> 3 - Joel patch + enable_rcu_normal_wake_from_gp
> 4 - enable_rcu_normal_wake_from_gp
> 
> Joel patch + enable_rcu_normal_wake_from_gp is a winner.
> Time dropped from 24 seconds to 15 seconds to complete the test.

There was also an increase in system time from 1.735s to 1.935s with
Joel's patch, correct?  Or is that in the noise?

							Thanx, Paul

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Uladzislau Rezki 1 month, 1 week ago

On Mon, Dec 29, 2025 at 07:53:59AM -0800, Paul E. McKenney wrote:
> On Mon, Dec 29, 2025 at 02:28:43PM +0100, Uladzislau Rezki wrote:
> > On Sun, Dec 28, 2025 at 09:49:45PM -0500, Joel Fernandes wrote:
> > > 
> > > 
> > > > On Dec 28, 2025, at 7:04 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > 
> > > > On Sun, Dec 28, 2025 at 06:57:58PM +0100, Uladzislau Rezki wrote:
> > > >>> On Thu, Dec 25, 2025 at 09:33:39PM -0500, Joel Fernandes wrote:
> > > >>> On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote:
> > > >>>> On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
> > > >>>>> The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
> > > >>>>> State) design where the first FQS saves dyntick-idle snapshots and
> > > >>>>> the second FQS compares them. This results in long and unnecessary latency
> > > >>>>> for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
> > > >>>>> 1000HZ) whenever one FQS wait sufficed.
> > > >>>>> 
> > > >>>>> Some investigations showed that the GP kthread's CPU is the holdout CPU
> > > >>>>> a lot of times after the first FQS as - it cannot be detected as "idle"
> > > >>>>> because it's actively running the FQS scan in the GP kthread.
> > > >>>>> 
> > > >>>>> Therefore, at the end of rcu_gp_init(), immediately report a quiescent
> > > >>>>> state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
> > > >>>>> GP kthread cannot be in an RCU read-side critical section while running
> > > >>>>> GP initialization, so this is safe and results in significant latency
> > > >>>>> improvements.
> > > >>>>> 
> > > >>>>> I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
> > > >>>>> showing significant latency improvements (default settings for fqs jiffies):
> > > >>>>> 
> > > >>>>> Baseline (without fix):
> > > >>>>> | Run | Mean      | Min      | Max       |
> > > >>>>> |-----|-----------|----------|-----------|
> > > >>>>> | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
> > > >>>>> | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
> > > >>>>> | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
> > > >>>>> | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
> > > >>>>> | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
> > > >>>>> | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
> > > >>>>> | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
> > > >>>>> | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
> > > >>>>> | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
> > > >>>>> | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
> > > >>>>> 
> > > >>>>> With fix:
> > > >>>>> | Run | Mean     | Min      | Max       |
> > > >>>>> |-----|----------|----------|-----------|
> > > >>>>> | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
> > > >>>>> | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
> > > >>>>> | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
> > > >>>>> | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
> > > >>>>> | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
> > > >>>>> | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
> > > >>>>> | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
> > > >>>>> | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
> > > >>>>> | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
> > > >>>>> | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
> > > >>>>> 
> > > >>>>> Summary:
> > > >>>>> - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
> > > >>>>> - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
> > > >>>>> 
> > > >>>>> Tested rcutorture TREE and SRCU configurations.
> > > >>>>> 
> > > >>>>> [apply paulmck feedack on moving logic to rcu_gp_init()]
> > > >>>> 
> > > >>>> If anything, these numbers look better, so good show!!!
> > > >>> 
> > > >>> Thanks, I ended up collecting more samples in the v2 to further confirm the
> > > >>> improvements.
> > > >>> 
> > > >>>> Are there workloads that might be hurt by some side effect such
> > > >>>> as increased CPU utilization by the RCU grace-period kthread?  One
> > > >>>> non-mainstream hypothetical situation that comes to mind is a kernel
> > > >>>> built with SMP=y but running on a single-CPU system with a high-frequence
> > > >>>> periodic interrupt that does call_rcu().  Might that result in the RCU
> > > >>>> grace-period kthread chewing up the entire CPU?
> > > >>> 
> > > >>> There are still GP delays due to FQS, even with this change, so it could not
> > > >>> chew up the entire CPU I believe. The GP cycle should still insert delays
> > > >>> into the GP kthread. I did not notice in my testing that synchronize_rcu()
> > > >>> latency dropping to sub millisecond, it was still limited by the timer wheel
> > > >>> delays and the FQS delays.
> > > >>> 
> > > >>>> For a non-hypothetical case, could you please see if one of the
> > > >>>> battery-powered embedded guys would be willing to test this?
> > > >>> 
> > > >>> My suspicion is the battery-powered folks are already running RCU_LAZY to
> > > >>> reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness
> > > >>> will be going to the bypass. Last I checked, Android and ChromeOS were both
> > > >>> enabling RCU_LAZY everywhere (back when I was at Google).
> > > >>> 
> > > >>> Uladzislau works on embedded (or at least till recently) and had recently
> > > >>> checked this area for improvements so I think he can help quantify too
> > > >>> perhaps. He is on CC. I personally don't directly work on embedded at the
> > > >>> moment, just big compute hungry machines. ;-) Uladzislau, would you have some
> > > >>> time to test on your Android devices?
> > > >>> 
> > > >> I will check the patch on my home based systems, big machines also :)
> > > >> I do not work with mobile area any more thus do not have access to our
> > > >> mobile devices. In fact i am glad that i have switched to something new.
> > > >> I was a bit tired by the applied Google restrictions when it comes to
> > > >> changes to the kernel and other Android layers.
> > > > 
> > > > How quickly I forget!  ;-)
> > > > 
> > > > Any thoughts on who would be a good person to ask about testing Joel's
> > > > patch on mobile platforms?
> > > 
> > > Maybe Suren? As precedent and fwiw, When rcu_normal_wake_from_gp optimization happened, it only improved things for Android.
> > > 
> > > Also Android already uses RCU_LAZY so this should not affect power for non-hurry usages.
> > > 
> > > Also networking bridge removal depends on synchronize_rcu() latency. When I forced rcu_normal_wake_from_gp on large machines, it improved bridge removal speed by about 5% per my notes. I would expect similar improvements with this.
> > > 
> > Here we go with some results. I tested bridge setup test case(100 loops):
> > 
> > <snip>
> > urezki@pc638:~$ cat bridge.sh
> > #!/bin/sh
> > 
> > BRIDGE="virbr0"
> > NETWORK="192.0.0.1"
> > 
> > # setup bridge
> > sudo brctl addbr ${BRIDGE}
> > sudo ifconfig ${BRIDGE} ${NETWORK} up
> > sudo ifconfig ${BRIDGE} ${NETWORK} down
> > 
> > sudo brctl delbr ${BRIDGE}
> > urezki@pc638:~$
> > <snip>
> > 
> > 1)
> > # /tmp/default.txt
> > urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
> > real    0m24.221s
> > user    0m1.875s
> > sys     0m2.013s
> > urezki@pc638:~$
> > 
> > 2)
> > # echo 1 > /sys/module/rcutree/parameters/enable_joel_patch
> > # /tmp/enable_joel_patch.txt
> > urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
> > real    0m20.754s
> > user    0m1.950s
> > sys     0m1.888s
> > urezki@pc638:~$
> > 
> > 3)
> > # echo 1 > /sys/module/rcutree/parameters/enable_joel_patch
> > # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> > # /tmp/enable_joel_patch_enable_rcu_normal_wake_from_gp.txt
> > urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
> > real    0m15.895s
> > user    0m2.023s
> > sys     0m1.935s
> > urezki@pc638:~$
> > 
> > 4)
> > # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> > # /tmp/enable_rcu_normal_wake_from_gp.txt
> > urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
> > real    0m18.947s
> > user    0m2.145s
> > sys     0m1.735s
> > urezki@pc638:~$ 
> > 
> > x86_64/64CPUs(in usec)
> >           1         2         3       4
> > median: 37249.5   31540.5   15765   22480
> > min:    7881      7918      9803    7857
> > max:    63651     55639     31861   32040
> > 
> > 1 - default;
> > 2 - Joel patch
> > 3 - Joel patch + enable_rcu_normal_wake_from_gp
> > 4 - enable_rcu_normal_wake_from_gp
> > 
> > Joel patch + enable_rcu_normal_wake_from_gp is a winner.
> > Time dropped from 24 seconds to 15 seconds to complete the test.
> 
> There was also an increase in system time from 1.735s to 1.935s with
> Joel's patch, correct?  Or is that in the noise?
> 

See below 5 run with just posted "sys" time:

#default
sys     0m1.936s
sys     0m1.894s
sys     0m1.937s
sys     0m1.698s
sys     0m1.740s

# Joel patch
sys     0m1.753s
sys     0m1.667s
sys     0m1.861s
sys     0m1.930s
sys     0m1.896s

i do not see increase, IMO it is a noise.

--
Uladzislau Rezki

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Paul E. McKenney 1 month, 1 week ago

On Mon, Dec 29, 2025 at 05:25:24PM +0100, Uladzislau Rezki wrote:
> On Mon, Dec 29, 2025 at 07:53:59AM -0800, Paul E. McKenney wrote:
> > On Mon, Dec 29, 2025 at 02:28:43PM +0100, Uladzislau Rezki wrote:
> > > On Sun, Dec 28, 2025 at 09:49:45PM -0500, Joel Fernandes wrote:
> > > > 
> > > > 
> > > > > On Dec 28, 2025, at 7:04 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > 
> > > > > On Sun, Dec 28, 2025 at 06:57:58PM +0100, Uladzislau Rezki wrote:
> > > > >>> On Thu, Dec 25, 2025 at 09:33:39PM -0500, Joel Fernandes wrote:
> > > > >>> On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote:
> > > > >>>> On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
> > > > >>>>> The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
> > > > >>>>> State) design where the first FQS saves dyntick-idle snapshots and
> > > > >>>>> the second FQS compares them. This results in long and unnecessary latency
> > > > >>>>> for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
> > > > >>>>> 1000HZ) whenever one FQS wait sufficed.
> > > > >>>>> 
> > > > >>>>> Some investigations showed that the GP kthread's CPU is the holdout CPU
> > > > >>>>> a lot of times after the first FQS as - it cannot be detected as "idle"
> > > > >>>>> because it's actively running the FQS scan in the GP kthread.
> > > > >>>>> 
> > > > >>>>> Therefore, at the end of rcu_gp_init(), immediately report a quiescent
> > > > >>>>> state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
> > > > >>>>> GP kthread cannot be in an RCU read-side critical section while running
> > > > >>>>> GP initialization, so this is safe and results in significant latency
> > > > >>>>> improvements.
> > > > >>>>> 
> > > > >>>>> I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
> > > > >>>>> showing significant latency improvements (default settings for fqs jiffies):
> > > > >>>>> 
> > > > >>>>> Baseline (without fix):
> > > > >>>>> | Run | Mean      | Min      | Max       |
> > > > >>>>> |-----|-----------|----------|-----------|
> > > > >>>>> | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
> > > > >>>>> | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
> > > > >>>>> | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
> > > > >>>>> | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
> > > > >>>>> | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
> > > > >>>>> | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
> > > > >>>>> | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
> > > > >>>>> | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
> > > > >>>>> | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
> > > > >>>>> | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
> > > > >>>>> 
> > > > >>>>> With fix:
> > > > >>>>> | Run | Mean     | Min      | Max       |
> > > > >>>>> |-----|----------|----------|-----------|
> > > > >>>>> | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
> > > > >>>>> | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
> > > > >>>>> | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
> > > > >>>>> | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
> > > > >>>>> | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
> > > > >>>>> | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
> > > > >>>>> | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
> > > > >>>>> | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
> > > > >>>>> | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
> > > > >>>>> | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
> > > > >>>>> 
> > > > >>>>> Summary:
> > > > >>>>> - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
> > > > >>>>> - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
> > > > >>>>> 
> > > > >>>>> Tested rcutorture TREE and SRCU configurations.
> > > > >>>>> 
> > > > >>>>> [apply paulmck feedack on moving logic to rcu_gp_init()]
> > > > >>>> 
> > > > >>>> If anything, these numbers look better, so good show!!!
> > > > >>> 
> > > > >>> Thanks, I ended up collecting more samples in the v2 to further confirm the
> > > > >>> improvements.
> > > > >>> 
> > > > >>>> Are there workloads that might be hurt by some side effect such
> > > > >>>> as increased CPU utilization by the RCU grace-period kthread?  One
> > > > >>>> non-mainstream hypothetical situation that comes to mind is a kernel
> > > > >>>> built with SMP=y but running on a single-CPU system with a high-frequence
> > > > >>>> periodic interrupt that does call_rcu().  Might that result in the RCU
> > > > >>>> grace-period kthread chewing up the entire CPU?
> > > > >>> 
> > > > >>> There are still GP delays due to FQS, even with this change, so it could not
> > > > >>> chew up the entire CPU I believe. The GP cycle should still insert delays
> > > > >>> into the GP kthread. I did not notice in my testing that synchronize_rcu()
> > > > >>> latency dropping to sub millisecond, it was still limited by the timer wheel
> > > > >>> delays and the FQS delays.
> > > > >>> 
> > > > >>>> For a non-hypothetical case, could you please see if one of the
> > > > >>>> battery-powered embedded guys would be willing to test this?
> > > > >>> 
> > > > >>> My suspicion is the battery-powered folks are already running RCU_LAZY to
> > > > >>> reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness
> > > > >>> will be going to the bypass. Last I checked, Android and ChromeOS were both
> > > > >>> enabling RCU_LAZY everywhere (back when I was at Google).
> > > > >>> 
> > > > >>> Uladzislau works on embedded (or at least till recently) and had recently
> > > > >>> checked this area for improvements so I think he can help quantify too
> > > > >>> perhaps. He is on CC. I personally don't directly work on embedded at the
> > > > >>> moment, just big compute hungry machines. ;-) Uladzislau, would you have some
> > > > >>> time to test on your Android devices?
> > > > >>> 
> > > > >> I will check the patch on my home based systems, big machines also :)
> > > > >> I do not work with mobile area any more thus do not have access to our
> > > > >> mobile devices. In fact i am glad that i have switched to something new.
> > > > >> I was a bit tired by the applied Google restrictions when it comes to
> > > > >> changes to the kernel and other Android layers.
> > > > > 
> > > > > How quickly I forget!  ;-)
> > > > > 
> > > > > Any thoughts on who would be a good person to ask about testing Joel's
> > > > > patch on mobile platforms?
> > > > 
> > > > Maybe Suren? As precedent and fwiw, When rcu_normal_wake_from_gp optimization happened, it only improved things for Android.
> > > > 
> > > > Also Android already uses RCU_LAZY so this should not affect power for non-hurry usages.
> > > > 
> > > > Also networking bridge removal depends on synchronize_rcu() latency. When I forced rcu_normal_wake_from_gp on large machines, it improved bridge removal speed by about 5% per my notes. I would expect similar improvements with this.
> > > > 
> > > Here we go with some results. I tested bridge setup test case(100 loops):
> > > 
> > > <snip>
> > > urezki@pc638:~$ cat bridge.sh
> > > #!/bin/sh
> > > 
> > > BRIDGE="virbr0"
> > > NETWORK="192.0.0.1"
> > > 
> > > # setup bridge
> > > sudo brctl addbr ${BRIDGE}
> > > sudo ifconfig ${BRIDGE} ${NETWORK} up
> > > sudo ifconfig ${BRIDGE} ${NETWORK} down
> > > 
> > > sudo brctl delbr ${BRIDGE}
> > > urezki@pc638:~$
> > > <snip>
> > > 
> > > 1)
> > > # /tmp/default.txt
> > > urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
> > > real    0m24.221s
> > > user    0m1.875s
> > > sys     0m2.013s
> > > urezki@pc638:~$
> > > 
> > > 2)
> > > # echo 1 > /sys/module/rcutree/parameters/enable_joel_patch
> > > # /tmp/enable_joel_patch.txt
> > > urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
> > > real    0m20.754s
> > > user    0m1.950s
> > > sys     0m1.888s
> > > urezki@pc638:~$
> > > 
> > > 3)
> > > # echo 1 > /sys/module/rcutree/parameters/enable_joel_patch
> > > # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> > > # /tmp/enable_joel_patch_enable_rcu_normal_wake_from_gp.txt
> > > urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
> > > real    0m15.895s
> > > user    0m2.023s
> > > sys     0m1.935s
> > > urezki@pc638:~$
> > > 
> > > 4)
> > > # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> > > # /tmp/enable_rcu_normal_wake_from_gp.txt
> > > urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
> > > real    0m18.947s
> > > user    0m2.145s
> > > sys     0m1.735s
> > > urezki@pc638:~$ 
> > > 
> > > x86_64/64CPUs(in usec)
> > >           1         2         3       4
> > > median: 37249.5   31540.5   15765   22480
> > > min:    7881      7918      9803    7857
> > > max:    63651     55639     31861   32040
> > > 
> > > 1 - default;
> > > 2 - Joel patch
> > > 3 - Joel patch + enable_rcu_normal_wake_from_gp
> > > 4 - enable_rcu_normal_wake_from_gp
> > > 
> > > Joel patch + enable_rcu_normal_wake_from_gp is a winner.
> > > Time dropped from 24 seconds to 15 seconds to complete the test.
> > 
> > There was also an increase in system time from 1.735s to 1.935s with
> > Joel's patch, correct?  Or is that in the noise?
> > 
> 
> See below 5 run with just posted "sys" time:
> 
> #default
> sys     0m1.936s
> sys     0m1.894s
> sys     0m1.937s
> sys     0m1.698s
> sys     0m1.740s
> 
> # Joel patch
> sys     0m1.753s
> sys     0m1.667s
> sys     0m1.861s
> sys     0m1.930s
> sys     0m1.896s
> 
> i do not see increase, IMO it is a noise.

Even better, thank you!

							Thanx, Paul

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Joel Fernandes 1 month, 1 week ago


> On Dec 29, 2025, at 12:02 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Mon, Dec 29, 2025 at 05:25:24PM +0100, Uladzislau Rezki wrote:
>>> On Mon, Dec 29, 2025 at 07:53:59AM -0800, Paul E. McKenney wrote:
>>> On Mon, Dec 29, 2025 at 02:28:43PM +0100, Uladzislau Rezki wrote:
>>>> On Sun, Dec 28, 2025 at 09:49:45PM -0500, Joel Fernandes wrote:
>>>>> 
>>>>> 
>>>>>> On Dec 28, 2025, at 7:04 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
>>>>>> 
>>>>>> On Sun, Dec 28, 2025 at 06:57:58PM +0100, Uladzislau Rezki wrote:
>>>>>>>> On Thu, Dec 25, 2025 at 09:33:39PM -0500, Joel Fernandes wrote:
>>>>>>>> On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote:
>>>>>>>>> On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
>>>>>>>>>> The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
>>>>>>>>>> State) design where the first FQS saves dyntick-idle snapshots and
>>>>>>>>>> the second FQS compares them. This results in long and unnecessary latency
>>>>>>>>>> for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
>>>>>>>>>> 1000HZ) whenever one FQS wait sufficed.
>>>>>>>>>> 
>>>>>>>>>> Some investigations showed that the GP kthread's CPU is the holdout CPU
>>>>>>>>>> a lot of times after the first FQS as - it cannot be detected as "idle"
>>>>>>>>>> because it's actively running the FQS scan in the GP kthread.
>>>>>>>>>> 
>>>>>>>>>> Therefore, at the end of rcu_gp_init(), immediately report a quiescent
>>>>>>>>>> state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
>>>>>>>>>> GP kthread cannot be in an RCU read-side critical section while running
>>>>>>>>>> GP initialization, so this is safe and results in significant latency
>>>>>>>>>> improvements.
>>>>>>>>>> 
>>>>>>>>>> I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
>>>>>>>>>> showing significant latency improvements (default settings for fqs jiffies):
>>>>>>>>>> 
>>>>>>>>>> Baseline (without fix):
>>>>>>>>>> | Run | Mean      | Min      | Max       |
>>>>>>>>>> |-----|-----------|----------|-----------|
>>>>>>>>>> | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
>>>>>>>>>> | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
>>>>>>>>>> | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
>>>>>>>>>> | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
>>>>>>>>>> | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
>>>>>>>>>> | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
>>>>>>>>>> | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
>>>>>>>>>> | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
>>>>>>>>>> | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
>>>>>>>>>> | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
>>>>>>>>>> 
>>>>>>>>>> With fix:
>>>>>>>>>> | Run | Mean     | Min      | Max       |
>>>>>>>>>> |-----|----------|----------|-----------|
>>>>>>>>>> | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
>>>>>>>>>> | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
>>>>>>>>>> | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
>>>>>>>>>> | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
>>>>>>>>>> | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
>>>>>>>>>> | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
>>>>>>>>>> | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
>>>>>>>>>> | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
>>>>>>>>>> | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
>>>>>>>>>> | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
>>>>>>>>>> 
>>>>>>>>>> Summary:
>>>>>>>>>> - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
>>>>>>>>>> - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
>>>>>>>>>> 
>>>>>>>>>> Tested rcutorture TREE and SRCU configurations.
>>>>>>>>>> 
>>>>>>>>>> [apply paulmck feedack on moving logic to rcu_gp_init()]
>>>>>>>>> 
>>>>>>>>> If anything, these numbers look better, so good show!!!
>>>>>>>> 
>>>>>>>> Thanks, I ended up collecting more samples in the v2 to further confirm the
>>>>>>>> improvements.
>>>>>>>> 
>>>>>>>>> Are there workloads that might be hurt by some side effect such
>>>>>>>>> as increased CPU utilization by the RCU grace-period kthread?  One
>>>>>>>>> non-mainstream hypothetical situation that comes to mind is a kernel
>>>>>>>>> built with SMP=y but running on a single-CPU system with a high-frequence
>>>>>>>>> periodic interrupt that does call_rcu().  Might that result in the RCU
>>>>>>>>> grace-period kthread chewing up the entire CPU?
>>>>>>>> 
>>>>>>>> There are still GP delays due to FQS, even with this change, so it could not
>>>>>>>> chew up the entire CPU I believe. The GP cycle should still insert delays
>>>>>>>> into the GP kthread. I did not notice in my testing that synchronize_rcu()
>>>>>>>> latency dropping to sub millisecond, it was still limited by the timer wheel
>>>>>>>> delays and the FQS delays.
>>>>>>>> 
>>>>>>>>> For a non-hypothetical case, could you please see if one of the
>>>>>>>>> battery-powered embedded guys would be willing to test this?
>>>>>>>> 
>>>>>>>> My suspicion is the battery-powered folks are already running RCU_LAZY to
>>>>>>>> reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness
>>>>>>>> will be going to the bypass. Last I checked, Android and ChromeOS were both
>>>>>>>> enabling RCU_LAZY everywhere (back when I was at Google).
>>>>>>>> 
>>>>>>>> Uladzislau works on embedded (or at least till recently) and had recently
>>>>>>>> checked this area for improvements so I think he can help quantify too
>>>>>>>> perhaps. He is on CC. I personally don't directly work on embedded at the
>>>>>>>> moment, just big compute hungry machines. ;-) Uladzislau, would you have some
>>>>>>>> time to test on your Android devices?
>>>>>>>> 
>>>>>>> I will check the patch on my home based systems, big machines also :)
>>>>>>> I do not work with mobile area any more thus do not have access to our
>>>>>>> mobile devices. In fact i am glad that i have switched to something new.
>>>>>>> I was a bit tired by the applied Google restrictions when it comes to
>>>>>>> changes to the kernel and other Android layers.
>>>>>> 
>>>>>> How quickly I forget!  ;-)
>>>>>> 
>>>>>> Any thoughts on who would be a good person to ask about testing Joel's
>>>>>> patch on mobile platforms?
>>>>> 
>>>>> Maybe Suren? As precedent and fwiw, When rcu_normal_wake_from_gp optimization happened, it only improved things for Android.
>>>>> 
>>>>> Also Android already uses RCU_LAZY so this should not affect power for non-hurry usages.
>>>>> 
>>>>> Also networking bridge removal depends on synchronize_rcu() latency. When I forced rcu_normal_wake_from_gp on large machines, it improved bridge removal speed by about 5% per my notes. I would expect similar improvements with this.
>>>>> 
>>>> Here we go with some results. I tested bridge setup test case(100 loops):
>>>> 
>>>> <snip>
>>>> urezki@pc638:~$ cat bridge.sh
>>>> #!/bin/sh
>>>> 
>>>> BRIDGE="virbr0"
>>>> NETWORK="192.0.0.1"
>>>> 
>>>> # setup bridge
>>>> sudo brctl addbr ${BRIDGE}
>>>> sudo ifconfig ${BRIDGE} ${NETWORK} up
>>>> sudo ifconfig ${BRIDGE} ${NETWORK} down
>>>> 
>>>> sudo brctl delbr ${BRIDGE}
>>>> urezki@pc638:~$
>>>> <snip>
>>>> 
>>>> 1)
>>>> # /tmp/default.txt
>>>> urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
>>>> real    0m24.221s
>>>> user    0m1.875s
>>>> sys     0m2.013s
>>>> urezki@pc638:~$
>>>> 
>>>> 2)
>>>> # echo 1 > /sys/module/rcutree/parameters/enable_joel_patch
>>>> # /tmp/enable_joel_patch.txt
>>>> urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
>>>> real    0m20.754s
>>>> user    0m1.950s
>>>> sys     0m1.888s
>>>> urezki@pc638:~$
>>>> 
>>>> 3)
>>>> # echo 1 > /sys/module/rcutree/parameters/enable_joel_patch
>>>> # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
>>>> # /tmp/enable_joel_patch_enable_rcu_normal_wake_from_gp.txt
>>>> urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
>>>> real    0m15.895s
>>>> user    0m2.023s
>>>> sys     0m1.935s
>>>> urezki@pc638:~$
>>>> 
>>>> 4)
>>>> # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
>>>> # /tmp/enable_rcu_normal_wake_from_gp.txt
>>>> urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done
>>>> real    0m18.947s
>>>> user    0m2.145s
>>>> sys     0m1.735s
>>>> urezki@pc638:~$
>>>> 
>>>> x86_64/64CPUs(in usec)
>>>>          1         2         3       4
>>>> median: 37249.5   31540.5   15765   22480
>>>> min:    7881      7918      9803    7857
>>>> max:    63651     55639     31861   32040
>>>> 
>>>> 1 - default;
>>>> 2 - Joel patch
>>>> 3 - Joel patch + enable_rcu_normal_wake_from_gp
>>>> 4 - enable_rcu_normal_wake_from_gp
>>>> 
>>>> Joel patch + enable_rcu_normal_wake_from_gp is a winner.
>>>> Time dropped from 24 seconds to 15 seconds to complete the test.
>>> 
>>> There was also an increase in system time from 1.735s to 1.935s with
>>> Joel's patch, correct?  Or is that in the noise?
>>> 
>> 
>> See below 5 run with just posted "sys" time:
>> 
>> #default
>> sys     0m1.936s
>> sys     0m1.894s
>> sys     0m1.937s
>> sys     0m1.698s
>> sys     0m1.740s
>> 
>> # Joel patch
>> sys     0m1.753s
>> sys     0m1.667s
>> sys     0m1.861s
>> sys     0m1.930s
>> sys     0m1.896s
>> 
>> i do not see increase, IMO it is a noise.
> 
> Even better, thank you!

Thanks a lot Vlad and Paul, I will include these numbers in the respin as well (with Tested-by from Vlad).

- Joel

> 
>                            Thanx, Paul
>

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Paul E. McKenney 1 month, 1 week ago

On Sun, Dec 28, 2025 at 09:49:45PM -0500, Joel Fernandes wrote:
> 
> 
> > On Dec 28, 2025, at 7:04 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Sun, Dec 28, 2025 at 06:57:58PM +0100, Uladzislau Rezki wrote:
> >>> On Thu, Dec 25, 2025 at 09:33:39PM -0500, Joel Fernandes wrote:
> >>> On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote:
> >>>> On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
> >>>>> The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
> >>>>> State) design where the first FQS saves dyntick-idle snapshots and
> >>>>> the second FQS compares them. This results in long and unnecessary latency
> >>>>> for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
> >>>>> 1000HZ) whenever one FQS wait sufficed.
> >>>>> 
> >>>>> Some investigations showed that the GP kthread's CPU is the holdout CPU
> >>>>> a lot of times after the first FQS as - it cannot be detected as "idle"
> >>>>> because it's actively running the FQS scan in the GP kthread.
> >>>>> 
> >>>>> Therefore, at the end of rcu_gp_init(), immediately report a quiescent
> >>>>> state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
> >>>>> GP kthread cannot be in an RCU read-side critical section while running
> >>>>> GP initialization, so this is safe and results in significant latency
> >>>>> improvements.
> >>>>> 
> >>>>> I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
> >>>>> showing significant latency improvements (default settings for fqs jiffies):
> >>>>> 
> >>>>> Baseline (without fix):
> >>>>> | Run | Mean      | Min      | Max       |
> >>>>> |-----|-----------|----------|-----------|
> >>>>> | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
> >>>>> | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
> >>>>> | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
> >>>>> | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
> >>>>> | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
> >>>>> | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
> >>>>> | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
> >>>>> | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
> >>>>> | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
> >>>>> | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
> >>>>> 
> >>>>> With fix:
> >>>>> | Run | Mean     | Min      | Max       |
> >>>>> |-----|----------|----------|-----------|
> >>>>> | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
> >>>>> | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
> >>>>> | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
> >>>>> | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
> >>>>> | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
> >>>>> | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
> >>>>> | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
> >>>>> | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
> >>>>> | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
> >>>>> | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
> >>>>> 
> >>>>> Summary:
> >>>>> - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
> >>>>> - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
> >>>>> 
> >>>>> Tested rcutorture TREE and SRCU configurations.
> >>>>> 
> >>>>> [apply paulmck feedack on moving logic to rcu_gp_init()]
> >>>> 
> >>>> If anything, these numbers look better, so good show!!!
> >>> 
> >>> Thanks, I ended up collecting more samples in the v2 to further confirm the
> >>> improvements.
> >>> 
> >>>> Are there workloads that might be hurt by some side effect such
> >>>> as increased CPU utilization by the RCU grace-period kthread?  One
> >>>> non-mainstream hypothetical situation that comes to mind is a kernel
> >>>> built with SMP=y but running on a single-CPU system with a high-frequence
> >>>> periodic interrupt that does call_rcu().  Might that result in the RCU
> >>>> grace-period kthread chewing up the entire CPU?
> >>> 
> >>> There are still GP delays due to FQS, even with this change, so it could not
> >>> chew up the entire CPU I believe. The GP cycle should still insert delays
> >>> into the GP kthread. I did not notice in my testing that synchronize_rcu()
> >>> latency dropping to sub millisecond, it was still limited by the timer wheel
> >>> delays and the FQS delays.
> >>> 
> >>>> For a non-hypothetical case, could you please see if one of the
> >>>> battery-powered embedded guys would be willing to test this?
> >>> 
> >>> My suspicion is the battery-powered folks are already running RCU_LAZY to
> >>> reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness
> >>> will be going to the bypass. Last I checked, Android and ChromeOS were both
> >>> enabling RCU_LAZY everywhere (back when I was at Google).
> >>> 
> >>> Uladzislau works on embedded (or at least till recently) and had recently
> >>> checked this area for improvements so I think he can help quantify too
> >>> perhaps. He is on CC. I personally don't directly work on embedded at the
> >>> moment, just big compute hungry machines. ;-) Uladzislau, would you have some
> >>> time to test on your Android devices?
> >>> 
> >> I will check the patch on my home based systems, big machines also :)
> >> I do not work with mobile area any more thus do not have access to our
> >> mobile devices. In fact i am glad that i have switched to something new.
> >> I was a bit tired by the applied Google restrictions when it comes to
> >> changes to the kernel and other Android layers.
> > 
> > How quickly I forget!  ;-)
> > 
> > Any thoughts on who would be a good person to ask about testing Joel's
> > patch on mobile platforms?
> 
> Maybe Suren? As precedent and fwiw, When rcu_normal_wake_from_gp optimization happened, it only improved things for Android.
> 
> Also Android already uses RCU_LAZY so this should not affect power for non-hurry usages.
> 
> Also networking bridge removal depends on synchronize_rcu() latency. When I forced rcu_normal_wake_from_gp on large machines, it improved bridge removal speed by about 5% per my notes. I would expect similar improvements with this.

Could you please try running on a single-CPU system or VM to check the
CPU overhead from RCU's grace-period kthread?

							Thanx, Paul

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Joel Fernandes 1 month, 1 week ago

On Sun, Dec 28, 2025 at 08:37:14PM -0800, Paul E. McKenney wrote:
> On Sun, Dec 28, 2025 at 09:49:45PM -0500, Joel Fernandes wrote:
> > 
> > 
> > > On Dec 28, 2025, at 7:04 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > 
> > > On Sun, Dec 28, 2025 at 06:57:58PM +0100, Uladzislau Rezki wrote:
> > >>> On Thu, Dec 25, 2025 at 09:33:39PM -0500, Joel Fernandes wrote:
> > >>> On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote:
> > >>>> On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
> > >>>>> The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
> > >>>>> State) design where the first FQS saves dyntick-idle snapshots and
> > >>>>> the second FQS compares them. This results in long and unnecessary latency
> > >>>>> for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
> > >>>>> 1000HZ) whenever one FQS wait sufficed.
> > >>>>> 
> > >>>>> Some investigations showed that the GP kthread's CPU is the holdout CPU
> > >>>>> a lot of times after the first FQS as - it cannot be detected as "idle"
> > >>>>> because it's actively running the FQS scan in the GP kthread.
> > >>>>> 
> > >>>>> Therefore, at the end of rcu_gp_init(), immediately report a quiescent
> > >>>>> state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
> > >>>>> GP kthread cannot be in an RCU read-side critical section while running
> > >>>>> GP initialization, so this is safe and results in significant latency
> > >>>>> improvements.
> > >>>>> 
> > >>>>> I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
> > >>>>> showing significant latency improvements (default settings for fqs jiffies):
> > >>>>> 
> > >>>>> Baseline (without fix):
> > >>>>> | Run | Mean      | Min      | Max       |
> > >>>>> |-----|-----------|----------|-----------|
> > >>>>> | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
> > >>>>> | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
> > >>>>> | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
> > >>>>> | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
> > >>>>> | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
> > >>>>> | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
> > >>>>> | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
> > >>>>> | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
> > >>>>> | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
> > >>>>> | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
> > >>>>> 
> > >>>>> With fix:
> > >>>>> | Run | Mean     | Min      | Max       |
> > >>>>> |-----|----------|----------|-----------|
> > >>>>> | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
> > >>>>> | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
> > >>>>> | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
> > >>>>> | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
> > >>>>> | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
> > >>>>> | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
> > >>>>> | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
> > >>>>> | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
> > >>>>> | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
> > >>>>> | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
> > >>>>> 
> > >>>>> Summary:
> > >>>>> - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
> > >>>>> - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
> > >>>>> 
> > >>>>> Tested rcutorture TREE and SRCU configurations.
> > >>>>> 
> > >>>>> [apply paulmck feedack on moving logic to rcu_gp_init()]
> > >>>> 
> > >>>> If anything, these numbers look better, so good show!!!
> > >>> 
> > >>> Thanks, I ended up collecting more samples in the v2 to further confirm the
> > >>> improvements.
> > >>> 
> > >>>> Are there workloads that might be hurt by some side effect such
> > >>>> as increased CPU utilization by the RCU grace-period kthread?  One
> > >>>> non-mainstream hypothetical situation that comes to mind is a kernel
> > >>>> built with SMP=y but running on a single-CPU system with a high-frequence
> > >>>> periodic interrupt that does call_rcu().  Might that result in the RCU
> > >>>> grace-period kthread chewing up the entire CPU?
> > >>> 
> > >>> There are still GP delays due to FQS, even with this change, so it could not
> > >>> chew up the entire CPU I believe. The GP cycle should still insert delays
> > >>> into the GP kthread. I did not notice in my testing that synchronize_rcu()
> > >>> latency dropping to sub millisecond, it was still limited by the timer wheel
> > >>> delays and the FQS delays.
> > >>> 
> > >>>> For a non-hypothetical case, could you please see if one of the
> > >>>> battery-powered embedded guys would be willing to test this?
> > >>> 
> > >>> My suspicion is the battery-powered folks are already running RCU_LAZY to
> > >>> reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness
> > >>> will be going to the bypass. Last I checked, Android and ChromeOS were both
> > >>> enabling RCU_LAZY everywhere (back when I was at Google).
> > >>> 
> > >>> Uladzislau works on embedded (or at least till recently) and had recently
> > >>> checked this area for improvements so I think he can help quantify too
> > >>> perhaps. He is on CC. I personally don't directly work on embedded at the
> > >>> moment, just big compute hungry machines. ;-) Uladzislau, would you have some
> > >>> time to test on your Android devices?
> > >>> 
> > >> I will check the patch on my home based systems, big machines also :)
> > >> I do not work with mobile area any more thus do not have access to our
> > >> mobile devices. In fact i am glad that i have switched to something new.
> > >> I was a bit tired by the applied Google restrictions when it comes to
> > >> changes to the kernel and other Android layers.
> > > 
> > > How quickly I forget!  ;-)
> > > 
> > > Any thoughts on who would be a good person to ask about testing Joel's
> > > patch on mobile platforms?
> > 
> > Maybe Suren? As precedent and fwiw, When rcu_normal_wake_from_gp
> > optimization happened, it only improved things for Android.
> > 
> > Also Android already uses RCU_LAZY so this should not affect power for
> > non-hurry usages.
> > 
> > Also networking bridge removal depends on synchronize_rcu() latency. When
> > I forced rcu_normal_wake_from_gp on large machines, it improved bridge
> > removal speed by about 5% per my notes. I would expect similar
> > improvements with this.
> 
> Could you please try running on a single-CPU system or VM to check the
> CPU overhead from RCU's grace-period kthread?

Hi, Paul,

I ran some tests with single CPU and used perf to measure overhead of the
GP kthread (rcu_preempt).

Actually, the GP kthread's CPU usage goes down, I believe this is because it
sleeps more.

I see similar/same results with synchronize_rcu() loop (200 iterations). I
also tested call_rcu() stressing from timer interrupts and call_rcu_hurry()
flooding.

                          Baseline    With Patch    Change
  task-clock:             1008 ms     898 ms        -11%
  CPU cycles:             48M         44M           -8%

Should I add these results to the changelog and send out a v3 (preferrably
with your review tag if you approve).

thanks,

 - Joel

Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Posted by Paul E. McKenney 1 month, 1 week ago

On Mon, Dec 29, 2025 at 11:28:40AM -0500, Joel Fernandes wrote:
> On Sun, Dec 28, 2025 at 08:37:14PM -0800, Paul E. McKenney wrote:
> > On Sun, Dec 28, 2025 at 09:49:45PM -0500, Joel Fernandes wrote:
> > > 
> > > 
> > > > On Dec 28, 2025, at 7:04 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > 
> > > > On Sun, Dec 28, 2025 at 06:57:58PM +0100, Uladzislau Rezki wrote:
> > > >>> On Thu, Dec 25, 2025 at 09:33:39PM -0500, Joel Fernandes wrote:
> > > >>> On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote:
> > > >>>> On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote:
> > > >>>>> The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
> > > >>>>> State) design where the first FQS saves dyntick-idle snapshots and
> > > >>>>> the second FQS compares them. This results in long and unnecessary latency
> > > >>>>> for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
> > > >>>>> 1000HZ) whenever one FQS wait sufficed.
> > > >>>>> 
> > > >>>>> Some investigations showed that the GP kthread's CPU is the holdout CPU
> > > >>>>> a lot of times after the first FQS as - it cannot be detected as "idle"
> > > >>>>> because it's actively running the FQS scan in the GP kthread.
> > > >>>>> 
> > > >>>>> Therefore, at the end of rcu_gp_init(), immediately report a quiescent
> > > >>>>> state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
> > > >>>>> GP kthread cannot be in an RCU read-side critical section while running
> > > >>>>> GP initialization, so this is safe and results in significant latency
> > > >>>>> improvements.
> > > >>>>> 
> > > >>>>> I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each
> > > >>>>> showing significant latency improvements (default settings for fqs jiffies):
> > > >>>>> 
> > > >>>>> Baseline (without fix):
> > > >>>>> | Run | Mean      | Min      | Max       |
> > > >>>>> |-----|-----------|----------|-----------|
> > > >>>>> | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
> > > >>>>> | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
> > > >>>>> | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
> > > >>>>> | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
> > > >>>>> | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
> > > >>>>> | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
> > > >>>>> | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
> > > >>>>> | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
> > > >>>>> | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
> > > >>>>> | 10  | 10.077 ms | 9.984 ms | 17.703 ms |
> > > >>>>> 
> > > >>>>> With fix:
> > > >>>>> | Run | Mean     | Min      | Max       |
> > > >>>>> |-----|----------|----------|-----------|
> > > >>>>> | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
> > > >>>>> | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
> > > >>>>> | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
> > > >>>>> | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
> > > >>>>> | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
> > > >>>>> | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
> > > >>>>> | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
> > > >>>>> | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
> > > >>>>> | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
> > > >>>>> | 10  | 6.036 ms | 5.993 ms |  9.579 ms |
> > > >>>>> 
> > > >>>>> Summary:
> > > >>>>> - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
> > > >>>>> - Max latency:  25.72 ms -> 10.25 ms (60% improvement)
> > > >>>>> 
> > > >>>>> Tested rcutorture TREE and SRCU configurations.
> > > >>>>> 
> > > >>>>> [apply paulmck feedack on moving logic to rcu_gp_init()]
> > > >>>> 
> > > >>>> If anything, these numbers look better, so good show!!!
> > > >>> 
> > > >>> Thanks, I ended up collecting more samples in the v2 to further confirm the
> > > >>> improvements.
> > > >>> 
> > > >>>> Are there workloads that might be hurt by some side effect such
> > > >>>> as increased CPU utilization by the RCU grace-period kthread?  One
> > > >>>> non-mainstream hypothetical situation that comes to mind is a kernel
> > > >>>> built with SMP=y but running on a single-CPU system with a high-frequence
> > > >>>> periodic interrupt that does call_rcu().  Might that result in the RCU
> > > >>>> grace-period kthread chewing up the entire CPU?
> > > >>> 
> > > >>> There are still GP delays due to FQS, even with this change, so it could not
> > > >>> chew up the entire CPU I believe. The GP cycle should still insert delays
> > > >>> into the GP kthread. I did not notice in my testing that synchronize_rcu()
> > > >>> latency dropping to sub millisecond, it was still limited by the timer wheel
> > > >>> delays and the FQS delays.
> > > >>> 
> > > >>>> For a non-hypothetical case, could you please see if one of the
> > > >>>> battery-powered embedded guys would be willing to test this?
> > > >>> 
> > > >>> My suspicion is the battery-powered folks are already running RCU_LAZY to
> > > >>> reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness
> > > >>> will be going to the bypass. Last I checked, Android and ChromeOS were both
> > > >>> enabling RCU_LAZY everywhere (back when I was at Google).
> > > >>> 
> > > >>> Uladzislau works on embedded (or at least till recently) and had recently
> > > >>> checked this area for improvements so I think he can help quantify too
> > > >>> perhaps. He is on CC. I personally don't directly work on embedded at the
> > > >>> moment, just big compute hungry machines. ;-) Uladzislau, would you have some
> > > >>> time to test on your Android devices?
> > > >>> 
> > > >> I will check the patch on my home based systems, big machines also :)
> > > >> I do not work with mobile area any more thus do not have access to our
> > > >> mobile devices. In fact i am glad that i have switched to something new.
> > > >> I was a bit tired by the applied Google restrictions when it comes to
> > > >> changes to the kernel and other Android layers.
> > > > 
> > > > How quickly I forget!  ;-)
> > > > 
> > > > Any thoughts on who would be a good person to ask about testing Joel's
> > > > patch on mobile platforms?
> > > 
> > > Maybe Suren? As precedent and fwiw, When rcu_normal_wake_from_gp
> > > optimization happened, it only improved things for Android.
> > > 
> > > Also Android already uses RCU_LAZY so this should not affect power for
> > > non-hurry usages.
> > > 
> > > Also networking bridge removal depends on synchronize_rcu() latency. When
> > > I forced rcu_normal_wake_from_gp on large machines, it improved bridge
> > > removal speed by about 5% per my notes. I would expect similar
> > > improvements with this.
> > 
> > Could you please try running on a single-CPU system or VM to check the
> > CPU overhead from RCU's grace-period kthread?
> 
> Hi, Paul,
> 
> I ran some tests with single CPU and used perf to measure overhead of the
> GP kthread (rcu_preempt).
> 
> Actually, the GP kthread's CPU usage goes down, I believe this is because it
> sleeps more.
> 
> I see similar/same results with synchronize_rcu() loop (200 iterations). I
> also tested call_rcu() stressing from timer interrupts and call_rcu_hurry()
> flooding.
> 
>                           Baseline    With Patch    Change
>   task-clock:             1008 ms     898 ms        -11%
>   CPU cycles:             48M         44M           -8%
> 
> Should I add these results to the changelog and send out a v3 (preferrably
> with your review tag if you approve).

Very good, and sounds good!

							Thanx, Paul