[PATCH RFC] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

Joel Fernandes posted 1 patch 1 month, 2 weeks ago
There is a newer version of this series
kernel/rcu/tree.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
[PATCH RFC] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early
Posted by Joel Fernandes 1 month, 2 weeks ago
The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
State) design where the first FQS saves dyntick-idle snapshots and
the second FQS compares them. This results in long and unncessary latency for
synchronize_rcu() on idle systems (two FQS waits of ~3ms each with 1000HZ)
whenever one FQS wait sufficed.

Some investigations showed that the GP kthread's CPU is the holdout CPU
a lot of times after the first FQS as - it cannot be detected as "idle"
because it's actively running the FQS scan in the GP kthread.

Therefore, at the start of the first FQS, immediately report a quiescent
state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
GP kthread cannot be in an RCU read-side critical section while running
the FQS scan, so this is safe and results in significant tail latency
improvements.

I benchmarked 100 synchronize_rcu() calls, 6 runs each showing good tail
latency improvements per synchronize_rcu() call (default settings for fqs
jiffies):

Baseline (without fix):
| Run | Mean     | Min      | Max       |
|-----|----------|----------|-----------|
| 1   | 4.036 ms | 3.509 ms | 7.973 ms  |
| 2   | 4.049 ms | 3.904 ms | 8.003 ms  |
| 3   | 4.033 ms | 1.160 ms | 10.083 ms |
| 4   | 3.993 ms | 3.145 ms | 4.093 ms  |
| 5   | 3.988 ms | 2.675 ms | 4.123 ms  |
| 6   | 4.019 ms | 3.894 ms | 5.845 ms  |

With fix:
| Run | Mean     | Min      | Max      |
|-----|----------|----------|----------|
| 1   | 3.991 ms | 2.953 ms | 4.125 ms |
| 2   | 3.995 ms | 3.439 ms | 4.081 ms |
| 3   | 3.989 ms | 2.974 ms | 4.079 ms |
| 4   | 3.997 ms | 3.667 ms | 4.072 ms |
| 5   | 4.027 ms | 2.550 ms | 7.928 ms |
| 6   | 3.989 ms | 2.886 ms | 4.076 ms |

The fix reduces worst-case latency due to the second FQS wait not
running when not needed.

Tested rcutorture TREE and SRCU configurations.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/rcu/tree.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8293bae1dec1..c116ed7633d3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -160,6 +160,7 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
 			      unsigned long gps, unsigned long flags);
 static void invoke_rcu_core(void);
 static void rcu_report_exp_rdp(struct rcu_data *rdp);
+static void rcu_report_qs_rdp(struct rcu_data *rdp);
 static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
 static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
 static bool rcu_rdp_cpu_online(struct rcu_data *rdp);
@@ -2032,6 +2033,17 @@ static void rcu_gp_fqs(bool first_time)
 	}
 
 	if (first_time) {
+		/*
+		 * Immediately report QS for the GP kthread's CPU. The GP kthread
+		 * cannot be in an RCU read-side critical section while running
+		 * the FQS scan. This eliminates the need for a second FQS wait
+		 * when all CPUs are idle.
+		 */
+		preempt_disable();
+		rcu_qs();
+		rcu_report_qs_rdp(this_cpu_ptr(&rcu_data));
+		preempt_enable();
+
 		/* Collect dyntick-idle snapshots. */
 		force_qs_rnp(rcu_watching_snap_save);
 	} else {
-- 
2.34.1
Re: [PATCH RFC] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early
Posted by Paul E. McKenney 1 month, 2 weeks ago
On Mon, Dec 22, 2025 at 07:30:39PM -0500, Joel Fernandes wrote:
> The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
> State) design where the first FQS saves dyntick-idle snapshots and
> the second FQS compares them. This results in long and unncessary latency for
> synchronize_rcu() on idle systems (two FQS waits of ~3ms each with 1000HZ)
> whenever one FQS wait sufficed.
> 
> Some investigations showed that the GP kthread's CPU is the holdout CPU
> a lot of times after the first FQS as - it cannot be detected as "idle"
> because it's actively running the FQS scan in the GP kthread.
> 
> Therefore, at the start of the first FQS, immediately report a quiescent
> state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
> GP kthread cannot be in an RCU read-side critical section while running
> the FQS scan, so this is safe and results in significant tail latency
> improvements.
> 
> I benchmarked 100 synchronize_rcu() calls, 6 runs each showing good tail
> latency improvements per synchronize_rcu() call (default settings for fqs
> jiffies):
> 
> Baseline (without fix):
> | Run | Mean     | Min      | Max       |
> |-----|----------|----------|-----------|
> | 1   | 4.036 ms | 3.509 ms | 7.973 ms  |
> | 2   | 4.049 ms | 3.904 ms | 8.003 ms  |
> | 3   | 4.033 ms | 1.160 ms | 10.083 ms |
> | 4   | 3.993 ms | 3.145 ms | 4.093 ms  |
> | 5   | 3.988 ms | 2.675 ms | 4.123 ms  |
> | 6   | 4.019 ms | 3.894 ms | 5.845 ms  |
> 
> With fix:
> | Run | Mean     | Min      | Max      |
> |-----|----------|----------|----------|
> | 1   | 3.991 ms | 2.953 ms | 4.125 ms |
> | 2   | 3.995 ms | 3.439 ms | 4.081 ms |
> | 3   | 3.989 ms | 2.974 ms | 4.079 ms |
> | 4   | 3.997 ms | 3.667 ms | 4.072 ms |
> | 5   | 4.027 ms | 2.550 ms | 7.928 ms |
> | 6   | 3.989 ms | 2.886 ms | 4.076 ms |
> 
> The fix reduces worst-case latency due to the second FQS wait not
> running when not needed.
> 
> Tested rcutorture TREE and SRCU configurations.
> 
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>

Nice results!!!

But why not do this at the end of rcu_gp_init()?

							Thanx, Paul

> ---
>  kernel/rcu/tree.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8293bae1dec1..c116ed7633d3 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -160,6 +160,7 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
>  			      unsigned long gps, unsigned long flags);
>  static void invoke_rcu_core(void);
>  static void rcu_report_exp_rdp(struct rcu_data *rdp);
> +static void rcu_report_qs_rdp(struct rcu_data *rdp);
>  static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
>  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
>  static bool rcu_rdp_cpu_online(struct rcu_data *rdp);
> @@ -2032,6 +2033,17 @@ static void rcu_gp_fqs(bool first_time)
>  	}
>  
>  	if (first_time) {
> +		/*
> +		 * Immediately report QS for the GP kthread's CPU. The GP kthread
> +		 * cannot be in an RCU read-side critical section while running
> +		 * the FQS scan. This eliminates the need for a second FQS wait
> +		 * when all CPUs are idle.
> +		 */
> +		preempt_disable();
> +		rcu_qs();
> +		rcu_report_qs_rdp(this_cpu_ptr(&rcu_data));
> +		preempt_enable();
> +
>  		/* Collect dyntick-idle snapshots. */
>  		force_qs_rnp(rcu_watching_snap_save);
>  	} else {
> -- 
> 2.34.1
>
Re: [PATCH RFC] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early
Posted by Joel Fernandes 1 month, 2 weeks ago

> On Dec 22, 2025, at 8:21 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Mon, Dec 22, 2025 at 07:30:39PM -0500, Joel Fernandes wrote:
>> The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
>> State) design where the first FQS saves dyntick-idle snapshots and
>> the second FQS compares them. This results in long and unncessary latency for
>> synchronize_rcu() on idle systems (two FQS waits of ~3ms each with 1000HZ)
>> whenever one FQS wait sufficed.
>> 
>> Some investigations showed that the GP kthread's CPU is the holdout CPU
>> a lot of times after the first FQS as - it cannot be detected as "idle"
>> because it's actively running the FQS scan in the GP kthread.
>> 
>> Therefore, at the start of the first FQS, immediately report a quiescent
>> state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
>> GP kthread cannot be in an RCU read-side critical section while running
>> the FQS scan, so this is safe and results in significant tail latency
>> improvements.
>> 
>> I benchmarked 100 synchronize_rcu() calls, 6 runs each showing good tail
>> latency improvements per synchronize_rcu() call (default settings for fqs
>> jiffies):
>> 
>> Baseline (without fix):
>> | Run | Mean     | Min      | Max       |
>> |-----|----------|----------|-----------|
>> | 1   | 4.036 ms | 3.509 ms | 7.973 ms  |
>> | 2   | 4.049 ms | 3.904 ms | 8.003 ms  |
>> | 3   | 4.033 ms | 1.160 ms | 10.083 ms |
>> | 4   | 3.993 ms | 3.145 ms | 4.093 ms  |
>> | 5   | 3.988 ms | 2.675 ms | 4.123 ms  |
>> | 6   | 4.019 ms | 3.894 ms | 5.845 ms  |
>> 
>> With fix:
>> | Run | Mean     | Min      | Max      |
>> |-----|----------|----------|----------|
>> | 1   | 3.991 ms | 2.953 ms | 4.125 ms |
>> | 2   | 3.995 ms | 3.439 ms | 4.081 ms |
>> | 3   | 3.989 ms | 2.974 ms | 4.079 ms |
>> | 4   | 3.997 ms | 3.667 ms | 4.072 ms |
>> | 5   | 4.027 ms | 2.550 ms | 7.928 ms |
>> | 6   | 3.989 ms | 2.886 ms | 4.076 ms |
>> 
>> The fix reduces worst-case latency due to the second FQS wait not
>> running when not needed.
>> 
>> Tested rcutorture TREE and SRCU configurations.
>> 
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> 
> Nice results!!!

Thanks!

> 
> But why not do this at the end of rcu_gp_init()?

Yes that is better, I will give that a try. Thanks,

 - Joel


> 
>                            Thanx, Paul
> 
>> ---
>> kernel/rcu/tree.c | 12 ++++++++++++
>> 1 file changed, 12 insertions(+)
>> 
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index 8293bae1dec1..c116ed7633d3 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -160,6 +160,7 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
>>                  unsigned long gps, unsigned long flags);
>> static void invoke_rcu_core(void);
>> static void rcu_report_exp_rdp(struct rcu_data *rdp);
>> +static void rcu_report_qs_rdp(struct rcu_data *rdp);
>> static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
>> static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
>> static bool rcu_rdp_cpu_online(struct rcu_data *rdp);
>> @@ -2032,6 +2033,17 @@ static void rcu_gp_fqs(bool first_time)
>>    }
>> 
>>    if (first_time) {
>> +        /*
>> +         * Immediately report QS for the GP kthread's CPU. The GP kthread
>> +         * cannot be in an RCU read-side critical section while running
>> +         * the FQS scan. This eliminates the need for a second FQS wait
>> +         * when all CPUs are idle.
>> +         */
>> +        preempt_disable();
>> +        rcu_qs();
>> +        rcu_report_qs_rdp(this_cpu_ptr(&rcu_data));
>> +        preempt_enable();
>> +
>>        /* Collect dyntick-idle snapshots. */
>>        force_qs_rnp(rcu_watching_snap_save);
>>    } else {
>> --
>> 2.34.1
>>