[PATCH 1/5] rcu/exp: Protect against early QS report

Frederic Weisbecker posted 5 patches 9 months, 1 week ago
There is a newer version of this series
[PATCH 1/5] rcu/exp: Protect against early QS report
Posted by Frederic Weisbecker 9 months, 1 week ago
When a grace period is started, the ->expmask of each node is set up
from sync_exp_reset_tree(). Then later on each leaf node also initialize
its ->exp_tasks pointer.

This means that the initialization of the quiescent state of a node and
the initialization of its blocking tasks happen with an unlocked node
gap in-between.

It happens to be fine because nothing is expected to report an exp
quiescent state within this gap, since no IPI have been issued yet and
every rdp's ->cpu_no_qs.b.exp should be false.

However if it were to happen by accident, the quiescent state could be
reported and propagated while ignoring tasks that blocked _before_ the
start of the grace period.

Prevent such trouble to happen in the future and initialize both the
quiescent states mask to report and the blocked tasks head from the same
node locked block.

If a task blocks within an RCU read side critical section before
sync_exp_reset_tree() is called and is then unblocked between
sync_exp_reset_tree() and __sync_rcu_exp_select_node_cpus(), the QS
won't be reported because no RCU exp IPI had been issued to request it
through the setting of srdp->cpu_no_qs.b.exp.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree_exp.h | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index c36c7d5575ca..2fa7aa9155bd 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -141,6 +141,13 @@ static void __maybe_unused sync_exp_reset_tree(void)
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 		WARN_ON_ONCE(rnp->expmask);
 		WRITE_ONCE(rnp->expmask, rnp->expmaskinit);
+		/*
+		 * Need to wait for any blocked tasks as well.	Note that
+		 * additional blocking tasks will also block the expedited GP
+		 * until such time as the ->expmask bits are cleared.
+		 */
+		if (rcu_is_leaf_node(rnp) && rcu_preempt_has_tasks(rnp))
+			WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	}
 }
@@ -393,13 +400,6 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
 	}
 	mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;
 
-	/*
-	 * Need to wait for any blocked tasks as well.	Note that
-	 * additional blocking tasks will also block the expedited GP
-	 * until such time as the ->expmask bits are cleared.
-	 */
-	if (rcu_preempt_has_tasks(rnp))
-		WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 
 	/* IPI the remaining CPUs for expedited quiescent state. */
-- 
2.48.1
Re: [PATCH 1/5] rcu/exp: Protect against early QS report
Posted by Paul E. McKenney 9 months ago
On Fri, Mar 14, 2025 at 03:36:38PM +0100, Frederic Weisbecker wrote:
> When a grace period is started, the ->expmask of each node is set up
> from sync_exp_reset_tree(). Then later on each leaf node also initialize
> its ->exp_tasks pointer.
> 
> This means that the initialization of the quiescent state of a node and
> the initialization of its blocking tasks happen with an unlocked node
> gap in-between.
> 
> It happens to be fine because nothing is expected to report an exp
> quiescent state within this gap, since no IPI have been issued yet and
> every rdp's ->cpu_no_qs.b.exp should be false.
> 
> However if it were to happen by accident, the quiescent state could be
> reported and propagated while ignoring tasks that blocked _before_ the
> start of the grace period.
> 
> Prevent such trouble to happen in the future and initialize both the
> quiescent states mask to report and the blocked tasks head from the same
> node locked block.
> 
> If a task blocks within an RCU read side critical section before
> sync_exp_reset_tree() is called and is then unblocked between
> sync_exp_reset_tree() and __sync_rcu_exp_select_node_cpus(), the QS
> won't be reported because no RCU exp IPI had been issued to request it
> through the setting of srdp->cpu_no_qs.b.exp.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

OK, and because ->expmaskinit has all bits set for all CPUs that have
ever been online, the ends of any corresponding readers will give up at
the beginning of the first pass of the loop in __rcu_report_exp_rnp().
This is because the ->expmask is guaranteed to be non-zero.  (Which is
kind of what you are saying in the last paragraph of your commit log,
just digging down another layer.)

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  kernel/rcu/tree_exp.h | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index c36c7d5575ca..2fa7aa9155bd 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -141,6 +141,13 @@ static void __maybe_unused sync_exp_reset_tree(void)
>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  		WARN_ON_ONCE(rnp->expmask);
>  		WRITE_ONCE(rnp->expmask, rnp->expmaskinit);
> +		/*
> +		 * Need to wait for any blocked tasks as well.	Note that
> +		 * additional blocking tasks will also block the expedited GP
> +		 * until such time as the ->expmask bits are cleared.
> +		 */
> +		if (rcu_is_leaf_node(rnp) && rcu_preempt_has_tasks(rnp))
> +			WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  	}
>  }
> @@ -393,13 +400,6 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
>  	}
>  	mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;
>  
> -	/*
> -	 * Need to wait for any blocked tasks as well.	Note that
> -	 * additional blocking tasks will also block the expedited GP
> -	 * until such time as the ->expmask bits are cleared.
> -	 */
> -	if (rcu_preempt_has_tasks(rnp))
> -		WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
>  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  
>  	/* IPI the remaining CPUs for expedited quiescent state. */
> -- 
> 2.48.1
>
Re: [PATCH 1/5] rcu/exp: Protect against early QS report
Posted by Frederic Weisbecker 9 months ago
Le Tue, Mar 18, 2025 at 10:17:16AM -0700, Paul E. McKenney a écrit :
> On Fri, Mar 14, 2025 at 03:36:38PM +0100, Frederic Weisbecker wrote:
> > When a grace period is started, the ->expmask of each node is set up
> > from sync_exp_reset_tree(). Then later on each leaf node also initialize
> > its ->exp_tasks pointer.
> > 
> > This means that the initialization of the quiescent state of a node and
> > the initialization of its blocking tasks happen with an unlocked node
> > gap in-between.
> > 
> > It happens to be fine because nothing is expected to report an exp
> > quiescent state within this gap, since no IPI have been issued yet and
> > every rdp's ->cpu_no_qs.b.exp should be false.
> > 
> > However if it were to happen by accident, the quiescent state could be
> > reported and propagated while ignoring tasks that blocked _before_ the
> > start of the grace period.
> > 
> > Prevent such trouble to happen in the future and initialize both the
> > quiescent states mask to report and the blocked tasks head from the same
> > node locked block.
> > 
> > If a task blocks within an RCU read side critical section before
> > sync_exp_reset_tree() is called and is then unblocked between
> > sync_exp_reset_tree() and __sync_rcu_exp_select_node_cpus(), the QS
> > won't be reported because no RCU exp IPI had been issued to request it
> > through the setting of srdp->cpu_no_qs.b.exp.
> > 
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> 
> OK, and because ->expmaskinit has all bits set for all CPUs that have
> ever been online, the ends of any corresponding readers will give up at
> the beginning of the first pass of the loop in __rcu_report_exp_rnp().
> This is because the ->expmask is guaranteed to be non-zero.  (Which is
> kind of what you are saying in the last paragraph of your commit log,
> just digging down another layer.)

Exactly!

Thanks.
Re: [PATCH 1/5] rcu/exp: Protect against early QS report
Posted by Joel Fernandes 9 months ago
Hi Frederic,

On Fri, Mar 14, 2025 at 03:36:38PM +0100, Frederic Weisbecker wrote:
> When a grace period is started, the ->expmask of each node is set up
> from sync_exp_reset_tree(). Then later on each leaf node also initialize
> its ->exp_tasks pointer.
> 
> This means that the initialization of the quiescent state of a node and
> the initialization of its blocking tasks happen with an unlocked node
> gap in-between.
> 
> It happens to be fine because nothing is expected to report an exp
> quiescent state within this gap, since no IPI have been issued yet and
> every rdp's ->cpu_no_qs.b.exp should be false.
> 
> However if it were to happen by accident, the quiescent state could be
> reported and propagated while ignoring tasks that blocked _before_ the
> start of the grace period.
> 
> Prevent such trouble to happen in the future and initialize both the
> quiescent states mask to report and the blocked tasks head from the same
> node locked block.
> 
> If a task blocks within an RCU read side critical section before
> sync_exp_reset_tree() is called and is then unblocked between
> sync_exp_reset_tree() and __sync_rcu_exp_select_node_cpus(), the QS
> won't be reported because no RCU exp IPI had been issued to request it
> through the setting of srdp->cpu_no_qs.b.exp.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  kernel/rcu/tree_exp.h | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index c36c7d5575ca..2fa7aa9155bd 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -141,6 +141,13 @@ static void __maybe_unused sync_exp_reset_tree(void)
>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  		WARN_ON_ONCE(rnp->expmask);
>  		WRITE_ONCE(rnp->expmask, rnp->expmaskinit);
> +		/*
> +		 * Need to wait for any blocked tasks as well.	Note that
> +		 * additional blocking tasks will also block the expedited GP
> +		 * until such time as the ->expmask bits are cleared.
> +		 */
> +		if (rcu_is_leaf_node(rnp) && rcu_preempt_has_tasks(rnp))
> +			WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  	}
>  }
> @@ -393,13 +400,6 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
>  	}
>  	mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;
>  
> -	/*
> -	 * Need to wait for any blocked tasks as well.	Note that
> -	 * additional blocking tasks will also block the expedited GP
> -	 * until such time as the ->expmask bits are cleared.
> -	 */
> -	if (rcu_preempt_has_tasks(rnp))
> -		WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
>  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

A small side effect of this patch could be:

In the existing code, if between the sync_exp_reset_tree() and the
__sync_rcu_exp_select_node_cpus(), if a pre-existing reader unblocked and
completed, then I think it wouldn't be responsible for blocking the GP
anymore.

Where as with this patch, it would not get a chance to be removed from the
blocked list because it would have to wait on the rnp lock, which after this
patch would now be held across the setting of exp_mask and exp_tasks?

But I think it is not a big deal either way, and if you feel it is more
future proof to do it this way, that sounds good to me.

thanks,

 - Joel


>  
>  	/* IPI the remaining CPUs for expedited quiescent state. */
> -- 
> 2.48.1
>
Re: [PATCH 1/5] rcu/exp: Protect against early QS report
Posted by Frederic Weisbecker 9 months ago
Le Sat, Mar 15, 2025 at 07:59:45PM -0400, Joel Fernandes a écrit :
> Hi Frederic,
> 
> On Fri, Mar 14, 2025 at 03:36:38PM +0100, Frederic Weisbecker wrote:
> > When a grace period is started, the ->expmask of each node is set up
> > from sync_exp_reset_tree(). Then later on each leaf node also initialize
> > its ->exp_tasks pointer.
> > 
> > This means that the initialization of the quiescent state of a node and
> > the initialization of its blocking tasks happen with an unlocked node
> > gap in-between.
> > 
> > It happens to be fine because nothing is expected to report an exp
> > quiescent state within this gap, since no IPI have been issued yet and
> > every rdp's ->cpu_no_qs.b.exp should be false.
> > 
> > However if it were to happen by accident, the quiescent state could be
> > reported and propagated while ignoring tasks that blocked _before_ the
> > start of the grace period.
> > 
> > Prevent such trouble to happen in the future and initialize both the
> > quiescent states mask to report and the blocked tasks head from the same
> > node locked block.
> > 
> > If a task blocks within an RCU read side critical section before
> > sync_exp_reset_tree() is called and is then unblocked between
> > sync_exp_reset_tree() and __sync_rcu_exp_select_node_cpus(), the QS
> > won't be reported because no RCU exp IPI had been issued to request it
> > through the setting of srdp->cpu_no_qs.b.exp.
> > 
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > ---
> >  kernel/rcu/tree_exp.h | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index c36c7d5575ca..2fa7aa9155bd 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -141,6 +141,13 @@ static void __maybe_unused sync_exp_reset_tree(void)
> >  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
> >  		WARN_ON_ONCE(rnp->expmask);
> >  		WRITE_ONCE(rnp->expmask, rnp->expmaskinit);
> > +		/*
> > +		 * Need to wait for any blocked tasks as well.	Note that
> > +		 * additional blocking tasks will also block the expedited GP
> > +		 * until such time as the ->expmask bits are cleared.
> > +		 */
> > +		if (rcu_is_leaf_node(rnp) && rcu_preempt_has_tasks(rnp))
> > +			WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
> >  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> >  	}
> >  }
> > @@ -393,13 +400,6 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
> >  	}
> >  	mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;
> >  
> > -	/*
> > -	 * Need to wait for any blocked tasks as well.	Note that
> > -	 * additional blocking tasks will also block the expedited GP
> > -	 * until such time as the ->expmask bits are cleared.
> > -	 */
> > -	if (rcu_preempt_has_tasks(rnp))
> > -		WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
> >  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> 
> A small side effect of this patch could be:
> 
> In the existing code, if between the sync_exp_reset_tree() and the
> __sync_rcu_exp_select_node_cpus(), if a pre-existing reader unblocked and
> completed, then I think it wouldn't be responsible for blocking the GP
> anymore.

Hmm, I don't see how that changes after this patch.

> 
> Where as with this patch, it would not get a chance to be removed from the
> blocked list because it would have to wait on the rnp lock, which after this
> patch would now be held across the setting of exp_mask and exp_tasks?

So that's sync_exp_reset_tree(). I'm a bit confused. An unblocking task
contend on rnp lock in any case. But after this patch it is still going
to remove itself from the blocking task once the rnp lock is released by
sync_exp_reset_tree().

What am I missing?

Thanks.

> 
> But I think it is not a big deal either way, and if you feel it is more
> future proof to do it this way, that sounds good to me.
> 
> thanks,
> 
>  - Joel
> 
> 
> >  
> >  	/* IPI the remaining CPUs for expedited quiescent state. */
> > -- 
> > 2.48.1
> > 
> 
Re: [PATCH 1/5] rcu/exp: Protect against early QS report
Posted by Joel Fernandes 9 months ago

On 3/16/2025 7:07 AM, Frederic Weisbecker wrote:
>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>> index c36c7d5575ca..2fa7aa9155bd 100644
>>> --- a/kernel/rcu/tree_exp.h
>>> +++ b/kernel/rcu/tree_exp.h
>>> @@ -141,6 +141,13 @@ static void __maybe_unused sync_exp_reset_tree(void)
>>>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>>>  		WARN_ON_ONCE(rnp->expmask);
>>>  		WRITE_ONCE(rnp->expmask, rnp->expmaskinit);
>>> +		/*
>>> +		 * Need to wait for any blocked tasks as well.	Note that
>>> +		 * additional blocking tasks will also block the expedited GP
>>> +		 * until such time as the ->expmask bits are cleared.
>>> +		 */
>>> +		if (rcu_is_leaf_node(rnp) && rcu_preempt_has_tasks(rnp))
>>> +			WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
>>>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>>>  	}
>>>  }
>>> @@ -393,13 +400,6 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
>>>  	}
>>>  	mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;
>>>  
>>> -	/*
>>> -	 * Need to wait for any blocked tasks as well.	Note that
>>> -	 * additional blocking tasks will also block the expedited GP
>>> -	 * until such time as the ->expmask bits are cleared.
>>> -	 */
>>> -	if (rcu_preempt_has_tasks(rnp))
>>> -		WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
>>>  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>> A small side effect of this patch could be:
>>
>> In the existing code, if between the sync_exp_reset_tree() and the
>> __sync_rcu_exp_select_node_cpus(), if a pre-existing reader unblocked and
>> completed, then I think it wouldn't be responsible for blocking the GP
>> anymore.
> Hmm, I don't see how that changes after this patch.
> 
>> Where as with this patch, it would not get a chance to be removed from the
>> blocked list because it would have to wait on the rnp lock, which after this
>> patch would now be held across the setting of exp_mask and exp_tasks?
> So that's sync_exp_reset_tree(). I'm a bit confused. An unblocking task
> contend on rnp lock in any case. But after this patch it is still going
> to remove itself from the blocking task once the rnp lock is released by
> sync_exp_reset_tree().
> 
> What am I missing?
You are probably not missing anything and I'm the one missing something.

But I was thinking:

In in the original code, in __sync_rcu_exp_select_node_cpus() if
rcu_preempt_has_tasks() returns FALSE because of the finer grained locking, then
there is a chance for the GP to conclude sooner,

On the other hand, after the patch because the unblocking task had to wait (on
the lock) to remove itself from the blocked task list, the GP may conclude later
than usual. This is just an intuitive guess.

Because this is an expedited GP, my intuition is to unblock + reader unlock and
get out of the way ASAP than hoping that it will get access to the lock before
any IPIs go out or quiescent state reports/checks happen which are required to
conclude the GP

Its just a theory and you're right, if it acquires the lock soon enough and gets
out of the way, then it doesn't matter either way.

Thanks!

 - Joel
Re: [PATCH 1/5] rcu/exp: Protect against early QS report
Posted by Frederic Weisbecker 9 months ago
Le Sun, Mar 16, 2025 at 10:23:45AM -0400, Joel Fernandes a écrit :
> >> A small side effect of this patch could be:
> >>
> >> In the existing code, if between the sync_exp_reset_tree() and the
> >> __sync_rcu_exp_select_node_cpus(), if a pre-existing reader unblocked and
> >> completed, then I think it wouldn't be responsible for blocking the GP
> >> anymore.
> > Hmm, I don't see how that changes after this patch.
> > 
> >> Where as with this patch, it would not get a chance to be removed from the
> >> blocked list because it would have to wait on the rnp lock, which after this
> >> patch would now be held across the setting of exp_mask and exp_tasks?
> > So that's sync_exp_reset_tree(). I'm a bit confused. An unblocking task
> > contend on rnp lock in any case. But after this patch it is still going
> > to remove itself from the blocking task once the rnp lock is released by
> > sync_exp_reset_tree().
> > 
> > What am I missing?
> You are probably not missing anything and I'm the one missing something.
> 
> But I was thinking:
> 
> In in the original code, in __sync_rcu_exp_select_node_cpus() if
> rcu_preempt_has_tasks() returns FALSE because of the finer grained locking, then
> there is a chance for the GP to conclude sooner,

Why do you think it's finer grained locking?
 
> On the other hand, after the patch because the unblocking task had to wait (on
> the lock) to remove itself from the blocked task list, the GP may conclude later
> than usual. This is just an intuitive guess.
> 
> Because this is an expedited GP, my intuition is to unblock + reader unlock and
> get out of the way ASAP than hoping that it will get access to the lock before
> any IPIs go out or quiescent state reports/checks happen which are required to
> conclude the GP
> 
> Its just a theory and you're right, if it acquires the lock soon enough and gets
> out of the way, then it doesn't matter either way.

I think I understand where the confusion is. A task that is preempted within an
RCU read side section _always_ adds itself to the rnp's list of blocked tasks
(&rnp->blkd_tasks). The only thing that changes with expedited GPs is that
rnp->exp_tasks may or may not be updated on the way. But rnp->exp_tasks is only
a pointer to an arbitrary element within the rnp->blkd_tasks list.

This means that an unblocking task must always delete itself from
rnp->blkd_tasks, and possibly update rnp->exp_tasks along the way.

Both the add and the delete happen with rnp locked.

Therefore a task unblocking before __sync_rcu_exp_select_node_cpus()
can make __sync_rcu_exp_select_node_cpus() contend on rnp locking.

But this patch doesn't change the behaviour in this regard.

Thanks.


> 
> Thanks!
> 
>  - Joel
> 
> 
>