[v3] locking/osq_lock: Optimisations to osq_lock code

[PATCH v3 next 5/5] Avoid writing to node->next in the osq_lock() fast path

Posted by david.laight.linux@gmail.com 1 month ago

From: David Laight <david.laight.linux@gmail.com>

When osq_lock() returns false or osq_unlock() returns static
analysis shows that node->next should always be NULL.
This means that it isn't necessary to explicitly set it to NULL
prior to atomic_xchg(&lock->tail, curr) on entry to osq_lock().

Defer determining the address of the CPU's 'node' until after the
atomic_exchange() so that it isn't done in the uncontented path.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---
 kernel/locking/osq_lock.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 0619691e2756..3f0cfdf1cd0f 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -92,13 +92,10 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 
 bool osq_lock(struct optimistic_spin_queue *lock)
 {
-	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
-	struct optimistic_spin_node *prev, *next;
+	struct optimistic_spin_node *node, *prev, *next;
 	unsigned int curr = encode_cpu(smp_processor_id());
 	unsigned int prev_cpu;
 
-	node->next = NULL;
-
 	/*
 	 * We need both ACQUIRE (pairs with corresponding RELEASE in
 	 * unlock() uncontended, or fastpath) and RELEASE (to publish
@@ -109,6 +106,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	if (prev_cpu == OSQ_UNLOCKED_VAL)
 		return true;
 
+	node = this_cpu_ptr(&osq_node);
 	WRITE_ONCE(node->prev_cpu, prev_cpu);
 	prev = decode_cpu(prev_cpu);
 	node->locked = 0;
-- 
2.39.5

Re: [PATCH v3 next 5/5] Avoid writing to node->next in the osq_lock() fast path

Posted by Waiman Long 3 weeks, 5 days ago

On 3/6/26 5:51 PM, david.laight.linux@gmail.com wrote:
> From: David Laight <david.laight.linux@gmail.com>
>
> When osq_lock() returns false or osq_unlock() returns static
> analysis shows that node->next should always be NULL.
> This means that it isn't necessary to explicitly set it to NULL
> prior to atomic_xchg(&lock->tail, curr) on entry to osq_lock().
>
> Defer determining the address of the CPU's 'node' until after the
> atomic_exchange() so that it isn't done in the uncontented path.
>
> Signed-off-by: David Laight <david.laight.linux@gmail.com>
> ---
>   kernel/locking/osq_lock.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 0619691e2756..3f0cfdf1cd0f 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -92,13 +92,10 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>   
>   bool osq_lock(struct optimistic_spin_queue *lock)
>   {
> -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> -	struct optimistic_spin_node *prev, *next;
> +	struct optimistic_spin_node *node, *prev, *next;
>   	unsigned int curr = encode_cpu(smp_processor_id());
>   	unsigned int prev_cpu;
>   
> -	node->next = NULL;

Although it does look like node->next should always be NULL when 
entering osq_lock(), any future change may invalidate this assumption. I 
know you want to not touch the osq_node cacheline on fast path, but we 
will need a big comment here to explicitly spell out this assumption to 
make sure that we won't break it in the future.

BTW, how much performance gain have you measured with this change? Can 
we just leave it there to be safe.

> -
>   	/*
>   	 * We need both ACQUIRE (pairs with corresponding RELEASE in
>   	 * unlock() uncontended, or fastpath) and RELEASE (to publish
> @@ -109,6 +106,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   	if (prev_cpu == OSQ_UNLOCKED_VAL)
>   		return true;
>   
> +	node = this_cpu_ptr(&osq_node);
>   	WRITE_ONCE(node->prev_cpu, prev_cpu);
>   	prev = decode_cpu(prev_cpu);
>   	node->locked = 0;

I am fine with moving the initialization here. The other patches also 
look good to me.

Cheers,
Longman

Re: [PATCH v3 next 5/5] Avoid writing to node->next in the osq_lock() fast path

Posted by David Laight 3 weeks, 5 days ago

On Wed, 11 Mar 2026 15:27:08 -0400
Waiman Long <longman@redhat.com> wrote:

> On 3/6/26 5:51 PM, david.laight.linux@gmail.com wrote:
> > From: David Laight <david.laight.linux@gmail.com>
> >
> > When osq_lock() returns false or osq_unlock() returns static
> > analysis shows that node->next should always be NULL.
> > This means that it isn't necessary to explicitly set it to NULL
> > prior to atomic_xchg(&lock->tail, curr) on entry to osq_lock().
> >
> > Defer determining the address of the CPU's 'node' until after the
> > atomic_exchange() so that it isn't done in the uncontented path.
> >
> > Signed-off-by: David Laight <david.laight.linux@gmail.com>
> > ---
> >   kernel/locking/osq_lock.c | 6 ++----
> >   1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index 0619691e2756..3f0cfdf1cd0f 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -92,13 +92,10 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> >   
> >   bool osq_lock(struct optimistic_spin_queue *lock)
> >   {
> > -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > -	struct optimistic_spin_node *prev, *next;
> > +	struct optimistic_spin_node *node, *prev, *next;
> >   	unsigned int curr = encode_cpu(smp_processor_id());
> >   	unsigned int prev_cpu;
> >   
> > -	node->next = NULL;  
> 
> Although it does look like node->next should always be NULL when 
> entering osq_lock(), any future change may invalidate this assumption. I 
> know you want to not touch the osq_node cacheline on fast path, but we 
> will need a big comment here to explicitly spell out this assumption to 
> make sure that we won't break it in the future.

I've been looking at the code some more.
The first new patch adds a load of comments.
Unfortunately they are rather wrong (or in the wrong place) by the time
I finished. So I need to go around again at least once.
The main extra changes:
- Set node->prev to zero instead of node->locked = 1.
  This makes the loop after the smp_cond_load_relaxed() better.
  And the write to next->prev can move into osq_wait_next().
- Just call osq_wait_next() from osq_unlock() - it is the same code.
  Removes the unconditional cmpxchg on lock->tail (which needs to be
  _release in both places).
- Change node->next to be the cpu number - it is only used as a pointer
  once (I think Linus suggested that once).
- Stop 'node' being cache-line aligned - it only contains two cpu numbers.
  Just 8 byte align so it is all in one cache line.
  The data for the 'wrong' cpu is hardly ever accessed.

> 
> BTW, how much performance gain have you measured with this change? Can 
> we just leave it there to be safe.

You think I've been brave enough to run my changes :-)
Actually I've written a little test harness that lets me compile and run
the actual kernel source in usespace and 'prod' it with requests.
That let me add delays at particular points to check the unusual paths
(but not the effects of acquire/release).

I will run the next changes, hopefully there won't be anything nasty in
the rc kernel to catch me out.
But I've no real idea of how to exercise the code.

	David

> 
> > -
> >   	/*
> >   	 * We need both ACQUIRE (pairs with corresponding RELEASE in
> >   	 * unlock() uncontended, or fastpath) and RELEASE (to publish
> > @@ -109,6 +106,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> >   	if (prev_cpu == OSQ_UNLOCKED_VAL)
> >   		return true;
> >   
> > +	node = this_cpu_ptr(&osq_node);
> >   	WRITE_ONCE(node->prev_cpu, prev_cpu);
> >   	prev = decode_cpu(prev_cpu);
> >   	node->locked = 0;  
> 
> I am fine with moving the initialization here. The other patches also 
> look good to me.
> 
> Cheers,
> Longman
>

Re: [PATCH v3 next 5/5] Avoid writing to node->next in the osq_lock() fast path

Posted by Waiman Long 3 weeks, 5 days ago

On 3/11/26 3:27 PM, Waiman Long wrote:
> On 3/6/26 5:51 PM, david.laight.linux@gmail.com wrote:
>> From: David Laight <david.laight.linux@gmail.com>
>>
>> When osq_lock() returns false or osq_unlock() returns static
>> analysis shows that node->next should always be NULL.
>> This means that it isn't necessary to explicitly set it to NULL
>> prior to atomic_xchg(&lock->tail, curr) on entry to osq_lock().
>>
>> Defer determining the address of the CPU's 'node' until after the
>> atomic_exchange() so that it isn't done in the uncontented path.
>>
>> Signed-off-by: David Laight <david.laight.linux@gmail.com>
>> ---
>>   kernel/locking/osq_lock.c | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index 0619691e2756..3f0cfdf1cd0f 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -92,13 +92,10 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>>     bool osq_lock(struct optimistic_spin_queue *lock)
>>   {
>> -    struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
>> -    struct optimistic_spin_node *prev, *next;
>> +    struct optimistic_spin_node *node, *prev, *next;
>>       unsigned int curr = encode_cpu(smp_processor_id());
>>       unsigned int prev_cpu;
>>   -    node->next = NULL;
>
> Although it does look like node->next should always be NULL when 
> entering osq_lock(), any future change may invalidate this assumption. 
> I know you want to not touch the osq_node cacheline on fast path, but 
> we will need a big comment here to explicitly spell out this 
> assumption to make sure that we won't break it in the future.
>
> BTW, how much performance gain have you measured with this change? Can 
> we just leave it there to be safe.
>
Or you could have something like

         if (IS_ENABLED(CONFIG_PROVE_LOCKING))
                 WARN_ON_ONCE(node->next != NULL);

Cheers,
Longman

Re: [PATCH v3 next 5/5] Avoid writing to node->next in the osq_lock() fast path

Posted by Linus Torvalds 1 month ago

On Fri, 6 Mar 2026 at 14:52, <david.laight.linux@gmail.com> wrote:
>
> From: David Laight <david.laight.linux@gmail.com>
>
> When osq_lock() returns false or osq_unlock() returns static
> analysis shows that node->next should always be NULL.

This explanation makes me nervous.

*What* static analysis? It's very unclear. And the "should be NULL"
doesn't make me get the warm and fuzzies.

For example, osq_unlock() does do

        node = this_cpu_ptr(&osq_node);
        next = xchg(&node->next, NULL);

so it's clearly NULL after that. But it's not obvious this will be
reached, because osq_unlock() does that

        /*
         * Fast path for the uncontended case.
         */
        if (atomic_try_cmpxchg_release(&lock->tail, &curr, OSQ_UNLOCKED_VAL))
                return;

before it actually gets to this point.

And yes, I'm very willing to believe that if we hit that fast-path,
node->next (which is "curr->next" in that path) is indeed NULL, but I
think this commit message really needs to spell it all out.

No "should be NULL", in other words. I want a rock-solid "node->next
is always NULL because XYZ" explanation, not a wishy-washy "static
analysis says" without spelling it out.

            Linus

Re: [PATCH v3 next 5/5] Avoid writing to node->next in the osq_lock() fast path

Posted by David Laight 1 month ago

On Fri, 6 Mar 2026 16:06:25 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 6 Mar 2026 at 14:52, <david.laight.linux@gmail.com> wrote:
> >
> > From: David Laight <david.laight.linux@gmail.com>
> >
> > When osq_lock() returns false or osq_unlock() returns static
> > analysis shows that node->next should always be NULL.  
> 
> This explanation makes me nervous.
> 
> *What* static analysis? It's very unclear. And the "should be NULL"
> doesn't make me get the warm and fuzzies.

The analysis was 'my brain' about two years ago :-)

> For example, osq_unlock() does do
> 
>         node = this_cpu_ptr(&osq_node);
>         next = xchg(&node->next, NULL);
> 
> so it's clearly NULL after that. But it's not obvious this will be
> reached, because osq_unlock() does that
> 
>         /*
>          * Fast path for the uncontended case.
>          */
>         if (atomic_try_cmpxchg_release(&lock->tail, &curr, OSQ_UNLOCKED_VAL))
>                 return;
> 
> before it actually gets to this point.

That is (should be) checking that the list only contains a single node.
So the 'next' pointer can't be set.

I'll drink 10 double-espressos and read the code again.

I'll also check what happens if the code is hit by an ethernet interrupt
just after the initial xchg().
Other cpu can also acquire the lock and then get preempted (so need to
unlink themselves) as well as the holder trying to release the lock.

It might be that the initial xchg needs to be a cmpxchg - which would
massively simplify the 'lock' path.
It does have the 'no forwards progress' issue.
I can't remember whether xchg has to be implemented as cmpxchg on any
architectures, if it does then the current complex locking code
is unnecessary.

	David

> 
> And yes, I'm very willing to believe that if we hit that fast-path,
> node->next (which is "curr->next" in that path) is indeed NULL, but I
> think this commit message really needs to spell it all out.
> 
> No "should be NULL", in other words. I want a rock-solid "node->next
> is always NULL because XYZ" explanation, not a wishy-washy "static
> analysis says" without spelling it out.
> 
>             Linus

Re: [PATCH v3 next 5/5] Avoid writing to node->next in the osq_lock() fast path

Posted by David Laight 1 month ago

On Fri,  6 Mar 2026 22:51:50 +0000
david.laight.linux@gmail.com wrote:

> From: David Laight <david.laight.linux@gmail.com>
> 
> When osq_lock() returns false or osq_unlock() returns static
> analysis shows that node->next should always be NULL.
> This means that it isn't necessary to explicitly set it to NULL
> prior to atomic_xchg(&lock->tail, curr) on entry to osq_lock().
> 
> Defer determining the address of the CPU's 'node' until after the
> atomic_exchange() so that it isn't done in the uncontented path.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>
> ---
>  kernel/locking/osq_lock.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 0619691e2756..3f0cfdf1cd0f 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -92,13 +92,10 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>  
>  bool osq_lock(struct optimistic_spin_queue *lock)
>  {
> -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> -	struct optimistic_spin_node *prev, *next;
> +	struct optimistic_spin_node *node, *prev, *next;
>  	unsigned int curr = encode_cpu(smp_processor_id());
>  	unsigned int prev_cpu;
>  
> -	node->next = NULL;
> -
>  	/*
>  	 * We need both ACQUIRE (pairs with corresponding RELEASE in
>  	 * unlock() uncontended, or fastpath) and RELEASE (to publish
> @@ -109,6 +106,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>  	if (prev_cpu == OSQ_UNLOCKED_VAL)
>  		return true;
>  
> +	node = this_cpu_ptr(&osq_node);
>  	WRITE_ONCE(node->prev_cpu, prev_cpu);
>  	prev = decode_cpu(prev_cpu);
>  	node->locked = 0;