locking/osq_lock: Optimisations to osq_lock code

[PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

Posted by David Laight 2 years, 1 month ago

osq_lock() starts by setting node->next to NULL and node->locked to 0.
Careful analysis shows that node->next is always NULL on entry.

node->locked is set non-zero by another cpu to force a wakeup.
This can only happen after the 'prev->next = node' assignment,
so locked can be set to zero just before that (along with the assignment
to node->prev).

Only initialise node->cpu once, after that use its value instead
of smp_processor_id() - which is probably a real function call.

Should reduce cache-line bouncing a little.

Signed-off-by: David Laight <david.laight@aculab.com>
---

Re-send without the 'RE:' on the subject line.
 kernel/locking/osq_lock.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d414eef4bec6..55f5db896c02 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 	      struct optimistic_spin_node *prev)
 {
 	struct optimistic_spin_node *next = NULL;
-	int curr = encode_cpu(smp_processor_id());
+	int curr = node->cpu;
 	int old;
 
 	/*
@@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 {
 	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
 	struct optimistic_spin_node *prev, *next;
-	int curr = encode_cpu(smp_processor_id());
 	int old;
 
-	node->locked = 0;
-	node->next = NULL;
-	node->cpu = curr;
+	if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
+		node->cpu = encode_cpu(smp_processor_id());
 
 	/*
 	 * We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	 * the node fields we just initialised) semantics when updating
 	 * the lock tail.
 	 */
-	old = atomic_xchg(&lock->tail, curr);
+	old = atomic_xchg(&lock->tail, node->cpu);
 	if (old == OSQ_UNLOCKED_VAL)
 		return true;
 
 	prev = decode_cpu(old);
 	node->prev = prev;
+	node->locked = 0;
 
 	/*
 	 * osq_lock()			unqueue
@@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 void osq_unlock(struct optimistic_spin_queue *lock)
 {
 	struct optimistic_spin_node *node, *next;
-	int curr = encode_cpu(smp_processor_id());
+	int curr = raw_cpu_read(osq_node.cpu);
 
 	/*
 	 * Fast path for the uncontended case.
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

Posted by Waiman Long 2 years, 1 month ago

On 12/29/23 17:11, David Laight wrote:
> osq_lock() starts by setting node->next to NULL and node->locked to 0.
> Careful analysis shows that node->next is always NULL on entry.
>
> node->locked is set non-zero by another cpu to force a wakeup.
> This can only happen after the 'prev->next = node' assignment,
> so locked can be set to zero just before that (along with the assignment
> to node->prev).
>
> Only initialise node->cpu once, after that use its value instead
> of smp_processor_id() - which is probably a real function call.
>
> Should reduce cache-line bouncing a little.
>
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
>
> Re-send without the 'RE:' on the subject line.
>   kernel/locking/osq_lock.c | 13 ++++++-------
>   1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d414eef4bec6..55f5db896c02 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>   	      struct optimistic_spin_node *prev)
>   {
>   	struct optimistic_spin_node *next = NULL;
> -	int curr = encode_cpu(smp_processor_id());
> +	int curr = node->cpu;
>   	int old;
>   
>   	/*
> @@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   {
>   	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
>   	struct optimistic_spin_node *prev, *next;
> -	int curr = encode_cpu(smp_processor_id());
>   	int old;
>   
> -	node->locked = 0;
> -	node->next = NULL;

I have some concern about not clearing node->next at the beginning. I 
know that next is supposed to be cleared at the end. I do have some 
worry that there may exist a race condition that leaves next not cleared 
before it is used again. So you either have to prove that such a race 
does not exist, or initializing it to NULL at the beginning like it is 
today.

Cheers,
Longman

> -	node->cpu = curr;
> +	if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> +		node->cpu = encode_cpu(smp_processor_id());
>   
>   	/*
>   	 * We need both ACQUIRE (pairs with corresponding RELEASE in
> @@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   	 * the node fields we just initialised) semantics when updating
>   	 * the lock tail.
>   	 */
> -	old = atomic_xchg(&lock->tail, curr);
> +	old = atomic_xchg(&lock->tail, node->cpu);
>   	if (old == OSQ_UNLOCKED_VAL)
>   		return true;
>   
>   	prev = decode_cpu(old);
>   	node->prev = prev;
> +	node->locked = 0;
>   
>   	/*
>   	 * osq_lock()			unqueue
> @@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   void osq_unlock(struct optimistic_spin_queue *lock)
>   {
>   	struct optimistic_spin_node *node, *next;
> -	int curr = encode_cpu(smp_processor_id());
> +	int curr = raw_cpu_read(osq_node.cpu);
>   
>   	/*
>   	 * Fast path for the uncontended case.

RE: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

Posted by David Laight 2 years, 1 month ago

From: Waiman Long
> Sent: 30 December 2023 03:20
> 
> On 12/29/23 17:11, David Laight wrote:
> > osq_lock() starts by setting node->next to NULL and node->locked to 0.
> > Careful analysis shows that node->next is always NULL on entry.
> >
> > node->locked is set non-zero by another cpu to force a wakeup.
> > This can only happen after the 'prev->next = node' assignment,
> > so locked can be set to zero just before that (along with the assignment
> > to node->prev).
> >
> > Only initialise node->cpu once, after that use its value instead
> > of smp_processor_id() - which is probably a real function call.
> >
> > Should reduce cache-line bouncing a little.
> >
> > Signed-off-by: David Laight <david.laight@aculab.com>
> > ---
> >
> > Re-send without the 'RE:' on the subject line.
> >   kernel/locking/osq_lock.c | 13 ++++++-------
> >   1 file changed, 6 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index d414eef4bec6..55f5db896c02 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> >   	      struct optimistic_spin_node *prev)
> >   {
> >   	struct optimistic_spin_node *next = NULL;
> > -	int curr = encode_cpu(smp_processor_id());
> > +	int curr = node->cpu;
> >   	int old;
> >
> >   	/*
> > @@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> >   {
> >   	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> >   	struct optimistic_spin_node *prev, *next;
> > -	int curr = encode_cpu(smp_processor_id());
> >   	int old;
> >
> > -	node->locked = 0;
> > -	node->next = NULL;
> 
> I have some concern about not clearing node->next at the beginning. I
> know that next is supposed to be cleared at the end. I do have some
> worry that there may exist a race condition that leaves next not cleared
> before it is used again. So you either have to prove that such a race
> does not exist, or initializing it to NULL at the beginning like it is
> today.

I've stared at the code some more :-)

There are two places where node->next is written non-NULL, both in osq_lock().
The first is at the top of the slow-path where prev->next = node
(this should be overwriting the NULL set (or not) on entry).
The second is at the bottom of osq_lock() prev->next = next (Step C)
where the NULL written in Step A is written with the correct 'next' node.
After either of those the 'node' cpu must later either take the unqueue
exit from osq_lock() or call osq_unlock().
Both require it wait for node->next be non-NULL and NULL it.

If code calls osq_lock() twice all bets are off!

Even if (somehow) node->next was non-NULL on entry it will be set by an
osq_lock() call from another cpu.
The only problem would be if osq_unlock() were called before the queueing
cpu set prev->next = node.
That in itself is unlikely - but would happen if node->next were always set.

I don't completely understand the 'acquire'/'release' semantics (they didn't
exist when I started doing SMP kernel code in the late 1980s).
But it looks odd that osq_unlock()'s fast path uses _release but the very
similar code in osq_wait_next() uses _acquire.

Indeed, apart from some (assumed) optimisations, I think osq_unlock()
could just be:
	next = osq_wait_next(lock, this_cpu_ptr(&osq_node), 0);
	if (next)
		next->locked = 1;

I don't think the order of the tests for lock->tail and node->next
matter in osq_wait_next().
If they were swapped the 'Second most likely case' code from osq_unlock()
could be removed.
(The 'uncontended case' doesn't need to load the address of 'node'.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

Posted by Boqun Feng 2 years, 1 month ago

On Sat, Dec 30, 2023 at 03:49:52PM +0000, David Laight wrote:
[...]
> I don't completely understand the 'acquire'/'release' semantics (they didn't
> exist when I started doing SMP kernel code in the late 1980s).
> But it looks odd that osq_unlock()'s fast path uses _release but the very
> similar code in osq_wait_next() uses _acquire.
> 

The _release in osq_unlock() is needed since unlocks are needed to be
RELEASE so that lock+unlock can be a critical section (i.e. no memory
accesses can escape). When osq_wait_next() is used in non unlock cases,
the RELEASE is not required. As for the case where osq_wait_next() is
used in osq_unlock(), there is a xchg() preceding it, which provides a
full barrier, so things are fine.

/me wonders whether we can relax the _acquire in osq_wait_next() into
a _relaxed.

> Indeed, apart from some (assumed) optimisations, I think osq_unlock()
> could just be:
> 	next = osq_wait_next(lock, this_cpu_ptr(&osq_node), 0);
> 	if (next)
> 		next->locked = 1;
> 

If so we need to provide some sort of RELEASE semantics for the
osq_unlock() in all the cases.

Regards,
Boqun

> I don't think the order of the tests for lock->tail and node->next
> matter in osq_wait_next().
> If they were swapped the 'Second most likely case' code from osq_unlock()
> could be removed.
> (The 'uncontended case' doesn't need to load the address of 'node'.)
> 
> 	David
> 		
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

RE: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

Posted by David Laight 2 years, 1 month ago

From: Boqun Feng
> Sent: 02 January 2024 18:54
> 
> On Sat, Dec 30, 2023 at 03:49:52PM +0000, David Laight wrote:
> [...]
> > But it looks odd that osq_unlock()'s fast path uses _release but the very
> > similar code in osq_wait_next() uses _acquire.
> >
> 
> The _release in osq_unlock() is needed since unlocks are needed to be
> RELEASE so that lock+unlock can be a critical section (i.e. no memory
> accesses can escape). When osq_wait_next() is used in non unlock cases,
> the RELEASE is not required. As for the case where osq_wait_next() is
> used in osq_unlock(), there is a xchg() preceding it, which provides a
> full barrier, so things are fine.

I know there have been issues with ACQUIRE/RELEASE/FULL xchg in this code,
but are FULL xchg always needed on node->next?

> /me wonders whether we can relax the _acquire in osq_wait_next() into
> a _relaxed.

I wouldn't have worried about relaxed v release.

> > Indeed, apart from some (assumed) optimisations, I think osq_unlock()
> > could just be:
> > 	next = osq_wait_next(lock, this_cpu_ptr(&osq_node), 0);
> > 	if (next)
> > 		next->locked = 1;
> >
> 
> If so we need to provide some sort of RELEASE semantics for the
> osq_unlock() in all the cases.

I wonder how often the unqueue code happens, and especially for
the last cpu in the list?
I'd only expect need_resched() to return true after spinning for
a while - in which case perhaps it is more likely that there are
a lot of cpu in the queue and the cpu being removed won't be last.
So osq_wait_next() exits on xchg(&node->next, NULL) != NULL
which is full barrier.

On a slightly different note I've also wondered if 'osq_node'
actually needs to be cache line aligned?
You definitely don't want it spanning 2 cache line, but I'd
expect that per-cpu data is mostly accessed by its own cpu?
So you really aren't going to get false sharing with some
other per-cpu data since the cpu is busy in this code.
So __aligned(16) would do?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

[PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c
[PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
RE: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
[PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()
[PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
[PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.