locking/osq_lock: Optimisations to osq_lock code

[PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

Posted by David Laight 2 years, 1 month ago

this_cpu_ptr() is rather more expensive than raw_cpu_read() since
the latter can use an 'offset from register' (%gs for x86-84).

Add a 'self' field to 'struct optimistic_spin_node' that can be
read with raw_cpu_read(), initialise on first call.

Signed-off-by: David Laight <david.laight@aculab.com>
---
 kernel/locking/osq_lock.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 9bb3a077ba92..b60b0add0161 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,7 +13,7 @@
  */
 
 struct optimistic_spin_node {
-	struct optimistic_spin_node *next, *prev;
+	struct optimistic_spin_node *self, *next, *prev;
 	int locked; /* 1 if lock acquired */
 	int cpu; /* encoded CPU # + 1 value */
 };
@@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 
 bool osq_lock(struct optimistic_spin_queue *lock)
 {
-	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
+	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
 	struct optimistic_spin_node *prev, *next;
 	int old;
 
-	if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
-		node->cpu = encode_cpu(smp_processor_id());
+	if (unlikely(!node)) {
+		int cpu = encode_cpu(smp_processor_id());
+		node = decode_cpu(cpu);
+		node->self = node;
+		node->cpu = cpu;
+	}
 
 	/*
 	 * We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -222,7 +226,7 @@ void osq_unlock(struct optimistic_spin_queue *lock)
 	/*
 	 * Second most likely case.
 	 */
-	node = this_cpu_ptr(&osq_node);
+	node = raw_cpu_read(osq_node.self);
 	next = xchg(&node->next, NULL);
 	if (next) {
 		WRITE_ONCE(next->locked, 1);
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

Posted by Linus Torvalds 2 years, 1 month ago

On Fri, 29 Dec 2023 at 12:57, David Laight <David.Laight@aculab.com> wrote:
>
> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> the latter can use an 'offset from register' (%gs for x86-84).
>
> Add a 'self' field to 'struct optimistic_spin_node' that can be
> read with raw_cpu_read(), initialise on first call.

No, this is horrible.

The problem isn't the "this_cpu_ptr()", it's the rest of the code.

>  bool osq_lock(struct optimistic_spin_queue *lock)
>  {
> -       struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> +       struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);

No. Both of these are crap.

>         struct optimistic_spin_node *prev, *next;
>         int old;
>
> -       if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> -               node->cpu = encode_cpu(smp_processor_id());
> +       if (unlikely(!node)) {
> +               int cpu = encode_cpu(smp_processor_id());
> +               node = decode_cpu(cpu);
> +               node->self = node;
> +               node->cpu = cpu;
> +       }

The proper fix here is to not do that silly

        node = this_cpu_ptr(&osq_node);
        ..
        node->next = NULL;

dance at all, but to simply do

        this_cpu_write(osq_node.next, NULL);

in the first place. That makes the whole thing just a single store off
the segment descriptor.

Yes, you'll eventually end up doing that

        node = this_cpu_ptr(&osq_node);

thing because it then wants to use that raw pointer to do

        WRITE_ONCE(prev->next, node);

but that's a separate issue and still does not make it worth it to
create a pointless self-pointer.

Btw, if you *really* want to solve that separate issue, then make the
optimistic_spin_node struct not contain the pointers at all, but the
CPU numbers, and then turn those numbers into the pointers the exact
same way it does for the "lock->tail" thing, ie doing that whole

        prev = decode_cpu(old);

dance. That *may* then result in avoiding turning them into pointers
at all in some cases.

Also, I think that you might want to look into making OSQ_UNLOCKED_VAL
be -1 instead, and add something like

  #define IS_OSQ_UNLOCKED(x) ((int)(x)<0)

and that would then avoid the +1 / -1 games in encoding/decoding the
CPU numbers. It causes silly code generated like this:

        subl    $1, %eax        #, cpu_nr
...
        cltq
        addq    __per_cpu_offset(,%rax,8), %rcx

which seems honestly stupid. The cltq is there for sign-extension,
which is because all these things are "int", and the "subl" will
zero-extend to 64-bit, not sign-extend.

At that point, I think gcc might be able to just generate

        addq    __per_cpu_offset-8(,%rax,8), %rcx

but honestly, I think it would be nicer to just have decode_cpu() do

        unsigned int cpu_nr = encoded_cpu_val;

        return per_cpu_ptr(&osq_node, cpu_nr);

and not have the -1/+1 at all.

Hmm?

UNTESTED patch to just do the "this_cpu_write()" parts attached.
Again, note how we do end up doing that this_cpu_ptr conversion later
anyway, but at least it's off the critical path.

                 Linus

RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

Posted by David Laight 2 years, 1 month ago

From: Linus Torvalds
> Sent: 30 December 2023 20:41
> 
> On Fri, 29 Dec 2023 at 12:57, David Laight <David.Laight@aculab.com> wrote:
> >
> > this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> > the latter can use an 'offset from register' (%gs for x86-84).
> >
> > Add a 'self' field to 'struct optimistic_spin_node' that can be
> > read with raw_cpu_read(), initialise on first call.
> 
> No, this is horrible.
> 
> The problem isn't the "this_cpu_ptr()", it's the rest of the code.
> 
> >  bool osq_lock(struct optimistic_spin_queue *lock)
> >  {
> > -       struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > +       struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> 
> No. Both of these are crap.
> 
> >         struct optimistic_spin_node *prev, *next;
> >         int old;
> >
> > -       if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> > -               node->cpu = encode_cpu(smp_processor_id());
> > +       if (unlikely(!node)) {
> > +               int cpu = encode_cpu(smp_processor_id());
> > +               node = decode_cpu(cpu);
> > +               node->self = node;
> > +               node->cpu = cpu;
> > +       }
> 
> The proper fix here is to not do that silly
> 
>         node = this_cpu_ptr(&osq_node);
>         ..
>         node->next = NULL;
> 
> dance at all, but to simply do
> 
>         this_cpu_write(osq_node.next, NULL);
> 
> in the first place. That makes the whole thing just a single store off
> the segment descriptor.

Is the equivalent true (ie offset from fixed register) for all SMP archs?
Or do some have to do something rather more complicated?

> Yes, you'll eventually end up doing that
> 
>         node = this_cpu_ptr(&osq_node);

For some reason I've had CONFIG_DEBUG_PREEMPT set. I don't remember
setting it, and can't imagine why I might have.
Best guess is it was inherited from the ubuntu .config I started with.
In any case it changes smp_processor_id() into a real function
in order to check that preemption is disabled.
I'd guess something like BUG_ON(!raw_cpu_read(preempt_disable_count))
would be faster and more obvious!

> thing because it then wants to use that raw pointer to do
> 
>         WRITE_ONCE(prev->next, node);
> 
> but that's a separate issue and still does not make it worth it to
> create a pointless self-pointer.

I could claim that loading it is one instruction shorter and that
if the cache line containing 'node' is needed and 'pcpu_hot'
is (unexpectedly) not cached it saves a cache line load.
But I probably won't!

> 
> Btw, if you *really* want to solve that separate issue, then make the
> optimistic_spin_node struct not contain the pointers at all, but the
> CPU numbers, and then turn those numbers into the pointers the exact
> same way it does for the "lock->tail" thing, ie doing that whole
> 
>         prev = decode_cpu(old);
> 
> dance. That *may* then result in avoiding turning them into pointers
> at all in some cases.

Don't think so.
Pretty much all the uses need to dereference the next/prev pointers.

> Also, I think that you might want to look into making OSQ_UNLOCKED_VAL
> be -1 instead, and add something like
> 
>   #define IS_OSQ_UNLOCKED(x) ((int)(x)<0)

I did think about that (but not for these patches).
But it is a lot more dangerous because it absolutely requires
the structure be correctly initialised, not just be all zero.
That might show up some very strange bugs.

> and that would then avoid the +1 / -1 games in encoding/decoding the
> CPU numbers. It causes silly code generated like this:
> 
>         subl    $1, %eax        #, cpu_nr
> ...
>         cltq
>         addq    __per_cpu_offset(,%rax,8), %rcx
> 
> which seems honestly stupid. The cltq is there for sign-extension,
> which is because all these things are "int", and the "subl" will
> zero-extend to 64-bit, not sign-extend.

Changing all the variables to 'unsigned int' will remove the sign
extension and, after the decrement, gcc will know the high bits
are zero so shouldn't need to zero extend.

> At that point, I think gcc might be able to just generate
> 
>         addq    __per_cpu_offset-8(,%rax,8), %rcx

That would need the decrement to be 64bit.
A quick test failed to make that work.
Probably (as you mentioned in the next email) because gcc
doesn't know that the high bits from atomic_xchg() are zero.

> but honestly, I think it would be nicer to just have decode_cpu() do
> 
>         unsigned int cpu_nr = encoded_cpu_val;
> 
>         return per_cpu_ptr(&osq_node, cpu_nr);
> 
> and not have the -1/+1 at all.

Unless you can persuade gcc that the high bits from atomic_xchg()
are zero that will require a zero extend.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

Posted by Linus Torvalds 2 years, 1 month ago

On Sat, 30 Dec 2023 at 12:41, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> UNTESTED patch to just do the "this_cpu_write()" parts attached.
> Again, note how we do end up doing that this_cpu_ptr conversion later
> anyway, but at least it's off the critical path.

Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely
code, it really is still better than caching a value in memory.

At least the memory location that 'this_cpu_ptr()' accesses is
slightly more likely to be hot (and is right next to the cpu number,
iirc).

That said, I think we should fix this_cpu_ptr() to not ever generate
that disgusting cltq just because the cpu pointer has the wrong
signedness. I don't quite know how to do it, but this:

  -#define per_cpu_offset(x) (__per_cpu_offset[x])
  +#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)])

at least helps a *bit*. It gets rid of the cltq, at least, but if
somebody actually passes in an 'unsigned long' cpuid, it would cause
an unnecessary truncation.

And gcc still generates

        subl    $1, %eax        #, cpu_nr
        addq    __per_cpu_offset(,%rax,8), %rcx

instead of just doing

        addq    __per_cpu_offset-8(,%rax,8), %rcx

because it still needs to clear the upper 32 bits and doesn't know
that the 'xchg()' already did that.

Oh well. I guess even without the -1/+1 games by the OSQ code, we
would still end up with a "movl" just to do that upper bits clearing
that the compiler doesn't know is unnecessary.

I don't think we have any reasonable way to tell the compiler that the
register output of our xchg() inline asm has the upper 32 bits clear.

              Linus

RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

Posted by David Laight 2 years, 1 month ago

From: Linus Torvalds
> Sent: 30 December 2023 20:59
> 
> On Sat, 30 Dec 2023 at 12:41, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > UNTESTED patch to just do the "this_cpu_write()" parts attached.
> > Again, note how we do end up doing that this_cpu_ptr conversion later
> > anyway, but at least it's off the critical path.
> 
> Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely
> code, it really is still better than caching a value in memory.
> 
> At least the memory location that 'this_cpu_ptr()' accesses is
> slightly more likely to be hot (and is right next to the cpu number,
> iirc).

I was only going to access the 'self' field in code that required
the 'node' cache line be present.

> 
> That said, I think we should fix this_cpu_ptr() to not ever generate
> that disgusting cltq just because the cpu pointer has the wrong
> signedness. I don't quite know how to do it, but this:
> 
>   -#define per_cpu_offset(x) (__per_cpu_offset[x])
>   +#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)])
> 
> at least helps a *bit*. It gets rid of the cltq, at least, but if
> somebody actually passes in an 'unsigned long' cpuid, it would cause
> an unnecessary truncation.

Doing the conversion using arithmetic might help, so:
		__per_cpu_offset[(x) + 0u]

> And gcc still generates
> 
>         subl    $1, %eax        #, cpu_nr
>         addq    __per_cpu_offset(,%rax,8), %rcx
> 
> instead of just doing
> 
>         addq    __per_cpu_offset-8(,%rax,8), %rcx
> 
> because it still needs to clear the upper 32 bits and doesn't know
> that the 'xchg()' already did that.

Not only that, you need to do the 'subl' after converting to 64 bits.
Otherwise the wrong location is read were cpu_nr to be zero.
I've tried that - but it still failed.

> Oh well. I guess even without the -1/+1 games by the OSQ code, we
> would still end up with a "movl" just to do that upper bits clearing
> that the compiler doesn't know is unnecessary.
> 
> I don't think we have any reasonable way to tell the compiler that the
> register output of our xchg() inline asm has the upper 32 bits clear.

It could be done for a 32bit unsigned xchg() - just make the return
type unsigned 64bit.
But that won't work for the signed exchange - and 'atomic_t' is signed.
OTOH I'd guess this code could use 'unsigned int' instead of atomic_t?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

Posted by Ingo Molnar 2 years, 1 month ago

* David Laight <David.Laight@ACULAB.COM> wrote:

>  bool osq_lock(struct optimistic_spin_queue *lock)
>  {
> -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> +	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
>  	struct optimistic_spin_node *prev, *next;
>  	int old;
>  
> -	if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> -		node->cpu = encode_cpu(smp_processor_id());
> +	if (unlikely(!node)) {
> +		int cpu = encode_cpu(smp_processor_id());
> +		node = decode_cpu(cpu);
> +		node->self = node;
> +		node->cpu = cpu;

This whole initialization sequence is suboptimal and needs to be 
cleaned up first: the node->cpu field is constant once initialized, so 
it should be initialized from appropriate init methods, not runtime in 
osq_lock(), right?

Eliminating that initialization branch is a useful micro-optimization 
in itself for the hot path.

Thanks,

	Ingo

RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

Posted by David Laight 2 years, 1 month ago

From: Ingo Molnar
> Sent: 30 December 2023 20:38
> 
> * David Laight <David.Laight@ACULAB.COM> wrote:
> 
> >  bool osq_lock(struct optimistic_spin_queue *lock)
> >  {
> > -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > +	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> >  	struct optimistic_spin_node *prev, *next;
> >  	int old;
> >
> > -	if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> > -		node->cpu = encode_cpu(smp_processor_id());
> > +	if (unlikely(!node)) {
> > +		int cpu = encode_cpu(smp_processor_id());
> > +		node = decode_cpu(cpu);
> > +		node->self = node;
> > +		node->cpu = cpu;
> 
> This whole initialization sequence is suboptimal and needs to be
> cleaned up first: the node->cpu field is constant once initialized, so
> it should be initialized from appropriate init methods, not runtime in
> osq_lock(), right?

I thought that as well, but there would need to be a list of 'init'
functions for the per-cpu data. I didn't spot one.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

Posted by Waiman Long 2 years, 1 month ago

On 12/29/23 15:57, David Laight wrote:
> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> the latter can use an 'offset from register' (%gs for x86-84).
>
> Add a 'self' field to 'struct optimistic_spin_node' that can be
> read with raw_cpu_read(), initialise on first call.
>
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
>   kernel/locking/osq_lock.c | 14 +++++++++-----
>   1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 9bb3a077ba92..b60b0add0161 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -13,7 +13,7 @@
>    */
>   
>   struct optimistic_spin_node {
> -	struct optimistic_spin_node *next, *prev;
> +	struct optimistic_spin_node *self, *next, *prev;
>   	int locked; /* 1 if lock acquired */
>   	int cpu; /* encoded CPU # + 1 value */
>   };
> @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>   
>   bool osq_lock(struct optimistic_spin_queue *lock)
>   {
> -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> +	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);

My gcc 11 compiler produces the following x86-64 code:

92        struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
    0x0000000000000029 <+25>:    mov    %rcx,%rdx
    0x000000000000002c <+28>:    add %gs:0x0(%rip),%rdx        # 0x34 
<osq_lock+36>

Which looks pretty optimized for me. Maybe older compiler may generate 
more complex code. However, I do have some doubt as to the benefit of 
this patch at the expense of making the code a bit more complex.

Cheers,
Longman

Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

Posted by Ingo Molnar 2 years, 1 month ago

* Waiman Long <longman@redhat.com> wrote:

> On 12/29/23 15:57, David Laight wrote:
> > this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> > the latter can use an 'offset from register' (%gs for x86-84).
> > 
> > Add a 'self' field to 'struct optimistic_spin_node' that can be
> > read with raw_cpu_read(), initialise on first call.
> > 
> > Signed-off-by: David Laight <david.laight@aculab.com>
> > ---
> >   kernel/locking/osq_lock.c | 14 +++++++++-----
> >   1 file changed, 9 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index 9bb3a077ba92..b60b0add0161 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -13,7 +13,7 @@
> >    */
> >   struct optimistic_spin_node {
> > -	struct optimistic_spin_node *next, *prev;
> > +	struct optimistic_spin_node *self, *next, *prev;
> >   	int locked; /* 1 if lock acquired */
> >   	int cpu; /* encoded CPU # + 1 value */
> >   };
> > @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> >   bool osq_lock(struct optimistic_spin_queue *lock)
> >   {
> > -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > +	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> 
> My gcc 11 compiler produces the following x86-64 code:
> 
> 92        struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
>    0x0000000000000029 <+25>:    mov    %rcx,%rdx
>    0x000000000000002c <+28>:    add %gs:0x0(%rip),%rdx        # 0x34
> <osq_lock+36>
> 
> Which looks pretty optimized for me. Maybe older compiler may generate more
> complex code. However, I do have some doubt as to the benefit of this patch
> at the expense of making the code a bit more complex.

GCC-11 is plenty of a look-back window in terms of compiler efficiency: 
latest enterprise distros use GCC-11 or newer, while recent desktop 
distros use GCC-13. Anything older won't matter, because no major 
distribution is going to use new kernels with old compilers.

Thanks,

	Ingo

RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

Posted by David Laight 2 years, 1 month ago

From: Ingo Molnar
> Sent: 30 December 2023 11:09
> 
> 
> * Waiman Long <longman@redhat.com> wrote:
> 
> > On 12/29/23 15:57, David Laight wrote:
> > > this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> > > the latter can use an 'offset from register' (%gs for x86-84).
> > >
> > > Add a 'self' field to 'struct optimistic_spin_node' that can be
> > > read with raw_cpu_read(), initialise on first call.
> > >
> > > Signed-off-by: David Laight <david.laight@aculab.com>
> > > ---
> > >   kernel/locking/osq_lock.c | 14 +++++++++-----
> > >   1 file changed, 9 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > > index 9bb3a077ba92..b60b0add0161 100644
> > > --- a/kernel/locking/osq_lock.c
> > > +++ b/kernel/locking/osq_lock.c
> > > @@ -13,7 +13,7 @@
> > >    */
> > >   struct optimistic_spin_node {
> > > -	struct optimistic_spin_node *next, *prev;
> > > +	struct optimistic_spin_node *self, *next, *prev;
> > >   	int locked; /* 1 if lock acquired */
> > >   	int cpu; /* encoded CPU # + 1 value */
> > >   };
> > > @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> > >   bool osq_lock(struct optimistic_spin_queue *lock)
> > >   {
> > > -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > > +	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> >
> > My gcc 11 compiler produces the following x86-64 code:
> >
> > 92        struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> >    0x0000000000000029 <+25>:    mov    %rcx,%rdx
> >    0x000000000000002c <+28>:    add %gs:0x0(%rip),%rdx        # 0x34
> > <osq_lock+36>
> >
> > Which looks pretty optimized for me. Maybe older compiler may generate more
> > complex code. However, I do have some doubt as to the benefit of this patch
> > at the expense of making the code a bit more complex.

My changed code is one instruction shorter!
  18:   65 48 8b 15 00 00 00    mov    %gs:0x0(%rip),%rdx        # 20 <osq_lock+0x20>
  1f:   00
                        1c: R_X86_64_PC32       .data..percpu..shared_aligned-0x4
However is might have one less cache line miss.

> GCC-11 is plenty of a look-back window in terms of compiler efficiency:
> latest enterprise distros use GCC-11 or newer, while recent desktop
> distros use GCC-13. Anything older won't matter, because no major
> distribution is going to use new kernels with old compilers.

There must be a difference in the header files as well.
Possibly forced by the older compiler I'm using (7.5 from Ubuntu 18.04).
But maybe based on some config option.

I'm seeing this_cpu_ptr(&xxx) converted to per_cpu_ptr(&xxx, smp_processor_id())
which necessitates an array lookup (indexed by cpu number).
Whereas I think you are seeing it implemented as
  raw_cpu_read(per_cpu_data_base) + offset_to(xxx)

So the old code generates (after the prologue):
  10:   49 89 fd                mov    %rdi,%r13
  13:   49 c7 c4 00 00 00 00    mov    $0x0,%r12
                        16: R_X86_64_32S        .data..percpu..shared_aligned
  1a:   e8 00 00 00 00          callq  1f <osq_lock+0x1f>
                        1b: R_X86_64_PC32       debug_smp_processor_id-0x4
  1f:   89 c0                   mov    %eax,%eax
  21:   48 8b 1c c5 00 00 00    mov    0x0(,%rax,8),%rbx
  28:   00
                        25: R_X86_64_32S        __per_cpu_offset
  29:   e8 00 00 00 00          callq  2e <osq_lock+0x2e>
                        2a: R_X86_64_PC32       debug_smp_processor_id-0x4
  2e:   4c 01 e3                add    %r12,%rbx
  31:   83 c0 01                add    $0x1,%eax
  34:   c7 43 10 00 00 00 00    movl   $0x0,0x10(%rbx)
  3b:   48 c7 03 00 00 00 00    movq   $0x0,(%rbx)
  42:   89 43 14                mov    %eax,0x14(%rbx)
  45:   41 87 45 00             xchg   %eax,0x0(%r13)

I was also surprised that smp_processor_id() is a real function rather
than an offset from %gs.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

Posted by Waiman Long 2 years, 1 month ago

On 12/30/23 06:35, David Laight wrote:
> From: Ingo Molnar
>> Sent: 30 December 2023 11:09
>>
>>
>> * Waiman Long <longman@redhat.com> wrote:
>>
>>> On 12/29/23 15:57, David Laight wrote:
>>>> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
>>>> the latter can use an 'offset from register' (%gs for x86-84).
>>>>
>>>> Add a 'self' field to 'struct optimistic_spin_node' that can be
>>>> read with raw_cpu_read(), initialise on first call.
>>>>
>>>> Signed-off-by: David Laight <david.laight@aculab.com>
>>>> ---
>>>>    kernel/locking/osq_lock.c | 14 +++++++++-----
>>>>    1 file changed, 9 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>>>> index 9bb3a077ba92..b60b0add0161 100644
>>>> --- a/kernel/locking/osq_lock.c
>>>> +++ b/kernel/locking/osq_lock.c
>>>> @@ -13,7 +13,7 @@
>>>>     */
>>>>    struct optimistic_spin_node {
>>>> -	struct optimistic_spin_node *next, *prev;
>>>> +	struct optimistic_spin_node *self, *next, *prev;
>>>>    	int locked; /* 1 if lock acquired */
>>>>    	int cpu; /* encoded CPU # + 1 value */
>>>>    };
>>>> @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>>>>    bool osq_lock(struct optimistic_spin_queue *lock)
>>>>    {
>>>> -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
>>>> +	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
>>> My gcc 11 compiler produces the following x86-64 code:
>>>
>>> 92        struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
>>>     0x0000000000000029 <+25>:    mov    %rcx,%rdx
>>>     0x000000000000002c <+28>:    add %gs:0x0(%rip),%rdx        # 0x34
>>> <osq_lock+36>
>>>
>>> Which looks pretty optimized for me. Maybe older compiler may generate more
>>> complex code. However, I do have some doubt as to the benefit of this patch
>>> at the expense of making the code a bit more complex.
> My changed code is one instruction shorter!
>    18:   65 48 8b 15 00 00 00    mov    %gs:0x0(%rip),%rdx        # 20 <osq_lock+0x20>
>    1f:   00
>                          1c: R_X86_64_PC32       .data..percpu..shared_aligned-0x4
> However is might have one less cache line miss.
>
>> GCC-11 is plenty of a look-back window in terms of compiler efficiency:
>> latest enterprise distros use GCC-11 or newer, while recent desktop
>> distros use GCC-13. Anything older won't matter, because no major
>> distribution is going to use new kernels with old compilers.
> There must be a difference in the header files as well.
> Possibly forced by the older compiler I'm using (7.5 from Ubuntu 18.04).
> But maybe based on some config option.
>
> I'm seeing this_cpu_ptr(&xxx) converted to per_cpu_ptr(&xxx, smp_processor_id())
> which necessitates an array lookup (indexed by cpu number).
> Whereas I think you are seeing it implemented as
>    raw_cpu_read(per_cpu_data_base) + offset_to(xxx)
>
> So the old code generates (after the prologue):
>    10:   49 89 fd                mov    %rdi,%r13
>    13:   49 c7 c4 00 00 00 00    mov    $0x0,%r12
>                          16: R_X86_64_32S        .data..percpu..shared_aligned
>    1a:   e8 00 00 00 00          callq  1f <osq_lock+0x1f>
>                          1b: R_X86_64_PC32       debug_smp_processor_id-0x4
>    1f:   89 c0                   mov    %eax,%eax
>    21:   48 8b 1c c5 00 00 00    mov    0x0(,%rax,8),%rbx
>    28:   00
>                          25: R_X86_64_32S        __per_cpu_offset
>    29:   e8 00 00 00 00          callq  2e <osq_lock+0x2e>
>                          2a: R_X86_64_PC32       debug_smp_processor_id-0x4
>    2e:   4c 01 e3                add    %r12,%rbx
>    31:   83 c0 01                add    $0x1,%eax
>    34:   c7 43 10 00 00 00 00    movl   $0x0,0x10(%rbx)
>    3b:   48 c7 03 00 00 00 00    movq   $0x0,(%rbx)
>    42:   89 43 14                mov    %eax,0x14(%rbx)
>    45:   41 87 45 00             xchg   %eax,0x0(%r13)
>
> I was also surprised that smp_processor_id() is a real function rather
> than an offset from %gs.

I have looked up definition of this_cpu_ptr() and gotten the following 
results:

this_cpu_ptr() => raw_cpu_ptr() => arch_raw_cpu_ptr()

/*
  * Compared to the generic __my_cpu_offset version, the following
  * saves one instruction and avoids clobbering a temp register.
  */
#define arch_raw_cpu_ptr(ptr)                           \
({                                                      \
         unsigned long tcp_ptr__;                        \
         asm ("add " __percpu_arg(1) ", %0"              \
              : "=r" (tcp_ptr__)                         \
              : "m" (this_cpu_off), "0" (ptr));          \
         (typeof(*(ptr)) __kernel __force *)tcp_ptr__;   \
})

The presence of debug_smp_processor_id in your compiled code is likely 
due to the setting of CONFIG_DEBUG_PREEMPT in your kernel config.

#ifdef CONFIG_DEBUG_PREEMPT
   extern unsigned int debug_smp_processor_id(void);
# define smp_processor_id() debug_smp_processor_id()
#else
# define smp_processor_id() __smp_processor_id()
#endif

I don't have that config entry in my kernel config and so I only get 2 
instructions for this_cpu_ptr(). We are not going to optimize the code 
specifically for CONFIG_DEBUG_PREEMPT and so this patch should be dropped.

Cheers,
Longman

RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

Posted by David Laight 2 years, 1 month ago

From: Waiman Long
> Sent: 31 December 2023 03:04
....
> The presence of debug_smp_processor_id in your compiled code is likely
> due to the setting of CONFIG_DEBUG_PREEMPT in your kernel config.
> 
> #ifdef CONFIG_DEBUG_PREEMPT
>    extern unsigned int debug_smp_processor_id(void);
> # define smp_processor_id() debug_smp_processor_id()
> #else
> # define smp_processor_id() __smp_processor_id()
> #endif
> 
> I don't have that config entry in my kernel config and so I only get 2
> instructions for this_cpu_ptr(). We are not going to optimize the code
> specifically for CONFIG_DEBUG_PREEMPT and so this patch should be dropped.

Yes, I discovered that late last night.
I've no idea why it is set.
It might even be inherited from the ubuntu 18.04 (I think) .config
I started with (mostly removing extra modules to reduce compile
time and the size of uramdisk).

I'll look at the patches again.
Saving node->prev->cpu as node->prev_cpu will definitely save
the cache-line bounce.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)