locking/osq_lock: Optimisations to osq_lock code.

[PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

Posted by David Laight 2 years, 1 month ago

per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
This requires the cpu number be 64bit.
However the value is osq_lock() comes from a 32bit xchg() and there
isn't a way of telling gcc the high bits are zero (they are) so
there will always be an instruction to clear the high bits.

The cpu number is also offset by one (to make the initialiser 0)
It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).

Converting the cpu number to 32bit unsigned prior to the decrement means
that gcc knows the decrement has set the high bits to zero and doesn't
add a register-register move (or cltq) to zero/sign extend the value.

Not massive but saves two instructions.

Signed-off-by: David Laight <david.laight@aculab.com>
---
 kernel/locking/osq_lock.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 35bb99e96697..37a4fa872989 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
 	return cpu_nr + 1;
 }
 
-static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
+static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val)
 {
-	int cpu_nr = encoded_cpu_val - 1;
-
-	return per_cpu_ptr(&osq_node, cpu_nr);
+	return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
 }
 
 /*
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

Posted by Ingo Molnar 2 years, 1 month ago

* David Laight <David.Laight@ACULAB.COM> wrote:

> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> This requires the cpu number be 64bit.
> However the value is osq_lock() comes from a 32bit xchg() and there
> isn't a way of telling gcc the high bits are zero (they are) so
> there will always be an instruction to clear the high bits.
> 
> The cpu number is also offset by one (to make the initialiser 0)
> It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).
> 
> Converting the cpu number to 32bit unsigned prior to the decrement means
> that gcc knows the decrement has set the high bits to zero and doesn't
> add a register-register move (or cltq) to zero/sign extend the value.
> 
> Not massive but saves two instructions.
> 
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
>  kernel/locking/osq_lock.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 35bb99e96697..37a4fa872989 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
>  	return cpu_nr + 1;
>  }
>  
> -static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> +static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val)
>  {
> -	int cpu_nr = encoded_cpu_val - 1;
> -
> -	return per_cpu_ptr(&osq_node, cpu_nr);
> +	return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);

So why do we 'encode' the CPU number to begin with?

Why not use -1 as the special value? Checks for negative values 
generates similarly fast machine code compared to checking for 0, if 
the value is also used (which it is in most cases here). What am I 
missing? We seem to be going through a lot of unnecessary hoops, and 
some of that is in the runtime path.

Thanks,

	Ingo

RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

Posted by David Laight 2 years, 1 month ago

From: Ingo Molnar
> Sent: 02 January 2024 09:54
> 
> 
> * David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> > This requires the cpu number be 64bit.
> > However the value is osq_lock() comes from a 32bit xchg() and there
> > isn't a way of telling gcc the high bits are zero (they are) so
> > there will always be an instruction to clear the high bits.
> >
> > The cpu number is also offset by one (to make the initialiser 0)
> > It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
> > into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).
> >
> > Converting the cpu number to 32bit unsigned prior to the decrement means
> > that gcc knows the decrement has set the high bits to zero and doesn't
> > add a register-register move (or cltq) to zero/sign extend the value.
> >
> > Not massive but saves two instructions.
> >
> > Signed-off-by: David Laight <david.laight@aculab.com>
> > ---
> >  kernel/locking/osq_lock.c | 6 ++----
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index 35bb99e96697..37a4fa872989 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
> >  	return cpu_nr + 1;
> >  }
> >
> > -static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> > +static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val)
> >  {
> > -	int cpu_nr = encoded_cpu_val - 1;
> > -
> > -	return per_cpu_ptr(&osq_node, cpu_nr);
> > +	return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
> 
> So why do we 'encode' the CPU number to begin with?
> 
> Why not use -1 as the special value? Checks for negative values
> generates similarly fast machine code compared to checking for 0, if
> the value is also used (which it is in most cases here). What am I
> missing? We seem to be going through a lot of unnecessary hoops, and
> some of that is in the runtime path.

You'd really have to ask the person who did the original patch
that changed lock->tail from a pointer to an int (saving 8 bytes)
in every mutex and rwsem.

I suspect the reason is that it is so much safer to have the
initialiser being zero, rather than a non-zero value with zero
being a valid value.

It is also hard to avoid an extra instruction in the per_cpu_ptr()
code - something has to extend the 32bit result from xchg() to
a 64bit one for the array index.
The asm for an unsigned 32bit exchange could return a 64bit result
(which would have the desired effect), but that won't work for
a signed value.
The '-1' in the vcpu_is_preempted() path will be executed in parallel
with something else and is likely to have no measurable effect.

So it is a slightly risky change that has less benefit than the
other changes (which save cache line accesses).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

Posted by Waiman Long 2 years, 1 month ago

On 12/31/23 16:55, David Laight wrote:
> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> This requires the cpu number be 64bit.
> However the value is osq_lock() comes from a 32bit xchg() and there
> isn't a way of telling gcc the high bits are zero (they are) so
> there will always be an instruction to clear the high bits.
>
> The cpu number is also offset by one (to make the initialiser 0)
> It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).
>
> Converting the cpu number to 32bit unsigned prior to the decrement means
> that gcc knows the decrement has set the high bits to zero and doesn't
> add a register-register move (or cltq) to zero/sign extend the value.
>
> Not massive but saves two instructions.
>
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
>   kernel/locking/osq_lock.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 35bb99e96697..37a4fa872989 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
>   	return cpu_nr + 1;
>   }
>   
> -static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> +static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val)
>   {
> -	int cpu_nr = encoded_cpu_val - 1;
> -
> -	return per_cpu_ptr(&osq_node, cpu_nr);
> +	return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
>   }
>   
>   /*

You really like micro-optimization.

Anyway,

Reviewed-by: Waiman Long <longman@redhat.com>

Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

Posted by Waiman Long 1 year, 9 months ago

On 12/31/23 23:14, Waiman Long wrote:
>
> On 12/31/23 16:55, David Laight wrote:
>> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
>> This requires the cpu number be 64bit.
>> However the value is osq_lock() comes from a 32bit xchg() and there
>> isn't a way of telling gcc the high bits are zero (they are) so
>> there will always be an instruction to clear the high bits.
>>
>> The cpu number is also offset by one (to make the initialiser 0)
>> It seems to be impossible to get gcc to convert 
>> __per_cpu_offset[cpu_p1 - 1]
>> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the 
>> address).
>>
>> Converting the cpu number to 32bit unsigned prior to the decrement means
>> that gcc knows the decrement has set the high bits to zero and doesn't
>> add a register-register move (or cltq) to zero/sign extend the value.
>>
>> Not massive but saves two instructions.
>>
>> Signed-off-by: David Laight <david.laight@aculab.com>
>> ---
>>   kernel/locking/osq_lock.c | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index 35bb99e96697..37a4fa872989 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
>>       return cpu_nr + 1;
>>   }
>>   -static inline struct optimistic_spin_node *decode_cpu(int 
>> encoded_cpu_val)
>> +static inline struct optimistic_spin_node *decode_cpu(unsigned int 
>> encoded_cpu_val)
>>   {
>> -    int cpu_nr = encoded_cpu_val - 1;
>> -
>> -    return per_cpu_ptr(&osq_node, cpu_nr);
>> +    return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
>>   }
>>     /*
>
> You really like micro-optimization.
>
> Anyway,
>
> Reviewed-by: Waiman Long <longman@redhat.com>
>
David,

Could you respin the series based on the latest upstream code?

Thanks,
Longman

RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

Posted by David Laight 1 year, 9 months ago

From: Waiman Long
> Sent: 03 May 2024 17:00
...
> David,
> 
> Could you respin the series based on the latest upstream code?

I've just reapplied the patches to 'master' and they all apply
cleanly and diffing the new patches to the old ones gives no differences.
So I think they should still apply.

Were you seeing a specific problem?

I don't remember any suggested changed either.
(Apart from a very local variable I used to keep a patch isolated.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

Posted by Waiman Long 1 year, 9 months ago

On 5/3/24 17:10, David Laight wrote:
> From: Waiman Long
>> Sent: 03 May 2024 17:00
> ...
>> David,
>>
>> Could you respin the series based on the latest upstream code?
> I've just reapplied the patches to 'master' and they all apply
> cleanly and diffing the new patches to the old ones gives no differences.
> So I think they should still apply.
>
> Were you seeing a specific problem?
>
> I don't remember any suggested changed either.
> (Apart from a very local variable I used to keep a patch isolated.)

No, I just want to make sure that your patches will still apply. Anyway, 
it will be easier for the maintainer to merge your remaining patches if 
you can send out a new version even if they are almost the same as the 
old ones.

Thanks,
Longman

RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

Posted by David Laight 1 year, 9 months ago

From: Waiman Long
> Sent: 03 May 2024 23:14
> 
> 
> On 5/3/24 17:10, David Laight wrote:
> > From: Waiman Long
> >> Sent: 03 May 2024 17:00
> > ...
> >> David,
> >>
> >> Could you respin the series based on the latest upstream code?
> > I've just reapplied the patches to 'master' and they all apply
> > cleanly and diffing the new patches to the old ones gives no differences.
> > So I think they should still apply.
> >
> > Were you seeing a specific problem?
> >
> > I don't remember any suggested changed either.
> > (Apart from a very local variable I used to keep a patch isolated.)
> 
> No, I just want to make sure that your patches will still apply. Anyway,
> it will be easier for the maintainer to merge your remaining patches if
> you can send out a new version even if they are almost the same as the
> old ones.

I don't think any changes are needed.
So the existing versions are fine.
They applied (well my copy of what I think I sent applied) and built.
So there shouldn't be any issues.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

Posted by David Laight 1 year, 9 months ago

From: Waiman Long
> Sent: 03 May 2024 17:00
> To: David Laight <David.Laight@ACULAB.COM>; 'linux-kernel@vger.kernel.org' <linux-
> kernel@vger.kernel.org>; 'peterz@infradead.org' <peterz@infradead.org>
> Cc: 'mingo@redhat.com' <mingo@redhat.com>; 'will@kernel.org' <will@kernel.org>; 'boqun.feng@gmail.com'
> <boqun.feng@gmail.com>; 'Linus Torvalds' <torvalds@linux-foundation.org>; 'virtualization@lists.linux-
> foundation.org' <virtualization@lists.linux-foundation.org>; 'Zeng Heng' <zengheng4@huawei.com>
> Subject: Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().
> 
> 
> On 12/31/23 23:14, Waiman Long wrote:
> >
> > On 12/31/23 16:55, David Laight wrote:
> >> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> >> This requires the cpu number be 64bit.
> >> However the value is osq_lock() comes from a 32bit xchg() and there
> >> isn't a way of telling gcc the high bits are zero (they are) so
> >> there will always be an instruction to clear the high bits.
> >>
> >> The cpu number is also offset by one (to make the initialiser 0)
> >> It seems to be impossible to get gcc to convert
> >> __per_cpu_offset[cpu_p1 - 1]
> >> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the
> >> address).
> >>
> >> Converting the cpu number to 32bit unsigned prior to the decrement means
> >> that gcc knows the decrement has set the high bits to zero and doesn't
> >> add a register-register move (or cltq) to zero/sign extend the value.
> >>
> >> Not massive but saves two instructions.
> >>
> >> Signed-off-by: David Laight <david.laight@aculab.com>
> >> ---
> >>   kernel/locking/osq_lock.c | 6 ++----
> >>   1 file changed, 2 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> >> index 35bb99e96697..37a4fa872989 100644
> >> --- a/kernel/locking/osq_lock.c
> >> +++ b/kernel/locking/osq_lock.c
> >> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
> >>       return cpu_nr + 1;
> >>   }
> >>   -static inline struct optimistic_spin_node *decode_cpu(int
> >> encoded_cpu_val)
> >> +static inline struct optimistic_spin_node *decode_cpu(unsigned int
> >> encoded_cpu_val)
> >>   {
> >> -    int cpu_nr = encoded_cpu_val - 1;
> >> -
> >> -    return per_cpu_ptr(&osq_node, cpu_nr);
> >> +    return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
> >>   }
> >>     /*
> >
> > You really like micro-optimization.
> >
> > Anyway,
> >
> > Reviewed-by: Waiman Long <longman@redhat.com>
> >
> David,
> 
> Could you respin the series based on the latest upstream code?

Looks like a wet bank holiday weekend.....

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

Posted by David Laight 2 years, 1 month ago

From: Waiman Long
> Sent: 01 January 2024 04:14
...
> You really like micro-optimization.

They all add up :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)