[PATCH 0/5] powerpc: Implement masked user access

Christophe Leroy posted 5 patches 3 months, 2 weeks ago
There is a newer version of this series
arch/powerpc/Kconfig                         |   2 +-
arch/powerpc/include/asm/book3s/32/kup.h     |   2 +-
arch/powerpc/include/asm/book3s/64/kup.h     |   4 +-
arch/powerpc/include/asm/kup.h               |  24 ++--
arch/powerpc/include/asm/nohash/32/kup-8xx.h |   2 +-
arch/powerpc/include/asm/nohash/kup-booke.h  |   2 +-
arch/powerpc/include/asm/uaccess.h           | 140 ++++++++++++++++---
fs/select.c                                  |   2 +-
include/linux/uaccess.h                      |   8 ++
kernel/futex/futex.h                         |   4 +-
lib/iov_iter.c                               |   7 +
lib/strncpy_from_user.c                      |   2 +-
lib/strnlen_user.c                           |   2 +-
13 files changed, 158 insertions(+), 43 deletions(-)
[PATCH 0/5] powerpc: Implement masked user access
Posted by Christophe Leroy 3 months, 2 weeks ago
Masked user access avoids the address/size verification by access_ok().
Allthough its main purpose is to skip the speculation in the
verification of user address and size hence avoid the need of spec
mitigation, it also has the advantage to reduce the amount of
instructions needed so it also benefits to platforms that don't
need speculation mitigation, especially when the size of the copy is
not know at build time.

Unlike x86_64 which masks the address to 'all bits set' when the
user address is invalid, here the address is set to an address in
the gap. It avoids relying on the zero page to catch offseted
accesses. On book3s/32 it makes sure the opening remains on user
segment. The overcost is a single instruction in the masking.

First patch adds masked_user_read_access_begin() and
masked_user_write_access_begin() to match with user_read_access_end()
and user_write_access_end().

Second patch adds speculation barrier to copy_from_user_iter() so that
the barrier in powerpc raw_copy_from_user() which is redundant with
the one in copy_from_user() can be removed.

Third patch removes the redundant barrier_nospec() in
raw_copy_from_user().

Fourth patch removes the unused size parameter when enabling/disabling
user access.

Last patch implements masked user access.

Christophe Leroy (5):
  uaccess: Add masked_user_{read/write}_access_begin
  uaccess: Add speculation barrier to copy_from_user_iter()
  powerpc: Remove unused size parametre to KUAP enabling/disabling
    functions
  powerpc: Move barrier_nospec() out of allow_read_{from/write}_user()
  powerpc: Implement masked user access

 arch/powerpc/Kconfig                         |   2 +-
 arch/powerpc/include/asm/book3s/32/kup.h     |   2 +-
 arch/powerpc/include/asm/book3s/64/kup.h     |   4 +-
 arch/powerpc/include/asm/kup.h               |  24 ++--
 arch/powerpc/include/asm/nohash/32/kup-8xx.h |   2 +-
 arch/powerpc/include/asm/nohash/kup-booke.h  |   2 +-
 arch/powerpc/include/asm/uaccess.h           | 140 ++++++++++++++++---
 fs/select.c                                  |   2 +-
 include/linux/uaccess.h                      |   8 ++
 kernel/futex/futex.h                         |   4 +-
 lib/iov_iter.c                               |   7 +
 lib/strncpy_from_user.c                      |   2 +-
 lib/strnlen_user.c                           |   2 +-
 13 files changed, 158 insertions(+), 43 deletions(-)

-- 
2.49.0
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by David Laight 3 months, 2 weeks ago
On Sun, 22 Jun 2025 11:52:38 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> Masked user access avoids the address/size verification by access_ok().
> Allthough its main purpose is to skip the speculation in the
> verification of user address and size hence avoid the need of spec
> mitigation, it also has the advantage to reduce the amount of
> instructions needed so it also benefits to platforms that don't
> need speculation mitigation, especially when the size of the copy is
> not know at build time.

It also removes a conditional branch that is quite likely to be
statically predicted 'the wrong way'.

> Unlike x86_64 which masks the address to 'all bits set' when the
> user address is invalid, here the address is set to an address in
> the gap. It avoids relying on the zero page to catch offseted
> accesses. On book3s/32 it makes sure the opening remains on user
> segment. The overcost is a single instruction in the masking.

That isn't true (any more).
Linus changed the check to (approx):
	if (uaddr > TASK_SIZE)
		uaddr = TASK_SIZE;
(Implemented with a conditional move)
Replacing the original version that used cmp, sbb, or to get 'all bits set'.
Quite likely the comments are wrong!

I thought there was a second architecture that implemented it - and might
still set ~0u?
As you noted returning 'TASK_SIZE' (or, at least, the base of a page that
is guaranteed to fault) means that the caller only has to do 'reasonably
sequential' accesses, and not guarantee to read offset zero first.

As a separate patch, provided there is a guard page between user and kernel,
and user accesses are 'reasonably sequential' even access_ok() need not
check the transfer length. Linus wasn't that brave :-)

I think some of the 'API' is still based on the original 386 code where
the page tables had to be checked by hand for CoW.

	David

> 
> First patch adds masked_user_read_access_begin() and
> masked_user_write_access_begin() to match with user_read_access_end()
> and user_write_access_end().
> 
> Second patch adds speculation barrier to copy_from_user_iter() so that
> the barrier in powerpc raw_copy_from_user() which is redundant with
> the one in copy_from_user() can be removed.
> 
> Third patch removes the redundant barrier_nospec() in
> raw_copy_from_user().
> 
> Fourth patch removes the unused size parameter when enabling/disabling
> user access.
> 
> Last patch implements masked user access.
> 
> Christophe Leroy (5):
>   uaccess: Add masked_user_{read/write}_access_begin
>   uaccess: Add speculation barrier to copy_from_user_iter()
>   powerpc: Remove unused size parametre to KUAP enabling/disabling
>     functions
>   powerpc: Move barrier_nospec() out of allow_read_{from/write}_user()
>   powerpc: Implement masked user access
> 
>  arch/powerpc/Kconfig                         |   2 +-
>  arch/powerpc/include/asm/book3s/32/kup.h     |   2 +-
>  arch/powerpc/include/asm/book3s/64/kup.h     |   4 +-
>  arch/powerpc/include/asm/kup.h               |  24 ++--
>  arch/powerpc/include/asm/nohash/32/kup-8xx.h |   2 +-
>  arch/powerpc/include/asm/nohash/kup-booke.h  |   2 +-
>  arch/powerpc/include/asm/uaccess.h           | 140 ++++++++++++++++---
>  fs/select.c                                  |   2 +-
>  include/linux/uaccess.h                      |   8 ++
>  kernel/futex/futex.h                         |   4 +-
>  lib/iov_iter.c                               |   7 +
>  lib/strncpy_from_user.c                      |   2 +-
>  lib/strnlen_user.c                           |   2 +-
>  13 files changed, 158 insertions(+), 43 deletions(-)
>
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by Christophe Leroy 3 months, 2 weeks ago

Le 22/06/2025 à 18:20, David Laight a écrit :
> On Sun, 22 Jun 2025 11:52:38 +0200
> Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> 
>> Masked user access avoids the address/size verification by access_ok().
>> Allthough its main purpose is to skip the speculation in the
>> verification of user address and size hence avoid the need of spec
>> mitigation, it also has the advantage to reduce the amount of
>> instructions needed so it also benefits to platforms that don't
>> need speculation mitigation, especially when the size of the copy is
>> not know at build time.
> 
> It also removes a conditional branch that is quite likely to be
> statically predicted 'the wrong way'.

But include/asm-generic/access_ok.h defines access_ok() as:

	#define access_ok(addr, size) likely(__access_ok(addr, size))

So GCC uses the 'unlikely' variant of the branch instruction to force 
the correct prediction, doesn't it ?

> 
>> Unlike x86_64 which masks the address to 'all bits set' when the
>> user address is invalid, here the address is set to an address in
>> the gap. It avoids relying on the zero page to catch offseted
>> accesses. On book3s/32 it makes sure the opening remains on user
>> segment. The overcost is a single instruction in the masking.
> 
> That isn't true (any more).
> Linus changed the check to (approx):
> 	if (uaddr > TASK_SIZE)
> 		uaddr = TASK_SIZE;
> (Implemented with a conditional move)

Ah ok, I overlooked that, I didn't know the cmove instruction, seem 
similar to the isel instruction on powerpc e500.

Christophe

Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by Segher Boessenkool 3 months, 2 weeks ago
On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote:
> Ah ok, I overlooked that, I didn't know the cmove instruction, seem 
> similar to the isel instruction on powerpc e500.

cmove does a move (register or memory) when some condition is true.
isel (which is base PowerPC, not something "e500" only) is a
computational instruction, it copies one of two registers to a third,
which of the two is decided by any bit in the condition register.

But sure, seen from very far off both isel and cmove can be used to
implemente the ternary operator ("?:"), are similar in that way :-)


Segher
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by David Laight 3 months, 2 weeks ago
On Tue, 24 Jun 2025 08:17:14 -0500
Segher Boessenkool <segher@kernel.crashing.org> wrote:

> On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote:
> > Ah ok, I overlooked that, I didn't know the cmove instruction, seem 
> > similar to the isel instruction on powerpc e500.  
> 
> cmove does a move (register or memory) when some condition is true.

The destination of x86 'cmov' is always a register (only the source can be
memory - an is probably always read).
It is a also a computational instruction.
It may well always do the register write - hard to detect.

There is a planned new instruction that would do a conditional write
to memory - but not on any cpu yet.

> isel (which is base PowerPC, not something "e500" only) is a
> computational instruction, it copies one of two registers to a third,
> which of the two is decided by any bit in the condition register.

Does that mean it could be used for all the ppc cpu variants?

> But sure, seen from very far off both isel and cmove can be used to
> implement the ternary operator ("?:"), are similar in that way :-)

Which is exactly what you want to avoid speculation.

	David

> 
> 
> Segher
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by Segher Boessenkool 3 months, 2 weeks ago
Hi!

On Tue, Jun 24, 2025 at 05:50:01PM +0100, David Laight wrote:
> On Tue, 24 Jun 2025 08:17:14 -0500
> Segher Boessenkool <segher@kernel.crashing.org> wrote:
> 
> > On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote:
> > > Ah ok, I overlooked that, I didn't know the cmove instruction, seem 
> > > similar to the isel instruction on powerpc e500.  
> > 
> > cmove does a move (register or memory) when some condition is true.
> 
> The destination of x86 'cmov' is always a register (only the source can be
> memory - an is probably always read).

Both source operands can be mem, right?  But probably not both at the
same time.

> It is a also a computational instruction.

Terminology...

x86 is not a RISC architecture, or more generally, a load/store
architecture.

A computational instruction is one that doesn't touch memory or does a
branch, or some system function, some supervisor or hypervisor
instruction maybe.

x86 does not have many computational insns, most insns can touch
memory :-)

(The important thing is that most computational insns do not ever cause
exceptions, the only exceptions are if you divide by zero or
similar :-) )

> It may well always do the register write - hard to detect.
> 
> There is a planned new instruction that would do a conditional write
> to memory - but not on any cpu yet.

Interesting!  Instructions like the atomic store insns we got for p9,
maybe?  They can do minimum/maximum and various kinds of more generic
reductions and similar.

> > isel (which is base PowerPC, not something "e500" only) is a
> > computational instruction, it copies one of two registers to a third,
> > which of the two is decided by any bit in the condition register.
> 
> Does that mean it could be used for all the ppc cpu variants?

No, only things that implement architecture version of 2.03 or later.
That is from 2006, so essentially everything that is still made
implements it :-)

But ancient things do not.  Both 970 (Apple G5) and Cell BE do not yet
have it (they are ISA 2.01 and 2.02 respectively).  And the older p5's
do not have it yet either, but the newer ones do.

And all classic PowerPC is ISA 1.xx of course.  Medieval CPUs :-)

> > But sure, seen from very far off both isel and cmove can be used to
> > implement the ternary operator ("?:"), are similar in that way :-)
> 
> Which is exactly what you want to avoid speculation.

There are cheaper / simpler / more effective / better ways to get that,
but sure, everything is better than a conditional branch, always :-)


Segher
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by David Laight 3 months, 2 weeks ago
On Tue, 24 Jun 2025 13:25:05 -0500
Segher Boessenkool <segher@kernel.crashing.org> wrote:

> Hi!
> 
> On Tue, Jun 24, 2025 at 05:50:01PM +0100, David Laight wrote:
> > On Tue, 24 Jun 2025 08:17:14 -0500
> > Segher Boessenkool <segher@kernel.crashing.org> wrote:
> >   
> > > On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote:  
> > > > Ah ok, I overlooked that, I didn't know the cmove instruction, seem 
> > > > similar to the isel instruction on powerpc e500.    
> > > 
> > > cmove does a move (register or memory) when some condition is true.  
> > 
> > The destination of x86 'cmov' is always a register (only the source can be
> > memory - and is probably always read).  
> 
> Both source operands can be mem, right?  But probably not both at the
> same time.

It only has one 'real' source, but the implementation could easily
read the destination register and then decide which value to write
back - rather than doing a conditional write to the register file.
A conditional write would be a right PITA for the alu result
forwarding logic


> 
> > It is a also a computational instruction.  
> 
> Terminology...
> 
> x86 is not a RISC architecture, or more generally, a load/store
> architecture.

It sort of is these days.
The memory transfers are separate u-ops, so a 'reg += mem' instruction
is split into two be the decoder.
Although some u-ops get merged together and executed in one clock,
obvious example is some 'compare+branch' pairs.

> A computational instruction is one that doesn't touch memory or does a
> branch, or some system function, some supervisor or hypervisor
> instruction maybe.
> 
> x86 does not have many computational insns, most insns can touch
> memory :-)

Except that the memory 'bit' is executed separately from any alu 'stuff'.
So for a 'reg += mem' instruction the memory read can be started as soon
as the registers that contain the address are valid, the 'add' requires
the memory read have completed and the instruction that generated the
old value of 'reg' have completed - which could be waiting on all sorts
of things (like a divide). Once both values are ready the 'add' can be
executed (provided a suitable alu is available).

 
> (The important thing is that most computational insns do not ever cause
> exceptions, the only exceptions are if you divide by zero or
> similar :-) )
> 
> > It may well always do the register write - hard to detect.
> > 
> > There is a planned new instruction that would do a conditional write
> > to memory - but not on any cpu yet.  
> 
> Interesting!  Instructions like the atomic store insns we got for p9,
> maybe?  They can do minimum/maximum and various kinds of more generic
> reductions and similar.

I think they are only conditional stores.
But they do save a conditional branch.
A late disable of a memory write is far less problematic than a disabled
register file write. No one minds (too much) about slight delays between
writes and reads of the same location (reduced by a store to load forwarder)
but you don't want to lose clocks between adjacent simple alu instructions.

For my sins I re-implemented a soft cpu last year...
Which doesn't have a 'cmov' :-(

> 
> > > isel (which is base PowerPC, not something "e500" only) is a
> > > computational instruction, it copies one of two registers to a third,
> > > which of the two is decided by any bit in the condition register.  
> > 
> > Does that mean it could be used for all the ppc cpu variants?  
> 
> No, only things that implement architecture version of 2.03 or later.
> That is from 2006, so essentially everything that is still made
> implements it :-)
> 
> But ancient things do not.  Both 970 (Apple G5) and Cell BE do not yet
> have it (they are ISA 2.01 and 2.02 respectively).  And the older p5's
> do not have it yet either, but the newer ones do.
> 
> And all classic PowerPC is ISA 1.xx of course.  Medieval CPUs :-)

That make more sense than the list in patch 5/5.

> 
> > > But sure, seen from very far off both isel and cmove can be used to
> > > implement the ternary operator ("?:"), are similar in that way :-)  
> > 
> > Which is exactly what you want to avoid speculation.  
> 
> There are cheaper / simpler / more effective / better ways to get that,
> but sure, everything is better than a conditional branch, always :-)

Everything except a TLB miss :-)

And for access_ok() avoiding the conditional is a good enough reason
to use a 'conditional move' instruction.
Avoiding speculation is actually free.

> 
> 
> Segher
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by Segher Boessenkool 3 months, 2 weeks ago
Hi!

On Tue, Jun 24, 2025 at 10:08:16PM +0100, David Laight wrote:
> On Tue, 24 Jun 2025 13:25:05 -0500
> Segher Boessenkool <segher@kernel.crashing.org> wrote:
> > On Tue, Jun 24, 2025 at 05:50:01PM +0100, David Laight wrote:
> > > On Tue, 24 Jun 2025 08:17:14 -0500
> > > Segher Boessenkool <segher@kernel.crashing.org> wrote:
> > >   
> > > > On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote:  
> > > > > Ah ok, I overlooked that, I didn't know the cmove instruction, seem 
> > > > > similar to the isel instruction on powerpc e500.    
> > > > 
> > > > cmove does a move (register or memory) when some condition is true.  
> > > 
> > > The destination of x86 'cmov' is always a register (only the source can be
> > > memory - and is probably always read).  
> > 
> > Both source operands can be mem, right?  But probably not both at the
> > same time.
> 
> It only has one 'real' source, but the implementation could easily
> read the destination register and then decide which value to write
> back - rather than doing a conditional write to the register file.

Yeah, in x86 many (most insns?) can read any reg that they write.  Not
a great design, but heh.

> A conditional write would be a right PITA for the alu result
> forwarding logic

Depends.  An implementation can always do the register forwarding etc.,
just annul the actual store where appropriate (and not put it in the
various store queues either, heh -- annul all the effects of the store).

> > x86 is not a RISC architecture, or more generally, a load/store
> > architecture.
> 
> It sort of is these days.

Not at all.  Most *implementations* are, the uarchs, but the
architecture (which determines the required visible semantics) is not.
That impedance difference is quite painful, yes, for code generation
more than for the processor implementation even -- as usual the
compilers have to save the day!

> The memory transfers are separate u-ops, so a 'reg += mem' instruction
> is split into two be the decoder.

Yup.  Very expensive.  Both for the implementation, and for the
performance of eventual code running on it.

> Although some u-ops get merged together and executed in one clock,
> obvious example is some 'compare+branch' pairs.

On many other architectures such things run in 0 cycles anyway :-)

> > A computational instruction is one that doesn't touch memory or does a
> > branch, or some system function, some supervisor or hypervisor
> > instruction maybe.
> > 
> > x86 does not have many computational insns, most insns can touch
> > memory :-)
> 
> Except that the memory 'bit' is executed separately from any alu 'stuff'.

On many uarchs, yes.  But not in the arch.  No uarch can decide to just
not implement these difficult and expensive insns :-)

> > > There is a planned new instruction that would do a conditional write
> > > to memory - but not on any cpu yet.  
> > 
> > Interesting!  Instructions like the atomic store insns we got for p9,
> > maybe?  They can do minimum/maximum and various kinds of more generic
> > reductions and similar.
> 
> I think they are only conditional stores.
> But they do save a conditional branch.

Yeah, but those are not ever executed *anyway*, there is branch
prediction and we require that to be pretty good to get reasonable
performance anyway.

A branch around the store insns is just fine if it can be predicted
correctly.  If it cannot be predicted correctly, you can do the store
always, just have the address that is stored to depend on the condition
(such the data is stored to some dummy memory if it "should not be
done").  Source code gets such a transform done manually in the
performance critical paths not infrequently, already.

GCC does not currently do such a transformation AFAIK, but it is a
pretty basic thing to do.  Conditional stores are not often written in
source code programs, or there would probably be an implementation for
this already :-)

> A late disable of a memory write is far less problematic than a disabled
> register file write. No one minds (too much) about slight delays between
> writes and reads of the same location (reduced by a store to load forwarder)
> but you don't want to lose clocks between adjacent simple alu instructions.

Writes to memory take tens of cycles *anyway*, but all of that is hidden
by the various memory load and store queues (which let you do forwarding
in just a few cycles).

> For my sins I re-implemented a soft cpu last year...

Ouch :-)  But it was fun to do I hope?

> Which doesn't have a 'cmov' :-(

The x86 flag register bits are so limited and complicated in the first
place, cmov is the easier part there ;-) 

> > But ancient things do not.  Both 970 (Apple G5) and Cell BE do not yet
> > have it (they are ISA 2.01 and 2.02 respectively).  And the older p5's
> > do not have it yet either, but the newer ones do.
> > 
> > And all classic PowerPC is ISA 1.xx of course.  Medieval CPUs :-)
> 
> That make more sense than the list in patch 5/5.

Not sure what list that is.  I'll find it :-)

> > > > But sure, seen from very far off both isel and cmove can be used to
> > > > implement the ternary operator ("?:"), are similar in that way :-)  
> > > 
> > > Which is exactly what you want to avoid speculation.  
> > 
> > There are cheaper / simpler / more effective / better ways to get that,
> > but sure, everything is better than a conditional branch, always :-)
> 
> Everything except a TLB miss :-)

Heh.  TLBa are just a tiny part of translation on Power.  We mostly care
about the ERATs.  Look it up, if you want to be introduced to another
level of pain :-)

> And for access_ok() avoiding the conditional is a good enough reason
> to use a 'conditional move' instruction.
> Avoiding speculation is actually free.

*Assuming* that avoiding speculation is actually free, you mean?


Segher
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by Christophe Leroy 3 months, 2 weeks ago

Le 24/06/2025 à 23:08, David Laight a écrit :
> On Tue, 24 Jun 2025 13:25:05 -0500
> Segher Boessenkool <segher@kernel.crashing.org> wrote:
> 
>>>> isel (which is base PowerPC, not something "e500" only) is a
>>>> computational instruction, it copies one of two registers to a third,
>>>> which of the two is decided by any bit in the condition register.
>>>
>>> Does that mean it could be used for all the ppc cpu variants?
>>
>> No, only things that implement architecture version of 2.03 or later.
>> That is from 2006, so essentially everything that is still made
>> implements it :-)
>>
>> But ancient things do not.  Both 970 (Apple G5) and Cell BE do not yet
>> have it (they are ISA 2.01 and 2.02 respectively).  And the older p5's
>> do not have it yet either, but the newer ones do.

For book3s64, GCC only use isel with -mcpu=power9 or -mcpu=power10

>>
>> And all classic PowerPC is ISA 1.xx of course.  Medieval CPUs :-)
> 
> That make more sense than the list in patch 5/5.

Sorry for the ambiguity. In patch 5/5 I was addressing only powerpc/32, 
and as far as I know the only powerpc/32 supported by Linux that has 
isel is the 85xx which has an e500 core.

For powerpc/64 we have less constraint than on powerpc32:
- Kernel memory starts at 0xc000000000000000
- User memory stops at 0x0010000000000000

So we can easily use 0x8000000000000000 as demarcation line, which 
allows a 3 instructions sequence:

	sradi	r9,r3,63   => shirt right algebric: r9 = 0 when r3 >= 0; r9 = -1 
when r3 < 0
	andc	r3,r3,r9   => and with complement: r3 unchanged when r9 = 0, r3 = 
0 when r9 = -1
	rldimi	r3,r9,0,1  => Rotate left and mask insert: Insert highest bit of 
r9 into r3

Whereas using isel requires a 4 instructions sequence:

	li	r9, 1		=> load immediate: r9 = 1
	rotldi	r9, r9, 63	=> rotate left: r9 = 0x8000000000000000
	cmpld	r3, r9		=> compare logically
	iselgt	r3, r9, r3	=> integer select greater than: select r3 when result 
of above compare is <= ; select r9 when result of compare is >

> 
>>
>>>> But sure, seen from very far off both isel and cmove can be used to
>>>> implement the ternary operator ("?:"), are similar in that way :-)
>>>
>>> Which is exactly what you want to avoid speculation.
>>
>> There are cheaper / simpler / more effective / better ways to get that,
>> but sure, everything is better than a conditional branch, always :-)
> 
> Everything except a TLB miss :-)
> 
> And for access_ok() avoiding the conditional is a good enough reason
> to use a 'conditional move' instruction.
> Avoiding speculation is actually free.
> 

And on CPUs that are not affected by Spectre and Meltdown like powerpc 
8xx or powerpc 603, masking the pointer is still worth it as it 
generates better code, even if the masking involves branching.

That's the reason why I did:
- e500 (affected by Spectre v1) ==> Use isel ==> 3 instructions sequence:

	lis	r9, TASK_SIZE@h	=> load immediate shifted => r9 = (TASK_SIZE >> 16) 
<< 16
	cmplw	r3, r9		=> compare addr with TASK_SIZE
	iselgt	r3, r9, r3	=> addr = TASK_SIZE when addr > TASK_SIZE; addr = 
addr when <= TASK_SIZE

For others:
- If TASK_SIZE <= 0x80000000 and kernel >= 0x80000000 ==> 3 instructions 
sequence similar to the 64 bits one above:

	srawi	r9,r3,63
	andc	r3,r3,r9
	rlwimi	r3,r9,0,0,0

- Otherwise, if speculation mitigation is required (e600), use the 
6-instructions masking sequence (r3 contains the address to mask)

	srwi	r9,r3,17	=> shift right: shift r3 by 17 to keep 15 bits (positive 
16 bits)
	subfic	r9,r9,22527	=> sub from immediate: r9 = (TASK_SIZE >> 17) - r9 
=> r9 < 0 when r3 is greater
	srawi	r9,r9,31	=> shift right algebric: r9 = 0 when r9 >= 0; r9 = -1 
when r9 < 0
	andis.	r10,r9,45056	=> and immediat shifted: r10 = r9 and upper part of 
TASK_SIZE => r10 = TASK_SIZE when r3 > TASK_SIZE, r10 = 0 otherwise
	andc	r3,r3,r9	=> and with complement: r3 is unchanged when r9 = 0 so 
when r3 <= TASK_SIZE, r3 is cleared when r9 = -1 so when r3 > TASK_SIZE
	or	r3,r3,r10	=> r3 = r3 | r10


- If no speculation mitigation is required, let GCC do as it wants

List of affected CPUs is available here: 
https://docs.nxp.com/bundle/GUID-805CC0EA-4001-47AD-86CD-4F340751F6B7/page/GUID-17B5D04F-6471-4EC6-BEB9-DE4D0AFA034A.html
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by Segher Boessenkool 3 months, 2 weeks ago
On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote:
> Le 24/06/2025 à 23:08, David Laight a écrit :
> >On Tue, 24 Jun 2025 13:25:05 -0500
> >Segher Boessenkool <segher@kernel.crashing.org> wrote:
> >>>>isel (which is base PowerPC, not something "e500" only) is a
> >>>>computational instruction, it copies one of two registers to a third,
> >>>>which of the two is decided by any bit in the condition register.
> >>>
> >>>Does that mean it could be used for all the ppc cpu variants?
> >>
> >>No, only things that implement architecture version of 2.03 or later.
> >>That is from 2006, so essentially everything that is still made
> >>implements it :-)
> >>
> >>But ancient things do not.  Both 970 (Apple G5) and Cell BE do not yet
> >>have it (they are ISA 2.01 and 2.02 respectively).  And the older p5's
> >>do not have it yet either, but the newer ones do.
> 
> For book3s64, GCC only use isel with -mcpu=power9 or -mcpu=power10

I have no idea what "book3s64" means.

Some ancient Power architecture versions had something called
"Book III-S", which was juxtaposed to "Book III-E", which essentially
corresponds to the old aborted BookE stuff.

I guess you mean almost all non-FSL implementations?  Most of those
support the isel insns.  Like, Power5+ (GS).  And everything after that.

I have no idea why you think power9 has it while older CPUS do not.  In
the GCC source code we have this comment:
  /* For ISA 2.06, don't add ISEL, since in general it isn't a win, but
     altivec is a win so enable it.  */
and in fact we do not enable it for ISA 2.06 (p8) either, probably for
a similar reason.

> >>And all classic PowerPC is ISA 1.xx of course.  Medieval CPUs :-)
> >
> >That make more sense than the list in patch 5/5.
> 
> Sorry for the ambiguity. In patch 5/5 I was addressing only powerpc/32, 
> and as far as I know the only powerpc/32 supported by Linux that has 
> isel is the 85xx which has an e500 core.

What is "powerpc/32"?  It does not help if you use different names from
what everyone else does.

The name "powerpc32" is sometimes used colloquially to mean PowerPC code
running in SF=0 mode (MSR[SF]=0), but perhaps more often it is used for
32-bit only implementations (so, those that do not even have that bit:
it's bit 0 in the 64-bit MSR, so all implementations that have an only
32-bit MSR, for example).

> For powerpc/64 we have less constraint than on powerpc32:
> - Kernel memory starts at 0xc000000000000000
> - User memory stops at 0x0010000000000000

That isn't true, not even if you mean some existing name.  Usually
userspace code is mapped at 256MB (0x10000000).  On powerpc64-linux
anyway, different default on different ABIs of course :-)

> >And for access_ok() avoiding the conditional is a good enough reason
> >to use a 'conditional move' instruction.
> >Avoiding speculation is actually free.
> 
> And on CPUs that are not affected by Spectre and Meltdown like powerpc 
> 8xx or powerpc 603,

Erm.


Segher
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by David Laight 3 months ago
On Thu, 26 Jun 2025 17:01:48 -0500
Segher Boessenkool <segher@kernel.crashing.org> wrote:

> On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote:
...
> I have no idea why you think power9 has it while older CPUS do not.  In
> the GCC source code we have this comment:
>   /* For ISA 2.06, don't add ISEL, since in general it isn't a win, but
>      altivec is a win so enable it.  */
> and in fact we do not enable it for ISA 2.06 (p8) either, probably for
> a similar reason.

Odd, I'd have thought that replacing a conditional branch with a
conditional move would pretty much always be a win.
Unless, of course, you only consider benchmark loops where the
branch predictor in 100% accurate.

OTOH isn't altivec 'simd' instructions?
They pretty much only help for loops with lots of iterations.
I don't know about ppc, but I've seen gcc make a real 'pigs breakfast'
of loop vectorisation on x86.

For the linux kernel (which as Linus keeps reminding people) tends
to run 'cold cache', you probably want conditional moves in order
to avoid mis-predicted branches and non-linear execution, but
don't want loop vectorisation because the setup and end cases
cost too much compared to the gain for each iteration.

	David
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by Segher Boessenkool 3 months ago
Hi!

On Sat, Jul 05, 2025 at 07:33:32PM +0100, David Laight wrote:
> On Thu, 26 Jun 2025 17:01:48 -0500
> Segher Boessenkool <segher@kernel.crashing.org> wrote:
> > On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote:
> ...
> > I have no idea why you think power9 has it while older CPUS do not.  In
> > the GCC source code we have this comment:
> >   /* For ISA 2.06, don't add ISEL, since in general it isn't a win, but
> >      altivec is a win so enable it.  */
> > and in fact we do not enable it for ISA 2.06 (p8) either, probably for

2.07 I meant of course.  Sigh.

> > a similar reason.
> 
> Odd, I'd have thought that replacing a conditional branch with a
> conditional move would pretty much always be a win.
> Unless, of course, you only consider benchmark loops where the
> branch predictor in 100% accurate.

The isel machine instruction is super expensive on p8: it is marked as
first in an instruction group, and has latency 5 for the GPR sources,
and 8 for the CR field source.

On p7 it wasn't great either, it was actually converted to a branch
sequence internally!

On p8 there are bc+8 optimisations done by the core as well, conditional
branches that skip one insn are faster than equivalent isel insns!

Since p9 it is a lot better :-)

> OTOH isn't altivec 'simd' instructions?

AltiVec is the old motorola marketing name for what is called the
"Vector Facility" in the architecture, and which at IBM is still called
VMX, the name it was developed under ("Vector Multimedia Extension").

Since p7 (ISA 2.06, 2010) there also is the Vector-Scalar Extension
Facility, VSX, which adds another 32 vector registers, and the
traditional floating point registers are physically the same (but those
use only the first half of each vector reg).  Many new VSX instructions
can do simple floating point stuff on all 64 VSX registers, either just
on the first lane ("scalar") or on all lanes ("vector").

This does largely mean that all floating point is stored in IEEE DP
format internally (on older cores usually some close to 70-bit format
was used internally), which in olden times actually allowed to make the
cores faster.  Only when storing a value to memory it was actually
converted to IEEE format (but of course it was always rounded correctly,
etc.)

> They pretty much only help for loops with lots of iterations.
> I don't know about ppc, but I've seen gcc make a real 'pigs breakfast'
> of loop vectorisation on x86.

For PowerPC (or Power, the more modern name) of course we also have our
fair share of problems with vectorisation.  It does help that we were
the first architecture used by GCC that had a serious Vector thing,
the C syntax extension for Vector literals is taken from the old
extensions in the AltiVec PIM but using curly brackets {} instead of
round brackets (), for example.

> For the linux kernel (which as Linus keeps reminding people) tends
> to run 'cold cache', you probably want conditional moves in order
> to avoid mis-predicted branches and non-linear execution, but
> don't want loop vectorisation because the setup and end cases
> cost too much compared to the gain for each iteration.

You are best off using what GCC gives you, usually.  It is very well
tuned, both the generic and the machine-specific code :-)

The kernel of course disables all Vector and FP stuff, essentially it
disables use of any of the associated registers, and that's pretty much
the end of it ;-)

(The reason for that is that it would make task switches more expensive,
long ago all task switches, but nowadays still user<->kernel switches).


Segher
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by David Laight 3 months ago
On Sat, 5 Jul 2025 15:15:57 -0500
Segher Boessenkool <segher@kernel.crashing.org> wrote:

...
> The isel machine instruction is super expensive on p8: it is marked as
> first in an instruction group, and has latency 5 for the GPR sources,
> and 8 for the CR field source.
> 
> On p7 it wasn't great either, it was actually converted to a branch
> sequence internally!

Ugg...

You'd think they'd add instructions that can be implemented.
It isn't as though isel is any harder than 'add with carry'.

Not that uncommon, IIRC amd added adox/adcx (add carry using the
overflow/carry flag and without changing any other flags) as very
slow instructions. Intel invented them without making jcxz (dec %cx
and jump non-zero) fast - so you can't (easily) put them in a loop.
Not to mention all the AVX512 fubars. 

Conditional move is more of a problem with a mips-like cpu where
alu ops read two registers and write a third.
You don't want to do a conditional write because it messes up
the decision of whether to forward the alu result to the following
instruction.
So I think you might need to do 'cmov odd/even' and read the LSB
from a third copy (or third read port) of the registers indexed
by what would normally be the 'output' register number.
Then tweak the register numbers early in the pipeline so that the
result goes to one of the 'input' registers rather than the normal
'output' one.
Not really that hard - could add to the cpu I did in 1/2 a day :-)

	David
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by Segher Boessenkool 3 months ago
Hi!

On Sat, Jul 05, 2025 at 10:05:38PM +0100, David Laight wrote:
> On Sat, 5 Jul 2025 15:15:57 -0500
> Segher Boessenkool <segher@kernel.crashing.org> wrote:
> 
> ...
> > The isel machine instruction is super expensive on p8: it is marked as
> > first in an instruction group, and has latency 5 for the GPR sources,
> > and 8 for the CR field source.
> > 
> > On p7 it wasn't great either, it was actually converted to a branch
> > sequence internally!
> 
> Ugg...
> 
> You'd think they'd add instructions that can be implemented.
> It isn't as though isel is any harder than 'add with carry'.

It is though!  isel was the first instruction that takes both GPR inputs
and a CR field input.  We now have more, the ISA 3.0 (p9) setb insn,
and esp. the extremely useful ISA 3.1 (p10) set[n]bc[r] insns -- well,
those don't take any GPR inputs actually, but like isel their output is
a GPR :-)

> Not that uncommon, IIRC amd added adox/adcx (add carry using the
> overflow/carry flag and without changing any other flags) as very

We have a similar "addex" insn since p9, which allows to use the OV
bit instead of the CA bit, and prepares to allow an extra three possible
bits as carry bits, too.  Using it you can run multiple carry chains
in parallel using insns very close to the traditional stuff.

The compiler still doesn't ever generate this, it is mostly useful
for handcoded assembler routines.

The carry bits are stored in the XER register, the "fixed-point
exception register", while results from comparison instructions are
stored in the CR, which holds eight four-bit CR fields, which you can
use in conditional jumps, or in isel and the like, or in the crlogical
insns (which can do any logic function on two CR field inputs and store
in a third, just like the logical insns on GPRs that also have the full
complement of 14 two source functions).

> slow instructions. Intel invented them without making jcxz (dec %cx
> and jump non-zero) fast - so you can't (easily) put them in a loop.
> Not to mention all the AVX512 fubars. 

Sounds lovely :-)

> Conditional move is more of a problem with a mips-like cpu where
> alu ops read two registers and write a third.

Like most Power implementations as well.

> You don't want to do a conditional write because it messes up
> the decision of whether to forward the alu result to the following
> instruction.
> So I think you might need to do 'cmov odd/even' and read the LSB
> from a third copy (or third read port) of the registers indexed
> by what would normally be the 'output' register number.
> Then tweak the register numbers early in the pipeline so that the
> result goes to one of the 'input' registers rather than the normal
> 'output' one.
> Not really that hard - could add to the cpu I did in 1/2 a day :-)

On p9 and later both GPR (or constant) inputs are fed into the execution
unit as well as some CR bit, and it just writes to a GPR.  Easier for
the hardware, easier for the compiler, and easier for the programmer!
Win-win-win.  The kind of tradeoffs I like best :-)


Segher
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by Christophe Leroy 3 months ago

Le 27/06/2025 à 00:01, Segher Boessenkool a écrit :
> On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote:
>> Le 24/06/2025 à 23:08, David Laight a écrit :
>>> On Tue, 24 Jun 2025 13:25:05 -0500
>>> Segher Boessenkool <segher@kernel.crashing.org> wrote:
>>>>>> isel (which is base PowerPC, not something "e500" only) is a
>>>>>> computational instruction, it copies one of two registers to a third,
>>>>>> which of the two is decided by any bit in the condition register.
>>>>>
>>>>> Does that mean it could be used for all the ppc cpu variants?
>>>>
>>>> No, only things that implement architecture version of 2.03 or later.
>>>> That is from 2006, so essentially everything that is still made
>>>> implements it :-)
>>>>
>>>> But ancient things do not.  Both 970 (Apple G5) and Cell BE do not yet
>>>> have it (they are ISA 2.01 and 2.02 respectively).  And the older p5's
>>>> do not have it yet either, but the newer ones do.
>>
>> For book3s64, GCC only use isel with -mcpu=power9 or -mcpu=power10
> 
> I have no idea what "book3s64" means.

Well that's the name given in Linux kernel to the 64 bits power CPU 
processors. See commits subject:

f5164797284d book3s64/radix : Optimize vmemmap start alignment
58450938f771 book3s64/radix : Handle error conditions properly in 
radix_vmemmap_populate
9cf7e13fecba book3s64/radix : Align section vmemmap start address to 
PAGE_SIZE
29bdc1f1c1df book3s64/radix: Fix compile errors when 
CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP=n
d629d7a8efc3 powerpc/book3s64/hugetlb: Fix disabling hugetlb when fadump 
is active
5959ffabbb67 arch/powerpc: teach book3s64 
arch_get_unmapped_area{_topdown} to handle hugetlb mappings
8846d9683884 book3s64/hash: Early detect debug_pagealloc size requirement
76b7d6463fc5 book3s64/hash: Disable kfence if not early init
b5fbf7e2c6a4 book3s64/radix: Refactoring common kfence related functions
8fec58f503b2 book3s64/hash: Add kfence functionality
47dd2e63d42a book3s64/hash: Disable debug_pagealloc if it requires more 
memory
...

> 
> Some ancient Power architecture versions had something called
> "Book III-S", which was juxtaposed to "Book III-E", which essentially
> corresponds to the old aborted BookE stuff.
> 
> I guess you mean almost all non-FSL implementations?  Most of those
> support the isel insns.  Like, Power5+ (GS).  And everything after that.
> 
> I have no idea why you think power9 has it while older CPUS do not.  In
> the GCC source code we have this comment:

I think nothing, I just observed that GCC doesn't use it unless you tell 
it is power9 or power10. But OK, the comment explains why.

>    /* For ISA 2.06, don't add ISEL, since in general it isn't a win, but
>       altivec is a win so enable it.  */
> and in fact we do not enable it for ISA 2.06 (p8) either, probably for
> a similar reason.
> 
>>>> And all classic PowerPC is ISA 1.xx of course.  Medieval CPUs :-)
>>>
>>> That make more sense than the list in patch 5/5.
>>
>> Sorry for the ambiguity. In patch 5/5 I was addressing only powerpc/32,
>> and as far as I know the only powerpc/32 supported by Linux that has
>> isel is the 85xx which has an e500 core.
> 
> What is "powerpc/32"?  It does not help if you use different names from
> what everyone else does.

Again, that's the way it is called in Linux kernel, refer below commits 
subjects:

$ git log --oneline arch/powerpc/ | grep powerpc/32
2bf3caa7cc3b powerpc/32: Stop printing Kernel virtual memory layout
2a17a5bebc9a powerpc/32: Replace mulhdu() by mul_u64_u64_shr()
dca5b1d69aea powerpc/32: Implement validation of emergency stack
2f2b9a3adc66 powerpc/32s: Reduce default size of module/execmem area
5799cd765fea powerpc/32: Convert patch_instruction() to patch_uint()
6035e7e35482 powerpc/32: Curb objtool unannotated intra-function call 
warning
b72c066ba85a powerpc/32: fix ADB_CUDA kconfig warning
cb615bbe5526 powerpc/32: fix ADB_CUDA kconfig warning
c8a1634145c2 powerpc/32: Drop unused grackle_set_stg()
aad26d3b6af1 powerpc/32s: Implement local_flush_tlb_page_psize()
bac4cffc7c4a powerpc/32s: Introduce _PAGE_READ and remove _PAGE_USER
46ebef51fd92 powerpc/32s: Add _PAGE_WRITE to supplement _PAGE_RW
f84b727d132c powerpc/32: Enable POWER_RESET in pmac32_defconfig
a3ef2fef198c powerpc/32: Add dependencies of POWER_RESET for pmac32
7cb0094be4a5 powerpc/32s: Cleanup the mess in __set_pte_at()
6958ad05d578 powerpc/32: Rearrange _switch to prepare for 32/64 merge
fc8562c9b69a powerpc/32: Remove sync from _switch
...

It means everything built with CONFIG_PPC32

> 
> The name "powerpc32" is sometimes used colloquially to mean PowerPC code
> running in SF=0 mode (MSR[SF]=0), but perhaps more often it is used for
> 32-bit only implementations (so, those that do not even have that bit:
> it's bit 0 in the 64-bit MSR, so all implementations that have an only
> 32-bit MSR, for example).
> 
>> For powerpc/64 we have less constraint than on powerpc32:
>> - Kernel memory starts at 0xc000000000000000
>> - User memory stops at 0x0010000000000000
> 
> That isn't true, not even if you mean some existing name.  Usually
> userspace code is mapped at 256MB (0x10000000).  On powerpc64-linux
> anyway, different default on different ABIs of course :-)

0x10000000 is below 0x0010000000000000, isn't it ? So why isn't it true ?

On 64 bits powerpc, everything related to user is between 0 and 
TASK_SIZE_MAX.

TASK_SIZE_MAX is either TASK_SIZE_4PB or TASK_SIZE_64TB depending on 
page size (64k or 4k)

TASK_SIZE_4PB is 0x0010000000000000UL

Christophe

> 
>>> And for access_ok() avoiding the conditional is a good enough reason
>>> to use a 'conditional move' instruction.
>>> Avoiding speculation is actually free.
>>
>> And on CPUs that are not affected by Spectre and Meltdown like powerpc
>> 8xx or powerpc 603,
> 
> Erm.
> 
> 
> Segher

Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by Segher Boessenkool 3 months ago
Hi!

On Sat, Jul 05, 2025 at 12:55:06PM +0200, Christophe Leroy wrote:
> > > For book3s64, GCC only use isel with -mcpu=power9 or -mcpu=power10
> > 
> > I have no idea what "book3s64" means.
> 
> Well that's the name given in Linux kernel to the 64 bits power CPU
> processors.

A fantasy name.  Great.

> > What is "powerpc/32"?  It does not help if you use different names from
> > what everyone else does.
> 
> Again, that's the way it is called in Linux kernel, refer below commits
> subjects:

And another.

> It means everything built with CONFIG_PPC32

Similar names for very dissimilar concepts, even!  Woohoo!

> > > For powerpc/64 we have less constraint than on powerpc32:
> > > - Kernel memory starts at 0xc000000000000000
> > > - User memory stops at 0x0010000000000000
> > 
> > That isn't true, not even if you mean some existing name.  Usually
> > userspace code is mapped at 256MB (0x10000000).  On powerpc64-linux
> > anyway, different default on different ABIs of course :-)
> 
> 0x10000000 is below 0x0010000000000000, isn't it ? So why isn't it true ?

I understood "starts at".  I read cross-eyed maybe, hehe.


Segher
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by David Laight 3 months, 2 weeks ago
On Tue, 24 Jun 2025 07:27:47 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> Le 22/06/2025 à 18:20, David Laight a écrit :
> > On Sun, 22 Jun 2025 11:52:38 +0200
> > Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> >   
> >> Masked user access avoids the address/size verification by access_ok().
> >> Allthough its main purpose is to skip the speculation in the
> >> verification of user address and size hence avoid the need of spec
> >> mitigation, it also has the advantage to reduce the amount of
> >> instructions needed so it also benefits to platforms that don't
> >> need speculation mitigation, especially when the size of the copy is
> >> not know at build time.  
> > 
> > It also removes a conditional branch that is quite likely to be
> > statically predicted 'the wrong way'.  
> 
> But include/asm-generic/access_ok.h defines access_ok() as:
> 
> 	#define access_ok(addr, size) likely(__access_ok(addr, size))
> 
> So GCC uses the 'unlikely' variant of the branch instruction to force 
> the correct prediction, doesn't it ?

Nope...
Most architectures don't have likely/unlikely variants of branches.
So all gcc can do is decide which path is the fall-through and
whether the branch is forwards or backwards.
Additionally unless there is code in both the 'if' and 'else' clauses
the [un]likely seems to have no effect.
So on simple cpu that predict 'backwards branches taken' you can get
the desired effect - but it may need an 'asm comment' to force the
compiler to generate the required branches (eg forwards branch directly
to a backwards unconditional jump).

On x86 it is all more complicated.
I think the pre-fetch code is likely to assume 'not taken' (but might
use stale info on the cache line).
The predictor itself never does 'static prediction' - it is always
based on the referenced branch prediction data structure.
So, unless you are in a loop (eg running a benchmark!) there is pretty
much a 50% chance of a branch mispredict.

I've been trying to benchmark different versions of the u64 * u64 / u64
function - and I think mispredicted branches make a big difference.
I need to sit down and sequence the test cases so that I can see
the effect of each branch!

> 
> >   
> >> Unlike x86_64 which masks the address to 'all bits set' when the
> >> user address is invalid, here the address is set to an address in
> >> the gap. It avoids relying on the zero page to catch offseted
> >> accesses. On book3s/32 it makes sure the opening remains on user
> >> segment. The overcost is a single instruction in the masking.  
> > 
> > That isn't true (any more).
> > Linus changed the check to (approx):
> > 	if (uaddr > TASK_SIZE)
> > 		uaddr = TASK_SIZE;
> > (Implemented with a conditional move)  
> 
> Ah ok, I overlooked that, I didn't know the cmove instruction, seem 
> similar to the isel instruction on powerpc e500.

It got added for the 386 - I learnt 8086 :-)
I suspect x86 got there first...

Although called 'conditional move' I very much suspect the write is
actually unconditional.
So the hardware implementation is much the same as 'add carry' except
the ALU operation is a simple multiplex.
Which means it is unlikely to be speculative.

	David


> 
> Christophe
> 
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by Segher Boessenkool 3 months, 2 weeks ago
Hi!

On Tue, Jun 24, 2025 at 09:32:58AM +0100, David Laight wrote:
> > So GCC uses the 'unlikely' variant of the branch instruction to force 
> > the correct prediction, doesn't it ?
> 
> Nope...
> Most architectures don't have likely/unlikely variants of branches.

In GCC, "likely" means 80%. "Very likely" means 99.95%.  Most things get
something more appropriate than such coarse things predicted.

Most of the time GCC uses these predicted branch probabilities to lay
out code in such a way that the fall-through path is the expected one.

Target backends can do special things with it as well, but usually that
isn't necessary.

There are many different predictors.  GCC usually can predict things
not bad by just looking at the shape of the code, using various
heuristics.  Things like profile-guided optimisation allow to use a
profile from an actual execution to optimise the code such that it will
work faster (assuming that future executions of the code will execute
similarly!)

You also can use __builtin_expect() in the source code, to put coarse
static prediction in.  That is what the kernel "{un,}likely" macros do.

If the compiler knows some branch is not very predictable, it can
optimise the code knowing that.  Like, it could use other strategies
than conditional branches.

On old CPUs something like "this branch is taken 50% of the time" makes
it a totally unpredictable branch.  But if say it branches exactly every
second time, it is 100% predicted correctly by more advanced predictors,
not just 50%.

To properly model modern branch predictors we need to record a "how
predictable is this branch" score as well for every branch, not just a
"how often does it branch instead of falling through" score.  We're not
there yet.


Segher
Re: [PATCH 0/5] powerpc: Implement masked user access
Posted by David Laight 3 months, 2 weeks ago
On Tue, 24 Jun 2025 16:37:12 -0500
Segher Boessenkool <segher@kernel.crashing.org> wrote:

> Hi!
> 
> On Tue, Jun 24, 2025 at 09:32:58AM +0100, David Laight wrote:
> > > So GCC uses the 'unlikely' variant of the branch instruction to force 
> > > the correct prediction, doesn't it ?  
> > 
> > Nope...
> > Most architectures don't have likely/unlikely variants of branches.  
> 
> In GCC, "likely" means 80%. "Very likely" means 99.95%.  Most things get
> something more appropriate than such coarse things predicted.
> 
> Most of the time GCC uses these predicted branch probabilities to lay
> out code in such a way that the fall-through path is the expected one.

That is fine provided the cpu doesn't predict the 'taken' path.
If you write:
	if (unlikely(x))
		continue;
gcc is very likely to generate a backwards conditional branch that
will get predicted taken (by a cpu without dynamic branch prediction).
You need to but something (an asm comment will do) before the 'continue'
to force gcc to generate a forwards (predicted not taken) branch to
the backwards jump.

> Target backends can do special things with it as well, but usually that
> isn't necessary.
> 
> There are many different predictors.  GCC usually can predict things
> not bad by just looking at the shape of the code, using various
> heuristics.  Things like profile-guided optimisation allow to use a
> profile from an actual execution to optimise the code such that it will
> work faster (assuming that future executions of the code will execute
> similarly!)

Without cpu instructions to force static prediction I don't see how that
helps as much as you might think.
Each time the code is loaded into the I-cache the branch predictor state
is likely to have been destroyed by other code.
So the branches get predicted by 'the other code' regardless of any layout.

> 
> You also can use __builtin_expect() in the source code, to put coarse
> static prediction in.  That is what the kernel "{un,}likely" macros do.
> 
> If the compiler knows some branch is not very predictable, it can
> optimise the code knowing that.  Like, it could use other strategies
> than conditional branches.
> 
> On old CPUs something like "this branch is taken 50% of the time" makes
> it a totally unpredictable branch.  But if say it branches exactly every
> second time, it is 100% predicted correctly by more advanced predictors,
> not just 50%.

Only once you are in a code loop.
Dynamic branch prediction is pretty hopeless for linear code.
The first time you execute a branch it is likely to be predicted taken
50% of the time.
(I guess a bit less than 50% - it will be percentage of branches that
are taken.)

> 
> To properly model modern branch predictors we need to record a "how
> predictable is this branch" score as well for every branch, not just a
> "how often does it branch instead of falling through" score.  We're not
> there yet.

If you are going to adjust the source code you want to determine correct
static prediction for most branches.
That probably requires an 'every other' static prediction.

I spent a lot of time optimising some code to minimise the worst case path,
the first thing I had to do was disable the dynamic branch prediction logic.

	David

> 
> 
> Segher