arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/book3s/32/kup.h | 2 +- arch/powerpc/include/asm/book3s/64/kup.h | 4 +- arch/powerpc/include/asm/kup.h | 24 ++-- arch/powerpc/include/asm/nohash/32/kup-8xx.h | 2 +- arch/powerpc/include/asm/nohash/kup-booke.h | 2 +- arch/powerpc/include/asm/uaccess.h | 140 ++++++++++++++++--- fs/select.c | 2 +- include/linux/uaccess.h | 8 ++ kernel/futex/futex.h | 4 +- lib/iov_iter.c | 7 + lib/strncpy_from_user.c | 2 +- lib/strnlen_user.c | 2 +- 13 files changed, 158 insertions(+), 43 deletions(-)
Masked user access avoids the address/size verification by access_ok(). Allthough its main purpose is to skip the speculation in the verification of user address and size hence avoid the need of spec mitigation, it also has the advantage to reduce the amount of instructions needed so it also benefits to platforms that don't need speculation mitigation, especially when the size of the copy is not know at build time. Unlike x86_64 which masks the address to 'all bits set' when the user address is invalid, here the address is set to an address in the gap. It avoids relying on the zero page to catch offseted accesses. On book3s/32 it makes sure the opening remains on user segment. The overcost is a single instruction in the masking. First patch adds masked_user_read_access_begin() and masked_user_write_access_begin() to match with user_read_access_end() and user_write_access_end(). Second patch adds speculation barrier to copy_from_user_iter() so that the barrier in powerpc raw_copy_from_user() which is redundant with the one in copy_from_user() can be removed. Third patch removes the redundant barrier_nospec() in raw_copy_from_user(). Fourth patch removes the unused size parameter when enabling/disabling user access. Last patch implements masked user access. Christophe Leroy (5): uaccess: Add masked_user_{read/write}_access_begin uaccess: Add speculation barrier to copy_from_user_iter() powerpc: Remove unused size parametre to KUAP enabling/disabling functions powerpc: Move barrier_nospec() out of allow_read_{from/write}_user() powerpc: Implement masked user access arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/book3s/32/kup.h | 2 +- arch/powerpc/include/asm/book3s/64/kup.h | 4 +- arch/powerpc/include/asm/kup.h | 24 ++-- arch/powerpc/include/asm/nohash/32/kup-8xx.h | 2 +- arch/powerpc/include/asm/nohash/kup-booke.h | 2 +- arch/powerpc/include/asm/uaccess.h | 140 ++++++++++++++++--- fs/select.c | 2 +- include/linux/uaccess.h | 8 ++ kernel/futex/futex.h | 4 +- lib/iov_iter.c | 7 + lib/strncpy_from_user.c | 2 +- lib/strnlen_user.c | 2 +- 13 files changed, 158 insertions(+), 43 deletions(-) -- 2.49.0
On Sun, 22 Jun 2025 11:52:38 +0200 Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > Masked user access avoids the address/size verification by access_ok(). > Allthough its main purpose is to skip the speculation in the > verification of user address and size hence avoid the need of spec > mitigation, it also has the advantage to reduce the amount of > instructions needed so it also benefits to platforms that don't > need speculation mitigation, especially when the size of the copy is > not know at build time. It also removes a conditional branch that is quite likely to be statically predicted 'the wrong way'. > Unlike x86_64 which masks the address to 'all bits set' when the > user address is invalid, here the address is set to an address in > the gap. It avoids relying on the zero page to catch offseted > accesses. On book3s/32 it makes sure the opening remains on user > segment. The overcost is a single instruction in the masking. That isn't true (any more). Linus changed the check to (approx): if (uaddr > TASK_SIZE) uaddr = TASK_SIZE; (Implemented with a conditional move) Replacing the original version that used cmp, sbb, or to get 'all bits set'. Quite likely the comments are wrong! I thought there was a second architecture that implemented it - and might still set ~0u? As you noted returning 'TASK_SIZE' (or, at least, the base of a page that is guaranteed to fault) means that the caller only has to do 'reasonably sequential' accesses, and not guarantee to read offset zero first. As a separate patch, provided there is a guard page between user and kernel, and user accesses are 'reasonably sequential' even access_ok() need not check the transfer length. Linus wasn't that brave :-) I think some of the 'API' is still based on the original 386 code where the page tables had to be checked by hand for CoW. David > > First patch adds masked_user_read_access_begin() and > masked_user_write_access_begin() to match with user_read_access_end() > and user_write_access_end(). > > Second patch adds speculation barrier to copy_from_user_iter() so that > the barrier in powerpc raw_copy_from_user() which is redundant with > the one in copy_from_user() can be removed. > > Third patch removes the redundant barrier_nospec() in > raw_copy_from_user(). > > Fourth patch removes the unused size parameter when enabling/disabling > user access. > > Last patch implements masked user access. > > Christophe Leroy (5): > uaccess: Add masked_user_{read/write}_access_begin > uaccess: Add speculation barrier to copy_from_user_iter() > powerpc: Remove unused size parametre to KUAP enabling/disabling > functions > powerpc: Move barrier_nospec() out of allow_read_{from/write}_user() > powerpc: Implement masked user access > > arch/powerpc/Kconfig | 2 +- > arch/powerpc/include/asm/book3s/32/kup.h | 2 +- > arch/powerpc/include/asm/book3s/64/kup.h | 4 +- > arch/powerpc/include/asm/kup.h | 24 ++-- > arch/powerpc/include/asm/nohash/32/kup-8xx.h | 2 +- > arch/powerpc/include/asm/nohash/kup-booke.h | 2 +- > arch/powerpc/include/asm/uaccess.h | 140 ++++++++++++++++--- > fs/select.c | 2 +- > include/linux/uaccess.h | 8 ++ > kernel/futex/futex.h | 4 +- > lib/iov_iter.c | 7 + > lib/strncpy_from_user.c | 2 +- > lib/strnlen_user.c | 2 +- > 13 files changed, 158 insertions(+), 43 deletions(-) >
Le 22/06/2025 à 18:20, David Laight a écrit : > On Sun, 22 Jun 2025 11:52:38 +0200 > Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > >> Masked user access avoids the address/size verification by access_ok(). >> Allthough its main purpose is to skip the speculation in the >> verification of user address and size hence avoid the need of spec >> mitigation, it also has the advantage to reduce the amount of >> instructions needed so it also benefits to platforms that don't >> need speculation mitigation, especially when the size of the copy is >> not know at build time. > > It also removes a conditional branch that is quite likely to be > statically predicted 'the wrong way'. But include/asm-generic/access_ok.h defines access_ok() as: #define access_ok(addr, size) likely(__access_ok(addr, size)) So GCC uses the 'unlikely' variant of the branch instruction to force the correct prediction, doesn't it ? > >> Unlike x86_64 which masks the address to 'all bits set' when the >> user address is invalid, here the address is set to an address in >> the gap. It avoids relying on the zero page to catch offseted >> accesses. On book3s/32 it makes sure the opening remains on user >> segment. The overcost is a single instruction in the masking. > > That isn't true (any more). > Linus changed the check to (approx): > if (uaddr > TASK_SIZE) > uaddr = TASK_SIZE; > (Implemented with a conditional move) Ah ok, I overlooked that, I didn't know the cmove instruction, seem similar to the isel instruction on powerpc e500. Christophe
On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote: > Ah ok, I overlooked that, I didn't know the cmove instruction, seem > similar to the isel instruction on powerpc e500. cmove does a move (register or memory) when some condition is true. isel (which is base PowerPC, not something "e500" only) is a computational instruction, it copies one of two registers to a third, which of the two is decided by any bit in the condition register. But sure, seen from very far off both isel and cmove can be used to implemente the ternary operator ("?:"), are similar in that way :-) Segher
On Tue, 24 Jun 2025 08:17:14 -0500 Segher Boessenkool <segher@kernel.crashing.org> wrote: > On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote: > > Ah ok, I overlooked that, I didn't know the cmove instruction, seem > > similar to the isel instruction on powerpc e500. > > cmove does a move (register or memory) when some condition is true. The destination of x86 'cmov' is always a register (only the source can be memory - an is probably always read). It is a also a computational instruction. It may well always do the register write - hard to detect. There is a planned new instruction that would do a conditional write to memory - but not on any cpu yet. > isel (which is base PowerPC, not something "e500" only) is a > computational instruction, it copies one of two registers to a third, > which of the two is decided by any bit in the condition register. Does that mean it could be used for all the ppc cpu variants? > But sure, seen from very far off both isel and cmove can be used to > implement the ternary operator ("?:"), are similar in that way :-) Which is exactly what you want to avoid speculation. David > > > Segher
Hi! On Tue, Jun 24, 2025 at 05:50:01PM +0100, David Laight wrote: > On Tue, 24 Jun 2025 08:17:14 -0500 > Segher Boessenkool <segher@kernel.crashing.org> wrote: > > > On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote: > > > Ah ok, I overlooked that, I didn't know the cmove instruction, seem > > > similar to the isel instruction on powerpc e500. > > > > cmove does a move (register or memory) when some condition is true. > > The destination of x86 'cmov' is always a register (only the source can be > memory - an is probably always read). Both source operands can be mem, right? But probably not both at the same time. > It is a also a computational instruction. Terminology... x86 is not a RISC architecture, or more generally, a load/store architecture. A computational instruction is one that doesn't touch memory or does a branch, or some system function, some supervisor or hypervisor instruction maybe. x86 does not have many computational insns, most insns can touch memory :-) (The important thing is that most computational insns do not ever cause exceptions, the only exceptions are if you divide by zero or similar :-) ) > It may well always do the register write - hard to detect. > > There is a planned new instruction that would do a conditional write > to memory - but not on any cpu yet. Interesting! Instructions like the atomic store insns we got for p9, maybe? They can do minimum/maximum and various kinds of more generic reductions and similar. > > isel (which is base PowerPC, not something "e500" only) is a > > computational instruction, it copies one of two registers to a third, > > which of the two is decided by any bit in the condition register. > > Does that mean it could be used for all the ppc cpu variants? No, only things that implement architecture version of 2.03 or later. That is from 2006, so essentially everything that is still made implements it :-) But ancient things do not. Both 970 (Apple G5) and Cell BE do not yet have it (they are ISA 2.01 and 2.02 respectively). And the older p5's do not have it yet either, but the newer ones do. And all classic PowerPC is ISA 1.xx of course. Medieval CPUs :-) > > But sure, seen from very far off both isel and cmove can be used to > > implement the ternary operator ("?:"), are similar in that way :-) > > Which is exactly what you want to avoid speculation. There are cheaper / simpler / more effective / better ways to get that, but sure, everything is better than a conditional branch, always :-) Segher
On Tue, 24 Jun 2025 13:25:05 -0500 Segher Boessenkool <segher@kernel.crashing.org> wrote: > Hi! > > On Tue, Jun 24, 2025 at 05:50:01PM +0100, David Laight wrote: > > On Tue, 24 Jun 2025 08:17:14 -0500 > > Segher Boessenkool <segher@kernel.crashing.org> wrote: > > > > > On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote: > > > > Ah ok, I overlooked that, I didn't know the cmove instruction, seem > > > > similar to the isel instruction on powerpc e500. > > > > > > cmove does a move (register or memory) when some condition is true. > > > > The destination of x86 'cmov' is always a register (only the source can be > > memory - and is probably always read). > > Both source operands can be mem, right? But probably not both at the > same time. It only has one 'real' source, but the implementation could easily read the destination register and then decide which value to write back - rather than doing a conditional write to the register file. A conditional write would be a right PITA for the alu result forwarding logic > > > It is a also a computational instruction. > > Terminology... > > x86 is not a RISC architecture, or more generally, a load/store > architecture. It sort of is these days. The memory transfers are separate u-ops, so a 'reg += mem' instruction is split into two be the decoder. Although some u-ops get merged together and executed in one clock, obvious example is some 'compare+branch' pairs. > A computational instruction is one that doesn't touch memory or does a > branch, or some system function, some supervisor or hypervisor > instruction maybe. > > x86 does not have many computational insns, most insns can touch > memory :-) Except that the memory 'bit' is executed separately from any alu 'stuff'. So for a 'reg += mem' instruction the memory read can be started as soon as the registers that contain the address are valid, the 'add' requires the memory read have completed and the instruction that generated the old value of 'reg' have completed - which could be waiting on all sorts of things (like a divide). Once both values are ready the 'add' can be executed (provided a suitable alu is available). > (The important thing is that most computational insns do not ever cause > exceptions, the only exceptions are if you divide by zero or > similar :-) ) > > > It may well always do the register write - hard to detect. > > > > There is a planned new instruction that would do a conditional write > > to memory - but not on any cpu yet. > > Interesting! Instructions like the atomic store insns we got for p9, > maybe? They can do minimum/maximum and various kinds of more generic > reductions and similar. I think they are only conditional stores. But they do save a conditional branch. A late disable of a memory write is far less problematic than a disabled register file write. No one minds (too much) about slight delays between writes and reads of the same location (reduced by a store to load forwarder) but you don't want to lose clocks between adjacent simple alu instructions. For my sins I re-implemented a soft cpu last year... Which doesn't have a 'cmov' :-( > > > > isel (which is base PowerPC, not something "e500" only) is a > > > computational instruction, it copies one of two registers to a third, > > > which of the two is decided by any bit in the condition register. > > > > Does that mean it could be used for all the ppc cpu variants? > > No, only things that implement architecture version of 2.03 or later. > That is from 2006, so essentially everything that is still made > implements it :-) > > But ancient things do not. Both 970 (Apple G5) and Cell BE do not yet > have it (they are ISA 2.01 and 2.02 respectively). And the older p5's > do not have it yet either, but the newer ones do. > > And all classic PowerPC is ISA 1.xx of course. Medieval CPUs :-) That make more sense than the list in patch 5/5. > > > > But sure, seen from very far off both isel and cmove can be used to > > > implement the ternary operator ("?:"), are similar in that way :-) > > > > Which is exactly what you want to avoid speculation. > > There are cheaper / simpler / more effective / better ways to get that, > but sure, everything is better than a conditional branch, always :-) Everything except a TLB miss :-) And for access_ok() avoiding the conditional is a good enough reason to use a 'conditional move' instruction. Avoiding speculation is actually free. > > > Segher
Hi! On Tue, Jun 24, 2025 at 10:08:16PM +0100, David Laight wrote: > On Tue, 24 Jun 2025 13:25:05 -0500 > Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Tue, Jun 24, 2025 at 05:50:01PM +0100, David Laight wrote: > > > On Tue, 24 Jun 2025 08:17:14 -0500 > > > Segher Boessenkool <segher@kernel.crashing.org> wrote: > > > > > > > On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote: > > > > > Ah ok, I overlooked that, I didn't know the cmove instruction, seem > > > > > similar to the isel instruction on powerpc e500. > > > > > > > > cmove does a move (register or memory) when some condition is true. > > > > > > The destination of x86 'cmov' is always a register (only the source can be > > > memory - and is probably always read). > > > > Both source operands can be mem, right? But probably not both at the > > same time. > > It only has one 'real' source, but the implementation could easily > read the destination register and then decide which value to write > back - rather than doing a conditional write to the register file. Yeah, in x86 many (most insns?) can read any reg that they write. Not a great design, but heh. > A conditional write would be a right PITA for the alu result > forwarding logic Depends. An implementation can always do the register forwarding etc., just annul the actual store where appropriate (and not put it in the various store queues either, heh -- annul all the effects of the store). > > x86 is not a RISC architecture, or more generally, a load/store > > architecture. > > It sort of is these days. Not at all. Most *implementations* are, the uarchs, but the architecture (which determines the required visible semantics) is not. That impedance difference is quite painful, yes, for code generation more than for the processor implementation even -- as usual the compilers have to save the day! > The memory transfers are separate u-ops, so a 'reg += mem' instruction > is split into two be the decoder. Yup. Very expensive. Both for the implementation, and for the performance of eventual code running on it. > Although some u-ops get merged together and executed in one clock, > obvious example is some 'compare+branch' pairs. On many other architectures such things run in 0 cycles anyway :-) > > A computational instruction is one that doesn't touch memory or does a > > branch, or some system function, some supervisor or hypervisor > > instruction maybe. > > > > x86 does not have many computational insns, most insns can touch > > memory :-) > > Except that the memory 'bit' is executed separately from any alu 'stuff'. On many uarchs, yes. But not in the arch. No uarch can decide to just not implement these difficult and expensive insns :-) > > > There is a planned new instruction that would do a conditional write > > > to memory - but not on any cpu yet. > > > > Interesting! Instructions like the atomic store insns we got for p9, > > maybe? They can do minimum/maximum and various kinds of more generic > > reductions and similar. > > I think they are only conditional stores. > But they do save a conditional branch. Yeah, but those are not ever executed *anyway*, there is branch prediction and we require that to be pretty good to get reasonable performance anyway. A branch around the store insns is just fine if it can be predicted correctly. If it cannot be predicted correctly, you can do the store always, just have the address that is stored to depend on the condition (such the data is stored to some dummy memory if it "should not be done"). Source code gets such a transform done manually in the performance critical paths not infrequently, already. GCC does not currently do such a transformation AFAIK, but it is a pretty basic thing to do. Conditional stores are not often written in source code programs, or there would probably be an implementation for this already :-) > A late disable of a memory write is far less problematic than a disabled > register file write. No one minds (too much) about slight delays between > writes and reads of the same location (reduced by a store to load forwarder) > but you don't want to lose clocks between adjacent simple alu instructions. Writes to memory take tens of cycles *anyway*, but all of that is hidden by the various memory load and store queues (which let you do forwarding in just a few cycles). > For my sins I re-implemented a soft cpu last year... Ouch :-) But it was fun to do I hope? > Which doesn't have a 'cmov' :-( The x86 flag register bits are so limited and complicated in the first place, cmov is the easier part there ;-) > > But ancient things do not. Both 970 (Apple G5) and Cell BE do not yet > > have it (they are ISA 2.01 and 2.02 respectively). And the older p5's > > do not have it yet either, but the newer ones do. > > > > And all classic PowerPC is ISA 1.xx of course. Medieval CPUs :-) > > That make more sense than the list in patch 5/5. Not sure what list that is. I'll find it :-) > > > > But sure, seen from very far off both isel and cmove can be used to > > > > implement the ternary operator ("?:"), are similar in that way :-) > > > > > > Which is exactly what you want to avoid speculation. > > > > There are cheaper / simpler / more effective / better ways to get that, > > but sure, everything is better than a conditional branch, always :-) > > Everything except a TLB miss :-) Heh. TLBa are just a tiny part of translation on Power. We mostly care about the ERATs. Look it up, if you want to be introduced to another level of pain :-) > And for access_ok() avoiding the conditional is a good enough reason > to use a 'conditional move' instruction. > Avoiding speculation is actually free. *Assuming* that avoiding speculation is actually free, you mean? Segher
Le 24/06/2025 à 23:08, David Laight a écrit : > On Tue, 24 Jun 2025 13:25:05 -0500 > Segher Boessenkool <segher@kernel.crashing.org> wrote: > >>>> isel (which is base PowerPC, not something "e500" only) is a >>>> computational instruction, it copies one of two registers to a third, >>>> which of the two is decided by any bit in the condition register. >>> >>> Does that mean it could be used for all the ppc cpu variants? >> >> No, only things that implement architecture version of 2.03 or later. >> That is from 2006, so essentially everything that is still made >> implements it :-) >> >> But ancient things do not. Both 970 (Apple G5) and Cell BE do not yet >> have it (they are ISA 2.01 and 2.02 respectively). And the older p5's >> do not have it yet either, but the newer ones do. For book3s64, GCC only use isel with -mcpu=power9 or -mcpu=power10 >> >> And all classic PowerPC is ISA 1.xx of course. Medieval CPUs :-) > > That make more sense than the list in patch 5/5. Sorry for the ambiguity. In patch 5/5 I was addressing only powerpc/32, and as far as I know the only powerpc/32 supported by Linux that has isel is the 85xx which has an e500 core. For powerpc/64 we have less constraint than on powerpc32: - Kernel memory starts at 0xc000000000000000 - User memory stops at 0x0010000000000000 So we can easily use 0x8000000000000000 as demarcation line, which allows a 3 instructions sequence: sradi r9,r3,63 => shirt right algebric: r9 = 0 when r3 >= 0; r9 = -1 when r3 < 0 andc r3,r3,r9 => and with complement: r3 unchanged when r9 = 0, r3 = 0 when r9 = -1 rldimi r3,r9,0,1 => Rotate left and mask insert: Insert highest bit of r9 into r3 Whereas using isel requires a 4 instructions sequence: li r9, 1 => load immediate: r9 = 1 rotldi r9, r9, 63 => rotate left: r9 = 0x8000000000000000 cmpld r3, r9 => compare logically iselgt r3, r9, r3 => integer select greater than: select r3 when result of above compare is <= ; select r9 when result of compare is > > >> >>>> But sure, seen from very far off both isel and cmove can be used to >>>> implement the ternary operator ("?:"), are similar in that way :-) >>> >>> Which is exactly what you want to avoid speculation. >> >> There are cheaper / simpler / more effective / better ways to get that, >> but sure, everything is better than a conditional branch, always :-) > > Everything except a TLB miss :-) > > And for access_ok() avoiding the conditional is a good enough reason > to use a 'conditional move' instruction. > Avoiding speculation is actually free. > And on CPUs that are not affected by Spectre and Meltdown like powerpc 8xx or powerpc 603, masking the pointer is still worth it as it generates better code, even if the masking involves branching. That's the reason why I did: - e500 (affected by Spectre v1) ==> Use isel ==> 3 instructions sequence: lis r9, TASK_SIZE@h => load immediate shifted => r9 = (TASK_SIZE >> 16) << 16 cmplw r3, r9 => compare addr with TASK_SIZE iselgt r3, r9, r3 => addr = TASK_SIZE when addr > TASK_SIZE; addr = addr when <= TASK_SIZE For others: - If TASK_SIZE <= 0x80000000 and kernel >= 0x80000000 ==> 3 instructions sequence similar to the 64 bits one above: srawi r9,r3,63 andc r3,r3,r9 rlwimi r3,r9,0,0,0 - Otherwise, if speculation mitigation is required (e600), use the 6-instructions masking sequence (r3 contains the address to mask) srwi r9,r3,17 => shift right: shift r3 by 17 to keep 15 bits (positive 16 bits) subfic r9,r9,22527 => sub from immediate: r9 = (TASK_SIZE >> 17) - r9 => r9 < 0 when r3 is greater srawi r9,r9,31 => shift right algebric: r9 = 0 when r9 >= 0; r9 = -1 when r9 < 0 andis. r10,r9,45056 => and immediat shifted: r10 = r9 and upper part of TASK_SIZE => r10 = TASK_SIZE when r3 > TASK_SIZE, r10 = 0 otherwise andc r3,r3,r9 => and with complement: r3 is unchanged when r9 = 0 so when r3 <= TASK_SIZE, r3 is cleared when r9 = -1 so when r3 > TASK_SIZE or r3,r3,r10 => r3 = r3 | r10 - If no speculation mitigation is required, let GCC do as it wants List of affected CPUs is available here: https://docs.nxp.com/bundle/GUID-805CC0EA-4001-47AD-86CD-4F340751F6B7/page/GUID-17B5D04F-6471-4EC6-BEB9-DE4D0AFA034A.html
On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote: > Le 24/06/2025 à 23:08, David Laight a écrit : > >On Tue, 24 Jun 2025 13:25:05 -0500 > >Segher Boessenkool <segher@kernel.crashing.org> wrote: > >>>>isel (which is base PowerPC, not something "e500" only) is a > >>>>computational instruction, it copies one of two registers to a third, > >>>>which of the two is decided by any bit in the condition register. > >>> > >>>Does that mean it could be used for all the ppc cpu variants? > >> > >>No, only things that implement architecture version of 2.03 or later. > >>That is from 2006, so essentially everything that is still made > >>implements it :-) > >> > >>But ancient things do not. Both 970 (Apple G5) and Cell BE do not yet > >>have it (they are ISA 2.01 and 2.02 respectively). And the older p5's > >>do not have it yet either, but the newer ones do. > > For book3s64, GCC only use isel with -mcpu=power9 or -mcpu=power10 I have no idea what "book3s64" means. Some ancient Power architecture versions had something called "Book III-S", which was juxtaposed to "Book III-E", which essentially corresponds to the old aborted BookE stuff. I guess you mean almost all non-FSL implementations? Most of those support the isel insns. Like, Power5+ (GS). And everything after that. I have no idea why you think power9 has it while older CPUS do not. In the GCC source code we have this comment: /* For ISA 2.06, don't add ISEL, since in general it isn't a win, but altivec is a win so enable it. */ and in fact we do not enable it for ISA 2.06 (p8) either, probably for a similar reason. > >>And all classic PowerPC is ISA 1.xx of course. Medieval CPUs :-) > > > >That make more sense than the list in patch 5/5. > > Sorry for the ambiguity. In patch 5/5 I was addressing only powerpc/32, > and as far as I know the only powerpc/32 supported by Linux that has > isel is the 85xx which has an e500 core. What is "powerpc/32"? It does not help if you use different names from what everyone else does. The name "powerpc32" is sometimes used colloquially to mean PowerPC code running in SF=0 mode (MSR[SF]=0), but perhaps more often it is used for 32-bit only implementations (so, those that do not even have that bit: it's bit 0 in the 64-bit MSR, so all implementations that have an only 32-bit MSR, for example). > For powerpc/64 we have less constraint than on powerpc32: > - Kernel memory starts at 0xc000000000000000 > - User memory stops at 0x0010000000000000 That isn't true, not even if you mean some existing name. Usually userspace code is mapped at 256MB (0x10000000). On powerpc64-linux anyway, different default on different ABIs of course :-) > >And for access_ok() avoiding the conditional is a good enough reason > >to use a 'conditional move' instruction. > >Avoiding speculation is actually free. > > And on CPUs that are not affected by Spectre and Meltdown like powerpc > 8xx or powerpc 603, Erm. Segher
On Thu, 26 Jun 2025 17:01:48 -0500 Segher Boessenkool <segher@kernel.crashing.org> wrote: > On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote: ... > I have no idea why you think power9 has it while older CPUS do not. In > the GCC source code we have this comment: > /* For ISA 2.06, don't add ISEL, since in general it isn't a win, but > altivec is a win so enable it. */ > and in fact we do not enable it for ISA 2.06 (p8) either, probably for > a similar reason. Odd, I'd have thought that replacing a conditional branch with a conditional move would pretty much always be a win. Unless, of course, you only consider benchmark loops where the branch predictor in 100% accurate. OTOH isn't altivec 'simd' instructions? They pretty much only help for loops with lots of iterations. I don't know about ppc, but I've seen gcc make a real 'pigs breakfast' of loop vectorisation on x86. For the linux kernel (which as Linus keeps reminding people) tends to run 'cold cache', you probably want conditional moves in order to avoid mis-predicted branches and non-linear execution, but don't want loop vectorisation because the setup and end cases cost too much compared to the gain for each iteration. David
Hi! On Sat, Jul 05, 2025 at 07:33:32PM +0100, David Laight wrote: > On Thu, 26 Jun 2025 17:01:48 -0500 > Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote: > ... > > I have no idea why you think power9 has it while older CPUS do not. In > > the GCC source code we have this comment: > > /* For ISA 2.06, don't add ISEL, since in general it isn't a win, but > > altivec is a win so enable it. */ > > and in fact we do not enable it for ISA 2.06 (p8) either, probably for 2.07 I meant of course. Sigh. > > a similar reason. > > Odd, I'd have thought that replacing a conditional branch with a > conditional move would pretty much always be a win. > Unless, of course, you only consider benchmark loops where the > branch predictor in 100% accurate. The isel machine instruction is super expensive on p8: it is marked as first in an instruction group, and has latency 5 for the GPR sources, and 8 for the CR field source. On p7 it wasn't great either, it was actually converted to a branch sequence internally! On p8 there are bc+8 optimisations done by the core as well, conditional branches that skip one insn are faster than equivalent isel insns! Since p9 it is a lot better :-) > OTOH isn't altivec 'simd' instructions? AltiVec is the old motorola marketing name for what is called the "Vector Facility" in the architecture, and which at IBM is still called VMX, the name it was developed under ("Vector Multimedia Extension"). Since p7 (ISA 2.06, 2010) there also is the Vector-Scalar Extension Facility, VSX, which adds another 32 vector registers, and the traditional floating point registers are physically the same (but those use only the first half of each vector reg). Many new VSX instructions can do simple floating point stuff on all 64 VSX registers, either just on the first lane ("scalar") or on all lanes ("vector"). This does largely mean that all floating point is stored in IEEE DP format internally (on older cores usually some close to 70-bit format was used internally), which in olden times actually allowed to make the cores faster. Only when storing a value to memory it was actually converted to IEEE format (but of course it was always rounded correctly, etc.) > They pretty much only help for loops with lots of iterations. > I don't know about ppc, but I've seen gcc make a real 'pigs breakfast' > of loop vectorisation on x86. For PowerPC (or Power, the more modern name) of course we also have our fair share of problems with vectorisation. It does help that we were the first architecture used by GCC that had a serious Vector thing, the C syntax extension for Vector literals is taken from the old extensions in the AltiVec PIM but using curly brackets {} instead of round brackets (), for example. > For the linux kernel (which as Linus keeps reminding people) tends > to run 'cold cache', you probably want conditional moves in order > to avoid mis-predicted branches and non-linear execution, but > don't want loop vectorisation because the setup and end cases > cost too much compared to the gain for each iteration. You are best off using what GCC gives you, usually. It is very well tuned, both the generic and the machine-specific code :-) The kernel of course disables all Vector and FP stuff, essentially it disables use of any of the associated registers, and that's pretty much the end of it ;-) (The reason for that is that it would make task switches more expensive, long ago all task switches, but nowadays still user<->kernel switches). Segher
On Sat, 5 Jul 2025 15:15:57 -0500 Segher Boessenkool <segher@kernel.crashing.org> wrote: ... > The isel machine instruction is super expensive on p8: it is marked as > first in an instruction group, and has latency 5 for the GPR sources, > and 8 for the CR field source. > > On p7 it wasn't great either, it was actually converted to a branch > sequence internally! Ugg... You'd think they'd add instructions that can be implemented. It isn't as though isel is any harder than 'add with carry'. Not that uncommon, IIRC amd added adox/adcx (add carry using the overflow/carry flag and without changing any other flags) as very slow instructions. Intel invented them without making jcxz (dec %cx and jump non-zero) fast - so you can't (easily) put them in a loop. Not to mention all the AVX512 fubars. Conditional move is more of a problem with a mips-like cpu where alu ops read two registers and write a third. You don't want to do a conditional write because it messes up the decision of whether to forward the alu result to the following instruction. So I think you might need to do 'cmov odd/even' and read the LSB from a third copy (or third read port) of the registers indexed by what would normally be the 'output' register number. Then tweak the register numbers early in the pipeline so that the result goes to one of the 'input' registers rather than the normal 'output' one. Not really that hard - could add to the cpu I did in 1/2 a day :-) David
Hi! On Sat, Jul 05, 2025 at 10:05:38PM +0100, David Laight wrote: > On Sat, 5 Jul 2025 15:15:57 -0500 > Segher Boessenkool <segher@kernel.crashing.org> wrote: > > ... > > The isel machine instruction is super expensive on p8: it is marked as > > first in an instruction group, and has latency 5 for the GPR sources, > > and 8 for the CR field source. > > > > On p7 it wasn't great either, it was actually converted to a branch > > sequence internally! > > Ugg... > > You'd think they'd add instructions that can be implemented. > It isn't as though isel is any harder than 'add with carry'. It is though! isel was the first instruction that takes both GPR inputs and a CR field input. We now have more, the ISA 3.0 (p9) setb insn, and esp. the extremely useful ISA 3.1 (p10) set[n]bc[r] insns -- well, those don't take any GPR inputs actually, but like isel their output is a GPR :-) > Not that uncommon, IIRC amd added adox/adcx (add carry using the > overflow/carry flag and without changing any other flags) as very We have a similar "addex" insn since p9, which allows to use the OV bit instead of the CA bit, and prepares to allow an extra three possible bits as carry bits, too. Using it you can run multiple carry chains in parallel using insns very close to the traditional stuff. The compiler still doesn't ever generate this, it is mostly useful for handcoded assembler routines. The carry bits are stored in the XER register, the "fixed-point exception register", while results from comparison instructions are stored in the CR, which holds eight four-bit CR fields, which you can use in conditional jumps, or in isel and the like, or in the crlogical insns (which can do any logic function on two CR field inputs and store in a third, just like the logical insns on GPRs that also have the full complement of 14 two source functions). > slow instructions. Intel invented them without making jcxz (dec %cx > and jump non-zero) fast - so you can't (easily) put them in a loop. > Not to mention all the AVX512 fubars. Sounds lovely :-) > Conditional move is more of a problem with a mips-like cpu where > alu ops read two registers and write a third. Like most Power implementations as well. > You don't want to do a conditional write because it messes up > the decision of whether to forward the alu result to the following > instruction. > So I think you might need to do 'cmov odd/even' and read the LSB > from a third copy (or third read port) of the registers indexed > by what would normally be the 'output' register number. > Then tweak the register numbers early in the pipeline so that the > result goes to one of the 'input' registers rather than the normal > 'output' one. > Not really that hard - could add to the cpu I did in 1/2 a day :-) On p9 and later both GPR (or constant) inputs are fed into the execution unit as well as some CR bit, and it just writes to a GPR. Easier for the hardware, easier for the compiler, and easier for the programmer! Win-win-win. The kind of tradeoffs I like best :-) Segher
Le 27/06/2025 à 00:01, Segher Boessenkool a écrit : > On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote: >> Le 24/06/2025 à 23:08, David Laight a écrit : >>> On Tue, 24 Jun 2025 13:25:05 -0500 >>> Segher Boessenkool <segher@kernel.crashing.org> wrote: >>>>>> isel (which is base PowerPC, not something "e500" only) is a >>>>>> computational instruction, it copies one of two registers to a third, >>>>>> which of the two is decided by any bit in the condition register. >>>>> >>>>> Does that mean it could be used for all the ppc cpu variants? >>>> >>>> No, only things that implement architecture version of 2.03 or later. >>>> That is from 2006, so essentially everything that is still made >>>> implements it :-) >>>> >>>> But ancient things do not. Both 970 (Apple G5) and Cell BE do not yet >>>> have it (they are ISA 2.01 and 2.02 respectively). And the older p5's >>>> do not have it yet either, but the newer ones do. >> >> For book3s64, GCC only use isel with -mcpu=power9 or -mcpu=power10 > > I have no idea what "book3s64" means. Well that's the name given in Linux kernel to the 64 bits power CPU processors. See commits subject: f5164797284d book3s64/radix : Optimize vmemmap start alignment 58450938f771 book3s64/radix : Handle error conditions properly in radix_vmemmap_populate 9cf7e13fecba book3s64/radix : Align section vmemmap start address to PAGE_SIZE 29bdc1f1c1df book3s64/radix: Fix compile errors when CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP=n d629d7a8efc3 powerpc/book3s64/hugetlb: Fix disabling hugetlb when fadump is active 5959ffabbb67 arch/powerpc: teach book3s64 arch_get_unmapped_area{_topdown} to handle hugetlb mappings 8846d9683884 book3s64/hash: Early detect debug_pagealloc size requirement 76b7d6463fc5 book3s64/hash: Disable kfence if not early init b5fbf7e2c6a4 book3s64/radix: Refactoring common kfence related functions 8fec58f503b2 book3s64/hash: Add kfence functionality 47dd2e63d42a book3s64/hash: Disable debug_pagealloc if it requires more memory ... > > Some ancient Power architecture versions had something called > "Book III-S", which was juxtaposed to "Book III-E", which essentially > corresponds to the old aborted BookE stuff. > > I guess you mean almost all non-FSL implementations? Most of those > support the isel insns. Like, Power5+ (GS). And everything after that. > > I have no idea why you think power9 has it while older CPUS do not. In > the GCC source code we have this comment: I think nothing, I just observed that GCC doesn't use it unless you tell it is power9 or power10. But OK, the comment explains why. > /* For ISA 2.06, don't add ISEL, since in general it isn't a win, but > altivec is a win so enable it. */ > and in fact we do not enable it for ISA 2.06 (p8) either, probably for > a similar reason. > >>>> And all classic PowerPC is ISA 1.xx of course. Medieval CPUs :-) >>> >>> That make more sense than the list in patch 5/5. >> >> Sorry for the ambiguity. In patch 5/5 I was addressing only powerpc/32, >> and as far as I know the only powerpc/32 supported by Linux that has >> isel is the 85xx which has an e500 core. > > What is "powerpc/32"? It does not help if you use different names from > what everyone else does. Again, that's the way it is called in Linux kernel, refer below commits subjects: $ git log --oneline arch/powerpc/ | grep powerpc/32 2bf3caa7cc3b powerpc/32: Stop printing Kernel virtual memory layout 2a17a5bebc9a powerpc/32: Replace mulhdu() by mul_u64_u64_shr() dca5b1d69aea powerpc/32: Implement validation of emergency stack 2f2b9a3adc66 powerpc/32s: Reduce default size of module/execmem area 5799cd765fea powerpc/32: Convert patch_instruction() to patch_uint() 6035e7e35482 powerpc/32: Curb objtool unannotated intra-function call warning b72c066ba85a powerpc/32: fix ADB_CUDA kconfig warning cb615bbe5526 powerpc/32: fix ADB_CUDA kconfig warning c8a1634145c2 powerpc/32: Drop unused grackle_set_stg() aad26d3b6af1 powerpc/32s: Implement local_flush_tlb_page_psize() bac4cffc7c4a powerpc/32s: Introduce _PAGE_READ and remove _PAGE_USER 46ebef51fd92 powerpc/32s: Add _PAGE_WRITE to supplement _PAGE_RW f84b727d132c powerpc/32: Enable POWER_RESET in pmac32_defconfig a3ef2fef198c powerpc/32: Add dependencies of POWER_RESET for pmac32 7cb0094be4a5 powerpc/32s: Cleanup the mess in __set_pte_at() 6958ad05d578 powerpc/32: Rearrange _switch to prepare for 32/64 merge fc8562c9b69a powerpc/32: Remove sync from _switch ... It means everything built with CONFIG_PPC32 > > The name "powerpc32" is sometimes used colloquially to mean PowerPC code > running in SF=0 mode (MSR[SF]=0), but perhaps more often it is used for > 32-bit only implementations (so, those that do not even have that bit: > it's bit 0 in the 64-bit MSR, so all implementations that have an only > 32-bit MSR, for example). > >> For powerpc/64 we have less constraint than on powerpc32: >> - Kernel memory starts at 0xc000000000000000 >> - User memory stops at 0x0010000000000000 > > That isn't true, not even if you mean some existing name. Usually > userspace code is mapped at 256MB (0x10000000). On powerpc64-linux > anyway, different default on different ABIs of course :-) 0x10000000 is below 0x0010000000000000, isn't it ? So why isn't it true ? On 64 bits powerpc, everything related to user is between 0 and TASK_SIZE_MAX. TASK_SIZE_MAX is either TASK_SIZE_4PB or TASK_SIZE_64TB depending on page size (64k or 4k) TASK_SIZE_4PB is 0x0010000000000000UL Christophe > >>> And for access_ok() avoiding the conditional is a good enough reason >>> to use a 'conditional move' instruction. >>> Avoiding speculation is actually free. >> >> And on CPUs that are not affected by Spectre and Meltdown like powerpc >> 8xx or powerpc 603, > > Erm. > > > Segher
Hi! On Sat, Jul 05, 2025 at 12:55:06PM +0200, Christophe Leroy wrote: > > > For book3s64, GCC only use isel with -mcpu=power9 or -mcpu=power10 > > > > I have no idea what "book3s64" means. > > Well that's the name given in Linux kernel to the 64 bits power CPU > processors. A fantasy name. Great. > > What is "powerpc/32"? It does not help if you use different names from > > what everyone else does. > > Again, that's the way it is called in Linux kernel, refer below commits > subjects: And another. > It means everything built with CONFIG_PPC32 Similar names for very dissimilar concepts, even! Woohoo! > > > For powerpc/64 we have less constraint than on powerpc32: > > > - Kernel memory starts at 0xc000000000000000 > > > - User memory stops at 0x0010000000000000 > > > > That isn't true, not even if you mean some existing name. Usually > > userspace code is mapped at 256MB (0x10000000). On powerpc64-linux > > anyway, different default on different ABIs of course :-) > > 0x10000000 is below 0x0010000000000000, isn't it ? So why isn't it true ? I understood "starts at". I read cross-eyed maybe, hehe. Segher
On Tue, 24 Jun 2025 07:27:47 +0200 Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > Le 22/06/2025 à 18:20, David Laight a écrit : > > On Sun, 22 Jun 2025 11:52:38 +0200 > > Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > > > >> Masked user access avoids the address/size verification by access_ok(). > >> Allthough its main purpose is to skip the speculation in the > >> verification of user address and size hence avoid the need of spec > >> mitigation, it also has the advantage to reduce the amount of > >> instructions needed so it also benefits to platforms that don't > >> need speculation mitigation, especially when the size of the copy is > >> not know at build time. > > > > It also removes a conditional branch that is quite likely to be > > statically predicted 'the wrong way'. > > But include/asm-generic/access_ok.h defines access_ok() as: > > #define access_ok(addr, size) likely(__access_ok(addr, size)) > > So GCC uses the 'unlikely' variant of the branch instruction to force > the correct prediction, doesn't it ? Nope... Most architectures don't have likely/unlikely variants of branches. So all gcc can do is decide which path is the fall-through and whether the branch is forwards or backwards. Additionally unless there is code in both the 'if' and 'else' clauses the [un]likely seems to have no effect. So on simple cpu that predict 'backwards branches taken' you can get the desired effect - but it may need an 'asm comment' to force the compiler to generate the required branches (eg forwards branch directly to a backwards unconditional jump). On x86 it is all more complicated. I think the pre-fetch code is likely to assume 'not taken' (but might use stale info on the cache line). The predictor itself never does 'static prediction' - it is always based on the referenced branch prediction data structure. So, unless you are in a loop (eg running a benchmark!) there is pretty much a 50% chance of a branch mispredict. I've been trying to benchmark different versions of the u64 * u64 / u64 function - and I think mispredicted branches make a big difference. I need to sit down and sequence the test cases so that I can see the effect of each branch! > > > > >> Unlike x86_64 which masks the address to 'all bits set' when the > >> user address is invalid, here the address is set to an address in > >> the gap. It avoids relying on the zero page to catch offseted > >> accesses. On book3s/32 it makes sure the opening remains on user > >> segment. The overcost is a single instruction in the masking. > > > > That isn't true (any more). > > Linus changed the check to (approx): > > if (uaddr > TASK_SIZE) > > uaddr = TASK_SIZE; > > (Implemented with a conditional move) > > Ah ok, I overlooked that, I didn't know the cmove instruction, seem > similar to the isel instruction on powerpc e500. It got added for the 386 - I learnt 8086 :-) I suspect x86 got there first... Although called 'conditional move' I very much suspect the write is actually unconditional. So the hardware implementation is much the same as 'add carry' except the ALU operation is a simple multiplex. Which means it is unlikely to be speculative. David > > Christophe >
Hi! On Tue, Jun 24, 2025 at 09:32:58AM +0100, David Laight wrote: > > So GCC uses the 'unlikely' variant of the branch instruction to force > > the correct prediction, doesn't it ? > > Nope... > Most architectures don't have likely/unlikely variants of branches. In GCC, "likely" means 80%. "Very likely" means 99.95%. Most things get something more appropriate than such coarse things predicted. Most of the time GCC uses these predicted branch probabilities to lay out code in such a way that the fall-through path is the expected one. Target backends can do special things with it as well, but usually that isn't necessary. There are many different predictors. GCC usually can predict things not bad by just looking at the shape of the code, using various heuristics. Things like profile-guided optimisation allow to use a profile from an actual execution to optimise the code such that it will work faster (assuming that future executions of the code will execute similarly!) You also can use __builtin_expect() in the source code, to put coarse static prediction in. That is what the kernel "{un,}likely" macros do. If the compiler knows some branch is not very predictable, it can optimise the code knowing that. Like, it could use other strategies than conditional branches. On old CPUs something like "this branch is taken 50% of the time" makes it a totally unpredictable branch. But if say it branches exactly every second time, it is 100% predicted correctly by more advanced predictors, not just 50%. To properly model modern branch predictors we need to record a "how predictable is this branch" score as well for every branch, not just a "how often does it branch instead of falling through" score. We're not there yet. Segher
On Tue, 24 Jun 2025 16:37:12 -0500 Segher Boessenkool <segher@kernel.crashing.org> wrote: > Hi! > > On Tue, Jun 24, 2025 at 09:32:58AM +0100, David Laight wrote: > > > So GCC uses the 'unlikely' variant of the branch instruction to force > > > the correct prediction, doesn't it ? > > > > Nope... > > Most architectures don't have likely/unlikely variants of branches. > > In GCC, "likely" means 80%. "Very likely" means 99.95%. Most things get > something more appropriate than such coarse things predicted. > > Most of the time GCC uses these predicted branch probabilities to lay > out code in such a way that the fall-through path is the expected one. That is fine provided the cpu doesn't predict the 'taken' path. If you write: if (unlikely(x)) continue; gcc is very likely to generate a backwards conditional branch that will get predicted taken (by a cpu without dynamic branch prediction). You need to but something (an asm comment will do) before the 'continue' to force gcc to generate a forwards (predicted not taken) branch to the backwards jump. > Target backends can do special things with it as well, but usually that > isn't necessary. > > There are many different predictors. GCC usually can predict things > not bad by just looking at the shape of the code, using various > heuristics. Things like profile-guided optimisation allow to use a > profile from an actual execution to optimise the code such that it will > work faster (assuming that future executions of the code will execute > similarly!) Without cpu instructions to force static prediction I don't see how that helps as much as you might think. Each time the code is loaded into the I-cache the branch predictor state is likely to have been destroyed by other code. So the branches get predicted by 'the other code' regardless of any layout. > > You also can use __builtin_expect() in the source code, to put coarse > static prediction in. That is what the kernel "{un,}likely" macros do. > > If the compiler knows some branch is not very predictable, it can > optimise the code knowing that. Like, it could use other strategies > than conditional branches. > > On old CPUs something like "this branch is taken 50% of the time" makes > it a totally unpredictable branch. But if say it branches exactly every > second time, it is 100% predicted correctly by more advanced predictors, > not just 50%. Only once you are in a code loop. Dynamic branch prediction is pretty hopeless for linear code. The first time you execute a branch it is likely to be predicted taken 50% of the time. (I guess a bit less than 50% - it will be percentage of branches that are taken.) > > To properly model modern branch predictors we need to record a "how > predictable is this branch" score as well for every branch, not just a > "how often does it branch instead of falling through" score. We're not > there yet. If you are going to adjust the source code you want to determine correct static prediction for most branches. That probably requires an 'every other' static prediction. I spent a lot of time optimising some code to minimise the worst case path, the first thing I had to do was disable the dynamic branch prediction logic. David > > > Segher
© 2016 - 2025 Red Hat, Inc.