arch/x86/include/asm/futex.h | 12 ++---- fs/select.c | 4 -- include/linux/uaccess.h | 78 +++++++++++++++++++++++++++++++++++++++++++ kernel/futex/futex.h | 37 +------------------- 4 files changed, 85 insertions(+), 46 deletions(-)
commit 2865baf54077 ("x86: support user address masking instead of non-speculative conditional") provided an optimization for unsafe_get/put_user(), which optimizes the Spectre-V1 mitigation in an architecture specific way. Currently only x86_64 supports that. The required code pattern screams for helper functions before it is copied all over the kernel. So far the exposure is limited to futex, x86 and fs/select. Provide a set of helpers for common single size access patterns: - get/put_user_masked_uNN() where $NN is the variable size in bits, i.e. 8, 16, 32, 64. - user_read/write_masked_begin() for scenarios, where several unsafe_put/get_user() invocations are required. Convert the existing users over to this. The series applies on top of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent which has a bug fix pending for the futex implementation of this. It's also available from git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git uaccess/masked there is another candidate for conversion: RSEQ. This has not been converted yet because there are more severe performance issues with this code, which starts to become prominent in exit path profiling because glibc started to use it. This will be addressed in a seperate series. As this creates dependencies, merging it should go through one central tree, i.e. tip, but that's a debate for later :) Thanks, tglx --- arch/x86/include/asm/futex.h | 12 ++---- fs/select.c | 4 -- include/linux/uaccess.h | 78 +++++++++++++++++++++++++++++++++++++++++++ kernel/futex/futex.h | 37 +------------------- 4 files changed, 85 insertions(+), 46 deletions(-)
On Wed, 13 Aug 2025 17:57:00 +0200 (CEST) Thomas Gleixner <tglx@linutronix.de> wrote: > commit 2865baf54077 ("x86: support user address masking instead of > non-speculative conditional") provided an optimization for > unsafe_get/put_user(), which optimizes the Spectre-V1 mitigation in an > architecture specific way. Currently only x86_64 supports that. > > The required code pattern screams for helper functions before it is copied > all over the kernel. So far the exposure is limited to futex, x86 and > fs/select. > > Provide a set of helpers for common single size access patterns: (gmail hasn't decided to accept 1/4 yet - I need to find a better mail relay...) +/* + * Conveniance macros to avoid spreading this pattern all over the place ^ spelling... + */ +#define user_read_masked_begin(src) ({ \ + bool __ret = true; \ + \ + if (can_do_masked_user_access()) \ + src = masked_user_access_begin(src); \ + else if (!user_read_access_begin(src, sizeof(*src))) \ + __ret = false; \ + __ret; \ +}) I proposed something very similar a while back. Since it updated 'src' it really ought to be passed by address. For the general case you also need the a parameter for the size. Linus didn't like it, but I've forgotten why. I'm also not convinced of the name. There isn't any 'masking' involved, so it shouldn't be propagated. There is also an implementation issue. The original masker_user_access_begin() returned ~0 for kernel addresses. That requires that the code always access offset zero first. I looked up some candidates for this code and found one (possibly epoll) that did the accesses in the wrong order. The current x86-64 'cmp+cmov' version returns the base of the guard page, so is safe provided the accesses are 'reasonably sequential'. That probably ought to be a requirement. David
On Sun, 17 Aug 2025 14:49:43 +0100 David Laight <david.laight.linux@gmail.com> wrote: > On Wed, 13 Aug 2025 17:57:00 +0200 (CEST) > Thomas Gleixner <tglx@linutronix.de> wrote: > > > commit 2865baf54077 ("x86: support user address masking instead of > > non-speculative conditional") provided an optimization for > > unsafe_get/put_user(), which optimizes the Spectre-V1 mitigation in an > > architecture specific way. Currently only x86_64 supports that. > > > > The required code pattern screams for helper functions before it is copied > > all over the kernel. So far the exposure is limited to futex, x86 and > > fs/select. > > > > Provide a set of helpers for common single size access patterns: > > (gmail hasn't decided to accept 1/4 yet - I need to find a better > mail relay...) > > +/* > + * Conveniance macros to avoid spreading this pattern all over the place > ^ spelling... > + */ > +#define user_read_masked_begin(src) ({ \ > + bool __ret = true; \ > + \ > + if (can_do_masked_user_access()) \ > + src = masked_user_access_begin(src); \ > + else if (!user_read_access_begin(src, sizeof(*src))) \ > + __ret = false; \ > + __ret; \ > +}) Would something like this work (to avoid the hidden update)? #define user_read_begin(uaddr, size, error_code) ({ \ typeof(uaddr) __uaddr; \ if (can_do_masked_user_access()) \ __uaddr = masked_user_access_begin(uaddr);\ else if (user_read_access_begin(uaddr, size)) \ __uaddr = uaddr; \ else { \ error_code; \ } \ __uaddr; \ }) With typical use being either: uaddr = user_read_begin(uaddr, sizeof (*uaddr), return -EFAULT); or: uaddr = user_read_begin(uaddr, sizeof (*uaddr), goto bad_uaddr); One problem is I don't think you can easily enforce the assignment. Ideally you'd want something that made the compiler think that 'uaddr' was unset. It could be done for in a debug/diagnostic compile by adding 'uaddr = NULL' at the bottom of the #define and COMPILE_ASSERT(!staticically_true(uaddr == NULL)) inside unsafe_get/put_user(). David
On 2025-08-18 22:21:06+0100, David Laight wrote: > On Sun, 17 Aug 2025 14:49:43 +0100 > David Laight <david.laight.linux@gmail.com> wrote: (...) > Would something like this work (to avoid the hidden update)? > > #define user_read_begin(uaddr, size, error_code) ({ \ > typeof(uaddr) __uaddr; \ > if (can_do_masked_user_access()) \ > __uaddr = masked_user_access_begin(uaddr);\ > else if (user_read_access_begin(uaddr, size)) \ > __uaddr = uaddr; \ > else { \ > error_code; \ > } \ > __uaddr; \ > }) > > With typical use being either: > uaddr = user_read_begin(uaddr, sizeof (*uaddr), return -EFAULT); > or: > uaddr = user_read_begin(uaddr, sizeof (*uaddr), goto bad_uaddr); > > One problem is I don't think you can easily enforce the assignment. > Ideally you'd want something that made the compiler think that 'uaddr' was unset. > It could be done for in a debug/diagnostic compile by adding 'uaddr = NULL' > at the bottom of the #define and COMPILE_ASSERT(!staticically_true(uaddr == NULL)) > inside unsafe_get/put_user(). To enforce some assignment, but not to the exact same variable as the argument, you can wrap user_read_begin() in a function marked as __must_check. #define __user_read_begin(uaddr, size, error_code) ({ \ /* See above */ }) static __always_inline void __must_check __user *__user_read_check(void __user *val) { return val; } #define user_read_begin(uaddr, size, error_code) \ ((typeof(uaddr))__user_read_check(__user_read_begin(uaddr, size, error_code))) Ignoring the return value gives: error: ignoring return value of ‘__user_read_check’ declared with attribute ‘warn_unused_result’ [-Werror=unused-result] 1629 | ((typeof(uaddr))__user_read_check(__user_read_begin(uaddr, size, error_code))) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ note: in expansion of macro ‘user_read_begin’ 1635 | user_read_begin(uaddr, sizeof (*uaddr), return -EFAULT); | ^~~~~~~~~~~~~~~ Thomas
On Mon, 18 Aug 2025 at 14:21, David Laight <david.laight.linux@gmail.com> wrote: > > Would something like this work (to avoid the hidden update)? It would certainly work, but I despise code inside macro arguments even more than I dislike the hidden update. If we want something like this, we should just make that last argument be a label, the same way unsafe_{get,put}_user() already works. That would not only match existing user access exception handling, it might allow for architecture-specific asm code that uses synchronous trap instructions (ie the label might turn into an exception entry) It's basically "manual exception handling", whether it then uses actual exceptions (like user accesses do) or ends up being some software implementation with just a "goto label" for the error case. I realize some people have grown up being told that "goto is bad". Or have been told that exception handling should be baked into the language and be asynchronous. Both of those ideas are complete and utter garbage, and the result of minds that cannot comprehend reality. Asynchronous exceptions are horrific and tend to cause huge performance problems (think setjmp()). The Linux kernel exception model with explicit exception points is not only "that's how you have to do it in C", it's also technically superior. And "goto" is fine, as long as you have legible syntax and don't use it to generate spaghetti code. Being able to write bad code with goto doesn't make 'goto' bad - you can write bad code with *anything*. Linus
On Mon, 18 Aug 2025 14:36:31 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 18 Aug 2025 at 14:21, David Laight <david.laight.linux@gmail.com> wrote: > > > > Would something like this work (to avoid the hidden update)? > > It would certainly work, but I despise code inside macro arguments > even more than I dislike the hidden update. > > If we want something like this, we should just make that last argument > be a label, the same way unsafe_{get,put}_user() already works. A 'goto label' is probably a bit more readable that just 'label'. But there will be similar code in the same function. But it can't use the same label - one needs the 'user_access_end()'. I wanted to allow an immediate 'return -EFAULT' as well as a goto. But not really expect anything more than 'rval = -EFAULT; goto label'. I do agree than some of the 'condvar wait' macros are a PITA when a chunk of code is executed repeatedly in a loop. There are also the iover iter ones, did they get better? > That would not only match existing user access exception handling, it > might allow for architecture-specific asm code that uses synchronous > trap instructions (ie the label might turn into an exception entry) Unlikely to be a trap in this case, but I guess you might want jump directly from asm. OTOH the real aim of this code has to be for all architectures to have a guard/invalid page that kernel addresses get converted to. So eventually the conditional jump disappears. > > It's basically "manual exception handling", whether it then uses > actual exceptions (like user accesses do) or ends up being some > software implementation with just a "goto label" for the error case. > > I realize some people have grown up being told that "goto is bad". Or > have been told that exception handling should be baked into the > language and be asynchronous. Both of those ideas are complete and > utter garbage, and the result of minds that cannot comprehend reality. Test: which language has a 'whenever' statement? IIRC the use is (effectively) 'whenever variable == value goto label'. (I hope my brain does remember that correctly, the implementation of that language I used didn't support it.) > Asynchronous exceptions are horrific and tend to cause huge > performance problems (think setjmp()). The original setjmp was trivial, no callee saved registers, so just saved the program counter and stack pointer. The original SYSV kernel used setjmp/longjmp to exit the kernel on error (after writing the errno value to u.u_error). The real killer is having to follow the stack to execute all the destructors. > The Linux kernel exception > model with explicit exception points is not only "that's how you have > to do it in C", it's also technically superior. Indeed. I've seen C++ code that did 'new buffer', called into some deep code that would normally save the pointer, but had a try/catch block that always freed it. The code had no way of knowing whether the exception happened before or after the pointer was saved. And it is pretty impossible to check all the places that might 'throw'. > > And "goto" is fine, as long as you have legible syntax and don't use > it to generate spaghetti code. Being able to write bad code with goto > doesn't make 'goto' bad - you can write bad code with *anything*. I fixed some 'dayjob' code that tried not to use goto, break or return. The error paths were just all wrong. David > > Linus
On Mon, Aug 18, 2025 at 02:36:31PM -0700, Linus Torvalds wrote: > I realize some people have grown up being told that "goto is bad". Or > have been told that exception handling should be baked into the > language and be asynchronous. Both of those ideas are complete and > utter garbage, and the result of minds that cannot comprehend reality. > > Asynchronous exceptions are horrific and tend to cause huge > performance problems (think setjmp()). The Linux kernel exception > model with explicit exception points is not only "that's how you have > to do it in C", it's also technically superior. > > And "goto" is fine, as long as you have legible syntax and don't use > it to generate spaghetti code. Being able to write bad code with goto > doesn't make 'goto' bad - you can write bad code with *anything*. Even Dijkstra doesn't argue against using goto for exception handling, "I remember having read the explicit recommendation to restrict the use of the go to statement to alarm exits, but I have not been able to trace it; presumably, it has been made by C.A.R. Hoare." https://www.cs.utexas.edu/~EWD/transcriptions/EWD02xx/EWD215.html
On Mon, Aug 18, 2025 at 02:36:31PM -0700, Linus Torvalds wrote: > And "goto" is fine, as long as you have legible syntax and don't use > it to generate spaghetti code. Being able to write bad code with goto > doesn't make 'goto' bad - you can write bad code with *anything*. Put it another way, one can massage a code into a strictly structured (no goto, only one exit from each block, etc.) equivalent and every hard-to-answer question about the original will map to the replacement - just as hard as it had been. I suspect that folks with "goto is a Bad Word(tm)" hardon had been told that goto was always avoidable, but had never bothered to read the proof...
On Mon, 18 Aug 2025 at 15:21, Al Viro <viro@zeniv.linux.org.uk> wrote: > > I suspect that folks with "goto is a Bad Word(tm)" hardon had been told > that goto was always avoidable, but had never bothered to read the proof... I think it's just a combination of compiler people historically finding purely structured loops easier to analyze (back before modern things like SSA). Together with language people who wanted to point out that "modern" languages had structures loop constructs. Both issues go back to the 1960s, and both are entirely irrelevant today - and have been for decades. It's not like you need to actively teach people to use for-loops instead of 'goto' these days. Now, I don't advocate 'goto' as a general programming model, but for exception handling it's superior to any alternative I know of. Exceptions simply DO NOT NEST, and 'try-catch-finally' is an insane model for exceptions that has only made things worse both for compilers and for programmers. So I do think using labels (without any crazy attempt nesting syntax) is objectively the superior model. And the 'finally' mess is much better handled by compilers dealing with cleanup - again without any pointless artificial nesting structures. I think most of our <linux/cleanup.h> models have been quite successful. Linus
On Mon, Aug 18, 2025 at 04:00:44PM -0700, Linus Torvalds wrote: > Now, I don't advocate 'goto' as a general programming model, but for > exception handling it's superior to any alternative I know of. > > Exceptions simply DO NOT NEST, and 'try-catch-finally' is an insane > model for exceptions that has only made things worse both for > compilers and for programmers. > > So I do think using labels (without any crazy attempt nesting syntax) > is objectively the superior model. > > And the 'finally' mess is much better handled by compilers dealing > with cleanup - again without any pointless artificial nesting > structures. I think most of our <linux/cleanup.h> models have been > quite successful. I'm still rather cautious about the uses related to locks - it's very easy to overextend the area where lock is held (witness the fs/namespace.c bugs of the "oops, that should've been scoped_guard(), not guard()" variety - we had several this year) and "grab lock, except it might fail" stuff appears to be all awful - when macro is supposed to be used like scoped_cond_guard(lock_timer, return -EINVAL, _id) (hidden in the bowels of another macro, no less)... I'm still trying to come up with something edible for lock_mount() - the best approximation I've got so far is CLASS(lock_mount, mp)(path); if (IS_ERR(mp.mp)) bugger off ... with things like do_add_mount() avoiding the IS_ERR(...) part by starting with if (IS_ERR(mp)) return PTR_ERR(mp); With that we get e.g. CLASS(lock_mount, mp)(mountpoint); error = do_add_mount(real_mount(mnt), mp.mp, mountpoint, mnt_flags); if (!error) // mnt is consumed by successful do_add_mount() retain_and_null_ptr(mnt); return error; but it takes some massage to get there...
On Tue, Aug 19, 2025 at 01:39:09AM +0100, Al Viro wrote: > I'm still trying to come up with something edible for lock_mount() - > the best approximation I've got so far is > > CLASS(lock_mount, mp)(path); > if (IS_ERR(mp.mp)) > bugger off ... and that does not work, since DEFINE_CLASS() has constructor return a value that gets copied into the local variable in question. Which is unusable for situations when a part of what constructor is doing is insertion of that local variable into a list. __cleanup() per se is still usable, but... no DEFINE_CLASS for that kind of data structures ;-/
On Thu, Aug 21, 2025 at 12:48:15AM +0100, Al Viro wrote: > On Tue, Aug 19, 2025 at 01:39:09AM +0100, Al Viro wrote: > > I'm still trying to come up with something edible for lock_mount() - > > the best approximation I've got so far is > > > > CLASS(lock_mount, mp)(path); > > if (IS_ERR(mp.mp)) > > bugger off > > ... and that does not work, since DEFINE_CLASS() has constructor return > a value that gets copied into the local variable in question. > > Which is unusable for situations when a part of what constructor is > doing is insertion of that local variable into a list. > > __cleanup() per se is still usable, but... no DEFINE_CLASS for that kind > of data structures ;-/ Just add the custom infrastructure that we need for this to work out imho. If it's useful outside of our own realm then we can add it to cleanup.h and if not we can just add our own header...
On Thu, Aug 21, 2025 at 09:45:22AM +0200, Christian Brauner wrote: > On Thu, Aug 21, 2025 at 12:48:15AM +0100, Al Viro wrote: > > On Tue, Aug 19, 2025 at 01:39:09AM +0100, Al Viro wrote: > > > I'm still trying to come up with something edible for lock_mount() - > > > the best approximation I've got so far is > > > > > > CLASS(lock_mount, mp)(path); > > > if (IS_ERR(mp.mp)) > > > bugger off > > > > ... and that does not work, since DEFINE_CLASS() has constructor return > > a value that gets copied into the local variable in question. > > > > Which is unusable for situations when a part of what constructor is > > doing is insertion of that local variable into a list. > > > > __cleanup() per se is still usable, but... no DEFINE_CLASS for that kind > > of data structures ;-/ > > Just add the custom infrastructure that we need for this to work out imho. Obviously... I'm going to put that into a branch on top of -rc3 and keep the more infrastructural parts in the beginning, so they could be merged into other branches in vfs/vfs.git without disrupting things on reordering. > If it's useful outside of our own realm then we can add it to cleanup.h > and if not we can just add our own header... lock_mount() et.al. are purely fs/namespace.c, so no header is needed at all. FWIW, existing guards in there have problems - I ended up with DEFINE_LOCK_GUARD_0(namespace_excl, namespace_lock(), namespace_unlock()) DEFINE_LOCK_GUARD_0(namespace_shared, down_read(&namespace_sem), up_read(&namespace_sem)) in fs/namespace.c and DEFINE_LOCK_GUARD_0(mount_writer, write_seqlock(&mount_lock), write_sequnlock(&mount_lock)) DEFINE_LOCK_GUARD_0(mount_locked_reader, read_seqlock_excl(&mount_lock), read_sequnlock_excl(&mount_lock)) in fs/mount.h; I'm doing conversions to those where they clearly are good fit and documenting as I go. mount_lock ones really should not be done in a blanket way - right now they are wrong in quite a few cases, where writer is used instead of the locked reader; we'll need to sort that out and I'd rather keep the open-coded ones for the stuff yet to be considered and/or tricky. BTW, the comments I'm using for functions are along the lines of * locks: mount_locked_reader || namespace_shared && is_mounted(mnt) this one - for is_path_reachable(). If you look through the comments there you'll see things like "vfsmount lock must be held for write" and the rwlock those are refering to had been gone for more than a decade... DEFINE_LOCK_GUARD_0 vs. DEFINE_GUARD makes for saner code generation; having it essenitally check IS_ERR_OR_NULL(&namespace_sem) is already ridiculous, but when it decides to sacrifice a register for that, complete with a bunch of spills...
On Sun, 17 Aug 2025 at 06:50, David Laight <david.laight.linux@gmail.com> wrote: > > Linus didn't like it, but I've forgotten why. I think the reason I didn't love it is that it has a bit subtle semantics, and I think you just proved my point: > I'm also not convinced of the name. > There isn't any 'masking' involved, so it shouldn't be propagated. Sure there is. Look closer at that patch: + if (can_do_masked_user_access()) \ + src = masked_user_access_begin(src); \ IOW, that macro changes the argument and masks it. So it's actually really easy to use, but it's also really easy to miss that it does that. We've done this before, and I have done it myself. The classic example is the whole "do_div()" macro that everybody hated because it did exactly the same thing (we also have "save_flags()" etc that have this behavior). So I don't love it - but I can't claim I've not done the same thing, and honestly, it does make it very easy to use, so when Thomas sent this series out I didn't speak out against it. Linus
On Sun, 17 Aug 2025 07:00:17 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, 17 Aug 2025 at 06:50, David Laight <david.laight.linux@gmail.com> wrote: > > > > Linus didn't like it, but I've forgotten why. > > I think the reason I didn't love it is that it has a bit subtle > semantics, and I think you just proved my point: Just requiring the caller pass &user_ptr would make it more obvious. The generated code (with 'src' -> *&src) will be the same. > > > I'm also not convinced of the name. > > There isn't any 'masking' involved, so it shouldn't be propagated. > > Sure there is. Look closer at that patch: > > + if (can_do_masked_user_access()) \ > + src = masked_user_access_begin(src); \ > > IOW, that macro changes the argument and masks it. Except the change has never been a 'mask' in the traditional sense. Neither the original cmp+sbb+or nor current cmp+cmov is really applying a mask. I think the 'guard page' might even be the highest user page, so it isn't even the case that kernel addresses get their low bits masked off. The function could just be user_read_begin(void __user *addr, unsigned long len); Although since it is the start of an unsafe_get_user() sequence perhaps is should be unsafe_get_user_begin() ? > > So it's actually really easy to use, but it's also really easy to miss > that it does that. > > We've done this before, and I have done it myself. The classic example > is the whole "do_div()" macro that everybody hated because it did > exactly the same thing Divide is (well was, I think my zen5 has a fast divide) also slow enough that I doubt it would have mattered. - can you drop a 'must_check' on the div_u64() that people keep putting in patches as a drop-in replacement for do_div()? David > (we also have "save_flags()" etc that have this > behavior). > > So I don't love it - but I can't claim I've not done the same thing, > and honestly, it does make it very easy to use, so when Thomas sent > this series out I didn't speak out against it. > > Linus
On Sun, 17 Aug 2025 at 08:29, David Laight <david.laight.linux@gmail.com> wrote: > > Just requiring the caller pass &user_ptr would make it more obvious. > The generated code (with 'src' -> *&src) will be the same. I personally absolutely detest passing pointers in and then hoping the compiler will fix it up and not make the assembler do the stupid thing that the source code does. That's actually true in general - I strive to make the source code and the generated code line up fairly closely, rather than "compilers fix up the mistakes in the source code". > > We've done this before, and I have done it myself. The classic example > > is the whole "do_div()" macro that everybody hated because it did > > exactly the same thing > > Divide is (well was, I think my zen5 has a fast divide) also slow enough that > I doubt it would have mattered. It mattered for code generation quality and smaller simpler code to look at. I still look at the generated asm (not for do_div(), but other things), and having compiler-generated code that makes sense and matches the source is a big plus in my book. Linus
On Sun, 17 Aug 2025 08:36:05 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, 17 Aug 2025 at 08:29, David Laight <david.laight.linux@gmail.com> wrote: > > > > Just requiring the caller pass &user_ptr would make it more obvious. > > The generated code (with 'src' -> *&src) will be the same. > > I personally absolutely detest passing pointers in and then hoping the > compiler will fix it up and not make the assembler do the stupid thing > that the source code does. I do generally dislike passing integers and pointers by reference. It typically generates horrid code further down the function as well as all the costs of the memory references themselves. But hidden updates to variables are worse. And we know (and can verify) that the generated code here is sane. A possible problem with the 'hidden update' is that someone could easily write code (eg in an ioctl handler) where the user pointer is in a buffer that gets written back to user space. Passing the pointer by reference makes it rather more obvious it can get changed. > That's actually true in general - I strive to make the source code and > the generated code line up fairly closely, rather than "compilers fix > up the mistakes in the source code". Especially when you are trying to code what it thinks is 'a mistake' :-) > > > We've done this before, and I have done it myself. The classic example > > > is the whole "do_div()" macro that everybody hated because it did > > > exactly the same thing > > > > Divide is (well was, I think my zen5 has a fast divide) also slow enough that > > I doubt it would have mattered. > > It mattered for code generation quality and smaller simpler code to look at. > > I still look at the generated asm (not for do_div(), but other > things), and having compiler-generated code that makes sense and > matches the source is a big plus in my book. Don't worry I also keep looking at generated code, some of it is stunningly bad (and seems to get worse with each new compiler version). Don't even think about what happens for C++ std::ostringstream. David > > Linus
© 2016 - 2025 Red Hat, Inc.