[PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()

Josh Poimboeuf posted 6 patches 3 weeks, 6 days ago
[PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Josh Poimboeuf 3 weeks, 6 days ago
The barrier_nospec() in 64-bit __get_user() is slow.  Instead use
pointer masking to force the user pointer to all 1's if a previous
access_ok() mispredicted true for an invalid address.

Note that for safety on some AMD CPUs, this relies on recent commit
86e6b1547b3d ("x86: fix user address masking non-canonical speculation
issue").

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/lib/getuser.S | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index 4357ec2a0bfc..998d5be6b794 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -112,8 +112,12 @@ EXPORT_SYMBOL(__get_user_8)
 
 /* .. and the same for __get_user, just without the range checks */
 SYM_FUNC_START(__get_user_nocheck_1)
-	ASM_STAC
+#ifdef CONFIG_X86_64
+	check_range size=1
+#else
 	ASM_BARRIER_NOSPEC
+#endif
+	ASM_STAC
 	UACCESS movzbl (%_ASM_AX),%edx
 	xor %eax,%eax
 	ASM_CLAC
@@ -122,8 +126,12 @@ SYM_FUNC_END(__get_user_nocheck_1)
 EXPORT_SYMBOL(__get_user_nocheck_1)
 
 SYM_FUNC_START(__get_user_nocheck_2)
-	ASM_STAC
+#ifdef CONFIG_X86_64
+	check_range size=2
+#else
 	ASM_BARRIER_NOSPEC
+#endif
+	ASM_STAC
 	UACCESS movzwl (%_ASM_AX),%edx
 	xor %eax,%eax
 	ASM_CLAC
@@ -132,8 +140,12 @@ SYM_FUNC_END(__get_user_nocheck_2)
 EXPORT_SYMBOL(__get_user_nocheck_2)
 
 SYM_FUNC_START(__get_user_nocheck_4)
-	ASM_STAC
+#ifdef CONFIG_X86_64
+	check_range size=4
+#else
 	ASM_BARRIER_NOSPEC
+#endif
+	ASM_STAC
 	UACCESS movl (%_ASM_AX),%edx
 	xor %eax,%eax
 	ASM_CLAC
@@ -142,8 +154,12 @@ SYM_FUNC_END(__get_user_nocheck_4)
 EXPORT_SYMBOL(__get_user_nocheck_4)
 
 SYM_FUNC_START(__get_user_nocheck_8)
-	ASM_STAC
+#ifdef CONFIG_X86_64
+	check_range size=8
+#else
 	ASM_BARRIER_NOSPEC
+#endif
+	ASM_STAC
 #ifdef CONFIG_X86_64
 	UACCESS movq (%_ASM_AX),%rdx
 #else
-- 
2.47.0
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Josh Poimboeuf 3 weeks, 6 days ago
On Mon, Oct 28, 2024 at 06:56:15PM -0700, Josh Poimboeuf wrote:
> The barrier_nospec() in 64-bit __get_user() is slow.  Instead use
> pointer masking to force the user pointer to all 1's if a previous
> access_ok() mispredicted true for an invalid address.

Linus pointed out that __get_user() may be used by some code to access
both kernel and user space and in fact I found one such usage in
vc_read_mem()....

So I self-NAK this patch for now.

Still, it would be great if patch 1 could get merged as that gives a
significant performance boost.

-- 
Josh
RE: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by David Laight 2 weeks, 2 days ago
From: Josh Poimboeuf
> Sent: 29 October 2024 03:28
> 
> On Mon, Oct 28, 2024 at 06:56:15PM -0700, Josh Poimboeuf wrote:
> > The barrier_nospec() in 64-bit __get_user() is slow.  Instead use
> > pointer masking to force the user pointer to all 1's if a previous
> > access_ok() mispredicted true for an invalid address.
> 
> Linus pointed out that __get_user() may be used by some code to access
> both kernel and user space and in fact I found one such usage in
> vc_read_mem()....
> 
> So I self-NAK this patch for now.
> 
> Still, it would be great if patch 1 could get merged as that gives a
> significant performance boost.

I'm a bit late to the party and still a week behind :-(

But I've wondered if access_ok() ought to be implemented using an
'asm goto with output' - much like get_user().

Then the use would be:
	masked_address = access_ok(maybe_bad_address, size, jump_label);
with later user accesses using the masked_address.

Once you've done that __get_user() doesn't need to contain address masking.

Given that clac/stac iare so slow should there are be something that
combines stac with access_ok() bracketed with a 'user_access_end'
or an actual fault.

I've sure there is code (maybe reading iovec[] or in sys_poll())
that wants to do multiple get/put_user in a short loop rather that
calling copy_to/from_user().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by 'Josh Poimboeuf' 1 week, 2 days ago
On Fri, Nov 08, 2024 at 05:12:53PM +0000, David Laight wrote:
> From: Josh Poimboeuf
> > On Mon, Oct 28, 2024 at 06:56:15PM -0700, Josh Poimboeuf wrote:
> > > The barrier_nospec() in 64-bit __get_user() is slow.  Instead use
> > > pointer masking to force the user pointer to all 1's if a previous
> > > access_ok() mispredicted true for an invalid address.
> > 
> > Linus pointed out that __get_user() may be used by some code to access
> > both kernel and user space and in fact I found one such usage in
> > vc_read_mem()....

.. which sucks because I got a "will-it-scale.per_process_ops 1.9%
improvement" report for this patch.

It's sad that __get_user() is now slower than get_user() on x86, it kind
of defeats the whole point!

I know at least the "coco" code is misusing __get_user().  Unless
somebody wants to audit all the other callers, we could do something
horrific:

.macro __get_user_nocheck_nospec
#ifdef CONFIG_X86_64
	movq $0x0123456789abcdef, %rdx
 1:
.pushsection runtime_ptr_USER_PTR_MAX, "a"
	.long 1b - 8 - .
.popsection
	cmp %rax, %rdx
	jb 10f
	sbb %rdx, %rdx
	or %rdx, %rax
	jmp 11f
10:	/*
	 * Stop access_ok() branch misprediction -- both of them ;-)
	 *
	 * As a benefit this also punishes callers who intentionally call this
	 * with a kernel address.  Once they're rooted out, __get_user() can
	 * just become an alias of get_user().
	 *
	 * TODO: Add WARN_ON()
	 */
#endif
	ASM_BARRIER_NOSPEC
11:
.endm

/* .. and the same for __get_user, just without the range checks */
SYM_FUNC_START(__get_user_nocheck_1)
	__get_user_nocheck_nospec
	ASM_STAC
	UACCESS movzbl (%_ASM_AX),%edx
	xor %eax,%eax
	ASM_CLAC
	RET
SYM_FUNC_END(__get_user_nocheck_1)
EXPORT_SYMBOL(__get_user_nocheck_1)

Yes, I know adding another access_ok() is bad, but it would be a
definite speedup.  And adding a WARN_ON() would root out any other bad
callers pretty quick.

> But I've wondered if access_ok() ought to be implemented using an
> 'asm goto with output' - much like get_user().
> 
> Then the use would be:
> 	masked_address = access_ok(maybe_bad_address, size, jump_label);
> with later user accesses using the masked_address.
> 
> Once you've done that __get_user() doesn't need to contain address masking.

Sure, we just need a volunteer to change all the access_ok() implementations
and callers tree-wide ;-)

> Given that clac/stac iare so slow should there are be something that
> combines stac with access_ok() bracketed with a 'user_access_end'
> or an actual fault.
> 
> I've sure there is code (maybe reading iovec[] or in sys_poll())
> that wants to do multiple get/put_user in a short loop rather that
> calling copy_to/from_user().

We already have this with user_access_begin() + unsafe_get_user().
There's also a version which masks the address: masked_user_access_begin().

We just need to start porting things over.

-- 
Josh
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Linus Torvalds 1 week, 2 days ago
On Fri, 15 Nov 2024 at 15:06, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> It's sad that __get_user() is now slower than get_user() on x86, it kind
> of defeats the whole point!

Well, honestly, we've been trying to get away from __get_user() and
__put_user() for a long long time.

With CLAC/STAC, it's been probably a decade or two since __get_user()
and friends were actually a worthwhile optimization, so let's just
strive to get rid of the ones that matter.

So I think the thing to do is

 (a) find out which __get_user() it is that matters so much for that load

Do you have a profile somewhere?

 (b) convert them to use "unsafe_get_user()", with that whole

                if (can_do_masked_user_access())
                        from = masked_user_access_begin(from);
                else if (!user_read_access_begin(from, sizeof(*from)))
                        return -EFAULT;

     sequence before it.

And if it's just a single __get_user() (rather than a sequence of
them), just convert it to get_user().

Hmm?

                Linus
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Josh Poimboeuf 3 days, 10 hours ago
> On Fri, 15 Nov 2024 at 15:06, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> So I think the thing to do is
> 
>  (a) find out which __get_user() it is that matters so much for that load
> 
> Do you have a profile somewhere?
> 
>  (b) convert them to use "unsafe_get_user()", with that whole
> 
>                 if (can_do_masked_user_access())
>                         from = masked_user_access_begin(from);
>                 else if (!user_read_access_begin(from, sizeof(*from)))
>                         return -EFAULT;
> 
>      sequence before it.
> 
> And if it's just a single __get_user() (rather than a sequence of
> them), just convert it to get_user().
> 
> Hmm?

The profile is showing futex_get_value_locked():

int futex_get_value_locked(u32 *dest, u32 __user *from)
{
	int ret;

	pagefault_disable();
	ret = __get_user(*dest, from);
	pagefault_enable();

	return ret ? -EFAULT : 0;
}

That has several callers, so we can probably just use get_user() there?

Also, is there any harm in speeding up __get_user()?  It still has ~80
callers and it's likely to be slowing down things we don't know about.

It's usually only the regressions which get noticed, and that LFENCE
went in almost 7 years ago, when there was much less automated
performance regression testing.

As a bonus, that patch will root out any "bad" users, which will
eventually allow us to simplify things and just make __get_user() an
alias of get_user().

In fact, if we aliased it for all arches, that could help in getting rid
of __get_user() altogether as there would no longer be any (real or
advertised) benefit to using it.

-- 
Josh
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Linus Torvalds 3 days, 9 hours ago
On Thu, 21 Nov 2024 at 13:40, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> The profile is showing futex_get_value_locked():

Ahh.

> That has several callers, so we can probably just use get_user() there?

Yeah, that's the simplest thing. That thing isn't even some inline
function, so the real cost is the call.

That said, exactly because it's not inlined, and calls are expensive,
and this is apparently really critical, we can just do it with the
full "unsafe_get_user()" model.

It's not so complicated. The attached patch is untested, but I did
check that it generates almost perfect code:

    mov    %gs:0x0,%rax                 # current
    incl   0x1a9c(%rax)                 # current->pagefault_disable++
    movabs $0x123456789abcdef,%rcx      # magic virtual address size
    cmp    %rsi,%rcx                    # address masking
    sbb    %rcx,%rcx
    or     %rsi,%rcx
    stac                                # enable user space acccess
    mov    (%rcx),%ecx                  # get the value
    clac                                # disable user space access
    decl   0x1a9c(%rax)                 # current->pagefault_disable--
    mov    %ecx,(%rdi)                  # save the value
    xor    %eax,%eax                    # return 0
    ret

(with the error case for the page fault all out-of-line).

So this should be _faster_ than the old __get_user(), because while
the address masking is not needed, it's cheaper than the function call
used to be and the error handling is better.

If you can test this and verify that it actually help, I'll take it as
a patch. Consider it signed-off after testing.

> Also, is there any harm in speeding up __get_user()?  It still has ~80
> callers and it's likely to be slowing down things we don't know about.

How would you speed it up?  We definitely can't replace the fence with
addressing tricks. So we can't just replace it with "get_user()",
because of those horrid architecture-specific kernel uses.

Now, we could possibly say "just remove the fence in __get_user()
entirely", but that would involve moving it to access_ok().

And then it wouldn't actually speed anything up (except the horrid
architecture-specific kernel uses that then don't call access_ok() at
all - and we don't care about *those*).

               Linus
RE: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by David Laight 2 days, 22 hours ago
From: Linus Torvalds
> Sent: 21 November 2024 22:16
> 
> On Thu, 21 Nov 2024 at 13:40, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> >
> > The profile is showing futex_get_value_locked():
> 
> Ahh.
> 
> > That has several callers, so we can probably just use get_user() there?
> 
> Yeah, that's the simplest thing. That thing isn't even some inline
> function, so the real cost is the call.
> 
> That said, exactly because it's not inlined, and calls are expensive,
> and this is apparently really critical, we can just do it with the
> full "unsafe_get_user()" model.
> 
> It's not so complicated. The attached patch is untested, but I did
> check that it generates almost perfect code:
> 
>     mov    %gs:0x0,%rax                 # current
>     incl   0x1a9c(%rax)                 # current->pagefault_disable++
>     movabs $0x123456789abcdef,%rcx      # magic virtual address size
>     cmp    %rsi,%rcx                    # address masking
>     sbb    %rcx,%rcx
>     or     %rsi,%rcx
>     stac                                # enable user space acccess
>     mov    (%rcx),%ecx                  # get the value
>     clac                                # disable user space access
>     decl   0x1a9c(%rax)                 # current->pagefault_disable--
>     mov    %ecx,(%rdi)                  # save the value
>     xor    %eax,%eax                    # return 0
>     ret
> 
> (with the error case for the page fault all out-of-line).

I presume you are assuming an earlier access_ok() call? 

IIRC all x86 from 286 onwards fault accesses that cross the ~0 to 0 boundary.
So you don't need to have called access_ok() prior to the above
for non-byte accesses.
Even for byte accesses (and with ~0 being a valid address) do the
address mask before pagefault_disable++ and the error test is 'jc label'
after the 'cmp'.

Don't you also get better code for an 'asm output with goto' form?
While it requires the caller have a 'label: return -EFAULT;' somewhere
that is quite common in kernel code.

> So this should be _faster_ than the old __get_user(), because while
> the address masking is not needed, it's cheaper than the function call
> used to be and the error handling is better.

Perhaps get_user() should be the function call and __get_user() inlined.
Both including address validation but different calling conventions?
(After fixing the code that doesn't want address masking.)

...
> Now, we could possibly say "just remove the fence in __get_user()
> entirely", but that would involve moving it to access_ok().

How valid would it be to assume an explicit access_ok() was far
enough away from the get_user() that you couldn't speculate all the
way to something that used the read value to perform another
kernel access?

> And then it wouldn't actually speed anything up (except the horrid
> architecture-specific kernel uses that then don't call access_ok() at
> all - and we don't care about *those*).

Find and kill :-)

	David

> 
>                Linus

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Josh Poimboeuf 3 days, 7 hours ago
On Thu, Nov 21, 2024 at 02:16:12PM -0800, Linus Torvalds wrote:
>     mov    %gs:0x0,%rax                 # current
>     incl   0x1a9c(%rax)                 # current->pagefault_disable++
>     movabs $0x123456789abcdef,%rcx      # magic virtual address size
>     cmp    %rsi,%rcx                    # address masking
>     sbb    %rcx,%rcx
>     or     %rsi,%rcx
>     stac                                # enable user space acccess
>     mov    (%rcx),%ecx                  # get the value
>     clac                                # disable user space access
>     decl   0x1a9c(%rax)                 # current->pagefault_disable--
>     mov    %ecx,(%rdi)                  # save the value
>     xor    %eax,%eax                    # return 0
>     ret

The asm looks good, but the C exploded a bit... why not just have an
inline get_user()?

> If you can test this and verify that it actually help, I'll take it as
> a patch. Consider it signed-off after testing.

Let me see if I can recreate the original report (or get the automatic
testing to see the commit).

> > Also, is there any harm in speeding up __get_user()?  It still has ~80
> > callers and it's likely to be slowing down things we don't know about.
> 
> How would you speed it up?  We definitely can't replace the fence with
> addressing tricks. So we can't just replace it with "get_user()",
> because of those horrid architecture-specific kernel uses.

I'm not sure if you saw the example code snippet I posted up-thread,
here it is below.

It adds a conditional branch to test if it's a user address.  If so, it
does pointer masking.  If not, it does LFENCE.  So "bad" users get the
slow path, and we could even add a WARN_ON().

Yes, another conditional branch isn't ideal, but it's still much faster
than the current code, and it would root out any bad users with a
WARN_ON() so that eventually it can just become a get_user() alias.

.macro __get_user_nocheck_nospec
#ifdef CONFIG_X86_64
	movq $0x0123456789abcdef, %rdx
 1:
.pushsection runtime_ptr_USER_PTR_MAX, "a"
	.long 1b - 8 - .
.popsection
	cmp %rax, %rdx
	jb 10f
	sbb %rdx, %rdx
	or %rdx, %rax
	jmp 11f
10:	/*
	 * Stop access_ok() branch misprediction -- both of them ;-)
	 *
	 * As a benefit this also punishes callers who intentionally call this
	 * with a kernel address.  Once they're rooted out, __get_user() can
	 * just become an alias of get_user().
	 *
	 * TODO: Add WARN_ON()
	 */
#endif
	ASM_BARRIER_NOSPEC
11:
.endm

/* .. and the same for __get_user, just without the range checks */
SYM_FUNC_START(__get_user_nocheck_1)
	__get_user_nocheck_nospec
	ASM_STAC
	UACCESS movzbl (%_ASM_AX),%edx
	xor %eax,%eax
	ASM_CLAC
	RET
SYM_FUNC_END(__get_user_nocheck_1)
EXPORT_SYMBOL(__get_user_nocheck_1)

-- 
Josh
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Linus Torvalds 3 days, 6 hours ago
On Thu, 21 Nov 2024 at 16:12, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> The asm looks good, but the C exploded a bit... why not just have an
> inline get_user()?

That was originally one of my goals for the "unsafe" ones - if done
right, they'd be the proper building blocks for a get_user(), and we'd
only really need one single good implementation.

But it really does blow up the code size quite noticeably. The old
sign-based thing wasn't quite so bad, and was one of the main reasons
I really didn't like having to switch to the big constant thing, but
with the constant, the "get_user()" part is basically 27 bytes per
location.

The uninlined call is 5 bytes.

(And both then have the error handling - the inlined one can use the
asm goto to *maybe* make up for some of it because it avoids tests,
but I suspect it ends up being pretty similar in the end).

So I'm not really sure inlining makes sense - except if you have code
that you've profiled.

Part of the problem is that you can't really just make an inline
function. You need to make one for each size. Which we've done, but it
gets real messy real quick. I don't want to have yet another "switch
(sizeof..)" thing.

Or you'd make it a macro (which makes dealing with different types
easy), but then it would have to be a statement expression to return
the error, and to take advantage of that exception handling error
handling gets messed up real quick too.

End result: the

        if (can_do_masked_user_access())
                from = masked_user_access_begin(from);
        else if (!user_read_access_begin(from, sizeof(*from)))
                goto enable_and_error;

        unsafe_get_user(val, from, Efault);
        user_access_end();

pattern is very good for generating fine code, but it's rather nasty
to encapsulate as one single macro somewhere. It really ends up having
those two error cases: the one that just returns the error, and the
one that has to do user_access_end().

[ Time passes ]

Ugh. I tried it. It looks like this:

#define inlined_get_user(res, ptr) ({                           \
        __label__ fail2, fail1;                                 \
        __auto_type __up = (ptr);                               \
        int __ret = 0;                                          \
        if (can_do_masked_user_access())                        \
                __up = masked_user_access_begin(__up);          \
        else if (!user_read_access_begin(__up, sizeof(*__up)))  \
                goto fail1;                                     \
        unsafe_get_user(res, ptr, fail2);                       \
        user_access_end();                                      \
        if (0) {                                                \
fail2:  user_access_end();                                      \
fail1:  __ret = -EFAULT;                                        \
        }                                                       \
        __ret; })

and maybe that works. It generated ok code in this case, where I did

  int futex_get_value_locked(u32 *dest, u32 __user *from)
  {
        int ret;

        pagefault_disable();
        ret = inlined_get_user(*dest, from);
        pagefault_enable();
        return ret;
  }

but I'm not convinced it's better than open-coding it. We have some
ugly macros in the kernel, but that may be one of the ugliest I've
ever written.

> > How would you speed it up?
>
> I'm not sure if you saw the example code snippet I posted up-thread,
> here it is below.

Oh, ok, no I hadn't seen that one.

Yeah, it looks like that would work. What a horrendous crock. But your
point that it would find the nasty __get_user() cases is nicely made.

             Linus
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Josh Poimboeuf 3 days, 4 hours ago
On Thu, Nov 21, 2024 at 05:02:06PM -0800, Linus Torvalds wrote:
> [ Time passes ]
> 
> Ugh. I tried it. It looks like this:
> 
> #define inlined_get_user(res, ptr) ({                           \
>         __label__ fail2, fail1;                                 \
>         __auto_type __up = (ptr);                               \
>         int __ret = 0;                                          \
>         if (can_do_masked_user_access())                        \
>                 __up = masked_user_access_begin(__up);          \
>         else if (!user_read_access_begin(__up, sizeof(*__up)))  \
>                 goto fail1;                                     \
>         unsafe_get_user(res, ptr, fail2);                       \
>         user_access_end();                                      \
>         if (0) {                                                \
> fail2:  user_access_end();                                      \
> fail1:  __ret = -EFAULT;                                        \
>         }                                                       \
>         __ret; })

That actually doesn't seem so bad, it's easy enough to follow the logic.
And it contains the ugly/fidgety all in one place so the callers' hands
don't have to get dirty.

We could easily use that macro in size-specific inline functions
selected by a macro with a sizeof(type) switch statement -- not so bad
IMO if they improve code usage and generation.

So all the user has to do is get_user_new_and_improved() -- resolving to
get_user_new_and_improved_x() -- and the compiler decides on the
inlining.  Which on average is hopefully better than Joe Developer's
inlining decisions?  Otherwise we've got bigger problems?

Then all the arches have to do is implement unsafe_*_user_{1,2,4,8} and
the "one good implementation" idea comes together?

BTW, looking at some other arches, I notice that get_user() is already
unconditionally inline for arm64, riscv, powerpc, and s390.

I also see that arm64 already defines get_user() to __get_user(), with
__get_user() having an access_ok().

It would be really nice to have the same behavior and shared code across
all the arches.

-- 
Josh
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Linus Torvalds 3 days, 3 hours ago
On Thu, 21 Nov 2024 at 19:11, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Thu, Nov 21, 2024 at 05:02:06PM -0800, Linus Torvalds wrote:
> > [ Time passes ]
> >
> > Ugh. I tried it. It looks like this:
> >
> > #define inlined_get_user(res, ptr) ({                           \
> >         __label__ fail2, fail1;                                 \
> >         __auto_type __up = (ptr);                               \
> >         int __ret = 0;                                          \
> >         if (can_do_masked_user_access())                        \
> >                 __up = masked_user_access_begin(__up);          \
> >         else if (!user_read_access_begin(__up, sizeof(*__up)))  \
> >                 goto fail1;                                     \
> >         unsafe_get_user(res, ptr, fail2);                       \

That 'ptr' needs to be '__up' of course.

Other than that it actually seems to work.

And by "seems to work" I mean "I replaced get_user() with this macro
instead, and my default kernel continued to build fine".

I didn't actually *test* the end result, although i did look at a few
more cases of code generation, and visually it looks sane enough.

> That actually doesn't seem so bad, it's easy enough to follow the logic.

I'm not loving the "if (0)" with the labels inside of it. But it's the
only way I can see to make a statement expression like this work,
since you can't have a "return"  inside of it.

> We could easily use that macro in size-specific inline functions
> selected by a macro with a sizeof(type) switch statement -- not so bad
> IMO if they improve code usage and generation.

The point of it being a macro is that then it doesn't need the
size-specific games at all, because it "just works" since the types
end up percolating inside the logic.

Of course, that depends on unsafe_get_user() itself having the
size-specific games in it, so that's a bit of a lie, but at least it
needs the games in just one place.

And yes, having inline wrappers anyway could then allow for the
compiler making the "inline or not" choice, but the compiler really
doesn't end up having any idea of whether something is more important
from a performance angle, so that wouldn't actually be a real
improvement.

> Then all the arches have to do is implement unsafe_*_user_{1,2,4,8} and
> the "one good implementation" idea comes together?

Yeah, except honestly, basically *none* of them do.

The list or architectures that actually implement the unsafe accessors
is sad: it's x86, powerpc, and arm64. That's it.

Which is - along with my bloat worry - what makes me go "it's not
worth fighting".

Also, honestly, it is *very* rare that "get_user()" and "put_user()"
is performance-critical or even noticeable. We have lots of important
user copies, and the path and argument copy (aka
"strncpy_from_user()") is very important, but very few other places
tend to be.

So I see "copy_from_user()" in profiles, and I see
"strncpy_from_user()", and I've seen a few special cases that I've
converted (look at get_sigset_argpack(), for example - it shows up on
some select-heavy loads).

And now you found another one in that futex code, and that is indeed special.

But honestly, most get_user() cases are really in things like random ioctls etc.

Which is why I suspect we'd be better off just doing the important
ones one by one.

And doing the important ones individually may sound really nasty, but
if they are important, it does kind of make sense.

For example, one place I suspect *is* worth looking at is the execve()
argument handling. But to optimize that isn't about inlining
get_user(). It's about doing more than one word for each user access,
particularly with CLAC/STAC being as slow as they often are.

So what you actually would want to do is to combine these two

                ret = -EFAULT;
                str = get_user_arg_ptr(argv, argc);
                if (IS_ERR(str))
                        goto out;

                len = strnlen_user(str, MAX_ARG_STRLEN);
                if (!len)
                        goto out;

into one thing, if you really cared enough. I've looked at it, and
always gone "I don't _quite_ care that much", even though argument
handling definitely shows up on some benchmarks (partly because it
mostly shows up on fairly artificial ones - the rest of execve is so
expensive that even when we waste some time on argument handling, it's
not noticeable enough to be worth spending huge amount of effort on).

But you could also look at the pure argv counting code (aka "count()"
in fs/exec.c). That *also* shows up quite a bit, and there batching
things migth be a much easier thing. But again, it's not about
inlining get_user(), it's about doing multiple accesses without
turning user accesses on and off all the time.

Anyway, that was a long way of saying: I really think we should just
special-case the (few) important cases that get reported. Because any
*big* improvements will come not from just inlining.

              Linus
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Linus Torvalds 2 days, 12 hours ago
On Thu, 21 Nov 2024 at 19:57, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, that was a long way of saying: I really think we should just
> special-case the (few) important cases that get reported. Because any
> *big* improvements will come not from just inlining.

Looking around at the futex code some more, I note:

 - the cmpxchg case and futex ops use an explicit barrier too, which is bad

 - we'd actually be better off inlining not just the user access, but
the whole futex_get_value_locked(), because then the compiler will be
able to do CSE on the user address masking, and only do it once
(several places do multiple different futex_get_value_locked() calls).

iow, I think the fix for the futex case ends up being a patch
something like the attached.

[ Annoyingly, we need "can_do_masked_user_access()" even on x86,
because the 32-bit case doesn't do the address masking trick ]

I've only compiled it so far, about to actually boot into it. Pray for me,

               Linus
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Linus Torvalds 2 days, 12 hours ago
On Fri, 22 Nov 2024 at 11:13, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I've only compiled it so far, about to actually boot into it.

Looks fine. Sent out a proper patch with commit message etc at

   https://lore.kernel.org/all/20241122193305.7316-1-torvalds@linux-foundation.org/

because it looks good to me. Comments?

                 Linus
RE: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by David Laight 15 hours ago
From: Linus Torvalds
> Sent: 22 November 2024 19:35
> 
> On Fri, 22 Nov 2024 at 11:13, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I've only compiled it so far, about to actually boot into it.
> 
> Looks fine. Sent out a proper patch with commit message etc at
> 
>    https://lore.kernel.org/all/20241122193305.7316-1-torvalds@linux-foundation.org/
> 
> because it looks good to me. Comments?

+static __always_inline int futex_read_inatomic(u32 *dest, u32 __user *from)
+{
+	u32 val;
+
+	if (can_do_masked_user_access())
+		from = masked_user_access_begin(from);
+	else if (!user_read_access_begin(from, sizeof(*from)))
+		return -EFAULT;
+	unsafe_get_user(val, from, Efault);
+	user_access_end();
+	*dest = val;
+	return 0;
+Efault:
+	user_access_end();
+	return -EFAULT;
+}
+
+static inline int futex_get_value_locked(u32 *dest, u32 __user *from)
+{
+	int ret;
+
+	pagefault_disable();
+	ret = futex_read_inatomic(dest, from);
+	pagefault_enable();
+
+	return ret;
+}

Is there an 'unsafe_get_user_nofault()' that uses a trap handler
that won't fault in a page?
That would save the inc/dec done by pagefault_en/disable().

I'd also have thought that the trap handler for unsafe_get_user()
would jump to the Efault label having already done user_access_end().
But maybe it doesn't work out that way?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Linus Torvalds 13 hours ago
On Sun, 24 Nov 2024 at 08:11, David Laight <David.Laight@aculab.com> wrote:
>
> Is there an 'unsafe_get_user_nofault()' that uses a trap handler
> that won't fault in a page?

Nope. I was thinking about the same thing, but we actually don't look
up the fault handler early - we only do it at failure time.

So the pagefault_disable() thus acts as the failure trigger that makes
us look up the fault handler. Without that, we'd never even check if
there's a exception note on the instruction.

> I'd also have thought that the trap handler for unsafe_get_user()
> would jump to the Efault label having already done user_access_end().
> But maybe it doesn't work out that way?

I actually at one point had a local version that did exactly that,
because it allowed us to avoid doing the user_access_end in the
exception path.

It got ugly. In particular, it gets really ugly for the
"copy_to/from_user()" case where we want to be byte-accurate, and a
64-bit access fails, and we go back to doing the last few accesses one
byte at a time.

See the exception table in arch/x86/lib/copy_user_64.S where it jumps
to .Lcopy_user_tail for an example of this.

Yes, yes, you could just do a "stac" again in the exception path to
undo the fact that the fault handler would have turned off user
accesses again...

But look at that copy_user_64 code again and you'll see that it's
actually a generic replacement for "rep movs" with fault handling, and
can be used for the "copy_from_kernel_nofault" cases too.

So I decided that it was just too ugly for words to have the fault
handler basically change the state of the faultee that way.

            Linus
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Christophe Leroy 2 days, 22 hours ago

Le 22/11/2024 à 04:57, Linus Torvalds a écrit :
> 
> I'm not loving the "if (0)" with the labels inside of it. But it's the
> only way I can see to make a statement expression like this work,
> since you can't have a "return"  inside of it.
> 

On powerpc for this kind of stuff I have been using do { } while (0); 
with a break; in the middle, see for instance __put_user() or 
__get_user_size_allowed() in arch/powerpc/include/asm/uaccess.h

Christophe
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Linus Torvalds 2 days, 13 hours ago
On Fri, 22 Nov 2024 at 01:20, Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
> On powerpc for this kind of stuff I have been using do { } while (0);
> with a break; in the middle, see for instance __put_user() or
> __get_user_size_allowed() in arch/powerpc/include/asm/uaccess.h

Oh, that's indeed a nicer syntax. I'll try to keep that in mind for
next time I come up with disgusting macros (although as mentioned, in
this case I think we're better off not playing that particular game at
all).

                Linus
RE: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by David Laight 1 week, 1 day ago
From: Linus Torvalds
> Sent: 16 November 2024 01:27
> 
> On Fri, 15 Nov 2024 at 15:06, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> >
> > It's sad that __get_user() is now slower than get_user() on x86, it kind
> > of defeats the whole point!
> 
> Well, honestly, we've been trying to get away from __get_user() and
> __put_user() for a long long time.
> 
> With CLAC/STAC, it's been probably a decade or two since __get_user()
> and friends were actually a worthwhile optimization, so let's just
> strive to get rid of the ones that matter.

Thinks....

If __get_user() is the same as get_user() then all the access_ok()
outside of get/put_user() and copy_to/from_user() can be removed
because they are pointless (anyone that brave?).
There is no point optimising the code to fast-path bad user pointers.

> We already have this with user_access_begin() + unsafe_get_user().
> There's also a version which masks the address: masked_user_access_begin().

That sounds as though it is begging for a simple to use:
	masked_addr = user_access_begin(addr, size, error_label);
and
	val = unsafe_get_user(masked_addr, error_label);
form?
Probably with objtool checking they are all in a valid sequence
with no functions calls (etc).

If address masking isn't needed (by an architecture) the address can be left
unchanged.

A quick grep shows access_ok() in 66 .c and 8 .h files outside the arch code.
And 69 .c file in arch, most of the arch .h are uaccess.h and futex.h.
I suspect the audit wouldn't tale that long.
Getting any patches accepted is another matter.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [PATCH v3 2/6] x86/uaccess: Avoid barrier_nospec() in 64-bit __get_user()
Posted by Linus Torvalds 1 week, 1 day ago
On Sat, 16 Nov 2024 at 13:38, David Laight <David.Laight@aculab.com> wrote:
>
> If __get_user() is the same as get_user() [..]

No, the problem is that it's the same from a performance angle (and
now it's actually slower), but some hacky code paths depend on
__get_user() not checking the address.

They then use that to read from either user space _or_ kernel space.

Wrong? Yes. Architecture-specific? Yes. But it sadly happens.

             Linus