[PATCH v2] x86: Allow user accesses to the base of the guard page

David Laight posted 1 patch 1 year, 2 months ago
arch/x86/kernel/cpu/common.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
[PATCH v2] x86: Allow user accesses to the base of the guard page
Posted by David Laight 1 year, 2 months ago
__access_ok() calls valid_user_address() with the address after
the last byte of the user buffer.
It is valid for a buffer to end with the last valid user address
so valid_user_address() must allow accesses to the base of the
guard page.

Fixes: 86e6b1547b3d0 ("x86: fix user address masking non-canonical speculation issue")
Signed-off-by: David Laight <david.laight@aculab.com>
---

v2: Rewritten commit message.

 arch/x86/kernel/cpu/common.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 06a516f6795b..ca327cfa42ae 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2389,12 +2389,12 @@ void __init arch_cpu_finalize_init(void)
 	alternative_instructions();
 
 	if (IS_ENABLED(CONFIG_X86_64)) {
-		unsigned long USER_PTR_MAX = TASK_SIZE_MAX-1;
+		unsigned long USER_PTR_MAX = TASK_SIZE_MAX;
 
 		/*
 		 * Enable this when LAM is gated on LASS support
 		if (cpu_feature_enabled(X86_FEATURE_LAM))
-			USER_PTR_MAX = (1ul << 63) - PAGE_SIZE - 1;
+			USER_PTR_MAX = (1ul << 63) - PAGE_SIZE;
 		 */
 		runtime_const_init(ptr, USER_PTR_MAX);
 
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
RE: [PATCH v2] x86: Allow user accesses to the base of the guard page
Posted by David Laight 1 year, 2 months ago
CC stable.

This needs picking up for 6.12

Head commit 573f45a9f9a47 applied by Linus with a modified commit message.

	David

> -----Original Message-----
> From: David Laight
> Sent: 24 November 2024 15:39
> To: 'Linus Torvalds' <torvalds@linux-foundation.org>; 'Andrew Cooper' <andrew.cooper3@citrix.com>;
> 'bp@alien8.de' <bp@alien8.de>; 'Josh Poimboeuf' <jpoimboe@kernel.org>
> Cc: 'x86@kernel.org' <x86@kernel.org>; 'linux-kernel@vger.kernel.org' <linux-kernel@vger.kernel.org>;
> 'Arnd Bergmann' <arnd@kernel.org>; 'Mikel Rychliski' <mikel@mikelr.com>; 'Thomas Gleixner'
> <tglx@linutronix.de>; 'Ingo Molnar' <mingo@redhat.com>; 'Borislav Petkov' <bp@alien8.de>; 'Dave
> Hansen' <dave.hansen@linux.intel.com>; 'H. Peter Anvin' <hpa@zytor.com>
> Subject: [PATCH v2] x86: Allow user accesses to the base of the guard page
> 
> __access_ok() calls valid_user_address() with the address after
> the last byte of the user buffer.
> It is valid for a buffer to end with the last valid user address
> so valid_user_address() must allow accesses to the base of the
> guard page.
> 
> Fixes: 86e6b1547b3d0 ("x86: fix user address masking non-canonical speculation issue")
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
> 
> v2: Rewritten commit message.
> 
>  arch/x86/kernel/cpu/common.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 06a516f6795b..ca327cfa42ae 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -2389,12 +2389,12 @@ void __init arch_cpu_finalize_init(void)
>  	alternative_instructions();
> 
>  	if (IS_ENABLED(CONFIG_X86_64)) {
> -		unsigned long USER_PTR_MAX = TASK_SIZE_MAX-1;
> +		unsigned long USER_PTR_MAX = TASK_SIZE_MAX;
> 
>  		/*
>  		 * Enable this when LAM is gated on LASS support
>  		if (cpu_feature_enabled(X86_FEATURE_LAM))
> -			USER_PTR_MAX = (1ul << 63) - PAGE_SIZE - 1;
> +			USER_PTR_MAX = (1ul << 63) - PAGE_SIZE;
>  		 */
>  		runtime_const_init(ptr, USER_PTR_MAX);
> 
> --
> 2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [PATCH v2] x86: Allow user accesses to the base of the guard page
Posted by Linus Torvalds 1 year, 2 months ago
On Sun, 24 Nov 2024 at 07:39, David Laight <David.Laight@aculab.com> wrote:
>
> v2: Rewritten commit message.

Grr. Now I remember why I did it this way - I started looking around
for the bigger context and history.

I wanted that "valid_user_address()" to really be "is this a valid
user address", because it's also used by the fault handling code (for
that reason).

And that means that I wanted valid_user_address() to be the actual
"this address is accessible".

But then it also gets used by that nasty

                unsigned long sum = size + (__force unsigned long)ptr;

                return valid_user_address(sum) && sum >= (__force
unsigned long)ptr;

case in __access_ok(), and there "sum" is indeed that "possibly one
past the last valid user address".

I really would want to just remove that size-based horror as per the
comment above it all:

 * In fact, we could probably remove the size check entirely, since
 * any kernel accesses will be in increasing address order starting
 * at 'ptr'.

and that would make this all go away, and that was why I was
(incorrectly) fixating on the zero-sized access at the end of the
address space, because I wasn't even thinking about this part of
__access_ok().

IOW, my *preferred* fix for this all would actually look like this:

  --- a/arch/x86/include/asm/uaccess_64.h
  +++ b/arch/x86/include/asm/uaccess_64.h
  @@ -86,24 +86,12 @@ static inline void __user
*mask_user_address(const void __user *ptr)
    *
    * Note that we always have at least one guard page between the
    * max user address and the non-canonical gap, allowing us to
  - * ignore small sizes entirely.
  - *
  - * In fact, we could probably remove the size check entirely, since
  - * any kernel accesses will be in increasing address order starting
  - * at 'ptr'.
  - *
  - * That's a separate optimization, for now just handle the small
  - * constant case.
  + * ignore the size entirely, since any kernel accesses will be in
  + * increasing address order starting at 'ptr'.
    */
   static inline bool __access_ok(const void __user *ptr, unsigned long size)
   {
  -     if (__builtin_constant_p(size <= PAGE_SIZE) && size <= PAGE_SIZE) {
  -             return valid_user_address(ptr);
  -     } else {
  -             unsigned long sum = size + (__force unsigned long)ptr;
  -
  -             return valid_user_address(sum) && sum >= (__force
unsigned long)ptr;
  -     }
  +     return valid_user_address(ptr);
   }
   #define __access_ok __access_ok

but I suspect that I'm too chicken to actually do that.

Please somebody convince me.

                 Linus "Bawk bawk bawk" Torvalds
RE: [PATCH v2] x86: Allow user accesses to the base of the guard page
Posted by David Laight 1 year, 2 months ago
From: Linus Torvalds
> Sent: 24 November 2024 18:53

> On Sun, 24 Nov 2024 at 07:39, David Laight <David.Laight@aculab.com> wrote:
> >
> > v2: Rewritten commit message.
> 
> Grr. Now I remember why I did it this way - I started looking around
> for the bigger context and history.
> 
> I wanted that "valid_user_address()" to really be "is this a valid
> user address", because it's also used by the fault handling code (for
> that reason).

Doesn't that just need a <= changed to < ?
(And possibly of name)

...
> and that would make this all go away, and that was why I was
> (incorrectly) fixating on the zero-sized access at the end of the
> address space, because I wasn't even thinking about this part of
> __access_ok().

access_ok(NULL, 0) is probably the annoying case that stops it using
valid_user_address(ptr + size - 1).
And the 'lea' that will do 'x + y - 1' runs on fewer 'ports' than add.

> IOW, my *preferred* fix for this all would actually look like this:
> 
>   --- a/arch/x86/include/asm/uaccess_64.h
>   +++ b/arch/x86/include/asm/uaccess_64.h
>   @@ -86,24 +86,12 @@ static inline void __user *mask_user_address(const void __user *ptr)
>     *
>     * Note that we always have at least one guard page between the
>     * max user address and the non-canonical gap, allowing us to
>   + * ignore the size entirely, since any kernel accesses will be in
>   + * increasing address order starting at 'ptr'.
>     */
>    static inline bool __access_ok(const void __user *ptr, unsigned long size)
>    {
>   +     return valid_user_address(ptr);
>    }
>    #define __access_ok __access_ok
> 
> but I suspect that I'm too chicken to actually do that.
> 
> Please somebody convince me.

I didn't know you really had a 'chicken streak' :-)

You'd need to double-check that nothing is parsing TLD data
(like CMSG or netlink buffers) directly from userspace having
done an outer access_ok() and then using __get_user().
OTOH there are few enough calls to access_ok() they can all
be checked.

Another place might be a copy_to/from_user() implementation
that does a copy of the final 'word' first and then copies
a whole number of words from the start of the buffer.
x86 should just be using 'rep movsb' (except for some constant sizes)
because I doubt anything else is worth the overhead of the (mispredicted
half the time) branch.

As an aside the:
>     movabs $0x123456789abcdef,%rcx      # magic virtual address size
>     cmp    %rsi,%rcx                    # address masking
>     sbb    %rcx,%rcx
>     or     %rsi,%rcx
sequence could be
>     movabs $0x123456789abcdef,%rcx      # magic virtual address size
>     cmp    %rsi,%rcx                    # address masking
>     cmovc  %rsi,%rcx
Provided the constant is TASK_SIZE_MAX (without the -1).

Remember cmov is an arithmetic instruction much like adc except
that it contains a multiplexor/selector not an adder.
Interestingly the intel implementation drops to 1 clock a family before
adc/sbb (the architecture didn't support 3 register inputs to one u-op).
(actually Ryzen might implement cmov as a conditional register rename)

In either case it won't be subject to misprediction.

That actually removes the requirement to access the base address first.
Just need to avoid jumps of PAGE_SIZE.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [PATCH v2] x86: Allow user accesses to the base of the guard page
Posted by Linus Torvalds 1 year, 2 months ago
On Sun, 24 Nov 2024 at 12:49, David Laight <David.Laight@aculab.com> wrote:
>
> Doesn't that just need a <= changed to < ?
> (And possibly of name)

Well, more importantly, it means that we can't use the same helper
function at all.

> > but I suspect that I'm too chicken to actually do that.
> >
> > Please somebody convince me.
>
> I didn't know you really had a 'chicken streak' :-)

Actually, looking at some of the users of access_ok(), my chicken
streak just got wider.

Most of them are trivial, but when masking out the pure 'sizeof(x)'
things (that the "statically smaller than one page" case will filter
out anyway), there's a few scary ones in the vhost code and the drm
code that makes me go "those are checking ring buffer sizes and may
not access the end result in any order".

So trying to optimize access_ok() is probably not worth the heartburn.  Oh well.

               Linus
Re: [PATCH v2] x86: Allow user accesses to the base of the guard page
Posted by Linus Torvalds 1 year, 2 months ago
On Sun, 24 Nov 2024 at 14:03, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, 24 Nov 2024 at 12:49, David Laight <David.Laight@aculab.com> wrote:
> >
> > Doesn't that just need a <= changed to < ?
> > (And possibly of name)
>
> Well, more importantly, it means that we can't use the same helper
> function at all.

Oh, the 'sbb' trick also ends up being one byte off if we increase
USER_PTR_MAX from being the "maximum valid address" to being "one past
the max valid address".

And yeah, none of this *matters*, since "one byte off" is then covered
by the fact that we have that guard page, but this just ends up
annoying me sense of how it all *should* work.

I'm starting to think that instead of changing the USER_PTR_MAX value
(that was selected to be the right constant for things that don't care
about the size), we just say "the slow case of access_ok() takes a
tiny hit".

Kind of like Mikel Rychliski's patch, but we can certainly do better,
ie something like

  --- a/arch/x86/include/asm/uaccess_64.h
  +++ b/arch/x86/include/asm/uaccess_64.h
  @@ -101,8 +101,9 @@ static inline bool __access_ok(..
                return valid_user_address(ptr);
        } else {
                unsigned long sum = size + (__force unsigned long)ptr;
  +             unsigned long max = runtime_const_ptr(USER_PTR_MAX)+1;

  -             return valid_user_address(sum) && sum >= (__force
unsigned long)ptr;
  +             return sum <= max && sum >= (__force unsigned long)ptr;
        }
   }
   #define __access_ok __access_ok

would seem like it should work, and now doesn't make the fast-cases be
off by one.

Yes, it adds a "add 1" to the runtime path (instead of the init-time
constant fixup), which is admittedly annoying. But this really
*should* be the slow path.

We do have a few annoying non-constant access_ok() calls in core code.
The iter code uses access_ok + raw_copy_to_user, because it's evil and
bad. I'm really not sure why it does that. I think it's *purely* bad
history, ie we used to do access_ok() far away from the iov_iter copy,
and then did __copy_from_user(), and then at some point it got changed
to have the access_ok() closer, rather than just use
"copy_from_user()".

So I get the feeling that those access_ok() cases could be removed
entirely in favor of just using the right user copy function. Getting
rid of the whole silly size checking thing.

David, does that patch above work for you?

           Linus
RE: [PATCH v2] x86: Allow user accesses to the base of the guard page
Posted by David Laight 1 year, 2 months ago
From: Linus Torvalds
> Sent: 24 November 2024 22:39
> 
> On Sun, 24 Nov 2024 at 14:03, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Sun, 24 Nov 2024 at 12:49, David Laight <David.Laight@aculab.com> wrote:
> > >
> > > Doesn't that just need a <= changed to < ?
> > > (And possibly of name)
> >
> > Well, more importantly, it means that we can't use the same helper
> > function at all.
> 
> Oh, the 'sbb' trick also ends up being one byte off if we increase
> USER_PTR_MAX from being the "maximum valid address" to being "one past
> the max valid address".
> 
> And yeah, none of this *matters*, since "one byte off" is then covered
> by the fact that we have that guard page, but this just ends up
> annoying me sense of how it all *should* work.

Maybe we need to look at what this code is trying to achieve.
It seems to be trying to do two different things:
1) Stop real user accesses to kernel memory.
2) Stop speculative accesses to user-defined kernel addresses. 

But they get rather mixed together in this pattern:
+	if (can_do_masked_user_access())
+		from = masked_user_access_begin(from);
+	else if (!user_read_access_begin(from, sizeof(*from)))
+		return -EFAULT;
+	unsafe_get_user(val, from, Efault);
+	user_access_end();
+	*dest = val;
+	return 0;
+Efault:
+	user_access_end();
+	return -EFAULT;

where the masked address designed to stop speculative accesses is
used for a real access.

I was looking at the epoll code.
It does an early access_ok() and then two __put_user() calls to write a 32bit
and 64bit value (with an architecture dependant stride or 12 or 16 bytes).

With access_ok() (possibly just checking the base address if there is a
guard page) it doesn't matter which order the accesses are done in.
But masked_user_access_begin() is going to convert a kernel address to ~0,
this is fine for speculative accesses but for real accesses it becomes
imperative that the first address is to the base address.
(Otherwise the accesses will go to page 0 at the bottom of userspace
which (IIRC) it has to possible to map - so won't always fault.)

Relying on that seems dangerous.

So perhaps rename things a bit so the above starts:
	if (have_guard_page())
		from = bound_to_guard_page_user_access_begin(from);
The default bound_to_guard_page() would be min(addr, guard_page).
Architectures that have 'speculative read issues' would need
that min() to be done without a branch (eg with cmov) but
others may not care.

ppc, of course, needs the length (I don't know whether a guard page
would help or is needed).

After that the order of the accesses doesn't matter - provided they
don't have page sized jumps.

Then rename USER_PTR_MAX to USER_GUARD_PAGE_ADDRESS.
After all it doesn't matter at all where the unwanted speculative
reads end up - provided it isn't a user-defined kernel address.
The base of the guard page is as good as anywhere else.

> I'm starting to think that instead of changing the USER_PTR_MAX value
> (that was selected to be the right constant for things that don't care
> about the size), we just say "the slow case of access_ok() takes a
> tiny hit".
> 
> Kind of like Mikel Rychliski's patch, but we can certainly do better,
> ie something like
> 
>   --- a/arch/x86/include/asm/uaccess_64.h
>   +++ b/arch/x86/include/asm/uaccess_64.h
>   @@ -101,8 +101,9 @@ static inline bool __access_ok(..
>                 return valid_user_address(ptr);
>         } else {
>                 unsigned long sum = size + (__force unsigned long)ptr;
>   +             unsigned long max = runtime_const_ptr(USER_PTR_MAX)+1;
> 
>   -             return valid_user_address(sum) && sum >= (__force
> unsigned long)ptr;
>   +             return sum <= max && sum >= (__force unsigned long)ptr;
>         }
>    }
>    #define __access_ok __access_ok
> 
> would seem like it should work, and now doesn't make the fast-cases be
> off by one.

I don't think it really matters whether access_ok() returns 0 or
a later access faults.
Particularly in the corner case of an access to the base of the guard page.

> 
> Yes, it adds a "add 1" to the runtime path (instead of the init-time
> constant fixup), which is admittedly annoying. But this really
> *should* be the slow path.

There is no real reason why you can't have two constants that are
one apart.

> We do have a few annoying non-constant access_ok() calls in core code.
> The iter code uses access_ok + raw_copy_to_user, because it's evil and
> bad. I'm really not sure why it does that. I think it's *purely* bad
> history, ie we used to do access_ok() far away from the iov_iter copy,
> and then did __copy_from_user(), and then at some point it got changed
> to have the access_ok() closer, rather than just use
> "copy_from_user()".

Is there still an access_ok() in iovec_import (I've not got a tree handy).
All seemed too far away from any actual copy to me.
(IIRC there is a pretty pointless 'direction' flag as well?)

> So I get the feeling that those access_ok() cases could be removed
> entirely in favor of just using the right user copy function. Getting
> rid of the whole silly size checking thing.

That would be a plan.
Would be interesting to unravel what io_uring (or was it bpf) was
doing where it casts a 'long' to a user pointer just to validate it.

> David, does that patch above work for you?

You are the boss :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [PATCH v2] x86: Allow user accesses to the base of the guard page
Posted by Linus Torvalds 1 year, 2 months ago
On Mon, 25 Nov 2024 at 08:49, David Laight <David.Laight@aculab.com> wrote:
>
> > David, does that patch above work for you?
>
> You are the boss :-)

Bah. I decided that I'll just apply your patch anyway, simply because
the one-off the other way won't matter, and this way you get proper
credit and it's all minimal and simple to back-port.

Maybe I'll waffle some more in the future and decide to do things some
other way, but at least for now that simple patch is what is in my
tree.

Thanks,
               Linus