[PATCH 5/5] powerpc: Implement masked user access

Christophe Leroy posted 5 patches 3 months, 2 weeks ago
There is a newer version of this series
[PATCH 5/5] powerpc: Implement masked user access
Posted by Christophe Leroy 3 months, 2 weeks ago
Masked user access avoids the address/size verification by access_ok().
Allthough its main purpose is to skip the speculation in the
verification of user address and size hence avoid the need of spec
mitigation, it also has the advantage to reduce the amount of
instructions needed so it also benefits to platforms that don't
need speculation mitigation, especially when the size of the copy is
not know at build time.

So implement masked user access on powerpc. The only requirement is
to have memory gap that faults between the top user space and the
real start of kernel area. On 64 bits platform it is easy, bit 0 is
always 0 for user addresses and always 1 for kernel addresses and
user addresses stop long before the end of the area. On 32 bits it
is more tricky. It theory user space can go up to 0xbfffffff while
kernel will usually start at 0xc0000000. So a gap needs to be added
inbetween. Allthough in theory a single 4k page would suffice, it
is easier and more efficient to enforce a 128k gap below kernel,
as it simplifies the masking.

Unlike x86_64 which masks the address to 'all bits set' when the
user address is invalid, here the address is set to an address is
the gap. It avoids relying on the zero page to catch offseted
accesses.

e500 has the isel instruction which allows selecting one value or
the other without branch and that instruction is not speculative, so
use it. Allthough GCC usually generates code using that instruction,
it is safer to use inline assembly to be sure. The result is:

  14:	3d 20 bf fe 	lis     r9,-16386
  18:	7c 03 48 40 	cmplw   r3,r9
  1c:	7c 69 18 5e 	iselgt  r3,r9,r3

On other ones, when kernel space is over 0x80000000 and user space
is below, the logic in mask_user_address_simple() leads to a
3 instruction sequence:

  14:	7c 69 fe 70 	srawi   r9,r3,31
  18:	7c 63 48 78 	andc    r3,r3,r9
  1c:	51 23 00 00 	rlwimi  r3,r9,0,0,0

This is the default on powerpc 8xx.

When the limit between user space and kernel space is not 0x80000000,
mask_user_address_32() is used and a 6 instructions sequence is
generated:

  24:	54 69 7c 7e 	srwi    r9,r3,17
  28:	21 29 57 ff 	subfic  r9,r9,22527
  2c:	7d 29 fe 70 	srawi   r9,r9,31
  30:	75 2a b0 00 	andis.  r10,r9,45056
  34:	7c 63 48 78 	andc    r3,r3,r9
  38:	7c 63 53 78 	or      r3,r3,r10

The constraint is that TASK_SIZE be aligned to 128K in order to get
the most optimal number of instructions.

When CONFIG_PPC_BARRIER_NOSPEC is not defined, fallback on the
test-based masking as it is quicker than the 6 instructions sequence
but not necessarily quicker than the 3 instructions sequences above.

On 64 bits, kernel is always above 0x8000000000000000 and user always
below, which leads to a 4 instructions sequence:

  80:	7c 69 1b 78 	mr      r9,r3
  84:	7c 63 fe 76 	sradi   r3,r3,63
  88:	7d 29 18 78 	andc    r9,r9,r3
  8c:	79 23 00 4c 	rldimi  r3,r9,0,1

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/Kconfig               |   2 +-
 arch/powerpc/include/asm/uaccess.h | 100 +++++++++++++++++++++++++++++
 2 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c3e0cc83f120..c26a39b4504a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -1303,7 +1303,7 @@ config TASK_SIZE
 	hex "Size of user task space" if TASK_SIZE_BOOL
 	default "0x80000000" if PPC_8xx
 	default "0xb0000000" if PPC_BOOK3S_32 && EXECMEM
-	default "0xc0000000"
+	default "0xbffe0000"
 
 config MODULES_SIZE_BOOL
 	bool "Set custom size for modules/execmem area"
diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index 89d53d4c2236..19743ee80523 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -2,6 +2,8 @@
 #ifndef _ARCH_POWERPC_UACCESS_H
 #define _ARCH_POWERPC_UACCESS_H
 
+#include <linux/sizes.h>
+
 #include <asm/processor.h>
 #include <asm/page.h>
 #include <asm/extable.h>
@@ -455,6 +457,104 @@ user_write_access_begin(const void __user *ptr, size_t len)
 #define user_write_access_begin	user_write_access_begin
 #define user_write_access_end		prevent_current_write_to_user
 
+/*
+ * Masking the user address is an alternative to a conditional
+ * user_access_begin that can avoid the fencing. This only works
+ * for dense accesses starting at the address.
+ */
+static inline void __user *mask_user_address_simple(const void __user *ptr)
+{
+	unsigned long addr = (unsigned long)ptr;
+	unsigned long mask = (unsigned long)((long)addr >> (BITS_PER_LONG - 1));
+
+	addr = ((addr & ~mask) & (~0UL >> 1)) | (mask & (1UL << (BITS_PER_LONG - 1)));
+
+	return (void __user *)addr;
+}
+
+static inline void __user *mask_user_address_e500(const void __user *ptr)
+{
+	unsigned long addr;
+
+	asm("cmplw %1, %2; iselgt %0, %2, %1" : "=r"(addr) : "r"(ptr), "r"(TASK_SIZE): "cr0");
+
+	return (void __user *)addr;
+}
+
+/* Make sure TASK_SIZE is a multiple of 128K for shifting by 17 to the right */
+static inline void __user *mask_user_address_32(const void __user *ptr)
+{
+	unsigned long addr = (unsigned long)ptr;
+	unsigned long mask = (unsigned long)((long)((TASK_SIZE >> 17) - 1 - (addr >> 17)) >> 31);
+
+	addr = (addr & ~mask) | (TASK_SIZE & mask);
+
+	return (void __user *)addr;
+}
+
+static inline void __user *mask_user_address_fallback(const void __user *ptr)
+{
+	unsigned long addr = (unsigned long)ptr;
+
+	return (void __user *)(addr < TASK_SIZE ? addr : TASK_SIZE);
+}
+
+static inline void __user *mask_user_address(const void __user *ptr)
+{
+#ifdef MODULES_VADDR
+	const unsigned long border = MODULES_VADDR;
+#else
+	const unsigned long border = PAGE_OFFSET;
+#endif
+	BUILD_BUG_ON(TASK_SIZE_MAX & (SZ_128K - 1));
+	BUILD_BUG_ON(TASK_SIZE_MAX + SZ_128K > border);
+	BUILD_BUG_ON(TASK_SIZE_MAX & 0x8000000000000000ULL);
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_PPC64) && !(PAGE_OFFSET & 0x8000000000000000ULL));
+
+	if (IS_ENABLED(CONFIG_PPC64))
+		return mask_user_address_simple(ptr);
+	if (IS_ENABLED(CONFIG_E500))
+		return mask_user_address_e500(ptr);
+	if (TASK_SIZE <= SZ_2G && border >= SZ_2G)
+		return mask_user_address_simple(ptr);
+	if (IS_ENABLED(CONFIG_PPC_BARRIER_NOSPEC))
+		return mask_user_address_32(ptr);
+	return mask_user_address_fallback(ptr);
+}
+
+static inline void __user *masked_user_access_begin(const void __user *p)
+{
+	void __user *ptr = mask_user_address(p);
+
+	might_fault();
+	allow_read_write_user(ptr, ptr);
+
+	return ptr;
+}
+#define masked_user_access_begin masked_user_access_begin
+
+static inline void __user *masked_user_read_access_begin(const void __user *p)
+{
+	void __user *ptr = mask_user_address(p);
+
+	might_fault();
+	allow_read_from_user(ptr);
+
+	return ptr;
+}
+#define masked_user_read_access_begin masked_user_read_access_begin
+
+static inline void __user *masked_user_write_access_begin(const void __user *p)
+{
+	void __user *ptr = mask_user_address(p);
+
+	might_fault();
+	allow_write_to_user(ptr);
+
+	return ptr;
+}
+#define masked_user_write_access_begin masked_user_write_access_begin
+
 #define unsafe_get_user(x, p, e) do {					\
 	__long_type(*(p)) __gu_val;				\
 	__typeof__(*(p)) __user *__gu_addr = (p);		\
-- 
2.49.0
Re: [PATCH 5/5] powerpc: Implement masked user access
Posted by David Laight 3 months, 2 weeks ago
On Sun, 22 Jun 2025 11:52:43 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> Masked user access avoids the address/size verification by access_ok().
> Allthough its main purpose is to skip the speculation in the
> verification of user address and size hence avoid the need of spec
> mitigation, it also has the advantage to reduce the amount of
> instructions needed so it also benefits to platforms that don't
> need speculation mitigation, especially when the size of the copy is
> not know at build time.

Not checking the size is slightly orthogonal.
It really just depends on the accesses being 'reasonably sequential'.
That is probably always true since access_ok() covers a single copy.

> 
> So implement masked user access on powerpc. The only requirement is
> to have memory gap that faults between the top user space and the
> real start of kernel area. On 64 bits platform it is easy, bit 0 is
> always 0 for user addresses and always 1 for kernel addresses and
> user addresses stop long before the end of the area. On 32 bits it
> is more tricky. It theory user space can go up to 0xbfffffff while
> kernel will usually start at 0xc0000000. So a gap needs to be added
> inbetween. Allthough in theory a single 4k page would suffice, it
> is easier and more efficient to enforce a 128k gap below kernel,
> as it simplifies the masking.

The gap isn't strictly necessary - provided the first access is guaranteed
to be at the specified address and the transfer are guaranteed sequential.
But that is hard to guarantee.

Where does the vdso end up?
My guess is 'near the top of userspace' - but maybe not.

> 
> Unlike x86_64 which masks the address to 'all bits set' when the
> user address is invalid, here the address is set to an address is
> the gap. It avoids relying on the zero page to catch offseted
> accesses.

Not true.
Using 'cmov' also removed an instruction.

> 
> e500 has the isel instruction which allows selecting one value or
> the other without branch and that instruction is not speculative, so
> use it. Allthough GCC usually generates code using that instruction,
> it is safer to use inline assembly to be sure. The result is:
> 
>   14:	3d 20 bf fe 	lis     r9,-16386
>   18:	7c 03 48 40 	cmplw   r3,r9
>   1c:	7c 69 18 5e 	iselgt  r3,r9,r3
> 
> On other ones, when kernel space is over 0x80000000 and user space
> is below, the logic in mask_user_address_simple() leads to a
> 3 instruction sequence:
> 
>   14:	7c 69 fe 70 	srawi   r9,r3,31
>   18:	7c 63 48 78 	andc    r3,r3,r9
>   1c:	51 23 00 00 	rlwimi  r3,r9,0,0,0
> 
> This is the default on powerpc 8xx.
> 
> When the limit between user space and kernel space is not 0x80000000,
> mask_user_address_32() is used and a 6 instructions sequence is
> generated:
> 
>   24:	54 69 7c 7e 	srwi    r9,r3,17
>   28:	21 29 57 ff 	subfic  r9,r9,22527
>   2c:	7d 29 fe 70 	srawi   r9,r9,31
>   30:	75 2a b0 00 	andis.  r10,r9,45056
>   34:	7c 63 48 78 	andc    r3,r3,r9
>   38:	7c 63 53 78 	or      r3,r3,r10
> 
> The constraint is that TASK_SIZE be aligned to 128K in order to get
> the most optimal number of instructions.
> 
> When CONFIG_PPC_BARRIER_NOSPEC is not defined, fallback on the
> test-based masking as it is quicker than the 6 instructions sequence
> but not necessarily quicker than the 3 instructions sequences above.

Doesn't that depend on whether the branch is predicted correctly?

I can't read ppc asm well enough to check the above.
And the C is also a bit tortuous.


> 
> On 64 bits, kernel is always above 0x8000000000000000 and user always
> below, which leads to a 4 instructions sequence:
> 
>   80:	7c 69 1b 78 	mr      r9,r3
>   84:	7c 63 fe 76 	sradi   r3,r3,63
>   88:	7d 29 18 78 	andc    r9,r9,r3
>   8c:	79 23 00 4c 	rldimi  r3,r9,0,1
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
>  arch/powerpc/Kconfig               |   2 +-
>  arch/powerpc/include/asm/uaccess.h | 100 +++++++++++++++++++++++++++++
>  2 files changed, 101 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index c3e0cc83f120..c26a39b4504a 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -1303,7 +1303,7 @@ config TASK_SIZE
>  	hex "Size of user task space" if TASK_SIZE_BOOL
>  	default "0x80000000" if PPC_8xx
>  	default "0xb0000000" if PPC_BOOK3S_32 && EXECMEM
> -	default "0xc0000000"
> +	default "0xbffe0000"
>  
>  config MODULES_SIZE_BOOL
>  	bool "Set custom size for modules/execmem area"
> diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
> index 89d53d4c2236..19743ee80523 100644
> --- a/arch/powerpc/include/asm/uaccess.h
> +++ b/arch/powerpc/include/asm/uaccess.h
> @@ -2,6 +2,8 @@
>  #ifndef _ARCH_POWERPC_UACCESS_H
>  #define _ARCH_POWERPC_UACCESS_H
>  
> +#include <linux/sizes.h>
> +
>  #include <asm/processor.h>
>  #include <asm/page.h>
>  #include <asm/extable.h>
> @@ -455,6 +457,104 @@ user_write_access_begin(const void __user *ptr, size_t len)
>  #define user_write_access_begin	user_write_access_begin
>  #define user_write_access_end		prevent_current_write_to_user
>  
> +/*
> + * Masking the user address is an alternative to a conditional
> + * user_access_begin that can avoid the fencing. This only works
> + * for dense accesses starting at the address.

I think you need to say that kernel addresses get converted to an invalid
address between user and kernel addresses.
It works provided accesses are 'reasonably dense'.

	David

> + */
> +static inline void __user *mask_user_address_simple(const void __user *ptr)
> +{
> +	unsigned long addr = (unsigned long)ptr;
> +	unsigned long mask = (unsigned long)((long)addr >> (BITS_PER_LONG - 1));
> +
> +	addr = ((addr & ~mask) & (~0UL >> 1)) | (mask & (1UL << (BITS_PER_LONG - 1)));
> +
> +	return (void __user *)addr;
> +}
> +
> +static inline void __user *mask_user_address_e500(const void __user *ptr)
> +{
> +	unsigned long addr;
> +
> +	asm("cmplw %1, %2; iselgt %0, %2, %1" : "=r"(addr) : "r"(ptr), "r"(TASK_SIZE): "cr0");
> +
> +	return (void __user *)addr;
> +}
> +
> +/* Make sure TASK_SIZE is a multiple of 128K for shifting by 17 to the right */
> +static inline void __user *mask_user_address_32(const void __user *ptr)
> +{
> +	unsigned long addr = (unsigned long)ptr;
> +	unsigned long mask = (unsigned long)((long)((TASK_SIZE >> 17) - 1 - (addr >> 17)) >> 31);
> +
> +	addr = (addr & ~mask) | (TASK_SIZE & mask);
> +
> +	return (void __user *)addr;
> +}
> +
> +static inline void __user *mask_user_address_fallback(const void __user *ptr)
> +{
> +	unsigned long addr = (unsigned long)ptr;
> +
> +	return (void __user *)(addr < TASK_SIZE ? addr : TASK_SIZE);
> +}
> +
> +static inline void __user *mask_user_address(const void __user *ptr)
> +{
> +#ifdef MODULES_VADDR
> +	const unsigned long border = MODULES_VADDR;
> +#else
> +	const unsigned long border = PAGE_OFFSET;
> +#endif
> +	BUILD_BUG_ON(TASK_SIZE_MAX & (SZ_128K - 1));
> +	BUILD_BUG_ON(TASK_SIZE_MAX + SZ_128K > border);
> +	BUILD_BUG_ON(TASK_SIZE_MAX & 0x8000000000000000ULL);
> +	BUILD_BUG_ON(IS_ENABLED(CONFIG_PPC64) && !(PAGE_OFFSET & 0x8000000000000000ULL));
> +
> +	if (IS_ENABLED(CONFIG_PPC64))
> +		return mask_user_address_simple(ptr);
> +	if (IS_ENABLED(CONFIG_E500))
> +		return mask_user_address_e500(ptr);
> +	if (TASK_SIZE <= SZ_2G && border >= SZ_2G)
> +		return mask_user_address_simple(ptr);
> +	if (IS_ENABLED(CONFIG_PPC_BARRIER_NOSPEC))
> +		return mask_user_address_32(ptr);
> +	return mask_user_address_fallback(ptr);
> +}
> +
> +static inline void __user *masked_user_access_begin(const void __user *p)
> +{
> +	void __user *ptr = mask_user_address(p);
> +
> +	might_fault();
> +	allow_read_write_user(ptr, ptr);
> +
> +	return ptr;
> +}
> +#define masked_user_access_begin masked_user_access_begin
> +
> +static inline void __user *masked_user_read_access_begin(const void __user *p)
> +{
> +	void __user *ptr = mask_user_address(p);
> +
> +	might_fault();
> +	allow_read_from_user(ptr);
> +
> +	return ptr;
> +}
> +#define masked_user_read_access_begin masked_user_read_access_begin
> +
> +static inline void __user *masked_user_write_access_begin(const void __user *p)
> +{
> +	void __user *ptr = mask_user_address(p);
> +
> +	might_fault();
> +	allow_write_to_user(ptr);
> +
> +	return ptr;
> +}
> +#define masked_user_write_access_begin masked_user_write_access_begin
> +
>  #define unsafe_get_user(x, p, e) do {					\
>  	__long_type(*(p)) __gu_val;				\
>  	__typeof__(*(p)) __user *__gu_addr = (p);		\
Re: [PATCH 5/5] powerpc: Implement masked user access
Posted by Segher Boessenkool 3 months, 2 weeks ago
Hi!

On Sun, Jun 22, 2025 at 06:13:51PM +0100, David Laight wrote:
> On Sun, 22 Jun 2025 11:52:43 +0200
> Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> > e500 has the isel instruction which allows selecting one value or
> > the other without branch and that instruction is not speculative, so
> > use it. Allthough GCC usually generates code using that instruction,
> > it is safer to use inline assembly to be sure. The result is:

The instruction (which is a standard Power instruction since
architecture version 2.03, published in 2006) can in principle be
speculative, but there exist no Power implementations that do any data
speculation like this at all.

If you want any particular machine instructions to be generated you have
to manually write it, sure, in inline asm or preferably in actual asm.
But you can be sure that GCC will generate isel or similar (like the
v3.1 set[n]bc[r] insns, best instructions ever!), whenever appropriate,
i.e. when it is a) allowed at all, and b) advantageous.

> >   14:	3d 20 bf fe 	lis     r9,-16386
> >   18:	7c 03 48 40 	cmplw   r3,r9
> >   1c:	7c 69 18 5e 	iselgt  r3,r9,r3
> > 
> > On other ones, when kernel space is over 0x80000000 and user space
> > is below, the logic in mask_user_address_simple() leads to a
> > 3 instruction sequence:
> > 
> >   14:	7c 69 fe 70 	srawi   r9,r3,31
> >   18:	7c 63 48 78 	andc    r3,r3,r9
> >   1c:	51 23 00 00 	rlwimi  r3,r9,0,0,0
> > 
> > This is the default on powerpc 8xx.
> > 
> > When the limit between user space and kernel space is not 0x80000000,
> > mask_user_address_32() is used and a 6 instructions sequence is
> > generated:
> > 
> >   24:	54 69 7c 7e 	srwi    r9,r3,17
> >   28:	21 29 57 ff 	subfic  r9,r9,22527
> >   2c:	7d 29 fe 70 	srawi   r9,r9,31
> >   30:	75 2a b0 00 	andis.  r10,r9,45056
> >   34:	7c 63 48 78 	andc    r3,r3,r9
> >   38:	7c 63 53 78 	or      r3,r3,r10
> > 
> > The constraint is that TASK_SIZE be aligned to 128K in order to get
> > the most optimal number of instructions.
> > 
> > When CONFIG_PPC_BARRIER_NOSPEC is not defined, fallback on the
> > test-based masking as it is quicker than the 6 instructions sequence
> > but not necessarily quicker than the 3 instructions sequences above.
> 
> Doesn't that depend on whether the branch is predicted correctly?
> 
> I can't read ppc asm well enough to check the above.

[ PowerPC or Power (or Power Architecture, or Power ISA) ]

> And the C is also a bit tortuous.

I can read the code ;-)  All those instructions are normal simple
integer instructions.  Shifts, adds, logicals.

In general, correctly predicted non-taken bvranches cost absolutely
nothing.  Correctly predicted taken branches cost the same as any taken
branch, so a refetch, maybe resulting in a cycle or so of decode bubble.
And a mispredicted branch can be very expensive, say on the order of a
hundred cycles (but usually more like ten, which is still a lot of insns
worth).

So branches are great for predictable stuff, and "not so great" for
not so predictable stuff.


Segher
Re: [PATCH 5/5] powerpc: Implement masked user access
Posted by Linus Torvalds 3 months, 2 weeks ago
On Sun, 22 Jun 2025 at 10:13, David Laight <david.laight.linux@gmail.com> wrote:
>
> Not checking the size is slightly orthogonal.
> It really just depends on the accesses being 'reasonably sequential'.
> That is probably always true since access_ok() covers a single copy.

It is probably true in practice, but yeah, it's worth thinking about.
Particularly for various user level structure accesses, we do end up
often accessing the members individually and thus potentially out of
order, but as you say "reasonable sequential" is still true: the
accesses are within a reasonably small offset of each other.

And when we have potentially very big accesses with large offsets from
the beginning (ie things like read/write() calls), we do them
sequentially.

There *might* be odd ioctls and such that get offsets from user space,
though. So any conversion to using 'masked_user_access_begin()' needs
to have at least *some* thought and not be just a mindless conversion
from access_ok().

We have this same issue in access_ok() itself, and on x86-64 that does

  static inline bool __access_ok(const void __user *ptr, unsigned long size)
  {
        if (__builtin_constant_p(size <= PAGE_SIZE) && size <= PAGE_SIZE) {
                return valid_user_address(ptr);
        .. do the more careful one that actually uses the 'size' ...

so it turns access_ok() itself into just a simple single-ended
comparison with the starting address for small sizes, because we know
it's ok to overflow by a bit (because of how valid_user_address()
works on x86).

           Linus
Re: [PATCH 5/5] powerpc: Implement masked user access
Posted by David Laight 3 months, 2 weeks ago
On Sun, 22 Jun 2025 10:40:00 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 22 Jun 2025 at 10:13, David Laight <david.laight.linux@gmail.com> wrote:
> >
> > Not checking the size is slightly orthogonal.
> > It really just depends on the accesses being 'reasonably sequential'.
> > That is probably always true since access_ok() covers a single copy.  
> 
> It is probably true in practice, but yeah, it's worth thinking about.
> Particularly for various user level structure accesses, we do end up
> often accessing the members individually and thus potentially out of
> order, but as you say "reasonable sequential" is still true: the
> accesses are within a reasonably small offset of each other.

I found one that did ptr[4] followed by ptr[0].
Which was potentially problematic if changed to use 'masked' accesses
before you changed the code to use cmov. 

> 
> And when we have potentially very big accesses with large offsets from
> the beginning (ie things like read/write() calls), we do them
> sequentially.
> 
> There *might* be odd ioctls and such that get offsets from user space,
> though. So any conversion to using 'masked_user_access_begin()' needs
> to have at least *some* thought and not be just a mindless conversion
> from access_ok().

True - but the ioctl (like) code is more likely to be using copy_to/from_user()
on the offsetted address rather than trying to be too clever.

> 
> We have this same issue in access_ok() itself, and on x86-64 that does
> 
>   static inline bool __access_ok(const void __user *ptr, unsigned long size)
>   {
>         if (__builtin_constant_p(size <= PAGE_SIZE) && size <= PAGE_SIZE) {
>                 return valid_user_address(ptr);
>         .. do the more careful one that actually uses the 'size' ...
> 
> so it turns access_ok() itself into just a simple single-ended
> comparison with the starting address for small sizes, because we know
> it's ok to overflow by a bit (because of how valid_user_address()
> works on x86).

IIRC there is a comment just below that the says the size could (probably)
just be ignored.
Given how few access_ok() there ought to be, checking them shouldn't be
a problem.
But I get either io_uring or bpf does something strange and unexpected
that is probably a bug waiting to be found.

Remembers some very strange code that has two iovec[] for reading data
from a second process.
I think I failed to find all the access_ok() tests.
IIRC it isn't used by anything 'important' and could be nuked on
security grounds.

	David

> 
>            Linus