net: move .getsockopt away from __user buffers

[PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers

Posted by Breno Leitao 1 week ago

Currently, .getsockopt callback cannot be called with kernel buffers
because it requires userspace addresses:

  int (*getsockopt)(struct socket *sock, int level,
		    int optname, char __user *optval, int __user *optlen);

This prevents kernel callers (io_uring, BPF, etc) from using getsockopt
on levels other than SOL_SOCKET, since they pass kernel pointers rather
than __user pointers.

Following Linus' suggestion [0], this series introduces a wrapper
around iov_iter (sockopt_t) and a temporary getsockopt_iter callback:

  typedef struct sockopt {
	  struct iov_iter iter;
	  int optlen;
  } sockopt_t;

Note: optlen was not suggested by Linus' but I believe it is needed, given
random values could be passed by protocols back to userspace.

And the callback becomes:

  int (*getsockopt_iter)(struct socket *sock, int level,
			 int optname, sockopt_t *opt);

The sockopt_t structure encapsulates:
- An iov_iter for reading/writing option data (works with both user
  and kernel buffers)
- An optlen field for buffer size (input) and returned data size
  (output)

The plan is to enable getsockopt to leverage kernel buffers initially,
but then move .setsockopt from sockptr_t into this as well.

This series:

 1. Adds the sockopt_t type and getsockopt_iter callback to proto_ops
 2. Adds do_sock_getsockopt_iter() helper that prefers getsockopt_iter
 3. Converts one protocol (netlink) to use getsockopt_iter as a proof of
    concept

This is what I have in mind for this work stream, to make it more
digestible:

 * Keep the temporary getsockopt_iter callback allows protocols to
   migrate gradually.
 * Once all protocols have been converted, getsockopt can be removed and
   getsockopt_iter renamed back to getsockopt with the new API.
 * Once the protocols are converted, the SOL_SOCKET limitation in
   io_uring_cmd_getsockopt() will be removed.
 * Covert setsockopt() to also use a similar strategy, moving it away
   from sockptr_t.
 * Remove sockptr_t in the front end (do_sock_getsockopt(),
   io_uring_cmd_getsockopt()) and start with sockopt_t (instead of
   sockptr_t) in __sys_getsockopt() and io_uring_cmd_getsockopt()

Link: https://lore.kernel.org/all/CAHk-=whmzrO-BMU=uSVXbuoLi-3tJsO=0kHj1BCPBE3F2kVhTA@mail.gmail.com/ [0]
---
Breno Leitao (3):
      net: add getsockopt_iter callback to proto_ops
      net: prefer getsockopt_iter in do_sock_getsockopt
      netlink: convert to getsockopt_iter

 include/linux/net.h      | 19 +++++++++++++++++++
 net/netlink/af_netlink.c | 22 ++++++++++++----------
 net/socket.c             | 42 +++++++++++++++++++++++++++++++++++++++---
 3 files changed, 70 insertions(+), 13 deletions(-)
---
base-commit: 4d310797262f0ddf129e76c2aad2b950adaf1fda
change-id: 20260130-getsockopt-9f36625eedcb

Best regards,
--  
Breno Leitao <leitao@debian.org>

Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers

Posted by David Laight 1 week ago

On Fri, 30 Jan 2026 10:46:16 -0800
Breno Leitao <leitao@debian.org> wrote:

> Currently, .getsockopt callback cannot be called with kernel buffers
> because it requires userspace addresses:
> 
>   int (*getsockopt)(struct socket *sock, int level,
> 		    int optname, char __user *optval, int __user *optlen);
> 
> This prevents kernel callers (io_uring, BPF, etc) from using getsockopt
> on levels other than SOL_SOCKET, since they pass kernel pointers rather
> than __user pointers.

I had thoughts about this as well.
I think using iov_iter is over the top and may have measurable performance
impact for some paths.

I think the first thing to do is sort out 'optlen'.
There is absolutely no reason for the user pointer being passed into
all the per-protocol functions.
(and the code that changes that use sockptr_t are just stupid...)
The system call wrapper can do the user copies, it can also suppress
the write if the value is unchanged (which matters with clac/slac).
The obvious change would be to pass the length itself and make the
return value -ERRNO or the size.

The annoyance is the few places that want to return an error and
change optlen.
That might be best addresses by something like:
#define GETSOCKOPT_RVAL(errval, size) (1 << 31 | (errval) << 20 | (size))
which would get picked in the rval < 0 path.
It would also let 'return 0' mean 'don't change the size' requiring
a special return for the one (or two?) places that want to set the
size to zero and return success.

The length passed should also be 'unsigned int' - with a check for
negative values in the system call wrapper.
(There are many broken drivers that treat negative lengths as 4.)

There is not much point making the 'optval' parameter more than
a structure of a user and kernel address - one of which will be NULL.
(This is safer than sockptr_t's discriminant union.)
You can't police the length because it is sometimes only the length
of a header (and in some recent code as well).

I have looked at some of this change - it is enormous.

	David

> 
> Following Linus' suggestion [0], this series introduces a wrapper
> around iov_iter (sockopt_t) and a temporary getsockopt_iter callback:
> 
>   typedef struct sockopt {
> 	  struct iov_iter iter;
> 	  int optlen;
>   } sockopt_t;
> 
> Note: optlen was not suggested by Linus' but I believe it is needed, given
> random values could be passed by protocols back to userspace.
> 
> And the callback becomes:
> 
>   int (*getsockopt_iter)(struct socket *sock, int level,
> 			 int optname, sockopt_t *opt);
> 
> The sockopt_t structure encapsulates:
> - An iov_iter for reading/writing option data (works with both user
>   and kernel buffers)
> - An optlen field for buffer size (input) and returned data size
>   (output)
> 
> The plan is to enable getsockopt to leverage kernel buffers initially,
> but then move .setsockopt from sockptr_t into this as well.
> 
> This series:
> 
>  1. Adds the sockopt_t type and getsockopt_iter callback to proto_ops
>  2. Adds do_sock_getsockopt_iter() helper that prefers getsockopt_iter
>  3. Converts one protocol (netlink) to use getsockopt_iter as a proof of
>     concept
> 
> This is what I have in mind for this work stream, to make it more
> digestible:
> 
>  * Keep the temporary getsockopt_iter callback allows protocols to
>    migrate gradually.
>  * Once all protocols have been converted, getsockopt can be removed and
>    getsockopt_iter renamed back to getsockopt with the new API.
>  * Once the protocols are converted, the SOL_SOCKET limitation in
>    io_uring_cmd_getsockopt() will be removed.
>  * Covert setsockopt() to also use a similar strategy, moving it away
>    from sockptr_t.
>  * Remove sockptr_t in the front end (do_sock_getsockopt(),
>    io_uring_cmd_getsockopt()) and start with sockopt_t (instead of
>    sockptr_t) in __sys_getsockopt() and io_uring_cmd_getsockopt()
> 
> Link: https://lore.kernel.org/all/CAHk-=whmzrO-BMU=uSVXbuoLi-3tJsO=0kHj1BCPBE3F2kVhTA@mail.gmail.com/ [0]
> ---
> Breno Leitao (3):
>       net: add getsockopt_iter callback to proto_ops
>       net: prefer getsockopt_iter in do_sock_getsockopt
>       netlink: convert to getsockopt_iter
> 
>  include/linux/net.h      | 19 +++++++++++++++++++
>  net/netlink/af_netlink.c | 22 ++++++++++++----------
>  net/socket.c             | 42 +++++++++++++++++++++++++++++++++++++++---
>  3 files changed, 70 insertions(+), 13 deletions(-)
> ---
> base-commit: 4d310797262f0ddf129e76c2aad2b950adaf1fda
> change-id: 20260130-getsockopt-9f36625eedcb
> 
> Best regards,
> --  
> Breno Leitao <leitao@debian.org>
> 
>

Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers

Posted by Breno Leitao 4 days, 21 hours ago

Hello David,

On Fri, Jan 30, 2026 at 08:52:27PM +0000, David Laight wrote:

> The system call wrapper can do the user copies, it can also suppress
> the write if the value is unchanged (which matters with clac/slac).

This aligns with my proposal: using an in-kernel optlen that protocol
functions can operate on directly:

	typedef struct sockopt {
		struct iov_iter iter;
		int optlen;
	} sockopt_t;

> The obvious change would be to pass the length itself and make the
> return value -ERRNO or the size.

I explored this approach to avoid embedding optlen in sockopt (which was
Linus' original suggestion). I attempted returning the length both via
iov_iter and as a return value, but neither proved ideal.

> #define GETSOCKOPT_RVAL(errval, size) (1 << 31 | (errval) << 20 | (size))
> which would get picked in the rval < 0 path.
> It would also let 'return 0' mean 'don't change the size' requiring
> a special return for the one (or two?) places that want to set the
> size to zero and return success.

My conclusion is that encoding both optlen and error in the return value
requires pointer manipulation that isn't justified for this slow path.
While technically feasible, the resulting "mixed pointer abomination"
won't be worth it.

> There is not much point making the 'optval' parameter more than
> a structure of a user and kernel address - one of which will be NULL.
> (This is safer than sockptr_t's discriminant union.)

This approach forces every protocol to distinguish between userspace and
kernelspace, then perform the appropriate copy:

  static inline int mgetsockopt(void *kernel_optlen, void *user_optlen, ..)
  {
	....
	if (kernel_optlen)
		memcpy(kernel_optlen, newoptlen, ...
	else
		copy_to_user(user_optlen, newoptlen, ...
  }

Additionally, you'd need safeguards ensuring callers never pass both user
and kernel pointers simultaneously. This seems significantly worse than
using sockptr.

--breno

Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers

Posted by David Laight 4 days, 11 hours ago

On Mon, 2 Feb 2026 04:32:42 -0800
Breno Leitao <leitao@debian.org> wrote:

> Hello David,
> 
> On Fri, Jan 30, 2026 at 08:52:27PM +0000, David Laight wrote:
> 
> > The system call wrapper can do the user copies, it can also suppress
> > the write if the value is unchanged (which matters with clac/slac).  
> 
> This aligns with my proposal: using an in-kernel optlen that protocol
> functions can operate on directly:
> 
> 	typedef struct sockopt {
> 		struct iov_iter iter;
> 		int optlen;
> 	} sockopt_t;
> 
> > The obvious change would be to pass the length itself and make the
> > return value -ERRNO or the size.  
> 
> I explored this approach to avoid embedding optlen in sockopt (which was
> Linus' original suggestion). I attempted returning the length both via
> iov_iter and as a return value, but neither proved ideal.
> 
> > #define GETSOCKOPT_RVAL(errval, size) (1 << 31 | (errval) << 20 | (size))
> > which would get picked in the rval < 0 path.
> > It would also let 'return 0' mean 'don't change the size' requiring
> > a special return for the one (or two?) places that want to set the
> > size to zero and return success.  
> 
> My conclusion is that encoding both optlen and error in the return value
> requires pointer manipulation that isn't justified for this slow path.
> While technically feasible, the resulting "mixed pointer abomination"
> won't be worth it.

Not really, they are both just numbers.
99% of the protocol code can just do 'return -Exxxx' or 'return size'.
That is all simple and foolproof.
The calling code (not many copies) does:
	rval = foo->getsockopt(..., size_in);
	size_out = size_in;
	if (rval >= 0) {
		if (rval > 0)
			size_out = rval;
		rval = 0;
	} else {
		/* abnormal path */
		if ((rval & (1 << 30))) {
			size_out = rval & 0xffffff;
			rval = -((rval & ~(1 << 31)) >> 20);
		}
	}
	if (size_out != size_in)
		put_user(size_out);
	return rval;
(Or something similar depending on exactly how the values are merged.)

> 
> > There is not much point making the 'optval' parameter more than
> > a structure of a user and kernel address - one of which will be NULL.
> > (This is safer than sockptr_t's discriminant union.)  
> 
> This approach forces every protocol to distinguish between userspace and
> kernelspace, then perform the appropriate copy:
> 
>   static inline int mgetsockopt(void *kernel_optlen, void *user_optlen, ..)
>   {
> 	....
> 	if (kernel_optlen)
> 		memcpy(kernel_optlen, newoptlen, ...
> 	else
> 		copy_to_user(user_optlen, newoptlen, ...
>   }

That is a function provided by the implementation.
It is no different from using the ones that act on iov_iter.
The real difficultly is stopping the usual culprits (bpf an io_uring)
from cheating and looking inside the structures.

> Additionally, you'd need safeguards ensuring callers never pass both user
> and kernel pointers simultaneously. This seems significantly worse than
> using sockptr.

Sockptr has the real disadvantage that it is very easy to mix up the
kernel and user pointers (there is some horrid code that looks inside).
If you have separate pointers that can't happen.
You might access NULL, but you are never going to use the wrong address.
Remember some systems (s390?) use the same numbers for user and kernel
addresses - you have to get it right.
In any case, if both addresses are set you can just have a rule that
one is used by preference - it isn't a problem.

There might be legitimate reasons for setting both pointers.
Consider setsockopt, the wrapper could copy small user structures
into an on-stack buffer.
The structure would then need to contain the address/length of the
kernel buffer as well as the actual user address in case the code
wants to read more that the expected data length.
For a kernel caller you also want the actual length of the buffer
as a separate field from the length of the [sg]etsockopt().

I'm not sure what fields you need for the address buffer.
Probably 'user address', 'kernel address' and 'kernel length',
what you don't need is support for scatter-gather, page list,
pipes etc.

> 
> --breno
>

Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers

Posted by Linus Torvalds 1 week ago

On Fri, 30 Jan 2026 at 14:40, David Laight <david.laight.linux@gmail.com> wrote:
>
> There is not much point making the 'optval' parameter more than
> a structure of a user and kernel address - one of which will be NULL.

That's exactly what we do *NOT* want. Because people will get it
wrong, and then we're back to the bad old days where trivial bugs
result in security issues.

Can you point to an actual case where setsockopt / getsockopt would be
performance-critical? Typically you do it once or twice.

              Linus

Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers

Posted by David Laight 6 days, 18 hours ago

On Fri, 30 Jan 2026 17:19:55 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 30 Jan 2026 at 14:40, David Laight <david.laight.linux@gmail.com> wrote:
> >
> > There is not much point making the 'optval' parameter more than
> > a structure of a user and kernel address - one of which will be NULL.  
> 
> That's exactly what we do *NOT* want. Because people will get it
> wrong, and then we're back to the bad old days where trivial bugs
> result in security issues.

It can still be a (semi-)transparent structure that code isn't allowed to change.
That is no different from using iov_iter.

> Can you point to an actual case where setsockopt / getsockopt would be
> performance-critical? Typically you do it once or twice.

IIRC a really horrid one - I think for async io.
That is also one of the few where the supplied length is a lie.

	David

> 
>               Linus
>

Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers

Posted by Jens Axboe 6 days, 18 hours ago

On 1/31/26 8:37 AM, David Laight wrote:
> On Fri, 30 Jan 2026 17:19:55 -0800
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Fri, 30 Jan 2026 at 14:40, David Laight <david.laight.linux@gmail.com> wrote:
>>>
>>> There is not much point making the 'optval' parameter more than
>>> a structure of a user and kernel address - one of which will be NULL.  
>>
>> That's exactly what we do *NOT* want. Because people will get it
>> wrong, and then we're back to the bad old days where trivial bugs
>> result in security issues.
> 
> It can still be a (semi-)transparent structure that code isn't allowed
> to change. That is no different from using iov_iter.

Then why not just use iov_iter?! FWIW, I fully agree with Linus on this
one. We have an existing abstraction, we should use it. We've previously
optimized common cases, like ITER_UBUF, if that ended up being
important. We're better off using iov_iter and improving that, rather
than some new mixed pointer abomination.

>> Can you point to an actual case where setsockopt / getsockopt would be
>> performance-critical? Typically you do it once or twice.
> 
> IIRC a really horrid one - I think for async io.
> That is also one of the few where the supplied length is a lie.

Huh?

-- 
Jens Axboe