[v10] net: devmem: improve cpu cost of RX token management

[PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management

Posted by Bobby Eshleman 3 weeks ago

This series improves the CPU cost of RX token management by adding an
attribute to NETDEV_CMD_BIND_RX that configures sockets using the
binding to avoid the xarray allocator and instead use a per-binding niov
array and a uref field in niov.

Improvement is ~13% cpu util per RX user thread.

Using kperf, the following results were observed:

Before:
	Average RX worker idle %: 13.13, flows 4, test runs 11
After:
	Average RX worker idle %: 26.32, flows 4, test runs 11

Two other approaches were tested, but with no improvement. Namely, 1)
using a hashmap for tokens and 2) keeping an xarray of atomic counters
but using RCU so that the hotpath could be mostly lockless. Neither of
these approaches proved better than the simple array in terms of CPU.

The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
optimization. It is an optional attribute and defaults to 0 (i.e.,
optimization on).

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>

Changes in v10:
- add new tests for edge cases
- add new binding->users to binding for tracking socket/rxq users
- remove rx binding count (use xarray instead)
- Link to v9: https://lore.kernel.org/r/20260109-scratch-bobbyeshleman-devmem-tcp-token-upstream-v9-0-8042930d00d7@meta.com

Changes in v9:
- fixed build with NET_DEVMEM=n
- fixed bug in rx bindings count logic
- Link to v8: https://lore.kernel.org/r/20260107-scratch-bobbyeshleman-devmem-tcp-token-upstream-v8-0-92c968631496@meta.com

Changes in v8:
- change static branch logic (only set when enabled, otherwise just
  always revert back to disabled)
- fix missing tests
- Link to v7: https://lore.kernel.org/r/20251119-scratch-bobbyeshleman-devmem-tcp-token-upstream-v7-0-1abc8467354c@meta.com

Changes in v7:
- use netlink instead of sockopt (Stan)
- restrict system to only one mode, dmabuf bindings can not co-exist
  with different modes (Stan)
- use static branching to enforce single system-wide mode (Stan)
- Link to v6: https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-upstream-v6-0-ea98cf4d40b3@meta.com

Changes in v6:
- renamed 'net: devmem: use niov array for token management' to refer to
  optionality of new config
- added documentation and tests
- make autorelease flag per-socket sockopt instead of binding
  field / sysctl
- many per-patch changes (see Changes sections per-patch)
- Link to v5: https://lore.kernel.org/r/20251023-scratch-bobbyeshleman-devmem-tcp-token-upstream-v5-0-47cb85f5259e@meta.com

Changes in v5:
- add sysctl to opt-out of performance benefit, back to old token release
- Link to v4: https://lore.kernel.org/all/20250926-scratch-bobbyeshleman-devmem-tcp-token-upstream-v4-0-39156563c3ea@meta.com

Changes in v4:
- rebase to net-next
- Link to v3: https://lore.kernel.org/r/20250926-scratch-bobbyeshleman-devmem-tcp-token-upstream-v3-0-084b46bda88f@meta.com

Changes in v3:
- make urefs per-binding instead of per-socket, reducing memory
  footprint
- fallback to cleaning up references in dmabuf unbind if socket
  leaked tokens
- drop ethtool patch
- Link to v2: https://lore.kernel.org/r/20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-0-c80d735bd453@meta.com

Changes in v2:
- net: ethtool: prevent user from breaking devmem single-binding rule
  (Mina)
- pre-assign niovs in binding->vec for RX case (Mina)
- remove WARNs on invalid user input (Mina)
- remove extraneous binding ref get (Mina)
- remove WARN for changed binding (Mina)
- always use GFP_ZERO for binding->vec (Mina)
- fix length of alloc for urefs
- use atomic_set(, 0) to initialize sk_user_frags.urefs
- Link to v1: https://lore.kernel.org/r/20250902-scratch-bobbyeshleman-devmem-tcp-token-upstream-v1-0-d946169b5550@meta.com

---
Bobby Eshleman (5):
      net: devmem: rename tx_vec to vec in dmabuf binding
      net: devmem: refactor sock_devmem_dontneed for autorelease split
      net: devmem: implement autorelease token management
      net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
      selftests: drv-net: devmem: add autorelease tests

 Documentation/netlink/specs/netdev.yaml           |  12 ++
 Documentation/networking/devmem.rst               |  73 +++++++++++
 include/net/netmem.h                              |   1 +
 include/net/sock.h                                |   7 +-
 include/uapi/linux/netdev.h                       |   1 +
 net/core/devmem.c                                 | 148 ++++++++++++++++++----
 net/core/devmem.h                                 |  66 +++++++++-
 net/core/netdev-genl-gen.c                        |   5 +-
 net/core/netdev-genl.c                            |  10 +-
 net/core/sock.c                                   | 103 +++++++++++----
 net/ipv4/tcp.c                                    |  87 ++++++++++---
 net/ipv4/tcp_ipv4.c                               |  15 ++-
 net/ipv4/tcp_minisocks.c                          |   3 +-
 tools/include/uapi/linux/netdev.h                 |   1 +
 tools/testing/selftests/drivers/net/hw/devmem.py  |  98 +++++++++++++-
 tools/testing/selftests/drivers/net/hw/ncdevmem.c |  68 +++++++++-
 16 files changed, 611 insertions(+), 87 deletions(-)
---
base-commit: d4596891e72cbf155d61798a81ce9d36b69bfaf4
change-id: 20250829-scratch-bobbyeshleman-devmem-tcp-token-upstream-292be174d503

Best regards,
-- 
Bobby Eshleman <bobbyeshleman@meta.com>

Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management

Posted by Jakub Kicinski 2 weeks, 3 days ago

On Thu, 15 Jan 2026 21:02:11 -0800 Bobby Eshleman wrote:
> This series improves the CPU cost of RX token management by adding an
> attribute to NETDEV_CMD_BIND_RX that configures sockets using the
> binding to avoid the xarray allocator and instead use a per-binding niov
> array and a uref field in niov.
> 
> Improvement is ~13% cpu util per RX user thread.
> 
> Using kperf, the following results were observed:
> 
> Before:
> 	Average RX worker idle %: 13.13, flows 4, test runs 11
> After:
> 	Average RX worker idle %: 26.32, flows 4, test runs 11
> 
> Two other approaches were tested, but with no improvement. Namely, 1)
> using a hashmap for tokens and 2) keeping an xarray of atomic counters
> but using RCU so that the hotpath could be mostly lockless. Neither of
> these approaches proved better than the simple array in terms of CPU.
> 
> The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
> optimization. It is an optional attribute and defaults to 0 (i.e.,
> optimization on).

IDK if the cmsg approach is still right for this flow TBH.
IIRC when Stan talked about this a while back we were considering doing
this via Netlink. Anything that proves that the user owns the binding
would work. IIUC the TCP socket in this design just proves that socket
has received a token from a given binding right?

Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management

Posted by Mina Almasry 2 weeks, 1 day ago

On Tue, Jan 20, 2026 at 5:07 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 15 Jan 2026 21:02:11 -0800 Bobby Eshleman wrote:
> > This series improves the CPU cost of RX token management by adding an
> > attribute to NETDEV_CMD_BIND_RX that configures sockets using the
> > binding to avoid the xarray allocator and instead use a per-binding niov
> > array and a uref field in niov.
> >
> > Improvement is ~13% cpu util per RX user thread.
> >
> > Using kperf, the following results were observed:
> >
> > Before:
> >       Average RX worker idle %: 13.13, flows 4, test runs 11
> > After:
> >       Average RX worker idle %: 26.32, flows 4, test runs 11
> >
> > Two other approaches were tested, but with no improvement. Namely, 1)
> > using a hashmap for tokens and 2) keeping an xarray of atomic counters
> > but using RCU so that the hotpath could be mostly lockless. Neither of
> > these approaches proved better than the simple array in terms of CPU.
> >
> > The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
> > optimization. It is an optional attribute and defaults to 0 (i.e.,
> > optimization on).
>
> IDK if the cmsg approach is still right for this flow TBH.
> IIRC when Stan talked about this a while back we were considering doing
> this via Netlink. Anything that proves that the user owns the binding
> would work. IIUC the TCP socket in this design just proves that socket
> has received a token from a given binding right?

Doesn't 'doing this via netlink' imply it's a control path operation
that acquires rtnl_lock or netdev_lock or some heavy lock expecting
you to do some config change? Returning tokens is a data-path
operation, IIRC we don't even lock the socket to do it in the
setsockopt.

Is there precedent/path to doing fast data-path operations via netlink?

There may be value in not biting more than we can chew in one series.
Maybe an alternative non-setsockopt dontneeding scheme should be its
own patch series.

-- 
Thanks,
Mina

Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management

Posted by Bobby Eshleman 1 week, 4 days ago

On Wed, Jan 21, 2026 at 08:21:36PM -0800, Mina Almasry wrote:
> On Tue, Jan 20, 2026 at 5:07 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Thu, 15 Jan 2026 21:02:11 -0800 Bobby Eshleman wrote:
> > > This series improves the CPU cost of RX token management by adding an
> > > attribute to NETDEV_CMD_BIND_RX that configures sockets using the
> > > binding to avoid the xarray allocator and instead use a per-binding niov
> > > array and a uref field in niov.
> > >
> > > Improvement is ~13% cpu util per RX user thread.
> > >
> > > Using kperf, the following results were observed:
> > >
> > > Before:
> > >       Average RX worker idle %: 13.13, flows 4, test runs 11
> > > After:
> > >       Average RX worker idle %: 26.32, flows 4, test runs 11
> > >
> > > Two other approaches were tested, but with no improvement. Namely, 1)
> > > using a hashmap for tokens and 2) keeping an xarray of atomic counters
> > > but using RCU so that the hotpath could be mostly lockless. Neither of
> > > these approaches proved better than the simple array in terms of CPU.
> > >
> > > The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
> > > optimization. It is an optional attribute and defaults to 0 (i.e.,
> > > optimization on).
> >
> > IDK if the cmsg approach is still right for this flow TBH.
> > IIRC when Stan talked about this a while back we were considering doing
> > this via Netlink. Anything that proves that the user owns the binding
> > would work. IIUC the TCP socket in this design just proves that socket
> > has received a token from a given binding right?
> 
> Doesn't 'doing this via netlink' imply it's a control path operation
> that acquires rtnl_lock or netdev_lock or some heavy lock expecting
> you to do some config change? Returning tokens is a data-path
> operation, IIRC we don't even lock the socket to do it in the
> setsockopt.
> 
> Is there precedent/path to doing fast data-path operations via netlink?
> There may be value in not biting more than we can chew in one series.
> Maybe an alternative non-setsockopt dontneeding scheme should be its
> own patch series.
> 

I'm onboard with improving what we have since it helps all of us
currently using this API, though I'm not opposed to discussing a
redesign in another thread/RFC. I do see the attraction to locating the
core logic in one place and possibly reducing some complexity around
socket/binding relationships.

FWIW regarding nl, I do see it supports rtnl lock-free operations via
'62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was
recently made lockless with that. I don't see / know of any fast path
precedent. I'm aware there are some things I'm not sure about being
relevant performance-wise, like hitting skb alloc an additional time
every release batch. I'd want to do some minimal latency comparisons
between that path and sockopt before diving head-first.

Best,
Bobby

Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management

Posted by Jakub Kicinski 1 week, 4 days ago

On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote:
> I'm onboard with improving what we have since it helps all of us
> currently using this API, though I'm not opposed to discussing a
> redesign in another thread/RFC. I do see the attraction to locating the
> core logic in one place and possibly reducing some complexity around
> socket/binding relationships.
> 
> FWIW regarding nl, I do see it supports rtnl lock-free operations via
> '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was
> recently made lockless with that. I don't see / know of any fast path
> precedent. I'm aware there are some things I'm not sure about being
> relevant performance-wise, like hitting skb alloc an additional time
> every release batch. I'd want to do some minimal latency comparisons
> between that path and sockopt before diving head-first.

FTR I'm not really pushing Netlink specifically, it may work it 
may not. Perhaps some other ioctl-y thing exists. Just in general
setsockopt() on a specific socket feels increasingly awkward for 
buffer flow. Maybe y'all disagree.

I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :)

Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management

Posted by Stanislav Fomichev 1 week, 3 days ago

On 01/26, Jakub Kicinski wrote:
> On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote:
> > I'm onboard with improving what we have since it helps all of us
> > currently using this API, though I'm not opposed to discussing a
> > redesign in another thread/RFC. I do see the attraction to locating the
> > core logic in one place and possibly reducing some complexity around
> > socket/binding relationships.
> > 
> > FWIW regarding nl, I do see it supports rtnl lock-free operations via
> > '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was
> > recently made lockless with that. I don't see / know of any fast path
> > precedent. I'm aware there are some things I'm not sure about being
> > relevant performance-wise, like hitting skb alloc an additional time
> > every release batch. I'd want to do some minimal latency comparisons
> > between that path and sockopt before diving head-first.
> 
> FTR I'm not really pushing Netlink specifically, it may work it 
> may not. Perhaps some other ioctl-y thing exists. Just in general
> setsockopt() on a specific socket feels increasingly awkward for 
> buffer flow. Maybe y'all disagree.
> 
> I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :)

From my side, if we do a completely new uapi, my preference would be on
an af_xdp like mapped rings (presumably on a netlink socket?) to completely
avoid the user-kernel copies.

Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management

Posted by Bobby Eshleman 1 week, 3 days ago

On Mon, Jan 26, 2026 at 10:00 PM Stanislav Fomichev
<stfomichev@gmail.com> wrote:
>
> On 01/26, Jakub Kicinski wrote:
> > On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote:
> > > I'm onboard with improving what we have since it helps all of us
> > > currently using this API, though I'm not opposed to discussing a
> > > redesign in another thread/RFC. I do see the attraction to locating the
> > > core logic in one place and possibly reducing some complexity around
> > > socket/binding relationships.
> > >
> > > FWIW regarding nl, I do see it supports rtnl lock-free operations via
> > > '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was
> > > recently made lockless with that. I don't see / know of any fast path
> > > precedent. I'm aware there are some things I'm not sure about being
> > > relevant performance-wise, like hitting skb alloc an additional time
> > > every release batch. I'd want to do some minimal latency comparisons
> > > between that path and sockopt before diving head-first.
> >
> > FTR I'm not really pushing Netlink specifically, it may work it
> > may not. Perhaps some other ioctl-y thing exists. Just in general
> > setsockopt() on a specific socket feels increasingly awkward for
> > buffer flow. Maybe y'all disagree.
> >
> > I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :)
>
> From my side, if we do a completely new uapi, my preference would be on
> an af_xdp like mapped rings (presumably on a netlink socket?) to completely
> avoid the user-kernel copies.

I second liking that approach. No put_cmsg() and or token alloc overhead (both
jump up in my profiling).

Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management

Posted by Pavel Begunkov 1 week ago

On 1/27/26 06:48, Bobby Eshleman wrote:
> On Mon, Jan 26, 2026 at 10:00 PM Stanislav Fomichev
> <stfomichev@gmail.com> wrote:
>>
>> On 01/26, Jakub Kicinski wrote:
>>> On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote:
>>>> I'm onboard with improving what we have since it helps all of us
>>>> currently using this API, though I'm not opposed to discussing a
>>>> redesign in another thread/RFC. I do see the attraction to locating the
>>>> core logic in one place and possibly reducing some complexity around
>>>> socket/binding relationships.
>>>>
>>>> FWIW regarding nl, I do see it supports rtnl lock-free operations via
>>>> '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was
>>>> recently made lockless with that. I don't see / know of any fast path
>>>> precedent. I'm aware there are some things I'm not sure about being
>>>> relevant performance-wise, like hitting skb alloc an additional time
>>>> every release batch. I'd want to do some minimal latency comparisons
>>>> between that path and sockopt before diving head-first.
>>>
>>> FTR I'm not really pushing Netlink specifically, it may work it
>>> may not. Perhaps some other ioctl-y thing exists. Just in general
>>> setsockopt() on a specific socket feels increasingly awkward for
>>> buffer flow. Maybe y'all disagree.
>>>
>>> I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :)
>>
>>  From my side, if we do a completely new uapi, my preference would be on
>> an af_xdp like mapped rings (presumably on a netlink socket?) to completely
>> avoid the user-kernel copies.
> 
> I second liking that approach. No put_cmsg() and or token alloc overhead (both
> jump up in my profiling).

Hmm, makes me wonder why not use zcrx instead of reinventing it? It
doesn't bind net_iov to sockets just as you do in this series. And it
also returns buffers back via a shared ring. Otherwise you'll be facing
same issues, like rings running out of space, and so you will need to
have a fallback path. And user space will need to synchronise the ring
if it's shared with other threads, and there will be a question of how
to scale it next, possibly by creating multiple rings as I'll likely to
do soon for zcrx.

-- 
Pavel Begunkov

Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management

Posted by Jens Axboe 1 day, 23 hours ago

On 1/30/26 4:13 AM, Pavel Begunkov wrote:
> On 1/27/26 06:48, Bobby Eshleman wrote:
>> On Mon, Jan 26, 2026 at 10:00?PM Stanislav Fomichev
>> <stfomichev@gmail.com> wrote:
>>>
>>> On 01/26, Jakub Kicinski wrote:
>>>> On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote:
>>>>> I'm onboard with improving what we have since it helps all of us
>>>>> currently using this API, though I'm not opposed to discussing a
>>>>> redesign in another thread/RFC. I do see the attraction to locating the
>>>>> core logic in one place and possibly reducing some complexity around
>>>>> socket/binding relationships.
>>>>>
>>>>> FWIW regarding nl, I do see it supports rtnl lock-free operations via
>>>>> '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was
>>>>> recently made lockless with that. I don't see / know of any fast path
>>>>> precedent. I'm aware there are some things I'm not sure about being
>>>>> relevant performance-wise, like hitting skb alloc an additional time
>>>>> every release batch. I'd want to do some minimal latency comparisons
>>>>> between that path and sockopt before diving head-first.
>>>>
>>>> FTR I'm not really pushing Netlink specifically, it may work it
>>>> may not. Perhaps some other ioctl-y thing exists. Just in general
>>>> setsockopt() on a specific socket feels increasingly awkward for
>>>> buffer flow. Maybe y'all disagree.
>>>>
>>>> I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :)
>>>
>>>  From my side, if we do a completely new uapi, my preference would be on
>>> an af_xdp like mapped rings (presumably on a netlink socket?) to completely
>>> avoid the user-kernel copies.
>>
>> I second liking that approach. No put_cmsg() and or token alloc
>> overhead (both jump up in my profiling).
> 
> Hmm, makes me wonder why not use zcrx instead of reinventing it? It

Was thinking the same throughout most of this later discussion... We
already have an API for this.

-- 
Jens Axboe

Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management

Posted by Bobby Eshleman 2 weeks, 2 days ago

On Tue, Jan 20, 2026 at 05:07:49PM -0800, Jakub Kicinski wrote:
> On Thu, 15 Jan 2026 21:02:11 -0800 Bobby Eshleman wrote:
> > This series improves the CPU cost of RX token management by adding an
> > attribute to NETDEV_CMD_BIND_RX that configures sockets using the
> > binding to avoid the xarray allocator and instead use a per-binding niov
> > array and a uref field in niov.
> > 
> > Improvement is ~13% cpu util per RX user thread.
> > 
> > Using kperf, the following results were observed:
> > 
> > Before:
> > 	Average RX worker idle %: 13.13, flows 4, test runs 11
> > After:
> > 	Average RX worker idle %: 26.32, flows 4, test runs 11
> > 
> > Two other approaches were tested, but with no improvement. Namely, 1)
> > using a hashmap for tokens and 2) keeping an xarray of atomic counters
> > but using RCU so that the hotpath could be mostly lockless. Neither of
> > these approaches proved better than the simple array in terms of CPU.
> > 
> > The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
> > optimization. It is an optional attribute and defaults to 0 (i.e.,
> > optimization on).
> 
> IDK if the cmsg approach is still right for this flow TBH.
> IIRC when Stan talked about this a while back we were considering doing
> this via Netlink. Anything that proves that the user owns the binding
> would work. IIUC the TCP socket in this design just proves that socket
> has received a token from a given binding right?

In both designs the owner of the binding starts of as the netlink opener,
and then ownership spreads out to TCP sockets as packets are steered to
them. Tokens are received by the user which gives them a share in the
form of references on the pp and binding. This design follows the same
approach... but I may be misinterpreting what you mean by ownership?

Best,
Bobby

Re: [PATCH net-next v10 0/5] net: devmem: improve cpu cost of RX token management

Posted by Jakub Kicinski 2 weeks, 2 days ago

On Tue, 20 Jan 2026 21:29:36 -0800 Bobby Eshleman wrote:
> > IDK if the cmsg approach is still right for this flow TBH.
> > IIRC when Stan talked about this a while back we were considering doing
> > this via Netlink. Anything that proves that the user owns the binding
> > would work. IIUC the TCP socket in this design just proves that socket
> > has received a token from a given binding right?  
> 
> In both designs the owner of the binding starts of as the netlink opener,
> and then ownership spreads out to TCP sockets as packets are steered to
> them. Tokens are received by the user which gives them a share in the
> form of references on the pp and binding. This design follows the same
> approach... but I may be misinterpreting what you mean by ownership?

What I was getting at was the same point about socket A vs socket B as
I made on the doc patch. IOW the kernel only tracks how many tokens it
gave out for a net_iov, there's no socket state beyond the binding
pointer. Right?