Documentation/netlink/specs/netdev.yaml | 12 ++ Documentation/networking/devmem.rst | 73 +++++++++++ include/net/netmem.h | 1 + include/net/sock.h | 7 +- include/uapi/linux/netdev.h | 1 + net/core/devmem.c | 148 ++++++++++++++++++---- net/core/devmem.h | 66 +++++++++- net/core/netdev-genl-gen.c | 5 +- net/core/netdev-genl.c | 10 +- net/core/sock.c | 103 +++++++++++---- net/ipv4/tcp.c | 87 ++++++++++--- net/ipv4/tcp_ipv4.c | 15 ++- net/ipv4/tcp_minisocks.c | 3 +- tools/include/uapi/linux/netdev.h | 1 + tools/testing/selftests/drivers/net/hw/devmem.py | 98 +++++++++++++- tools/testing/selftests/drivers/net/hw/ncdevmem.c | 68 +++++++++- 16 files changed, 611 insertions(+), 87 deletions(-)
This series improves the CPU cost of RX token management by adding an
attribute to NETDEV_CMD_BIND_RX that configures sockets using the
binding to avoid the xarray allocator and instead use a per-binding niov
array and a uref field in niov.
Improvement is ~13% cpu util per RX user thread.
Using kperf, the following results were observed:
Before:
Average RX worker idle %: 13.13, flows 4, test runs 11
After:
Average RX worker idle %: 26.32, flows 4, test runs 11
Two other approaches were tested, but with no improvement. Namely, 1)
using a hashmap for tokens and 2) keeping an xarray of atomic counters
but using RCU so that the hotpath could be mostly lockless. Neither of
these approaches proved better than the simple array in terms of CPU.
The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
optimization. It is an optional attribute and defaults to 0 (i.e.,
optimization on).
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Changes in v10:
- add new tests for edge cases
- add new binding->users to binding for tracking socket/rxq users
- remove rx binding count (use xarray instead)
- Link to v9: https://lore.kernel.org/r/20260109-scratch-bobbyeshleman-devmem-tcp-token-upstream-v9-0-8042930d00d7@meta.com
Changes in v9:
- fixed build with NET_DEVMEM=n
- fixed bug in rx bindings count logic
- Link to v8: https://lore.kernel.org/r/20260107-scratch-bobbyeshleman-devmem-tcp-token-upstream-v8-0-92c968631496@meta.com
Changes in v8:
- change static branch logic (only set when enabled, otherwise just
always revert back to disabled)
- fix missing tests
- Link to v7: https://lore.kernel.org/r/20251119-scratch-bobbyeshleman-devmem-tcp-token-upstream-v7-0-1abc8467354c@meta.com
Changes in v7:
- use netlink instead of sockopt (Stan)
- restrict system to only one mode, dmabuf bindings can not co-exist
with different modes (Stan)
- use static branching to enforce single system-wide mode (Stan)
- Link to v6: https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-upstream-v6-0-ea98cf4d40b3@meta.com
Changes in v6:
- renamed 'net: devmem: use niov array for token management' to refer to
optionality of new config
- added documentation and tests
- make autorelease flag per-socket sockopt instead of binding
field / sysctl
- many per-patch changes (see Changes sections per-patch)
- Link to v5: https://lore.kernel.org/r/20251023-scratch-bobbyeshleman-devmem-tcp-token-upstream-v5-0-47cb85f5259e@meta.com
Changes in v5:
- add sysctl to opt-out of performance benefit, back to old token release
- Link to v4: https://lore.kernel.org/all/20250926-scratch-bobbyeshleman-devmem-tcp-token-upstream-v4-0-39156563c3ea@meta.com
Changes in v4:
- rebase to net-next
- Link to v3: https://lore.kernel.org/r/20250926-scratch-bobbyeshleman-devmem-tcp-token-upstream-v3-0-084b46bda88f@meta.com
Changes in v3:
- make urefs per-binding instead of per-socket, reducing memory
footprint
- fallback to cleaning up references in dmabuf unbind if socket
leaked tokens
- drop ethtool patch
- Link to v2: https://lore.kernel.org/r/20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-0-c80d735bd453@meta.com
Changes in v2:
- net: ethtool: prevent user from breaking devmem single-binding rule
(Mina)
- pre-assign niovs in binding->vec for RX case (Mina)
- remove WARNs on invalid user input (Mina)
- remove extraneous binding ref get (Mina)
- remove WARN for changed binding (Mina)
- always use GFP_ZERO for binding->vec (Mina)
- fix length of alloc for urefs
- use atomic_set(, 0) to initialize sk_user_frags.urefs
- Link to v1: https://lore.kernel.org/r/20250902-scratch-bobbyeshleman-devmem-tcp-token-upstream-v1-0-d946169b5550@meta.com
---
Bobby Eshleman (5):
net: devmem: rename tx_vec to vec in dmabuf binding
net: devmem: refactor sock_devmem_dontneed for autorelease split
net: devmem: implement autorelease token management
net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
selftests: drv-net: devmem: add autorelease tests
Documentation/netlink/specs/netdev.yaml | 12 ++
Documentation/networking/devmem.rst | 73 +++++++++++
include/net/netmem.h | 1 +
include/net/sock.h | 7 +-
include/uapi/linux/netdev.h | 1 +
net/core/devmem.c | 148 ++++++++++++++++++----
net/core/devmem.h | 66 +++++++++-
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 10 +-
net/core/sock.c | 103 +++++++++++----
net/ipv4/tcp.c | 87 ++++++++++---
net/ipv4/tcp_ipv4.c | 15 ++-
net/ipv4/tcp_minisocks.c | 3 +-
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/drivers/net/hw/devmem.py | 98 +++++++++++++-
tools/testing/selftests/drivers/net/hw/ncdevmem.c | 68 +++++++++-
16 files changed, 611 insertions(+), 87 deletions(-)
---
base-commit: d4596891e72cbf155d61798a81ce9d36b69bfaf4
change-id: 20250829-scratch-bobbyeshleman-devmem-tcp-token-upstream-292be174d503
Best regards,
--
Bobby Eshleman <bobbyeshleman@meta.com>
On Thu, 15 Jan 2026 21:02:11 -0800 Bobby Eshleman wrote: > This series improves the CPU cost of RX token management by adding an > attribute to NETDEV_CMD_BIND_RX that configures sockets using the > binding to avoid the xarray allocator and instead use a per-binding niov > array and a uref field in niov. > > Improvement is ~13% cpu util per RX user thread. > > Using kperf, the following results were observed: > > Before: > Average RX worker idle %: 13.13, flows 4, test runs 11 > After: > Average RX worker idle %: 26.32, flows 4, test runs 11 > > Two other approaches were tested, but with no improvement. Namely, 1) > using a hashmap for tokens and 2) keeping an xarray of atomic counters > but using RCU so that the hotpath could be mostly lockless. Neither of > these approaches proved better than the simple array in terms of CPU. > > The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the > optimization. It is an optional attribute and defaults to 0 (i.e., > optimization on). IDK if the cmsg approach is still right for this flow TBH. IIRC when Stan talked about this a while back we were considering doing this via Netlink. Anything that proves that the user owns the binding would work. IIUC the TCP socket in this design just proves that socket has received a token from a given binding right?
On Tue, Jan 20, 2026 at 5:07 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Thu, 15 Jan 2026 21:02:11 -0800 Bobby Eshleman wrote: > > This series improves the CPU cost of RX token management by adding an > > attribute to NETDEV_CMD_BIND_RX that configures sockets using the > > binding to avoid the xarray allocator and instead use a per-binding niov > > array and a uref field in niov. > > > > Improvement is ~13% cpu util per RX user thread. > > > > Using kperf, the following results were observed: > > > > Before: > > Average RX worker idle %: 13.13, flows 4, test runs 11 > > After: > > Average RX worker idle %: 26.32, flows 4, test runs 11 > > > > Two other approaches were tested, but with no improvement. Namely, 1) > > using a hashmap for tokens and 2) keeping an xarray of atomic counters > > but using RCU so that the hotpath could be mostly lockless. Neither of > > these approaches proved better than the simple array in terms of CPU. > > > > The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the > > optimization. It is an optional attribute and defaults to 0 (i.e., > > optimization on). > > IDK if the cmsg approach is still right for this flow TBH. > IIRC when Stan talked about this a while back we were considering doing > this via Netlink. Anything that proves that the user owns the binding > would work. IIUC the TCP socket in this design just proves that socket > has received a token from a given binding right? Doesn't 'doing this via netlink' imply it's a control path operation that acquires rtnl_lock or netdev_lock or some heavy lock expecting you to do some config change? Returning tokens is a data-path operation, IIRC we don't even lock the socket to do it in the setsockopt. Is there precedent/path to doing fast data-path operations via netlink? There may be value in not biting more than we can chew in one series. Maybe an alternative non-setsockopt dontneeding scheme should be its own patch series. -- Thanks, Mina
On Wed, Jan 21, 2026 at 08:21:36PM -0800, Mina Almasry wrote: > On Tue, Jan 20, 2026 at 5:07 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Thu, 15 Jan 2026 21:02:11 -0800 Bobby Eshleman wrote: > > > This series improves the CPU cost of RX token management by adding an > > > attribute to NETDEV_CMD_BIND_RX that configures sockets using the > > > binding to avoid the xarray allocator and instead use a per-binding niov > > > array and a uref field in niov. > > > > > > Improvement is ~13% cpu util per RX user thread. > > > > > > Using kperf, the following results were observed: > > > > > > Before: > > > Average RX worker idle %: 13.13, flows 4, test runs 11 > > > After: > > > Average RX worker idle %: 26.32, flows 4, test runs 11 > > > > > > Two other approaches were tested, but with no improvement. Namely, 1) > > > using a hashmap for tokens and 2) keeping an xarray of atomic counters > > > but using RCU so that the hotpath could be mostly lockless. Neither of > > > these approaches proved better than the simple array in terms of CPU. > > > > > > The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the > > > optimization. It is an optional attribute and defaults to 0 (i.e., > > > optimization on). > > > > IDK if the cmsg approach is still right for this flow TBH. > > IIRC when Stan talked about this a while back we were considering doing > > this via Netlink. Anything that proves that the user owns the binding > > would work. IIUC the TCP socket in this design just proves that socket > > has received a token from a given binding right? > > Doesn't 'doing this via netlink' imply it's a control path operation > that acquires rtnl_lock or netdev_lock or some heavy lock expecting > you to do some config change? Returning tokens is a data-path > operation, IIRC we don't even lock the socket to do it in the > setsockopt. > > Is there precedent/path to doing fast data-path operations via netlink? > There may be value in not biting more than we can chew in one series. > Maybe an alternative non-setsockopt dontneeding scheme should be its > own patch series. > I'm onboard with improving what we have since it helps all of us currently using this API, though I'm not opposed to discussing a redesign in another thread/RFC. I do see the attraction to locating the core logic in one place and possibly reducing some complexity around socket/binding relationships. FWIW regarding nl, I do see it supports rtnl lock-free operations via '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was recently made lockless with that. I don't see / know of any fast path precedent. I'm aware there are some things I'm not sure about being relevant performance-wise, like hitting skb alloc an additional time every release batch. I'd want to do some minimal latency comparisons between that path and sockopt before diving head-first. Best, Bobby
On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote: > I'm onboard with improving what we have since it helps all of us > currently using this API, though I'm not opposed to discussing a > redesign in another thread/RFC. I do see the attraction to locating the > core logic in one place and possibly reducing some complexity around > socket/binding relationships. > > FWIW regarding nl, I do see it supports rtnl lock-free operations via > '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was > recently made lockless with that. I don't see / know of any fast path > precedent. I'm aware there are some things I'm not sure about being > relevant performance-wise, like hitting skb alloc an additional time > every release batch. I'd want to do some minimal latency comparisons > between that path and sockopt before diving head-first. FTR I'm not really pushing Netlink specifically, it may work it may not. Perhaps some other ioctl-y thing exists. Just in general setsockopt() on a specific socket feels increasingly awkward for buffer flow. Maybe y'all disagree. I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :)
On 01/26, Jakub Kicinski wrote: > On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote: > > I'm onboard with improving what we have since it helps all of us > > currently using this API, though I'm not opposed to discussing a > > redesign in another thread/RFC. I do see the attraction to locating the > > core logic in one place and possibly reducing some complexity around > > socket/binding relationships. > > > > FWIW regarding nl, I do see it supports rtnl lock-free operations via > > '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was > > recently made lockless with that. I don't see / know of any fast path > > precedent. I'm aware there are some things I'm not sure about being > > relevant performance-wise, like hitting skb alloc an additional time > > every release batch. I'd want to do some minimal latency comparisons > > between that path and sockopt before diving head-first. > > FTR I'm not really pushing Netlink specifically, it may work it > may not. Perhaps some other ioctl-y thing exists. Just in general > setsockopt() on a specific socket feels increasingly awkward for > buffer flow. Maybe y'all disagree. > > I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :) From my side, if we do a completely new uapi, my preference would be on an af_xdp like mapped rings (presumably on a netlink socket?) to completely avoid the user-kernel copies.
On Mon, Jan 26, 2026 at 10:00 PM Stanislav Fomichev <stfomichev@gmail.com> wrote: > > On 01/26, Jakub Kicinski wrote: > > On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote: > > > I'm onboard with improving what we have since it helps all of us > > > currently using this API, though I'm not opposed to discussing a > > > redesign in another thread/RFC. I do see the attraction to locating the > > > core logic in one place and possibly reducing some complexity around > > > socket/binding relationships. > > > > > > FWIW regarding nl, I do see it supports rtnl lock-free operations via > > > '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was > > > recently made lockless with that. I don't see / know of any fast path > > > precedent. I'm aware there are some things I'm not sure about being > > > relevant performance-wise, like hitting skb alloc an additional time > > > every release batch. I'd want to do some minimal latency comparisons > > > between that path and sockopt before diving head-first. > > > > FTR I'm not really pushing Netlink specifically, it may work it > > may not. Perhaps some other ioctl-y thing exists. Just in general > > setsockopt() on a specific socket feels increasingly awkward for > > buffer flow. Maybe y'all disagree. > > > > I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :) > > From my side, if we do a completely new uapi, my preference would be on > an af_xdp like mapped rings (presumably on a netlink socket?) to completely > avoid the user-kernel copies. I second liking that approach. No put_cmsg() and or token alloc overhead (both jump up in my profiling).
On 1/27/26 06:48, Bobby Eshleman wrote: > On Mon, Jan 26, 2026 at 10:00 PM Stanislav Fomichev > <stfomichev@gmail.com> wrote: >> >> On 01/26, Jakub Kicinski wrote: >>> On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote: >>>> I'm onboard with improving what we have since it helps all of us >>>> currently using this API, though I'm not opposed to discussing a >>>> redesign in another thread/RFC. I do see the attraction to locating the >>>> core logic in one place and possibly reducing some complexity around >>>> socket/binding relationships. >>>> >>>> FWIW regarding nl, I do see it supports rtnl lock-free operations via >>>> '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was >>>> recently made lockless with that. I don't see / know of any fast path >>>> precedent. I'm aware there are some things I'm not sure about being >>>> relevant performance-wise, like hitting skb alloc an additional time >>>> every release batch. I'd want to do some minimal latency comparisons >>>> between that path and sockopt before diving head-first. >>> >>> FTR I'm not really pushing Netlink specifically, it may work it >>> may not. Perhaps some other ioctl-y thing exists. Just in general >>> setsockopt() on a specific socket feels increasingly awkward for >>> buffer flow. Maybe y'all disagree. >>> >>> I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :) >> >> From my side, if we do a completely new uapi, my preference would be on >> an af_xdp like mapped rings (presumably on a netlink socket?) to completely >> avoid the user-kernel copies. > > I second liking that approach. No put_cmsg() and or token alloc overhead (both > jump up in my profiling). Hmm, makes me wonder why not use zcrx instead of reinventing it? It doesn't bind net_iov to sockets just as you do in this series. And it also returns buffers back via a shared ring. Otherwise you'll be facing same issues, like rings running out of space, and so you will need to have a fallback path. And user space will need to synchronise the ring if it's shared with other threads, and there will be a question of how to scale it next, possibly by creating multiple rings as I'll likely to do soon for zcrx. -- Pavel Begunkov
On 1/30/26 4:13 AM, Pavel Begunkov wrote: > On 1/27/26 06:48, Bobby Eshleman wrote: >> On Mon, Jan 26, 2026 at 10:00?PM Stanislav Fomichev >> <stfomichev@gmail.com> wrote: >>> >>> On 01/26, Jakub Kicinski wrote: >>>> On Mon, 26 Jan 2026 10:45:22 -0800 Bobby Eshleman wrote: >>>>> I'm onboard with improving what we have since it helps all of us >>>>> currently using this API, though I'm not opposed to discussing a >>>>> redesign in another thread/RFC. I do see the attraction to locating the >>>>> core logic in one place and possibly reducing some complexity around >>>>> socket/binding relationships. >>>>> >>>>> FWIW regarding nl, I do see it supports rtnl lock-free operations via >>>>> '62256f98f244 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED' and routing was >>>>> recently made lockless with that. I don't see / know of any fast path >>>>> precedent. I'm aware there are some things I'm not sure about being >>>>> relevant performance-wise, like hitting skb alloc an additional time >>>>> every release batch. I'd want to do some minimal latency comparisons >>>>> between that path and sockopt before diving head-first. >>>> >>>> FTR I'm not really pushing Netlink specifically, it may work it >>>> may not. Perhaps some other ioctl-y thing exists. Just in general >>>> setsockopt() on a specific socket feels increasingly awkward for >>>> buffer flow. Maybe y'all disagree. >>>> >>>> I thought I'd clarify since I may be seen as "Mr Netlink Everywhere" :) >>> >>> From my side, if we do a completely new uapi, my preference would be on >>> an af_xdp like mapped rings (presumably on a netlink socket?) to completely >>> avoid the user-kernel copies. >> >> I second liking that approach. No put_cmsg() and or token alloc >> overhead (both jump up in my profiling). > > Hmm, makes me wonder why not use zcrx instead of reinventing it? It Was thinking the same throughout most of this later discussion... We already have an API for this. -- Jens Axboe
On Tue, Jan 20, 2026 at 05:07:49PM -0800, Jakub Kicinski wrote: > On Thu, 15 Jan 2026 21:02:11 -0800 Bobby Eshleman wrote: > > This series improves the CPU cost of RX token management by adding an > > attribute to NETDEV_CMD_BIND_RX that configures sockets using the > > binding to avoid the xarray allocator and instead use a per-binding niov > > array and a uref field in niov. > > > > Improvement is ~13% cpu util per RX user thread. > > > > Using kperf, the following results were observed: > > > > Before: > > Average RX worker idle %: 13.13, flows 4, test runs 11 > > After: > > Average RX worker idle %: 26.32, flows 4, test runs 11 > > > > Two other approaches were tested, but with no improvement. Namely, 1) > > using a hashmap for tokens and 2) keeping an xarray of atomic counters > > but using RCU so that the hotpath could be mostly lockless. Neither of > > these approaches proved better than the simple array in terms of CPU. > > > > The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the > > optimization. It is an optional attribute and defaults to 0 (i.e., > > optimization on). > > IDK if the cmsg approach is still right for this flow TBH. > IIRC when Stan talked about this a while back we were considering doing > this via Netlink. Anything that proves that the user owns the binding > would work. IIUC the TCP socket in this design just proves that socket > has received a token from a given binding right? In both designs the owner of the binding starts of as the netlink opener, and then ownership spreads out to TCP sockets as packets are steered to them. Tokens are received by the user which gives them a share in the form of references on the pp and binding. This design follows the same approach... but I may be misinterpreting what you mean by ownership? Best, Bobby
On Tue, 20 Jan 2026 21:29:36 -0800 Bobby Eshleman wrote: > > IDK if the cmsg approach is still right for this flow TBH. > > IIRC when Stan talked about this a while back we were considering doing > > this via Netlink. Anything that proves that the user owns the binding > > would work. IIUC the TCP socket in this design just proves that socket > > has received a token from a given binding right? > > In both designs the owner of the binding starts of as the netlink opener, > and then ownership spreads out to TCP sockets as packets are steered to > them. Tokens are received by the user which gives them a share in the > form of references on the pp and binding. This design follows the same > approach... but I may be misinterpreting what you mean by ownership? What I was getting at was the same point about socket A vs socket B as I made on the doc patch. IOW the kernel only tracks how many tokens it gave out for a net_iov, there's no socket state beyond the binding pointer. Right?
© 2016 - 2026 Red Hat, Inc.