From: Bobby Eshleman <bobbyeshleman@meta.com>
Update devmem.rst documentation to describe the autorelease netlink
attribute used during RX dmabuf binding.
The autorelease attribute is specified at bind-time via the netlink API
(NETDEV_CMD_BIND_RX) and controls what happens to outstanding tokens
when the socket closes.
Document the two token release modes (automatic vs manual), how to
configure the binding for autorelease, the perf benefits, new caveats
and restrictions, and the way the mode is enforced system-wide.
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v7:
- Document netlink instead of sockopt
- Mention system-wide locked to one mode
---
Documentation/networking/devmem.rst | 73 +++++++++++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)
diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
index a6cd7236bfbd..f85f1dcc9621 100644
--- a/Documentation/networking/devmem.rst
+++ b/Documentation/networking/devmem.rst
@@ -235,6 +235,79 @@ can be less than the tokens provided by the user in case of:
(a) an internal kernel leak bug.
(b) the user passed more than 1024 frags.
+
+Autorelease Control
+~~~~~~~~~~~~~~~~~~~
+
+The autorelease mode controls what happens to outstanding tokens (tokens not
+released via SO_DEVMEM_DONTNEED) when the socket closes. Autorelease is
+configured per-binding at binding creation time via the netlink API::
+
+ struct netdev_bind_rx_req *req;
+ struct netdev_bind_rx_rsp *rsp;
+ struct ynl_sock *ys;
+ struct ynl_error yerr;
+
+ ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+
+ req = netdev_bind_rx_req_alloc();
+ netdev_bind_rx_req_set_ifindex(req, ifindex);
+ netdev_bind_rx_req_set_fd(req, dmabuf_fd);
+ netdev_bind_rx_req_set_autorelease(req, 0); /* 0 = manual, 1 = auto */
+ __netdev_bind_rx_req_set_queues(req, queues, n_queues);
+
+ rsp = netdev_bind_rx(ys, req);
+
+ dmabuf_id = rsp->id;
+
+When autorelease is disabled (0):
+
+- Outstanding tokens are NOT released when the socket closes
+- Outstanding tokens are only released when all RX queues are unbound AND all
+ sockets that called recvmsg() are closed
+- Provides better performance by eliminating xarray overhead (~13% CPU reduction)
+- Kernel tracks tokens via atomic reference counters in net_iov structures
+
+When autorelease is enabled (1):
+
+- Outstanding tokens are automatically released when the socket closes
+- Backwards compatible behavior
+- Kernel tracks tokens in an xarray per socket
+
+The default is autorelease disabled.
+
+Important: In both modes, applications should call SO_DEVMEM_DONTNEED to
+return tokens as soon as they are done processing. The autorelease setting only
+affects what happens to tokens that are still outstanding when close() is called.
+
+The mode is enforced system-wide. Once a binding is created with a specific
+autorelease mode, all subsequent bindings system-wide must use the same mode.
+
+
+Performance Considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Disabling autorelease provides approximately ~13% CPU utilization improvement
+in RX workloads. That said, applications must ensure all tokens are released
+via SO_DEVMEM_DONTNEED before closing the socket, otherwise the backing pages
+will remain pinned until all RX queues are unbound AND all sockets that called
+recvmsg() are closed.
+
+
+Caveats
+~~~~~~~
+
+- Once a system-wide autorelease mode is selected (via the first binding),
+ all subsequent bindings must use the same mode. Attempts to create bindings
+ with a different mode will be rejected with -EBUSY.
+
+- Applications using manual release mode (autorelease=0) must ensure all tokens
+ are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource
+ leaks during the lifetime of the dmabuf binding. Tokens not released before
+ close() will only be freed when all RX queues are unbound AND all sockets
+ that called recvmsg() are closed.
+
+
TX Interface
============
--
2.47.3
On Thu, 15 Jan 2026 21:02:15 -0800 Bobby Eshleman wrote: > +- Once a system-wide autorelease mode is selected (via the first binding), > + all subsequent bindings must use the same mode. Attempts to create bindings > + with a different mode will be rejected with -EBUSY. Why? > +- Applications using manual release mode (autorelease=0) must ensure all tokens > + are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource > + leaks during the lifetime of the dmabuf binding. Tokens not released before > + close() will only be freed when all RX queues are unbound AND all sockets > + that called recvmsg() are closed. Could you add a short example on how? by calling shutdown()?
On Tue, Jan 20, 2026 at 04:36:50PM -0800, Jakub Kicinski wrote: > On Thu, 15 Jan 2026 21:02:15 -0800 Bobby Eshleman wrote: > > +- Once a system-wide autorelease mode is selected (via the first binding), > > + all subsequent bindings must use the same mode. Attempts to create bindings > > + with a different mode will be rejected with -EBUSY. > > Why? > Originally I was using EINVAL, but when writing the tests I noticed this might be a confusing case for users to interpret EINVAL (i.e., some binding possibly made by someone else is in a different mode). I thought EBUSY could capture the semantic "the system is locked up in a different mode, try again when it isn't". I'm not married to it though. Happy to go back to EINVAL or another errno. > > +- Applications using manual release mode (autorelease=0) must ensure all tokens > > + are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource > > + leaks during the lifetime of the dmabuf binding. Tokens not released before > > + close() will only be freed when all RX queues are unbound AND all sockets > > + that called recvmsg() are closed. > > Could you add a short example on how? by calling shutdown()? Show an example of the three steps: returning the tokens, unbinding, and closing the sockets (TCP/NL)? Best, Bobby
On Tue, 20 Jan 2026 21:44:09 -0800 Bobby Eshleman wrote: > On Tue, Jan 20, 2026 at 04:36:50PM -0800, Jakub Kicinski wrote: > > On Thu, 15 Jan 2026 21:02:15 -0800 Bobby Eshleman wrote: > > > +- Once a system-wide autorelease mode is selected (via the first binding), > > > + all subsequent bindings must use the same mode. Attempts to create bindings > > > + with a different mode will be rejected with -EBUSY. > > > > Why? > > Originally I was using EINVAL, but when writing the tests I noticed this > might be a confusing case for users to interpret EINVAL (i.e., some > binding possibly made by someone else is in a different mode). I thought > EBUSY could capture the semantic "the system is locked up in a different > mode, try again when it isn't". > > I'm not married to it though. Happy to go back to EINVAL or another > errno. My question was more why the system-wide policy exists, rather than binding-by-binding. Naively I'd think that a single socket must pick but system wide there could easily be multiple bindings not bothering each other, doing different things? > > > +- Applications using manual release mode (autorelease=0) must ensure all tokens > > > + are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource > > > + leaks during the lifetime of the dmabuf binding. Tokens not released before > > > + close() will only be freed when all RX queues are unbound AND all sockets > > > + that called recvmsg() are closed. > > > > Could you add a short example on how? by calling shutdown()? > > Show an example of the three steps: returning the tokens, unbinding, and closing the > sockets (TCP/NL)? TBH I read the doc before reading the code, which I guess may actually be better since we don't expect users to read the code first either.. Now after reading the code I'm not sure the doc explains things properly. AFAIU there's no association of token <> socket within the same binding. User can close socket A and return the tokens via socket B. As written the doc made me think that there will be a leak if socket is closed without releasing tokens, or that there may be a race with data queued but not read. Neither is true, really?
On Wed, Jan 21, 2026 at 05:35:12PM -0800, Jakub Kicinski wrote: > On Tue, 20 Jan 2026 21:44:09 -0800 Bobby Eshleman wrote: > > On Tue, Jan 20, 2026 at 04:36:50PM -0800, Jakub Kicinski wrote: > > > On Thu, 15 Jan 2026 21:02:15 -0800 Bobby Eshleman wrote: > > > > +- Once a system-wide autorelease mode is selected (via the first binding), > > > > + all subsequent bindings must use the same mode. Attempts to create bindings > > > > + with a different mode will be rejected with -EBUSY. > > > > > > Why? > > > > Originally I was using EINVAL, but when writing the tests I noticed this > > might be a confusing case for users to interpret EINVAL (i.e., some > > binding possibly made by someone else is in a different mode). I thought > > EBUSY could capture the semantic "the system is locked up in a different > > mode, try again when it isn't". > > > > I'm not married to it though. Happy to go back to EINVAL or another > > errno. > > My question was more why the system-wide policy exists, rather than > binding-by-binding. Naively I'd think that a single socket must pick > but system wide there could easily be multiple bindings not bothering > each other, doing different things? Originally we allowed per-binding policy, but it seemed one-per-system may 1) simplify reasoning through the code by only allowing one policy per system, and 2) allow simpler deprecation of autorelease=on if its found to be obsolete over time (just hack off that particular path of the static branch set). It doesn't prevent any races/bugs or anything. > > > > > +- Applications using manual release mode (autorelease=0) must ensure all tokens > > > > + are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource > > > > + leaks during the lifetime of the dmabuf binding. Tokens not released before > > > > + close() will only be freed when all RX queues are unbound AND all sockets > > > > + that called recvmsg() are closed. > > > > > > Could you add a short example on how? by calling shutdown()? > > > > Show an example of the three steps: returning the tokens, unbinding, and closing the > > sockets (TCP/NL)? > > TBH I read the doc before reading the code, which I guess may actually > be better since we don't expect users to read the code first either.. > > Now after reading the code I'm not sure the doc explains things > properly. AFAIU there's no association of token <> socket within the > same binding. User can close socket A and return the tokens via socket > B. As written the doc made me think that there will be a leak if socket > is closed without releasing tokens, or that there may be a race with > data queued but not read. Neither is true, really? That is correct, neither is true. If the two sockets share a binding the kernel doesn't care which socket received the token or which one returned it. No token <> socket association. There is no queued-but-not-read race either. If any tokens are not returned, as long as all of the binding references are eventually released and all sockets that used the binding are closed, then all references will be accounted for and everything cleaned up. Best, Bobby
On Wed, 21 Jan 2026 18:37:56 -0800 Bobby Eshleman wrote: > > > Show an example of the three steps: returning the tokens, unbinding, and closing the > > > sockets (TCP/NL)? > > > > TBH I read the doc before reading the code, which I guess may actually > > be better since we don't expect users to read the code first either.. > > > > Now after reading the code I'm not sure the doc explains things > > properly. AFAIU there's no association of token <> socket within the > > same binding. User can close socket A and return the tokens via socket > > B. As written the doc made me think that there will be a leak if socket > > is closed without releasing tokens, or that there may be a race with > > data queued but not read. Neither is true, really? > > That is correct, neither is true. If the two sockets share a binding the > kernel doesn't care which socket received the token or which one > returned it. No token <> socket association. There is no > queued-but-not-read race either. If any tokens are not returned, as long > as all of the binding references are eventually released and all sockets > that used the binding are closed, then all references will be accounted > for and everything cleaned up. Naming is hard, but I wonder whether the whole feature wouldn't be better referred to as something to do with global token accounting / management? AUTORELEASE makes sense but seems like focusing on one particular side effect.
On Wed, Jan 21, 2026 at 6:50 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 21 Jan 2026 18:37:56 -0800 Bobby Eshleman wrote: > > > > Show an example of the three steps: returning the tokens, unbinding, and closing the > > > > sockets (TCP/NL)? > > > > > > TBH I read the doc before reading the code, which I guess may actually > > > be better since we don't expect users to read the code first either.. > > > > > > Now after reading the code I'm not sure the doc explains things > > > properly. AFAIU there's no association of token <> socket within the > > > same binding. User can close socket A and return the tokens via socket > > > B. As written the doc made me think that there will be a leak if socket > > > is closed without releasing tokens, or that there may be a race with > > > data queued but not read. Neither is true, really? > > > > That is correct, neither is true. If the two sockets share a binding the > > kernel doesn't care which socket received the token or which one > > returned it. No token <> socket association. There is no > > queued-but-not-read race either. If any tokens are not returned, as long > > as all of the binding references are eventually released and all sockets > > that used the binding are closed, then all references will be accounted > > for and everything cleaned up. > > Naming is hard, but I wonder whether the whole feature wouldn't be > better referred to as something to do with global token accounting > / management? AUTORELEASE makes sense but seems like focusing on one > particular side effect. Good point. The only real use case for autorelease=on is for backwards compatibility... so I thought maybe DEVMEM_A_DMABUF_COMPAT_TOKEN or DEVMEM_A_DMABUF_COMPAT_DONTNEED would be clearer?
On Wed, 21 Jan 2026 19:25:27 -0800 Bobby Eshleman wrote: > > > That is correct, neither is true. If the two sockets share a binding the > > > kernel doesn't care which socket received the token or which one > > > returned it. No token <> socket association. There is no > > > queued-but-not-read race either. If any tokens are not returned, as long > > > as all of the binding references are eventually released and all sockets > > > that used the binding are closed, then all references will be accounted > > > for and everything cleaned up. > > > > Naming is hard, but I wonder whether the whole feature wouldn't be > > better referred to as something to do with global token accounting > > / management? AUTORELEASE makes sense but seems like focusing on one > > particular side effect. > > Good point. The only real use case for autorelease=on is for backwards > compatibility... so I thought maybe DEVMEM_A_DMABUF_COMPAT_TOKEN > or DEVMEM_A_DMABUF_COMPAT_DONTNEED would be clearer? Hm. Maybe let's return to naming once we have consensus on the uAPI. Does everyone think that pushing this via TCP socket opts still makes sense, even tho in practice the TCP socket is just how we find the binding?
On 01/21, Jakub Kicinski wrote: > On Wed, 21 Jan 2026 19:25:27 -0800 Bobby Eshleman wrote: > > > > That is correct, neither is true. If the two sockets share a binding the > > > > kernel doesn't care which socket received the token or which one > > > > returned it. No token <> socket association. There is no > > > > queued-but-not-read race either. If any tokens are not returned, as long > > > > as all of the binding references are eventually released and all sockets > > > > that used the binding are closed, then all references will be accounted > > > > for and everything cleaned up. > > > > > > Naming is hard, but I wonder whether the whole feature wouldn't be > > > better referred to as something to do with global token accounting > > > / management? AUTORELEASE makes sense but seems like focusing on one > > > particular side effect. > > > > Good point. The only real use case for autorelease=on is for backwards > > compatibility... so I thought maybe DEVMEM_A_DMABUF_COMPAT_TOKEN > > or DEVMEM_A_DMABUF_COMPAT_DONTNEED would be clearer? > > Hm. Maybe let's return to naming once we have consensus on the uAPI. > > Does everyone think that pushing this via TCP socket opts still makes > sense, even tho in practice the TCP socket is just how we find the > binding? I'm not a fan of the existing cmsg scheme, but we already have userspace using it, so getting more performance out of it seems like an easy win?
On Wed, 21 Jan 2026 20:07:11 -0800 Stanislav Fomichev wrote:
> On 01/21, Jakub Kicinski wrote:
> > On Wed, 21 Jan 2026 19:25:27 -0800 Bobby Eshleman wrote:
> > > Good point. The only real use case for autorelease=on is for backwards
> > > compatibility... so I thought maybe DEVMEM_A_DMABUF_COMPAT_TOKEN
> > > or DEVMEM_A_DMABUF_COMPAT_DONTNEED would be clearer?
> >
> > Hm. Maybe let's return to naming once we have consensus on the uAPI.
> >
> > Does everyone think that pushing this via TCP socket opts still makes
> > sense, even tho in practice the TCP socket is just how we find the
> > binding?
>
> I'm not a fan of the existing cmsg scheme, but we already have userspace
> using it, so getting more performance out of it seems like an easy win?
I don't like:
- the fact that we have to add the binding to a socket (extra field)
- single socket can only serve single binding, there's no technical
reason for this really, AFAICT, just the fact that we have a single
pointer in the sock struct
- the 7 levels of indentation in tcp_recvmsg_dmabuf()
I understand your argument, but as is this series feels closer to a PoC
than an easy win (the easy part should imply minor changes and no
detrimental effect on code quality IMHO).
On Mon, Jan 26, 2026 at 05:26:46PM -0800, Jakub Kicinski wrote: > On Wed, 21 Jan 2026 20:07:11 -0800 Stanislav Fomichev wrote: > > On 01/21, Jakub Kicinski wrote: > > > On Wed, 21 Jan 2026 19:25:27 -0800 Bobby Eshleman wrote: > > > > Good point. The only real use case for autorelease=on is for backwards > > > > compatibility... so I thought maybe DEVMEM_A_DMABUF_COMPAT_TOKEN > > > > or DEVMEM_A_DMABUF_COMPAT_DONTNEED would be clearer? > > > > > > Hm. Maybe let's return to naming once we have consensus on the uAPI. > > > > > > Does everyone think that pushing this via TCP socket opts still makes > > > sense, even tho in practice the TCP socket is just how we find the > > > binding? > > > > I'm not a fan of the existing cmsg scheme, but we already have userspace > > using it, so getting more performance out of it seems like an easy win? > > I don't like: > - the fact that we have to add the binding to a socket (extra field) > - single socket can only serve single binding, there's no technical > reason for this really, AFAICT, just the fact that we have a single > pointer in the sock struct The core reason is that sockets lose the ability to map a given token to a given binding by no longer storing the niov ptr. One proposal I had was to encode some number of bits in the token that can be used to lookup the binding in an array, I could reboot that approach. With 32 bits, we can represent: dmabuf max size = 512 GB, max dmabuf count = 8 dmabuf max size = 256 GB, max dmabuf count = 16 dmabuf max size = 128 GB, max dmabuf count = 32 etc... Then, if the dmabuf count encoding space is exhausted, the socket would have to wait until the user returns all of the tokens from one of the dmabufs and frees the ID (or err out is another option). This wouldn't change adding a field to the socket, we'd have to add one or two more for allocating the dmabuf ID and fetching the dmabuf with it. But it does fix the single binding thing. > - the 7 levels of indentation in tcp_recvmsg_dmabuf() For sure, it is getting hairy. > > I understand your argument, but as is this series feels closer to a PoC > than an easy win (the easy part should imply minor changes and no > detrimental effect on code quality IMHO). Sure, let's try to find a way to minimize the changes. Best, Bobby
On Mon, 26 Jan 2026 18:30:45 -0800 Bobby Eshleman wrote: > > > I'm not a fan of the existing cmsg scheme, but we already have userspace > > > using it, so getting more performance out of it seems like an easy win? > > > > I don't like: > > - the fact that we have to add the binding to a socket (extra field) > > - single socket can only serve single binding, there's no technical > > reason for this really, AFAICT, just the fact that we have a single > > pointer in the sock struct > > The core reason is that sockets lose the ability to map a given token to > a given binding by no longer storing the niov ptr. > > One proposal I had was to encode some number of bits in the token that > can be used to lookup the binding in an array, I could reboot that > approach. > > With 32 bits, we can represent: > > dmabuf max size = 512 GB, max dmabuf count = 8 > dmabuf max size = 256 GB, max dmabuf count = 16 > dmabuf max size = 128 GB, max dmabuf count = 32 > > etc... > > Then, if the dmabuf count encoding space is exhausted, the socket would > have to wait until the user returns all of the tokens from one of the > dmabufs and frees the ID (or err out is another option). > > This wouldn't change adding a field to the socket, we'd have to add one > or two more for allocating the dmabuf ID and fetching the dmabuf with > it. But it does fix the single binding thing. I think the bigger problem (than space exhaustion) is that we'd also have some understanding of permissions. If an application guesses the binding ID of another app it can mess up its buffers. ENOBUENO..
On Mon, Jan 26, 2026 at 06:44:40PM -0800, Jakub Kicinski wrote: > On Mon, 26 Jan 2026 18:30:45 -0800 Bobby Eshleman wrote: > > > > I'm not a fan of the existing cmsg scheme, but we already have userspace > > > > using it, so getting more performance out of it seems like an easy win? > > > > > > I don't like: > > > - the fact that we have to add the binding to a socket (extra field) > > > - single socket can only serve single binding, there's no technical > > > reason for this really, AFAICT, just the fact that we have a single > > > pointer in the sock struct > > > > The core reason is that sockets lose the ability to map a given token to > > a given binding by no longer storing the niov ptr. > > > > One proposal I had was to encode some number of bits in the token that > > can be used to lookup the binding in an array, I could reboot that > > approach. > > > > With 32 bits, we can represent: > > > > dmabuf max size = 512 GB, max dmabuf count = 8 > > dmabuf max size = 256 GB, max dmabuf count = 16 > > dmabuf max size = 128 GB, max dmabuf count = 32 > > > > etc... > > > > Then, if the dmabuf count encoding space is exhausted, the socket would > > have to wait until the user returns all of the tokens from one of the > > dmabufs and frees the ID (or err out is another option). > > > > This wouldn't change adding a field to the socket, we'd have to add one > > or two more for allocating the dmabuf ID and fetching the dmabuf with > > it. But it does fix the single binding thing. > > I think the bigger problem (than space exhaustion) is that we'd also > have some understanding of permissions. If an application guesses > the binding ID of another app it can mess up its buffers. ENOBUENO.. I was thinking it would be per-socket, effectively: sk->sk_devmem_info.bindings[binding_id_from_token(token)] So sockets could only access those that they have already recv'd on.
On Mon, 26 Jan 2026 19:06:49 -0800 Bobby Eshleman wrote: > > > Then, if the dmabuf count encoding space is exhausted, the socket would > > > have to wait until the user returns all of the tokens from one of the > > > dmabufs and frees the ID (or err out is another option). > > > > > > This wouldn't change adding a field to the socket, we'd have to add one > > > or two more for allocating the dmabuf ID and fetching the dmabuf with > > > it. But it does fix the single binding thing. > > > > I think the bigger problem (than space exhaustion) is that we'd also > > have some understanding of permissions. If an application guesses > > the binding ID of another app it can mess up its buffers. ENOBUENO.. > > I was thinking it would be per-socket, effectively: > > sk->sk_devmem_info.bindings[binding_id_from_token(token)] > > So sockets could only access those that they have already recv'd on. Ah, missed that the array would be per socket. I guess it'd have to be reusing the token xarray otherwise we're taking up even more space in the socket struct? Dunno.
On Mon, Jan 26, 2026 at 07:43:59PM -0800, Jakub Kicinski wrote: > On Mon, 26 Jan 2026 19:06:49 -0800 Bobby Eshleman wrote: > > > > Then, if the dmabuf count encoding space is exhausted, the socket would > > > > have to wait until the user returns all of the tokens from one of the > > > > dmabufs and frees the ID (or err out is another option). > > > > > > > > This wouldn't change adding a field to the socket, we'd have to add one > > > > or two more for allocating the dmabuf ID and fetching the dmabuf with > > > > it. But it does fix the single binding thing. > > > > > > I think the bigger problem (than space exhaustion) is that we'd also > > > have some understanding of permissions. If an application guesses > > > the binding ID of another app it can mess up its buffers. ENOBUENO.. > > > > I was thinking it would be per-socket, effectively: > > > > sk->sk_devmem_info.bindings[binding_id_from_token(token)] > > > > So sockets could only access those that they have already recv'd on. > > Ah, missed that the array would be per socket. I guess it'd have to be > reusing the token xarray otherwise we're taking up even more space in > the socket struct? Dunno. Yeah, unless we just want to break this all off into a malloc'd struct we point to... or put into tcp_sock (not sure if either addresses the unappealing bit of adding to struct sock)?
© 2016 - 2026 Red Hat, Inc.