[v3] net: devmem: improve cpu cost of RX token management

[PATCH net-next v3 0/2] net: devmem: improve cpu cost of RX token management

Posted by Bobby Eshleman 4 months, 2 weeks ago

This series improves the CPU cost of RX token management by replacing
the xarray allocator with an niov array and a uref field in niov.

Improvement is ~5% per RX user thread.

Two other approaches were tested, but with no improvement. Namely, 1)
using a hashmap for tokens and 2) keeping an xarray of atomic counters
but using RCU so that the hotpath could be mostly lockless. Neither of
these approaches proved better than the simple array in terms of CPU.

Running with a NCCL workload is still TODO, but I will follow up on this
thread with those results when done.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v3:
- make urefs per-binding instead of per-socket, reducing memory
  footprint
- fallback to cleaning up references in dmabuf unbind if socket
  leaked tokens
- drop ethtool patch
- Link to v2: https://lore.kernel.org/r/20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-0-c80d735bd453@meta.com

Changes in v2:
- net: ethtool: prevent user from breaking devmem single-binding rule
  (Mina)
- pre-assign niovs in binding->vec for RX case (Mina)
- remove WARNs on invalid user input (Mina)
- remove extraneous binding ref get (Mina)
- remove WARN for changed binding (Mina)
- always use GFP_ZERO for binding->vec (Mina)
- fix length of alloc for urefs
- use atomic_set(, 0) to initialize sk_user_frags.urefs
- Link to v1:
https://lore.kernel.org/r/20250902-scratch-bobbyeshleman-devmem-tcp-token-upstream-v1-0-d946169b5550@meta.com

---
Bobby Eshleman (2):
      net: devmem: rename tx_vec to vec in dmabuf binding
      net: devmem: use niov array for token management

 include/net/netmem.h     |  1 +
 include/net/sock.h       |  4 +--
 net/core/devmem.c        | 46 +++++++++++++++---------
 net/core/devmem.h        |  4 +--
 net/core/sock.c          | 38 ++++++++++++++------
 net/ipv4/tcp.c           | 94 +++++++++++-------------------------------------
 net/ipv4/tcp_ipv4.c      | 18 ++--------
 net/ipv4/tcp_minisocks.c |  2 --
 8 files changed, 85 insertions(+), 122 deletions(-)
---
base-commit: cd8a4cfa6bb43a441901e82f5c222dddc75a18a3
change-id: 20250829-scratch-bobbyeshleman-devmem-tcp-token-upstream-292be174d503

Best regards,
-- 
Bobby Eshleman <bobbyeshleman@meta.com>

Re: [PATCH net-next v3 0/2] net: devmem: improve cpu cost of RX token management

Posted by Simon Horman 4 months, 2 weeks ago

On Fri, Sep 26, 2025 at 08:02:52AM -0700, Bobby Eshleman wrote:
> This series improves the CPU cost of RX token management by replacing
> the xarray allocator with an niov array and a uref field in niov.
> 
> Improvement is ~5% per RX user thread.
> 
> Two other approaches were tested, but with no improvement. Namely, 1)
> using a hashmap for tokens and 2) keeping an xarray of atomic counters
> but using RCU so that the hotpath could be mostly lockless. Neither of
> these approaches proved better than the simple array in terms of CPU.
> 
> Running with a NCCL workload is still TODO, but I will follow up on this
> thread with those results when done.
> 
> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>

Hi Bobby,

Unfortunately this patchset doesn't apply cleanly to net-next.
So you'll need to rebase and repost at some point.

-- 
pw-bot: changes-requested

Re: [PATCH net-next v3 0/2] net: devmem: improve cpu cost of RX token management

Posted by Bobby Eshleman 4 months, 2 weeks ago

On Fri, Sep 26, 2025 at 04:55:01PM +0100, Simon Horman wrote:
> On Fri, Sep 26, 2025 at 08:02:52AM -0700, Bobby Eshleman wrote:
> > This series improves the CPU cost of RX token management by replacing
> > the xarray allocator with an niov array and a uref field in niov.
> > 
> > Improvement is ~5% per RX user thread.
> > 
> > Two other approaches were tested, but with no improvement. Namely, 1)
> > using a hashmap for tokens and 2) keeping an xarray of atomic counters
> > but using RCU so that the hotpath could be mostly lockless. Neither of
> > these approaches proved better than the simple array in terms of CPU.
> > 
> > Running with a NCCL workload is still TODO, but I will follow up on this
> > thread with those results when done.
> > 
> > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> 
> Hi Bobby,
> 
> Unfortunately this patchset doesn't apply cleanly to net-next.
> So you'll need to rebase and repost at some point.
> 
> -- 
> pw-bot: changes-requested

Got it, just resent and added this check to my automation, thanks!

Best,
Bobby