From nobody Sun Feb 8 04:17:46 2026 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D74471ACEDF for ; Thu, 24 Apr 2025 04:03:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467388; cv=none; b=VKA34bDDgJERuXu7Z09CiICLGcaTx8aR4WLU11BVZwznpB2SzkLx8ivsi+qnEHgGsRqmepUwdR3wnKrX7ptOoQsiwysIG9KwAafUge7TQrXjeQxBPH1lqjN9878k5J9vTD8U9J50VTjqtUtSXQN987vN9LE1aKtsudxeuwNNP+s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467388; c=relaxed/simple; bh=4G6PWQVuDvN75y22hCzNOaMSSVueFKJQsuXjCZZMx9k=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=fOAipRO7QrZx4VRffzJ2VJanxdEnxF/sVetRpSzWtFCoqlVLgcES/h8iQrUGXmUy3Ur7o57Kk1a/FZYEtdSypshf5QjuBtMvkuHSRXuwNthGrk96asxtP2RgOQt6GuW2yzD5vsTxPBfaLxjvakIu8l9IfJ0/2YS/1mxxfY8MABI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=a5pJADMU; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="a5pJADMU" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-225107fbdc7so5442675ad.0 for ; Wed, 23 Apr 2025 21:03:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745467385; x=1746072185; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=V8selbbQToSUa/Bg+HZMawzfuX6zEarAEcQjX78fl4Q=; b=a5pJADMUGFeuTFDk3ssjzFL9dshF0fXby2ckg6WJw6aGJsMwiA1Tf3KW+SGE3K2QiO Qo8mlSy3Co5weQgLVxWHEZXbk/3uAAe0/5R/ECD5GYdXw75bOcubHIfBks8t0cLXtxMC o7eq9K648pDwLGwrE8BzUap+3l1EQLI/TJWsIcPBzzzNlFTOCZFoCytxrwSNhAdaXVEB PrHJkQ2+XPMkoGAaHVsZMcSzdrUiRosoMvP30W+TclSRqpswk7RcTKNT5+J/6bBD225G GlnxhzRPhRDDvTQvxzGv44tNhwLDL1WJy+iYzrRpif8IsWuVjzyenq9lfM7mZVo9/46h sKEw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745467385; x=1746072185; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=V8selbbQToSUa/Bg+HZMawzfuX6zEarAEcQjX78fl4Q=; b=GfvGvWdnefbNg79OPfEVGoW5wvd05XnE1Uxee/WEVIWH4bGN4x9PFuSqUTy/OH8sm2 CMTT7T2vXtR3YwFJ15/eBkcr6sVCCSgMaXx4OyoWwkiV2xjExESFT4SKQzLvA/0Q0m43 /f/dYsnD9YMTr+m8Hg7z8L8dgR8/bD49VHMziPt3Ijy0xMmdt3uxRf+cjmbdIPV8TS+4 pXrlYGLQrZijMnb6hW1hKKYr7JC+aJrz6gNClKL0GjtYyE/3IK+2icF1TR4C4u+CaCA8 1wSl6HS5J0eWG7G4Q5R6A2+KR9HtRJmQLuXOELYCWas1BvZZpKRW9YGqPVnkxMwNGWwI YL1A== X-Forwarded-Encrypted: i=1; AJvYcCVH+vhnXgNDKwMigExkhEfD2mpIvrM6ue8VmigGoJsFcZq7khTFuBWhF6tKLUH9d8avRLD7fPSCPc1ihKw=@vger.kernel.org X-Gm-Message-State: AOJu0YzWaXKEdNhsxtX85e90J5TnQ566wONJVhOX5fgn2h9m7DJgjuIT lVYdvXfN5jci1FiTliTLYF4+25ICElLJ/s9W1eT2ldR18/fXkmQxlcIVjlj5NAPjMAuZvCNMqSV mmubzyc2X6yA1wSdHsRv+iQ== X-Google-Smtp-Source: AGHT+IGK7UYRkqxJlzG4egQUL0etMs/8hnfBJu3kSoKaPIbND5acRDwRmuLJDjQ8rd9K1P4hPANpxbQRx4HCkhuwpw== X-Received: from pjbqo4.prod.google.com ([2002:a17:90b:3dc4:b0:2fa:1fac:2695]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:28c:b0:21f:bd66:cafa with SMTP id d9443c01a7336-22db3c0d5f1mr15899025ad.17.1745467384900; Wed, 23 Apr 2025 21:03:04 -0700 (PDT) Date: Thu, 24 Apr 2025 04:02:53 +0000 In-Reply-To: <20250424040301.2480876-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250424040301.2480876-1-almasrymina@google.com> X-Mailer: git-send-email 2.49.0.805.g082f7c87e0-goog Message-ID: <20250424040301.2480876-2-almasrymina@google.com> Subject: [PATCH net-next v11 1/8] netmem: add niov->type attribute to distinguish different net_iov types From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, io-uring@vger.kernel.org, virtualization@lists.linux.dev, kvm@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , Jeroen de Borst , Harshitha Ramamurthy , Kuniyuki Iwashima , Willem de Bruijn , Jens Axboe , Pavel Begunkov , David Ahern , Neal Cardwell , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , "=?UTF-8?q?Eugenio=20P=C3=A9rez?=" , sdf@fomichev.me, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela , Samiullah Khawaja Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Later patches in the series adds TX net_iovs where there is no pp associated, so we can't rely on niov->pp->mp_ops to tell what is the type of the net_iov. Add a type enum to the net_iov which tells us the net_iov type. Signed-off-by: Mina Almasry --- v8: - Since io_uring zcrx is now in net-next, update io_uring net_iov type setting and remove the NET_IOV_UNSPECIFIED type v7: - New patch fix iouring --- include/net/netmem.h | 11 ++++++++++- io_uring/zcrx.c | 1 + net/core/devmem.c | 3 ++- 3 files changed, 13 insertions(+), 2 deletions(-) diff --git a/include/net/netmem.h b/include/net/netmem.h index c61d5b21e7b42..64af9a288c80c 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -20,8 +20,17 @@ DECLARE_STATIC_KEY_FALSE(page_pool_mem_providers); */ #define NET_IOV 0x01UL =20 +enum net_iov_type { + NET_IOV_DMABUF, + NET_IOV_IOURING, + + /* Force size to unsigned long to make the NET_IOV_ASSERTS below pass. + */ + NET_IOV_MAX =3D ULONG_MAX, +}; + struct net_iov { - unsigned long __unused_padding; + enum net_iov_type type; unsigned long pp_magic; struct page_pool *pp; struct net_iov_area *owner; diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index 0f46e0404c045..17a54e74ed5d5 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -247,6 +247,7 @@ static int io_zcrx_create_area(struct io_zcrx_ifq *ifq, niov->owner =3D &area->nia; area->freelist[i] =3D i; atomic_set(&area->user_refs[i], 0); + niov->type =3D NET_IOV_IOURING; } =20 area->free_count =3D nr_iovs; diff --git a/net/core/devmem.c b/net/core/devmem.c index 6e27a47d04935..f5c3a7e6dbb7b 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -30,7 +30,7 @@ static const struct memory_provider_ops dmabuf_devmem_ops; =20 bool net_is_devmem_iov(struct net_iov *niov) { - return niov->pp->mp_ops =3D=3D &dmabuf_devmem_ops; + return niov->type =3D=3D NET_IOV_DMABUF; } =20 static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool, @@ -266,6 +266,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned= int dmabuf_fd, =20 for (i =3D 0; i < owner->area.num_niovs; i++) { niov =3D &owner->area.niovs[i]; + niov->type =3D NET_IOV_DMABUF; niov->owner =3D &owner->area; page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov), net_devmem_get_dma_addr(niov)); --=20 2.49.0.805.g082f7c87e0-goog From nobody Sun Feb 8 04:17:46 2026 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 98939198E75 for ; Thu, 24 Apr 2025 04:03:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467389; cv=none; b=QNPVNEQdL0dPmr4K4L0FXY5/C+0WCQoU/c9ZOXTAYQiFmChE0J6+CmKe4nhIgpxLiiMyM4pbpruaUlBn4of6K1QkzEcPU1RuuZQIXf0hl4ECX+bZOBIyEwGi9nXc7F6gG0LKV5y9vdAhEEbHc+o4/On7bLlPqwniON+y/8Z0WJQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467389; c=relaxed/simple; bh=4aYoyzB/KkWmJ5/9G+uCizlJUk2VJLG9EQuaIZ/YcXQ=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=bpUlyN4JV603tTFp5alsCXG+f5599uN85MpFTVLOhqtvlkAXQ3Jin/OhKtuSCPELigfBHNZgF7CRQqLAq+mEkQEaOFs0sLWLyxDb9rrLVXhM4Pt/qnlSm1AolcpPUacsZu5e+iVV5d7srp2PanTtc7WAwvqgxvq2lWdYRoNJS+Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=kJYt0Frg; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="kJYt0Frg" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-2ff798e8c3bso441662a91.2 for ; Wed, 23 Apr 2025 21:03:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745467387; x=1746072187; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=3J28jVMWh8snHxmkIXhQjFawApSkxGZWDK5B0i2RUdE=; b=kJYt0FrgBCwKhuIcL8yiqAir2zpwvvucwRsR33/C5FOJQCcvbPUoOGwpVTcysUdXtr 60/jB1rRksQwLyGx6zCKm9lWRLv706oqBwuod0EkQPIljmeebQeUgh1gVq3e+p9DgW+y E+oiW6ygqpeCTMT1en21V0Ssy9sJ9Jje7tI3bK8NSUm/SRrELN9Bt8EPQTH35VaHtoMQ DtNi55uQjImP+5oMYKa/NK61bNB5AVse5gmkDG4e7PCUWSvak8Sgfo0X2WotkYCmrZtQ sCeKOFxTKuz1EjgXUUWgm7NBFQayMNLwEltD41dDkgraHsNIRBXCT6IN8NuMtPNieJdB X+nA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745467387; x=1746072187; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=3J28jVMWh8snHxmkIXhQjFawApSkxGZWDK5B0i2RUdE=; b=hNeA5bbQAiBs/MmuWuTuKN8QlHFahWOXMW4kY+17flOPK4EXbIFJzrzEkfFaY8aV8b bWjj9DyDmGbblBri3e6KtZ9NtlvJR8+qOOb22HdimgxOlY8McBeX6rCbr5pSVV1wxuCM BpogRCCf1VEuncYWF9PVup2Q8TJl4y6XkSZI6hYXnV1JOHn7jpTxl5iwLdyFlGZpQHu9 TfIhpIt11hWDPN6+xtl9bzsRn9Iovc/Kd4vmWuBBpeANMAo0bL2E2R7121pk5dv1krFh H3apv4Opw/DPpMvt/1kAVnh635LnzALL4Qc1FDFTi4yigDzQ50tZ5TTVE8Y15jSzLbE1 zrXQ== X-Forwarded-Encrypted: i=1; AJvYcCW9N/mg3ALZTFbBXAD6lNlScdzNYnnEGo0Qs1IAQRFSHnKTqFLFvlH1YuaQ5yKMfjx3gu2zK/0jVOr54I4=@vger.kernel.org X-Gm-Message-State: AOJu0Yx26f9OqleiXiIVqyS1/DFBiJ4lGeXJqmffD/LGbu9yqXmtK6iG 8w9oSas+ovclxwp1kkVDBjHQETDJ4wA0Hc8OLiRjE1LHlS1Sk0EYTMcV86Qfq6+MsoE/+IFAdwR reyeQyF49A1W9K5WIF+9/IQ== X-Google-Smtp-Source: AGHT+IGBZ2S7HV+M9UOrbBgFDh83W2ojy4ER/zk/J1mOecMVq06wFRifMv1iMLCm4IcHtdE3RfF8GhNBCk00yV5IAw== X-Received: from pjqq12.prod.google.com ([2002:a17:90b:584c:b0:2f9:dc36:b11]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:2752:b0:305:2d9d:81c9 with SMTP id 98e67ed59e1d1-309ed2805d0mr1893917a91.16.1745467386893; Wed, 23 Apr 2025 21:03:06 -0700 (PDT) Date: Thu, 24 Apr 2025 04:02:54 +0000 In-Reply-To: <20250424040301.2480876-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250424040301.2480876-1-almasrymina@google.com> X-Mailer: git-send-email 2.49.0.805.g082f7c87e0-goog Message-ID: <20250424040301.2480876-3-almasrymina@google.com> Subject: [PATCH net-next v11 2/8] net: add get_netmem/put_netmem support From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, io-uring@vger.kernel.org, virtualization@lists.linux.dev, kvm@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , Jeroen de Borst , Harshitha Ramamurthy , Kuniyuki Iwashima , Willem de Bruijn , Jens Axboe , Pavel Begunkov , David Ahern , Neal Cardwell , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , "=?UTF-8?q?Eugenio=20P=C3=A9rez?=" , sdf@fomichev.me, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela , Samiullah Khawaja Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Currently net_iovs support only pp ref counts, and do not support a page ref equivalent. This is fine for the RX path as net_iovs are used exclusively with the pp and only pp refcounting is needed there. The TX path however does not use pp ref counts, thus, support for get_page/put_page equivalent is needed for netmem. Support get_netmem/put_netmem. Check the type of the netmem before passing it to page or net_iov specific code to obtain a page ref equivalent. For dmabuf net_iovs, we obtain a ref on the underlying binding. This ensures the entire binding doesn't disappear until all the net_iovs have been put_netmem'ed. We do not need to track the refcount of individual dmabuf net_iovs as we don't allocate/free them from a pool similar to what the buddy allocator does for pages. This code is written to be extensible by other net_iov implementers. get_netmem/put_netmem will check the type of the netmem and route it to the correct helper: pages -> [get|put]_page() dmabuf net_iovs -> net_devmem_[get|put]_net_iov() new net_iovs -> new helpers Signed-off-by: Mina Almasry Acked-by: Stanislav Fomichev --- v5: https://lore.kernel.org/netdev/20250227041209.2031104-2-almasrymina@goo= gle.com/ - Updated to check that the net_iov is devmem before calling net_devmem_put_net_iov(). - Jakub requested that callers of __skb_frag_ref()/skb_page_unref be inspected to make sure that they generate / anticipate skbs with the correct pp_recycle and unreadable setting: skb_page_unref =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D - skb_page_unref is unreachable from these callers due to unreadable checks returning early: gro_pull_from_frag0, skb_copy_ubufs, __pskb_pull_tail - callers that are reachable for unreadable skbs. These would only see rx unreadable skbs with pp_recycle set before this patchset and would drop a pp ref count. After this patchset they can see tx unreadable skbs with no pp attached and no pp_recycle set, and so now they will drop a net_iov ref via put_netmem: __pskb_trim, __pskb_trim_head, skb_release_data, skb_shift __skb_frag_ref =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Before this patchset __skb_frag_ref would not do the right thing if it saw any unreadable skbs, either with pp_recycle set or not. Because it unconditionally tries to acquire a page ref, but with RX only support I can't reproduce calls to __skb_frag_ref even after enabling tc forwarding to TX. After this patchset __skb_frag_ref would obtain a page ref equivalent on dmabuf net_iovs, by obtaining a ref on the binding. Callers that are unreachable for unreadable skbs: - veth_xdp_get Callers that are reachable for unreadable skbs, and from code review they look specific to the TX path: - tcp_grow_skb, __skb_zcopy_downgrade_managed, __pskb_copy_fclone, pskb_expand_head, skb_zerocopy, skb_split, pksb_carve_inside_header, pskb_care_inside_nonlinear, tcp_clone_payload, skb_segment. Callers that are reachable for unreadable skbs, and from code review they look reachable in the RX path, although my testing never hit these paths. These are concerning. Maybe we should put this patch in net and cc stable? However, no drivers currently enable unreadable netmem, so fixing this in net-next is fine as well maybe: - skb_shift, skb_try_coalesce v2: - Add comment on top of refcount_t ref explaining the usage in the XT path. - Fix missing definition of net_devmem_dmabuf_binding_put in this patch. --- include/linux/skbuff_ref.h | 4 ++-- include/net/netmem.h | 3 +++ net/core/devmem.c | 10 ++++++++++ net/core/devmem.h | 20 ++++++++++++++++++++ net/core/skbuff.c | 30 ++++++++++++++++++++++++++++++ 5 files changed, 65 insertions(+), 2 deletions(-) diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h index 0f3c58007488a..9e49372ef1a05 100644 --- a/include/linux/skbuff_ref.h +++ b/include/linux/skbuff_ref.h @@ -17,7 +17,7 @@ */ static inline void __skb_frag_ref(skb_frag_t *frag) { - get_page(skb_frag_page(frag)); + get_netmem(skb_frag_netmem(frag)); } =20 /** @@ -40,7 +40,7 @@ static inline void skb_page_unref(netmem_ref netmem, bool= recycle) if (recycle && napi_pp_put_page(netmem)) return; #endif - put_page(netmem_to_page(netmem)); + put_netmem(netmem); } =20 /** diff --git a/include/net/netmem.h b/include/net/netmem.h index 64af9a288c80c..1b047cfb9e4f7 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -273,4 +273,7 @@ static inline unsigned long netmem_get_dma_addr(netmem_= ref netmem) return __netmem_clear_lsb(netmem)->dma_addr; } =20 +void get_netmem(netmem_ref netmem); +void put_netmem(netmem_ref netmem); + #endif /* _NET_NETMEM_H */ diff --git a/net/core/devmem.c b/net/core/devmem.c index f5c3a7e6dbb7b..dca2ff7cf6923 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -295,6 +295,16 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigne= d int dmabuf_fd, return ERR_PTR(err); } =20 +void net_devmem_get_net_iov(struct net_iov *niov) +{ + net_devmem_dmabuf_binding_get(net_devmem_iov_binding(niov)); +} + +void net_devmem_put_net_iov(struct net_iov *niov) +{ + net_devmem_dmabuf_binding_put(net_devmem_iov_binding(niov)); +} + /*** "Dmabuf devmem memory provider" ***/ =20 int mp_dmabuf_devmem_init(struct page_pool *pool) diff --git a/net/core/devmem.h b/net/core/devmem.h index 7fc158d527293..946f2e0157467 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -29,6 +29,10 @@ struct net_devmem_dmabuf_binding { * The binding undos itself and unmaps the underlying dmabuf once all * those refs are dropped and the binding is no longer desired or in * use. + * + * net_devmem_get_net_iov() on dmabuf net_iovs will increment this + * reference, making sure that the binding remains alive until all the + * net_iovs are no longer used. */ refcount_t ref; =20 @@ -111,6 +115,9 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_= binding *binding) __net_devmem_dmabuf_binding_free(binding); } =20 +void net_devmem_get_net_iov(struct net_iov *niov); +void net_devmem_put_net_iov(struct net_iov *niov); + struct net_iov * net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding); void net_devmem_free_dmabuf(struct net_iov *ppiov); @@ -120,6 +127,19 @@ bool net_is_devmem_iov(struct net_iov *niov); #else struct net_devmem_dmabuf_binding; =20 +static inline void +net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding) +{ +} + +static inline void net_devmem_get_net_iov(struct net_iov *niov) +{ +} + +static inline void net_devmem_put_net_iov(struct net_iov *niov) +{ +} + static inline void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) { diff --git a/net/core/skbuff.c b/net/core/skbuff.c index d73ad79fe739d..00c22bce98e44 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -89,6 +89,7 @@ #include =20 #include "dev.h" +#include "devmem.h" #include "netmem_priv.h" #include "sock_destructor.h" =20 @@ -7313,3 +7314,32 @@ bool csum_and_copy_from_iter_full(void *addr, size_t= bytes, return false; } EXPORT_SYMBOL(csum_and_copy_from_iter_full); + +void get_netmem(netmem_ref netmem) +{ + struct net_iov *niov; + + if (netmem_is_net_iov(netmem)) { + niov =3D netmem_to_net_iov(netmem); + if (net_is_devmem_iov(niov)) + net_devmem_get_net_iov(netmem_to_net_iov(netmem)); + return; + } + get_page(netmem_to_page(netmem)); +} +EXPORT_SYMBOL(get_netmem); + +void put_netmem(netmem_ref netmem) +{ + struct net_iov *niov; + + if (netmem_is_net_iov(netmem)) { + niov =3D netmem_to_net_iov(netmem); + if (net_is_devmem_iov(niov)) + net_devmem_put_net_iov(netmem_to_net_iov(netmem)); + return; + } + + put_page(netmem_to_page(netmem)); +} +EXPORT_SYMBOL(put_netmem); --=20 2.49.0.805.g082f7c87e0-goog From nobody Sun Feb 8 04:17:46 2026 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CF5021DB363 for ; Thu, 24 Apr 2025 04:03:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467393; cv=none; b=qC/3guj00iPvESS8iqcUjQtlHu1Q1dFtrYCNM94olTZbhsmrK93UXdTVJJO61dqhWtpFvCppUPLEkQ0pz2MbkXVDojNcgvTxoGkVNUJIsHoiZU/4StkefZp/EaJWIFpPNgUoOXBaos7PNRQBJ0PLI3uhRP+6AYrPwFe0XZvY57M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467393; c=relaxed/simple; bh=M+EoVe0iz3z6LK0bpxKAqvjY9FNWAhMyDmmOVRhy0yk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=fj1AnfYIacYPLTBEW52+qy1apBrRwAMuNy5c0cwbRTNOjz6qtxCPOucNfbJrc+Nio12iN0h0zDBwWLsegSO7K9bjXGn7ieaDxUg+mjgpZPFQxueJCpiQDYY/rXTQEFnepTNRdame9KZPWPabczZLMLVm8KLQDJVVcgvqBbtsqhw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=reapeS/6; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="reapeS/6" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-3087a703066so628570a91.0 for ; Wed, 23 Apr 2025 21:03:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745467389; x=1746072189; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=UorbkxOGoi+kM8R/LRA30UsUBgYSiSm724ks3SFR6rA=; b=reapeS/6Aij6kj0doQ65kjNWJ513OBXnD6jImAb6/JpOhaHZ3k9ZOFdg9a4emRhw07 XviKAl9Qt9VEPBVhh33pH1wOcgcaIma5Cx+JbmoJJCi4xdJSklYKSYT/swLj1h+N08pj 9R2UHO66eLjLyKT2U1XfNbvLvUUnBl20IiYpXFKBxVO3X9e+icLw6Cp7OGcf1CrKi9+O KaFzA+L5XzZbWH6feK2X3zeP/OYEN0Yxb27fFmwWwByWhS2InkB7bujb+zigI9eLDq7g 9bb62YS7NOeYRHFQ+GjJJU1q2KhKNKXR+pf1hTqaXF+O9gy4e2NnDzNTZ18kyvq81Lcv H2Iw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745467389; x=1746072189; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=UorbkxOGoi+kM8R/LRA30UsUBgYSiSm724ks3SFR6rA=; b=QCAPb7AvEaHwXskPHruZ79F01Q9L5cjHI2bnRD4qIHl77+ZZiKl1Erv8RFYLCaplSO 25zDA56k+8kHlQlDVI3wK2ylcZmvzvi/DIfEdDTTUW1JsZWHdaMBrgOVO2TDMUavz89h 0mx2sbwgS/YxQtyR22CJ7/6MZ4+fmNZn8CRDsUHhEpf/7Vbkg+iI4q9VclW7o5552BJF 7i1RG9PY3aAyEOQRazHlFi7VbxD1JDA2fnhtZ15uF28ddR99yYCrqKnirogAxEZkZXTT xleouzH1JzJUKlXPT/wO8/5Ee4Q/OgwpjT0qRB3LbRjc/OgUm8SEbN6h9+lcHj3jaSko WI9g== X-Forwarded-Encrypted: i=1; AJvYcCW7TLuVfYS73owS8dPI/HTxeqChtaDpHL04Uc8nuxMvqX8n9qEPHBkKtc8D2bi3LgEdLJK3q/NR89C3/90=@vger.kernel.org X-Gm-Message-State: AOJu0YyeU8Nbi7XvBvp7JT/mG8Msf8XC2S/Cu15o0hLZGUK3kRZPPohk 2SdAHVfzsh9PDbA2Vkny+lasva1IfSzZk5mSPrZhX5EFKIZQf4NP06Oers1Nhutk2WGzVFXZI+V 7Y5dyigDovPhOHw8sRO5/zQ== X-Google-Smtp-Source: AGHT+IE6UfSjjJaPb2Ixl4liBZlhcJYxBQp412BWCqCF1iqcJPfPbLJAycJrYuatJLsoDyOBxR3XgGhO3tRPxthSew== X-Received: from pjvf14.prod.google.com ([2002:a17:90a:da8e:b0:309:da3b:15d1]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:5445:b0:305:2d27:7cb0 with SMTP id 98e67ed59e1d1-309ed29ce8bmr1728461a91.21.1745467388941; Wed, 23 Apr 2025 21:03:08 -0700 (PDT) Date: Thu, 24 Apr 2025 04:02:55 +0000 In-Reply-To: <20250424040301.2480876-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250424040301.2480876-1-almasrymina@google.com> X-Mailer: git-send-email 2.49.0.805.g082f7c87e0-goog Message-ID: <20250424040301.2480876-4-almasrymina@google.com> Subject: [PATCH net-next v11 3/8] net: devmem: TCP tx netlink api From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, io-uring@vger.kernel.org, virtualization@lists.linux.dev, kvm@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , Jeroen de Borst , Harshitha Ramamurthy , Kuniyuki Iwashima , Willem de Bruijn , Jens Axboe , Pavel Begunkov , David Ahern , Neal Cardwell , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , "=?UTF-8?q?Eugenio=20P=C3=A9rez?=" , sdf@fomichev.me, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela , Samiullah Khawaja Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Stanislav Fomichev Add bind-tx netlink call to attach dmabuf for TX; queue is not required, only ifindex and dmabuf fd for attachment. Signed-off-by: Stanislav Fomichev Signed-off-by: Mina Almasry --- v3: - Fix ynl-regen.sh error (Simon). --- Documentation/netlink/specs/netdev.yaml | 12 ++++++++++++ include/uapi/linux/netdev.h | 1 + net/core/netdev-genl-gen.c | 13 +++++++++++++ net/core/netdev-genl-gen.h | 1 + net/core/netdev-genl.c | 6 ++++++ tools/include/uapi/linux/netdev.h | 1 + 6 files changed, 34 insertions(+) diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlin= k/specs/netdev.yaml index f5e0750ab71db..c0ef6d0d77865 100644 --- a/Documentation/netlink/specs/netdev.yaml +++ b/Documentation/netlink/specs/netdev.yaml @@ -743,6 +743,18 @@ operations: - defer-hard-irqs - gro-flush-timeout - irq-suspend-timeout + - + name: bind-tx + doc: Bind dmabuf to netdev for TX + attribute-set: dmabuf + do: + request: + attributes: + - ifindex + - fd + reply: + attributes: + - id =20 kernel-family: headers: [ "net/netdev_netlink.h"] diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index 7600bf62dbdf0..7eb9571786b83 100644 --- a/include/uapi/linux/netdev.h +++ b/include/uapi/linux/netdev.h @@ -219,6 +219,7 @@ enum { NETDEV_CMD_QSTATS_GET, NETDEV_CMD_BIND_RX, NETDEV_CMD_NAPI_SET, + NETDEV_CMD_BIND_TX, =20 __NETDEV_CMD_MAX, NETDEV_CMD_MAX =3D (__NETDEV_CMD_MAX - 1) diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c index 739f7b6506a6a..4fc44587f4936 100644 --- a/net/core/netdev-genl-gen.c +++ b/net/core/netdev-genl-gen.c @@ -99,6 +99,12 @@ static const struct nla_policy netdev_napi_set_nl_policy= [NETDEV_A_NAPI_IRQ_SUSPE [NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] =3D { .type =3D NLA_UINT, }, }; =20 +/* NETDEV_CMD_BIND_TX - do */ +static const struct nla_policy netdev_bind_tx_nl_policy[NETDEV_A_DMABUF_FD= + 1] =3D { + [NETDEV_A_DMABUF_IFINDEX] =3D NLA_POLICY_MIN(NLA_U32, 1), + [NETDEV_A_DMABUF_FD] =3D { .type =3D NLA_U32, }, +}; + /* Ops table for netdev */ static const struct genl_split_ops netdev_nl_ops[] =3D { { @@ -190,6 +196,13 @@ static const struct genl_split_ops netdev_nl_ops[] =3D= { .maxattr =3D NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT, .flags =3D GENL_ADMIN_PERM | GENL_CMD_CAP_DO, }, + { + .cmd =3D NETDEV_CMD_BIND_TX, + .doit =3D netdev_nl_bind_tx_doit, + .policy =3D netdev_bind_tx_nl_policy, + .maxattr =3D NETDEV_A_DMABUF_FD, + .flags =3D GENL_CMD_CAP_DO, + }, }; =20 static const struct genl_multicast_group netdev_nl_mcgrps[] =3D { diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h index 17d39fd64c948..cf3fad74511f5 100644 --- a/net/core/netdev-genl-gen.h +++ b/net/core/netdev-genl-gen.h @@ -34,6 +34,7 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info); +int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info); =20 enum { NETDEV_NLGRP_MGMT, diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index 2c104947d224f..410df19d98d78 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -934,6 +934,12 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct= genl_info *info) return err; } =20 +/* stub */ +int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info) +{ + return 0; +} + void netdev_nl_sock_priv_init(struct netdev_nl_sock *priv) { INIT_LIST_HEAD(&priv->bindings); diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/n= etdev.h index 7600bf62dbdf0..7eb9571786b83 100644 --- a/tools/include/uapi/linux/netdev.h +++ b/tools/include/uapi/linux/netdev.h @@ -219,6 +219,7 @@ enum { NETDEV_CMD_QSTATS_GET, NETDEV_CMD_BIND_RX, NETDEV_CMD_NAPI_SET, + NETDEV_CMD_BIND_TX, =20 __NETDEV_CMD_MAX, NETDEV_CMD_MAX =3D (__NETDEV_CMD_MAX - 1) --=20 2.49.0.805.g082f7c87e0-goog From nobody Sun Feb 8 04:17:46 2026 Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E3E401DF975 for ; Thu, 24 Apr 2025 04:03:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467395; cv=none; b=UfhZbJgaxpZRHlXirvSD7Y0HrCkH/xnAWtMWic3G3fHtR6e6yhGUvkNGCEohnBILHPh4OJyPdfZueKSGaQfANTxk1KAVJ//l13V79+3xIcCP0jqXeQTQfwvWDJ14lWQ8Ik7QvuMB2upVRZIEBlIbti4XUAgvCwlGxcQXPbWMmVA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467395; c=relaxed/simple; bh=Tf+nD2dtxPpDOtaGhOKVJ3u3Kks6MYWDiA+hTyQAYWo=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=eT64G38zFkoHy78+srtE5QvFhDkLqY+/oc14UMIFA0vZdGABYTkzpePUd72fNIhYrlJNyYrWuevIOL9pVHGvHRlbDU2xOqNkuboHzXCxWi8cLIr9IHqD1nSC/BX9Gcn3PFHKFej7nW1r5rnfMcZtwiw3Z/2HdG2UBc0nqjRCC04= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=BciRnmIt; arc=none smtp.client-ip=209.85.210.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="BciRnmIt" Received: by mail-pf1-f201.google.com with SMTP id d2e1a72fcca58-736cb72efd5so363238b3a.3 for ; Wed, 23 Apr 2025 21:03:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745467391; x=1746072191; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=0JdfSry6TVRMjgTE4zxvHX0PgKqjTTsx1czXWl4vKPo=; b=BciRnmItQNb8Z1ZMaeAoIm4ZqN9RdtAO+4SpRxl6Xt+eMI6uousDQ/3Fx8c6CRJ8kN f8wuqoScWjUnEKoj4RKCXVX/nPagJUGFGl2RQdkVSYJkofPD6EQKOovnK+nwp/pBwoAk mlYFA97wDKYruraDBXZCfd1NQnqcwAMmOyGo8y6lLuFxikR5OVJTXy/K37pzum2B/4zW be4f0WvYR2nqtk+O4B7RxAD7UyeCjF2rbNz2zBILCbdKvtf+mtIinJDRm6H4w+ACur0G Lks/h9mQjgNh1+EjCeqfdpCw9F9bYYWuf7j8MzaRDtwHdJh5I3rAJzav32binKry64kW vHhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745467391; x=1746072191; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=0JdfSry6TVRMjgTE4zxvHX0PgKqjTTsx1czXWl4vKPo=; b=nTB/wZZrTtAE03rAbOkih1UupX+VuISdcH0GeOQ68Foxl9qqp8dU1D554fktmO21dy Ow+btK4EREVfvM4K9g0zHS+521HXm3xGib3s3Tu1+vm+P+hUmsdbo12MfUOlTQZfYlOX 4At6Eu5pAnByd24waqXpduBSQCOvDSruSfgVEsSY5T7pN7iLaXQl4xheMrd3PBzwznyK WPdOUt2DcFOjdyEiNSUli9xDJIT14sZLy0s9oVkY11Y9mAgzC+UHtat+Y7XrqQMxctAF eSYs5aLaDuZOduyfwI//wqJdnuhumux8yPGZQVO2axEffWqfNcI+B1qImzs6Nk5h91gW cqhA== X-Forwarded-Encrypted: i=1; AJvYcCVnUO1wNxoVMGsaJ4/Mvr6z3x2RIT6xgWMsmv5uT/ClPrR2iWpMC4uF3wA7bBsW4RsiIGg9jscat42KcmQ=@vger.kernel.org X-Gm-Message-State: AOJu0YzTLXEb8ziOp5176NVvKwdRyawEKaTzBvxmwRG9T+lkvrOKEqUu 6f8L+EPctpRjTby+i4u2GOhtjkqyj2jXpFAbZhHkuEKFwNFUYcg5MItrNH/Gm+nViTn2ml3RcDi RGgmY9+DNV7kc9IehyKT7Vg== X-Google-Smtp-Source: AGHT+IHkuFRiJlj+3jdQRm7baCR8VHt7QvmT5PDBLQnRStA0Oxm7ZZdr+CeOebrKLilfgJXQ9kb+tJZd636r9lNk/Q== X-Received: from pfib24.prod.google.com ([2002:aa7:8118:0:b0:736:adf0:d154]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:2d1c:b0:736:4110:5579 with SMTP id d2e1a72fcca58-73e244ba8bbmr1708916b3a.2.1745467390899; Wed, 23 Apr 2025 21:03:10 -0700 (PDT) Date: Thu, 24 Apr 2025 04:02:56 +0000 In-Reply-To: <20250424040301.2480876-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250424040301.2480876-1-almasrymina@google.com> X-Mailer: git-send-email 2.49.0.805.g082f7c87e0-goog Message-ID: <20250424040301.2480876-5-almasrymina@google.com> Subject: [PATCH net-next v11 4/8] net: devmem: Implement TX path From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, io-uring@vger.kernel.org, virtualization@lists.linux.dev, kvm@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , Jeroen de Borst , Harshitha Ramamurthy , Kuniyuki Iwashima , Willem de Bruijn , Jens Axboe , Pavel Begunkov , David Ahern , Neal Cardwell , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , "=?UTF-8?q?Eugenio=20P=C3=A9rez?=" , sdf@fomichev.me, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela , Samiullah Khawaja , Kaiyuan Zhang Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Augment dmabuf binding to be able to handle TX. Additional to all the RX binding, we also create tx_vec needed for the TX path. Provide API for sendmsg to be able to send dmabufs bound to this device: - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from. - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY implementation, while disabling instances where MSG_ZEROCOPY falls back to copying. We additionally pipe the binding down to the new zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems instead of the traditional page netmems. We also special case skb_frag_dma_map to return the dma-address of these dmabuf net_iovs instead of attempting to map pages. The TX path may release the dmabuf in a context where we cannot wait. This happens when the user unbinds a TX dmabuf while there are still references to its netmems in the TX path. In that case, the netmems will be put_netmem'd from a context where we can't unmap the dmabuf, Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd. Based on work by Stanislav Fomichev . A lot of the meat of the implementation came from devmem TCP RFC v1[1], which included the TX path, but Stan did all the rebasing on top of netmem/net_iov. Cc: Stanislav Fomichev Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry Acked-by: Stanislav Fomichev --- v11: - Address some nits from Michael. Whitespace adjustment and remove local var. (Michael) v10: - Make sure to release netdev lock if !netif_device_present (Jakub) - handle niov->pp =3D=3D NULL in io_uring code (Pavel) v9: - Use priv->bindings instead of sock_bindings_list. This was missed in v8 during the rebase as net-next had been updated to use priv->bindings (thanks Stan!) v8: - move adding the binding to the net_devmem_dmabuf_bindings after it's been fully initialized to eliminate potential for races with send path (Stan). - Use net_devmem_get_by_index_lock instead of rtnl_locking v6: - Retain behavior that MSG_FASTOPEN succeeds even if cmsg is invalid (Paolo). - Rework the freeing of tx_vec slightly to improve readability. Now it has its own err label (Paolo). - Squash making unbinding scheduled work (Paolo). - Add comment to clarify that net_iovs stuck in the transmit path hold a ref on the underlying dmabuf binding (David). - Fix the comment on how binding refcounting works on RX (the comment was not matching the existing code behavior). v5: - Return -EFAULT from zerocopy_fill_skb_from_devmem (Stan) - don't null check before kvfree (stan). v4: - Remove dmabuf_tx_cmsg definition and just use __u32 for the dma-buf id (Willem). - Check that iov_iter_type() is ITER_IOVEC in zerocopy_fill_skb_from_iter() (Pavel). - Fix binding->tx_vec not being freed on error paths (Paolo). - Make devmem patch mutually exclusive with msg->ubuf_info path (Pavel). - Check that MSG_ZEROCOPY and SOCK_ZEROCOPY are provided when sockc.dmabuf_id is provided. - Don't mm_account_pinned_pages() on devmem TX (Pavel). v3: - Use kvmalloc_array instead of kcalloc (Stan). - Fix unreachable code warning (Simon). v2: - Remove dmabuf_offset from the dmabuf cmsg. - Update zerocopy_fill_skb_from_devmem to interpret the iov_base/iter_iov_addr as the offset into the dmabuf to send from (Stan). - Remove the confusing binding->tx_iter which is not needed if we interpret the iov_base/iter_iov_addr as offset into the dmabuf (Stan). - Remove check for binding->sgt and binding->sgt->nents in dmabuf binding. - Simplify the calculation of binding->tx_vec. - Check in net_devmem_get_binding that the binding we're returning has ifindex matching the sending socket (Willem). --- include/linux/skbuff.h | 17 +++- include/net/sock.h | 1 + io_uring/zcrx.c | 2 +- net/core/datagram.c | 48 +++++++++- net/core/devmem.c | 120 ++++++++++++++++++++---- net/core/devmem.h | 61 +++++++++--- net/core/netdev-genl.c | 71 +++++++++++++- net/core/skbuff.c | 18 ++-- net/core/sock.c | 6 ++ net/ipv4/ip_output.c | 3 +- net/ipv4/tcp.c | 50 ++++++++-- net/ipv6/ip6_output.c | 3 +- net/vmw_vsock/virtio_transport_common.c | 5 +- 13 files changed, 343 insertions(+), 62 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index beb084ee4f4d2..5d62e8e77546c 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1710,13 +1710,16 @@ static inline void skb_set_end_offset(struct sk_buf= f *skb, unsigned int offset) extern const struct ubuf_info_ops msg_zerocopy_ubuf_ops; =20 struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size, - struct ubuf_info *uarg); + struct ubuf_info *uarg, bool devmem); =20 void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref); =20 +struct net_devmem_dmabuf_binding; + int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, struct sk_buff *skb, struct iov_iter *from, - size_t length); + size_t length, + struct net_devmem_dmabuf_binding *binding); =20 int zerocopy_fill_skb_from_iter(struct sk_buff *skb, struct iov_iter *from, size_t length); @@ -1724,12 +1727,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb, static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len) { - return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len); + return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len, + NULL); } =20 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, struct msghdr *msg, int len, - struct ubuf_info *uarg); + struct ubuf_info *uarg, + struct net_devmem_dmabuf_binding *binding); =20 /* Internal */ #define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB))) @@ -3700,6 +3705,10 @@ static inline dma_addr_t __skb_frag_dma_map(struct d= evice *dev, size_t offset, size_t size, enum dma_data_direction dir) { + if (skb_frag_is_net_iov(frag)) { + return netmem_to_net_iov(frag->netmem)->dma_addr + offset + + frag->offset; + } return dma_map_page(dev, skb_frag_page(frag), skb_frag_off(frag) + offset, size, dir); } diff --git a/include/net/sock.h b/include/net/sock.h index e223102337c77..f266757a37c87 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1851,6 +1851,7 @@ struct sockcm_cookie { u32 tsflags; u32 ts_opt_id; u32 priority; + u32 dmabuf_id; }; =20 static inline void sockcm_init(struct sockcm_cookie *sockc, diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index 17a54e74ed5d5..f392320e7dede 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -791,7 +791,7 @@ static int io_zcrx_recv_frag(struct io_kiocb *req, stru= ct io_zcrx_ifq *ifq, return io_zcrx_copy_frag(req, ifq, frag, off, len); =20 niov =3D netmem_to_net_iov(frag->netmem); - if (niov->pp->mp_ops !=3D &io_uring_pp_zc_ops || + if (!niov->pp || niov->pp->mp_ops !=3D &io_uring_pp_zc_ops || niov->pp->mp_priv !=3D ifq) return -EFAULT; =20 diff --git a/net/core/datagram.c b/net/core/datagram.c index f0634f0cb8346..042a7dceb85ad 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -63,6 +63,8 @@ #include #include =20 +#include "devmem.h" + /* * Is a socket 'connection oriented' ? */ @@ -691,9 +693,49 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb, return 0; } =20 +static int +zerocopy_fill_skb_from_devmem(struct sk_buff *skb, struct iov_iter *from, + int length, + struct net_devmem_dmabuf_binding *binding) +{ + int i =3D skb_shinfo(skb)->nr_frags; + size_t virt_addr, size, off; + struct net_iov *niov; + + /* Devmem filling works by taking an IOVEC from the user where the + * iov_addrs are interpreted as an offset in bytes into the dma-buf to + * send from. We do not support other iter types. + */ + if (iov_iter_type(from) !=3D ITER_IOVEC) + return -EFAULT; + + while (length && iov_iter_count(from)) { + if (i =3D=3D MAX_SKB_FRAGS) + return -EMSGSIZE; + + virt_addr =3D (size_t)iter_iov_addr(from); + niov =3D net_devmem_get_niov_at(binding, virt_addr, &off, &size); + if (!niov) + return -EFAULT; + + size =3D min_t(size_t, size, length); + size =3D min_t(size_t, size, iter_iov_len(from)); + + get_netmem(net_iov_to_netmem(niov)); + skb_add_rx_frag_netmem(skb, i, net_iov_to_netmem(niov), off, + size, PAGE_SIZE); + iov_iter_advance(from, size); + length -=3D size; + i++; + } + + return 0; +} + int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, struct sk_buff *skb, struct iov_iter *from, - size_t length) + size_t length, + struct net_devmem_dmabuf_binding *binding) { unsigned long orig_size =3D skb->truesize; unsigned long truesize; @@ -701,6 +743,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct = sock *sk, =20 if (msg && msg->msg_ubuf && msg->sg_from_iter) ret =3D msg->sg_from_iter(skb, from, length); + else if (unlikely(binding)) + ret =3D zerocopy_fill_skb_from_devmem(skb, from, length, binding); else ret =3D zerocopy_fill_skb_from_iter(skb, from, length); =20 @@ -734,7 +778,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct i= ov_iter *from) if (skb_copy_datagram_from_iter(skb, 0, from, copy)) return -EFAULT; =20 - return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U); + return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U, NULL); } EXPORT_SYMBOL(zerocopy_sg_from_iter); =20 diff --git a/net/core/devmem.c b/net/core/devmem.c index dca2ff7cf6923..d2b47f647e1a1 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -16,6 +16,7 @@ #include #include #include +#include #include =20 #include "devmem.h" @@ -52,8 +53,10 @@ static dma_addr_t net_devmem_get_dma_addr(const struct n= et_iov *niov) ((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT); } =20 -void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *bi= nding) +void __net_devmem_dmabuf_binding_free(struct work_struct *wq) { + struct net_devmem_dmabuf_binding *binding =3D container_of(wq, typeof(*bi= nding), unbind_w); + size_t size, avail; =20 gen_pool_for_each_chunk(binding->chunk_pool, @@ -71,8 +74,10 @@ void __net_devmem_dmabuf_binding_free(struct net_devmem_= dmabuf_binding *binding) dma_buf_detach(binding->dmabuf, binding->attachment); dma_buf_put(binding->dmabuf); xa_destroy(&binding->bound_rxqs); + kvfree(binding->tx_vec); kfree(binding); } +EXPORT_SYMBOL(__net_devmem_dmabuf_binding_free); =20 struct net_iov * net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) @@ -117,6 +122,13 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf= _binding *binding) unsigned long xa_idx; unsigned int rxq_idx; =20 + xa_erase(&net_devmem_dmabuf_bindings, binding->id); + + /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the + * erase. + */ + synchronize_net(); + if (binding->list.next) list_del(&binding->list); =20 @@ -131,8 +143,6 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_= binding *binding) __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params); } =20 - xa_erase(&net_devmem_dmabuf_bindings, binding->id); - net_devmem_dmabuf_binding_put(binding); } =20 @@ -166,8 +176,9 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *= dev, u32 rxq_idx, } =20 struct net_devmem_dmabuf_binding * -net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, - struct netlink_ext_ack *extack) +net_devmem_bind_dmabuf(struct net_device *dev, + enum dma_data_direction direction, + unsigned int dmabuf_fd, struct netlink_ext_ack *extack) { struct net_devmem_dmabuf_binding *binding; static u32 id_alloc_next; @@ -189,13 +200,6 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigne= d int dmabuf_fd, } =20 binding->dev =3D dev; - - err =3D xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id, - binding, xa_limit_32b, &id_alloc_next, - GFP_KERNEL); - if (err < 0) - goto err_free_binding; - xa_init_flags(&binding->bound_rxqs, XA_FLAGS_ALLOC); =20 refcount_set(&binding->ref, 1); @@ -206,26 +210,36 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsign= ed int dmabuf_fd, if (IS_ERR(binding->attachment)) { err =3D PTR_ERR(binding->attachment); NL_SET_ERR_MSG(extack, "Failed to bind dmabuf to device"); - goto err_free_id; + goto err_free_binding; } =20 binding->sgt =3D dma_buf_map_attachment_unlocked(binding->attachment, - DMA_FROM_DEVICE); + direction); if (IS_ERR(binding->sgt)) { err =3D PTR_ERR(binding->sgt); NL_SET_ERR_MSG(extack, "Failed to map dmabuf attachment"); goto err_detach; } =20 + if (direction =3D=3D DMA_TO_DEVICE) { + binding->tx_vec =3D kvmalloc_array(dmabuf->size / PAGE_SIZE, + sizeof(struct net_iov *), + GFP_KERNEL); + if (!binding->tx_vec) { + err =3D -ENOMEM; + goto err_unmap; + } + } + /* For simplicity we expect to make PAGE_SIZE allocations, but the * binding can be much more flexible than that. We may be able to * allocate MTU sized chunks here. Leave that for future work... */ - binding->chunk_pool =3D - gen_pool_create(PAGE_SHIFT, dev_to_node(&dev->dev)); + binding->chunk_pool =3D gen_pool_create(PAGE_SHIFT, + dev_to_node(&dev->dev)); if (!binding->chunk_pool) { err =3D -ENOMEM; - goto err_unmap; + goto err_tx_vec; } =20 virtual =3D 0; @@ -270,24 +284,34 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsign= ed int dmabuf_fd, niov->owner =3D &owner->area; page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov), net_devmem_get_dma_addr(niov)); + if (direction =3D=3D DMA_TO_DEVICE) + binding->tx_vec[owner->area.base_virtual / PAGE_SIZE + i] =3D niov; } =20 virtual +=3D len; } =20 + err =3D xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id, + binding, xa_limit_32b, &id_alloc_next, + GFP_KERNEL); + if (err < 0) + goto err_free_id; + return binding; =20 +err_free_id: + xa_erase(&net_devmem_dmabuf_bindings, binding->id); err_free_chunks: gen_pool_for_each_chunk(binding->chunk_pool, net_devmem_dmabuf_free_chunk_owner, NULL); gen_pool_destroy(binding->chunk_pool); +err_tx_vec: + kvfree(binding->tx_vec); err_unmap: dma_buf_unmap_attachment_unlocked(binding->attachment, binding->sgt, DMA_FROM_DEVICE); err_detach: dma_buf_detach(dmabuf, binding->attachment); -err_free_id: - xa_erase(&net_devmem_dmabuf_bindings, binding->id); err_free_binding: kfree(binding); err_put_dmabuf: @@ -295,6 +319,21 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigne= d int dmabuf_fd, return ERR_PTR(err); } =20 +struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id) +{ + struct net_devmem_dmabuf_binding *binding; + + rcu_read_lock(); + binding =3D xa_load(&net_devmem_dmabuf_bindings, id); + if (binding) { + if (!net_devmem_dmabuf_binding_get(binding)) + binding =3D NULL; + } + rcu_read_unlock(); + + return binding; +} + void net_devmem_get_net_iov(struct net_iov *niov) { net_devmem_dmabuf_binding_get(net_devmem_iov_binding(niov)); @@ -305,6 +344,49 @@ void net_devmem_put_net_iov(struct net_iov *niov) net_devmem_dmabuf_binding_put(net_devmem_iov_binding(niov)); } =20 +struct net_devmem_dmabuf_binding *net_devmem_get_binding(struct sock *sk, + unsigned int dmabuf_id) +{ + struct net_devmem_dmabuf_binding *binding; + struct dst_entry *dst =3D __sk_dst_get(sk); + int err =3D 0; + + binding =3D net_devmem_lookup_dmabuf(dmabuf_id); + if (!binding || !binding->tx_vec) { + err =3D -EINVAL; + goto out_err; + } + + /* The dma-addrs in this binding are only reachable to the corresponding + * net_device. + */ + if (!dst || !dst->dev || dst->dev->ifindex !=3D binding->dev->ifindex) { + err =3D -ENODEV; + goto out_err; + } + + return binding; + +out_err: + if (binding) + net_devmem_dmabuf_binding_put(binding); + + return ERR_PTR(err); +} + +struct net_iov * +net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, + size_t virt_addr, size_t *off, size_t *size) +{ + if (virt_addr >=3D binding->dmabuf->size) + return NULL; + + *off =3D virt_addr % PAGE_SIZE; + *size =3D PAGE_SIZE - *off; + + return binding->tx_vec[virt_addr / PAGE_SIZE]; +} + /*** "Dmabuf devmem memory provider" ***/ =20 int mp_dmabuf_devmem_init(struct page_pool *pool) diff --git a/net/core/devmem.h b/net/core/devmem.h index 946f2e0157467..67168aae5e5b3 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -23,8 +23,9 @@ struct net_devmem_dmabuf_binding { =20 /* The user holds a ref (via the netlink API) for as long as they want * the binding to remain alive. Each page pool using this binding holds - * a ref to keep the binding alive. Each allocated net_iov holds a - * ref. + * a ref to keep the binding alive. The page_pool does not release the + * ref until all the net_iovs allocated from this binding are released + * back to the page_pool. * * The binding undos itself and unmaps the underlying dmabuf once all * those refs are dropped and the binding is no longer desired or in @@ -32,7 +33,10 @@ struct net_devmem_dmabuf_binding { * * net_devmem_get_net_iov() on dmabuf net_iovs will increment this * reference, making sure that the binding remains alive until all the - * net_iovs are no longer used. + * net_iovs are no longer used. net_iovs allocated from this binding + * that are stuck in the TX path for any reason (such as awaiting + * retransmits) hold a reference to the binding until the skb holding + * them is freed. */ refcount_t ref; =20 @@ -48,6 +52,14 @@ struct net_devmem_dmabuf_binding { * active. */ u32 id; + + /* Array of net_iov pointers for this binding, sorted by virtual + * address. This array is convenient to map the virtual addresses to + * net_iovs in the TX path. + */ + struct net_iov **tx_vec; + + struct work_struct unbind_w; }; =20 #if defined(CONFIG_NET_DEVMEM) @@ -64,14 +76,17 @@ struct dmabuf_genpool_chunk_owner { dma_addr_t base_dma_addr; }; =20 -void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *bi= nding); +void __net_devmem_dmabuf_binding_free(struct work_struct *wq); struct net_devmem_dmabuf_binding * -net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, - struct netlink_ext_ack *extack); +net_devmem_bind_dmabuf(struct net_device *dev, + enum dma_data_direction direction, + unsigned int dmabuf_fd, struct netlink_ext_ack *extack); +struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id); void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding); int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, struct net_devmem_dmabuf_binding *binding, struct netlink_ext_ack *extack); +void net_devmem_bind_tx_release(struct sock *sk); =20 static inline struct dmabuf_genpool_chunk_owner * net_devmem_iov_to_chunk_owner(const struct net_iov *niov) @@ -100,10 +115,10 @@ static inline unsigned long net_iov_virtual_addr(cons= t struct net_iov *niov) ((unsigned long)net_iov_idx(niov) << PAGE_SHIFT); } =20 -static inline void +static inline bool net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding) { - refcount_inc(&binding->ref); + return refcount_inc_not_zero(&binding->ref); } =20 static inline void @@ -112,7 +127,8 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_= binding *binding) if (!refcount_dec_and_test(&binding->ref)) return; =20 - __net_devmem_dmabuf_binding_free(binding); + INIT_WORK(&binding->unbind_w, __net_devmem_dmabuf_binding_free); + schedule_work(&binding->unbind_w); } =20 void net_devmem_get_net_iov(struct net_iov *niov); @@ -123,6 +139,11 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_bindi= ng *binding); void net_devmem_free_dmabuf(struct net_iov *ppiov); =20 bool net_is_devmem_iov(struct net_iov *niov); +struct net_devmem_dmabuf_binding * +net_devmem_get_binding(struct sock *sk, unsigned int dmabuf_id); +struct net_iov * +net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, size_t a= ddr, + size_t *off, size_t *size); =20 #else struct net_devmem_dmabuf_binding; @@ -140,18 +161,23 @@ static inline void net_devmem_put_net_iov(struct net_= iov *niov) { } =20 -static inline void -__net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) +static inline void __net_devmem_dmabuf_binding_free(struct work_struct *wq) { } =20 static inline struct net_devmem_dmabuf_binding * net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, + enum dma_data_direction direction, struct netlink_ext_ack *extack) { return ERR_PTR(-EOPNOTSUPP); } =20 +static inline struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u= 32 id) +{ + return NULL; +} + static inline void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) { @@ -190,6 +216,19 @@ static inline bool net_is_devmem_iov(struct net_iov *n= iov) { return false; } + +static inline struct net_devmem_dmabuf_binding * +net_devmem_get_binding(struct sock *sk, unsigned int dmabuf_id) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +static inline struct net_iov * +net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, size_t a= ddr, + size_t *off, size_t *size) +{ + return NULL; +} #endif =20 #endif /* _NET_DEVMEM_H */ diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index 410df19d98d78..292606df834de 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -873,7 +873,8 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct = genl_info *info) goto err_unlock; } =20 - binding =3D net_devmem_bind_dmabuf(netdev, dmabuf_fd, info->extack); + binding =3D net_devmem_bind_dmabuf(netdev, DMA_FROM_DEVICE, dmabuf_fd, + info->extack); if (IS_ERR(binding)) { err =3D PTR_ERR(binding); goto err_unlock; @@ -934,10 +935,74 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struc= t genl_info *info) return err; } =20 -/* stub */ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info) { - return 0; + struct net_devmem_dmabuf_binding *binding; + struct netdev_nl_sock *priv; + struct net_device *netdev; + u32 ifindex, dmabuf_fd; + struct sk_buff *rsp; + int err =3D 0; + void *hdr; + + if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX) || + GENL_REQ_ATTR_CHECK(info, NETDEV_A_DMABUF_FD)) + return -EINVAL; + + ifindex =3D nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]); + dmabuf_fd =3D nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]); + + priv =3D genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk); + if (IS_ERR(priv)) + return PTR_ERR(priv); + + rsp =3D genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL); + if (!rsp) + return -ENOMEM; + + hdr =3D genlmsg_iput(rsp, info); + if (!hdr) { + err =3D -EMSGSIZE; + goto err_genlmsg_free; + } + + mutex_lock(&priv->lock); + + netdev =3D netdev_get_by_index_lock(genl_info_net(info), ifindex); + if (!netdev) { + err =3D -ENODEV; + goto err_unlock_sock; + } + + if (!netif_device_present(netdev)) { + err =3D -ENODEV; + goto err_unlock_netdev; + } + + binding =3D net_devmem_bind_dmabuf(netdev, DMA_TO_DEVICE, dmabuf_fd, + info->extack); + if (IS_ERR(binding)) { + err =3D PTR_ERR(binding); + goto err_unlock_netdev; + } + + list_add(&binding->list, &priv->bindings); + + nla_put_u32(rsp, NETDEV_A_DMABUF_ID, binding->id); + genlmsg_end(rsp, hdr); + + netdev_unlock(netdev); + mutex_unlock(&priv->lock); + + return genlmsg_reply(rsp, info); + +err_unlock_netdev: + netdev_unlock(netdev); +err_unlock_sock: + mutex_unlock(&priv->lock); +err_genlmsg_free: + nlmsg_free(rsp); + return err; } =20 void netdev_nl_sock_priv_init(struct netdev_nl_sock *priv) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 00c22bce98e44..4159107f1666c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1655,7 +1655,8 @@ void mm_unaccount_pinned_pages(struct mmpin *mmp) } EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages); =20 -static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size) +static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size, + bool devmem) { struct ubuf_info_msgzc *uarg; struct sk_buff *skb; @@ -1670,7 +1671,7 @@ static struct ubuf_info *msg_zerocopy_alloc(struct so= ck *sk, size_t size) uarg =3D (void *)skb->cb; uarg->mmp.user =3D NULL; =20 - if (mm_account_pinned_pages(&uarg->mmp, size)) { + if (likely(!devmem) && mm_account_pinned_pages(&uarg->mmp, size)) { kfree_skb(skb); return NULL; } @@ -1693,7 +1694,7 @@ static inline struct sk_buff *skb_from_uarg(struct ub= uf_info_msgzc *uarg) } =20 struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size, - struct ubuf_info *uarg) + struct ubuf_info *uarg, bool devmem) { if (uarg) { struct ubuf_info_msgzc *uarg_zc; @@ -1723,7 +1724,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *s= k, size_t size, =20 next =3D (u32)atomic_read(&sk->sk_zckey); if ((u32)(uarg_zc->id + uarg_zc->len) =3D=3D next) { - if (mm_account_pinned_pages(&uarg_zc->mmp, size)) + if (likely(!devmem) && + mm_account_pinned_pages(&uarg_zc->mmp, size)) return NULL; uarg_zc->len++; uarg_zc->bytelen =3D bytelen; @@ -1738,7 +1740,7 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *s= k, size_t size, } =20 new_alloc: - return msg_zerocopy_alloc(sk, size); + return msg_zerocopy_alloc(sk, size, devmem); } EXPORT_SYMBOL_GPL(msg_zerocopy_realloc); =20 @@ -1842,7 +1844,8 @@ EXPORT_SYMBOL_GPL(msg_zerocopy_ubuf_ops); =20 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, struct msghdr *msg, int len, - struct ubuf_info *uarg) + struct ubuf_info *uarg, + struct net_devmem_dmabuf_binding *binding) { int err, orig_len =3D skb->len; =20 @@ -1861,7 +1864,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct = sk_buff *skb, return -EEXIST; } =20 - err =3D __zerocopy_sg_from_iter(msg, sk, skb, &msg->msg_iter, len); + err =3D __zerocopy_sg_from_iter(msg, sk, skb, &msg->msg_iter, len, + binding); if (err =3D=3D -EFAULT || (err =3D=3D -EMSGSIZE && skb->len =3D=3D orig_l= en)) { struct sock *save_sk =3D skb->sk; =20 diff --git a/net/core/sock.c b/net/core/sock.c index b64df2463300b..9dd2989040357 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3017,6 +3017,12 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr= *cmsg, if (!sk_set_prio_allowed(sk, *(u32 *)CMSG_DATA(cmsg))) return -EPERM; sockc->priority =3D *(u32 *)CMSG_DATA(cmsg); + break; + case SCM_DEVMEM_DMABUF: + if (cmsg->cmsg_len !=3D CMSG_LEN(sizeof(u32))) + return -EINVAL; + sockc->dmabuf_id =3D *(u32 *)CMSG_DATA(cmsg); + break; default: return -EINVAL; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 6e18d7ec50624..a2705d454fd64 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1014,7 +1014,8 @@ static int __ip_append_data(struct sock *sk, uarg =3D msg->msg_ubuf; } } else if (sock_flag(sk, SOCK_ZEROCOPY)) { - uarg =3D msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); + uarg =3D msg_zerocopy_realloc(sk, length, skb_zcopy(skb), + false); if (!uarg) return -ENOBUFS; extra_uref =3D !skb_zcopy(skb); /* only ref on new uarg */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e0e96f8fd47cb..9f7cd33444968 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1059,6 +1059,7 @@ int tcp_sendmsg_fastopen(struct sock *sk, struct msgh= dr *msg, int *copied, =20 int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) { + struct net_devmem_dmabuf_binding *binding =3D NULL; struct tcp_sock *tp =3D tcp_sk(sk); struct ubuf_info *uarg =3D NULL; struct sk_buff *skb; @@ -1066,11 +1067,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msgh= dr *msg, size_t size) int flags, err, copied =3D 0; int mss_now =3D 0, size_goal, copied_syn =3D 0; int process_backlog =3D 0; + bool sockc_valid =3D true; int zc =3D 0; long timeo; =20 flags =3D msg->msg_flags; =20 + sockc =3D (struct sockcm_cookie){ .tsflags =3D READ_ONCE(sk->sk_tsflags), + .dmabuf_id =3D 0 }; + if (msg->msg_controllen) { + err =3D sock_cmsg_send(sk, msg, &sockc); + if (unlikely(err)) + /* Don't return error until MSG_FASTOPEN has been + * processed; that may succeed even if the cmsg is + * invalid. + */ + sockc_valid =3D false; + } + if ((flags & MSG_ZEROCOPY) && size) { if (msg->msg_ubuf) { uarg =3D msg->msg_ubuf; @@ -1078,7 +1092,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr= *msg, size_t size) zc =3D MSG_ZEROCOPY; } else if (sock_flag(sk, SOCK_ZEROCOPY)) { skb =3D tcp_write_queue_tail(sk); - uarg =3D msg_zerocopy_realloc(sk, size, skb_zcopy(skb)); + uarg =3D msg_zerocopy_realloc(sk, size, skb_zcopy(skb), + sockc_valid && !!sockc.dmabuf_id); if (!uarg) { err =3D -ENOBUFS; goto out_err; @@ -1087,12 +1102,27 @@ int tcp_sendmsg_locked(struct sock *sk, struct msgh= dr *msg, size_t size) zc =3D MSG_ZEROCOPY; else uarg_to_msgzc(uarg)->zerocopy =3D 0; + + if (sockc_valid && sockc.dmabuf_id) { + binding =3D net_devmem_get_binding(sk, sockc.dmabuf_id); + if (IS_ERR(binding)) { + err =3D PTR_ERR(binding); + binding =3D NULL; + goto out_err; + } + } } } else if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES) && size) { if (sk->sk_route_caps & NETIF_F_SG) zc =3D MSG_SPLICE_PAGES; } =20 + if (sockc_valid && sockc.dmabuf_id && + (!(flags & MSG_ZEROCOPY) || !sock_flag(sk, SOCK_ZEROCOPY))) { + err =3D -EINVAL; + goto out_err; + } + if (unlikely(flags & MSG_FASTOPEN || inet_test_bit(DEFER_CONNECT, sk)) && !tp->repair) { @@ -1131,14 +1161,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghd= r *msg, size_t size) /* 'common' sending to sendq */ } =20 - sockc =3D (struct sockcm_cookie) { .tsflags =3D READ_ONCE(sk->sk_tsflags)= }; - if (msg->msg_controllen) { - err =3D sock_cmsg_send(sk, msg, &sockc); - if (unlikely(err)) { - err =3D -EINVAL; - goto out_err; - } - } + if (!sockc_valid) + goto out_err; =20 /* This should be in poll */ sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk); @@ -1258,7 +1282,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr= *msg, size_t size) goto wait_for_space; } =20 - err =3D skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg); + err =3D skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg, + binding); if (err =3D=3D -EMSGSIZE || err =3D=3D -EEXIST) { tcp_mark_push(tp, skb); goto new_segment; @@ -1339,6 +1364,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr= *msg, size_t size) /* msg->msg_ubuf is pinned by the caller so we don't take extra refs */ if (uarg && !msg->msg_ubuf) net_zcopy_put(uarg); + if (binding) + net_devmem_dmabuf_binding_put(binding); return copied + copied_syn; =20 do_error: @@ -1356,6 +1383,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr= *msg, size_t size) sk->sk_write_space(sk); tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED); } + if (binding) + net_devmem_dmabuf_binding_put(binding); + return err; } EXPORT_SYMBOL_GPL(tcp_sendmsg_locked); diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index ef052ccd93800..7bd29a9ff0db8 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1524,7 +1524,8 @@ static int __ip6_append_data(struct sock *sk, uarg =3D msg->msg_ubuf; } } else if (sock_flag(sk, SOCK_ZEROCOPY)) { - uarg =3D msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); + uarg =3D msg_zerocopy_realloc(sk, length, skb_zcopy(skb), + false); if (!uarg) return -ENOBUFS; extra_uref =3D !skb_zcopy(skb); /* only ref on new uarg */ diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio= _transport_common.c index 7f7de6d880965..6e7b727c781c8 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -87,7 +87,7 @@ static int virtio_transport_init_zcopy_skb(struct vsock_s= ock *vsk, =20 uarg =3D msg_zerocopy_realloc(sk_vsock(vsk), iter->count, - NULL); + NULL, false); if (!uarg) return -1; =20 @@ -107,8 +107,7 @@ static int virtio_transport_fill_skb(struct sk_buff *sk= b, { if (zcopy) return __zerocopy_sg_from_iter(info->msg, NULL, skb, - &info->msg->msg_iter, - len); + &info->msg->msg_iter, len, NULL); =20 return memcpy_from_msg(skb_put(skb, len), info->msg, len); } --=20 2.49.0.805.g082f7c87e0-goog From nobody Sun Feb 8 04:17:46 2026 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 34B041B0F31 for ; Thu, 24 Apr 2025 04:03:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467396; cv=none; b=JHhnXQNpaj/YF69WVwiaB0aXngnePNu+PZZZJJOMLCh+03FP3tPDCiNakUVc2pjVPMLWVdblQGvQMdOhKrijDIo1aolzTs5dUK5hYNO6kdAOyE0n5sWAI8BenC1kF0iCIrF49zA5ZbFhK0WNvk9t6ne6nP1pZkgWURue36rjZpg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467396; c=relaxed/simple; bh=QDLXXaueecCgO5Axw+Nyu5EgxIhVwC4DuRcRjtwAJ5E=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=uEykVbH2d6EIY/kcx5XxbqzQyCaVoxaMuFb3Zh31cd6IR5zqUOjKDZLRVyyRxMP1FhBo0owpcc9R5qH4+NvzJ02EZB4oJGmckzQB1gYs+iDj4kh+kC0bLkZ2YEOJBrjhB8XHrq1bISRVoN+WfzyivhjxtenNs6VBv41flOPGbpA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=sTnzVxQV; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="sTnzVxQV" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-3055f2e1486so785324a91.0 for ; Wed, 23 Apr 2025 21:03:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745467393; x=1746072193; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=8s7bJQK9mVBiBVARPJ5kkauQHM9lxUZX4qjjEP+CsW4=; b=sTnzVxQVdhbgUb1hJpIpuSNo+o+MCvmzgCb+ytxDh76N/7/QVfzWDPJlgJi2kdRjHk qm6FLI6BoCR1gMuNwd++WgVq761Sr+1AGU+TJl9i2gL4fAD8mz3pxtXnlNhZl+XI4m4X pGQOCEteKBzZEpya/E9uRBN9w1jS6s0D7OTzHH5t8VhGsEStsIqUt4g7EMX398psrt6v WkiQi3h/pFqf9rRcY8mHGZmuys9xttdfLhQ4njB4DA8LrzWuSH/5DKYMaUL7XNj4OXb7 M00nvD4Eaon2iNUjM18lSh+K+vmtSg38cq6tKOpp+QUihfdg0xHfFNLyRdz9WyK1WeRg wUYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745467393; x=1746072193; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8s7bJQK9mVBiBVARPJ5kkauQHM9lxUZX4qjjEP+CsW4=; b=ZDoeiAdUDvkp1Wctz4wbIox8Vvb3v9T3u8StYUwQBUVRzKR1yAJRGXuejhbuZyujv4 Sd+FKUKj3spL/X3qwGQWRRJn+QZA7QdhJ9FB9ypYGKDJ5v62ZBS/5+SkeXoCsFx7Vtrf /sDPtF1p0IsCIaE288wVh/EN1nldD18SnJ3tHV15iyv4odX9XHaxL/sFCQrFxREewX1s fcttarpe1zOKfAREFurMLYCGDwvtIc6wNuSjCV4HjyyaqEgIv5QV0NtwzTQ0+DxprqVh KjeuYqTChV33emvRxqpjg2zR97XfHAdATpz1WsdqGpiOHH3u2XQOGsmeI8CgX6PfLAGN u8JA== X-Forwarded-Encrypted: i=1; AJvYcCXcwjDnprHF8SgrmWLK5epiu7zBZUAoi2PQ6ID72v2tFEPh97r8iGTDj8HRPdq2KaiFMGjNbikuD3yWk2Q=@vger.kernel.org X-Gm-Message-State: AOJu0YwgWDu6yMK8mEgh8Y9Vlhxu40cI87CLeoUg5FEsFj8Md9L0CVGz Lif7TdABs1/eMyOf/EfA/fqbAzAuydeUTc/4HnCGfHZykrcnF194mU86eRPzXO7DFVIQ7pLQvxW DVUVmwO7nN88ISYvNv5nMFA== X-Google-Smtp-Source: AGHT+IFDmjk5qYhlVEvcaWV/6HkyWEY6bOujBoJ8R/I3J2khGHfAvgsbrDZnCgnzdM+djzfP0cvjIoh5gcQPKiQEqg== X-Received: from pjbsc15.prod.google.com ([2002:a17:90b:510f:b0:301:2679:9aa]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90a:c2d0:b0:2ee:f687:6acb with SMTP id 98e67ed59e1d1-309ed271ad4mr1819350a91.13.1745467392967; Wed, 23 Apr 2025 21:03:12 -0700 (PDT) Date: Thu, 24 Apr 2025 04:02:57 +0000 In-Reply-To: <20250424040301.2480876-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250424040301.2480876-1-almasrymina@google.com> X-Mailer: git-send-email 2.49.0.805.g082f7c87e0-goog Message-ID: <20250424040301.2480876-6-almasrymina@google.com> Subject: [PATCH net-next v11 5/8] net: add devmem TCP TX documentation From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, io-uring@vger.kernel.org, virtualization@lists.linux.dev, kvm@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , Jeroen de Borst , Harshitha Ramamurthy , Kuniyuki Iwashima , Willem de Bruijn , Jens Axboe , Pavel Begunkov , David Ahern , Neal Cardwell , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , "=?UTF-8?q?Eugenio=20P=C3=A9rez?=" , sdf@fomichev.me, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela , Samiullah Khawaja Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add documentation outlining the usage and details of the devmem TCP TX API. Signed-off-by: Mina Almasry Acked-by: Stanislav Fomichev --- v5: - Address comments from Stan and Bagas v4: - Mention SO_BINDTODEVICE is recommended (me/Pavel). v2: - Update documentation for iov_base is the dmabuf offset (Stan) --- Documentation/networking/devmem.rst | 150 +++++++++++++++++++++++++++- 1 file changed, 146 insertions(+), 4 deletions(-) diff --git a/Documentation/networking/devmem.rst b/Documentation/networking= /devmem.rst index eb678ca454968..a6cd7236bfbd2 100644 --- a/Documentation/networking/devmem.rst +++ b/Documentation/networking/devmem.rst @@ -62,15 +62,15 @@ More Info https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@go= ogle.com/ =20 =20 -Interface -=3D=3D=3D=3D=3D=3D=3D=3D=3D +RX Interface +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 =20 Example ------- =20 -tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setti= ng up -the RX path of this API. +./tools/testing/selftests/drivers/net/hw/ncdevmem:do_server shows an examp= le of +setting up the RX path of this API. =20 =20 NIC Setup @@ -235,6 +235,148 @@ can be less than the tokens provided by the user in c= ase of: (a) an internal kernel leak bug. (b) the user passed more than 1024 frags. =20 +TX Interface +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + + +Example +------- + +./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an examp= le of +setting up the TX path of this API. + + +NIC Setup +--------- + +The user must bind a TX dmabuf to a given NIC using the netlink API:: + + struct netdev_bind_tx_req *req =3D NULL; + struct netdev_bind_tx_rsp *rsp =3D NULL; + struct ynl_error yerr; + + *ys =3D ynl_sock_create(&ynl_netdev_family, &yerr); + + req =3D netdev_bind_tx_req_alloc(); + netdev_bind_tx_req_set_ifindex(req, ifindex); + netdev_bind_tx_req_set_fd(req, dmabuf_fd); + + rsp =3D netdev_bind_tx(*ys, req); + + tx_dmabuf_id =3D rsp->id; + + +The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf +that has been bound. + +The user can unbind the dmabuf from the netdevice by closing the netlink s= ocket +that established the binding. We do this so that the binding is automatica= lly +unbound even if the userspace process crashes. + +Note that any reasonably well-behaved dmabuf from any exporter should work= with +devmem TCP, even if the dmabuf is not actually backed by devmem. An exampl= e of +this is udmabuf, which wraps user memory (non-devmem) in a dmabuf. + +Socket Setup +------------ + +The user application must use MSG_ZEROCOPY flag when sending devmem TCP. D= evmem +cannot be copied by the kernel, so the semantics of the devmem TX are simi= lar +to the semantics of MSG_ZEROCOPY:: + + setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt)); + +It is also recommended that the user binds the TX socket to the same inter= face +the dma-buf has been bound to via SO_BINDTODEVICE:: + + setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, strlen(ifname)= + 1); + + +Sending data +------------ + +Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg. + +The user should create a msghdr where, + +* iov_base is set to the offset into the dmabuf to start sending from +* iov_len is set to the number of bytes to be sent from the dmabuf + +The user passes the dma-buf id to send from via the dmabuf_tx_cmsg.dmabuf_= id. + +The example below sends 1024 bytes from offset 100 into the dmabuf, and 20= 48 +from offset 2000 into the dmabuf. The dmabuf to send from is tx_dmabuf_id:: + + char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))]; + struct dmabuf_tx_cmsg ddmabuf; + struct msghdr msg =3D {}; + struct cmsghdr *cmsg; + struct iovec iov[2]; + + iov[0].iov_base =3D (void*)100; + iov[0].iov_len =3D 1024; + iov[1].iov_base =3D (void*)2000; + iov[1].iov_len =3D 2048; + + msg.msg_iov =3D iov; + msg.msg_iovlen =3D 2; + + msg.msg_control =3D ctrl_data; + msg.msg_controllen =3D sizeof(ctrl_data); + + cmsg =3D CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level =3D SOL_SOCKET; + cmsg->cmsg_type =3D SCM_DEVMEM_DMABUF; + cmsg->cmsg_len =3D CMSG_LEN(sizeof(struct dmabuf_tx_cmsg)); + + ddmabuf.dmabuf_id =3D tx_dmabuf_id; + + *((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) =3D ddmabuf; + + sendmsg(socket_fd, &msg, MSG_ZEROCOPY); + + +Reusing TX dmabufs +------------------ + +Similar to MSG_ZEROCOPY with regular memory, the user should not modify the +contents of the dma-buf while a send operation is in progress. This is bec= ause +the kernel does not keep a copy of the dmabuf contents. Instead, the kernel +will pin and send data from the buffer available to the userspace. + +Just as in MSG_ZEROCOPY, the kernel notifies the userspace of send complet= ions +using MSG_ERRQUEUE:: + + int64_t tstop =3D gettimeofday_ms() + waittime_ms; + char control[CMSG_SPACE(100)] =3D {}; + struct sock_extended_err *serr; + struct msghdr msg =3D {}; + struct cmsghdr *cm; + int retries =3D 10; + __u32 hi, lo; + + msg.msg_control =3D control; + msg.msg_controllen =3D sizeof(control); + + while (gettimeofday_ms() < tstop) { + if (!do_poll(fd)) continue; + + ret =3D recvmsg(fd, &msg, MSG_ERRQUEUE); + + for (cm =3D CMSG_FIRSTHDR(&msg); cm; cm =3D CMSG_NXTHDR(&m= sg, cm)) { + serr =3D (void *)CMSG_DATA(cm); + + hi =3D serr->ee_data; + lo =3D serr->ee_info; + + fprintf(stdout, "tx complete [%d,%d]\n", lo, hi); + } + } + +After the associated sendmsg has been completed, the dmabuf can be reused = by +the userspace. + + Implementation & Caveats =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 --=20 2.49.0.805.g082f7c87e0-goog From nobody Sun Feb 8 04:17:46 2026 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AC4A81E8837 for ; Thu, 24 Apr 2025 04:03:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467398; cv=none; b=oS7T8m1TgnVPGD5IX8EowCCqgB1Id3JG7eDghZjdw/N6p28AUouR5hdF0J+CI9YbkjbLEx9zSV0/ZShzzfi27nTo7PGRf60NqYIpgPlUjJc4+qtuxHA5qTPN8umcO2dcOk/EkQYGP7fB+76tC+sI/w4tJpmw3gjFcAdJw3/bbHg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467398; c=relaxed/simple; bh=6/8NFpps55oCa3AYk0X5GT/T8IVzlp9P4uaBDdIiIX0=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=BCIiId1kRVdZwwfQaORgf+2VBMsJddqYKth6AyT5L7WnRFUZ1kPY1/WrB+dWEim6p5ZLZ09aAISuRYhzahibSn/rOczNJjpA6NoKp+slV523uJMfyNFiW0/ZmG95Eg8RU3EtrDEAlCU15tMK5S2mPHf2iDoNScygeiPv1pttUno= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=TeuPViB5; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="TeuPViB5" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-3032f4ea8cfso589179a91.3 for ; Wed, 23 Apr 2025 21:03:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745467395; x=1746072195; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=8crc0ZI5juv7i3SKZgQ8E1WkOFtJNTgbW4wVjE0PfXM=; b=TeuPViB50DXAB322Efl/4VAF/KIEnYz9ZuSspVeuTgAO0VbfKWFCkCgJ86lrGAEB/S Op2daHs/EBeTynygbovS8L6BoolIOZfhwzGXF82yT/SfYAzwCL8lZ7fw0fLkS/DD5hO1 L+XXymK+JhI4pGLiFIEYY3T4XuQTq68hZSXz6wUSxuKUFyzylKP1zs3ucIZsMat7Crfg zKzxqGz9VtCn8v9geFjoPABSeidzdu6eriGw6ce+NG78E4SipgCi2rJcGNetVjMvlVI9 zpqe+0JqZHq3WtSIEnxRHHjfHOSXXA6DgdtNReHx5Zf9OlsnQZzWxq+niTWG73ak19Y5 WU1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745467395; x=1746072195; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8crc0ZI5juv7i3SKZgQ8E1WkOFtJNTgbW4wVjE0PfXM=; b=PXS9JshKX4BjYMCcVKMcD0dYvc/IZNq7/abkE79ax2O/D6BxKWtlBgS5gF3EHAgeWG rDoN2MMBNNhJHIvEpgCZsBBmya0BYBGedQ9lrPn+T4YItB+dTYfireQB7UGotxiDmEuh W8jb14UFmLjC5636cDjUMQdwgp4nERdSwj1UXZEyhP6LLzadxUjqJaNWRczhAWahB8X3 7g7DPRmPfogoWMtHXGpKOVaoS7mr89MQHXIfXi503BOsnWqMLufj4hAXxyk7PRKObstU DdAjKEx0GekjB+aDj6UJdGSKYIDULuYh2weFTKsLS1+S7Q7BXqu6hfoRVqHsJ34Bxnzs PrNg== X-Forwarded-Encrypted: i=1; AJvYcCVbhzZo81Hqh/iXGSod1qy/3+3jkWzxeJ3bDbO6ZotZek2+8H5knonXYTAqpAFV9dPkNpJh19v+KVw5CSk=@vger.kernel.org X-Gm-Message-State: AOJu0YyP5vdle1y4dEyUJV+IxMyTfplMtdLA1TE3eSOsBchED+AlMu9a mbEoov9XLn7jRPiAHX5KdXIQjbEcjWhF2XAEHVrKEDAhKPmo6xA5YJgnAqH1Py6iSxVdZr9gveF xfCaX3zP4HDkpzJuyDZOMJA== X-Google-Smtp-Source: AGHT+IE/uHG8+AZZIBCMVC/GGy8e2nzm+5117A/O/SktTvnjYRiZiqW/H74wunvNM5SwwvVOwZjEBcY/+xkPLAdVkw== X-Received: from pjbsv16.prod.google.com ([2002:a17:90b:5390:b0:2f7:d453:e587]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:2f46:b0:2ff:7031:e380 with SMTP id 98e67ed59e1d1-309ed27a53dmr1995546a91.10.1745467394886; Wed, 23 Apr 2025 21:03:14 -0700 (PDT) Date: Thu, 24 Apr 2025 04:02:58 +0000 In-Reply-To: <20250424040301.2480876-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250424040301.2480876-1-almasrymina@google.com> X-Mailer: git-send-email 2.49.0.805.g082f7c87e0-goog Message-ID: <20250424040301.2480876-7-almasrymina@google.com> Subject: [PATCH net-next v11 6/8] net: enable driver support for netmem TX From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, io-uring@vger.kernel.org, virtualization@lists.linux.dev, kvm@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , Jeroen de Borst , Harshitha Ramamurthy , Kuniyuki Iwashima , Willem de Bruijn , Jens Axboe , Pavel Begunkov , David Ahern , Neal Cardwell , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , "=?UTF-8?q?Eugenio=20P=C3=A9rez?=" , sdf@fomichev.me, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela , Samiullah Khawaja Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Drivers need to make sure not to pass netmem dma-addrs to the dma-mapping API in order to support netmem TX. Add helpers and netmem_dma_*() helpers that enables special handling of netmem dma-addrs that drivers can use. Document in netmem.rst what drivers need to do to support netmem TX. Signed-off-by: Mina Almasry Acked-by: Stanislav Fomichev --- v8: - use spaces instead of tabs (Paolo) v5: - Fix netmet TX documentation (Stan). v4: - New patch --- .../networking/net_cachelines/net_device.rst | 1 + Documentation/networking/netdev-features.rst | 5 ++++ Documentation/networking/netmem.rst | 23 +++++++++++++++++-- include/linux/netdevice.h | 2 ++ include/net/netmem.h | 20 ++++++++++++++++ 5 files changed, 49 insertions(+), 2 deletions(-) diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Docum= entation/networking/net_cachelines/net_device.rst index ca8605eb82ffc..c69cc89c958e0 100644 --- a/Documentation/networking/net_cachelines/net_device.rst +++ b/Documentation/networking/net_cachelines/net_device.rst @@ -10,6 +10,7 @@ Type Name = fastpath_tx_acce =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D unsigned_long:32 priv_flags read_mostl= y __dev_queue_xmit(tx) unsigned_long:1 lltx read_mostl= y HARD_TX_LOCK,HARD_TX_TRYLOCK,HARD_TX_UNLOCK(t= x) +unsigned long:1 netmem_tx:1; read_mostly char name[16] struct netdev_name_node* name_node struct dev_ifalias* ifalias diff --git a/Documentation/networking/netdev-features.rst b/Documentation/n= etworking/netdev-features.rst index 5014f7cc1398b..02bd7536fc0ca 100644 --- a/Documentation/networking/netdev-features.rst +++ b/Documentation/networking/netdev-features.rst @@ -188,3 +188,8 @@ Redundancy) frames from one port to another in hardware. This should be set for devices which duplicate outgoing HSR (High-availabi= lity Seamless Redundancy) or PRP (Parallel Redundancy Protocol) tags automatica= lly frames in hardware. + +* netmem-tx + +This should be set for devices which support netmem TX. See +Documentation/networking/netmem.rst diff --git a/Documentation/networking/netmem.rst b/Documentation/networking= /netmem.rst index 7de21ddb54129..b63aded463370 100644 --- a/Documentation/networking/netmem.rst +++ b/Documentation/networking/netmem.rst @@ -19,8 +19,8 @@ Benefits of Netmem : * Simplified Development: Drivers interact with a consistent API, regardless of the underlying memory implementation. =20 -Driver Requirements -=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Driver RX Requirements +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 1. The driver must support page_pool. =20 @@ -77,3 +77,22 @@ Driver Requirements that purpose, but be mindful that some netmem types might have longer circulation times, such as when userspace holds a reference in zerocopy scenarios. + +Driver TX Requirements +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +1. The Driver must not pass the netmem dma_addr to any of the dma-mapping = APIs + directly. This is because netmem dma_addrs may come from a source like + dma-buf that is not compatible with the dma-mapping APIs. + + Helpers like netmem_dma_unmap_page_attrs() & netmem_dma_unmap_addr_set() + should be used in lieu of dma_unmap_page[_attrs](), dma_unmap_addr_set(= ). + The netmem variants will handle netmem dma_addrs correctly regardless o= f the + source, delegating to the dma-mapping APIs when appropriate. + + Not all dma-mapping APIs have netmem equivalents at the moment. If your + driver relies on a missing netmem API, feel free to add and propose to + netdev@, or reach out to the maintainers and/or almasrymina@google.com = for + help adding the netmem API. + +2. Driver should declare support by setting `netdev->netmem_tx =3D true` diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 0321fd952f708..a661820a26c44 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1772,6 +1772,7 @@ enum netdev_reg_state { * @lltx: device supports lockless Tx. Deprecated for real HW * drivers. Mainly used by logical interfaces, such as * bonding and tunnels + * @netmem_tx: device support netmem_tx. * * @name: This is the first field of the "visible" part of this structure * (i.e. as seen by users in the "Space.c" file). It is the name @@ -2087,6 +2088,7 @@ struct net_device { struct_group(priv_flags_fast, unsigned long priv_flags:32; unsigned long lltx:1; + unsigned long netmem_tx:1; ); const struct net_device_ops *netdev_ops; const struct header_ops *header_ops; diff --git a/include/net/netmem.h b/include/net/netmem.h index 1b047cfb9e4f7..8a9210e2868d3 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -8,6 +8,7 @@ #ifndef _NET_NETMEM_H #define _NET_NETMEM_H =20 +#include #include #include =20 @@ -276,4 +277,23 @@ static inline unsigned long netmem_get_dma_addr(netmem= _ref netmem) void get_netmem(netmem_ref netmem); void put_netmem(netmem_ref netmem); =20 +#define netmem_dma_unmap_addr_set(NETMEM, PTR, ADDR_NAME, VAL) \ + do { \ + if (!netmem_is_net_iov(NETMEM)) \ + dma_unmap_addr_set(PTR, ADDR_NAME, VAL); \ + else \ + dma_unmap_addr_set(PTR, ADDR_NAME, 0); \ + } while (0) + +static inline void netmem_dma_unmap_page_attrs(struct device *dev, + dma_addr_t addr, size_t size, + enum dma_data_direction dir, + unsigned long attrs) +{ + if (!addr) + return; + + dma_unmap_page_attrs(dev, addr, size, dir, attrs); +} + #endif /* _NET_NETMEM_H */ --=20 2.49.0.805.g082f7c87e0-goog From nobody Sun Feb 8 04:17:46 2026 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9CEF61F1936 for ; Thu, 24 Apr 2025 04:03:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467399; cv=none; b=FLpR0zvE5YoVHG6Op6QKLXOrURI7eMW7zjj1W9ZOdskKyXNnrrIxUQNw1yeVM7rKtKOkYl3xsZjeUKNFxFT3XW6rq2FaV/lW19pv59WYecYFTM2+Tkbifqnzs9L28JV9W5nL6BU8iJ/PqolSHDwUZMPWZGIalKi/OYLj4VBYXG4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467399; c=relaxed/simple; bh=G/WocB7aC91WXa3o0zsJehcs+v1ffo648SDyxbN0QrE=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=HH0SqyO1+Ai3wnZ5pmcaSWD/my10euI8XeRfYO0YngiwBSgqAnW4HIV+Rh2WPEKFumU/odIoSKia18tZfdHrOlC2e/4PcIlZk3nLsUFFKT4hMiwVvTtuGxOOTUb8aDoe//ogPYtNXeVOCdhodQTv5vFU1Gg2jE4P0koZgXmhZmw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=i4E6SoLr; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="i4E6SoLr" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-22c31b55ac6so9487115ad.0 for ; Wed, 23 Apr 2025 21:03:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745467397; x=1746072197; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=vPSgvf3wO98YN/W102VPKsPEBpygUJCFjK6qLw3/I9w=; b=i4E6SoLrwqxbW1ywuXQT1Q/wi2gbMPiUNhdq1HJp55Sp8waqg8bqTzyUhLlfpyY2ib nudaUD6z9bauRLHgE+tcNAHC3IaH15rlSeV3iDWtyBzi1DYyEfEd8CzsSyPQwR/Gz7AQ pMBAbDqg3e78dtOPp1EljiaRhTbsV3CLxE2Ajq0vLTScpc8anjdHiu/JesDHVCwaSBvS vl1euu0VV5mzw7YCSZ/XOXBK8dsMRNUPyqpidsnv4cNF/f6r6VAlBy6zmoF4aGrvmJ8q iZpZdvPmuHToyOV6CNIBHUjqIN3EMLKPdBFevSq9lug0YQZIIZ45r9L9fNdYaSMdIrxk TxTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745467397; x=1746072197; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=vPSgvf3wO98YN/W102VPKsPEBpygUJCFjK6qLw3/I9w=; b=mTDM7ziKkcJ+LpF2GVWsoxXSfc+plWKKLwe7NMvjKcX9Y0K73nJmq7MdHK21qxs5v/ Q1FAcFnEJGDw6v9byhB+emnkNJ/kssR6rhFnOdTYxxi0k/ijb9yvn8u6S1lXnS1Fnmvq FcqKJwKNxBJtFNsFE2cyTAV20pvOpWg9vEi8SBmgldalYv236e/uzBNwY7HRVcZ8qFJ0 wOJHiwWyQG8mSND/jws3BglS2EL+6oCPoOHsc4WAvMhZ470Yai7azWAfnXmAs7YB5T7E 6Vx6q+HxDDw2aPgwDi0TJ2gd829zNzzeyrhumhYCaNnt9RvNlTv9EW21iXNUhgbRjxuF KLzA== X-Forwarded-Encrypted: i=1; AJvYcCVw0UNcaCYbEPAczlZzyoDjQ3fhyOk5ghSZXJsqlmjZ1tb5WD8Uxx2xvPQvym2nS4Zd4mzZA1U3jq2lONs=@vger.kernel.org X-Gm-Message-State: AOJu0YyLld1TepSy2O9XNCeUgL5Q37Se0B/i39ddAC5hur0EYVdAI3wm D4WhBOcxKUxdCzTyNttN68e8KS43kPRXdSCfjffz7MFGz3pPEncbpI13nDI8+jLkbIAAxDW7TgA V9q0pTndY1N0oB7Eoc6h7yg== X-Google-Smtp-Source: AGHT+IHGiGawAQJXdYMac6t0R9ZXFSXnUAt0bqsviOtJQgSYLeN5QUXxoG5Rxjv5VMEf9TjsXXbYR81x8xGB5ZTQ3w== X-Received: from pjbee11.prod.google.com ([2002:a17:90a:fc4b:b0:2fc:3022:36b8]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:3d88:b0:309:d115:b5f7 with SMTP id 98e67ed59e1d1-309ed313729mr1983532a91.24.1745467396836; Wed, 23 Apr 2025 21:03:16 -0700 (PDT) Date: Thu, 24 Apr 2025 04:02:59 +0000 In-Reply-To: <20250424040301.2480876-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250424040301.2480876-1-almasrymina@google.com> X-Mailer: git-send-email 2.49.0.805.g082f7c87e0-goog Message-ID: <20250424040301.2480876-8-almasrymina@google.com> Subject: [PATCH net-next v11 7/8] gve: add netmem TX support to GVE DQO-RDA mode From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, io-uring@vger.kernel.org, virtualization@lists.linux.dev, kvm@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , Jeroen de Borst , Harshitha Ramamurthy , Kuniyuki Iwashima , Willem de Bruijn , Jens Axboe , Pavel Begunkov , David Ahern , Neal Cardwell , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , "=?UTF-8?q?Eugenio=20P=C3=A9rez?=" , sdf@fomichev.me, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela , Samiullah Khawaja Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Use netmem_dma_*() helpers in gve_tx_dqo.c DQO-RDA paths to enable netmem TX support in that mode. Declare support for netmem TX in GVE DQO-RDA mode. Signed-off-by: Mina Almasry Acked-by: Harshitha Ramamurthy --- v11: - Fix whitespace (Harshitha) v10: - Move setting dev->netmem_tx to right after priv is initialized (Harshitha) v4: - New patch --- drivers/net/ethernet/google/gve/gve_main.c | 3 +++ drivers/net/ethernet/google/gve/gve_tx_dqo.c | 8 +++++--- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ether= net/google/gve/gve_main.c index 446e4b6fd3f17..e1ffbd561fac6 100644 --- a/drivers/net/ethernet/google/gve/gve_main.c +++ b/drivers/net/ethernet/google/gve/gve_main.c @@ -2659,6 +2659,9 @@ static int gve_probe(struct pci_dev *pdev, const stru= ct pci_device_id *ent) if (err) goto abort_with_wq; =20 + if (!gve_is_gqi(priv) && !gve_is_qpl(priv)) + dev->netmem_tx =3D true; + err =3D register_netdev(dev); if (err) goto abort_with_gve_init; diff --git a/drivers/net/ethernet/google/gve/gve_tx_dqo.c b/drivers/net/eth= ernet/google/gve/gve_tx_dqo.c index 2eba868d80370..a27f1574a7337 100644 --- a/drivers/net/ethernet/google/gve/gve_tx_dqo.c +++ b/drivers/net/ethernet/google/gve/gve_tx_dqo.c @@ -660,7 +660,8 @@ static int gve_tx_add_skb_no_copy_dqo(struct gve_tx_rin= g *tx, goto err; =20 dma_unmap_len_set(pkt, len[pkt->num_bufs], len); - dma_unmap_addr_set(pkt, dma[pkt->num_bufs], addr); + netmem_dma_unmap_addr_set(skb_frag_netmem(frag), pkt, + dma[pkt->num_bufs], addr); ++pkt->num_bufs; =20 gve_tx_fill_pkt_desc_dqo(tx, desc_idx, skb, len, addr, @@ -1038,8 +1039,9 @@ static void gve_unmap_packet(struct device *dev, dma_unmap_single(dev, dma_unmap_addr(pkt, dma[0]), dma_unmap_len(pkt, len[0]), DMA_TO_DEVICE); for (i =3D 1; i < pkt->num_bufs; i++) { - dma_unmap_page(dev, dma_unmap_addr(pkt, dma[i]), - dma_unmap_len(pkt, len[i]), DMA_TO_DEVICE); + netmem_dma_unmap_page_attrs(dev, dma_unmap_addr(pkt, dma[i]), + dma_unmap_len(pkt, len[i]), + DMA_TO_DEVICE, 0); } pkt->num_bufs =3D 0; } --=20 2.49.0.805.g082f7c87e0-goog From nobody Sun Feb 8 04:17:46 2026 Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B30851F1506 for ; Thu, 24 Apr 2025 04:03:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467401; cv=none; b=K67sRA8MwoD04ttvcGWbfQTT+pMIWH2NkoG+Na8PxhPKfL8jATDwbRWbLlBVpPMGpD9tbXKkRN/86ORwjHKKF6M6CDSYCKLvCLITJpcyGFgOdaMiusyKYWB15xBzPFwx0Ag9ePXtWmFB/KuveKH+w22U89Lmo2NZQAJSJ6C75Lw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745467401; c=relaxed/simple; bh=7NgBY2/M7RX3ZWIExGTnoM9nMea+0yUh7UHHWuhJ4Bk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=c8twfxipHUiItLkWNsRQXSNuJKxpogGbMDlzcTOZvds+geM2wREHS5UBRZyjalVbfYaapLLV6qRoVbitTDnbns7d3gycnN7687ElX3m/RmY6IpdrF1rM1177GF0NTd1iA0sQXVuE92XqmcNFDUr070APkX28ELAcOzTQThvyZs4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=jZ1c4KMv; arc=none smtp.client-ip=209.85.210.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="jZ1c4KMv" Received: by mail-pf1-f201.google.com with SMTP id d2e1a72fcca58-73c205898aaso401069b3a.2 for ; Wed, 23 Apr 2025 21:03:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745467399; x=1746072199; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Dm+KUe9Ww/e6Vyl6aeeJ+DObwsjW8ys4qGoTcP5xOok=; b=jZ1c4KMvnZuK3XFchcG5cPV675VfERwiiWWxLd19118Uwd0SbJFwp9AwAuV/hLOXh+ bz+tHL5BcMMDgQ5Q1rNeVobOLq+dUmTgNJXow+jPCYJIzlGoGtIJngvXusdIMQI7k/hp 4EnZHsWQJ21hZ7IDc8ZmV+apDzoZG+gRDXhJO9U3taJRoYc9VQ7hnzNxaJ0bfiyppOhA i66ukbAHnMrUXEIsPxGOrMMRlveNhK5k1R0LKRUKY5nZXDvUKQirVe6WWYUzpN8pXGtK r7J2DrNB1e6wIuXtQI74OeFkLIH7ANM/R8Hwa8odp86xeqr8rCe9OZcP8K/Uwf0eA57+ hn3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745467399; x=1746072199; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Dm+KUe9Ww/e6Vyl6aeeJ+DObwsjW8ys4qGoTcP5xOok=; b=nrK971U7TgWwwO7344mAMXf0PVXn1CHTq5B3/Vj3X1AiuxS7cVWGCuSG8pIMM9TocG gbmN2BLtpA+Y7YBtJHxYU9/VBRZfjCdZN2j8EvB65nkKan2BWj0OoCF2lefUpReKChb2 giJ8lqrIit7DR+SfJnKehF2nwgp24Jk3VG9ppMauLvWJARksoOcPmaUcrPkk5Dc5rir5 Jsos8AQSIMqR6ToayJBLqVRs8Ju5pd8nkeAu7DSniBdR2GqoO9lr34uqlyZIUfYeFOsC bZogs7qJ2rB4vrCzBNYPhGRIbGqXs0Vc7C6enGbB3ua2/+rFKvOq7DEgL9gzbDZ/3sf/ 6N6w== X-Forwarded-Encrypted: i=1; AJvYcCWVKRMSStCnUYajTiCLyO9a7c6f1vfgc2tMskUx32JecanAU2stnqnhsKcWtuiBcnhvnAOrcmPMm6mPb68=@vger.kernel.org X-Gm-Message-State: AOJu0YzzIuCrO0z9oOi5l1OoL+KtI3GnkZEGKUQo3lmHUXdWB91o4cnI ZwaaaG5tKGDqoRSnf3gDBLpvoelrgHXfQSZuJnfmdy5ueGnLCNzHwtDQT2k/YGOLFAhS1EG0yU9 S6NNvIZfFjBN0sY7J9KNUbA== X-Google-Smtp-Source: AGHT+IFLPIU2yWmxJxJgjQYe70sSW4SkHs7h5xwWXVAjgbGENSUaZlxA7iNR4GkvJCd9MTDUDy3y/0RyI7kIx5Ea4g== X-Received: from pfx55.prod.google.com ([2002:a05:6a00:a477:b0:736:4ad6:1803]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:2d04:b0:1f5:619a:7f4c with SMTP id adf61e73a8af0-20444f2d5a0mr1681293637.29.1745467398744; Wed, 23 Apr 2025 21:03:18 -0700 (PDT) Date: Thu, 24 Apr 2025 04:03:00 +0000 In-Reply-To: <20250424040301.2480876-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250424040301.2480876-1-almasrymina@google.com> X-Mailer: git-send-email 2.49.0.805.g082f7c87e0-goog Message-ID: <20250424040301.2480876-9-almasrymina@google.com> Subject: [PATCH net-next v11 8/8] net: check for driver support in netmem TX From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, io-uring@vger.kernel.org, virtualization@lists.linux.dev, kvm@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , Jeroen de Borst , Harshitha Ramamurthy , Kuniyuki Iwashima , Willem de Bruijn , Jens Axboe , Pavel Begunkov , David Ahern , Neal Cardwell , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , "=?UTF-8?q?Eugenio=20P=C3=A9rez?=" , sdf@fomichev.me, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela , Samiullah Khawaja Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" We should not enable netmem TX for drivers that don't declare support. Check for driver netmem TX support during devmem TX binding and fail if the driver does not have the functionality. Check for driver support in validate_xmit_skb as well. Signed-off-by: Mina Almasry Acked-by: Stanislav Fomichev --- v8: - Rebase on latest net-next and resolve conflict. - Remove likely (Paolo) v5: https://lore.kernel.org/netdev/20250227041209.2031104-8-almasrymina@goo= gle.com/ - Check that the dmabuf mappings belongs to the specific device the TX is being sent from (Jakub) v4: - New patch --- net/core/dev.c | 34 ++++++++++++++++++++++++++++++++-- net/core/devmem.h | 6 ++++++ net/core/netdev-genl.c | 7 +++++++ 3 files changed, 45 insertions(+), 2 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index d1a8cad0c99c4..66f0c122de80e 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3896,12 +3896,42 @@ int skb_csum_hwoffload_help(struct sk_buff *skb, } EXPORT_SYMBOL(skb_csum_hwoffload_help); =20 +static struct sk_buff *validate_xmit_unreadable_skb(struct sk_buff *skb, + struct net_device *dev) +{ + struct skb_shared_info *shinfo; + struct net_iov *niov; + + if (likely(skb_frags_readable(skb))) + goto out; + + if (!dev->netmem_tx) + goto out_free; + + shinfo =3D skb_shinfo(skb); + + if (shinfo->nr_frags > 0) { + niov =3D netmem_to_net_iov(skb_frag_netmem(&shinfo->frags[0])); + if (net_is_devmem_iov(niov) && + net_devmem_iov_binding(niov)->dev !=3D dev) + goto out_free; + } + +out: + return skb; + +out_free: + kfree_skb(skb); + return NULL; +} + static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_d= evice *dev, bool *again) { netdev_features_t features; =20 - if (!skb_frags_readable(skb)) - goto out_kfree_skb; + skb =3D validate_xmit_unreadable_skb(skb, dev); + if (unlikely(!skb)) + goto out_null; =20 features =3D netif_skb_features(skb); skb =3D validate_xmit_vlan(skb, features); diff --git a/net/core/devmem.h b/net/core/devmem.h index 67168aae5e5b3..919e6ed28fdcd 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -229,6 +229,12 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_bindin= g *binding, size_t addr, { return NULL; } + +static inline struct net_devmem_dmabuf_binding * +net_devmem_iov_binding(const struct net_iov *niov) +{ + return NULL; +} #endif =20 #endif /* _NET_DEVMEM_H */ diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index 292606df834de..84c033574eb16 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -979,6 +979,13 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct= genl_info *info) goto err_unlock_netdev; } =20 + if (!netdev->netmem_tx) { + err =3D -EOPNOTSUPP; + NL_SET_ERR_MSG(info->extack, + "Driver does not support netmem TX"); + goto err_unlock_netdev; + } + binding =3D net_devmem_bind_dmabuf(netdev, DMA_TO_DEVICE, dmabuf_fd, info->extack); if (IS_ERR(binding)) { --=20 2.49.0.805.g082f7c87e0-goog