From nobody Sun Dec 22 03:11:18 2024 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9864FB663 for ; Sat, 21 Dec 2024 00:51:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734742298; cv=none; b=afbOLe+JHTvK1/BPQeJCOkmYnBoLc+2jDrHT1vJ9Qz41adLcqF4d9tbtIDBcyBdtYjQHVqd7iyFJ6NQxFhkp1hpTccxvn+zd06Ag8t88CDICwBHLAB6842u77pXiBhwCabuPdIwLoqD3PMfEEkqb554ZIoNOjh8C1jwssy5QfyM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734742298; c=relaxed/simple; bh=Q4IYKRzUgikdQrLO/xeVDk1fFg7Nif3uP+CIswjEMhs=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=d22dlid1AK6xQowOak7gocQhcnDA6ltazn8Dbrf3tSlZ/0nhfiwxDluYt3kFWU4wdmjnSgFRiC1+XET6W9ktGVB/BxouWQsmA8iO4DJ+9QGPzhGq8x3+AI0Jgmb7uPapcpXbvxcqrP5f2Iw85vzLJRpiZvJPcBWKISLYMaiFZNQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=TD6WIrMH; arc=none smtp.client-ip=209.85.215.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="TD6WIrMH" Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-7ea8c2b257bso1716632a12.1 for ; Fri, 20 Dec 2024 16:51:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1734742296; x=1735347096; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=1SuvBd87z8dQBsyZrB5o6nLF21oYWt912VciCuuXhAk=; b=TD6WIrMH/PSeZefcZH/xSj8V3zntcrPdAiHuasTxB3oG7xVCbZ7Rk7JxFi2D6wBXqo tvWgvXDv0t+k7ZI8mMCyaZ4TRgdPZOeFuwcfFcsYjNbjIZK3RVR5V+CWLcWH4K9j3DY/ dzJP5PPR8lXeEho8ax3i8hHFiooOjj1/C4q/hPGGbPSNwA8ucPVrk9d7V5nx4Bm/lTbR SfavQTmWd6WqNOb56BTbSjOP+7EO4bGlU2iN6q4cG9eeIhvfJ805vV5bJhIBONUDmeXM CLkLkmGvB1X3wqRv9y1yOG9YY3vIFh72li1rLgXm+pkrjKqSX3u9Q47MdwKFtXsVOc9Y eUgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734742296; x=1735347096; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1SuvBd87z8dQBsyZrB5o6nLF21oYWt912VciCuuXhAk=; b=JciV5p/Wp8VkXvs261DWyQbv4O2FL5BF49ZGI6JvdONCxsH0jfa5eDCnfzE5ug8QCB FNSBDnrFIHRE3pzfzUXIcjGNUmeFIal1JyuJBU5iqNG1L51f1DjFhq2Q2kNzn4KrMT8y DkvEnWWGqp49OIRRb6pf6X7E7INZppC2y07aqftAFediLx5JKrITVttpizMR/MC9O7tf UgQP8/gOSzHo0/3lnIEk0GdzVzUVymew+LEDhwBMvf/2t797djD1Tlu45u1zIAdb3z4/ auZtpxFbMy+m9FCYl8O4flEIJKK7ppf1DKfCIxTzJUUWvlgztUM6xGU/BBwDjmyIH+bq MkwQ== X-Forwarded-Encrypted: i=1; AJvYcCVK2tDIeGbdQUVQSJh58CEQcc54FYc3jiXeaNBgxcpN8O62W45xXdFnJRWk3BJgkoW9VaEsXOVhhItsHcE=@vger.kernel.org X-Gm-Message-State: AOJu0YzrGqDIsp165uIJqjSp3RQMLpUsHi2hPL9+zV1p4dYAVDmg7hra MeT7wGDlCtHE/5sDoeoiazVtYXXRaW6PW+zutJgztD3IG8f+RxbvcZmQaM9jRKJMI7ygsP/oTuJ V1HK58tOyDMmj6Ew+dUB2Pw== X-Google-Smtp-Source: AGHT+IHjSak1jAgwlXPNoXP7dkl7+gaTr1FCoaGK/MjoAkKJUhx1FnZaoq8kHAbuzx5bYAxn+DO+wEw8pti8Saz7mQ== X-Received: from plhc14.prod.google.com ([2002:a17:903:234e:b0:216:21cb:2dfe]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:ebc9:b0:212:fa3:f627 with SMTP id d9443c01a7336-219e6e9f9a2mr71116625ad.16.1734742295746; Fri, 20 Dec 2024 16:51:35 -0800 (PST) Date: Sat, 21 Dec 2024 00:42:32 +0000 In-Reply-To: <20241221004236.2629280-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241221004236.2629280-1-almasrymina@google.com> X-Mailer: git-send-email 2.47.1.613.gc27f4b7a9f-goog Message-ID: <20241221004236.2629280-2-almasrymina@google.com> Subject: [PATCH RFC net-next v1 1/5] net: add devmem TCP TX documentation From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, virtualization@lists.linux.dev, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , David Ahern , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , "=?UTF-8?q?Eugenio=20P=C3=A9rez?=" , Stefan Hajnoczi , Stefano Garzarella , Shuah Khan , Kaiyuan Zhang , Pavel Begunkov , Willem de Bruijn , Samiullah Khawaja , Stanislav Fomichev , Joe Damato , dw@davidwei.uk Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add documentation outlining the usage and details of the devmem TCP TX API. Signed-off-by: Mina Almasry --- Documentation/networking/devmem.rst | 140 +++++++++++++++++++++++++++- 1 file changed, 136 insertions(+), 4 deletions(-) diff --git a/Documentation/networking/devmem.rst b/Documentation/networking= /devmem.rst index d95363645331..9be01cd96ee2 100644 --- a/Documentation/networking/devmem.rst +++ b/Documentation/networking/devmem.rst @@ -62,15 +62,15 @@ More Info https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@go= ogle.com/ =20 =20 -Interface -=3D=3D=3D=3D=3D=3D=3D=3D=3D +RX Interface +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 =20 Example ------- =20 -tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setti= ng up -the RX path of this API. +./tools/testing/selftests/drivers/net/hw/ncdevmem:do_server shows an examp= le of +setting up the RX path of this API. =20 =20 NIC Setup @@ -235,6 +235,138 @@ can be less than the tokens provided by the user in c= ase of: (a) an internal kernel leak bug. (b) the user passed more than 1024 frags. =20 +TX Interface +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + + +Example +------- + +./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an examp= le of +setting up the TX path of this API. + + +NIC Setup +--------- + +The user must bind a TX dmabuf to a given NIC using the netlink API:: + + struct netdev_bind_tx_req *req =3D NULL; + struct netdev_bind_tx_rsp *rsp =3D NULL; + struct ynl_error yerr; + + *ys =3D ynl_sock_create(&ynl_netdev_family, &yerr); + + req =3D netdev_bind_tx_req_alloc(); + netdev_bind_tx_req_set_ifindex(req, ifindex); + netdev_bind_tx_req_set_fd(req, dmabuf_fd); + + rsp =3D netdev_bind_tx(*ys, req); + + tx_dmabuf_id =3D rsp->id; + + +The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf +that has been bound. + +The user can unbind the dmabuf from the netdevice by closing the netlink s= ocket +that established the binding. We do this so that the binding is automatica= lly +unbound even if the userspace process crashes. + +Note that any reasonably well-behaved dmabuf from any exporter should work= with +devmem TCP, even if the dmabuf is not actually backed by devmem. An exampl= e of +this is udmabuf, which wraps user memory (non-devmem) in a dmabuf. + +Socket Setup +------------ + +The user application must use MSG_ZEROCOPY flag when sending devmem TCP. D= evmem +cannot be copied by the kernel, so the semantics of the devmem TX are simi= lar +to the semantics of MSG_ZEROCOPY. + + ret =3D setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt)); + +Sending data +-------------- + +Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg. + +The user should create a msghdr with iov_base set to NULL and iov_len set = to the +number of bytes to be sent from the dmabuf. + +The user passes the dma-buf id via the dmabuf_tx_cmsg.dmabuf_id, and passe= s the +offset into the dmabuf from where to start sending using the +dmabuf_tx_cmsg.dmabuf_offset field:: + + char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))]; + struct dmabuf_tx_cmsg ddmabuf; + struct msghdr msg =3D {}; + struct cmsghdr *cmsg; + uint64_t off =3D 100; + struct iovec iov; + + iov.iov_base =3D NULL; + iov.iov_len =3D line_size; + + msg.msg_iov =3D &iov; + msg.msg_iovlen =3D 1; + + msg.msg_control =3D ctrl_data; + msg.msg_controllen =3D sizeof(ctrl_data); + + cmsg =3D CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level =3D SOL_SOCKET; + cmsg->cmsg_type =3D SCM_DEVMEM_DMABUF; + cmsg->cmsg_len =3D CMSG_LEN(sizeof(struct dmabuf_tx_cmsg)); + + ddmabuf.dmabuf_id =3D tx_dmabuf_id; + ddmabuf.dmabuf_offset =3D off; + + *((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) =3D ddmabuf; + + ret =3D sendmsg(socket_fd, &msg, MSG_ZEROCOPY); + +Reusing TX dmabufs +------------------ + +Similar to MSG_ZEROCOPY with regular memory, the user should not modify the +contents of the dma-buf while a send operation is in progress. This is bec= ause +the kernel does not keep a copy of the dmabuf contents. Instead, the kernel +will pin and send data from the buffer available to the userspace. + +Just as in MSG_ZEROCOPY, the kernel notifies the userspace of send complet= ions +using MSG_ERRQUEUE:: + + int64_t tstop =3D gettimeofday_ms() + waittime_ms; + char control[CMSG_SPACE(100)] =3D {}; + struct sock_extended_err *serr; + struct msghdr msg =3D {}; + struct cmsghdr *cm; + int retries =3D 10; + __u32 hi, lo; + + msg.msg_control =3D control; + msg.msg_controllen =3D sizeof(control); + + while (gettimeofday_ms() < tstop) { + if (!do_poll(fd)) continue; + + ret =3D recvmsg(fd, &msg, MSG_ERRQUEUE); + + for (cm =3D CMSG_FIRSTHDR(&msg); cm; cm =3D CMSG_NXTHDR(&m= sg, cm)) { + serr =3D (void *)CMSG_DATA(cm); + + hi =3D serr->ee_data; + lo =3D serr->ee_info; + + fprintf(stdout, "tx complete [%d,%d]\n", lo, hi); + } + } + +After the associated sendmsg has been completed, the dmabuf can be reused = by +the userspace. + + Implementation & Caveats =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 --=20 2.47.1.613.gc27f4b7a9f-goog From nobody Sun Dec 22 03:11:18 2024 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 03C5A27453 for ; Sat, 21 Dec 2024 00:51:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734742300; cv=none; b=RSEVff0FmuwMBqeidkVVjijx4Q71eiaop24ocv4aDK9EN0d3UBT6L1F9OSerApoj0ceVReYx0Wtt6LiiqXnbh9nsSnXb0/U+O7SksMqbxcVh5SD9OAFLcLVBgodXh4npKNOf3JDKJtTM3jIbGCLynymDSdyr8JPdOo8L5nMRwmY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734742300; c=relaxed/simple; bh=oM9VPB9o+HoOhFQ2jKZEYqU0zOMOqiBju5kW7kCY7hk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=TYdyu/8Dh+gGW+ap9vB0V10dAuIQ5Zsl6VsmT5dmL2rWLSiFNUgNGOCpbMNq314BCh7MSQdzDAuLo3XlUvaAENWilKXjHMy6yebBP7o1Aq6zazL4oXKlM6q+/kKucI/tBPaj89Wn3j9EJ05LobBSWD2dDMYb61Qb6yMiUwUWPxg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=LZvVG6xt; arc=none smtp.client-ip=209.85.215.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="LZvVG6xt" Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-7fcb7afb4e1so2450283a12.0 for ; Fri, 20 Dec 2024 16:51:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1734742297; x=1735347097; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=wvRhxm0PfanRMhR9AWSahUuI027dbnG0NIDTqQqY9zM=; b=LZvVG6xt17QgIKJq9ytXGwuWgp0dYkwSuYljJriPi30A22etdSyik1HzkoV/S26EEC 5FijQR0h1LG04lz7rSLrjPLDnh9eotBwb2vxEZNTX08GymJDa6iiFX1H7kx5tzORUQtQ tdyCNGPddSJw91kA0r1LjVPgIZkZO2B2rfdxeaYKBj8UL/OsLpQZrBxkeG/2/8SW+T6g mz3NQUBpIoR6RY4OXMHfKc2A2g56LSXpFb8mXhF/ue1fqkxLNJBJ4WUHW3kqfNwQRSqB HaSCszZJyW3KL5zqM80YDFgDtobgbRUlzyjo0ArTcwYL2VDZW3zlCaoTYHLNzfI5W3D6 7rTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734742297; x=1735347097; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wvRhxm0PfanRMhR9AWSahUuI027dbnG0NIDTqQqY9zM=; b=UqshcmsIyye/YjKqCP+fOONiOE04/C57oNIza6miGjyp16VUGBxK8nfy0UhPBvCt8C rL2kfD6mnncBBlJEt4k9MjLD3Ju7n4iwsuar+kbdmh6fXNP00o+VmluHv0S/cbueprV2 uJ616crozeLGGUSRfNhepLd35zMagyXpl6IHvjcNmnXNXIVkVxC3NFzxINaKVyqtZlG0 zsXt9lbfS2Ltq3dB/dRMc30wc0I1TbNOSY1gVyFSprWGEtJh+5FqUORz6Frr791mJkSd uENCPW6XxGT7gi+uAFDhxv0jlZOWQNbxYrTUyWNHEZ9YzwyuZeXqnhg1FmFhXzoo+QQ6 bXVQ== X-Forwarded-Encrypted: i=1; AJvYcCUfv0S2yAZDM7gyyUdTjciPUGXBxtFmOlBx1X4NbcOB9uVzMtuGSa/HrjXOKTP8seruLY1WCCJntPRuTdA=@vger.kernel.org X-Gm-Message-State: AOJu0YxTv7kdIkQXMYBxi10irthQ5d9ZhUs2fA2iUosi0dASWSaYLjZZ Krmz6vk2mrU7lOLxm/s4YJ3ZM4j2QgN5N3dDCFUIyQZ7qt/x8abmKJxfZE/dKliT54HFQ3/89SW nFCjq+ZLV2gSNj/G8CGxknQ== X-Google-Smtp-Source: AGHT+IE9+UzLSddcELfZ2mgl0mh/RGerBHIeMvNiWYQCp1XxXSrcI+ut917KM8tfbtPFabOJFM2yXHBxSI1aokLf1g== X-Received: from pgwb8.prod.google.com ([2002:a65:6688:0:b0:801:e378:a64a]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:3115:b0:1e1:aef4:9cd0 with SMTP id adf61e73a8af0-1e5e044e423mr8605246637.3.1734742297348; Fri, 20 Dec 2024 16:51:37 -0800 (PST) Date: Sat, 21 Dec 2024 00:42:33 +0000 In-Reply-To: <20241221004236.2629280-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241221004236.2629280-1-almasrymina@google.com> X-Mailer: git-send-email 2.47.1.613.gc27f4b7a9f-goog Message-ID: <20241221004236.2629280-3-almasrymina@google.com> Subject: [PATCH RFC net-next v1 2/5] selftests: ncdevmem: Implement devmem TCP TX From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, virtualization@lists.linux.dev, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , David Ahern , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , "=?UTF-8?q?Eugenio=20P=C3=A9rez?=" , Stefan Hajnoczi , Stefano Garzarella , Shuah Khan , Kaiyuan Zhang , Pavel Begunkov , Willem de Bruijn , Samiullah Khawaja , Stanislav Fomichev , Joe Damato , dw@davidwei.uk Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add support for devmem TX in ncdevmem. This is a combination of the ncdevmem from the devmem TCP series RFCv1 which included the TX path, and work by Stan to include the netlink API and refactored on top of his generic memory_provider support. Signed-off-by: Mina Almasry Signed-off-by: Stanislav Fomichev --- .../selftests/drivers/net/hw/ncdevmem.c | 261 +++++++++++++++++- 1 file changed, 259 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/test= ing/selftests/drivers/net/hw/ncdevmem.c index 19a6969643f4..c1cbe2e11230 100644 --- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c +++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c @@ -40,15 +40,18 @@ #include #include #include +#include =20 #include #include #include #include #include +#include =20 #include #include +#include #include #include #include @@ -80,6 +83,8 @@ static int num_queues =3D -1; static char *ifname; static unsigned int ifindex; static unsigned int dmabuf_id; +static uint32_t tx_dmabuf_id; +static int waittime_ms =3D 500; =20 struct memory_buffer { int fd; @@ -93,6 +98,8 @@ struct memory_buffer { struct memory_provider { struct memory_buffer *(*alloc)(size_t size); void (*free)(struct memory_buffer *ctx); + void (*memcpy_to_device)(struct memory_buffer *dst, size_t off, + void *src, int n); void (*memcpy_from_device)(void *dst, struct memory_buffer *src, size_t off, int n); }; @@ -153,6 +160,20 @@ static void udmabuf_free(struct memory_buffer *ctx) free(ctx); } =20 +static void udmabuf_memcpy_to_device(struct memory_buffer *dst, size_t off, + void *src, int n) +{ + struct dma_buf_sync sync =3D {}; + + sync.flags =3D DMA_BUF_SYNC_START | DMA_BUF_SYNC_WRITE; + ioctl(dst->fd, DMA_BUF_IOCTL_SYNC, &sync); + + memcpy(dst->buf_mem + off, src, n); + + sync.flags =3D DMA_BUF_SYNC_END | DMA_BUF_SYNC_WRITE; + ioctl(dst->fd, DMA_BUF_IOCTL_SYNC, &sync); +} + static void udmabuf_memcpy_from_device(void *dst, struct memory_buffer *sr= c, size_t off, int n) { @@ -170,6 +191,7 @@ static void udmabuf_memcpy_from_device(void *dst, struc= t memory_buffer *src, static struct memory_provider udmabuf_memory_provider =3D { .alloc =3D udmabuf_alloc, .free =3D udmabuf_free, + .memcpy_to_device =3D udmabuf_memcpy_to_device, .memcpy_from_device =3D udmabuf_memcpy_from_device, }; =20 @@ -394,6 +416,49 @@ static int bind_rx_queue(unsigned int ifindex, unsigne= d int dmabuf_fd, return -1; } =20 +static int bind_tx_queue(unsigned int ifindex, unsigned int dmabuf_fd, + struct ynl_sock **ys) +{ + struct netdev_bind_tx_req *req =3D NULL; + struct netdev_bind_tx_rsp *rsp =3D NULL; + struct ynl_error yerr; + + *ys =3D ynl_sock_create(&ynl_netdev_family, &yerr); + if (!*ys) { + fprintf(stderr, "YNL: %s\n", yerr.msg); + return -1; + } + + req =3D netdev_bind_tx_req_alloc(); + netdev_bind_tx_req_set_ifindex(req, ifindex); + netdev_bind_tx_req_set_fd(req, dmabuf_fd); + + rsp =3D netdev_bind_tx(*ys, req); + if (!rsp) { + perror("netdev_bind_tx"); + goto err_close; + } + + if (!rsp->_present.id) { + perror("id not present"); + goto err_close; + } + + fprintf(stderr, "got tx dmabuf id=3D%d\n", rsp->id); + tx_dmabuf_id =3D rsp->id; + + netdev_bind_tx_req_free(req); + netdev_bind_tx_rsp_free(rsp); + + return 0; + +err_close: + fprintf(stderr, "YNL failed: %s\n", (*ys)->err.msg); + netdev_bind_tx_req_free(req); + ynl_sock_destroy(*ys); + return -1; +} + static void enable_reuseaddr(int fd) { int opt =3D 1; @@ -432,7 +497,7 @@ static int parse_address(const char *str, int port, str= uct sockaddr_in6 *sin6) return 0; } =20 -int do_server(struct memory_buffer *mem) +static int do_server(struct memory_buffer *mem) { char ctrl_data[sizeof(int) * 20000]; struct netdev_queue_id *queues; @@ -686,6 +751,198 @@ void run_devmem_tests(void) provider->free(mem); } =20 +static unsigned long gettimeofday_ms(void) +{ + struct timeval tv; + + gettimeofday(&tv, NULL); + return (tv.tv_sec * 1000) + (tv.tv_usec / 1000); +} + +static int do_poll(int fd) +{ + struct pollfd pfd; + int ret; + + pfd.events =3D POLLERR; + pfd.revents =3D 0; + pfd.fd =3D fd; + + ret =3D poll(&pfd, 1, waittime_ms); + if (ret =3D=3D -1) + error(1, errno, "poll"); + + return ret && (pfd.revents & POLLERR); +} + +static void wait_compl(int fd) +{ + int64_t tstop =3D gettimeofday_ms() + waittime_ms; + char control[CMSG_SPACE(100)] =3D {}; + struct sock_extended_err *serr; + struct msghdr msg =3D {}; + struct cmsghdr *cm; + int retries =3D 10; + __u32 hi, lo; + int ret; + + msg.msg_control =3D control; + msg.msg_controllen =3D sizeof(control); + + while (gettimeofday_ms() < tstop) { + if (!do_poll(fd)) + continue; + + ret =3D recvmsg(fd, &msg, MSG_ERRQUEUE); + if (ret < 0) { + if (errno =3D=3D EAGAIN) + continue; + error(1, ret, "recvmsg(MSG_ERRQUEUE)"); + return; + } + if (msg.msg_flags & MSG_CTRUNC) + error(1, 0, "MSG_CTRUNC\n"); + + for (cm =3D CMSG_FIRSTHDR(&msg); cm; cm =3D CMSG_NXTHDR(&msg, cm)) { + if (cm->cmsg_level !=3D SOL_IP && + cm->cmsg_level !=3D SOL_IPV6) + continue; + if (cm->cmsg_level =3D=3D SOL_IP && + cm->cmsg_type !=3D IP_RECVERR) + continue; + if (cm->cmsg_level =3D=3D SOL_IPV6 && + cm->cmsg_type !=3D IPV6_RECVERR) + continue; + + serr =3D (void *)CMSG_DATA(cm); + if (serr->ee_origin !=3D SO_EE_ORIGIN_ZEROCOPY) + error(1, 0, "wrong origin %u", serr->ee_origin); + if (serr->ee_errno !=3D 0) + error(1, 0, "wrong errno %d", serr->ee_errno); + + hi =3D serr->ee_data; + lo =3D serr->ee_info; + + fprintf(stderr, "tx complete [%d,%d]\n", lo, hi); + return; + } + } + + error(1, 0, "did not receive tx completion"); +} + +static int do_client(struct memory_buffer *mem) +{ + char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))]; + struct sockaddr_in6 server_sin; + struct sockaddr_in6 client_sin; + struct dmabuf_tx_cmsg ddmabuf; + struct ynl_sock *ys =3D NULL; + struct msghdr msg =3D {}; + ssize_t line_size =3D 0; + struct cmsghdr *cmsg; + uint64_t off =3D 100; + char *line =3D NULL; + struct iovec iov; + size_t len =3D 0; + int socket_fd; + int opt =3D 1; + int ret; + + ret =3D parse_address(server_ip, atoi(port), &server_sin); + if (ret < 0) + error(1, 0, "parse server address"); + + socket_fd =3D socket(AF_INET6, SOCK_STREAM, 0); + if (socket_fd < 0) + error(1, socket_fd, "create socket"); + + enable_reuseaddr(socket_fd); + + ret =3D setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, + strlen(ifname) + 1); + if (ret) + error(1, ret, "bindtodevice"); + + if (bind_tx_queue(ifindex, mem->fd, &ys)) + error(1, 0, "Failed to bind\n"); + + ret =3D parse_address(client_ip, atoi(port), &client_sin); + if (ret < 0) + error(1, 0, "parse client address"); + + ret =3D bind(socket_fd, &client_sin, sizeof(client_sin)); + if (ret) + error(1, ret, "bind"); + + ret =3D setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt)); + if (ret) + error(1, ret, "set sock opt"); + + fprintf(stderr, "Connect to %s %d (via %s)\n", server_ip, + ntohs(server_sin.sin6_port), ifname); + + ret =3D connect(socket_fd, &server_sin, sizeof(server_sin)); + if (ret) + error(1, ret, "connect"); + + while (1) { + free(line); + line =3D NULL; + line_size =3D getline(&line, &len, stdin); + + if (line_size < 0) + break; + + provider->memcpy_to_device(mem, off, line, line_size); + + while (line_size) { + fprintf(stderr, "read line_size=3D%ld off=3D%d\n", + line_size, off); + + iov.iov_base =3D NULL; + iov.iov_len =3D line_size; + + msg.msg_iov =3D &iov; + msg.msg_iovlen =3D 1; + + msg.msg_control =3D ctrl_data; + msg.msg_controllen =3D sizeof(ctrl_data); + + cmsg =3D CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level =3D SOL_SOCKET; + cmsg->cmsg_type =3D SCM_DEVMEM_DMABUF; + cmsg->cmsg_len =3D CMSG_LEN(sizeof(struct dmabuf_tx_cmsg)); + + ddmabuf.dmabuf_id =3D tx_dmabuf_id; + ddmabuf.dmabuf_offset =3D off; + + *((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) =3D ddmabuf; + + ret =3D sendmsg(socket_fd, &msg, MSG_ZEROCOPY); + if (ret < 0) + error(1, errno, "Failed sendmsg"); + + fprintf(stderr, "sendmsg_ret=3D%d\n", ret); + + off +=3D ret; + line_size -=3D ret; + + wait_compl(socket_fd); + } + } + + fprintf(stderr, "%s: tx ok\n", TEST_PREFIX); + + free(line); + close(socket_fd); + + if (ys) + ynl_sock_destroy(ys); + + return 0; +} + int main(int argc, char *argv[]) { struct memory_buffer *mem; @@ -779,7 +1036,7 @@ int main(int argc, char *argv[]) error(1, 0, "Missing -p argument\n"); =20 mem =3D provider->alloc(getpagesize() * NUM_PAGES); - ret =3D is_server ? do_server(mem) : 1; + ret =3D is_server ? do_server(mem) : do_client(mem); provider->free(mem); =20 return ret; --=20 2.47.1.613.gc27f4b7a9f-goog From nobody Sun Dec 22 03:11:18 2024 Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com [209.85.210.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B0F2E126C03 for ; Sat, 21 Dec 2024 00:51:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734742301; cv=none; b=oAj2DUSMmSiKpKOZmWDgnMGc6RBmhkM1nTYX96koKuEK/l/EUAEAMt9MkaQFHMym/J8mMYAS3rcmqZjhxSLtY78ofRA/d51namsjHQItgfvgsM+WJI6SP5WxRJwPQ/hkRs6gHd5l124t6G6Ap7UTfZyZX2kiUWlba6fb/oPp79o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734742301; c=relaxed/simple; bh=NJYVQtSnFpPYRXXneblkJlgbJV6+0PWgWGqywbwgTRw=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=n+VrFFkm0QIlJ7CU3UxlzXlRHqi4ERDBjFl7e3SSwHcR4ZfadZnVnYeZBrX/kJ1MFfigoT+PskqeE+NHhrZCFIaLE3XYnkcpAZv5TLmEgaHXRK89P3gts/wGrcZlUHpKhzF9V7ZyGyqTVa8vnr83DscTzVEAxUP3Or0W1pm4wZA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=l3SwcKP+; arc=none smtp.client-ip=209.85.210.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="l3SwcKP+" Received: by mail-pf1-f202.google.com with SMTP id d2e1a72fcca58-728ea538b52so3346001b3a.3 for ; Fri, 20 Dec 2024 16:51:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1734742299; x=1735347099; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=6mBSL9yXhBVvFNjspCgCNSWimNpbAGgVhzwIwXbDDa0=; b=l3SwcKP+hraqfWZEcrHjlI/MPxgrJEnxc1+lWJe6hNX2LdyXvAWTM5F6+NYnN4A2kq /xWDD1SZtsV1dAWH0NZR6jqtDHpRSnNVM605p8gtNErcjZvJPb4HtgtsjSSIIPN3vw2b szeQ/rLxwSSBAGKI3ISa1o67ec+C1EEEw9mjSRmeZ6M24auv1c0wKgIApyMGr0OpfPAE RIHNdqbIjHBRuu6rSbcY6n6OAKAu97acn7u89Qs9jNJy4AuazD0NRZpNHlNmfC3PqJcc HMF5ARVWE+rWk5+OQCCgSyg8o+H+sXA+VFTpXy6DlDRT1/IzrytpfJgQwGeqoLNJ8hOW kPbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734742299; x=1735347099; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6mBSL9yXhBVvFNjspCgCNSWimNpbAGgVhzwIwXbDDa0=; b=VKunOWLbmdk48o0+IuPw/Vrh1fp0Y00twHZwPSY+kc/e7wLtqfCHmbAj/GNTaCg4o4 gS2KrDtjPHkoXzX4v93bhb9yKhheHMzT4YHmV7iL/CayDKDcsQcHavJc+cBmjv3sAfWO wftPgdf5LtejKqiYCps1wj6w4eY5a2BJwK8TmVlywnjGrvC2jJSU6k60h4TEs+7G5rkX OeudRFQgKfUQpJeLVBoGFJk6aWHOHXaP4Fi/DFNvDmYxN8mA8RHjWfqfMZRd0osp1D1d kz8pmCySMpK3tBTusTuWStV0/ONSP4A5cbp7RGtpZQ2oM7l8THCDBLEaaNowFLFNMB+S BlOg== X-Forwarded-Encrypted: i=1; AJvYcCVDr5tgtFO3tXWu6TMf+wkf7ZKfX63xcaInItvz6KKJruhGz22rJAUUHhQJrbj5+ZUKUwhuajs7WiAKm/E=@vger.kernel.org X-Gm-Message-State: AOJu0Yx1uo52/JegTe0GIqskdh6BkyoZwVSOwoiE+1FNn13RKU/w+wGa BmNePfj2RFowTc7OjH/8vcGOt1p2SYv6Uw3w+wr8SiMR+RaL+6l59Ypp3JZ+Km+jDwg7qYGMGph UNfafUi3Ym2JtPc0le3cEZw== X-Google-Smtp-Source: AGHT+IH+J7RzJDe7NhBx4Bevu5yjGUwknofvvj4Inwv5XKnR+LfxDzqBhWN45vaZIqRemQ7zYlgZCpxoSq+gw07sAg== X-Received: from pfbcj5.prod.google.com ([2002:a05:6a00:2985:b0:725:e2fd:dcf9]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:2c86:b0:725:f4c6:6b81 with SMTP id d2e1a72fcca58-72abdd4e7a8mr8624153b3a.2.1734742298900; Fri, 20 Dec 2024 16:51:38 -0800 (PST) Date: Sat, 21 Dec 2024 00:42:34 +0000 In-Reply-To: <20241221004236.2629280-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241221004236.2629280-1-almasrymina@google.com> X-Mailer: git-send-email 2.47.1.613.gc27f4b7a9f-goog Message-ID: <20241221004236.2629280-4-almasrymina@google.com> Subject: [PATCH RFC net-next v1 3/5] net: add get_netmem/put_netmem support From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, virtualization@lists.linux.dev, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , David Ahern , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , "=?UTF-8?q?Eugenio=20P=C3=A9rez?=" , Stefan Hajnoczi , Stefano Garzarella , Shuah Khan , Kaiyuan Zhang , Pavel Begunkov , Willem de Bruijn , Samiullah Khawaja , Stanislav Fomichev , Joe Damato , dw@davidwei.uk Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Currently net_iovs support only pp ref counts, and do not support a page ref equivalent. This is fine for the RX path as net_iovs are used exclusively with the pp and only pp refcounting is needed there. The TX path however does not use pp ref counts, thus, support for get_page/put_page equivalent is needed for netmem. Support get_netmem/put_netmem. Check the type of the netmem before passing it to page or net_iov specific code to obtain a page ref equivalent. For dmabuf net_iovs, we obtain a ref on the underlying binding. This ensures the entire binding doesn't disappear until all the net_iovs have been put_netmem'ed. We do not need to track the refcount of individual dmabuf net_iovs as we don't allocate/free them from a pool similar to what the buddy allocator does for pages. This code is written to be extensible by other net_iov implementers. get_netmem/put_netmem will check the type of the netmem and route it to the correct helper: pages -> [get|put]_page() dmabuf net_iovs -> net_devmem_[get|put]_net_iov() new net_iovs -> new helpers Signed-off-by: Mina Almasry --- include/linux/skbuff_ref.h | 4 ++-- include/net/netmem.h | 3 +++ net/core/devmem.c | 10 ++++++++++ net/core/devmem.h | 11 +++++++++++ net/core/skbuff.c | 30 ++++++++++++++++++++++++++++++ 5 files changed, 56 insertions(+), 2 deletions(-) diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h index 0f3c58007488..9e49372ef1a0 100644 --- a/include/linux/skbuff_ref.h +++ b/include/linux/skbuff_ref.h @@ -17,7 +17,7 @@ */ static inline void __skb_frag_ref(skb_frag_t *frag) { - get_page(skb_frag_page(frag)); + get_netmem(skb_frag_netmem(frag)); } =20 /** @@ -40,7 +40,7 @@ static inline void skb_page_unref(netmem_ref netmem, bool= recycle) if (recycle && napi_pp_put_page(netmem)) return; #endif - put_page(netmem_to_page(netmem)); + put_netmem(netmem); } =20 /** diff --git a/include/net/netmem.h b/include/net/netmem.h index 1b58faa4f20f..d30f31878a09 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -245,4 +245,7 @@ static inline unsigned long netmem_get_dma_addr(netmem_= ref netmem) return __netmem_clear_lsb(netmem)->dma_addr; } =20 +void get_netmem(netmem_ref netmem); +void put_netmem(netmem_ref netmem); + #endif /* _NET_NETMEM_H */ diff --git a/net/core/devmem.c b/net/core/devmem.c index 0b6ed7525b22..f7e06a8cba01 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -322,6 +322,16 @@ void dev_dmabuf_uninstall(struct net_device *dev) } } =20 +void net_devmem_get_net_iov(struct net_iov *niov) +{ + net_devmem_dmabuf_binding_get(niov->owner->binding); +} + +void net_devmem_put_net_iov(struct net_iov *niov) +{ + net_devmem_dmabuf_binding_put(niov->owner->binding); +} + /*** "Dmabuf devmem memory provider" ***/ =20 int mp_dmabuf_devmem_init(struct page_pool *pool) diff --git a/net/core/devmem.h b/net/core/devmem.h index 76099ef9c482..54e30fea80b3 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -119,6 +119,9 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_= binding *binding) __net_devmem_dmabuf_binding_free(binding); } =20 +void net_devmem_get_net_iov(struct net_iov *niov); +void net_devmem_put_net_iov(struct net_iov *niov); + struct net_iov * net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding); void net_devmem_free_dmabuf(struct net_iov *ppiov); @@ -126,6 +129,14 @@ void net_devmem_free_dmabuf(struct net_iov *ppiov); #else struct net_devmem_dmabuf_binding; =20 +static inline void net_devmem_get_net_iov(struct net_iov *niov) +{ +} + +static inline void net_devmem_put_net_iov(struct net_iov *niov) +{ +} + static inline void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) { diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a441613a1e6c..815245d5c36b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -88,6 +88,7 @@ #include =20 #include "dev.h" +#include "devmem.h" #include "netmem_priv.h" #include "sock_destructor.h" =20 @@ -7290,3 +7291,32 @@ bool csum_and_copy_from_iter_full(void *addr, size_t= bytes, return false; } EXPORT_SYMBOL(csum_and_copy_from_iter_full); + +void get_netmem(netmem_ref netmem) +{ + if (netmem_is_net_iov(netmem)) { + /* Assume any net_iov is devmem and route it to + * net_devmem_get_net_iov. As new net_iov types are added they + * need to be checked here. + */ + net_devmem_get_net_iov(netmem_to_net_iov(netmem)); + return; + } + get_page(netmem_to_page(netmem)); +} +EXPORT_SYMBOL(get_netmem); + +void put_netmem(netmem_ref netmem) +{ + if (netmem_is_net_iov(netmem)) { + /* Assume any net_iov is devmem and route it to + * net_devmem_put_net_iov. As new net_iov types are added they + * need to be checked here. + */ + net_devmem_put_net_iov(netmem_to_net_iov(netmem)); + return; + } + + put_page(netmem_to_page(netmem)); +} +EXPORT_SYMBOL(put_netmem); --=20 2.47.1.613.gc27f4b7a9f-goog From nobody Sun Dec 22 03:11:18 2024 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F41F813A88A for ; Sat, 21 Dec 2024 00:51:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734742302; cv=none; b=mf3MM2BBiIFmSf8eGcqAl+DdNq6UhSr+43opEyTNAJ20hoeWvcMTlGQeFpOgJiCSeepwmphe2pcjfWXL93v8nnfVdDoc0aPm5daXBT0gy/ofRFPxj1Nv0rL+B51IiFqowxG4448zvx96a6HvTTJVJFLdpodQ9hv7PFxJ2esaaNQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734742302; c=relaxed/simple; bh=QHTemyqKq7eQ6aluLDUFi/rI+VpPUt8hVCo9T/ySwTk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=El84SLCVyqE5TAWKdX/jzcEC3r1yX80rAeVgJ4B2ByQYegJNNgDuAei5zjK7bPEhvzAomRq5PVFyx490/vU1mKoBcG2PDNGUcAONGsZ6fSq7CY3GtkWP0PtikteeC/R+73jz57+FNn1s4xvAGbCspdVqO2ylX2f6QWpVwXxdNuM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=EMmmx2vv; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="EMmmx2vv" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-2efa74481fdso2207775a91.1 for ; Fri, 20 Dec 2024 16:51:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1734742300; x=1735347100; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=lJ6NozhFSUXmkJr/9AfmzkA62BmshCZz/WJgAAj7Dl0=; b=EMmmx2vvjYQvDwSF0cF57JoZ34xf0ZL5Hew7JTRb8YSyMb076VTSWNiEClF45stpZR S1naaK5n7CaOmTF/h9rw9trRHV2kRb+sIK6ae7PL5ibQpXOpUijNB+IilmP9nLz5tvZw QcUpqptZ/Bl5osdIZCff1txm7DDc3hVzZr7Wzq6lGqhCebxrA2BCnqMAxPKVZxw+RpTC 7nNVnpXhF/yMx6pWVE/tvZCGsBdkkCTsGdItXs4hWky+jmgKKNa7emHugbTPt4XDYC0P XmdciYwgeoM5NonzTeVZxiTNqC+ueg7vYwgR1RZDzLzcm9tmkX2eDd4LINq2YJw+NYeA Nmtg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734742300; x=1735347100; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=lJ6NozhFSUXmkJr/9AfmzkA62BmshCZz/WJgAAj7Dl0=; b=vNLB4KcTFNsILUHBrFX3L6AuR+TvB/cdv1YOJAzJkxIwnwl1N4dXU9yjUOsCMSoOF8 +zC6msl2IsvjTZq0tL1J0X/dEudDz8laUJWyRImOQk26pAyFbDPDYZoaykO89cBjTYH4 Lb19W9l7cW7FYN5+CVX/MXrFeUZMP/9/Bbj9w/+05yIhh0ZtOuM7OKJwpaXPc4IKWBcV mRTVWCR4xcc8nTzmZIPP6nZPW9bHQxLTcMcw2rHhSPmMH5YQYTNoTblHyvB7T0aXm67O ntt9tg7+BOJtkjiPOUacH/3nl3ObqghfljbNCP3kgUnvqulmpIdxUjQYfPsYVX0JnsdQ Agpw== X-Forwarded-Encrypted: i=1; AJvYcCX/wW436n6QWj5376GBJh/p0p+RacyaUHhTes4TZ/sgKGpeVdel2uRqkMxSrjTghhO+QYPVS6pGOG/proI=@vger.kernel.org X-Gm-Message-State: AOJu0YzSY51tKqNj/fQ9LYytETGtSHhSjydZrp9J2J6h7sjZm719ddH/ CHoyG94I0iN9GF5Xhmq14baYprHwlnTchzx83FTOChphhV9ZQOq/FdLLAgfYtdBDuJ3TR05JFPO WkqljYu85Pa2NTZ/98tVEHw== X-Google-Smtp-Source: AGHT+IECenOfrqMdD4y/BkZI8Rt/o8P2W8JHtAIr7xge1JhM22ZDFK4i9b8sLtz0qSwXRtkNrZAnVUepMo8t46ZczQ== X-Received: from pjbsm15.prod.google.com ([2002:a17:90b:2e4f:b0:2ea:9d23:79a0]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:54cb:b0:2ee:96a5:721e with SMTP id 98e67ed59e1d1-2f452e1cacamr9375996a91.12.1734742300328; Fri, 20 Dec 2024 16:51:40 -0800 (PST) Date: Sat, 21 Dec 2024 00:42:35 +0000 In-Reply-To: <20241221004236.2629280-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241221004236.2629280-1-almasrymina@google.com> X-Mailer: git-send-email 2.47.1.613.gc27f4b7a9f-goog Message-ID: <20241221004236.2629280-5-almasrymina@google.com> Subject: [PATCH RFC net-next v1 4/5] net: devmem TCP tx netlink api From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, virtualization@lists.linux.dev, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , David Ahern , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , "=?UTF-8?q?Eugenio=20P=C3=A9rez?=" , Stefan Hajnoczi , Stefano Garzarella , Shuah Khan , Kaiyuan Zhang , Pavel Begunkov , Willem de Bruijn , Samiullah Khawaja , Stanislav Fomichev , Joe Damato , dw@davidwei.uk Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Stanislav Fomichev Add bind-tx netlink call to attach dmabuf for TX; queue is not required, only ifindex and dmabuf fd for attachment. Signed-off-by: Stanislav Fomichev Signed-off-by: Mina Almasry --- Documentation/netlink/specs/netdev.yaml | 12 ++++++++++++ include/uapi/linux/netdev.h | 1 + net/core/netdev-genl-gen.c | 13 +++++++++++++ net/core/netdev-genl-gen.h | 1 + net/core/netdev-genl.c | 6 ++++++ tools/include/uapi/linux/netdev.h | 1 + 6 files changed, 34 insertions(+) diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlin= k/specs/netdev.yaml index cbb544bd6c84..93f4333e7bc6 100644 --- a/Documentation/netlink/specs/netdev.yaml +++ b/Documentation/netlink/specs/netdev.yaml @@ -711,6 +711,18 @@ operations: - defer-hard-irqs - gro-flush-timeout - irq-suspend-timeout + - + name: bind-tx + doc: Bind dmabuf to netdev for TX + attribute-set: dmabuf + do: + request: + attributes: + - ifindex + - fd + reply: + attributes: + - id =20 kernel-family: headers: [ "linux/list.h"] diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index e4be227d3ad6..04364ef5edbe 100644 --- a/include/uapi/linux/netdev.h +++ b/include/uapi/linux/netdev.h @@ -203,6 +203,7 @@ enum { NETDEV_CMD_QSTATS_GET, NETDEV_CMD_BIND_RX, NETDEV_CMD_NAPI_SET, + NETDEV_CMD_BIND_TX, =20 __NETDEV_CMD_MAX, NETDEV_CMD_MAX =3D (__NETDEV_CMD_MAX - 1) diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c index a89cbd8d87c3..581b6b9935a5 100644 --- a/net/core/netdev-genl-gen.c +++ b/net/core/netdev-genl-gen.c @@ -99,6 +99,12 @@ static const struct nla_policy netdev_napi_set_nl_policy= [NETDEV_A_NAPI_IRQ_SUSPE [NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] =3D { .type =3D NLA_UINT, }, }; =20 +/* NETDEV_CMD_BIND_TX - do */ +static const struct nla_policy netdev_bind_tx_nl_policy[NETDEV_A_DMABUF_FD= + 1] =3D { + [NETDEV_A_DMABUF_IFINDEX] =3D NLA_POLICY_MIN(NLA_U32, 1), + [NETDEV_A_DMABUF_FD] =3D { .type =3D NLA_U32, }, +}; + /* Ops table for netdev */ static const struct genl_split_ops netdev_nl_ops[] =3D { { @@ -190,6 +196,13 @@ static const struct genl_split_ops netdev_nl_ops[] =3D= { .maxattr =3D NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT, .flags =3D GENL_ADMIN_PERM | GENL_CMD_CAP_DO, }, + { + .cmd =3D NETDEV_CMD_BIND_TX, + .doit =3D netdev_nl_bind_tx_doit, + .policy =3D netdev_bind_tx_nl_policy, + .maxattr =3D NETDEV_A_DMABUF_FD, + .flags =3D GENL_CMD_CAP_DO, + }, }; =20 static const struct genl_multicast_group netdev_nl_mcgrps[] =3D { diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h index e09dd7539ff2..c1fed66e92b9 100644 --- a/net/core/netdev-genl-gen.h +++ b/net/core/netdev-genl-gen.h @@ -34,6 +34,7 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info); +int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info); =20 enum { NETDEV_NLGRP_MGMT, diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index 2d3ae0cd3ad2..00d3d5851487 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -907,6 +907,12 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct= genl_info *info) return err; } =20 +/* stub */ +int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info) +{ + return 0; +} + void netdev_nl_sock_priv_init(struct list_head *priv) { INIT_LIST_HEAD(priv); diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/n= etdev.h index e4be227d3ad6..04364ef5edbe 100644 --- a/tools/include/uapi/linux/netdev.h +++ b/tools/include/uapi/linux/netdev.h @@ -203,6 +203,7 @@ enum { NETDEV_CMD_QSTATS_GET, NETDEV_CMD_BIND_RX, NETDEV_CMD_NAPI_SET, + NETDEV_CMD_BIND_TX, =20 __NETDEV_CMD_MAX, NETDEV_CMD_MAX =3D (__NETDEV_CMD_MAX - 1) --=20 2.47.1.613.gc27f4b7a9f-goog From nobody Sun Dec 22 03:11:18 2024 Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com [209.85.210.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7E95114B942 for ; Sat, 21 Dec 2024 00:51:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734742305; cv=none; b=i684V2yKivD/28V7fd8fUgT178n6qj7H2Cix5NWeTy4BraWf0m61xeB3bigVjxTmy+gjEMSGXBltFi+T0G+FWOliyWsVvFwHAX3tclksywlyKJjIiDGFsaHZCURNiCVyMhnyHoOFDSQNUBDo/sMoN4c4ILBCxiBNiWC9tXgveV8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734742305; c=relaxed/simple; bh=IINc7iQX/r7LYdrEQhak9aKmKhNQAg2DZGRwc2+9I94=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=tW3WY7vTFThFhs9Qitg35ttl1HUwrKxfrKJwjnDz2ySPlN/AYbmrdiuVLQXL0YrzleKJv5j+HDm8T3NRu81GSPdUhdJlg7qHvVi3G//UydDbiYgwie1jCYcHawgZ5bZPsWxS5Uj3UEDQZwfZfd4sJOc8aPLGs26pcqCpL/pygTE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=xV4ZicyZ; arc=none smtp.client-ip=209.85.210.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="xV4ZicyZ" Received: by mail-pf1-f202.google.com with SMTP id d2e1a72fcca58-727c59ecb9fso2215931b3a.1 for ; Fri, 20 Dec 2024 16:51:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1734742302; x=1735347102; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=NZSE0w2/Y5onWcsN4W4n34gva14f/out2BJyIZxl9Q0=; b=xV4ZicyZwFJdWRL2dv9PUY3tgBZgnIpCS+y150QqyxXxYbeq9NOzpmIowsP745p8Uc 0qSBRUH3VE6OOJJUg53rvjdFQv2pUZaA5yhEeVxcEiOFqx95TgnU29jq6FX1KQYGIFdQ BDcSmEUcKcko5RafqvpKJDstIb3vSzYbS8V+DiX5LLlxxkwIW55s9lhNoiF/zNkPp8E4 PMeWb7En6wbMIjsDbN9BlqZoTo7Ky/UqxQTGyVh+KC8vzwwL2cfZOiO4KUEcvNlmOjmM 5Ty8n6IsDClerws0mkP+568cKV0BI59c1pdr9VLk89CCCvQq1UyBJcUYXH0sQNUZoV3v Cp3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734742302; x=1735347102; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=NZSE0w2/Y5onWcsN4W4n34gva14f/out2BJyIZxl9Q0=; b=sZEbebvjeTdoKWMXMrWhkN7uR/qrXmMiBDcfQxKtOvDE+kjWFdV3ln/swbqOxrvpSZ wkaGmoxxEoKXDSe4K+VuM9mjUrTkhm/my5LxkbSiDtp8LRYKfFYl4sTiEKQqJaNgckVf qBvEizh7TdcPWueYvgywMNJti7251FTyAzXrqQqm/rC+tEj852nk7CBwKBtH1NIh9B+j ZyQwjF0ZEw0iTuDzlTv5XW+emX3hQpFrJR4+OfkuVj0nG17voYu/2dtMwmsLJm8QSqD6 UwrRKohNC0b6QwvTXil4fSkH77MVorrOe4Xcr1IncP6kef064t9MVwTmxv0ZOJYMxj8J O3Vw== X-Forwarded-Encrypted: i=1; AJvYcCXMC110QxQd4EdPIzVdwK8Vo9uGvFnj26LR8TtDB5vJWxVtcq9KA7ROs6hDgmjrxC5F6Tz42m1r3q7Wppg=@vger.kernel.org X-Gm-Message-State: AOJu0Yxy5Oq25YRMFywzEbZQJ3ECUQ5jfUxlVFutsXC/gxSoNx5scFlS /4333JfEAnYv4LNhHYHOxy3lbOeqN+LPHJTHNfPCx/QTEwu2kzygpUIbcGMT3SABlnN0B/a5eFE EAnRzNv6oe1RmPydlgD+tMw== X-Google-Smtp-Source: AGHT+IEVonbxTJJnb/MSXmpgW2Kj8xiEDkigF5HqD2BqY7pxwYAoFbcUNdEZ7bbAazV4BPucfe0/kWlx29dFavSHaw== X-Received: from pgjo21.prod.google.com ([2002:a63:e355:0:b0:826:36c0:d549]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a20:258c:b0:1db:915b:ab11 with SMTP id adf61e73a8af0-1e5e04946b1mr9353594637.24.1734742301979; Fri, 20 Dec 2024 16:51:41 -0800 (PST) Date: Sat, 21 Dec 2024 00:42:36 +0000 In-Reply-To: <20241221004236.2629280-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241221004236.2629280-1-almasrymina@google.com> X-Mailer: git-send-email 2.47.1.613.gc27f4b7a9f-goog Message-ID: <20241221004236.2629280-6-almasrymina@google.com> Subject: [PATCH RFC net-next v1 5/5] net: devmem: Implement TX path From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, virtualization@lists.linux.dev, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , David Ahern , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , "=?UTF-8?q?Eugenio=20P=C3=A9rez?=" , Stefan Hajnoczi , Stefano Garzarella , Shuah Khan , Kaiyuan Zhang , Pavel Begunkov , Willem de Bruijn , Samiullah Khawaja , Stanislav Fomichev , Joe Damato , dw@davidwei.uk Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Augment dmabuf binding to be able to handle TX. Additional to all the RX binding, we also create tx_vec and tx_iter needed for the TX path. Provide API for sendmsg to be able to send dmabufs bound to this device: - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from, and the offset into the dmabuf to send from. - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY implementation, while disabling instances where MSG_ZEROCOPY falls back to copying. We additionally look up the dmabuf to send from by id, then pipe the binding down to the new zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems instead of the traditional page netmems. We also special case skb_frag_dma_map to return the dma-address of these dmabuf net_iovs instead of attempting to map pages. Based on work by Stanislav Fomichev . A lot of the meat of the implementation came from devmem TCP RFC v1[1], which included the TX path, but Stan did all the rebasing on top of netmem/net_iov. Cc: Stanislav Fomichev Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry --- include/linux/skbuff.h | 13 +++- include/net/sock.h | 2 + include/uapi/linux/uio.h | 5 ++ net/core/datagram.c | 40 ++++++++++- net/core/devmem.c | 91 +++++++++++++++++++++++-- net/core/devmem.h | 40 +++++++++-- net/core/netdev-genl.c | 65 +++++++++++++++++- net/core/skbuff.c | 8 ++- net/core/sock.c | 9 +++ net/ipv4/tcp.c | 36 +++++++--- net/vmw_vsock/virtio_transport_common.c | 4 +- 11 files changed, 281 insertions(+), 32 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index bb2b751d274a..e90dc0c4d542 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1711,9 +1711,10 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *= sk, size_t size, =20 void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref); =20 +struct net_devmem_dmabuf_binding; int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, struct sk_buff *skb, struct iov_iter *from, - size_t length); + size_t length, bool is_devmem); =20 int zerocopy_fill_skb_from_iter(struct sk_buff *skb, struct iov_iter *from, size_t length); @@ -1721,12 +1722,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb, static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len) { - return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len); + return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len, + false); } =20 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, struct msghdr *msg, int len, - struct ubuf_info *uarg); + struct ubuf_info *uarg, + struct net_devmem_dmabuf_binding *binding); =20 /* Internal */ #define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB))) @@ -3697,6 +3700,10 @@ static inline dma_addr_t __skb_frag_dma_map(struct d= evice *dev, size_t offset, size_t size, enum dma_data_direction dir) { + if (skb_frag_is_net_iov(frag)) { + return netmem_to_net_iov(frag->netmem)->dma_addr + offset + + frag->offset; + } return dma_map_page(dev, skb_frag_page(frag), skb_frag_off(frag) + offset, size, dir); } diff --git a/include/net/sock.h b/include/net/sock.h index d4bdd3286e03..75bd580fe9c6 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1816,6 +1816,8 @@ struct sockcm_cookie { u32 tsflags; u32 ts_opt_id; u32 priority; + u32 dmabuf_id; + u64 dmabuf_offset; }; =20 static inline void sockcm_init(struct sockcm_cookie *sockc, diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 649739e0c404..41490cde95ad 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -38,6 +38,11 @@ struct dmabuf_token { __u32 token_count; }; =20 +struct dmabuf_tx_cmsg { + __u32 dmabuf_id; + __u64 dmabuf_offset; +}; + /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/core/datagram.c b/net/core/datagram.c index f0693707aece..3b09995db894 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -63,6 +63,8 @@ #include #include =20 +#include "devmem.h" + /* * Is a socket 'connection oriented' ? */ @@ -692,9 +694,41 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb, return 0; } =20 +static int zerocopy_fill_skb_from_devmem(struct sk_buff *skb, + struct msghdr *msg, + struct iov_iter *from, int length) +{ + int i =3D skb_shinfo(skb)->nr_frags; + int orig_length =3D length; + netmem_ref netmem; + size_t size; + + while (length && iov_iter_count(from)) { + if (i =3D=3D MAX_SKB_FRAGS) + return -EMSGSIZE; + + size =3D min_t(size_t, iter_iov_len(from), length); + if (!size) + return -EFAULT; + + netmem =3D net_iov_to_netmem(iter_iov(from)->iov_base); + get_netmem(netmem); + skb_add_rx_frag_netmem(skb, i, netmem, from->iov_offset, size, + PAGE_SIZE); + + iov_iter_advance(from, size); + length -=3D size; + i++; + } + + iov_iter_advance(&msg->msg_iter, orig_length); + + return 0; +} + int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, struct sk_buff *skb, struct iov_iter *from, - size_t length) + size_t length, bool is_devmem) { unsigned long orig_size =3D skb->truesize; unsigned long truesize; @@ -702,6 +736,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct = sock *sk, =20 if (msg && msg->msg_ubuf && msg->sg_from_iter) ret =3D msg->sg_from_iter(skb, from, length); + else if (unlikely(is_devmem)) + ret =3D zerocopy_fill_skb_from_devmem(skb, msg, from, length); else ret =3D zerocopy_fill_skb_from_iter(skb, from, length); =20 @@ -735,7 +771,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct i= ov_iter *from) if (skb_copy_datagram_from_iter(skb, 0, from, copy)) return -EFAULT; =20 - return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U); + return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U, NULL); } EXPORT_SYMBOL(zerocopy_sg_from_iter); =20 diff --git a/net/core/devmem.c b/net/core/devmem.c index f7e06a8cba01..81f1b715cfa6 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -15,6 +15,7 @@ #include #include #include +#include #include =20 #include "devmem.h" @@ -63,8 +64,10 @@ void __net_devmem_dmabuf_binding_free(struct net_devmem_= dmabuf_binding *binding) dma_buf_detach(binding->dmabuf, binding->attachment); dma_buf_put(binding->dmabuf); xa_destroy(&binding->bound_rxqs); + kfree(binding->tx_vec); kfree(binding); } +EXPORT_SYMBOL(__net_devmem_dmabuf_binding_free); =20 struct net_iov * net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) @@ -109,6 +112,13 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf= _binding *binding) unsigned long xa_idx; unsigned int rxq_idx; =20 + xa_erase(&net_devmem_dmabuf_bindings, binding->id); + + /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the + * erase. + */ + synchronize_net(); + if (binding->list.next) list_del(&binding->list); =20 @@ -122,8 +132,6 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_= binding *binding) WARN_ON(netdev_rx_queue_restart(binding->dev, rxq_idx)); } =20 - xa_erase(&net_devmem_dmabuf_bindings, binding->id); - net_devmem_dmabuf_binding_put(binding); } =20 @@ -174,8 +182,9 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *= dev, u32 rxq_idx, } =20 struct net_devmem_dmabuf_binding * -net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, - struct netlink_ext_ack *extack) +net_devmem_bind_dmabuf(struct net_device *dev, + enum dma_data_direction direction, + unsigned int dmabuf_fd, struct netlink_ext_ack *extack) { struct net_devmem_dmabuf_binding *binding; static u32 id_alloc_next; @@ -183,6 +192,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned= int dmabuf_fd, struct dma_buf *dmabuf; unsigned int sg_idx, i; unsigned long virtual; + struct iovec *iov; int err; =20 dmabuf =3D dma_buf_get(dmabuf_fd); @@ -218,13 +228,19 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsign= ed int dmabuf_fd, } =20 binding->sgt =3D dma_buf_map_attachment_unlocked(binding->attachment, - DMA_FROM_DEVICE); + direction); if (IS_ERR(binding->sgt)) { err =3D PTR_ERR(binding->sgt); NL_SET_ERR_MSG(extack, "Failed to map dmabuf attachment"); goto err_detach; } =20 + if (!binding->sgt || binding->sgt->nents =3D=3D 0) { + err =3D -EINVAL; + NL_SET_ERR_MSG(extack, "Empty dmabuf attachment"); + goto err_detach; + } + /* For simplicity we expect to make PAGE_SIZE allocations, but the * binding can be much more flexible than that. We may be able to * allocate MTU sized chunks here. Leave that for future work... @@ -236,6 +252,19 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigne= d int dmabuf_fd, goto err_unmap; } =20 + if (direction =3D=3D DMA_TO_DEVICE) { + virtual =3D 0; + for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) + virtual +=3D sg_dma_len(sg); + + binding->tx_vec =3D kcalloc(virtual / PAGE_SIZE + 1, + sizeof(struct iovec), GFP_KERNEL); + if (!binding->tx_vec) { + err =3D -ENOMEM; + goto err_unmap; + } + } + virtual =3D 0; for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) { dma_addr_t dma_addr =3D sg_dma_address(sg); @@ -277,11 +306,21 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsign= ed int dmabuf_fd, niov->owner =3D owner; page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov), net_devmem_get_dma_addr(niov)); + + if (direction =3D=3D DMA_TO_DEVICE) { + iov =3D &binding->tx_vec[virtual / PAGE_SIZE + i]; + iov->iov_base =3D niov; + iov->iov_len =3D PAGE_SIZE; + } } =20 virtual +=3D len; } =20 + if (direction =3D=3D DMA_TO_DEVICE) + iov_iter_init(&binding->tx_iter, WRITE, binding->tx_vec, + virtual / PAGE_SIZE + 1, virtual); + return binding; =20 err_free_chunks: @@ -302,6 +341,21 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigne= d int dmabuf_fd, return ERR_PTR(err); } =20 +struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id) +{ + struct net_devmem_dmabuf_binding *binding; + + rcu_read_lock(); + binding =3D xa_load(&net_devmem_dmabuf_bindings, id); + if (binding) { + if (!net_devmem_dmabuf_binding_get(binding)) + binding =3D NULL; + } + rcu_read_unlock(); + + return binding; +} + void dev_dmabuf_uninstall(struct net_device *dev) { struct net_devmem_dmabuf_binding *binding; @@ -332,6 +386,33 @@ void net_devmem_put_net_iov(struct net_iov *niov) net_devmem_dmabuf_binding_put(niov->owner->binding); } =20 +struct net_devmem_dmabuf_binding * +net_devmem_get_sockc_binding(struct sock *sk, struct sockcm_cookie *sockc) +{ + struct net_devmem_dmabuf_binding *binding; + int err =3D 0; + + binding =3D net_devmem_lookup_dmabuf(sockc->dmabuf_id); + if (!binding || !binding->tx_vec) { + err =3D -EINVAL; + goto out_err; + } + + if (sock_net(sk) !=3D dev_net(binding->dev)) { + err =3D -ENODEV; + goto out_err; + } + + iov_iter_advance(&binding->tx_iter, sockc->dmabuf_offset); + return binding; + +out_err: + if (binding) + net_devmem_dmabuf_binding_put(binding); + + return ERR_PTR(err); +} + /*** "Dmabuf devmem memory provider" ***/ =20 int mp_dmabuf_devmem_init(struct page_pool *pool) diff --git a/net/core/devmem.h b/net/core/devmem.h index 54e30fea80b3..f923c77d9c45 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -11,6 +11,8 @@ #define _NET_DEVMEM_H =20 struct netlink_ext_ack; +struct sockcm_cookie; +struct sock; =20 struct net_devmem_dmabuf_binding { struct dma_buf *dmabuf; @@ -27,6 +29,10 @@ struct net_devmem_dmabuf_binding { * The binding undos itself and unmaps the underlying dmabuf once all * those refs are dropped and the binding is no longer desired or in * use. + * + * net_devmem_get_net_iov() on dmabuf net_iovs will increment this + * reference, making sure that that the binding remains alive until all + * the net_iovs are no longer used. */ refcount_t ref; =20 @@ -42,6 +48,10 @@ struct net_devmem_dmabuf_binding { * active. */ u32 id; + + /* iov_iter representing all possible net_iov chunks in the dmabuf. */ + struct iov_iter tx_iter; + struct iovec *tx_vec; }; =20 #if defined(CONFIG_NET_DEVMEM) @@ -66,8 +76,10 @@ struct dmabuf_genpool_chunk_owner { =20 void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *bi= nding); struct net_devmem_dmabuf_binding * -net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, - struct netlink_ext_ack *extack); +net_devmem_bind_dmabuf(struct net_device *dev, + enum dma_data_direction direction, + unsigned int dmabuf_fd, struct netlink_ext_ack *extack); +struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id); void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding); int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, struct net_devmem_dmabuf_binding *binding, @@ -104,10 +116,10 @@ static inline u32 net_iov_binding_id(const struct net= _iov *niov) return net_iov_owner(niov)->binding->id; } =20 -static inline void +static inline bool net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding) { - refcount_inc(&binding->ref); + return refcount_inc_not_zero(&binding->ref); } =20 static inline void @@ -126,6 +138,9 @@ struct net_iov * net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding); void net_devmem_free_dmabuf(struct net_iov *ppiov); =20 +struct net_devmem_dmabuf_binding * +net_devmem_get_sockc_binding(struct sock *sk, struct sockcm_cookie *sockc); + #else struct net_devmem_dmabuf_binding; =20 @@ -144,11 +159,17 @@ __net_devmem_dmabuf_binding_free(struct net_devmem_dm= abuf_binding *binding) =20 static inline struct net_devmem_dmabuf_binding * net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, + enum dma_data_direction direction, struct netlink_ext_ack *extack) { return ERR_PTR(-EOPNOTSUPP); } =20 +static inline struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u= 32 id) +{ + return NULL; +} + static inline void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) { @@ -186,6 +207,17 @@ static inline u32 net_iov_binding_id(const struct net_= iov *niov) { return 0; } + +static inline void +net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding) +{ +} + +static inline struct net_devmem_dmabuf_binding * +net_devmem_get_sockc_binding(struct sock *sk, struct sockcm_cookie *sockc) +{ + return ERR_PTR(-EOPNOTSUPP); +} #endif =20 #endif /* _NET_DEVMEM_H */ diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index 00d3d5851487..b9928bac94da 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -850,7 +850,8 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct = genl_info *info) goto err_unlock; } =20 - binding =3D net_devmem_bind_dmabuf(netdev, dmabuf_fd, info->extack); + binding =3D net_devmem_bind_dmabuf(netdev, DMA_FROM_DEVICE, dmabuf_fd, + info->extack); if (IS_ERR(binding)) { err =3D PTR_ERR(binding); goto err_unlock; @@ -907,10 +908,68 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struc= t genl_info *info) return err; } =20 -/* stub */ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info) { - return 0; + struct net_devmem_dmabuf_binding *binding; + struct list_head *sock_binding_list; + struct net_device *netdev; + u32 ifindex, dmabuf_fd; + struct sk_buff *rsp; + int err =3D 0; + void *hdr; + + if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX) || + GENL_REQ_ATTR_CHECK(info, NETDEV_A_DMABUF_FD)) + return -EINVAL; + + ifindex =3D nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]); + dmabuf_fd =3D nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]); + + sock_binding_list =3D + genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk); + if (IS_ERR(sock_binding_list)) + return PTR_ERR(sock_binding_list); + + rsp =3D genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL); + if (!rsp) + return -ENOMEM; + + hdr =3D genlmsg_iput(rsp, info); + if (!hdr) { + err =3D -EMSGSIZE; + goto err_genlmsg_free; + } + + rtnl_lock(); + + netdev =3D __dev_get_by_index(genl_info_net(info), ifindex); + if (!netdev || !netif_device_present(netdev)) { + err =3D -ENODEV; + goto err_unlock; + } + + binding =3D net_devmem_bind_dmabuf(netdev, DMA_TO_DEVICE, dmabuf_fd, + info->extack); + if (IS_ERR(binding)) { + err =3D PTR_ERR(binding); + goto err_unlock; + } + + list_add(&binding->list, sock_binding_list); + + nla_put_u32(rsp, NETDEV_A_DMABUF_ID, binding->id); + genlmsg_end(rsp, hdr); + + rtnl_unlock(); + + return genlmsg_reply(rsp, info); + + net_devmem_unbind_dmabuf(binding); +err_unlock: + rtnl_unlock(); +err_genlmsg_free: + nlmsg_free(rsp); + return err; } =20 void netdev_nl_sock_priv_init(struct list_head *priv) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 815245d5c36b..eb6b41a32524 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1882,8 +1882,10 @@ EXPORT_SYMBOL_GPL(msg_zerocopy_ubuf_ops); =20 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, struct msghdr *msg, int len, - struct ubuf_info *uarg) + struct ubuf_info *uarg, + struct net_devmem_dmabuf_binding *binding) { + struct iov_iter *from =3D binding ? &binding->tx_iter : &msg->msg_iter; int err, orig_len =3D skb->len; =20 if (uarg->ops->link_skb) { @@ -1901,12 +1903,12 @@ int skb_zerocopy_iter_stream(struct sock *sk, struc= t sk_buff *skb, return -EEXIST; } =20 - err =3D __zerocopy_sg_from_iter(msg, sk, skb, &msg->msg_iter, len); + err =3D __zerocopy_sg_from_iter(msg, sk, skb, from, len, binding !=3D NUL= L); if (err =3D=3D -EFAULT || (err =3D=3D -EMSGSIZE && skb->len =3D=3D orig_l= en)) { struct sock *save_sk =3D skb->sk; =20 /* Streams do not free skb on error. Reset to prev state. */ - iov_iter_revert(&msg->msg_iter, skb->len - orig_len); + iov_iter_revert(from, skb->len - orig_len); skb->sk =3D sk; ___pskb_trim(skb, orig_len); skb->sk =3D save_sk; diff --git a/net/core/sock.c b/net/core/sock.c index e7bcc8952248..ed7089310f0d 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2908,6 +2908,7 @@ EXPORT_SYMBOL(sock_alloc_send_pskb); int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg, struct sockcm_cookie *sockc) { + struct dmabuf_tx_cmsg dmabuf_tx; u32 tsflags; =20 BUILD_BUG_ON(SOF_TIMESTAMPING_LAST =3D=3D (1 << 31)); @@ -2961,6 +2962,14 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr= *cmsg, if (!sk_set_prio_allowed(sk, *(u32 *)CMSG_DATA(cmsg))) return -EPERM; sockc->priority =3D *(u32 *)CMSG_DATA(cmsg); + break; + case SCM_DEVMEM_DMABUF: + if (cmsg->cmsg_len !=3D CMSG_LEN(sizeof(struct dmabuf_tx_cmsg))) + return -EINVAL; + dmabuf_tx =3D *(struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg); + sockc->dmabuf_id =3D dmabuf_tx.dmabuf_id; + sockc->dmabuf_offset =3D dmabuf_tx.dmabuf_offset; + break; default: return -EINVAL; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 0d704bda6c41..406dc2993742 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1051,6 +1051,7 @@ int tcp_sendmsg_fastopen(struct sock *sk, struct msgh= dr *msg, int *copied, =20 int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) { + struct net_devmem_dmabuf_binding *binding =3D NULL; struct tcp_sock *tp =3D tcp_sk(sk); struct ubuf_info *uarg =3D NULL; struct sk_buff *skb; @@ -1063,6 +1064,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghd= r *msg, size_t size) =20 flags =3D msg->msg_flags; =20 + sockcm_init(&sockc, sk); + if (msg->msg_controllen) { + err =3D sock_cmsg_send(sk, msg, &sockc); + if (unlikely(err)) { + err =3D -EINVAL; + goto out_err; + } + } + if ((flags & MSG_ZEROCOPY) && size) { if (msg->msg_ubuf) { uarg =3D msg->msg_ubuf; @@ -1080,6 +1090,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghd= r *msg, size_t size) else uarg_to_msgzc(uarg)->zerocopy =3D 0; } + + if (sockc.dmabuf_id !=3D 0) { + binding =3D net_devmem_get_sockc_binding(sk, &sockc); + if (IS_ERR(binding)) { + err =3D PTR_ERR(binding); + binding =3D NULL; + goto out_err; + } + } } else if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES) && size) { if (sk->sk_route_caps & NETIF_F_SG) zc =3D MSG_SPLICE_PAGES; @@ -1123,15 +1142,6 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghd= r *msg, size_t size) /* 'common' sending to sendq */ } =20 - sockcm_init(&sockc, sk); - if (msg->msg_controllen) { - err =3D sock_cmsg_send(sk, msg, &sockc); - if (unlikely(err)) { - err =3D -EINVAL; - goto out_err; - } - } - /* This should be in poll */ sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk); =20 @@ -1248,7 +1258,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr= *msg, size_t size) goto wait_for_space; } =20 - err =3D skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg); + err =3D skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg, + binding); if (err =3D=3D -EMSGSIZE || err =3D=3D -EEXIST) { tcp_mark_push(tp, skb); goto new_segment; @@ -1329,6 +1340,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr= *msg, size_t size) /* msg->msg_ubuf is pinned by the caller so we don't take extra refs */ if (uarg && !msg->msg_ubuf) net_zcopy_put(uarg); + if (binding) + net_devmem_dmabuf_binding_put(binding); return copied + copied_syn; =20 do_error: @@ -1346,6 +1359,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr= *msg, size_t size) sk->sk_write_space(sk); tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED); } + if (binding) + net_devmem_dmabuf_binding_put(binding); + return err; } EXPORT_SYMBOL_GPL(tcp_sendmsg_locked); diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio= _transport_common.c index 9acc13ab3f82..286e6cd5ad34 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -104,8 +104,8 @@ static int virtio_transport_fill_skb(struct sk_buff *sk= b, { if (zcopy) return __zerocopy_sg_from_iter(info->msg, NULL, skb, - &info->msg->msg_iter, - len); + &info->msg->msg_iter, len, + false); =20 return memcpy_from_msg(skb_put(skb, len), info->msg, len); } --=20 2.47.1.613.gc27f4b7a9f-goog