From nobody Thu Oct 2 19:25:12 2025 Received: from mail-yw1-f177.google.com (mail-yw1-f177.google.com [209.85.128.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D8B12749CF for ; Fri, 12 Sep 2025 05:28:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757654918; cv=none; b=PMbvfOF7sh69tfDX3YJ62t2wu9RYrz0FkH5/3Wvqe/Y2oOePCzboDzARYbdklbaBKjYiigwHPZiGG/jayxVVMk2bU4PEZdpfCWfbSoPHmuOyrQzy7LXbm5Q14pd9DOJBX6FxgesK/pnKvUviQqr0K1z+lmAMKcOqHB6jbQYHnwI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757654918; c=relaxed/simple; bh=fdrJ2W7mBo4ScK0+ExdwCjhXZ7mjfmL0Ggt7KK74nGg=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=JoH0XA/reyEmXMMRQ3ydfVN4B4FCN9CRjA6ZWsZ6WL9/jIVzCzQtr2QYw0lRXornzmn95ti7jt0Wt1QIhVwT8MYerhr4ewKVuccPiBxDBiT4WDYcEIVYUIHNIIoimvrmyDjJICNZuraPmtc/h/zFrzKiX0KC0WETmNwwf0BRhR0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=OoYHWbjb; arc=none smtp.client-ip=209.85.128.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="OoYHWbjb" Received: by mail-yw1-f177.google.com with SMTP id 00721157ae682-71d603a269cso10537047b3.1 for ; Thu, 11 Sep 2025 22:28:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1757654915; x=1758259715; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=gJXxA4eRg+MQwHOWJI8nle9db3f24EujzoUrSFcpQz8=; b=OoYHWbjb/V3YOaicu1paxd9GOTv+BujVE6ocJRheTKaEXIXRwpguH/GYcckn3xBbDg /CcqRtRN9VHzKWPsi4Uyx/6a3fxsrMjrI5l/N4meuS094QqXIYDQUIPxaP7wtqffRnJa sa93PJUlNI03lOucvPEN4pggV+SnpoI7qq+wofjCniOsOOW1MF8Yqdfv4jt/OW4eW1y2 90YNQU5qCAZhLn4KQM1zKNIJm/5jsHY+60j1pmlw1U3UbWFG/dXFdbnqE5n0E3KA7HYB 0mpsumgucwjcUVb5EgVRGH+OC+HtLFfWxrOj/bmL1AnBWLFvO+rIZd1Jq0l13iB3/ZJV qwZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757654915; x=1758259715; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=gJXxA4eRg+MQwHOWJI8nle9db3f24EujzoUrSFcpQz8=; b=lhcGFUoTkpeqMTezUbfL5kz+mSI6TKCjGng70kGK0BvYgRGaH5qvtYeQOmnEaszMuJ 2mq6YS3a2OMi0klWumiIPQNK00uJgWC4p1WoJRaLnaz2aXs53o47dKoAqKi7jCG8gD/X d/qYm4KRmz54dT4pkiwqRYfsRQIMUEcGV94AxR3OMg8ocsA1cq8h+zzg4FjqM6oAceeV zeN9wDV4yBjHfQ2whKm1wGsNyBcJFz7qdb23dzcLSQzZIDqBHee4h2ubUBDLj6qOvTS0 VITHAJuVSRBsem5rdGwAfWLZ11DJvi17+prPjCAvybpWDBrItI7PzCMr/8z6WRq580xu Jgnw== X-Forwarded-Encrypted: i=1; AJvYcCX1kIRJyy9b8D1gb2T2m2xWtU9xyi/EidI2LlxDczEv56IjvWWx6sTqdzKqDlkKSnH/y9GT8t262QXy7m0=@vger.kernel.org X-Gm-Message-State: AOJu0YxTD3eyfvzANWfRourMuVQYd99NalzC8BCgV17kJZO5OYa/p7Ek w5d2NOqjIgVl7pdTvu6qka8ezvdISBarERfhBJvzg3kw53qTVJYTv7VO X-Gm-Gg: ASbGncucDwYScw493oyXVxFjChl77A95gKERyTZ2s1J2+4nY4z/PrZv9PbGK6cRYj/m 25PvPgBl+UDE80DdapbeVFT58c/CsJCH8LWa+lhk5hgHP78JA1HM/iyN/ujUBOaXDuAk4Fm0o8/ WXEVTRuykTb7sMJO4J771/HRSbmCwbqJRb2CLL20fK5AU4DHIpuZA2JL5jKQG2fbidYqkjmRJUz r4s66OCoDQ9hB6gdvt8o2NQolv4JEfuU7rKRw6WGTK01Kd9FNGaDmCSpeSJ8JYIhtP+VV5GOahf YL5Ma8NHmmje2I7RMKeu8hEkqJj6MmeeAuQpxf2Ol8Eqgp/FM4wfHlIf3lL9dACR4OO9iHXZHly i+HZD7fdJMt4L7VNJpfjehUl6CilU9fI= X-Google-Smtp-Source: AGHT+IGQDOKo/+rHOGnJMv9k+15pjzLyuNWmg1SbdxW78JqevUneQyQ65Vm5pm9WI07RFWoIos5Pnw== X-Received: by 2002:a05:690c:6e93:b0:71f:a867:3fa8 with SMTP id 00721157ae682-7306376c3ccmr16751537b3.20.1757654915409; Thu, 11 Sep 2025 22:28:35 -0700 (PDT) Received: from localhost ([2a03:2880:25ff:8::]) by smtp.gmail.com with ESMTPSA id 00721157ae682-72f791a35bcsm8520227b3.39.2025.09.11.22.28.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Sep 2025 22:28:34 -0700 (PDT) From: Bobby Eshleman Date: Thu, 11 Sep 2025 22:28:15 -0700 Subject: [PATCH net-next v2 1/3] net: devmem: rename tx_vec to vec in dmabuf binding Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-1-c80d735bd453@meta.com> References: <20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-0-c80d735bd453@meta.com> In-Reply-To: <20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-0-c80d735bd453@meta.com> To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Kuniyuki Iwashima , Willem de Bruijn , Neal Cardwell , David Ahern Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Stanislav Fomichev , Mina Almasry , Bobby Eshleman X-Mailer: b4 0.13.0 From: Bobby Eshleman Rename the 'tx_vec' field in struct net_devmem_dmabuf_binding to 'vec'. This field holds pointers to net_iov structures. The rename prepares for reusing 'vec' for both TX and RX directions. No functional change intended. Signed-off-by: Bobby Eshleman --- net/core/devmem.c | 22 +++++++++++----------- net/core/devmem.h | 2 +- 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/net/core/devmem.c b/net/core/devmem.c index d9de31a6cc7f..b4c570d4f37a 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -74,7 +74,7 @@ void __net_devmem_dmabuf_binding_free(struct work_struct = *wq) dma_buf_detach(binding->dmabuf, binding->attachment); dma_buf_put(binding->dmabuf); xa_destroy(&binding->bound_rxqs); - kvfree(binding->tx_vec); + kvfree(binding->vec); kfree(binding); } =20 @@ -231,10 +231,10 @@ net_devmem_bind_dmabuf(struct net_device *dev, } =20 if (direction =3D=3D DMA_TO_DEVICE) { - binding->tx_vec =3D kvmalloc_array(dmabuf->size / PAGE_SIZE, - sizeof(struct net_iov *), - GFP_KERNEL); - if (!binding->tx_vec) { + binding->vec =3D kvmalloc_array(dmabuf->size / PAGE_SIZE, + sizeof(struct net_iov *), + GFP_KERNEL); + if (!binding->vec) { err =3D -ENOMEM; goto err_unmap; } @@ -248,7 +248,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, dev_to_node(&dev->dev)); if (!binding->chunk_pool) { err =3D -ENOMEM; - goto err_tx_vec; + goto err_vec; } =20 virtual =3D 0; @@ -294,7 +294,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov), net_devmem_get_dma_addr(niov)); if (direction =3D=3D DMA_TO_DEVICE) - binding->tx_vec[owner->area.base_virtual / PAGE_SIZE + i] =3D niov; + binding->vec[owner->area.base_virtual / PAGE_SIZE + i] =3D niov; } =20 virtual +=3D len; @@ -314,8 +314,8 @@ net_devmem_bind_dmabuf(struct net_device *dev, gen_pool_for_each_chunk(binding->chunk_pool, net_devmem_dmabuf_free_chunk_owner, NULL); gen_pool_destroy(binding->chunk_pool); -err_tx_vec: - kvfree(binding->tx_vec); +err_vec: + kvfree(binding->vec); err_unmap: dma_buf_unmap_attachment_unlocked(binding->attachment, binding->sgt, direction); @@ -361,7 +361,7 @@ struct net_devmem_dmabuf_binding *net_devmem_get_bindin= g(struct sock *sk, int err =3D 0; =20 binding =3D net_devmem_lookup_dmabuf(dmabuf_id); - if (!binding || !binding->tx_vec) { + if (!binding || !binding->vec) { err =3D -EINVAL; goto out_err; } @@ -393,7 +393,7 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_binding= *binding, *off =3D virt_addr % PAGE_SIZE; *size =3D PAGE_SIZE - *off; =20 - return binding->tx_vec[virt_addr / PAGE_SIZE]; + return binding->vec[virt_addr / PAGE_SIZE]; } =20 /*** "Dmabuf devmem memory provider" ***/ diff --git a/net/core/devmem.h b/net/core/devmem.h index 101150d761af..2ada54fb63d7 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -63,7 +63,7 @@ struct net_devmem_dmabuf_binding { * address. This array is convenient to map the virtual addresses to * net_iovs in the TX path. */ - struct net_iov **tx_vec; + struct net_iov **vec; =20 struct work_struct unbind_w; }; --=20 2.47.3 From nobody Thu Oct 2 19:25:12 2025 Received: from mail-yb1-f180.google.com (mail-yb1-f180.google.com [209.85.219.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0A5682773D0 for ; Fri, 12 Sep 2025 05:28:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757654920; cv=none; b=KVfHdYGPqn3EczDAlhSMfUkpjR9xWFjlNokenCb9YQ5f7KxVPWlcntNMRdmjUuVeHTmBRTgo1sjcYy11BQ+Xi2rGjn9/V1qRpAEXH4V4qT/dkGtlZ4Uf96Ow1yWG+7SwV64GqKxT3dKgn/fxwonlmJdA/FoiZ2GviPDzQNpOMK0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757654920; c=relaxed/simple; bh=zN4SN/j2ISft3iy7uc3CczU8w6NhqwXrwEh+is5t9QU=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=hP6+Gx5Ajy1lH1ZSMXx8p/oWrcBjxPfvbn/FxtN803TcMF3gmnC42jchA0U1YBTNFIgy/QBjTGpr5M7GOFFMIA4tHaKNMuxIK1rCEbEjRkntkdC7J5RPkb192uy1gzWfQBcPgeC+qxocA9wadKb3Y0G+7f6IDR7hTAw6jyEcoEc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=A1hRzGnG; arc=none smtp.client-ip=209.85.219.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="A1hRzGnG" Received: by mail-yb1-f180.google.com with SMTP id 3f1490d57ef6-e9d6cb1df67so1067317276.1 for ; Thu, 11 Sep 2025 22:28:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1757654917; x=1758259717; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=PEqnbE9/rgWsIS4LQviPEL5Xbu9Py0soYdydYPEU9K0=; b=A1hRzGnGITTSHsa+2iIumRNOOMQ/lWFilBywJr4Y+9Ga54PAjoUTp0XDQuIIo6U6/M vFlWmxEVDR9/WhpQQ/QAmqZVfVON3BSRGxvnRWuUvCViUVkN4O9BkBzdN8Db3qMwABsx DTda2p+WS+r4pMoI2hk1tQspSRrIV6SZGprvhHdqMkta/Ue4F0F2A1zynSZkwZymkn3i w28ZkVi4DzadVk4zZrvuibmgEU1eMtFDsk4/lm0Mr86pjy4y4tiN8kK591IzY5GnCqB3 D5s5SR7fGPxp1g9kbpeILFSQoPHZImWjelROK4/wk1IeyqKB6gz4+d8E1VJZQHCTdMiR aDYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757654917; x=1758259717; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PEqnbE9/rgWsIS4LQviPEL5Xbu9Py0soYdydYPEU9K0=; b=Lzv5l6JnZgo1nb05pesZdBEyqYF6bgA4EmzuaA2Z6XAX+f0rTsRlmRHh7n/L/+rN7B 7C8WJ5I6Qg58ZrQCIJicFvTXT9Pve4ehis/hFjiN6pUmiJrYhQfFTbD6PeIpLjQ41ZYp CB7rI0B3Efmi7yzSEtbwjBKPx2F7zvIoftDo9AV6DhLv3NSXjNKry+436E5siZDDpIqG QHMI4zYcaDp27C9Y78voJqAZbgyfqavPVQCDDNR1XEP56W0oJ9wMx3QA0rn8XJ+dyMql NnVJ+4A4tMjqXlY0yHs1qfPolIraLcSA3enKC4TckkVedsFT9R6gDVt+E8OViyj+Sc6p yogQ== X-Forwarded-Encrypted: i=1; AJvYcCX5QwNSmveJxDrj2e3vrzJS/MF4y0I86UWzuPSqIBz1MIto2rWHA09OrabO97X4hLB1+k+IOH9Z1Ei4QMI=@vger.kernel.org X-Gm-Message-State: AOJu0YwG5kwRt3IBWmbtXMuaVIIZY3teAl3mwGhZU9W4cqhBoDY7omFK ZgS+P2PW6hu1+SqhFYg8KhoNfUgd4z/l7zIsC8iMfhnvQWE+hBirrASm X-Gm-Gg: ASbGncub56ED9gaVeMaqiURA+5ykYAGgIP8k//h93gGh/QdbTQSsBVyDEFfqlH6Uufv 2ClT6rBp78CTNb1U6Hf1FXKJZgNnM/+rvyzHQkYA8yFBo/MQzaoDoCc7RCGYheliXgtDMR80S8G aHnSKhzqtStPaLN77mmY+VtlAh4Y860jLiLq/vR5SqUO/BcCmD3Bqw8wqtJa1KIqu8/RIN9iYWC htNB1Hiu0His3jxmklsiOwPkbuw4WgsyF3gUxgwvbvAJ+TnXlHKFwXW1AKofy9M5997q/ui9uMt 9lYBDCuHFDe+Ca/5jxBnqhAgksrMjssKTb5sv+IKG84wA3+TjKERGr5INobf/x7j+JBL4N0uOWN v8NROlRGzeDY2p4173EE0 X-Google-Smtp-Source: AGHT+IGk2MxFPWEVq4ijWMadHnGEJiVkiExVcRwLD7RgFfY3XTYFj6a+tI/I+l20YSJnSia7QLY9Xg== X-Received: by 2002:a05:690c:6104:b0:72c:54a3:f051 with SMTP id 00721157ae682-730626d26c7mr16759827b3.1.1757654916965; Thu, 11 Sep 2025 22:28:36 -0700 (PDT) Received: from localhost ([2a03:2880:25ff:a::]) by smtp.gmail.com with ESMTPSA id 00721157ae682-72f791a3759sm8574067b3.36.2025.09.11.22.28.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Sep 2025 22:28:36 -0700 (PDT) From: Bobby Eshleman Date: Thu, 11 Sep 2025 22:28:16 -0700 Subject: [PATCH net-next v2 2/3] net: devmem: use niov array for token management Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-2-c80d735bd453@meta.com> References: <20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-0-c80d735bd453@meta.com> In-Reply-To: <20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-0-c80d735bd453@meta.com> To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Kuniyuki Iwashima , Willem de Bruijn , Neal Cardwell , David Ahern Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Stanislav Fomichev , Mina Almasry , Bobby Eshleman X-Mailer: b4 0.13.0 From: Bobby Eshleman Improve CPU performance of devmem token management by using page offsets as dmabuf tokens and using them for direct array access lookups instead of xarray lookups. Consequently, the xarray can be removed. The result is an average 5% reduction in CPU cycles spent by devmem RX user threads. This patch changes the meaning of tokens. Tokens previously referred to unique fragments of pages. In this patch tokens instead represent references to pages, not fragments. Because of this, multiple tokens may refer to the same page and so have identical value (e.g., two small fragments may coexist on the same page). The token and offset pair that the user receives uniquely identifies fragments if needed. This assumes that the user is not attempting to sort / uniq the token list using tokens alone. A new restriction is added to the implementation: devmem RX sockets cannot switch dmabuf bindings. In practice, this is a symptom of invalid configuration as a flow would have to be steered to a different queue or device where there is a different binding, which is generally bad for TCP flows. This restriction is necessary because the 32-bit dmabuf token does not have enough bits to represent both the pages in a large dmabuf and also a binding or dmabuf ID. For example, a system with 8 NICs and 32 queues requires 8 bits for a binding / queue ID (8 NICs * 32 queues =3D=3D 256 queues total =3D=3D 2^8), which leaves only 24 bits for dmabuf p= ages (2^24 * 4096 / (1<<30) =3D=3D 64GB). This is insufficient for the device and queue numbers on many current systems or systems that may need larger GPU dmabufs (as for hard limits, my current H100 has 80GB GPU memory per device). Using kperf[1] with 4 flows and workers, this patch improves receive worker CPU util by ~4.9% with slightly better throughput. Before, mean cpu util for rx workers ~83.6%: Average: CPU %usr %nice %sys %iowait %irq %soft %steal %= guest %gnice %idle Average: 4 2.30 0.00 79.43 0.00 0.65 0.21 0.00 = 0.00 0.00 17.41 Average: 5 2.27 0.00 80.40 0.00 0.45 0.21 0.00 = 0.00 0.00 16.67 Average: 6 2.28 0.00 80.47 0.00 0.46 0.25 0.00 = 0.00 0.00 16.54 Average: 7 2.42 0.00 82.05 0.00 0.46 0.21 0.00 = 0.00 0.00 14.86 After, mean cpu util % for rx workers ~78.7%: Average: CPU %usr %nice %sys %iowait %irq %soft %steal %= guest %gnice %idle Average: 4 2.61 0.00 73.31 0.00 0.76 0.11 0.00 = 0.00 0.00 23.20 Average: 5 2.95 0.00 74.24 0.00 0.66 0.22 0.00 = 0.00 0.00 21.94 Average: 6 2.81 0.00 73.38 0.00 0.97 0.11 0.00 = 0.00 0.00 22.73 Average: 7 3.05 0.00 78.76 0.00 0.76 0.11 0.00 = 0.00 0.00 17.32 Mean throughput improves, but falls within a standard deviation (~45GB/s for 4 flows on a 50GB/s NIC, one hop). This patch adds an array of atomics for counting the tokens returned to the user for a given page. There is a 4-byte atomic per page in the dmabuf per socket. Given a 2GB dmabuf, this array is 2MB. [1]: https://github.com/facebookexperimental/kperf Signed-off-by: Bobby Eshleman --- Changes in v2: - always use GFP_ZERO for binding->vec (Mina) - remove WARN for changed binding (Mina) - remove extraneous binding ref get (Mina) - remove WARNs on invalid user input (Mina) - pre-assign niovs in binding->vec for RX case (Mina) - use atomic_set(, 0) to initialize sk_user_frags.urefs - fix length of alloc for urefs --- include/net/sock.h | 5 ++- net/core/devmem.c | 17 +++----- net/core/devmem.h | 2 +- net/core/sock.c | 23 +++++++--- net/ipv4/tcp.c | 111 ++++++++++++++++---------------------------= ---- net/ipv4/tcp_ipv4.c | 39 ++++++++++++++--- net/ipv4/tcp_minisocks.c | 2 - 7 files changed, 99 insertions(+), 100 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 896bec2d2176..304aad494764 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -575,7 +575,10 @@ struct sock { #endif struct rcu_head sk_rcu; netns_tracker ns_tracker; - struct xarray sk_user_frags; + struct { + struct net_devmem_dmabuf_binding *binding; + atomic_t *urefs; + } sk_user_frags; =20 #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES) struct module *sk_owner; diff --git a/net/core/devmem.c b/net/core/devmem.c index b4c570d4f37a..1dae43934942 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -230,14 +230,12 @@ net_devmem_bind_dmabuf(struct net_device *dev, goto err_detach; } =20 - if (direction =3D=3D DMA_TO_DEVICE) { - binding->vec =3D kvmalloc_array(dmabuf->size / PAGE_SIZE, - sizeof(struct net_iov *), - GFP_KERNEL); - if (!binding->vec) { - err =3D -ENOMEM; - goto err_unmap; - } + binding->vec =3D kvmalloc_array(dmabuf->size / PAGE_SIZE, + sizeof(struct net_iov *), + GFP_KERNEL | __GFP_ZERO); + if (!binding->vec) { + err =3D -ENOMEM; + goto err_unmap; } =20 /* For simplicity we expect to make PAGE_SIZE allocations, but the @@ -293,8 +291,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, niov->owner =3D &owner->area; page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov), net_devmem_get_dma_addr(niov)); - if (direction =3D=3D DMA_TO_DEVICE) - binding->vec[owner->area.base_virtual / PAGE_SIZE + i] =3D niov; + binding->vec[owner->area.base_virtual / PAGE_SIZE + i] =3D niov; } =20 virtual +=3D len; diff --git a/net/core/devmem.h b/net/core/devmem.h index 2ada54fb63d7..d4eb28d079bb 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -61,7 +61,7 @@ struct net_devmem_dmabuf_binding { =20 /* Array of net_iov pointers for this binding, sorted by virtual * address. This array is convenient to map the virtual addresses to - * net_iovs in the TX path. + * net_iovs. */ struct net_iov **vec; =20 diff --git a/net/core/sock.c b/net/core/sock.c index 1f8ef4d8bcd9..15e198842b4a 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -87,6 +87,7 @@ =20 #include #include +#include #include #include #include @@ -151,6 +152,7 @@ #include =20 #include "dev.h" +#include "devmem.h" =20 static DEFINE_MUTEX(proto_list_mutex); static LIST_HEAD(proto_list); @@ -1100,32 +1102,39 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t opt= val, unsigned int optlen) return -EFAULT; } =20 - xa_lock_bh(&sk->sk_user_frags); for (i =3D 0; i < num_tokens; i++) { for (j =3D 0; j < tokens[i].token_count; j++) { + struct net_iov *niov; + unsigned int token; + netmem_ref netmem; + + token =3D tokens[i].token_start + j; + if (token >=3D sk->sk_user_frags.binding->dmabuf->size / PAGE_SIZE) + break; + if (++num_frags > MAX_DONTNEED_FRAGS) goto frag_limit_reached; - - netmem_ref netmem =3D (__force netmem_ref)__xa_erase( - &sk->sk_user_frags, tokens[i].token_start + j); + niov =3D sk->sk_user_frags.binding->vec[token]; + netmem =3D net_iov_to_netmem(niov); =20 if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem))) continue; =20 + if (atomic_dec_if_positive(&sk->sk_user_frags.urefs[token]) + < 0) + continue; + netmems[netmem_num++] =3D netmem; if (netmem_num =3D=3D ARRAY_SIZE(netmems)) { - xa_unlock_bh(&sk->sk_user_frags); for (k =3D 0; k < netmem_num; k++) WARN_ON_ONCE(!napi_pp_put_page(netmems[k])); netmem_num =3D 0; - xa_lock_bh(&sk->sk_user_frags); } ret++; } } =20 frag_limit_reached: - xa_unlock_bh(&sk->sk_user_frags); for (k =3D 0; k < netmem_num; k++) WARN_ON_ONCE(!napi_pp_put_page(netmems[k])); =20 diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 7f9c671b1ee0..438b8132ed89 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -261,6 +261,7 @@ #include #include #include +#include #include #include #include @@ -491,7 +492,8 @@ void tcp_init_sock(struct sock *sk) =20 set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags); sk_sockets_allocated_inc(sk); - xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1); + sk->sk_user_frags.binding =3D NULL; + sk->sk_user_frags.urefs =3D NULL; } EXPORT_IPV6_MOD(tcp_init_sock); =20 @@ -2402,68 +2404,6 @@ static int tcp_inq_hint(struct sock *sk) return inq; } =20 -/* batch __xa_alloc() calls and reduce xa_lock()/xa_unlock() overhead. */ -struct tcp_xa_pool { - u8 max; /* max <=3D MAX_SKB_FRAGS */ - u8 idx; /* idx <=3D max */ - __u32 tokens[MAX_SKB_FRAGS]; - netmem_ref netmems[MAX_SKB_FRAGS]; -}; - -static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool = *p) -{ - int i; - - /* Commit part that has been copied to user space. */ - for (i =3D 0; i < p->idx; i++) - __xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY, - (__force void *)p->netmems[i], GFP_KERNEL); - /* Rollback what has been pre-allocated and is no longer needed. */ - for (; i < p->max; i++) - __xa_erase(&sk->sk_user_frags, p->tokens[i]); - - p->max =3D 0; - p->idx =3D 0; -} - -static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p) -{ - if (!p->max) - return; - - xa_lock_bh(&sk->sk_user_frags); - - tcp_xa_pool_commit_locked(sk, p); - - xa_unlock_bh(&sk->sk_user_frags); -} - -static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p, - unsigned int max_frags) -{ - int err, k; - - if (p->idx < p->max) - return 0; - - xa_lock_bh(&sk->sk_user_frags); - - tcp_xa_pool_commit_locked(sk, p); - - for (k =3D 0; k < max_frags; k++) { - err =3D __xa_alloc(&sk->sk_user_frags, &p->tokens[k], - XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL); - if (err) - break; - } - - xa_unlock_bh(&sk->sk_user_frags); - - p->max =3D k; - p->idx =3D 0; - return k ? 0 : err; -} - /* On error, returns the -errno. On success, returns number of bytes sent = to the * user. May not consume all of @remaining_len. */ @@ -2472,14 +2412,11 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, cons= t struct sk_buff *skb, int remaining_len) { struct dmabuf_cmsg dmabuf_cmsg =3D { 0 }; - struct tcp_xa_pool tcp_xa_pool; unsigned int start; int i, copy, n; int sent =3D 0; int err =3D 0; =20 - tcp_xa_pool.max =3D 0; - tcp_xa_pool.idx =3D 0; do { start =3D skb_headlen(skb); =20 @@ -2526,8 +2463,11 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const= struct sk_buff *skb, */ for (i =3D 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_frag_t *frag =3D &skb_shinfo(skb)->frags[i]; + struct net_devmem_dmabuf_binding *binding; struct net_iov *niov; u64 frag_offset; + size_t len; + u32 token; int end; =20 /* !skb_frags_readable() should indicate that ALL the @@ -2560,13 +2500,39 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, cons= t struct sk_buff *skb, start; dmabuf_cmsg.frag_offset =3D frag_offset; dmabuf_cmsg.frag_size =3D copy; - err =3D tcp_xa_pool_refill(sk, &tcp_xa_pool, - skb_shinfo(skb)->nr_frags - i); - if (err) + + binding =3D net_devmem_iov_binding(niov); + + if (!sk->sk_user_frags.binding) { + sk->sk_user_frags.binding =3D binding; + + len =3D binding->dmabuf->size / PAGE_SIZE; + sk->sk_user_frags.urefs =3D kzalloc(len * sizeof(*sk->sk_user_frags.u= refs), + GFP_KERNEL); + if (!sk->sk_user_frags.urefs) { + sk->sk_user_frags.binding =3D NULL; + err =3D -ENOMEM; + goto out; + } + + for (token =3D 0; token < len; token++) + atomic_set(&sk->sk_user_frags.urefs[token], + 0); + + spin_lock_bh(&devmem_sockets_lock); + list_add(&sk->sk_devmem_list, &devmem_sockets_list); + spin_unlock_bh(&devmem_sockets_lock); + } + + if (sk->sk_user_frags.binding !=3D binding) { + err =3D -EFAULT; goto out; + } + + token =3D net_iov_virtual_addr(niov) >> PAGE_SHIFT; + dmabuf_cmsg.frag_token =3D token; =20 /* Will perform the exchange later */ - dmabuf_cmsg.frag_token =3D tcp_xa_pool.tokens[tcp_xa_pool.idx]; dmabuf_cmsg.dmabuf_id =3D net_devmem_iov_binding_id(niov); =20 offset +=3D copy; @@ -2579,8 +2545,9 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const = struct sk_buff *skb, if (err) goto out; =20 + atomic_inc(&sk->sk_user_frags.urefs[token]); + atomic_long_inc(&niov->pp_ref_count); - tcp_xa_pool.netmems[tcp_xa_pool.idx++] =3D skb_frag_netmem(frag); =20 sent +=3D copy; =20 @@ -2590,7 +2557,6 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const = struct sk_buff *skb, start =3D end; } =20 - tcp_xa_pool_commit(sk, &tcp_xa_pool); if (!remaining_len) goto out; =20 @@ -2608,7 +2574,6 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const = struct sk_buff *skb, } =20 out: - tcp_xa_pool_commit(sk, &tcp_xa_pool); if (!sent) sent =3D err; =20 diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 2a0602035729..68ebf96d06f8 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -87,6 +87,9 @@ #include #include =20 +#include +#include "../core/devmem.h" + #include =20 #ifdef CONFIG_TCP_MD5SIG @@ -2525,11 +2528,37 @@ static int tcp_v4_init_sock(struct sock *sk) static void tcp_release_user_frags(struct sock *sk) { #ifdef CONFIG_PAGE_POOL - unsigned long index; - void *netmem; + struct net_devmem_dmabuf_binding *binding; + struct net_iov *niov; + unsigned int token; + netmem_ref netmem; + + if (!sk->sk_user_frags.urefs) + return; + + binding =3D sk->sk_user_frags.binding; + if (!binding || !binding->vec) + return; + + for (token =3D 0; token < binding->dmabuf->size / PAGE_SIZE; token++) { + niov =3D binding->vec[token]; + + /* never used by recvmsg() */ + if (!niov) + continue; + + if (!net_is_devmem_iov(niov)) + continue; + + netmem =3D net_iov_to_netmem(niov); =20 - xa_for_each(&sk->sk_user_frags, index, netmem) - WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem)); + while (atomic_dec_return(&sk->sk_user_frags.urefs[token]) >=3D 0) + WARN_ON_ONCE(!napi_pp_put_page(netmem)); + } + + sk->sk_user_frags.binding =3D NULL; + kvfree(sk->sk_user_frags.urefs); + sk->sk_user_frags.urefs =3D NULL; #endif } =20 @@ -2539,8 +2568,6 @@ void tcp_v4_destroy_sock(struct sock *sk) =20 tcp_release_user_frags(sk); =20 - xa_destroy(&sk->sk_user_frags); - trace_tcp_destroy_sock(sk); =20 tcp_clear_xmit_timers(sk); diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 7c2ae07d8d5d..6a44df3074df 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -630,8 +630,6 @@ struct sock *tcp_create_openreq_child(const struct sock= *sk, =20 __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS); =20 - xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1); - return newsk; } EXPORT_SYMBOL(tcp_create_openreq_child); --=20 2.47.3 From nobody Thu Oct 2 19:25:12 2025 Received: from mail-yx1-f49.google.com (mail-yx1-f49.google.com [74.125.224.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A60CB27F160 for ; Fri, 12 Sep 2025 05:28:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.224.49 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757654921; cv=none; b=G0HEgNIpHxogOIFAW9IERDhLJvAR26swp+SzZ4flx87H9/mlEBT2Ji+HAbvzdzYCDykOxjQHBKgnhTvxE2dGf8/HqRw+fooZuFwezDc8IGpgjXQDztI5q7N7/iwNzb8nQkmyZm7+iY0GC9iBRWZLYFdXfP41lnMkf09LdWWeEQs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757654921; c=relaxed/simple; bh=iXk5LtD8VQud/p8QuIvkPtAyIAC7E2tqnLsWZY8OouY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=Njc9JtmIWryIjzquxAblwNb86g/UZ9aBiOlDE7WJQHuZUnso7ZSTcVdS5wFsVKTSi7dd4XzJJCmODtbt7onXXA3y66E82KmhlkEo8OPA+HQ4vB81mSPze72PS6u5tgqgSfEvZZICbQSfHTH2WF9yXvk51Nlf0fgun8P+OVo14Bk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=AuvnMtbo; arc=none smtp.client-ip=74.125.224.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="AuvnMtbo" Received: by mail-yx1-f49.google.com with SMTP id 956f58d0204a3-6071dbcf3fcso394363d50.0 for ; Thu, 11 Sep 2025 22:28:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1757654918; x=1758259718; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=TBxJ9bgU3ETsPO/br9jSR8/72uyQYfX4epNE3oPxRwc=; b=AuvnMtbovYW8BgT/4UfNtvBEEaYvYkdVZABqAlA726TUWkCeMlmXW6UMit0S2d7Pfi YbJfLRX5uU3zXeK5vUr9uz1hj76TNplY6UqXlK5eEkaSkg649iLyFFry9jee72RBwqXO 9FU7IA6B11ZlVvmqM0IAlOaLh7qPu1ukV/RQI1Yi/9uF+gRviP4yF8EsffsfOHJn16N3 VRngVjqCRBF2cdZzJft1ynrhzvlWK/U/SGjyFi2efkVisvB9tjm0FAY5JsGDJGg8Snto bF17tXBkLjkQrCOhZwzmFKplaEMTIWC9CwJsXrTWRX4dEW/evub3U+AsUSt0g8TEBLN9 i4Qg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757654918; x=1758259718; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TBxJ9bgU3ETsPO/br9jSR8/72uyQYfX4epNE3oPxRwc=; b=tjqgC2/MUTn0rTqffba0POIIXcMxADgf31adQ+LO+R0jGzDK9qTI6CjDgOypqINe75 pjNegOGIjxjHJlUtsYXoGaRCpi0/5j20eh6C6YUMWDGUQ19AlZFewN/BTdM8tYZW4gm/ xZZOWs0/nTjzYE1oyEeaxCmSEpZaraRwKxVr91iAW3WL43aZ4SyUt/yQyLcGn2UtCryr XZvlw1dUagJYxbsGIShnNxw16IBaERUhkV8yj9Rff+rbVLtdDGwrmGU46mKUMK/z3X/k /LmL/Psk/+hoMu/TwAwzBverqd2+Vr/iykrAnp7baEdTdtJTAHMEAfmDC3JWOtx6BUVZ GHpA== X-Forwarded-Encrypted: i=1; AJvYcCVlNdVN6dz+Fd6ikmp+3EYLddntG4dNK/MbzN6keQW89xt/Q5wTykAZsNlvsHvuqVAVaIQOr92WEyVsc5g=@vger.kernel.org X-Gm-Message-State: AOJu0YwxdH0QG45cQGT5UYrXYfhloWAvPBURqqFWir9LgutmLGUVCFDa tZ6KnHy8mbmBd0Jd1ufzQVx76sM4dyLOYsJW37B1+IAkgFsH4UXw1Teu X-Gm-Gg: ASbGncuXAuTZmMtCz+OcAgTnHSC7Jkt+dM4lyHWzwZwZC1WmaHVVWn8OE7r8aGzQ9AD +t90EVTGXxBmWK22wSoiBJmjEN10T5AeuaqUeBKRFDRGmdR/xfKXOwwFn6aRkXuhM/2d2IbjdbT 8cDlzH6eeAyR3d4j0Y9XN2eWHx1o5zeBW5+hwhZm9QPD66SeMI4pUroGo/4+Y0oExlfZ/E5if/C A1KjTShHHyt1Ma7M/YgZykavvBud4mbE+N5tz66yLd8+5wIKzhvlJgkcLgQttzjyJJ16LL+xhvY 0wy8USpVxcg/olIgIFz19nUTidInfz3iNlELT7kGoZ0vMeGBZgA/VA4oeRUSQ4TUh/KkT1ocs0l wn+LbywlsTbye6G4+4Qrj6Vhy4hggPHFF1/VygXhPmHA= X-Google-Smtp-Source: AGHT+IEvNdoERnjqUX3gWMOkOVdUpABgLqadOZKxTupu5VHNH+oEciO0ODN2lDoQyjjHn43fJYnwgg== X-Received: by 2002:a05:690c:a96:b0:726:697b:9e1f with SMTP id 00721157ae682-73065abe03amr15986057b3.54.1757654918519; Thu, 11 Sep 2025 22:28:38 -0700 (PDT) Received: from localhost ([2a03:2880:25ff:73::]) by smtp.gmail.com with ESMTPSA id 00721157ae682-72f76238482sm8652877b3.12.2025.09.11.22.28.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Sep 2025 22:28:37 -0700 (PDT) From: Bobby Eshleman Date: Thu, 11 Sep 2025 22:28:17 -0700 Subject: [PATCH net-next v2 3/3] net: ethtool: prevent user from breaking devmem single-binding rule Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-3-c80d735bd453@meta.com> References: <20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-0-c80d735bd453@meta.com> In-Reply-To: <20250911-scratch-bobbyeshleman-devmem-tcp-token-upstream-v2-0-c80d735bd453@meta.com> To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Kuniyuki Iwashima , Willem de Bruijn , Neal Cardwell , David Ahern Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Stanislav Fomichev , Mina Almasry , Bobby Eshleman X-Mailer: b4 0.13.0 From: Bobby Eshleman Prevent the user from breaking devmem's single-binding rule by rejecting ethtool TCP/IP requests to modify or delete rules that will redirect a devmem socket to a queue with a different dmabuf binding. This is done in a "best effort" approach because not all steering rule types are validated. If an ethtool_rxnfc flow steering rule evaluates true for: 1) matching a devmem socket's ip addr 2) selecting a queue with a different dmabuf binding 3) is TCP/IP (v4 or v6) ... then reject the ethtool_rxnfc request with -EBUSY to indicate a devmem socket is using the current rules that steer it to its dmabuf binding. Non-TCP/IP rules are completely ignored, and if they do match a devmem flow then they can still break devmem sockets. For example, bytes 0 and 1 of L2 headers, etc... it is still unknown to me if these are possible to evaluate at the time of the ethtool call, and so are left to future work (or never, if not possible). FLOW_RSS rules which guide flows to an RSS context are also not evaluated yet. This seems feasible, but the correct path towards retrieving the RSS context and scanning the queues for dmabuf bindings seems unclear and maybe overkill (re-use parts of ethtool_get_rxnfc?). Signed-off-by: Bobby Eshleman --- include/net/sock.h | 1 + net/ethtool/ioctl.c | 144 ++++++++++++++++++++++++++++++++++++++++++++++++= ++++ net/ipv4/tcp.c | 9 ++++ net/ipv4/tcp_ipv4.c | 6 +++ 4 files changed, 160 insertions(+) diff --git a/include/net/sock.h b/include/net/sock.h index 304aad494764..73a1ff59dcde 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -579,6 +579,7 @@ struct sock { struct net_devmem_dmabuf_binding *binding; atomic_t *urefs; } sk_user_frags; + struct list_head sk_devmem_list; =20 #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES) struct module *sk_owner; diff --git a/net/ethtool/ioctl.c b/net/ethtool/ioctl.c index 0b2a4d0573b3..99676ac9bbaa 100644 --- a/net/ethtool/ioctl.c +++ b/net/ethtool/ioctl.c @@ -29,11 +29,16 @@ #include #include #include +#include #include #include #include #include #include "common.h" +#include "../core/devmem.h" + +extern struct list_head devmem_sockets_list; +extern spinlock_t devmem_sockets_lock; =20 /* State held across locks and calls for commands which have devlink fallb= ack */ struct ethtool_devlink_compat { @@ -1169,6 +1174,142 @@ ethtool_get_rxfh_fields(struct net_device *dev, u32= cmd, void __user *useraddr) return ethtool_rxnfc_copy_to_user(useraddr, &info, info_size, NULL); } =20 +static bool +__ethtool_rx_flow_spec_breaks_devmem_sk(struct ethtool_rx_flow_spec *fs, + struct net_device *dev, + struct sock *sk) +{ + struct in6_addr saddr6, smask6, daddr6, dmask6; + struct sockaddr_storage saddr, daddr; + struct sockaddr_in6 *src6, *dst6; + struct sockaddr_in *src4, *dst4; + struct netdev_rx_queue *rxq; + __u32 flow_type; + + if (dev !=3D __sk_dst_get(sk)->dev) + return false; + + src6 =3D (struct sockaddr_in6 *)&saddr; + dst6 =3D (struct sockaddr_in6 *)&daddr; + src4 =3D (struct sockaddr_in *)&saddr; + dst4 =3D (struct sockaddr_in *)&daddr; + + if (sk->sk_family =3D=3D AF_INET6) { + src6->sin6_port =3D inet_sk(sk)->inet_sport; + src6->sin6_addr =3D inet6_sk(sk)->saddr; + dst6->sin6_port =3D inet_sk(sk)->inet_dport; + dst6->sin6_addr =3D sk->sk_v6_daddr; + } else { + src4->sin_port =3D inet_sk(sk)->inet_sport; + src4->sin_addr.s_addr =3D inet_sk(sk)->inet_saddr; + dst4->sin_port =3D inet_sk(sk)->inet_dport; + dst4->sin_addr.s_addr =3D inet_sk(sk)->inet_daddr; + } + + flow_type =3D fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT | FLOW_RSS); + + rxq =3D __netif_get_rx_queue(dev, fs->ring_cookie); + if (!rxq) + return false; + + /* If the requested binding and the sk binding is equal then we know + * this rule can't redirect to a different binding. + */ + if (rxq->mp_params.mp_priv =3D=3D sk->sk_user_frags.binding) + return false; + + /* Reject rules that redirect RX devmem sockets to a queue with a + * different dmabuf binding. Because these sockets are on the RX side + * (registered in the recvmsg() path), we compare the opposite + * endpoints: the socket source with the rule destination, and the + * socket destination with the rule source. + * + * Only perform checks on the simplest rules to check, that is, IP/TCP + * rules. Flow hash options are not verified, so may still break TCP + * devmem flows in theory (VLAN tag, bytes 0 and 1 of L4 header, + * etc...). The author of this function was simply not sure how + * to validate these at the time of the ethtool call. + */ + switch (flow_type) { + case IPV4_USER_FLOW: { + const struct ethtool_usrip4_spec *v4_usr_spec, *v4_usr_m_spec; + + v4_usr_spec =3D &fs->h_u.usr_ip4_spec; + v4_usr_m_spec =3D &fs->m_u.usr_ip4_spec; + + if (((v4_usr_spec->ip4src ^ dst4->sin_addr.s_addr) & v4_usr_m_spec->ip4s= rc) || + (v4_usr_spec->ip4dst ^ src4->sin_addr.s_addr) & v4_usr_m_spec->ip4ds= t) { + return true; + } + + return false; + } + case TCP_V4_FLOW: { + const struct ethtool_tcpip4_spec *v4_spec, *v4_m_spec; + + v4_spec =3D &fs->h_u.tcp_ip4_spec; + v4_m_spec =3D &fs->m_u.tcp_ip4_spec; + + if (((v4_spec->ip4src ^ dst4->sin_addr.s_addr) & v4_m_spec->ip4src) || + ((v4_spec->ip4dst ^ src4->sin_addr.s_addr) & v4_m_spec->ip4dst)) + return true; + + return false; + } + case IPV6_USER_FLOW: { + const struct ethtool_usrip6_spec *v6_usr_spec, *v6_usr_m_spec; + + v6_usr_spec =3D &fs->h_u.usr_ip6_spec; + v6_usr_m_spec =3D &fs->m_u.usr_ip6_spec; + + memcpy(&daddr6, v6_usr_spec->ip6dst, sizeof(daddr6)); + memcpy(&dmask6, v6_usr_m_spec->ip6dst, sizeof(dmask6)); + memcpy(&saddr6, v6_usr_spec->ip6src, sizeof(saddr6)); + memcpy(&smask6, v6_usr_m_spec->ip6src, sizeof(smask6)); + + return !ipv6_masked_addr_cmp(&saddr6, &smask6, &dst6->sin6_addr) && + !ipv6_masked_addr_cmp(&daddr6, &dmask6, &src6->sin6_addr); + } + case TCP_V6_FLOW: { + const struct ethtool_tcpip6_spec *v6_spec, *v6_m_spec; + + v6_spec =3D &fs->h_u.tcp_ip6_spec; + v6_m_spec =3D &fs->m_u.tcp_ip6_spec; + + memcpy(&daddr6, v6_spec->ip6dst, sizeof(daddr6)); + memcpy(&dmask6, v6_m_spec->ip6dst, sizeof(dmask6)); + memcpy(&saddr6, v6_spec->ip6src, sizeof(saddr6)); + memcpy(&smask6, v6_m_spec->ip6src, sizeof(smask6)); + + return !ipv6_masked_addr_cmp(&daddr6, &dmask6, &src6->sin6_addr) && + !ipv6_masked_addr_cmp(&saddr6, &smask6, &dst6->sin6_addr); + } + default: + return false; + } +} + +static bool +ethtool_rx_flow_spec_breaks_devmem_sk(struct ethtool_rx_flow_spec *fs, + struct net_device *dev) +{ + struct sock *sk; + bool ret; + + ret =3D false; + + spin_lock_bh(&devmem_sockets_lock); + list_for_each_entry(sk, &devmem_sockets_list, sk_devmem_list) { + if (__ethtool_rx_flow_spec_breaks_devmem_sk(fs, dev, sk)) { + ret =3D true; + break; + } + } + spin_unlock_bh(&devmem_sockets_lock); + + return ret; +} + static noinline_for_stack int ethtool_set_rxnfc(struct net_device *dev, u32 cmd, void __user *useraddr) { @@ -1197,6 +1338,9 @@ static noinline_for_stack int ethtool_set_rxnfc(struc= t net_device *dev, return -EINVAL; } =20 + if (ethtool_rx_flow_spec_breaks_devmem_sk(&info.fs, dev)) + return -EBUSY; + rc =3D ops->set_rxnfc(dev, &info); if (rc) return rc; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 438b8132ed89..3f57e658ea80 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -311,6 +311,12 @@ DEFINE_STATIC_KEY_FALSE(tcp_have_smc); EXPORT_SYMBOL(tcp_have_smc); #endif =20 +struct list_head devmem_sockets_list; +EXPORT_SYMBOL_GPL(devmem_sockets_list); + +DEFINE_SPINLOCK(devmem_sockets_lock); +EXPORT_SYMBOL_GPL(devmem_sockets_lock); + /* * Current number of TCP sockets. */ @@ -5229,4 +5235,7 @@ void __init tcp_init(void) BUG_ON(tcp_register_congestion_control(&tcp_reno) !=3D 0); tcp_tsq_work_init(); mptcp_init(); + + spin_lock_init(&devmem_sockets_lock); + INIT_LIST_HEAD(&devmem_sockets_list); } diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 68ebf96d06f8..a3213c97aed9 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -92,6 +92,9 @@ =20 #include =20 +extern struct list_head devmem_sockets_list; +extern spinlock_t devmem_sockets_lock; + #ifdef CONFIG_TCP_MD5SIG static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key= *key, __be32 daddr, __be32 saddr, const struct tcphdr *th); @@ -2559,6 +2562,9 @@ static void tcp_release_user_frags(struct sock *sk) sk->sk_user_frags.binding =3D NULL; kvfree(sk->sk_user_frags.urefs); sk->sk_user_frags.urefs =3D NULL; + spin_lock_bh(&devmem_sockets_lock); + list_del(&sk->sk_devmem_list); + spin_unlock_bh(&devmem_sockets_lock); #endif } =20 --=20 2.47.3