From nobody Mon May 25 18:05:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AC0F1363C7A for ; Sat, 16 May 2026 08:28:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920128; cv=none; b=GzGZXMZTCUKbpJdoPpGoerH20mixFrgMcdNq4U7sak35jTHAlGgf8STMBeUuuFlzIOaGH1xAcQYfLiUXXqSKWZlYNk0zQm97/x/6p10rOF19yaK5LetwP7pe6lCrp07BLEY22wREoAs0nkcduFBkprH1Ca2DuokhZYU42ObPUU0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920128; c=relaxed/simple; bh=UMbv0kJDGb4DqzP57rJf1RO7F6wW9TMuOGWKszzTj/o=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hBOQqMsl1We3ySEEom3nH3u6ZzY4TS2UJKqJ1CRqElGtbQ34dgn7PX2tHp2WdSfVpzCYt0z4ML7VOg2ZBxM/e0nmhS2jJ64uktUd1ooS6JuXetW9pkejIoCXt2GPIKb+2LIoKilwWULTjkGEiyPPx/IWb6bjRksbIgBTlLuBsV8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=JBHLq+o0; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="JBHLq+o0" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A2CCFC2BCB7; Sat, 16 May 2026 08:28:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778920128; bh=UMbv0kJDGb4DqzP57rJf1RO7F6wW9TMuOGWKszzTj/o=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=JBHLq+o0AUkvuh3EWngjMJCh22qXmnGjFJgVBAyTEuKXswwye6ChO9EXJGJyb6d5c uUcqMLpocqW61jiitDrbHClZhaJeo2aCL8vekNO8g+Jz6GNkK6/wg6fsipe8Lhq0II vTuXCYPolOCGSEEpiR4D59U7NJWO+CbMNnM5ssBcvnYL3v1lICXs5Iym06q50JEfKS OVW1xRl+MkPM0X3/h3aas48ixwHTbDeP9BLu1cCHiTksXk8zkMwukAF5//IAhYKRUi Lo4HeWCG/NZb4q0/+GvXOiN7n+7SOP3UPnpP6z+uQmCis/VN1EKNofsp/tf+hdUoAs 6z1YUSojYlH6w== From: Geliang Tang To: mptcp@lists.linux.dev Cc: Geliang Tang , Hannes Reinecke , zhenwei pi , Hui Zhu , Gang Yan Subject: [RFC mptcp-next v10 1/9] nvmet-tcp: check return value of set_queue_sock Date: Sat, 16 May 2026 16:27:49 +0800 Message-ID: <6fe186acd68e1083e32fda1037f710f42539e660.1778919284.git.tanggeliang@kylinos.cn> X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Geliang Tang The return value of nvmet_tcp_set_queue_sock() is currently ignored in nvmet_tcp_tls_handshake_done(). If it fails (e.g., due to concurrent port removal), the socket callbacks will not be properly set, leading to queue and socket leakage. Fix this by capturing the return value and calling nvmet_tcp_schedule_release_queue() on failure to ensure proper cleanup. Cc: Hannes Reinecke Cc: zhenwei pi Cc: Hui Zhu Cc: Gang Yan Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall") Signed-off-by: Geliang Tang --- drivers/nvme/target/tcp.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c index 164a564ba3b4..8a243d22a511 100644 --- a/drivers/nvme/target/tcp.c +++ b/drivers/nvme/target/tcp.c @@ -1842,10 +1842,11 @@ static void nvmet_tcp_tls_handshake_done(void *data= , int status, if (!status) status =3D nvmet_tcp_tls_key_lookup(queue, peerid); =20 + if (!status) + status =3D nvmet_tcp_set_queue_sock(queue); + if (status) nvmet_tcp_schedule_release_queue(queue); - else - nvmet_tcp_set_queue_sock(queue); kref_put(&queue->kref, nvmet_tcp_release_queue); } =20 --=20 2.53.0 From nobody Mon May 25 18:05:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CB85A363C7A for ; Sat, 16 May 2026 08:28:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920130; cv=none; b=ATvwHx5n296LJQ0A+YIeswzbCXL4HOE1d06XLWpXDYbr+Btl7D7GZwinqYIPJSC6A3TFXhTO9WY8uN8qC9/DthgNjPixXo/3ciOTouCw5V+xn1/oZGpVyuHp6NMUSbkeTo5E0VDSvxcEfKb5qLmp41lEzaPXX+eJOZYUEU8RR90= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920130; c=relaxed/simple; bh=g6BeQR+MjQcibFF8FGef8JUqHcS1LfNBZ70P87bWZt0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=SiJi2MPiRe+x0FORzBcqDcHF7NrMbYanQAY3VH9mPPz89Ems5M0MibkyqJiQyi73nnJ1FCwVnTFp6ZwwnCppu32+zwiIUH7oHqQiBodbwq9nx31ZVyQ8wQtEs8yx6FHOBAdJeNNS6WoKYwbk8GLSKoplbxKY/MvK15GUEhgGaNo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=vBVSRW3u; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="vBVSRW3u" Received: by smtp.kernel.org (Postfix) with ESMTPSA id C7C85C19425; Sat, 16 May 2026 08:28:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778920130; bh=g6BeQR+MjQcibFF8FGef8JUqHcS1LfNBZ70P87bWZt0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=vBVSRW3uv5huf383ziWRJJ9hrj9rxIEtoebrfoK9qcucgOMtUVHFstWTEACk10b6p c6tCkb5ZIJZZi6sT/E0QojngaJOZ+NwPZpQXn2KGR3iTBlcwuvaFuovC03jeR2teQV dBRX1sTqK1a/+VHi2c5c7mfe27VZ4JyGNzpqfQaMIWEH+uXPOAyU9kMvWChtUMWWy8 UpPTdbBmpjAv60QbclY4wmG265uylF81bb11RBfTVq1FupMU5CoKjj/m4LbkPAuDsS Nrdx0UvY7WRP11K4lmxFV4vAT56H+J3bfUaMI2URkzuUbH60ZPKEr0DaBenTze3t1X 0lfex9sGxzgjA== From: Geliang Tang To: mptcp@lists.linux.dev Cc: Geliang Tang , Hannes Reinecke , zhenwei pi , Hui Zhu , Gang Yan Subject: [RFC mptcp-next v10 2/9] nvmet-tcp: define target tcp_proto struct Date: Sat, 16 May 2026 16:27:50 +0800 Message-ID: <2848552e4e11e9a4ff5d09bddb56051f33795c72.1778919284.git.tanggeliang@kylinos.cn> X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Geliang Tang To add MPTCP support in "NVMe over TCP", the target side needs to pass IPPROTO_MPTCP to sock_create() instead of IPPROTO_TCP to create an MPTCP socket. Additionally, the setsockopt operations for this socket need to be switched to a set of MPTCP-specific functions. This patch defines the nvmet_tcp_proto structure, which contains the protocol of the socket and a set of function pointers for these socket operations. A "proto" field is also added to struct nvmet_tcp_port. A TCP-specific version of struct nvmet_tcp_proto is defined. In nvmet_tcp_add_port(), port->proto is set to nvmet_tcp_proto based on whether trtype is TCP. All locations that previously called TCP setsockopt functions are updated to call the corresponding function pointers in the nvmet_tcp_proto structure. This new nvmet_fabrics_ops is selected in nvmet_tcp_done_recv_pdu() based on the protocol type. RCU protection is added when accessing queue->port in the I/O path to prevent use-after-free when a port is removed while asynchronous operations are pending. The queue->port pointer is cleared using RCU assignment and synchronized with RCU grace period, and the port structure is then released after all RCU readers have completed. Cc: Hannes Reinecke Co-developed-by: zhenwei pi Signed-off-by: zhenwei pi Co-developed-by: Hui Zhu Signed-off-by: Hui Zhu Co-developed-by: Gang Yan Signed-off-by: Gang Yan Signed-off-by: Geliang Tang --- drivers/nvme/target/tcp.c | 104 +++++++++++++++++++++++++++++++++----- 1 file changed, 91 insertions(+), 13 deletions(-) diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c index 8a243d22a511..72cba7e0df7a 100644 --- a/drivers/nvme/target/tcp.c +++ b/drivers/nvme/target/tcp.c @@ -18,6 +18,7 @@ #include #include #include +#include #include =20 #include "nvmet.h" @@ -147,7 +148,7 @@ enum nvmet_tcp_queue_state { =20 struct nvmet_tcp_queue { struct socket *sock; - struct nvmet_tcp_port *port; + struct nvmet_tcp_port __rcu *port; struct work_struct io_work; struct nvmet_cq nvme_cq; struct nvmet_sq nvme_sq; @@ -198,12 +199,23 @@ struct nvmet_tcp_queue { void (*write_space)(struct sock *); }; =20 +struct nvmet_tcp_proto { + int protocol; + void (*set_reuseaddr)(struct sock *sk); + void (*set_nodelay)(struct sock *sk); + void (*set_priority)(struct sock *sk, u32 priority); + void (*no_linger)(struct sock *sk); + void (*set_tos)(struct sock *sk, int val); + const struct nvmet_fabrics_ops *ops; +}; + struct nvmet_tcp_port { struct socket *sock; struct work_struct accept_work; struct nvmet_port *nport; struct sockaddr_storage addr; void (*data_ready)(struct sock *); + const struct nvmet_tcp_proto *proto; }; =20 static DEFINE_IDA(nvmet_tcp_queue_ida); @@ -1044,6 +1056,7 @@ static int nvmet_tcp_done_recv_pdu(struct nvmet_tcp_q= ueue *queue) { struct nvme_tcp_hdr *hdr =3D &queue->pdu.cmd.hdr; struct nvme_command *nvme_cmd =3D &queue->pdu.cmd.cmd; + struct nvmet_tcp_port *port; struct nvmet_req *req; int ret; =20 @@ -1081,7 +1094,14 @@ static int nvmet_tcp_done_recv_pdu(struct nvmet_tcp_= queue *queue) req =3D &queue->cmd->req; memcpy(req->cmd, nvme_cmd, sizeof(*nvme_cmd)); =20 - if (unlikely(!nvmet_req_init(req, &queue->nvme_sq, &nvmet_tcp_ops))) { + rcu_read_lock(); + port =3D rcu_dereference(queue->port); + if (!port || !port->proto) { + rcu_read_unlock(); + return -EINVAL; + } + if (unlikely(!nvmet_req_init(req, &queue->nvme_sq, port->proto->ops))) { + rcu_read_unlock(); pr_err("failed cmd %p id %d opcode %d, data_len: %d, status: %04x\n", req->cmd, req->cmd->common.command_id, req->cmd->common.opcode, @@ -1090,6 +1110,7 @@ static int nvmet_tcp_done_recv_pdu(struct nvmet_tcp_q= ueue *queue) =20 return nvmet_tcp_handle_req_failure(queue, queue->cmd, req); } + rcu_read_unlock(); =20 ret =3D nvmet_tcp_map_data(queue->cmd); if (unlikely(ret)) { @@ -1468,6 +1489,8 @@ static int nvmet_tcp_alloc_cmd(struct nvmet_tcp_queue= *queue, u8 hdgst =3D nvmet_tcp_hdgst_len(queue); =20 c->queue =3D queue; + if (!queue->port || !queue->port->nport) + return -EINVAL; c->req.port =3D queue->port->nport; =20 c->cmd_pdu =3D page_frag_alloc(&queue->pf_cache, @@ -1697,6 +1720,8 @@ static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_= queue *queue) { struct socket *sock =3D queue->sock; struct inet_sock *inet =3D inet_sk(sock->sk); + const struct nvmet_tcp_proto *proto; + struct nvmet_tcp_port *port; int ret; =20 ret =3D kernel_getsockname(sock, @@ -1709,19 +1734,29 @@ static int nvmet_tcp_set_queue_sock(struct nvmet_tc= p_queue *queue) if (ret < 0) return ret; =20 + rcu_read_lock(); + port =3D rcu_dereference(queue->port); + if (!port || !port->proto || + port->proto->protocol !=3D sock->sk->sk_protocol) { + rcu_read_unlock(); + return -EINVAL; + } + proto =3D port->proto; + rcu_read_unlock(); + /* * Cleanup whatever is sitting in the TCP transmit queue on socket * close. This is done to prevent stale data from being sent should * the network connection be restored before TCP times out. */ - sock_no_linger(sock->sk); + proto->no_linger(sock->sk); =20 if (so_priority > 0) - sock_set_priority(sock->sk, so_priority); + proto->set_priority(sock->sk, so_priority); =20 /* Set socket type of service */ if (inet->rcv_tos > 0) - ip_sock_set_tos(sock->sk, inet->rcv_tos); + proto->set_tos(sock->sk, inet->rcv_tos); =20 ret =3D 0; write_lock_bh(&sock->sk->sk_callback_lock); @@ -1752,6 +1787,7 @@ static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_= queue *queue) static int nvmet_tcp_try_peek_pdu(struct nvmet_tcp_queue *queue) { struct nvme_tcp_hdr *hdr =3D &queue->pdu.cmd.hdr; + struct nvmet_tcp_port *port; int len, ret; struct kvec iov =3D { .iov_base =3D (u8 *)&queue->pdu + queue->offset, @@ -1764,8 +1800,18 @@ static int nvmet_tcp_try_peek_pdu(struct nvmet_tcp_q= ueue *queue) .msg_flags =3D MSG_PEEK, }; =20 - if (nvmet_port_secure_channel_required(queue->port->nport)) + rcu_read_lock(); + port =3D rcu_dereference(queue->port); + if (!port || !port->nport) { + rcu_read_unlock(); return 0; + } + + if (nvmet_port_secure_channel_required(port->nport)) { + rcu_read_unlock(); + return 0; + } + rcu_read_unlock(); =20 len =3D kernel_recvmsg(queue->sock, &msg, &iov, 1, iov.iov_len, msg.msg_flags); @@ -1876,19 +1922,30 @@ static int nvmet_tcp_tls_handshake(struct nvmet_tcp= _queue *queue) { int ret =3D -EOPNOTSUPP; struct tls_handshake_args args; + struct nvmet_tcp_port *port; + key_serial_t keyring; =20 if (queue->state !=3D NVMET_TCP_Q_TLS_HANDSHAKE) { pr_warn("cannot start TLS in state %d\n", queue->state); return -EINVAL; } =20 + rcu_read_lock(); + port =3D rcu_dereference(queue->port); + if (!port || !port->nport || !port->nport->keyring) { + rcu_read_unlock(); + return -EINVAL; + } + keyring =3D key_serial(port->nport->keyring); + rcu_read_unlock(); + kref_get(&queue->kref); pr_debug("queue %d: TLS ServerHello\n", queue->idx); memset(&args, 0, sizeof(args)); args.ta_sock =3D queue->sock; args.ta_done =3D nvmet_tcp_tls_handshake_done; args.ta_data =3D queue; - args.ta_keyring =3D key_serial(queue->port->nport->keyring); + args.ta_keyring =3D keyring; args.ta_timeout_ms =3D tls_handshake_timeout * 1000; =20 ret =3D tls_server_hello_psk(&args, GFP_KERNEL); @@ -2042,6 +2099,16 @@ static void nvmet_tcp_listen_data_ready(struct sock = *sk) read_unlock_bh(&sk->sk_callback_lock); } =20 +static const struct nvmet_tcp_proto nvmet_tcp_proto =3D { + .protocol =3D IPPROTO_TCP, + .set_reuseaddr =3D sock_set_reuseaddr, + .set_nodelay =3D tcp_sock_set_nodelay, + .set_priority =3D sock_set_priority, + .no_linger =3D sock_no_linger, + .set_tos =3D ip_sock_set_tos, + .ops =3D &nvmet_tcp_ops, +}; + static int nvmet_tcp_add_port(struct nvmet_port *nport) { struct nvmet_tcp_port *port; @@ -2066,6 +2133,13 @@ static int nvmet_tcp_add_port(struct nvmet_port *npo= rt) goto err_port; } =20 + if (nport->disc_addr.trtype =3D=3D NVMF_TRTYPE_TCP) { + port->proto =3D &nvmet_tcp_proto; + } else { + ret =3D -EINVAL; + goto err_port; + } + ret =3D inet_pton_with_scope(&init_net, af, nport->disc_addr.traddr, nport->disc_addr.trsvcid, &port->addr); if (ret) { @@ -2080,7 +2154,7 @@ static int nvmet_tcp_add_port(struct nvmet_port *npor= t) port->nport->inline_data_size =3D NVMET_TCP_DEF_INLINE_DATA_SIZE; =20 ret =3D sock_create(port->addr.ss_family, SOCK_STREAM, - IPPROTO_TCP, &port->sock); + port->proto->protocol, &port->sock); if (ret) { pr_err("failed to create a socket\n"); goto err_port; @@ -2089,10 +2163,10 @@ static int nvmet_tcp_add_port(struct nvmet_port *np= ort) port->sock->sk->sk_user_data =3D port; port->data_ready =3D port->sock->sk->sk_data_ready; port->sock->sk->sk_data_ready =3D nvmet_tcp_listen_data_ready; - sock_set_reuseaddr(port->sock->sk); - tcp_sock_set_nodelay(port->sock->sk); + port->proto->set_reuseaddr(port->sock->sk); + port->proto->set_nodelay(port->sock->sk); if (so_priority > 0) - sock_set_priority(port->sock->sk, so_priority); + port->proto->set_priority(port->sock->sk, so_priority); =20 ret =3D kernel_bind(port->sock, (struct sockaddr_unsized *)&port->addr, sizeof(port->addr)); @@ -2125,10 +2199,14 @@ static void nvmet_tcp_destroy_port_queues(struct nv= met_tcp_port *port) struct nvmet_tcp_queue *queue; =20 mutex_lock(&nvmet_tcp_queue_mutex); - list_for_each_entry(queue, &nvmet_tcp_queue_list, queue_list) - if (queue->port =3D=3D port) + list_for_each_entry(queue, &nvmet_tcp_queue_list, queue_list) { + if (rcu_access_pointer(queue->port) =3D=3D port) { + rcu_assign_pointer(queue->port, NULL); kernel_sock_shutdown(queue->sock, SHUT_RDWR); + } + } mutex_unlock(&nvmet_tcp_queue_mutex); + synchronize_rcu(); } =20 static void nvmet_tcp_remove_port(struct nvmet_port *nport) --=20 2.53.0 From nobody Mon May 25 18:05:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D460E363C7A for ; Sat, 16 May 2026 08:28:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920132; cv=none; b=ZetrHeDPqR6NQVvHLvCNS57Pr/NoTtxohE09VMeOkpV2BQr/tTAVjPEB7R+FXtqXhWXgtf9HYQ2Jxkw4W44V1C/y/nij0RvCde053s1Kx97bzLQLNElzKDSgZqlYlLCEVE/u5ekXgJMedFFGtUI1CcFgIQs4oBVlINjFQGe0T0k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920132; c=relaxed/simple; bh=Iqn8/aA0hgl47XUbHpcs8u9NqkMbLiW/Hf1LRcB+cOE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lNPau+cTD0utw/5apWzYfMdR6KVXgu4Ycp8CDKE/mGSa/DRTBzzcPKJmppRmqlC93bn2OLLuRnONo6FuOXq2WiTqWO8AqZT/nUzl4Y5MxvB04nY2ZETg6aJsw0PD5e4uE4o5HhMakOfPRyBu8YnhNg2Kzg+VxaidQEuDLOhxHlE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=kBkLFqKV; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="kBkLFqKV" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D37C8C2BCB7; Sat, 16 May 2026 08:28:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778920132; bh=Iqn8/aA0hgl47XUbHpcs8u9NqkMbLiW/Hf1LRcB+cOE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=kBkLFqKVg/t4QJFzlJDRp7Een5HjFran8tS4ljsvo87HwXoXh0jUZiQp3mbwI4XVV 5mMbCKUnot8HZqywqRItvprqTCGDsgXe1oUkUpQNB28MP1r0BRCn5lSujiDj/imbIf txxiGr/6RZrfg8qKA+WSO0QxWj91yNKr6zU7Cxm6Zp3phg2uXsa26K7nDLgnBP8qfD GiPP7347Bqe7G5EUjo/7Dz83GkPucqwQpjO+Kf2OIjPYtr4/1T1CGGMW6vecycvTiB zV2AQtAS+M76VSr1WHHKmCEeB4zrHRMAOqAniZ370XZmo73evn2hv66IDbe3EaK/if Y+xD14CoG7OHQ== From: Geliang Tang To: mptcp@lists.linux.dev Cc: Geliang Tang , Hannes Reinecke , zhenwei pi , Hui Zhu , Gang Yan Subject: [RFC mptcp-next v10 3/9] nvmet-tcp: register target mptcp transport Date: Sat, 16 May 2026 16:27:51 +0800 Message-ID: <29e02d79546afeeb64794f93c175f661e99997cf.1778919284.git.tanggeliang@kylinos.cn> X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Geliang Tang This patch adds a new nvme target transport type NVMF_TRTYPE_MPTCP for MPTCP. And defines a new nvmet_fabrics_ops named nvmet_mptcp_ops, which is almost the same as nvmet_tcp_ops except .type. It is registered in nvmet_tcp_init() and unregistered in nvmet_tcp_exit(). A MODULE_ALIAS for "nvmet-transport-4" is also added. v2: - use trtype instead of tsas (Hannes). v3: - check mptcp protocol from disc_addr.trtype instead of passing a parameter (Hannes). v4: - check CONFIG_MPTCP. Cc: Hannes Reinecke Co-developed-by: zhenwei pi Signed-off-by: zhenwei pi Co-developed-by: Hui Zhu Signed-off-by: Hui Zhu Co-developed-by: Gang Yan Signed-off-by: Gang Yan Signed-off-by: Geliang Tang --- drivers/nvme/target/configfs.c | 1 + drivers/nvme/target/tcp.c | 27 +++++++++++++++++++++++++++ include/linux/nvme.h | 1 + 3 files changed, 29 insertions(+) diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index b88f897f06e2..51fc0f4d0c32 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -37,6 +37,7 @@ static struct nvmet_type_name_map nvmet_transport[] =3D { { NVMF_TRTYPE_RDMA, "rdma" }, { NVMF_TRTYPE_FC, "fc" }, { NVMF_TRTYPE_TCP, "tcp" }, + { NVMF_TRTYPE_MPTCP, "mptcp" }, { NVMF_TRTYPE_PCI, "pci" }, { NVMF_TRTYPE_LOOP, "loop" }, }; diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c index 72cba7e0df7a..9ec64bf0a86f 100644 --- a/drivers/nvme/target/tcp.c +++ b/drivers/nvme/target/tcp.c @@ -2310,6 +2310,21 @@ static const struct nvmet_fabrics_ops nvmet_tcp_ops = =3D { .host_traddr =3D nvmet_tcp_host_port_addr, }; =20 +#ifdef CONFIG_MPTCP +static const struct nvmet_fabrics_ops nvmet_mptcp_ops =3D { + .owner =3D THIS_MODULE, + .type =3D NVMF_TRTYPE_MPTCP, + .msdbd =3D 1, + .add_port =3D nvmet_tcp_add_port, + .remove_port =3D nvmet_tcp_remove_port, + .queue_response =3D nvmet_tcp_queue_response, + .delete_ctrl =3D nvmet_tcp_delete_ctrl, + .install_queue =3D nvmet_tcp_install_queue, + .disc_traddr =3D nvmet_tcp_disc_port_addr, + .host_traddr =3D nvmet_tcp_host_port_addr, +}; +#endif + static int __init nvmet_tcp_init(void) { int ret; @@ -2323,6 +2338,14 @@ static int __init nvmet_tcp_init(void) if (ret) goto err; =20 +#ifdef CONFIG_MPTCP + ret =3D nvmet_register_transport(&nvmet_mptcp_ops); + if (ret) { + nvmet_unregister_transport(&nvmet_tcp_ops); + goto err; + } +#endif + return 0; err: destroy_workqueue(nvmet_tcp_wq); @@ -2333,6 +2356,9 @@ static void __exit nvmet_tcp_exit(void) { struct nvmet_tcp_queue *queue; =20 +#ifdef CONFIG_MPTCP + nvmet_unregister_transport(&nvmet_mptcp_ops); +#endif nvmet_unregister_transport(&nvmet_tcp_ops); =20 flush_workqueue(nvmet_wq); @@ -2352,3 +2378,4 @@ module_exit(nvmet_tcp_exit); MODULE_DESCRIPTION("NVMe target TCP transport driver"); MODULE_LICENSE("GPL v2"); MODULE_ALIAS("nvmet-transport-3"); /* 3 =3D=3D NVMF_TRTYPE_TCP */ +MODULE_ALIAS("nvmet-transport-4"); /* 4 =3D=3D NVMF_TRTYPE_MPTCP */ diff --git a/include/linux/nvme.h b/include/linux/nvme.h index 041f30931a90..0eada1e0c652 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -68,6 +68,7 @@ enum { NVMF_TRTYPE_RDMA =3D 1, /* RDMA */ NVMF_TRTYPE_FC =3D 2, /* Fibre Channel */ NVMF_TRTYPE_TCP =3D 3, /* TCP/IP */ + NVMF_TRTYPE_MPTCP =3D 4, /* Multipath TCP */ NVMF_TRTYPE_LOOP =3D 254, /* Reserved for host usage */ NVMF_TRTYPE_MAX, }; --=20 2.53.0 From nobody Mon May 25 18:05:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E42B5363C7A for ; Sat, 16 May 2026 08:28:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920135; cv=none; b=CSOpi6Q7t9/8xJeWCZhaFxG2HWN5AXeio8ufaucpDKgVcKDIMJgFBjRrYmgWQUrEKQyeLlLKJ0YBgfOzA7A3SlCqFRg4jhKc5O6quxcPKl/5TVoOGQxqyB3BKK1uTtDlZ4OKfr3jPP4PPTIXXuH3Kv6ejpRnu/rmZwm/jlXS21o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920135; c=relaxed/simple; bh=sfDvr6z+KhjtmW49cbFGiNzjUtmOz7HZFzEf6PB2B5o=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=FpaJHqrOGqA/sFGPVXb1mZS6NJFRNrnjgM3fxk1T3iwWhMkoa9vanu8PhaiQUSJBdwKyJQsS/Kr+SpcXK/zoPGIIqUMKa+C7WPltuc0f4bPUKlVvpqHAgTIkEZ4RLqp38VQ7beYypf0Cfrbrb6NDzgkm7YgtrALppNvjzyK0ILI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=BShVQcZS; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="BShVQcZS" Received: by smtp.kernel.org (Postfix) with ESMTPSA id E090BC19425; Sat, 16 May 2026 08:28:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778920134; bh=sfDvr6z+KhjtmW49cbFGiNzjUtmOz7HZFzEf6PB2B5o=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=BShVQcZSasEndMHyCbfskwb0GGDpylAklriavhzbmSUCsr6/93FEYoP8mvrb4T/3N hmDwf6ORZUoWJRm8p1hNHmjD49BfJmxKnVKWSczXrDmApwohGI8NQYX/q1aQtqHLPZ yQB1Bwjco+DzGqW1j0AKE5046y4Aes6hxW0C9bajaRbpb+qFNjjf8BY9uPsnMEY6AR 44GRoH6QfhAJ5/NlZLS1Fd1z0lTuz02d38tEIM38bhogZWEwVglvLnvJ9jdAGUX4BL Lx3ugqheh8s2IvjxeLOnuyZM/t+4ODtlqrlHjsJkmm0FdZNNZsyYoQt7FdpCLkqcZr Do+LcitjL/GPA== From: Geliang Tang To: mptcp@lists.linux.dev Cc: Geliang Tang , Hannes Reinecke , zhenwei pi , Hui Zhu , Gang Yan Subject: [RFC mptcp-next v10 4/9] nvmet-tcp: implement target mptcp proto Date: Sat, 16 May 2026 16:27:52 +0800 Message-ID: X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Geliang Tang This patch introduces a new NVMe target transport type NVMF_TRTYPE_MPTCP to support MPTCP. An MPTCP-specific version of struct nvmet_tcp_proto is implemented, and it is assigned to port->proto when the transport type is MPTCP. Dedicated MPTCP helpers are introduced for setting socket options. These helpers set the values on the first subflow socket of an MPTCP connection. The values are then synchronized to other newly created subflows in sync_socket_options(). Cc: Hannes Reinecke Co-developed-by: zhenwei pi Signed-off-by: zhenwei pi Co-developed-by: Hui Zhu Signed-off-by: Hui Zhu Co-developed-by: Gang Yan Signed-off-by: Gang Yan Signed-off-by: Geliang Tang --- drivers/nvme/target/tcp.c | 19 +++++++ include/net/mptcp.h | 20 +++++++ net/mptcp/sockopt.c | 106 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 145 insertions(+) diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c index 9ec64bf0a86f..931f78473506 100644 --- a/drivers/nvme/target/tcp.c +++ b/drivers/nvme/target/tcp.c @@ -224,6 +224,9 @@ static DEFINE_MUTEX(nvmet_tcp_queue_mutex); =20 static struct workqueue_struct *nvmet_tcp_wq; static const struct nvmet_fabrics_ops nvmet_tcp_ops; +#ifdef CONFIG_MPTCP +static const struct nvmet_fabrics_ops nvmet_mptcp_ops; +#endif static void nvmet_tcp_free_cmd(struct nvmet_tcp_cmd *c); static void nvmet_tcp_free_cmd_buffers(struct nvmet_tcp_cmd *cmd); =20 @@ -2109,6 +2112,18 @@ static const struct nvmet_tcp_proto nvmet_tcp_proto = =3D { .ops =3D &nvmet_tcp_ops, }; =20 +#ifdef CONFIG_MPTCP +static const struct nvmet_tcp_proto nvmet_mptcp_proto =3D { + .protocol =3D IPPROTO_MPTCP, + .set_reuseaddr =3D mptcp_sock_set_reuseaddr, + .set_nodelay =3D mptcp_sock_set_nodelay, + .set_priority =3D mptcp_sock_set_priority, + .no_linger =3D mptcp_sock_no_linger, + .set_tos =3D mptcp_sock_set_tos, + .ops =3D &nvmet_mptcp_ops, +}; +#endif + static int nvmet_tcp_add_port(struct nvmet_port *nport) { struct nvmet_tcp_port *port; @@ -2135,6 +2150,10 @@ static int nvmet_tcp_add_port(struct nvmet_port *npo= rt) =20 if (nport->disc_addr.trtype =3D=3D NVMF_TRTYPE_TCP) { port->proto =3D &nvmet_tcp_proto; +#ifdef CONFIG_MPTCP + } else if (nport->disc_addr.trtype =3D=3D NVMF_TRTYPE_MPTCP) { + port->proto =3D &nvmet_mptcp_proto; +#endif } else { ret =3D -EINVAL; goto err_port; diff --git a/include/net/mptcp.h b/include/net/mptcp.h index 4cf59e83c1c5..91ce7b9b639d 100644 --- a/include/net/mptcp.h +++ b/include/net/mptcp.h @@ -237,6 +237,16 @@ static inline __be32 mptcp_reset_option(const struct s= k_buff *skb) } =20 void mptcp_active_detect_blackhole(struct sock *sk, bool expired); + +void mptcp_sock_set_reuseaddr(struct sock *sk); + +void mptcp_sock_set_nodelay(struct sock *sk); + +void mptcp_sock_set_priority(struct sock *sk, u32 priority); + +void mptcp_sock_no_linger(struct sock *sk); + +void mptcp_sock_set_tos(struct sock *sk, int val); #else =20 static inline void mptcp_init(void) @@ -323,6 +333,16 @@ static inline struct request_sock *mptcp_subflow_reqsk= _alloc(const struct reques static inline __be32 mptcp_reset_option(const struct sk_buff *skb) { retu= rn htonl(0u); } =20 static inline void mptcp_active_detect_blackhole(struct sock *sk, bool exp= ired) { } + +static inline void mptcp_sock_set_reuseaddr(struct sock *sk) { } + +static inline void mptcp_sock_set_nodelay(struct sock *sk) { } + +static inline void mptcp_sock_set_priority(struct sock *sk, u32 priority) = { } + +static inline void mptcp_sock_no_linger(struct sock *sk) { } + +static inline void mptcp_sock_set_tos(struct sock *sk, int val) { } #endif /* CONFIG_MPTCP */ =20 #if IS_ENABLED(CONFIG_MPTCP_IPV6) diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c index 87b5796d0135..062ed4a43e5a 100644 --- a/net/mptcp/sockopt.c +++ b/net/mptcp/sockopt.c @@ -1547,6 +1547,7 @@ static void sync_socket_options(struct mptcp_sock *ms= k, struct sock *ssk) static const unsigned int tx_rx_locks =3D SOCK_RCVBUF_LOCK | SOCK_SNDBUF_= LOCK; struct sock *sk =3D (struct sock *)msk; bool keep_open; + u32 priority; =20 keep_open =3D sock_flag(sk, SOCK_KEEPOPEN); if (ssk->sk_prot->keepalive) @@ -1596,6 +1597,11 @@ static void sync_socket_options(struct mptcp_sock *m= sk, struct sock *ssk) inet_assign_bit(FREEBIND, ssk, inet_test_bit(FREEBIND, sk)); inet_assign_bit(BIND_ADDRESS_NO_PORT, ssk, inet_test_bit(BIND_ADDRESS_NO_= PORT, sk)); WRITE_ONCE(inet_sk(ssk)->local_port_range, READ_ONCE(inet_sk(sk)->local_p= ort_range)); + + ssk->sk_reuse =3D sk->sk_reuse; + priority =3D READ_ONCE(sk->sk_priority); + if (priority > 0) + sock_set_priority(ssk, priority); } =20 void mptcp_sockopt_sync_locked(struct mptcp_sock *msk, struct sock *ssk) @@ -1662,3 +1668,103 @@ int mptcp_set_rcvlowat(struct sock *sk, int val) } return 0; } + +void mptcp_sock_set_reuseaddr(struct sock *sk) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct sock *ssk; + + lock_sock(sk); + sockopt_seq_inc(msk); + sk->sk_reuse =3D SK_CAN_REUSE; + ssk =3D __mptcp_nmpc_sk(msk); + if (IS_ERR(ssk)) + goto unlock; + lock_sock_nested(ssk, SINGLE_DEPTH_NESTING); + ssk->sk_reuse =3D SK_CAN_REUSE; + release_sock(ssk); +unlock: + release_sock(sk); +} +EXPORT_SYMBOL(mptcp_sock_set_reuseaddr); + +void mptcp_sock_set_nodelay(struct sock *sk) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct sock *ssk; + + lock_sock(sk); + sockopt_seq_inc(msk); + msk->nodelay =3D true; + ssk =3D __mptcp_nmpc_sk(msk); + if (IS_ERR(ssk)) + goto unlock; + lock_sock_nested(ssk, SINGLE_DEPTH_NESTING); + __tcp_sock_set_nodelay(ssk, true); + release_sock(ssk); +unlock: + release_sock(sk); +} +EXPORT_SYMBOL(mptcp_sock_set_nodelay); + +void mptcp_sock_set_priority(struct sock *sk, u32 priority) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct sock *ssk; + + lock_sock(sk); + sockopt_seq_inc(msk); + sock_set_priority(sk, priority); + ssk =3D READ_ONCE(msk->first); + if (ssk) { + sock_hold(ssk); + lock_sock_nested(ssk, SINGLE_DEPTH_NESTING); + sock_set_priority(ssk, priority); + release_sock(ssk); + sock_put(ssk); + } + release_sock(sk); +} +EXPORT_SYMBOL(mptcp_sock_set_priority); + +void mptcp_sock_no_linger(struct sock *sk) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct sock *ssk; + + lock_sock(sk); + sockopt_seq_inc(msk); + WRITE_ONCE(sk->sk_lingertime, 0); + sock_set_flag(sk, SOCK_LINGER); + ssk =3D READ_ONCE(msk->first); + if (ssk) { + sock_hold(ssk); + lock_sock_nested(ssk, SINGLE_DEPTH_NESTING); + WRITE_ONCE(ssk->sk_lingertime, 0); + sock_set_flag(ssk, SOCK_LINGER); + release_sock(ssk); + sock_put(ssk); + } + release_sock(sk); +} +EXPORT_SYMBOL(mptcp_sock_no_linger); + +void mptcp_sock_set_tos(struct sock *sk, int val) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct sock *ssk; + + lock_sock(sk); + sockopt_seq_inc(msk); + __ip_sock_set_tos(sk, val); + ssk =3D READ_ONCE(msk->first); + if (ssk) { + sock_hold(ssk); + lock_sock_nested(ssk, SINGLE_DEPTH_NESTING); + __ip_sock_set_tos(ssk, val); + release_sock(ssk); + sock_put(ssk); + } + release_sock(sk); +} +EXPORT_SYMBOL(mptcp_sock_set_tos); --=20 2.53.0 From nobody Mon May 25 18:05:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8231C363C7A for ; Sat, 16 May 2026 08:28:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920137; cv=none; b=J8yw2VElwo9JA/qvFMw7/m1Sp7XmhYG37hGaJpykXfUo2PVkXue7G+WIMcfhhWTGitUj2xStIYJojVP3OO8+n+ZegaWrmTBGeUlflW3CGP2k1ITnkdXsMgyqziCfB4V1UeF0IKdF1hbD6hO53JttKjbeMs4hmqOaNBeweA85U/8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920137; c=relaxed/simple; bh=8HC3fDaEofGerA5yJHiBos5fwxdcuEiJ5kjHH/ADavY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=YWLzx27tC9SnY5CByX0onEexqaA/akntZENnVYlHRVgNIDY5+BN+qDjyEaWwz8Yp5/vezpQOu2BTcQY3QTFWbIyMQTxJFx9Huv9IfL4hC6BAiA9MKsZREaxAyUlUCG2m/yS8M3wBmGXZJJREkiDHqyq+WOBb12KxtF3UU99U0EU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=NoXPw5Y+; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="NoXPw5Y+" Received: by smtp.kernel.org (Postfix) with ESMTPSA id EB759C2BCB7; Sat, 16 May 2026 08:28:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778920137; bh=8HC3fDaEofGerA5yJHiBos5fwxdcuEiJ5kjHH/ADavY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=NoXPw5Y+oTeNmAKYpVO3dLFmvItUm2gECDYQqbYf5CnygIFHm8SroApiWu6Lp5w9J N0qAL0Ujo1bQbUikejtMjcI18ZpdJt/bKqUqgMkUlHNCfaYO7baeGErvu7stdP4PRz KbVhi/vgbWGSBcUa+RBC1fMXJXFMTtCZt6PXn614V3slpl948lmj4d8UWyEKMHos1Y SWo28Yt/qgWEl9Z0OL9kIse6GgKkwHAFxhPSVvOxdarKNeiPrAe3IsC4rbIkPy48in 0Y6Q3RKFvaZUMDTKQV8/wA90vDtBU0toYcLemQ0oMDrrhuqgEEtBRCmX0sNfQpHBNw Z9vaV5fCoA29Q== From: Geliang Tang To: mptcp@lists.linux.dev Cc: Geliang Tang , Hannes Reinecke , zhenwei pi , Hui Zhu , Gang Yan Subject: [RFC mptcp-next v10 5/9] nvme-tcp: define host tcp_proto struct Date: Sat, 16 May 2026 16:27:53 +0800 Message-ID: <6560edecfd204e47e3e042f448928a550116dacf.1778919284.git.tanggeliang@kylinos.cn> X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Geliang Tang To add MPTCP support in "NVMe over TCP", the host side needs to pass IPPROTO_MPTCP to sock_create_kern() instead of IPPROTO_TCP to create an MPTCP socket. Similar to the target-side nvmet_tcp_proto, this patch defines the host-side nvme_tcp_proto structure, which contains the protocol of the socket and a set of function pointers for socket operations. The only difference is that it defines .set_syncnt instead of .set_reuseaddr. A TCP-specific version of this structure is defined, and a proto field is added to nvme_tcp_ctrl. When the transport string is "tcp", it is assigned to ctrl->proto. All locations that previously called TCP setsockopt functions are updated to call the corresponding function pointers in the nvme_tcp_proto structure. The proto field points to a statically allocated nvme_tcp_proto structure that is never freed, so no RCU protection is needed. The controller's proto pointer is set during initialization and remains valid throughout the controller's lifetime. Cc: Hannes Reinecke Co-developed-by: zhenwei pi Signed-off-by: zhenwei pi Co-developed-by: Hui Zhu Signed-off-by: Hui Zhu Co-developed-by: Gang Yan Signed-off-by: Gang Yan Signed-off-by: Geliang Tang --- drivers/nvme/host/tcp.c | 44 ++++++++++++++++++++++++++++++++++------- 1 file changed, 37 insertions(+), 7 deletions(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 15d36d6a728e..f54b1eb86940 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -182,6 +183,16 @@ struct nvme_tcp_queue { void (*write_space)(struct sock *); }; =20 +struct nvme_tcp_proto { + int protocol; + int (*set_syncnt)(struct sock *sk, int val); + void (*set_nodelay)(struct sock *sk); + void (*no_linger)(struct sock *sk); + void (*set_priority)(struct sock *sk, u32 priority); + void (*set_tos)(struct sock *sk, int val); + const struct nvme_ctrl_ops *ops; +}; + struct nvme_tcp_ctrl { /* read only in the hot path */ struct nvme_tcp_queue *queues; @@ -198,6 +209,8 @@ struct nvme_tcp_ctrl { struct delayed_work connect_work; struct nvme_tcp_request async_req; u32 io_queues[HCTX_MAX_TYPES]; + + const struct nvme_tcp_proto *proto; }; =20 static LIST_HEAD(nvme_tcp_ctrl_list); @@ -1799,7 +1812,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nct= rl, int qid, =20 ret =3D sock_create_kern(current->nsproxy->net_ns, ctrl->addr.ss_family, SOCK_STREAM, - IPPROTO_TCP, &queue->sock); + ctrl->proto->protocol, &queue->sock); if (ret) { dev_err(nctrl->device, "failed to create socket: %d\n", ret); @@ -1816,24 +1829,24 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *n= ctrl, int qid, nvme_tcp_reclassify_socket(queue->sock); =20 /* Single syn retry */ - tcp_sock_set_syncnt(queue->sock->sk, 1); + ctrl->proto->set_syncnt(queue->sock->sk, 1); =20 /* Set TCP no delay */ - tcp_sock_set_nodelay(queue->sock->sk); + ctrl->proto->set_nodelay(queue->sock->sk); =20 /* * Cleanup whatever is sitting in the TCP transmit queue on socket * close. This is done to prevent stale data from being sent should * the network connection be restored before TCP times out. */ - sock_no_linger(queue->sock->sk); + ctrl->proto->no_linger(queue->sock->sk); =20 if (so_priority > 0) - sock_set_priority(queue->sock->sk, so_priority); + ctrl->proto->set_priority(queue->sock->sk, so_priority); =20 /* Set socket type of service */ if (nctrl->opts->tos >=3D 0) - ip_sock_set_tos(queue->sock->sk, nctrl->opts->tos); + ctrl->proto->set_tos(queue->sock->sk, nctrl->opts->tos); =20 /* Set 10 seconds timeout for icresp recvmsg */ queue->sock->sk->sk_rcvtimeo =3D 10 * HZ; @@ -2900,6 +2913,16 @@ nvme_tcp_existing_controller(struct nvmf_ctrl_option= s *opts) return found; } =20 +static const struct nvme_tcp_proto nvme_tcp_proto =3D { + .protocol =3D IPPROTO_TCP, + .set_syncnt =3D tcp_sock_set_syncnt, + .set_nodelay =3D tcp_sock_set_nodelay, + .no_linger =3D sock_no_linger, + .set_priority =3D sock_set_priority, + .set_tos =3D ip_sock_set_tos, + .ops =3D &nvme_tcp_ctrl_ops, +}; + static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev, struct nvmf_ctrl_options *opts) { @@ -2964,13 +2987,20 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(st= ruct device *dev, goto out_free_ctrl; } =20 + if (!strcmp(ctrl->ctrl.opts->transport, "tcp")) { + ctrl->proto =3D &nvme_tcp_proto; + } else { + ret =3D -EINVAL; + goto out_free_ctrl; + } + ctrl->queues =3D kzalloc_objs(*ctrl->queues, ctrl->ctrl.queue_count); if (!ctrl->queues) { ret =3D -ENOMEM; goto out_free_ctrl; } =20 - ret =3D nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_tcp_ctrl_ops, 0); + ret =3D nvme_init_ctrl(&ctrl->ctrl, dev, ctrl->proto->ops, 0); if (ret) goto out_kfree_queues; =20 --=20 2.53.0 From nobody Mon May 25 18:05:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 87EA8363C7A for ; Sat, 16 May 2026 08:28:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920139; cv=none; b=lVXP6yg+eHrbOWh8yNvNGq2PjivkGY8MvfFSSMjVwJhq93mfbEG9QOkIPAiayjcmQIlXsymnqvzazSd4NVZ215VPluhSWgREEO6XEuHwhEHe2xlzJHUwhspqoM9Phn/NiOJOqd6m1RJUjWU7DbbR7bzhy7Sb2q+EpNZVUPmt1uE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920139; c=relaxed/simple; bh=MOIKwVaXmEwicmclLIjAD1s6inLmJTXKkQjJ9N8GyQg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lY5tdWaFb7jybe1JkG69zPL5p8QRKHn0huWRjDLdQ55CJsLIXChZoahuJ8Xb5gmZHR8Pa5khc1iNtjiEnDCHEO7+psF3it4wf0tzkCa7cYg/NU7zOQkdEOdNOatibAFNrhRXAcSAFxvrdtrP5tLUSoTTFaqQQ/PIcMnaIJENDwE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=K0Xb92iL; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="K0Xb92iL" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 98490C19425; Sat, 16 May 2026 08:28:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778920139; bh=MOIKwVaXmEwicmclLIjAD1s6inLmJTXKkQjJ9N8GyQg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=K0Xb92iLaltx5nxI+J82oduF3keVONx2+1L3nydiCbpEbKfuAm1hKmTRmpbjrLJ7U 3YApK9TIgdKpTFMtvL7WnHANvEm+jBX9R9DFbAsY/j6GJMfd+ZnjTr1+a68UVWKzdJ JaICeXVh4QCn1LksWcde0zC8GrmeKs02RDYMXaGmr/eel5oQ36aVE+mnF2+tvLNjjk q8EHNkT9LLezbPeWMCNfbx387A+1O58SzzMtl0F9rd73UAWdC6hZVV3pCYC19ZKmmY fdiBNvXlCDi3nlt17qPvbeXigSK5TRzII2QxkjQBP+CC3K3kIBoVvtMmK2+e1o4hXV YA5QaNIbTZ07A== From: Geliang Tang To: mptcp@lists.linux.dev Cc: Geliang Tang , Hannes Reinecke , zhenwei pi , Hui Zhu , Gang Yan Subject: [RFC mptcp-next v10 6/9] nvme-tcp: register host mptcp transport Date: Sat, 16 May 2026 16:27:54 +0800 Message-ID: X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Geliang Tang This patch defines a new nvmf_transport_ops named nvme_mptcp_transport, which is almost the same as nvme_tcp_transport except .type and .allowed_opts. MPTCP currently does not support TLS. The four TLS-related options (NVMF_OPT_TLS, NVMF_OPT_KEYRING, NVMF_OPT_TLS_KEY, and NVMF_OPT_CONCAT) have been removed from allowed_opts. They will be added back once MPTCP TLS is supported. It is registered in nvme_tcp_init_module() and unregistered in nvme_tcp_cleanup_module(). A separate nvme_mptcp_ctrl_ops structure with .name =3D "mptcp" is defined and used for MPTCP controllers. A MODULE_ALIAS("nvme-mptcp") declaration alongside the other module metadata is added at the end of the file. v2: - use 'trtype' instead of '--mptcp' (Hannes) v3: - check mptcp protocol from opts->transport instead of passing a parameter (Hannes). v4: - check CONFIG_MPTCP. Cc: Hannes Reinecke Co-developed-by: zhenwei pi Signed-off-by: zhenwei pi Co-developed-by: Hui Zhu Signed-off-by: Hui Zhu Co-developed-by: Gang Yan Signed-off-by: Gang Yan Signed-off-by: Geliang Tang --- drivers/nvme/host/tcp.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index f54b1eb86940..bad18d7c323e 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -3067,6 +3067,20 @@ static struct nvmf_transport_ops nvme_tcp_transport = =3D { .create_ctrl =3D nvme_tcp_create_ctrl, }; =20 +#ifdef CONFIG_MPTCP +static struct nvmf_transport_ops nvme_mptcp_transport =3D { + .name =3D "mptcp", + .module =3D THIS_MODULE, + .required_opts =3D NVMF_OPT_TRADDR, + .allowed_opts =3D NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY | + NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO | + NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST | + NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES | + NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE, + .create_ctrl =3D nvme_tcp_create_ctrl, +}; +#endif + static int __init nvme_tcp_init_module(void) { unsigned int wq_flags =3D WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_SYSFS; @@ -3092,6 +3106,9 @@ static int __init nvme_tcp_init_module(void) atomic_set(&nvme_tcp_cpu_queues[cpu], 0); =20 nvmf_register_transport(&nvme_tcp_transport); +#ifdef CONFIG_MPTCP + nvmf_register_transport(&nvme_mptcp_transport); +#endif return 0; } =20 @@ -3099,6 +3116,9 @@ static void __exit nvme_tcp_cleanup_module(void) { struct nvme_tcp_ctrl *ctrl; =20 +#ifdef CONFIG_MPTCP + nvmf_unregister_transport(&nvme_mptcp_transport); +#endif nvmf_unregister_transport(&nvme_tcp_transport); =20 mutex_lock(&nvme_tcp_ctrl_mutex); @@ -3116,3 +3136,4 @@ module_exit(nvme_tcp_cleanup_module); MODULE_DESCRIPTION("NVMe host TCP transport driver"); MODULE_LICENSE("GPL v2"); MODULE_ALIAS("nvme-tcp"); +MODULE_ALIAS("nvme-mptcp"); --=20 2.53.0 From nobody Mon May 25 18:05:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 702D136CDFD for ; Sat, 16 May 2026 08:29:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920142; cv=none; b=XBSqnPwj38N953+R2eEdA6PpnQJyjRXhsy1hxCSj+0C9wZ+8xH49nmKSUGu8VvL5CwtJXiQz1coHuf691gA8o8GGBKpXX8r1nOI/NZ116PHhB/yhJG3qUQJ//2UrWGw9QbPB3R4+Q40dD2mXVlfS5/Zys7/S7zaSgtROdwdvwHQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920142; c=relaxed/simple; bh=zIEu3Tbq4u0Ak9NYh4lD4Ah96s2t9fq1T8XcX4eNn+A=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=MT6d07xw+ELGHbiFzKmuufT922+FTfbeELOlnjtb7mOmkR0ao2gbSEC4BPtrpZ2j+7osepV7TZCXuNf4Urjxv9GW2+vmy6zqTSxKaYNuAwc6kwPd3xYSpo5Z8mLk/5fB7R4Cc6LVNvzmdBEJuXD22E/cWTWaBY60Na3Axtjjabk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=kJTKBG7M; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="kJTKBG7M" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D6CC9C2BCB7; Sat, 16 May 2026 08:28:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778920142; bh=zIEu3Tbq4u0Ak9NYh4lD4Ah96s2t9fq1T8XcX4eNn+A=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=kJTKBG7MP2nE6OqIDkQZbJ2kTvEAKR7RJEX4EGnn4WJ8z3JxMKk7HNJDKG3DMCpxl c7Bz2cEda2uaLGdPAZ1oJHllIHhBkILAMtYZVPZT1zOPaD3DXM16B8WJZa5zHsGn0a xdw0B5wZ6tsP53YHYIlXKovb2DaRDTMgFyfPBTiGYuWKzr1Uvy+nHZz96AP1C1xRK8 rsCTAWy5RzycIdy87D7jHRkDRrA8DXbQeEl63eHvens18VMWIrExBQ2TBKmGdZDoc+ DOOGCupSkx7jqGGLuidXTFsuLTUNsxS7vS2SbjzuA9BeVnpu0Gf3LUOAgcDPJViXHd 8OHsbyCXcIcXw== From: Geliang Tang To: mptcp@lists.linux.dev Cc: Geliang Tang , Hannes Reinecke , zhenwei pi , Hui Zhu , Gang Yan Subject: [RFC mptcp-next v10 7/9] nvme-tcp: implement host mptcp proto Date: Sat, 16 May 2026 16:27:55 +0800 Message-ID: X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Geliang Tang An MPTCP-specific version of struct nvme_tcp_proto is implemented, and it is assigned to ctrl->proto when the transport string is "mptcp". The socket option setting logic is similar to the target side, except that mptcp_sock_set_syncnt is newly defined for the host side. It sets the value on the first subflow socket of an MPTCP connection. The value is then synchronized to other newly created subflows in sync_socket_options(). A separate nvme_mptcp_ctrl_ops structure with .name =3D "mptcp" is defined and used for MPTCP controllers. Cc: Hannes Reinecke Co-developed-by: zhenwei pi Signed-off-by: zhenwei pi Co-developed-by: Hui Zhu Signed-off-by: Hui Zhu Co-developed-by: Gang Yan Signed-off-by: Gang Yan Signed-off-by: Geliang Tang --- drivers/nvme/host/tcp.c | 34 ++++++++++++++++++++++++++++++++++ include/net/mptcp.h | 7 +++++++ net/mptcp/protocol.h | 1 + net/mptcp/sockopt.c | 21 +++++++++++++++++++++ 4 files changed, 63 insertions(+) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index bad18d7c323e..22fcfc3b5c2a 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2896,6 +2896,24 @@ static const struct nvme_ctrl_ops nvme_tcp_ctrl_ops = =3D { .get_virt_boundary =3D nvmf_get_virt_boundary, }; =20 +#ifdef CONFIG_MPTCP +static const struct nvme_ctrl_ops nvme_mptcp_ctrl_ops =3D { + .name =3D "mptcp", + .module =3D THIS_MODULE, + .flags =3D NVME_F_FABRICS | NVME_F_BLOCKING, + .reg_read32 =3D nvmf_reg_read32, + .reg_read64 =3D nvmf_reg_read64, + .reg_write32 =3D nvmf_reg_write32, + .subsystem_reset =3D nvmf_subsystem_reset, + .free_ctrl =3D nvme_tcp_free_ctrl, + .submit_async_event =3D nvme_tcp_submit_async_event, + .delete_ctrl =3D nvme_tcp_delete_ctrl, + .get_address =3D nvme_tcp_get_address, + .stop_ctrl =3D nvme_tcp_stop_ctrl, + .get_virt_boundary =3D nvmf_get_virt_boundary, +}; +#endif + static bool nvme_tcp_existing_controller(struct nvmf_ctrl_options *opts) { @@ -2923,6 +2941,18 @@ static const struct nvme_tcp_proto nvme_tcp_proto = =3D { .ops =3D &nvme_tcp_ctrl_ops, }; =20 +#ifdef CONFIG_MPTCP +static const struct nvme_tcp_proto nvme_mptcp_proto =3D { + .protocol =3D IPPROTO_MPTCP, + .set_syncnt =3D mptcp_sock_set_syncnt, + .set_nodelay =3D mptcp_sock_set_nodelay, + .no_linger =3D mptcp_sock_no_linger, + .set_priority =3D mptcp_sock_set_priority, + .set_tos =3D mptcp_sock_set_tos, + .ops =3D &nvme_mptcp_ctrl_ops, +}; +#endif + static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev, struct nvmf_ctrl_options *opts) { @@ -2989,6 +3019,10 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(str= uct device *dev, =20 if (!strcmp(ctrl->ctrl.opts->transport, "tcp")) { ctrl->proto =3D &nvme_tcp_proto; +#ifdef CONFIG_MPTCP + } else if (!strcmp(ctrl->ctrl.opts->transport, "mptcp")) { + ctrl->proto =3D &nvme_mptcp_proto; +#endif } else { ret =3D -EINVAL; goto out_free_ctrl; diff --git a/include/net/mptcp.h b/include/net/mptcp.h index 91ce7b9b639d..49031a111e69 100644 --- a/include/net/mptcp.h +++ b/include/net/mptcp.h @@ -247,6 +247,8 @@ void mptcp_sock_set_priority(struct sock *sk, u32 prior= ity); void mptcp_sock_no_linger(struct sock *sk); =20 void mptcp_sock_set_tos(struct sock *sk, int val); + +int mptcp_sock_set_syncnt(struct sock *sk, int val); #else =20 static inline void mptcp_init(void) @@ -343,6 +345,11 @@ static inline void mptcp_sock_set_priority(struct sock= *sk, u32 priority) { } static inline void mptcp_sock_no_linger(struct sock *sk) { } =20 static inline void mptcp_sock_set_tos(struct sock *sk, int val) { } + +static inline int mptcp_sock_set_syncnt(struct sock *sk, int val) +{ + return 0; +} #endif /* CONFIG_MPTCP */ =20 #if IS_ENABLED(CONFIG_MPTCP_IPV6) diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 661600f8b573..0096cabdccd2 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -336,6 +336,7 @@ struct mptcp_sock { int keepalive_idle; int keepalive_intvl; int maxseg; + int icsk_syn_retries; struct work_struct work; struct sk_buff *ooo_last_skb; struct rb_root out_of_order_queue; diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c index 062ed4a43e5a..afd5a4c511dc 100644 --- a/net/mptcp/sockopt.c +++ b/net/mptcp/sockopt.c @@ -1602,6 +1602,8 @@ static void sync_socket_options(struct mptcp_sock *ms= k, struct sock *ssk) priority =3D READ_ONCE(sk->sk_priority); if (priority > 0) sock_set_priority(ssk, priority); + if (msk->icsk_syn_retries > 0) + tcp_sock_set_syncnt(ssk, msk->icsk_syn_retries); } =20 void mptcp_sockopt_sync_locked(struct mptcp_sock *msk, struct sock *ssk) @@ -1768,3 +1770,22 @@ void mptcp_sock_set_tos(struct sock *sk, int val) release_sock(sk); } EXPORT_SYMBOL(mptcp_sock_set_tos); + +int mptcp_sock_set_syncnt(struct sock *sk, int val) +{ + struct mptcp_sock *msk =3D mptcp_sk(sk); + struct sock *ssk; + + if (val < 1 || val > MAX_TCP_SYNCNT) + return -EINVAL; + + lock_sock(sk); + sockopt_seq_inc(msk); + msk->icsk_syn_retries =3D val; + ssk =3D __mptcp_nmpc_sk(msk); + if (!IS_ERR(ssk)) + tcp_sock_set_syncnt(ssk, val); + release_sock(sk); + return 0; +} +EXPORT_SYMBOL(mptcp_sock_set_syncnt); --=20 2.53.0 From nobody Mon May 25 18:05:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 54D70363C7A for ; Sat, 16 May 2026 08:29:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920144; cv=none; b=rlQZkXmu5xff0UVPjQnvWUBDxORD7didmOQ+gxdt7Uk167ExO5tx21m0OsFIpO5v/NQblAUnmBxjNNusnpkR+uZm/QbyPyCEF5/o1Tv3iu2ZNOI1fwT4EwrSEBNDn7kvecVuiyiqgaQfjnF8oSYGiIXTq9sP/yDMMIVDDXODLPE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920144; c=relaxed/simple; bh=xxOT6/n/26xtB9LS33xJHvAxF5HFsm+FI7gH7PAaYPg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=D37soMKpPZDYTymfPiDpty24vmmbqfd50pjo0HFVjR5lwkqGHP//n6XOurMWWRbjZjOQC9LTuRwGYJ3C41PvCF/0AKt0FuCBa8mqDiQnriSznpGXK7BdwxrPAIfVTYWBrZX8ffiWuZVA/KkSQKfPDy2Gjob5td+xOgnAlccnvl8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=PxqyTnuU; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="PxqyTnuU" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9C3B0C19425; Sat, 16 May 2026 08:29:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778920144; bh=xxOT6/n/26xtB9LS33xJHvAxF5HFsm+FI7gH7PAaYPg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=PxqyTnuUckRFPtjXnDh1MSlJSgNcVQ8OR9S1OJ1ASAi7Wk7lb2qpyGjr3ewsjXkfr IG6/4798zWP1kzOi0eZNdtNSz15hZfWfIXMl9gq3pVu02dNuxXkz5K4tcxQisBoDpw OZiGXnUUN+e9IqkEXDfJwS7Ij46edpjrSBKg6TDQw4Dy1OwWKsK8x+bUIc7YGZm6ie 8pr/pYCsAWiUZc1Q8JO3fvH/RvoNmEoKEgSRT7sJNujtEhVRNu5YZzAy1K+tIzIwIY rolzTOk2csU7DQ/HBibzaNpM4PkZ+ZQ4wbDXo8/beqttpaxlFUrEr6qwcu8fwQoEBa w0MkUz1FGYlWw== From: Geliang Tang To: mptcp@lists.linux.dev Cc: Geliang Tang , Hannes Reinecke , zhenwei pi , Hui Zhu , Gang Yan Subject: [RFC mptcp-next v10 8/9] selftests: mptcp: add nvme over mptcp test Date: Sat, 16 May 2026 16:27:56 +0800 Message-ID: <22e26dc2ad1ab3c6f349644690165ba1056612a6.1778919284.git.tanggeliang@kylinos.cn> X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Geliang Tang A test case for NVMe over MPTCP has been implemented. It verifies the proper functionality of nvme list, discover, connect, and disconnect commands. Additionally, read/write performance has been evaluated using fio. This script accepts two positional parameters: trtype - Transport type (mptcp|tcp). Default: mptcp path - Number of multipath (1-4). Default: 1 This test simulates four NICs on both target and host sides, each limited to 125MB/s. It shows that 'NVMe over MPTCP' delivered bandwidth up to four times that of standard TCP with a single NVMe multipath configuration: # ./mptcp_nvme.sh tcp READ: bw=3D112MiB/s (118MB/s), 112MiB/s-112MiB/s (118MB/s-118MB/s), io=3D1123MiB (1177MB), run=3D10018-10018msec WRITE: bw=3D112MiB/s (117MB/s), 112MiB/s-112MiB/s (117MB/s-117MB/s), io=3D1118MiB (1173MB), run=3D10018-10018msec # ./mptcp_nvme.sh mptcp READ: bw=3D427MiB/s (448MB/s), 427MiB/s-427MiB/s (448MB/s-448MB/s), io=3D4286MiB (4494MB), run=3D10039-10039msec WRITE: bw=3D387MiB/s (406MB/s), 387MiB/s-387MiB/s (406MB/s-406MB/s), io=3D3885MiB (4073MB), run=3D10043-10043msec It reflects that MPTCP has the same multi-interface bandwidth aggregation capability as NVMe multipath. Cc: Hannes Reinecke Co-developed-by: zhenwei pi Signed-off-by: zhenwei pi Co-developed-by: Hui Zhu Signed-off-by: Hui Zhu Co-developed-by: Gang Yan Signed-off-by: Gang Yan Signed-off-by: Geliang Tang --- tools/testing/selftests/net/mptcp/Makefile | 1 + tools/testing/selftests/net/mptcp/config | 7 + .../testing/selftests/net/mptcp/mptcp_lib.sh | 12 + .../testing/selftests/net/mptcp/mptcp_nvme.sh | 315 ++++++++++++++++++ 4 files changed, 335 insertions(+) create mode 100755 tools/testing/selftests/net/mptcp/mptcp_nvme.sh diff --git a/tools/testing/selftests/net/mptcp/Makefile b/tools/testing/sel= ftests/net/mptcp/Makefile index 22ba0da2adb8..7b308447a58b 100644 --- a/tools/testing/selftests/net/mptcp/Makefile +++ b/tools/testing/selftests/net/mptcp/Makefile @@ -13,6 +13,7 @@ TEST_PROGS :=3D \ mptcp_connect_sendfile.sh \ mptcp_connect_splice.sh \ mptcp_join.sh \ + mptcp_nvme.sh \ mptcp_sockopt.sh \ pm_netlink.sh \ simult_flows.sh \ diff --git a/tools/testing/selftests/net/mptcp/config b/tools/testing/selft= ests/net/mptcp/config index 59051ee2a986..0eee348eff8b 100644 --- a/tools/testing/selftests/net/mptcp/config +++ b/tools/testing/selftests/net/mptcp/config @@ -34,3 +34,10 @@ CONFIG_NFT_SOCKET=3Dm CONFIG_NFT_TPROXY=3Dm CONFIG_SYN_COOKIES=3Dy CONFIG_VETH=3Dy +CONFIG_CONFIGFS_FS=3Dy +CONFIG_NVME_CORE=3Dy +CONFIG_NVME_FABRICS=3Dy +CONFIG_NVME_TCP=3Dy +CONFIG_NVME_TARGET=3Dy +CONFIG_NVME_TARGET_TCP=3Dy +CONFIG_NVME_MULTIPATH=3Dy diff --git a/tools/testing/selftests/net/mptcp/mptcp_lib.sh b/tools/testing= /selftests/net/mptcp/mptcp_lib.sh index 5ef6033775c8..e08854ba42bd 100644 --- a/tools/testing/selftests/net/mptcp/mptcp_lib.sh +++ b/tools/testing/selftests/net/mptcp/mptcp_lib.sh @@ -530,6 +530,18 @@ mptcp_lib_check_tools() { exit ${KSFT_SKIP} fi ;; + "nvme") + if ! nvme --version &> /dev/null; then + mptcp_lib_pr_skip "nvme tool not found" + exit ${KSFT_SKIP} + fi + ;; + "fio") + if ! fio -h &> /dev/null; then + mptcp_lib_pr_skip "fio tool not found" + exit ${KSFT_SKIP} + fi + ;; *) mptcp_lib_pr_fail "Internal error: unsupported tool: ${tool}" exit ${KSFT_FAIL} diff --git a/tools/testing/selftests/net/mptcp/mptcp_nvme.sh b/tools/testin= g/selftests/net/mptcp/mptcp_nvme.sh new file mode 100755 index 000000000000..1bd76e245a18 --- /dev/null +++ b/tools/testing/selftests/net/mptcp/mptcp_nvme.sh @@ -0,0 +1,315 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +. "$(dirname "$0")/mptcp_lib.sh" + +ret=3D0 +trtype=3D"${1:-mptcp}" +path=3D"${2:-1}" +nqn=3D"nqn.2014-08.org.nvmexpress.${trtype}dev.$$.${RANDOM}" +ns=3D1 +port=3D$((RANDOM % 10000 + 20000)) +trsvcid=3D$((RANDOM % 64512 + 1024)) +ns1=3D"" +ns2=3D"" +temp_file=3D"" +loop_dev=3D"" + +usage() +{ + cat << EOF + +Usage: + + $(basename "$0") [trtype] [path] + + trtype Transport type (tcp|mptcp) - default: mptcp + path Number of multipath (1-4) - default: 1 + +EOF +exit 0 +} + +validate_params() +{ + if [[ ! "${trtype}" =3D~ ^(tcp|mptcp)$ ]]; then + echo "Error: Invalid trtype ${trtype}." + usage + fi + + if [[ ! "${path}" =3D~ ^[0-9]+$ ]] || [ "${path}" -lt 1 ]; then + echo "Error: Invalid path count ${path}." + usage + fi + + if [ "${path}" -gt 4 ]; then + echo "Warning: path count ${path} > 4, limiting to 4" + path=3D4 + fi +} + +# This function is invoked indirectly +#shellcheck disable=3DSC2317,SC2329 +ns1_cleanup() +{ + pushd /sys/kernel/config/nvmet || exit 1 + + for i in $(seq 1 "${path}"); do + local portdir=3D$((port + i)) + + rm -rf "ports/${portdir}/subsystems/${nqn}" + rmdir "ports/${portdir}" + done + + echo 0 > "subsystems/${nqn}/namespaces/${ns}/enable" + echo -n 0 > "subsystems/${nqn}/namespaces/${ns}/device_path" + rmdir "subsystems/${nqn}/namespaces/${ns}" + rmdir "subsystems/${nqn}" + + popd || exit 1 +} + +# This function is invoked indirectly +#shellcheck disable=3DSC2317,SC2329 +ns2_cleanup() +{ + nvme disconnect -n "${nqn}" || true +} + +# This function is used in the cleanup trap +#shellcheck disable=3DSC2317,SC2329 +cleanup() +{ + ip netns exec "$ns2" bash <<- EOF + $(declare -f ns2_cleanup) + ns2_cleanup + EOF + + sleep 1 + + ip netns exec "$ns1" unshare -m bash <<- EOF + mount -t configfs none /sys/kernel/config + $(declare -f ns1_cleanup) + ns1_cleanup + EOF + + if [ -n "${loop_dev}" ] && [ -b "${loop_dev}" ]; then + losetup -d "${loop_dev}" 2>/dev/null || true + fi + rm -rf "${temp_file}" + + mptcp_lib_ns_exit "$ns1" "$ns2" + + kill "$monitor_pid_ns1" 2>/dev/null + wait "$monitor_pid_ns1" 2>/dev/null + + kill "$monitor_pid_ns2" 2>/dev/null + wait "$monitor_pid_ns2" 2>/dev/null + + unset -v trtype path nqn ns port trsvcid +} + +init() +{ + mptcp_lib_ns_init ns1 ns2 + + # ns1 ns2 + # 10.1.1.1 10.1.1.2 + # 10.1.2.1 10.1.2.2 + # 10.1.3.1 10.1.3.2 + # 10.1.4.1 10.1.4.2 + for i in {1..4}; do + ip link add ns1eth"$i" netns "$ns1" type veth peer \ + name ns2eth"$i" netns "$ns2" + ip -net "$ns1" addr add 10.1."$i".1/24 dev ns1eth"$i" + ip -net "$ns1" addr add dead:beef:"$i"::1/64 \ + dev ns1eth"$i" nodad + ip -net "$ns1" link set ns1eth"$i" up + ip -net "$ns2" addr add 10.1."$i".2/24 dev ns2eth"$i" + ip -net "$ns2" addr add dead:beef:"$i"::2/64 \ + dev ns2eth"$i" nodad + ip -net "$ns2" link set ns2eth"$i" up + ip -net "$ns2" route add default via 10.1."$i".1 \ + dev ns2eth"$i" metric 10"$i" + ip -net "$ns2" route add default via dead:beef:"$i"::1 \ + dev ns2eth"$i" metric 10"$i" + + # Add tc qdisc to both namespaces for bandwidth limiting + tc -n "$ns1" qdisc add dev ns1eth"$i" root netem rate 1000mbit + tc -n "$ns2" qdisc add dev ns2eth"$i" root netem rate 1000mbit + + tc -n "$ns1" qdisc show dev ns1eth"$i" + tc -n "$ns2" qdisc show dev ns2eth"$i" + done + + mptcp_lib_pm_nl_set_limits "${ns1}" 8 8 + + mptcp_lib_pm_nl_add_endpoint "$ns1" 10.1.1.1 flags signal + mptcp_lib_pm_nl_add_endpoint "$ns1" 10.1.2.1 flags signal + mptcp_lib_pm_nl_add_endpoint "$ns1" 10.1.3.1 flags signal + mptcp_lib_pm_nl_add_endpoint "$ns1" 10.1.4.1 flags signal + + mptcp_lib_pm_nl_set_limits "${ns2}" 8 8 + + mptcp_lib_pm_nl_add_endpoint "$ns2" 10.1.1.2 flags subflow + mptcp_lib_pm_nl_add_endpoint "$ns2" 10.1.2.2 flags subflow + mptcp_lib_pm_nl_add_endpoint "$ns2" 10.1.3.2 flags subflow + mptcp_lib_pm_nl_add_endpoint "$ns2" 10.1.4.2 flags subflow + + ip -n "${ns1}" mptcp monitor & + monitor_pid_ns1=3D$! + ip -n "${ns2}" mptcp monitor & + monitor_pid_ns2=3D$! +} + +# This function is invoked indirectly +#shellcheck disable=3DSC2317,SC2329 +run_target() +{ + cd /sys/kernel/config/nvmet/subsystems || exit + mkdir -p "${nqn}" + cd "${nqn}" || exit + echo 1 > attr_allow_any_host + mkdir -p namespaces/"${ns}" + echo "${loop_dev}" > namespaces/"${ns}"/device_path + echo 1 > namespaces/"${ns}"/enable + + # Create 4 ports, each on a different IP address + for i in $(seq 1 "${path}"); do + local portdir=3D$((port + i)) + + cd /sys/kernel/config/nvmet/ports || exit + mkdir -p "${portdir}" + cd "${portdir}" || exit 1 + echo "${trtype}" > addr_trtype + echo ipv4 > addr_adrfam + if [ "${path}" -eq 1 ]; then + echo "0.0.0.0" > addr_traddr + else + echo "10.1.${i}.1" > addr_traddr + fi + echo "${trsvcid}" > addr_trsvcid + + mkdir -p subsystems + ln -sf "../../subsystems/${nqn}" "subsystems/${nqn}" + cd - >/dev/null + done +} + +# This function is invoked indirectly +#shellcheck disable=3DSC2317,SC2329 +run_host() +{ + local traddr=3D10.1.1.1 + local devname + + echo "nvme discover -a ${traddr}" + if ! nvme discover -t "${trtype}" -a "${traddr}" -s "${trsvcid}"; then + return 1 + fi + + for i in $(seq 1 "${path}"); do + echo "Connecting to 10.1.${i}.1:${trsvcid}" + + if ! nvme connect -t "${trtype}" -a "10.1.${i}.1" \ + -s "${trsvcid}" -n "${nqn}"; then + echo "Failed to connect to 10.1.${i}.1" + return 1 + fi + done + + sleep 1 + + # Scan all NVMe block devices + for dev in /dev/nvme*n1 /dev/nvme*cn1; do + if [ -b "$dev" ] 2>/dev/null; then + # Check if this device's controller matches our NQN + if nvme id-ctrl "$dev" 2>/dev/null | + grep -q "${nqn}"; then + devname=3D$(basename "$dev") + break + fi + fi + done 2>/dev/null || exit + + if [ -z "$devname" ]; then + echo "No block device found for NQN ${nqn}" >&2 + return 1 + fi + + sleep 1 + + echo "nvme list" + nvme list + + sleep 1 + + echo "fio randread /dev/${devname}" + if ! fio --name=3Dglobal --direct=3D1 --norandommap --randrepeat=3D0 \ + --ioengine=3Dlibaio --thread=3D1 --blocksize=3D128k --runtime=3D10 \ + --time_based --rw=3Drandread --numjobs=3D4 --iodepth=3D256 \ + --group_reporting --size=3D100% --name=3Dlibaio_4_256_4k_randread \ + --filename=3D"/dev/${devname}"; then + return 1 + fi + + sleep 1 + + echo "fio randwrite /dev/${devname}" + if ! fio --name=3Dglobal --direct=3D1 --norandommap --randrepeat=3D0 \ + --ioengine=3Dlibaio --thread=3D1 --blocksize=3D128k --runtime=3D10 \ + --time_based --rw=3Drandwrite --numjobs=3D4 --iodepth=3D256 \ + --group_reporting --size=3D100% --name=3Dlibaio_4_256_4k_randwrite \ + --filename=3D"/dev/${devname}"; then + return 1 + fi + + nvme flush "/dev/${devname}" +} + +init +trap cleanup EXIT + +mptcp_lib_check_tools nvme fio +validate_params + +temp_file=3D$(mktemp /tmp/nvme_test.XXXXXX.raw) +if [ $? -ne 0 ]; then + echo "Failed to create temp file" + exit 1 +fi + +dd if=3D/dev/zero of=3D"${temp_file}" bs=3D1M count=3D0 seek=3D512 +loop_dev=3D$(losetup -f --show "${temp_file}") + +run_test() +{ + export trtype path nqn ns port trsvcid + export loop_dev temp_file + + if ! ip netns exec "$ns1" unshare -m bash <<- EOF + mount -t configfs none /sys/kernel/config + $(declare -f run_target) + run_target + exit \$? + EOF + then + ret=3D"${KSFT_FAIL}" + fi + + if ! ip netns exec "$ns2" bash <<- EOF + $(declare -f run_host) + run_host + exit \$? + EOF + then + ret=3D"${KSFT_FAIL}" + fi + + sleep 1 +} + +run_test "$@" + +mptcp_lib_result_print_all_tap +exit "$ret" --=20 2.53.0 From nobody Mon May 25 18:05:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9C9873859DE for ; Sat, 16 May 2026 08:29:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920146; cv=none; b=rmlCavM+TovS223w3vL+FyIS+0vM4PuFU7JKssLKp8oKNcVcG/8HvBGoBKOAOOQMPb1CKSCLRH7oMo0flV7eGnyzQ+vmF5NxyVkm62GfHiJJabm0/VNWAtxEpjVpleBo2YdqF1SMh/b+xZONoPRwasp6AtWoeyXIcQ5bgmc76Ys= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778920146; c=relaxed/simple; bh=44yqZNHggru7StIPnN+9PMMvG9n4S89wyuuffMJeonM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Jc1BBoQ+cXERlBhf/vjQ/6F7xDbfBGM3QLP1uVjPWfHEsLrxlwXHy2P/rSX6ODdugVRthVK4a9vQdJCQqjqcr8o19PnlS7/013oliraNN3qhStKaseAMwQRbQQwr0wxV0Zi/Vne8cUAyPBmwb2XxK4ZzKcfKjAJdQ2pgtSnou4Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=gzji5TQ0; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="gzji5TQ0" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A8001C2BCB7; Sat, 16 May 2026 08:29:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778920146; bh=44yqZNHggru7StIPnN+9PMMvG9n4S89wyuuffMJeonM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=gzji5TQ09UFkp9wAa5Zx9DLdpv2aN4c/Tk/loU8bWGTqekokuLFE4Tv+qLva47b9g P5105vCDjVIudPWBYY8IxfQJhRU58T3K2O2AAYBgCep2hOLrKAdU4+nrwUxlAQnjNS ze3B/zNOChuo4SMXW5MM2cZgX6nJp1Dt8xdIkJnyab/o0w1eipjRjkIukftfUx+79V 9+Zm1Rov0BQOeyUZZCof5ZhJuxDgeOD4Q5ba8KI4OgWVb5eURdiygotmEOuP5N+/AF 7t7I7BtkkojEkV8UTvDG+dxCkMYGUTdechtvi+TgYelf4DuNeujorUAcZZSN7VDN+b MZ/nZJNHsMjfg== From: Geliang Tang To: mptcp@lists.linux.dev Cc: Geliang Tang , Hannes Reinecke , zhenwei pi , Hui Zhu , Gang Yan Subject: [RFC mptcp-next v10 9/9] selftests: mptcp: nvme: add iopolicy tests Date: Sat, 16 May 2026 16:27:57 +0800 Message-ID: <1d36e5b833250836db9f24e32473bf19431156cd.1778919284.git.tanggeliang@kylinos.cn> X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Geliang Tang Add NVMe iopolicy testing to mptcp_nvme.sh, with the default set to "numa". It can be set to "round-robin" or "queue-depth". Test results with 4 NVMe multipath paths and round-robin iopolicy show that TCP and MPTCP achieve similar bandwidth: # ./mptcp_nvme.sh tcp 4 round-robin READ: bw=3D455MiB/s (478MB/s), 455MiB/s-455MiB/s (478MB/s-478MB/s), io=3D4665MiB (4891MB), run=3D10242-10242msec WRITE: bw=3D455MiB/s (477MB/s), 455MiB/s-455MiB/s (477MB/s-477MB/s), io=3D4633MiB (4858MB), run=3D10184-10184msec # ./mptcp_nvme.sh mptcp 4 round-robin READ: bw=3D445MiB/s (466MB/s), 445MiB/s-445MiB/s (466MB/s-466MB/s), io=3D4575MiB (4797MB), run=3D10287-10287msec WRITE: bw=3D445MiB/s (467MB/s), 445MiB/s-445MiB/s (467MB/s-467MB/s), io=3D4572MiB (4794MB), run=3D10267-10267msec A "loss" argument is added to simulate network packet loss. When loss=3D1, each veth interface is configured with "delay 5ms loss 0.5%" using tc qdisc. Under this scenario, TCP performance is reduced by multiples compared to MPTCP: # ./mptcp_nvme.sh tcp 4 round-robin 1 READ: bw=3D144MiB/s (151MB/s), 144MiB/s-144MiB/s (151MB/s-151MB/s), io=3D1909MiB (2001MB), run=3D13231-13231msec WRITE: bw=3D100.0MiB/s (105MB/s), 100.0MiB/s-100.0MiB/s (105MB/s-105MB/s), io=3D1397MiB (1465MB), run=3D13980-13980msec # ./mptcp_nvme.sh mptcp 4 round-robin 1 READ: bw=3D428MiB/s (449MB/s), 428MiB/s-428MiB/s (449MB/s-449MB/s), io=3D4524MiB (4743MB), run=3D10564-10564msec WRITE: bw=3D431MiB/s (452MB/s), 431MiB/s-431MiB/s (452MB/s-452MB/s), io=3D4513MiB (4732MB), run=3D10481-10481msec These results demonstrate that MPTCP has better resilience against packet loss compared to TCP, as it can leverage multiple subflows to mitigate network degradation. Cc: Hannes Reinecke Co-developed-by: zhenwei pi Signed-off-by: zhenwei pi Co-developed-by: Hui Zhu Signed-off-by: Hui Zhu Co-developed-by: Gang Yan Signed-off-by: Gang Yan Signed-off-by: Geliang Tang --- .../testing/selftests/net/mptcp/mptcp_nvme.sh | 67 ++++++++++++++++++- 1 file changed, 64 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/net/mptcp/mptcp_nvme.sh b/tools/testin= g/selftests/net/mptcp/mptcp_nvme.sh index 1bd76e245a18..465a7c9cf4ce 100755 --- a/tools/testing/selftests/net/mptcp/mptcp_nvme.sh +++ b/tools/testing/selftests/net/mptcp/mptcp_nvme.sh @@ -6,6 +6,8 @@ ret=3D0 trtype=3D"${1:-mptcp}" path=3D"${2:-1}" +iopolicy=3D${3:-"numa"} # round-robin, queue-depth +loss=3D${4:-0} nqn=3D"nqn.2014-08.org.nvmexpress.${trtype}dev.$$.${RANDOM}" ns=3D1 port=3D$((RANDOM % 10000 + 20000)) @@ -21,10 +23,12 @@ usage() =20 Usage: =20 - $(basename "$0") [trtype] [path] + $(basename "$0") [trtype] [path] [iopolicy] [loss] =20 trtype Transport type (tcp|mptcp) - default: mptcp path Number of multipath (1-4) - default: 1 + iopolicy I/O policy (numa|round-robin|queue-depth) - default: numa + loss Enable packet loss (0|1) - default: 0 =20 EOF exit 0 @@ -46,6 +50,16 @@ validate_params() echo "Warning: path count ${path} > 4, limiting to 4" path=3D4 fi + + if [[ ! "${iopolicy}" =3D~ ^(numa|round-robin|queue-depth)$ ]]; then + echo "Error: Invalid iopolicy ${iopolicy}." + usage + fi + + if [[ ! "${loss}" =3D~ ^[01]$ ]]; then + echo "Error: Invalid loss value ${loss}. Must be 0 or 1" + usage + fi } =20 # This function is invoked indirectly @@ -107,6 +121,7 @@ cleanup() wait "$monitor_pid_ns2" 2>/dev/null =20 unset -v trtype path nqn ns port trsvcid + unset iopolicy loss } =20 init() @@ -135,8 +150,10 @@ init() dev ns2eth"$i" metric 10"$i" =20 # Add tc qdisc to both namespaces for bandwidth limiting - tc -n "$ns1" qdisc add dev ns1eth"$i" root netem rate 1000mbit - tc -n "$ns2" qdisc add dev ns2eth"$i" root netem rate 1000mbit + tc -n "$ns1" qdisc add dev ns1eth"$i" root netem rate 1000mbit \ + $([ "${loss}" -eq 1 ] && echo "delay 5ms loss 0.5%") + tc -n "$ns2" qdisc add dev ns2eth"$i" root netem rate 1000mbit \ + $([ "${loss}" -eq 1 ] && echo "delay 5ms loss 0.5%") =20 tc -n "$ns1" qdisc show dev ns1eth"$i" tc -n "$ns2" qdisc show dev ns2eth"$i" @@ -196,6 +213,43 @@ run_target() done } =20 +# This function is invoked indirectly +#shellcheck disable=3DSC2317,SC2329 +set_io_policy() +{ + local nqn=3D"$1" + local iopolicy=3D"$2" + local subname + local policy + local current + + subname=3D$(nvme list-subsys 2>/dev/null | grep "${nqn}" | + grep -o 'nvme-subsys[0-9]*' | head -1) + if [ -z "$subname" ]; then + return 1 + fi + + policy=3D"/sys/class/nvme-subsystem/${subname}/iopolicy" + if [ ! -w "$policy" ]; then + return 1 + fi + + if ! echo "${iopolicy}" > "$policy" 2>/dev/null; then + return 1 + fi + + current=3D$(cat "$policy" 2>/dev/null) + if [ -z "$current" ]; then + return 1 + fi + + if [[ "$current" !=3D *"${iopolicy}"* ]]; then + return 1 + fi + + return 0 +} + # This function is invoked indirectly #shellcheck disable=3DSC2317,SC2329 run_host() @@ -242,6 +296,11 @@ run_host() echo "nvme list" nvme list =20 + if ! set_io_policy "${nqn}" "${iopolicy}"; then + echo "Failed to set I/O policy to ${iopolicy}" + return 1 + fi + sleep 1 =20 echo "fio randread /dev/${devname}" @@ -286,6 +345,7 @@ run_test() { export trtype path nqn ns port trsvcid export loop_dev temp_file + export iopolicy loss =20 if ! ip netns exec "$ns1" unshare -m bash <<- EOF mount -t configfs none /sys/kernel/config @@ -298,6 +358,7 @@ run_test() fi =20 if ! ip netns exec "$ns2" bash <<- EOF + $(declare -f set_io_policy) $(declare -f run_host) run_host exit \$? --=20 2.53.0