From nobody Tue Jun 23 13:09:43 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6A9F7C433F5 for ; Fri, 4 Mar 2022 09:41:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233847AbiCDJmI (ORCPT ); Fri, 4 Mar 2022 04:42:08 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35778 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239306AbiCDJmA (ORCPT ); Fri, 4 Mar 2022 04:42:00 -0500 Received: from smtp.tom.com (smtprz14.163.net [106.3.154.247]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C54F2187BAD for ; Fri, 4 Mar 2022 01:40:38 -0800 (PST) Received: from my-app02.tom.com (my-app02.tom.com [127.0.0.1]) by freemail02.tom.com (Postfix) with ESMTP id D482BB00D6A for ; Fri, 4 Mar 2022 17:29:07 +0800 (CST) Received: from my-app02.tom.com (HELO smtp.tom.com) ([127.0.0.1]) by my-app02 (TOM SMTP Server) with SMTP ID -338669765 for ; Fri, 04 Mar 2022 17:29:07 +0800 (CST) Received: from antispam1.tom.com (unknown [172.25.16.55]) by freemail02.tom.com (Postfix) with ESMTP id C2259B00D4B for ; Fri, 4 Mar 2022 17:29:07 +0800 (CST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tom.com; s=201807; t=1646386147; bh=Bc9KMhlTFc+L9nse97ukML/8OVbdVvQqv3GytP/pNFw=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Q6wrtlGGgrEDhBoSDzQoakAlagmOq/v5pTKFPh5tHw7xZRPBxrllwZJ8QW5i4wl0T wNe9zIv3Uv1i0ndI0Mw1bMurF0xcLR7tYheO6LHdSwDavcMhJ2yYQU5m/JFOi4N0GE gQGJQ2hejHm6lm+1YZIxS+iXNGs0QsJEM5Xdj8gs= Received: from antispam1.tom.com (antispam1.tom.com [127.0.0.1]) by antispam1.tom.com (Postfix) with ESMTP id ABE65D41598 for ; Fri, 4 Mar 2022 17:29:07 +0800 (CST) X-Virus-Scanned: Debian amavisd-new at antispam1.tom.com Received: from antispam1.tom.com ([127.0.0.1]) by antispam1.tom.com (antispam1.tom.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id iWZZYryIMrFX for ; Fri, 4 Mar 2022 17:29:05 +0800 (CST) Received: from localhost.localdomain (unknown [39.144.44.23]) by antispam1.tom.com (Postfix) with ESMTPA id 1D5B9D41530; Fri, 4 Mar 2022 17:29:04 +0800 (CST) From: Mingbao Sun To: Keith Busch , Jens Axboe , Christoph Hellwig , Sagi Grimberg , Chaitanya Kulkarni , linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org Cc: sunmingbao@tom.com, tyler.sun@dell.com, ping.gan@dell.com, yanxiu.cai@dell.com, libin.zhang@dell.com, ao.sun@dell.com Subject: [PATCH 1/2] nvmet-tcp: support specifying the congestion-control Date: Fri, 4 Mar 2022 17:27:53 +0800 Message-Id: <20220304092754.2721-2-sunmingbao@tom.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20220304092754.2721-1-sunmingbao@tom.com> References: <20220304092754.2721-1-sunmingbao@tom.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Mingbao Sun congestion-control could have a noticeable impaction on the performance of TCP-based communications. This is of course true to NVMe_over_TCP. Different congestion-controls (e.g., cubic, dctcp) are suitable for different scenarios. Proper adoption of congestion control would benefit the performance. On the contrary, the performance could be destroyed. Though we can specify the congestion-control of NVMe_over_TCP via writing '/proc/sys/net/ipv4/tcp_congestion_control', but this also changes the congestion-control of all the future TCP sockets that have not been explicitly assigned the congestion-control, thus bringing potential impaction on their performance. So it makes sense to make NVMe_over_TCP support specifying the congestion-control. And this commit addresses the target side. Implementation approach: the following new file entry was created for user to specify the congestion-control of each nvmet port. '/sys/kernel/config/nvmet/ports/X/tcp_congestion' Then later in nvmet_tcp_add_port, the specified congestion-control would be applied to the listening socket of the nvmet port. Signed-off-by: Mingbao Sun --- drivers/nvme/target/configfs.c | 52 ++++++++++++++++++++++++++++++++++ drivers/nvme/target/nvmet.h | 1 + drivers/nvme/target/tcp.c | 27 ++++++++++++++++++ 3 files changed, 80 insertions(+) diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index 091a0ca16361..fcf01f2b8045 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -11,6 +11,7 @@ #include #include #include +#include =20 #include "nvmet.h" =20 @@ -222,6 +223,55 @@ static ssize_t nvmet_addr_trsvcid_store(struct config_= item *item, =20 CONFIGFS_ATTR(nvmet_, addr_trsvcid); =20 +static ssize_t nvmet_tcp_congestion_show(struct config_item *item, + char *page) +{ + struct nvmet_port *port =3D to_nvmet_port(item); + + return snprintf(page, PAGE_SIZE, "%s\n", + port->tcp_congestion ? port->tcp_congestion : ""); +} + +static ssize_t nvmet_tcp_congestion_store(struct config_item *item, + const char *page, size_t count) +{ + struct nvmet_port *port =3D to_nvmet_port(item); + int len; + bool ecn_ca; + u32 key; + + len =3D strcspn(page, "\n"); + if (!len) + return -EINVAL; + + if (len >=3D TCP_CA_NAME_MAX) { + pr_err("name of TCP congestion control can not exceed %d bytes.\n", + TCP_CA_NAME_MAX); + return -EINVAL; + } + + if (nvmet_is_port_enabled(port, __func__)) + return -EACCES; + + kfree(port->tcp_congestion); + port->tcp_congestion =3D kmemdup_nul(page, len, GFP_KERNEL); + if (!port->tcp_congestion) + return -ENOMEM; + + key =3D tcp_ca_get_key_by_name(NULL, port->tcp_congestion, &ecn_ca); + if (key =3D=3D TCP_CA_UNSPEC) { + pr_err("congestion control %s not found.\n", + port->tcp_congestion); + kfree(port->tcp_congestion); + port->tcp_congestion =3D NULL; + return -EINVAL; + } + + return count; +} + +CONFIGFS_ATTR(nvmet_, tcp_congestion); + static ssize_t nvmet_param_inline_data_size_show(struct config_item *item, char *page) { @@ -1597,6 +1647,7 @@ static void nvmet_port_release(struct config_item *it= em) list_del(&port->global_entry); =20 kfree(port->ana_state); + kfree(port->tcp_congestion); kfree(port); } =20 @@ -1605,6 +1656,7 @@ static struct configfs_attribute *nvmet_port_attrs[] = =3D { &nvmet_attr_addr_treq, &nvmet_attr_addr_traddr, &nvmet_attr_addr_trsvcid, + &nvmet_attr_tcp_congestion, &nvmet_attr_addr_trtype, &nvmet_attr_param_inline_data_size, #ifdef CONFIG_BLK_DEV_INTEGRITY diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 69637bf8f8e1..76a57c4c3456 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -145,6 +145,7 @@ struct nvmet_port { struct config_group ana_groups_group; struct nvmet_ana_group ana_default_group; enum nvme_ana_state *ana_state; + const char *tcp_congestion; void *priv; bool enabled; int inline_data_size; diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c index 83ca577f72be..3b72e782c901 100644 --- a/drivers/nvme/target/tcp.c +++ b/drivers/nvme/target/tcp.c @@ -1657,8 +1657,10 @@ static void nvmet_tcp_accept_work(struct work_struct= *w) struct nvmet_tcp_port *port =3D container_of(w, struct nvmet_tcp_port, accept_work); struct socket *newsock; + struct inet_connection_sock *icsk, *icsk_new; int ret; =20 + icsk =3D inet_csk(port->sock->sk); while (true) { ret =3D kernel_accept(port->sock, &newsock, O_NONBLOCK); if (ret < 0) { @@ -1666,6 +1668,16 @@ static void nvmet_tcp_accept_work(struct work_struct= *w) pr_warn("failed to accept err=3D%d\n", ret); return; } + + if (port->nport->tcp_congestion) { + icsk_new =3D inet_csk(newsock->sk); + if (icsk_new->icsk_ca_ops !=3D icsk->icsk_ca_ops) { + pr_warn("congestion abnormal: expected %s, actual %s.\n", + icsk->icsk_ca_ops->name, + icsk_new->icsk_ca_ops->name); + } + } + ret =3D nvmet_tcp_alloc_queue(port, newsock); if (ret) { pr_err("failed to allocate queue\n"); @@ -1693,6 +1705,8 @@ static int nvmet_tcp_add_port(struct nvmet_port *npor= t) { struct nvmet_tcp_port *port; __kernel_sa_family_t af; + char ca_name[TCP_CA_NAME_MAX]; + sockptr_t optval; int ret; =20 port =3D kzalloc(sizeof(*port), GFP_KERNEL); @@ -1741,6 +1755,19 @@ static int nvmet_tcp_add_port(struct nvmet_port *npo= rt) if (so_priority > 0) sock_set_priority(port->sock->sk, so_priority); =20 + if (nport->tcp_congestion) { + strncpy(ca_name, nport->tcp_congestion, TCP_CA_NAME_MAX-1); + optval =3D KERNEL_SOCKPTR(ca_name); + ret =3D sock_common_setsockopt(port->sock, IPPROTO_TCP, + TCP_CONGESTION, optval, + strlen(ca_name)); + if (ret) { + pr_err("failed to set port socket's congestion to %s: %d\n", + ca_name, ret); + goto err_sock; + } + } + ret =3D kernel_bind(port->sock, (struct sockaddr *)&port->addr, sizeof(port->addr)); if (ret) { --=20 2.26.2 From nobody Tue Jun 23 13:09:43 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E7F49C433FE for ; Fri, 4 Mar 2022 09:41:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239279AbiCDJmU (ORCPT ); Fri, 4 Mar 2022 04:42:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33102 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239380AbiCDJmA (ORCPT ); Fri, 4 Mar 2022 04:42:00 -0500 Received: from smtp.tom.com (smtprz14.163.net [106.3.154.247]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C57631B01B6 for ; Fri, 4 Mar 2022 01:40:38 -0800 (PST) Received: from my-app02.tom.com (my-app02.tom.com [127.0.0.1]) by freemail02.tom.com (Postfix) with ESMTP id 4C852B00D4F for ; Fri, 4 Mar 2022 17:29:16 +0800 (CST) Received: from my-app02.tom.com (HELO smtp.tom.com) ([127.0.0.1]) by my-app02 (TOM SMTP Server) with SMTP ID 1887978912 for ; Fri, 04 Mar 2022 17:29:16 +0800 (CST) Received: from antispam1.tom.com (unknown [172.25.16.55]) by freemail02.tom.com (Postfix) with ESMTP id 2A9E3B00D70 for ; Fri, 4 Mar 2022 17:29:16 +0800 (CST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tom.com; s=201807; t=1646386156; bh=a8O+cqlKrcAi4SPISzPwhi3ONqz/CHSJ1yBQ/iW1LBc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=uaTz/KHMSN8lazFwZ859MuANQQZSWw6HJumLrrfsawRSgD9SEFfr5VpBPJq03BRi8 LFp5E7dgwDV0TSQ77YLVrxxYfrQeHMM3WgaDpR+lCZUxJ3UsFHLxMkIl6kY7FWZeZA Y7lonHTfGQqou9Oidn2NbLM6+TzCSGCtOfKma1f0= Received: from antispam1.tom.com (antispam1.tom.com [127.0.0.1]) by antispam1.tom.com (Postfix) with ESMTP id F3D45D41598 for ; Fri, 4 Mar 2022 17:29:15 +0800 (CST) X-Virus-Scanned: Debian amavisd-new at antispam1.tom.com Received: from antispam1.tom.com ([127.0.0.1]) by antispam1.tom.com (antispam1.tom.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RI0BWn6DzwPT for ; Fri, 4 Mar 2022 17:29:14 +0800 (CST) Received: from localhost.localdomain (unknown [39.144.44.23]) by antispam1.tom.com (Postfix) with ESMTPA id 74437D41530; Fri, 4 Mar 2022 17:29:12 +0800 (CST) From: Mingbao Sun To: Keith Busch , Jens Axboe , Christoph Hellwig , Sagi Grimberg , Chaitanya Kulkarni , linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org Cc: sunmingbao@tom.com, tyler.sun@dell.com, ping.gan@dell.com, yanxiu.cai@dell.com, libin.zhang@dell.com, ao.sun@dell.com Subject: [PATCH 2/2] nvme-tcp: support specifying the congestion-control Date: Fri, 4 Mar 2022 17:27:54 +0800 Message-Id: <20220304092754.2721-3-sunmingbao@tom.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20220304092754.2721-1-sunmingbao@tom.com> References: <20220304092754.2721-1-sunmingbao@tom.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Mingbao Sun congestion-control could have a noticeable impaction on the performance of TCP-based communications. This is of course true to NVMe_over_TCP. Different congestion-controls (e.g., cubic, dctcp) are suitable for different scenarios. Proper adoption of congestion control would benefit the performance. On the contrary, the performance could be destroyed. Though we can specify the congestion-control of NVMe_over_TCP via writing '/proc/sys/net/ipv4/tcp_congestion_control', but this also changes the congestion-control of all the future TCP sockets that have not been explicitly assigned the congestion-control, thus bringing potential impaction on their performance. So it makes sense to make NVMe_over_TCP support specifying the congestion-control. And this commit addresses the host side. Implementation approach: a new option called 'tcp_congestion' was created in fabrics opt_tokens for 'nvme connect' command to passed in the congestion-control specified by the user. Then later in nvme_tcp_alloc_queue, the specified congestion-control would be applied to the relevant sockets of the host side. Signed-off-by: Mingbao Sun --- drivers/nvme/host/fabrics.c | 24 ++++++++++++++++++++++++ drivers/nvme/host/fabrics.h | 2 ++ drivers/nvme/host/tcp.c | 20 +++++++++++++++++++- 3 files changed, 45 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index ee79a6d639b4..6d946f758372 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -10,6 +10,7 @@ #include #include #include +#include #include "nvme.h" #include "fabrics.h" =20 @@ -548,6 +549,7 @@ static const match_table_t opt_tokens =3D { { NVMF_OPT_TOS, "tos=3D%d" }, { NVMF_OPT_FAIL_FAST_TMO, "fast_io_fail_tmo=3D%d" }, { NVMF_OPT_DISCOVERY, "discovery" }, + { NVMF_OPT_TCP_CONGESTION, "tcp_congestion=3D%s" }, { NVMF_OPT_ERR, NULL } }; =20 @@ -560,6 +562,8 @@ static int nvmf_parse_options(struct nvmf_ctrl_options = *opts, size_t nqnlen =3D 0; int ctrl_loss_tmo =3D NVMF_DEF_CTRL_LOSS_TMO; uuid_t hostid; + bool ecn_ca; + u32 key; =20 /* Set defaults */ opts->queue_size =3D NVMF_DEF_QUEUE_SIZE; @@ -829,6 +833,25 @@ static int nvmf_parse_options(struct nvmf_ctrl_options= *opts, case NVMF_OPT_DISCOVERY: opts->discovery_nqn =3D true; break; + case NVMF_OPT_TCP_CONGESTION: + p =3D match_strdup(args); + if (!p) { + ret =3D -ENOMEM; + goto out; + } + + key =3D tcp_ca_get_key_by_name(NULL, p, &ecn_ca); + if (key =3D=3D TCP_CA_UNSPEC) { + pr_err("congestion control %s not found.\n", + p); + ret =3D -EINVAL; + kfree(p); + goto out; + } + + kfree(opts->tcp_congestion); + opts->tcp_congestion =3D p; + break; default: pr_warn("unknown parameter or missing value '%s' in ctrl creation reque= st\n", p); @@ -947,6 +970,7 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts) kfree(opts->subsysnqn); kfree(opts->host_traddr); kfree(opts->host_iface); + kfree(opts->tcp_congestion); kfree(opts); } EXPORT_SYMBOL_GPL(nvmf_free_options); diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index c3203ff1c654..25fdc169949d 100644 --- a/drivers/nvme/host/fabrics.h +++ b/drivers/nvme/host/fabrics.h @@ -68,6 +68,7 @@ enum { NVMF_OPT_FAIL_FAST_TMO =3D 1 << 20, NVMF_OPT_HOST_IFACE =3D 1 << 21, NVMF_OPT_DISCOVERY =3D 1 << 22, + NVMF_OPT_TCP_CONGESTION =3D 1 << 23, }; =20 /** @@ -117,6 +118,7 @@ struct nvmf_ctrl_options { unsigned int nr_io_queues; unsigned int reconnect_delay; bool discovery_nqn; + const char *tcp_congestion; bool duplicate_connect; unsigned int kato; struct nvmf_host *host; diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 6cbcc8b4daaf..cb2c7d7371d4 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1403,6 +1403,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nct= rl, { struct nvme_tcp_ctrl *ctrl =3D to_tcp_ctrl(nctrl); struct nvme_tcp_queue *queue =3D &ctrl->queues[qid]; + char ca_name[TCP_CA_NAME_MAX]; + sockptr_t optval; int ret, rcv_pdu_size; =20 mutex_init(&queue->queue_lock); @@ -1447,6 +1449,21 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nc= trl, if (nctrl->opts->tos >=3D 0) ip_sock_set_tos(queue->sock->sk, nctrl->opts->tos); =20 + if (nctrl->opts->mask & NVMF_OPT_TCP_CONGESTION) { + strncpy(ca_name, nctrl->opts->tcp_congestion, + TCP_CA_NAME_MAX-1); + optval =3D KERNEL_SOCKPTR(ca_name); + ret =3D sock_common_setsockopt(queue->sock, IPPROTO_TCP, + TCP_CONGESTION, optval, + strlen(ca_name)); + if (ret) { + dev_err(nctrl->device, + "failed to set TCP congestion to %s: %d\n", + ca_name, ret); + goto err_sock; + } + } + /* Set 10 seconds timeout for icresp recvmsg */ queue->sock->sk->sk_rcvtimeo =3D 10 * HZ; =20 @@ -2611,7 +2628,8 @@ static struct nvmf_transport_ops nvme_tcp_transport = =3D { NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO | NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST | NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES | - NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE, + NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE | + NVMF_OPT_TCP_CONGESTION, .create_ctrl =3D nvme_tcp_create_ctrl, }; =20 --=20 2.26.2