From nobody Tue Dec 16 13:22:06 2025 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B3F8126A0AD; Mon, 27 Oct 2025 22:49:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761605355; cv=none; b=hkkzEu/Snv9vri3rfIiTIkrbYEuNyShCgcrgukEqpaIlEk61BMMRoA3Vc+HzlQ9PsL5b6kPeijCl+peutwW9gXJIKD9r50JM1QZyWUE+Jq+DHY2wwBW92C5VaFclQ0HK1ctXiZ7kPZpd/z2Y23HNutPhxPEE0lHNSyuqdfIYBAc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761605355; c=relaxed/simple; bh=3vY8ghq/Ja5QnoG98iX2P5vjyQITlGFGrgyHRSU318s=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Y/cGRU/Fw/OjOjPQOUvy1nEkKfbnylCypqxxbfV0c8vUFl/ziAsFDPM732b7GxHimU5YtL56Zn9iDfMVEJD1fmelb43ExDfaR22qpAED+108ImHZfGQcyjhmM0DMysbvx2zzF5Mi/4XTfqtqnFWRGn1oE9WC5+oNpYFFfdjwLkM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=M80LYF85; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="M80LYF85" Received: from pps.filterd (m0360072.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 59RBnCA4023535; Mon, 27 Oct 2025 22:49:06 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=faMxPjxST9CP1d5hz 3PAMImoNygi7R43hazQAU8cC8Y=; b=M80LYF85BrC9Q8iu1rpWlljuy1ehizR1L v73l8kEsR16zptImTOj3pn+SGH0w5es9jeWl8VZOUW8XHD6ffwYtgZv+tsERPwRh LQ81Eawzlia579RxAWax4Nhy2GzUj/r2PUqrYq3sEJnMUYd23/sKTuFDv6oF4B1z kdO0+DzCqyIqAnYDAI5B9fzCph+1W5lqjU4sx++nO1RLfhkaPc+iRXPABXb9wMxs w4prJR2jrDA4KamRVghhvZGoSxXwV5U+rw5eleEZNyh1C4DzEwX41l5zcwDpwPuJ 1CEMw74JEcypZMkT5JQ1uRW5cy3hRrtaiXhMkDnPhI9IelpqyDdbQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4a0p720ymt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 27 Oct 2025 22:49:06 +0000 (GMT) Received: from m0360072.ppops.net (m0360072.ppops.net [127.0.0.1]) by pps.reinject (8.18.1.12/8.18.0.8) with ESMTP id 59RMar0R002886; Mon, 27 Oct 2025 22:49:05 GMT Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4a0p720ymq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 27 Oct 2025 22:49:05 +0000 (GMT) Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1]) by ppma21.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 59RMgMPE030191; Mon, 27 Oct 2025 22:49:04 GMT Received: from smtprelay03.fra02v.mail.ibm.com ([9.218.2.224]) by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 4a19vmfxx2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 27 Oct 2025 22:49:04 +0000 Received: from smtpav04.fra02v.mail.ibm.com (smtpav04.fra02v.mail.ibm.com [10.20.54.103]) by smtprelay03.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 59RMn0Ag57016824 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 27 Oct 2025 22:49:00 GMT Received: from smtpav04.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A8EB820043; Mon, 27 Oct 2025 22:49:00 +0000 (GMT) Received: from smtpav04.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5F56220040; Mon, 27 Oct 2025 22:49:00 +0000 (GMT) Received: from tuxmaker.lnxne.boe (unknown [9.152.85.9]) by smtpav04.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 27 Oct 2025 22:49:00 +0000 (GMT) From: Halil Pasic To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Jonathan Corbet , "D. Wythe" , Dust Li , Sidraya Jayagond , Wenjia Zhang , Mahanta Jambigi , Tony Lu , netdev@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, linux-s390@vger.kernel.org Cc: Halil Pasic , Wen Gu , Guangguan Wang , Bagas Sanjaya Subject: [PATCH net-next v6 1/2] net/smc: make wr buffer count configurable Date: Mon, 27 Oct 2025 23:48:55 +0100 Message-ID: <20251027224856.2970019-2-pasic@linux.ibm.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20251027224856.2970019-1-pasic@linux.ibm.com> References: <20251027224856.2970019-1-pasic@linux.ibm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 4ejc2jfyyqTFRBwCScyjyEEYZC9dXg4S X-Proofpoint-ORIG-GUID: w49MxPa2is0NXcXrc5NfoLaKoERfwFz8 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjUxMDI1MDAyNCBTYWx0ZWRfXzdJvQnaHX4aW WDICLcXKUaPaH5IcZnEX9dYTqKc5RMvhBYy/rF6iTgjNRd04FKChp66WPX+rRep3DDbIhVr4f0T IoBu3vm/o6YIM6oze28bgNBdM6rXYx0Hqs371Pj+vo3ALxqkZ3PimAOTOZl8RS3UpigHsiwqtqs F0MlbDkuV8VfSeJqk92TGmKftm5wiWQCiRiIBOmE76qnCI73/yTXqw3lw38XMfSXvfiJXBX9/m6 OBiD4mQizF4EmwIYO8CF8bJzNkj+8mZ/epTPk2MCalY9ehHDKxC+nmw6tsCEZtmpT3nrf30h+ic Czbsr9lD5Pp2TNN1XHzlWt74vX7FkjZUf+0KrtmaHVMfCm5zsBUjmSe0FRnzAytoSikFgc3wX64 ikjw73iZSNS73HaprIXhMh588E4gyQ== X-Authority-Analysis: v=2.4 cv=G/gR0tk5 c=1 sm=1 tr=0 ts=68fff6e2 cx=c_pps a=GFwsV6G8L6GxiO2Y/PsHdQ==:117 a=GFwsV6G8L6GxiO2Y/PsHdQ==:17 a=x6icFKpwvdMA:10 a=VkNPw1HP01LnGYTKEx00:22 a=VnNF1IyMAAAA:8 a=SRrdq9N9AAAA:8 a=a6OFRn_g8X_OcWKUc2YA:9 a=cPQSjfK2_nFv0Q5t_7PE:22 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1121,Hydra:6.1.9,FMLib:17.12.80.40 definitions=2025-10-27_09,2025-10-22_01,2025-03-28_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 phishscore=0 lowpriorityscore=0 adultscore=0 impostorscore=0 spamscore=0 priorityscore=1501 malwarescore=0 suspectscore=0 clxscore=1015 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.19.0-2510020000 definitions=main-2510250024 Content-Type: text/plain; charset="utf-8" Think SMC_WR_BUF_CNT_SEND :=3D SMC_WR_BUF_CNT used in send context and SMC_WR_BUF_CNT_RECV :=3D 3 * SMC_WR_BUF_CNT used in recv context. Those get replaced with lgr->max_send_wr and lgr->max_recv_wr respective. Please note that although with the default sysctl values qp_attr.cap.max_send_wr =3D=3D qp_attr.cap.max_recv_wr is maintained but can not be assumed to be generally true any more. I see no downside to that, but my confidence level is rather modest. Signed-off-by: Halil Pasic Reviewed-by: Sidraya Jayagond Reviewed-by: Dust Li Tested-by: Mahanta Jambigi --- Documentation/networking/smc-sysctl.rst | 36 +++++++++++++++++++++++++ include/net/netns/smc.h | 2 ++ net/smc/smc_core.h | 6 +++++ net/smc/smc_ib.c | 10 +++---- net/smc/smc_llc.c | 2 ++ net/smc/smc_sysctl.c | 22 +++++++++++++++ net/smc/smc_sysctl.h | 2 ++ net/smc/smc_wr.c | 31 ++++++++++----------- net/smc/smc_wr.h | 2 -- 9 files changed, 91 insertions(+), 22 deletions(-) diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networ= king/smc-sysctl.rst index a874d007f2db..337ac2be167e 100644 --- a/Documentation/networking/smc-sysctl.rst +++ b/Documentation/networking/smc-sysctl.rst @@ -71,3 +71,39 @@ smcr_max_conns_per_lgr - INTEGER acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later. =20 Default: 255 + +smcr_max_send_wr - INTEGER + So-called work request buffers are SMCR link (and RDMA queue pair) level + resources necessary for performing RDMA operations. Since up to 255 + connections can share a link group and thus also a link and the number + of the work request buffers is decided when the link is allocated, + depending on the workload it can be a bottleneck in a sense that threads + have to wait for work request buffers to become available. Before the + introduction of this control the maximal number of work request buffers + available on the send path used to be hard coded to 16. With this control + it becomes configurable. The acceptable range is between 2 and 2048. + + Please be aware that all the buffers need to be allocated as a physically + continuous array in which each element is a single buffer and has the size + of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much + like before having this control. + + Default: 16 + +smcr_max_recv_wr - INTEGER + So-called work request buffers are SMCR link (and RDMA queue pair) level + resources necessary for performing RDMA operations. Since up to 255 + connections can share a link group and thus also a link and the number + of the work request buffers is decided when the link is allocated, + depending on the workload it can be a bottleneck in a sense that threads + have to wait for work request buffers to become available. Before the + introduction of this control the maximal number of work request buffers + available on the receive path used to be hard coded to 16. With this cont= rol + it becomes configurable. The acceptable range is between 2 and 2048. + + Please be aware that all the buffers need to be allocated as a physically + continuous array in which each element is a single buffer and has the size + of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much + like before having this control. + + Default: 48 diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h index fc752a50f91b..6ceb12baec24 100644 --- a/include/net/netns/smc.h +++ b/include/net/netns/smc.h @@ -24,5 +24,7 @@ struct netns_smc { int sysctl_rmem; int sysctl_max_links_per_lgr; int sysctl_max_conns_per_lgr; + unsigned int sysctl_smcr_max_send_wr; + unsigned int sysctl_smcr_max_recv_wr; }; #endif diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h index a5a78cbff341..8d06c8bb14e9 100644 --- a/net/smc/smc_core.h +++ b/net/smc/smc_core.h @@ -34,6 +34,8 @@ * distributions may modify it to a value between * 16-255 as needed. */ +#define SMCR_MAX_SEND_WR_DEF 16 /* Default number of work requests per sen= d queue */ +#define SMCR_MAX_RECV_WR_DEF 48 /* Default number of work requests per rec= v queue */ =20 struct smc_lgr_list { /* list of link group definition */ struct list_head list; @@ -366,6 +368,10 @@ struct smc_link_group { /* max conn can be assigned to lgr */ u8 max_links; /* max links can be added in lgr */ + u16 max_send_wr; + /* number of WR buffers on send */ + u16 max_recv_wr; + /* number of WR buffers on recv */ }; struct { /* SMC-D */ struct smcd_gid peer_gid; diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c index 0052f02756eb..1154907c5c05 100644 --- a/net/smc/smc_ib.c +++ b/net/smc/smc_ib.c @@ -669,11 +669,6 @@ int smc_ib_create_queue_pair(struct smc_link *lnk) .recv_cq =3D lnk->smcibdev->roce_cq_recv, .srq =3D NULL, .cap =3D { - /* include unsolicited rdma_writes as well, - * there are max. 2 RDMA_WRITE per 1 WR_SEND - */ - .max_send_wr =3D SMC_WR_BUF_CNT * 3, - .max_recv_wr =3D SMC_WR_BUF_CNT * 3, .max_send_sge =3D SMC_IB_MAX_SEND_SGE, .max_recv_sge =3D lnk->wr_rx_sge_cnt, .max_inline_data =3D 0, @@ -683,6 +678,11 @@ int smc_ib_create_queue_pair(struct smc_link *lnk) }; int rc; =20 + /* include unsolicited rdma_writes as well, + * there are max. 2 RDMA_WRITE per 1 WR_SEND + */ + qp_attr.cap.max_send_wr =3D 3 * lnk->lgr->max_send_wr; + qp_attr.cap.max_recv_wr =3D lnk->lgr->max_recv_wr; lnk->roce_qp =3D ib_create_qp(lnk->roce_pd, &qp_attr); rc =3D PTR_ERR_OR_ZERO(lnk->roce_qp); if (IS_ERR(lnk->roce_qp)) diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c index f865c58c3aa7..f5d5eb617526 100644 --- a/net/smc/smc_llc.c +++ b/net/smc/smc_llc.c @@ -2157,6 +2157,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, str= uct smc_sock *smc) init_waitqueue_head(&lgr->llc_msg_waiter); init_rwsem(&lgr->llc_conf_mutex); lgr->llc_testlink_time =3D READ_ONCE(net->smc.sysctl_smcr_testlink_time); + lgr->max_send_wr =3D (u16)(READ_ONCE(net->smc.sysctl_smcr_max_send_wr)); + lgr->max_recv_wr =3D (u16)(READ_ONCE(net->smc.sysctl_smcr_max_recv_wr)); } =20 /* called after lgr was removed from lgr_list */ diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c index 2fab6456f765..7b2471904d04 100644 --- a/net/smc/smc_sysctl.c +++ b/net/smc/smc_sysctl.c @@ -29,6 +29,8 @@ static int links_per_lgr_min =3D SMC_LINKS_ADD_LNK_MIN; static int links_per_lgr_max =3D SMC_LINKS_ADD_LNK_MAX; static int conns_per_lgr_min =3D SMC_CONN_PER_LGR_MIN; static int conns_per_lgr_max =3D SMC_CONN_PER_LGR_MAX; +static unsigned int smcr_max_wr_min =3D 2; +static unsigned int smcr_max_wr_max =3D 2048; =20 static struct ctl_table smc_table[] =3D { { @@ -99,6 +101,24 @@ static struct ctl_table smc_table[] =3D { .extra1 =3D SYSCTL_ZERO, .extra2 =3D SYSCTL_ONE, }, + { + .procname =3D "smcr_max_send_wr", + .data =3D &init_net.smc.sysctl_smcr_max_send_wr, + .maxlen =3D sizeof(int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D &smcr_max_wr_min, + .extra2 =3D &smcr_max_wr_max, + }, + { + .procname =3D "smcr_max_recv_wr", + .data =3D &init_net.smc.sysctl_smcr_max_recv_wr, + .maxlen =3D sizeof(int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D &smcr_max_wr_min, + .extra2 =3D &smcr_max_wr_max, + }, }; =20 int __net_init smc_sysctl_net_init(struct net *net) @@ -130,6 +150,8 @@ int __net_init smc_sysctl_net_init(struct net *net) WRITE_ONCE(net->smc.sysctl_rmem, net_smc_rmem_init); net->smc.sysctl_max_links_per_lgr =3D SMC_LINKS_PER_LGR_MAX_PREFER; net->smc.sysctl_max_conns_per_lgr =3D SMC_CONN_PER_LGR_PREFER; + net->smc.sysctl_smcr_max_send_wr =3D SMCR_MAX_SEND_WR_DEF; + net->smc.sysctl_smcr_max_recv_wr =3D SMCR_MAX_RECV_WR_DEF; /* disable handshake limitation by default */ net->smc.limit_smc_hs =3D 0; =20 diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h index eb2465ae1e15..8538915af7af 100644 --- a/net/smc/smc_sysctl.h +++ b/net/smc/smc_sysctl.h @@ -25,6 +25,8 @@ static inline int smc_sysctl_net_init(struct net *net) net->smc.sysctl_autocorking_size =3D SMC_AUTOCORKING_DEFAULT_SIZE; net->smc.sysctl_max_links_per_lgr =3D SMC_LINKS_PER_LGR_MAX_PREFER; net->smc.sysctl_max_conns_per_lgr =3D SMC_CONN_PER_LGR_PREFER; + net->smc.sysctl_smcr_max_send_wr =3D SMCR_MAX_SEND_WR_DEF; + net->smc.sysctl_smcr_max_recv_wr =3D SMCR_MAX_RECV_WR_DEF; return 0; } =20 diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c index b04a21b8c511..883fb0f1ce43 100644 --- a/net/smc/smc_wr.c +++ b/net/smc/smc_wr.c @@ -547,9 +547,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk) IB_QP_DEST_QPN, &init_attr); =20 - lnk->wr_tx_cnt =3D min_t(size_t, SMC_WR_BUF_CNT, + lnk->wr_tx_cnt =3D min_t(size_t, lnk->lgr->max_send_wr, lnk->qp_attr.cap.max_send_wr); - lnk->wr_rx_cnt =3D min_t(size_t, SMC_WR_BUF_CNT * 3, + lnk->wr_rx_cnt =3D min_t(size_t, lnk->lgr->max_recv_wr, lnk->qp_attr.cap.max_recv_wr); } =20 @@ -741,50 +741,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr) int smc_wr_alloc_link_mem(struct smc_link *link) { /* allocate link related memory */ - link->wr_tx_bufs =3D kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL); + link->wr_tx_bufs =3D kcalloc(link->lgr->max_send_wr, + SMC_WR_BUF_SIZE, GFP_KERNEL); if (!link->wr_tx_bufs) goto no_mem; - link->wr_rx_bufs =3D kcalloc(SMC_WR_BUF_CNT * 3, link->wr_rx_buflen, + link->wr_rx_bufs =3D kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen, GFP_KERNEL); if (!link->wr_rx_bufs) goto no_mem_wr_tx_bufs; - link->wr_tx_ibs =3D kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_ibs[0]), - GFP_KERNEL); + link->wr_tx_ibs =3D kcalloc(link->lgr->max_send_wr, + sizeof(link->wr_tx_ibs[0]), GFP_KERNEL); if (!link->wr_tx_ibs) goto no_mem_wr_rx_bufs; - link->wr_rx_ibs =3D kcalloc(SMC_WR_BUF_CNT * 3, + link->wr_rx_ibs =3D kcalloc(link->lgr->max_recv_wr, sizeof(link->wr_rx_ibs[0]), GFP_KERNEL); if (!link->wr_rx_ibs) goto no_mem_wr_tx_ibs; - link->wr_tx_rdmas =3D kcalloc(SMC_WR_BUF_CNT, + link->wr_tx_rdmas =3D kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_rdmas[0]), GFP_KERNEL); if (!link->wr_tx_rdmas) goto no_mem_wr_rx_ibs; - link->wr_tx_rdma_sges =3D kcalloc(SMC_WR_BUF_CNT, + link->wr_tx_rdma_sges =3D kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_rdma_sges[0]), GFP_KERNEL); if (!link->wr_tx_rdma_sges) goto no_mem_wr_tx_rdmas; - link->wr_tx_sges =3D kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_sges[0]), + link->wr_tx_sges =3D kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_s= ges[0]), GFP_KERNEL); if (!link->wr_tx_sges) goto no_mem_wr_tx_rdma_sges; - link->wr_rx_sges =3D kcalloc(SMC_WR_BUF_CNT * 3, + link->wr_rx_sges =3D kcalloc(link->lgr->max_recv_wr, sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt, GFP_KERNEL); if (!link->wr_rx_sges) goto no_mem_wr_tx_sges; - link->wr_tx_mask =3D bitmap_zalloc(SMC_WR_BUF_CNT, GFP_KERNEL); + link->wr_tx_mask =3D bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL); if (!link->wr_tx_mask) goto no_mem_wr_rx_sges; - link->wr_tx_pends =3D kcalloc(SMC_WR_BUF_CNT, + link->wr_tx_pends =3D kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_pends[0]), GFP_KERNEL); if (!link->wr_tx_pends) goto no_mem_wr_tx_mask; - link->wr_tx_compl =3D kcalloc(SMC_WR_BUF_CNT, + link->wr_tx_compl =3D kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_compl[0]), GFP_KERNEL); if (!link->wr_tx_compl) @@ -905,7 +906,7 @@ int smc_wr_create_link(struct smc_link *lnk) goto dma_unmap; } smc_wr_init_sge(lnk); - bitmap_zero(lnk->wr_tx_mask, SMC_WR_BUF_CNT); + bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr); init_waitqueue_head(&lnk->wr_tx_wait); rc =3D percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KER= NEL); if (rc) diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h index f3008dda222a..aa4533af9122 100644 --- a/net/smc/smc_wr.h +++ b/net/smc/smc_wr.h @@ -19,8 +19,6 @@ #include "smc.h" #include "smc_core.h" =20 -#define SMC_WR_BUF_CNT 16 /* # of ctrl buffers per link */ - #define SMC_WR_TX_WAIT_FREE_SLOT_TIME (10 * HZ) =20 #define SMC_WR_TX_SIZE 44 /* actual size of wr_send data (<=3DSMC_WR_BUF_S= IZE) */ --=20 2.48.1 From nobody Tue Dec 16 13:22:06 2025 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D31D32E14D; Mon, 27 Oct 2025 22:49:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761605360; cv=none; b=blK43xX06c2cTSGtNdSrVUkbhqM45dMW1gU7WyPOVxLHSS6OhYnCYTMGzqg1QjIE9W2010O2Eplqj0m/6Cqw4PXKAqWm6qS4UwDWzoiQHsmCXuePnMY5ntvKjqvJ9EwoT5qZ77WLLhMJW8KQrYou5u/A17qs+C9ilwPlN63o0Rs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761605360; c=relaxed/simple; bh=+GEACyZCZS+HBvX9zYiAjJ67P762LUNPLIxfR60aqEs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HgEeW5USs2Pj2S8qB3VhUX0n6aEitl4GrW6TRVZ4F0XFD9RwgsNbFQCr7ZEOhWvsS+/Fhiml+rystJJJhtc4rDk8myrlhh2lpEqckfmuuREJGhF6LR++5bqVlI8qiv9Z3BPqspMnCZ4I7n306PixUJD6WbZxlB+UdgzqFOE1kJM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=LF/lIUr1; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="LF/lIUr1" Received: from pps.filterd (m0356517.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 59RDLKtX026955; Mon, 27 Oct 2025 22:49:08 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=vCE/W5S+kEZlj7fJc 5Y3UNShKdRqrEEAXxsVBnSxIvE=; b=LF/lIUr1+RV+t79KPwbqsQrIOIOfZgPsn ocgN9dp9vcNdUGECK0fgjdqdF8J+LD3mYWVEI5KjmntPFWWuwEzOYaxckZk0dYhC +4DyL5DoVaYHMKUlBd7WGPYKM/eL0TOINpGt9TI8+0PQ49EsJQt4xEB9gNKzb5dZ gflYVLFk461XlY80y+HZSJsZzwVrnIzzzOBNPU6APNYxRpCJ0x91kPcRTLjRcUiF Jwc8XlUF4ZwnLMAfwlQuKrB88HM1ETnhEScRqUNf9HcYScDbq5MG9ehg1660oBJR cA+Gb/xMTrZ65ZfTdnJxKiXRopFuOMqZvhKt1uBeswEsfOPqLD0tA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4a0p99153w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 27 Oct 2025 22:49:08 +0000 (GMT) Received: from m0356517.ppops.net (m0356517.ppops.net [127.0.0.1]) by pps.reinject (8.18.1.12/8.18.0.8) with ESMTP id 59RMk1i4017026; Mon, 27 Oct 2025 22:49:07 GMT Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4a0p99153t-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 27 Oct 2025 22:49:07 +0000 (GMT) Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1]) by ppma21.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 59RMe62f030074; Mon, 27 Oct 2025 22:49:06 GMT Received: from smtprelay04.fra02v.mail.ibm.com ([9.218.2.228]) by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 4a19vmfxx6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 27 Oct 2025 22:49:06 +0000 Received: from smtpav04.fra02v.mail.ibm.com (smtpav04.fra02v.mail.ibm.com [10.20.54.103]) by smtprelay04.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 59RMn2aO29033160 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 27 Oct 2025 22:49:02 GMT Received: from smtpav04.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0BED42004B; Mon, 27 Oct 2025 22:49:02 +0000 (GMT) Received: from smtpav04.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B6CE920043; Mon, 27 Oct 2025 22:49:01 +0000 (GMT) Received: from tuxmaker.lnxne.boe (unknown [9.152.85.9]) by smtpav04.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 27 Oct 2025 22:49:01 +0000 (GMT) From: Halil Pasic To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Jonathan Corbet , "D. Wythe" , Dust Li , Sidraya Jayagond , Wenjia Zhang , Mahanta Jambigi , Tony Lu , netdev@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, linux-s390@vger.kernel.org Cc: Halil Pasic , Wen Gu , Guangguan Wang , Bagas Sanjaya Subject: [PATCH net-next v6 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Date: Mon, 27 Oct 2025 23:48:56 +0100 Message-ID: <20251027224856.2970019-3-pasic@linux.ibm.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20251027224856.2970019-1-pasic@linux.ibm.com> References: <20251027224856.2970019-1-pasic@linux.ibm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Authority-Analysis: v=2.4 cv=JqL8bc4C c=1 sm=1 tr=0 ts=68fff6e4 cx=c_pps a=GFwsV6G8L6GxiO2Y/PsHdQ==:117 a=GFwsV6G8L6GxiO2Y/PsHdQ==:17 a=x6icFKpwvdMA:10 a=VkNPw1HP01LnGYTKEx00:22 a=VnNF1IyMAAAA:8 a=SRrdq9N9AAAA:8 a=gpm2FF-g8vmvhG6dL6YA:9 a=cPQSjfK2_nFv0Q5t_7PE:22 X-Proofpoint-GUID: pzQSG3u0NkBVm02HUJln8va1lGksVM1R X-Proofpoint-Spam-Details-Enc: AW1haW4tMjUxMDI1MDAxOSBTYWx0ZWRfX4eb2yPelVAKc wAkSAOKPpikag/vwVruBQbp6esJ+B5WipRxGlxoZB20S8lXker8dERgMiPi8K+F6w3koEPb8Lid V96UfG1m86nxg1VDA10EqASwjNqAB4NTiBSQBv6jwEhGAYpVfQdDaPJbh3apsAUAqIa+WLw3zgZ /xyCN4Yhqzkerbzhnl8qen/SGT64YZytwZCpFhE4d/a5OxsiJKlVzI5KogPAN75gxbGX3Ix7y4r 1JVSJOJRxEeoziskHJvtYNtapoTTZfVLDeS6yWiTVTiQZEsHGOaAP2iZ/PQ0BWdsMUIv/k9nd0K HgU0sCf9UTo/vDZGUBe+L7I26j1MkRRC7S/4XXljDfKKTZ9JAadZNkmembhUOFlxyTs0nW7qbVT Wza+1OFTDkK4NAwRVGdo2W4VX/2MOQ== X-Proofpoint-ORIG-GUID: GCCtVFaB8B9r8Q-8GGHwx9Fvam9BlnbA X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1121,Hydra:6.1.9,FMLib:17.12.80.40 definitions=2025-10-27_09,2025-10-22_01,2025-03-28_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 clxscore=1015 lowpriorityscore=0 malwarescore=0 bulkscore=0 priorityscore=1501 spamscore=0 adultscore=0 phishscore=0 suspectscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.19.0-2510020000 definitions=main-2510250019 Content-Type: text/plain; charset="utf-8" Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by giving up and going the way of a TCP fallback. This was reasonable before the sizes of the allocations there were compile time constants and reasonably small. But now those are actually configurable. So instead of giving up, keep retrying with half of the requested size unless we dip below the old static sizes -- then give up! In terms of numbers that means we give up when it is certain that we at best would end up allocating less than 16 send WR buffers or less than 48 recv WR buffers. This is to avoid regressions due to having fewer buffers compared the static values of the past. Please note that SMC-R is supposed to be an optimisation over TCP, and falling back to TCP is superior to establishing an SMC connection that is going to perform worse. If the memory allocation fails (and we propagate -ENOMEM), we fall back to TCP. Preserve (modulo truncation) the ratio of send/recv WR buffer counts. Signed-off-by: Halil Pasic Reviewed-by: Wenjia Zhang Reviewed-by: Mahanta Jambigi Reviewed-by: Sidraya Jayagond Reviewed-by: Dust Li Tested-by: Mahanta Jambigi --- Documentation/networking/smc-sysctl.rst | 8 ++++-- net/smc/smc_core.c | 34 +++++++++++++++++-------- net/smc/smc_core.h | 2 ++ net/smc/smc_wr.c | 28 ++++++++++---------- 4 files changed, 46 insertions(+), 26 deletions(-) diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networ= king/smc-sysctl.rst index 337ac2be167e..904a910f198e 100644 --- a/Documentation/networking/smc-sysctl.rst +++ b/Documentation/networking/smc-sysctl.rst @@ -85,7 +85,9 @@ smcr_max_send_wr - INTEGER =20 Please be aware that all the buffers need to be allocated as a physically continuous array in which each element is a single buffer and has the size - of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much + of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying + with half of the buffer count until it is ether successful or (unlikely) + we dip below the old hard coded value which is 16 where we give up much like before having this control. =20 Default: 16 @@ -103,7 +105,9 @@ smcr_max_recv_wr - INTEGER =20 Please be aware that all the buffers need to be allocated as a physically continuous array in which each element is a single buffer and has the size - of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much + of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying + with half of the buffer count until it is ether successful or (unlikely) + we dip below the old hard coded value which is 16 where we give up much like before having this control. =20 Default: 48 diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c index be0c2da83d2b..e4eabc83719e 100644 --- a/net/smc/smc_core.c +++ b/net/smc/smc_core.c @@ -810,6 +810,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct s= mc_link *lnk, lnk->clearing =3D 0; lnk->path_mtu =3D lnk->smcibdev->pattr[lnk->ibport - 1].active_mtu; lnk->link_id =3D smcr_next_link_id(lgr); + lnk->max_send_wr =3D lgr->max_send_wr; + lnk->max_recv_wr =3D lgr->max_recv_wr; lnk->lgr =3D lgr; smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */ lnk->link_idx =3D link_idx; @@ -836,27 +838,39 @@ int smcr_link_init(struct smc_link_group *lgr, struct= smc_link *lnk, rc =3D smc_llc_link_init(lnk); if (rc) goto out; - rc =3D smc_wr_alloc_link_mem(lnk); - if (rc) - goto clear_llc_lnk; rc =3D smc_ib_create_protection_domain(lnk); if (rc) - goto free_link_mem; - rc =3D smc_ib_create_queue_pair(lnk); - if (rc) - goto dealloc_pd; + goto clear_llc_lnk; + do { + rc =3D smc_ib_create_queue_pair(lnk); + if (rc) + goto dealloc_pd; + rc =3D smc_wr_alloc_link_mem(lnk); + if (!rc) + break; + else if (rc !=3D -ENOMEM) /* give up */ + goto destroy_qp; + /* retry with smaller ... */ + lnk->max_send_wr /=3D 2; + lnk->max_recv_wr /=3D 2; + /* ... unless droping below old SMC_WR_BUF_SIZE */ + if (lnk->max_send_wr < 16 || lnk->max_recv_wr < 48) + goto destroy_qp; + smc_ib_destroy_queue_pair(lnk); + } while (1); + rc =3D smc_wr_create_link(lnk); if (rc) - goto destroy_qp; + goto free_link_mem; lnk->state =3D SMC_LNK_ACTIVATING; return 0; =20 +free_link_mem: + smc_wr_free_link_mem(lnk); destroy_qp: smc_ib_destroy_queue_pair(lnk); dealloc_pd: smc_ib_dealloc_protection_domain(lnk); -free_link_mem: - smc_wr_free_link_mem(lnk); clear_llc_lnk: smc_llc_link_clear(lnk, false); out: diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h index 8d06c8bb14e9..5c18f08a4c8a 100644 --- a/net/smc/smc_core.h +++ b/net/smc/smc_core.h @@ -175,6 +175,8 @@ struct smc_link { struct completion llc_testlink_resp; /* wait for rx of testlink */ int llc_testlink_time; /* testlink interval */ atomic_t conn_cnt; /* connections on this link */ + u16 max_send_wr; + u16 max_recv_wr; }; =20 /* For now we just allow one parallel link per link group. The SMC protocol diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c index 883fb0f1ce43..5feafa98ab1a 100644 --- a/net/smc/smc_wr.c +++ b/net/smc/smc_wr.c @@ -547,9 +547,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk) IB_QP_DEST_QPN, &init_attr); =20 - lnk->wr_tx_cnt =3D min_t(size_t, lnk->lgr->max_send_wr, + lnk->wr_tx_cnt =3D min_t(size_t, lnk->max_send_wr, lnk->qp_attr.cap.max_send_wr); - lnk->wr_rx_cnt =3D min_t(size_t, lnk->lgr->max_recv_wr, + lnk->wr_rx_cnt =3D min_t(size_t, lnk->max_recv_wr, lnk->qp_attr.cap.max_recv_wr); } =20 @@ -741,51 +741,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr) int smc_wr_alloc_link_mem(struct smc_link *link) { /* allocate link related memory */ - link->wr_tx_bufs =3D kcalloc(link->lgr->max_send_wr, + link->wr_tx_bufs =3D kcalloc(link->max_send_wr, SMC_WR_BUF_SIZE, GFP_KERNEL); if (!link->wr_tx_bufs) goto no_mem; - link->wr_rx_bufs =3D kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen, + link->wr_rx_bufs =3D kcalloc(link->max_recv_wr, link->wr_rx_buflen, GFP_KERNEL); if (!link->wr_rx_bufs) goto no_mem_wr_tx_bufs; - link->wr_tx_ibs =3D kcalloc(link->lgr->max_send_wr, + link->wr_tx_ibs =3D kcalloc(link->max_send_wr, sizeof(link->wr_tx_ibs[0]), GFP_KERNEL); if (!link->wr_tx_ibs) goto no_mem_wr_rx_bufs; - link->wr_rx_ibs =3D kcalloc(link->lgr->max_recv_wr, + link->wr_rx_ibs =3D kcalloc(link->max_recv_wr, sizeof(link->wr_rx_ibs[0]), GFP_KERNEL); if (!link->wr_rx_ibs) goto no_mem_wr_tx_ibs; - link->wr_tx_rdmas =3D kcalloc(link->lgr->max_send_wr, + link->wr_tx_rdmas =3D kcalloc(link->max_send_wr, sizeof(link->wr_tx_rdmas[0]), GFP_KERNEL); if (!link->wr_tx_rdmas) goto no_mem_wr_rx_ibs; - link->wr_tx_rdma_sges =3D kcalloc(link->lgr->max_send_wr, + link->wr_tx_rdma_sges =3D kcalloc(link->max_send_wr, sizeof(link->wr_tx_rdma_sges[0]), GFP_KERNEL); if (!link->wr_tx_rdma_sges) goto no_mem_wr_tx_rdmas; - link->wr_tx_sges =3D kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_s= ges[0]), + link->wr_tx_sges =3D kcalloc(link->max_send_wr, sizeof(link->wr_tx_sges[0= ]), GFP_KERNEL); if (!link->wr_tx_sges) goto no_mem_wr_tx_rdma_sges; - link->wr_rx_sges =3D kcalloc(link->lgr->max_recv_wr, + link->wr_rx_sges =3D kcalloc(link->max_recv_wr, sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt, GFP_KERNEL); if (!link->wr_rx_sges) goto no_mem_wr_tx_sges; - link->wr_tx_mask =3D bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL); + link->wr_tx_mask =3D bitmap_zalloc(link->max_send_wr, GFP_KERNEL); if (!link->wr_tx_mask) goto no_mem_wr_rx_sges; - link->wr_tx_pends =3D kcalloc(link->lgr->max_send_wr, + link->wr_tx_pends =3D kcalloc(link->max_send_wr, sizeof(link->wr_tx_pends[0]), GFP_KERNEL); if (!link->wr_tx_pends) goto no_mem_wr_tx_mask; - link->wr_tx_compl =3D kcalloc(link->lgr->max_send_wr, + link->wr_tx_compl =3D kcalloc(link->max_send_wr, sizeof(link->wr_tx_compl[0]), GFP_KERNEL); if (!link->wr_tx_compl) @@ -906,7 +906,7 @@ int smc_wr_create_link(struct smc_link *lnk) goto dma_unmap; } smc_wr_init_sge(lnk); - bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr); + bitmap_zero(lnk->wr_tx_mask, lnk->max_send_wr); init_waitqueue_head(&lnk->wr_tx_wait); rc =3D percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KER= NEL); if (rc) --=20 2.48.1