[v1] net/smc: make wr buffer count configurable

[PATCH net-next 1/2] net/smc: make wr buffer count configurable

Posted by Halil Pasic 4 weeks ago

Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.

While at it let us also remove a confusing comment that is either not
about the context in which it resides (describing
qp_attr.cap.max_send_wr and qp_attr.cap.max_recv_wr) or not applicable
any more when these values become configurable .

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
---
 Documentation/networking/smc-sysctl.rst | 37 +++++++++++++++++++++++++
 net/smc/smc.h                           |  2 ++
 net/smc/smc_core.h                      |  4 +++
 net/smc/smc_ib.c                        |  7 ++---
 net/smc/smc_llc.c                       |  2 ++
 net/smc/smc_sysctl.c                    | 22 +++++++++++++++
 net/smc/smc_wr.c                        | 32 +++++++++++----------
 net/smc/smc_wr.h                        |  2 --
 8 files changed, 86 insertions(+), 22 deletions(-)

diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
index a874d007f2db..c687092329e3 100644
--- a/Documentation/networking/smc-sysctl.rst
+++ b/Documentation/networking/smc-sysctl.rst
@@ -71,3 +71,40 @@ smcr_max_conns_per_lgr - INTEGER
 	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
 
 	Default: 255
+
+smcr_max_send_wr - INTEGER
+	So called work request buffers are SMCR link (and RDMA queue pair) level
+	resources necessary for performing RDMA operations. Since up to 255
+	connections can share a link group and thus also a link and the number
+	of the work request buffers is decided when the link is allocated,
+	depending on the workload it can a bottleneck in a sense that threads
+	have to wait for work request buffers to become available. Before the
+	introduction of this control the maximal number of work request buffers
+	available on the send path used to be hard coded to 16. With this control
+	it becomes configurable. The acceptable range is between 2 and 2048.
+
+	Please be aware that all the buffers need to be allocated as a physically
+	continuous array in which each element is a single buffer and has the size
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	like before having this control.
+	this control.
+
+	Default: 16
+
+smcr_max_recv_wr - INTEGER
+	So called work request buffers are SMCR link (and RDMA queue pair) level
+	resources necessary for performing RDMA operations. Since up to 255
+	connections can share a link group and thus also a link and the number
+	of the work request buffers is decided when the link is allocated,
+	depending on the workload it can a bottleneck in a sense that threads
+	have to wait for work request buffers to become available. Before the
+	introduction of this control the maximal number of work request buffers
+	available on the receive path used to be hard coded to 16. With this control
+	it becomes configurable. The acceptable range is between 2 and 2048.
+
+	Please be aware that all the buffers need to be allocated as a physically
+	continuous array in which each element is a single buffer and has the size
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	like before having this control.
+
+	Default: 48
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 2c9084963739..ffe48253fa1f 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -33,6 +33,8 @@
 
 extern struct proto smc_proto;
 extern struct proto smc_proto6;
+extern unsigned int smc_ib_sysctl_max_send_wr;
+extern unsigned int smc_ib_sysctl_max_recv_wr;
 
 extern struct smc_hashinfo smc_v4_hashinfo;
 extern struct smc_hashinfo smc_v6_hashinfo;
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 48a1b1dcb576..b883f43fc206 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -361,6 +361,10 @@ struct smc_link_group {
 						/* max conn can be assigned to lgr */
 			u8			max_links;
 						/* max links can be added in lgr */
+			u16			max_send_wr;
+						/* number of WR buffers on send */
+			u16			max_recv_wr;
+						/* number of WR buffers on recv */
 		};
 		struct { /* SMC-D */
 			struct smcd_gid		peer_gid;
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 0052f02756eb..e8d35c22c525 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -669,11 +669,6 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 		.recv_cq = lnk->smcibdev->roce_cq_recv,
 		.srq = NULL,
 		.cap = {
-				/* include unsolicited rdma_writes as well,
-				 * there are max. 2 RDMA_WRITE per 1 WR_SEND
-				 */
-			.max_send_wr = SMC_WR_BUF_CNT * 3,
-			.max_recv_wr = SMC_WR_BUF_CNT * 3,
 			.max_send_sge = SMC_IB_MAX_SEND_SGE,
 			.max_recv_sge = lnk->wr_rx_sge_cnt,
 			.max_inline_data = 0,
@@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 	};
 	int rc;
 
+	qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
+	qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
 	lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
 	rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
 	if (IS_ERR(lnk->roce_qp))
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index f865c58c3aa7..91c936bf7336 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -2157,6 +2157,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
 	init_waitqueue_head(&lgr->llc_msg_waiter);
 	init_rwsem(&lgr->llc_conf_mutex);
 	lgr->llc_testlink_time = READ_ONCE(net->smc.sysctl_smcr_testlink_time);
+	lgr->max_send_wr = (u16)(READ_ONCE(smc_ib_sysctl_max_send_wr));
+	lgr->max_recv_wr = (u16)(READ_ONCE(smc_ib_sysctl_max_recv_wr));
 }
 
 /* called after lgr was removed from lgr_list */
diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
index 2fab6456f765..01da1297e150 100644
--- a/net/smc/smc_sysctl.c
+++ b/net/smc/smc_sysctl.c
@@ -29,6 +29,10 @@ static int links_per_lgr_min = SMC_LINKS_ADD_LNK_MIN;
 static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
 static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
 static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
+unsigned int smc_ib_sysctl_max_send_wr = 16;
+unsigned int smc_ib_sysctl_max_recv_wr = 48;
+static unsigned int smc_ib_sysctl_max_wr_min = 2;
+static unsigned int smc_ib_sysctl_max_wr_max = 2048;
 
 static struct ctl_table smc_table[] = {
 	{
@@ -99,6 +103,24 @@ static struct ctl_table smc_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname       = "smcr_max_send_wr",
+		.data		= &smc_ib_sysctl_max_send_wr,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+		.extra1		= &smc_ib_sysctl_max_wr_min,
+		.extra2		= &smc_ib_sysctl_max_wr_max,
+	},
+	{
+		.procname       = "smcr_max_recv_wr",
+		.data		= &smc_ib_sysctl_max_recv_wr,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+		.extra1		= &smc_ib_sysctl_max_wr_min,
+		.extra2		= &smc_ib_sysctl_max_wr_max,
+	},
 };
 
 int __net_init smc_sysctl_net_init(struct net *net)
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index b04a21b8c511..85ebc65f1546 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -34,6 +34,7 @@
 #define SMC_WR_MAX_POLL_CQE 10	/* max. # of compl. queue elements in 1 poll */
 
 #define SMC_WR_RX_HASH_BITS 4
+
 static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS);
 static DEFINE_SPINLOCK(smc_wr_rx_hash_lock);
 
@@ -547,9 +548,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
 		    IB_QP_DEST_QPN,
 		    &init_attr);
 
-	lnk->wr_tx_cnt = min_t(size_t, SMC_WR_BUF_CNT,
+	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
 			       lnk->qp_attr.cap.max_send_wr);
-	lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT * 3,
+	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
 			       lnk->qp_attr.cap.max_recv_wr);
 }
 
@@ -741,50 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
 int smc_wr_alloc_link_mem(struct smc_link *link)
 {
 	/* allocate link related memory */
-	link->wr_tx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL);
+	link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
+				   SMC_WR_BUF_SIZE, GFP_KERNEL);
 	if (!link->wr_tx_bufs)
 		goto no_mem;
-	link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, link->wr_rx_buflen,
+	link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, SMC_WR_BUF_SIZE,
 				   GFP_KERNEL);
 	if (!link->wr_rx_bufs)
 		goto no_mem_wr_tx_bufs;
-	link->wr_tx_ibs = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_ibs[0]),
-				  GFP_KERNEL);
+	link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
+				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
 	if (!link->wr_tx_ibs)
 		goto no_mem_wr_rx_bufs;
-	link->wr_rx_ibs = kcalloc(SMC_WR_BUF_CNT * 3,
+	link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
 				  sizeof(link->wr_rx_ibs[0]),
 				  GFP_KERNEL);
 	if (!link->wr_rx_ibs)
 		goto no_mem_wr_tx_ibs;
-	link->wr_tx_rdmas = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
 				    sizeof(link->wr_tx_rdmas[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_rdmas)
 		goto no_mem_wr_rx_ibs;
-	link->wr_tx_rdma_sges = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
 					sizeof(link->wr_tx_rdma_sges[0]),
 					GFP_KERNEL);
 	if (!link->wr_tx_rdma_sges)
 		goto no_mem_wr_tx_rdmas;
-	link->wr_tx_sges = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_sges[0]),
+	link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
 				   GFP_KERNEL);
 	if (!link->wr_tx_sges)
 		goto no_mem_wr_tx_rdma_sges;
-	link->wr_rx_sges = kcalloc(SMC_WR_BUF_CNT * 3,
+	link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
 				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
 				   GFP_KERNEL);
 	if (!link->wr_rx_sges)
 		goto no_mem_wr_tx_sges;
-	link->wr_tx_mask = bitmap_zalloc(SMC_WR_BUF_CNT, GFP_KERNEL);
+	link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
 	if (!link->wr_tx_mask)
 		goto no_mem_wr_rx_sges;
-	link->wr_tx_pends = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
 				    sizeof(link->wr_tx_pends[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_pends)
 		goto no_mem_wr_tx_mask;
-	link->wr_tx_compl = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
 				    sizeof(link->wr_tx_compl[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_compl)
@@ -905,7 +907,7 @@ int smc_wr_create_link(struct smc_link *lnk)
 		goto dma_unmap;
 	}
 	smc_wr_init_sge(lnk);
-	bitmap_zero(lnk->wr_tx_mask, SMC_WR_BUF_CNT);
+	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
 	init_waitqueue_head(&lnk->wr_tx_wait);
 	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
 	if (rc)
diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
index f3008dda222a..aa4533af9122 100644
--- a/net/smc/smc_wr.h
+++ b/net/smc/smc_wr.h
@@ -19,8 +19,6 @@
 #include "smc.h"
 #include "smc_core.h"
 
-#define SMC_WR_BUF_CNT 16	/* # of ctrl buffers per link */
-
 #define SMC_WR_TX_WAIT_FREE_SLOT_TIME	(10 * HZ)
 
 #define SMC_WR_TX_SIZE 44 /* actual size of wr_send data (<=SMC_WR_BUF_SIZE) */
-- 
2.48.1

Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable

Posted by kernel test robot 3 weeks, 5 days ago

Hi Halil,

kernel test robot noticed the following build errors:

[auto build test ERROR on 5ef04a7b068cbb828eba226aacb42f880f7924d7]

url:    https://github.com/intel-lab-lkp/linux/commits/Halil-Pasic/net-smc-make-wr-buffer-count-configurable/20250905-051510
base:   5ef04a7b068cbb828eba226aacb42f880f7924d7
patch link:    https://lore.kernel.org/r/20250904211254.1057445-2-pasic%40linux.ibm.com
patch subject: [PATCH net-next 1/2] net/smc: make wr buffer count configurable
config: loongarch-randconfig-002-20250906 (https://download.01.org/0day-ci/archive/20250907/202509070225.pVKkaaCr-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250907/202509070225.pVKkaaCr-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509070225.pVKkaaCr-lkp@intel.com/

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "smc_ib_sysctl_max_send_wr" [net/smc/smc.ko] undefined!
>> ERROR: modpost: "smc_ib_sysctl_max_recv_wr" [net/smc/smc.ko] undefined!

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable

Posted by Dust Li 4 weeks ago

On 2025-09-04 23:12:52, Halil Pasic wrote:
>Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
>SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
>get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.

Hi Halil,

I think making the WR buffer count configurable helps make SMC more flexible.

However, there are two additional issues we need to consider:

1. What if the two sides have different max_send_wr/max_recv_wr configurations?
IIUC, For example, if the client sets max_send_wr to 64, but the server sets
max_recv_wr to 16, the client might overflow the server's QP receive
queue, potentially causing an RNR (Receiver Not Ready) error.

2. Since WR buffers are configurable, it’d be helpful to add some
monitoring methods so we can keep track of how many WR buffers each QP
is currently using. This should be useful now that you introduced a fallback
retry mechanism, which can cause the number of WR buffers to change
dynamically.


Some other minor issues in the patch itself, see below

>
>While at it let us also remove a confusing comment that is either not
>about the context in which it resides (describing
>qp_attr.cap.max_send_wr and qp_attr.cap.max_recv_wr) or not applicable
>any more when these values become configurable .
>
>Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
>Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
>---
> Documentation/networking/smc-sysctl.rst | 37 +++++++++++++++++++++++++
> net/smc/smc.h                           |  2 ++
> net/smc/smc_core.h                      |  4 +++
> net/smc/smc_ib.c                        |  7 ++---
> net/smc/smc_llc.c                       |  2 ++
> net/smc/smc_sysctl.c                    | 22 +++++++++++++++
> net/smc/smc_wr.c                        | 32 +++++++++++----------
> net/smc/smc_wr.h                        |  2 --
> 8 files changed, 86 insertions(+), 22 deletions(-)
>
>diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
>index a874d007f2db..c687092329e3 100644
>--- a/Documentation/networking/smc-sysctl.rst
>+++ b/Documentation/networking/smc-sysctl.rst
>@@ -71,3 +71,40 @@ smcr_max_conns_per_lgr - INTEGER
> 	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
> 
> 	Default: 255
>+
>+smcr_max_send_wr - INTEGER

Why call it max ? But not something like smcr_send_wr_cnt ?

>+	So called work request buffers are SMCR link (and RDMA queue pair) level
>+	resources necessary for performing RDMA operations. Since up to 255
>+	connections can share a link group and thus also a link and the number
>+	of the work request buffers is decided when the link is allocated,
>+	depending on the workload it can a bottleneck in a sense that threads
>+	have to wait for work request buffers to become available. Before the
>+	introduction of this control the maximal number of work request buffers
>+	available on the send path used to be hard coded to 16. With this control
>+	it becomes configurable. The acceptable range is between 2 and 2048.
>+
>+	Please be aware that all the buffers need to be allocated as a physically
>+	continuous array in which each element is a single buffer and has the size
>+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
>+	like before having this control.
>+	this control.
>+
>+	Default: 16
>+
>+smcr_max_recv_wr - INTEGER
>+	So called work request buffers are SMCR link (and RDMA queue pair) level
>+	resources necessary for performing RDMA operations. Since up to 255
>+	connections can share a link group and thus also a link and the number
>+	of the work request buffers is decided when the link is allocated,
>+	depending on the workload it can a bottleneck in a sense that threads
>+	have to wait for work request buffers to become available. Before the
>+	introduction of this control the maximal number of work request buffers
>+	available on the receive path used to be hard coded to 16. With this control
>+	it becomes configurable. The acceptable range is between 2 and 2048.
>+
>+	Please be aware that all the buffers need to be allocated as a physically
>+	continuous array in which each element is a single buffer and has the size
>+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
>+	like before having this control.
>+
>+	Default: 48
>diff --git a/net/smc/smc.h b/net/smc/smc.h
>index 2c9084963739..ffe48253fa1f 100644
>--- a/net/smc/smc.h
>+++ b/net/smc/smc.h
>@@ -33,6 +33,8 @@
> 
> extern struct proto smc_proto;
> extern struct proto smc_proto6;
>+extern unsigned int smc_ib_sysctl_max_send_wr;
>+extern unsigned int smc_ib_sysctl_max_recv_wr;
> 
> extern struct smc_hashinfo smc_v4_hashinfo;
> extern struct smc_hashinfo smc_v6_hashinfo;
>diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
>index 48a1b1dcb576..b883f43fc206 100644
>--- a/net/smc/smc_core.h
>+++ b/net/smc/smc_core.h
>@@ -361,6 +361,10 @@ struct smc_link_group {
> 						/* max conn can be assigned to lgr */
> 			u8			max_links;
> 						/* max links can be added in lgr */
>+			u16			max_send_wr;
>+						/* number of WR buffers on send */
>+			u16			max_recv_wr;
>+						/* number of WR buffers on recv */
> 		};
> 		struct { /* SMC-D */
> 			struct smcd_gid		peer_gid;
>diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
>index 0052f02756eb..e8d35c22c525 100644
>--- a/net/smc/smc_ib.c
>+++ b/net/smc/smc_ib.c
>@@ -669,11 +669,6 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> 		.recv_cq = lnk->smcibdev->roce_cq_recv,
> 		.srq = NULL,
> 		.cap = {
>-				/* include unsolicited rdma_writes as well,
>-				 * there are max. 2 RDMA_WRITE per 1 WR_SEND
>-				 */
>-			.max_send_wr = SMC_WR_BUF_CNT * 3,
>-			.max_recv_wr = SMC_WR_BUF_CNT * 3,
> 			.max_send_sge = SMC_IB_MAX_SEND_SGE,
> 			.max_recv_sge = lnk->wr_rx_sge_cnt,
> 			.max_inline_data = 0,
>@@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> 	};
> 	int rc;
> 
>+	qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
>+	qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
> 	lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
> 	rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
> 	if (IS_ERR(lnk->roce_qp))
>diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
>index f865c58c3aa7..91c936bf7336 100644
>--- a/net/smc/smc_llc.c
>+++ b/net/smc/smc_llc.c
>@@ -2157,6 +2157,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
> 	init_waitqueue_head(&lgr->llc_msg_waiter);
> 	init_rwsem(&lgr->llc_conf_mutex);
> 	lgr->llc_testlink_time = READ_ONCE(net->smc.sysctl_smcr_testlink_time);
>+	lgr->max_send_wr = (u16)(READ_ONCE(smc_ib_sysctl_max_send_wr));
>+	lgr->max_recv_wr = (u16)(READ_ONCE(smc_ib_sysctl_max_recv_wr));
> }
> 
> /* called after lgr was removed from lgr_list */
>diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
>index 2fab6456f765..01da1297e150 100644
>--- a/net/smc/smc_sysctl.c
>+++ b/net/smc/smc_sysctl.c
>@@ -29,6 +29,10 @@ static int links_per_lgr_min = SMC_LINKS_ADD_LNK_MIN;
> static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
> static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
> static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
>+unsigned int smc_ib_sysctl_max_send_wr = 16;
>+unsigned int smc_ib_sysctl_max_recv_wr = 48;
>+static unsigned int smc_ib_sysctl_max_wr_min = 2;
>+static unsigned int smc_ib_sysctl_max_wr_max = 2048;
> 
> static struct ctl_table smc_table[] = {
> 	{
>@@ -99,6 +103,24 @@ static struct ctl_table smc_table[] = {
> 		.extra1		= SYSCTL_ZERO,
> 		.extra2		= SYSCTL_ONE,
> 	},
>+	{
>+		.procname       = "smcr_max_send_wr",
>+		.data		= &smc_ib_sysctl_max_send_wr,
>+		.maxlen         = sizeof(int),
>+		.mode           = 0644,
>+		.proc_handler   = proc_dointvec_minmax,
>+		.extra1		= &smc_ib_sysctl_max_wr_min,
>+		.extra2		= &smc_ib_sysctl_max_wr_max,
>+	},
>+	{
>+		.procname       = "smcr_max_recv_wr",
>+		.data		= &smc_ib_sysctl_max_recv_wr,
>+		.maxlen         = sizeof(int),
>+		.mode           = 0644,
>+		.proc_handler   = proc_dointvec_minmax,

It's better to use tab instead of space before those '=' here.

Best regards,
Dust

Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable

Posted by Halil Pasic 3 weeks, 6 days ago

On Fri, 5 Sep 2025 11:45:36 +0800
Dust Li <dust.li@linux.alibaba.com> wrote:

> On 2025-09-04 23:12:52, Halil Pasic wrote:
> >Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
> >SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
> >get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.  
> 
> Hi Halil,
> 
> I think making the WR buffer count configurable helps make SMC more flexible.

Hi Dust Li,

Thank you for having a look. 

> 
> However, there are two additional issues we need to consider:
> 
> 1. What if the two sides have different max_send_wr/max_recv_wr configurations?
> IIUC, For example, if the client sets max_send_wr to 64, but the server sets
> max_recv_wr to 16, the client might overflow the server's QP receive
> queue, potentially causing an RNR (Receiver Not Ready) error.

I don't think the 16 is spec-ed anywhere and if the client and the server
need to agree on the same value it should either be speced, or a
protocol mechanism for negotiating it needs to exist. So what is your
take on this as an SMC maintainer?

I think, we have tested heterogeneous setups and didn't see any grave
issues. But let me please do a follow up on this. Maybe the other
maintainers can chime in as well.

> 
> 2. Since WR buffers are configurable, it’d be helpful to add some
> monitoring methods so we can keep track of how many WR buffers each QP
> is currently using. This should be useful now that you introduced a fallback
> retry mechanism, which can cause the number of WR buffers to change
> dynamically.
> 

I agree, but I think that can be done in a different scope. I don't think
this needs to be a part of the MVP. Or do you think that it needs to
be part of the series?
> 
> Some other minor issues in the patch itself, see below
> 
> >
> >While at it let us also remove a confusing comment that is either not
> >about the context in which it resides (describing
> >qp_attr.cap.max_send_wr and qp_attr.cap.max_recv_wr) or not applicable
> >any more when these values become configurable .
> >
> >Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> >Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
> >---
> > Documentation/networking/smc-sysctl.rst | 37 +++++++++++++++++++++++++
> > net/smc/smc.h                           |  2 ++
> > net/smc/smc_core.h                      |  4 +++
> > net/smc/smc_ib.c                        |  7 ++---
> > net/smc/smc_llc.c                       |  2 ++
> > net/smc/smc_sysctl.c                    | 22 +++++++++++++++
> > net/smc/smc_wr.c                        | 32 +++++++++++----------
> > net/smc/smc_wr.h                        |  2 --
> > 8 files changed, 86 insertions(+), 22 deletions(-)
> >
> >diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
> >index a874d007f2db..c687092329e3 100644
> >--- a/Documentation/networking/smc-sysctl.rst
> >+++ b/Documentation/networking/smc-sysctl.rst
> >@@ -71,3 +71,40 @@ smcr_max_conns_per_lgr - INTEGER
> > 	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
> > 
> > 	Default: 255
> >+
> >+smcr_max_send_wr - INTEGER  
> 
> Why call it max ? But not something like smcr_send_wr_cnt ?

Because of the back-off mechanism. You are not guaranteed to get
this many but you are guaranteed to not get more.
> > static struct ctl_table smc_table[] = {
> > 	{
> >@@ -99,6 +103,24 @@ static struct ctl_table smc_table[] = {
> > 		.extra1		= SYSCTL_ZERO,
> > 		.extra2		= SYSCTL_ONE,
> > 	},
> >+	{
> >+		.procname       = "smcr_max_send_wr",
> >+		.data		= &smc_ib_sysctl_max_send_wr,
> >+		.maxlen         = sizeof(int),
> >+		.mode           = 0644,
> >+		.proc_handler   = proc_dointvec_minmax,
> >+		.extra1		= &smc_ib_sysctl_max_wr_min,
> >+		.extra2		= &smc_ib_sysctl_max_wr_max,
> >+	},
> >+	{
> >+		.procname       = "smcr_max_recv_wr",
> >+		.data		= &smc_ib_sysctl_max_recv_wr,
> >+		.maxlen         = sizeof(int),
> >+		.mode           = 0644,
> >+		.proc_handler   = proc_dointvec_minmax,  
> 
> It's better to use tab instead of space before those '=' here.
> 

I can definitely fix that and do a respin if you like. But I think
we need to sort out the other problems first.

Regards,
Halil

Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable

Posted by Halil Pasic 3 weeks, 6 days ago

On Fri, 5 Sep 2025 11:00:59 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:

> > 1. What if the two sides have different max_send_wr/max_recv_wr configurations?
> > IIUC, For example, if the client sets max_send_wr to 64, but the server sets
> > max_recv_wr to 16, the client might overflow the server's QP receive
> > queue, potentially causing an RNR (Receiver Not Ready) error.  
> 
> I don't think the 16 is spec-ed anywhere and if the client and the server
> need to agree on the same value it should either be speced, or a
> protocol mechanism for negotiating it needs to exist. So what is your
> take on this as an SMC maintainer?
> 
> I think, we have tested heterogeneous setups and didn't see any grave
> issues. But let me please do a follow up on this. Maybe the other
> maintainers can chime in as well.

Did some research and some thinking. Are you concerned about a
performance regression for e.g. 64 -> 16 compared to 16 -> 16? According
to my current understanding the RNR must not lead to a catastrophic
failure, but the RDMA/IB stack is supposed to handle that.

I would like to also point out that bumping SMC_WR_BUF_CNT basically has
the same problem, although admittedly to a smaller extent because it is
only between "old" and "new".

Assuming that my understanding is correct, I believe that the problem of
the potential RNR is inherent to the objective of the series, and
probably one that can be lived with. Given this entire EID business, I
think the SMC-R setup is likely to happen in a coordinated fashion for
all potential peers, and I hope whoever tweaks those values has a
sufficent understanding or empiric evidence to justify the tweaks.

Assuming my understanding is not utterly wrong, I would very much like
to know what would you want me to do with this?

Thank you in advance!

Regards,
Hali

Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable

Posted by Dust Li 3 weeks, 6 days ago

On 2025-09-05 14:01:35, Halil Pasic wrote:
>On Fri, 5 Sep 2025 11:00:59 +0200
>Halil Pasic <pasic@linux.ibm.com> wrote:
>
>> > 1. What if the two sides have different max_send_wr/max_recv_wr configurations?
>> > IIUC, For example, if the client sets max_send_wr to 64, but the server sets
>> > max_recv_wr to 16, the client might overflow the server's QP receive
>> > queue, potentially causing an RNR (Receiver Not Ready) error.  
>>
>> I don't think the 16 is spec-ed anywhere and if the client and the server
>> need to agree on the same value it should either be speced, or a
>> protocol mechanism for negotiating it needs to exist. So what is your
>> take on this as an SMC maintainer?

Right — I didn't realize that either until I saw this patch today :)
But since the implementation's been set to 16 since day one, bumping it
up might break things.

>>
>> I think, we have tested heterogeneous setups and didn't see any grave
>> issues. But let me please do a follow up on this. Maybe the other
>> maintainers can chime in as well.

I'm glad to hear from others.

>
>Did some research and some thinking. Are you concerned about a
>performance regression for e.g. 64 -> 16 compared to 16 -> 16? According
>to my current understanding the RNR must not lead to a catastrophic
>failure, but the RDMA/IB stack is supposed to handle that.

No, it's not just a performance regression.
If we get an RNR when going from 64 -> 16, the whole link group gets
torn down — and all SMC connections inside it break.
So from the user’s point of view, connections will just randomly drop
out of nowhere.

>
>I would like to also point out that bumping SMC_WR_BUF_CNT basically has
>the same problem, although admittedly to a smaller extent because it is
>only between "old" and "new".

Right.

>
>Assuming that my understanding is correct, I believe that the problem of
>the potential RNR is inherent to the objective of the series, and
>probably one that can be lived with. Given this entire EID business, I
>think the SMC-R setup is likely to happen in a coordinated fashion for
>all potential peers, and I hope whoever tweaks those values has a
>sufficent understanding or empiric evidence to justify the tweaks.
>
>Assuming my understanding is not utterly wrong, I would very much like
>to know what would you want me to do with this?

In my opinion, the best way to support different SMC_WR_BUF_CNT values
is: First, define it in the spec - then do a handshake to agree on the
value.
If the peer doesn’t support this feature, just fall back to 16.
If both sides support it, use the smaller (MIN) of the two values.

Best regards,
Dust

Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable

Posted by Dust Li 3 weeks, 6 days ago

On 2025-09-05 22:22:48, Dust Li wrote:
>On 2025-09-05 14:01:35, Halil Pasic wrote:
>>On Fri, 5 Sep 2025 11:00:59 +0200
>>Halil Pasic <pasic@linux.ibm.com> wrote:
>>
>>> > 1. What if the two sides have different max_send_wr/max_recv_wr configurations?
>>> > IIUC, For example, if the client sets max_send_wr to 64, but the server sets
>>> > max_recv_wr to 16, the client might overflow the server's QP receive
>>> > queue, potentially causing an RNR (Receiver Not Ready) error.  
>>>
>>> I don't think the 16 is spec-ed anywhere and if the client and the server
>>> need to agree on the same value it should either be speced, or a
>>> protocol mechanism for negotiating it needs to exist. So what is your
>>> take on this as an SMC maintainer?
>
>Right — I didn't realize that either until I saw this patch today :)
>But since the implementation's been set to 16 since day one, bumping it
>up might break things.
>
>>>
>>> I think, we have tested heterogeneous setups and didn't see any grave
>>> issues. But let me please do a follow up on this. Maybe the other
>>> maintainers can chime in as well.
>
>I'm glad to hear from others.
>
>>
>>Did some research and some thinking. Are you concerned about a
>>performance regression for e.g. 64 -> 16 compared to 16 -> 16? According
>>to my current understanding the RNR must not lead to a catastrophic
>>failure, but the RDMA/IB stack is supposed to handle that.
>
>No, it's not just a performance regression.
>If we get an RNR when going from 64 -> 16, the whole link group gets
>torn down — and all SMC connections inside it break.
>So from the user’s point of view, connections will just randomly drop
>out of nowhere.

I double-checked the code and noticed we set qp_attr.rnr_retry =
SMC_QP_RNR_RETRY = 7, which means "infinite retries."
So the QP will just keep retrying — we won't actually get an RNR.
That said, yeah, just performance regression.

So in this case, I would regard it as acceptable. We can go with this.

Best regards,
Dust

Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable

Posted by Halil Pasic 3 weeks, 6 days ago

On Fri, 5 Sep 2025 22:51:37 +0800
Dust Li <dust.li@linux.alibaba.com> wrote:

> >>Did some research and some thinking. Are you concerned about a
> >>performance regression for e.g. 64 -> 16 compared to 16 -> 16? According
> >>to my current understanding the RNR must not lead to a catastrophic
> >>failure, but the RDMA/IB stack is supposed to handle that.  
> >
> >No, it's not just a performance regression.
> >If we get an RNR when going from 64 -> 16, the whole link group gets
> >torn down — and all SMC connections inside it break.
> >So from the user’s point of view, connections will just randomly drop
> >out of nowhere.  
> 
> I double-checked the code and noticed we set qp_attr.rnr_retry =
> SMC_QP_RNR_RETRY = 7, which means "infinite retries."
> So the QP will just keep retrying — we won't actually get an RNR.
> That said, yeah, just performance regression.
> 
> So in this case, I would regard it as acceptable. We can go with this.

Yes, that is consistent with Mahanta's testing in a sense that he did
not see any catastrophic failure. Regarding the performance regression,
I don't know how bad it is. Mahanta was so kind to do most of the
testing. 

So that leaves us with replacing tabs with spaces and maybe with the
names, or? If you have a proposal for a better name let's talk about
is.

BTW are you aware of any generic counters that would help with
figuring out how many RNRs have been sent/received?

Regards,
Halil

Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable

Posted by Mahanta Jambigi 3 weeks, 6 days ago


On 05/09/25 5:31 pm, Halil Pasic wrote:
> On Fri, 5 Sep 2025 11:00:59 +0200
> Halil Pasic <pasic@linux.ibm.com> wrote:
> 
>>> 1. What if the two sides have different max_send_wr/max_recv_wr configurations?
>>> IIUC, For example, if the client sets max_send_wr to 64, but the server sets
>>> max_recv_wr to 16, the client might overflow the server's QP receive
>>> queue, potentially causing an RNR (Receiver Not Ready) error.  
>>
>> I don't think the 16 is spec-ed anywhere and if the client and the server
>> need to agree on the same value it should either be speced, or a
>> protocol mechanism for negotiating it needs to exist. So what is your
>> take on this as an SMC maintainer?
>>
>> I think, we have tested heterogeneous setups and didn't see any grave
>> issues. But let me please do a follow up on this. Maybe the other
>> maintainers can chime in as well.
> 
> Did some research and some thinking. Are you concerned about a
> performance regression for e.g. 64 -> 16 compared to 16 -> 16? According
> to my current understanding the RNR must not lead to a catastrophic
> failure, but the RDMA/IB stack is supposed to handle that.
> 

Hi Dust,

I configured a client-server setup & did some SMC-R testing by setting
the values you proposed. Ran iperf3(using smc_run) with max parallel
connections of 128 & it looks good. No tcp fallback. No obvious errors.
As Halil mentioned I don't see any catastrophic failure here. Let me
know if I need to stress the system by some more tests or any specific
test that you can think may cause RNR errors. The setup is ready & I can
try it.

*Client* side logs:
[root@client ~]$ sysctl net.smc.smcr_max_send_wr
net.smc.smcr_max_send_wr = 64
[root@client ~]$

[root@client ~]$ smc_run iperf3  -P 128 -t 120 -c 10.25.0.72
Connecting to host 10.25.0.72, port 5201
[  5] local 10.25.0.73 port 52544 connected to 10.25.0.72 port 5201
[  7] local 10.25.0.73 port 52558 connected to 10.25.0.72 port 5201

*Server* side logs:
[root@server ~]$ sysctl net.smc.smcr_max_recv_wr
net.smc.smcr_max_recv_wr = 16
[root@client ~]$

[root@server~]$ smc_run iperf3 -s

-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------

[PATCH net-next 1/2] net/smc: make wr buffer count configurable
[PATCH net-next 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully