[PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully

Halil Pasic posted 2 patches 2 days, 21 hours ago
[PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
Posted by Halil Pasic 2 days, 21 hours ago
Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by
giving up and going the way of a TCP fallback. This was reasonable
before the sizes of the allocations there were compile time constants
and reasonably small. But now those are actually configurable.

So instead of giving up, keep retrying with half of the requested size
unless we dip below the old static sizes -- then give up! In terms of
numbers that means we give up when it is certain that we at best would
end up allocating less than 16 send WR buffers or less than 48 recv WR
buffers. This is to avoid regressions due to having fewer buffers
compared the static values of the past.

Please note that SMC-R is supposed to be an optimisation over TCP, and
falling back to TCP is superior to establishing an SMC connection that
is going to perform worse. If the memory allocation fails (and we
propagate -ENOMEM), we fall back to TCP.

Preserve (modulo truncation) the ratio of send/recv WR buffer counts.

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>
---
 Documentation/networking/smc-sysctl.rst |  8 ++++--
 net/smc/smc_core.c                      | 34 +++++++++++++++++--------
 net/smc/smc_core.h                      |  2 ++
 net/smc/smc_wr.c                        | 28 ++++++++++----------
 4 files changed, 46 insertions(+), 26 deletions(-)

diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
index 5de4893ef3e7..4a5b4c89bc97 100644
--- a/Documentation/networking/smc-sysctl.rst
+++ b/Documentation/networking/smc-sysctl.rst
@@ -85,7 +85,9 @@ smcr_max_send_wr - INTEGER
 
 	Please be aware that all the buffers need to be allocated as a physically
 	continuous array in which each element is a single buffer and has the size
-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
+	with half of the buffer count until it is ether successful or (unlikely)
+	we dip below the old hard coded value which is 16 where we give up much
 	like before having this control.
 
 	Default: 16
@@ -103,7 +105,9 @@ smcr_max_recv_wr - INTEGER
 
 	Please be aware that all the buffers need to be allocated as a physically
 	continuous array in which each element is a single buffer and has the size
-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
+	with half of the buffer count until it is ether successful or (unlikely)
+	we dip below the old hard coded value which is 16 where we give up much
 	like before having this control.
 
 	Default: 48
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index be0c2da83d2b..e4eabc83719e 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -810,6 +810,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	lnk->clearing = 0;
 	lnk->path_mtu = lnk->smcibdev->pattr[lnk->ibport - 1].active_mtu;
 	lnk->link_id = smcr_next_link_id(lgr);
+	lnk->max_send_wr = lgr->max_send_wr;
+	lnk->max_recv_wr = lgr->max_recv_wr;
 	lnk->lgr = lgr;
 	smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */
 	lnk->link_idx = link_idx;
@@ -836,27 +838,39 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	rc = smc_llc_link_init(lnk);
 	if (rc)
 		goto out;
-	rc = smc_wr_alloc_link_mem(lnk);
-	if (rc)
-		goto clear_llc_lnk;
 	rc = smc_ib_create_protection_domain(lnk);
 	if (rc)
-		goto free_link_mem;
-	rc = smc_ib_create_queue_pair(lnk);
-	if (rc)
-		goto dealloc_pd;
+		goto clear_llc_lnk;
+	do {
+		rc = smc_ib_create_queue_pair(lnk);
+		if (rc)
+			goto dealloc_pd;
+		rc = smc_wr_alloc_link_mem(lnk);
+		if (!rc)
+			break;
+		else if (rc != -ENOMEM) /* give up */
+			goto destroy_qp;
+		/* retry with smaller ... */
+		lnk->max_send_wr /= 2;
+		lnk->max_recv_wr /= 2;
+		/* ... unless droping below old SMC_WR_BUF_SIZE */
+		if (lnk->max_send_wr < 16 || lnk->max_recv_wr < 48)
+			goto destroy_qp;
+		smc_ib_destroy_queue_pair(lnk);
+	} while (1);
+
 	rc = smc_wr_create_link(lnk);
 	if (rc)
-		goto destroy_qp;
+		goto free_link_mem;
 	lnk->state = SMC_LNK_ACTIVATING;
 	return 0;
 
+free_link_mem:
+	smc_wr_free_link_mem(lnk);
 destroy_qp:
 	smc_ib_destroy_queue_pair(lnk);
 dealloc_pd:
 	smc_ib_dealloc_protection_domain(lnk);
-free_link_mem:
-	smc_wr_free_link_mem(lnk);
 clear_llc_lnk:
 	smc_llc_link_clear(lnk, false);
 out:
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 8d06c8bb14e9..5c18f08a4c8a 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -175,6 +175,8 @@ struct smc_link {
 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
 	int			llc_testlink_time; /* testlink interval */
 	atomic_t		conn_cnt; /* connections on this link */
+	u16			max_send_wr;
+	u16			max_recv_wr;
 };
 
 /* For now we just allow one parallel link per link group. The SMC protocol
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index 883fb0f1ce43..5feafa98ab1a 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -547,9 +547,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
 		    IB_QP_DEST_QPN,
 		    &init_attr);
 
-	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
+	lnk->wr_tx_cnt = min_t(size_t, lnk->max_send_wr,
 			       lnk->qp_attr.cap.max_send_wr);
-	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
+	lnk->wr_rx_cnt = min_t(size_t, lnk->max_recv_wr,
 			       lnk->qp_attr.cap.max_recv_wr);
 }
 
@@ -741,51 +741,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
 int smc_wr_alloc_link_mem(struct smc_link *link)
 {
 	/* allocate link related memory */
-	link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_bufs = kcalloc(link->max_send_wr,
 				   SMC_WR_BUF_SIZE, GFP_KERNEL);
 	if (!link->wr_tx_bufs)
 		goto no_mem;
-	link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen,
+	link->wr_rx_bufs = kcalloc(link->max_recv_wr, link->wr_rx_buflen,
 				   GFP_KERNEL);
 	if (!link->wr_rx_bufs)
 		goto no_mem_wr_tx_bufs;
-	link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_ibs = kcalloc(link->max_send_wr,
 				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
 	if (!link->wr_tx_ibs)
 		goto no_mem_wr_rx_bufs;
-	link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
+	link->wr_rx_ibs = kcalloc(link->max_recv_wr,
 				  sizeof(link->wr_rx_ibs[0]),
 				  GFP_KERNEL);
 	if (!link->wr_rx_ibs)
 		goto no_mem_wr_tx_ibs;
-	link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_rdmas = kcalloc(link->max_send_wr,
 				    sizeof(link->wr_tx_rdmas[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_rdmas)
 		goto no_mem_wr_rx_ibs;
-	link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_rdma_sges = kcalloc(link->max_send_wr,
 					sizeof(link->wr_tx_rdma_sges[0]),
 					GFP_KERNEL);
 	if (!link->wr_tx_rdma_sges)
 		goto no_mem_wr_tx_rdmas;
-	link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
+	link->wr_tx_sges = kcalloc(link->max_send_wr, sizeof(link->wr_tx_sges[0]),
 				   GFP_KERNEL);
 	if (!link->wr_tx_sges)
 		goto no_mem_wr_tx_rdma_sges;
-	link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
+	link->wr_rx_sges = kcalloc(link->max_recv_wr,
 				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
 				   GFP_KERNEL);
 	if (!link->wr_rx_sges)
 		goto no_mem_wr_tx_sges;
-	link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
+	link->wr_tx_mask = bitmap_zalloc(link->max_send_wr, GFP_KERNEL);
 	if (!link->wr_tx_mask)
 		goto no_mem_wr_rx_sges;
-	link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_pends = kcalloc(link->max_send_wr,
 				    sizeof(link->wr_tx_pends[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_pends)
 		goto no_mem_wr_tx_mask;
-	link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_compl = kcalloc(link->max_send_wr,
 				    sizeof(link->wr_tx_compl[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_compl)
@@ -906,7 +906,7 @@ int smc_wr_create_link(struct smc_link *lnk)
 		goto dma_unmap;
 	}
 	smc_wr_init_sge(lnk);
-	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
+	bitmap_zero(lnk->wr_tx_mask, lnk->max_send_wr);
 	init_waitqueue_head(&lnk->wr_tx_wait);
 	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
 	if (rc)
-- 
2.48.1
Re: [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
Posted by Dust Li 2 days, 19 hours ago
On 2025-09-29 02:00:01, Halil Pasic wrote:
>Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by
>giving up and going the way of a TCP fallback. This was reasonable
>before the sizes of the allocations there were compile time constants
>and reasonably small. But now those are actually configurable.
>
>So instead of giving up, keep retrying with half of the requested size
>unless we dip below the old static sizes -- then give up! In terms of
>numbers that means we give up when it is certain that we at best would
>end up allocating less than 16 send WR buffers or less than 48 recv WR
>buffers. This is to avoid regressions due to having fewer buffers
>compared the static values of the past.
>
>Please note that SMC-R is supposed to be an optimisation over TCP, and
>falling back to TCP is superior to establishing an SMC connection that
>is going to perform worse. If the memory allocation fails (and we
>propagate -ENOMEM), we fall back to TCP.
>
>Preserve (modulo truncation) the ratio of send/recv WR buffer counts.
>
>Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
>Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
>Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
>Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>
>---
> Documentation/networking/smc-sysctl.rst |  8 ++++--
> net/smc/smc_core.c                      | 34 +++++++++++++++++--------
> net/smc/smc_core.h                      |  2 ++
> net/smc/smc_wr.c                        | 28 ++++++++++----------
> 4 files changed, 46 insertions(+), 26 deletions(-)
>
>diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
>index 5de4893ef3e7..4a5b4c89bc97 100644
>--- a/Documentation/networking/smc-sysctl.rst
>+++ b/Documentation/networking/smc-sysctl.rst
>@@ -85,7 +85,9 @@ smcr_max_send_wr - INTEGER
> 
> 	Please be aware that all the buffers need to be allocated as a physically
> 	continuous array in which each element is a single buffer and has the size
>-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
>+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
>+	with half of the buffer count until it is ether successful or (unlikely)
>+	we dip below the old hard coded value which is 16 where we give up much
> 	like before having this control.
> 
> 	Default: 16
>@@ -103,7 +105,9 @@ smcr_max_recv_wr - INTEGER
> 
> 	Please be aware that all the buffers need to be allocated as a physically
> 	continuous array in which each element is a single buffer and has the size
>-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
>+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
>+	with half of the buffer count until it is ether successful or (unlikely)
>+	we dip below the old hard coded value which is 16 where we give up much
> 	like before having this control.
> 
> 	Default: 48
>diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
>index be0c2da83d2b..e4eabc83719e 100644
>--- a/net/smc/smc_core.c
>+++ b/net/smc/smc_core.c
>@@ -810,6 +810,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
> 	lnk->clearing = 0;
> 	lnk->path_mtu = lnk->smcibdev->pattr[lnk->ibport - 1].active_mtu;
> 	lnk->link_id = smcr_next_link_id(lgr);
>+	lnk->max_send_wr = lgr->max_send_wr;
>+	lnk->max_recv_wr = lgr->max_recv_wr;
> 	lnk->lgr = lgr;
> 	smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */
> 	lnk->link_idx = link_idx;
>@@ -836,27 +838,39 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
> 	rc = smc_llc_link_init(lnk);
> 	if (rc)
> 		goto out;
>-	rc = smc_wr_alloc_link_mem(lnk);
>-	if (rc)
>-		goto clear_llc_lnk;
> 	rc = smc_ib_create_protection_domain(lnk);
> 	if (rc)
>-		goto free_link_mem;
>-	rc = smc_ib_create_queue_pair(lnk);
>-	if (rc)
>-		goto dealloc_pd;
>+		goto clear_llc_lnk;
>+	do {
>+		rc = smc_ib_create_queue_pair(lnk);
>+		if (rc)
>+			goto dealloc_pd;
>+		rc = smc_wr_alloc_link_mem(lnk);
>+		if (!rc)
>+			break;
>+		else if (rc != -ENOMEM) /* give up */
>+			goto destroy_qp;
>+		/* retry with smaller ... */
>+		lnk->max_send_wr /= 2;
>+		lnk->max_recv_wr /= 2;
>+		/* ... unless droping below old SMC_WR_BUF_SIZE */
>+		if (lnk->max_send_wr < 16 || lnk->max_recv_wr < 48)
>+			goto destroy_qp;
>+		smc_ib_destroy_queue_pair(lnk);
>+	} while (1);
>+
> 	rc = smc_wr_create_link(lnk);
> 	if (rc)
>-		goto destroy_qp;
>+		goto free_link_mem;
> 	lnk->state = SMC_LNK_ACTIVATING;
> 	return 0;
> 
>+free_link_mem:
>+	smc_wr_free_link_mem(lnk);
> destroy_qp:
> 	smc_ib_destroy_queue_pair(lnk);
> dealloc_pd:
> 	smc_ib_dealloc_protection_domain(lnk);
>-free_link_mem:
>-	smc_wr_free_link_mem(lnk);
> clear_llc_lnk:
> 	smc_llc_link_clear(lnk, false);
> out:
>diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
>index 8d06c8bb14e9..5c18f08a4c8a 100644
>--- a/net/smc/smc_core.h
>+++ b/net/smc/smc_core.h
>@@ -175,6 +175,8 @@ struct smc_link {
> 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
> 	int			llc_testlink_time; /* testlink interval */
> 	atomic_t		conn_cnt; /* connections on this link */
>+	u16			max_send_wr;
>+	u16			max_recv_wr;

Here, you've moved max_send_wr/max_recv_wr from the link group to individual links.
This means we can now have different max_send_wr/max_recv_wr values on two
different links within the same link group.

Since in Alibaba we doesn't use multi-link configurations, we haven't tested
this scenario. Have you tested the link-down handling process in a multi-link
setup?

Otherwise, the patch looks good to me.

Best regards,
Dust

> };
> 
> /* For now we just allow one parallel link per link group. The SMC protocol
>diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
>index 883fb0f1ce43..5feafa98ab1a 100644
>--- a/net/smc/smc_wr.c
>+++ b/net/smc/smc_wr.c
>@@ -547,9 +547,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
> 		    IB_QP_DEST_QPN,
> 		    &init_attr);
> 
>-	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
>+	lnk->wr_tx_cnt = min_t(size_t, lnk->max_send_wr,
> 			       lnk->qp_attr.cap.max_send_wr);
>-	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
>+	lnk->wr_rx_cnt = min_t(size_t, lnk->max_recv_wr,
> 			       lnk->qp_attr.cap.max_recv_wr);
> }
> 
>@@ -741,51 +741,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
> int smc_wr_alloc_link_mem(struct smc_link *link)
> {
> 	/* allocate link related memory */
>-	link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
>+	link->wr_tx_bufs = kcalloc(link->max_send_wr,
> 				   SMC_WR_BUF_SIZE, GFP_KERNEL);
> 	if (!link->wr_tx_bufs)
> 		goto no_mem;
>-	link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen,
>+	link->wr_rx_bufs = kcalloc(link->max_recv_wr, link->wr_rx_buflen,
> 				   GFP_KERNEL);
> 	if (!link->wr_rx_bufs)
> 		goto no_mem_wr_tx_bufs;
>-	link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
>+	link->wr_tx_ibs = kcalloc(link->max_send_wr,
> 				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
> 	if (!link->wr_tx_ibs)
> 		goto no_mem_wr_rx_bufs;
>-	link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
>+	link->wr_rx_ibs = kcalloc(link->max_recv_wr,
> 				  sizeof(link->wr_rx_ibs[0]),
> 				  GFP_KERNEL);
> 	if (!link->wr_rx_ibs)
> 		goto no_mem_wr_tx_ibs;
>-	link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
>+	link->wr_tx_rdmas = kcalloc(link->max_send_wr,
> 				    sizeof(link->wr_tx_rdmas[0]),
> 				    GFP_KERNEL);
> 	if (!link->wr_tx_rdmas)
> 		goto no_mem_wr_rx_ibs;
>-	link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
>+	link->wr_tx_rdma_sges = kcalloc(link->max_send_wr,
> 					sizeof(link->wr_tx_rdma_sges[0]),
> 					GFP_KERNEL);
> 	if (!link->wr_tx_rdma_sges)
> 		goto no_mem_wr_tx_rdmas;
>-	link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
>+	link->wr_tx_sges = kcalloc(link->max_send_wr, sizeof(link->wr_tx_sges[0]),
> 				   GFP_KERNEL);
> 	if (!link->wr_tx_sges)
> 		goto no_mem_wr_tx_rdma_sges;
>-	link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
>+	link->wr_rx_sges = kcalloc(link->max_recv_wr,
> 				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
> 				   GFP_KERNEL);
> 	if (!link->wr_rx_sges)
> 		goto no_mem_wr_tx_sges;
>-	link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
>+	link->wr_tx_mask = bitmap_zalloc(link->max_send_wr, GFP_KERNEL);
> 	if (!link->wr_tx_mask)
> 		goto no_mem_wr_rx_sges;
>-	link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
>+	link->wr_tx_pends = kcalloc(link->max_send_wr,
> 				    sizeof(link->wr_tx_pends[0]),
> 				    GFP_KERNEL);
> 	if (!link->wr_tx_pends)
> 		goto no_mem_wr_tx_mask;
>-	link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
>+	link->wr_tx_compl = kcalloc(link->max_send_wr,
> 				    sizeof(link->wr_tx_compl[0]),
> 				    GFP_KERNEL);
> 	if (!link->wr_tx_compl)
>@@ -906,7 +906,7 @@ int smc_wr_create_link(struct smc_link *lnk)
> 		goto dma_unmap;
> 	}
> 	smc_wr_init_sge(lnk);
>-	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
>+	bitmap_zero(lnk->wr_tx_mask, lnk->max_send_wr);
> 	init_waitqueue_head(&lnk->wr_tx_wait);
> 	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
> 	if (rc)
>-- 
>2.48.1
Re: [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
Posted by Halil Pasic 2 days, 12 hours ago
On Mon, 29 Sep 2025 09:50:52 +0800
Dust Li <dust.li@linux.alibaba.com> wrote:

> >@@ -175,6 +175,8 @@ struct smc_link {
> > 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
> > 	int			llc_testlink_time; /* testlink interval */
> > 	atomic_t		conn_cnt; /* connections on this link */
> >+	u16			max_send_wr;
> >+	u16			max_recv_wr;  
> 
> Here, you've moved max_send_wr/max_recv_wr from the link group to individual links.
> This means we can now have different max_send_wr/max_recv_wr values on two
> different links within the same link group.

Only if allocations fail. Please notice that the hunk:

--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -810,6 +810,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	lnk->clearing = 0;
 	lnk->path_mtu = lnk->smcibdev->pattr[lnk->ibport - 1].active_mtu;
 	lnk->link_id = smcr_next_link_id(lgr);
+	lnk->max_send_wr = lgr->max_send_wr;
+	lnk->max_recv_wr = lgr->max_recv_wr;

initializes the link values with the values from the lgr which are in
turn picked up form the systctls at lgr creation time. I have made an
effort to keep these values the same for each link, but in case the
allocation fails and we do back off, we can end up with different values
on the links. 

The alternative would be to throw in the towel, and not create
a second link if we can't match what worked for the first one.

> 
> Since in Alibaba we doesn't use multi-link configurations, we haven't tested
> this scenario. Have you tested the link-down handling process in a multi-link
> setup?
> 

Mahanta was so kind to do most of the testing on this. I don't think
I've tested this myself. @Mahanta: Would you be kind to give this a try
if it wasn't covered in the past? The best way is probably to modify
the code to force such a scenario. I don't think it is easy to somehow
trigger in the wild.

BTW I don't expect any problems. I think at worst the one link would
end up giving worse performance than the other, but I guess that can
happen for other reasons as well (like different HW for the two links).

But I think getting some sort of a query interface which would tell
us how much did we end up with down the road would be a good idea anyway.

And I hope we can switch to vmalloc down the road as well, which would
make back off less likely.

> Otherwise, the patch looks good to me.
> 

Thank you very much!

Regards,
Halil
Re: [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
Posted by Paolo Abeni 14 hours ago
On 9/29/25 11:22 AM, Halil Pasic wrote:
> On Mon, 29 Sep 2025 09:50:52 +0800
> Dust Li <dust.li@linux.alibaba.com> wrote:
> 
>>> @@ -175,6 +175,8 @@ struct smc_link {
>>> 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
>>> 	int			llc_testlink_time; /* testlink interval */
>>> 	atomic_t		conn_cnt; /* connections on this link */
>>> +	u16			max_send_wr;
>>> +	u16			max_recv_wr;  
>>
>> Here, you've moved max_send_wr/max_recv_wr from the link group to individual links.
>> This means we can now have different max_send_wr/max_recv_wr values on two
>> different links within the same link group.
> 
> Only if allocations fail. Please notice that the hunk:
> 
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -810,6 +810,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>  	lnk->clearing = 0;
>  	lnk->path_mtu = lnk->smcibdev->pattr[lnk->ibport - 1].active_mtu;
>  	lnk->link_id = smcr_next_link_id(lgr);
> +	lnk->max_send_wr = lgr->max_send_wr;
> +	lnk->max_recv_wr = lgr->max_recv_wr;
> 
> initializes the link values with the values from the lgr which are in
> turn picked up form the systctls at lgr creation time. I have made an
> effort to keep these values the same for each link, but in case the
> allocation fails and we do back off, we can end up with different values
> on the links. 
> 
> The alternative would be to throw in the towel, and not create
> a second link if we can't match what worked for the first one.
> 
>>
>> Since in Alibaba we doesn't use multi-link configurations, we haven't tested
>> this scenario. Have you tested the link-down handling process in a multi-link
>> setup?
>>
> 
> Mahanta was so kind to do most of the testing on this. I don't think
> I've tested this myself. @Mahanta: Would you be kind to give this a try
> if it wasn't covered in the past? The best way is probably to modify
> the code to force such a scenario. I don't think it is easy to somehow
> trigger in the wild.
> 
> BTW I don't expect any problems. I think at worst the one link would
> end up giving worse performance than the other, but I guess that can
> happen for other reasons as well (like different HW for the two links).
> 
> But I think getting some sort of a query interface which would tell
> us how much did we end up with down the road would be a good idea anyway.
> 
> And I hope we can switch to vmalloc down the road as well, which would
> make back off less likely.

Unfortunately we are closing the net-next PR right now and I would
prefer such testing being reported explicitly. Let's defer this series
to the next cycle: please re-post when net-next will reopen after Oct 12th.

Thanks,

Paolo
Re: [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
Posted by Halil Pasic 11 hours ago
On Wed, 1 Oct 2025 09:21:09 +0200
Paolo Abeni <pabeni@redhat.com> wrote:

> >>
> >> Since in Alibaba we doesn't use multi-link configurations, we haven't tested
> >> this scenario. Have you tested the link-down handling process in a multi-link
> >> setup?
> >>  
> > 
> > Mahanta was so kind to do most of the testing on this. I don't think
> > I've tested this myself. @Mahanta: Would you be kind to give this a try
> > if it wasn't covered in the past? The best way is probably to modify
> > the code to force such a scenario. I don't think it is easy to somehow
> > trigger in the wild.
> > 
> > BTW I don't expect any problems. I think at worst the one link would
> > end up giving worse performance than the other, but I guess that can
> > happen for other reasons as well (like different HW for the two links).
> > 
> > But I think getting some sort of a query interface which would tell
> > us how much did we end up with down the road would be a good idea anyway.
> > 
> > And I hope we can switch to vmalloc down the road as well, which would
> > make back off less likely.  
> 
> Unfortunately we are closing the net-next PR right now and I would
> prefer such testing being reported explicitly. Let's defer this series
> to the next cycle: please re-post when net-next will reopen after Oct 12th.

Than you Paolo! Will do! I have talked to Mahanta yesterday about this,
and he has done some testing in the meanwhile, but I'm not sure he
covered everything he wanted. And he is out for the week (today and
tomorrow is public holiday in his geography).

Regards,
Halil