[PATCH net-next] net/smc: cap allocation order for SMC-R physically contiguous buffers

D. Wythe posted 1 patch 3 weeks, 5 days ago
There is a newer version of this series
net/smc/smc_core.c | 30 +++++++++++++++++++-----------
1 file changed, 19 insertions(+), 11 deletions(-)
[PATCH net-next] net/smc: cap allocation order for SMC-R physically contiguous buffers
Posted by D. Wythe 3 weeks, 5 days ago
The alloc_page() cannot satisfy requests exceeding MAX_PAGE_ORDER,
and attempting such allocations will lead to guaranteed failures
and potential kernel warnings.

For SMCR_PHYS_CONT_BUFS, cap the allocation order to MAX_PAGE_ORDER.
This ensures the attempts to allocate the largest possible physically
contiguous chunk succeed, instead of failing with an invalid order.
This also avoids redundant "try-fail-degrade" cycles in
__smc_buf_create().

For SMCR_MIXED_BUFS, if its order exceeds MAX_PAGE_ORDER, skip the
doomed physical allocation attempt and fallback to virtual memory
immediately.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 net/smc/smc_core.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index e2d083daeb7e..a18730edb7e0 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -2314,6 +2314,10 @@ int smcr_buf_reg_lgr(struct smc_link *lnk)
 	return rc;
 }
 
+/*
+ * smcr_new_buf_create may allocate a buffer smaller than the requested
+ * bufsize. Use buf_desc->len to determine the actual allocated size.
+ */
 static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
 						int bufsize)
 {
@@ -2326,18 +2330,22 @@ static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
 
 	switch (lgr->buf_type) {
 	case SMCR_PHYS_CONT_BUFS:
+		bufsize = min(bufsize, (int)PAGE_SIZE << MAX_PAGE_ORDER);
+		fallthrough;
 	case SMCR_MIXED_BUFS:
 		buf_desc->order = get_order(bufsize);
-		buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
-					      __GFP_NOMEMALLOC | __GFP_COMP |
-					      __GFP_NORETRY | __GFP_ZERO,
-					      buf_desc->order);
-		if (buf_desc->pages) {
-			buf_desc->cpu_addr =
-				(void *)page_address(buf_desc->pages);
-			buf_desc->len = bufsize;
-			buf_desc->is_vm = false;
-			break;
+		if (buf_desc->order <= MAX_PAGE_ORDER) {
+			buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
+						      __GFP_NOMEMALLOC | __GFP_COMP |
+						      __GFP_NORETRY | __GFP_ZERO,
+						      buf_desc->order);
+			if (buf_desc->pages) {
+				buf_desc->cpu_addr =
+					(void *)page_address(buf_desc->pages);
+				buf_desc->len = bufsize;
+				buf_desc->is_vm = false;
+				break;
+			}
 		}
 		if (lgr->buf_type == SMCR_PHYS_CONT_BUFS)
 			goto out;
@@ -2476,7 +2484,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
 		}
 
 		SMC_STAT_RMB_ALLOC(smc, is_smcd, is_rmb);
-		SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, true, bufsize);
+		SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, true, buf_desc->len);
 		buf_desc->used = 1;
 		down_write(lock);
 		smc_lgr_buf_list_add(lgr, is_rmb, buf_list, buf_desc);
-- 
2.45.0
Re: [net-next] net/smc: cap allocation order for SMC-R physically contiguous buffers
Posted by Jakub Kicinski 3 weeks ago
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.

For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
net/smc: cap allocation order for SMC-R physically contiguous buffers

The alloc_page() cannot satisfy requests exceeding MAX_PAGE_ORDER, and
attempting such allocations will lead to guaranteed failures and potential
kernel warnings. For SMCR_PHYS_CONT_BUFS, cap the allocation order to
MAX_PAGE_ORDER. This ensures the attempts to allocate the largest possible
physically contiguous chunk succeed. For SMCR_MIXED_BUFS, if its order
exceeds MAX_PAGE_ORDER, skip the physical allocation attempt and fallback to
virtual memory immediately.

> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index e2d083daeb7e..a18730edb7e0 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -2314,6 +2314,10 @@ int smcr_buf_reg_lgr(struct smc_link *lnk)
>  	return rc;
>  }
>
> +/*
> + * smcr_new_buf_create may allocate a buffer smaller than the requested
> + * bufsize. Use buf_desc->len to determine the actual allocated size.
> + */
>  static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
>  						int bufsize)
>  {
> @@ -2326,18 +2330,22 @@ static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
>
>  	switch (lgr->buf_type) {
>  	case SMCR_PHYS_CONT_BUFS:
> +		bufsize = min(bufsize, (int)PAGE_SIZE << MAX_PAGE_ORDER);
> +		fallthrough;
>  	case SMCR_MIXED_BUFS:
>  		buf_desc->order = get_order(bufsize);
> -		buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
> -					      __GFP_NOMEMALLOC | __GFP_COMP |
> -					      __GFP_NORETRY | __GFP_ZERO,
> -					      buf_desc->order);
> -		if (buf_desc->pages) {
> -			buf_desc->cpu_addr =
> -				(void *)page_address(buf_desc->pages);
> -			buf_desc->len = bufsize;
> -			buf_desc->is_vm = false;
> -			break;
> +		if (buf_desc->order <= MAX_PAGE_ORDER) {
> +			buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
> +						      __GFP_NOMEMALLOC | __GFP_COMP |
> +						      __GFP_NORETRY | __GFP_ZERO,
> +						      buf_desc->order);
> +			if (buf_desc->pages) {
> +				buf_desc->cpu_addr =
> +					(void *)page_address(buf_desc->pages);
> +				buf_desc->len = bufsize;
> +				buf_desc->is_vm = false;
> +				break;
> +			}
>  		}
>  		if (lgr->buf_type == SMCR_PHYS_CONT_BUFS)
>  			goto out;

[ ... ]

> @@ -2476,7 +2484,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
>  		}
>
>  		SMC_STAT_RMB_ALLOC(smc, is_smcd, is_rmb);
> -		SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, true, bufsize);
> +		SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, true, buf_desc->len);
>  		buf_desc->used = 1;
>  		down_write(lock);
>  		smc_lgr_buf_list_add(lgr, is_rmb, buf_list, buf_desc);
>  		up_write(lock);
>  		break; /* found */
>  	}
>
>  	if (IS_ERR(buf_desc))
>  		return PTR_ERR(buf_desc);
>
>  	if (!is_smcd) {
>  		if (smcr_buf_map_usable_links(lgr, buf_desc, is_rmb)) {
>  			smcr_buf_unuse(buf_desc, is_rmb, lgr);
>  			return -ENOMEM;
>  		}
>  	}
>
>  	if (is_rmb) {
>  		conn->rmb_desc = buf_desc;
>  		conn->rmbe_size_comp = bufsize_comp;
>  		smc->sk.sk_rcvbuf = bufsize * 2;
>  		atomic_set(&conn->bytes_to_rcv, 0);
>  		conn->rmbe_update_limit =
>  			smc_rmb_wnd_update_limit(buf_desc->len);
>  		if (is_smcd)
>  			smc_ism_set_conn(conn); /* map RMB/smcd_dev to conn */
>  	} else {
>  		conn->sndbuf_desc = buf_desc;
>  		smc->sk.sk_sndbuf = bufsize * 2;
>  		atomic_set(&conn->sndbuf_space, bufsize);
                                                 ^^^^^^^

The comment added at smcr_new_buf_create() says "Use buf_desc->len to
determine the actual allocated size," but __smc_buf_create() still uses the
uncapped bufsize for several critical fields. Is this correct?

For SMCR_PHYS_CONT_BUFS, smcr_new_buf_create() now caps bufsize internally
to PAGE_SIZE << MAX_PAGE_ORDER (typically 4MB), so buf_desc->len can be
smaller than the requested bufsize.

However, __smc_buf_create() continues using the original uncapped bufsize
for:

    if (is_rmb) {
        conn->rmbe_size_comp = bufsize_comp;
        smc->sk.sk_rcvbuf = bufsize * 2;
    } else {
        smc->sk.sk_sndbuf = bufsize * 2;
        atomic_set(&conn->sndbuf_space, bufsize);
    }

The sndbuf_space mismatch is particularly concerning. Looking at
smc_tx.c:274, the comment states: "0 <= sndbuf_space <= sndbuf_desc->len".
When sndbuf_space exceeds sndbuf_desc->len, smc_tx_sendmsg() could calculate
writespace incorrectly:

    int writespace = atomic_read(&conn->sndbuf_space);

and the two-chunk copy loop might write beyond the actual buffer boundary.
This is reachable with default sysctl settings (SMCR_PHYS_CONT_BUFS) and
socket buffer sizes exceeding 4MB.

Should these assignments use buf_desc->len instead of bufsize? For example:

    atomic_set(&conn->sndbuf_space, buf_desc->len);
    smc->sk.sk_sndbuf = buf_desc->len * 2;
    smc->sk.sk_rcvbuf = buf_desc->len * 2;

And should rmbe_size_comp be recalculated from buf_desc->len? The
rmbe_size_comp value is sent to the peer via CLC messages in smc_clc.c:1125,
and an inflated value might cause the peer to believe the RMB is larger than
it actually is.

>  	}
>  	return 0;
>  }
-- 
pw-bot: cr
Re: [net-next] net/smc: cap allocation order for SMC-R physically contiguous buffers
Posted by D. Wythe 4 days, 10 hours ago
On Mon, Mar 16, 2026 at 07:45:34PM -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> 
> For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
> ---
> net/smc: cap allocation order for SMC-R physically contiguous buffers
> 
> The alloc_page() cannot satisfy requests exceeding MAX_PAGE_ORDER, and
> attempting such allocations will lead to guaranteed failures and potential
> kernel warnings. For SMCR_PHYS_CONT_BUFS, cap the allocation order to
> MAX_PAGE_ORDER. This ensures the attempts to allocate the largest possible
> physically contiguous chunk succeed. For SMCR_MIXED_BUFS, if its order
> exceeds MAX_PAGE_ORDER, skip the physical allocation attempt and fallback to
> virtual memory immediately.
> 
> > diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> > index e2d083daeb7e..a18730edb7e0 100644
> > --- a/net/smc/smc_core.c
> > +++ b/net/smc/smc_core.c
> > @@ -2314,6 +2314,10 @@ int smcr_buf_reg_lgr(struct smc_link *lnk)
> >  	return rc;
> >  }
> >
> > +/*
> > + * smcr_new_buf_create may allocate a buffer smaller than the requested
> > + * bufsize. Use buf_desc->len to determine the actual allocated size.
> > + */
> >  static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
> >  						int bufsize)
> >  {
> > @@ -2326,18 +2330,22 @@ static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
> >
> >  	switch (lgr->buf_type) {
> >  	case SMCR_PHYS_CONT_BUFS:
> > +		bufsize = min(bufsize, (int)PAGE_SIZE << MAX_PAGE_ORDER);
> > +		fallthrough;
> >  	case SMCR_MIXED_BUFS:
> >  		buf_desc->order = get_order(bufsize);
> > -		buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
> > -					      __GFP_NOMEMALLOC | __GFP_COMP |
> > -					      __GFP_NORETRY | __GFP_ZERO,
> > -					      buf_desc->order);
> > -		if (buf_desc->pages) {
> > -			buf_desc->cpu_addr =
> > -				(void *)page_address(buf_desc->pages);
> > -			buf_desc->len = bufsize;
> > -			buf_desc->is_vm = false;
> > -			break;
> > +		if (buf_desc->order <= MAX_PAGE_ORDER) {
> > +			buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
> > +						      __GFP_NOMEMALLOC | __GFP_COMP |
> > +						      __GFP_NORETRY | __GFP_ZERO,
> > +						      buf_desc->order);
> > +			if (buf_desc->pages) {
> > +				buf_desc->cpu_addr =
> > +					(void *)page_address(buf_desc->pages);
> > +				buf_desc->len = bufsize;
> > +				buf_desc->is_vm = false;
> > +				break;
> > +			}
> >  		}
> >  		if (lgr->buf_type == SMCR_PHYS_CONT_BUFS)
> >  			goto out;
> 
> [ ... ]
> 
> > @@ -2476,7 +2484,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
> >  		}
> >
> >  		SMC_STAT_RMB_ALLOC(smc, is_smcd, is_rmb);
> > -		SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, true, bufsize);
> > +		SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, true, buf_desc->len);
> >  		buf_desc->used = 1;
> >  		down_write(lock);
> >  		smc_lgr_buf_list_add(lgr, is_rmb, buf_list, buf_desc);
> >  		up_write(lock);
> >  		break; /* found */
> >  	}
> >
> >  	if (IS_ERR(buf_desc))
> >  		return PTR_ERR(buf_desc);
> >
> >  	if (!is_smcd) {
> >  		if (smcr_buf_map_usable_links(lgr, buf_desc, is_rmb)) {
> >  			smcr_buf_unuse(buf_desc, is_rmb, lgr);
> >  			return -ENOMEM;
> >  		}
> >  	}
> >
> >  	if (is_rmb) {
> >  		conn->rmb_desc = buf_desc;
> >  		conn->rmbe_size_comp = bufsize_comp;
> >  		smc->sk.sk_rcvbuf = bufsize * 2;
> >  		atomic_set(&conn->bytes_to_rcv, 0);
> >  		conn->rmbe_update_limit =
> >  			smc_rmb_wnd_update_limit(buf_desc->len);
> >  		if (is_smcd)
> >  			smc_ism_set_conn(conn); /* map RMB/smcd_dev to conn */
> >  	} else {
> >  		conn->sndbuf_desc = buf_desc;
> >  		smc->sk.sk_sndbuf = bufsize * 2;
> >  		atomic_set(&conn->sndbuf_space, bufsize);
>                                                  ^^^^^^^
> 
> The comment added at smcr_new_buf_create() says "Use buf_desc->len to
> determine the actual allocated size," but __smc_buf_create() still uses the
> uncapped bufsize for several critical fields. Is this correct?
> 
> For SMCR_PHYS_CONT_BUFS, smcr_new_buf_create() now caps bufsize internally
> to PAGE_SIZE << MAX_PAGE_ORDER (typically 4MB), so buf_desc->len can be
> smaller than the requested bufsize.
> 
> However, __smc_buf_create() continues using the original uncapped bufsize
> for:
> 
>     if (is_rmb) {
>         conn->rmbe_size_comp = bufsize_comp;
>         smc->sk.sk_rcvbuf = bufsize * 2;
>     } else {
>         smc->sk.sk_sndbuf = bufsize * 2;
>         atomic_set(&conn->sndbuf_space, bufsize);
>     }
> 
> The sndbuf_space mismatch is particularly concerning. Looking at
> smc_tx.c:274, the comment states: "0 <= sndbuf_space <= sndbuf_desc->len".
> When sndbuf_space exceeds sndbuf_desc->len, smc_tx_sendmsg() could calculate
> writespace incorrectly:
> 
>     int writespace = atomic_read(&conn->sndbuf_space);
> 
> and the two-chunk copy loop might write beyond the actual buffer boundary.
> This is reachable with default sysctl settings (SMCR_PHYS_CONT_BUFS) and
> socket buffer sizes exceeding 4MB.
> 
> Should these assignments use buf_desc->len instead of bufsize? For example:
> 
>     atomic_set(&conn->sndbuf_space, buf_desc->len);
>     smc->sk.sk_sndbuf = buf_desc->len * 2;
>     smc->sk.sk_rcvbuf = buf_desc->len * 2;
> 
> And should rmbe_size_comp be recalculated from buf_desc->len? The
> rmbe_size_comp value is sent to the peer via CLC messages in smc_clc.c:1125,
> and an inflated value might cause the peer to believe the RMB is larger than
> it actually is.
> 

Thanks for catching this.

After rethinking this, I think we should not change the semantic of
smcr_new_buf_create() to return a potentially smaller buffer than
requested. The MAX_PAGE_ORDER limit is better handled in __smc_buf_create(),
where the effective buffer size is already known and used to initialize
socket/accounting state.

I'll update the patch accordingly.

> >  	}
> >  	return 0;
> >  }
> -- 
> pw-bot: cr