[PATCH] RDMA/siw: work around clang stack size warning

Arnd Bergmann posted 1 patch 3 months, 2 weeks ago
drivers/infiniband/sw/siw/siw_qp_tx.c | 22 ++++++++++++++++------
1 file changed, 16 insertions(+), 6 deletions(-)
[PATCH] RDMA/siw: work around clang stack size warning
Posted by Arnd Bergmann 3 months, 2 weeks ago
From: Arnd Bergmann <arnd@arndb.de>

clang inlines a lot of functions into siw_qp_sq_process(), with the
aggregate stack frame blowing the warning limit in some configurations:

drivers/infiniband/sw/siw/siw_qp_tx.c:1014:5: error: stack frame size (1544) exceeds limit (1280) in 'siw_qp_sq_process' [-Werror,-Wframe-larger-than]

The real problem here is the array of kvec structures in siw_tx_hdt that
makes up the majority of the consumed stack space.

Ideally there would be a way to avoid allocating the array on the
stack, but that would require a larger rework. Add a noinline_for_stack
annotation to avoid the warning for now, and make clang behave the same
way as gcc here. The combined stack usage is still similar, but is spread
over multiple functions now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/infiniband/sw/siw/siw_qp_tx.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c
index 6432bce7d083..3a08f57d2211 100644
--- a/drivers/infiniband/sw/siw/siw_qp_tx.c
+++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
@@ -277,6 +277,15 @@ static int siw_qp_prepare_tx(struct siw_iwarp_tx *c_tx)
 	return PKT_FRAGMENTED;
 }
 
+static noinline_for_stack int
+siw_sendmsg(struct socket *sock, unsigned int msg_flags,
+	    struct kvec *vec, size_t num, size_t len)
+{
+	struct msghdr msg = { .msg_flags = msg_flags };
+
+	return kernel_sendmsg(sock, &msg, vec, num, len);
+}
+
 /*
  * Send out one complete control type FPDU, or header of FPDU carrying
  * data. Used for fixed sized packets like Read.Requests or zero length
@@ -285,12 +294,11 @@ static int siw_qp_prepare_tx(struct siw_iwarp_tx *c_tx)
 static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
 			      int flags)
 {
-	struct msghdr msg = { .msg_flags = flags };
 	struct kvec iov = { .iov_base =
 				    (char *)&c_tx->pkt.ctrl + c_tx->ctrl_sent,
 			    .iov_len = c_tx->ctrl_len - c_tx->ctrl_sent };
 
-	int rv = kernel_sendmsg(s, &msg, &iov, 1, iov.iov_len);
+	int rv = siw_sendmsg(s, flags, &iov, 1, iov.iov_len);
 
 	if (rv >= 0) {
 		c_tx->ctrl_sent += rv;
@@ -427,13 +435,13 @@ static void siw_unmap_pages(struct kvec *iov, unsigned long kmap_mask, int len)
  * Write out iov referencing hdr, data and trailer of current FPDU.
  * Update transmit state dependent on write return status
  */
-static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
+static noinline_for_stack int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
+					 struct socket *s)
 {
 	struct siw_wqe *wqe = &c_tx->wqe_active;
 	struct siw_sge *sge = &wqe->sqe.sge[c_tx->sge_idx];
 	struct kvec iov[MAX_ARRAY];
 	struct page *page_array[MAX_ARRAY];
-	struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
 
 	int seg = 0, do_crc = c_tx->do_crc, is_kva = 0, rv;
 	unsigned int data_len = c_tx->bytes_unsent, hdr_len = 0, trl_len = 0,
@@ -586,14 +594,16 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
 		rv = siw_0copy_tx(s, page_array, &wqe->sqe.sge[c_tx->sge_idx],
 				  c_tx->sge_off, data_len);
 		if (rv == data_len) {
-			rv = kernel_sendmsg(s, &msg, &iov[seg], 1, trl_len);
+
+			rv = siw_sendmsg(s, MSG_DONTWAIT | MSG_EOR, &iov[seg],
+					 1, trl_len);
 			if (rv > 0)
 				rv += data_len;
 			else
 				rv = data_len;
 		}
 	} else {
-		rv = kernel_sendmsg(s, &msg, iov, seg + 1,
+		rv = siw_sendmsg(s, MSG_DONTWAIT | MSG_EOR, iov, seg + 1,
 				    hdr_len + data_len + trl_len);
 		siw_unmap_pages(iov, kmap_mask, seg);
 	}
-- 
2.39.5
Re: [PATCH] RDMA/siw: work around clang stack size warning
Posted by Leon Romanovsky 3 months, 2 weeks ago
On Fri, 20 Jun 2025 13:43:28 +0200, Arnd Bergmann wrote:
> clang inlines a lot of functions into siw_qp_sq_process(), with the
> aggregate stack frame blowing the warning limit in some configurations:
> 
> drivers/infiniband/sw/siw/siw_qp_tx.c:1014:5: error: stack frame size (1544) exceeds limit (1280) in 'siw_qp_sq_process' [-Werror,-Wframe-larger-than]
> 
> The real problem here is the array of kvec structures in siw_tx_hdt that
> makes up the majority of the consumed stack space.
> 
> [...]

Applied, thanks!

[1/1] RDMA/siw: work around clang stack size warning
      https://git.kernel.org/rdma/rdma/c/842cf5a6e35656

Best regards,
-- 
Leon Romanovsky <leon@kernel.org>
Re: [PATCH] RDMA/siw: work around clang stack size warning
Posted by Zhu Yanjun 3 months, 2 weeks ago
在 2025/6/20 4:43, Arnd Bergmann 写道:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> clang inlines a lot of functions into siw_qp_sq_process(), with the
> aggregate stack frame blowing the warning limit in some configurations:
> 
> drivers/infiniband/sw/siw/siw_qp_tx.c:1014:5: error: stack frame size (1544) exceeds limit (1280) in 'siw_qp_sq_process' [-Werror,-Wframe-larger-than]
> 
> The real problem here is the array of kvec structures in siw_tx_hdt that
> makes up the majority of the consumed stack space.

Because the array of kvec structures in siw_tx_hdt consumes the majority 
of the stack space, would it be possible to use kmalloc or a similar 
dynamic memory allocation function instead of allocating this memory on 
the stack?

Would using kmalloc (or an equivalent) also effectively resolve the 
stack usage issue?
Please note that I’m not questioning the value of this commit—I’m simply 
curious whether there might be an alternative solution to the problem.

Thanks,
Yanjun.Zhu

> 
> Ideally there would be a way to avoid allocating the array on the
> stack, but that would require a larger rework. Add a noinline_for_stack
> annotation to avoid the warning for now, and make clang behave the same
> way as gcc here. The combined stack usage is still similar, but is spread
> over multiple functions now.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>   drivers/infiniband/sw/siw/siw_qp_tx.c | 22 ++++++++++++++++------
>   1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c
> index 6432bce7d083..3a08f57d2211 100644
> --- a/drivers/infiniband/sw/siw/siw_qp_tx.c
> +++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
> @@ -277,6 +277,15 @@ static int siw_qp_prepare_tx(struct siw_iwarp_tx *c_tx)
>   	return PKT_FRAGMENTED;
>   }
>   
> +static noinline_for_stack int
> +siw_sendmsg(struct socket *sock, unsigned int msg_flags,
> +	    struct kvec *vec, size_t num, size_t len)
> +{
> +	struct msghdr msg = { .msg_flags = msg_flags };
> +
> +	return kernel_sendmsg(sock, &msg, vec, num, len);
> +}
> +
>   /*
>    * Send out one complete control type FPDU, or header of FPDU carrying
>    * data. Used for fixed sized packets like Read.Requests or zero length
> @@ -285,12 +294,11 @@ static int siw_qp_prepare_tx(struct siw_iwarp_tx *c_tx)
>   static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
>   			      int flags)
>   {
> -	struct msghdr msg = { .msg_flags = flags };
>   	struct kvec iov = { .iov_base =
>   				    (char *)&c_tx->pkt.ctrl + c_tx->ctrl_sent,
>   			    .iov_len = c_tx->ctrl_len - c_tx->ctrl_sent };
>   
> -	int rv = kernel_sendmsg(s, &msg, &iov, 1, iov.iov_len);
> +	int rv = siw_sendmsg(s, flags, &iov, 1, iov.iov_len);
>   
>   	if (rv >= 0) {
>   		c_tx->ctrl_sent += rv;
> @@ -427,13 +435,13 @@ static void siw_unmap_pages(struct kvec *iov, unsigned long kmap_mask, int len)
>    * Write out iov referencing hdr, data and trailer of current FPDU.
>    * Update transmit state dependent on write return status
>    */
> -static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
> +static noinline_for_stack int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
> +					 struct socket *s)
>   {
>   	struct siw_wqe *wqe = &c_tx->wqe_active;
>   	struct siw_sge *sge = &wqe->sqe.sge[c_tx->sge_idx];
>   	struct kvec iov[MAX_ARRAY];
>   	struct page *page_array[MAX_ARRAY];
> -	struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
>   
>   	int seg = 0, do_crc = c_tx->do_crc, is_kva = 0, rv;
>   	unsigned int data_len = c_tx->bytes_unsent, hdr_len = 0, trl_len = 0,
> @@ -586,14 +594,16 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
>   		rv = siw_0copy_tx(s, page_array, &wqe->sqe.sge[c_tx->sge_idx],
>   				  c_tx->sge_off, data_len);
>   		if (rv == data_len) {
> -			rv = kernel_sendmsg(s, &msg, &iov[seg], 1, trl_len);
> +
> +			rv = siw_sendmsg(s, MSG_DONTWAIT | MSG_EOR, &iov[seg],
> +					 1, trl_len);
>   			if (rv > 0)
>   				rv += data_len;
>   			else
>   				rv = data_len;
>   		}
>   	} else {
> -		rv = kernel_sendmsg(s, &msg, iov, seg + 1,
> +		rv = siw_sendmsg(s, MSG_DONTWAIT | MSG_EOR, iov, seg + 1,
>   				    hdr_len + data_len + trl_len);
>   		siw_unmap_pages(iov, kmap_mask, seg);
>   	}

Re: [PATCH] RDMA/siw: work around clang stack size warning
Posted by Arnd Bergmann 3 months, 2 weeks ago
On Sat, Jun 21, 2025, at 06:12, Zhu Yanjun wrote:
> 在 2025/6/20 4:43, Arnd Bergmann 写道:
>
> Because the array of kvec structures in siw_tx_hdt consumes the majority 
> of the stack space, would it be possible to use kmalloc or a similar 
> dynamic memory allocation function instead of allocating this memory on 
> the stack?
>
> Would using kmalloc (or an equivalent) also effectively resolve the 
> stack usage issue?

Yes, moving the allocation somewhere else (kmalloc, static variable,
per siw_sge, per siw_wqe) would avoid the high stack usage effectively,
it's a tradeoff and I picked the solution that made the most sense
to me, but there is a good chance another alternative is better here.

The main differences are:

- kmalloc() adds runtime overhead that may be expensive in a
  fast path

- kmalloc() can fail, which adds complexity from error handling.
  Note that small allocations with GFP_KERNEL do not fail but instead
  wait for memory to become available

- If kmalloc() runs into a low-memory situation, it can go through
  writeback, which in turn can use more stack space than the
  on-stack allocation it was replacing

- static allocations bloat the kernel image and require locking that
  may be expensive

- per-object preallocations can be wasteful if a lot of objects
  are created, and can still require locking if the object is used
  from multiple threads

As I wrote, I mainly picked the 'noinline_for_stack' approach
here since that is how the code is known to work with gcc, so
there is little risk of my patch causing problems.

Moving the both the kvec array and the page array into
the siw_wqe is likely better here, I'm not familiar enough
with the driver to tell whether that is an overall improvement.

A related change I would like to see is to remove the
kmap_local_page() in this driver and instead make it
depend on 64BIT or !CONFIG_HIGHMEM, to slowly chip away
at the code that is highmem aware throughout the kernel.
I'm not sure if that that would also help drop the array
here.

     Arnd
Re: [PATCH] RDMA/siw: work around clang stack size warning
Posted by Zhu Yanjun 3 months, 2 weeks ago
在 2025/6/21 1:43, Arnd Bergmann 写道:
> On Sat, Jun 21, 2025, at 06:12, Zhu Yanjun wrote:
>> 在 2025/6/20 4:43, Arnd Bergmann 写道:
>>
>> Because the array of kvec structures in siw_tx_hdt consumes the majority
>> of the stack space, would it be possible to use kmalloc or a similar
>> dynamic memory allocation function instead of allocating this memory on
>> the stack?
>>
>> Would using kmalloc (or an equivalent) also effectively resolve the
>> stack usage issue?
> Yes, moving the allocation somewhere else (kmalloc, static variable,
> per siw_sge, per siw_wqe) would avoid the high stack usage effectively,
> it's a tradeoff and I picked the solution that made the most sense
> to me, but there is a good chance another alternative is better here.
>
> The main differences are:
>
> - kmalloc() adds runtime overhead that may be expensive in a
>    fast path
>
> - kmalloc() can fail, which adds complexity from error handling.
>    Note that small allocations with GFP_KERNEL do not fail but instead
>    wait for memory to become available
>
> - If kmalloc() runs into a low-memory situation, it can go through
>    writeback, which in turn can use more stack space than the
>    on-stack allocation it was replacing
>
> - static allocations bloat the kernel image and require locking that
>    may be expensive
>
> - per-object preallocations can be wasteful if a lot of objects
>    are created, and can still require locking if the object is used
>    from multiple threads
>
> As I wrote, I mainly picked the 'noinline_for_stack' approach
> here since that is how the code is known to work with gcc, so
> there is little risk of my patch causing problems.
>
> Moving the both the kvec array and the page array into
> the siw_wqe is likely better here, I'm not familiar enough
> with the driver to tell whether that is an overall improvement.Th
Thank you very much. There are several possible solutions to this issue, 
and the appropriate one depends on the specific scenario. For the 
problem in siw, the noinline_for_stack approach has been selected. In my 
opinion, this appears to be more of a workaround than a true fix. While 
it does mitigate the issue, the underlying problem in siw still remains.

That said, now that we have a clearer understanding of the problem and 
its root cause through discussions and extended analysis, a more robust 
and long-term solution should eventually be proposed.

Thanks,

Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Zhu Yanjun

>
> A related change I would like to see is to remove the
> kmap_local_page() in this driver and instead make it
> depend on 64BIT or !CONFIG_HIGHMEM, to slowly chip away
> at the code that is highmem aware throughout the kernel.
> I'm not sure if that that would also help drop the array
> here.
>
>       Arnd

-- 
Best Regards,
Yanjun.Zhu