[PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message

Håkon Bugge posted 1 patch 2 weeks, 6 days ago
drivers/infiniband/core/cm.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
[PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
Posted by Håkon Bugge 2 weeks, 6 days ago
When the destroy CM ID timeout kicks in, you typically get a storm of
them which creates a log flooding. Hence, change pr_err() to
pr_err_ratelimited() in cm_destroy_id_wait_timeout().

Fixes: 96d9cbe2f2ff ("RDMA/cm: add timeout to cm_destroy_id wait")
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
 drivers/infiniband/core/cm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 92678e438ff4d..01bede8ba1055 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1049,8 +1049,8 @@ static noinline void cm_destroy_id_wait_timeout(struct ib_cm_id *cm_id,
 	struct cm_id_private *cm_id_priv;
 
 	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
-	pr_err("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__,
-	       cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount));
+	pr_err_ratelimited("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__,
+			   cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount));
 }
 
 static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
-- 
2.43.5

Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
Posted by Jason Gunthorpe 2 weeks, 2 days ago
On Fri, Sep 12, 2025 at 12:05:20PM +0200, Håkon Bugge wrote:
> When the destroy CM ID timeout kicks in, you typically get a storm of
> them which creates a log flooding. Hence, change pr_err() to
> pr_err_ratelimited() in cm_destroy_id_wait_timeout().

Did you figure out why you were getting these? IIRC it signals a ULP
bug and is not expected.

Jason
Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
Posted by Jacob Moroni 2 weeks, 2 days ago
Does this happen when there is a missing send completion?

Asking because I remember triggering this if a device encounters an
unrecoverable
error/VF reset while under heavy RDMA-CM activity (like a large scale
MPI wire-up).

I assumed it was because RDMA-CM was waiting for TX completions that
would never arrive.

Of course, the unrecoverable error/VF reset without generating flush
completions was the real
bug in my case.

- Jake
Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
Posted by Leon Romanovsky 2 weeks, 3 days ago
On Fri, 12 Sep 2025 12:05:20 +0200, Håkon Bugge wrote:
> When the destroy CM ID timeout kicks in, you typically get a storm of
> them which creates a log flooding. Hence, change pr_err() to
> pr_err_ratelimited() in cm_destroy_id_wait_timeout().
> 
> 

Applied, thanks!

[1/1] RDMA/cm: Rate limit destroy CM ID timeout error message
      https://git.kernel.org/rdma/rdma/c/2bbe1255fcf19c

Best regards,
-- 
Leon Romanovsky <leon@kernel.org>

Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
Posted by Haakon Bugge 2 weeks, 3 days ago

> On 15 Sep 2025, at 09:43, Leon Romanovsky <leon@kernel.org> wrote:
> 
> 
> On Fri, 12 Sep 2025 12:05:20 +0200, Håkon Bugge wrote:
>> When the destroy CM ID timeout kicks in, you typically get a storm of
>> them which creates a log flooding. Hence, change pr_err() to
>> pr_err_ratelimited() in cm_destroy_id_wait_timeout().
>> 
>> 
> 
> Applied, thanks!

Thanks for the quick turnaround, Leon and Zhu Yanjun!


Thxs, Håkon

Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
Posted by yanjun.zhu 2 weeks, 5 days ago
On 9/12/25 3:05 AM, HÃ¥kon Bugge wrote:
> When the destroy CM ID timeout kicks in, you typically get a storm of
> them which creates a log flooding. Hence, change pr_err() to
> pr_err_ratelimited() in cm_destroy_id_wait_timeout().
> 
> Fixes: 96d9cbe2f2ff ("RDMA/cm: add timeout to cm_destroy_id wait")
> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> ---
>   drivers/infiniband/core/cm.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index 92678e438ff4d..01bede8ba1055 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -1049,8 +1049,8 @@ static noinline void cm_destroy_id_wait_timeout(struct ib_cm_id *cm_id,
>   	struct cm_id_private *cm_id_priv;
>   
>   	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
> -	pr_err("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__,
> -	       cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount));
> +	pr_err_ratelimited("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__,
> +			   cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount));

When many CMs time out, this pr_err can generate excessive noise. Using 
the _ratelimited variant will help alleviate the problem.

Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Zhu Yanjun

>   }
>   
>   static void cm_destroy_id(struct ib_cm_id *cm_id, int err)