drivers/infiniband/core/cm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
When the destroy CM ID timeout kicks in, you typically get a storm of
them which creates a log flooding. Hence, change pr_err() to
pr_err_ratelimited() in cm_destroy_id_wait_timeout().
Fixes: 96d9cbe2f2ff ("RDMA/cm: add timeout to cm_destroy_id wait")
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
drivers/infiniband/core/cm.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 92678e438ff4d..01bede8ba1055 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1049,8 +1049,8 @@ static noinline void cm_destroy_id_wait_timeout(struct ib_cm_id *cm_id,
struct cm_id_private *cm_id_priv;
cm_id_priv = container_of(cm_id, struct cm_id_private, id);
- pr_err("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__,
- cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount));
+ pr_err_ratelimited("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__,
+ cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount));
}
static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
--
2.43.5
On Fri, Sep 12, 2025 at 12:05:20PM +0200, Håkon Bugge wrote: > When the destroy CM ID timeout kicks in, you typically get a storm of > them which creates a log flooding. Hence, change pr_err() to > pr_err_ratelimited() in cm_destroy_id_wait_timeout(). Did you figure out why you were getting these? IIRC it signals a ULP bug and is not expected. Jason
Does this happen when there is a missing send completion? Asking because I remember triggering this if a device encounters an unrecoverable error/VF reset while under heavy RDMA-CM activity (like a large scale MPI wire-up). I assumed it was because RDMA-CM was waiting for TX completions that would never arrive. Of course, the unrecoverable error/VF reset without generating flush completions was the real bug in my case. - Jake
On Fri, 12 Sep 2025 12:05:20 +0200, Håkon Bugge wrote: > When the destroy CM ID timeout kicks in, you typically get a storm of > them which creates a log flooding. Hence, change pr_err() to > pr_err_ratelimited() in cm_destroy_id_wait_timeout(). > > Applied, thanks! [1/1] RDMA/cm: Rate limit destroy CM ID timeout error message https://git.kernel.org/rdma/rdma/c/2bbe1255fcf19c Best regards, -- Leon Romanovsky <leon@kernel.org>
> On 15 Sep 2025, at 09:43, Leon Romanovsky <leon@kernel.org> wrote: > > > On Fri, 12 Sep 2025 12:05:20 +0200, Håkon Bugge wrote: >> When the destroy CM ID timeout kicks in, you typically get a storm of >> them which creates a log flooding. Hence, change pr_err() to >> pr_err_ratelimited() in cm_destroy_id_wait_timeout(). >> >> > > Applied, thanks! Thanks for the quick turnaround, Leon and Zhu Yanjun! Thxs, Håkon
On 9/12/25 3:05 AM, HÃ¥kon Bugge wrote: > When the destroy CM ID timeout kicks in, you typically get a storm of > them which creates a log flooding. Hence, change pr_err() to > pr_err_ratelimited() in cm_destroy_id_wait_timeout(). > > Fixes: 96d9cbe2f2ff ("RDMA/cm: add timeout to cm_destroy_id wait") > Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> > --- > drivers/infiniband/core/cm.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index 92678e438ff4d..01bede8ba1055 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -1049,8 +1049,8 @@ static noinline void cm_destroy_id_wait_timeout(struct ib_cm_id *cm_id, > struct cm_id_private *cm_id_priv; > > cm_id_priv = container_of(cm_id, struct cm_id_private, id); > - pr_err("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__, > - cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount)); > + pr_err_ratelimited("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__, > + cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount)); When many CMs time out, this pr_err can generate excessive noise. Using the _ratelimited variant will help alleviate the problem. Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Zhu Yanjun > } > > static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
© 2016 - 2025 Red Hat, Inc.