[PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
@ 2025-09-12 10:05 Håkon Bugge
  2025-09-12 19:27 ` yanjun.zhu
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Håkon Bugge @ 2025-09-12 10:05 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, Sean Hefty, Vlad Dumitrescu,
	Or Har-Toov, Håkon Bugge, Jacob Moroni, Manjunath Patil
  Cc: linux-rdma, linux-kernel

When the destroy CM ID timeout kicks in, you typically get a storm of
them which creates a log flooding. Hence, change pr_err() to
pr_err_ratelimited() in cm_destroy_id_wait_timeout().

Fixes: 96d9cbe2f2ff ("RDMA/cm: add timeout to cm_destroy_id wait")
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
 drivers/infiniband/core/cm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 92678e438ff4d..01bede8ba1055 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1049,8 +1049,8 @@ static noinline void cm_destroy_id_wait_timeout(struct ib_cm_id *cm_id,
 	struct cm_id_private *cm_id_priv;
 
 	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
-	pr_err("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__,
-	       cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount));
+	pr_err_ratelimited("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__,
+			   cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount));
 }
 
 static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-09-12 10:05 [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message Håkon Bugge
@ 2025-09-12 19:27 ` yanjun.zhu
  2025-09-15  7:43 ` Leon Romanovsky
  2025-09-16 14:18 ` Jason Gunthorpe
  2 siblings, 0 replies; 17+ messages in thread
From: yanjun.zhu @ 2025-09-12 19:27 UTC (permalink / raw)
  To: Håkon Bugge, Jason Gunthorpe, Leon Romanovsky, Sean Hefty,
	Vlad Dumitrescu, Or Har-Toov, Jacob Moroni, Manjunath Patil
  Cc: linux-rdma, linux-kernel

On 9/12/25 3:05 AM, HÃ¥kon Bugge wrote:
> When the destroy CM ID timeout kicks in, you typically get a storm of
> them which creates a log flooding. Hence, change pr_err() to
> pr_err_ratelimited() in cm_destroy_id_wait_timeout().
> 
> Fixes: 96d9cbe2f2ff ("RDMA/cm: add timeout to cm_destroy_id wait")
> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> ---
>   drivers/infiniband/core/cm.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index 92678e438ff4d..01bede8ba1055 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -1049,8 +1049,8 @@ static noinline void cm_destroy_id_wait_timeout(struct ib_cm_id *cm_id,
>   	struct cm_id_private *cm_id_priv;
>   
>   	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
> -	pr_err("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__,
> -	       cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount));
> +	pr_err_ratelimited("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__,
> +			   cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount));

When many CMs time out, this pr_err can generate excessive noise. Using 
the _ratelimited variant will help alleviate the problem.

Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Zhu Yanjun

>   }
>   
>   static void cm_destroy_id(struct ib_cm_id *cm_id, int err)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-09-12 10:05 [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message Håkon Bugge
  2025-09-12 19:27 ` yanjun.zhu
@ 2025-09-15  7:43 ` Leon Romanovsky
  2025-09-15  9:44   ` Haakon Bugge
  2025-09-16 14:18 ` Jason Gunthorpe
  2 siblings, 1 reply; 17+ messages in thread
From: Leon Romanovsky @ 2025-09-15  7:43 UTC (permalink / raw)
  To: Jason Gunthorpe, Sean Hefty, Vlad Dumitrescu, Or Har-Toov,
	Jacob Moroni, Manjunath Patil, Håkon Bugge
  Cc: linux-rdma, linux-kernel


On Fri, 12 Sep 2025 12:05:20 +0200, Håkon Bugge wrote:
> When the destroy CM ID timeout kicks in, you typically get a storm of
> them which creates a log flooding. Hence, change pr_err() to
> pr_err_ratelimited() in cm_destroy_id_wait_timeout().
> 
> 

Applied, thanks!

[1/1] RDMA/cm: Rate limit destroy CM ID timeout error message
      https://git.kernel.org/rdma/rdma/c/2bbe1255fcf19c

Best regards,
-- 
Leon Romanovsky <leon@kernel.org>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-09-15  7:43 ` Leon Romanovsky
@ 2025-09-15  9:44   ` Haakon Bugge
  0 siblings, 0 replies; 17+ messages in thread
From: Haakon Bugge @ 2025-09-15  9:44 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Sean Hefty, Vlad Dumitrescu, Or Har-Toov,
	Jacob Moroni, Manjunath Patil, OFED mailing list,
	linux-kernel@vger.kernel.org



> On 15 Sep 2025, at 09:43, Leon Romanovsky <leon@kernel.org> wrote:
> 
> 
> On Fri, 12 Sep 2025 12:05:20 +0200, Håkon Bugge wrote:
>> When the destroy CM ID timeout kicks in, you typically get a storm of
>> them which creates a log flooding. Hence, change pr_err() to
>> pr_err_ratelimited() in cm_destroy_id_wait_timeout().
>> 
>> 
> 
> Applied, thanks!

Thanks for the quick turnaround, Leon and Zhu Yanjun!


Thxs, Håkon


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-09-12 10:05 [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message Håkon Bugge
  2025-09-12 19:27 ` yanjun.zhu
  2025-09-15  7:43 ` Leon Romanovsky
@ 2025-09-16 14:18 ` Jason Gunthorpe
  2025-09-16 14:36   ` Jacob Moroni
  2 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2025-09-16 14:18 UTC (permalink / raw)
  To: Håkon Bugge
  Cc: Leon Romanovsky, Sean Hefty, Vlad Dumitrescu, Or Har-Toov,
	Jacob Moroni, Manjunath Patil, linux-rdma, linux-kernel

On Fri, Sep 12, 2025 at 12:05:20PM +0200, Håkon Bugge wrote:
> When the destroy CM ID timeout kicks in, you typically get a storm of
> them which creates a log flooding. Hence, change pr_err() to
> pr_err_ratelimited() in cm_destroy_id_wait_timeout().

Did you figure out why you were getting these? IIRC it signals a ULP
bug and is not expected.

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-09-16 14:18 ` Jason Gunthorpe
@ 2025-09-16 14:36   ` Jacob Moroni
  2025-09-25 11:29     ` Haakon Bugge
  0 siblings, 1 reply; 17+ messages in thread
From: Jacob Moroni @ 2025-09-16 14:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Håkon Bugge, Leon Romanovsky, Sean Hefty, Vlad Dumitrescu,
	Or Har-Toov, Manjunath Patil, linux-rdma, linux-kernel

Does this happen when there is a missing send completion?

Asking because I remember triggering this if a device encounters an
unrecoverable
error/VF reset while under heavy RDMA-CM activity (like a large scale
MPI wire-up).

I assumed it was because RDMA-CM was waiting for TX completions that
would never arrive.

Of course, the unrecoverable error/VF reset without generating flush
completions was the real
bug in my case.

- Jake

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-09-16 14:36   ` Jacob Moroni
@ 2025-09-25 11:29     ` Haakon Bugge
  2025-10-13 14:04       ` Haakon Bugge
  0 siblings, 1 reply; 17+ messages in thread
From: Haakon Bugge @ 2025-09-25 11:29 UTC (permalink / raw)
  To: Jacob Moroni
  Cc: Jason Gunthorpe, Leon Romanovsky, Sean Hefty, Vlad Dumitrescu,
	Or Har-Toov, Manjunath Patil, OFED mailing list,
	linux-kernel@vger.kernel.org

Hi Jason and Jake,


> On 16 Sep 2025, at 16:36, Jacob Moroni <jmoroni@google.com> wrote:
> 
> Does this happen when there is a missing send completion?
> 
> Asking because I remember triggering this if a device encounters an
> unrecoverable
> error/VF reset while under heavy RDMA-CM activity (like a large scale
> MPI wire-up).
> 
> I assumed it was because RDMA-CM was waiting for TX completions that
> would never arrive.
> 
> Of course, the unrecoverable error/VF reset without generating flush
> completions was the real
> bug in my case.


I concur. I looked ahead of the first incident, but didn't see any obscure mlx5 driver messages. But looking in-between, I saw:

kernel: mlx5_core 0000:13:01.1 ens4f16: TX timeout detected
kernel: cm_destroy_id_wait_timeout: cm_id=00000000564a7a31 timed out. state 2 -> 0, refcnt=2
kernel: mlx5_core 0000:13:01.1 ens4f16: TX timeout on queue: 12, SQ: 0x14f2a, CQ: 0x1739, SQ Cons: 0x0 SQ Prod: 0x3c5, usecs since last trans: 30224000
kernel: cm_destroy_id_wait_timeout: cm_id=00000000b821dcda timed out. state 2 -> 0, refcnt=1
kernel: cm_destroy_id_wait_timeout: cm_id=00000000edf170fa timed out. state 2 -> 0, refcnt=1
kernel: mlx5_core 0000:13:01.1 ens4f16: EQ 0x14: Cons = 0x444670, irqn = 0x28c

Not in close proximity in time, but a 6 digits amount of messages were suppressed due to the flooding.

My take is that the timeout should be monotonic increasing from the driver to RDMA_CM (and to the ULPs). They are not, as the mlx5e_build_nic_netdev() functions sets the ndetdev's watchdog_timeo to 15 seconds, whereas the timeout value calling cm_destroy_id_wait_timeout() is 10 seconds.

So, the mitigation by detecting a TX timeout from netdev has not kicked in when cm_destroy_id_wait_timeout() is called.


Thxs, Håkon







^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-09-25 11:29     ` Haakon Bugge
@ 2025-10-13 14:04       ` Haakon Bugge
  2025-10-15 11:38         ` Haakon Bugge
  0 siblings, 1 reply; 17+ messages in thread
From: Haakon Bugge @ 2025-10-13 14:04 UTC (permalink / raw)
  To: Jacob Moroni
  Cc: Jason Gunthorpe, Leon Romanovsky, Sean Hefty, Vlad Dumitrescu,
	Or Har-Toov, Manjunath Patil, OFED mailing list,
	linux-kernel@vger.kernel.org

Hi Jason and Jake,

> On 25 Sep 2025, at 13:29, Haakon Bugge <haakon.bugge@oracle.com> wrote:
> 
> [snip]

Got another incident this weekend. The first cm_destroy_id_wait_timeout() message at:

Oct 10 12:10:31 lab64 kernel: cm_destroy_id_wait_timeout: cm_id=00000000803feac1 timed out. state 2 -> 0, refcnt=2
[]

Health error in close proximity in time:

# devlink health show ens4f14 reporter tx
auxiliary/mlx5_core.eth.8/524288:
  reporter tx
    state healthy error 256 recover 256 last_dump_date 2025-10-10 last_dump_time 12:08:47 grace_period 500 auto_recover true auto_dump true

And also ndev timeout in close proximity in time:

# fgrep "TX timeout on queue" /var/log/messages-20251012 
Oct 10 12:11:16 lab64 kernel: mlx5_core 0000:13:01.0 ens4f14: TX timeout on queue: 5, SQ: 0xfb43, CQ: 0x66b, SQ Cons: 0x0 SQ Prod: 0x4, usecs since last trans: 20992000
Oct 10 13:01:05 lab64 kernel: mlx5_core 0000:13:01.0 ens4f14: TX timeout on queue: 3, SQ: 0x1088a, CQ: 0xcae, SQ Cons: 0x0 SQ Prod: 0x3c5, usecs since last trans: 16390000
[]

4140 cm_ids in this situation:

# dmesg | grep cm_destroy_id_wait_timeout | awk '{ print $3; }' | sort -n | uniq | wc
   4140    4140   95220

By running ib_send_bw using the suspected bad device:

[1]-  Running                 ib_send_bw -R -d mlx5_7 -i 2 &
[2]+  Running                 ib_send_bw -R -d mlx5_7 -i 1 192.168.64.15 &

Said processes are hanging and we got one more cm_id in this situation:

# dmesg | grep cm_destroy_id_wait_timeout | awk '{ print $3; }' | sort -n | uniq | wc
   4141    4141   95243

Stacks:

# cat /proc/435477/stack 
[<0>] ucma_get_event+0xf0/0x240 [rdma_ucm]
[<0>] ucma_write+0xf8/0x170 [rdma_ucm]
[<0>] vfs_write+0xd4/0x294
[<0>] ksys_write+0xbb/0xe8
[<0>] do_syscall_64+0x35/0x87
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x0

# cat /proc/436606/stack 
[<0>] cm_destroy_id+0x16a/0x61c [ib_cm]
[<0>] _destroy_id+0x5a/0x230 [rdma_cm]
[<0>] ucma_destroy_private_ctx+0x354/0x370 [rdma_ucm]
[<0>] ucma_close+0x7a/0xb0 [rdma_ucm]
[<0>] __fput+0x96/0x24e
[<0>] task_work_run+0x65/0x9b
[<0>] do_exit+0x215/0x489
[<0>] do_group_exit+0x33/0x96
[<0>] get_signal+0x14a/0x967
[<0>] arch_do_signal_or_restart+0x10d/0x12a
[<0>] exit_to_user_mode_loop+0xc4/0x1b1
[<0>] exit_to_user_mode_prepare+0x124/0x12c
[<0>] syscall_exit_to_user_mode+0x18/0x42
[<0>] do_syscall_64+0x42/0x87
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x0

Distribution of old_state:

# dmesg | grep cm_destroy_id_wait_timeout | awk '{ ++state[$7]; } END {for (s in state) printf("%2d: %d\n", s, state[s]); }'
 0: 192
 2: 10546
 6: 102
11: 513
13: 148



My take is that the VF in question here gets whacked and that the MAD timeout handling does not resonate well with how CMA handles them.


Thxs, Håkon



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-10-13 14:04       ` Haakon Bugge
@ 2025-10-15 11:38         ` Haakon Bugge
  2025-10-15 16:49           ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Haakon Bugge @ 2025-10-15 11:38 UTC (permalink / raw)
  To: Jacob Moroni
  Cc: Jason Gunthorpe, Leon Romanovsky, Sean Hefty, Vlad Dumitrescu,
	Or Har-Toov, Manjunath Patil, OFED mailing list,
	linux-kernel@vger.kernel.org

Hi Jason and Jake,

> On 13 Oct 2025, at 16:04, Haakon Bugge <haakon.bugge@oracle.com> wrote:

[snip]

> My take is that the VF in question here gets whacked and that the MAD timeout handling does not resonate well with how CMA handles them.

I was able to simulate a whacked VF by setting the CMA max retries to one and once in a while, skip sending of the MAD:
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 9b471548e7ae1..43eff54151830 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -45,7 +45,7 @@ MODULE_DESCRIPTION("Generic RDMA CM Agent");
 MODULE_LICENSE("Dual BSD/GPL");
   #define CMA_CM_RESPONSE_TIMEOUT 20
-#define CMA_MAX_CM_RETRIES 15
+#define CMA_MAX_CM_RETRIES 1
 #define CMA_IBOE_PACKET_LIFETIME 16
 #define CMA_PREFERRED_ROCE_GID_TYPE IB_GID_TYPE_ROCE_UDP_ENCAP
  diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 21c8669dd1371..9c19333a507d8 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -1057,9 +1057,13 @@ int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr)
          spin_lock_irqsave(&qp_info->send_queue.lock, flags);
        if (qp_info->send_queue.count < qp_info->send_queue.max_active) {
-               trace_ib_mad_ib_send_mad(mad_send_wr, qp_info);
-               ret = ib_post_send(mad_agent->qp, &mad_send_wr->send_wr.wr,
-                                  NULL);
+               if (!(jiffies % 10000)) {
+                       pr_err("Skipping ib_post_send\n");
+               } else {
+                       trace_ib_mad_ib_send_mad(mad_send_wr, qp_info);
+                       ret = ib_post_send(mad_agent->qp, &mad_send_wr->send_wr.wr,
+                                          NULL);
+               }
                list = &qp_info->send_queue.list;
        } else {
                ret = 0;


With this hack, running cmtime with 10.000 connections in loopback, the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out. state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill cmtime because it was hanging, and then it got defunct with the following stack:

# cat /proc/7977/task/7978/stack 
[<0>] cm_destroy_id+0x23a/0x680 [ib_cm]
[<0>] _destroy_id+0xcf/0x330 [rdma_cm]
[<0>] ucma_destroy_private_ctx+0x379/0x390 [rdma_ucm]
[<0>] ucma_close+0x78/0xb0 [rdma_ucm]
[<0>] __fput+0xe3/0x2a0
[<0>] task_work_run+0x5c/0x90
[<0>] do_exit+0x1e3/0x447
[<0>] do_group_exit+0x30/0x80
[<0>] get_signal+0x88d/0x88d
[<0>] arch_do_signal_or_restart+0x34/0x110
[<0>] exit_to_user_mode_loop+0x4a/0x160
[<0>] do_syscall_64+0x1b8/0x940
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e



Thxs, Håkon



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-10-15 11:38         ` Haakon Bugge
@ 2025-10-15 16:49           ` Jason Gunthorpe
  2025-10-15 18:34             ` Sean Hefty
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2025-10-15 16:49 UTC (permalink / raw)
  To: Haakon Bugge
  Cc: Jacob Moroni, Leon Romanovsky, Sean Hefty, Vlad Dumitrescu,
	Or Har-Toov, Manjunath Patil, OFED mailing list,
	linux-kernel@vger.kernel.org

On Wed, Oct 15, 2025 at 11:38:10AM +0000, Haakon Bugge wrote:
> With this hack, running cmtime with 10.000 connections in loopback,
> the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed
> out. state 6 -> 0, refcnt=1" messages are indeed produced. Had to
> kill cmtime because it was hanging, and then it got defunct with the
> following stack:

Seems like a bug, it should not hang forever if a MAD is lost..

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-10-15 16:49           ` Jason Gunthorpe
@ 2025-10-15 18:34             ` Sean Hefty
  2025-10-15 18:45               ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Sean Hefty @ 2025-10-15 18:34 UTC (permalink / raw)
  To: Jason Gunthorpe, Haakon Bugge
  Cc: Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu, Or Har-Toov,
	Manjunath Patil, OFED mailing list, linux-kernel@vger.kernel.org

> > With this hack, running cmtime with 10.000 connections in loopback,
> > the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out.
> > state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill
> > cmtime because it was hanging, and then it got defunct with the
> > following stack:
> 
> Seems like a bug, it should not hang forever if a MAD is lost..

The hack skipped calling ib_post_send.  But the result of that is a completion is never written to the CQ.  The state machine or reference counting is likely waiting for the completion, so it knows that HW is done trying to access the buffer.

- Sean

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-10-15 18:34             ` Sean Hefty
@ 2025-10-15 18:45               ` Jason Gunthorpe
  2025-10-16 15:25                 ` Haakon Bugge
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2025-10-15 18:45 UTC (permalink / raw)
  To: Sean Hefty
  Cc: Haakon Bugge, Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu,
	Or Har-Toov, Manjunath Patil, OFED mailing list,
	linux-kernel@vger.kernel.org

On Wed, Oct 15, 2025 at 06:34:33PM +0000, Sean Hefty wrote:
> > > With this hack, running cmtime with 10.000 connections in loopback,
> > > the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out.
> > > state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill
> > > cmtime because it was hanging, and then it got defunct with the
> > > following stack:
> > 
> > Seems like a bug, it should not hang forever if a MAD is lost..
> 
> The hack skipped calling ib_post_send.  But the result of that is a
> completion is never written to the CQ.  The state machine or
> reference counting is likely waiting for the completion, so it knows
> that HW is done trying to access the buffer.

That does make sense, it has to immediately trigger the completion to
be accurate. A better test would be to truncate the mad or something
so it can't be rx'd

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-10-15 18:45               ` Jason Gunthorpe
@ 2025-10-16 15:25                 ` Haakon Bugge
  2025-10-16 16:12                   ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Haakon Bugge @ 2025-10-16 15:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sean Hefty, Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu,
	Or Har-Toov, Manjunath Patil, OFED mailing list,
	linux-kernel@vger.kernel.org

> On 15 Oct 2025, at 20:45, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Wed, Oct 15, 2025 at 06:34:33PM +0000, Sean Hefty wrote:
>>>> With this hack, running cmtime with 10.000 connections in loopback,
>>>> the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out.
>>>> state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill
>>>> cmtime because it was hanging, and then it got defunct with the
>>>> following stack:
>>> 
>>> Seems like a bug, it should not hang forever if a MAD is lost..
>> 
>> The hack skipped calling ib_post_send.  But the result of that is a
>> completion is never written to the CQ.

Which is exactly the behaviour I see when the VF gets "whacked". This is from a system without the reproducer hack. Looking at the netdev detected TX timeout:

mlx5_core 0000:af:00.2 ens4f2: TX timeout detected
mlx5_core 0000:af:00.2 ens4f2: TX timeout on queue: 0, SQ: 0xe31ee, CQ: 0x484, SQ Cons: 0x0 SQ Prod: 0x7, usecs since last trans: 18439000
mlx5_core 0000:af:00.2 ens4f2: EQ 0x7: Cons = 0x3ded47a, irqn = 0x197

(I get tons of the like)

There are two points here. All of them has "SQ Cons: 0x0", which to me implies that no TX CQE has ever been polled for any of them.

The other point is that we do _not_ see "Recovered %d eqes on EQ 0x%x" (which is because mlx5_eq_poll_irq_disabled() always returns zero), which means that either a) no CQE has been generated by the HCA or b) a CQE has been generated but no corresponding EQE has been written to the EQ.

>>  The state machine or
>> reference counting is likely waiting for the completion, so it knows
>> that HW is done trying to access the buffer.
> 
> That does make sense, it has to immediately trigger the completion to
> be accurate. A better test would be to truncate the mad or something
> so it can't be rx'd

As argued above, I think my reproducer hack is sound and to the point.

I will raise an NVIDIA ticket as to why the VF gets into this "whacked" state. But, since Google has encountered this, we have one customer, and I am able to repro this situation (without my hack) in 1-3 days running on a single server, do we expect the RDMA stack to handle this situation?

Thxs, Håkon

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-10-16 15:25                 ` Haakon Bugge
@ 2025-10-16 16:12                   ` Jason Gunthorpe
  2025-10-16 16:43                     ` Haakon Bugge
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2025-10-16 16:12 UTC (permalink / raw)
  To: Haakon Bugge
  Cc: Sean Hefty, Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu,
	Or Har-Toov, Manjunath Patil, OFED mailing list,
	linux-kernel@vger.kernel.org

On Thu, Oct 16, 2025 at 03:25:15PM +0000, Haakon Bugge wrote:
> 
> 
> > On 15 Oct 2025, at 20:45, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Wed, Oct 15, 2025 at 06:34:33PM +0000, Sean Hefty wrote:
> >>>> With this hack, running cmtime with 10.000 connections in loopback,
> >>>> the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out.
> >>>> state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill
> >>>> cmtime because it was hanging, and then it got defunct with the
> >>>> following stack:
> >>> 
> >>> Seems like a bug, it should not hang forever if a MAD is lost..
> >> 
> >> The hack skipped calling ib_post_send.  But the result of that is a
> >> completion is never written to the CQ.
> 
> 
> Which is exactly the behaviour I see when the VF gets "whacked". This is from a system without the reproducer hack. Looking at the netdev detected TX timeout:
>
> mlx5_core 0000:af:00.2 ens4f2: TX timeout detected
> mlx5_core 0000:af:00.2 ens4f2: TX timeout on queue: 0, SQ: 0xe31ee, CQ: 0x484, SQ Cons: 0x0 SQ Prod: 0x7, usecs since last trans: 18439000
> mlx5_core 0000:af:00.2 ens4f2: EQ 0x7: Cons = 0x3ded47a, irqn = 0x197
> 
> (I get tons of the like)
> 
> There are two points here. All of them has "SQ Cons: 0x0", which to me implies that no TX CQE has ever been polled for any of them.
 
> The other point is that we do _not_ see "Recovered %d eqes on EQ
> 0x%x" (which is because mlx5_eq_poll_irq_disabled() always returns
> zero), which means that either a) no CQE has been generated by the
> HCA or b) a CQE has been generated but no corresponding EQE has been
> written to the EQ.

Lost interrupts/cqe are an obnoxiously common bug in virtualization
environments. Be sure you are running latest NIC firmware. Be sure you
have all the qemu/kvm fixes.

But yes, if you hit these bugs then the QP gets effectively stuck
forever.

We don't have a stuck QP watchdog for the GMP QP, IIRC. Perhaps we
should, but I'd also argue if you are loosing interrupts for GMP QPs
then your VM platform is so broken it won't succeed to run normal RDMA
applications :\

At the end of the day you must not have these "TX timeout" type
errors, they are very very serious. Whatever bugs cause them must be
squashed.

> >>  The state machine or
> >> reference counting is likely waiting for the completion, so it knows
> >> that HW is done trying to access the buffer.
> > 
> > That does make sense, it has to immediately trigger the completion to
> > be accurate. A better test would be to truncate the mad or something
> > so it can't be rx'd
> 
> As argued above, I think my reproducer hack is sound and to the point.

Not quite, you are just loosing CQEs. We should never loose a CQE.

Yes perhaps your QP can become permanently stuck, and that's bad. But
the fix is to detect the stuck QP, push it through to error and drain
it generating all the err CQs without any loss.

To better model what you are seeing you want to do something like
randomly drop the GMP QP doorbell ring, that will cause the QP to get
stuck similar to a lost interrupt/etc.

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-10-16 16:12                   ` Jason Gunthorpe
@ 2025-10-16 16:43                     ` Haakon Bugge
  2025-10-16 18:01                       ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Haakon Bugge @ 2025-10-16 16:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sean Hefty, Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu,
	Or Har-Toov, Manjunath Patil, OFED mailing list,
	linux-kernel@vger.kernel.org



> On 16 Oct 2025, at 18:12, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Thu, Oct 16, 2025 at 03:25:15PM +0000, Haakon Bugge wrote:
>> 
>> 
>>> On 15 Oct 2025, at 20:45, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>> 
>>> On Wed, Oct 15, 2025 at 06:34:33PM +0000, Sean Hefty wrote:
>>>>>> With this hack, running cmtime with 10.000 connections in loopback,
>>>>>> the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out.
>>>>>> state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill
>>>>>> cmtime because it was hanging, and then it got defunct with the
>>>>>> following stack:
>>>>> 
>>>>> Seems like a bug, it should not hang forever if a MAD is lost..
>>>> 
>>>> The hack skipped calling ib_post_send.  But the result of that is a
>>>> completion is never written to the CQ.
>> 
>> 
>> Which is exactly the behaviour I see when the VF gets "whacked". This is from a system without the reproducer hack. Looking at the netdev detected TX timeout:
>> 
>> mlx5_core 0000:af:00.2 ens4f2: TX timeout detected
>> mlx5_core 0000:af:00.2 ens4f2: TX timeout on queue: 0, SQ: 0xe31ee, CQ: 0x484, SQ Cons: 0x0 SQ Prod: 0x7, usecs since last trans: 18439000
>> mlx5_core 0000:af:00.2 ens4f2: EQ 0x7: Cons = 0x3ded47a, irqn = 0x197
>> 
>> (I get tons of the like)
>> 
>> There are two points here. All of them has "SQ Cons: 0x0", which to me implies that no TX CQE has ever been polled for any of them.
> 
>> The other point is that we do _not_ see "Recovered %d eqes on EQ
>> 0x%x" (which is because mlx5_eq_poll_irq_disabled() always returns
>> zero), which means that either a) no CQE has been generated by the
>> HCA or b) a CQE has been generated but no corresponding EQE has been
>> written to the EQ.
> 
> Lost interrupts/cqe are an obnoxiously common bug in virtualization
> environments. Be sure you are running latest NIC firmware. Be sure you
> have all the qemu/kvm fixes.

Sorry, may be I did not mention it, but I run BM with a bunch of VFs instantiated.

> But yes, if you hit these bugs then the QP gets effectively stuck
> forever.
> 
> We don't have a stuck QP watchdog for the GMP QP, IIRC. Perhaps we
> should, but I'd also argue if you are loosing interrupts for GMP QPs
> then your VM platform is so broken it won't succeed to run normal RDMA
> applications :\

Most probably. But the intent of mlx5e_tx_reporter_timeout_recover() is to recover, but it does not manually poll the CQ. What I mean, if there are no EQEs to recover, but the CQ is non-empty, should they somehow be handled?

> At the end of the day you must not have these "TX timeout" type
> errors, they are very very serious. Whatever bugs cause them must be
> squashed.
I can agree to that :-) But, life is not perfect.

> 
>>>> The state machine or
>>>> reference counting is likely waiting for the completion, so it knows
>>>> that HW is done trying to access the buffer.
>>> 
>>> That does make sense, it has to immediately trigger the completion to
>>> be accurate. A better test would be to truncate the mad or something
>>> so it can't be rx'd
>> 
>> As argued above, I think my reproducer hack is sound and to the point.
> 
> Not quite, you are just loosing CQEs. We should never loose a CQE.
> 
> Yes perhaps your QP can become permanently stuck, and that's bad. But
> the fix is to detect the stuck QP, push it through to error and drain
> it generating all the err CQs without any loss.

In my opinion, it is not a QP that is stuck. It is the VF. I had an example above where I ran into this issue (using RDS), but creating a connection from user-space using ib_send_bw, the newly created cm_id also got stuck, presumable due to the lack of CQE, EQE, or interrupt.

> To better model what you are seeing you want to do something like
> randomly drop the GMP QP doorbell ring, that will cause the QP to get
> stuck similar to a lost interrupt/etc.


Well, I started off this thread thinking a cm_deref_id() was missing somewhere, but now I am more inclined to think as you do, this is an unrecoverable situation, and I should work with NVIDIA to fix it.


Thxs, Håkon



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-10-16 16:43                     ` Haakon Bugge
@ 2025-10-16 18:01                       ` Jason Gunthorpe
  2025-10-21 16:32                         ` Haakon Bugge
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2025-10-16 18:01 UTC (permalink / raw)
  To: Haakon Bugge
  Cc: Sean Hefty, Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu,
	Or Har-Toov, Manjunath Patil, OFED mailing list,
	linux-kernel@vger.kernel.org

On Thu, Oct 16, 2025 at 04:43:16PM +0000, Haakon Bugge wrote:

> Well, I started off this thread thinking a cm_deref_id() was missing
> somewhere, but now I am more inclined to think as you do, this is an
> unrecoverable situation, and I should work with NVIDIA to fix it.

If the VF is just stuck and not progressing QPs for whatever reason
then yes absolutely.

At best all we can do is detect stuck QPs and try to recover them as I
described.

How hard/costly would it be to have a tx timer watchdog on the mad
layer send q?

At the very least we could log a stuck MAD QP..

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
  2025-10-16 18:01                       ` Jason Gunthorpe
@ 2025-10-21 16:32                         ` Haakon Bugge
  0 siblings, 0 replies; 17+ messages in thread
From: Haakon Bugge @ 2025-10-21 16:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sean Hefty, Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu,
	Or Har-Toov, Manjunath Patil, OFED mailing list,
	linux-kernel@vger.kernel.org

> On 16 Oct 2025, at 20:01, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Thu, Oct 16, 2025 at 04:43:16PM +0000, Haakon Bugge wrote:
> 
>> Well, I started off this thread thinking a cm_deref_id() was missing
>> somewhere, but now I am more inclined to think as you do, this is an
>> unrecoverable situation, and I should work with NVIDIA to fix it.
> 
> If the VF is just stuck and not progressing QPs for whatever reason
> then yes absolutely.

We are running with MULTI_PORT_VHCA_EN=1 (i.e., one device, two ports), and I see that it is only one of the ports in the function that get into this situation. And yes, mlnx ticket filed.

> At best all we can do is detect stuck QPs and try to recover them as I
> described.

It applies to all QPs, not only GSI MADs, and, as reported above, new QPs created from user-space run into the same situation. I tried an FLR, but the RDMA stack is not able to recover from it. So, from my perspective, only a reboot helps. In other words, unrecoverable from a SW perspective.

> How hard/costly would it be to have a tx timer watchdog on the mad
> layer send q?

May be Steve can answer that. But from my perspective, the "destroy CM ID timeout error" message is _the_ signature of the situation. And, anyone seeing it would probably read though this thread...

> At the very least we could log a stuck MAD QP..

That won't hurt, but I do not expect all other ULPs and user-space apps to handle the case where a CQE is expected but never comes.

Thxs, Håkon

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-10-21 16:33 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-12 10:05 [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message Håkon Bugge
2025-09-12 19:27 ` yanjun.zhu
2025-09-15  7:43 ` Leon Romanovsky
2025-09-15  9:44   ` Haakon Bugge
2025-09-16 14:18 ` Jason Gunthorpe
2025-09-16 14:36   ` Jacob Moroni
2025-09-25 11:29     ` Haakon Bugge
2025-10-13 14:04       ` Haakon Bugge
2025-10-15 11:38         ` Haakon Bugge
2025-10-15 16:49           ` Jason Gunthorpe
2025-10-15 18:34             ` Sean Hefty
2025-10-15 18:45               ` Jason Gunthorpe
2025-10-16 15:25                 ` Haakon Bugge
2025-10-16 16:12                   ` Jason Gunthorpe
2025-10-16 16:43                     ` Haakon Bugge
2025-10-16 18:01                       ` Jason Gunthorpe
2025-10-21 16:32                         ` Haakon Bugge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox