* [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
@ 2025-09-12 10:05 Håkon Bugge
2025-09-12 19:27 ` yanjun.zhu
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Håkon Bugge @ 2025-09-12 10:05 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Sean Hefty, Vlad Dumitrescu,
Or Har-Toov, Håkon Bugge, Jacob Moroni, Manjunath Patil
Cc: linux-rdma, linux-kernel
When the destroy CM ID timeout kicks in, you typically get a storm of
them which creates a log flooding. Hence, change pr_err() to
pr_err_ratelimited() in cm_destroy_id_wait_timeout().
Fixes: 96d9cbe2f2ff ("RDMA/cm: add timeout to cm_destroy_id wait")
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
drivers/infiniband/core/cm.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 92678e438ff4d..01bede8ba1055 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1049,8 +1049,8 @@ static noinline void cm_destroy_id_wait_timeout(struct ib_cm_id *cm_id,
struct cm_id_private *cm_id_priv;
cm_id_priv = container_of(cm_id, struct cm_id_private, id);
- pr_err("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__,
- cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount));
+ pr_err_ratelimited("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__,
+ cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount));
}
static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
--
2.43.5
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-09-12 10:05 [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message Håkon Bugge @ 2025-09-12 19:27 ` yanjun.zhu 2025-09-15 7:43 ` Leon Romanovsky 2025-09-16 14:18 ` Jason Gunthorpe 2 siblings, 0 replies; 17+ messages in thread From: yanjun.zhu @ 2025-09-12 19:27 UTC (permalink / raw) To: Håkon Bugge, Jason Gunthorpe, Leon Romanovsky, Sean Hefty, Vlad Dumitrescu, Or Har-Toov, Jacob Moroni, Manjunath Patil Cc: linux-rdma, linux-kernel On 9/12/25 3:05 AM, HÃ¥kon Bugge wrote: > When the destroy CM ID timeout kicks in, you typically get a storm of > them which creates a log flooding. Hence, change pr_err() to > pr_err_ratelimited() in cm_destroy_id_wait_timeout(). > > Fixes: 96d9cbe2f2ff ("RDMA/cm: add timeout to cm_destroy_id wait") > Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> > --- > drivers/infiniband/core/cm.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index 92678e438ff4d..01bede8ba1055 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -1049,8 +1049,8 @@ static noinline void cm_destroy_id_wait_timeout(struct ib_cm_id *cm_id, > struct cm_id_private *cm_id_priv; > > cm_id_priv = container_of(cm_id, struct cm_id_private, id); > - pr_err("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__, > - cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount)); > + pr_err_ratelimited("%s: cm_id=%p timed out. state %d -> %d, refcnt=%d\n", __func__, > + cm_id, old_state, cm_id->state, refcount_read(&cm_id_priv->refcount)); When many CMs time out, this pr_err can generate excessive noise. Using the _ratelimited variant will help alleviate the problem. Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Zhu Yanjun > } > > static void cm_destroy_id(struct ib_cm_id *cm_id, int err) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-09-12 10:05 [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message Håkon Bugge 2025-09-12 19:27 ` yanjun.zhu @ 2025-09-15 7:43 ` Leon Romanovsky 2025-09-15 9:44 ` Haakon Bugge 2025-09-16 14:18 ` Jason Gunthorpe 2 siblings, 1 reply; 17+ messages in thread From: Leon Romanovsky @ 2025-09-15 7:43 UTC (permalink / raw) To: Jason Gunthorpe, Sean Hefty, Vlad Dumitrescu, Or Har-Toov, Jacob Moroni, Manjunath Patil, Håkon Bugge Cc: linux-rdma, linux-kernel On Fri, 12 Sep 2025 12:05:20 +0200, Håkon Bugge wrote: > When the destroy CM ID timeout kicks in, you typically get a storm of > them which creates a log flooding. Hence, change pr_err() to > pr_err_ratelimited() in cm_destroy_id_wait_timeout(). > > Applied, thanks! [1/1] RDMA/cm: Rate limit destroy CM ID timeout error message https://git.kernel.org/rdma/rdma/c/2bbe1255fcf19c Best regards, -- Leon Romanovsky <leon@kernel.org> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-09-15 7:43 ` Leon Romanovsky @ 2025-09-15 9:44 ` Haakon Bugge 0 siblings, 0 replies; 17+ messages in thread From: Haakon Bugge @ 2025-09-15 9:44 UTC (permalink / raw) To: Leon Romanovsky Cc: Jason Gunthorpe, Sean Hefty, Vlad Dumitrescu, Or Har-Toov, Jacob Moroni, Manjunath Patil, OFED mailing list, linux-kernel@vger.kernel.org > On 15 Sep 2025, at 09:43, Leon Romanovsky <leon@kernel.org> wrote: > > > On Fri, 12 Sep 2025 12:05:20 +0200, Håkon Bugge wrote: >> When the destroy CM ID timeout kicks in, you typically get a storm of >> them which creates a log flooding. Hence, change pr_err() to >> pr_err_ratelimited() in cm_destroy_id_wait_timeout(). >> >> > > Applied, thanks! Thanks for the quick turnaround, Leon and Zhu Yanjun! Thxs, Håkon ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-09-12 10:05 [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message Håkon Bugge 2025-09-12 19:27 ` yanjun.zhu 2025-09-15 7:43 ` Leon Romanovsky @ 2025-09-16 14:18 ` Jason Gunthorpe 2025-09-16 14:36 ` Jacob Moroni 2 siblings, 1 reply; 17+ messages in thread From: Jason Gunthorpe @ 2025-09-16 14:18 UTC (permalink / raw) To: Håkon Bugge Cc: Leon Romanovsky, Sean Hefty, Vlad Dumitrescu, Or Har-Toov, Jacob Moroni, Manjunath Patil, linux-rdma, linux-kernel On Fri, Sep 12, 2025 at 12:05:20PM +0200, Håkon Bugge wrote: > When the destroy CM ID timeout kicks in, you typically get a storm of > them which creates a log flooding. Hence, change pr_err() to > pr_err_ratelimited() in cm_destroy_id_wait_timeout(). Did you figure out why you were getting these? IIRC it signals a ULP bug and is not expected. Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-09-16 14:18 ` Jason Gunthorpe @ 2025-09-16 14:36 ` Jacob Moroni 2025-09-25 11:29 ` Haakon Bugge 0 siblings, 1 reply; 17+ messages in thread From: Jacob Moroni @ 2025-09-16 14:36 UTC (permalink / raw) To: Jason Gunthorpe Cc: Håkon Bugge, Leon Romanovsky, Sean Hefty, Vlad Dumitrescu, Or Har-Toov, Manjunath Patil, linux-rdma, linux-kernel Does this happen when there is a missing send completion? Asking because I remember triggering this if a device encounters an unrecoverable error/VF reset while under heavy RDMA-CM activity (like a large scale MPI wire-up). I assumed it was because RDMA-CM was waiting for TX completions that would never arrive. Of course, the unrecoverable error/VF reset without generating flush completions was the real bug in my case. - Jake ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-09-16 14:36 ` Jacob Moroni @ 2025-09-25 11:29 ` Haakon Bugge 2025-10-13 14:04 ` Haakon Bugge 0 siblings, 1 reply; 17+ messages in thread From: Haakon Bugge @ 2025-09-25 11:29 UTC (permalink / raw) To: Jacob Moroni Cc: Jason Gunthorpe, Leon Romanovsky, Sean Hefty, Vlad Dumitrescu, Or Har-Toov, Manjunath Patil, OFED mailing list, linux-kernel@vger.kernel.org Hi Jason and Jake, > On 16 Sep 2025, at 16:36, Jacob Moroni <jmoroni@google.com> wrote: > > Does this happen when there is a missing send completion? > > Asking because I remember triggering this if a device encounters an > unrecoverable > error/VF reset while under heavy RDMA-CM activity (like a large scale > MPI wire-up). > > I assumed it was because RDMA-CM was waiting for TX completions that > would never arrive. > > Of course, the unrecoverable error/VF reset without generating flush > completions was the real > bug in my case. I concur. I looked ahead of the first incident, but didn't see any obscure mlx5 driver messages. But looking in-between, I saw: kernel: mlx5_core 0000:13:01.1 ens4f16: TX timeout detected kernel: cm_destroy_id_wait_timeout: cm_id=00000000564a7a31 timed out. state 2 -> 0, refcnt=2 kernel: mlx5_core 0000:13:01.1 ens4f16: TX timeout on queue: 12, SQ: 0x14f2a, CQ: 0x1739, SQ Cons: 0x0 SQ Prod: 0x3c5, usecs since last trans: 30224000 kernel: cm_destroy_id_wait_timeout: cm_id=00000000b821dcda timed out. state 2 -> 0, refcnt=1 kernel: cm_destroy_id_wait_timeout: cm_id=00000000edf170fa timed out. state 2 -> 0, refcnt=1 kernel: mlx5_core 0000:13:01.1 ens4f16: EQ 0x14: Cons = 0x444670, irqn = 0x28c Not in close proximity in time, but a 6 digits amount of messages were suppressed due to the flooding. My take is that the timeout should be monotonic increasing from the driver to RDMA_CM (and to the ULPs). They are not, as the mlx5e_build_nic_netdev() functions sets the ndetdev's watchdog_timeo to 15 seconds, whereas the timeout value calling cm_destroy_id_wait_timeout() is 10 seconds. So, the mitigation by detecting a TX timeout from netdev has not kicked in when cm_destroy_id_wait_timeout() is called. Thxs, Håkon ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-09-25 11:29 ` Haakon Bugge @ 2025-10-13 14:04 ` Haakon Bugge 2025-10-15 11:38 ` Haakon Bugge 0 siblings, 1 reply; 17+ messages in thread From: Haakon Bugge @ 2025-10-13 14:04 UTC (permalink / raw) To: Jacob Moroni Cc: Jason Gunthorpe, Leon Romanovsky, Sean Hefty, Vlad Dumitrescu, Or Har-Toov, Manjunath Patil, OFED mailing list, linux-kernel@vger.kernel.org Hi Jason and Jake, > On 25 Sep 2025, at 13:29, Haakon Bugge <haakon.bugge@oracle.com> wrote: > > [snip] Got another incident this weekend. The first cm_destroy_id_wait_timeout() message at: Oct 10 12:10:31 lab64 kernel: cm_destroy_id_wait_timeout: cm_id=00000000803feac1 timed out. state 2 -> 0, refcnt=2 [] Health error in close proximity in time: # devlink health show ens4f14 reporter tx auxiliary/mlx5_core.eth.8/524288: reporter tx state healthy error 256 recover 256 last_dump_date 2025-10-10 last_dump_time 12:08:47 grace_period 500 auto_recover true auto_dump true And also ndev timeout in close proximity in time: # fgrep "TX timeout on queue" /var/log/messages-20251012 Oct 10 12:11:16 lab64 kernel: mlx5_core 0000:13:01.0 ens4f14: TX timeout on queue: 5, SQ: 0xfb43, CQ: 0x66b, SQ Cons: 0x0 SQ Prod: 0x4, usecs since last trans: 20992000 Oct 10 13:01:05 lab64 kernel: mlx5_core 0000:13:01.0 ens4f14: TX timeout on queue: 3, SQ: 0x1088a, CQ: 0xcae, SQ Cons: 0x0 SQ Prod: 0x3c5, usecs since last trans: 16390000 [] 4140 cm_ids in this situation: # dmesg | grep cm_destroy_id_wait_timeout | awk '{ print $3; }' | sort -n | uniq | wc 4140 4140 95220 By running ib_send_bw using the suspected bad device: [1]- Running ib_send_bw -R -d mlx5_7 -i 2 & [2]+ Running ib_send_bw -R -d mlx5_7 -i 1 192.168.64.15 & Said processes are hanging and we got one more cm_id in this situation: # dmesg | grep cm_destroy_id_wait_timeout | awk '{ print $3; }' | sort -n | uniq | wc 4141 4141 95243 Stacks: # cat /proc/435477/stack [<0>] ucma_get_event+0xf0/0x240 [rdma_ucm] [<0>] ucma_write+0xf8/0x170 [rdma_ucm] [<0>] vfs_write+0xd4/0x294 [<0>] ksys_write+0xbb/0xe8 [<0>] do_syscall_64+0x35/0x87 [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x0 # cat /proc/436606/stack [<0>] cm_destroy_id+0x16a/0x61c [ib_cm] [<0>] _destroy_id+0x5a/0x230 [rdma_cm] [<0>] ucma_destroy_private_ctx+0x354/0x370 [rdma_ucm] [<0>] ucma_close+0x7a/0xb0 [rdma_ucm] [<0>] __fput+0x96/0x24e [<0>] task_work_run+0x65/0x9b [<0>] do_exit+0x215/0x489 [<0>] do_group_exit+0x33/0x96 [<0>] get_signal+0x14a/0x967 [<0>] arch_do_signal_or_restart+0x10d/0x12a [<0>] exit_to_user_mode_loop+0xc4/0x1b1 [<0>] exit_to_user_mode_prepare+0x124/0x12c [<0>] syscall_exit_to_user_mode+0x18/0x42 [<0>] do_syscall_64+0x42/0x87 [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x0 Distribution of old_state: # dmesg | grep cm_destroy_id_wait_timeout | awk '{ ++state[$7]; } END {for (s in state) printf("%2d: %d\n", s, state[s]); }' 0: 192 2: 10546 6: 102 11: 513 13: 148 My take is that the VF in question here gets whacked and that the MAD timeout handling does not resonate well with how CMA handles them. Thxs, Håkon ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-10-13 14:04 ` Haakon Bugge @ 2025-10-15 11:38 ` Haakon Bugge 2025-10-15 16:49 ` Jason Gunthorpe 0 siblings, 1 reply; 17+ messages in thread From: Haakon Bugge @ 2025-10-15 11:38 UTC (permalink / raw) To: Jacob Moroni Cc: Jason Gunthorpe, Leon Romanovsky, Sean Hefty, Vlad Dumitrescu, Or Har-Toov, Manjunath Patil, OFED mailing list, linux-kernel@vger.kernel.org Hi Jason and Jake, > On 13 Oct 2025, at 16:04, Haakon Bugge <haakon.bugge@oracle.com> wrote: [snip] > My take is that the VF in question here gets whacked and that the MAD timeout handling does not resonate well with how CMA handles them. I was able to simulate a whacked VF by setting the CMA max retries to one and once in a while, skip sending of the MAD: diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 9b471548e7ae1..43eff54151830 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -45,7 +45,7 @@ MODULE_DESCRIPTION("Generic RDMA CM Agent"); MODULE_LICENSE("Dual BSD/GPL"); #define CMA_CM_RESPONSE_TIMEOUT 20 -#define CMA_MAX_CM_RETRIES 15 +#define CMA_MAX_CM_RETRIES 1 #define CMA_IBOE_PACKET_LIFETIME 16 #define CMA_PREFERRED_ROCE_GID_TYPE IB_GID_TYPE_ROCE_UDP_ENCAP diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 21c8669dd1371..9c19333a507d8 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1057,9 +1057,13 @@ int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr) spin_lock_irqsave(&qp_info->send_queue.lock, flags); if (qp_info->send_queue.count < qp_info->send_queue.max_active) { - trace_ib_mad_ib_send_mad(mad_send_wr, qp_info); - ret = ib_post_send(mad_agent->qp, &mad_send_wr->send_wr.wr, - NULL); + if (!(jiffies % 10000)) { + pr_err("Skipping ib_post_send\n"); + } else { + trace_ib_mad_ib_send_mad(mad_send_wr, qp_info); + ret = ib_post_send(mad_agent->qp, &mad_send_wr->send_wr.wr, + NULL); + } list = &qp_info->send_queue.list; } else { ret = 0; With this hack, running cmtime with 10.000 connections in loopback, the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out. state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill cmtime because it was hanging, and then it got defunct with the following stack: # cat /proc/7977/task/7978/stack [<0>] cm_destroy_id+0x23a/0x680 [ib_cm] [<0>] _destroy_id+0xcf/0x330 [rdma_cm] [<0>] ucma_destroy_private_ctx+0x379/0x390 [rdma_ucm] [<0>] ucma_close+0x78/0xb0 [rdma_ucm] [<0>] __fput+0xe3/0x2a0 [<0>] task_work_run+0x5c/0x90 [<0>] do_exit+0x1e3/0x447 [<0>] do_group_exit+0x30/0x80 [<0>] get_signal+0x88d/0x88d [<0>] arch_do_signal_or_restart+0x34/0x110 [<0>] exit_to_user_mode_loop+0x4a/0x160 [<0>] do_syscall_64+0x1b8/0x940 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e Thxs, Håkon ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-10-15 11:38 ` Haakon Bugge @ 2025-10-15 16:49 ` Jason Gunthorpe 2025-10-15 18:34 ` Sean Hefty 0 siblings, 1 reply; 17+ messages in thread From: Jason Gunthorpe @ 2025-10-15 16:49 UTC (permalink / raw) To: Haakon Bugge Cc: Jacob Moroni, Leon Romanovsky, Sean Hefty, Vlad Dumitrescu, Or Har-Toov, Manjunath Patil, OFED mailing list, linux-kernel@vger.kernel.org On Wed, Oct 15, 2025 at 11:38:10AM +0000, Haakon Bugge wrote: > With this hack, running cmtime with 10.000 connections in loopback, > the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed > out. state 6 -> 0, refcnt=1" messages are indeed produced. Had to > kill cmtime because it was hanging, and then it got defunct with the > following stack: Seems like a bug, it should not hang forever if a MAD is lost.. Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-10-15 16:49 ` Jason Gunthorpe @ 2025-10-15 18:34 ` Sean Hefty 2025-10-15 18:45 ` Jason Gunthorpe 0 siblings, 1 reply; 17+ messages in thread From: Sean Hefty @ 2025-10-15 18:34 UTC (permalink / raw) To: Jason Gunthorpe, Haakon Bugge Cc: Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu, Or Har-Toov, Manjunath Patil, OFED mailing list, linux-kernel@vger.kernel.org > > With this hack, running cmtime with 10.000 connections in loopback, > > the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out. > > state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill > > cmtime because it was hanging, and then it got defunct with the > > following stack: > > Seems like a bug, it should not hang forever if a MAD is lost.. The hack skipped calling ib_post_send. But the result of that is a completion is never written to the CQ. The state machine or reference counting is likely waiting for the completion, so it knows that HW is done trying to access the buffer. - Sean ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-10-15 18:34 ` Sean Hefty @ 2025-10-15 18:45 ` Jason Gunthorpe 2025-10-16 15:25 ` Haakon Bugge 0 siblings, 1 reply; 17+ messages in thread From: Jason Gunthorpe @ 2025-10-15 18:45 UTC (permalink / raw) To: Sean Hefty Cc: Haakon Bugge, Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu, Or Har-Toov, Manjunath Patil, OFED mailing list, linux-kernel@vger.kernel.org On Wed, Oct 15, 2025 at 06:34:33PM +0000, Sean Hefty wrote: > > > With this hack, running cmtime with 10.000 connections in loopback, > > > the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out. > > > state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill > > > cmtime because it was hanging, and then it got defunct with the > > > following stack: > > > > Seems like a bug, it should not hang forever if a MAD is lost.. > > The hack skipped calling ib_post_send. But the result of that is a > completion is never written to the CQ. The state machine or > reference counting is likely waiting for the completion, so it knows > that HW is done trying to access the buffer. That does make sense, it has to immediately trigger the completion to be accurate. A better test would be to truncate the mad or something so it can't be rx'd Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-10-15 18:45 ` Jason Gunthorpe @ 2025-10-16 15:25 ` Haakon Bugge 2025-10-16 16:12 ` Jason Gunthorpe 0 siblings, 1 reply; 17+ messages in thread From: Haakon Bugge @ 2025-10-16 15:25 UTC (permalink / raw) To: Jason Gunthorpe Cc: Sean Hefty, Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu, Or Har-Toov, Manjunath Patil, OFED mailing list, linux-kernel@vger.kernel.org > On 15 Oct 2025, at 20:45, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Wed, Oct 15, 2025 at 06:34:33PM +0000, Sean Hefty wrote: >>>> With this hack, running cmtime with 10.000 connections in loopback, >>>> the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out. >>>> state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill >>>> cmtime because it was hanging, and then it got defunct with the >>>> following stack: >>> >>> Seems like a bug, it should not hang forever if a MAD is lost.. >> >> The hack skipped calling ib_post_send. But the result of that is a >> completion is never written to the CQ. Which is exactly the behaviour I see when the VF gets "whacked". This is from a system without the reproducer hack. Looking at the netdev detected TX timeout: mlx5_core 0000:af:00.2 ens4f2: TX timeout detected mlx5_core 0000:af:00.2 ens4f2: TX timeout on queue: 0, SQ: 0xe31ee, CQ: 0x484, SQ Cons: 0x0 SQ Prod: 0x7, usecs since last trans: 18439000 mlx5_core 0000:af:00.2 ens4f2: EQ 0x7: Cons = 0x3ded47a, irqn = 0x197 (I get tons of the like) There are two points here. All of them has "SQ Cons: 0x0", which to me implies that no TX CQE has ever been polled for any of them. The other point is that we do _not_ see "Recovered %d eqes on EQ 0x%x" (which is because mlx5_eq_poll_irq_disabled() always returns zero), which means that either a) no CQE has been generated by the HCA or b) a CQE has been generated but no corresponding EQE has been written to the EQ. >> The state machine or >> reference counting is likely waiting for the completion, so it knows >> that HW is done trying to access the buffer. > > That does make sense, it has to immediately trigger the completion to > be accurate. A better test would be to truncate the mad or something > so it can't be rx'd As argued above, I think my reproducer hack is sound and to the point. I will raise an NVIDIA ticket as to why the VF gets into this "whacked" state. But, since Google has encountered this, we have one customer, and I am able to repro this situation (without my hack) in 1-3 days running on a single server, do we expect the RDMA stack to handle this situation? Thxs, Håkon ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-10-16 15:25 ` Haakon Bugge @ 2025-10-16 16:12 ` Jason Gunthorpe 2025-10-16 16:43 ` Haakon Bugge 0 siblings, 1 reply; 17+ messages in thread From: Jason Gunthorpe @ 2025-10-16 16:12 UTC (permalink / raw) To: Haakon Bugge Cc: Sean Hefty, Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu, Or Har-Toov, Manjunath Patil, OFED mailing list, linux-kernel@vger.kernel.org On Thu, Oct 16, 2025 at 03:25:15PM +0000, Haakon Bugge wrote: > > > > On 15 Oct 2025, at 20:45, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Wed, Oct 15, 2025 at 06:34:33PM +0000, Sean Hefty wrote: > >>>> With this hack, running cmtime with 10.000 connections in loopback, > >>>> the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out. > >>>> state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill > >>>> cmtime because it was hanging, and then it got defunct with the > >>>> following stack: > >>> > >>> Seems like a bug, it should not hang forever if a MAD is lost.. > >> > >> The hack skipped calling ib_post_send. But the result of that is a > >> completion is never written to the CQ. > > > Which is exactly the behaviour I see when the VF gets "whacked". This is from a system without the reproducer hack. Looking at the netdev detected TX timeout: > > mlx5_core 0000:af:00.2 ens4f2: TX timeout detected > mlx5_core 0000:af:00.2 ens4f2: TX timeout on queue: 0, SQ: 0xe31ee, CQ: 0x484, SQ Cons: 0x0 SQ Prod: 0x7, usecs since last trans: 18439000 > mlx5_core 0000:af:00.2 ens4f2: EQ 0x7: Cons = 0x3ded47a, irqn = 0x197 > > (I get tons of the like) > > There are two points here. All of them has "SQ Cons: 0x0", which to me implies that no TX CQE has ever been polled for any of them. > The other point is that we do _not_ see "Recovered %d eqes on EQ > 0x%x" (which is because mlx5_eq_poll_irq_disabled() always returns > zero), which means that either a) no CQE has been generated by the > HCA or b) a CQE has been generated but no corresponding EQE has been > written to the EQ. Lost interrupts/cqe are an obnoxiously common bug in virtualization environments. Be sure you are running latest NIC firmware. Be sure you have all the qemu/kvm fixes. But yes, if you hit these bugs then the QP gets effectively stuck forever. We don't have a stuck QP watchdog for the GMP QP, IIRC. Perhaps we should, but I'd also argue if you are loosing interrupts for GMP QPs then your VM platform is so broken it won't succeed to run normal RDMA applications :\ At the end of the day you must not have these "TX timeout" type errors, they are very very serious. Whatever bugs cause them must be squashed. > >> The state machine or > >> reference counting is likely waiting for the completion, so it knows > >> that HW is done trying to access the buffer. > > > > That does make sense, it has to immediately trigger the completion to > > be accurate. A better test would be to truncate the mad or something > > so it can't be rx'd > > As argued above, I think my reproducer hack is sound and to the point. Not quite, you are just loosing CQEs. We should never loose a CQE. Yes perhaps your QP can become permanently stuck, and that's bad. But the fix is to detect the stuck QP, push it through to error and drain it generating all the err CQs without any loss. To better model what you are seeing you want to do something like randomly drop the GMP QP doorbell ring, that will cause the QP to get stuck similar to a lost interrupt/etc. Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-10-16 16:12 ` Jason Gunthorpe @ 2025-10-16 16:43 ` Haakon Bugge 2025-10-16 18:01 ` Jason Gunthorpe 0 siblings, 1 reply; 17+ messages in thread From: Haakon Bugge @ 2025-10-16 16:43 UTC (permalink / raw) To: Jason Gunthorpe Cc: Sean Hefty, Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu, Or Har-Toov, Manjunath Patil, OFED mailing list, linux-kernel@vger.kernel.org > On 16 Oct 2025, at 18:12, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Oct 16, 2025 at 03:25:15PM +0000, Haakon Bugge wrote: >> >> >>> On 15 Oct 2025, at 20:45, Jason Gunthorpe <jgg@ziepe.ca> wrote: >>> >>> On Wed, Oct 15, 2025 at 06:34:33PM +0000, Sean Hefty wrote: >>>>>> With this hack, running cmtime with 10.000 connections in loopback, >>>>>> the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out. >>>>>> state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill >>>>>> cmtime because it was hanging, and then it got defunct with the >>>>>> following stack: >>>>> >>>>> Seems like a bug, it should not hang forever if a MAD is lost.. >>>> >>>> The hack skipped calling ib_post_send. But the result of that is a >>>> completion is never written to the CQ. >> >> >> Which is exactly the behaviour I see when the VF gets "whacked". This is from a system without the reproducer hack. Looking at the netdev detected TX timeout: >> >> mlx5_core 0000:af:00.2 ens4f2: TX timeout detected >> mlx5_core 0000:af:00.2 ens4f2: TX timeout on queue: 0, SQ: 0xe31ee, CQ: 0x484, SQ Cons: 0x0 SQ Prod: 0x7, usecs since last trans: 18439000 >> mlx5_core 0000:af:00.2 ens4f2: EQ 0x7: Cons = 0x3ded47a, irqn = 0x197 >> >> (I get tons of the like) >> >> There are two points here. All of them has "SQ Cons: 0x0", which to me implies that no TX CQE has ever been polled for any of them. > >> The other point is that we do _not_ see "Recovered %d eqes on EQ >> 0x%x" (which is because mlx5_eq_poll_irq_disabled() always returns >> zero), which means that either a) no CQE has been generated by the >> HCA or b) a CQE has been generated but no corresponding EQE has been >> written to the EQ. > > Lost interrupts/cqe are an obnoxiously common bug in virtualization > environments. Be sure you are running latest NIC firmware. Be sure you > have all the qemu/kvm fixes. Sorry, may be I did not mention it, but I run BM with a bunch of VFs instantiated. > But yes, if you hit these bugs then the QP gets effectively stuck > forever. > > We don't have a stuck QP watchdog for the GMP QP, IIRC. Perhaps we > should, but I'd also argue if you are loosing interrupts for GMP QPs > then your VM platform is so broken it won't succeed to run normal RDMA > applications :\ Most probably. But the intent of mlx5e_tx_reporter_timeout_recover() is to recover, but it does not manually poll the CQ. What I mean, if there are no EQEs to recover, but the CQ is non-empty, should they somehow be handled? > At the end of the day you must not have these "TX timeout" type > errors, they are very very serious. Whatever bugs cause them must be > squashed. I can agree to that :-) But, life is not perfect. > >>>> The state machine or >>>> reference counting is likely waiting for the completion, so it knows >>>> that HW is done trying to access the buffer. >>> >>> That does make sense, it has to immediately trigger the completion to >>> be accurate. A better test would be to truncate the mad or something >>> so it can't be rx'd >> >> As argued above, I think my reproducer hack is sound and to the point. > > Not quite, you are just loosing CQEs. We should never loose a CQE. > > Yes perhaps your QP can become permanently stuck, and that's bad. But > the fix is to detect the stuck QP, push it through to error and drain > it generating all the err CQs without any loss. In my opinion, it is not a QP that is stuck. It is the VF. I had an example above where I ran into this issue (using RDS), but creating a connection from user-space using ib_send_bw, the newly created cm_id also got stuck, presumable due to the lack of CQE, EQE, or interrupt. > To better model what you are seeing you want to do something like > randomly drop the GMP QP doorbell ring, that will cause the QP to get > stuck similar to a lost interrupt/etc. Well, I started off this thread thinking a cm_deref_id() was missing somewhere, but now I am more inclined to think as you do, this is an unrecoverable situation, and I should work with NVIDIA to fix it. Thxs, Håkon ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-10-16 16:43 ` Haakon Bugge @ 2025-10-16 18:01 ` Jason Gunthorpe 2025-10-21 16:32 ` Haakon Bugge 0 siblings, 1 reply; 17+ messages in thread From: Jason Gunthorpe @ 2025-10-16 18:01 UTC (permalink / raw) To: Haakon Bugge Cc: Sean Hefty, Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu, Or Har-Toov, Manjunath Patil, OFED mailing list, linux-kernel@vger.kernel.org On Thu, Oct 16, 2025 at 04:43:16PM +0000, Haakon Bugge wrote: > Well, I started off this thread thinking a cm_deref_id() was missing > somewhere, but now I am more inclined to think as you do, this is an > unrecoverable situation, and I should work with NVIDIA to fix it. If the VF is just stuck and not progressing QPs for whatever reason then yes absolutely. At best all we can do is detect stuck QPs and try to recover them as I described. How hard/costly would it be to have a tx timer watchdog on the mad layer send q? At the very least we could log a stuck MAD QP.. Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message 2025-10-16 18:01 ` Jason Gunthorpe @ 2025-10-21 16:32 ` Haakon Bugge 0 siblings, 0 replies; 17+ messages in thread From: Haakon Bugge @ 2025-10-21 16:32 UTC (permalink / raw) To: Jason Gunthorpe Cc: Sean Hefty, Jacob Moroni, Leon Romanovsky, Vlad Dumitrescu, Or Har-Toov, Manjunath Patil, OFED mailing list, linux-kernel@vger.kernel.org > On 16 Oct 2025, at 20:01, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Oct 16, 2025 at 04:43:16PM +0000, Haakon Bugge wrote: > >> Well, I started off this thread thinking a cm_deref_id() was missing >> somewhere, but now I am more inclined to think as you do, this is an >> unrecoverable situation, and I should work with NVIDIA to fix it. > > If the VF is just stuck and not progressing QPs for whatever reason > then yes absolutely. We are running with MULTI_PORT_VHCA_EN=1 (i.e., one device, two ports), and I see that it is only one of the ports in the function that get into this situation. And yes, mlnx ticket filed. > At best all we can do is detect stuck QPs and try to recover them as I > described. It applies to all QPs, not only GSI MADs, and, as reported above, new QPs created from user-space run into the same situation. I tried an FLR, but the RDMA stack is not able to recover from it. So, from my perspective, only a reboot helps. In other words, unrecoverable from a SW perspective. > How hard/costly would it be to have a tx timer watchdog on the mad > layer send q? May be Steve can answer that. But from my perspective, the "destroy CM ID timeout error" message is _the_ signature of the situation. And, anyone seeing it would probably read though this thread... > At the very least we could log a stuck MAD QP.. That won't hurt, but I do not expect all other ULPs and user-space apps to handle the case where a CQE is expected but never comes. Thxs, Håkon ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-10-21 16:33 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-12 10:05 [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message Håkon Bugge 2025-09-12 19:27 ` yanjun.zhu 2025-09-15 7:43 ` Leon Romanovsky 2025-09-15 9:44 ` Haakon Bugge 2025-09-16 14:18 ` Jason Gunthorpe 2025-09-16 14:36 ` Jacob Moroni 2025-09-25 11:29 ` Haakon Bugge 2025-10-13 14:04 ` Haakon Bugge 2025-10-15 11:38 ` Haakon Bugge 2025-10-15 16:49 ` Jason Gunthorpe 2025-10-15 18:34 ` Sean Hefty 2025-10-15 18:45 ` Jason Gunthorpe 2025-10-16 15:25 ` Haakon Bugge 2025-10-16 16:12 ` Jason Gunthorpe 2025-10-16 16:43 ` Haakon Bugge 2025-10-16 18:01 ` Jason Gunthorpe 2025-10-21 16:32 ` Haakon Bugge
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox