* [PATCH rdma] RDMA/bnxt_re: cmds completions handler avoid accessing invalid memeory
@ 2024-11-12 13:49 Mohammad Heib
2024-11-14 10:04 ` Leon Romanovsky
0 siblings, 1 reply; 10+ messages in thread
From: Mohammad Heib @ 2024-11-12 13:49 UTC (permalink / raw)
To: linux-rdma, selvin.xavier, kashyap.desai; +Cc: Mohammad Heib
If bnxt FW behaves unexpectedly because of FW bug or unexpected behavior it
can send completions for old cookies that have already been handled by the
bnxt driver. If that old cookie was associated with an old calling context
the driver will try to access that caller memory again because the driver
never clean the is_waiter_alive flag after the caller successfully complete
waiting, and this access will cause the following kernel panic:
Call Trace:
<IRQ>
? __die+0x20/0x70
? page_fault_oops+0x75/0x170
? exc_page_fault+0xaa/0x140
? asm_exc_page_fault+0x22/0x30
? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re]
? srso_return_thunk+0x5/0x5f
? __wake_up_common+0x78/0xa0
? srso_return_thunk+0x5/0x5f
bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re]
tasklet_action_common+0xac/0x210
handle_softirqs+0xd3/0x2b0
__irq_exit_rcu+0x9b/0xc0
common_interrupt+0x7f/0xa0
</IRQ>
<TASK>
To avoid the above unexpected behavior clear the is_waiter_alive flag
every time the caller finishes waiting for a completion.
Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout")
Signed-off-by: Mohammad Heib <mheib@redhat.com>
---
drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c
index f5713e3c39fb..eaf92029862b 100644
--- a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c
+++ b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c
@@ -511,15 +511,15 @@ static int __bnxt_qplib_rcfw_send_message(struct bnxt_qplib_rcfw *rcfw,
else
rc = __poll_for_resp(rcfw, cookie);
- if (rc) {
- spin_lock_irqsave(&rcfw->cmdq.hwq.lock, flags);
- crsqe = &rcfw->crsqe_tbl[cookie];
- crsqe->is_waiter_alive = false;
- if (rc == -ENODEV)
- set_bit(FIRMWARE_STALL_DETECTED, &rcfw->cmdq.flags);
- spin_unlock_irqrestore(&rcfw->cmdq.hwq.lock, flags);
+
+ spin_lock_irqsave(&rcfw->cmdq.hwq.lock, flags);
+ crsqe = &rcfw->crsqe_tbl[cookie];
+ crsqe->is_waiter_alive = false;
+ if (rc == -ENODEV)
+ set_bit(FIRMWARE_STALL_DETECTED, &rcfw->cmdq.flags);
+ spin_unlock_irqrestore(&rcfw->cmdq.hwq.lock, flags);
+ if (rc)
return -ETIMEDOUT;
- }
if (evnt->status) {
/* failed with status */
--
2.34.3
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH rdma] RDMA/bnxt_re: cmds completions handler avoid accessing invalid memeory 2024-11-12 13:49 [PATCH rdma] RDMA/bnxt_re: cmds completions handler avoid accessing invalid memeory Mohammad Heib @ 2024-11-14 10:04 ` Leon Romanovsky 2024-11-14 10:07 ` Selvin Xavier 0 siblings, 1 reply; 10+ messages in thread From: Leon Romanovsky @ 2024-11-14 10:04 UTC (permalink / raw) To: Mohammad Heib, selvin.xavier; +Cc: linux-rdma, kashyap.desai On Tue, Nov 12, 2024 at 03:49:56PM +0200, Mohammad Heib wrote: > If bnxt FW behaves unexpectedly because of FW bug or unexpected behavior it > can send completions for old cookies that have already been handled by the > bnxt driver. If that old cookie was associated with an old calling context > the driver will try to access that caller memory again because the driver > never clean the is_waiter_alive flag after the caller successfully complete > waiting, and this access will cause the following kernel panic: > > Call Trace: > <IRQ> > ? __die+0x20/0x70 > ? page_fault_oops+0x75/0x170 > ? exc_page_fault+0xaa/0x140 > ? asm_exc_page_fault+0x22/0x30 > ? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re] > ? srso_return_thunk+0x5/0x5f > ? __wake_up_common+0x78/0xa0 > ? srso_return_thunk+0x5/0x5f > bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re] > tasklet_action_common+0xac/0x210 > handle_softirqs+0xd3/0x2b0 > __irq_exit_rcu+0x9b/0xc0 > common_interrupt+0x7f/0xa0 > </IRQ> > <TASK> > > To avoid the above unexpected behavior clear the is_waiter_alive flag > every time the caller finishes waiting for a completion. > > Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout") > Signed-off-by: Mohammad Heib <mheib@redhat.com> > --- > drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++-------- > 1 file changed, 8 insertions(+), 8 deletions(-) Selvin? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH rdma] RDMA/bnxt_re: cmds completions handler avoid accessing invalid memeory 2024-11-14 10:04 ` Leon Romanovsky @ 2024-11-14 10:07 ` Selvin Xavier 2024-11-14 11:45 ` Leon Romanovsky 0 siblings, 1 reply; 10+ messages in thread From: Selvin Xavier @ 2024-11-14 10:07 UTC (permalink / raw) To: Leon Romanovsky; +Cc: Mohammad Heib, linux-rdma, kashyap.desai On Thu, Nov 14, 2024 at 3:34 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Tue, Nov 12, 2024 at 03:49:56PM +0200, Mohammad Heib wrote: > > If bnxt FW behaves unexpectedly because of FW bug or unexpected behavior it > > can send completions for old cookies that have already been handled by the > > bnxt driver. If that old cookie was associated with an old calling context > > the driver will try to access that caller memory again because the driver > > never clean the is_waiter_alive flag after the caller successfully complete > > waiting, and this access will cause the following kernel panic: > > > > Call Trace: > > <IRQ> > > ? __die+0x20/0x70 > > ? page_fault_oops+0x75/0x170 > > ? exc_page_fault+0xaa/0x140 > > ? asm_exc_page_fault+0x22/0x30 > > ? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re] > > ? srso_return_thunk+0x5/0x5f > > ? __wake_up_common+0x78/0xa0 > > ? srso_return_thunk+0x5/0x5f > > bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re] > > tasklet_action_common+0xac/0x210 > > handle_softirqs+0xd3/0x2b0 > > __irq_exit_rcu+0x9b/0xc0 > > common_interrupt+0x7f/0xa0 > > </IRQ> > > <TASK> > > > > To avoid the above unexpected behavior clear the is_waiter_alive flag > > every time the caller finishes waiting for a completion. > > > > Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout") > > Signed-off-by: Mohammad Heib <mheib@redhat.com> > > --- > > drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++-------- > > 1 file changed, 8 insertions(+), 8 deletions(-) > > Selvin? Someone is confirming the fix. Will ack in a day. Thanks ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH rdma] RDMA/bnxt_re: cmds completions handler avoid accessing invalid memeory 2024-11-14 10:07 ` Selvin Xavier @ 2024-11-14 11:45 ` Leon Romanovsky 2024-11-16 8:03 ` Selvin Xavier 0 siblings, 1 reply; 10+ messages in thread From: Leon Romanovsky @ 2024-11-14 11:45 UTC (permalink / raw) To: Selvin Xavier; +Cc: Mohammad Heib, linux-rdma, kashyap.desai On Thu, Nov 14, 2024 at 03:37:30PM +0530, Selvin Xavier wrote: > On Thu, Nov 14, 2024 at 3:34 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Tue, Nov 12, 2024 at 03:49:56PM +0200, Mohammad Heib wrote: > > > If bnxt FW behaves unexpectedly because of FW bug or unexpected behavior it > > > can send completions for old cookies that have already been handled by the > > > bnxt driver. If that old cookie was associated with an old calling context > > > the driver will try to access that caller memory again because the driver > > > never clean the is_waiter_alive flag after the caller successfully complete > > > waiting, and this access will cause the following kernel panic: > > > > > > Call Trace: > > > <IRQ> > > > ? __die+0x20/0x70 > > > ? page_fault_oops+0x75/0x170 > > > ? exc_page_fault+0xaa/0x140 > > > ? asm_exc_page_fault+0x22/0x30 > > > ? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re] > > > ? srso_return_thunk+0x5/0x5f > > > ? __wake_up_common+0x78/0xa0 > > > ? srso_return_thunk+0x5/0x5f > > > bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re] > > > tasklet_action_common+0xac/0x210 > > > handle_softirqs+0xd3/0x2b0 > > > __irq_exit_rcu+0x9b/0xc0 > > > common_interrupt+0x7f/0xa0 > > > </IRQ> > > > <TASK> > > > > > > To avoid the above unexpected behavior clear the is_waiter_alive flag > > > every time the caller finishes waiting for a completion. > > > > > > Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout") > > > Signed-off-by: Mohammad Heib <mheib@redhat.com> > > > --- > > > drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++-------- > > > 1 file changed, 8 insertions(+), 8 deletions(-) > > > > Selvin? > Someone is confirming the fix. Will ack in a day. Thanks Thanks ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH rdma] RDMA/bnxt_re: cmds completions handler avoid accessing invalid memeory 2024-11-14 11:45 ` Leon Romanovsky @ 2024-11-16 8:03 ` Selvin Xavier 2024-11-22 13:24 ` Mohammad Heib 0 siblings, 1 reply; 10+ messages in thread From: Selvin Xavier @ 2024-11-16 8:03 UTC (permalink / raw) To: Leon Romanovsky; +Cc: Mohammad Heib, linux-rdma, kashyap.desai [-- Attachment #1: Type: text/plain, Size: 2352 bytes --] On Thu, Nov 14, 2024 at 5:15 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Thu, Nov 14, 2024 at 03:37:30PM +0530, Selvin Xavier wrote: > > On Thu, Nov 14, 2024 at 3:34 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Tue, Nov 12, 2024 at 03:49:56PM +0200, Mohammad Heib wrote: > > > > If bnxt FW behaves unexpectedly because of FW bug or unexpected behavior it > > > > can send completions for old cookies that have already been handled by the > > > > bnxt driver. If that old cookie was associated with an old calling context > > > > the driver will try to access that caller memory again because the driver > > > > never clean the is_waiter_alive flag after the caller successfully complete > > > > waiting, and this access will cause the following kernel panic: > > > > > > > > Call Trace: > > > > <IRQ> > > > > ? __die+0x20/0x70 > > > > ? page_fault_oops+0x75/0x170 > > > > ? exc_page_fault+0xaa/0x140 > > > > ? asm_exc_page_fault+0x22/0x30 > > > > ? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re] > > > > ? srso_return_thunk+0x5/0x5f > > > > ? __wake_up_common+0x78/0xa0 > > > > ? srso_return_thunk+0x5/0x5f > > > > bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re] > > > > tasklet_action_common+0xac/0x210 > > > > handle_softirqs+0xd3/0x2b0 > > > > __irq_exit_rcu+0x9b/0xc0 > > > > common_interrupt+0x7f/0xa0 > > > > </IRQ> > > > > <TASK> > > > > > > > > To avoid the above unexpected behavior clear the is_waiter_alive flag > > > > every time the caller finishes waiting for a completion. Mohammad, We were trying to see the possibility. FW shouldn't be giving an old cookie. One possibility could be if FW crashes and we are in the recovery routine. Adding this check is okay, but may be hiding some other error. Is it possible to share your test scripts to repro this problem? Also, can you share the vmcore-demsg also Thanks Selvin > > > > > > > > Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout") > > > > Signed-off-by: Mohammad Heib <mheib@redhat.com> > > > > --- > > > > drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++-------- > > > > 1 file changed, 8 insertions(+), 8 deletions(-) > > > > > > Selvin? > > Someone is confirming the fix. Will ack in a day. Thanks > > Thanks [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4224 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH rdma] RDMA/bnxt_re: cmds completions handler avoid accessing invalid memeory 2024-11-16 8:03 ` Selvin Xavier @ 2024-11-22 13:24 ` Mohammad Heib 2024-11-22 13:45 ` Kashyap Desai 0 siblings, 1 reply; 10+ messages in thread From: Mohammad Heib @ 2024-11-22 13:24 UTC (permalink / raw) To: Selvin Xavier; +Cc: Leon Romanovsky, linux-rdma, kashyap.desai On Sat, Nov 16, 2024 at 01:33:13PM +0530, Selvin Xavier wrote: > On Thu, Nov 14, 2024 at 5:15 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Thu, Nov 14, 2024 at 03:37:30PM +0530, Selvin Xavier wrote: > > > On Thu, Nov 14, 2024 at 3:34 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > On Tue, Nov 12, 2024 at 03:49:56PM +0200, Mohammad Heib wrote: > > > > > If bnxt FW behaves unexpectedly because of FW bug or unexpected behavior it > > > > > can send completions for old cookies that have already been handled by the > > > > > bnxt driver. If that old cookie was associated with an old calling context > > > > > the driver will try to access that caller memory again because the driver > > > > > never clean the is_waiter_alive flag after the caller successfully complete > > > > > waiting, and this access will cause the following kernel panic: > > > > > > > > > > Call Trace: > > > > > <IRQ> > > > > > ? __die+0x20/0x70 > > > > > ? page_fault_oops+0x75/0x170 > > > > > ? exc_page_fault+0xaa/0x140 > > > > > ? asm_exc_page_fault+0x22/0x30 > > > > > ? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re] > > > > > ? srso_return_thunk+0x5/0x5f > > > > > ? __wake_up_common+0x78/0xa0 > > > > > ? srso_return_thunk+0x5/0x5f > > > > > bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re] > > > > > tasklet_action_common+0xac/0x210 > > > > > handle_softirqs+0xd3/0x2b0 > > > > > __irq_exit_rcu+0x9b/0xc0 > > > > > common_interrupt+0x7f/0xa0 > > > > > </IRQ> > > > > > <TASK> > > > > > > > > > > To avoid the above unexpected behavior clear the is_waiter_alive flag > > > > > every time the caller finishes waiting for a completion. > Mohammad, > We were trying to see the possibility. FW shouldn't be giving an old > cookie. One possibility > could be if FW crashes and we are in the recovery routine. > Adding this check is okay, but may be hiding some other error. > Is it possible to share your test scripts to repro this problem? Also, > can you share > the vmcore-demsg also > > Thanks > Selvin > I have sent you all the needed data in a separate email. Thanks, > > > > > > > > > > > Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout") > > > > > Signed-off-by: Mohammad Heib <mheib@redhat.com> > > > > > --- > > > > > drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++-------- > > > > > 1 file changed, 8 insertions(+), 8 deletions(-) > > > > > > > > Selvin? > > > Someone is confirming the fix. Will ack in a day. Thanks > > > > Thanks ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH rdma] RDMA/bnxt_re: cmds completions handler avoid accessing invalid memeory 2024-11-22 13:24 ` Mohammad Heib @ 2024-11-22 13:45 ` Kashyap Desai 2024-11-25 7:22 ` Leon Romanovsky 0 siblings, 1 reply; 10+ messages in thread From: Kashyap Desai @ 2024-11-22 13:45 UTC (permalink / raw) To: Mohammad Heib; +Cc: Selvin Xavier, Leon Romanovsky, linux-rdma [-- Attachment #1.1: Type: text/plain, Size: 2971 bytes --] Hi All, We will work with Redhat for final go. For now this patch is on hold and not urgent. Leon, Hold this discussion for now. Kashyap On Fri, 22 Nov 2024, 18:54 Mohammad Heib, <mheib@redhat.com> wrote: > On Sat, Nov 16, 2024 at 01:33:13PM +0530, Selvin Xavier wrote: > > On Thu, Nov 14, 2024 at 5:15 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Thu, Nov 14, 2024 at 03:37:30PM +0530, Selvin Xavier wrote: > > > > On Thu, Nov 14, 2024 at 3:34 PM Leon Romanovsky <leon@kernel.org> > wrote: > > > > > > > > > > On Tue, Nov 12, 2024 at 03:49:56PM +0200, Mohammad Heib wrote: > > > > > > If bnxt FW behaves unexpectedly because of FW bug or unexpected > behavior it > > > > > > can send completions for old cookies that have already been > handled by the > > > > > > bnxt driver. If that old cookie was associated with an old > calling context > > > > > > the driver will try to access that caller memory again because > the driver > > > > > > never clean the is_waiter_alive flag after the caller > successfully complete > > > > > > waiting, and this access will cause the following kernel panic: > > > > > > > > > > > > Call Trace: > > > > > > <IRQ> > > > > > > ? __die+0x20/0x70 > > > > > > ? page_fault_oops+0x75/0x170 > > > > > > ? exc_page_fault+0xaa/0x140 > > > > > > ? asm_exc_page_fault+0x22/0x30 > > > > > > ? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re] > > > > > > ? srso_return_thunk+0x5/0x5f > > > > > > ? __wake_up_common+0x78/0xa0 > > > > > > ? srso_return_thunk+0x5/0x5f > > > > > > bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re] > > > > > > tasklet_action_common+0xac/0x210 > > > > > > handle_softirqs+0xd3/0x2b0 > > > > > > __irq_exit_rcu+0x9b/0xc0 > > > > > > common_interrupt+0x7f/0xa0 > > > > > > </IRQ> > > > > > > <TASK> > > > > > > > > > > > > To avoid the above unexpected behavior clear the is_waiter_alive > flag > > > > > > every time the caller finishes waiting for a completion. > > Mohammad, > > We were trying to see the possibility. FW shouldn't be giving an old > > cookie. One possibility > > could be if FW crashes and we are in the recovery routine. > > Adding this check is okay, but may be hiding some other error. > > Is it possible to share your test scripts to repro this problem? Also, > > can you share > > the vmcore-demsg also > > > > Thanks > > Selvin > > > I have sent you all the needed data in a separate email. > Thanks, > > > > > > > > > > > > > > Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions > after driver detect a timedout") > > > > > > Signed-off-by: Mohammad Heib <mheib@redhat.com> > > > > > > --- > > > > > > drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++-------- > > > > > > 1 file changed, 8 insertions(+), 8 deletions(-) > > > > > > > > > > Selvin? > > > > Someone is confirming the fix. Will ack in a day. Thanks > > > > > > Thanks > > > [-- Attachment #1.2: Type: text/html, Size: 4485 bytes --] [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4212 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH rdma] RDMA/bnxt_re: cmds completions handler avoid accessing invalid memeory 2024-11-22 13:45 ` Kashyap Desai @ 2024-11-25 7:22 ` Leon Romanovsky 2025-03-04 23:31 ` [PATCH] Fix bnxt_re crash in bnxt_qplib_process_qp_event Sherry Yang 0 siblings, 1 reply; 10+ messages in thread From: Leon Romanovsky @ 2024-11-25 7:22 UTC (permalink / raw) To: Kashyap Desai; +Cc: Mohammad Heib, Selvin Xavier, linux-rdma On Fri, Nov 22, 2024 at 07:15:10PM +0530, Kashyap Desai wrote: > Hi All, > > We will work with Redhat for final go. > > For now this patch is on hold and not urgent. > > Leon, > > Hold this discussion for now. I marked this patch as "deferred" in patchworks for now. https://patchwork.kernel.org/project/linux-rdma/patch/20241112134956.1415343-1-mheib@redhat.com/ Please resubmit once you get Acks from Selvin. Thanks ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] Fix bnxt_re crash in bnxt_qplib_process_qp_event 2024-11-25 7:22 ` Leon Romanovsky @ 2025-03-04 23:31 ` Sherry Yang 2025-03-05 10:59 ` Kashyap Desai 0 siblings, 1 reply; 10+ messages in thread From: Sherry Yang @ 2025-03-04 23:31 UTC (permalink / raw) To: leon, kashyap.desai, mheib, selvin.xavier; +Cc: linux-rdma Hi All, I encountered a similar issue with the bnxt_re driver from Linux 6.12 to 6.14 where a KVM host kernel crash occurs in bnxt_qplib_process_qp_event due to a write access to an invalid memory address (ffff9f058cedbb10) after performing few SRIOV operations on the guest. It doesn’t happen on Linux 6.11. It can’t be reproduced consistently, happens 2 out of 5 times. System details: - NIC: Broadcom BCM57417 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller The crash trace is as follows: [ 6882.739369] BUG: unable to handle page fault for address: ffff9f058cedbb10 [ 6882.739771] #PF: supervisor write access in kernel mode [ 6882.740127] #PF: error_code(0x0002) - not-present page [ 6882.740417] PGD 100000067 P4D 100000067 PUD 1002e3067 PMD 107b10067 PTE 0 [ 6882.740696] Oops: Oops: 0002 [#1] PREEMPT SMP PTI [ 6882.740971] CPU: 23 UID: 0 PID: 0 Comm: swapper/23 Kdump: loaded Not tainted 6.12.0-0.16.14.el9uek.x86_64 #1 [ 6882.741528] RIP: 0010:bnxt_qplib_process_qp_event.isra.0+0xa5/0x323 [bnxt_re] [ 6882.741827] Code: 74 0d 80 7d 01 00 75 07 f0 ff 8b d0 02 00 00 41 80 7f 11 00 0f 84 87 00 00 00 49 8b 17 48 85 d2 0f 84 0e 02 00 00 48 8b 4d 00 <48> 89 0a 48 8b 4d 08 48 89 4a 08 44 0f bf e0 41 8b 47 08 41 c7 47 [ 6882.742434] RSP: 0018:ffff9f058cf1ce88 EFLAGS: 00010282 [ 6882.742754] RAX: 0000000000000000 RBX: ffff904ceb600c80 RCX: 0000000000000338 [ 6882.743078] RDX: ffff9f058cedbb10 RSI: 0000000000000000 RDI: 0000000000000000 [ 6882.743395] RBP: ffff9044dc5bd660 R08: 0000000000000000 R09: 0000000000000000 [ 6882.743705] R10: 0000000000000000 R11: 0000000000000000 R12: ffff90434b3f8000 [ 6882.743987] R13: ffff9f058cf1cf14 R14: ffff904ceb600c98 R15: ffff90444df40000 [ 6882.744272] FS: 0000000000000000(0000) GS:ffff908180e80000(0000) knlGS:0000000000000000 [ 6882.744556] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6882.744839] CR2: ffff9f058cedbb10 CR3: 0000001773e38001 CR4: 00000000007726f0 [ 6882.745125] PKRU: 55555554 [ 6882.745406] Call Trace: [ 6882.745686] <IRQ> [ 6882.745964] ? show_trace_log_lvl+0x1b0/0x300 [ 6882.746247] ? show_trace_log_lvl+0x1b0/0x300 [ 6882.746529] ? bnxt_qplib_service_creq+0x16a/0x236 [bnxt_re] [ 6882.746821] ? __die_body.cold+0x8/0x17 [ 6882.747099] ? page_fault_oops+0x162/0x16d [ 6882.747397] ? exc_page_fault+0x16d/0x180 [ 6882.747700] ? asm_exc_page_fault+0x26/0x30 [ 6882.747975] ? bnxt_qplib_process_qp_event.isra.0+0xa5/0x323 [bnxt_re] [ 6882.748250] ? bnxt_qplib_process_qp_event.isra.0+0x43/0x323 [bnxt_re] [ 6882.748518] bnxt_qplib_service_creq+0x16a/0x236 [bnxt_re] [ 6882.748785] tasklet_action_common+0xca/0x240 [ 6882.749042] handle_softirqs+0xe1/0x2ac [ 6882.749295] __irq_exit_rcu+0xab/0xd0 [ 6882.749571] common_interrupt+0x85/0xa0 [ 6882.749835] </IRQ> [ 6882.750094] <TASK> [ 6882.750350] asm_common_interrupt+0x26/0x40 [ 6882.750622] RIP: 0010:cpuidle_enter_state+0xc6/0x430 [ 6882.750870] Code: 00 00 e8 dd 82 23 ff e8 38 f1 ff ff 49 89 c5 0f 1f 44 00 00 31 ff e8 79 f2 21 ff 45 84 ff 0f 85 b8 01 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 92 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d [ 6882.751411] RSP: 0018:ffff9f05807dfe70 EFLAGS: 00000246 [ 6882.751698] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000 [ 6882.751990] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 6882.752281] RBP: ffff908180ec4f68 R08: 0000000000000000 R09: 0000000000000000 [ 6882.752598] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff89ce0900 [ 6882.752989] R13: 00000642833badfd R14: 0000000000000003 R15: 0000000000000000 [ 6882.753373] cpuidle_enter+0x2d/0x50 [ 6882.753701] cpuidle_idle_call+0xfd/0x170 [ 6882.754049] do_idle+0x7b/0xc0 [ 6882.754333] cpu_startup_entry+0x29/0x30 [ 6882.754597] start_secondary+0x11e/0x140 [ 6882.754856] common_startup_64+0x13e/0x141 [ 6882.755114] </TASK> [ 6882.755357] Modules linked in: vfio_pci vfio_pci_core vhost_net vhost vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_compat nf_nat_tftp nf_conntrack_tftp tun nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set sunrpc vfat fat intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel bnxt_re iTCO_wdt ipmi_ssif iTCO_vendor_support ib_uverbs kvm pcspkr acpi_ipmi ib_core ipmi_si i2c_i801 lpc_ich i2c_smbus ipmi_devintf ioatdma intel_pch_thermal wmi ipmi_msghandler fuse xfs qla2xxx sd_mod nvme_fc mgag200 sg drm_shmem_helper nvme_fabrics ahci crct10dif_pclmul crc32_pclmul drm_kms_helper nvme libahci nvme_keyring ghash_clmulni_intel i40e drm sha512_ssse3 sha256_ssse3 nvme_core bnxt_en igb libata megaraid_sas scsi_transport_fc sha1_ssse3 nvme_auth [ 6882.755442] libie dca i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod aesni_intel gf128mul crypto_simd cryptd [ 6882.758069] CR2: ffff9f058cedbb10 I would like to know what’s going on with this issue or if there are any workarounds available. Please let me know if further debugging logs or tests are needed. Thanks, Sherry ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] Fix bnxt_re crash in bnxt_qplib_process_qp_event 2025-03-04 23:31 ` [PATCH] Fix bnxt_re crash in bnxt_qplib_process_qp_event Sherry Yang @ 2025-03-05 10:59 ` Kashyap Desai 0 siblings, 0 replies; 10+ messages in thread From: Kashyap Desai @ 2025-03-05 10:59 UTC (permalink / raw) To: Sherry Yang; +Cc: leon, mheib, selvin.xavier, linux-rdma [-- Attachment #1: Type: text/plain, Size: 1041 bytes --] On Wed, Mar 5, 2025 at 5:02 AM Sherry Yang <sherry.yang@oracle.com> wrote: > > Hi All, > > I encountered a similar issue with the bnxt_re driver from Linux 6.12 to 6.14 > where a KVM host kernel crash occurs in bnxt_qplib_process_qp_event due to a > write access to an invalid memory address (ffff9f058cedbb10) after performing > few SRIOV operations on the guest. It doesn’t happen on Linux 6.11. It can’t be > reproduced consistently, happens 2 out of 5 times. > Below controller supports experiment level RDMA functionality. We recommend using the next generation controller for RDAM functionality but not the below one. Broadcom BCM57417 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller We have seen panic like you mentioned in our testing. Some of them are Firmware issues. Not a straightforward driver fix. If you do not have any use case of RDMA on this controller, disable the RDMA feature on it. Kashyap > System details: > - NIC: Broadcom BCM57417 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller > [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4199 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-03-05 10:59 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-11-12 13:49 [PATCH rdma] RDMA/bnxt_re: cmds completions handler avoid accessing invalid memeory Mohammad Heib 2024-11-14 10:04 ` Leon Romanovsky 2024-11-14 10:07 ` Selvin Xavier 2024-11-14 11:45 ` Leon Romanovsky 2024-11-16 8:03 ` Selvin Xavier 2024-11-22 13:24 ` Mohammad Heib 2024-11-22 13:45 ` Kashyap Desai 2024-11-25 7:22 ` Leon Romanovsky 2025-03-04 23:31 ` [PATCH] Fix bnxt_re crash in bnxt_qplib_process_qp_event Sherry Yang 2025-03-05 10:59 ` Kashyap Desai
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox