Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
* Infiniband crash
@ 2022-10-14 18:16 rug
  2022-10-14 19:21 ` Jason Gunthorpe
  0 siblings, 1 reply; 9+ messages in thread
From: rug @ 2022-10-14 18:16 UTC (permalink / raw)
  To: linux-rdma

Hi to whom it may concern,

We are getting on a 6.0.0 (and also on 5.10 up) the following Mellanox  
infiniband problem (see below).
Can you please help (this is on a running ia64 cluster).

Regards,

Rudi Gabler

[   31.915749] Unable to handle kernel NULL pointer dereference  
(address 0000000000000010)
[   31.915749] kworker/u17:0[44]: Oops 11012296146944 [1]
[   31.915749] Modules linked in: af_packet ib_iser libiscsi  
scsi_transport_iscsi nf_tables nfnetlink rpcrdma sunrpc ib_ipoib tg3  
libphy ib_mthca fuse configfs dm_round_robin qla2xxx firmware_class  
dm_mirror dm_region_hash dm_log dm_multipath efivarfs

[   31.915749] CPU: 0 PID: 44 Comm: kworker/u17:0 Not tainted  
6.0.0-gentoo-ia64 #5
[   31.915749] Hardware name: hp server BL860c                   ,  
BIOS 04.32                                                             
05/21/2013
[   31.915749] Workqueue: ib-comp-unb-wq ib_cq_poll_work
[   31.915749] psr : 0000121008522030 ifs : 8000000000000ca1 ip  :  
[<a00000020036ba21>]    Not tainted (6.0.0-gentoo-ia64)
[   31.915749] ip is at mthca_poll_cq+0xc41/0x1620 [ib_mthca]
[   31.915749] unat: 0000000000000000 pfs : 0000000000000ca1 rsc :  
0000000000000003
[   31.915749] rnat: 0000000000000000 bsps: 0000000000000000 pr  :  
0000000000015555
[   31.915749] ldrs: 0000000000000000 ccv : 0000000000000000 fpsr:  
0009804c8a70433f
[   31.915749] csd : 0000000000000000 ssd : 0000000000000000
[   31.915749] b0  : a00000020036b290 b6  : a00000020036ade0 b7  :  
a00000010000bce0
[   31.915749] f6  : 1003ee000000106bf1c50 f7  : 1003e61c8864680b583eb
[   31.915749] f8  : 1003e73ad788c017bed70 f9  : 1003e0000000000015ab9
[   31.915749] f10 : 1003e000000000000b76a f11 : 1003e0000000000000000
[   31.915749] r1  : a00000020037b480 r2  : 0000000000000000 r3  :  
00000000000000d0
[   31.915749] r8  : e000000107d85100 r9  : 0000000000000000 r10 :  
0000000000000000
[   31.915749] r11 : 0000000000000000 r12 : e000000100507d40 r13 :  
e000000100500000
[   31.915749] r14 : e000000100ce9e00 r15 : 0000000000000000 r16 :  
0000000000000010
[   31.915749] r17 : 0000000000040000 r18 : 8080808080808080 r19 :  
e00000010012cb74
[   31.915749] r20 : 000000000000012c r21 : 73ad788c017bed70 r22 :  
0000040000000000
[   31.915749] r23 : e000000106bd4c10 r24 : 0000000000010000 r25 :  
000000000000ffff
[   31.915749] r26 : 0000000000000400 r27 : e00000010786b018 r28 :  
e000000107d85148
[   31.915749] r29 : e000000107d852f0 r30 : 0000000400000000 r31 :  
e000000107d85314
[   31.915749]
                Call Trace:
[   31.915749]  [<a000000100013170>] show_stack.part.0+0x30/0x50
                                                sp=e000000100507990  
bsp=e000000100501430
[   31.915749]  [<a000000100013720>] show_stack+0x30/0xa0
                                                sp=e000000100507990  
bsp=e000000100501400
[   31.915749]  [<a000000100014110>] show_regs+0x980/0x990
                                               sp=e000000100507b60  
bsp=e0000001005013a8
[   31.915749]  [<a000000100022340>] die+0x180/0x2e0
                                                sp=e000000100507b60  
bsp=e000000100501360
[   31.915749]  [<a000000100045a90>] ia64_do_page_fault+0x850/0xa20
                                                sp=e000000100507b60  
bsp=e0000001005012d8
[   31.915749]  [<a00000010000c4c0>] ia64_leave_kernel+0x0/0x270
                                                sp=e000000100507b70  
bsp=e0000001005012d8
[   31.915749]  [<a00000020036ba20>] mthca_poll_cq+0xc40/0x1620 [ib_mthca]
                                                sp=e000000100507d40  
bsp=e0000001005011c8
[   31.915749]  [<a000000100ad0f30>] __ib_process_cq+0xc0/0x210
                                                sp=e000000100507e30  
bsp=e000000100501150
[   31.915749]  [<a000000100ad1430>] ib_cq_poll_work+0x40/0x100
                                                sp=e000000100507e30  
bsp=e000000100501120
[   31.915749]  [<a000000100081820>] process_one_work+0x3b0/0x4c0
                                                sp=e000000100507e30  
bsp=e0000001005010a0
[   31.915749]  [<a000000100081f30>] worker_thread+0x580/0x670
                                               sp=e000000100507e30  
bsp=e000000100501008
[   31.915749]  [<a000000100090580>] kthread+0x1d0/0x1f0
                                                sp=e000000100507e30  
bsp=e000000100500fb8
[   31.915749]  [<a00000010000c2b0>] call_payload+0x50/0x80
                                                sp=e000000100507e30  
bsp=e000000100500fa0
[   31.915749] Disabling lock debugging due to kernel taint


^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: Infiniband crash
@ 2024-12-16 18:05 tpearson
  2024-12-17  8:10 ` Thomas Bogendoerfer
  0 siblings, 1 reply; 9+ messages in thread
From: tpearson @ 2024-12-16 18:05 UTC (permalink / raw)
  To: rug; +Cc: linux-rdma

Did you ever find a solution for this?  We're running into the same problem on a highly customized aarch64 system (NXP QorIQ platform), same Infinband adapter and very similar crash:

[    4.544159] OF: /soc/pcie@3600000: no iommu-map translation for id 0x100 on (null)
[    4.551873] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
[    4.558690] ib_mthca: Initializing 0000:01:00.0
[    6.258309] ib_mthca 0000:01:00.0: HCA FW version 5.1.000 is old (5.3.000 is current).
[    6.266272] ib_mthca 0000:01:00.0: If you have problems, try updating your HCA FW.
[    6.393143] ib_mthca 0000:01:00.0 ibp1s0: renamed from ib0
[    6.399038] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
[    6.407865] Mem abort info:
[    6.410662]   ESR = 0x0000000096000004
[    6.414419]   EC = 0x25: DABT (current EL), IL = 32 bits
[    6.419748]   SET = 0, FnV = 0
[    6.422806]   EA = 0, S1PTW = 0
[    6.425952]   FSC = 0x04: level 0 translation fault
[    6.430842] Data abort info:
[    6.433725]   ISV = 0, ISS = 0x00000004
[    6.437569]   CM = 0, WnR = 0
[    6.440540] user pgtable: 4k pages, 48-bit VAs, pgdp=0000008086f60000
[    6.447003] [0000000000000010] pgd=0000000000000000, p4d=0000000000000000
[    6.453819] Internal error: Oops: 0000000096000004 [#1] SMP
[    6.459412] Modules linked in: ib_ipoib(E) ib_umad(E) rdma_ucm(E) rdma_cm(E) iw_cm(E) ib_cm(E) configfs(E) ib_mthca(E) ib_uverbs(E) ib_core(E)
[    6.472263] CPU: 0 PID: 100 Comm: kworker/u17:0 Tainted: G            E      6.1.0+ #55
[    6.480297] Hardware name: Freescale Layerscape 2080a RDB Board (DT)
[    6.486670] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
[    6.492636] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    6.499624] pc : mthca_poll_cq+0x4f0/0x9a0 [ib_mthca]
[    6.504703] lr : mthca_poll_cq+0x1e8/0x9a0 [ib_mthca]

Since this is apparently hitting two different architectures, I suspect the problem is in the driver, not the arch-specific code.  I may recommend we upgrade the card to work around this, but given the rarity of the hardware it's not something I want to recommend tinkering with and it may or may not even accept the new card in the first place.

Thoughts?

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: Infiniband crash
@ 2024-12-16 18:06 tpearson
       [not found] ` <420F7218-5193-44B3-AD7F-ACED38C206AE@usm.lmu.de>
  0 siblings, 1 reply; 9+ messages in thread
From: tpearson @ 2024-12-16 18:06 UTC (permalink / raw)
  To: rug; +Cc: linux-rdma

Did you ever find a solution for this?  We're running into the same problem on a highly customized aarch64 system (NXP QorIQ platform), same Infinband adapter and very similar crash:

[    4.544159] OF: /soc/pcie@3600000: no iommu-map translation for id 0x100 on (null)
[    4.551873] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
[    4.558690] ib_mthca: Initializing 0000:01:00.0
[    6.258309] ib_mthca 0000:01:00.0: HCA FW version 5.1.000 is old (5.3.000 is current).
[    6.266272] ib_mthca 0000:01:00.0: If you have problems, try updating your HCA FW.
[    6.393143] ib_mthca 0000:01:00.0 ibp1s0: renamed from ib0
[    6.399038] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
[    6.407865] Mem abort info:
[    6.410662]   ESR = 0x0000000096000004
[    6.414419]   EC = 0x25: DABT (current EL), IL = 32 bits
[    6.419748]   SET = 0, FnV = 0
[    6.422806]   EA = 0, S1PTW = 0
[    6.425952]   FSC = 0x04: level 0 translation fault
[    6.430842] Data abort info:
[    6.433725]   ISV = 0, ISS = 0x00000004
[    6.437569]   CM = 0, WnR = 0
[    6.440540] user pgtable: 4k pages, 48-bit VAs, pgdp=0000008086f60000
[    6.447003] [0000000000000010] pgd=0000000000000000, p4d=0000000000000000
[    6.453819] Internal error: Oops: 0000000096000004 [#1] SMP
[    6.459412] Modules linked in: ib_ipoib(E) ib_umad(E) rdma_ucm(E) rdma_cm(E) iw_cm(E) ib_cm(E) configfs(E) ib_mthca(E) ib_uverbs(E) ib_core(E)
[    6.472263] CPU: 0 PID: 100 Comm: kworker/u17:0 Tainted: G            E      6.1.0+ #55
[    6.480297] Hardware name: Freescale Layerscape 2080a RDB Board (DT)
[    6.486670] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
[    6.492636] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    6.499624] pc : mthca_poll_cq+0x4f0/0x9a0 [ib_mthca]
[    6.504703] lr : mthca_poll_cq+0x1e8/0x9a0 [ib_mthca]

Since this is apparently hitting two different architectures, I suspect the problem is in the driver, not the arch-specific code.  I may recommend we upgrade the card to work around this, but given the rarity of the hardware it's not something I want to recommend tinkering with and it may or may not even accept the new card in the first place.

Thoughts?

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-12-17 19:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-14 18:16 Infiniband crash rug
2022-10-14 19:21 ` Jason Gunthorpe
2022-10-17 10:13   ` Christoph Lameter
2022-10-17 11:24     ` Rudolf Gabler
  -- strict thread matches above, loose matches on Subject: below --
2024-12-16 18:05 tpearson
2024-12-17  8:10 ` Thomas Bogendoerfer
2024-12-17 19:42   ` Timothy Pearson
2024-12-16 18:06 tpearson
     [not found] ` <420F7218-5193-44B3-AD7F-ACED38C206AE@usm.lmu.de>
2024-12-16 20:10   ` Timothy Pearson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox