Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
* Infiniband crash
@ 2022-10-14 18:16 rug
  2022-10-14 19:21 ` Jason Gunthorpe
  0 siblings, 1 reply; 9+ messages in thread
From: rug @ 2022-10-14 18:16 UTC (permalink / raw)
  To: linux-rdma

Hi to whom it may concern,

We are getting on a 6.0.0 (and also on 5.10 up) the following Mellanox  
infiniband problem (see below).
Can you please help (this is on a running ia64 cluster).

Regards,

Rudi Gabler

[   31.915749] Unable to handle kernel NULL pointer dereference  
(address 0000000000000010)
[   31.915749] kworker/u17:0[44]: Oops 11012296146944 [1]
[   31.915749] Modules linked in: af_packet ib_iser libiscsi  
scsi_transport_iscsi nf_tables nfnetlink rpcrdma sunrpc ib_ipoib tg3  
libphy ib_mthca fuse configfs dm_round_robin qla2xxx firmware_class  
dm_mirror dm_region_hash dm_log dm_multipath efivarfs

[   31.915749] CPU: 0 PID: 44 Comm: kworker/u17:0 Not tainted  
6.0.0-gentoo-ia64 #5
[   31.915749] Hardware name: hp server BL860c                   ,  
BIOS 04.32                                                             
05/21/2013
[   31.915749] Workqueue: ib-comp-unb-wq ib_cq_poll_work
[   31.915749] psr : 0000121008522030 ifs : 8000000000000ca1 ip  :  
[<a00000020036ba21>]    Not tainted (6.0.0-gentoo-ia64)
[   31.915749] ip is at mthca_poll_cq+0xc41/0x1620 [ib_mthca]
[   31.915749] unat: 0000000000000000 pfs : 0000000000000ca1 rsc :  
0000000000000003
[   31.915749] rnat: 0000000000000000 bsps: 0000000000000000 pr  :  
0000000000015555
[   31.915749] ldrs: 0000000000000000 ccv : 0000000000000000 fpsr:  
0009804c8a70433f
[   31.915749] csd : 0000000000000000 ssd : 0000000000000000
[   31.915749] b0  : a00000020036b290 b6  : a00000020036ade0 b7  :  
a00000010000bce0
[   31.915749] f6  : 1003ee000000106bf1c50 f7  : 1003e61c8864680b583eb
[   31.915749] f8  : 1003e73ad788c017bed70 f9  : 1003e0000000000015ab9
[   31.915749] f10 : 1003e000000000000b76a f11 : 1003e0000000000000000
[   31.915749] r1  : a00000020037b480 r2  : 0000000000000000 r3  :  
00000000000000d0
[   31.915749] r8  : e000000107d85100 r9  : 0000000000000000 r10 :  
0000000000000000
[   31.915749] r11 : 0000000000000000 r12 : e000000100507d40 r13 :  
e000000100500000
[   31.915749] r14 : e000000100ce9e00 r15 : 0000000000000000 r16 :  
0000000000000010
[   31.915749] r17 : 0000000000040000 r18 : 8080808080808080 r19 :  
e00000010012cb74
[   31.915749] r20 : 000000000000012c r21 : 73ad788c017bed70 r22 :  
0000040000000000
[   31.915749] r23 : e000000106bd4c10 r24 : 0000000000010000 r25 :  
000000000000ffff
[   31.915749] r26 : 0000000000000400 r27 : e00000010786b018 r28 :  
e000000107d85148
[   31.915749] r29 : e000000107d852f0 r30 : 0000000400000000 r31 :  
e000000107d85314
[   31.915749]
                Call Trace:
[   31.915749]  [<a000000100013170>] show_stack.part.0+0x30/0x50
                                                sp=e000000100507990  
bsp=e000000100501430
[   31.915749]  [<a000000100013720>] show_stack+0x30/0xa0
                                                sp=e000000100507990  
bsp=e000000100501400
[   31.915749]  [<a000000100014110>] show_regs+0x980/0x990
                                               sp=e000000100507b60  
bsp=e0000001005013a8
[   31.915749]  [<a000000100022340>] die+0x180/0x2e0
                                                sp=e000000100507b60  
bsp=e000000100501360
[   31.915749]  [<a000000100045a90>] ia64_do_page_fault+0x850/0xa20
                                                sp=e000000100507b60  
bsp=e0000001005012d8
[   31.915749]  [<a00000010000c4c0>] ia64_leave_kernel+0x0/0x270
                                                sp=e000000100507b70  
bsp=e0000001005012d8
[   31.915749]  [<a00000020036ba20>] mthca_poll_cq+0xc40/0x1620 [ib_mthca]
                                                sp=e000000100507d40  
bsp=e0000001005011c8
[   31.915749]  [<a000000100ad0f30>] __ib_process_cq+0xc0/0x210
                                                sp=e000000100507e30  
bsp=e000000100501150
[   31.915749]  [<a000000100ad1430>] ib_cq_poll_work+0x40/0x100
                                                sp=e000000100507e30  
bsp=e000000100501120
[   31.915749]  [<a000000100081820>] process_one_work+0x3b0/0x4c0
                                                sp=e000000100507e30  
bsp=e0000001005010a0
[   31.915749]  [<a000000100081f30>] worker_thread+0x580/0x670
                                               sp=e000000100507e30  
bsp=e000000100501008
[   31.915749]  [<a000000100090580>] kthread+0x1d0/0x1f0
                                                sp=e000000100507e30  
bsp=e000000100500fb8
[   31.915749]  [<a00000010000c2b0>] call_payload+0x50/0x80
                                                sp=e000000100507e30  
bsp=e000000100500fa0
[   31.915749] Disabling lock debugging due to kernel taint


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Infiniband crash
  2022-10-14 18:16 rug
@ 2022-10-14 19:21 ` Jason Gunthorpe
  2022-10-17 10:13   ` Christoph Lameter
  0 siblings, 1 reply; 9+ messages in thread
From: Jason Gunthorpe @ 2022-10-14 19:21 UTC (permalink / raw)
  To: rug; +Cc: linux-rdma

On Fri, Oct 14, 2022 at 06:16:51PM +0000, rug@usm.lmu.de wrote:
> Hi to whom it may concern,
> 
> We are getting on a 6.0.0 (and also on 5.10 up) the following Mellanox
> infiniband problem (see below).
> Can you please help (this is on a running ia64 cluster).

The fastest/simplest way to get help on something so obscure would be
to bisection search to the problematic commit

You might be the only user left in the world of this combination :)

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Infiniband crash
  2022-10-14 19:21 ` Jason Gunthorpe
@ 2022-10-17 10:13   ` Christoph Lameter
  2022-10-17 11:24     ` Rudolf Gabler
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2022-10-17 10:13 UTC (permalink / raw)
  Cc: rug, linux-rdma

On Fri, 14 Oct 2022, Jason Gunthorpe wrote:

> On Fri, Oct 14, 2022 at 06:16:51PM +0000, rug@usm.lmu.de wrote:
> > Hi to whom it may concern,
> >
> > We are getting on a 6.0.0 (and also on 5.10 up) the following Mellanox
> > infiniband problem (see below).
> > Can you please help (this is on a running ia64 cluster).
>
> The fastest/simplest way to get help on something so obscure would be
> to bisection search to the problematic commit
>
> You might be the only user left in the world of this combination :)

And CC the linux-ia64 mailing list? Gentoo on ia64.. Wow.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Infiniband crash
  2022-10-17 10:13   ` Christoph Lameter
@ 2022-10-17 11:24     ` Rudolf Gabler
  0 siblings, 0 replies; 9+ messages in thread
From: Rudolf Gabler @ 2022-10-17 11:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-rdma

[-- Attachment #1: Type: text/plain, Size: 1347 bytes --]

It isn’t that obscure. A HPE C7000 rack with 8 x BL860c Ithanium sever and Infinband module. Half of the server still running a rhel4 like single system image cluster (but totally outdated) but at least with 20 GBit HBAs (Mellanox) for a fast cluster file system. Ideal for the most kind of system services (mail …) because of the intrinsic loadbalance of the cluster. Here the Infiniband still works with ofed 1.4

On the other hand my attempt to replace it with a up to day system (gentoo). I have corosync and pacemaker (and crmsh) to cluster it and the Infiniband problem is the only thing missing.


Regards Rudi

Von meinem iPhone gesendet

> Am 17.10.2022 um 12:13 schrieb Christoph Lameter <cl@gentwo.de>:
> 
> On Fri, 14 Oct 2022, Jason Gunthorpe wrote:
> 
>>> On Fri, Oct 14, 2022 at 06:16:51PM +0000, rug@usm.lmu.de wrote:
>>> Hi to whom it may concern,
>>> 
>>> We are getting on a 6.0.0 (and also on 5.10 up) the following Mellanox
>>> infiniband problem (see below).
>>> Can you please help (this is on a running ia64 cluster).
>> 
>> The fastest/simplest way to get help on something so obscure would be
>> to bisection search to the problematic commit
>> 
>> You might be the only user left in the world of this combination :)
> 
> And CC the linux-ia64 mailing list? Gentoo on ia64.. Wow.
> 
> 

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 6792 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Infiniband crash
@ 2024-12-16 18:05 tpearson
  2024-12-17  8:10 ` Thomas Bogendoerfer
  0 siblings, 1 reply; 9+ messages in thread
From: tpearson @ 2024-12-16 18:05 UTC (permalink / raw)
  To: rug; +Cc: linux-rdma

Did you ever find a solution for this?  We're running into the same problem on a highly customized aarch64 system (NXP QorIQ platform), same Infinband adapter and very similar crash:

[    4.544159] OF: /soc/pcie@3600000: no iommu-map translation for id 0x100 on (null)
[    4.551873] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
[    4.558690] ib_mthca: Initializing 0000:01:00.0
[    6.258309] ib_mthca 0000:01:00.0: HCA FW version 5.1.000 is old (5.3.000 is current).
[    6.266272] ib_mthca 0000:01:00.0: If you have problems, try updating your HCA FW.
[    6.393143] ib_mthca 0000:01:00.0 ibp1s0: renamed from ib0
[    6.399038] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
[    6.407865] Mem abort info:
[    6.410662]   ESR = 0x0000000096000004
[    6.414419]   EC = 0x25: DABT (current EL), IL = 32 bits
[    6.419748]   SET = 0, FnV = 0
[    6.422806]   EA = 0, S1PTW = 0
[    6.425952]   FSC = 0x04: level 0 translation fault
[    6.430842] Data abort info:
[    6.433725]   ISV = 0, ISS = 0x00000004
[    6.437569]   CM = 0, WnR = 0
[    6.440540] user pgtable: 4k pages, 48-bit VAs, pgdp=0000008086f60000
[    6.447003] [0000000000000010] pgd=0000000000000000, p4d=0000000000000000
[    6.453819] Internal error: Oops: 0000000096000004 [#1] SMP
[    6.459412] Modules linked in: ib_ipoib(E) ib_umad(E) rdma_ucm(E) rdma_cm(E) iw_cm(E) ib_cm(E) configfs(E) ib_mthca(E) ib_uverbs(E) ib_core(E)
[    6.472263] CPU: 0 PID: 100 Comm: kworker/u17:0 Tainted: G            E      6.1.0+ #55
[    6.480297] Hardware name: Freescale Layerscape 2080a RDB Board (DT)
[    6.486670] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
[    6.492636] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    6.499624] pc : mthca_poll_cq+0x4f0/0x9a0 [ib_mthca]
[    6.504703] lr : mthca_poll_cq+0x1e8/0x9a0 [ib_mthca]

Since this is apparently hitting two different architectures, I suspect the problem is in the driver, not the arch-specific code.  I may recommend we upgrade the card to work around this, but given the rarity of the hardware it's not something I want to recommend tinkering with and it may or may not even accept the new card in the first place.

Thoughts?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Infiniband crash
@ 2024-12-16 18:06 tpearson
       [not found] ` <420F7218-5193-44B3-AD7F-ACED38C206AE@usm.lmu.de>
  0 siblings, 1 reply; 9+ messages in thread
From: tpearson @ 2024-12-16 18:06 UTC (permalink / raw)
  To: rug; +Cc: linux-rdma

Did you ever find a solution for this?  We're running into the same problem on a highly customized aarch64 system (NXP QorIQ platform), same Infinband adapter and very similar crash:

[    4.544159] OF: /soc/pcie@3600000: no iommu-map translation for id 0x100 on (null)
[    4.551873] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
[    4.558690] ib_mthca: Initializing 0000:01:00.0
[    6.258309] ib_mthca 0000:01:00.0: HCA FW version 5.1.000 is old (5.3.000 is current).
[    6.266272] ib_mthca 0000:01:00.0: If you have problems, try updating your HCA FW.
[    6.393143] ib_mthca 0000:01:00.0 ibp1s0: renamed from ib0
[    6.399038] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
[    6.407865] Mem abort info:
[    6.410662]   ESR = 0x0000000096000004
[    6.414419]   EC = 0x25: DABT (current EL), IL = 32 bits
[    6.419748]   SET = 0, FnV = 0
[    6.422806]   EA = 0, S1PTW = 0
[    6.425952]   FSC = 0x04: level 0 translation fault
[    6.430842] Data abort info:
[    6.433725]   ISV = 0, ISS = 0x00000004
[    6.437569]   CM = 0, WnR = 0
[    6.440540] user pgtable: 4k pages, 48-bit VAs, pgdp=0000008086f60000
[    6.447003] [0000000000000010] pgd=0000000000000000, p4d=0000000000000000
[    6.453819] Internal error: Oops: 0000000096000004 [#1] SMP
[    6.459412] Modules linked in: ib_ipoib(E) ib_umad(E) rdma_ucm(E) rdma_cm(E) iw_cm(E) ib_cm(E) configfs(E) ib_mthca(E) ib_uverbs(E) ib_core(E)
[    6.472263] CPU: 0 PID: 100 Comm: kworker/u17:0 Tainted: G            E      6.1.0+ #55
[    6.480297] Hardware name: Freescale Layerscape 2080a RDB Board (DT)
[    6.486670] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
[    6.492636] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    6.499624] pc : mthca_poll_cq+0x4f0/0x9a0 [ib_mthca]
[    6.504703] lr : mthca_poll_cq+0x1e8/0x9a0 [ib_mthca]

Since this is apparently hitting two different architectures, I suspect the problem is in the driver, not the arch-specific code.  I may recommend we upgrade the card to work around this, but given the rarity of the hardware it's not something I want to recommend tinkering with and it may or may not even accept the new card in the first place.

Thoughts?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Infiniband crash
       [not found] ` <420F7218-5193-44B3-AD7F-ACED38C206AE@usm.lmu.de>
@ 2024-12-16 20:10   ` Timothy Pearson
  0 siblings, 0 replies; 9+ messages in thread
From: Timothy Pearson @ 2024-12-16 20:10 UTC (permalink / raw)
  To: Rudolf Gabler; +Cc: linux-rdma

Ouch.  FWIW kernel 5.4 still works, but I guess it's time to put these mthca cards out to pasture.  aarch64 isn't exactly obsolete, though it still has a ton of problems vs. amd64/ppc64el -- when even tcpdump doesn't work properly, it's hard to ever envision aarch64 as server grade. ;)

----- Original Message -----
> From: "Rudolf Gabler" <rug@usm.lmu.de>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Sent: Monday, December 16, 2024 1:56:14 PM
> Subject: Re: Infiniband crash

> Sorry but I never got a solution and in the meanwhile the ia64 is out of any
> support.
> 
> I tried with a firmware upgrade but this didn’t change anything. The cards are
> ok but the driver development changed so much, that only very old kernels are
> working (i have a sles 11 which runs without problems beyond the circumstance
> that it is totally outdated).
> 
> Regards
> 
> Rudi G.
> 
>> Am 16.12.2024 um 19:06 schrieb tpearson@raptorengineering.com:
>> 
>> Did you ever find a solution for this?  We're running into the same problem on a
>> highly customized aarch64 system (NXP QorIQ platform), same Infinband adapter
>> and very similar crash:
>> 
>> [    4.544159] OF: /soc/pcie@3600000: no iommu-map translation for id 0x100 on
>> (null)
>> [    4.551873] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
>> [    4.558690] ib_mthca: Initializing 0000:01:00.0
>> [    6.258309] ib_mthca 0000:01:00.0: HCA FW version 5.1.000 is old (5.3.000 is
>> current).
>> [    6.266272] ib_mthca 0000:01:00.0: If you have problems, try updating your
>> HCA FW.
>> [    6.393143] ib_mthca 0000:01:00.0 ibp1s0: renamed from ib0
>> [    6.399038] Unable to handle kernel NULL pointer dereference at virtual
>> address 0000000000000010
>> [    6.407865] Mem abort info:
>> [    6.410662]   ESR = 0x0000000096000004
>> [    6.414419]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [    6.419748]   SET = 0, FnV = 0
>> [    6.422806]   EA = 0, S1PTW = 0
>> [    6.425952]   FSC = 0x04: level 0 translation fault
>> [    6.430842] Data abort info:
>> [    6.433725]   ISV = 0, ISS = 0x00000004
>> [    6.437569]   CM = 0, WnR = 0
>> [    6.440540] user pgtable: 4k pages, 48-bit VAs, pgdp=0000008086f60000
>> [    6.447003] [0000000000000010] pgd=0000000000000000, p4d=0000000000000000
>> [    6.453819] Internal error: Oops: 0000000096000004 [#1] SMP
>> [    6.459412] Modules linked in: ib_ipoib(E) ib_umad(E) rdma_ucm(E) rdma_cm(E)
>> iw_cm(E) ib_cm(E) configfs(E) ib_mthca(E) ib_uverbs(E) ib_core(E)
>> [    6.472263] CPU: 0 PID: 100 Comm: kworker/u17:0 Tainted: G            E
>> 6.1.0+ #55
>> [    6.480297] Hardware name: Freescale Layerscape 2080a RDB Board (DT)
>> [    6.486670] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
>> [    6.492636] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [    6.499624] pc : mthca_poll_cq+0x4f0/0x9a0 [ib_mthca]
>> [    6.504703] lr : mthca_poll_cq+0x1e8/0x9a0 [ib_mthca]
>> 
>> Since this is apparently hitting two different architectures, I suspect the
>> problem is in the driver, not the arch-specific code.  I may recommend we
>> upgrade the card to work around this, but given the rarity of the hardware it's
>> not something I want to recommend tinkering with and it may or may not even
>> accept the new card in the first place.
>> 
> > Thoughts?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Infiniband crash
  2024-12-16 18:05 Infiniband crash tpearson
@ 2024-12-17  8:10 ` Thomas Bogendoerfer
  2024-12-17 19:42   ` Timothy Pearson
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Bogendoerfer @ 2024-12-17  8:10 UTC (permalink / raw)
  To: tpearson; +Cc: rug, linux-rdma

On Mon, 16 Dec 2024 12:05:39 -0600
tpearson@raptorengineering.com wrote:

> Did you ever find a solution for this?  We're running into the same problem on a highly customized aarch64 system (NXP QorIQ platform), same Infinband adapter and very similar crash:
> 
> [    4.544159] OF: /soc/pcie@3600000: no iommu-map translation for id 0x100 on (null)
> [    4.551873] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
> [    4.558690] ib_mthca: Initializing 0000:01:00.0
> [    6.258309] ib_mthca 0000:01:00.0: HCA FW version 5.1.000 is old (5.3.000 is current).
> [    6.266272] ib_mthca 0000:01:00.0: If you have problems, try updating your HCA FW.
> [    6.393143] ib_mthca 0000:01:00.0 ibp1s0: renamed from ib0
> [    6.399038] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
> [    6.407865] Mem abort info:
> [    6.410662]   ESR = 0x0000000096000004
> [    6.414419]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    6.419748]   SET = 0, FnV = 0
> [    6.422806]   EA = 0, S1PTW = 0
> [    6.425952]   FSC = 0x04: level 0 translation fault
> [    6.430842] Data abort info:
> [    6.433725]   ISV = 0, ISS = 0x00000004
> [    6.437569]   CM = 0, WnR = 0
> [    6.440540] user pgtable: 4k pages, 48-bit VAs, pgdp=0000008086f60000
> [    6.447003] [0000000000000010] pgd=0000000000000000, p4d=0000000000000000
> [    6.453819] Internal error: Oops: 0000000096000004 [#1] SMP
> [    6.459412] Modules linked in: ib_ipoib(E) ib_umad(E) rdma_ucm(E) rdma_cm(E) iw_cm(E) ib_cm(E) configfs(E) ib_mthca(E) ib_uverbs(E) ib_core(E)
> [    6.472263] CPU: 0 PID: 100 Comm: kworker/u17:0 Tainted: G            E      6.1.0+ #55
> [    6.480297] Hardware name: Freescale Layerscape 2080a RDB Board (DT)
> [    6.486670] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
> [    6.492636] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    6.499624] pc : mthca_poll_cq+0x4f0/0x9a0 [ib_mthca]
> [    6.504703] lr : mthca_poll_cq+0x1e8/0x9a0 [ib_mthca]
> 
> Since this is apparently hitting two different architectures, I suspect the problem is in the driver, not the arch-specific code.  I may recommend we upgrade the card to work around this, but given the rarity of the hardware it's not something I want to recommend tinkering with and it may or may not even accept the new card in the first place.

which kernel version is this ? It looks like the bug fixed with

dc52aadbc184 RDMA/mthca: Fix crash when polling CQ for shared QPs

Thomas.

-- 
SUSE Software Solutions Germany GmbH
HRB 36809 (AG Nürnberg)
Geschäftsführer: Ivo Totev, Andrew McDonald, Werner Knoblich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Infiniband crash
  2024-12-17  8:10 ` Thomas Bogendoerfer
@ 2024-12-17 19:42   ` Timothy Pearson
  0 siblings, 0 replies; 9+ messages in thread
From: Timothy Pearson @ 2024-12-17 19:42 UTC (permalink / raw)
  To: Thomas Bogendoerfer; +Cc: Rudolf Gabler, linux-rdma



----- Original Message -----
> From: "Thomas Bogendoerfer" <tbogendoerfer@suse.de>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "Rudolf Gabler" <rug@usm.lmu.de>, "linux-rdma" <linux-rdma@vger.kernel.org>
> Sent: Tuesday, December 17, 2024 2:10:42 AM
> Subject: Re: Infiniband crash

> On Mon, 16 Dec 2024 12:05:39 -0600
> tpearson@raptorengineering.com wrote:
> 
>> Did you ever find a solution for this?  We're running into the same problem on a
>> highly customized aarch64 system (NXP QorIQ platform), same Infinband adapter
>> and very similar crash:
>> 
>> [    4.544159] OF: /soc/pcie@3600000: no iommu-map translation for id 0x100 on
>> (null)
>> [    4.551873] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
>> [    4.558690] ib_mthca: Initializing 0000:01:00.0
>> [    6.258309] ib_mthca 0000:01:00.0: HCA FW version 5.1.000 is old (5.3.000 is
>> current).
>> [    6.266272] ib_mthca 0000:01:00.0: If you have problems, try updating your
>> HCA FW.
>> [    6.393143] ib_mthca 0000:01:00.0 ibp1s0: renamed from ib0
>> [    6.399038] Unable to handle kernel NULL pointer dereference at virtual
>> address 0000000000000010
>> [    6.407865] Mem abort info:
>> [    6.410662]   ESR = 0x0000000096000004
>> [    6.414419]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [    6.419748]   SET = 0, FnV = 0
>> [    6.422806]   EA = 0, S1PTW = 0
>> [    6.425952]   FSC = 0x04: level 0 translation fault
>> [    6.430842] Data abort info:
>> [    6.433725]   ISV = 0, ISS = 0x00000004
>> [    6.437569]   CM = 0, WnR = 0
>> [    6.440540] user pgtable: 4k pages, 48-bit VAs, pgdp=0000008086f60000
>> [    6.447003] [0000000000000010] pgd=0000000000000000, p4d=0000000000000000
>> [    6.453819] Internal error: Oops: 0000000096000004 [#1] SMP
>> [    6.459412] Modules linked in: ib_ipoib(E) ib_umad(E) rdma_ucm(E) rdma_cm(E)
>> iw_cm(E) ib_cm(E) configfs(E) ib_mthca(E) ib_uverbs(E) ib_core(E)
>> [    6.472263] CPU: 0 PID: 100 Comm: kworker/u17:0 Tainted: G            E
>> 6.1.0+ #55
>> [    6.480297] Hardware name: Freescale Layerscape 2080a RDB Board (DT)
>> [    6.486670] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
>> [    6.492636] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [    6.499624] pc : mthca_poll_cq+0x4f0/0x9a0 [ib_mthca]
>> [    6.504703] lr : mthca_poll_cq+0x1e8/0x9a0 [ib_mthca]
>> 
>> Since this is apparently hitting two different architectures, I suspect the
>> problem is in the driver, not the arch-specific code.  I may recommend we
>> upgrade the card to work around this, but given the rarity of the hardware it's
>> not something I want to recommend tinkering with and it may or may not even
>> accept the new card in the first place.
> 
> which kernel version is this ? It looks like the bug fixed with
> 
> dc52aadbc184 RDMA/mthca: Fix crash when polling CQ for shared QPs
> 
> Thomas.

Kernel 6.1 -- this is a custom build for the rather odd aarch64 platform in use, and v6.1 was selected due to the use of Debian Bookworm.

I can confirm applying the patch referenced above resolves the crash.  Thanks for the pointer!

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-12-17 19:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-16 18:05 Infiniband crash tpearson
2024-12-17  8:10 ` Thomas Bogendoerfer
2024-12-17 19:42   ` Timothy Pearson
  -- strict thread matches above, loose matches on Subject: below --
2024-12-16 18:06 tpearson
     [not found] ` <420F7218-5193-44B3-AD7F-ACED38C206AE@usm.lmu.de>
2024-12-16 20:10   ` Timothy Pearson
2022-10-14 18:16 rug
2022-10-14 19:21 ` Jason Gunthorpe
2022-10-17 10:13   ` Christoph Lameter
2022-10-17 11:24     ` Rudolf Gabler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox