From: Timothy Pearson <tpearson@raptorengineering.com>
To: Rudolf Gabler <rug@usm.lmu.de>
Cc: linux-rdma <linux-rdma@vger.kernel.org>
Subject: Re: Infiniband crash
Date: Mon, 16 Dec 2024 14:10:14 -0600 (CST) [thread overview]
Message-ID: <1960002764.45894075.1734379814889.JavaMail.zimbra@raptorengineeringinc.com> (raw)
In-Reply-To: <420F7218-5193-44B3-AD7F-ACED38C206AE@usm.lmu.de>
Ouch. FWIW kernel 5.4 still works, but I guess it's time to put these mthca cards out to pasture. aarch64 isn't exactly obsolete, though it still has a ton of problems vs. amd64/ppc64el -- when even tcpdump doesn't work properly, it's hard to ever envision aarch64 as server grade. ;)
----- Original Message -----
> From: "Rudolf Gabler" <rug@usm.lmu.de>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Sent: Monday, December 16, 2024 1:56:14 PM
> Subject: Re: Infiniband crash
> Sorry but I never got a solution and in the meanwhile the ia64 is out of any
> support.
>
> I tried with a firmware upgrade but this didn’t change anything. The cards are
> ok but the driver development changed so much, that only very old kernels are
> working (i have a sles 11 which runs without problems beyond the circumstance
> that it is totally outdated).
>
> Regards
>
> Rudi G.
>
>> Am 16.12.2024 um 19:06 schrieb tpearson@raptorengineering.com:
>>
>> Did you ever find a solution for this? We're running into the same problem on a
>> highly customized aarch64 system (NXP QorIQ platform), same Infinband adapter
>> and very similar crash:
>>
>> [ 4.544159] OF: /soc/pcie@3600000: no iommu-map translation for id 0x100 on
>> (null)
>> [ 4.551873] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
>> [ 4.558690] ib_mthca: Initializing 0000:01:00.0
>> [ 6.258309] ib_mthca 0000:01:00.0: HCA FW version 5.1.000 is old (5.3.000 is
>> current).
>> [ 6.266272] ib_mthca 0000:01:00.0: If you have problems, try updating your
>> HCA FW.
>> [ 6.393143] ib_mthca 0000:01:00.0 ibp1s0: renamed from ib0
>> [ 6.399038] Unable to handle kernel NULL pointer dereference at virtual
>> address 0000000000000010
>> [ 6.407865] Mem abort info:
>> [ 6.410662] ESR = 0x0000000096000004
>> [ 6.414419] EC = 0x25: DABT (current EL), IL = 32 bits
>> [ 6.419748] SET = 0, FnV = 0
>> [ 6.422806] EA = 0, S1PTW = 0
>> [ 6.425952] FSC = 0x04: level 0 translation fault
>> [ 6.430842] Data abort info:
>> [ 6.433725] ISV = 0, ISS = 0x00000004
>> [ 6.437569] CM = 0, WnR = 0
>> [ 6.440540] user pgtable: 4k pages, 48-bit VAs, pgdp=0000008086f60000
>> [ 6.447003] [0000000000000010] pgd=0000000000000000, p4d=0000000000000000
>> [ 6.453819] Internal error: Oops: 0000000096000004 [#1] SMP
>> [ 6.459412] Modules linked in: ib_ipoib(E) ib_umad(E) rdma_ucm(E) rdma_cm(E)
>> iw_cm(E) ib_cm(E) configfs(E) ib_mthca(E) ib_uverbs(E) ib_core(E)
>> [ 6.472263] CPU: 0 PID: 100 Comm: kworker/u17:0 Tainted: G E
>> 6.1.0+ #55
>> [ 6.480297] Hardware name: Freescale Layerscape 2080a RDB Board (DT)
>> [ 6.486670] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
>> [ 6.492636] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [ 6.499624] pc : mthca_poll_cq+0x4f0/0x9a0 [ib_mthca]
>> [ 6.504703] lr : mthca_poll_cq+0x1e8/0x9a0 [ib_mthca]
>>
>> Since this is apparently hitting two different architectures, I suspect the
>> problem is in the driver, not the arch-specific code. I may recommend we
>> upgrade the card to work around this, but given the rarity of the hardware it's
>> not something I want to recommend tinkering with and it may or may not even
>> accept the new card in the first place.
>>
> > Thoughts?
next prev parent reply other threads:[~2024-12-16 20:10 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-16 18:06 Infiniband crash tpearson
[not found] ` <420F7218-5193-44B3-AD7F-ACED38C206AE@usm.lmu.de>
2024-12-16 20:10 ` Timothy Pearson [this message]
-- strict thread matches above, loose matches on Subject: below --
2024-12-16 18:05 tpearson
2024-12-17 8:10 ` Thomas Bogendoerfer
2024-12-17 19:42 ` Timothy Pearson
2022-10-14 18:16 rug
2022-10-14 19:21 ` Jason Gunthorpe
2022-10-17 10:13 ` Christoph Lameter
2022-10-17 11:24 ` Rudolf Gabler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1960002764.45894075.1734379814889.JavaMail.zimbra@raptorengineeringinc.com \
--to=tpearson@raptorengineering.com \
--cc=linux-rdma@vger.kernel.org \
--cc=rug@usm.lmu.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox