* failed to allocate device WQ
@ 2024-12-20 17:10 Holger Kiehl
2024-12-21 8:37 ` Zhu Yanjun
2024-12-24 9:47 ` Leon Romanovsky
0 siblings, 2 replies; 6+ messages in thread
From: Holger Kiehl @ 2024-12-20 17:10 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky; +Cc: linux-rdma, linux-kernel
Hello,
since upgrading from kernel 6.10 to 6.11 (also 6.12) one Infiniband
card sometimes hits this error:
kernel: workqueue: Failed to create a rescuer kthread for wq "ipoib_wq": -EINTR
kernel: ib0: failed to allocate device WQ
kernel: mlx5_1: failed to initialize device: ib0 port 1 (ret = -12)
kernel: mlx5_1: couldn't register ipoib port 1; error -12
The system has two cards:
41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
c4:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
If that happens one cannot use that card for TCP/IP communication. It does
not always happen, but when it does it always happens with the second
card mlx5_1. Never with mlx5_0. This happens on four different systems.
Any idea what I can do to stop this from happening?
Regards,
Holger
PS: Firmware for both cards is 20.41.1000
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: failed to allocate device WQ
2024-12-20 17:10 failed to allocate device WQ Holger Kiehl
@ 2024-12-21 8:37 ` Zhu Yanjun
2024-12-21 13:16 ` Holger Kiehl
2024-12-25 5:11 ` Joe Klein
2024-12-24 9:47 ` Leon Romanovsky
1 sibling, 2 replies; 6+ messages in thread
From: Zhu Yanjun @ 2024-12-21 8:37 UTC (permalink / raw)
To: Holger Kiehl, Jason Gunthorpe, Leon Romanovsky; +Cc: linux-rdma, linux-kernel
在 2024/12/20 18:10, Holger Kiehl 写道:
> Hello,
>
> since upgrading from kernel 6.10 to 6.11 (also 6.12) one Infiniband
> card sometimes hits this error:
>
> kernel: workqueue: Failed to create a rescuer kthread for wq "ipoib_wq": -EINTR
> kernel: ib0: failed to allocate device WQ
> kernel: mlx5_1: failed to initialize device: ib0 port 1 (ret = -12)
> kernel: mlx5_1: couldn't register ipoib port 1; error -12
>
> The system has two cards:
>
> 41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> c4:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
>
> If that happens one cannot use that card for TCP/IP communication. It does
> not always happen, but when it does it always happens with the second
> card mlx5_1. Never with mlx5_0. This happens on four different systems.
>
> Any idea what I can do to stop this from happening?
>
> Regards,
> Holger
>
> PS: Firmware for both cards is 20.41.1000
It is very possible that FW is not compatible with the driver. IMO, you
can make tests with Mellanox OFED.
If the driver is compatible with FW, this problem should disappear.
Zhu Yanjun
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: failed to allocate device WQ
2024-12-21 8:37 ` Zhu Yanjun
@ 2024-12-21 13:16 ` Holger Kiehl
2024-12-25 5:11 ` Joe Klein
1 sibling, 0 replies; 6+ messages in thread
From: Holger Kiehl @ 2024-12-21 13:16 UTC (permalink / raw)
To: Zhu Yanjun; +Cc: Jason Gunthorpe, Leon Romanovsky, linux-rdma, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1669 bytes --]
On Sat, 21 Dec 2024, Zhu Yanjun wrote:
> 在 2024/12/20 18:10, Holger Kiehl 写道:
> > Hello,
> >
> > since upgrading from kernel 6.10 to 6.11 (also 6.12) one Infiniband
> > card sometimes hits this error:
> >
> > kernel: workqueue: Failed to create a rescuer kthread for wq "ipoib_wq":
> > kernel: -EINTR
> > kernel: ib0: failed to allocate device WQ
> > kernel: mlx5_1: failed to initialize device: ib0 port 1 (ret = -12)
> > kernel: mlx5_1: couldn't register ipoib port 1; error -12
> >
> > The system has two cards:
> >
> > 41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family
> > [ConnectX-6]
> > c4:00.0 Infiniband controller: Mellanox Technologies MT28908 Family
> > [ConnectX-6]
> >
> > If that happens one cannot use that card for TCP/IP communication. It does
> > not always happen, but when it does it always happens with the second
> > card mlx5_1. Never with mlx5_0. This happens on four different systems.
> >
> > Any idea what I can do to stop this from happening?
> >
> > Regards,
> > Holger
> >
> > PS: Firmware for both cards is 20.41.1000
>
> It is very possible that FW is not compatible with the driver. IMO, you can
> make tests with Mellanox OFED.
>
Thanks, I did not know there was another driver.
I just had a look, but since I build my own kernels via 'make
binrpm-pkg' and then distribute them, this seems a to big hurdle to
overcome for me.
> If the driver is compatible with FW, this problem should disappear.
>
Actually the firmware has always been 20.39.1002 and after the
problems appeared I have upgraded to 20.41.1000, hoping it will
solve the problem.
Regards,
Holger
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: failed to allocate device WQ
2024-12-20 17:10 failed to allocate device WQ Holger Kiehl
2024-12-21 8:37 ` Zhu Yanjun
@ 2024-12-24 9:47 ` Leon Romanovsky
2024-12-24 14:28 ` Holger Kiehl
1 sibling, 1 reply; 6+ messages in thread
From: Leon Romanovsky @ 2024-12-24 9:47 UTC (permalink / raw)
To: Holger Kiehl; +Cc: Jason Gunthorpe, linux-rdma, linux-kernel
On Fri, Dec 20, 2024 at 05:10:32PM +0000, Holger Kiehl wrote:
> Hello,
>
> since upgrading from kernel 6.10 to 6.11 (also 6.12) one Infiniband
> card sometimes hits this error:
>
> kernel: workqueue: Failed to create a rescuer kthread for wq "ipoib_wq": -EINTR
> kernel: ib0: failed to allocate device WQ
> kernel: mlx5_1: failed to initialize device: ib0 port 1 (ret = -12)
> kernel: mlx5_1: couldn't register ipoib port 1; error -12
>
> The system has two cards:
>
> 41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> c4:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
>
> If that happens one cannot use that card for TCP/IP communication. It does
> not always happen, but when it does it always happens with the second
> card mlx5_1. Never with mlx5_0. This happens on four different systems.
>
> Any idea what I can do to stop this from happening?
It is not related to the FW but to how your system loads kernel modules.
This merged PR in rdma-core probably fixes it.
* Ensure RDMA service loads modules in initrd - https://github.com/linux-rdma/rdma-core/pull/1481
Thanks
>
> Regards,
> Holger
>
> PS: Firmware for both cards is 20.41.1000
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: failed to allocate device WQ
2024-12-24 9:47 ` Leon Romanovsky
@ 2024-12-24 14:28 ` Holger Kiehl
0 siblings, 0 replies; 6+ messages in thread
From: Holger Kiehl @ 2024-12-24 14:28 UTC (permalink / raw)
To: Leon Romanovsky; +Cc: Jason Gunthorpe, linux-rdma, linux-kernel
On Tue, 24 Dec 2024, Leon Romanovsky wrote:
> On Fri, Dec 20, 2024 at 05:10:32PM +0000, Holger Kiehl wrote:
> > Hello,
> >
> > since upgrading from kernel 6.10 to 6.11 (also 6.12) one Infiniband
> > card sometimes hits this error:
> >
> > kernel: workqueue: Failed to create a rescuer kthread for wq "ipoib_wq": -EINTR
> > kernel: ib0: failed to allocate device WQ
> > kernel: mlx5_1: failed to initialize device: ib0 port 1 (ret = -12)
> > kernel: mlx5_1: couldn't register ipoib port 1; error -12
> >
> > The system has two cards:
> >
> > 41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> > c4:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> >
> > If that happens one cannot use that card for TCP/IP communication. It does
> > not always happen, but when it does it always happens with the second
> > card mlx5_1. Never with mlx5_0. This happens on four different systems.
> >
> > Any idea what I can do to stop this from happening?
>
> It is not related to the FW but to how your system loads kernel modules.
>
> This merged PR in rdma-core probably fixes it.
> * Ensure RDMA service loads modules in initrd - https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVE3PDk0Nyt7MD8jPStkaTA9PDc9PCt+ZGpjbHl4f2gwPDk5OT5ubm5rNG47OD9raDQ1aGtrbzQ9azhvbD47Pm44OGxoaD5oOyt5MDw6Pjg9Pj47NT0rfGRpMDlPQjRgPXpvPT48Pz0+IDlPQjRgPXpuPT48Pz0+K39ufXkwRWJhamh/I0ZkaGVhTWl6aSNpaCtuMDg/K2VpYTA9&url=https%3a%2f%2fgithub.com%2flinux-rdma%2frdma-core%2fpull%2f1481
>
Yes, applying those I could no longer reproduce the problem. After
reboot both cards are now always detected.
Thank you very much for the hint!
Regards,
Holger
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: failed to allocate device WQ
2024-12-21 8:37 ` Zhu Yanjun
2024-12-21 13:16 ` Holger Kiehl
@ 2024-12-25 5:11 ` Joe Klein
1 sibling, 0 replies; 6+ messages in thread
From: Joe Klein @ 2024-12-25 5:11 UTC (permalink / raw)
To: Zhu Yanjun
Cc: Holger Kiehl, Jason Gunthorpe, Leon Romanovsky, linux-rdma,
linux-kernel
On Sat, Dec 21, 2024 at 9:38 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>
> 在 2024/12/20 18:10, Holger Kiehl 写道:
> > Hello,
> >
> > since upgrading from kernel 6.10 to 6.11 (also 6.12) one Infiniband
> > card sometimes hits this error:
> >
> > kernel: workqueue: Failed to create a rescuer kthread for wq "ipoib_wq": -EINTR
> > kernel: ib0: failed to allocate device WQ
> > kernel: mlx5_1: failed to initialize device: ib0 port 1 (ret = -12)
> > kernel: mlx5_1: couldn't register ipoib port 1; error -12
> >
> > The system has two cards:
> >
> > 41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> > c4:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> >
> > If that happens one cannot use that card for TCP/IP communication. It does
> > not always happen, but when it does it always happens with the second
> > card mlx5_1. Never with mlx5_0. This happens on four different systems.
> >
> > Any idea what I can do to stop this from happening?
> >
> > Regards,
> > Holger
> >
> > PS: Firmware for both cards is 20.41.1000
>
> It is very possible that FW is not compatible with the driver. IMO, you
> can make tests with Mellanox OFED.
>
> If the driver is compatible with FW, this problem should disappear.
Thanks, Zhu. We have the similar problem and have been fixed by your solution.
We are in the same boat. Appreciate your help.
>
> Zhu Yanjun
>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-12-25 5:11 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-20 17:10 failed to allocate device WQ Holger Kiehl
2024-12-21 8:37 ` Zhu Yanjun
2024-12-21 13:16 ` Holger Kiehl
2024-12-25 5:11 ` Joe Klein
2024-12-24 9:47 ` Leon Romanovsky
2024-12-24 14:28 ` Holger Kiehl
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox