Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
* failed to allocate device WQ
@ 2024-12-20 17:10 Holger Kiehl
  2024-12-21  8:37 ` Zhu Yanjun
  2024-12-24  9:47 ` Leon Romanovsky
  0 siblings, 2 replies; 6+ messages in thread
From: Holger Kiehl @ 2024-12-20 17:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky; +Cc: linux-rdma, linux-kernel

Hello,

since upgrading from kernel 6.10 to 6.11 (also 6.12) one Infiniband
card sometimes hits this error:

   kernel: workqueue: Failed to create a rescuer kthread for wq "ipoib_wq": -EINTR
   kernel: ib0: failed to allocate device WQ
   kernel: mlx5_1: failed to initialize device: ib0 port 1 (ret = -12)
   kernel: mlx5_1: couldn't register ipoib port 1; error -12

The system has two cards:

   41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
   c4:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]

If that happens one cannot use that card for TCP/IP communication. It does
not always happen, but when it does it always happens with the second
card mlx5_1. Never with mlx5_0. This happens on four different systems.

Any idea what I can do to stop this from happening?

Regards,
Holger

PS: Firmware for both cards is 20.41.1000

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: failed to allocate device WQ
  2024-12-20 17:10 failed to allocate device WQ Holger Kiehl
@ 2024-12-21  8:37 ` Zhu Yanjun
  2024-12-21 13:16   ` Holger Kiehl
  2024-12-25  5:11   ` Joe Klein
  2024-12-24  9:47 ` Leon Romanovsky
  1 sibling, 2 replies; 6+ messages in thread
From: Zhu Yanjun @ 2024-12-21  8:37 UTC (permalink / raw)
  To: Holger Kiehl, Jason Gunthorpe, Leon Romanovsky; +Cc: linux-rdma, linux-kernel

在 2024/12/20 18:10, Holger Kiehl 写道:
> Hello,
> 
> since upgrading from kernel 6.10 to 6.11 (also 6.12) one Infiniband
> card sometimes hits this error:
> 
>     kernel: workqueue: Failed to create a rescuer kthread for wq "ipoib_wq": -EINTR
>     kernel: ib0: failed to allocate device WQ
>     kernel: mlx5_1: failed to initialize device: ib0 port 1 (ret = -12)
>     kernel: mlx5_1: couldn't register ipoib port 1; error -12
> 
> The system has two cards:
> 
>     41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
>     c4:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> 
> If that happens one cannot use that card for TCP/IP communication. It does
> not always happen, but when it does it always happens with the second
> card mlx5_1. Never with mlx5_0. This happens on four different systems.
> 
> Any idea what I can do to stop this from happening?
> 
> Regards,
> Holger
> 
> PS: Firmware for both cards is 20.41.1000

It is very possible that FW is not compatible with the driver. IMO, you 
can make tests with Mellanox OFED.

If the driver is compatible with FW, this problem should disappear.

Zhu Yanjun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: failed to allocate device WQ
  2024-12-21  8:37 ` Zhu Yanjun
@ 2024-12-21 13:16   ` Holger Kiehl
  2024-12-25  5:11   ` Joe Klein
  1 sibling, 0 replies; 6+ messages in thread
From: Holger Kiehl @ 2024-12-21 13:16 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Jason Gunthorpe, Leon Romanovsky, linux-rdma, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1669 bytes --]

On Sat, 21 Dec 2024, Zhu Yanjun wrote:

> 在 2024/12/20 18:10, Holger Kiehl 写道:
> > Hello,
> > 
> > since upgrading from kernel 6.10 to 6.11 (also 6.12) one Infiniband
> > card sometimes hits this error:
> > 
> >     kernel: workqueue: Failed to create a rescuer kthread for wq "ipoib_wq":
> >     kernel: -EINTR
> >     kernel: ib0: failed to allocate device WQ
> >     kernel: mlx5_1: failed to initialize device: ib0 port 1 (ret = -12)
> >     kernel: mlx5_1: couldn't register ipoib port 1; error -12
> > 
> > The system has two cards:
> > 
> >     41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family
> >     [ConnectX-6]
> >     c4:00.0 Infiniband controller: Mellanox Technologies MT28908 Family
> >     [ConnectX-6]
> > 
> > If that happens one cannot use that card for TCP/IP communication. It does
> > not always happen, but when it does it always happens with the second
> > card mlx5_1. Never with mlx5_0. This happens on four different systems.
> > 
> > Any idea what I can do to stop this from happening?
> > 
> > Regards,
> > Holger
> > 
> > PS: Firmware for both cards is 20.41.1000
> 
> It is very possible that FW is not compatible with the driver. IMO, you can
> make tests with Mellanox OFED.
> 
Thanks, I did not know there was another driver.

I just had a look, but since I build my own kernels via 'make
binrpm-pkg' and then distribute them, this seems a to big hurdle to
overcome for me.

> If the driver is compatible with FW, this problem should disappear.
> 
Actually the firmware has always been 20.39.1002 and after the
problems appeared I have upgraded to 20.41.1000, hoping it will
solve the problem.

Regards,
Holger

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: failed to allocate device WQ
  2024-12-20 17:10 failed to allocate device WQ Holger Kiehl
  2024-12-21  8:37 ` Zhu Yanjun
@ 2024-12-24  9:47 ` Leon Romanovsky
  2024-12-24 14:28   ` Holger Kiehl
  1 sibling, 1 reply; 6+ messages in thread
From: Leon Romanovsky @ 2024-12-24  9:47 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: Jason Gunthorpe, linux-rdma, linux-kernel

On Fri, Dec 20, 2024 at 05:10:32PM +0000, Holger Kiehl wrote:
> Hello,
> 
> since upgrading from kernel 6.10 to 6.11 (also 6.12) one Infiniband
> card sometimes hits this error:
> 
>    kernel: workqueue: Failed to create a rescuer kthread for wq "ipoib_wq": -EINTR
>    kernel: ib0: failed to allocate device WQ
>    kernel: mlx5_1: failed to initialize device: ib0 port 1 (ret = -12)
>    kernel: mlx5_1: couldn't register ipoib port 1; error -12
> 
> The system has two cards:
> 
>    41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
>    c4:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> 
> If that happens one cannot use that card for TCP/IP communication. It does
> not always happen, but when it does it always happens with the second
> card mlx5_1. Never with mlx5_0. This happens on four different systems.
> 
> Any idea what I can do to stop this from happening?

It is not related to the FW but to how your system loads kernel modules.

This merged PR in rdma-core probably fixes it.
* Ensure RDMA service loads modules in initrd - https://github.com/linux-rdma/rdma-core/pull/1481

Thanks

> 
> Regards,
> Holger
> 
> PS: Firmware for both cards is 20.41.1000

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: failed to allocate device WQ
  2024-12-24  9:47 ` Leon Romanovsky
@ 2024-12-24 14:28   ` Holger Kiehl
  0 siblings, 0 replies; 6+ messages in thread
From: Holger Kiehl @ 2024-12-24 14:28 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Jason Gunthorpe, linux-rdma, linux-kernel

On Tue, 24 Dec 2024, Leon Romanovsky wrote:

> On Fri, Dec 20, 2024 at 05:10:32PM +0000, Holger Kiehl wrote:
> > Hello,
> > 
> > since upgrading from kernel 6.10 to 6.11 (also 6.12) one Infiniband
> > card sometimes hits this error:
> > 
> >    kernel: workqueue: Failed to create a rescuer kthread for wq "ipoib_wq": -EINTR
> >    kernel: ib0: failed to allocate device WQ
> >    kernel: mlx5_1: failed to initialize device: ib0 port 1 (ret = -12)
> >    kernel: mlx5_1: couldn't register ipoib port 1; error -12
> > 
> > The system has two cards:
> > 
> >    41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> >    c4:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> > 
> > If that happens one cannot use that card for TCP/IP communication. It does
> > not always happen, but when it does it always happens with the second
> > card mlx5_1. Never with mlx5_0. This happens on four different systems.
> > 
> > Any idea what I can do to stop this from happening?
> 
> It is not related to the FW but to how your system loads kernel modules.
> 
> This merged PR in rdma-core probably fixes it.
> * Ensure RDMA service loads modules in initrd - https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVE3PDk0Nyt7MD8jPStkaTA9PDc9PCt+ZGpjbHl4f2gwPDk5OT5ubm5rNG47OD9raDQ1aGtrbzQ9azhvbD47Pm44OGxoaD5oOyt5MDw6Pjg9Pj47NT0rfGRpMDlPQjRgPXpvPT48Pz0+IDlPQjRgPXpuPT48Pz0+K39ufXkwRWJhamh/I0ZkaGVhTWl6aSNpaCtuMDg/K2VpYTA9&url=https%3a%2f%2fgithub.com%2flinux-rdma%2frdma-core%2fpull%2f1481 
> 
Yes, applying those I could no longer reproduce the problem. After
reboot both cards are now always detected.

Thank you very much for the hint!

Regards,
Holger

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: failed to allocate device WQ
  2024-12-21  8:37 ` Zhu Yanjun
  2024-12-21 13:16   ` Holger Kiehl
@ 2024-12-25  5:11   ` Joe Klein
  1 sibling, 0 replies; 6+ messages in thread
From: Joe Klein @ 2024-12-25  5:11 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Holger Kiehl, Jason Gunthorpe, Leon Romanovsky, linux-rdma,
	linux-kernel

On Sat, Dec 21, 2024 at 9:38 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>
> 在 2024/12/20 18:10, Holger Kiehl 写道:
> > Hello,
> >
> > since upgrading from kernel 6.10 to 6.11 (also 6.12) one Infiniband
> > card sometimes hits this error:
> >
> >     kernel: workqueue: Failed to create a rescuer kthread for wq "ipoib_wq": -EINTR
> >     kernel: ib0: failed to allocate device WQ
> >     kernel: mlx5_1: failed to initialize device: ib0 port 1 (ret = -12)
> >     kernel: mlx5_1: couldn't register ipoib port 1; error -12
> >
> > The system has two cards:
> >
> >     41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> >     c4:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> >
> > If that happens one cannot use that card for TCP/IP communication. It does
> > not always happen, but when it does it always happens with the second
> > card mlx5_1. Never with mlx5_0. This happens on four different systems.
> >
> > Any idea what I can do to stop this from happening?
> >
> > Regards,
> > Holger
> >
> > PS: Firmware for both cards is 20.41.1000
>
> It is very possible that FW is not compatible with the driver. IMO, you
> can make tests with Mellanox OFED.
>
> If the driver is compatible with FW, this problem should disappear.

Thanks, Zhu. We have the similar problem and have been fixed by your solution.
We are in the same boat. Appreciate your help.

>
> Zhu Yanjun
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-12-25  5:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-20 17:10 failed to allocate device WQ Holger Kiehl
2024-12-21  8:37 ` Zhu Yanjun
2024-12-21 13:16   ` Holger Kiehl
2024-12-25  5:11   ` Joe Klein
2024-12-24  9:47 ` Leon Romanovsky
2024-12-24 14:28   ` Holger Kiehl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox