netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start
       [not found]       ` <20241118200616.865cb4c869e693b19529df36@paranoici.org>
@ 2024-11-21 10:04         ` Uwe Kleine-König
  2024-11-25 18:54           ` Francesco Poli
  0 siblings, 1 reply; 12+ messages in thread
From: Uwe Kleine-König @ 2024-11-21 10:04 UTC (permalink / raw)
  To: Francesco Poli, 1086520; +Cc: Mark Zhang, Leon Romanovsky, linux-rdma, netdev

[-- Attachment #1: Type: text/plain, Size: 3191 bytes --]

Hello Francesco,

[for the new-comers: This is about a regression in 6.11. Details
available at https://bugs.debian.org/1086520. The TL;DR; is that on
6.10.11 opensm works as expected, while it fails to start on 6.11.7.]

On Mon, Nov 18, 2024 at 08:06:16PM +0100, Francesco Poli wrote:
> On Mon, 18 Nov 2024 09:58:03 +0100 Uwe Kleine-König wrote:
> 
> [...]
> > On Wed, Nov 13, 2024 at 11:15:03PM +0100, Francesco Poli wrote:
> > > On Mon, 11 Nov 2024 11:22:26 +0100 Uwe Kleine-König wrote:
> [...]
> > > > I guess the kernel provides a directory "/sys/class/infiniband_mad". Do
> > > > its contents look different on 6.10.x and 6.11.x?
> > > 
> > > I will look into this as soon as I can reboot the cluster head node.
> 
> I looked into this, while testing the new Debian Linux kernel that has
> just migrated to testing (which, once again, makes opensm fail to
> start, just like other 6.11.x versions).
> 
> With a working kernel:
> 
>   $ uname -v
>   #1 SMP PREEMPT_DYNAMIC Debian 6.10.11-1 (2024-09-22)
>   $ ls -altrF /sys/class/infiniband_mad/
>   total 0
>   lrwxrwxrwx  1 root root    0 Nov  4 15:58 umad0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/umad0/
>   lrwxrwxrwx  1 root root    0 Nov  4 15:58 umad1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/umad1/
>   lrwxrwxrwx  1 root root    0 Nov 11 15:54 issm1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/issm1/
>   lrwxrwxrwx  1 root root    0 Nov 11 15:54 issm0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/issm0/
>   drwxr-xr-x  2 root root    0 Nov 11 15:54 ./
>   drwxr-xr-x 72 root root    0 Nov 11 15:54 ../
>   -r--r--r--  1 root root 4096 Nov 11 15:54 abi_version
>   $ cat /sys/class/infiniband_mad/abi_version 
>   5
> 
> With a kernel that makes opensm fail to start:
> 
>   $ uname -v
>   #1 SMP PREEMPT_DYNAMIC Debian 6.11.7-1 (2024-11-09)
>   $ ls -altrF /sys/class/infiniband_mad/
>   total 0
>   drwxr-xr-x 73 root root    0 Nov 18 09:41 ../
>   -r--r--r--  1 root root 4096 Nov 18 09:41 abi_version
>   lrwxrwxrwx  1 root root    0 Nov 18 09:41 umad0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/umad0/
>   lrwxrwxrwx  1 root root    0 Nov 18 09:41 umad1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/umad1/
>   drwxr-xr-x  2 root root    0 Nov 18 09:43 ./
>   $ cat /sys/class/infiniband_mad/abi_version
>   5
> 
> As you can see, a couple of files (symlinks) are missing here...

It looks like the commit that is biting you is

https://git.kernel.org/linus/50660c5197f52b8137e223dc3ba8d43661179a1d

So if you bisect, try 50660c5197f52b8137e223dc3ba8d43661179a1d and its
parent 24943dcdc156cf294d97a36bf5c51168bf574c22 first.

I don't know about infiniband, but I'd say: Either your machine doesn't
have these issmX devices and opensm should cope with that, or these
issmX devices are available then
50660c5197f52b8137e223dc3ba8d43661179a1d is buggy.

> Does this ring a bell?

It doesn't for me, but maybe Mark Zhang or someone else among the new
recipients has an idea?

Best regards
Uwe


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start
  2024-11-21 10:04         ` Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start Uwe Kleine-König
@ 2024-11-25 18:54           ` Francesco Poli
  2024-11-25 19:38             ` Leon Romanovsky
  0 siblings, 1 reply; 12+ messages in thread
From: Francesco Poli @ 2024-11-25 18:54 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: 1086520, Mark Zhang, Leon Romanovsky, linux-rdma, netdev

[-- Attachment #1: Type: text/plain, Size: 4025 bytes --]

On Thu, 21 Nov 2024 11:04:13 +0100 Uwe Kleine-König wrote:

[...]
> It looks like the commit that is biting you is
> 
> https://git.kernel.org/linus/50660c5197f52b8137e223dc3ba8d43661179a1d
> 
> So if you bisect, try 50660c5197f52b8137e223dc3ba8d43661179a1d and its
> parent 24943dcdc156cf294d97a36bf5c51168bf574c22 first.

I started to bisect.

The first surprise is that 50660c5197f52b8137e223dc3ba8d43661179a1d is
good...   :-o

  $ git checkout 50660c5197f52b8137e223dc3ba8d43661179a1d
  $ make -j 12 my_defconfig bindeb-pkg

  [install and reboot with this kernel version]

  # ls /sys/class/infiniband_mad/ -altrF
  total 0
  drwxr-xr-x 70 root root    0 Nov 25 12:05 ../
  -r--r--r--  1 root root 4096 Nov 25 12:05 abi_version
  lrwxrwxrwx  1 root root    0 Nov 25 12:05 umad0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/umad0/
  lrwxrwxrwx  1 root root    0 Nov 25 12:05 umad1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/umad1/
  lrwxrwxrwx  1 root root    0 Nov 25 12:08 issm1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/issm1/
  lrwxrwxrwx  1 root root    0 Nov 25 12:08 issm0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/issm0/
  drwxr-xr-x  2 root root    0 Nov 25 12:08 ./

  [InfiniBand works]

  $ git bisect start
  $ git bisect good
  $ git checkout v6.11
  $ make -j 12 my_defconfig bindeb-pkg

  [install and reboot with this kernel version]

  # ls /sys/class/infiniband_mad/ -altrF
  total 0
  drwxr-xr-x 70 root root    0 Nov 25 12:29 ../
  -r--r--r--  1 root root 4096 Nov 25 12:29 abi_version
  lrwxrwxrwx  1 root root    0 Nov 25 12:29 umad0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/umad0/
  lrwxrwxrwx  1 root root    0 Nov 25 12:29 umad1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/umad1/
  drwxr-xr-x  2 root root    0 Nov 25 12:30 ./

  [InfiniBand fails, because OpenSM fails to start]

  $ git bisect bad
  Bisecting: 7036 revisions left to test after this (roughly 13 steps)
  [b3ce7a30847a54a7f96a35e609303d8afecd460b] Merge tag 'drm-next-2024-07-18' of https://gitlab.freedesktop.org/drm/kernel
  $ make -j 12 my_defconfig bindeb-pkg


Woooha, 13 steps are a lot...

I went on until 10 steps are left:

  [test b3ce7a30847a54a7f96a35e609303d8afecd460b]
  $ git bisect good
  Bisecting: 3385 revisions left to test after this (roughly 12 steps)
  [fbc90c042cd1dc7258ebfebe6d226017e5b5ac8c] Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
  
  [test fbc90c042cd1dc7258ebfebe6d226017e5b5ac8c]
  $ git bisect bad
  Bisecting: 1763 revisions left to test after this (roughly 11 steps)
  [09ea8089abb5d851ce08a9b1a43706e42ef39db2] Merge tag 'staging-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

  [test 09ea8089abb5d851ce08a9b1a43706e42ef39db2]
  $ git bisect bad
  Bisecting: 910 revisions left to test after this (roughly 10 steps)
  [4305ca0087dd99c3c3e0e2ac8a228b7e53a21c78] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi


Since I could not afford to keep the cluster out of service any longer
(each step takes at least 20 or 25 minutes: build + install + reboot +
check InfiniBand), I decided to return the cluster to service.

I will try to continue to bisect by testing the resulting kernels on a
compute node: there's no OpenSM there and it cannot run anyway, if
there's another OpenSM on the same InfiniBand network.
However, I can check whether those issm* symlinks are created in
/sys/class/infiniband_mad/ 
I really hope that this is enough to pinpoint the first bad
commit...

Any better ideas?


-- 
 http://www.inventati.org/frx/
 There's not a second to spare! To the laboratory!
..................................................... Francesco Poli .
 GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE

[-- Attachment #2: Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start
  2024-11-25 18:54           ` Francesco Poli
@ 2024-11-25 19:38             ` Leon Romanovsky
  2024-11-26  1:21               ` Mark Zhang
  2024-11-27 17:48               ` Francesco Poli
  0 siblings, 2 replies; 12+ messages in thread
From: Leon Romanovsky @ 2024-11-25 19:38 UTC (permalink / raw)
  To: Francesco Poli
  Cc: Uwe Kleine-König, 1086520, Mark Zhang, linux-rdma, netdev

On Mon, Nov 25, 2024 at 07:54:43PM +0100, Francesco Poli wrote:
> On Thu, 21 Nov 2024 11:04:13 +0100 Uwe Kleine-König wrote:
> 
> [...]
> > It looks like the commit that is biting you is
> > 
> > https://git.kernel.org/linus/50660c5197f52b8137e223dc3ba8d43661179a1d
> > 
> > So if you bisect, try 50660c5197f52b8137e223dc3ba8d43661179a1d and its
> > parent 24943dcdc156cf294d97a36bf5c51168bf574c22 first.
> 
> I started to bisect.
> 
> The first surprise is that 50660c5197f52b8137e223dc3ba8d43661179a1d is
> good...   :-o

It is good news, as I looked on it all that time from the day Uwe
reported it.

> 

<...>

> I will try to continue to bisect by testing the resulting kernels on a
> compute node: there's no OpenSM there and it cannot run anyway, if
> there's another OpenSM on the same InfiniBand network.
> However, I can check whether those issm* symlinks are created in
> /sys/class/infiniband_mad/ 
> I really hope that this is enough to pinpoint the first bad
> commit...

Yes, these symlinks should be there. Your test scenario is correct one.

> 
> Any better ideas?

I think that commit: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port")
is the one which is causing to troubles, which leads me to suspect FW.

Thanks

> 
> 
> -- 
>  http://www.inventati.org/frx/
>  There's not a second to spare! To the laboratory!
> ..................................................... Francesco Poli .
>  GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start
  2024-11-25 19:38             ` Leon Romanovsky
@ 2024-11-26  1:21               ` Mark Zhang
  2024-11-26  7:18                 ` Francesco Poli
  2024-11-27 17:48               ` Francesco Poli
  1 sibling, 1 reply; 12+ messages in thread
From: Mark Zhang @ 2024-11-26  1:21 UTC (permalink / raw)
  To: Leon Romanovsky, Francesco Poli
  Cc: Uwe Kleine-König, 1086520, linux-rdma, netdev


On 11/26/2024 3:38 AM, Leon Romanovsky wrote:
> On Mon, Nov 25, 2024 at 07:54:43PM +0100, Francesco Poli wrote:
>> On Thu, 21 Nov 2024 11:04:13 +0100 Uwe Kleine-König wrote:
>>
>> [...]
>>> It looks like the commit that is biting you is
>>>
>>> https://git.kernel.org/linus/50660c5197f52b8137e223dc3ba8d43661179a1d
>>>
>>> So if you bisect, try 50660c5197f52b8137e223dc3ba8d43661179a1d and its
>>> parent 24943dcdc156cf294d97a36bf5c51168bf574c22 first.
>>
>> I started to bisect.
>>
>> The first surprise is that 50660c5197f52b8137e223dc3ba8d43661179a1d is
>> good...   :-o
> 
> It is good news, as I looked on it all that time from the day Uwe
> reported it.
> 
>>
> 
> <...>
> 
>> I will try to continue to bisect by testing the resulting kernels on a
>> compute node: there's no OpenSM there and it cannot run anyway, if
>> there's another OpenSM on the same InfiniBand network.
>> However, I can check whether those issm* symlinks are created in
>> /sys/class/infiniband_mad/
>> I really hope that this is enough to pinpoint the first bad
>> commit...
> 
> Yes, these symlinks should be there. Your test scenario is correct one.
> 
>>
>> Any better ideas?
> 
> I think that commit: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port")
> is the one which is causing to troubles, which leads me to suspect FW.
> 

Yes looks like FW reports vport.num_plane > 0. What is your hw type and 
FW version ("ethtool -i <netdev_of_the_ibdev>")? I don't think it 
supports multiplane.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start
  2024-11-26  1:21               ` Mark Zhang
@ 2024-11-26  7:18                 ` Francesco Poli
  2024-11-26  8:38                   ` Leon Romanovsky
  0 siblings, 1 reply; 12+ messages in thread
From: Francesco Poli @ 2024-11-26  7:18 UTC (permalink / raw)
  To: Mark Zhang
  Cc: Leon Romanovsky, Uwe Kleine-König, 1086520, linux-rdma,
	netdev

[-- Attachment #1: Type: text/plain, Size: 957 bytes --]

On Tue, 26 Nov 2024 09:21:37 +0800 Mark Zhang wrote:

[...]
> Yes looks like FW reports vport.num_plane > 0. What is your hw type and 
> FW version ("ethtool -i <netdev_of_the_ibdev>")? I don't think it 
> supports multiplane.

  $ /sbin/ethtool -i ibp129s0f0
  driver: mlx5_core[ib_ipoib]
  version: 6.10.11-amd64
  firmware-version: 20.40.1000 (MT_0000000224)
  expansion-rom-version: 
  bus-info: 0000:81:00.0
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: no
  supports-register-dump: no
  supports-priv-flags: yes

Please note that I determined <netdev_of_the_ibdev> by looking at
the output of 'ibv_devices': I hope this is a correct way to answer
your question.




-- 
 http://www.inventati.org/frx/
 There's not a second to spare! To the laboratory!
..................................................... Francesco Poli .
 GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE

[-- Attachment #2: Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start
  2024-11-26  7:18                 ` Francesco Poli
@ 2024-11-26  8:38                   ` Leon Romanovsky
  2024-11-26 10:09                     ` Leon Romanovsky
  0 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2024-11-26  8:38 UTC (permalink / raw)
  To: Francesco Poli
  Cc: Mark Zhang, Uwe Kleine-König, 1086520, linux-rdma, netdev

On Tue, Nov 26, 2024 at 08:18:24AM +0100, Francesco Poli wrote:
> On Tue, 26 Nov 2024 09:21:37 +0800 Mark Zhang wrote:
> 
> [...]
> > Yes looks like FW reports vport.num_plane > 0. What is your hw type and 
> > FW version ("ethtool -i <netdev_of_the_ibdev>")? I don't think it 
> > supports multiplane.
> 
>   $ /sbin/ethtool -i ibp129s0f0
>   driver: mlx5_core[ib_ipoib]
>   version: 6.10.11-amd64
>   firmware-version: 20.40.1000 (MT_0000000224)
>   expansion-rom-version: 
>   bus-info: 0000:81:00.0
>   supports-statistics: yes
>   supports-test: yes
>   supports-eeprom-access: no
>   supports-register-dump: no
>   supports-priv-flags: yes
> 
> Please note that I determined <netdev_of_the_ibdev> by looking at
> the output of 'ibv_devices': I hope this is a correct way to answer
> your question.

We forwarded this information to FW team and will update you on the
findings.

Thanks

> 
> 
> 
> 
> -- 
>  http://www.inventati.org/frx/
>  There's not a second to spare! To the laboratory!
> ..................................................... Francesco Poli .
>  GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start
  2024-11-26  8:38                   ` Leon Romanovsky
@ 2024-11-26 10:09                     ` Leon Romanovsky
  0 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2024-11-26 10:09 UTC (permalink / raw)
  To: Francesco Poli
  Cc: Mark Zhang, Uwe Kleine-König, 1086520, linux-rdma, netdev

On Tue, Nov 26, 2024 at 10:38:59AM +0200, Leon Romanovsky wrote:
> On Tue, Nov 26, 2024 at 08:18:24AM +0100, Francesco Poli wrote:
> > On Tue, 26 Nov 2024 09:21:37 +0800 Mark Zhang wrote:
> > 
> > [...]
> > > Yes looks like FW reports vport.num_plane > 0. What is your hw type and 
> > > FW version ("ethtool -i <netdev_of_the_ibdev>")? I don't think it 
> > > supports multiplane.
> > 
> >   $ /sbin/ethtool -i ibp129s0f0
> >   driver: mlx5_core[ib_ipoib]
> >   version: 6.10.11-amd64
> >   firmware-version: 20.40.1000 (MT_0000000224)
> >   expansion-rom-version: 
> >   bus-info: 0000:81:00.0
> >   supports-statistics: yes
> >   supports-test: yes
> >   supports-eeprom-access: no
> >   supports-register-dump: no
> >   supports-priv-flags: yes
> > 
> > Please note that I determined <netdev_of_the_ibdev> by looking at
> > the output of 'ibv_devices': I hope this is a correct way to answer
> > your question.
> 
> We forwarded this information to FW team and will update you on the
> findings.

Francesco, 

Please update NICs FW to the latest version. In your FW version there
is a bug which causes to return vport.num_plane == 1 even if NIC doesn't
support multiplane mode.

We will continue to work internally to find a solution, which won't require
FW upgrade.

Thanks

> 
> Thanks
> 
> > 
> > 
> > 
> > 
> > -- 
> >  http://www.inventati.org/frx/
> >  There's not a second to spare! To the laboratory!
> > ..................................................... Francesco Poli .
> >  GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE
> 
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start
  2024-11-25 19:38             ` Leon Romanovsky
  2024-11-26  1:21               ` Mark Zhang
@ 2024-11-27 17:48               ` Francesco Poli
  2024-11-27 20:04                 ` Leon Romanovsky
  1 sibling, 1 reply; 12+ messages in thread
From: Francesco Poli @ 2024-11-27 17:48 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Uwe Kleine-König, 1086520, Mark Zhang, linux-rdma, netdev

[-- Attachment #1: Type: text/plain, Size: 5079 bytes --]

On Mon, 25 Nov 2024 21:38:37 +0200 Leon Romanovsky wrote:

> On Mon, Nov 25, 2024 at 07:54:43PM +0100, Francesco Poli wrote:
[...]
> > I will try to continue to bisect by testing the resulting kernels on a
> > compute node: there's no OpenSM there and it cannot run anyway, if
> > there's another OpenSM on the same InfiniBand network.
> > However, I can check whether those issm* symlinks are created in
> > /sys/class/infiniband_mad/ 
> > I really hope that this is enough to pinpoint the first bad
> > commit...
> 
> Yes, these symlinks should be there. Your test scenario is correct one.

OK, I have completed the bisect on a compute node without OpenSM, by
looking at the issm* symlinks, as I said.

See below.

> 
> > 
> > Any better ideas?
> 
> I think that commit: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port")
> is the one which is causing to troubles, which leads me to suspect FW.
[...]

Thanks to your guess about the possibly troublesome commit, the bisect was completed in a few steps:

  $ git checkout 2a5db20fa532
  $ make -j 12 my_defconfig bindeb-pkg
  
  [install this version on a compute node test image and reboot
  one compute node with that image: the InfiniBand network was
  working for that node, that's no surprise, since OpenSM was running
  on the head node, but no issm* symlink was created; please note
  that, surprisingly, the Ethernet network was not working, I mean
  that the Ethernet interfaces were not found by the kernel...]
  
  root@node # ls -altrF /sys/class/infiniband_mad/
  total 0
  drwxr-xr-x 60 root root    0 Nov 26 17:06 ../
  lrwxrwxrwx  1 root root    0 Nov 26 17:06 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
  -r--r--r--  1 root root 4096 Nov 26 17:06 abi_version
  lrwxrwxrwx  1 root root    0 Nov 26 17:06 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
  drwxr-xr-x  2 root root    0 Nov 26 17:08 ./
  
  $ git bisect bad
  Bisecting: 0 revisions left to test after this (roughly 0 steps)
  [65528cfb21fdb68de8ae6dccae19af180d93e143] net/mlx5: mlx5_ifc update for multi-plane support
  $ make -j 12 my_defconfig bindeb-pkg
  
  [install this version on the compute node test image and reboot
  one compute node with that image: the InfiniBand network again
  working for that node, issm* symlinks were created;
  Ethernet network again not working for that node...]
  
  root@node # ls -altrF /sys/class/infiniband_mad/
  total 0
  drwxr-xr-x 60 root root    0 Nov 26 17:31 ../
  lrwxrwxrwx  1 root root    0 Nov 26 17:31 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
  -r--r--r--  1 root root 4096 Nov 26 17:31 abi_version
  lrwxrwxrwx  1 root root    0 Nov 26 17:31 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
  lrwxrwxrwx  1 root root    0 Nov 26 17:36 issm1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/issm1/
  lrwxrwxrwx  1 root root    0 Nov 26 17:36 issm0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/issm0/
  drwxr-xr-x  2 root root    0 Nov 26 17:36 ./
  
  $ git bisect good
  2a5db20fa532198639671713c6213f96ff285b85 is the first bad commit
  commit 2a5db20fa532198639671713c6213f96ff285b85
  Author: Mark Zhang <markzhang@nvidia.com>
  Date:   Sun Jun 16 19:08:35 2024 +0300
  
      RDMA/mlx5: Add support to multi-plane device and port
  
      When multi-plane is supported, a logical port, which is aggregation of
      multiple physical plane ports, is exposed for data transmission.
      Compared with a normal mlx5 IB port, this logical port supports all
      functionalities except Subnet Management.
  
      Signed-off-by: Mark Zhang <markzhang@nvidia.com>
      Link: https://lore.kernel.org/r/7e37c06c9cb243be9ac79930cd17053903785b95.1718553901.git.leon@kernel.org
      Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
  
   drivers/infiniband/hw/mlx5/main.c               | 60 +++++++++++++++++++++----
   drivers/infiniband/hw/mlx5/mlx5_ib.h            |  2 +
   drivers/net/ethernet/mellanox/mlx5/core/vport.c |  1 +
   include/linux/mlx5/driver.h                     |  1 +
   4 files changed, 55 insertions(+), 9 deletions(-)


In other words, bingo!, your guess looks correct, the first bad commit
is the one you mentioned.


Now, I will try to upgrade the firmware of the InfiniBand NICs, as you
suggested, and check whether this solves the issue with the recent
Linux kernel versions.

Please confirm that the procedure to be followed is the one described in
<https://docs.nvidia.com/networking/display/ubuntu2204/firmware+burning>

Thanks for your time and patience, and for all the help you are kindly
providing!   :-)


-- 
 http://www.inventati.org/frx/
 There's not a second to spare! To the laboratory!
..................................................... Francesco Poli .
 GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE

[-- Attachment #2: Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start
  2024-11-27 17:48               ` Francesco Poli
@ 2024-11-27 20:04                 ` Leon Romanovsky
  2024-12-04 16:37                   ` Uwe Kleine-König
  0 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2024-11-27 20:04 UTC (permalink / raw)
  To: Francesco Poli
  Cc: Uwe Kleine-König, 1086520, Mark Zhang, linux-rdma, netdev

On Wed, Nov 27, 2024 at 06:48:03PM +0100, Francesco Poli wrote:
> On Mon, 25 Nov 2024 21:38:37 +0200 Leon Romanovsky wrote:
> 
> > On Mon, Nov 25, 2024 at 07:54:43PM +0100, Francesco Poli wrote:
> [...]
> > > I will try to continue to bisect by testing the resulting kernels on a
> > > compute node: there's no OpenSM there and it cannot run anyway, if
> > > there's another OpenSM on the same InfiniBand network.
> > > However, I can check whether those issm* symlinks are created in
> > > /sys/class/infiniband_mad/ 
> > > I really hope that this is enough to pinpoint the first bad
> > > commit...
> > 
> > Yes, these symlinks should be there. Your test scenario is correct one.
> 
> OK, I have completed the bisect on a compute node without OpenSM, by
> looking at the issm* symlinks, as I said.
> 
> See below.
> 
> > 
> > > 
> > > Any better ideas?
> > 
> > I think that commit: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port")
> > is the one which is causing to troubles, which leads me to suspect FW.
> [...]
> 
> Thanks to your guess about the possibly troublesome commit, the bisect was completed in a few steps:
> 
>   $ git checkout 2a5db20fa532
>   $ make -j 12 my_defconfig bindeb-pkg
>   
>   [install this version on a compute node test image and reboot
>   one compute node with that image: the InfiniBand network was
>   working for that node, that's no surprise, since OpenSM was running
>   on the head node, but no issm* symlink was created; please note
>   that, surprisingly, the Ethernet network was not working, I mean
>   that the Ethernet interfaces were not found by the kernel...]
>   
>   root@node # ls -altrF /sys/class/infiniband_mad/
>   total 0
>   drwxr-xr-x 60 root root    0 Nov 26 17:06 ../
>   lrwxrwxrwx  1 root root    0 Nov 26 17:06 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
>   -r--r--r--  1 root root 4096 Nov 26 17:06 abi_version
>   lrwxrwxrwx  1 root root    0 Nov 26 17:06 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
>   drwxr-xr-x  2 root root    0 Nov 26 17:08 ./
>   
>   $ git bisect bad
>   Bisecting: 0 revisions left to test after this (roughly 0 steps)
>   [65528cfb21fdb68de8ae6dccae19af180d93e143] net/mlx5: mlx5_ifc update for multi-plane support
>   $ make -j 12 my_defconfig bindeb-pkg
>   
>   [install this version on the compute node test image and reboot
>   one compute node with that image: the InfiniBand network again
>   working for that node, issm* symlinks were created;
>   Ethernet network again not working for that node...]
>   
>   root@node # ls -altrF /sys/class/infiniband_mad/
>   total 0
>   drwxr-xr-x 60 root root    0 Nov 26 17:31 ../
>   lrwxrwxrwx  1 root root    0 Nov 26 17:31 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
>   -r--r--r--  1 root root 4096 Nov 26 17:31 abi_version
>   lrwxrwxrwx  1 root root    0 Nov 26 17:31 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
>   lrwxrwxrwx  1 root root    0 Nov 26 17:36 issm1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/issm1/
>   lrwxrwxrwx  1 root root    0 Nov 26 17:36 issm0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/issm0/
>   drwxr-xr-x  2 root root    0 Nov 26 17:36 ./
>   
>   $ git bisect good
>   2a5db20fa532198639671713c6213f96ff285b85 is the first bad commit
>   commit 2a5db20fa532198639671713c6213f96ff285b85
>   Author: Mark Zhang <markzhang@nvidia.com>
>   Date:   Sun Jun 16 19:08:35 2024 +0300
>   
>       RDMA/mlx5: Add support to multi-plane device and port
>   
>       When multi-plane is supported, a logical port, which is aggregation of
>       multiple physical plane ports, is exposed for data transmission.
>       Compared with a normal mlx5 IB port, this logical port supports all
>       functionalities except Subnet Management.
>   
>       Signed-off-by: Mark Zhang <markzhang@nvidia.com>
>       Link: https://lore.kernel.org/r/7e37c06c9cb243be9ac79930cd17053903785b95.1718553901.git.leon@kernel.org
>       Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>   
>    drivers/infiniband/hw/mlx5/main.c               | 60 +++++++++++++++++++++----
>    drivers/infiniband/hw/mlx5/mlx5_ib.h            |  2 +
>    drivers/net/ethernet/mellanox/mlx5/core/vport.c |  1 +
>    include/linux/mlx5/driver.h                     |  1 +
>    4 files changed, 55 insertions(+), 9 deletions(-)
> 
> 
> In other words, bingo!, your guess looks correct, the first bad commit
> is the one you mentioned.
> 
> 
> Now, I will try to upgrade the firmware of the InfiniBand NICs, as you
> suggested, and check whether this solves the issue with the recent
> Linux kernel versions.
> 
> Please confirm that the procedure to be followed is the one described in
> <https://docs.nvidia.com/networking/display/ubuntu2204/firmware+burning>

Yes, it looks correct procedure.
If you didn't upgrade FW, this diff will achieve same result for you:

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index c2314797afc9..110ce177c305 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2846,7 +2846,7 @@ static int mlx5_ib_get_plane_num(struct mlx5_core_dev *mdev, u8 *num_plane)
        if (err)
                return err;

-       *num_plane = vport_ctx.num_plane;
+       *num_plane = (vport_ctx.num_plane > 1) ? vport_ctx.num_plane : 0;
        return 0;
 }

The culprit of your issue that in some FW versions, the vport_ctx.num_plane
was 1 and not 0 for devices which don't support that mode, while for the driver
everything that is not 0 means supported.

Thanks

> 
> Thanks for your time and patience, and for all the help you are kindly
> providing!   :-)
> 
> 
> -- 
>  http://www.inventati.org/frx/
>  There's not a second to spare! To the laboratory!
> ..................................................... Francesco Poli .
>  GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start
  2024-11-27 20:04                 ` Leon Romanovsky
@ 2024-12-04 16:37                   ` Uwe Kleine-König
  2024-12-04 17:13                     ` Francesco Poli
  0 siblings, 1 reply; 12+ messages in thread
From: Uwe Kleine-König @ 2024-12-04 16:37 UTC (permalink / raw)
  To: Francesco Poli
  Cc: Leon Romanovsky, 1086520@bugs.debian.org Mark Zhang, linux-rdma,
	netdev

[-- Attachment #1: Type: text/plain, Size: 6406 bytes --]

Hello Francesco,

On Wed, Nov 27, 2024 at 10:04:13PM +0200, Leon Romanovsky wrote:
> On Wed, Nov 27, 2024 at 06:48:03PM +0100, Francesco Poli wrote:
> > On Mon, 25 Nov 2024 21:38:37 +0200 Leon Romanovsky wrote:
> > 
> > > On Mon, Nov 25, 2024 at 07:54:43PM +0100, Francesco Poli wrote:
> > [...]
> > > > I will try to continue to bisect by testing the resulting kernels on a
> > > > compute node: there's no OpenSM there and it cannot run anyway, if
> > > > there's another OpenSM on the same InfiniBand network.
> > > > However, I can check whether those issm* symlinks are created in
> > > > /sys/class/infiniband_mad/ 
> > > > I really hope that this is enough to pinpoint the first bad
> > > > commit...
> > > 
> > > Yes, these symlinks should be there. Your test scenario is correct one.
> > 
> > OK, I have completed the bisect on a compute node without OpenSM, by
> > looking at the issm* symlinks, as I said.
> > 
> > See below.
> > 
> > > 
> > > > 
> > > > Any better ideas?
> > > 
> > > I think that commit: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port")
> > > is the one which is causing to troubles, which leads me to suspect FW.
> > [...]
> > 
> > Thanks to your guess about the possibly troublesome commit, the bisect was completed in a few steps:
> > 
> >   $ git checkout 2a5db20fa532
> >   $ make -j 12 my_defconfig bindeb-pkg
> >   
> >   [install this version on a compute node test image and reboot
> >   one compute node with that image: the InfiniBand network was
> >   working for that node, that's no surprise, since OpenSM was running
> >   on the head node, but no issm* symlink was created; please note
> >   that, surprisingly, the Ethernet network was not working, I mean
> >   that the Ethernet interfaces were not found by the kernel...]
> >   
> >   root@node # ls -altrF /sys/class/infiniband_mad/
> >   total 0
> >   drwxr-xr-x 60 root root    0 Nov 26 17:06 ../
> >   lrwxrwxrwx  1 root root    0 Nov 26 17:06 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
> >   -r--r--r--  1 root root 4096 Nov 26 17:06 abi_version
> >   lrwxrwxrwx  1 root root    0 Nov 26 17:06 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
> >   drwxr-xr-x  2 root root    0 Nov 26 17:08 ./
> >   
> >   $ git bisect bad
> >   Bisecting: 0 revisions left to test after this (roughly 0 steps)
> >   [65528cfb21fdb68de8ae6dccae19af180d93e143] net/mlx5: mlx5_ifc update for multi-plane support
> >   $ make -j 12 my_defconfig bindeb-pkg
> >   
> >   [install this version on the compute node test image and reboot
> >   one compute node with that image: the InfiniBand network again
> >   working for that node, issm* symlinks were created;
> >   Ethernet network again not working for that node...]
> >   
> >   root@node # ls -altrF /sys/class/infiniband_mad/
> >   total 0
> >   drwxr-xr-x 60 root root    0 Nov 26 17:31 ../
> >   lrwxrwxrwx  1 root root    0 Nov 26 17:31 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
> >   -r--r--r--  1 root root 4096 Nov 26 17:31 abi_version
> >   lrwxrwxrwx  1 root root    0 Nov 26 17:31 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
> >   lrwxrwxrwx  1 root root    0 Nov 26 17:36 issm1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/issm1/
> >   lrwxrwxrwx  1 root root    0 Nov 26 17:36 issm0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/issm0/
> >   drwxr-xr-x  2 root root    0 Nov 26 17:36 ./
> >   
> >   $ git bisect good
> >   2a5db20fa532198639671713c6213f96ff285b85 is the first bad commit
> >   commit 2a5db20fa532198639671713c6213f96ff285b85
> >   Author: Mark Zhang <markzhang@nvidia.com>
> >   Date:   Sun Jun 16 19:08:35 2024 +0300
> >   
> >       RDMA/mlx5: Add support to multi-plane device and port
> >   
> >       When multi-plane is supported, a logical port, which is aggregation of
> >       multiple physical plane ports, is exposed for data transmission.
> >       Compared with a normal mlx5 IB port, this logical port supports all
> >       functionalities except Subnet Management.
> >   
> >       Signed-off-by: Mark Zhang <markzhang@nvidia.com>
> >       Link: https://lore.kernel.org/r/7e37c06c9cb243be9ac79930cd17053903785b95.1718553901.git.leon@kernel.org
> >       Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> >   
> >    drivers/infiniband/hw/mlx5/main.c               | 60 +++++++++++++++++++++----
> >    drivers/infiniband/hw/mlx5/mlx5_ib.h            |  2 +
> >    drivers/net/ethernet/mellanox/mlx5/core/vport.c |  1 +
> >    include/linux/mlx5/driver.h                     |  1 +
> >    4 files changed, 55 insertions(+), 9 deletions(-)
> > 
> > 
> > In other words, bingo!, your guess looks correct, the first bad commit
> > is the one you mentioned.
> > 
> > 
> > Now, I will try to upgrade the firmware of the InfiniBand NICs, as you
> > suggested, and check whether this solves the issue with the recent
> > Linux kernel versions.
> > 
> > Please confirm that the procedure to be followed is the one described in
> > <https://docs.nvidia.com/networking/display/ubuntu2204/firmware+burning>
> 
> Yes, it looks correct procedure.
> If you didn't upgrade FW, this diff will achieve same result for you:
> 
> diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
> index c2314797afc9..110ce177c305 100644
> --- a/drivers/infiniband/hw/mlx5/main.c
> +++ b/drivers/infiniband/hw/mlx5/main.c
> @@ -2846,7 +2846,7 @@ static int mlx5_ib_get_plane_num(struct mlx5_core_dev *mdev, u8 *num_plane)
>         if (err)
>                 return err;
> 
> -       *num_plane = vport_ctx.num_plane;
> +       *num_plane = (vport_ctx.num_plane > 1) ? vport_ctx.num_plane : 0;
>         return 0;
>  }
> 
> The culprit of your issue that in some FW versions, the vport_ctx.num_plane
> was 1 and not 0 for devices which don't support that mode, while for the driver
> everything that is not 0 means supported.

I wonder if you could test a firmware upgrade or the above patch. Would
be nice to know if there are still some things to do for us (= Debian
kernel team) here.

If everything is fine for you, I'd like to close this bug.

Best regards
Uwe

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start
  2024-12-04 16:37                   ` Uwe Kleine-König
@ 2024-12-04 17:13                     ` Francesco Poli
  2024-12-05  9:17                       ` Leon Romanovsky
  0 siblings, 1 reply; 12+ messages in thread
From: Francesco Poli @ 2024-12-04 17:13 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Leon Romanovsky, 1086520-done, Mark Zhang, linux-rdma, netdev

[-- Attachment #1: Type: text/plain, Size: 2569 bytes --]

On Wed, 4 Dec 2024 17:37:05 +0100 Uwe Kleine-König wrote:

> Hello Francesco,

Hello Uwe,

[...]
> I wonder if you could test a firmware upgrade or the above patch. Would
> be nice to know if there are still some things to do for us (= Debian
> kernel team) here.

Yes, I've finally got around to upgrading the firmware.

And today I had a time window, where I could reboot the cluster head
node.
After the reboot, the InfiniBand network works correctly:

  $ uname -v
  #1 SMP PREEMPT_DYNAMIC Debian 6.11.10-1 (2024-11-23)
  $ ls -altrF /sys/class/infiniband_mad/
  total 0
  lrwxrwxrwx  1 root root    0 Dec  4 10:15 umad0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/umad0/
  lrwxrwxrwx  1 root root    0 Dec  4 10:15 umad1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/umad1/
  drwxr-xr-x  2 root root    0 Dec  4 10:17 ./
  drwxr-xr-x 73 root root    0 Dec  4 10:17 ../
  -r--r--r--  1 root root 4096 Dec  4 10:17 abi_version
  lrwxrwxrwx  1 root root    0 Dec  4 18:08 issm1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/issm1/
  lrwxrwxrwx  1 root root    0 Dec  4 18:08 issm0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/issm0/
  # ethtool -i ibp129s0f0
  driver: mlx5_core[ib_ipoib]
  version: 6.11.10-amd64
  firmware-version: 20.43.1014 (MT_0000000224)
  expansion-rom-version:
  bus-info: 0000:81:00.0
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: no
  supports-register-dump: no
  supports-priv-flags: yes
  # ethtool -i ibp129s0f1
  driver: mlx5_core[ib_ipoib]
  version: 6.11.10-amd64
  firmware-version: 20.43.1014 (MT_0000000224)
  expansion-rom-version:
  bus-info: 0000:81:00.1
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: no
  supports-register-dump: no
  supports-priv-flags: yes
  $ ps aux | grep opens[m]
  root        1150  0.0  0.0 1560776 3636 ?        Ssl  10:15   0:00 /usr/sbin/opensm --guid 0x9c63c00300033240 --log_file /var/log/opensm.0x9c63c00300033240.log


> 
> If everything is fine for you, I'd like to close this bug.

I am closing the Debian bug report right now.
Thanks to everyone who has been involved for the great and kind help!

> 
> Best regards

Have a nice evening.   :-)

-- 
 http://www.inventati.org/frx/
 There's not a second to spare! To the laboratory!
..................................................... Francesco Poli .
 GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE

[-- Attachment #2: Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start
  2024-12-04 17:13                     ` Francesco Poli
@ 2024-12-05  9:17                       ` Leon Romanovsky
  0 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2024-12-05  9:17 UTC (permalink / raw)
  To: Francesco Poli
  Cc: Uwe Kleine-König, 1086520-done, Mark Zhang, linux-rdma,
	netdev

On Wed, Dec 04, 2024 at 06:13:56PM +0100, Francesco Poli wrote:
> On Wed, 4 Dec 2024 17:37:05 +0100 Uwe Kleine-König wrote:
> 
> > Hello Francesco,
> 
> Hello Uwe,
> 
> [...]
> > I wonder if you could test a firmware upgrade or the above patch. Would
> > be nice to know if there are still some things to do for us (= Debian
> > kernel team) here.
> 
> Yes, I've finally got around to upgrading the firmware.
> 
> And today I had a time window, where I could reboot the cluster head
> node.
> After the reboot, the InfiniBand network works correctly:
> 
>   $ uname -v
>   #1 SMP PREEMPT_DYNAMIC Debian 6.11.10-1 (2024-11-23)
>   $ ls -altrF /sys/class/infiniband_mad/
>   total 0
>   lrwxrwxrwx  1 root root    0 Dec  4 10:15 umad0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/umad0/
>   lrwxrwxrwx  1 root root    0 Dec  4 10:15 umad1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/umad1/
>   drwxr-xr-x  2 root root    0 Dec  4 10:17 ./
>   drwxr-xr-x 73 root root    0 Dec  4 10:17 ../
>   -r--r--r--  1 root root 4096 Dec  4 10:17 abi_version
>   lrwxrwxrwx  1 root root    0 Dec  4 18:08 issm1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/issm1/
>   lrwxrwxrwx  1 root root    0 Dec  4 18:08 issm0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/issm0/
>   # ethtool -i ibp129s0f0
>   driver: mlx5_core[ib_ipoib]
>   version: 6.11.10-amd64
>   firmware-version: 20.43.1014 (MT_0000000224)
>   expansion-rom-version:
>   bus-info: 0000:81:00.0
>   supports-statistics: yes
>   supports-test: yes
>   supports-eeprom-access: no
>   supports-register-dump: no
>   supports-priv-flags: yes
>   # ethtool -i ibp129s0f1
>   driver: mlx5_core[ib_ipoib]
>   version: 6.11.10-amd64
>   firmware-version: 20.43.1014 (MT_0000000224)
>   expansion-rom-version:
>   bus-info: 0000:81:00.1
>   supports-statistics: yes
>   supports-test: yes
>   supports-eeprom-access: no
>   supports-register-dump: no
>   supports-priv-flags: yes
>   $ ps aux | grep opens[m]
>   root        1150  0.0  0.0 1560776 3636 ?        Ssl  10:15   0:00 /usr/sbin/opensm --guid 0x9c63c00300033240 --log_file /var/log/opensm.0x9c63c00300033240.log
> 
> 
> > 
> > If everything is fine for you, I'd like to close this bug.
> 
> I am closing the Debian bug report right now.
> Thanks to everyone who has been involved for the great and kind help!

Thanks a lot for your help. You helped a lot.

BTW, we have an official fix [1], but it wasn't sent yet as we want to
finish all various tests first (E2E, QA e.t.c).

[1] https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=rdma-next&id=09754c1e5d0d204747928290cc8c6f4371fd4c6a

> 
> > 
> > Best regards
> 
> Have a nice evening.   :-)
> 
> -- 
>  http://www.inventati.org/frx/
>  There's not a second to spare! To the laboratory!
> ..................................................... Francesco Poli .
>  GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-12-05  9:17 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <jaw7557rpn2eln3dtb2xbv2gvzkzde6mfful7d2mf5mgc3wql7@wikm2a7a3kcv>
     [not found] ` <20241113231503.54d12ed5b5d0c8fa9b7d9806@paranoici.org>
     [not found]   ` <3wfi2j7jn2f7rajabfcengubgtyt3wkuin6hqepdoe5dlvfhvn@2clhco3z6fuw>
     [not found]     ` <173040083268.16618.7451145398661885923.reportbug@crunch>
     [not found]       ` <20241118200616.865cb4c869e693b19529df36@paranoici.org>
2024-11-21 10:04         ` Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start Uwe Kleine-König
2024-11-25 18:54           ` Francesco Poli
2024-11-25 19:38             ` Leon Romanovsky
2024-11-26  1:21               ` Mark Zhang
2024-11-26  7:18                 ` Francesco Poli
2024-11-26  8:38                   ` Leon Romanovsky
2024-11-26 10:09                     ` Leon Romanovsky
2024-11-27 17:48               ` Francesco Poli
2024-11-27 20:04                 ` Leon Romanovsky
2024-12-04 16:37                   ` Uwe Kleine-König
2024-12-04 17:13                     ` Francesco Poli
2024-12-05  9:17                       ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).