From: Leon Romanovsky <leonro@nvidia.com>
To: Francesco Poli <invernomuto@paranoici.org>
Cc: "Uwe Kleine-König" <ukleinek@debian.org>,
1086520@bugs.debian.org, "Mark Zhang" <markzhang@nvidia.com>,
linux-rdma@vger.kernel.org, netdev@vger.kernel.org
Subject: Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start
Date: Wed, 27 Nov 2024 22:04:13 +0200 [thread overview]
Message-ID: <20241127200413.GE1245331@unreal> (raw)
In-Reply-To: <20241127184803.75086499e71c6b1588a4fb5a@paranoici.org>
On Wed, Nov 27, 2024 at 06:48:03PM +0100, Francesco Poli wrote:
> On Mon, 25 Nov 2024 21:38:37 +0200 Leon Romanovsky wrote:
>
> > On Mon, Nov 25, 2024 at 07:54:43PM +0100, Francesco Poli wrote:
> [...]
> > > I will try to continue to bisect by testing the resulting kernels on a
> > > compute node: there's no OpenSM there and it cannot run anyway, if
> > > there's another OpenSM on the same InfiniBand network.
> > > However, I can check whether those issm* symlinks are created in
> > > /sys/class/infiniband_mad/
> > > I really hope that this is enough to pinpoint the first bad
> > > commit...
> >
> > Yes, these symlinks should be there. Your test scenario is correct one.
>
> OK, I have completed the bisect on a compute node without OpenSM, by
> looking at the issm* symlinks, as I said.
>
> See below.
>
> >
> > >
> > > Any better ideas?
> >
> > I think that commit: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port")
> > is the one which is causing to troubles, which leads me to suspect FW.
> [...]
>
> Thanks to your guess about the possibly troublesome commit, the bisect was completed in a few steps:
>
> $ git checkout 2a5db20fa532
> $ make -j 12 my_defconfig bindeb-pkg
>
> [install this version on a compute node test image and reboot
> one compute node with that image: the InfiniBand network was
> working for that node, that's no surprise, since OpenSM was running
> on the head node, but no issm* symlink was created; please note
> that, surprisingly, the Ethernet network was not working, I mean
> that the Ethernet interfaces were not found by the kernel...]
>
> root@node # ls -altrF /sys/class/infiniband_mad/
> total 0
> drwxr-xr-x 60 root root 0 Nov 26 17:06 ../
> lrwxrwxrwx 1 root root 0 Nov 26 17:06 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
> -r--r--r-- 1 root root 4096 Nov 26 17:06 abi_version
> lrwxrwxrwx 1 root root 0 Nov 26 17:06 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
> drwxr-xr-x 2 root root 0 Nov 26 17:08 ./
>
> $ git bisect bad
> Bisecting: 0 revisions left to test after this (roughly 0 steps)
> [65528cfb21fdb68de8ae6dccae19af180d93e143] net/mlx5: mlx5_ifc update for multi-plane support
> $ make -j 12 my_defconfig bindeb-pkg
>
> [install this version on the compute node test image and reboot
> one compute node with that image: the InfiniBand network again
> working for that node, issm* symlinks were created;
> Ethernet network again not working for that node...]
>
> root@node # ls -altrF /sys/class/infiniband_mad/
> total 0
> drwxr-xr-x 60 root root 0 Nov 26 17:31 ../
> lrwxrwxrwx 1 root root 0 Nov 26 17:31 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
> -r--r--r-- 1 root root 4096 Nov 26 17:31 abi_version
> lrwxrwxrwx 1 root root 0 Nov 26 17:31 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
> lrwxrwxrwx 1 root root 0 Nov 26 17:36 issm1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/issm1/
> lrwxrwxrwx 1 root root 0 Nov 26 17:36 issm0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/issm0/
> drwxr-xr-x 2 root root 0 Nov 26 17:36 ./
>
> $ git bisect good
> 2a5db20fa532198639671713c6213f96ff285b85 is the first bad commit
> commit 2a5db20fa532198639671713c6213f96ff285b85
> Author: Mark Zhang <markzhang@nvidia.com>
> Date: Sun Jun 16 19:08:35 2024 +0300
>
> RDMA/mlx5: Add support to multi-plane device and port
>
> When multi-plane is supported, a logical port, which is aggregation of
> multiple physical plane ports, is exposed for data transmission.
> Compared with a normal mlx5 IB port, this logical port supports all
> functionalities except Subnet Management.
>
> Signed-off-by: Mark Zhang <markzhang@nvidia.com>
> Link: https://lore.kernel.org/r/7e37c06c9cb243be9ac79930cd17053903785b95.1718553901.git.leon@kernel.org
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>
> drivers/infiniband/hw/mlx5/main.c | 60 +++++++++++++++++++++----
> drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +
> drivers/net/ethernet/mellanox/mlx5/core/vport.c | 1 +
> include/linux/mlx5/driver.h | 1 +
> 4 files changed, 55 insertions(+), 9 deletions(-)
>
>
> In other words, bingo!, your guess looks correct, the first bad commit
> is the one you mentioned.
>
>
> Now, I will try to upgrade the firmware of the InfiniBand NICs, as you
> suggested, and check whether this solves the issue with the recent
> Linux kernel versions.
>
> Please confirm that the procedure to be followed is the one described in
> <https://docs.nvidia.com/networking/display/ubuntu2204/firmware+burning>
Yes, it looks correct procedure.
If you didn't upgrade FW, this diff will achieve same result for you:
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index c2314797afc9..110ce177c305 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2846,7 +2846,7 @@ static int mlx5_ib_get_plane_num(struct mlx5_core_dev *mdev, u8 *num_plane)
if (err)
return err;
- *num_plane = vport_ctx.num_plane;
+ *num_plane = (vport_ctx.num_plane > 1) ? vport_ctx.num_plane : 0;
return 0;
}
The culprit of your issue that in some FW versions, the vport_ctx.num_plane
was 1 and not 0 for devices which don't support that mode, while for the driver
everything that is not 0 means supported.
Thanks
>
> Thanks for your time and patience, and for all the help you are kindly
> providing! :-)
>
>
> --
> http://www.inventati.org/frx/
> There's not a second to spare! To the laboratory!
> ..................................................... Francesco Poli .
> GnuPG key fpr == CA01 1147 9CD2 EFDF FB82 3925 3E1C 27E1 1F69 BFFE
next prev parent reply other threads:[~2024-11-27 20:04 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <jaw7557rpn2eln3dtb2xbv2gvzkzde6mfful7d2mf5mgc3wql7@wikm2a7a3kcv>
[not found] ` <20241113231503.54d12ed5b5d0c8fa9b7d9806@paranoici.org>
[not found] ` <3wfi2j7jn2f7rajabfcengubgtyt3wkuin6hqepdoe5dlvfhvn@2clhco3z6fuw>
[not found] ` <173040083268.16618.7451145398661885923.reportbug@crunch>
[not found] ` <20241118200616.865cb4c869e693b19529df36@paranoici.org>
2024-11-21 10:04 ` Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start Uwe Kleine-König
2024-11-25 18:54 ` Francesco Poli
2024-11-25 19:38 ` Leon Romanovsky
2024-11-26 1:21 ` Mark Zhang
2024-11-26 7:18 ` Francesco Poli
2024-11-26 8:38 ` Leon Romanovsky
2024-11-26 10:09 ` Leon Romanovsky
2024-11-27 17:48 ` Francesco Poli
2024-11-27 20:04 ` Leon Romanovsky [this message]
2024-12-04 16:37 ` Uwe Kleine-König
2024-12-04 17:13 ` Francesco Poli
2024-12-05 9:17 ` Leon Romanovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241127200413.GE1245331@unreal \
--to=leonro@nvidia.com \
--cc=1086520@bugs.debian.org \
--cc=invernomuto@paranoici.org \
--cc=linux-rdma@vger.kernel.org \
--cc=markzhang@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=ukleinek@debian.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.