Linux PCI subsystem development
 help / color / mirror / Atom feed
From: <Erin_Tsao@wistron.com>
To: <helgaas@kernel.org>
Cc: <linux-pci@vger.kernel.org>, <mj@ucw.cz>
Subject: RE: Issue about PCI physical slot fetch incorrect number
Date: Fri, 6 Sep 2024 02:04:28 +0000	[thread overview]
Message-ID: <9b2154e86ab240dcba530b6620100139@wistron.com> (raw)
In-Reply-To: <20240829163518.GA39705@bhelgaas>

[-- Attachment #1: Type: text/plain, Size: 8527 bytes --]

Hi Bjorn,
Sorry for the late response, we do need some time to collect the information with other team, that's why I got back to you after almost a week.

[Bjorn]I don't know off the top of my head why lspci doesn't report a "Physical Slot:" for 21:00.0.  I suppose the kernel didn't provide something in /sys for it. All the other "Physical Slot:" reports from lspci match the "Physical Slot Number" from the PCIe Capability of the bridge leading to the slot, *except* for 33:00.0.  In that case, the "Physical Slot Number" from the bridge PCIe Capability is not unique.  Both 20:01.1 and 32:00.0 advertise Slot #0 there, so the kernel make the sysfs slot unique, e.g., "0-6".

->I think we are on the same page about this. I am wondering why 33:00.0, which is GPU device that being an exception for not reaching the correct physical slot number. As you said, GPU device which is 33:00.0 display physical slot with number 0 due to its bridge didn't have the proper slot number.

[Bjorn]From lspci_vvv_team2.txt: This seems strange to me. For 39:01.0, lspci reports "Physical Slot: 24", but 39:01.0 is a Downstream Port that *leads* to a slot; it's not a slot itself.  3b:00.0 is the device in that slot, and I think it should have a slot number, but it doesn't. Similarly, lspci reports "Physical Slot: 39" for 39:02.0, when it should show 3e:00.0 being in slot 39. I guess this team2 situation is what you're trying to understand?

->I got your opinion. Yes, this situation is also what I want to clarify.

To be brief, there are two question I would like to know:
1. From team1, all device except GPU device fetch the correct physical slot number. Due to GPU device's bridge didn't have the slot number thus when it downstream to end device, the slot number will become "0", which I believe is not the correct slot number it should display.
2.From team2, the slot number shows on downstream port instead of end device itself. So I think this is the reason why their GPU device doesn't have the slot number? Because their slot number shows on the downstream port?

[Bjorn]Can you collect the complete dmesg log and output of "grep -r ./sys/bus/pci/slots" for both team1 and team2?  We should be able to puzzle out what's going on.  The dmesg logging will show which hotplug drivers are in use and should have hints about slot numbering, and if it doesn't, we may need to add some.

->Also based on what you request, we do collect the dmesg from team1 and team2. Please help up to look inside and provide us the action we can make in the next step to fix this issue.
If you need any further information, please feel free to tell me. I will do my best to get back to you as soon as possible.

I will send both dmesg and slot file under path /sys/bus/pci from team1 and team2 to you. Please find attachment for more details, really appreciate your help and assistance.
Hope to hear from you soon.

BR,
Erin
-----Original Message-----
From: Bjorn Helgaas <helgaas@kernel.org> 
Sent: Friday, August 30, 2024 12:35 AM
To: Erin Tsao/WHQ/Wistron <Erin_Tsao@wistron.com>
Cc: linux-pci@vger.kernel.org; mj@ucw.cz
Subject: Re: Issue about PCI physical slot fetch incorrect number

On Mon, Aug 26, 2024 at 08:27:09AM +0000, Erin_Tsao@wistron.com wrote:
> Hi Bjorn,
> Sorry for the late response. And thanks for responding to my question.
> There's a few thing I would like to clarify with you.
> 1. Is the physical slot number associate with the configuration of 
> device itself or with the configuration of device's parent?

A PCIe device doesn't know its own slot number.  The bridge leading to a slot (either a Root Port or a Switch Downstream Port) has the Slot Capability/Status/Control registers that manage the slot.  The Slot Capabilities register contains a "Physical Slot Number".  This is HwInit, which means it's set by hardware or firmware, and it's supposed to be a number that's unique within the chassis.

The "Physical Slot" reported by lspci for Endpoints comes from sysfs, not from the device itself.  See https://urldefense.com/v3/__https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/tree/lib/sysfs.c?id=v3.13.0*n277__;Iw!!AAkNMFMq5MQ!cRxVCvUr6rEgDN9a_S_dxnHv2u1BP6J_Ue82PAqDcmqxFg_DhbQBLrfGaNdA6vCBLtLcdo-c-hPaS2SV2A$ 

From team1 lspci-vvv:

  20:01.1 Root Port to [bus 21-34] Slot #0
    21:00.0 Broadcom Switch Upstream Port to [bus 22-34]
      22:00.0 Broadcom Switch Downstream Port to [bus 23] Slot #308
        23:00.0 Broadcom Endpoint 02b2 "Physical Slot: 308"
      22:01.0 Broadcom Switch Downstream Port to [bus 24] Slot #306
        24:00.0 Broadcom Endpoint 02b2 "Physical Slot: 306"
      22:02.0 Broadcom Switch Downstream Port to [bus 25-2a] Slot #213
        25:00.0 Mellanox Endpoint MT2910 "Physical Slot: 213"
      22:03.0 Broadcom Switch Downstream Port to [bus 2b-30] Slot #203
        2b:00.0 Broadcom Endpoint 02b2 "Physical Slot: 203"
      22:04.0 Broadcom Switch Downstream Port to [bus 31-33] Slot #101
        31:00.0 AMD Switch Upstream Port to [bus 32-33] "Physical Slot: 101"
          32:00.0 AMD Switch Downstream Port to [bus 33] Slot #0
            33:00.0 AMD Endpoint 74a1 "Physical Slot: 0-6"

I don't know off the top of my head why lspci doesn't report a "Physical Slot:" for 21:00.0.  I suppose the kernel didn't provide something in /sys for it.

All the other "Physical Slot:" reports from lspci match the "Physical Slot Number" from the PCIe Capability of the bridge leading to the slot, *except* for 33:00.0.  In that case, the "Physical Slot Number"
from the bridge PCIe Capability is not unique.  Both 20:01.1 and
32:00.0 advertise Slot #0 there, so the kernel make the sysfs slot unique, e.g., "0-6".

From lspci_vvv_team2.txt:

  39:00.0 Broadcom Switch Downstream Port to [bus 3a] Slot #166
    3a:00.0 Samsung Endpoint NVMe "Physical Slot: 166"
  39:01.0 Broadcom Switch Downstream Port to [bus 3b-3d] Slot #24 "Physical Slot: 24"
    3b:00.0 AMD Switch Upstream Port to [bus 3c-3d]
      3c:00.0 AMD Switch Downstream Port to [bus 3d] Slot #0
        3d:00.0 AMD Endpoint MI300X
  39:02.0 Broadcom Switch Downstream Port to [bus 3e] Slot #39 "Physical Slot: 39"
    3e:00.0 Mellanox Endpoint

This seems strange to me.  For 39:01.0, lspci reports "Physical Slot:
24", but 39:01.0 is a Downstream Port that *leads* to a slot; it's not a slot itself.  3b:00.0 is the device in that slot, and I think it should have a slot number, but it doesn't.

Similarly, lspci reports "Physical Slot: 39" for 39:02.0, when it should show 3e:00.0 being in slot 39.

I guess this team2 situation is what you're trying to understand?

Can you collect the complete dmesg log and output of "grep -r .
/sys/bus/pci/slots" for both team1 and team2?  We should be able to puzzle out what's going on.  The dmesg logging will show which hotplug drivers are in use and should have hints about slot numbering, and if it doesn't, we may need to add some.

> 2. As my understanding, we also have another team using AMD GPU MI300. 
> And I have discovered that lspci -xxx have some difference between our 
> team(team 1)  and their team (team 2). The difference is that when we 
> dump the file of lspci -xxx, the content only listed to 0xff, however, 
> another team listed the content till 0xfff, which means that they have 
> additional content from 0x100 to 0xfff.

>   ->Is there any setting of OS that we can enable in order to see
>   the whole content?

I think "lspci -xxx" will only show you 0-0xff unless lspci is run as root.

>   ->Will these additional content related to the physical slot
>   number? Or have any impact on showing the physical slot number?

I don't think so.  The Slot Capability/Status/Control registers are in the PCIe Capability, which should be below 0xff.

Bjorn

---------------------------------------------------------------------------------------------------------------------------------------------------------------
This email contains confidential or legally privileged information and is for the sole use of its intended recipient.
Any unauthorized review, use, copying or distribution of this email or the content of this email is strictly prohibited.
If you are not the intended recipient, you may reply to the sender and should delete this e-mail immediately.
---------------------------------------------------------------------------------------------------------------------------------------------------------------

[-- Attachment #2: dmesg_and_slot_team1.zip --]
[-- Type: application/x-zip-compressed, Size: 127413 bytes --]

[-- Attachment #3: dmesg_and_slot_team2.zip --]
[-- Type: application/x-zip-compressed, Size: 73987 bytes --]

  reply	other threads:[~2024-09-06  2:47 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <a600fc09c06d4ca28b045668ad1e63cb@wistron.com>
2024-08-23 18:51 ` Issue about PCI physical slot fetch incorrect number Martin Mareš
2024-08-23 21:03   ` Bjorn Helgaas
2024-08-26  8:27     ` Erin_Tsao
2024-08-29 16:35       ` Bjorn Helgaas
2024-09-06  2:04         ` Erin_Tsao [this message]
2024-09-18 14:09           ` Bjorn Helgaas
2024-08-26  9:05   ` Erin_Tsao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9b2154e86ab240dcba530b6620100139@wistron.com \
    --to=erin_tsao@wistron.com \
    --cc=helgaas@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mj@ucw.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox