Linux PCI subsystem development
 help / color / mirror / Atom feed
* Re: Issue about PCI physical slot fetch incorrect number
       [not found] <a600fc09c06d4ca28b045668ad1e63cb@wistron.com>
@ 2024-08-23 18:51 ` Martin Mareš
  2024-08-23 21:03   ` Bjorn Helgaas
  2024-08-26  9:05   ` Erin_Tsao
  0 siblings, 2 replies; 7+ messages in thread
From: Martin Mareš @ 2024-08-23 18:51 UTC (permalink / raw)
  To: Erin_Tsao; +Cc: Linux-PCI Mailing List

Hi!

> This is Erin from Taiwan. I have a question about physical slot number.
> Currently we are working on the PCIE slot number assigning by PCIE switch. In the PCIe slot assignment process, the slot numbers are assigned to bridges first, and then the end devices fetch the slot ID from the bridge in the upper layer.
> 
> I have observed that under our PCIE switch, GPUs will create a bridge before reaching the end device. If GPUs also fetch the slot ID from the upper bridge layer, they may retrieve incorrect values.
> 
> Our GPU will get the physical slot number with number “0”, and show the slot number “0”、”0-1” , etc.
> May I ask
> 
>   1.  Why GPU will fetch the slot number “0”? Is the slot number assigned to GPU related to any register? Or can we set any bit to fetch the right number?
>   2.  Is there any possible for us not to show the physical slot number of GPU?
> 
> I have checked with the code on the git, unfortunately I didn’t obtain any answer.
> It will really be helpful to get the response from you.
> Hope to hear from you soon, thanks in advance.

It's a long long time since I was the maintainer of the PCI layer in the
kernel. Now I maintain just the PCI utilities which display whatever the
kernel tells them.

I am forwarding your question to the linux-pci mailing list where it
hopefully finds a better audience.

				Martin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue about PCI physical slot fetch incorrect number
  2024-08-23 18:51 ` Issue about PCI physical slot fetch incorrect number Martin Mareš
@ 2024-08-23 21:03   ` Bjorn Helgaas
  2024-08-26  8:27     ` Erin_Tsao
  2024-08-26  9:05   ` Erin_Tsao
  1 sibling, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2024-08-23 21:03 UTC (permalink / raw)
  To: Erin_Tsao; +Cc: Linux-PCI Mailing List, Martin Mareš

Hi Erin, thanks for your question.

On Fri, Aug 23, 2024 at 08:51:58PM +0200, Martin Mareš wrote:
> Hi!
> 
> > This is Erin from Taiwan. I have a question about physical slot
> > number.  Currently we are working on the PCIE slot number
> > assigning by PCIE switch. In the PCIe slot assignment process, the
> > slot numbers are assigned to bridges first, and then the end
> > devices fetch the slot ID from the bridge in the upper layer.
> > 
> > I have observed that under our PCIE switch, GPUs will create a
> > bridge before reaching the end device. If GPUs also fetch the slot
> > ID from the upper bridge layer, they may retrieve incorrect
> > values.
> > 
> > Our GPU will get the physical slot number with number “0”, and
> > show the slot number “0”、”0-1” , etc.
> > May I ask
> > 
> >   1.  Why GPU will fetch the slot number “0”? Is the slot number
> >   assigned to GPU related to any register? Or can we set any bit
> >   to fetch the right number?
> >
> >   2.  Is there any possible for us not to show the physical slot
> >   number of GPU?

Can you supply logs showing what you see and what's incorrect?

For example, if lspci is showing the wrong thing, can you provide the
complete output of "sudo lspci -vv" and indicate which things are
wrong?

If the kernel dmesg log is wrong, can you supply that output and point
out what's wrong?

Also, I think slots are exposed in /sys, so please include the output
of "grep . /sys/bus/pci/slots/*/address".

Slot numbering is messy because there are several sources of
information, e.g., the Physical Slot Number in the Slot Capabilities
register, SMBIOS table, ACPI _DSM methods, etc., and they are not all
coordinated.  So the kernel goes to some trouble to come up with a
unique "slot number" for each slot.

Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Issue about PCI physical slot fetch incorrect number
  2024-08-23 21:03   ` Bjorn Helgaas
@ 2024-08-26  8:27     ` Erin_Tsao
  2024-08-29 16:35       ` Bjorn Helgaas
  0 siblings, 1 reply; 7+ messages in thread
From: Erin_Tsao @ 2024-08-26  8:27 UTC (permalink / raw)
  To: helgaas; +Cc: linux-pci, mj

[-- Attachment #1: Type: text/plain, Size: 5118 bytes --]

Hi Bjorn,
Sorry for the late response. And thanks for responding to my question.
There's a few thing I would like to clarify with you.
1. Is the physical slot number associate with the configuration of device itself or with the configuration of device's parent?
2. As my understanding, we also have another team using AMD GPU MI300. And I have discovered that lspci -xxx have some difference between our team(team 1)  
  and their team (team 2). The difference is that when we dump the file of lspci -xxx, the content only listed to 0xff, however, another team listed the content till 
  0xfff, which means that they have additional content from 0x100 to 0xfff.
  ->Is there any setting of OS that we can enable in order to see the whole content?
  ->Will these additional content related to the physical slot number? Or have any impact on showing the physical slot number?
3.Based on the response you gave:
 Slot numbering is messy because there are several sources of information, e.g., the Physical Slot Number in the Slot Capabilities register, SMBIOS table, ACPI _DSM     
 methods, etc., and they are not all coordinated.  So the kernel goes to some trouble to come up with a unique "slot number" for each slot.
 ->These will all organize into the path /sys/bus/pci/slots/. May I know how will them been organized, is there any specified code in lspci can we trace?
4.The attached file is our (team 1) partial lspci -tv、lspci -vvv、lspci -xxx and the other team's (team2) partial lspci -tv、lspci -vvv、lspci -xxx. Please find more details 
  in these attachment. Particularly, I would like to focus on the GPU region. As you can see in team 2's -vvv 3d:00.0, the MI300 GPU didn't show the physical  
  slot number, however, our team (team 1)'s -vvv 33:00.0, the GPU will show the physical slot number "0".
5.Also the screenshot of slot under /sys/bus/pci/slots is in attachment. I can't find the path you gave /sys/bus/pci/slots/*/address".
  ->After capturing the screenshot, I think that the original slot number list in the path /sys/bus/pci/slots is already incorrect. May you help me with how this file is constructed,  
  so that we may make some modification.

Really appreciate your help and hope to hear from you soon.
Thanks in advance.

BR,
Erin

-----Original Message-----
From: Bjorn Helgaas <helgaas@kernel.org> 
Sent: Saturday, August 24, 2024 5:03 AM
To: Erin Tsao/WHQ/Wistron <Erin_Tsao@wistron.com>
Cc: Linux-PCI Mailing List <linux-pci@vger.kernel.org>; Martin Mareš <mj@ucw.cz>
Subject: Re: Issue about PCI physical slot fetch incorrect number

Hi Erin, thanks for your question.

On Fri, Aug 23, 2024 at 08:51:58PM +0200, Martin Mareš wrote:
> Hi!
> 
> > This is Erin from Taiwan. I have a question about physical slot 
> > number.  Currently we are working on the PCIE slot number assigning 
> > by PCIE switch. In the PCIe slot assignment process, the slot 
> > numbers are assigned to bridges first, and then the end devices 
> > fetch the slot ID from the bridge in the upper layer.
> > 
> > I have observed that under our PCIE switch, GPUs will create a 
> > bridge before reaching the end device. If GPUs also fetch the slot 
> > ID from the upper bridge layer, they may retrieve incorrect values.
> > 
> > Our GPU will get the physical slot number with number “0”, and show 
> > the slot number “0”、”0-1” , etc.
> > May I ask
> > 
> >   1.  Why GPU will fetch the slot number “0”? Is the slot number
> >   assigned to GPU related to any register? Or can we set any bit
> >   to fetch the right number?
> >
> >   2.  Is there any possible for us not to show the physical slot
> >   number of GPU?

Can you supply logs showing what you see and what's incorrect?

For example, if lspci is showing the wrong thing, can you provide the complete output of "sudo lspci -vv" and indicate which things are wrong?

If the kernel dmesg log is wrong, can you supply that output and point out what's wrong?

Also, I think slots are exposed in /sys, so please include the output of "grep . /sys/bus/pci/slots/*/address".

Slot numbering is messy because there are several sources of information, e.g., the Physical Slot Number in the Slot Capabilities register, SMBIOS table, ACPI _DSM methods, etc., and they are not all coordinated.  So the kernel goes to some trouble to come up with a unique "slot number" for each slot.

Bjorn

---------------------------------------------------------------------------------------------------------------------------------------------------------------
This email contains confidential or legally privileged information and is for the sole use of its intended recipient.
Any unauthorized review, use, copying or distribution of this email or the content of this email is strictly prohibited.
If you are not the intended recipient, you may reply to the sender and should delete this e-mail immediately.
---------------------------------------------------------------------------------------------------------------------------------------------------------------

[-- Attachment #2: lspci_info_team1.zip --]
[-- Type: application/x-zip-compressed, Size: 10631 bytes --]

[-- Attachment #3: lspci_info_team2.zip --]
[-- Type: application/x-zip-compressed, Size: 18007 bytes --]

[-- Attachment #4: slot_num.png --]
[-- Type: image/png, Size: 65191 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Issue about PCI physical slot fetch incorrect number
  2024-08-23 18:51 ` Issue about PCI physical slot fetch incorrect number Martin Mareš
  2024-08-23 21:03   ` Bjorn Helgaas
@ 2024-08-26  9:05   ` Erin_Tsao
  1 sibling, 0 replies; 7+ messages in thread
From: Erin_Tsao @ 2024-08-26  9:05 UTC (permalink / raw)
  To: mj; +Cc: linux-pci

Hi Martin,
Thanks for helping me forwarding the mail.
Hope to have good news from them.

Wish you have a good day!

BR,
Erin

-----Original Message-----
From: Martin Mareš <mj@ucw.cz> 
Sent: Saturday, August 24, 2024 2:52 AM
To: Erin Tsao/WHQ/Wistron <Erin_Tsao@wistron.com>
Cc: Linux-PCI Mailing List <linux-pci@vger.kernel.org>
Subject: Re: Issue about PCI physical slot fetch incorrect number

Hi!

> This is Erin from Taiwan. I have a question about physical slot number.
> Currently we are working on the PCIE slot number assigning by PCIE switch. In the PCIe slot assignment process, the slot numbers are assigned to bridges first, and then the end devices fetch the slot ID from the bridge in the upper layer.
> 
> I have observed that under our PCIE switch, GPUs will create a bridge before reaching the end device. If GPUs also fetch the slot ID from the upper bridge layer, they may retrieve incorrect values.
> 
> Our GPU will get the physical slot number with number “0”, and show the slot number “0”、”0-1” , etc.
> May I ask
> 
>   1.  Why GPU will fetch the slot number “0”? Is the slot number assigned to GPU related to any register? Or can we set any bit to fetch the right number?
>   2.  Is there any possible for us not to show the physical slot number of GPU?
> 
> I have checked with the code on the git, unfortunately I didn’t obtain any answer.
> It will really be helpful to get the response from you.
> Hope to hear from you soon, thanks in advance.

It's a long long time since I was the maintainer of the PCI layer in the kernel. Now I maintain just the PCI utilities which display whatever the kernel tells them.

I am forwarding your question to the linux-pci mailing list where it hopefully finds a better audience.

				Martin

---------------------------------------------------------------------------------------------------------------------------------------------------------------
This email contains confidential or legally privileged information and is for the sole use of its intended recipient.
Any unauthorized review, use, copying or distribution of this email or the content of this email is strictly prohibited.
If you are not the intended recipient, you may reply to the sender and should delete this e-mail immediately.
---------------------------------------------------------------------------------------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue about PCI physical slot fetch incorrect number
  2024-08-26  8:27     ` Erin_Tsao
@ 2024-08-29 16:35       ` Bjorn Helgaas
  2024-09-06  2:04         ` Erin_Tsao
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2024-08-29 16:35 UTC (permalink / raw)
  To: Erin_Tsao; +Cc: linux-pci, mj

On Mon, Aug 26, 2024 at 08:27:09AM +0000, Erin_Tsao@wistron.com wrote:
> Hi Bjorn,
> Sorry for the late response. And thanks for responding to my question.
> There's a few thing I would like to clarify with you.
> 1. Is the physical slot number associate with the configuration of
> device itself or with the configuration of device's parent?

A PCIe device doesn't know its own slot number.  The bridge leading to
a slot (either a Root Port or a Switch Downstream Port) has the Slot
Capability/Status/Control registers that manage the slot.  The Slot
Capabilities register contains a "Physical Slot Number".  This is
HwInit, which means it's set by hardware or firmware, and it's
supposed to be a number that's unique within the chassis.

The "Physical Slot" reported by lspci for Endpoints comes from sysfs,
not from the device itself.  See
https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/tree/lib/sysfs.c?id=v3.13.0#n277

From team1 lspci-vvv:

  20:01.1 Root Port to [bus 21-34] Slot #0
    21:00.0 Broadcom Switch Upstream Port to [bus 22-34]
      22:00.0 Broadcom Switch Downstream Port to [bus 23] Slot #308
        23:00.0 Broadcom Endpoint 02b2 "Physical Slot: 308"
      22:01.0 Broadcom Switch Downstream Port to [bus 24] Slot #306
        24:00.0 Broadcom Endpoint 02b2 "Physical Slot: 306"
      22:02.0 Broadcom Switch Downstream Port to [bus 25-2a] Slot #213
        25:00.0 Mellanox Endpoint MT2910 "Physical Slot: 213"
      22:03.0 Broadcom Switch Downstream Port to [bus 2b-30] Slot #203
        2b:00.0 Broadcom Endpoint 02b2 "Physical Slot: 203"
      22:04.0 Broadcom Switch Downstream Port to [bus 31-33] Slot #101
        31:00.0 AMD Switch Upstream Port to [bus 32-33] "Physical Slot: 101"
          32:00.0 AMD Switch Downstream Port to [bus 33] Slot #0
            33:00.0 AMD Endpoint 74a1 "Physical Slot: 0-6"

I don't know off the top of my head why lspci doesn't report a
"Physical Slot:" for 21:00.0.  I suppose the kernel didn't provide
something in /sys for it.

All the other "Physical Slot:" reports from lspci match the "Physical
Slot Number" from the PCIe Capability of the bridge leading to the
slot, *except* for 33:00.0.  In that case, the "Physical Slot Number"
from the bridge PCIe Capability is not unique.  Both 20:01.1 and
32:00.0 advertise Slot #0 there, so the kernel make the sysfs slot
unique, e.g., "0-6".

From lspci_vvv_team2.txt:

  39:00.0 Broadcom Switch Downstream Port to [bus 3a] Slot #166
    3a:00.0 Samsung Endpoint NVMe "Physical Slot: 166"
  39:01.0 Broadcom Switch Downstream Port to [bus 3b-3d] Slot #24 "Physical Slot: 24"
    3b:00.0 AMD Switch Upstream Port to [bus 3c-3d]
      3c:00.0 AMD Switch Downstream Port to [bus 3d] Slot #0
        3d:00.0 AMD Endpoint MI300X
  39:02.0 Broadcom Switch Downstream Port to [bus 3e] Slot #39 "Physical Slot: 39"
    3e:00.0 Mellanox Endpoint

This seems strange to me.  For 39:01.0, lspci reports "Physical Slot:
24", but 39:01.0 is a Downstream Port that *leads* to a slot; it's not
a slot itself.  3b:00.0 is the device in that slot, and I think it
should have a slot number, but it doesn't.

Similarly, lspci reports "Physical Slot: 39" for 39:02.0, when it
should show 3e:00.0 being in slot 39.

I guess this team2 situation is what you're trying to understand?

Can you collect the complete dmesg log and output of "grep -r .
/sys/bus/pci/slots" for both team1 and team2?  We should be able to
puzzle out what's going on.  The dmesg logging will show which hotplug
drivers are in use and should have hints about slot numbering, and if
it doesn't, we may need to add some.

> 2. As my understanding, we also have another team using AMD GPU
> MI300. And I have discovered that lspci -xxx have some difference
> between our team(team 1)  and their team (team 2). The difference is
> that when we dump the file of lspci -xxx, the content only listed to
> 0xff, however, another team listed the content till 0xfff, which
> means that they have additional content from 0x100 to 0xfff.

>   ->Is there any setting of OS that we can enable in order to see
>   the whole content?

I think "lspci -xxx" will only show you 0-0xff unless lspci is run as
root.

>   ->Will these additional content related to the physical slot
>   number? Or have any impact on showing the physical slot number?

I don't think so.  The Slot Capability/Status/Control registers are in
the PCIe Capability, which should be below 0xff.

Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Issue about PCI physical slot fetch incorrect number
  2024-08-29 16:35       ` Bjorn Helgaas
@ 2024-09-06  2:04         ` Erin_Tsao
  2024-09-18 14:09           ` Bjorn Helgaas
  0 siblings, 1 reply; 7+ messages in thread
From: Erin_Tsao @ 2024-09-06  2:04 UTC (permalink / raw)
  To: helgaas; +Cc: linux-pci, mj

[-- Attachment #1: Type: text/plain, Size: 8527 bytes --]

Hi Bjorn,
Sorry for the late response, we do need some time to collect the information with other team, that's why I got back to you after almost a week.

[Bjorn]I don't know off the top of my head why lspci doesn't report a "Physical Slot:" for 21:00.0.  I suppose the kernel didn't provide something in /sys for it. All the other "Physical Slot:" reports from lspci match the "Physical Slot Number" from the PCIe Capability of the bridge leading to the slot, *except* for 33:00.0.  In that case, the "Physical Slot Number" from the bridge PCIe Capability is not unique.  Both 20:01.1 and 32:00.0 advertise Slot #0 there, so the kernel make the sysfs slot unique, e.g., "0-6".

->I think we are on the same page about this. I am wondering why 33:00.0, which is GPU device that being an exception for not reaching the correct physical slot number. As you said, GPU device which is 33:00.0 display physical slot with number 0 due to its bridge didn't have the proper slot number.

[Bjorn]From lspci_vvv_team2.txt: This seems strange to me. For 39:01.0, lspci reports "Physical Slot: 24", but 39:01.0 is a Downstream Port that *leads* to a slot; it's not a slot itself.  3b:00.0 is the device in that slot, and I think it should have a slot number, but it doesn't. Similarly, lspci reports "Physical Slot: 39" for 39:02.0, when it should show 3e:00.0 being in slot 39. I guess this team2 situation is what you're trying to understand?

->I got your opinion. Yes, this situation is also what I want to clarify.

To be brief, there are two question I would like to know:
1. From team1, all device except GPU device fetch the correct physical slot number. Due to GPU device's bridge didn't have the slot number thus when it downstream to end device, the slot number will become "0", which I believe is not the correct slot number it should display.
2.From team2, the slot number shows on downstream port instead of end device itself. So I think this is the reason why their GPU device doesn't have the slot number? Because their slot number shows on the downstream port?

[Bjorn]Can you collect the complete dmesg log and output of "grep -r ./sys/bus/pci/slots" for both team1 and team2?  We should be able to puzzle out what's going on.  The dmesg logging will show which hotplug drivers are in use and should have hints about slot numbering, and if it doesn't, we may need to add some.

->Also based on what you request, we do collect the dmesg from team1 and team2. Please help up to look inside and provide us the action we can make in the next step to fix this issue.
If you need any further information, please feel free to tell me. I will do my best to get back to you as soon as possible.

I will send both dmesg and slot file under path /sys/bus/pci from team1 and team2 to you. Please find attachment for more details, really appreciate your help and assistance.
Hope to hear from you soon.

BR,
Erin
-----Original Message-----
From: Bjorn Helgaas <helgaas@kernel.org> 
Sent: Friday, August 30, 2024 12:35 AM
To: Erin Tsao/WHQ/Wistron <Erin_Tsao@wistron.com>
Cc: linux-pci@vger.kernel.org; mj@ucw.cz
Subject: Re: Issue about PCI physical slot fetch incorrect number

On Mon, Aug 26, 2024 at 08:27:09AM +0000, Erin_Tsao@wistron.com wrote:
> Hi Bjorn,
> Sorry for the late response. And thanks for responding to my question.
> There's a few thing I would like to clarify with you.
> 1. Is the physical slot number associate with the configuration of 
> device itself or with the configuration of device's parent?

A PCIe device doesn't know its own slot number.  The bridge leading to a slot (either a Root Port or a Switch Downstream Port) has the Slot Capability/Status/Control registers that manage the slot.  The Slot Capabilities register contains a "Physical Slot Number".  This is HwInit, which means it's set by hardware or firmware, and it's supposed to be a number that's unique within the chassis.

The "Physical Slot" reported by lspci for Endpoints comes from sysfs, not from the device itself.  See https://urldefense.com/v3/__https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/tree/lib/sysfs.c?id=v3.13.0*n277__;Iw!!AAkNMFMq5MQ!cRxVCvUr6rEgDN9a_S_dxnHv2u1BP6J_Ue82PAqDcmqxFg_DhbQBLrfGaNdA6vCBLtLcdo-c-hPaS2SV2A$ 

From team1 lspci-vvv:

  20:01.1 Root Port to [bus 21-34] Slot #0
    21:00.0 Broadcom Switch Upstream Port to [bus 22-34]
      22:00.0 Broadcom Switch Downstream Port to [bus 23] Slot #308
        23:00.0 Broadcom Endpoint 02b2 "Physical Slot: 308"
      22:01.0 Broadcom Switch Downstream Port to [bus 24] Slot #306
        24:00.0 Broadcom Endpoint 02b2 "Physical Slot: 306"
      22:02.0 Broadcom Switch Downstream Port to [bus 25-2a] Slot #213
        25:00.0 Mellanox Endpoint MT2910 "Physical Slot: 213"
      22:03.0 Broadcom Switch Downstream Port to [bus 2b-30] Slot #203
        2b:00.0 Broadcom Endpoint 02b2 "Physical Slot: 203"
      22:04.0 Broadcom Switch Downstream Port to [bus 31-33] Slot #101
        31:00.0 AMD Switch Upstream Port to [bus 32-33] "Physical Slot: 101"
          32:00.0 AMD Switch Downstream Port to [bus 33] Slot #0
            33:00.0 AMD Endpoint 74a1 "Physical Slot: 0-6"

I don't know off the top of my head why lspci doesn't report a "Physical Slot:" for 21:00.0.  I suppose the kernel didn't provide something in /sys for it.

All the other "Physical Slot:" reports from lspci match the "Physical Slot Number" from the PCIe Capability of the bridge leading to the slot, *except* for 33:00.0.  In that case, the "Physical Slot Number"
from the bridge PCIe Capability is not unique.  Both 20:01.1 and
32:00.0 advertise Slot #0 there, so the kernel make the sysfs slot unique, e.g., "0-6".

From lspci_vvv_team2.txt:

  39:00.0 Broadcom Switch Downstream Port to [bus 3a] Slot #166
    3a:00.0 Samsung Endpoint NVMe "Physical Slot: 166"
  39:01.0 Broadcom Switch Downstream Port to [bus 3b-3d] Slot #24 "Physical Slot: 24"
    3b:00.0 AMD Switch Upstream Port to [bus 3c-3d]
      3c:00.0 AMD Switch Downstream Port to [bus 3d] Slot #0
        3d:00.0 AMD Endpoint MI300X
  39:02.0 Broadcom Switch Downstream Port to [bus 3e] Slot #39 "Physical Slot: 39"
    3e:00.0 Mellanox Endpoint

This seems strange to me.  For 39:01.0, lspci reports "Physical Slot:
24", but 39:01.0 is a Downstream Port that *leads* to a slot; it's not a slot itself.  3b:00.0 is the device in that slot, and I think it should have a slot number, but it doesn't.

Similarly, lspci reports "Physical Slot: 39" for 39:02.0, when it should show 3e:00.0 being in slot 39.

I guess this team2 situation is what you're trying to understand?

Can you collect the complete dmesg log and output of "grep -r .
/sys/bus/pci/slots" for both team1 and team2?  We should be able to puzzle out what's going on.  The dmesg logging will show which hotplug drivers are in use and should have hints about slot numbering, and if it doesn't, we may need to add some.

> 2. As my understanding, we also have another team using AMD GPU MI300. 
> And I have discovered that lspci -xxx have some difference between our 
> team(team 1)  and their team (team 2). The difference is that when we 
> dump the file of lspci -xxx, the content only listed to 0xff, however, 
> another team listed the content till 0xfff, which means that they have 
> additional content from 0x100 to 0xfff.

>   ->Is there any setting of OS that we can enable in order to see
>   the whole content?

I think "lspci -xxx" will only show you 0-0xff unless lspci is run as root.

>   ->Will these additional content related to the physical slot
>   number? Or have any impact on showing the physical slot number?

I don't think so.  The Slot Capability/Status/Control registers are in the PCIe Capability, which should be below 0xff.

Bjorn

---------------------------------------------------------------------------------------------------------------------------------------------------------------
This email contains confidential or legally privileged information and is for the sole use of its intended recipient.
Any unauthorized review, use, copying or distribution of this email or the content of this email is strictly prohibited.
If you are not the intended recipient, you may reply to the sender and should delete this e-mail immediately.
---------------------------------------------------------------------------------------------------------------------------------------------------------------

[-- Attachment #2: dmesg_and_slot_team1.zip --]
[-- Type: application/x-zip-compressed, Size: 127413 bytes --]

[-- Attachment #3: dmesg_and_slot_team2.zip --]
[-- Type: application/x-zip-compressed, Size: 73987 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Issue about PCI physical slot fetch incorrect number
  2024-09-06  2:04         ` Erin_Tsao
@ 2024-09-18 14:09           ` Bjorn Helgaas
  0 siblings, 0 replies; 7+ messages in thread
From: Bjorn Helgaas @ 2024-09-18 14:09 UTC (permalink / raw)
  To: Erin_Tsao; +Cc: linux-pci, mj

On Fri, Sep 06, 2024 at 02:04:28AM +0000, Erin_Tsao@wistron.com wrote:
> ...

> I will send both dmesg and slot file under path /sys/bus/pci from
> team1 and team2 to you. Please find attachment for more details,
> really appreciate your help and assistance.

Evidently these don't match the lspci information you collected at
https://lore.kernel.org/r/dcd9ff2b16c44efca61189850fb0fe02@wistron.com

It's too complicated to try to put this together without a consistent
set of info.  Can you collect dmesg, lspci, and /sys/.../Slots info
that all match, please?

And point out the device that you think has the wrong slot name?

Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-09-18 14:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <a600fc09c06d4ca28b045668ad1e63cb@wistron.com>
2024-08-23 18:51 ` Issue about PCI physical slot fetch incorrect number Martin Mareš
2024-08-23 21:03   ` Bjorn Helgaas
2024-08-26  8:27     ` Erin_Tsao
2024-08-29 16:35       ` Bjorn Helgaas
2024-09-06  2:04         ` Erin_Tsao
2024-09-18 14:09           ` Bjorn Helgaas
2024-08-26  9:05   ` Erin_Tsao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox