* Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset
[not found] <CAMciSVU4vv7=WjVUhuP3PJHdpnYqrgMPCmz-HnijEbhyxk54eQ@mail.gmail.com>
@ 2025-02-19 17:06 ` Bjorn Helgaas
0 siblings, 0 replies; 7+ messages in thread
From: Bjorn Helgaas @ 2025-02-19 17:06 UTC (permalink / raw)
To: Naveen Kumar P; +Cc: linux-pci, linux-acpi, linux-kernel, kernelnewbies
[+cc linux-acpi]
On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote:
> Hi all,
>
> I am writing to seek assistance with an issue we are experiencing with
> a PCIe device (PLDA Device 5555) connected through PCI Express Root
> Port 1 to the host bridge.
>
> We have observed that after booting the system, the Base Address
> Register (BAR0) memory of this device gets reset to 0x0 after
> approximately one hour or more (the timing is inconsistent). This was
> verified using the lspci output and the setpci -s 01:00.0
> BASE_ADDRESS_0 command.
>
> To diagnose the issue, we checked the dmesg log, but it did not
> provide any relevant information. I then enabled dynamic debugging for
> the PCI subsystem (drivers/pci/*) and noticed the following messages
> related ACPI hotplug in the dmesg log:
>
> [ 0.465144] pci 0000:01:00.0: reg 0x10: [mem 0xb0400000-0xb07fffff]
> ...
> [ 6710.000355] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> [ 7916.250868] perf: interrupt took too long (4072 > 3601), lowering
> kernel.perf_event_max_sample_rate to 49000
> [ 7984.719647] perf: interrupt took too long (5378 > 5090), lowering
> kernel.perf_event_max_sample_rate to 37000
> [11051.409115] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> [11755.388727] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> [12223.885715] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> [14303.465636] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> After these messages appear, reading the device BAR memory results in
> 0x0 instead of the expected value.
>
> I would like to understand the following:
>
> 1. What could be causing these hotplug_event debug messages?
This is an ACPI Notify event. Basically the platform is telling us to
re-enumerate the hierarchy below RP01 because a device might have been
added or removed.
Unfortunately the only real information we get is the ACPI device
(RP01) and the notification value (ACPI_NOTIFY_BUS_CHECK).
You could instrument acpiphp_check_bridge() to see what path we take.
The main paths look like enable_slot() or disable_slot(), but those
both include a pr_debug() than you apparently don't see.
A remove followed by add would definitely reset the device, including
its BARs. But you would normally see some messages related to
enumerating a new device.
If this doesn't help, try to reproduce the problem with a recent
kernel, e.g., v6.13, and post the complete dmesg log.
> 2. Why does this result in the BAR memory being reset?
> 3. How can we resolve this issue?
>
> I have verified that the issue occurs even without loading the driver
> for the PLDA Device 5555, so it does not appear to be related to the
> device driver.
>
> Any help or guidance on debugging this issue would be greatly appreciated.
>
> Thank you for your assistance.
>
> Best regards,
> Naveen
_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset
[not found] <CAMciSVXDS_n7-XzHevMmAOhb-qCNsCBbE1Pym-zWybnOyjZWmw@mail.gmail.com>
@ 2025-02-24 17:33 ` Bjorn Helgaas
0 siblings, 0 replies; 7+ messages in thread
From: Bjorn Helgaas @ 2025-02-24 17:33 UTC (permalink / raw)
To: Naveen Kumar P; +Cc: linux-pci, linux-acpi, linux-kernel, kernelnewbies
On Mon, Feb 24, 2025 at 05:45:35PM +0530, Naveen Kumar P wrote:
> On Wed, Feb 19, 2025 at 10:36 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote:
> > > Hi all,
> > >
> > > I am writing to seek assistance with an issue we are experiencing with
> > > a PCIe device (PLDA Device 5555) connected through PCI Express Root
> > > Port 1 to the host bridge.
> > >
> > > We have observed that after booting the system, the Base Address
> > > Register (BAR0) memory of this device gets reset to 0x0 after
> > > approximately one hour or more (the timing is inconsistent). This was
> > > verified using the lspci output and the setpci -s 01:00.0
> > > BASE_ADDRESS_0 command.
> > >
> > > To diagnose the issue, we checked the dmesg log, but it did not
> > > provide any relevant information. I then enabled dynamic debugging for
> > > the PCI subsystem (drivers/pci/*) and noticed the following messages
> > > related ACPI hotplug in the dmesg log:
> > >
> > > [ 0.465144] pci 0000:01:00.0: reg 0x10: [mem 0xb0400000-0xb07fffff]
> > > ...
> > > [ 6710.000355] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > [ 7916.250868] perf: interrupt took too long (4072 > 3601), lowering
> > > kernel.perf_event_max_sample_rate to 49000
> > > [ 7984.719647] perf: interrupt took too long (5378 > 5090), lowering
> > > kernel.perf_event_max_sample_rate to 37000
> > > [11051.409115] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > [11755.388727] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > [12223.885715] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > [14303.465636] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > After these messages appear, reading the device BAR memory results in
> > > 0x0 instead of the expected value.
> > >
> > > I would like to understand the following:
> > >
> > > 1. What could be causing these hotplug_event debug messages?
> >
> > This is an ACPI Notify event. Basically the platform is telling us to
> > re-enumerate the hierarchy below RP01 because a device might have been
> > added or removed.
>
> Thank you for your response regarding the PCI BAR reset issue we are
> experiencing with the PLDA Device 5555. I have a few follow-up
> questions and additional information to share.
>
> 1. Clarification on "Platform":
>
> Does the term "platform" refer to the BIOS/ACPI subsystem in this context?
Yes, "platform" refers to the BIOS/ACPI subsystem.
> Can the platform signal to re-enumerate the hierarchy below RP01
> without an actual device being removed or added? In our case, the PCI
> PLDA device is neither physically removed nor connected to the bus on
> the fly.
Yes, I think a Bus Check notification is just a request for the OS to
re-enumerate starting at the point in the device tree where it is
notified. It's possible that no add or remove has occurred. ACPI
r6.5, sec 5.6.6, includes the example of hardware that can't detect
device changes during a system sleep state, so it issues a Bus Check
on wake.
> 2. System Configuration:
>
> We are currently using an x86_64 system with Ubuntu 20.04.6 LTS
> (kernel version: 5.4.0-148-generic).
> I have enabled dynamic debug logs for all files in the PCI and ACPI
> subsystems and rebooted the system with the following parameters:
> $ cat /proc/cmdline
> BOOT_IMAGE=/vmlinuz-5.4.0-148-generic root=/dev/mapper/vg00-rootvol ro
> quiet libata.force=noncq pci=nomsi pcie_aspm=off pcie_ports=on
> "dyndbg=file drivers/pci/* +p; file drivers/acpi/* +p"
>
>
> 3. Observations:
>
> After rebooting with more debug logs, I noticed the issue after 1 day,
> 11:48 hours.
> A snippet of the dmesg log is mentioned below (complete dmesg log is
> attached to this email):
>
> [128845.248503] ACPI: GPE event 0x01
> [128845.356866] ACPI: \_SB_.PCI0.RP01: ACPI_NOTIFY_BUS_CHECK event
> [128845.357343] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in
> hotplug_event()
If you could add more debug in hotplug_event() and the things it
calls, we might get more clues about what's happening.
> 4. BAR Reset Issue:
>
> I filtered the lspci output to show the contents of the configuration
> space starting at offset 0x10 for getting BASE_ADDRESS_0 by running
> sudo lspci -xxx -s 01:00.0 | grep "10:".
> Prior to the BAR reset issue, the lspci output was:
> $ sudo lspci -xxx -s 01:00.0 | grep "10:"
> 10: 00 00 40 b0 00 00 00 00 00 00 00 00 00 00 00 00
>
> During the ACPI_NOTIFY_BUS_CHECK event, the lspci output initially
> showed all FF's, and then the next run of the same command showed
> BASE_ADDRESS_0 reset to zero:
> $ sudo lspci -xxx -s 01:00.0 | grep "10:"
> 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
Looks like the device isn't responding at all here. Could happen if
the device is reset or powered down.
What is this device? What driver is bound to it? I don't see
anything in dmesg that identifies a driver.
> $ sudo lspci -xxx -s 01:00.0 | grep "10:"
> 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>
> I am not sure why lspci initially showed all FF's and then the next
> run showed BAR0 reset.
> Complete sudo lspci -xxx -s 01:00.0 output is captured in the attached
> dmesg_log_pci_bar_reset.txt file.
>
> /sys/firmware/acpi/interrupts/gpe01: 1 EN enabled unmasked
> /sys/firmware/acpi/interrupts/gpe02: 1 EN enabled unmasked
>
>
> 5. Debugging Steps:
>
> Instrumenting acpiphp_check_bridge() will indicate whether we are
> enabling or disabling a slot (enable_slot() or disable_slot()). Based
> on the dmesg log, there is only one ACPI_NOTIFY_BUS_CHECK event, and
> it is most likely for disable_slot(). However, does instrumenting
> acpiphp_check_bridge() will explain why this is happening without
> actually removing the PCI PLDA device?
No, it won't explain that. But if there was no add/remove event,
re-enumeration should be harmless. The objective of instrumentation
would be to figure out why it isn't harmless in this case.
> 6. Reproduction and Additional Information:
>
> We do not see any clear pattern or procedure to reproduce this issue.
> Once the issue occurs, rebooting the machine resolves it, but it
> reoccurs after an unpredictable time.
> We have another identical hardware setup with an older kernel (Ubuntu
> 16.04.4 LTS, kernel version: 4.4.0-66-generic), and this issue has not
> been observed so far on that machine.
> Any additional pointers or suggestions on how to proceed to the root
> cause of this issue would be greatly appreciated.
You're seeing the problem on v5.4 (Nov 2019), which is much newer than
v4.4 (Jan 2016). But v5.4 is still really too old to spend a lot of
time on unless the problem still happens on a current kernel.
Bjorn
_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset
[not found] <CAMciSVVV9tHH1M2bOnwqCJCQ8OjNFGjuQB7R-fY7JHHD5tQHoA@mail.gmail.com>
@ 2025-02-24 19:54 ` Bjorn Helgaas
0 siblings, 0 replies; 7+ messages in thread
From: Bjorn Helgaas @ 2025-02-24 19:54 UTC (permalink / raw)
To: Naveen Kumar P; +Cc: linux-pci, linux-acpi, linux-kernel, kernelnewbies
On Tue, Feb 25, 2025 at 12:29:00AM +0530, Naveen Kumar P wrote:
> On Mon, Feb 24, 2025 at 11:03 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Mon, Feb 24, 2025 at 05:45:35PM +0530, Naveen Kumar P wrote:
> > > On Wed, Feb 19, 2025 at 10:36 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote:
> > > > > Hi all,
> > > > >
> > > > > I am writing to seek assistance with an issue we are experiencing with
> > > > > a PCIe device (PLDA Device 5555) connected through PCI Express Root
> > > > > Port 1 to the host bridge.
> > > > >
> > > > > We have observed that after booting the system, the Base Address
> > > > > Register (BAR0) memory of this device gets reset to 0x0 after
> > > > > approximately one hour or more (the timing is inconsistent). This was
> > > > > verified using the lspci output and the setpci -s 01:00.0
> > > > > BASE_ADDRESS_0 command.
> ...
> I booted with the pcie_aspm=off kernel parameter, which means that
> PCIe Active State Power Management (ASPM) is disabled. Given this
> context, should I consider removing this setting to see if it affects
> the occurrence of the Bus Check notifications and the BAR0 reset
> issue?
Doesn't seem likely to be related. Once configured, ASPM operates
without any software intervention. But note that "pcie_aspm=off"
means the kernel doesn't touch ASPM configuration at all, and any
configuration done by firmware remains in effect.
You can tell whether ASPM has been enabled by firmware with "sudo
lspci -vv" before the problem occurs.
> > > During the ACPI_NOTIFY_BUS_CHECK event, the lspci output initially
> > > showed all FF's, and then the next run of the same command showed
> > > BASE_ADDRESS_0 reset to zero:
> > > $ sudo lspci -xxx -s 01:00.0 | grep "10:"
> > > 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> >
> > Looks like the device isn't responding at all here. Could happen if
> > the device is reset or powered down.
>
> From the kernel driver or user space tools, is it possible to
> determine whether the device has been reset or powered down? Are
> there any power management settings or configurations that could be
> causing the device to reset or power down unexpectedly?
Not really. By "powered down", I meant D3cold, where the main power
is removed. Config space is readable in all other power states.
> > What is this device? What driver is bound to it? I don't see
> > anything in dmesg that identifies a driver.
>
> The PCIe device in question is a Xilinx FPGA endpoint, which is
> flashed with RTL code to expose several host interfaces to the system
> via the PCIe link.
>
> We have an out-of-tree driver for this device, but to eliminate the
> driver's role in this issue, I renamed the driver to prevent it from
> loading automatically after rebooting the machine. Despite not using
> the driver, the issue still occurred.
Oh, right, I forgot that you mentioned this before.
> > You're seeing the problem on v5.4 (Nov 2019), which is much newer than
> > v4.4 (Jan 2016). But v5.4 is still really too old to spend a lot of
> > time on unless the problem still happens on a current kernel.
This part is important. We don't want to spend a lot of time
debugging an issue that may have already been fixed upstream.
Bjorn
_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset
[not found] <CAMciSVX3X=DxLU0tfj4rG5WPaS5BCUDcMp2MYWBitT0ecEH+ig@mail.gmail.com>
@ 2025-02-25 20:38 ` Bjorn Helgaas
0 siblings, 0 replies; 7+ messages in thread
From: Bjorn Helgaas @ 2025-02-25 20:38 UTC (permalink / raw)
To: Naveen Kumar P; +Cc: linux-pci, linux-acpi, linux-kernel, kernelnewbies
On Tue, Feb 25, 2025 at 06:46:02PM +0530, Naveen Kumar P wrote:
> On Tue, Feb 25, 2025 at 1:24 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Tue, Feb 25, 2025 at 12:29:00AM +0530, Naveen Kumar P wrote:
> > > On Mon, Feb 24, 2025 at 11:03 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Mon, Feb 24, 2025 at 05:45:35PM +0530, Naveen Kumar P wrote:
> > > > > On Wed, Feb 19, 2025 at 10:36 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I am writing to seek assistance with an issue we are experiencing with
> > > > > > > a PCIe device (PLDA Device 5555) connected through PCI Express Root
> > > > > > > Port 1 to the host bridge.
> > > > > > >
> > > > > > > We have observed that after booting the system, the Base Address
> > > > > > > Register (BAR0) memory of this device gets reset to 0x0 after
> > > > > > > approximately one hour or more (the timing is inconsistent). This was
> > > > > > > verified using the lspci output and the setpci -s 01:00.0
> > > > > > > BASE_ADDRESS_0 command.
> >
> > > ...
> > > I booted with the pcie_aspm=off kernel parameter, which means that
> > > PCIe Active State Power Management (ASPM) is disabled. Given this
> > > context, should I consider removing this setting to see if it affects
> > > the occurrence of the Bus Check notifications and the BAR0 reset
> > > issue?
> >
> > Doesn't seem likely to be related. Once configured, ASPM operates
> > without any software intervention. But note that "pcie_aspm=off"
> > means the kernel doesn't touch ASPM configuration at all, and any
> > configuration done by firmware remains in effect.
> >
> > You can tell whether ASPM has been enabled by firmware with "sudo
> > lspci -vv" before the problem occurs.
> >
> > > > > During the ACPI_NOTIFY_BUS_CHECK event, the lspci output initially
> > > > > showed all FF's, and then the next run of the same command showed
> > > > > BASE_ADDRESS_0 reset to zero:
> > > > > $ sudo lspci -xxx -s 01:00.0 | grep "10:"
> > > > > 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > >
> > > > Looks like the device isn't responding at all here. Could happen if
> > > > the device is reset or powered down.
> > >
> > > From the kernel driver or user space tools, is it possible to
> > > determine whether the device has been reset or powered down? Are
> > > there any power management settings or configurations that could be
> > > causing the device to reset or power down unexpectedly?
> >
> > Not really. By "powered down", I meant D3cold, where the main power
> > is removed. Config space is readable in all other power states.
> >
> > > > What is this device? What driver is bound to it? I don't see
> > > > anything in dmesg that identifies a driver.
> > >
> > > The PCIe device in question is a Xilinx FPGA endpoint, which is
> > > flashed with RTL code to expose several host interfaces to the system
> > > via the PCIe link.
> > >
> > > We have an out-of-tree driver for this device, but to eliminate the
> > > driver's role in this issue, I renamed the driver to prevent it from
> > > loading automatically after rebooting the machine. Despite not using
> > > the driver, the issue still occurred.
> >
> > Oh, right, I forgot that you mentioned this before.
> >
> > > > You're seeing the problem on v5.4 (Nov 2019), which is much newer than
> > > > v4.4 (Jan 2016). But v5.4 is still really too old to spend a lot of
> > > > time on unless the problem still happens on a current kernel.
> >
> > This part is important. We don't want to spend a lot of time
> > debugging an issue that may have already been fixed upstream.
>
> Sure, I started building the 6.13 kernel and will post more
> information if I notice the issue on the 6.13 kernel.
>
> Regarding the CommClk- (Common Clock Configuration) bit, it indicates
> whether the common clock configuration is enabled or disabled. When it
> is set to CommClk-, it means that the common clock configuration is
> disabled.
>
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>
> For my device, I noticed that the common clock configuration is
> disabled. Could this be causing the BAR reset issue?
Not to my knowledge.
> How is the CommClk bit determined(to set or clear)? and is it okay to
> enable this bit after booting the kernel?
It is somewhere in drivers/pci/pcie/aspm.c, i.e.,
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/aspm.c?id=v6.13#n383
_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset
[not found] <CAMciSVU2Xeh+3KsFK33GGLK7h59n9A_1RANdFV+ghGv39qcxPw@mail.gmail.com>
@ 2025-03-04 20:45 ` Bjorn Helgaas
0 siblings, 0 replies; 7+ messages in thread
From: Bjorn Helgaas @ 2025-03-04 20:45 UTC (permalink / raw)
To: Naveen Kumar P; +Cc: linux-pci, linux-acpi, linux-kernel, kernelnewbies
On Tue, Mar 04, 2025 at 01:35:14PM +0530, Naveen Kumar P wrote:
> On Fri, Feb 28, 2025 at 9:31 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Wed, Feb 26, 2025 at 06:28:33PM +0530, Naveen Kumar P wrote:
> > > On Wed, Feb 26, 2025 at 2:08 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Tue, Feb 25, 2025 at 06:46:02PM +0530, Naveen Kumar P wrote:
> > > > > On Tue, Feb 25, 2025 at 1:24 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > On Tue, Feb 25, 2025 at 12:29:00AM +0530, Naveen Kumar P wrote:
> > > > > > > On Mon, Feb 24, 2025 at 11:03 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > On Mon, Feb 24, 2025 at 05:45:35PM +0530, Naveen Kumar P wrote:
> > > > > > > > > On Wed, Feb 19, 2025 at 10:36 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > > > On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote:
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > I am writing to seek assistance with an issue we are
> > > > > > > > > > > experiencing with a PCIe device (PLDA Device 5555)
> > > > > > > > > > > connected through PCI Express Root Port 1 to the
> > > > > > > > > > > host bridge.
> > > > > > > > > > >
> > > > > > > > > > > We have observed that after booting the system, the
> > > > > > > > > > > Base Address Register (BAR0) memory of this device
> > > > > > > > > > > gets reset to 0x0 after approximately one hour or
> > > > > > > > > > > more (the timing is inconsistent). This was verified
> > > > > > > > > > > using the lspci output and the setpci -s 01:00.0
> > > > > > > > > > > BASE_ADDRESS_0 command.
> > > > > > > ...
> >
> > > I have downloaded the 6.13 kernel source and added additional debug
> > > logs in hotplug_event(), then built the kernel. After that rebooted
> > > with the new kernel using the following parameters:
> > > BOOT_IMAGE=/vmlinuz-6.13.0+ root=/dev/mapper/vg00-rootvol ro quiet
> > > libata.force=noncq pci=nomsi pcie_aspm=off pcie_ports=on "dyndbg=file
> > > drivers/pci/* +p; file drivers/acpi/* +p"
> >
> > Why "pci=nomsi"? I don't think that should make a difference. Also,
> > it contributes to the fact that Linux doesn't request OS control of
> > several features that it ordinarily does, so you end up in a somewhat
> > unusual state (which *should* still work, of course):
> >
> > acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig Segments HPX-Type3]
> > acpi PNP0A08:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
> >
> > Same for "pcie_aspm=off".
>
> I initially suspected that the PCI BAR reset was happening due to the
> device entering a low-power state, so I set pcie_aspm=off to prevent
> it.
ASPM never makes a device lose its state. It's completely invisible
from a software point of view.
> As per your suggestion, I instrumented the PCI configuration
> accessors to log all reads and writes to my device (01:00.0). The
> corresponding patch
> (0002-instrumented-the-PCI-config-accessors-to-log-all-the.patch) is
> attached to this email. After applying the patch and rebooting with
> the same boot parameters, the issue reproduced after 193890 seconds.
>
> The complete dmesg log (dmesg_march3rd_log.txt) is also attached.
> Could you check if this new log provides any useful clues?
> [193890.407810] ACPI: \_SB_.PCI0.RP01: ACPI: ACPI_NOTIFY_BUS_CHECK event
> [193890.407973] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bridge acquired in hotplug_event()
> [193890.408010] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> [193890.408030] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Checking bridge in hotplug_event()
> [193890.408052] PCI READ: res=0, bus=01 dev=00 func=0 pos=0x00 len=4 data=0x55551556
> [193890.408095] PCI READ: res=0, bus=01 dev=00 func=0 pos=0x00 len=4 data=0x55551556
Looks perfectly fine. This is reading the Vendor and Device IDs.
> [193890.408122] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Enabling slot in acpiphp_check_bridge()
> [193890.408184] ACPI: Device [PXSX] status [0000000f]
> [193890.408236] ACPI: Device [D015] status [0000000f]
> [193890.408305] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Releasing bridge in hotplug_event()
_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset
[not found] <CAMciSVVhdRjfVYZGg+0Yo6EV4P80No3kLxCL8+LyVjwywiWxYg@mail.gmail.com>
@ 2025-03-04 21:01 ` Bjorn Helgaas
0 siblings, 0 replies; 7+ messages in thread
From: Bjorn Helgaas @ 2025-03-04 21:01 UTC (permalink / raw)
To: Naveen Kumar P; +Cc: linux-pci, linux-acpi, linux-kernel, kernelnewbies
On Tue, Mar 04, 2025 at 10:19:07PM +0530, Naveen Kumar P wrote:
> On Tue, Mar 4, 2025 at 1:35 PM Naveen Kumar P
> <naveenkumar.parna@gmail.com> wrote:
> ...
> For this test run, I removed all three parameters (pcie_aspm=off,
> pci=nomsi, and pcie_ports=on) and booted with the following kernel
> command line arguments:
>
> cat /proc/cmdline
> BOOT_IMAGE=/vmlinuz-6.13.0+ root=/dev/mapper/vg00-rootvol ro quiet
> "dyndbg=file drivers/pci/* +p; file drivers/acpi/bus.c +p; file
> drivers/acpi/osl.c +p"
>
> This time, the issue occurred earlier, at 22998 seconds. Below is the
> relevant dmesg log during the ACPI_NOTIFY_BUS_CHECK event. The
> complete log is attached (dmesg_march4th_log.txt).
>
> [22998.536705] ACPI: \_SB_.PCI0.RP01: ACPI: ACPI_NOTIFY_BUS_CHECK event
> [22998.536753] ACPI: \_SB_.PCI0.RP01: ACPI: OSL: Scheduling hotplug
> event 0 for deferred handling
> [22998.536934] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bridge acquired in
> hotplug_event()
> [22998.536972] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> [22998.537002] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Checking bridge in
> hotplug_event()
> [22998.537024] PCI READ: res=0, bus=01 dev=00 func=0 pos=0x00 len=4
> data=0x55551556
> [22998.537066] PCI READ: res=0, bus=01 dev=00 func=0 pos=0x00 len=4
> data=0x55551556
Fine again.
> [22998.537094] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Enabling slot in
> acpiphp_check_bridge()
> [22998.537155] ACPI: Device [PXSX] status [0000000f]
> [22998.537206] ACPI: Device [D015] status [0000000f]
> [22998.537276] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Releasing bridge
> in hotplug_event()
>
> sudo lspci -xxx -s 01:00.0 | grep 10:
> 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Obviously a problem. Can you start including the whole
"lspci -x -s 01:00.0" output? Obviously the Vendor ID reads above
worked fine. I *assume* it's still fine here, and only the BARs are
zeroed out?
I assume you saw no new dmesg logs about config accesses to the device
before the lspci. If you instrumented the user config accessors
(pci_user_read_config_*(), also in access.c), you should see those
accesses.
You could sprinkle some calls to early_dump_pci_device() through the
acpiphp path. Turn off the kernel config access tracing when you do
this so it doesn't clutter things up.
What is this device? Is it a shipping product? Do you have good
confidence that the hardware is working correctly? I guess you said
it works correctly on a different machine with an older kernel. I
would swap the cards between machines in case one card is broken.
You could try bisecting between the working kernel and the broken one.
It's kind of painful since it takes so long to reproduce the problem.
Bjorn
_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset
[not found] <CAMciSVVu6qL6QV7KqLem2ZoRoW2T5a3s13EyKE-4SFGHDFfR4g@mail.gmail.com>
@ 2025-03-19 21:41 ` Bjorn Helgaas
0 siblings, 0 replies; 7+ messages in thread
From: Bjorn Helgaas @ 2025-03-19 21:41 UTC (permalink / raw)
To: Naveen Kumar P; +Cc: linux-pci, linux-acpi, linux-kernel, kernelnewbies
On Wed, Mar 19, 2025 at 08:07:55PM +0530, Naveen Kumar P wrote:
> ...
> I am reaching out to follow up on the PCI BAR0 reset issue and its
> potential connection to the ACPI errors observed in my system running
> Linux kernel 6.13.0+.
> ...
Trying to finish up the last bits for the upcoming v6.15 merge window,
will come back to this later.
Bjorn
_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-03-19 22:10 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAMciSVX3X=DxLU0tfj4rG5WPaS5BCUDcMp2MYWBitT0ecEH+ig@mail.gmail.com>
2025-02-25 20:38 ` PCI: hotplug_event: PCIe PLDA Device BAR Reset Bjorn Helgaas
[not found] <CAMciSVVu6qL6QV7KqLem2ZoRoW2T5a3s13EyKE-4SFGHDFfR4g@mail.gmail.com>
2025-03-19 21:41 ` Bjorn Helgaas
[not found] <CAMciSVVhdRjfVYZGg+0Yo6EV4P80No3kLxCL8+LyVjwywiWxYg@mail.gmail.com>
2025-03-04 21:01 ` Bjorn Helgaas
[not found] <CAMciSVU2Xeh+3KsFK33GGLK7h59n9A_1RANdFV+ghGv39qcxPw@mail.gmail.com>
2025-03-04 20:45 ` Bjorn Helgaas
[not found] <CAMciSVVV9tHH1M2bOnwqCJCQ8OjNFGjuQB7R-fY7JHHD5tQHoA@mail.gmail.com>
2025-02-24 19:54 ` Bjorn Helgaas
[not found] <CAMciSVXDS_n7-XzHevMmAOhb-qCNsCBbE1Pym-zWybnOyjZWmw@mail.gmail.com>
2025-02-24 17:33 ` Bjorn Helgaas
[not found] <CAMciSVU4vv7=WjVUhuP3PJHdpnYqrgMPCmz-HnijEbhyxk54eQ@mail.gmail.com>
2025-02-19 17:06 ` Bjorn Helgaas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).