* Re: [PATCH 5.4 182/389] PCI/portdrv: Dont disable AER reporting in get_port_device_capability()
[not found] <9dfa04c4-e0cc-f265-5935-254f43db931b@candelatech.com>
@ 2023-03-31 22:06 ` Bjorn Helgaas
0 siblings, 0 replies; 4+ messages in thread
From: Bjorn Helgaas @ 2023-03-31 22:06 UTC (permalink / raw)
To: Ben Greear
Cc: Pali Rohár, Greg Kroah-Hartman, bjorn, LKML, stable,
Stefan Roese, Bjorn Helgaas, Rafael J. Wysocki,
Bharat Kumar Gogada, Michal Simek, Yao Hongbo, Naveen Naidu,
Sasha Levin, linux-pci, Gregory Greenman, Kalle Valo,
linux-wireless, netdev
[+cc iwlwifi folks]
Re: 8795e182b02d ("PCI/portdrv: Don't disable AER reporting in
get_port_device_capability()")
On Wed, Mar 29, 2023 at 04:17:29PM -0700, Ben Greear wrote:
> On 8/30/22 3:16 PM, Ben Greear wrote:
> ...
> I notice this patch appears to be in 6.2.6 kernel, and my kernel logs are
> full of spam and system is unstable. Possibly the unstable part is related
> to something else, but the log spam is definitely extreme.
>
> These systems are fairly stable on 5.19-ish kernels without the patch in
> question.
Hmmm, I was going to thank you for the report, but looking closer, I
see that you reported this last August [1] and we *should* have
pursued it with the iwlwifi folks or figured out what the PCI core is
doing wrong, but I totally dropped the ball. Sorry about that.
To make sure we're all on the same page, we're talking about
8795e182b02d ("PCI/portdrv: Don't disable AER reporting in
get_port_device_capability()") [2],
which is present in v6.0 and later [3] but not v5.19.16 [4].
> Here is sample of the spam:
>
> [ 1675.547023] pcieport 0000:03:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> [ 1675.556851] pcieport 0000:03:02.0: device [10b5:8619] error status/mask=00100000/00000000
> [ 1675.563904] pcieport 0000:03:02.0: [20] UnsupReq (First)
> [ 1675.569398] pcieport 0000:03:02.0: AER: TLP Header: 34000000 05001f10 00000000 88c888c8
> [ 1675.576296] iwlwifi 0000:05:00.0: AER: can't recover (no error_detected callback)
The TLP header says this is an LTR message from 05:00.0. Apparently
the bridge above 05:00.0 is 03:02.0, which logged an Unsupported
Request error for the message, probably because 03:02.0 doesn't have
LTR enabled.
Can you collect the output of "sudo lspci -vv"? Does this happen even
before loading the iwlwifi driver? I assume there are no hotplug
events before this happens?
The PCI core enables LTR during enumeration for every device for which
LTR is supported and enabled along the entire path up to a Root Port.
If it does that wrong, you might see errors even before loading
iwlwifi.
I see that iwlwifi *reads* PCI_EXP_DEVCTL2_LTR_EN in
iwl_pcie_apm_config(), which should be safe. I don't see any writes,
but the iwlwifi experts should know more about this. There are a
couple paths that do this, which looks somehow related:
__iwl_mvm_mac_start
iwl_mvm_up
iwl_mvm_config_ltr
if (trans->ltr_enabled)
iwl_mvm_send_cmd_pdu(mvm, LTR_CONFIG, ...)
Bjorn
[1] https://lore.kernel.org/all/47b775c5-57fa-5edf-b59e-8a9041ffbee7@candelatech.com/#t
[2] https://git.kernel.org/linus/8795e182b02d
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/portdrv_core.c?id=v6.0#n223
[4] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/pci/pcie/portdrv_core.c?id=v5.19.16#n223
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 5.4 182/389] PCI/portdrv: Dont disable AER reporting in get_port_device_capability()
[not found] <4ff1397e-1d78-bc59-f577-e69024c4c4f3@candelatech.com>
@ 2023-04-04 17:09 ` Bjorn Helgaas
2023-04-18 18:18 ` Ben Greear
0 siblings, 1 reply; 4+ messages in thread
From: Bjorn Helgaas @ 2023-04-04 17:09 UTC (permalink / raw)
To: Ben Greear
Cc: Pali Rohár, Greg Kroah-Hartman, bjorn, LKML, stable,
Stefan Roese, Bjorn Helgaas, Rafael J. Wysocki,
Bharat Kumar Gogada, Michal Simek, Yao Hongbo, Naveen Naidu,
Sasha Levin, linux-pci, Gregory Greenman, Kalle Valo,
linux-wireless, netdev
On Fri, Mar 31, 2023 at 03:31:40PM -0700, Ben Greear wrote:
> On 3/31/23 15:06, Bjorn Helgaas wrote:
> > [+cc iwlwifi folks]
> >
> > Re: 8795e182b02d ("PCI/portdrv: Don't disable AER reporting in
> > get_port_device_capability()")
> >
> > On Wed, Mar 29, 2023 at 04:17:29PM -0700, Ben Greear wrote:
> > > On 8/30/22 3:16 PM, Ben Greear wrote:
> > > ...
> >
> > > I notice this patch appears to be in 6.2.6 kernel, and my kernel logs are
> > > full of spam and system is unstable. Possibly the unstable part is related
> > > to something else, but the log spam is definitely extreme.
> > >
> > > These systems are fairly stable on 5.19-ish kernels without the patch in
> > > question.
> >
> > Hmmm, I was going to thank you for the report, but looking closer, I
> > see that you reported this last August [1] and we *should* have
> > pursued it with the iwlwifi folks or figured out what the PCI core is
> > doing wrong, but I totally dropped the ball. Sorry about that.
> >
> > To make sure we're all on the same page, we're talking about
> > 8795e182b02d ("PCI/portdrv: Don't disable AER reporting in
> > get_port_device_capability()") [2],
> > which is present in v6.0 and later [3] but not v5.19.16 [4].
>
> Yes, though I manually tried reverting that patch, and problem
> persisted, so maybe some secondary patch still enables whatever
> causes the issue.
>
> Booting with pci=noaer 'fixes' the problem for me, that is what I am
> running currently.
>
> > > Here is sample of the spam:
> > >
> > > [ 1675.547023] pcieport 0000:03:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > > [ 1675.556851] pcieport 0000:03:02.0: device [10b5:8619] error status/mask=00100000/00000000
> > > [ 1675.563904] pcieport 0000:03:02.0: [20] UnsupReq (First)
> > > [ 1675.569398] pcieport 0000:03:02.0: AER: TLP Header: 34000000 05001f10 00000000 88c888c8
> > > [ 1675.576296] iwlwifi 0000:05:00.0: AER: can't recover (no error_detected callback)
> >
> > The TLP header says this is an LTR message from 05:00.0. Apparently
> > the bridge above 05:00.0 is 03:02.0, which logged an Unsupported
> > Request error for the message, probably because 03:02.0 doesn't have
> > LTR enabled.
> Here is lspci, and please note that I am using a pcie -> 12x m.2
> adapter board, which is not common in the world. Possibly it is
> causing some of the problems with the AER logic (though, it is
> stable in 5.19 and lower. And a similar system with 2 of these
> adapter boards filled with 24 mtk7922 radios does not show the AER
> warnings or instability problems so far.)
>
> The lspci below is from a system with 12 ax210 radios, I have
> another with 24, it shows similar problems.
Interesting config. Somebody is definitely doing something wrong.
LTR is enabled at 00:1c.0 (which is fine), not supported and disabled
at 02:00.0 and 03:02.0 (also fine), but *enabled* at 05:00.0, which is
absolutely not fine because 03:02.0 won't know what to do with the LTR
messages and would log the AER errors you're seeing.
> 00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1) (prog-if 00 [Normal decode])
> Bus: primary=00, secondary=02, subordinate=0f, sec-latency=0
> DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd+
> AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
> 02:00.0 PCI bridge: PLX Technology, Inc. PEX 8619 16-lane, 16-Port PCI Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba) (prog-if 00 [Normal decode])
> Bus: primary=02, secondary=03, subordinate=0f, sec-latency=0
> DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
> AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> 03:02.0 PCI bridge: PLX Technology, Inc. PEX 8619 16-lane, 16-Port PCI Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba) (prog-if 00 [Normal decode])
> Bus: primary=03, secondary=05, subordinate=05, sec-latency=0
> DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported ARIFwd+
> AtomicOpsCap: Routing-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
> 05:00.0 Network controller: Intel Corporation Device 2725 (rev 1a)
> DevCap2: Completion Timeout: Range B, TimeoutDis+, LTR+, OBFF Via WAKE#
> AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> DevCtl2: Completion Timeout: 16ms to 55ms, TimeoutDis-, LTR+, OBFF Disabled
> AtomicOpsCtl: ReqEn-
For 02:00.0 and 03:02.0, pci_configure_ltr() should bail out as soon
as it sees they don't support PCI_EXP_DEVCAP2_LTR, so they should
never have dev->ltr_path set. And pci_configure_ltr() should not set
PCI_EXP_DEVCTL2_LTR_EN for 05:00.0 since bridge->ltr_path is not set
for 03:02.0.
Can you collect the dmesg log when booted with "pci=earlydump"? I
wonder if BIOS could be enabling LTR on 05:00.0.
Bjorn
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 5.4 182/389] PCI/portdrv: Dont disable AER reporting in get_port_device_capability()
2023-04-04 17:09 ` [PATCH 5.4 182/389] PCI/portdrv: Dont disable AER reporting in get_port_device_capability() Bjorn Helgaas
@ 2023-04-18 18:18 ` Ben Greear
2023-04-18 20:26 ` Bjorn Helgaas
0 siblings, 1 reply; 4+ messages in thread
From: Ben Greear @ 2023-04-18 18:18 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Pali Rohár, Greg Kroah-Hartman, bjorn, LKML, stable,
Stefan Roese, Bjorn Helgaas, Rafael J. Wysocki,
Bharat Kumar Gogada, Michal Simek, Yao Hongbo, Naveen Naidu,
Sasha Levin, linux-pci, Gregory Greenman, Kalle Valo,
linux-wireless, netdev
The CC list in this email is huge, and the dmesg is also large. I'm going to send the file directly to Bjorn.
Please let me know if anyone wants to see it, or if I should just reply-all and paste it in...
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 5.4 182/389] PCI/portdrv: Dont disable AER reporting in get_port_device_capability()
2023-04-18 18:18 ` Ben Greear
@ 2023-04-18 20:26 ` Bjorn Helgaas
0 siblings, 0 replies; 4+ messages in thread
From: Bjorn Helgaas @ 2023-04-18 20:26 UTC (permalink / raw)
To: Ben Greear
Cc: Pali Rohár, Greg Kroah-Hartman, bjorn, LKML, stable,
Stefan Roese, Bjorn Helgaas, Rafael J. Wysocki,
Bharat Kumar Gogada, Michal Simek, Yao Hongbo, Naveen Naidu,
Sasha Levin, linux-pci, Gregory Greenman, Kalle Valo,
linux-wireless, netdev
On Tue, Apr 18, 2023 at 11:18:58AM -0700, Ben Greear wrote:
> The CC list in this email is huge, and the dmesg is also large. I'm going to send the file directly to Bjorn.
> Please let me know if anyone wants to see it, or if I should just reply-all and paste it in...
Thanks, I got the dmesg log and attached it to this bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=217352
I tried to match the earlydump up with the lspci from
https://lore.kernel.org/r/4ff1397e-1d78-bc59-f577-e69024c4c4f3@candelatech.com
but it doesn't seem to match. Could they be from different systems or
different configs?
Could I trouble you to collect the "sudo lspci -vvxxx" output to match
the pci=earlydump log? (Or just collect both from the same system)
Bjorn
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-04-18 20:26 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <4ff1397e-1d78-bc59-f577-e69024c4c4f3@candelatech.com>
2023-04-04 17:09 ` [PATCH 5.4 182/389] PCI/portdrv: Dont disable AER reporting in get_port_device_capability() Bjorn Helgaas
2023-04-18 18:18 ` Ben Greear
2023-04-18 20:26 ` Bjorn Helgaas
[not found] <9dfa04c4-e0cc-f265-5935-254f43db931b@candelatech.com>
2023-03-31 22:06 ` Bjorn Helgaas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox