[BUG] pci: nwl: Unhandled AER correctable error

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [BUG] pci: nwl: Unhandled AER correctable error
@ 2025-08-01 17:43 Sean Anderson
  2025-08-04 20:20 ` Sean Anderson
  2025-08-04 20:57 ` Bjorn Helgaas
  0 siblings, 2 replies; 7+ messages in thread
From: Sean Anderson @ 2025-08-01 17:43 UTC (permalink / raw)
  To: Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, linux-pci
  Cc: Rob Herring, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, Michal Simek, linux-arm-kernel, linux-kernel

Hi,

AER correctable errors are pretty rare. I only saw one once before and
came up with commit 78457cae24cb ("PCI: xilinx-nwl: Rate-limit misc
interrupt messages") in response. I saw another today and,
unfortunately, clearing the correctable AER bit in MSGF_MISC_STATUS is
not sufficient to handle the IRQ. It gets immediately re-raised,
preventing the system from making any other progress. I suspect that it
needs to be cleared in PCI_ERR_ROOT_STATUS. But since the AER IRQ never
gets delivered to aer_irq, those registers never get tickled.

The underlying problem is that pcieport thinks that the IRQ is going to
be one of the MSIs or a legacy interrupt, but it's actually a native
interrupt:

           CPU0       CPU1       CPU2       CPU3       
 42:          0          0          0          0     GICv2 150 Level     nwl_pcie:misc
 45:          0          0          0          0  nwl_pcie:legacy   0 Level     PCIe PME, aerdrv
 46:         25          0          0          0  nwl_pcie:msi 524288 Edge      nvme0q0
 47:          0          0          0          0  nwl_pcie:msi 524289 Edge      nvme0q1
 48:          0          0          0          0  nwl_pcie:msi 524290 Edge      nvme0q2
 49:         46          0          0          0  nwl_pcie:msi 524291 Edge      nvme0q3
 50:          0          0          0          0  nwl_pcie:msi 524292 Edge      nvme0q4

In the above example, AER errors will trigger interrupt 42, not 45.
Actually, there are a bunch of different interrupts in MSGF_MISC_STATUS,
so maybe nwl_pcie_misc_handler should be an interrupt controller
instead? But even then pcie_port_enable_irq_vec() won't figure out the
correct IRQ. Any ideas on how to fix this?

Additionally, any tips on actually triggering AER/PME stuff in a
consistent way? Are there any off-the-shelf cards for sending weird PCIe
stuff over a link for testing? Right now all I have 

--Sean

# lspci -vv
00:00.0 PCI bridge: Xilinx Corporation Device d011 (prog-if 00 [Normal decode])
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 45
	Bus: primary=00, secondary=01, subordinate=0c, sec-latency=0
	I/O behind bridge: 00000000-00000fff [size=4K]
	Memory behind bridge: e0000000-e00fffff [size=1M]
	Prefetchable memory behind bridge: [disabled]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [60] Express (v2) Root Port (Slot-), MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0
			ExtTag- RBE+
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend+
		LnkCap:	Port #0, Speed 5GT/s, Width x2, ASPM not supported
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s (ok), Width x2 (ok)
			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
		RootCap: CRSVisible-
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd-
			 AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled, ARIFwd-
			 AtomicOpsCtl: ReqEn- EgressBlck-
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [100 v1] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [10c v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [128 v1] Vendor Specific Information: ID=1234 Rev=1 Len=018 <?>
	Capabilities: [140 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
		RootCmd: CERptEn+ NFERptEn+ FERptEn+
		RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
			 FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
		ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
	Kernel driver in use: pcieport


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] pci: nwl: Unhandled AER correctable error
  2025-08-01 17:43 [BUG] pci: nwl: Unhandled AER correctable error Sean Anderson
@ 2025-08-04 20:20 ` Sean Anderson
  2025-08-04 20:57 ` Bjorn Helgaas
  1 sibling, 0 replies; 7+ messages in thread
From: Sean Anderson @ 2025-08-04 20:20 UTC (permalink / raw)
  To: Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, linux-pci, Thomas Gleixner
  Cc: Rob Herring, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, Michal Simek, linux-arm-kernel, linux-kernel

On 8/1/25 13:43, Sean Anderson wrote:
> Hi,
> 
> AER correctable errors are pretty rare. I only saw one once before and
> came up with commit 78457cae24cb ("PCI: xilinx-nwl: Rate-limit misc
> interrupt messages") in response. I saw another today and,
> unfortunately, clearing the correctable AER bit in MSGF_MISC_STATUS is
> not sufficient to handle the IRQ. It gets immediately re-raised,
> preventing the system from making any other progress. I suspect that it
> needs to be cleared in PCI_ERR_ROOT_STATUS. But since the AER IRQ never
> gets delivered to aer_irq, those registers never get tickled.
> 
> The underlying problem is that pcieport thinks that the IRQ is going to
> be one of the MSIs or a legacy interrupt, but it's actually a native
> interrupt:
> 
>            CPU0       CPU1       CPU2       CPU3       
>  42:          0          0          0          0     GICv2 150 Level     nwl_pcie:misc
>  45:          0          0          0          0  nwl_pcie:legacy   0 Level     PCIe PME, aerdrv
>  46:         25          0          0          0  nwl_pcie:msi 524288 Edge      nvme0q0
>  47:          0          0          0          0  nwl_pcie:msi 524289 Edge      nvme0q1
>  48:          0          0          0          0  nwl_pcie:msi 524290 Edge      nvme0q2
>  49:         46          0          0          0  nwl_pcie:msi 524291 Edge      nvme0q3
>  50:          0          0          0          0  nwl_pcie:msi 524292 Edge      nvme0q4
> 
> In the above example, AER errors will trigger interrupt 42, not 45.
> Actually, there are a bunch of different interrupts in MSGF_MISC_STATUS,
> so maybe nwl_pcie_misc_handler should be an interrupt controller
> instead? But even then pcie_port_enable_irq_vec() won't figure out the
> correct IRQ. Any ideas on how to fix this?

OK, so as a first pass, maybe something like

	if (misc_stat & (MSGF_MISC_SR_FATAL_AER | MSGF_MISC_SR_NON_FATAL_AER
			 MSGF_MISC_SR_CORR_AER))
		generic_handle_domain_irq(pcie->legacy_irq_domain, 0);

to simulate the correct IRQ. I have no idea whether it's safe to call
generic_handle_domain_irq in this context. It wasn't OK for AER (see
commit 9ae052253785 ("PCI/AER: Fix the broken interrupt injection")),
but maybe it's OK for us since the legacy irqchip doesn't support
affinity? I CC'd Thomas and maybe he can comment.

Otherwise, maybe the best thing is to just add an API to manually trigger AER.

> Additionally, any tips on actually triggering AER/PME stuff in a
> consistent way? Are there any off-the-shelf cards for sending weird PCIe
> stuff over a link for testing? Right now all I have 

But I still don't know how to test this. I can inject a misc interrupt
since the GIC supports irq_set_irqchip_state, but that won't really
simulate an AER interrupt since MSGF_MISC_STATUS won't have the right
bit set. Maybe I can wiggle a card around in its slot? Maybe PME or link
bandwidth notification could trigger this as well?

--Sean


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] pci: nwl: Unhandled AER correctable error
  2025-08-01 17:43 [BUG] pci: nwl: Unhandled AER correctable error Sean Anderson
  2025-08-04 20:20 ` Sean Anderson
@ 2025-08-04 20:57 ` Bjorn Helgaas
  2025-08-04 22:10   ` Sean Anderson
  1 sibling, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2025-08-04 20:57 UTC (permalink / raw)
  To: Sean Anderson
  Cc: Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, linux-pci, Rob Herring,
	Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
	Michal Simek, Brian Norris, Minghuan Lian, Mingkai Hu, Roy Zang,
	Frank Li, Hou Zhiqiang, linux-arm-kernel, linux-kernel

[+cc more folks who might be interested in AER with non-standard
interrupts]

On Fri, Aug 01, 2025 at 01:43:19PM -0400, Sean Anderson wrote:
> Hi,
> 
> AER correctable errors are pretty rare. I only saw one once before and
> came up with commit 78457cae24cb ("PCI: xilinx-nwl: Rate-limit misc
> interrupt messages") in response. I saw another today and,
> unfortunately, clearing the correctable AER bit in MSGF_MISC_STATUS is
> not sufficient to handle the IRQ. It gets immediately re-raised,
> preventing the system from making any other progress. I suspect that it
> needs to be cleared in PCI_ERR_ROOT_STATUS. But since the AER IRQ never
> gets delivered to aer_irq, those registers never get tickled.
> 
> The underlying problem is that pcieport thinks that the IRQ is going to
> be one of the MSIs or a legacy interrupt, but it's actually a native
> interrupt:
> 
>            CPU0       CPU1       CPU2       CPU3       
>  42:          0          0          0          0     GICv2 150 Level     nwl_pcie:misc
>  45:          0          0          0          0  nwl_pcie:legacy   0 Level     PCIe PME, aerdrv
>  46:         25          0          0          0  nwl_pcie:msi 524288 Edge      nvme0q0
>  47:          0          0          0          0  nwl_pcie:msi 524289 Edge      nvme0q1
>  48:          0          0          0          0  nwl_pcie:msi 524290 Edge      nvme0q2
>  49:         46          0          0          0  nwl_pcie:msi 524291 Edge      nvme0q3
>  50:          0          0          0          0  nwl_pcie:msi 524292 Edge      nvme0q4
> 
> In the above example, AER errors will trigger interrupt 42, not 45.
> Actually, there are a bunch of different interrupts in MSGF_MISC_STATUS,
> so maybe nwl_pcie_misc_handler should be an interrupt controller
> instead? But even then pcie_port_enable_irq_vec() won't figure out the
> correct IRQ. Any ideas on how to fix this?
> 
> Additionally, any tips on actually triggering AER/PME stuff in a
> consistent way? Are there any off-the-shelf cards for sending weird PCIe
> stuff over a link for testing? Right now all I have 

This is definitely a problem.  We have had some discussion about this
in the past, but haven't quite achieved critical mass to solve this in
a generic way.  Here are some links:

  https://lore.kernel.org/linux-pci/20250702223841.GA1905230@bhelgaas/t/#u
  https://lore.kernel.org/linux-pci/1464242406-20203-1-git-send-email-po.liu@nxp.com/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] pci: nwl: Unhandled AER correctable error
  2025-08-04 20:57 ` Bjorn Helgaas
@ 2025-08-04 22:10   ` Sean Anderson
  2025-08-05 10:42     ` Manivannan Sadhasivam
  0 siblings, 1 reply; 7+ messages in thread
From: Sean Anderson @ 2025-08-04 22:10 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, linux-pci, Rob Herring,
	Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
	Michal Simek, Brian Norris, Minghuan Lian, Mingkai Hu, Roy Zang,
	Frank Li, Hou Zhiqiang, linux-arm-kernel, linux-kernel

On 8/4/25 16:57, Bjorn Helgaas wrote:
> [+cc more folks who might be interested in AER with non-standard
> interrupts]
> 
> On Fri, Aug 01, 2025 at 01:43:19PM -0400, Sean Anderson wrote:
>> Hi,
>> 
>> AER correctable errors are pretty rare. I only saw one once before and
>> came up with commit 78457cae24cb ("PCI: xilinx-nwl: Rate-limit misc
>> interrupt messages") in response. I saw another today and,
>> unfortunately, clearing the correctable AER bit in MSGF_MISC_STATUS is
>> not sufficient to handle the IRQ. It gets immediately re-raised,
>> preventing the system from making any other progress. I suspect that it
>> needs to be cleared in PCI_ERR_ROOT_STATUS. But since the AER IRQ never
>> gets delivered to aer_irq, those registers never get tickled.
>> 
>> The underlying problem is that pcieport thinks that the IRQ is going to
>> be one of the MSIs or a legacy interrupt, but it's actually a native
>> interrupt:
>> 
>>            CPU0       CPU1       CPU2       CPU3       
>>  42:          0          0          0          0     GICv2 150 Level     nwl_pcie:misc
>>  45:          0          0          0          0  nwl_pcie:legacy   0 Level     PCIe PME, aerdrv
>>  46:         25          0          0          0  nwl_pcie:msi 524288 Edge      nvme0q0
>>  47:          0          0          0          0  nwl_pcie:msi 524289 Edge      nvme0q1
>>  48:          0          0          0          0  nwl_pcie:msi 524290 Edge      nvme0q2
>>  49:         46          0          0          0  nwl_pcie:msi 524291 Edge      nvme0q3
>>  50:          0          0          0          0  nwl_pcie:msi 524292 Edge      nvme0q4
>> 
>> In the above example, AER errors will trigger interrupt 42, not 45.
>> Actually, there are a bunch of different interrupts in MSGF_MISC_STATUS,
>> so maybe nwl_pcie_misc_handler should be an interrupt controller
>> instead? But even then pcie_port_enable_irq_vec() won't figure out the
>> correct IRQ. Any ideas on how to fix this?
>> 
>> Additionally, any tips on actually triggering AER/PME stuff in a
>> consistent way? Are there any off-the-shelf cards for sending weird PCIe
>> stuff over a link for testing? Right now all I have 
> 
> This is definitely a problem.  We have had some discussion about this
> in the past, but haven't quite achieved critical mass to solve this in
> a generic way.  Here are some links:
> 
>   https://lore.kernel.org/linux-pci/20250702223841.GA1905230@bhelgaas/t/#u
>   https://lore.kernel.org/linux-pci/1464242406-20203-1-git-send-email-po.liu@nxp.com/

Thanks for the links. Toggling PERST does seem to reliably cause
correctable errors (however "correctable" they may actually be in
practice). With the patch I posted on the other branch of this chain I
now get

[   43.041610] pcieport 0000:00:00.0: AER: Multiple Corrected error message received from 0000:00:00.0
[   43.050693] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[   43.061477] pcieport 0000:00:00.0:   device [10ee:d011] error status/mask=00000001/0000e000
[   43.069842] pcieport 0000:00:00.0:    [ 0] RxErr                 

Whether or not that's the right fix, at least I can test things :)

--Sean


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] pci: nwl: Unhandled AER correctable error
  2025-08-04 22:10   ` Sean Anderson
@ 2025-08-05 10:42     ` Manivannan Sadhasivam
  2025-08-05 14:02       ` Sean Anderson
  0 siblings, 1 reply; 7+ messages in thread
From: Manivannan Sadhasivam @ 2025-08-05 10:42 UTC (permalink / raw)
  To: Sean Anderson
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, linux-pci, Rob Herring,
	Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
	Michal Simek, Brian Norris, Minghuan Lian, Mingkai Hu, Roy Zang,
	Frank Li, Hou Zhiqiang, linux-arm-kernel, linux-kernel

On Mon, Aug 04, 2025 at 06:10:48PM GMT, Sean Anderson wrote:
> On 8/4/25 16:57, Bjorn Helgaas wrote:
> > [+cc more folks who might be interested in AER with non-standard
> > interrupts]
> > 
> > On Fri, Aug 01, 2025 at 01:43:19PM -0400, Sean Anderson wrote:
> >> Hi,
> >> 
> >> AER correctable errors are pretty rare. I only saw one once before and
> >> came up with commit 78457cae24cb ("PCI: xilinx-nwl: Rate-limit misc
> >> interrupt messages") in response. I saw another today and,
> >> unfortunately, clearing the correctable AER bit in MSGF_MISC_STATUS is
> >> not sufficient to handle the IRQ. It gets immediately re-raised,
> >> preventing the system from making any other progress. I suspect that it
> >> needs to be cleared in PCI_ERR_ROOT_STATUS. But since the AER IRQ never
> >> gets delivered to aer_irq, those registers never get tickled.
> >> 
> >> The underlying problem is that pcieport thinks that the IRQ is going to
> >> be one of the MSIs or a legacy interrupt, but it's actually a native
> >> interrupt:
> >> 
> >>            CPU0       CPU1       CPU2       CPU3       
> >>  42:          0          0          0          0     GICv2 150 Level     nwl_pcie:misc
> >>  45:          0          0          0          0  nwl_pcie:legacy   0 Level     PCIe PME, aerdrv
> >>  46:         25          0          0          0  nwl_pcie:msi 524288 Edge      nvme0q0
> >>  47:          0          0          0          0  nwl_pcie:msi 524289 Edge      nvme0q1
> >>  48:          0          0          0          0  nwl_pcie:msi 524290 Edge      nvme0q2
> >>  49:         46          0          0          0  nwl_pcie:msi 524291 Edge      nvme0q3
> >>  50:          0          0          0          0  nwl_pcie:msi 524292 Edge      nvme0q4
> >> 
> >> In the above example, AER errors will trigger interrupt 42, not 45.
> >> Actually, there are a bunch of different interrupts in MSGF_MISC_STATUS,
> >> so maybe nwl_pcie_misc_handler should be an interrupt controller
> >> instead? But even then pcie_port_enable_irq_vec() won't figure out the
> >> correct IRQ. Any ideas on how to fix this?
> >> 
> >> Additionally, any tips on actually triggering AER/PME stuff in a
> >> consistent way? Are there any off-the-shelf cards for sending weird PCIe
> >> stuff over a link for testing? Right now all I have 
> > 
> > This is definitely a problem.  We have had some discussion about this
> > in the past, but haven't quite achieved critical mass to solve this in
> > a generic way.  Here are some links:
> > 
> >   https://lore.kernel.org/linux-pci/20250702223841.GA1905230@bhelgaas/t/#u
> >   https://lore.kernel.org/linux-pci/1464242406-20203-1-git-send-email-po.liu@nxp.com/
> 
> Thanks for the links. Toggling PERST does seem to reliably cause
> correctable errors (however "correctable" they may actually be in
> practice). With the patch I posted on the other branch of this chain I
> now get
> 
> [   43.041610] pcieport 0000:00:00.0: AER: Multiple Corrected error message received from 0000:00:00.0
> [   43.050693] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> [   43.061477] pcieport 0000:00:00.0:   device [10ee:d011] error status/mask=00000001/0000e000
> [   43.069842] pcieport 0000:00:00.0:    [ 0] RxErr                 
> 
> Whether or not that's the right fix, at least I can test things :)

Could you please check if INTX is working for AER? You can just pass the cmdline
parameter, "pcie_pme=nomsi" and observe if the IRQ is getting triggered.

We have a desire to add platform IRQs for AER, but before doing that we need to
make sure that the platform doesn't support both MSI and INTx.

- Mani

-- 
மணிவண்ணன் சதாசிவம்


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] pci: nwl: Unhandled AER correctable error
  2025-08-05 10:42     ` Manivannan Sadhasivam
@ 2025-08-05 14:02       ` Sean Anderson
  2025-08-05 17:30         ` Manivannan Sadhasivam
  0 siblings, 1 reply; 7+ messages in thread
From: Sean Anderson @ 2025-08-05 14:02 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, linux-pci, Rob Herring,
	Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
	Michal Simek, Brian Norris, Minghuan Lian, Mingkai Hu, Roy Zang,
	Frank Li, Hou Zhiqiang, linux-arm-kernel, linux-kernel

On 8/5/25 06:42, Manivannan Sadhasivam wrote:
> On Mon, Aug 04, 2025 at 06:10:48PM GMT, Sean Anderson wrote:
>> On 8/4/25 16:57, Bjorn Helgaas wrote:
>> > [+cc more folks who might be interested in AER with non-standard
>> > interrupts]
>> > 
>> > On Fri, Aug 01, 2025 at 01:43:19PM -0400, Sean Anderson wrote:
>> >> Hi,
>> >> 
>> >> AER correctable errors are pretty rare. I only saw one once before and
>> >> came up with commit 78457cae24cb ("PCI: xilinx-nwl: Rate-limit misc
>> >> interrupt messages") in response. I saw another today and,
>> >> unfortunately, clearing the correctable AER bit in MSGF_MISC_STATUS is
>> >> not sufficient to handle the IRQ. It gets immediately re-raised,
>> >> preventing the system from making any other progress. I suspect that it
>> >> needs to be cleared in PCI_ERR_ROOT_STATUS. But since the AER IRQ never
>> >> gets delivered to aer_irq, those registers never get tickled.
>> >> 
>> >> The underlying problem is that pcieport thinks that the IRQ is going to
>> >> be one of the MSIs or a legacy interrupt, but it's actually a native
>> >> interrupt:
>> >> 
>> >>            CPU0       CPU1       CPU2       CPU3       
>> >>  42:          0          0          0          0     GICv2 150 Level     nwl_pcie:misc
>> >>  45:          0          0          0          0  nwl_pcie:legacy   0 Level     PCIe PME, aerdrv
>> >>  46:         25          0          0          0  nwl_pcie:msi 524288 Edge      nvme0q0
>> >>  47:          0          0          0          0  nwl_pcie:msi 524289 Edge      nvme0q1
>> >>  48:          0          0          0          0  nwl_pcie:msi 524290 Edge      nvme0q2
>> >>  49:         46          0          0          0  nwl_pcie:msi 524291 Edge      nvme0q3
>> >>  50:          0          0          0          0  nwl_pcie:msi 524292 Edge      nvme0q4
>> >> 
>> >> In the above example, AER errors will trigger interrupt 42, not 45.
>> >> Actually, there are a bunch of different interrupts in MSGF_MISC_STATUS,
>> >> so maybe nwl_pcie_misc_handler should be an interrupt controller
>> >> instead? But even then pcie_port_enable_irq_vec() won't figure out the
>> >> correct IRQ. Any ideas on how to fix this?
>> >> 
>> >> Additionally, any tips on actually triggering AER/PME stuff in a
>> >> consistent way? Are there any off-the-shelf cards for sending weird PCIe
>> >> stuff over a link for testing? Right now all I have 
>> > 
>> > This is definitely a problem.  We have had some discussion about this
>> > in the past, but haven't quite achieved critical mass to solve this in
>> > a generic way.  Here are some links:
>> > 
>> >   https://lore.kernel.org/linux-pci/20250702223841.GA1905230@bhelgaas/t/#u
>> >   https://lore.kernel.org/linux-pci/1464242406-20203-1-git-send-email-po.liu@nxp.com/
>> 
>> Thanks for the links. Toggling PERST does seem to reliably cause
>> correctable errors (however "correctable" they may actually be in
>> practice). With the patch I posted on the other branch of this chain I
>> now get
>> 
>> [   43.041610] pcieport 0000:00:00.0: AER: Multiple Corrected error message received from 0000:00:00.0
>> [   43.050693] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
>> [   43.061477] pcieport 0000:00:00.0:   device [10ee:d011] error status/mask=00000001/0000e000
>> [   43.069842] pcieport 0000:00:00.0:    [ 0] RxErr                 
>> 
>> Whether or not that's the right fix, at least I can test things :)
> 
> Could you please check if INTX is working for AER? You can just pass the cmdline
> parameter, "pcie_pme=nomsi" and observe if the IRQ is getting triggered.

I don't really understand what you want me to check. As shown above, pme
and aer are already assigned to INTA, not an MSI. This of course never
gets triggered.

Figure 30-5 in UG1085 [1] shows the interrupt architecture, and I think
it's clear from that diagram that there's no pathway for root port
errors to trigger an MSI or a legacy interrupt.

--Sean

[1] https://docs.amd.com/api/khub/documents/xzMsp_c5sG9J6A3u7NkJYQ/content?Ft-Calling-App=ft%2Fturnkey-portal&Ft-Calling-App-Version=5.1.38#G32.381770

> We have a desire to add platform IRQs for AER, but before doing that we need to
> make sure that the platform doesn't support both MSI and INTx.
> 
> - Mani
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] pci: nwl: Unhandled AER correctable error
  2025-08-05 14:02       ` Sean Anderson
@ 2025-08-05 17:30         ` Manivannan Sadhasivam
  0 siblings, 0 replies; 7+ messages in thread
From: Manivannan Sadhasivam @ 2025-08-05 17:30 UTC (permalink / raw)
  To: Sean Anderson
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, linux-pci, Rob Herring,
	Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
	Michal Simek, Brian Norris, Minghuan Lian, Mingkai Hu, Roy Zang,
	Frank Li, Hou Zhiqiang, linux-arm-kernel, linux-kernel

On Tue, Aug 05, 2025 at 10:02:39AM GMT, Sean Anderson wrote:
> On 8/5/25 06:42, Manivannan Sadhasivam wrote:
> > On Mon, Aug 04, 2025 at 06:10:48PM GMT, Sean Anderson wrote:
> >> On 8/4/25 16:57, Bjorn Helgaas wrote:
> >> > [+cc more folks who might be interested in AER with non-standard
> >> > interrupts]
> >> > 
> >> > On Fri, Aug 01, 2025 at 01:43:19PM -0400, Sean Anderson wrote:
> >> >> Hi,
> >> >> 
> >> >> AER correctable errors are pretty rare. I only saw one once before and
> >> >> came up with commit 78457cae24cb ("PCI: xilinx-nwl: Rate-limit misc
> >> >> interrupt messages") in response. I saw another today and,
> >> >> unfortunately, clearing the correctable AER bit in MSGF_MISC_STATUS is
> >> >> not sufficient to handle the IRQ. It gets immediately re-raised,
> >> >> preventing the system from making any other progress. I suspect that it
> >> >> needs to be cleared in PCI_ERR_ROOT_STATUS. But since the AER IRQ never
> >> >> gets delivered to aer_irq, those registers never get tickled.
> >> >> 
> >> >> The underlying problem is that pcieport thinks that the IRQ is going to
> >> >> be one of the MSIs or a legacy interrupt, but it's actually a native
> >> >> interrupt:
> >> >> 
> >> >>            CPU0       CPU1       CPU2       CPU3       
> >> >>  42:          0          0          0          0     GICv2 150 Level     nwl_pcie:misc
> >> >>  45:          0          0          0          0  nwl_pcie:legacy   0 Level     PCIe PME, aerdrv
> >> >>  46:         25          0          0          0  nwl_pcie:msi 524288 Edge      nvme0q0
> >> >>  47:          0          0          0          0  nwl_pcie:msi 524289 Edge      nvme0q1
> >> >>  48:          0          0          0          0  nwl_pcie:msi 524290 Edge      nvme0q2
> >> >>  49:         46          0          0          0  nwl_pcie:msi 524291 Edge      nvme0q3
> >> >>  50:          0          0          0          0  nwl_pcie:msi 524292 Edge      nvme0q4
> >> >> 
> >> >> In the above example, AER errors will trigger interrupt 42, not 45.
> >> >> Actually, there are a bunch of different interrupts in MSGF_MISC_STATUS,
> >> >> so maybe nwl_pcie_misc_handler should be an interrupt controller
> >> >> instead? But even then pcie_port_enable_irq_vec() won't figure out the
> >> >> correct IRQ. Any ideas on how to fix this?
> >> >> 
> >> >> Additionally, any tips on actually triggering AER/PME stuff in a
> >> >> consistent way? Are there any off-the-shelf cards for sending weird PCIe
> >> >> stuff over a link for testing? Right now all I have 
> >> > 
> >> > This is definitely a problem.  We have had some discussion about this
> >> > in the past, but haven't quite achieved critical mass to solve this in
> >> > a generic way.  Here are some links:
> >> > 
> >> >   https://lore.kernel.org/linux-pci/20250702223841.GA1905230@bhelgaas/t/#u
> >> >   https://lore.kernel.org/linux-pci/1464242406-20203-1-git-send-email-po.liu@nxp.com/
> >> 
> >> Thanks for the links. Toggling PERST does seem to reliably cause
> >> correctable errors (however "correctable" they may actually be in
> >> practice). With the patch I posted on the other branch of this chain I
> >> now get
> >> 
> >> [   43.041610] pcieport 0000:00:00.0: AER: Multiple Corrected error message received from 0000:00:00.0
> >> [   43.050693] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> >> [   43.061477] pcieport 0000:00:00.0:   device [10ee:d011] error status/mask=00000001/0000e000
> >> [   43.069842] pcieport 0000:00:00.0:    [ 0] RxErr                 
> >> 
> >> Whether or not that's the right fix, at least I can test things :)
> > 
> > Could you please check if INTX is working for AER? You can just pass the cmdline
> > parameter, "pcie_pme=nomsi" and observe if the IRQ is getting triggered.
> 
> I don't really understand what you want me to check. As shown above, pme
> and aer are already assigned to INTA, not an MSI. This of course never
> gets triggered.
> 

Sorry, my bad. I misread the MSI interrupts assigned to NVMe queues as AER.

> Figure 30-5 in UG1085 [1] shows the interrupt architecture, and I think
> it's clear from that diagram that there's no pathway for root port
> errors to trigger an MSI or a legacy interrupt.
> 

Then we really need to plug aer_irq with the platform interrupt with the help of
the controller driver. It is not on top of my priority list, so someone with the
bandwidth and motivation should look into it.

- Mani

-- 
மணிவண்ணன் சதாசிவம்


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-08-05 18:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-01 17:43 [BUG] pci: nwl: Unhandled AER correctable error Sean Anderson
2025-08-04 20:20 ` Sean Anderson
2025-08-04 20:57 ` Bjorn Helgaas
2025-08-04 22:10   ` Sean Anderson
2025-08-05 10:42     ` Manivannan Sadhasivam
2025-08-05 14:02       ` Sean Anderson
2025-08-05 17:30         ` Manivannan Sadhasivam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).