All of lore.kernel.org
 help / color / mirror / Atom feed
* PCIe device issue since v6.1.16
@ 2023-09-19 14:17 Schroeder, Chad
  2023-09-19 14:59 ` Lukas Wunner
  0 siblings, 1 reply; 5+ messages in thread
From: Schroeder, Chad @ 2023-09-19 14:17 UTC (permalink / raw)
  To: bhelgaas@google.com; +Cc: linux-pci@vger.kernel.org, Schroeder, Chad

Recently, I discovered an issue with a PCIe device and recent kernels.

Background:

	Using a Linux host system, the PCIe device, a Torrent QN16e PCIe 16-128 Channel QAM Modulator, is passed through to a FreeBSD guest.

	Until recently, this was working as expected.

Issue:

	Starting with kernel v6.1.16, when the guest domain was started, the kernel would immediately report errors via vfio-pci.

	kernel: vfio-pci 0000:65:00.0: vfio_bar_restore: reset recovery - restoring BARs

	The guest would boot and its driver would load.  Then, when guest user-space, would access the device, a PCI system error (PERR) was raised,
	as reported by the impi event log, and the hardware itself would suffer a catastrophic event, cycling the server.

Discovery:

	After researching the issue, I found the commit that lead system error:

	https://lore.kernel.org/all/da77c92796b99ec568bd070cbe4725074a117038.1673769517.git.lukas@wunner.de/

	Specifically, this removal:

	- Drop an unnecessary 1 sec delay from pci_reset_secondary_bus() which
	is now performed by pci_bridge_wait_for_secondary_bus().  A static
	delay this long is only necessary for Conventional PCI, so modern
	PCIe systems benefit from shorter reset times as a side effect.

Resolution:

	I reintroduced the 1 second delay to pci_reset_secondary_bus, recompiled and installed and the system is now working as expected.

	void pci_reset_secondary_bus(struct pci_dev *dev)
	{
  	 u16 ctrl;

	   pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
	   ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
	   pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);

	   /*
	    * PCI spec v3.0 7.6.4.2 requires minimum Trst of 1ms.  Double
	    * this to 2ms to ensure that we meet the minimum requirement.
	    */
	   msleep(2);

	   ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
 	  pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);

	   ssleep(1);

	}

PCIe Device:

	https://www.videopropulsion.com/content/torrent-qn16e-pcie-16-128-channel-qam-modulator

$ lspci

0000:65:00.0 Multimedia controller: Genroco, Inc Device 0004 (rev 01)
	Subsystem: Genroco, Inc Device 0004
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 349
	NUMA node: 0
	IOMMU group: 1
	Region 0: Memory at e0e00000 (32-bit, non-prefetchable) [size=32K]
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [80] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75W
		DevCtl:	CorrErr- NonFatalErr- FatalErr+ UnsupReq-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #1, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <4us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=01
			Status:	NegoPending- InProgress-
	Capabilities: [200 v1] Vendor Specific Information: ID=1172 Rev=0 Len=044 <?>
	Kernel driver in use: vfio-pci

Regards,

Chad Schroeder

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PCIe device issue since v6.1.16
  2023-09-19 14:17 PCIe device issue since v6.1.16 Schroeder, Chad
@ 2023-09-19 14:59 ` Lukas Wunner
  2023-09-19 15:08   ` Schroeder, Chad
  0 siblings, 1 reply; 5+ messages in thread
From: Lukas Wunner @ 2023-09-19 14:59 UTC (permalink / raw)
  To: Schroeder, Chad; +Cc: bhelgaas@google.com, linux-pci@vger.kernel.org

Hi Chad,

On Tue, Sep 19, 2023 at 02:17:29PM +0000, Schroeder, Chad wrote:
> 	After researching the issue, I found the commit that lead system error:
> 
> 	https://lore.kernel.org/all/da77c92796b99ec568bd070cbe4725074a117038.1673769517.git.lukas@wunner.de/
> 
> 	Specifically, this removal:
> 
> 	- Drop an unnecessary 1 sec delay from pci_reset_secondary_bus() which
> 	is now performed by pci_bridge_wait_for_secondary_bus().  A static
> 	delay this long is only necessary for Conventional PCI, so modern
> 	PCIe systems benefit from shorter reset times as a side effect.

Thanks for the report and sorry for the breakage.

This endpoint device only supports Gen1 speed (2.5GT/s) and does not support
Data Link Layer Link Active Reporting.  I have a suspicion that I neglected
to take this case into account in pci_bridge_wait_for_secondary_bus().

To better understand what's going on, could you also provide "lspci -vvv"
output of the parent bridge above 0000:65:00.0 (i.e. of the bridge whose
secondary bus is 65)?

Thanks!

Lukas

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: PCIe device issue since v6.1.16
  2023-09-19 14:59 ` Lukas Wunner
@ 2023-09-19 15:08   ` Schroeder, Chad
  2023-09-19 18:41     ` Lukas Wunner
  0 siblings, 1 reply; 5+ messages in thread
From: Schroeder, Chad @ 2023-09-19 15:08 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: bhelgaas@google.com, linux-pci@vger.kernel.org, Schroeder, Chad

Hi Lukas,

Thanks for the reply.  Here you go:

$ lspci

0000:64:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 07) (prog-if 00 [Normal decode])
	Subsystem: Super Micro Computer Inc Device 0953
	Physical Slot: 2
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 30
	NUMA node: 0
	IOMMU group: 0
	Bus: primary=64, secondary=65, subordinate=65, sec-latency=0
	I/O behind bridge: f000-0fff [disabled] [16-bit]
	Memory behind bridge: e0e00000-e0efffff [size=1M] [32-bit]
	Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff [disabled] [64-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
	BridgeCtl: Parity+ SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Subsystem: Super Micro Computer Inc Device 0953
	Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
		Address: fee00038  Data: 0000
		Masking: 00000002  Pending: 00000000
	Capabilities: [90] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0
			ExtTag+ RBE+
		DevCtl:	CorrErr- NonFatalErr- FatalErr+ UnsupReq-
			RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 256 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #5, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <16us
			ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1
			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
			Slot #2, PowerLimit 75W; Interlock- NoCompl-
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
			Control: AttnInd Off, PwrInd On, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
			Changed: MRL- PresDet+ LinkState+
		RootCap: CRSVisible+
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal+ PMEIntEna+ CRSVisible+
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range BCD, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- LN System CLS Not Supported, TPHComp+ ExtTPHComp- ARIFwd+
			 AtomicOpsCap: Routing- 32bit+ 64bit+ 128bitCAS+
		DevCtl2: Completion Timeout: 260ms to 900ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, ARIFwd-
			 AtomicOpsCtl: ReqEn- EgressBlck-
		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [e0] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
	Capabilities: [110 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	Capabilities: [148 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES+ TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr+ BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
		RootCmd: CERptEn- NFERptEn- FERptEn-
		RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
			 FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
		ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
	Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?>
	Capabilities: [250 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: LaneErr at lane: 15
	Capabilities: [280 v1] Vendor Specific Information: ID=0005 Rev=3 Len=018 <?>
	Capabilities: [298 v1] Vendor Specific Information: ID=0007 Rev=0 Len=024 <?>
	Capabilities: [300 v1] Vendor Specific Information: ID=0008 Rev=0 Len=038 <?>
	Kernel driver in use: pcieport

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PCIe device issue since v6.1.16
  2023-09-19 15:08   ` Schroeder, Chad
@ 2023-09-19 18:41     ` Lukas Wunner
  2023-09-20 15:36       ` Schroeder, Chad
  0 siblings, 1 reply; 5+ messages in thread
From: Lukas Wunner @ 2023-09-19 18:41 UTC (permalink / raw)
  To: Schroeder, Chad; +Cc: bhelgaas@google.com, linux-pci@vger.kernel.org

Hi Chad,

On Tue, Sep 19, 2023 at 03:08:08PM +0000, Schroeder, Chad wrote:
> 0000:64:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 07) (prog-if 00 [Normal decode])
[...]
> 	Bus: primary=64, secondary=65, subordinate=65, sec-latency=0
[...]
> 	Capabilities: [90] Express (v2) Root Port (Slot+), MSI 00
[...]
> 		LnkCap:	Port #5, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <16us
> 			ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
> 		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> 			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> 		LnkSta:	Speed 2.5GT/s, Width x1
> 			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-

pci_bridge_secondary_bus_reset() calls pcibios_reset_secondary_bus()
to perform the SBR, then calls pci_bridge_wait_for_secondary_bus()
to perform the delays prescribed by the spec.

At the bottom of pci_bridge_wait_for_secondary_bus(), there's an if-else
clause and based on the lspci output quoted above, I'd expect that control
flow enters the else branch (because the SkyLake Root Port supports more
than 5 GT/s, namely 8 GT/s.

The function then calls pcie_wait_for_link_delay() to wait for the link
to come up, then waits 100 msec per PCIe r6.1 sec 6.6.1.

Afterwards it calls pci_dev_wait() to poll the Vendor ID register of the
VideoPropulsion card.

My guess is that the VideoPropulsion card requires a longer delay than
100 msec before its config space is first accessed.  The PCI system errors
that you mention are probably raised by the card because it is not ready
yet to handle those config space accesses.

Since this is a PCIe r1.0 card, I've checked whether PCIe r1.0 has
longer delays after reset than contemporary revisions of the PCIe
Base Spec.  But that's not the case.  PCIe r1.0 sec 7.6 says:

   "To allow components to perform internal initialization, system
    software must wait for at least 100 ms from the end of a reset
    (cold/warm/hot) before it is permitted to issue Configuration
    Requests to PCI Express devices
    [...]
    The Root Complex and/or system software must allow 1.0s (+50% / -0%)
    after a reset (hot/warm/cold), before it may determine that a device
    which fails to return a Successful Completion status for a valid
    Configuration Request is a broken device"

Those timing requirements are essentially identical to what contemporary
PCIe revisions prescribe.  It's also what the code in the kernel follows.

Which leads me to believe that the longer delay before the first config
space access required by this particular card is a quirk.  So I'm
proposing the following patch:


diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 20ac67d..3cbff71 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5993,3 +5993,9 @@ static void dpc_log_size(struct pci_dev *dev)
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a2f, dpc_log_size);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a31, dpc_log_size);
 #endif
+
+static void pci_fixup_d3cold_delay_1sec(struct pci_dev *pdev)
+{
+	pdev->d3cold_delay = 1000;
+}
+DECLARE_PCI_FIXUP_FINAL(0x5555, 0x0004, pci_fixup_d3cold_delay_1sec);


Could you test if applying this on top of v6.1.16 fixes the issue?
(Apply with "patch -p1 < filename.patch" at the top of the kernel
source tree.)

Thanks,

Lukas

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* RE: PCIe device issue since v6.1.16
  2023-09-19 18:41     ` Lukas Wunner
@ 2023-09-20 15:36       ` Schroeder, Chad
  0 siblings, 0 replies; 5+ messages in thread
From: Schroeder, Chad @ 2023-09-20 15:36 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: bhelgaas@google.com, linux-pci@vger.kernel.org, Schroeder, Chad

> Which leads me to believe that the longer delay before the first config
> space access required by this particular card is a quirk.  So I'm
> proposing the following patch:
> 
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 20ac67d..3cbff71 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5993,3 +5993,9 @@ static void dpc_log_size(struct pci_dev *dev)
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a2f, dpc_log_size);
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a31,
> dpc_log_size);
>  #endif
> +
> +static void pci_fixup_d3cold_delay_1sec(struct pci_dev *pdev)
> +{
> +       pdev->d3cold_delay = 1000;
> +}
> +DECLARE_PCI_FIXUP_FINAL(0x5555, 0x0004, pci_fixup_d3cold_delay_1sec);
> 
> 
> Could you test if applying this on top of v6.1.16 fixes the issue?
> (Apply with "patch -p1 < filename.patch" at the top of the kernel
> source tree.)
> 
> Thanks,
> 
> Lukas

I've applied the quirk patch to v6.5.4 and it does indeed elicit the same successful result as the original ssleep delay.

The device and server are functioning as expected.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-09-20 15:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-19 14:17 PCIe device issue since v6.1.16 Schroeder, Chad
2023-09-19 14:59 ` Lukas Wunner
2023-09-19 15:08   ` Schroeder, Chad
2023-09-19 18:41     ` Lukas Wunner
2023-09-20 15:36       ` Schroeder, Chad

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.