* [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"
@ 2011-09-06 17:36 Simon Kirby
2011-09-07 16:22 ` Sven Schnelle
0 siblings, 1 reply; 11+ messages in thread
From: Simon Kirby @ 2011-09-06 17:36 UTC (permalink / raw)
To: linux-kernel, Jon Mason, Jordan_Hargrave
Hello!
Since trying 3.1-rc4 on a few Dell servers, all of them have booted up
with the amber error LED lit. "ipmitool sel list" shows:
1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted
2 | 09/06/2011 | 17:25:38 | Critical Interrupt #0x18 | Bus Fatal Error | Asserted
3 | 09/06/2011 | 17:25:38 | Unknown #0x1a |
4 | 09/06/2011 | 17:25:38 | Unknown #0x1a |
I bisected this to:
b03e7495a862b028294f59fc87286d6d78ee7fa1 is the first bad commit
commit b03e7495a862b028294f59fc87286d6d78ee7fa1
Author: Jon Mason <mason@myri.com>
Date: Wed Jul 20 15:20:54 2011 -0500
PCI: Set PCI-E Max Payload Size on fabric
It sounds like this has caused other problems as well: http://www.spinics.net/lists/linux-scsi/msg54464.html
In this case, the 6 or so boxes I've see the issue on are all PowerEdge 2950 servers.
Simon-
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" 2011-09-06 17:36 [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" Simon Kirby @ 2011-09-07 16:22 ` Sven Schnelle 2011-09-07 16:52 ` Josh Boyer 0 siblings, 1 reply; 11+ messages in thread From: Sven Schnelle @ 2011-09-07 16:22 UTC (permalink / raw) To: Simon Kirby; +Cc: linux-kernel, Jon Mason, Jordan_Hargrave Simon Kirby <sim@hostway.ca> writes: > Hello! > > Since trying 3.1-rc4 on a few Dell servers, all of them have booted up > with the amber error LED lit. "ipmitool sel list" shows: > > 1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted > 2 | 09/06/2011 | 17:25:38 | Critical Interrupt #0x18 | Bus Fatal Error | Asserted > 3 | 09/06/2011 | 17:25:38 | Unknown #0x1a | > 4 | 09/06/2011 | 17:25:38 | Unknown #0x1a | I'm seeing exact the same issue on a Dell 1950 Server. If anyone wants me to try additional debugging/patches, feel free to do so. Unfortunately i don't have the time/knowledge to debug that by myself. Sven ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" 2011-09-07 16:22 ` Sven Schnelle @ 2011-09-07 16:52 ` Josh Boyer 2011-09-07 17:44 ` Jesse Barnes 0 siblings, 1 reply; 11+ messages in thread From: Josh Boyer @ 2011-09-07 16:52 UTC (permalink / raw) To: Sven Schnelle Cc: Simon Kirby, linux-kernel, Jon Mason, Jordan_Hargrave, Jesse Barnes On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org> wrote: > Simon Kirby <sim@hostway.ca> writes: > >> Hello! >> >> Since trying 3.1-rc4 on a few Dell servers, all of them have booted up >> with the amber error LED lit. "ipmitool sel list" shows: >> >> 1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted >> 2 | 09/06/2011 | 17:25:38 | Critical Interrupt #0x18 | Bus Fatal Error | Asserted >> 3 | 09/06/2011 | 17:25:38 | Unknown #0x1a | >> 4 | 09/06/2011 | 17:25:38 | Unknown #0x1a | > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone wants > me to try additional debugging/patches, feel free to do > so. Unfortunately i don't have the time/knowledge to debug that by myself. I thought Jesse or Jon had a revert or partial fix queued up to send to Linus, but I don't see anything in or post -rc5 yet. That was indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162 Jesse, Jon? josh ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" 2011-09-07 16:52 ` Josh Boyer @ 2011-09-07 17:44 ` Jesse Barnes 2011-09-07 19:18 ` Simon Kirby 0 siblings, 1 reply; 11+ messages in thread From: Jesse Barnes @ 2011-09-07 17:44 UTC (permalink / raw) To: Josh Boyer Cc: Sven Schnelle, Simon Kirby, linux-kernel, Jon Mason, Jordan_Hargrave On Wed, 7 Sep 2011 12:52:25 -0400 Josh Boyer <jwboyer@gmail.com> wrote: > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org> > wrote: > > Simon Kirby <sim@hostway.ca> writes: > > > >> Hello! > >> > >> Since trying 3.1-rc4 on a few Dell servers, all of them have > >> booted up with the amber error LED lit. "ipmitool sel list" shows: > >> > >> 1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 | > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown > >> #0x1a | > > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone > > wants me to try additional debugging/patches, feel free to do > > so. Unfortunately i don't have the time/knowledge to debug that by > > myself. > > I thought Jesse or Jon had a revert or partial fix queued up to send > to Linus, but I don't see anything in or post -rc5 yet. That was > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162 > > Jesse, Jon? kernel.org is still down and I haven't pushed anything to github. I asked Jon to send his patch directly to Linus today instead. Thanks, Jesse ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" 2011-09-07 17:44 ` Jesse Barnes @ 2011-09-07 19:18 ` Simon Kirby 2011-09-07 20:47 ` Simon Kirby 0 siblings, 1 reply; 11+ messages in thread From: Simon Kirby @ 2011-09-07 19:18 UTC (permalink / raw) To: Jesse Barnes, Jon Mason Cc: Josh Boyer, Sven Schnelle, linux-kernel, Jordan_Hargrave On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote: > On Wed, 7 Sep 2011 12:52:25 -0400 > Josh Boyer <jwboyer@gmail.com> wrote: > > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org> > > wrote: > > > Simon Kirby <sim@hostway.ca> writes: > > > > > >> Hello! > > >> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have > > >> booted up with the amber error LED lit. "ipmitool sel list" shows: > > >> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 | > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown > > >> #0x1a | > > > > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone > > > wants me to try additional debugging/patches, feel free to do > > > so. Unfortunately i don't have the time/knowledge to debug that by > > > myself. > > > > I thought Jesse or Jon had a revert or partial fix queued up to send > > to Linus, but I don't see anything in or post -rc5 yet. That was > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162 > > > > Jesse, Jon? > > kernel.org is still down and I haven't pushed anything to github. I > asked Jon to send his patch directly to Linus today instead. FWIW, this patch didn't seem to fix it: https://bugzilla.kernel.org/attachment.cgi?id=71222 dmesg used to say: pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128 pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096 pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096 pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128 pci 0000:08:00.0: MPS configured higher than maximum supported by the device. If a bus issue occurs, try running with pci=pcie_bus_safe. pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128 Uhhuh. NMI received for unknown reason 21 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue pci 0000:07:01.0: Dev MPS 128 MPSS 256 MRRS 4096 pci 0000:07:01.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:06:00.3: Dev MPS 128 MPSS 256 MRRS 256 pci 0000:06:00.3: Dev MPS 256 MPSS 256 MRRS 256 pci 0000:00:03.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:00:03.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:01:00.0: Dev MPS 256 MPSS 256 MRRS 512 pci 0000:01:00.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:01:00.2: Dev MPS 256 MPSS 256 MRRS 512 pci 0000:01:00.2: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:00:04.0: Dev MPS 128 MPSS 256 MRRS 128 pci 0000:00:04.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:00:05.0: Dev MPS 128 MPSS 256 MRRS 128 pci 0000:00:05.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:00:06.0: Dev MPS 128 MPSS 256 MRRS 128 pci 0000:00:06.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:00:07.0: Dev MPS 128 MPSS 256 MRRS 128 pci 0000:00:07.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:00:1c.0: Dev MPS 128 MPSS 128 MRRS 128 pci 0000:00:1c.0: Dev MPS 128 MPSS 128 MRRS 128 pci 0000:04:00.0: Dev MPS 128 MPSS 128 MRRS 128 pci 0000:04:00.0: Dev MPS 128 MPSS 128 MRRS 128 pci_bus 0000:00: on NUMA node 0 with the patch, I see: pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128 pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096 pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 4096 pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096 pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 4096 pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128 pci 0000:08:00.0: MPS configured higher than maximum supported by the device. If a bus issue occurs, try running with pci=pcie_bus_safe. pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:07:01.0: Dev MPS 128 MPSS 256 MRRS 4096 pci 0000:07:01.0: Dev MPS 256 MPSS 256 MRRS 4096 pci 0000:06:00.3: Dev MPS 128 MPSS 256 MRRS 256 pci 0000:06:00.3: Dev MPS 256 MPSS 256 MRRS 256 pci 0000:00:03.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:00:03.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:01:00.0: Dev MPS 256 MPSS 256 MRRS 512 pci 0000:01:00.0: Dev MPS 256 MPSS 256 MRRS 512 pci 0000:01:00.2: Dev MPS 256 MPSS 256 MRRS 512 pci 0000:01:00.2: Dev MPS 256 MPSS 256 MRRS 512 pci 0000:00:04.0: Dev MPS 128 MPSS 256 MRRS 128 pci 0000:00:04.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:00:05.0: Dev MPS 128 MPSS 256 MRRS 128 pci 0000:00:05.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:00:06.0: Dev MPS 128 MPSS 256 MRRS 128 pci 0000:00:06.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:00:07.0: Dev MPS 128 MPSS 256 MRRS 128 pci 0000:00:07.0: Dev MPS 256 MPSS 256 MRRS 128 pci 0000:00:1c.0: Dev MPS 128 MPSS 128 MRRS 128 pci 0000:00:1c.0: Dev MPS 128 MPSS 128 MRRS 128 pci 0000:04:00.0: Dev MPS 128 MPSS 128 MRRS 128 pci 0000:04:00.0: Dev MPS 128 MPSS 128 MRRS 128 pci_bus 0000:00: on NUMA node 0 ...later on... PCI: max bus depth: 4 pci_try_num: 5 Uhhuh. NMI received for unknown reason 31 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue pci 0000:08:00.0: PCI bridge to [bus 09-09] pci 0000:08:00.0: bridge window [mem 0xf4000000-0xf7ffffff] pci 0000:07:00.0: PCI bridge to [bus 08-09] pci 0000:07:00.0: bridge window [mem 0xf4000000-0xf7ffffff] ...and the error still shows up in the IPMI SEL. If I also add "pci=pcie_bus_safe", I _still_ get the same output and bus error. Maybe this is two issues? # lspci 00:00.0 Host bridge: Intel Corporation 5000X Chipset Memory Controller Hub (rev 12) 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 2 (rev 12) 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 3 (rev 12) 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 4-5 (rev 12) 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 5 (rev 12) 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 6-7 (rev 12) 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 7 (rev 12) 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 12) 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 12) 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 12) 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev 12) 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev 12) 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 12) 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 12) 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset PCI Express Root Port 1 (rev 09) 00:1d.0 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #1 (rev 09) 00:1d.1 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #2 (rev 09) 00:1d.2 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #3 (rev 09) 00:1d.7 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset EHCI USB2 Controller (rev 09) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9) 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset LPC Interface Controller (rev 09) 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE Controller (rev 09) 01:00.0 PCI bridge: Intel Corporation 80333 Segment-A PCI Express-to-PCI Express Bridge 01:00.2 PCI bridge: Intel Corporation 80333 Segment-B PCI Express-to-PCI Express Bridge 02:0e.0 RAID bus controller: Dell PowerEdge Expandable RAID controller 5 04:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c2) 05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 11) 06:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Upstream Port (rev 01) 06:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to PCI-X Bridge (rev 01) 07:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E1 (rev 01) 07:01.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E2 (rev 01) 08:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c2) 09:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 11) 10:0d.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02) # lspci -v -s 08:00.0 08:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c2) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=08, secondary=09, subordinate=09, sec-latency=64 Memory behind bridge: f4000000-f7ffffff Capabilities: [60] Express PCI/PCI-X Bridge, MSI 00 Capabilities: [90] PCI-X bridge device Capabilities: [b0] Power Management version 2 # lspci -v -v -s 08:00.0 08:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c2) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Bus: primary=08, secondary=09, subordinate=09, sec-latency=64 Memory behind bridge: f4000000-f7ffffff Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ <SERR- <PERR- BridgeCtl: Parity- SERR+ NoISA+ VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [60] Express (v1) PCI/PCI-X Bridge, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 <16us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal+ Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- BrConfRtry- MaxPayload 256 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Latency L0 <4us, L1 <4us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [90] PCI-X bridge device Secondary Status: 64bit+ 133MHz+ SCD- USC- SCO- SRD- Freq=133MHz Status: Dev=08:00.0 64bit- 133MHz- SCD- USC- SCO- SRD- Upstream: Capacity=0 CommitmentLimit=0 Downstream: Capacity=0 CommitmentLimit=0 Capabilities: [b0] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Simon- ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" 2011-09-07 19:18 ` Simon Kirby @ 2011-09-07 20:47 ` Simon Kirby 2011-09-07 20:57 ` Jon Mason 0 siblings, 1 reply; 11+ messages in thread From: Simon Kirby @ 2011-09-07 20:47 UTC (permalink / raw) To: Jesse Barnes, Jon Mason Cc: Josh Boyer, Sven Schnelle, linux-kernel, Jordan_Hargrave On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote: > On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote: > > > On Wed, 7 Sep 2011 12:52:25 -0400 > > Josh Boyer <jwboyer@gmail.com> wrote: > > > > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org> > > > wrote: > > > > Simon Kirby <sim@hostway.ca> writes: > > > > > > > >> Hello! > > > >> > > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have > > > >> booted up with the amber error LED lit. "ipmitool sel list" shows: > > > >> > > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log > > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical > > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 | > > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown > > > >> #0x1a | > > > > > > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone > > > > wants me to try additional debugging/patches, feel free to do > > > > so. Unfortunately i don't have the time/knowledge to debug that by > > > > myself. > > > > > > I thought Jesse or Jon had a revert or partial fix queued up to send > > > to Linus, but I don't see anything in or post -rc5 yet. That was > > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162 > > > > > > Jesse, Jon? > > > > kernel.org is still down and I haven't pushed anything to github. I > > asked Jon to send his patch directly to Linus today instead. > > FWIW, this patch didn't seem to fix it: > https://bugzilla.kernel.org/attachment.cgi?id=71222 > > dmesg used to say: > > pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128 > pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128 > pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096 > pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128 > pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096 > pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128 > pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128 > pci 0000:08:00.0: MPS configured higher than maximum supported by the device. If a bus issue occurs, try running with pci=pcie_bus_safe. > pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128 > Uhhuh. NMI received for unknown reason 21 on CPU 0. > Do you have a strange power saving mode enabled? > Dazed and confused, but trying to continue Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error stopped, but this made me realize that the pci=pcie_bus_safe option must have been missing. It turns out I had hacked a custom grub entry to load the newest kernel into grub instead of the one with the highest version number (grumble), so the default kopt didn't apply there. So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the MRRS-dissabling patch makes no difference in this case. Can we just make pci=pcie_bus_safe (as in previous behavior) the default, or make it not change where it would otherwise warn, or does that basically make the thing useless? Simon- ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" 2011-09-07 20:47 ` Simon Kirby @ 2011-09-07 20:57 ` Jon Mason 2011-09-07 20:58 ` Simon Kirby 0 siblings, 1 reply; 11+ messages in thread From: Jon Mason @ 2011-09-07 20:57 UTC (permalink / raw) To: Simon Kirby Cc: Jesse Barnes, Josh Boyer, Sven Schnelle, linux-kernel, Jordan_Hargrave On Wed, Sep 7, 2011 at 1:47 PM, Simon Kirby <sim@hostway.ca> wrote: > On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote: > >> On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote: >> >> > On Wed, 7 Sep 2011 12:52:25 -0400 >> > Josh Boyer <jwboyer@gmail.com> wrote: >> > >> > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org> >> > > wrote: >> > > > Simon Kirby <sim@hostway.ca> writes: >> > > > >> > > >> Hello! >> > > >> >> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have >> > > >> booted up with the amber error LED lit. "ipmitool sel list" shows: >> > > >> >> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log >> > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical >> > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 | >> > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown >> > > >> #0x1a | >> > > > >> > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone >> > > > wants me to try additional debugging/patches, feel free to do >> > > > so. Unfortunately i don't have the time/knowledge to debug that by >> > > > myself. >> > > >> > > I thought Jesse or Jon had a revert or partial fix queued up to send >> > > to Linus, but I don't see anything in or post -rc5 yet. That was >> > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162 >> > > >> > > Jesse, Jon? >> > >> > kernel.org is still down and I haven't pushed anything to github. I >> > asked Jon to send his patch directly to Linus today instead. >> >> FWIW, this patch didn't seem to fix it: >> https://bugzilla.kernel.org/attachment.cgi?id=71222 >> >> dmesg used to say: >> >> pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128 >> pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128 >> pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096 >> pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128 >> pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096 >> pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128 >> pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128 >> pci 0000:08:00.0: MPS configured higher than maximum supported by the device. If a bus issue occurs, try running with pci=pcie_bus_safe. >> pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128 >> Uhhuh. NMI received for unknown reason 21 on CPU 0. >> Do you have a strange power saving mode enabled? >> Dazed and confused, but trying to continue > > Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error > stopped, but this made me realize that the pci=pcie_bus_safe option must > have been missing. It turns out I had hacked a custom grub entry to load > the newest kernel into grub instead of the one with the highest version > number (grumble), so the default kopt didn't apply there. > > So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the > MRRS-dissabling patch makes no difference in this case. > > Can we just make pci=pcie_bus_safe (as in previous behavior) the default, > or make it not change where it would otherwise warn, or does that > basically make the thing useless? I have a patch that does does pcie_bus_safe as the default behavior and does not modify the MRRS. Would you be willing to test this patch for me? Thanks, Jon > > Simon- > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" 2011-09-07 20:57 ` Jon Mason @ 2011-09-07 20:58 ` Simon Kirby 2011-09-07 21:10 ` Jon Mason 0 siblings, 1 reply; 11+ messages in thread From: Simon Kirby @ 2011-09-07 20:58 UTC (permalink / raw) To: Jon Mason Cc: Jesse Barnes, Josh Boyer, Sven Schnelle, linux-kernel, Jordan_Hargrave On Wed, Sep 07, 2011 at 01:57:28PM -0700, Jon Mason wrote: > On Wed, Sep 7, 2011 at 1:47 PM, Simon Kirby <sim@hostway.ca> wrote: > > On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote: > > > >> On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote: > >> > >> > On Wed, 7 Sep 2011 12:52:25 -0400 > >> > Josh Boyer <jwboyer@gmail.com> wrote: > >> > > >> > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org> > >> > > wrote: > >> > > > Simon Kirby <sim@hostway.ca> writes: > >> > > > > >> > > >> Hello! > >> > > >> > >> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have > >> > > >> booted up with the amber error LED lit. "ipmitool sel list" shows: > >> > > >> > >> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log > >> > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical > >> > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 | > >> > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown > >> > > >> #0x1a | > >> > > > > >> > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone > >> > > > wants me to try additional debugging/patches, feel free to do > >> > > > so. Unfortunately i don't have the time/knowledge to debug that by > >> > > > myself. > >> > > > >> > > I thought Jesse or Jon had a revert or partial fix queued up to send > >> > > to Linus, but I don't see anything in or post -rc5 yet. ?That was > >> > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162 > >> > > > >> > > Jesse, Jon? > >> > > >> > kernel.org is still down and I haven't pushed anything to github. ?I > >> > asked Jon to send his patch directly to Linus today instead. > >> > >> FWIW, this patch didn't seem to fix it: > >> https://bugzilla.kernel.org/attachment.cgi?id=71222 > >> > >> dmesg used to say: > >> > >> pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128 > >> pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128 > >> pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096 > >> pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128 > >> pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096 > >> pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128 > >> pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128 > >> pci 0000:08:00.0: MPS configured higher than maximum supported by the device. ?If a bus issue occurs, try running with pci=pcie_bus_safe. > >> pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128 > >> Uhhuh. NMI received for unknown reason 21 on CPU 0. > >> Do you have a strange power saving mode enabled? > >> Dazed and confused, but trying to continue > > > > Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error > > stopped, but this made me realize that the pci=pcie_bus_safe option must > > have been missing. It turns out I had hacked a custom grub entry to load > > the newest kernel into grub instead of the one with the highest version > > number (grumble), so the default kopt didn't apply there. > > > > So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the > > MRRS-dissabling patch makes no difference in this case. > > > > Can we just make pci=pcie_bus_safe (as in previous behavior) the default, > > or make it not change where it would otherwise warn, or does that > > basically make the thing useless? > > I have a patch that does does pcie_bus_safe as the default behavior > and does not modify the MRRS. Would you be willing to test this patch > for me? Sure, of course. (It compiles, ship it. :)) Simon- ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" 2011-09-07 20:58 ` Simon Kirby @ 2011-09-07 21:10 ` Jon Mason 2011-09-07 21:33 ` Simon Kirby 2011-09-08 6:42 ` Sven Schnelle 0 siblings, 2 replies; 11+ messages in thread From: Jon Mason @ 2011-09-07 21:10 UTC (permalink / raw) To: Simon Kirby Cc: Jesse Barnes, Josh Boyer, Sven Schnelle, linux-kernel, Jordan_Hargrave [-- Attachment #1: Type: text/plain, Size: 3798 bytes --] On Wed, Sep 7, 2011 at 1:58 PM, Simon Kirby <sim@hostway.ca> wrote: > On Wed, Sep 07, 2011 at 01:57:28PM -0700, Jon Mason wrote: > >> On Wed, Sep 7, 2011 at 1:47 PM, Simon Kirby <sim@hostway.ca> wrote: >> > On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote: >> > >> >> On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote: >> >> >> >> > On Wed, 7 Sep 2011 12:52:25 -0400 >> >> > Josh Boyer <jwboyer@gmail.com> wrote: >> >> > >> >> > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org> >> >> > > wrote: >> >> > > > Simon Kirby <sim@hostway.ca> writes: >> >> > > > >> >> > > >> Hello! >> >> > > >> >> >> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have >> >> > > >> booted up with the amber error LED lit. "ipmitool sel list" shows: >> >> > > >> >> >> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log >> >> > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical >> >> > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 | >> >> > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown >> >> > > >> #0x1a | >> >> > > > >> >> > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone >> >> > > > wants me to try additional debugging/patches, feel free to do >> >> > > > so. Unfortunately i don't have the time/knowledge to debug that by >> >> > > > myself. >> >> > > >> >> > > I thought Jesse or Jon had a revert or partial fix queued up to send >> >> > > to Linus, but I don't see anything in or post -rc5 yet. ?That was >> >> > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162 >> >> > > >> >> > > Jesse, Jon? >> >> > >> >> > kernel.org is still down and I haven't pushed anything to github. ?I >> >> > asked Jon to send his patch directly to Linus today instead. >> >> >> >> FWIW, this patch didn't seem to fix it: >> >> https://bugzilla.kernel.org/attachment.cgi?id=71222 >> >> >> >> dmesg used to say: >> >> >> >> pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128 >> >> pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128 >> >> pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096 >> >> pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128 >> >> pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096 >> >> pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128 >> >> pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128 >> >> pci 0000:08:00.0: MPS configured higher than maximum supported by the device. ?If a bus issue occurs, try running with pci=pcie_bus_safe. >> >> pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128 >> >> Uhhuh. NMI received for unknown reason 21 on CPU 0. >> >> Do you have a strange power saving mode enabled? >> >> Dazed and confused, but trying to continue >> > >> > Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error >> > stopped, but this made me realize that the pci=pcie_bus_safe option must >> > have been missing. It turns out I had hacked a custom grub entry to load >> > the newest kernel into grub instead of the one with the highest version >> > number (grumble), so the default kopt didn't apply there. >> > >> > So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the >> > MRRS-dissabling patch makes no difference in this case. >> > >> > Can we just make pci=pcie_bus_safe (as in previous behavior) the default, >> > or make it not change where it would otherwise warn, or does that >> > basically make the thing useless? >> >> I have a patch that does does pcie_bus_safe as the default behavior >> and does not modify the MRRS. Would you be willing to test this patch >> for me? > > Sure, of course. (It compiles, ship it. :)) Great, thanks! I've attached a patch file to this e-mail. Thanks, Jon > > Simon- > [-- Attachment #2: test.patch --] [-- Type: text/x-patch, Size: 3448 bytes --] diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 0ce6742..4e84fd4 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -77,7 +77,7 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE; unsigned long pci_hotplug_io_size = DEFAULT_HOTPLUG_IO_SIZE; unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE; -enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_PERFORMANCE; +enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_SAFE; /* * The default CLS is used if arch didn't set CLS explicitly and not diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 8473727..847ee5d 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1396,34 +1396,37 @@ static void pcie_write_mps(struct pci_dev *dev, int mps) static void pcie_write_mrrs(struct pci_dev *dev, int mps) { - int rc, mrrs; + int rc, mrrs, dev_mpss; - if (pcie_bus_config == PCIE_BUS_PERFORMANCE) { - int dev_mpss = 128 << dev->pcie_mpss; + /* In the "safe" case, do not configure the MRRS. There appear to be + * issues with setting MRRS to 0 on a number of devices. + */ - /* For Max performance, the MRRS must be set to the largest - * supported value. However, it cannot be configured larger - * than the MPS the device or the bus can support. This assumes - * that the largest MRRS available on the device cannot be - * smaller than the device MPSS. - */ - mrrs = mps < dev_mpss ? mps : dev_mpss; - } else - /* In the "safe" case, configure the MRRS for fairness on the - * bus by making all devices have the same size - */ - mrrs = mps; + if (pcie_bus_config != PCIE_BUS_PERFORMANCE) + return; + + dev_mpss = 128 << dev->pcie_mpss; + /* For Max performance, the MRRS must be set to the largest supported + * value. However, it cannot be configured larger than the MPS the + * device or the bus can support. This assumes that the largest MRRS + * available on the device cannot be smaller than the device MPSS. + */ + mrrs = min(mps, dev_mpss); /* MRRS is a R/W register. Invalid values can be written, but a - * subsiquent read will verify if the value is acceptable or not. - * If the MRRS value provided is not acceptable (e.g., too large), - * shrink the value until it is acceptable to the HW. + * subsiquent read will verify if the value is acceptable or not. If + * the MRRS value provided is not acceptable (eg, too large), shrink the + * value until it is acceptable to the HW. */ while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) { + dev_warn(&dev->dev, "Attempting to modify the PCI-E MRRS value" + " to %d. If any issues are encountered, please try " + "running with pci=pcie_bus_safe\n", mrrs); rc = pcie_set_readrq(dev, mrrs); if (rc) - dev_err(&dev->dev, "Failed attempting to set the MRRS\n"); + dev_err(&dev->dev, + "Failed attempting to set the MRRS\n"); mrrs /= 2; } @@ -1436,13 +1439,13 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data) if (!pci_is_pcie(dev)) return 0; - dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", + dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev)); pcie_write_mps(dev, mps); pcie_write_mrrs(dev, mps); - dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", + dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev)); return 0; ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" 2011-09-07 21:10 ` Jon Mason @ 2011-09-07 21:33 ` Simon Kirby 2011-09-08 6:42 ` Sven Schnelle 1 sibling, 0 replies; 11+ messages in thread From: Simon Kirby @ 2011-09-07 21:33 UTC (permalink / raw) To: Jon Mason Cc: Jesse Barnes, Josh Boyer, Sven Schnelle, linux-kernel, Jordan_Hargrave On Wed, Sep 07, 2011 at 02:10:59PM -0700, Jon Mason wrote: > On Wed, Sep 7, 2011 at 1:58 PM, Simon Kirby <sim@hostway.ca> wrote: > > On Wed, Sep 07, 2011 at 01:57:28PM -0700, Jon Mason wrote: > > > >> On Wed, Sep 7, 2011 at 1:47 PM, Simon Kirby <sim@hostway.ca> wrote: > >> > On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote: > >> > > >> >> On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote: > >> >> > >> >> > On Wed, 7 Sep 2011 12:52:25 -0400 > >> >> > Josh Boyer <jwboyer@gmail.com> wrote: > >> >> > > >> >> > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org> > >> >> > > wrote: > >> >> > > > Simon Kirby <sim@hostway.ca> writes: > >> >> > > > > >> >> > > >> Hello! > >> >> > > >> > >> >> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have > >> >> > > >> booted up with the amber error LED lit. "ipmitool sel list" shows: > >> >> > > >> > >> >> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log > >> >> > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical > >> >> > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 | > >> >> > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown > >> >> > > >> #0x1a | > >> >> > > > > >> >> > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone > >> >> > > > wants me to try additional debugging/patches, feel free to do > >> >> > > > so. Unfortunately i don't have the time/knowledge to debug that by > >> >> > > > myself. > >> >> > > > >> >> > > I thought Jesse or Jon had a revert or partial fix queued up to send > >> >> > > to Linus, but I don't see anything in or post -rc5 yet. ?That was > >> >> > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162 > >> >> > > > >> >> > > Jesse, Jon? > >> >> > > >> >> > kernel.org is still down and I haven't pushed anything to github. ?I > >> >> > asked Jon to send his patch directly to Linus today instead. > >> >> > >> >> FWIW, this patch didn't seem to fix it: > >> >> https://bugzilla.kernel.org/attachment.cgi?id=71222 > >> >> > >> >> dmesg used to say: > >> >> > >> >> pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128 > >> >> pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128 > >> >> pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096 > >> >> pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128 > >> >> pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096 > >> >> pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128 > >> >> pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128 > >> >> pci 0000:08:00.0: MPS configured higher than maximum supported by the device. ?If a bus issue occurs, try running with pci=pcie_bus_safe. > >> >> pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128 > >> >> Uhhuh. NMI received for unknown reason 21 on CPU 0. > >> >> Do you have a strange power saving mode enabled? > >> >> Dazed and confused, but trying to continue > >> > > >> > Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error > >> > stopped, but this made me realize that the pci=pcie_bus_safe option must > >> > have been missing. It turns out I had hacked a custom grub entry to load > >> > the newest kernel into grub instead of the one with the highest version > >> > number (grumble), so the default kopt didn't apply there. > >> > > >> > So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the > >> > MRRS-dissabling patch makes no difference in this case. > >> > > >> > Can we just make pci=pcie_bus_safe (as in previous behavior) the default, > >> > or make it not change where it would otherwise warn, or does that > >> > basically make the thing useless? > >> > >> I have a patch that does does pcie_bus_safe as the default behavior > >> and does not modify the MRRS. ?Would you be willing to test this patch > >> for me? > > > > Sure, of course. (It compiles, ship it. :)) > > Great, thanks! I've attached a patch file to this e-mail. > > Thanks, > Jon > > > > > Simon- > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 0ce6742..4e84fd4 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -77,7 +77,7 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE; > unsigned long pci_hotplug_io_size = DEFAULT_HOTPLUG_IO_SIZE; > unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE; > > -enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_PERFORMANCE; > +enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_SAFE; > > /* > * The default CLS is used if arch didn't set CLS explicitly and not > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > index 8473727..847ee5d 100644 > --- a/drivers/pci/probe.c > +++ b/drivers/pci/probe.c > @@ -1396,34 +1396,37 @@ static void pcie_write_mps(struct pci_dev *dev, int mps) > > static void pcie_write_mrrs(struct pci_dev *dev, int mps) > { > - int rc, mrrs; > + int rc, mrrs, dev_mpss; > > - if (pcie_bus_config == PCIE_BUS_PERFORMANCE) { > - int dev_mpss = 128 << dev->pcie_mpss; > + /* In the "safe" case, do not configure the MRRS. There appear to be > + * issues with setting MRRS to 0 on a number of devices. > + */ > > - /* For Max performance, the MRRS must be set to the largest > - * supported value. However, it cannot be configured larger > - * than the MPS the device or the bus can support. This assumes > - * that the largest MRRS available on the device cannot be > - * smaller than the device MPSS. > - */ > - mrrs = mps < dev_mpss ? mps : dev_mpss; > - } else > - /* In the "safe" case, configure the MRRS for fairness on the > - * bus by making all devices have the same size > - */ > - mrrs = mps; > + if (pcie_bus_config != PCIE_BUS_PERFORMANCE) > + return; > + > + dev_mpss = 128 << dev->pcie_mpss; > > + /* For Max performance, the MRRS must be set to the largest supported > + * value. However, it cannot be configured larger than the MPS the > + * device or the bus can support. This assumes that the largest MRRS > + * available on the device cannot be smaller than the device MPSS. > + */ > + mrrs = min(mps, dev_mpss); > > /* MRRS is a R/W register. Invalid values can be written, but a > - * subsiquent read will verify if the value is acceptable or not. > - * If the MRRS value provided is not acceptable (e.g., too large), > - * shrink the value until it is acceptable to the HW. > + * subsiquent read will verify if the value is acceptable or not. If > + * the MRRS value provided is not acceptable (eg, too large), shrink the > + * value until it is acceptable to the HW. > */ > while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) { > + dev_warn(&dev->dev, "Attempting to modify the PCI-E MRRS value" > + " to %d. If any issues are encountered, please try " > + "running with pci=pcie_bus_safe\n", mrrs); > rc = pcie_set_readrq(dev, mrrs); > if (rc) > - dev_err(&dev->dev, "Failed attempting to set the MRRS\n"); > + dev_err(&dev->dev, > + "Failed attempting to set the MRRS\n"); > > mrrs /= 2; > } > @@ -1436,13 +1439,13 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data) > if (!pci_is_pcie(dev)) > return 0; > > - dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", > + dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", > pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev)); > > pcie_write_mps(dev, mps); > pcie_write_mrrs(dev, mps); > > - dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", > + dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", > pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev)); > > return 0; Works for me (with no special cmdline). Suggest spelling fix: s/subsiquent/subsequent/ Tested-by: Simon Kirby <sim@hostway.ca> Simon- ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" 2011-09-07 21:10 ` Jon Mason 2011-09-07 21:33 ` Simon Kirby @ 2011-09-08 6:42 ` Sven Schnelle 1 sibling, 0 replies; 11+ messages in thread From: Sven Schnelle @ 2011-09-08 6:42 UTC (permalink / raw) To: Jon Mason Cc: Simon Kirby, Jesse Barnes, Josh Boyer, linux-kernel, Jordan_Hargrave Jon Mason <mason@myri.com> writes: > On Wed, Sep 7, 2011 at 1:58 PM, Simon Kirby <sim@hostway.ca> wrote: >> On Wed, Sep 07, 2011 at 01:57:28PM -0700, Jon Mason wrote: >> >>> On Wed, Sep 7, 2011 at 1:47 PM, Simon Kirby <sim@hostway.ca> wrote: >>> > On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote: >>> > >>> >> On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote: >>> >> >>> >> > On Wed, 7 Sep 2011 12:52:25 -0400 >>> >> > Josh Boyer <jwboyer@gmail.com> wrote: >>> >> > >>> >> > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org> >>> >> > > wrote: >>> >> > > > Simon Kirby <sim@hostway.ca> writes: >>> >> > > > >>> >> > > >> Hello! >>> >> > > >> >>> >> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have >>> >> > > >> booted up with the amber error LED lit. "ipmitool sel list" shows: >>> >> > > >> >>> >> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log >>> >> > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical >>> >> > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 | >>> >> > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown >>> >> > > >> #0x1a | >>> >> > > > >>> >> > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone >>> >> > > > wants me to try additional debugging/patches, feel free to do >>> >> > > > so. Unfortunately i don't have the time/knowledge to debug that by >>> >> > > > myself. >>> >> > > >>> >> > > I thought Jesse or Jon had a revert or partial fix queued up to send >>> >> > > to Linus, but I don't see anything in or post -rc5 yet. ?That was >>> >> > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162 >>> >> > > >>> >> > > Jesse, Jon? >>> >> > >>> >> > kernel.org is still down and I haven't pushed anything to github. ?I >>> >> > asked Jon to send his patch directly to Linus today instead. >>> >> >>> >> FWIW, this patch didn't seem to fix it: >>> >> https://bugzilla.kernel.org/attachment.cgi?id=71222 >>> >> >>> >> dmesg used to say: >>> >> >>> >> pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128 >>> >> pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128 >>> >> pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096 >>> >> pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128 >>> >> pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096 >>> >> pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128 >>> >> pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128 >>> >> pci 0000:08:00.0: MPS configured higher than maximum supported by the device. ?If a bus issue occurs, try running with pci=pcie_bus_safe. >>> >> pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128 >>> >> Uhhuh. NMI received for unknown reason 21 on CPU 0. >>> >> Do you have a strange power saving mode enabled? >>> >> Dazed and confused, but trying to continue >>> > >>> > Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error >>> > stopped, but this made me realize that the pci=pcie_bus_safe option must >>> > have been missing. It turns out I had hacked a custom grub entry to load >>> > the newest kernel into grub instead of the one with the highest version >>> > number (grumble), so the default kopt didn't apply there. >>> > >>> > So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the >>> > MRRS-dissabling patch makes no difference in this case. >>> > >>> > Can we just make pci=pcie_bus_safe (as in previous behavior) the default, >>> > or make it not change where it would otherwise warn, or does that >>> > basically make the thing useless? >>> >>> I have a patch that does does pcie_bus_safe as the default behavior >>> and does not modify the MRRS. Would you be willing to test this patch >>> for me? >> >> Sure, of course. (It compiles, ship it. :)) > > Great, thanks! I've attached a patch file to this e-mail. Thanks, Jon. Works my system (Dell 1950). Tested-by: Sven Schnelle <svens@stackframe.org> Regards Sven ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2011-09-08 6:42 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-09-06 17:36 [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" Simon Kirby 2011-09-07 16:22 ` Sven Schnelle 2011-09-07 16:52 ` Josh Boyer 2011-09-07 17:44 ` Jesse Barnes 2011-09-07 19:18 ` Simon Kirby 2011-09-07 20:47 ` Simon Kirby 2011-09-07 20:57 ` Jon Mason 2011-09-07 20:58 ` Simon Kirby 2011-09-07 21:10 ` Jon Mason 2011-09-07 21:33 ` Simon Kirby 2011-09-08 6:42 ` Sven Schnelle
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox