[3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"
@ 2011-09-06 17:36 Simon Kirby
  2011-09-07 16:22 ` Sven Schnelle
  0 siblings, 1 reply; 11+ messages in thread
From: Simon Kirby @ 2011-09-06 17:36 UTC (permalink / raw)
  To: linux-kernel, Jon Mason, Jordan_Hargrave

Hello!

Since trying 3.1-rc4 on a few Dell servers, all of them have booted up
with the amber error LED lit. "ipmitool sel list" shows:

   1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted
   2 | 09/06/2011 | 17:25:38 | Critical Interrupt #0x18 | Bus Fatal Error | Asserted
   3 | 09/06/2011 | 17:25:38 | Unknown #0x1a | 
   4 | 09/06/2011 | 17:25:38 | Unknown #0x1a | 

I bisected this to:

b03e7495a862b028294f59fc87286d6d78ee7fa1 is the first bad commit
commit b03e7495a862b028294f59fc87286d6d78ee7fa1
Author: Jon Mason <mason@myri.com>
Date:   Wed Jul 20 15:20:54 2011 -0500

    PCI: Set PCI-E Max Payload Size on fabric

It sounds like this has caused other problems as well: http://www.spinics.net/lists/linux-scsi/msg54464.html    

In this case, the 6 or so boxes I've see the issue on are all PowerEdge 2950 servers.

Simon-         

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"
  2011-09-06 17:36 [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" Simon Kirby
@ 2011-09-07 16:22 ` Sven Schnelle
  2011-09-07 16:52   ` Josh Boyer
  0 siblings, 1 reply; 11+ messages in thread
From: Sven Schnelle @ 2011-09-07 16:22 UTC (permalink / raw)
  To: Simon Kirby; +Cc: linux-kernel, Jon Mason, Jordan_Hargrave

Simon Kirby <sim@hostway.ca> writes:

> Hello!
>
> Since trying 3.1-rc4 on a few Dell servers, all of them have booted up
> with the amber error LED lit. "ipmitool sel list" shows:
>
>    1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted
>    2 | 09/06/2011 | 17:25:38 | Critical Interrupt #0x18 | Bus Fatal Error | Asserted
>    3 | 09/06/2011 | 17:25:38 | Unknown #0x1a | 
>    4 | 09/06/2011 | 17:25:38 | Unknown #0x1a | 

I'm seeing exact the same issue on a Dell 1950 Server. If anyone wants
me to try additional debugging/patches, feel free to do
so. Unfortunately i don't have the time/knowledge to debug that by myself.

Sven    

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"
  2011-09-07 16:22 ` Sven Schnelle
@ 2011-09-07 16:52   ` Josh Boyer
  2011-09-07 17:44     ` Jesse Barnes
  0 siblings, 1 reply; 11+ messages in thread
From: Josh Boyer @ 2011-09-07 16:52 UTC (permalink / raw)
  To: Sven Schnelle
  Cc: Simon Kirby, linux-kernel, Jon Mason, Jordan_Hargrave,
	Jesse Barnes

On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org> wrote:
> Simon Kirby <sim@hostway.ca> writes:
>
>> Hello!
>>
>> Since trying 3.1-rc4 on a few Dell servers, all of them have booted up
>> with the amber error LED lit. "ipmitool sel list" shows:
>>
>>    1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted
>>    2 | 09/06/2011 | 17:25:38 | Critical Interrupt #0x18 | Bus Fatal Error | Asserted
>>    3 | 09/06/2011 | 17:25:38 | Unknown #0x1a |
>>    4 | 09/06/2011 | 17:25:38 | Unknown #0x1a |
>
> I'm seeing exact the same issue on a Dell 1950 Server. If anyone wants
> me to try additional debugging/patches, feel free to do
> so. Unfortunately i don't have the time/knowledge to debug that by myself.

I thought Jesse or Jon had a revert or partial fix queued up to send
to Linus, but I don't see anything in or post -rc5 yet.  That was
indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162

Jesse, Jon?

josh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"
  2011-09-07 16:52   ` Josh Boyer
@ 2011-09-07 17:44     ` Jesse Barnes
  2011-09-07 19:18       ` Simon Kirby
  0 siblings, 1 reply; 11+ messages in thread
From: Jesse Barnes @ 2011-09-07 17:44 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Sven Schnelle, Simon Kirby, linux-kernel, Jon Mason,
	Jordan_Hargrave

On Wed, 7 Sep 2011 12:52:25 -0400
Josh Boyer <jwboyer@gmail.com> wrote:

> On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org>
> wrote:
> > Simon Kirby <sim@hostway.ca> writes:
> >
> >> Hello!
> >>
> >> Since trying 3.1-rc4 on a few Dell servers, all of them have
> >> booted up with the amber error LED lit. "ipmitool sel list" shows:
> >>
> >>    1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log
> >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical
> >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 |
> >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown
> >> #0x1a |
> >
> > I'm seeing exact the same issue on a Dell 1950 Server. If anyone
> > wants me to try additional debugging/patches, feel free to do
> > so. Unfortunately i don't have the time/knowledge to debug that by
> > myself.
> 
> I thought Jesse or Jon had a revert or partial fix queued up to send
> to Linus, but I don't see anything in or post -rc5 yet.  That was
> indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162
> 
> Jesse, Jon?

kernel.org is still down and I haven't pushed anything to github.  I
asked Jon to send his patch directly to Linus today instead.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"
  2011-09-07 17:44     ` Jesse Barnes
@ 2011-09-07 19:18       ` Simon Kirby
  2011-09-07 20:47         ` Simon Kirby
  0 siblings, 1 reply; 11+ messages in thread
From: Simon Kirby @ 2011-09-07 19:18 UTC (permalink / raw)
  To: Jesse Barnes, Jon Mason
  Cc: Josh Boyer, Sven Schnelle, linux-kernel, Jordan_Hargrave

On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote:

> On Wed, 7 Sep 2011 12:52:25 -0400
> Josh Boyer <jwboyer@gmail.com> wrote:
> 
> > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org>
> > wrote:
> > > Simon Kirby <sim@hostway.ca> writes:
> > >
> > >> Hello!
> > >>
> > >> Since trying 3.1-rc4 on a few Dell servers, all of them have
> > >> booted up with the amber error LED lit. "ipmitool sel list" shows:
> > >>
> > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log
> > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical
> > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 |
> > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown
> > >> #0x1a |
> > >
> > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone
> > > wants me to try additional debugging/patches, feel free to do
> > > so. Unfortunately i don't have the time/knowledge to debug that by
> > > myself.
> > 
> > I thought Jesse or Jon had a revert or partial fix queued up to send
> > to Linus, but I don't see anything in or post -rc5 yet.  That was
> > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162
> > 
> > Jesse, Jon?
> 
> kernel.org is still down and I haven't pushed anything to github.  I
> asked Jon to send his patch directly to Linus today instead.

FWIW, this patch didn't seem to fix it:
https://bugzilla.kernel.org/attachment.cgi?id=71222

dmesg used to say:

pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128
pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096
pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096
pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128
pci 0000:08:00.0: MPS configured higher than maximum supported by the device.  If a bus issue occurs, try running with pci=pcie_bus_safe.
pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128
Uhhuh. NMI received for unknown reason 21 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
pci 0000:07:01.0: Dev MPS 128 MPSS 256 MRRS 4096
pci 0000:07:01.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:06:00.3: Dev MPS 128 MPSS 256 MRRS 256
pci 0000:06:00.3: Dev MPS 256 MPSS 256 MRRS 256
pci 0000:00:03.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:00:03.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:01:00.0: Dev MPS 256 MPSS 256 MRRS 512
pci 0000:01:00.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:01:00.2: Dev MPS 256 MPSS 256 MRRS 512
pci 0000:01:00.2: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:00:04.0: Dev MPS 128 MPSS 256 MRRS 128
pci 0000:00:04.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:00:05.0: Dev MPS 128 MPSS 256 MRRS 128
pci 0000:00:05.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:00:06.0: Dev MPS 128 MPSS 256 MRRS 128
pci 0000:00:06.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:00:07.0: Dev MPS 128 MPSS 256 MRRS 128
pci 0000:00:07.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:00:1c.0: Dev MPS 128 MPSS 128 MRRS 128
pci 0000:00:1c.0: Dev MPS 128 MPSS 128 MRRS 128
pci 0000:04:00.0: Dev MPS 128 MPSS 128 MRRS 128
pci 0000:04:00.0: Dev MPS 128 MPSS 128 MRRS 128
pci_bus 0000:00: on NUMA node 0

with the patch, I see:

pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128
pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096
pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 4096
pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096
pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 4096
pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128
pci 0000:08:00.0: MPS configured higher than maximum supported by the
device.  If a bus issue occurs, try running with pci=pcie_bus_safe.
pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:07:01.0: Dev MPS 128 MPSS 256 MRRS 4096
pci 0000:07:01.0: Dev MPS 256 MPSS 256 MRRS 4096
pci 0000:06:00.3: Dev MPS 128 MPSS 256 MRRS 256
pci 0000:06:00.3: Dev MPS 256 MPSS 256 MRRS 256
pci 0000:00:03.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:00:03.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:01:00.0: Dev MPS 256 MPSS 256 MRRS 512
pci 0000:01:00.0: Dev MPS 256 MPSS 256 MRRS 512
pci 0000:01:00.2: Dev MPS 256 MPSS 256 MRRS 512
pci 0000:01:00.2: Dev MPS 256 MPSS 256 MRRS 512
pci 0000:00:04.0: Dev MPS 128 MPSS 256 MRRS 128
pci 0000:00:04.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:00:05.0: Dev MPS 128 MPSS 256 MRRS 128
pci 0000:00:05.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:00:06.0: Dev MPS 128 MPSS 256 MRRS 128
pci 0000:00:06.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:00:07.0: Dev MPS 128 MPSS 256 MRRS 128
pci 0000:00:07.0: Dev MPS 256 MPSS 256 MRRS 128
pci 0000:00:1c.0: Dev MPS 128 MPSS 128 MRRS 128
pci 0000:00:1c.0: Dev MPS 128 MPSS 128 MRRS 128
pci 0000:04:00.0: Dev MPS 128 MPSS 128 MRRS 128
pci 0000:04:00.0: Dev MPS 128 MPSS 128 MRRS 128
pci_bus 0000:00: on NUMA node 0
...later on...
PCI: max bus depth: 4 pci_try_num: 5
Uhhuh. NMI received for unknown reason 31 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
pci 0000:08:00.0: PCI bridge to [bus 09-09]
pci 0000:08:00.0:   bridge window [mem 0xf4000000-0xf7ffffff]
pci 0000:07:00.0: PCI bridge to [bus 08-09]
pci 0000:07:00.0:   bridge window [mem 0xf4000000-0xf7ffffff]

...and the error still shows up in the IPMI SEL. 
If I also add "pci=pcie_bus_safe", I _still_ get the same output and bus
error. Maybe this is two issues?

# lspci
00:00.0 Host bridge: Intel Corporation 5000X Chipset Memory Controller Hub (rev 12)
00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 2 (rev 12)
00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 3 (rev 12)
00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 4-5 (rev 12)
00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 5 (rev 12)
00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 6-7 (rev 12)
00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 7 (rev 12)
00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 12)
00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 12)
00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 12)
00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev 12)
00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev 12)
00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 12)
00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 12)
00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset PCI Express Root Port 1 (rev 09)
00:1d.0 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #1 (rev 09)
00:1d.1 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #2 (rev 09)
00:1d.2 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #3 (rev 09)
00:1d.7 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset EHCI USB2 Controller (rev 09)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset LPC Interface Controller (rev 09)
00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE Controller (rev 09)
01:00.0 PCI bridge: Intel Corporation 80333 Segment-A PCI Express-to-PCI Express Bridge
01:00.2 PCI bridge: Intel Corporation 80333 Segment-B PCI Express-to-PCI Express Bridge
02:0e.0 RAID bus controller: Dell PowerEdge Expandable RAID controller 5
04:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c2)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 11)
06:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Upstream Port (rev 01)
06:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to PCI-X Bridge (rev 01)
07:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E1 (rev 01)
07:01.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E2 (rev 01)
08:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c2)
09:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 11)
10:0d.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)

# lspci -v -s 08:00.0
08:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c2) (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0
        Bus: primary=08, secondary=09, subordinate=09, sec-latency=64
        Memory behind bridge: f4000000-f7ffffff
        Capabilities: [60] Express PCI/PCI-X Bridge, MSI 00
        Capabilities: [90] PCI-X bridge device
        Capabilities: [b0] Power Management version 2

# lspci -v  -v -s 08:00.0
08:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c2) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Bus: primary=08, secondary=09, subordinate=09, sec-latency=64
        Memory behind bridge: f4000000-f7ffffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA+ VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [60] Express (v1) PCI/PCI-X Bridge, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 <16us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal+ Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- BrConfRtry-
                        MaxPayload 256 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Latency L0 <4us, L1 <4us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; Disabled- Retrain- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [90] PCI-X bridge device
                Secondary Status: 64bit+ 133MHz+ SCD- USC- SCO- SRD- Freq=133MHz
                Status: Dev=08:00.0 64bit- 133MHz- SCD- USC- SCO- SRD-
                Upstream: Capacity=0 CommitmentLimit=0
                Downstream: Capacity=0 CommitmentLimit=0
        Capabilities: [b0] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

Simon-

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"
  2011-09-07 19:18       ` Simon Kirby
@ 2011-09-07 20:47         ` Simon Kirby
  2011-09-07 20:57           ` Jon Mason
  0 siblings, 1 reply; 11+ messages in thread
From: Simon Kirby @ 2011-09-07 20:47 UTC (permalink / raw)
  To: Jesse Barnes, Jon Mason
  Cc: Josh Boyer, Sven Schnelle, linux-kernel, Jordan_Hargrave

On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote:

> On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote:
> 
> > On Wed, 7 Sep 2011 12:52:25 -0400
> > Josh Boyer <jwboyer@gmail.com> wrote:
> > 
> > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org>
> > > wrote:
> > > > Simon Kirby <sim@hostway.ca> writes:
> > > >
> > > >> Hello!
> > > >>
> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have
> > > >> booted up with the amber error LED lit. "ipmitool sel list" shows:
> > > >>
> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log
> > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical
> > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 |
> > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown
> > > >> #0x1a |
> > > >
> > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone
> > > > wants me to try additional debugging/patches, feel free to do
> > > > so. Unfortunately i don't have the time/knowledge to debug that by
> > > > myself.
> > > 
> > > I thought Jesse or Jon had a revert or partial fix queued up to send
> > > to Linus, but I don't see anything in or post -rc5 yet.  That was
> > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162
> > > 
> > > Jesse, Jon?
> > 
> > kernel.org is still down and I haven't pushed anything to github.  I
> > asked Jon to send his patch directly to Linus today instead.
> 
> FWIW, this patch didn't seem to fix it:
> https://bugzilla.kernel.org/attachment.cgi?id=71222
> 
> dmesg used to say:
> 
> pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128
> pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128
> pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096
> pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128
> pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096
> pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128
> pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128
> pci 0000:08:00.0: MPS configured higher than maximum supported by the device.  If a bus issue occurs, try running with pci=pcie_bus_safe.
> pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128
> Uhhuh. NMI received for unknown reason 21 on CPU 0.
> Do you have a strange power saving mode enabled?
> Dazed and confused, but trying to continue

Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error
stopped, but this made me realize that the pci=pcie_bus_safe option must
have been missing. It turns out I had hacked a custom grub entry to load
the newest kernel into grub instead of the one with the highest version
number (grumble), so the default kopt didn't apply there.

So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the
MRRS-dissabling patch makes no difference in this case.

Can we just make pci=pcie_bus_safe (as in previous behavior) the default,
or make it not change where it would otherwise warn, or does that
basically make the thing useless?

Simon-

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"
  2011-09-07 20:47         ` Simon Kirby
@ 2011-09-07 20:57           ` Jon Mason
  2011-09-07 20:58             ` Simon Kirby
  0 siblings, 1 reply; 11+ messages in thread
From: Jon Mason @ 2011-09-07 20:57 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Jesse Barnes, Josh Boyer, Sven Schnelle, linux-kernel,
	Jordan_Hargrave

On Wed, Sep 7, 2011 at 1:47 PM, Simon Kirby <sim@hostway.ca> wrote:
> On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote:
>
>> On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote:
>>
>> > On Wed, 7 Sep 2011 12:52:25 -0400
>> > Josh Boyer <jwboyer@gmail.com> wrote:
>> >
>> > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org>
>> > > wrote:
>> > > > Simon Kirby <sim@hostway.ca> writes:
>> > > >
>> > > >> Hello!
>> > > >>
>> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have
>> > > >> booted up with the amber error LED lit. "ipmitool sel list" shows:
>> > > >>
>> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log
>> > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical
>> > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 |
>> > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown
>> > > >> #0x1a |
>> > > >
>> > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone
>> > > > wants me to try additional debugging/patches, feel free to do
>> > > > so. Unfortunately i don't have the time/knowledge to debug that by
>> > > > myself.
>> > >
>> > > I thought Jesse or Jon had a revert or partial fix queued up to send
>> > > to Linus, but I don't see anything in or post -rc5 yet.  That was
>> > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162
>> > >
>> > > Jesse, Jon?
>> >
>> > kernel.org is still down and I haven't pushed anything to github.  I
>> > asked Jon to send his patch directly to Linus today instead.
>>
>> FWIW, this patch didn't seem to fix it:
>> https://bugzilla.kernel.org/attachment.cgi?id=71222
>>
>> dmesg used to say:
>>
>> pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128
>> pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128
>> pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096
>> pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128
>> pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096
>> pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128
>> pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128
>> pci 0000:08:00.0: MPS configured higher than maximum supported by the device.  If a bus issue occurs, try running with pci=pcie_bus_safe.
>> pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128
>> Uhhuh. NMI received for unknown reason 21 on CPU 0.
>> Do you have a strange power saving mode enabled?
>> Dazed and confused, but trying to continue
>
> Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error
> stopped, but this made me realize that the pci=pcie_bus_safe option must
> have been missing. It turns out I had hacked a custom grub entry to load
> the newest kernel into grub instead of the one with the highest version
> number (grumble), so the default kopt didn't apply there.
>
> So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the
> MRRS-dissabling patch makes no difference in this case.
>
> Can we just make pci=pcie_bus_safe (as in previous behavior) the default,
> or make it not change where it would otherwise warn, or does that
> basically make the thing useless?

I have a patch that does does pcie_bus_safe as the default behavior
and does not modify the MRRS.  Would you be willing to test this patch
for me?

Thanks,
Jon

>
> Simon-
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"
  2011-09-07 20:57           ` Jon Mason
@ 2011-09-07 20:58             ` Simon Kirby
  2011-09-07 21:10               ` Jon Mason
  0 siblings, 1 reply; 11+ messages in thread
From: Simon Kirby @ 2011-09-07 20:58 UTC (permalink / raw)
  To: Jon Mason
  Cc: Jesse Barnes, Josh Boyer, Sven Schnelle, linux-kernel,
	Jordan_Hargrave

On Wed, Sep 07, 2011 at 01:57:28PM -0700, Jon Mason wrote:

> On Wed, Sep 7, 2011 at 1:47 PM, Simon Kirby <sim@hostway.ca> wrote:
> > On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote:
> >
> >> On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote:
> >>
> >> > On Wed, 7 Sep 2011 12:52:25 -0400
> >> > Josh Boyer <jwboyer@gmail.com> wrote:
> >> >
> >> > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org>
> >> > > wrote:
> >> > > > Simon Kirby <sim@hostway.ca> writes:
> >> > > >
> >> > > >> Hello!
> >> > > >>
> >> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have
> >> > > >> booted up with the amber error LED lit. "ipmitool sel list" shows:
> >> > > >>
> >> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log
> >> > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical
> >> > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 |
> >> > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown
> >> > > >> #0x1a |
> >> > > >
> >> > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone
> >> > > > wants me to try additional debugging/patches, feel free to do
> >> > > > so. Unfortunately i don't have the time/knowledge to debug that by
> >> > > > myself.
> >> > >
> >> > > I thought Jesse or Jon had a revert or partial fix queued up to send
> >> > > to Linus, but I don't see anything in or post -rc5 yet. ?That was
> >> > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162
> >> > >
> >> > > Jesse, Jon?
> >> >
> >> > kernel.org is still down and I haven't pushed anything to github. ?I
> >> > asked Jon to send his patch directly to Linus today instead.
> >>
> >> FWIW, this patch didn't seem to fix it:
> >> https://bugzilla.kernel.org/attachment.cgi?id=71222
> >>
> >> dmesg used to say:
> >>
> >> pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128
> >> pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128
> >> pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096
> >> pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128
> >> pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096
> >> pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128
> >> pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128
> >> pci 0000:08:00.0: MPS configured higher than maximum supported by the device. ?If a bus issue occurs, try running with pci=pcie_bus_safe.
> >> pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128
> >> Uhhuh. NMI received for unknown reason 21 on CPU 0.
> >> Do you have a strange power saving mode enabled?
> >> Dazed and confused, but trying to continue
> >
> > Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error
> > stopped, but this made me realize that the pci=pcie_bus_safe option must
> > have been missing. It turns out I had hacked a custom grub entry to load
> > the newest kernel into grub instead of the one with the highest version
> > number (grumble), so the default kopt didn't apply there.
> >
> > So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the
> > MRRS-dissabling patch makes no difference in this case.
> >
> > Can we just make pci=pcie_bus_safe (as in previous behavior) the default,
> > or make it not change where it would otherwise warn, or does that
> > basically make the thing useless?
> 
> I have a patch that does does pcie_bus_safe as the default behavior
> and does not modify the MRRS.  Would you be willing to test this patch
> for me?

Sure, of course. (It compiles, ship it. :))

Simon-

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"
  2011-09-07 20:58             ` Simon Kirby
@ 2011-09-07 21:10               ` Jon Mason
  2011-09-07 21:33                 ` Simon Kirby
  2011-09-08  6:42                 ` Sven Schnelle
  0 siblings, 2 replies; 11+ messages in thread
From: Jon Mason @ 2011-09-07 21:10 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Jesse Barnes, Josh Boyer, Sven Schnelle, linux-kernel,
	Jordan_Hargrave

[-- Attachment #1: Type: text/plain, Size: 3798 bytes --]

On Wed, Sep 7, 2011 at 1:58 PM, Simon Kirby <sim@hostway.ca> wrote:
> On Wed, Sep 07, 2011 at 01:57:28PM -0700, Jon Mason wrote:
>
>> On Wed, Sep 7, 2011 at 1:47 PM, Simon Kirby <sim@hostway.ca> wrote:
>> > On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote:
>> >
>> >> On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote:
>> >>
>> >> > On Wed, 7 Sep 2011 12:52:25 -0400
>> >> > Josh Boyer <jwboyer@gmail.com> wrote:
>> >> >
>> >> > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org>
>> >> > > wrote:
>> >> > > > Simon Kirby <sim@hostway.ca> writes:
>> >> > > >
>> >> > > >> Hello!
>> >> > > >>
>> >> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have
>> >> > > >> booted up with the amber error LED lit. "ipmitool sel list" shows:
>> >> > > >>
>> >> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log
>> >> > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical
>> >> > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 |
>> >> > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown
>> >> > > >> #0x1a |
>> >> > > >
>> >> > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone
>> >> > > > wants me to try additional debugging/patches, feel free to do
>> >> > > > so. Unfortunately i don't have the time/knowledge to debug that by
>> >> > > > myself.
>> >> > >
>> >> > > I thought Jesse or Jon had a revert or partial fix queued up to send
>> >> > > to Linus, but I don't see anything in or post -rc5 yet. ?That was
>> >> > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162
>> >> > >
>> >> > > Jesse, Jon?
>> >> >
>> >> > kernel.org is still down and I haven't pushed anything to github. ?I
>> >> > asked Jon to send his patch directly to Linus today instead.
>> >>
>> >> FWIW, this patch didn't seem to fix it:
>> >> https://bugzilla.kernel.org/attachment.cgi?id=71222
>> >>
>> >> dmesg used to say:
>> >>
>> >> pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128
>> >> pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128
>> >> pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096
>> >> pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128
>> >> pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096
>> >> pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128
>> >> pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128
>> >> pci 0000:08:00.0: MPS configured higher than maximum supported by the device. ?If a bus issue occurs, try running with pci=pcie_bus_safe.
>> >> pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128
>> >> Uhhuh. NMI received for unknown reason 21 on CPU 0.
>> >> Do you have a strange power saving mode enabled?
>> >> Dazed and confused, but trying to continue
>> >
>> > Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error
>> > stopped, but this made me realize that the pci=pcie_bus_safe option must
>> > have been missing. It turns out I had hacked a custom grub entry to load
>> > the newest kernel into grub instead of the one with the highest version
>> > number (grumble), so the default kopt didn't apply there.
>> >
>> > So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the
>> > MRRS-dissabling patch makes no difference in this case.
>> >
>> > Can we just make pci=pcie_bus_safe (as in previous behavior) the default,
>> > or make it not change where it would otherwise warn, or does that
>> > basically make the thing useless?
>>
>> I have a patch that does does pcie_bus_safe as the default behavior
>> and does not modify the MRRS.  Would you be willing to test this patch
>> for me?
>
> Sure, of course. (It compiles, ship it. :))

Great, thanks!  I've attached a patch file to this e-mail.

Thanks,
Jon

>
> Simon-
>

[-- Attachment #2: test.patch --]
[-- Type: text/x-patch, Size: 3448 bytes --]

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 0ce6742..4e84fd4 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -77,7 +77,7 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE;
 unsigned long pci_hotplug_io_size  = DEFAULT_HOTPLUG_IO_SIZE;
 unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE;
 
-enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_PERFORMANCE;
+enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_SAFE;
 
 /*
  * The default CLS is used if arch didn't set CLS explicitly and not
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 8473727..847ee5d 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1396,34 +1396,37 @@ static void pcie_write_mps(struct pci_dev *dev, int mps)
 
 static void pcie_write_mrrs(struct pci_dev *dev, int mps)
 {
-	int rc, mrrs;
+	int rc, mrrs, dev_mpss;
 
-	if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
-		int dev_mpss = 128 << dev->pcie_mpss;
+	/* In the "safe" case, do not configure the MRRS.  There appear to be
+	 * issues with setting MRRS to 0 on a number of devices.
+	 */
 
-		/* For Max performance, the MRRS must be set to the largest
-		 * supported value.  However, it cannot be configured larger
-		 * than the MPS the device or the bus can support.  This assumes
-		 * that the largest MRRS available on the device cannot be
-		 * smaller than the device MPSS.
-		 */
-		mrrs = mps < dev_mpss ? mps : dev_mpss;
-	} else
-		/* In the "safe" case, configure the MRRS for fairness on the
-		 * bus by making all devices have the same size
-		 */
-		mrrs = mps;
+	if (pcie_bus_config != PCIE_BUS_PERFORMANCE)
+		return;
+
+	dev_mpss = 128 << dev->pcie_mpss;
 
+	/* For Max performance, the MRRS must be set to the largest supported
+	 * value.  However, it cannot be configured larger than the MPS the
+	 * device or the bus can support.  This assumes that the largest MRRS
+	 * available on the device cannot be smaller than the device MPSS.
+	 */
+	mrrs = min(mps, dev_mpss);
 
 	/* MRRS is a R/W register.  Invalid values can be written, but a
-	 * subsiquent read will verify if the value is acceptable or not.
-	 * If the MRRS value provided is not acceptable (e.g., too large),
-	 * shrink the value until it is acceptable to the HW.
+	 * subsiquent read will verify if the value is acceptable or not.  If
+	 * the MRRS value provided is not acceptable (eg, too large), shrink the
+	 * value until it is acceptable to the HW.
  	 */
 	while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) {
+		dev_warn(&dev->dev, "Attempting to modify the PCI-E MRRS value"
+			 " to %d.  If any issues are encountered, please try "
+			 "running with pci=pcie_bus_safe\n", mrrs);
 		rc = pcie_set_readrq(dev, mrrs);
 		if (rc)
-			dev_err(&dev->dev, "Failed attempting to set the MRRS\n");
+			dev_err(&dev->dev,
+				"Failed attempting to set the MRRS\n");
 
 		mrrs /= 2;
 	}
@@ -1436,13 +1439,13 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
 	if (!pci_is_pcie(dev))
 		return 0;
 
-	dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
+	dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
 		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
 
 	pcie_write_mps(dev, mps);
 	pcie_write_mrrs(dev, mps);
 
-	dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
+	dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
 		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
 
 	return 0;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"
  2011-09-07 21:10               ` Jon Mason
@ 2011-09-07 21:33                 ` Simon Kirby
  2011-09-08  6:42                 ` Sven Schnelle
  1 sibling, 0 replies; 11+ messages in thread
From: Simon Kirby @ 2011-09-07 21:33 UTC (permalink / raw)
  To: Jon Mason
  Cc: Jesse Barnes, Josh Boyer, Sven Schnelle, linux-kernel,
	Jordan_Hargrave

On Wed, Sep 07, 2011 at 02:10:59PM -0700, Jon Mason wrote:

> On Wed, Sep 7, 2011 at 1:58 PM, Simon Kirby <sim@hostway.ca> wrote:
> > On Wed, Sep 07, 2011 at 01:57:28PM -0700, Jon Mason wrote:
> >
> >> On Wed, Sep 7, 2011 at 1:47 PM, Simon Kirby <sim@hostway.ca> wrote:
> >> > On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote:
> >> >
> >> >> On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote:
> >> >>
> >> >> > On Wed, 7 Sep 2011 12:52:25 -0400
> >> >> > Josh Boyer <jwboyer@gmail.com> wrote:
> >> >> >
> >> >> > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org>
> >> >> > > wrote:
> >> >> > > > Simon Kirby <sim@hostway.ca> writes:
> >> >> > > >
> >> >> > > >> Hello!
> >> >> > > >>
> >> >> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have
> >> >> > > >> booted up with the amber error LED lit. "ipmitool sel list" shows:
> >> >> > > >>
> >> >> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log
> >> >> > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical
> >> >> > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 |
> >> >> > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown
> >> >> > > >> #0x1a |
> >> >> > > >
> >> >> > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone
> >> >> > > > wants me to try additional debugging/patches, feel free to do
> >> >> > > > so. Unfortunately i don't have the time/knowledge to debug that by
> >> >> > > > myself.
> >> >> > >
> >> >> > > I thought Jesse or Jon had a revert or partial fix queued up to send
> >> >> > > to Linus, but I don't see anything in or post -rc5 yet. ?That was
> >> >> > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162
> >> >> > >
> >> >> > > Jesse, Jon?
> >> >> >
> >> >> > kernel.org is still down and I haven't pushed anything to github. ?I
> >> >> > asked Jon to send his patch directly to Linus today instead.
> >> >>
> >> >> FWIW, this patch didn't seem to fix it:
> >> >> https://bugzilla.kernel.org/attachment.cgi?id=71222
> >> >>
> >> >> dmesg used to say:
> >> >>
> >> >> pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128
> >> >> pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128
> >> >> pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096
> >> >> pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128
> >> >> pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096
> >> >> pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128
> >> >> pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128
> >> >> pci 0000:08:00.0: MPS configured higher than maximum supported by the device. ?If a bus issue occurs, try running with pci=pcie_bus_safe.
> >> >> pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128
> >> >> Uhhuh. NMI received for unknown reason 21 on CPU 0.
> >> >> Do you have a strange power saving mode enabled?
> >> >> Dazed and confused, but trying to continue
> >> >
> >> > Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error
> >> > stopped, but this made me realize that the pci=pcie_bus_safe option must
> >> > have been missing. It turns out I had hacked a custom grub entry to load
> >> > the newest kernel into grub instead of the one with the highest version
> >> > number (grumble), so the default kopt didn't apply there.
> >> >
> >> > So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the
> >> > MRRS-dissabling patch makes no difference in this case.
> >> >
> >> > Can we just make pci=pcie_bus_safe (as in previous behavior) the default,
> >> > or make it not change where it would otherwise warn, or does that
> >> > basically make the thing useless?
> >>
> >> I have a patch that does does pcie_bus_safe as the default behavior
> >> and does not modify the MRRS. ?Would you be willing to test this patch
> >> for me?
> >
> > Sure, of course. (It compiles, ship it. :))
> 
> Great, thanks!  I've attached a patch file to this e-mail.
> 
> Thanks,
> Jon
> 
> >
> > Simon-
> >

> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 0ce6742..4e84fd4 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -77,7 +77,7 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE;
>  unsigned long pci_hotplug_io_size  = DEFAULT_HOTPLUG_IO_SIZE;
>  unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE;
>  
> -enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_PERFORMANCE;
> +enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_SAFE;
>  
>  /*
>   * The default CLS is used if arch didn't set CLS explicitly and not
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 8473727..847ee5d 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1396,34 +1396,37 @@ static void pcie_write_mps(struct pci_dev *dev, int mps)
>  
>  static void pcie_write_mrrs(struct pci_dev *dev, int mps)
>  {
> -	int rc, mrrs;
> +	int rc, mrrs, dev_mpss;
>  
> -	if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
> -		int dev_mpss = 128 << dev->pcie_mpss;
> +	/* In the "safe" case, do not configure the MRRS.  There appear to be
> +	 * issues with setting MRRS to 0 on a number of devices.
> +	 */
>  
> -		/* For Max performance, the MRRS must be set to the largest
> -		 * supported value.  However, it cannot be configured larger
> -		 * than the MPS the device or the bus can support.  This assumes
> -		 * that the largest MRRS available on the device cannot be
> -		 * smaller than the device MPSS.
> -		 */
> -		mrrs = mps < dev_mpss ? mps : dev_mpss;
> -	} else
> -		/* In the "safe" case, configure the MRRS for fairness on the
> -		 * bus by making all devices have the same size
> -		 */
> -		mrrs = mps;
> +	if (pcie_bus_config != PCIE_BUS_PERFORMANCE)
> +		return;
> +
> +	dev_mpss = 128 << dev->pcie_mpss;
>  
> +	/* For Max performance, the MRRS must be set to the largest supported
> +	 * value.  However, it cannot be configured larger than the MPS the
> +	 * device or the bus can support.  This assumes that the largest MRRS
> +	 * available on the device cannot be smaller than the device MPSS.
> +	 */
> +	mrrs = min(mps, dev_mpss);
>  
>  	/* MRRS is a R/W register.  Invalid values can be written, but a
> -	 * subsiquent read will verify if the value is acceptable or not.
> -	 * If the MRRS value provided is not acceptable (e.g., too large),
> -	 * shrink the value until it is acceptable to the HW.
> +	 * subsiquent read will verify if the value is acceptable or not.  If
> +	 * the MRRS value provided is not acceptable (eg, too large), shrink the
> +	 * value until it is acceptable to the HW.
>   	 */
>  	while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) {
> +		dev_warn(&dev->dev, "Attempting to modify the PCI-E MRRS value"
> +			 " to %d.  If any issues are encountered, please try "
> +			 "running with pci=pcie_bus_safe\n", mrrs);
>  		rc = pcie_set_readrq(dev, mrrs);
>  		if (rc)
> -			dev_err(&dev->dev, "Failed attempting to set the MRRS\n");
> +			dev_err(&dev->dev,
> +				"Failed attempting to set the MRRS\n");
>  
>  		mrrs /= 2;
>  	}
> @@ -1436,13 +1439,13 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
>  	if (!pci_is_pcie(dev))
>  		return 0;
>  
> -	dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
> +	dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
>  		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
>  
>  	pcie_write_mps(dev, mps);
>  	pcie_write_mrrs(dev, mps);
>  
> -	dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
> +	dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
>  		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
>  
>  	return 0;

Works for me (with no special cmdline).

Suggest spelling fix: s/subsiquent/subsequent/

Tested-by: Simon Kirby <sim@hostway.ca>

Simon-

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"
  2011-09-07 21:10               ` Jon Mason
  2011-09-07 21:33                 ` Simon Kirby
@ 2011-09-08  6:42                 ` Sven Schnelle
  1 sibling, 0 replies; 11+ messages in thread
From: Sven Schnelle @ 2011-09-08  6:42 UTC (permalink / raw)
  To: Jon Mason
  Cc: Simon Kirby, Jesse Barnes, Josh Boyer, linux-kernel,
	Jordan_Hargrave

Jon Mason <mason@myri.com> writes:

> On Wed, Sep 7, 2011 at 1:58 PM, Simon Kirby <sim@hostway.ca> wrote:
>> On Wed, Sep 07, 2011 at 01:57:28PM -0700, Jon Mason wrote:
>>
>>> On Wed, Sep 7, 2011 at 1:47 PM, Simon Kirby <sim@hostway.ca> wrote:
>>> > On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote:
>>> >
>>> >> On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote:
>>> >>
>>> >> > On Wed, 7 Sep 2011 12:52:25 -0400
>>> >> > Josh Boyer <jwboyer@gmail.com> wrote:
>>> >> >
>>> >> > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@stackframe.org>
>>> >> > > wrote:
>>> >> > > > Simon Kirby <sim@hostway.ca> writes:
>>> >> > > >
>>> >> > > >> Hello!
>>> >> > > >>
>>> >> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have
>>> >> > > >> booted up with the amber error LED lit. "ipmitool sel list" shows:
>>> >> > > >>
>>> >> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log
>>> >> > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical
>>> >> > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 |
>>> >> > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown
>>> >> > > >> #0x1a |
>>> >> > > >
>>> >> > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone
>>> >> > > > wants me to try additional debugging/patches, feel free to do
>>> >> > > > so. Unfortunately i don't have the time/knowledge to debug that by
>>> >> > > > myself.
>>> >> > >
>>> >> > > I thought Jesse or Jon had a revert or partial fix queued up to send
>>> >> > > to Linus, but I don't see anything in or post -rc5 yet. ?That was
>>> >> > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162
>>> >> > >
>>> >> > > Jesse, Jon?
>>> >> >
>>> >> > kernel.org is still down and I haven't pushed anything to github. ?I
>>> >> > asked Jon to send his patch directly to Linus today instead.
>>> >>
>>> >> FWIW, this patch didn't seem to fix it:
>>> >> https://bugzilla.kernel.org/attachment.cgi?id=71222
>>> >>
>>> >> dmesg used to say:
>>> >>
>>> >> pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128
>>> >> pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128
>>> >> pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096
>>> >> pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128
>>> >> pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096
>>> >> pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128
>>> >> pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128
>>> >> pci 0000:08:00.0: MPS configured higher than maximum supported by the device. ?If a bus issue occurs, try running with pci=pcie_bus_safe.
>>> >> pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128
>>> >> Uhhuh. NMI received for unknown reason 21 on CPU 0.
>>> >> Do you have a strange power saving mode enabled?
>>> >> Dazed and confused, but trying to continue
>>> >
>>> > Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error
>>> > stopped, but this made me realize that the pci=pcie_bus_safe option must
>>> > have been missing. It turns out I had hacked a custom grub entry to load
>>> > the newest kernel into grub instead of the one with the highest version
>>> > number (grumble), so the default kopt didn't apply there.
>>> >
>>> > So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the
>>> > MRRS-dissabling patch makes no difference in this case.
>>> >
>>> > Can we just make pci=pcie_bus_safe (as in previous behavior) the default,
>>> > or make it not change where it would otherwise warn, or does that
>>> > basically make the thing useless?
>>>
>>> I have a patch that does does pcie_bus_safe as the default behavior
>>> and does not modify the MRRS.  Would you be willing to test this patch
>>> for me?
>>
>> Sure, of course. (It compiles, ship it. :))
>
> Great, thanks!  I've attached a patch file to this e-mail.

Thanks, Jon. Works my system (Dell 1950).

Tested-by: Sven Schnelle <svens@stackframe.org>

Regards

Sven

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-09-08  6:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-06 17:36 [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric" Simon Kirby
2011-09-07 16:22 ` Sven Schnelle
2011-09-07 16:52   ` Josh Boyer
2011-09-07 17:44     ` Jesse Barnes
2011-09-07 19:18       ` Simon Kirby
2011-09-07 20:47         ` Simon Kirby
2011-09-07 20:57           ` Jon Mason
2011-09-07 20:58             ` Simon Kirby
2011-09-07 21:10               ` Jon Mason
2011-09-07 21:33                 ` Simon Kirby
2011-09-08  6:42                 ` Sven Schnelle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox