AMD 8132 parity issue causes interrupt storms

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* AMD 8132 parity issue causes interrupt storms
@ 2009-02-20 20:48 Mr. Berkley Shands
  2009-02-21  0:44 ` Robert Hancock
  0 siblings, 1 reply; 4+ messages in thread
From: Mr. Berkley Shands @ 2009-02-20 20:48 UTC (permalink / raw)
  To: linux-kernel

It seems that the 8132 should be blacklisted :-)

INT-A will be asserted forever if any channel sees a parity error.
This can be blocked by several means;

1) setpci -s <bus address of 8132> 5.b=05   /* disable interrupts from 
the bridge */
This is the I don't see you method.

Shouldn't the interrupt handler (is there one?) trap and clear this?
Shouldn't the kernel at least report this error and reset those bits?

All,

OK, here's what I know so far.  The interrupt storm is coming from the 
parity error detector in the 8132.  The parity error is reported in two 
locations using sticky bits:

0x1c bits 31 and 24
   Here there seems to be some differentiation between which party 
detected the parity error.  The 8132 spec is pretty vague here (see page 
75) but it looks like the 8132 is detecting a parity error from the HBA 
not the other way around.
0x80 bit 0
   Here it just states that someone asserted the PERR_L signal, no 
distinction on who did it.

All these bits are write-one-to-clear.  If 0x80 bit 0 is cleared, the 
storm stops.  Clearly the OS does not know how to handle these 
conditions and the error flag is left on while the interrupt is 
continuously handled.

One way to handle this is to set 0x48 bit 19 to 0.  This prevents the 
8132 from interrupting when 0x80 bit 0 is set.

A much better way to handle this is to have the interrupt handler 
actually check the error bits on the 8132 when it is called.  This would 
slow down the interrupt handler, but actually give us a much better 
visibility into this problem (when, where and how often this happens).  
The irritating thing here is that this is chipset dependent.  The 
interrupt handler would have to know what PCI-X chipset it was talking 
through to know how to handle this (way to go AMD).

The really odd thing is that the parity error is reported through INTB 
on the 8132.  The spec claims that fatal errors (the category they put 
PERR in) go to INTB while hot plug conditions trigger INTA.  Masking off 
fatal errors in the IOAPIC turns off the storm too.  I have no idea why 
this is showing up on INTA.

Berkley

-- 

// E. F. Berkley Shands, MSc//

** Exegy Inc.**

349 Marshall Road, Suite 100

St. Louis , MO  63119

Direct:  (314) 218-3600 X450

Cell:  (314) 303-2546

Office:  (314) 218-3600

Fax:  (314) 218-3601

The Usual Disclaimer follows...

This e-mail and any documents accompanying it may contain legally privileged and/or confidential information belonging to Exegy, Inc. Such information may be protected from disclosure by law. The information is intended for use by only the addressee. If you are not the intended recipient, you are hereby notified that any disclosure or use of the information is strictly prohibited. If you have received this e-mail in error, please immediately contact the sender by e-mail or phone regarding instructions for return or destruction and do not use or disclose the content to others.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: AMD 8132 parity issue causes interrupt storms
  2009-02-20 20:48 AMD 8132 parity issue causes interrupt storms Mr. Berkley Shands
@ 2009-02-21  0:44 ` Robert Hancock
       [not found]   ` <499F6468.3080907@exegy.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Robert Hancock @ 2009-02-21  0:44 UTC (permalink / raw)
  To: Mr. Berkley Shands; +Cc: linux-kernel

Mr. Berkley Shands wrote:
> It seems that the 8132 should be blacklisted :-)
> 
> INT-A will be asserted forever if any channel sees a parity error.
> This can be blocked by several means;
> 
> 1) setpci -s <bus address of 8132> 5.b=05   /* disable interrupts from 
> the bridge */
> This is the I don't see you method.
> 
> Shouldn't the interrupt handler (is there one?) trap and clear this?
> Shouldn't the kernel at least report this error and reset those bits?

What's enabling this interrupt generation? Interrupting on parity errors 
is not part of the PCI spec. Unless there's some driver that's set up to 
handle these interrupts, whoever's enabling them shouldn't be..

> 
> All,
> 
> OK, here's what I know so far.  The interrupt storm is coming from the 
> parity error detector in the 8132.  The parity error is reported in two 
> locations using sticky bits:
> 
> 0x1c bits 31 and 24
>   Here there seems to be some differentiation between which party 
> detected the parity error.  The 8132 spec is pretty vague here (see page 
> 75) but it looks like the 8132 is detecting a parity error from the HBA 
> not the other way around.
> 0x80 bit 0
>   Here it just states that someone asserted the PERR_L signal, no 
> distinction on who did it.
> 
> All these bits are write-one-to-clear.  If 0x80 bit 0 is cleared, the 
> storm stops.  Clearly the OS does not know how to handle these 
> conditions and the error flag is left on while the interrupt is 
> continuously handled.
> 
> One way to handle this is to set 0x48 bit 19 to 0.  This prevents the 
> 8132 from interrupting when 0x80 bit 0 is set.
> 
> A much better way to handle this is to have the interrupt handler 
> actually check the error bits on the 8132 when it is called.  This would 
> slow down the interrupt handler, but actually give us a much better 
> visibility into this problem (when, where and how often this happens).  
> The irritating thing here is that this is chipset dependent.  The 
> interrupt handler would have to know what PCI-X chipset it was talking 
> through to know how to handle this (way to go AMD).
> 
> The really odd thing is that the parity error is reported through INTB 
> on the 8132.  The spec claims that fatal errors (the category they put 
> PERR in) go to INTB while hot plug conditions trigger INTA.  Masking off 
> fatal errors in the IOAPIC turns off the storm too.  I have no idea why 
> this is showing up on INTA.
> 
> Berkley
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: AMD 8132 parity issue causes interrupt storms
       [not found]   ` <499F6468.3080907@exegy.com>
@ 2009-02-21 19:06     ` Robert Hancock
  2009-02-21 19:08       ` Robert Hancock
  0 siblings, 1 reply; 4+ messages in thread
From: Robert Hancock @ 2009-02-21 19:06 UTC (permalink / raw)
  To: Mr. Berkley Shands; +Cc: linux-kernel, linux-pci

Mr. Berkley Shands wrote:
> I am certainly not doing that :-)
> Some supermicro H8QME-2 motherboards (about 40%) show up with that enabled.
> Something generates a parity error, and the machine is instantly on its 
> knees until it gets power cycled.
> 
> My thought was to look and report that parity was being enabled (bios bug?)

That would be a BIOS bug then, if it sets the parity interrupts enabled 
by default. If the OS installs a driver to handle those interrupts, the 
driver can enable them, otherwise they should stay off.

We could probably create a PCI quirk for this chip that would disable 
the parity interrupts on bootup if it found them enabled.. CCing linux-pci.

> 
> I can fix it in a number of ways with setpci. It has taken a year to 
> find the cause of my troubles.
> And a $15K scope, ...
> 
> Berkley
> 
> 
> Robert Hancock wrote:
>> Mr. Berkley Shands wrote:
>>> It seems that the 8132 should be blacklisted :-)
>>>
>>> INT-A will be asserted forever if any channel sees a parity error.
>>> This can be blocked by several means;
>>>
>>> 1) setpci -s <bus address of 8132> 5.b=05   /* disable interrupts 
>>> from the bridge */
>>> This is the I don't see you method.
>>>
>>> Shouldn't the interrupt handler (is there one?) trap and clear this?
>>> Shouldn't the kernel at least report this error and reset those bits?
>>
>> What's enabling this interrupt generation? Interrupting on parity 
>> errors is not part of the PCI spec. Unless there's some driver that's 
>> set up to handle these interrupts, whoever's enabling them shouldn't be..
>>
>>>
>>> All,
>>>
>>> OK, here's what I know so far.  The interrupt storm is coming from 
>>> the parity error detector in the 8132.  The parity error is reported 
>>> in two locations using sticky bits:
>>>
>>> 0x1c bits 31 and 24
>>>   Here there seems to be some differentiation between which party 
>>> detected the parity error.  The 8132 spec is pretty vague here (see 
>>> page 75) but it looks like the 8132 is detecting a parity error from 
>>> the HBA not the other way around.
>>> 0x80 bit 0
>>>   Here it just states that someone asserted the PERR_L signal, no 
>>> distinction on who did it.
>>>
>>> All these bits are write-one-to-clear.  If 0x80 bit 0 is cleared, the 
>>> storm stops.  Clearly the OS does not know how to handle these 
>>> conditions and the error flag is left on while the interrupt is 
>>> continuously handled.
>>>
>>> One way to handle this is to set 0x48 bit 19 to 0.  This prevents the 
>>> 8132 from interrupting when 0x80 bit 0 is set.
>>>
>>> A much better way to handle this is to have the interrupt handler 
>>> actually check the error bits on the 8132 when it is called.  This 
>>> would slow down the interrupt handler, but actually give us a much 
>>> better visibility into this problem (when, where and how often this 
>>> happens).  The irritating thing here is that this is chipset 
>>> dependent.  The interrupt handler would have to know what PCI-X 
>>> chipset it was talking through to know how to handle this (way to go 
>>> AMD).
>>>
>>> The really odd thing is that the parity error is reported through 
>>> INTB on the 8132.  The spec claims that fatal errors (the category 
>>> they put PERR in) go to INTB while hot plug conditions trigger INTA.  
>>> Masking off fatal errors in the IOAPIC turns off the storm too.  I 
>>> have no idea why this is showing up on INTA.
>>>
>>> Berkley
>>>
>>
>>
> 
> 
> -- 
> 
> // E. F. Berkley Shands, MSc//
> 
> ** Exegy Inc.**
> 
> 349 Marshall Road, Suite 100
> 
> St. Louis , MO  63119
> 
> Direct:  (314) 218-3600 X450
> 
> Cell:  (314) 303-2546
> 
> Office:  (314) 218-3600
> 
> Fax:  (314) 218-3601
> 
>  
> 
> The Usual Disclaimer follows...
> 
>  
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: AMD 8132 parity issue causes interrupt storms
  2009-02-21 19:06     ` Robert Hancock
@ 2009-02-21 19:08       ` Robert Hancock
  0 siblings, 0 replies; 4+ messages in thread
From: Robert Hancock @ 2009-02-21 19:08 UTC (permalink / raw)
  To: Mr. Berkley Shands; +Cc: linux-kernel, linux-pci@vger.kernel.org

Robert Hancock wrote:
> Mr. Berkley Shands wrote:
>> I am certainly not doing that :-)
>> Some supermicro H8QME-2 motherboards (about 40%) show up with that 
>> enabled.
>> Something generates a parity error, and the machine is instantly on 
>> its knees until it gets power cycled.
>>
>> My thought was to look and report that parity was being enabled (bios 
>> bug?)
> 
> That would be a BIOS bug then, if it sets the parity interrupts enabled 
> by default. If the OS installs a driver to handle those interrupts, the 
> driver can enable them, otherwise they should stay off.
> 
> We could probably create a PCI quirk for this chip that would disable 
> the parity interrupts on bootup if it found them enabled.. CCing linux-pci.

Really ccing linux-pci, this time..

> 
>>
>> I can fix it in a number of ways with setpci. It has taken a year to 
>> find the cause of my troubles.
>> And a $15K scope, ...
>>
>> Berkley
>>
>>
>> Robert Hancock wrote:
>>> Mr. Berkley Shands wrote:
>>>> It seems that the 8132 should be blacklisted :-)
>>>>
>>>> INT-A will be asserted forever if any channel sees a parity error.
>>>> This can be blocked by several means;
>>>>
>>>> 1) setpci -s <bus address of 8132> 5.b=05   /* disable interrupts 
>>>> from the bridge */
>>>> This is the I don't see you method.
>>>>
>>>> Shouldn't the interrupt handler (is there one?) trap and clear this?
>>>> Shouldn't the kernel at least report this error and reset those bits?
>>>
>>> What's enabling this interrupt generation? Interrupting on parity 
>>> errors is not part of the PCI spec. Unless there's some driver that's 
>>> set up to handle these interrupts, whoever's enabling them shouldn't 
>>> be..
>>>
>>>>
>>>> All,
>>>>
>>>> OK, here's what I know so far.  The interrupt storm is coming from 
>>>> the parity error detector in the 8132.  The parity error is reported 
>>>> in two locations using sticky bits:
>>>>
>>>> 0x1c bits 31 and 24
>>>>   Here there seems to be some differentiation between which party 
>>>> detected the parity error.  The 8132 spec is pretty vague here (see 
>>>> page 75) but it looks like the 8132 is detecting a parity error from 
>>>> the HBA not the other way around.
>>>> 0x80 bit 0
>>>>   Here it just states that someone asserted the PERR_L signal, no 
>>>> distinction on who did it.
>>>>
>>>> All these bits are write-one-to-clear.  If 0x80 bit 0 is cleared, 
>>>> the storm stops.  Clearly the OS does not know how to handle these 
>>>> conditions and the error flag is left on while the interrupt is 
>>>> continuously handled.
>>>>
>>>> One way to handle this is to set 0x48 bit 19 to 0.  This prevents 
>>>> the 8132 from interrupting when 0x80 bit 0 is set.
>>>>
>>>> A much better way to handle this is to have the interrupt handler 
>>>> actually check the error bits on the 8132 when it is called.  This 
>>>> would slow down the interrupt handler, but actually give us a much 
>>>> better visibility into this problem (when, where and how often this 
>>>> happens).  The irritating thing here is that this is chipset 
>>>> dependent.  The interrupt handler would have to know what PCI-X 
>>>> chipset it was talking through to know how to handle this (way to go 
>>>> AMD).
>>>>
>>>> The really odd thing is that the parity error is reported through 
>>>> INTB on the 8132.  The spec claims that fatal errors (the category 
>>>> they put PERR in) go to INTB while hot plug conditions trigger 
>>>> INTA.  Masking off fatal errors in the IOAPIC turns off the storm 
>>>> too.  I have no idea why this is showing up on INTA.
>>>>
>>>> Berkley
>>>>
>>>
>>>
>>
>>
>> -- 
>>
>> // E. F. Berkley Shands, MSc//
>>
>> ** Exegy Inc.**
>>
>> 349 Marshall Road, Suite 100
>>
>> St. Louis , MO  63119
>>
>> Direct:  (314) 218-3600 X450
>>
>> Cell:  (314) 303-2546
>>
>> Office:  (314) 218-3600
>>
>> Fax:  (314) 218-3601
>>
>>  
>>
>> The Usual Disclaimer follows...
>>
>>  
>>
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-02-21 19:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-20 20:48 AMD 8132 parity issue causes interrupt storms Mr. Berkley Shands
2009-02-21  0:44 ` Robert Hancock
     [not found]   ` <499F6468.3080907@exegy.com>
2009-02-21 19:06     ` Robert Hancock
2009-02-21 19:08       ` Robert Hancock

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).