PPC host with a PCI root-complex

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* PPC host with a PCI root-complex
@ 2006-05-18 21:56 Srinivas Murthy
  2006-05-18 23:38 ` Segher Boessenkool
  2006-05-19 16:23 ` Linas Vepstas
  0 siblings, 2 replies; 5+ messages in thread
From: Srinivas Murthy @ 2006-05-18 21:56 UTC (permalink / raw)
  To: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 709 bytes --]

Hi,

We have a ppc host with a PCI root-complex across which there are multiple
PCI end points.

An application running on the ppc host reading one of the device memory
regions (not DMA access but direct CPU read) causes a parity error on the
PCI interface controller.

We think that the error should be propagated up as a machine-check which is
considered a non-recoverable system-wide error. However with multiple PCI
devices present we think that this is too generic and could be reduced to be
a critical-error which could be recovered from.

Are there any other approaches/thoughts keeping in mind that this is PPC
host and we're running a PCI-rootcomplex interface?

Thanks,
_Srinivas

[-- Attachment #2: Type: text/html, Size: 919 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PPC host with a PCI root-complex
  2006-05-18 21:56 PPC host with a PCI root-complex Srinivas Murthy
@ 2006-05-18 23:38 ` Segher Boessenkool
  2006-05-19 16:23 ` Linas Vepstas
  1 sibling, 0 replies; 5+ messages in thread
From: Segher Boessenkool @ 2006-05-18 23:38 UTC (permalink / raw)
  To: Srinivas Murthy; +Cc: linuxppc-dev

> We have a ppc host with a PCI root-complex across which there are  
> multiple PCI end points.
>
> An application running on the ppc host reading one of the device  
> memory regions (not DMA access but direct CPU read) causes a parity  
> error on the PCI interface controller.
>
> We think that the error should be propagated up as a machine-check  
> which is considered a non-recoverable system-wide error. However  
> with multiple PCI devices present we think that this is too generic  
> and could be reduced to be a critical-error which could be  
> recovered from.
>
> Are there any other approaches/thoughts keeping in mind that this  
> is PPC host and we're running a PCI-rootcomplex interface?

You can handle the machine check in a platform-specific way --
see ppc_md.machine_check_exception().


Segher

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PPC host with a PCI root-complex
  2006-05-18 21:56 PPC host with a PCI root-complex Srinivas Murthy
  2006-05-18 23:38 ` Segher Boessenkool
@ 2006-05-19 16:23 ` Linas Vepstas
  2006-05-19 21:28   ` Srinivas Murthy
  1 sibling, 1 reply; 5+ messages in thread
From: Linas Vepstas @ 2006-05-19 16:23 UTC (permalink / raw)
  To: Srinivas Murthy; +Cc: linuxppc-dev

On Thu, May 18, 2006 at 02:56:31PM -0700, Srinivas Murthy wrote:
> Hi,
> 
> We have a ppc host with a PCI root-complex across which there are multiple
> PCI end points.
> 
> An application running on the ppc host reading one of the device memory
> regions (not DMA access but direct CPU read) causes a parity error on the
> PCI interface controller.
> 
> We think that the error should be propagated up as a machine-check which is
> considered a non-recoverable system-wide error. However with multiple PCI
> devices present we think that this is too generic and could be reduced to be
> a critical-error which could be recovered from.

The "PCI Error Recovery" API was created to deal with this kind of a
situation. See Documentation/pci-error-recovery.txt

In breif: if something like a PCI parity error is detected by the
hardware, then some arch-specific code runs; for example,
 arch/powerpc/platforms/pseries/eeh.c.

This code notifies the PCI device driver (via generic callbacks in
include/linux/pci.h) about the error. The device driver may ask the
arch to have the pci device/bus/link/etc/ get reset, or not.  If/when
the PCI bus/link is back to normal, the PCI device driver is notified
via callback, and resumes normal operation.

If you have questions/suggestions, let me know, I've been maintaining 
this code, and am interested in seeing how well it can be adapted
to a broader range of hardware.

--linas

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PPC host with a PCI root-complex
  2006-05-19 16:23 ` Linas Vepstas
@ 2006-05-19 21:28   ` Srinivas Murthy
  2006-05-19 23:00     ` Linas Vepstas
  0 siblings, 1 reply; 5+ messages in thread
From: Srinivas Murthy @ 2006-05-19 21:28 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2496 bytes --]

Thanks for the reply.

I have a couple of concerns here. I would appreciate if you could provide
your thoughts.

On a PPC (44x) platform, following an error such as parity error detected by
the PCI root complex, should we cause a bus error (causing a machine-check
exception) or complete the bus transaction normally but trigger a critical
interrupt? Note that these are two diff types of interrupts as seen by the
CPU with the machine check having the highest NMI priority.

If the parity error detection was a result of say a memory read operation by
the core to a PCI device, there might be a several cycle diff between the
read and the cpu being interrupted (with the critical interrupt handler).
This may result in data corruption, etc. Is this a valid concern to have?
What is the normal approach to deal with this issue in an "enterprise" or
high-end environment?










On 5/19/06, Linas Vepstas <linas@austin.ibm.com> wrote:
>
> On Thu, May 18, 2006 at 02:56:31PM -0700, Srinivas Murthy wrote:
> > Hi,
> >
> > We have a ppc host with a PCI root-complex across which there are
> multiple
> > PCI end points.
> >
> > An application running on the ppc host reading one of the device memory
> > regions (not DMA access but direct CPU read) causes a parity error on
> the
> > PCI interface controller.
> >
> > We think that the error should be propagated up as a machine-check which
> is
> > considered a non-recoverable system-wide error. However with multiple
> PCI
> > devices present we think that this is too generic and could be reduced
> to be
> > a critical-error which could be recovered from.
>
> The "PCI Error Recovery" API was created to deal with this kind of a
> situation. See Documentation/pci-error-recovery.txt
>
> In breif: if something like a PCI parity error is detected by the
> hardware, then some arch-specific code runs; for example,
> arch/powerpc/platforms/pseries/eeh.c.
>
> This code notifies the PCI device driver (via generic callbacks in
> include/linux/pci.h) about the error. The device driver may ask the
> arch to have the pci device/bus/link/etc/ get reset, or not.  If/when
> the PCI bus/link is back to normal, the PCI device driver is notified
> via callback, and resumes normal operation.
>
> If you have questions/suggestions, let me know, I've been maintaining
> this code, and am interested in seeing how well it can be adapted
> to a broader range of hardware.
>
> --linas
>

[-- Attachment #2: Type: text/html, Size: 3062 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PPC host with a PCI root-complex
  2006-05-19 21:28   ` Srinivas Murthy
@ 2006-05-19 23:00     ` Linas Vepstas
  0 siblings, 0 replies; 5+ messages in thread
From: Linas Vepstas @ 2006-05-19 23:00 UTC (permalink / raw)
  To: Srinivas Murthy; +Cc: linuxppc-dev

On Fri, May 19, 2006 at 02:28:29PM -0700, Srinivas Murthy wrote:
> 
> On a PPC (44x) platform, following an error such as parity error detected by
> the PCI root complex, should we cause a bus error (causing a machine-check
> exception) or complete the bus transaction normally but trigger a critical
> interrupt? Note that these are two diff types of interrupts as seen by the
> CPU with the machine check having the highest NMI priority.

I can't answer that question; I'd say that's a platform implementation
question that each platform has to decide on thier own.  If you have a 
recoverable machine check, and can take it and recover from it, then
I suppose that's a reasonable choice. But I've never dealt with that.

> If the parity error detection was a result of say a memory read operation by
> the core to a PCI device, there might be a several cycle diff between the
> read and the cpu being interrupted (with the critical interrupt handler).
> This may result in data corruption, etc. Is this a valid concern to have?

Yes. Sort of. Maybe. In the early days of getting pci error handling 
to work, it became clear that there were lots of pci devices with weak 
firmware or buggy hardware that were dma'ing to all sorts of wild adresses, 
and/or doing other bad things (mangled PCI split transactions, etc). 
We became painfully aware of this, because our pci bridges flagged any DMA 
to any page that hadn't been expressly mapped (as well as a bunch of
other PCI errors).  Seems that these devices had been busy corrupting
memory and whatever else for years, and no one noticed before, because 
no one had a stringent, error checking PCI bridge.  Is data corruption
important? Yes. Have you been living with it for years, and not noticing
it? Yes.

(Most of those I dealt with have been fixed, either in the Linux device
driver, the device firmware, or in one case, a hardware change, (I assume
some gate array fixup)).

> What is the normal approach to deal with this issue in an "enterprise" or
> high-end environment?

On IBM pSeries, the PCI-Host bridge stops the transaction; I don't know 
the details at the hardware level; I presume its some abort or termination.
In principle, the corrupted data never makes it to system memory or CPU.
If the CPU is reading, 0xffffffff is returned, as are all future reads. 
All writes are dropped on the floor.  DMA's are also cut off.  

On pSeries, there's no interrupt generated on error. Rather, if the 
device gets an unexpected 0xffffff on read, it can query the firmware 
for the PCI bridge state, and proceed from there.  Typical scenario 
is that an error occurred during DMA, long ago; the device driver gets 
a dma-complete interrupt from pci device, and discovers upon reading
that the device interrupt status register is all 0xffff... During 
recovery, that hunk of incomplete dma data is discarded. 

The point here is that its potentially OK to allow corrupted data into
systm memory, as long as its at the right address, and as long as the 
device driver can nuke it before it has gotten to other consumers. 
If you ar getting parity erorrs on DMA addresses, then ... :-) 

PCI Express does something else, and I don't quite understand what.
Part of thier mechanism involves an interrupt, although its somehow 
wired that on MMIO read, the interrupt gets to the CPU before the
MMIO read completes. Some chunk of it seems x86/itanium-centric.
The Intel guys have been trying to figure out how to implement this, 
but haven't done so yet.

--linas

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-05-19 23:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-18 21:56 PPC host with a PCI root-complex Srinivas Murthy
2006-05-18 23:38 ` Segher Boessenkool
2006-05-19 16:23 ` Linas Vepstas
2006-05-19 21:28   ` Srinivas Murthy
2006-05-19 23:00     ` Linas Vepstas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox