* PPC host with a PCI root-complex @ 2006-05-18 21:56 Srinivas Murthy 2006-05-18 23:38 ` Segher Boessenkool 2006-05-19 16:23 ` Linas Vepstas 0 siblings, 2 replies; 5+ messages in thread From: Srinivas Murthy @ 2006-05-18 21:56 UTC (permalink / raw) To: linuxppc-dev [-- Attachment #1: Type: text/plain, Size: 709 bytes --] Hi, We have a ppc host with a PCI root-complex across which there are multiple PCI end points. An application running on the ppc host reading one of the device memory regions (not DMA access but direct CPU read) causes a parity error on the PCI interface controller. We think that the error should be propagated up as a machine-check which is considered a non-recoverable system-wide error. However with multiple PCI devices present we think that this is too generic and could be reduced to be a critical-error which could be recovered from. Are there any other approaches/thoughts keeping in mind that this is PPC host and we're running a PCI-rootcomplex interface? Thanks, _Srinivas [-- Attachment #2: Type: text/html, Size: 919 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: PPC host with a PCI root-complex 2006-05-18 21:56 PPC host with a PCI root-complex Srinivas Murthy @ 2006-05-18 23:38 ` Segher Boessenkool 2006-05-19 16:23 ` Linas Vepstas 1 sibling, 0 replies; 5+ messages in thread From: Segher Boessenkool @ 2006-05-18 23:38 UTC (permalink / raw) To: Srinivas Murthy; +Cc: linuxppc-dev > We have a ppc host with a PCI root-complex across which there are > multiple PCI end points. > > An application running on the ppc host reading one of the device > memory regions (not DMA access but direct CPU read) causes a parity > error on the PCI interface controller. > > We think that the error should be propagated up as a machine-check > which is considered a non-recoverable system-wide error. However > with multiple PCI devices present we think that this is too generic > and could be reduced to be a critical-error which could be > recovered from. > > Are there any other approaches/thoughts keeping in mind that this > is PPC host and we're running a PCI-rootcomplex interface? You can handle the machine check in a platform-specific way -- see ppc_md.machine_check_exception(). Segher ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: PPC host with a PCI root-complex 2006-05-18 21:56 PPC host with a PCI root-complex Srinivas Murthy 2006-05-18 23:38 ` Segher Boessenkool @ 2006-05-19 16:23 ` Linas Vepstas 2006-05-19 21:28 ` Srinivas Murthy 1 sibling, 1 reply; 5+ messages in thread From: Linas Vepstas @ 2006-05-19 16:23 UTC (permalink / raw) To: Srinivas Murthy; +Cc: linuxppc-dev On Thu, May 18, 2006 at 02:56:31PM -0700, Srinivas Murthy wrote: > Hi, > > We have a ppc host with a PCI root-complex across which there are multiple > PCI end points. > > An application running on the ppc host reading one of the device memory > regions (not DMA access but direct CPU read) causes a parity error on the > PCI interface controller. > > We think that the error should be propagated up as a machine-check which is > considered a non-recoverable system-wide error. However with multiple PCI > devices present we think that this is too generic and could be reduced to be > a critical-error which could be recovered from. The "PCI Error Recovery" API was created to deal with this kind of a situation. See Documentation/pci-error-recovery.txt In breif: if something like a PCI parity error is detected by the hardware, then some arch-specific code runs; for example, arch/powerpc/platforms/pseries/eeh.c. This code notifies the PCI device driver (via generic callbacks in include/linux/pci.h) about the error. The device driver may ask the arch to have the pci device/bus/link/etc/ get reset, or not. If/when the PCI bus/link is back to normal, the PCI device driver is notified via callback, and resumes normal operation. If you have questions/suggestions, let me know, I've been maintaining this code, and am interested in seeing how well it can be adapted to a broader range of hardware. --linas ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: PPC host with a PCI root-complex 2006-05-19 16:23 ` Linas Vepstas @ 2006-05-19 21:28 ` Srinivas Murthy 2006-05-19 23:00 ` Linas Vepstas 0 siblings, 1 reply; 5+ messages in thread From: Srinivas Murthy @ 2006-05-19 21:28 UTC (permalink / raw) To: Linas Vepstas; +Cc: linuxppc-dev [-- Attachment #1: Type: text/plain, Size: 2496 bytes --] Thanks for the reply. I have a couple of concerns here. I would appreciate if you could provide your thoughts. On a PPC (44x) platform, following an error such as parity error detected by the PCI root complex, should we cause a bus error (causing a machine-check exception) or complete the bus transaction normally but trigger a critical interrupt? Note that these are two diff types of interrupts as seen by the CPU with the machine check having the highest NMI priority. If the parity error detection was a result of say a memory read operation by the core to a PCI device, there might be a several cycle diff between the read and the cpu being interrupted (with the critical interrupt handler). This may result in data corruption, etc. Is this a valid concern to have? What is the normal approach to deal with this issue in an "enterprise" or high-end environment? On 5/19/06, Linas Vepstas <linas@austin.ibm.com> wrote: > > On Thu, May 18, 2006 at 02:56:31PM -0700, Srinivas Murthy wrote: > > Hi, > > > > We have a ppc host with a PCI root-complex across which there are > multiple > > PCI end points. > > > > An application running on the ppc host reading one of the device memory > > regions (not DMA access but direct CPU read) causes a parity error on > the > > PCI interface controller. > > > > We think that the error should be propagated up as a machine-check which > is > > considered a non-recoverable system-wide error. However with multiple > PCI > > devices present we think that this is too generic and could be reduced > to be > > a critical-error which could be recovered from. > > The "PCI Error Recovery" API was created to deal with this kind of a > situation. See Documentation/pci-error-recovery.txt > > In breif: if something like a PCI parity error is detected by the > hardware, then some arch-specific code runs; for example, > arch/powerpc/platforms/pseries/eeh.c. > > This code notifies the PCI device driver (via generic callbacks in > include/linux/pci.h) about the error. The device driver may ask the > arch to have the pci device/bus/link/etc/ get reset, or not. If/when > the PCI bus/link is back to normal, the PCI device driver is notified > via callback, and resumes normal operation. > > If you have questions/suggestions, let me know, I've been maintaining > this code, and am interested in seeing how well it can be adapted > to a broader range of hardware. > > --linas > [-- Attachment #2: Type: text/html, Size: 3062 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: PPC host with a PCI root-complex 2006-05-19 21:28 ` Srinivas Murthy @ 2006-05-19 23:00 ` Linas Vepstas 0 siblings, 0 replies; 5+ messages in thread From: Linas Vepstas @ 2006-05-19 23:00 UTC (permalink / raw) To: Srinivas Murthy; +Cc: linuxppc-dev On Fri, May 19, 2006 at 02:28:29PM -0700, Srinivas Murthy wrote: > > On a PPC (44x) platform, following an error such as parity error detected by > the PCI root complex, should we cause a bus error (causing a machine-check > exception) or complete the bus transaction normally but trigger a critical > interrupt? Note that these are two diff types of interrupts as seen by the > CPU with the machine check having the highest NMI priority. I can't answer that question; I'd say that's a platform implementation question that each platform has to decide on thier own. If you have a recoverable machine check, and can take it and recover from it, then I suppose that's a reasonable choice. But I've never dealt with that. > If the parity error detection was a result of say a memory read operation by > the core to a PCI device, there might be a several cycle diff between the > read and the cpu being interrupted (with the critical interrupt handler). > This may result in data corruption, etc. Is this a valid concern to have? Yes. Sort of. Maybe. In the early days of getting pci error handling to work, it became clear that there were lots of pci devices with weak firmware or buggy hardware that were dma'ing to all sorts of wild adresses, and/or doing other bad things (mangled PCI split transactions, etc). We became painfully aware of this, because our pci bridges flagged any DMA to any page that hadn't been expressly mapped (as well as a bunch of other PCI errors). Seems that these devices had been busy corrupting memory and whatever else for years, and no one noticed before, because no one had a stringent, error checking PCI bridge. Is data corruption important? Yes. Have you been living with it for years, and not noticing it? Yes. (Most of those I dealt with have been fixed, either in the Linux device driver, the device firmware, or in one case, a hardware change, (I assume some gate array fixup)). > What is the normal approach to deal with this issue in an "enterprise" or > high-end environment? On IBM pSeries, the PCI-Host bridge stops the transaction; I don't know the details at the hardware level; I presume its some abort or termination. In principle, the corrupted data never makes it to system memory or CPU. If the CPU is reading, 0xffffffff is returned, as are all future reads. All writes are dropped on the floor. DMA's are also cut off. On pSeries, there's no interrupt generated on error. Rather, if the device gets an unexpected 0xffffff on read, it can query the firmware for the PCI bridge state, and proceed from there. Typical scenario is that an error occurred during DMA, long ago; the device driver gets a dma-complete interrupt from pci device, and discovers upon reading that the device interrupt status register is all 0xffff... During recovery, that hunk of incomplete dma data is discarded. The point here is that its potentially OK to allow corrupted data into systm memory, as long as its at the right address, and as long as the device driver can nuke it before it has gotten to other consumers. If you ar getting parity erorrs on DMA addresses, then ... :-) PCI Express does something else, and I don't quite understand what. Part of thier mechanism involves an interrupt, although its somehow wired that on MMIO read, the interrupt gets to the CPU before the MMIO read completes. Some chunk of it seems x86/itanium-centric. The Intel guys have been trying to figure out how to implement this, but haven't done so yet. --linas ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2006-05-19 23:00 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-05-18 21:56 PPC host with a PCI root-complex Srinivas Murthy 2006-05-18 23:38 ` Segher Boessenkool 2006-05-19 16:23 ` Linas Vepstas 2006-05-19 21:28 ` Srinivas Murthy 2006-05-19 23:00 ` Linas Vepstas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox