LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: linas@austin.ibm.com (Linas Vepstas)
To: Srinivas Murthy <codevana@gmail.com>
Cc: linuxppc-dev <linuxppc-dev@ozlabs.org>
Subject: Re: PPC host with a PCI root-complex
Date: Fri, 19 May 2006 18:00:47 -0500	[thread overview]
Message-ID: <20060519230047.GM12135@austin.ibm.com> (raw)
In-Reply-To: <7cb1293c0605191428n57c18a60h5b86863d729cd9b9@mail.gmail.com>

On Fri, May 19, 2006 at 02:28:29PM -0700, Srinivas Murthy wrote:
> 
> On a PPC (44x) platform, following an error such as parity error detected by
> the PCI root complex, should we cause a bus error (causing a machine-check
> exception) or complete the bus transaction normally but trigger a critical
> interrupt? Note that these are two diff types of interrupts as seen by the
> CPU with the machine check having the highest NMI priority.

I can't answer that question; I'd say that's a platform implementation
question that each platform has to decide on thier own.  If you have a 
recoverable machine check, and can take it and recover from it, then
I suppose that's a reasonable choice. But I've never dealt with that.

> If the parity error detection was a result of say a memory read operation by
> the core to a PCI device, there might be a several cycle diff between the
> read and the cpu being interrupted (with the critical interrupt handler).
> This may result in data corruption, etc. Is this a valid concern to have?

Yes. Sort of. Maybe. In the early days of getting pci error handling 
to work, it became clear that there were lots of pci devices with weak 
firmware or buggy hardware that were dma'ing to all sorts of wild adresses, 
and/or doing other bad things (mangled PCI split transactions, etc). 
We became painfully aware of this, because our pci bridges flagged any DMA 
to any page that hadn't been expressly mapped (as well as a bunch of
other PCI errors).  Seems that these devices had been busy corrupting
memory and whatever else for years, and no one noticed before, because 
no one had a stringent, error checking PCI bridge.  Is data corruption
important? Yes. Have you been living with it for years, and not noticing
it? Yes.

(Most of those I dealt with have been fixed, either in the Linux device
driver, the device firmware, or in one case, a hardware change, (I assume
some gate array fixup)).

> What is the normal approach to deal with this issue in an "enterprise" or
> high-end environment?

On IBM pSeries, the PCI-Host bridge stops the transaction; I don't know 
the details at the hardware level; I presume its some abort or termination.
In principle, the corrupted data never makes it to system memory or CPU.
If the CPU is reading, 0xffffffff is returned, as are all future reads. 
All writes are dropped on the floor.  DMA's are also cut off.  

On pSeries, there's no interrupt generated on error. Rather, if the 
device gets an unexpected 0xffffff on read, it can query the firmware 
for the PCI bridge state, and proceed from there.  Typical scenario 
is that an error occurred during DMA, long ago; the device driver gets 
a dma-complete interrupt from pci device, and discovers upon reading
that the device interrupt status register is all 0xffff... During 
recovery, that hunk of incomplete dma data is discarded. 

The point here is that its potentially OK to allow corrupted data into
systm memory, as long as its at the right address, and as long as the 
device driver can nuke it before it has gotten to other consumers. 
If you ar getting parity erorrs on DMA addresses, then ... :-) 

PCI Express does something else, and I don't quite understand what.
Part of thier mechanism involves an interrupt, although its somehow 
wired that on MMIO read, the interrupt gets to the CPU before the
MMIO read completes. Some chunk of it seems x86/itanium-centric.
The Intel guys have been trying to figure out how to implement this, 
but haven't done so yet.

--linas

      reply	other threads:[~2006-05-19 23:00 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-05-18 21:56 PPC host with a PCI root-complex Srinivas Murthy
2006-05-18 23:38 ` Segher Boessenkool
2006-05-19 16:23 ` Linas Vepstas
2006-05-19 21:28   ` Srinivas Murthy
2006-05-19 23:00     ` Linas Vepstas [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060519230047.GM12135@austin.ibm.com \
    --to=linas@austin.ibm.com \
    --cc=codevana@gmail.com \
    --cc=linuxppc-dev@ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox