[RFC/PATCH, 4/4] readX_check() performance evaluation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Hironobu Ishii <ishii.hironobu@jp.fujitsu.com>
To: linux-kernel <linux-kernel@vger.kernel.org>,
	linux-ia64 <linux-ia64@vger.kernel.org>
Subject: [RFC/PATCH, 4/4] readX_check() performance evaluation
Date: Wed, 28 Jan 2004 10:54:46 +0900	[thread overview]
Message-ID: <00a501c3e541$c214e4d0$2987110a@lsd.css.fujitsu.com> (raw)

This is our goal and plan.

Our goal:
    - We'd like to recover from PCI-X intermittent errors
      as possible as we can.
      We'd like to recvoer from following errors
         a) PIO read data parity errors
         b) PIO write data parity erros
         c) DMA read data parity errors
         d) DMA write data parity erros
      (PIO means that the initiator is the CPU,
      DMA means that the initiator is the device.)
    - If a fatal bus error like SERR occurs, we'd like to shutdown 
      the failed bus(, remove driver instances) and continue 
      system operation.
    - If there are readX_check() I/F, we can safely perform 
      PHP(PCI Hotplug) operation.
      Currently there is no reliable I/F to check the newly 
      installed card, so to speak, PHP is the same as the gambling.

Our Plan:
    - Recover from Data parity error
        Currently, we are focused on ia64 architecture.

        Device setup assumption:
          PCI command register/ Parity error response =1
          PCI-X command register/ Data parity error recovery=1

        a) PIO read data parity error recovery
          (Memory mapped I/O read error recovery)

          - readX_check() read the memory mapped hardware
            and consume the data by additional instruction.
          - If there were data parity error, machine check will
            occurs.  Local machine check handler check whether 
            the exception was caused by readX_check() 
            and increment a per-CPU MCA counter variable.
            Then handler returns.
          - readX_check check the per-CPU MCA counter variable.
            When the counter is incremented, return an error.
          - If readX_check() returns an error and if reading 
            the register has no side-effect, re-read the register
            a few times to recover from the error.
            If reading the register has a side-effect, driver may
            reset the HW and restart operation.

        b) PIO write data parity error recovery
          (Memory mapped I/O write error recovery)
          
          - The device will find a parity error.
            When the device find a parity error, 
            it reports the error in the PCI status 
            register as "Detected Parity Error == 1".
            After the driver did some PIO writes,
            the driver can notice the error by reading
            the PCI staus register.
          - If writing to the register has no side-effect,
            the driver can re-writeto the register.
            If writing the register has a side-effect,
            driver may reset the HW and restart operation.

        c) DMA read data parity errors
           (DMA from memory to device)

          - The device will find a parity error.
            When the device detected a data parity error,
            it set the "Detected Parity Error == 1".

            After a while, the driver check the PCI status
            register, and knows that there was a data parity error.
          - Because the driver can't decide which I/O failed
            precisely, the recovery method will be reseting the HW
            and restarting the I/O if possible.

        d) DMA write data parity erros
           (DMA from device to memory)

           - Host bridge detect the data parity error,
             and assert PERR signal on the bus.
             Then the device will notice the error and 
             report it in PCI status register as 
             "Master Data Parity error == 1".
          - Because the driver can't decide which I/O failed
            precisely, the recovery method will be reseting the HW
            and restarting the I/O if possible.

    - Shutdown a bus(, remove driver instances) and continue operation. 
        We have no good idea currently.
        Does someone have a idea?

Thanks,
Hironobu Ishii

                 reply	other threads:[~2004-01-28  1:55 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='00a501c3e541$c214e4d0$2987110a@lsd.css.fujitsu.com' \
    --to=ishii.hironobu@jp.fujitsu.com \
    --cc=linux-ia64@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox