All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
To: Linux Kernel list <linux-kernel@vger.kernel.org>,
	linux-pci@atrey.karlin.mff.cuni.cz, linux-ia64@vger.kernel.org
Cc: Linus Torvalds <torvalds@osdl.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Linas Vepstas <linas@austin.ibm.com>,
	"Luck, Tony" <tony.luck@intel.com>
Subject: Re: [PATCH/RFC] I/O-check interface for driver's error handling
Date: Fri, 04 Mar 2005 12:40:58 +0000	[thread overview]
Message-ID: <4228575A.8070708@jp.fujitsu.com> (raw)
In-Reply-To: <422428EC.3090905@jp.fujitsu.com>

Thanks for all comments!

OK, I'd like to sort our situation:

################

$ Here are 2 features:
   - iochk_clear/read() interface for error "detection"
       by Seto ... me :-)
   - callback, thread, and event notification for error "recovery"
       by Linas ... expert in PPC64

$ What will "detection" interface provides?
   - allow drivers to get error information
      - device/bus was isolated/going-reset/re-enabled/etc.
      - error status which hardware and PCI subsystem provides
   - allow drivers to do "simple retry" easily
      - major soft errors(*1) would be recovered by a simple retry
      - in cases that device/bus was re-enabled but a retry is required

$ What will "recovery" infrastructure provides?
   - allow drivers to help OS's recovery
      - usually OS cannot re-enable affected devices by itself
   - allow drivers to respond asynchronous error event
      - allow drivers to implement "device specific recovery"

$ Difference of stance
   - "detection"
      - Assume that the number of soft error is far more than that of
        hard error. (PCI-Express has ECC, but traditional PCI does not.)
      - Assume that it isn't too late that attempt of device isolation
        and/or recovery comes after a simple retry(*2), and that a retry
        would be required even if the recovery had go well.
      - It isn't matter whether device isolation is actually possible or
        not for the arch. The fundamental intention of this interface is
        prevent user applications from data pollution.
      - Currently DMA and asynchronous I/O is not target.
   - "recovery"
      - (I'd appreciate it if Linas could fill here by his suitable words.)
      - (Maybe,) it is based on assuming that erroneous device should be
        isolated immediately irrespective of type of the error.
      - (I guess that) once a device was isolated, it become harder to
        re-enable it. It seems like a kind of hotplug feature.
      - Currently there are few platform which can isolate devices and
        attempt to recover from the I/O error.

$ How to use
   - "detection" ... easy.
      - clip I/Os by iochk_clear() and iochk_read()
      - if iochk_read() returns non-0, retry once and/or notify the error
        to user application.
   - "recovery" ... rather hard.
      - (I'd appreciate it if Linas could fill here by his suitable words.)
      - write callback function for each event(*3)

-----

*1:
   Traditionally, there are 2 types of error:
    - soft error:
        data was broken (ex. due to low voltage, natural radiation etc.)
        temporary error
    - hard error:
        device or bus was physically broken (i.e. uncorrectable)
        permanent error

*2:
   it's difficult to distinguish hard errors from soft errors, without
   any retry.

*3:
   Linas, how many stages/events would you prefer to be there? is 3 enough?

   ex. IMHO:

   IOERR_DETECTED
     - An error was detected, so error logging or device isolation would be
       major request. On PPC64, isolation would be already done by hardware.
   IOERR_PREPARE_RECOVERY
     - Require preparation before attempting error recovery by OS.
   IOERR_DO_RECOVERY
     - Require device specific recovery and result of the recovery.
       OS will gather all results and will decide recovered or not.
   IOERR_RECOVERED
     - OS recovery was succeeded.
   IOERR_DEAD
     - OS recovery was failed.

   And as Ben said and as you already proposed, I also think only one callback
   is enough and better, like:
     int pci_emergency_callback(pci_dev *dev, err_event event, void *extra)

   It allows us to add new event if desired.

################

Thanks,
H.Seto


WARNING: multiple messages have this Message-ID (diff)
From: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
To: Linux Kernel list <linux-kernel@vger.kernel.org>,
	linux-pci@atrey.karlin.mff.cuni.cz, linux-ia64@vger.kernel.org
Cc: Linus Torvalds <torvalds@osdl.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Linas Vepstas <linas@austin.ibm.com>,
	"Luck, Tony" <tony.luck@intel.com>
Subject: Re: [PATCH/RFC] I/O-check interface for driver's error handling
Date: Fri, 04 Mar 2005 21:40:58 +0900	[thread overview]
Message-ID: <4228575A.8070708@jp.fujitsu.com> (raw)
In-Reply-To: <422428EC.3090905@jp.fujitsu.com>

Thanks for all comments!

OK, I'd like to sort our situation:

################

$ Here are 2 features:
   - iochk_clear/read() interface for error "detection"
       by Seto ... me :-)
   - callback, thread, and event notification for error "recovery"
       by Linas ... expert in PPC64

$ What will "detection" interface provides?
   - allow drivers to get error information
      - device/bus was isolated/going-reset/re-enabled/etc.
      - error status which hardware and PCI subsystem provides
   - allow drivers to do "simple retry" easily
      - major soft errors(*1) would be recovered by a simple retry
      - in cases that device/bus was re-enabled but a retry is required

$ What will "recovery" infrastructure provides?
   - allow drivers to help OS's recovery
      - usually OS cannot re-enable affected devices by itself
   - allow drivers to respond asynchronous error event
      - allow drivers to implement "device specific recovery"

$ Difference of stance
   - "detection"
      - Assume that the number of soft error is far more than that of
        hard error. (PCI-Express has ECC, but traditional PCI does not.)
      - Assume that it isn't too late that attempt of device isolation
        and/or recovery comes after a simple retry(*2), and that a retry
        would be required even if the recovery had go well.
      - It isn't matter whether device isolation is actually possible or
        not for the arch. The fundamental intention of this interface is
        prevent user applications from data pollution.
      - Currently DMA and asynchronous I/O is not target.
   - "recovery"
      - (I'd appreciate it if Linas could fill here by his suitable words.)
      - (Maybe,) it is based on assuming that erroneous device should be
        isolated immediately irrespective of type of the error.
      - (I guess that) once a device was isolated, it become harder to
        re-enable it. It seems like a kind of hotplug feature.
      - Currently there are few platform which can isolate devices and
        attempt to recover from the I/O error.

$ How to use
   - "detection" ... easy.
      - clip I/Os by iochk_clear() and iochk_read()
      - if iochk_read() returns non-0, retry once and/or notify the error
        to user application.
   - "recovery" ... rather hard.
      - (I'd appreciate it if Linas could fill here by his suitable words.)
      - write callback function for each event(*3)

-----

*1:
   Traditionally, there are 2 types of error:
    - soft error:
        data was broken (ex. due to low voltage, natural radiation etc.)
        temporary error
    - hard error:
        device or bus was physically broken (i.e. uncorrectable)
        permanent error

*2:
   it's difficult to distinguish hard errors from soft errors, without
   any retry.

*3:
   Linas, how many stages/events would you prefer to be there? is 3 enough?

   ex. IMHO:

   IOERR_DETECTED
     - An error was detected, so error logging or device isolation would be
       major request. On PPC64, isolation would be already done by hardware.
   IOERR_PREPARE_RECOVERY
     - Require preparation before attempting error recovery by OS.
   IOERR_DO_RECOVERY
     - Require device specific recovery and result of the recovery.
       OS will gather all results and will decide recovered or not.
   IOERR_RECOVERED
     - OS recovery was succeeded.
   IOERR_DEAD
     - OS recovery was failed.

   And as Ben said and as you already proposed, I also think only one callback
   is enough and better, like:
     int pci_emergency_callback(pci_dev *dev, err_event event, void *extra)

   It allows us to add new event if desired.

################

Thanks,
H.Seto


  parent reply	other threads:[~2005-03-04 12:40 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-03-01  8:33 [PATCH/RFC] I/O-check interface for driver's error handling Hidetoshi Seto
2005-03-01  8:33 ` Hidetoshi Seto
2005-03-01 14:42 ` Matthew Wilcox
2005-03-01 14:42   ` Matthew Wilcox
2005-03-01 19:27   ` Linas Vepstas
2005-03-01 19:27     ` Linas Vepstas
2005-03-01 19:37     ` Linus Torvalds
2005-03-01 19:37       ` Linus Torvalds
2005-03-02  6:13     ` Hidetoshi Seto
2005-03-02  6:13       ` Hidetoshi Seto
2005-03-02 19:20       ` Linas Vepstas
2005-03-02 19:20         ` Linas Vepstas
2005-03-04  2:03         ` Hidetoshi Seto
2005-03-04  2:03           ` Hidetoshi Seto
2005-03-04 16:46           ` Linas Vepstas
2005-03-04 16:46             ` Linas Vepstas
2005-03-01 16:37 ` Jeff Garzik
2005-03-01 16:37   ` Jeff Garzik
2005-03-01 16:49   ` Linus Torvalds
2005-03-01 16:49     ` Linus Torvalds
2005-03-01 16:59     ` Matthew Wilcox
2005-03-01 16:59       ` Matthew Wilcox
2005-03-01 17:10       ` Jesse Barnes
2005-03-01 17:10         ` Jesse Barnes
2005-03-01 18:33         ` Linas Vepstas
2005-03-01 18:33           ` Linas Vepstas
2005-03-01 22:27           ` Benjamin Herrenschmidt
2005-03-01 22:27             ` Benjamin Herrenschmidt
2005-03-02 20:02             ` Linas Vepstas
2005-03-02 20:02               ` Linas Vepstas
2005-03-02 22:46               ` Benjamin Herrenschmidt
2005-03-02 22:46                 ` Benjamin Herrenschmidt
2005-03-02 23:37                 ` Linas Vepstas
2005-03-02 23:37                   ` Linas Vepstas
2005-03-01 22:23         ` Benjamin Herrenschmidt
2005-03-01 22:23           ` Benjamin Herrenschmidt
2005-03-02  3:13         ` Hidetoshi Seto
2005-03-02  3:13           ` Hidetoshi Seto
2005-03-04 13:54         ` Pavel Machek
2005-03-04 13:54           ` Pavel Machek
2005-03-04 17:50           ` Jesse Barnes
2005-03-04 17:50             ` Jesse Barnes
2005-03-04 22:37           ` Benjamin Herrenschmidt
2005-03-04 22:37             ` Benjamin Herrenschmidt
2005-03-04 22:57             ` Pavel Machek
2005-03-04 22:57               ` Pavel Machek
2005-03-04 23:03               ` Benjamin Herrenschmidt
2005-03-04 23:03                 ` Benjamin Herrenschmidt
2005-03-04 23:18                 ` Pavel Machek
2005-03-04 23:18                   ` Pavel Machek
2005-03-04 23:27                   ` Benjamin Herrenschmidt
2005-03-04 23:27                     ` Benjamin Herrenschmidt
2005-03-02  2:28       ` Hidetoshi Seto
2005-03-02  2:28         ` Hidetoshi Seto
2005-03-02 17:44         ` Linas Vepstas
2005-03-02 17:44           ` Linas Vepstas
2005-03-02 18:03           ` linux-os
2005-03-02 18:03             ` linux-os
2005-03-02 22:40             ` Benjamin Herrenschmidt
2005-03-02 22:40               ` Benjamin Herrenschmidt
2005-03-04  2:21           ` Hidetoshi Seto
2005-03-04  2:21             ` Hidetoshi Seto
2005-03-01 22:20     ` Benjamin Herrenschmidt
2005-03-01 22:20       ` Benjamin Herrenschmidt
2005-03-02 18:22     ` Linas Vepstas
2005-03-02 18:22       ` Linas Vepstas
2005-03-02 18:41       ` Jesse Barnes
2005-03-02 18:41         ` Jesse Barnes
2005-03-02 19:46         ` Linas Vepstas
2005-03-02 19:46           ` Linas Vepstas
2005-03-02 22:43         ` Benjamin Herrenschmidt
2005-03-02 22:43           ` Benjamin Herrenschmidt
2005-03-02 22:41       ` Benjamin Herrenschmidt
2005-03-02 22:41         ` Benjamin Herrenschmidt
2005-03-02 23:30         ` Linas Vepstas
2005-03-02 23:30           ` Linas Vepstas
2005-03-02 23:40           ` Jesse Barnes
2005-03-02 23:40             ` Jesse Barnes
2005-03-01 19:17   ` Linas Vepstas
2005-03-01 19:17     ` Linas Vepstas
2005-03-01 22:15   ` Benjamin Herrenschmidt
2005-03-01 22:15     ` Benjamin Herrenschmidt
2005-03-01 17:19 ` Andi Kleen
2005-03-01 18:08   ` Linus Torvalds
2005-03-01 18:45     ` Andi Kleen
2005-03-01 18:59     ` Linas Vepstas
2005-03-01 22:26     ` Benjamin Herrenschmidt
2005-03-01 22:24   ` Benjamin Herrenschmidt
2005-03-04 12:40 ` Hidetoshi Seto [this message]
2005-03-04 12:40   ` Hidetoshi Seto

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4228575A.8070708@jp.fujitsu.com \
    --to=seto.hidetoshi@jp.fujitsu.com \
    --cc=benh@kernel.crashing.org \
    --cc=linas@austin.ibm.com \
    --cc=linux-ia64@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@atrey.karlin.mff.cuni.cz \
    --cc=tony.luck@intel.com \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.