public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Peterson <dsp@llnl.gov>
To: Arjan van de Ven <arjan@infradead.org>
Cc: Greg KH <greg@kroah.com>, "Randy.Dunlap" <rdunlap@xenotime.net>,
	linux-kernel@vger.kernel.org, torvalds@osdl.org, alan@redhat.com,
	gregkh@kroah.com, Doug Thompson <dthompson@lnxi.com>,
	bluesmoke-devel@lists.sourceforge.net
Subject: Re: [PATCH] EDAC: core EDAC support code
Date: Fri, 10 Mar 2006 17:57:00 -0800	[thread overview]
Message-ID: <200603101757.00060.dsp@llnl.gov> (raw)
In-Reply-To: <1142025821.2876.106.camel@laptopd505.fenrus.org>

On Friday 10 March 2006 13:23, Arjan van de Ven wrote:
> hmm ok so I want a function that takes a device as parameter, and checks
> the state of that device for errors. Internally that probably has to go
> via a function pointer somewhere to a device specific check method.
>
> Or maybe a per test-type (pci parity / ECC / etc) check
>
> int pci_check_parity_errors(struct pci_dev *dev, int flags);
>
> something like that, or pci_check_and_clear_parity_errors()
> (although that gets too long :)
>
> drivers can call that, say, after firmware init or something to validate
> their device is sanely connected. Maybe pci_enable_device() could call
> it too.
>
> This also needs a pci_suspend_parity_check() ... _resume_ ... so that
> the driver can temporarily disable any checks, for example during device
> reset/init. And then just before resume, it manually clears a check.

ok, perhaps things might look something like this?

    - A check_error() function checks a device for errors.  As you
      mention above, this may be a device-specific check method or a
      function like pci_check_parity_errors(dev, flags).  In either
      case, if check_error() finds an error, it clears the error from
      the device's registers and returns the error.  If check_error()
      finds multiple errors, here are couple of options:

          - Return a list of all errors detected.
          - Return a single error along with a boolean value that says
            whether more errors are present.  If so, check_error() may
            be called repeatedly until no errors remain.

    - A handle_error() function handles errors returned by
      check_error().  Here are a few options: Each device may have a
      handle_error() method that takes an error as a parameter.  Or,
      each type of error may have its own handle_me() method.  A third
      option is something like pci_handle_parity_error(dev, error).
      In any case, depending on the error type, the handler may choose
      to feed the error to EDAC.

    - Each hardware subsystem or device driver may determine its own
      error checking/handling strategy.  Some may want to poll for
      errors.  Others may want error detection to be asynchronous
      (driven by interrupts or exceptions).  Or a subsystem may poll
      for some kinds of errors and detect others asynchronously.  As
      you suggest, errors may also be checked for in other situations,
      such as after firmware init.

      For polling, the poll function calls check_error(), and then
      calls handle_error() if an error is found.  For interrupt-driven
      error checking, the interrupt handler calls check_error().  If
      an error is found, the interrupt handler may either call
      handle_error() directly or defer invocation of handle_error()
      outside interrupt context.  If the interrupt is an NMI, deferred
      invocation of handle_error() allows it to execute without
      introducing deadlocks or race conditions.

    - For some types of hardware, at boot time the device's registers
      contain spurious error info, which should be discarded.  This
      may be done by calling check_error() and discarding the results.

> > > 2) per bus code that calls the do check functions and whatever is
> > > needed for bus checks
> > >
> > > 3) "EDAC" central command, which basically gathers all failure reports
> > > and does something with them (push them to userspace or implement the
> > > userspace chosen policy (panic/reboot/etc))
> >
> > Are you suggesting something like the following?
> >
> >     - The controls that determine how the error checking is done are
> >       located within the various hardware subsystems.  For instance,
> >       with PCI parity checking, this would include stuff like setting
> >       the polling frequency and determining which devices to check.
>
> yes. I would NOT make it overly tunable btw. For many things a sane
> default can be chosen, and the effect of picking a different tuning
> isn't all that big. Maybe think of it this way: a tuneable to a large
> degree is an excuse for not determining the right value / heuristic in
> the first place.

Sounds good.

> >     - When an error is actually detected, the subsystem that detected
> >       the error (for instance, PCI) would feed the error information
> >       to EDAC.  Then EDAC would determine how to respond to the error
> >       (for instance, push it to userspace or implement the
> >       userspace-chosen policy (panic/reboot/etc))
>
> yup.

Cool!  I think this also coincides with what Doug is saying.  Doug, how
does this sound?


  reply	other threads:[~2006-03-11  1:57 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <200601190414.k0J4EZCV021775@hera.kernel.org>
2006-03-05 10:18 ` [PATCH] EDAC: core EDAC support code Arjan van de Ven
2006-03-05 10:30   ` Arjan van de Ven
2006-03-06 18:14     ` Dave Peterson
2006-03-06 18:22       ` Randy.Dunlap
2006-03-07 17:03         ` Dave Peterson
2006-03-07 17:20           ` Greg KH
2006-03-08  0:29             ` Dave Peterson
2006-03-07 19:03           ` Arjan van de Ven
2006-03-07 19:05             ` Greg KH
2006-03-09 23:51             ` Dave Peterson
2006-03-10  0:02               ` Greg KH
2006-03-10  1:46                 ` Dave Peterson
2006-03-10  7:36                   ` Arjan van de Ven
2006-03-10 11:06                     ` Tim Small
2006-03-10 11:40                       ` Arjan van de Ven
2006-03-10 17:46                     ` Dave Peterson
2006-03-10 17:58                       ` Arjan van de Ven
2006-03-10 19:07                         ` Dave Peterson
2006-03-10 19:33                           ` Arjan van de Ven
2006-03-10 21:13                             ` Dave Peterson
2006-03-10 21:23                               ` Arjan van de Ven
2006-03-11  1:57                                 ` Dave Peterson [this message]
2006-03-11  7:18                                   ` Arjan van de Ven
2006-03-13 19:31                                     ` Dave Peterson
2006-03-06 19:52       ` Greg KH
2006-03-05 15:55   ` Greg KH
2006-03-05 16:24     ` Arjan van de Ven
2006-03-06 18:52     ` Dave Peterson
2006-03-06 19:53       ` Greg KH
2006-03-06 21:01         ` Dave Peterson
2006-03-06 21:07           ` Arjan van de Ven
2006-03-09  3:19             ` Dave Peterson
2006-03-09  3:44               ` Al Viro
2006-03-09  5:51               ` Greg KH
2006-03-06 21:32           ` Al Viro
2006-03-06 21:53             ` Greg KH
2006-03-06 22:24               ` Al Viro
2006-03-06 22:55                 ` Greg KH
2006-03-07 10:41                 ` Andrew Morton
2006-03-07 11:08                   ` Al Viro
2006-03-08  2:46                   ` Rusty Russell
2006-03-07  1:45               ` Dmitry Torokhov
2006-03-07  1:57                 ` Greg KH
2006-03-07  2:10                   ` Dmitry Torokhov
2006-03-07 16:47             ` Dave Peterson
2006-03-07 17:04               ` Greg KH
2006-03-07 17:06                 ` Dave Peterson
2006-03-08  1:03                 ` Dmitry Torokhov
2006-03-08  1:33                   ` Greg KH
2006-03-10  0:44 Doug Thompson
  -- strict thread matches above, loose matches on Subject: below --
2006-03-10 16:56 Doug Thompson
2006-03-10 17:11 ` Arjan van de Ven
2006-03-10 17:28 Doug Thompson
2006-03-10 20:37 Doug Thompson
2006-03-11 17:04 Doug Thompson
2006-03-13 19:35 ` Dave Peterson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200603101757.00060.dsp@llnl.gov \
    --to=dsp@llnl.gov \
    --cc=alan@redhat.com \
    --cc=arjan@infradead.org \
    --cc=bluesmoke-devel@lists.sourceforge.net \
    --cc=dthompson@lnxi.com \
    --cc=greg@kroah.com \
    --cc=gregkh@kroah.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rdunlap@xenotime.net \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox