Re: Problems with EDAC coexisting with BIOS

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: David Peterson <peterson66@llnl.gov>
To: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Tim Small <tim@buttersideup.com>,
	"Ong, Soo Keong" <soo.keong.ong@intel.com>,
	"Gross, Mark" <mark.gross@intel.com>,
	bluesmoke-devel@lists.sourceforge.net,
	LKML <linux-kernel@vger.kernel.org>,
	"Carbonari, Steven" <steven.carbonari@intel.com>,
	"Wang, Zhenyu Z" <zhenyu.z.wang@intel.com>
Subject: Re: Problems with EDAC coexisting with BIOS
Date: Wed, 03 May 2006 16:06:30 -0700	[thread overview]
Message-ID: <64140659c4.659c464140@llnl.gov> (raw)

> On Wed, 2006-05-03, Alan Cox wrote:
> > something with NMI-signalled errors, I was wondering what the 
problems 
> > with using NMI-signalled ECC errors were?
> 
> The big problem with NMI is that it can occur *during* a PCI
> configuration sequence (ie during pci_config_* functions). That 
> means we can't safely do some I/O, especially configuration space 
I/O in an NMI
> handler. At best we could set a flag and catch it afterwards.

This is roughly what I did in the NMI handling code I wrote for
bluesmoke.  If my memory is correct, I believe there's a spinlock
that pci_read_config_dword() and friends acquire.  Basically
I did the following type of thing:

    /* Using spin_trylock() below avoids deadlock in the case where
     * the code interrupted by the NMI is holding the lock.
     */
    if (likely(spin_trylock(&lock))) {
            We got the lock.  Go ahead and access PCI config space
            (and then drop the lock)...
    } else {
            This case should be rare in practice.  Defer the access to
            PCI config space outside NMI context (I wrote a little API
            that facilitates doing this kind of stuff in a manner that
            avoids deadlocks and race conditions associated with NMI
            handlers).
    }

For anyone who is interested, the code may be obtained by going to
http://sourceforge.net/projects/bluesmoke/ and downloading the latest
version of the 'bluesmoke' package.  It's experimental stuff that
hasn't seen much testing.  Also it's not quite functional as-is
because a little piece of code still needs to be added somewhere that
enables SERR#.

I'm not necessarily advocating that NMI-driven error handling should
go into the mainstream kernel.  The expected benefits would have to
be weighed against the extra complexity that the code introduces.
However the parts of the code that handle the basic NMI-related
synchronization issues are abstracted into a relatively clean 
architecture-independent API that may be useful in other places where
NMIs (or similar types of exceptions/interrupts on other platforms)
are used.  I posted this code to LKML a while ago but it has since
been improved.  Perhaps having this type of code (or something
similar) in just one location would be an improvement over reinventing
the wheel in a number of places.

There is an issue regarding NMI-driven error handling that may be a
substantial pain to deal with on x86: When NMI occurs, we can't be
sure whether the NMI is from the watchdog, or due to a hardware error.
Therefore we must check the hardware for errors on each NMI.  This is
no better than polling (at whatever frequency the watchdog runs at).

The above problem can be worked around as follows (although I'm
not advocating that this be done in Linux): Modify local_irq_disable()
and local_irq_enable() so that instead of using cli/sti machine
instructions they adjust the interrupt priority level (controlled by
the local APIC on x86 processors) as follows:

    local_irq_disable()
    {
            Set interrupt priority level to (MAX_PRIORITY - 1).
            In other words, all interrupts are masked out except
            those whose priority is MAX_PRIORITY.
    }

    local_irq_enable()
    {
            Set interrupt priority level to 0 (i.e. all interrupts
            are enabled).
    }

Then the watchdog may be implemented as a normal interrupt with
priority MAX_PRIORITY, and all other interrupts may be given lower
priorities.  NMI would then only be asserted for genuine hardware
errors (PCI parity errors, ECC memory errors, etc.).  This is
probably more trouble than it's worth.  However I think it's doable
in principle.

next             reply	other threads:[~2006-05-03 23:06 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-05-03 23:06 David Peterson [this message]
  -- strict thread matches above, loose matches on Subject: below --
2006-05-04 16:44 Problems with EDAC coexisting with BIOS David Peterson
2006-05-03 22:22 Doug Thompson
2006-05-03 21:39 Doug Thompson
2006-05-03 20:49 Gross, Mark
2006-04-26  3:24 Gross, Mark
2006-04-26  3:19 Gross, Mark
2006-04-25 23:25 Gross, Mark
2006-04-26  2:19 ` Corey Minyard
2006-04-26  2:34 ` Randy.Dunlap
2006-04-26 18:26   ` mark gross
2006-04-26 18:38     ` Randy.Dunlap
2006-04-26 19:39       ` mark gross
2006-04-26 20:17         ` Randy.Dunlap
2006-04-25 21:24 Gross, Mark
2006-04-25 22:39 ` Corey Minyard
2006-04-25 20:22 Gross, Mark
2006-04-25 18:19 Gross, Mark
2006-04-25 19:55 ` Corey Minyard
2006-04-24 18:14 Gross, Mark
2006-04-24 15:57 Gross, Mark
2006-04-24 17:08 ` Eric W. Biederman
2006-04-24 17:49 ` Alan Cox
2006-04-24 14:32 Ong, Soo Keong
2006-04-24 14:15 Ong, Soo Keong
2006-04-24 14:29 ` Alan Cox
2006-05-03 20:25   ` Tim Small
2006-05-03 20:37     ` thockin
2006-05-04  9:45       ` Tim Small
2006-05-03 21:44     ` Alan Cox
2006-05-04  9:02       ` Tim Small
2006-04-24 13:59 Ong, Soo Keong
2006-04-24 14:13 ` Alan Cox
2006-04-23  1:44 Gross, Mark
2006-04-21 22:36 Doug Thompson
2006-04-21 22:20 Gross, Mark
2006-04-22 18:31 ` Tim Small
2006-04-21 21:42 Doug Thompson
2006-04-21 21:32 Gross, Mark
2006-04-21 20:57 Doug Thompson
2006-04-21 16:01 Gross, Mark
2006-04-21 21:13 ` Ingo Oeser
2006-04-24 13:19 ` Alan Cox
2006-04-24 17:38   ` Doug Thompson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=64140659c4.659c464140@llnl.gov \
    --to=peterson66@llnl.gov \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=bluesmoke-devel@lists.sourceforge.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.gross@intel.com \
    --cc=soo.keong.ong@intel.com \
    --cc=steven.carbonari@intel.com \
    --cc=tim@buttersideup.com \
    --cc=zhenyu.z.wang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox