public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-25 23:25 Gross, Mark
  2006-04-26  2:19 ` Corey Minyard
  2006-04-26  2:34 ` Randy.Dunlap
  0 siblings, 2 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-25 23:25 UTC (permalink / raw)
  To: Corey Minyard
  Cc: Alan Cox, bluesmoke-devel, LKML, Carbonari, Steven,
	Ong, Soo Keong, Wang, Zhenyu Z

[-- Attachment #1: Type: text/plain, Size: 765 bytes --]



>-----Original Message-----
>From: Corey Minyard [mailto:minyard@acm.org]
>Sent: Tuesday, April 25, 2006 3:40 PM
>To: Gross, Mark
>Cc: Alan Cox; bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari,
>Steven; Ong, Soo Keong; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Gross, Mark wrote:
>
>>>
>>>
>>>
>>
>>Done.
>>Signed-off-by: Mark Gross
>>
>>
>>--mgross
>>
>>
>Yes, this is what I was talking about.  I believe the mode of
>module_param should be 444, since modifying this after the module is
>loaded won't do anything.  Also, it might be nice to print something
>different in the "force" case.

How about printing nothing like it used too?

See attached.  

Signed-off-by: Mark Gross

--mgross

[-- Attachment #2: e752x_edac.patch --]
[-- Type: application/octet-stream, Size: 1739 bytes --]

diff -urN -X linux-2.6.16/Documentation/dontdiff linux-2.6.16/drivers/edac/e752x_edac.c linux-2.6.16_edac/drivers/edac/e752x_edac.c
--- linux-2.6.16/drivers/edac/e752x_edac.c	2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16_edac/drivers/edac/e752x_edac.c	2006-04-25 16:19:25.000000000 -0700
@@ -29,6 +29,7 @@
 
 #include "edac_mc.h"
 
+static int force_function_unhide = 0;
 
 #ifndef PCI_DEVICE_ID_INTEL_7520_0
 #define PCI_DEVICE_ID_INTEL_7520_0      0x3590
@@ -755,10 +756,18 @@
 	debugf0("MC: " __FILE__ ": %s(): mci\n", __func__);
 	debugf0("Starting Probe1\n");
 
-	/* enable device 0 function 1 */
+	/* check to see if device 0 function 1 is enbaled if it isn't we
+	 * assume the BIOS has reserved it for a reason and is expecting
+	 * exclusive access, we take care to not violate that assumption and
+	 * fail the probe. */
 	pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
-	stat8 |= (1 << 5);
-	pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
+	if (!force_function_unhide && !(stat8 & (1 << 5))) {
+		printk(KERN_INFO "contact your bios vendor to see if the " 
+		"E752x error registers can be safely un-hidden\n");
+			goto fail;
+	}
+        stat8 |= (1 << 5);
+        pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
 
 	/* need to find out the number of channels */
 	pci_read_config_dword(pdev, E752X_DRC, &drc);
@@ -1069,3 +1078,8 @@
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Linux Networx (http://lnxi.com) Tom Zimmerman\n");
 MODULE_DESCRIPTION("MC support for Intel e752x memory controllers");
+
+module_param(force_function_unhide, int, 0444);
+MODULE_PARM_DESC(force_function_unhide, "if BIOS sets Dev0:Fun1 up as hidden:"
+" 1=force unhide and hope BIOS doesn't fight driver for Dev0:Fun1 access");
+

^ permalink raw reply	[flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
@ 2006-05-04 16:44 David Peterson
  0 siblings, 0 replies; 44+ messages in thread
From: David Peterson @ 2006-05-04 16:44 UTC (permalink / raw)
  To: Tim Small
  Cc: Alan Cox, Ong, Soo Keong, Gross, Mark, bluesmoke-devel, LKML,
	Carbonari, Steven, Wang, Zhenyu Z

> My first thought was to schedule a tasklet as part of the ECC-
specific 
> NMI handling, or are there any gotchas with doing this from within 
> an NMI handler?

Unfortunately yes.  __tasklet_schedule() uses interrupt disabling
as a synchronization mechanism.  This presents a problem since by
definition, NMIs can occur even when interrupts are disabled. 
However the NMI handling code in bluesmoke has a mechanism similar to
tasklets that is intended specifically for use by NMI handlers.



^ permalink raw reply	[flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
@ 2006-05-03 23:06 David Peterson
  0 siblings, 0 replies; 44+ messages in thread
From: David Peterson @ 2006-05-03 23:06 UTC (permalink / raw)
  To: Alan Cox
  Cc: Tim Small, Ong, Soo Keong, Gross, Mark, bluesmoke-devel, LKML,
	Carbonari, Steven, Wang, Zhenyu Z

> On Wed, 2006-05-03, Alan Cox wrote:
> > something with NMI-signalled errors, I was wondering what the 
problems 
> > with using NMI-signalled ECC errors were?
> 
> The big problem with NMI is that it can occur *during* a PCI
> configuration sequence (ie during pci_config_* functions). That 
> means we can't safely do some I/O, especially configuration space 
I/O in an NMI
> handler. At best we could set a flag and catch it afterwards.

This is roughly what I did in the NMI handling code I wrote for
bluesmoke.  If my memory is correct, I believe there's a spinlock
that pci_read_config_dword() and friends acquire.  Basically
I did the following type of thing:

    /* Using spin_trylock() below avoids deadlock in the case where
     * the code interrupted by the NMI is holding the lock.
     */
    if (likely(spin_trylock(&lock))) {
            We got the lock.  Go ahead and access PCI config space
            (and then drop the lock)...
    } else {
            This case should be rare in practice.  Defer the access to
            PCI config space outside NMI context (I wrote a little API
            that facilitates doing this kind of stuff in a manner that
            avoids deadlocks and race conditions associated with NMI
            handlers).
    }

For anyone who is interested, the code may be obtained by going to
http://sourceforge.net/projects/bluesmoke/ and downloading the latest
version of the 'bluesmoke' package.  It's experimental stuff that
hasn't seen much testing.  Also it's not quite functional as-is
because a little piece of code still needs to be added somewhere that
enables SERR#.

I'm not necessarily advocating that NMI-driven error handling should
go into the mainstream kernel.  The expected benefits would have to
be weighed against the extra complexity that the code introduces.
However the parts of the code that handle the basic NMI-related
synchronization issues are abstracted into a relatively clean 
architecture-independent API that may be useful in other places where
NMIs (or similar types of exceptions/interrupts on other platforms)
are used.  I posted this code to LKML a while ago but it has since
been improved.  Perhaps having this type of code (or something
similar) in just one location would be an improvement over reinventing
the wheel in a number of places.

There is an issue regarding NMI-driven error handling that may be a
substantial pain to deal with on x86: When NMI occurs, we can't be
sure whether the NMI is from the watchdog, or due to a hardware error.
Therefore we must check the hardware for errors on each NMI.  This is
no better than polling (at whatever frequency the watchdog runs at).

The above problem can be worked around as follows (although I'm
not advocating that this be done in Linux): Modify local_irq_disable()
and local_irq_enable() so that instead of using cli/sti machine
instructions they adjust the interrupt priority level (controlled by
the local APIC on x86 processors) as follows:

    local_irq_disable()
    {
            Set interrupt priority level to (MAX_PRIORITY - 1).
            In other words, all interrupts are masked out except
            those whose priority is MAX_PRIORITY.
    }

    local_irq_enable()
    {
            Set interrupt priority level to 0 (i.e. all interrupts
            are enabled).
    }

Then the watchdog may be implemented as a normal interrupt with
priority MAX_PRIORITY, and all other interrupts may be given lower
priorities.  NMI would then only be asserted for genuine hardware
errors (PCI parity errors, ECC memory errors, etc.).  This is
probably more trouble than it's worth.  However I think it's doable
in principle.





^ permalink raw reply	[flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
@ 2006-05-03 22:22 Doug Thompson
  0 siblings, 0 replies; 44+ messages in thread
From: Doug Thompson @ 2006-05-03 22:22 UTC (permalink / raw)
  To: tim, alan
  Cc: mark.gross, soo.keong.ong, steven.carbonari, zhenyu.z.wang,
	bluesmoke-devel, linux-kernel

>>> Alan Cox <alan@lxorguk.ukuu.org.uk> 05/03/06 3:44 PM >>>
On Mer, 2006-05-03 at 21:25 +0100, Tim Small wrote:
>> something with NMI-signalled errors, I was wondering what the
problems 
>> with using NMI-signalled ECC errors were?

>The big problem with NMI is that it can occur *during* a PCI
>configuration sequence (ie during pci_config_* functions). That means
we
>can't safely do some I/O, especially configuration space I/O in an NMI
>handler. At best we could set a flag and catch it afterwards.


Dave Peterson did submit a set of patches to provide a deferred calling
mechanism triggered by the NMI handler. I am in the process of reading
up on that patch. It currently is in the bluesmoke project.

doug t



^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-05-03 21:39 Doug Thompson
  0 siblings, 0 replies; 44+ messages in thread
From: Doug Thompson @ 2006-05-03 21:39 UTC (permalink / raw)
  To: mark.gross
  Cc: tim, soo.keong.ong, steven.carbonari, zhenyu.z.wang,
	bluesmoke-devel, alan, linux-kernel

On Wed, 2006-05-03 at 20:49 +0000, "Gross, Mark"  wrote:
> 
> >-----Original Message-----
> >From: Tim Small [mailto:tim@buttersideup.com]
> >Sent: Wednesday, May 03, 2006 1:25 PM
> >To: Alan Cox
> >Cc: Ong, Soo Keong; Gross, Mark;
bluesmoke-devel@lists.sourceforge.net;
> >LKML; Carbonari, Steven; Wang, Zhenyu Z
> >Subject: Re: Problems with EDAC coexisting with BIOS
> >
> >Alan Cox wrote:
> >
> >>On Llu, 2006-04-24 at 22:15 +0800, Ong, Soo Keong wrote:
> >>
> >>
> >>>To me, periodical is not a good design for error handling, it
wastes
> >>>transaction bandwidth that should be used for other more productive
> >>>purposes.
> >>>
> >>>
> >>
> >>The periodical choice is mostly down to the brain damaged choice of
> NMI
> >>as the viable alternative, which is as good as 'not usable'
> >>
> >>
> >Hi,
> >
> >As I believe that the majority of the bluesmoke/EDAC developers are
> >(were) operating under the assumption that it would be possible to do
> >something with NMI-signalled errors, I was wondering what the
problems
> >with using NMI-signalled ECC errors were?

Dave Peterson created the NMI patch for bluesmoke that is currently
operational in the bluesmoke project.  When bluesmoke was submitted, the
advice we received was to defer until later to apply that NMI patch
after we had gotten EDAC into the kernel and more stable.

Since Dave has stepped down, that leaves me to refactor that patch for
EDAC release. It is on my TODO list, but not yet risen to the top.

Once concern I had yet to research is:  Do all motherboards route the
NMI signal properly? Is that a function of the BIOS? etc, etc


> 
> Soo Keong or Steve, can you answer this one?
> 
> >
> >Are there some systems states in which an incoming NMI throws a
spanner
> >in to the works in an unrecoverable way?  If this is the case, is it
so
> >on all x86/x86-64 systems, or just a subset, and is there no way to
> >implement some sort of top half / bottom half style NMI handler
> >cleanly?  As I am certainly not an x86 architecture expert, I would
> >appreciate any input from the resident gurus ;o).
> 
> I don't think NMI's are so much the problem as those blasted SMI's. 
And
> SMI's are not for sharing :(
> 
> These problems are BIOS specific, Your Mileage Will Vary from one bios
> to the next.
> 
> >
> >Quickly returning to the original problem - I know this isn't a
proper
> >API by any stretch of the imagination, and would require changes to
> >existing BIOSs, but the EDAC module could reprogram the chipset
> >error-signalling registers, so that an ECC error no longer triggers
an
> >SMI.  The BIOS SMI handler could then read the signalling registers,
> and
> >leave the ECC registers well alone if ECC errors are not set to
> generate
> >an SMI.
> 
> It's not the SMI from ECC events that are the problem.  It's the BIOS
> assuming no one else would want to use those registers on Dev0:Fun1,
and
> the fact that you still end up with a race between the OS and the BIOS
> SMI to handle the events.

exactly and for that reason we might have a tar baby to wash and clean
up somehow

doug t



^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-05-03 20:49 Gross, Mark
  0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-05-03 20:49 UTC (permalink / raw)
  To: Tim Small, Alan Cox
  Cc: Ong, Soo Keong, bluesmoke-devel, LKML, Carbonari, Steven,
	Wang, Zhenyu Z



>-----Original Message-----
>From: Tim Small [mailto:tim@buttersideup.com]
>Sent: Wednesday, May 03, 2006 1:25 PM
>To: Alan Cox
>Cc: Ong, Soo Keong; Gross, Mark; bluesmoke-devel@lists.sourceforge.net;
>LKML; Carbonari, Steven; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Alan Cox wrote:
>
>>On Llu, 2006-04-24 at 22:15 +0800, Ong, Soo Keong wrote:
>>
>>
>>>To me, periodical is not a good design for error handling, it wastes
>>>transaction bandwidth that should be used for other more productive
>>>purposes.
>>>
>>>
>>
>>The periodical choice is mostly down to the brain damaged choice of
NMI
>>as the viable alternative, which is as good as 'not usable'
>>
>>
>Hi,
>
>As I believe that the majority of the bluesmoke/EDAC developers are
>(were) operating under the assumption that it would be possible to do
>something with NMI-signalled errors, I was wondering what the problems
>with using NMI-signalled ECC errors were?

Soo Keong or Steve, can you answer this one?

>
>Are there some systems states in which an incoming NMI throws a spanner
>in to the works in an unrecoverable way?  If this is the case, is it so
>on all x86/x86-64 systems, or just a subset, and is there no way to
>implement some sort of top half / bottom half style NMI handler
>cleanly?  As I am certainly not an x86 architecture expert, I would
>appreciate any input from the resident gurus ;o).

I don't think NMI's are so much the problem as those blasted SMI's.  And
SMI's are not for sharing :(

These problems are BIOS specific, Your Mileage Will Vary from one bios
to the next.

>
>Quickly returning to the original problem - I know this isn't a proper
>API by any stretch of the imagination, and would require changes to
>existing BIOSs, but the EDAC module could reprogram the chipset
>error-signalling registers, so that an ECC error no longer triggers an
>SMI.  The BIOS SMI handler could then read the signalling registers,
and
>leave the ECC registers well alone if ECC errors are not set to
generate
>an SMI.

It's not the SMI from ECC events that are the problem.  It's the BIOS
assuming no one else would want to use those registers on Dev0:Fun1, and
the fact that you still end up with a race between the OS and the BIOS
SMI to handle the events.

>
>Cheers,
>
>Tim.

^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-26  3:24 Gross, Mark
  0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-26  3:24 UTC (permalink / raw)
  To: Randy.Dunlap
  Cc: minyard, alan, bluesmoke-devel, linux-kernel, Carbonari, Steven,
	Ong, Soo Keong, Wang, Zhenyu Z



>-----Original Message-----
>From: Randy.Dunlap [mailto:rdunlap@xenotime.net]
>Sent: Tuesday, April 25, 2006 7:34 PM
>To: Gross, Mark
>Cc: minyard@acm.org; alan@lxorguk.ukuu.org.uk; bluesmoke-
>devel@lists.sourceforge.net; linux-kernel@vger.kernel.org; Carbonari,
>Steven; Ong, Soo Keong; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>On Tue, 25 Apr 2006 16:25:59 -0700 Gross, Mark wrote:
>
>>
>> How about printing nothing like it used too?
>>
>> See attached.
>>
>> Signed-off-by: Mark Gross
>
>Hi Mark,
>
>This small patch will need some cleaning up:
>
>1.  Signed-off-by: Mark Gross <mark.gross@intel.com>
>
>2.  Try not to use attachments.  If you must, then make the attachment
>	type be text instead of octet-stream.
>
>3.  No need to init to 0:
>+static int force_function_unhide = 0;
>
>4.  Typos:
>
>+	/* check to see if device 0 function 1 is enbaled if it isn't we
>                                                  enabled; it it isn't,
we
>+	 * assume the BIOS has reserved it for a reason and is expecting
>+	 * exclusive access, we take care to not violate that assumption
and
>                                          not to violate
>+	 * fail the probe. */
>
>5.  indentation, typos, and at least one trailing space:
>
>+	if (!force_function_unhide && !(stat8 & (1 << 5))) {
>+		printk(KERN_INFO "contact your bios vendor to see if the
"
>                                  Contact your BIOS
>+		"E752x error registers can be safely un-hidden\n");
>		^indent one more tab
>
>---

Thanks,
I'll get these tomorrow AM.

--mgross

^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-26  3:19 Gross, Mark
  0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-26  3:19 UTC (permalink / raw)
  To: Corey Minyard
  Cc: Alan Cox, bluesmoke-devel, LKML, Carbonari, Steven,
	Ong, Soo Keong, Wang, Zhenyu Z



>-----Original Message-----
>From: Corey Minyard [mailto:minyard@acm.org]
>Sent: Tuesday, April 25, 2006 7:19 PM
>To: Gross, Mark
>Cc: Alan Cox; bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari,
>Steven; Ong, Soo Keong; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Gross, Mark wrote:
>
>>>Yes, this is what I was talking about.  I believe the mode of
>>>module_param should be 444, since modifying this after the module is
>>>loaded won't do anything.  Also, it might be nice to print something
>>>different in the "force" case.
>>>
>>>
>>
>>How about printing nothing like it used too?
>>
>>See attached.
>>
>>
>This is fine with me.  I debated between printing nothing and a "You
>have chosen to override ..." print but didn't come out with any
>opinion.  I'm easy either way.
>
>The indenting in the "if (!force_function_unhide && !(stat8 & (1 <<
5)))
>{" clause is kind of strange 

I know I'm going to regret this but:  What's strange about it?

> and you have some spaces instead of tabs
>right after that (where stat8 is set).

Damn.  I pasted using the x-windows middle mouse button :(  I'll redo
the patch when I get back to my workstation tomorrow.

--mgross

^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-25 21:24 Gross, Mark
  2006-04-25 22:39 ` Corey Minyard
  0 siblings, 1 reply; 44+ messages in thread
From: Gross, Mark @ 2006-04-25 21:24 UTC (permalink / raw)
  To: Corey Minyard
  Cc: Alan Cox, bluesmoke-devel, LKML, Carbonari, Steven,
	Ong, Soo Keong, Wang, Zhenyu Z

[-- Attachment #1: Type: text/plain, Size: 482 bytes --]



>-----Original Message-----
>From: Corey Minyard [mailto:minyard@acm.org]
>Sent: Tuesday, April 25, 2006 12:55 PM
>To: Gross, Mark
>Cc: Alan Cox; bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari,
>Steven; Ong, Soo Keong; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Shouldn't you provide a way (kernel command line) to override this
check
>if the function can be safely unhidden?
>

Done. 
Signed-off-by: Mark Gross


--mgross

[-- Attachment #2: e752x_edac.patch --]
[-- Type: application/octet-stream, Size: 1744 bytes --]

diff -urN -X linux-2.6.16/Documentation/dontdiff linux-2.6.16/drivers/edac/e752x_edac.c linux-2.6.16_edac/drivers/edac/e752x_edac.c
--- linux-2.6.16/drivers/edac/e752x_edac.c	2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16_edac/drivers/edac/e752x_edac.c	2006-04-25 13:59:45.000000000 -0700
@@ -29,6 +29,7 @@
 
 #include "edac_mc.h"
 
+static int force_function_unhide = 0;
 
 #ifndef PCI_DEVICE_ID_INTEL_7520_0
 #define PCI_DEVICE_ID_INTEL_7520_0      0x3590
@@ -755,10 +756,19 @@
 	debugf0("MC: " __FILE__ ": %s(): mci\n", __func__);
 	debugf0("Starting Probe1\n");
 
-	/* enable device 0 function 1 */
+	/* check to see if device 0 function 1 is enbaled if it isn't we
+	 * assume the BIOS has reserved it for a reason and is expecting
+	 * exclusive access, we take care to not violate that assumption and
+	 * fail the probe. */
 	pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
-	stat8 |= (1 << 5);
-	pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
+	if (!(stat8 & (1 << 5))) {
+		printk(KERN_INFO "contact your bios vendor to see if the " 
+		"E752x error registers can be safely un-hidden\n");
+		if (!force_function_unhide)
+			goto fail;
+	}
+        stat8 |= (1 << 5);
+        pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
 
 	/* need to find out the number of channels */
 	pci_read_config_dword(pdev, E752X_DRC, &drc);
@@ -1069,3 +1079,8 @@
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Linux Networx (http://lnxi.com) Tom Zimmerman\n");
 MODULE_DESCRIPTION("MC support for Intel e752x memory controllers");
+
+module_param(force_function_unhide, int, 0644);
+MODULE_PARM_DESC(force_function_unhide, "if BIOS sets Dev0:Fun1 up as hidden:"
+" 1=force unhide and hope BIOS doesn't fight driver for Dev0:Fun1 access");
+

^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-25 20:22 Gross, Mark
  0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-25 20:22 UTC (permalink / raw)
  To: Corey Minyard
  Cc: Alan Cox, bluesmoke-devel, LKML, Carbonari, Steven,
	Ong, Soo Keong, Wang, Zhenyu Z

Something like a force-unhide parameter?

That makes sense, I'll take a stab at it now.

--mgross

>-----Original Message-----
>From: Corey Minyard [mailto:minyard@acm.org]
>Sent: Tuesday, April 25, 2006 12:55 PM
>To: Gross, Mark
>Cc: Alan Cox; bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari,
>Steven; Ong, Soo Keong; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Shouldn't you provide a way (kernel command line) to override this
check
>if the function can be safely unhidden?
>
>-Corey
>
>Gross, Mark wrote:
>
>>
>>
>>Patch to work around this problem is attached.
>>
>>Signed-off-by: Mark Gross
>>
>>--mgross
>>
>>

^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-25 18:19 Gross, Mark
  2006-04-25 19:55 ` Corey Minyard
  0 siblings, 1 reply; 44+ messages in thread
From: Gross, Mark @ 2006-04-25 18:19 UTC (permalink / raw)
  To: Gross, Mark, Alan Cox
  Cc: bluesmoke-devel, LKML, Carbonari, Steven, Ong, Soo Keong,
	Wang, Zhenyu Z

[-- Attachment #1: Type: text/plain, Size: 2596 bytes --]



>-----Original Message-----
>From: Gross, Mark
>Sent: Monday, April 24, 2006 11:14 AM
>To: 'Alan Cox'
>Cc: bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari, Steven;
Ong,
>Soo Keong; Wang, Zhenyu Z
>Subject: RE: Problems with EDAC coexisting with BIOS
>
>
>
>>-----Original Message-----
>>From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk]
>>Sent: Monday, April 24, 2006 10:50 AM
>>To: Gross, Mark
>>Cc: bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari, Steven;
Ong,
>>Soo Keong; Wang, Zhenyu Z
>>Subject: RE: Problems with EDAC coexisting with BIOS
>>
>>On Llu, 2006-04-24 at 08:57 -0700, Gross, Mark wrote:
>>> I think what I'm saying is pretty clear and I don't think it is
related
>>> to whatever workarounds where done earlier.
>>
>>Ok. I was concerned as I seem to remember an earlier errata fix
enabled
>>the memory controller temporarily to do a workaround on one bridge. We
>>hit this because it unconditionally disabled it afterwards and Intel
>>sent fixes for RHEL4. I don't believe the workaround in question is in
>>the current tree as it was fixed elsewhere.
>>
>>Just worried that if that is the case an SMI the wrong moment might
fail
>>to apply the workaround.
>>
>>
>>> >Why did Intel bother implementing this functionality and then
screwing
>>> >it up so that OS vendors can't use it ? It seems so bogus.
>>> >
>>>
>>> It was just a screw up not to have identified this issue sooner.
>>
>>Ok. So the intention was that the OS should also be able to access
this
>>material.
>>
>
>The E752x Si is made to allow access to the device / Function.
However;
>when it's integrated onto a MoBo with BIOS there can be implementations
>where we get into this coordination issue.
>
>>> >At the very least we should print a warning advising the user that
the
>>> >BIOS is incompatible and to ask the BIOS vendor for an update so
that
>>> >they can enable error detection and management support.
>>>
>>> I would place the warning in the probe or init code.
>>
>>Agreed, and then bale out. Customer pressure should do the rest if the
>>BIOS needs updating, or ACPI or similar need to grow a 'shared' API
for
>>this so the BIOS and OS can co-operate.
>>
>
>Yes and yes.
>
>I'm having trouble getting the dev0:fun1 hidden by bios test into the
>e752x_init code.  It seems to be a shame having to fail the probe1 and
>leave the driver loaded in memory.  Are there any recommendations on a
good
>way to do this?
>


Patch to work around this problem is attached.

Signed-off-by: Mark Gross

--mgross

[-- Attachment #2: edac.patch --]
[-- Type: application/octet-stream, Size: 1097 bytes --]

===================================================================
RCS file: /opt/CVS/linux/drivers/edac/e752x_edac.c,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 e752x_edac.c
--- drivers/edac/e752x_edac.c	21 Apr 2006 02:58:32 -0000	1.1.1.1
+++ drivers/edac/e752x_edac.c	25 Apr 2006 13:26:33 -0000
@@ -755,10 +755,16 @@
 	debugf0("MC: " __FILE__ ": %s(): mci\n", __func__);
 	debugf0("Starting Probe1\n");
 
-	/* enable device 0 function 1 */
+	/* check to see if device 0 function 1 is enbaled if it isn't we assume
+	 * the BIOS has reserved it for a reason and is expecting exclusive
+	 * access, we take care to not violate that assumption and fail the
+	 * probe. */
 	pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
-	stat8 |= (1 << 5);
-	pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
+	if (!(stat8 & (1 << 5))) {
+		printk(KERN_WARNING "contact your bios vendor to see if the " 
+		"E752x error registers can be safely un-hidden\n");
+		goto fail;
+	}
 
 	/* need to find out the number of channels */
 	pci_read_config_dword(pdev, E752X_DRC, &drc);

^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-24 18:14 Gross, Mark
  0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-24 18:14 UTC (permalink / raw)
  To: Alan Cox
  Cc: bluesmoke-devel, LKML, Carbonari, Steven, Ong, Soo Keong,
	Wang, Zhenyu Z



>-----Original Message-----
>From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk]
>Sent: Monday, April 24, 2006 10:50 AM
>To: Gross, Mark
>Cc: bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari, Steven;
Ong,
>Soo Keong; Wang, Zhenyu Z
>Subject: RE: Problems with EDAC coexisting with BIOS
>
>On Llu, 2006-04-24 at 08:57 -0700, Gross, Mark wrote:
>> I think what I'm saying is pretty clear and I don't think it is
related
>> to whatever workarounds where done earlier.
>
>Ok. I was concerned as I seem to remember an earlier errata fix enabled
>the memory controller temporarily to do a workaround on one bridge. We
>hit this because it unconditionally disabled it afterwards and Intel
>sent fixes for RHEL4. I don't believe the workaround in question is in
>the current tree as it was fixed elsewhere.
>
>Just worried that if that is the case an SMI the wrong moment might
fail
>to apply the workaround.
>
>
>> >Why did Intel bother implementing this functionality and then
screwing
>> >it up so that OS vendors can't use it ? It seems so bogus.
>> >
>>
>> It was just a screw up not to have identified this issue sooner.
>
>Ok. So the intention was that the OS should also be able to access this
>material.
>

The E752x Si is made to allow access to the device / Function.  However;
when it's integrated onto a MoBo with BIOS there can be implementations
where we get into this coordination issue.

>> >At the very least we should print a warning advising the user that
the
>> >BIOS is incompatible and to ask the BIOS vendor for an update so
that
>> >they can enable error detection and management support.
>>
>> I would place the warning in the probe or init code.
>
>Agreed, and then bale out. Customer pressure should do the rest if the
>BIOS needs updating, or ACPI or similar need to grow a 'shared' API for
>this so the BIOS and OS can co-operate.
>

Yes and yes. 

I'm having trouble getting the dev0:fun1 hidden by bios test into the
e752x_init code.  It seems to be a shame having to fail the probe1 and
leave the driver loaded in memory.  Are there any recommendations on a
good way to do this?


--mgross

^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-24 15:57 Gross, Mark
  2006-04-24 17:08 ` Eric W. Biederman
  2006-04-24 17:49 ` Alan Cox
  0 siblings, 2 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-24 15:57 UTC (permalink / raw)
  To: Alan Cox
  Cc: bluesmoke-devel, LKML, Carbonari, Steven, Ong, Soo Keong,
	Wang, Zhenyu Z

[-- Attachment #1: Type: text/plain, Size: 3595 bytes --]



>-----Original Message-----
>From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk]
>Sent: Monday, April 24, 2006 6:19 AM
>To: Gross, Mark
>Cc: bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari, Steven;
Ong,
>Soo Keong; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>On Gwe, 2006-04-21 at 09:01 -0700, Gross, Mark wrote:
>> 1) The default AMI BIOS behavior on SMI is to check the chipset error
>> registers (Dev0:Fun1) and re-hide them.
>
>The words "bad design" come to mind (followed by a large number of more
>accurate phrases that are inappropriate for a public list)
>

Yes, EDAC should not have existed without first working more closely
with the BIOS folks.  It's too bad we (Intel) didn't catch this.  The
BIOS folks where not in the loop with the driver folks making
contributions to EDAC previously.

>> Basically if device 0 : function 1 is hidden by the platform at boot
>> time un-hiding and using the device and function is a risky thing to
do,
>
>Intel provided patches that do exactly this for some of the chip
>workarounds. Are you saying the Intel chip work around also needs
>fixing ?
>

I think what I'm saying is pretty clear and I don't think it is related
to whatever workarounds where done earlier.

>> The driver should never get loaded by default or automatically.  If
the
>> user knows enough about there BIOS to trust that the SMI behavior
will
>> coexist with the driver then its OK to load otherwise using this
driver
>> is not a safe thing to do.
>
>So Intel and/or the BIOS vendors also forgot to put in any kind of
>indicator ? How do they expect end users to know this, or OS vendors ?
>Is there a technote that covers this mess ?
>

I don't know of any technote.  It took me working with Soo Keong for a
few weeks to chase this issue down to the level I have.  The short
answer is that the BIOS assumes the payload OS would not be fighting it
for hidden device access and the EDAC driver violates this assumption.  

>> I think the best thing to do is to have the driver error out in its
init
>> or probe code if the dev0:fun1 is hidden at boot time.
>>
>> Comments?
>
>Why did Intel bother implementing this functionality and then screwing
>it up so that OS vendors can't use it ? It seems so bogus.
>

It was just a screw up not to have identified this issue sooner.  

>At the very least we should print a warning advising the user that the
>BIOS is incompatible and to ask the BIOS vendor for an update so that
>they can enable error detection and management support.

I would place the warning in the probe or init code.

Attached is a test patch I'm testing now.  I don't like it, but it seems
to be working so far.  It basically fails the probe call leaving the
driver loaded.  I'm going to move the test to e752x_int so the driver
fails at init this am at restart my tests.


>
>Is only the AMI BIOS this braindamaged, should we just blacklist AMI
>bioses in EDAC or is this shared Intel supplied code that may be found
>in other vendors systems.
>

Unknown.  Also the BOIS teams for various platforms can modify the base
AMI functionality.  I know that at least one Intel e7520 based system
with AMI based bios seems to not expose this issue.  The point is that
without working out a handshake between the OS and the platform / BIOS
for this type of thing, loading EDAC without a patch like mine is
equivalent to playing Russian roulette.  You can't know which platform /
bios will blow up on you if you load the thing.

--mgross


[-- Attachment #2: edac_2_6_16.patch --]
[-- Type: application/octet-stream, Size: 1065 bytes --]

? scripts/kconfig/lxdialog/lxdialog
Index: drivers/edac/e752x_edac.c
===================================================================
RCS file: /opt/CVS/linux/drivers/edac/e752x_edac.c,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 e752x_edac.c
--- drivers/edac/e752x_edac.c	21 Apr 2006 02:58:32 -0000	1.1.1.1
+++ drivers/edac/e752x_edac.c	23 Apr 2006 23:42:19 -0000
@@ -755,10 +755,13 @@
 	debugf0("MC: " __FILE__ ": %s(): mci\n", __func__);
 	debugf0("Starting Probe1\n");
 
-	/* enable device 0 function 1 */
+	/* check to see if device 0 function 1 is enbaled if it isn't we assume
+	 * the BIOS has reserved it for a reason like, it has SMI's setup
+	 * expecting exclusive access, and we sould take care to not violate
+	 * that assumption and fail the probe. */
 	pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
-	stat8 |= (1 << 5);
-	pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
+	if (!(stat8 & (1 << 5))
+		goto fail;
 
 	/* need to find out the number of channels */
 	pci_read_config_dword(pdev, E752X_DRC, &drc);

^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-24 14:32 Ong, Soo Keong
  0 siblings, 0 replies; 44+ messages in thread
From: Ong, Soo Keong @ 2006-04-24 14:32 UTC (permalink / raw)
  To: Alan Cox
  Cc: Gross, Mark, bluesmoke-devel, LKML, Carbonari, Steven,
	Wang, Zhenyu Z

There are 4 occasions (that I aware of) during the OS times that could
possibly trigger SMI

1. Before OS USB driver disconnect SMI from USB controller
2. ACPI driver call software SMI once
3. SpeedStep using ACPI interface
4. Error (connected to SMI) happens

I know there are always ways to improve BIOS. Allow me to look at the OS
first so that OS can be robust enough to handle different
implementations.

1 and 2 will be gone early in booting. 3 could be handled appropriately
by OS because OS knows when SpeedStep ACPI interface is called and is
done. 4 will be gone after error interrupt re-connection done by OS
after phase 1 and 2.

I am not the one who prefer error handling stay in BIOS, but many people
have different opinion from me.

I logout now.

-----Original Message-----
From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk] 
Sent: Monday, April 24, 2006 10:30 PM
To: Ong, Soo Keong
Cc: Gross, Mark; bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari,
Steven; Wang, Zhenyu Z
Subject: RE: Problems with EDAC coexisting with BIOS

On Llu, 2006-04-24 at 22:15 +0800, Ong, Soo Keong wrote:
> To me, periodical is not a good design for error handling, it wastes
> transaction bandwidth that should be used for other more productive
> purposes.

The periodical choice is mostly down to the brain damaged choice of NMI
as the viable alternative, which is as good as 'not usable'

> It is more appropriate to have single handler, either OS or BIOS.

Agreed but then the BIOS must provide that service to the OS reliably
and efficiently so that users can build that service into their system
wide error management and control processes.

> In general, the errors handler connect the errors to the interrupt or
> interrutps. The handler should undhide (if it s hideable) the error
> controller and read its registers upon interrupt, then carry out
> appropriate actions to handle the erros.

Actually I am dubious that the error handler can do that. If the OS
kernel just issued the first half of a config cycle what occurs when the
SMI tries to play with PCI config space ? 

^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-24 14:15 Ong, Soo Keong
  2006-04-24 14:29 ` Alan Cox
  0 siblings, 1 reply; 44+ messages in thread
From: Ong, Soo Keong @ 2006-04-24 14:15 UTC (permalink / raw)
  To: Alan Cox
  Cc: Gross, Mark, bluesmoke-devel, LKML, Carbonari, Steven,
	Wang, Zhenyu Z

To me, periodical is not a good design for error handling, it wastes
transaction bandwidth that should be used for other more productive
purposes.

It is more appropriate to have single handler, either OS or BIOS.

In general, the errors handler connect the errors to the interrupt or
interrutps. The handler should undhide (if it s hideable) the error
controller and read its registers upon interrupt, then carry out
appropriate actions to handle the erros.


-----Original Message-----
From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk] 
Sent: Monday, April 24, 2006 10:14 PM
To: Ong, Soo Keong
Cc: Gross, Mark; bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari,
Steven; Wang, Zhenyu Z
Subject: RE: Problems with EDAC coexisting with BIOS

On Llu, 2006-04-24 at 21:59 +0800, Ong, Soo Keong wrote:
> Alan,
> 
> Have you understood how the errors are connected to the interrupts
> (either SMI, NMI, SCI)?

I believe so

> When does EDAC read the error status? Periodical or upon interrpt by
> errors?

Periodically currently. The sf development tree has some code for
handling the NMI case but this isn't actually useful because an NMI can
occur half way through a PCI config transaction.

Alan

^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-24 13:59 Ong, Soo Keong
  2006-04-24 14:13 ` Alan Cox
  0 siblings, 1 reply; 44+ messages in thread
From: Ong, Soo Keong @ 2006-04-24 13:59 UTC (permalink / raw)
  To: Alan Cox, Gross, Mark
  Cc: bluesmoke-devel, LKML, Carbonari, Steven, Wang, Zhenyu Z

Alan,

Have you understood how the errors are connected to the interrupts
(either SMI, NMI, SCI)?

When does EDAC read the error status? Periodical or upon interrpt by
errors?

Soo Keong


-----Original Message-----
From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk] 
Sent: Monday, April 24, 2006 9:19 PM
To: Gross, Mark
Cc: bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari, Steven; Ong,
Soo Keong; Wang, Zhenyu Z
Subject: Re: Problems with EDAC coexisting with BIOS

On Gwe, 2006-04-21 at 09:01 -0700, Gross, Mark wrote:
> 1) The default AMI BIOS behavior on SMI is to check the chipset error
> registers (Dev0:Fun1) and re-hide them.

The words "bad design" come to mind (followed by a large number of more
accurate phrases that are inappropriate for a public list)

> Basically if device 0 : function 1 is hidden by the platform at boot
> time un-hiding and using the device and function is a risky thing to
do,

Intel provided patches that do exactly this for some of the chip
workarounds. Are you saying the Intel chip work around also needs
fixing ?

> The driver should never get loaded by default or automatically.  If
the
> user knows enough about there BIOS to trust that the SMI behavior will
> coexist with the driver then its OK to load otherwise using this
driver
> is not a safe thing to do.

So Intel and/or the BIOS vendors also forgot to put in any kind of
indicator ? How do they expect end users to know this, or OS vendors ?
Is there a technote that covers this mess ?

> I think the best thing to do is to have the driver error out in its
init
> or probe code if the dev0:fun1 is hidden at boot time.
> 
> Comments?

Why did Intel bother implementing this functionality and then screwing
it up so that OS vendors can't use it ? It seems so bogus.

At the very least we should print a warning advising the user that the
BIOS is incompatible and to ask the BIOS vendor for an update so that
they can enable error detection and management support. 

Is only the AMI BIOS this braindamaged, should we just blacklist AMI
bioses in EDAC or is this shared Intel supplied code that may be found
in other vendors systems.

Alan

^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-23  1:44 Gross, Mark
  0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-23  1:44 UTC (permalink / raw)
  To: Tim Small
  Cc: Doug Thompson, Ong, Soo Keong, Carbonari, Steven, Wang, Zhenyu Z,
	bluesmoke-devel, linux-kernel



>-----Original Message-----
>From: Tim Small [mailto:tim@buttersideup.com]
>Sent: Saturday, April 22, 2006 11:32 AM
>To: Gross, Mark
>Cc: Doug Thompson; Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z;
>bluesmoke-devel@lists.sourceforge.net; linux-kernel@vger.kernel.org
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Gross, Mark wrote:
>
>>You can never predict when a SMI will bubble through the system.  Even
>>if you handle case where the BIOS re-hides Dev0:Fun1 and not panic how
>>do you deal with the race between the BIOS SMI based handling and the
>>driver?  Who will end up reading (and clearing) the error registers
>>first?  There is no good way to share today.
>>
>>
>You could (at least from memory, on certain chipsets) modify the error
>reporting registers so that an SMI is no longer generated as a result
of
>MC ECC errors.  True, this doesn't fix many of the other problems
>related to this issue, but would be useful in a "modprobe xyz_edac
>force_unhide_MC_PCI=1" case.

You can get SMI's for more than just ECC events, suppressing the ECC
SMI's won't save you from the other SMI events that could happen.  

We need to work something out with the bios guys to do this right.
Today we don't have a good way of coordinating between the payload OS
and the BIOS for this type of platform level stuff.

>
>Closed-source BIOSes eh?  Who needs em ;-p.
No comment.

--mgross

^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-21 22:36 Doug Thompson
  0 siblings, 0 replies; 44+ messages in thread
From: Doug Thompson @ 2006-04-21 22:36 UTC (permalink / raw)
  To: mark.gross
  Cc: soo.keong.ong, steven.carbonari, zhenyu.z.wang, bluesmoke-devel,
	Doug Thompson, linux-kernel

> >
> >Does that fix cover the issue as you have discovered it? Or is there
> >more, or another fix you see?
> 
> No, you should return an error from e752x_probe1 if bit5 is cleared
> already.  To do anything else you are assuming that you are privileged
> to have more knowledge about what the BIOS is doing than you should.
> 
> I'll send the patch to e752x_edac.c this weekend.
> 
> --mgross
> 

great. I don't have immediate access to a jarrell mobo at the moment and
definately no access to a manged one.

thanks

doug t




^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-21 22:20 Gross, Mark
  2006-04-22 18:31 ` Tim Small
  0 siblings, 1 reply; 44+ messages in thread
From: Gross, Mark @ 2006-04-21 22:20 UTC (permalink / raw)
  To: Doug Thompson
  Cc: Ong, Soo Keong, Carbonari, Steven, Wang, Zhenyu Z,
	bluesmoke-devel, linux-kernel



>-----Original Message-----
>From: Doug Thompson [mailto:dthompson@lnxi.com]
>Sent: Friday, April 21, 2006 2:43 PM
>To: Gross, Mark
>Cc: Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z; bluesmoke-
>devel@lists.sourceforge.net; Doug Thompson;
linux-kernel@vger.kernel.org
>Subject: RE: Problems with EDAC coexisting with BIOS
>
>On Fri, 2006-04-21 at 21:32 +0000, "Gross, Mark"  wrote:
>>
>> >-----Original Message-----
>> >From: Doug Thompson [mailto:dthompson@lnxi.com]
>> >Sent: Friday, April 21, 2006 1:57 PM
>> >To: Gross, Mark
>> >Cc: Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z; bluesmoke-
>> >devel@lists.sourceforge.net; linux-kernel@vger.kernel.org
>> >Subject: Re: Problems with EDAC coexisting with BIOS
>> >
>> >Mark thanks for the informaton on this.
>> >
>> >Now the e752x_edac.c driver contains no direct calls to panic within
>> >itself. The edac_mc.c 'core' piece does, but only calls that if an
UE
>> >error is found and panic on UE is enabled. (The other is a PCI
parity
>> >panic, but doesn't effect this path). It might be possible that
since
>> >the hidden register was now hidden, the retrieval function returns
some
>> >garbage which falsely triggers the panic by the core.
>>
>> The problem is that once the BIOS SMI re-hides the device reading
from
>> the chipset error registers returns 0xFFFFFFFF.
>
>Ok so the panic is caused by the driver because the UE bit is a one
>(somewhere in the 32-bit field), and then informs the CORE driver of
the
>false positive UE event, which does the panic trigger. I follow the
>logic now.
>
>The question then comes up: how to determine if the system has the
>hidden register feature.

For the E752x chipsets, if the Dev0:Fun1 is hidden after the BIOS hands
control to the OS, I think the driver should simply honor this and not
un-hide it.

The test is to check if bit 5 of the DEVPRES1 resister is set.  If set
the device is available, otherwise its hidden.  (see section 3.5.41 of
e7520 specification
ftp://download.intel.com/design/chipsets/datashts/30300602.pdf )

>
>You mentioned that the probe should be refactored to look for this all
>ones return value as such an indicator, is that correct?

Yes, basically we should consider removing the un-hide code in
e752x_probe1, and handle the case where the dev0:fun1 is left hidden by
the BIOS as an error condition at driver probe time.

>
>But there is a window that the register is unhidden, the probe runs,
>detects the hardware it is looking at and registers itself. Then later
>the window closes, the register is hidden again and the panic can
>reoccurr as above. Thus, a check needs to be made in the poll operation
>for the all ones return value, notify the log of this situtation,
>shutdown future polling of this hardware and basicly become a no-op
poll
>operation.

You can never predict when a SMI will bubble through the system.  Even
if you handle case where the BIOS re-hides Dev0:Fun1 and not panic how
do you deal with the race between the BIOS SMI based handling and the
driver?  Who will end up reading (and clearing) the error registers
first?  There is no good way to share today.


>
>Does that fix cover the issue as you have discovered it? Or is there
>more, or another fix you see?

No, you should return an error from e752x_probe1 if bit5 is cleared
already.  To do anything else you are assuming that you are privileged
to have more knowledge about what the BIOS is doing than you should.

I'll send the patch to e752x_edac.c this weekend.

--mgross

>
>doug t
>
>
>
>>
>> >
>> >I would like to see the panic output and stack trace to see where
that
>> >panic came from. It might have come from the PCI device access
subsytem
>> >when it was trying to access the now hidden register.
>> >
>> >can you post that panice information?
>>
>> Yes, but the edac_mc.c call to panic doesn't drop any data to the
>> console.
>> The following is the output from an instrumentation we did to the
EDAC
>> that went into RHEL4-U3 :
>>
>> INIT: version 2.85 booting
>>                 Welcome to Red Hat Enterprise Linux AS
>>                 Press 'I' to enter interactive startup.
>> Starting udev:  [  OK  ]
>> Initializing hardware...  storage network audioIn probe1, before
write
>> E752X_DEVPRES1 = 0x10
>> In probe1, after write
>> E752X_DEVPRES1 = 0x30
>>
>> Snip .....
>>
>> Configuring kernel parameters:  [  OK  ]
>> Setting clock  (localtime): Sun Mar 12 00:52:17 PST 2006 [  OK  ]
>> Setting hostname localhost.localdomain:  [  OK  ]
>> Checking root filesystem
>> [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/sda1
>> /: clean, 365954/8830976 files, 2357630/17657443 blocks
>> [  OK  ]
>> Remounting root filesystem in read-write mode:  [  OK  ]
>> No devices found
>> No Software RAID disks
>> Setting up Logical Volume ManageE752X_DEVPRES1 = 0x10
>> ment: E752X_DEVPRES1 = 0x10
>> Kernel panic - not syncing: MC0: Uncorrected Error
>>
>> that's all you get, no stack trace.
>>
>> The EX752C_DEVPRES1 = ... debug statements came from sprinkling call
to
>> the following debug function we instrumented into the EDAC driver
code.
>>
>> static void print_error_info(struct pci_dev *pdev)
>> {
>>   u8 stat8;
>>   pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
>>   printk(KERN_EMERG "E752X_DEVPRES1 = 0x%02X\n", stat8);
>> }
>>

^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-21 21:42 Doug Thompson
  0 siblings, 0 replies; 44+ messages in thread
From: Doug Thompson @ 2006-04-21 21:42 UTC (permalink / raw)
  To: mark.gross
  Cc: soo.keong.ong, steven.carbonari, zhenyu.z.wang, bluesmoke-devel,
	Doug Thompson, linux-kernel

On Fri, 2006-04-21 at 21:32 +0000, "Gross, Mark"  wrote:
> 
> >-----Original Message-----
> >From: Doug Thompson [mailto:dthompson@lnxi.com]
> >Sent: Friday, April 21, 2006 1:57 PM
> >To: Gross, Mark
> >Cc: Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z; bluesmoke-
> >devel@lists.sourceforge.net; linux-kernel@vger.kernel.org
> >Subject: Re: Problems with EDAC coexisting with BIOS
> >
> >Mark thanks for the informaton on this.
> >
> >Now the e752x_edac.c driver contains no direct calls to panic within
> >itself. The edac_mc.c 'core' piece does, but only calls that if an UE
> >error is found and panic on UE is enabled. (The other is a PCI parity
> >panic, but doesn't effect this path). It might be possible that since
> >the hidden register was now hidden, the retrieval function returns some
> >garbage which falsely triggers the panic by the core.
> 
> The problem is that once the BIOS SMI re-hides the device reading from
> the chipset error registers returns 0xFFFFFFFF.

Ok so the panic is caused by the driver because the UE bit is a one
(somewhere in the 32-bit field), and then informs the CORE driver of the
false positive UE event, which does the panic trigger. I follow the
logic now.

The question then comes up: how to determine if the system has the
hidden register feature.

You mentioned that the probe should be refactored to look for this all
ones return value as such an indicator, is that correct?

But there is a window that the register is unhidden, the probe runs,
detects the hardware it is looking at and registers itself. Then later
the window closes, the register is hidden again and the panic can
reoccurr as above. Thus, a check needs to be made in the poll operation
for the all ones return value, notify the log of this situtation,
shutdown future polling of this hardware and basicly become a no-op poll
operation.

Does that fix cover the issue as you have discovered it? Or is there
more, or another fix you see?

doug t



> 
> >
> >I would like to see the panic output and stack trace to see where that
> >panic came from. It might have come from the PCI device access subsytem
> >when it was trying to access the now hidden register.
> >
> >can you post that panice information?
> 
> Yes, but the edac_mc.c call to panic doesn't drop any data to the
> console.
> The following is the output from an instrumentation we did to the EDAC
> that went into RHEL4-U3 :
> 
> INIT: version 2.85 booting
>                 Welcome to Red Hat Enterprise Linux AS
>                 Press 'I' to enter interactive startup.
> Starting udev:  [  OK  ]
> Initializing hardware...  storage network audioIn probe1, before write
> E752X_DEVPRES1 = 0x10
> In probe1, after write
> E752X_DEVPRES1 = 0x30
> 
> Snip .....
> 
> Configuring kernel parameters:  [  OK  ]
> Setting clock  (localtime): Sun Mar 12 00:52:17 PST 2006 [  OK  ]
> Setting hostname localhost.localdomain:  [  OK  ]
> Checking root filesystem
> [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/sda1
> /: clean, 365954/8830976 files, 2357630/17657443 blocks
> [  OK  ]
> Remounting root filesystem in read-write mode:  [  OK  ]
> No devices found
> No Software RAID disks
> Setting up Logical Volume ManageE752X_DEVPRES1 = 0x10
> ment: E752X_DEVPRES1 = 0x10
> Kernel panic - not syncing: MC0: Uncorrected Error
> 
> that's all you get, no stack trace.
> 
> The EX752C_DEVPRES1 = ... debug statements came from sprinkling call to
> the following debug function we instrumented into the EDAC driver code.
> 
> static void print_error_info(struct pci_dev *pdev) 
> {
>   u8 stat8;
>   pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
>   printk(KERN_EMERG "E752X_DEVPRES1 = 0x%02X\n", stat8);
> }
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-21 21:32 Gross, Mark
  0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-21 21:32 UTC (permalink / raw)
  To: Doug Thompson
  Cc: Ong, Soo Keong, Carbonari, Steven, Wang, Zhenyu Z,
	bluesmoke-devel, linux-kernel



>-----Original Message-----
>From: Doug Thompson [mailto:dthompson@lnxi.com]
>Sent: Friday, April 21, 2006 1:57 PM
>To: Gross, Mark
>Cc: Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z; bluesmoke-
>devel@lists.sourceforge.net; linux-kernel@vger.kernel.org
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Mark thanks for the informaton on this.
>
>Now the e752x_edac.c driver contains no direct calls to panic within
>itself. The edac_mc.c 'core' piece does, but only calls that if an UE
>error is found and panic on UE is enabled. (The other is a PCI parity
>panic, but doesn't effect this path). It might be possible that since
>the hidden register was now hidden, the retrieval function returns some
>garbage which falsely triggers the panic by the core.

The problem is that once the BIOS SMI re-hides the device reading from
the chipset error registers returns 0xFFFFFFFF.

>
>I would like to see the panic output and stack trace to see where that
>panic came from. It might have come from the PCI device access subsytem
>when it was trying to access the now hidden register.
>
>can you post that panice information?

Yes, but the edac_mc.c call to panic doesn't drop any data to the
console.
The following is the output from an instrumentation we did to the EDAC
that went into RHEL4-U3 :

INIT: version 2.85 booting
                Welcome to Red Hat Enterprise Linux AS
                Press 'I' to enter interactive startup.
Starting udev:  [  OK  ]
Initializing hardware...  storage network audioIn probe1, before write
E752X_DEVPRES1 = 0x10
In probe1, after write
E752X_DEVPRES1 = 0x30

Snip .....

Configuring kernel parameters:  [  OK  ]
Setting clock  (localtime): Sun Mar 12 00:52:17 PST 2006 [  OK  ]
Setting hostname localhost.localdomain:  [  OK  ]
Checking root filesystem
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/sda1
/: clean, 365954/8830976 files, 2357630/17657443 blocks
[  OK  ]
Remounting root filesystem in read-write mode:  [  OK  ]
No devices found
No Software RAID disks
Setting up Logical Volume ManageE752X_DEVPRES1 = 0x10
ment: E752X_DEVPRES1 = 0x10
Kernel panic - not syncing: MC0: Uncorrected Error

that's all you get, no stack trace.

The EX752C_DEVPRES1 = ... debug statements came from sprinkling call to
the following debug function we instrumented into the EDAC driver code.

static void print_error_info(struct pci_dev *pdev) 
{
  u8 stat8;
  pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
  printk(KERN_EMERG "E752X_DEVPRES1 = 0x%02X\n", stat8);
}

--mgross
>
>thanks
>
>doug thompson
>
>
>On Fri, 2006-04-21 at 16:01 +0000, "Gross, Mark"  wrote:
>> I'm sorry to have to bring up these issues after a fare amount of
good
>> work, and I don't know how this problem managed to get by for as long
as
>> it has, but there are some issues with the EDAC and the BIOS for
managed
>> computer systems.
>>
>> Managed computers are systems with automatic ECC logging to a System
>> Event Log or SEL.  They typically have an out of band Board
Management
>> Controller aka BMC or IPMC that runs out of band WRT the OS payload.
>>
>> The issues found with the EDAC driver are:
>> 1) The default AMI BIOS behavior on SMI is to check the chipset error
>> registers (Dev0:Fun1) and re-hide them.
>> 2) If you are lucky enough to have BIOS code that doesn't re-hide
>> Dev0:Fun1; then when EDAC is loaded there is a race condition between
>> the platform BIOS and the driver to gain access to these registers.
>> 3) If the platform BIOS does the ECC logging out of band WRT the
payload
>> OS, there is no good way for the driver to know at load time.
>>
>> We discovered these problems when testing with one of the later
RHEL4-U3
>> RC's.  The EDAC driver called panic when the device 0 Function 1 of
the
>> E7250 was re-hidden by the legacy USB SMI that when off between the
load
>> of the EDAC driver and the USB host driver.  Loading the EDAC driver
for
>> many AMI bios's is a panic land mine waiting go off.  Unless the OS
>> knows that it can trust the BIOS to not re-hide those chipset
registers
>> using this driver is not a safe thing to do.
>>
>> Basically if device 0 : function 1 is hidden by the platform at boot
>> time un-hiding and using the device and function is a risky thing to
do,
>> as there is likely a good reason for it to have been hidden in the
first
>> place.  If the BIOS thinks that it owns some registers then the OS
>> should not use them without great care.
>>
>> It is possible that the driver could be modified to check for
re-hiding
>> of the DEV0:FUN1, but this will be racey WRT SMI processing.  At
least
>> it shouldn't panic.
>>
>> The driver should never get loaded by default or automatically.  If
the
>> user knows enough about there BIOS to trust that the SMI behavior
will
>> coexist with the driver then its OK to load otherwise using this
driver
>> is not a safe thing to do.
>>
>> I think the best thing to do is to have the driver error out in its
init
>> or probe code if the dev0:fun1 is hidden at boot time.
>>
>> Comments?
>>
>> Next steps?
>>
>> Do you want me to send a patch implementing graceful error handling
at
>> driver init time so it doesn't load if DEV0:FUN1 is hidden?
>>
>> --mgross
>> Intel Open Source Technology Center
>> (503) 677-4628
>> (503)-712-6227
>> ms: JF1-235
>> 2111 NW 25th Ave
>> Hillsboro, OR 97124
>>
>>
>>
>> -------------------------------------------------------
>> Using Tomcat but need to do more? Need to support web services,
security?
>> Get stuff done quickly with pre-integrated technology to make your
job
>easier
>> Download IBM WebSphere Application Server v.1.0.1 based on Apache
>Geronimo
>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
>> _______________________________________________
>> bluesmoke-devel mailing list
>> bluesmoke-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
@ 2006-04-21 20:57 Doug Thompson
  0 siblings, 0 replies; 44+ messages in thread
From: Doug Thompson @ 2006-04-21 20:57 UTC (permalink / raw)
  To: mark.gross
  Cc: soo.keong.ong, steven.carbonari, zhenyu.z.wang, bluesmoke-devel,
	linux-kernel

Mark thanks for the informaton on this.

Now the e752x_edac.c driver contains no direct calls to panic within
itself. The edac_mc.c 'core' piece does, but only calls that if an UE
error is found and panic on UE is enabled. (The other is a PCI parity
panic, but doesn't effect this path). It might be possible that since
the hidden register was now hidden, the retrieval function returns some
garbage which falsely triggers the panic by the core.

I would like to see the panic output and stack trace to see where that
panic came from. It might have come from the PCI device access subsytem
when it was trying to access the now hidden register.

can you post that panice information?

thanks

doug thompson


On Fri, 2006-04-21 at 16:01 +0000, "Gross, Mark"  wrote:
> I'm sorry to have to bring up these issues after a fare amount of good
> work, and I don't know how this problem managed to get by for as long as
> it has, but there are some issues with the EDAC and the BIOS for managed
> computer systems.
> 
> Managed computers are systems with automatic ECC logging to a System
> Event Log or SEL.  They typically have an out of band Board Management
> Controller aka BMC or IPMC that runs out of band WRT the OS payload.
> 
> The issues found with the EDAC driver are:
> 1) The default AMI BIOS behavior on SMI is to check the chipset error
> registers (Dev0:Fun1) and re-hide them.
> 2) If you are lucky enough to have BIOS code that doesn't re-hide
> Dev0:Fun1; then when EDAC is loaded there is a race condition between
> the platform BIOS and the driver to gain access to these registers. 
> 3) If the platform BIOS does the ECC logging out of band WRT the payload
> OS, there is no good way for the driver to know at load time.  
> 
> We discovered these problems when testing with one of the later RHEL4-U3
> RC's.  The EDAC driver called panic when the device 0 Function 1 of the
> E7250 was re-hidden by the legacy USB SMI that when off between the load
> of the EDAC driver and the USB host driver.  Loading the EDAC driver for
> many AMI bios's is a panic land mine waiting go off.  Unless the OS
> knows that it can trust the BIOS to not re-hide those chipset registers
> using this driver is not a safe thing to do.
> 
> Basically if device 0 : function 1 is hidden by the platform at boot
> time un-hiding and using the device and function is a risky thing to do,
> as there is likely a good reason for it to have been hidden in the first
> place.  If the BIOS thinks that it owns some registers then the OS
> should not use them without great care.
> 
> It is possible that the driver could be modified to check for re-hiding
> of the DEV0:FUN1, but this will be racey WRT SMI processing.  At least
> it shouldn't panic.  
> 
> The driver should never get loaded by default or automatically.  If the
> user knows enough about there BIOS to trust that the SMI behavior will
> coexist with the driver then its OK to load otherwise using this driver
> is not a safe thing to do.
> 
> I think the best thing to do is to have the driver error out in its init
> or probe code if the dev0:fun1 is hidden at boot time.
> 
> Comments?
> 
> Next steps?
> 
> Do you want me to send a patch implementing graceful error handling at
> driver init time so it doesn't load if DEV0:FUN1 is hidden?
> 
> --mgross
> Intel Open Source Technology Center
> (503) 677-4628
> (503)-712-6227
> ms: JF1-235
> 2111 NW 25th Ave
> Hillsboro, OR 97124
>  
> 
> 
> -------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
> _______________________________________________
> bluesmoke-devel mailing list
> bluesmoke-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel


^ permalink raw reply	[flat|nested] 44+ messages in thread
* Problems with EDAC coexisting with BIOS
@ 2006-04-21 16:01 Gross, Mark
  2006-04-21 21:13 ` Ingo Oeser
  2006-04-24 13:19 ` Alan Cox
  0 siblings, 2 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-21 16:01 UTC (permalink / raw)
  To: bluesmoke-devel, LKML
  Cc: Carbonari, Steven, Ong, Soo Keong, Wang, Zhenyu Z, Gross, Mark

I'm sorry to have to bring up these issues after a fare amount of good
work, and I don't know how this problem managed to get by for as long as
it has, but there are some issues with the EDAC and the BIOS for managed
computer systems.

Managed computers are systems with automatic ECC logging to a System
Event Log or SEL.  They typically have an out of band Board Management
Controller aka BMC or IPMC that runs out of band WRT the OS payload.

The issues found with the EDAC driver are:
1) The default AMI BIOS behavior on SMI is to check the chipset error
registers (Dev0:Fun1) and re-hide them.
2) If you are lucky enough to have BIOS code that doesn't re-hide
Dev0:Fun1; then when EDAC is loaded there is a race condition between
the platform BIOS and the driver to gain access to these registers. 
3) If the platform BIOS does the ECC logging out of band WRT the payload
OS, there is no good way for the driver to know at load time.  

We discovered these problems when testing with one of the later RHEL4-U3
RC's.  The EDAC driver called panic when the device 0 Function 1 of the
E7250 was re-hidden by the legacy USB SMI that when off between the load
of the EDAC driver and the USB host driver.  Loading the EDAC driver for
many AMI bios's is a panic land mine waiting go off.  Unless the OS
knows that it can trust the BIOS to not re-hide those chipset registers
using this driver is not a safe thing to do.

Basically if device 0 : function 1 is hidden by the platform at boot
time un-hiding and using the device and function is a risky thing to do,
as there is likely a good reason for it to have been hidden in the first
place.  If the BIOS thinks that it owns some registers then the OS
should not use them without great care.

It is possible that the driver could be modified to check for re-hiding
of the DEV0:FUN1, but this will be racey WRT SMI processing.  At least
it shouldn't panic.  

The driver should never get loaded by default or automatically.  If the
user knows enough about there BIOS to trust that the SMI behavior will
coexist with the driver then its OK to load otherwise using this driver
is not a safe thing to do.

I think the best thing to do is to have the driver error out in its init
or probe code if the dev0:fun1 is hidden at boot time.

Comments?

Next steps?

Do you want me to send a patch implementing graceful error handling at
driver init time so it doesn't load if DEV0:FUN1 is hidden?

--mgross
Intel Open Source Technology Center
(503) 677-4628
(503)-712-6227
ms: JF1-235
2111 NW 25th Ave
Hillsboro, OR 97124
 

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2006-05-04 16:44 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-25 23:25 Problems with EDAC coexisting with BIOS Gross, Mark
2006-04-26  2:19 ` Corey Minyard
2006-04-26  2:34 ` Randy.Dunlap
2006-04-26 18:26   ` mark gross
2006-04-26 18:38     ` Randy.Dunlap
2006-04-26 19:39       ` mark gross
2006-04-26 20:17         ` Randy.Dunlap
  -- strict thread matches above, loose matches on Subject: below --
2006-05-04 16:44 David Peterson
2006-05-03 23:06 David Peterson
2006-05-03 22:22 Doug Thompson
2006-05-03 21:39 Doug Thompson
2006-05-03 20:49 Gross, Mark
2006-04-26  3:24 Gross, Mark
2006-04-26  3:19 Gross, Mark
2006-04-25 21:24 Gross, Mark
2006-04-25 22:39 ` Corey Minyard
2006-04-25 20:22 Gross, Mark
2006-04-25 18:19 Gross, Mark
2006-04-25 19:55 ` Corey Minyard
2006-04-24 18:14 Gross, Mark
2006-04-24 15:57 Gross, Mark
2006-04-24 17:08 ` Eric W. Biederman
2006-04-24 17:49 ` Alan Cox
2006-04-24 14:32 Ong, Soo Keong
2006-04-24 14:15 Ong, Soo Keong
2006-04-24 14:29 ` Alan Cox
2006-05-03 20:25   ` Tim Small
2006-05-03 20:37     ` thockin
2006-05-04  9:45       ` Tim Small
2006-05-03 21:44     ` Alan Cox
2006-05-04  9:02       ` Tim Small
2006-04-24 13:59 Ong, Soo Keong
2006-04-24 14:13 ` Alan Cox
2006-04-23  1:44 Gross, Mark
2006-04-21 22:36 Doug Thompson
2006-04-21 22:20 Gross, Mark
2006-04-22 18:31 ` Tim Small
2006-04-21 21:42 Doug Thompson
2006-04-21 21:32 Gross, Mark
2006-04-21 20:57 Doug Thompson
2006-04-21 16:01 Gross, Mark
2006-04-21 21:13 ` Ingo Oeser
2006-04-24 13:19 ` Alan Cox
2006-04-24 17:38   ` Doug Thompson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox