* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-25 23:25 Gross, Mark
2006-04-26 2:19 ` Corey Minyard
2006-04-26 2:34 ` Randy.Dunlap
0 siblings, 2 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-25 23:25 UTC (permalink / raw)
To: Corey Minyard
Cc: Alan Cox, bluesmoke-devel, LKML, Carbonari, Steven,
Ong, Soo Keong, Wang, Zhenyu Z
[-- Attachment #1: Type: text/plain, Size: 765 bytes --]
>-----Original Message-----
>From: Corey Minyard [mailto:minyard@acm.org]
>Sent: Tuesday, April 25, 2006 3:40 PM
>To: Gross, Mark
>Cc: Alan Cox; bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari,
>Steven; Ong, Soo Keong; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Gross, Mark wrote:
>
>>>
>>>
>>>
>>
>>Done.
>>Signed-off-by: Mark Gross
>>
>>
>>--mgross
>>
>>
>Yes, this is what I was talking about. I believe the mode of
>module_param should be 444, since modifying this after the module is
>loaded won't do anything. Also, it might be nice to print something
>different in the "force" case.
How about printing nothing like it used too?
See attached.
Signed-off-by: Mark Gross
--mgross
[-- Attachment #2: e752x_edac.patch --]
[-- Type: application/octet-stream, Size: 1739 bytes --]
diff -urN -X linux-2.6.16/Documentation/dontdiff linux-2.6.16/drivers/edac/e752x_edac.c linux-2.6.16_edac/drivers/edac/e752x_edac.c
--- linux-2.6.16/drivers/edac/e752x_edac.c 2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16_edac/drivers/edac/e752x_edac.c 2006-04-25 16:19:25.000000000 -0700
@@ -29,6 +29,7 @@
#include "edac_mc.h"
+static int force_function_unhide = 0;
#ifndef PCI_DEVICE_ID_INTEL_7520_0
#define PCI_DEVICE_ID_INTEL_7520_0 0x3590
@@ -755,10 +756,18 @@
debugf0("MC: " __FILE__ ": %s(): mci\n", __func__);
debugf0("Starting Probe1\n");
- /* enable device 0 function 1 */
+ /* check to see if device 0 function 1 is enbaled if it isn't we
+ * assume the BIOS has reserved it for a reason and is expecting
+ * exclusive access, we take care to not violate that assumption and
+ * fail the probe. */
pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
- stat8 |= (1 << 5);
- pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
+ if (!force_function_unhide && !(stat8 & (1 << 5))) {
+ printk(KERN_INFO "contact your bios vendor to see if the "
+ "E752x error registers can be safely un-hidden\n");
+ goto fail;
+ }
+ stat8 |= (1 << 5);
+ pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
/* need to find out the number of channels */
pci_read_config_dword(pdev, E752X_DRC, &drc);
@@ -1069,3 +1078,8 @@
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Linux Networx (http://lnxi.com) Tom Zimmerman\n");
MODULE_DESCRIPTION("MC support for Intel e752x memory controllers");
+
+module_param(force_function_unhide, int, 0444);
+MODULE_PARM_DESC(force_function_unhide, "if BIOS sets Dev0:Fun1 up as hidden:"
+" 1=force unhide and hope BIOS doesn't fight driver for Dev0:Fun1 access");
+
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: Problems with EDAC coexisting with BIOS
2006-04-25 23:25 Problems with EDAC coexisting with BIOS Gross, Mark
@ 2006-04-26 2:19 ` Corey Minyard
2006-04-26 2:34 ` Randy.Dunlap
1 sibling, 0 replies; 44+ messages in thread
From: Corey Minyard @ 2006-04-26 2:19 UTC (permalink / raw)
To: Gross, Mark
Cc: Alan Cox, bluesmoke-devel, LKML, Carbonari, Steven,
Ong, Soo Keong, Wang, Zhenyu Z
Gross, Mark wrote:
>>Yes, this is what I was talking about. I believe the mode of
>>module_param should be 444, since modifying this after the module is
>>loaded won't do anything. Also, it might be nice to print something
>>different in the "force" case.
>>
>>
>
>How about printing nothing like it used too?
>
>See attached.
>
>
This is fine with me. I debated between printing nothing and a "You
have chosen to override ..." print but didn't come out with any
opinion. I'm easy either way.
The indenting in the "if (!force_function_unhide && !(stat8 & (1 << 5)))
{" clause is kind of strange and you have some spaces instead of tabs
right after that (where stat8 is set).
Thanks again,
-Corey
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: Problems with EDAC coexisting with BIOS
2006-04-25 23:25 Problems with EDAC coexisting with BIOS Gross, Mark
2006-04-26 2:19 ` Corey Minyard
@ 2006-04-26 2:34 ` Randy.Dunlap
2006-04-26 18:26 ` mark gross
1 sibling, 1 reply; 44+ messages in thread
From: Randy.Dunlap @ 2006-04-26 2:34 UTC (permalink / raw)
To: Gross, Mark
Cc: minyard, alan, bluesmoke-devel, linux-kernel, steven.carbonari,
soo.keong.ong, zhenyu.z.wang
On Tue, 25 Apr 2006 16:25:59 -0700 Gross, Mark wrote:
>
> How about printing nothing like it used too?
>
> See attached.
>
> Signed-off-by: Mark Gross
Hi Mark,
This small patch will need some cleaning up:
1. Signed-off-by: Mark Gross <mark.gross@intel.com>
2. Try not to use attachments. If you must, then make the attachment
type be text instead of octet-stream.
3. No need to init to 0:
+static int force_function_unhide = 0;
4. Typos:
+ /* check to see if device 0 function 1 is enbaled if it isn't we
enabled; it it isn't, we
+ * assume the BIOS has reserved it for a reason and is expecting
+ * exclusive access, we take care to not violate that assumption and
not to violate
+ * fail the probe. */
5. indentation, typos, and at least one trailing space:
+ if (!force_function_unhide && !(stat8 & (1 << 5))) {
+ printk(KERN_INFO "contact your bios vendor to see if the "
Contact your BIOS
+ "E752x error registers can be safely un-hidden\n");
^indent one more tab
---
~Randy
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: Problems with EDAC coexisting with BIOS
2006-04-26 2:34 ` Randy.Dunlap
@ 2006-04-26 18:26 ` mark gross
2006-04-26 18:38 ` Randy.Dunlap
0 siblings, 1 reply; 44+ messages in thread
From: mark gross @ 2006-04-26 18:26 UTC (permalink / raw)
To: Randy.Dunlap
Cc: Gross, Mark, minyard, alan, bluesmoke-devel, linux-kernel,
steven.carbonari, soo.keong.ong, zhenyu.z.wang
On Tue, Apr 25, 2006 at 07:34:05PM -0700, Randy.Dunlap wrote:
> On Tue, 25 Apr 2006 16:25:59 -0700 Gross, Mark wrote:
>
> >
> > How about printing nothing like it used too?
> >
> > See attached.
> >
> > Signed-off-by: Mark Gross
>
> Hi Mark,
>
> This small patch will need some cleaning up:
>
> 1. Signed-off-by: Mark Gross <mark.gross@intel.com>
>
> 2. Try not to use attachments. If you must, then make the attachment
> type be text instead of octet-stream.
>
> 3. No need to init to 0:
> +static int force_function_unhide = 0;
>
> 4. Typos:
>
> + /* check to see if device 0 function 1 is enbaled if it isn't we
> enabled; it it isn't, we
> + * assume the BIOS has reserved it for a reason and is expecting
> + * exclusive access, we take care to not violate that assumption and
> not to violate
> + * fail the probe. */
>
> 5. indentation, typos, and at least one trailing space:
>
> + if (!force_function_unhide && !(stat8 & (1 << 5))) {
> + printk(KERN_INFO "contact your bios vendor to see if the "
> Contact your BIOS
> + "E752x error registers can be safely un-hidden\n");
> ^indent one more tab
>
Signed-off-by: Mark Gross <mark.gross@intel.com>
Trying mutt.
thanks,
diff -urN -X linux-2.6.16/Documentation/dontdiff linux-2.6.16/drivers/edac/e752x_edac.c linux-2.6.16_edac/drivers/edac/e752x_edac.c
--- linux-2.6.16/drivers/edac/e752x_edac.c 2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16_edac/drivers/edac/e752x_edac.c 2006-04-26 08:36:25.000000000 -0700
@@ -29,6 +29,7 @@
#include "edac_mc.h"
+static int force_function_unhide;
#ifndef PCI_DEVICE_ID_INTEL_7520_0
#define PCI_DEVICE_ID_INTEL_7520_0 0x3590
@@ -755,8 +756,16 @@
debugf0("MC: " __FILE__ ": %s(): mci\n", __func__);
debugf0("Starting Probe1\n");
- /* enable device 0 function 1 */
+ /* check to see if device 0 function 1 is enbaled; if it isn't, we
+ * assume the BIOS has reserved it for a reason and is expecting
+ * exclusive access, we take care not to violate that assumption and
+ * fail the probe. */
pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
+ if (!force_function_unhide && !(stat8 & (1 << 5))) {
+ printk(KERN_INFO "Contact your BIOS vendor to see if the "
+ "E752x error registers can be safely un-hidden\n");
+ goto fail;
+ }
stat8 |= (1 << 5);
pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
@@ -1069,3 +1078,8 @@
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Linux Networx (http://lnxi.com) Tom Zimmerman\n");
MODULE_DESCRIPTION("MC support for Intel e752x memory controllers");
+
+module_param(force_function_unhide, int, 0444);
+MODULE_PARM_DESC(force_function_unhide, "if BIOS sets Dev0:Fun1 up as hidden:"
+" 1=force unhide and hope BIOS doesn't fight driver for Dev0:Fun1 access");
+
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: Problems with EDAC coexisting with BIOS
2006-04-26 18:26 ` mark gross
@ 2006-04-26 18:38 ` Randy.Dunlap
2006-04-26 19:39 ` mark gross
0 siblings, 1 reply; 44+ messages in thread
From: Randy.Dunlap @ 2006-04-26 18:38 UTC (permalink / raw)
To: mgross
Cc: mark.gross, minyard, alan, bluesmoke-devel, linux-kernel,
steven.carbonari, soo.keong.ong, zhenyu.z.wang
On Wed, 26 Apr 2006 11:26:38 -0700 mark gross wrote:
> Signed-off-by: Mark Gross <mark.gross@intel.com>
>
>
> Trying mutt.
Yes, much better. Almost there.
> diff -urN -X linux-2.6.16/Documentation/dontdiff linux-2.6.16/drivers/edac/e752x_edac.c linux-2.6.16_edac/drivers/edac/e752x_edac.c
> --- linux-2.6.16/drivers/edac/e752x_edac.c 2006-03-19 21:53:29.000000000 -0800
> +++ linux-2.6.16_edac/drivers/edac/e752x_edac.c 2006-04-26 08:36:25.000000000 -0700
> @@ -755,8 +756,16 @@
> debugf0("MC: " __FILE__ ": %s(): mci\n", __func__);
> debugf0("Starting Probe1\n");
>
> - /* enable device 0 function 1 */
> + /* check to see if device 0 function 1 is enbaled; if it isn't, we
"enabled"
> + * assume the BIOS has reserved it for a reason and is expecting
> + * exclusive access, we take care not to violate that assumption and
> + * fail the probe. */
> pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
> + if (!force_function_unhide && !(stat8 & (1 << 5))) {
> + printk(KERN_INFO "Contact your BIOS vendor to see if the "
> + "E752x error registers can be safely un-hidden\n");
> + goto fail;
The goto is indented too much...
> + }
Thanks.
---
~Randy
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: Problems with EDAC coexisting with BIOS
2006-04-26 18:38 ` Randy.Dunlap
@ 2006-04-26 19:39 ` mark gross
2006-04-26 20:17 ` Randy.Dunlap
0 siblings, 1 reply; 44+ messages in thread
From: mark gross @ 2006-04-26 19:39 UTC (permalink / raw)
To: Randy.Dunlap
Cc: mark.gross, minyard, alan, bluesmoke-devel, linux-kernel,
steven.carbonari, soo.keong.ong, zhenyu.z.wang
On Wed, Apr 26, 2006 at 11:38:35AM -0700, Randy.Dunlap wrote:
> On Wed, 26 Apr 2006 11:26:38 -0700 mark gross wrote:
>
> > Signed-off-by: Mark Gross <mark.gross@intel.com>
> >
> >
> > Trying mutt.
>
> Yes, much better. Almost there.
>
>
> > diff -urN -X linux-2.6.16/Documentation/dontdiff linux-2.6.16/drivers/edac/e752x_edac.c linux-2.6.16_edac/drivers/edac/e752x_edac.c
> > --- linux-2.6.16/drivers/edac/e752x_edac.c 2006-03-19 21:53:29.000000000 -0800
> > +++ linux-2.6.16_edac/drivers/edac/e752x_edac.c 2006-04-26 08:36:25.000000000 -0700
> > @@ -755,8 +756,16 @@
> > debugf0("MC: " __FILE__ ": %s(): mci\n", __func__);
> > debugf0("Starting Probe1\n");
> >
> > - /* enable device 0 function 1 */
> > + /* check to see if device 0 function 1 is enbaled; if it isn't, we
>
> "enabled"
>
> > + * assume the BIOS has reserved it for a reason and is expecting
> > + * exclusive access, we take care not to violate that assumption and
> > + * fail the probe. */
> > pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
> > + if (!force_function_unhide && !(stat8 & (1 << 5))) {
> > + printk(KERN_INFO "Contact your BIOS vendor to see if the "
> > + "E752x error registers can be safely un-hidden\n");
> > + goto fail;
>
> The goto is indented too much...
>
> > + }
>
Sorry about that. trying again.
Signed-off-by: Mark Gross <mark.gross@intel.com>
diff -urN -X linux-2.6.16/Documentation/dontdiff linux-2.6.16/drivers/edac/e752x_edac.c linux-2.6.16_edac/drivers/edac/e752x_edac.c
--- linux-2.6.16/drivers/edac/e752x_edac.c 2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16_edac/drivers/edac/e752x_edac.c 2006-04-26 12:31:51.000000000 -0700
@@ -29,6 +29,7 @@
#include "edac_mc.h"
+static int force_function_unhide;
#ifndef PCI_DEVICE_ID_INTEL_7520_0
#define PCI_DEVICE_ID_INTEL_7520_0 0x3590
@@ -755,8 +756,16 @@
debugf0("MC: " __FILE__ ": %s(): mci\n", __func__);
debugf0("Starting Probe1\n");
- /* enable device 0 function 1 */
+ /* check to see if device 0 function 1 is enabled; if it isn't, we
+ * assume the BIOS has reserved it for a reason and is expecting
+ * exclusive access, we take care not to violate that assumption and
+ * fail the probe. */
pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
+ if (!force_function_unhide && !(stat8 & (1 << 5))) {
+ printk(KERN_INFO "Contact your BIOS vendor to see if the "
+ "E752x error registers can be safely un-hidden\n");
+ goto fail;
+ }
stat8 |= (1 << 5);
pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
@@ -1069,3 +1078,8 @@
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Linux Networx (http://lnxi.com) Tom Zimmerman\n");
MODULE_DESCRIPTION("MC support for Intel e752x memory controllers");
+
+module_param(force_function_unhide, int, 0444);
+MODULE_PARM_DESC(force_function_unhide, "if BIOS sets Dev0:Fun1 up as hidden:"
+" 1=force unhide and hope BIOS doesn't fight driver for Dev0:Fun1 access");
+
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
@ 2006-05-04 16:44 David Peterson
0 siblings, 0 replies; 44+ messages in thread
From: David Peterson @ 2006-05-04 16:44 UTC (permalink / raw)
To: Tim Small
Cc: Alan Cox, Ong, Soo Keong, Gross, Mark, bluesmoke-devel, LKML,
Carbonari, Steven, Wang, Zhenyu Z
> My first thought was to schedule a tasklet as part of the ECC-
specific
> NMI handling, or are there any gotchas with doing this from within
> an NMI handler?
Unfortunately yes. __tasklet_schedule() uses interrupt disabling
as a synchronization mechanism. This presents a problem since by
definition, NMIs can occur even when interrupts are disabled.
However the NMI handling code in bluesmoke has a mechanism similar to
tasklets that is intended specifically for use by NMI handlers.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
@ 2006-05-03 23:06 David Peterson
0 siblings, 0 replies; 44+ messages in thread
From: David Peterson @ 2006-05-03 23:06 UTC (permalink / raw)
To: Alan Cox
Cc: Tim Small, Ong, Soo Keong, Gross, Mark, bluesmoke-devel, LKML,
Carbonari, Steven, Wang, Zhenyu Z
> On Wed, 2006-05-03, Alan Cox wrote:
> > something with NMI-signalled errors, I was wondering what the
problems
> > with using NMI-signalled ECC errors were?
>
> The big problem with NMI is that it can occur *during* a PCI
> configuration sequence (ie during pci_config_* functions). That
> means we can't safely do some I/O, especially configuration space
I/O in an NMI
> handler. At best we could set a flag and catch it afterwards.
This is roughly what I did in the NMI handling code I wrote for
bluesmoke. If my memory is correct, I believe there's a spinlock
that pci_read_config_dword() and friends acquire. Basically
I did the following type of thing:
/* Using spin_trylock() below avoids deadlock in the case where
* the code interrupted by the NMI is holding the lock.
*/
if (likely(spin_trylock(&lock))) {
We got the lock. Go ahead and access PCI config space
(and then drop the lock)...
} else {
This case should be rare in practice. Defer the access to
PCI config space outside NMI context (I wrote a little API
that facilitates doing this kind of stuff in a manner that
avoids deadlocks and race conditions associated with NMI
handlers).
}
For anyone who is interested, the code may be obtained by going to
http://sourceforge.net/projects/bluesmoke/ and downloading the latest
version of the 'bluesmoke' package. It's experimental stuff that
hasn't seen much testing. Also it's not quite functional as-is
because a little piece of code still needs to be added somewhere that
enables SERR#.
I'm not necessarily advocating that NMI-driven error handling should
go into the mainstream kernel. The expected benefits would have to
be weighed against the extra complexity that the code introduces.
However the parts of the code that handle the basic NMI-related
synchronization issues are abstracted into a relatively clean
architecture-independent API that may be useful in other places where
NMIs (or similar types of exceptions/interrupts on other platforms)
are used. I posted this code to LKML a while ago but it has since
been improved. Perhaps having this type of code (or something
similar) in just one location would be an improvement over reinventing
the wheel in a number of places.
There is an issue regarding NMI-driven error handling that may be a
substantial pain to deal with on x86: When NMI occurs, we can't be
sure whether the NMI is from the watchdog, or due to a hardware error.
Therefore we must check the hardware for errors on each NMI. This is
no better than polling (at whatever frequency the watchdog runs at).
The above problem can be worked around as follows (although I'm
not advocating that this be done in Linux): Modify local_irq_disable()
and local_irq_enable() so that instead of using cli/sti machine
instructions they adjust the interrupt priority level (controlled by
the local APIC on x86 processors) as follows:
local_irq_disable()
{
Set interrupt priority level to (MAX_PRIORITY - 1).
In other words, all interrupts are masked out except
those whose priority is MAX_PRIORITY.
}
local_irq_enable()
{
Set interrupt priority level to 0 (i.e. all interrupts
are enabled).
}
Then the watchdog may be implemented as a normal interrupt with
priority MAX_PRIORITY, and all other interrupts may be given lower
priorities. NMI would then only be asserted for genuine hardware
errors (PCI parity errors, ECC memory errors, etc.). This is
probably more trouble than it's worth. However I think it's doable
in principle.
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: Problems with EDAC coexisting with BIOS
@ 2006-05-03 22:22 Doug Thompson
0 siblings, 0 replies; 44+ messages in thread
From: Doug Thompson @ 2006-05-03 22:22 UTC (permalink / raw)
To: tim, alan
Cc: mark.gross, soo.keong.ong, steven.carbonari, zhenyu.z.wang,
bluesmoke-devel, linux-kernel
>>> Alan Cox <alan@lxorguk.ukuu.org.uk> 05/03/06 3:44 PM >>>
On Mer, 2006-05-03 at 21:25 +0100, Tim Small wrote:
>> something with NMI-signalled errors, I was wondering what the
problems
>> with using NMI-signalled ECC errors were?
>The big problem with NMI is that it can occur *during* a PCI
>configuration sequence (ie during pci_config_* functions). That means
we
>can't safely do some I/O, especially configuration space I/O in an NMI
>handler. At best we could set a flag and catch it afterwards.
Dave Peterson did submit a set of patches to provide a deferred calling
mechanism triggered by the NMI handler. I am in the process of reading
up on that patch. It currently is in the bluesmoke project.
doug t
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-05-03 21:39 Doug Thompson
0 siblings, 0 replies; 44+ messages in thread
From: Doug Thompson @ 2006-05-03 21:39 UTC (permalink / raw)
To: mark.gross
Cc: tim, soo.keong.ong, steven.carbonari, zhenyu.z.wang,
bluesmoke-devel, alan, linux-kernel
On Wed, 2006-05-03 at 20:49 +0000, "Gross, Mark" wrote:
>
> >-----Original Message-----
> >From: Tim Small [mailto:tim@buttersideup.com]
> >Sent: Wednesday, May 03, 2006 1:25 PM
> >To: Alan Cox
> >Cc: Ong, Soo Keong; Gross, Mark;
bluesmoke-devel@lists.sourceforge.net;
> >LKML; Carbonari, Steven; Wang, Zhenyu Z
> >Subject: Re: Problems with EDAC coexisting with BIOS
> >
> >Alan Cox wrote:
> >
> >>On Llu, 2006-04-24 at 22:15 +0800, Ong, Soo Keong wrote:
> >>
> >>
> >>>To me, periodical is not a good design for error handling, it
wastes
> >>>transaction bandwidth that should be used for other more productive
> >>>purposes.
> >>>
> >>>
> >>
> >>The periodical choice is mostly down to the brain damaged choice of
> NMI
> >>as the viable alternative, which is as good as 'not usable'
> >>
> >>
> >Hi,
> >
> >As I believe that the majority of the bluesmoke/EDAC developers are
> >(were) operating under the assumption that it would be possible to do
> >something with NMI-signalled errors, I was wondering what the
problems
> >with using NMI-signalled ECC errors were?
Dave Peterson created the NMI patch for bluesmoke that is currently
operational in the bluesmoke project. When bluesmoke was submitted, the
advice we received was to defer until later to apply that NMI patch
after we had gotten EDAC into the kernel and more stable.
Since Dave has stepped down, that leaves me to refactor that patch for
EDAC release. It is on my TODO list, but not yet risen to the top.
Once concern I had yet to research is: Do all motherboards route the
NMI signal properly? Is that a function of the BIOS? etc, etc
>
> Soo Keong or Steve, can you answer this one?
>
> >
> >Are there some systems states in which an incoming NMI throws a
spanner
> >in to the works in an unrecoverable way? If this is the case, is it
so
> >on all x86/x86-64 systems, or just a subset, and is there no way to
> >implement some sort of top half / bottom half style NMI handler
> >cleanly? As I am certainly not an x86 architecture expert, I would
> >appreciate any input from the resident gurus ;o).
>
> I don't think NMI's are so much the problem as those blasted SMI's.
And
> SMI's are not for sharing :(
>
> These problems are BIOS specific, Your Mileage Will Vary from one bios
> to the next.
>
> >
> >Quickly returning to the original problem - I know this isn't a
proper
> >API by any stretch of the imagination, and would require changes to
> >existing BIOSs, but the EDAC module could reprogram the chipset
> >error-signalling registers, so that an ECC error no longer triggers
an
> >SMI. The BIOS SMI handler could then read the signalling registers,
> and
> >leave the ECC registers well alone if ECC errors are not set to
> generate
> >an SMI.
>
> It's not the SMI from ECC events that are the problem. It's the BIOS
> assuming no one else would want to use those registers on Dev0:Fun1,
and
> the fact that you still end up with a race between the OS and the BIOS
> SMI to handle the events.
exactly and for that reason we might have a tar baby to wash and clean
up somehow
doug t
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-05-03 20:49 Gross, Mark
0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-05-03 20:49 UTC (permalink / raw)
To: Tim Small, Alan Cox
Cc: Ong, Soo Keong, bluesmoke-devel, LKML, Carbonari, Steven,
Wang, Zhenyu Z
>-----Original Message-----
>From: Tim Small [mailto:tim@buttersideup.com]
>Sent: Wednesday, May 03, 2006 1:25 PM
>To: Alan Cox
>Cc: Ong, Soo Keong; Gross, Mark; bluesmoke-devel@lists.sourceforge.net;
>LKML; Carbonari, Steven; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Alan Cox wrote:
>
>>On Llu, 2006-04-24 at 22:15 +0800, Ong, Soo Keong wrote:
>>
>>
>>>To me, periodical is not a good design for error handling, it wastes
>>>transaction bandwidth that should be used for other more productive
>>>purposes.
>>>
>>>
>>
>>The periodical choice is mostly down to the brain damaged choice of
NMI
>>as the viable alternative, which is as good as 'not usable'
>>
>>
>Hi,
>
>As I believe that the majority of the bluesmoke/EDAC developers are
>(were) operating under the assumption that it would be possible to do
>something with NMI-signalled errors, I was wondering what the problems
>with using NMI-signalled ECC errors were?
Soo Keong or Steve, can you answer this one?
>
>Are there some systems states in which an incoming NMI throws a spanner
>in to the works in an unrecoverable way? If this is the case, is it so
>on all x86/x86-64 systems, or just a subset, and is there no way to
>implement some sort of top half / bottom half style NMI handler
>cleanly? As I am certainly not an x86 architecture expert, I would
>appreciate any input from the resident gurus ;o).
I don't think NMI's are so much the problem as those blasted SMI's. And
SMI's are not for sharing :(
These problems are BIOS specific, Your Mileage Will Vary from one bios
to the next.
>
>Quickly returning to the original problem - I know this isn't a proper
>API by any stretch of the imagination, and would require changes to
>existing BIOSs, but the EDAC module could reprogram the chipset
>error-signalling registers, so that an ECC error no longer triggers an
>SMI. The BIOS SMI handler could then read the signalling registers,
and
>leave the ECC registers well alone if ECC errors are not set to
generate
>an SMI.
It's not the SMI from ECC events that are the problem. It's the BIOS
assuming no one else would want to use those registers on Dev0:Fun1, and
the fact that you still end up with a race between the OS and the BIOS
SMI to handle the events.
>
>Cheers,
>
>Tim.
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-26 3:24 Gross, Mark
0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-26 3:24 UTC (permalink / raw)
To: Randy.Dunlap
Cc: minyard, alan, bluesmoke-devel, linux-kernel, Carbonari, Steven,
Ong, Soo Keong, Wang, Zhenyu Z
>-----Original Message-----
>From: Randy.Dunlap [mailto:rdunlap@xenotime.net]
>Sent: Tuesday, April 25, 2006 7:34 PM
>To: Gross, Mark
>Cc: minyard@acm.org; alan@lxorguk.ukuu.org.uk; bluesmoke-
>devel@lists.sourceforge.net; linux-kernel@vger.kernel.org; Carbonari,
>Steven; Ong, Soo Keong; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>On Tue, 25 Apr 2006 16:25:59 -0700 Gross, Mark wrote:
>
>>
>> How about printing nothing like it used too?
>>
>> See attached.
>>
>> Signed-off-by: Mark Gross
>
>Hi Mark,
>
>This small patch will need some cleaning up:
>
>1. Signed-off-by: Mark Gross <mark.gross@intel.com>
>
>2. Try not to use attachments. If you must, then make the attachment
> type be text instead of octet-stream.
>
>3. No need to init to 0:
>+static int force_function_unhide = 0;
>
>4. Typos:
>
>+ /* check to see if device 0 function 1 is enbaled if it isn't we
> enabled; it it isn't,
we
>+ * assume the BIOS has reserved it for a reason and is expecting
>+ * exclusive access, we take care to not violate that assumption
and
> not to violate
>+ * fail the probe. */
>
>5. indentation, typos, and at least one trailing space:
>
>+ if (!force_function_unhide && !(stat8 & (1 << 5))) {
>+ printk(KERN_INFO "contact your bios vendor to see if the
"
> Contact your BIOS
>+ "E752x error registers can be safely un-hidden\n");
> ^indent one more tab
>
>---
Thanks,
I'll get these tomorrow AM.
--mgross
^ permalink raw reply [flat|nested] 44+ messages in thread* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-26 3:19 Gross, Mark
0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-26 3:19 UTC (permalink / raw)
To: Corey Minyard
Cc: Alan Cox, bluesmoke-devel, LKML, Carbonari, Steven,
Ong, Soo Keong, Wang, Zhenyu Z
>-----Original Message-----
>From: Corey Minyard [mailto:minyard@acm.org]
>Sent: Tuesday, April 25, 2006 7:19 PM
>To: Gross, Mark
>Cc: Alan Cox; bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari,
>Steven; Ong, Soo Keong; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Gross, Mark wrote:
>
>>>Yes, this is what I was talking about. I believe the mode of
>>>module_param should be 444, since modifying this after the module is
>>>loaded won't do anything. Also, it might be nice to print something
>>>different in the "force" case.
>>>
>>>
>>
>>How about printing nothing like it used too?
>>
>>See attached.
>>
>>
>This is fine with me. I debated between printing nothing and a "You
>have chosen to override ..." print but didn't come out with any
>opinion. I'm easy either way.
>
>The indenting in the "if (!force_function_unhide && !(stat8 & (1 <<
5)))
>{" clause is kind of strange
I know I'm going to regret this but: What's strange about it?
> and you have some spaces instead of tabs
>right after that (where stat8 is set).
Damn. I pasted using the x-windows middle mouse button :( I'll redo
the patch when I get back to my workstation tomorrow.
--mgross
^ permalink raw reply [flat|nested] 44+ messages in thread* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-25 21:24 Gross, Mark
2006-04-25 22:39 ` Corey Minyard
0 siblings, 1 reply; 44+ messages in thread
From: Gross, Mark @ 2006-04-25 21:24 UTC (permalink / raw)
To: Corey Minyard
Cc: Alan Cox, bluesmoke-devel, LKML, Carbonari, Steven,
Ong, Soo Keong, Wang, Zhenyu Z
[-- Attachment #1: Type: text/plain, Size: 482 bytes --]
>-----Original Message-----
>From: Corey Minyard [mailto:minyard@acm.org]
>Sent: Tuesday, April 25, 2006 12:55 PM
>To: Gross, Mark
>Cc: Alan Cox; bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari,
>Steven; Ong, Soo Keong; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Shouldn't you provide a way (kernel command line) to override this
check
>if the function can be safely unhidden?
>
Done.
Signed-off-by: Mark Gross
--mgross
[-- Attachment #2: e752x_edac.patch --]
[-- Type: application/octet-stream, Size: 1744 bytes --]
diff -urN -X linux-2.6.16/Documentation/dontdiff linux-2.6.16/drivers/edac/e752x_edac.c linux-2.6.16_edac/drivers/edac/e752x_edac.c
--- linux-2.6.16/drivers/edac/e752x_edac.c 2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16_edac/drivers/edac/e752x_edac.c 2006-04-25 13:59:45.000000000 -0700
@@ -29,6 +29,7 @@
#include "edac_mc.h"
+static int force_function_unhide = 0;
#ifndef PCI_DEVICE_ID_INTEL_7520_0
#define PCI_DEVICE_ID_INTEL_7520_0 0x3590
@@ -755,10 +756,19 @@
debugf0("MC: " __FILE__ ": %s(): mci\n", __func__);
debugf0("Starting Probe1\n");
- /* enable device 0 function 1 */
+ /* check to see if device 0 function 1 is enbaled if it isn't we
+ * assume the BIOS has reserved it for a reason and is expecting
+ * exclusive access, we take care to not violate that assumption and
+ * fail the probe. */
pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
- stat8 |= (1 << 5);
- pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
+ if (!(stat8 & (1 << 5))) {
+ printk(KERN_INFO "contact your bios vendor to see if the "
+ "E752x error registers can be safely un-hidden\n");
+ if (!force_function_unhide)
+ goto fail;
+ }
+ stat8 |= (1 << 5);
+ pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
/* need to find out the number of channels */
pci_read_config_dword(pdev, E752X_DRC, &drc);
@@ -1069,3 +1079,8 @@
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Linux Networx (http://lnxi.com) Tom Zimmerman\n");
MODULE_DESCRIPTION("MC support for Intel e752x memory controllers");
+
+module_param(force_function_unhide, int, 0644);
+MODULE_PARM_DESC(force_function_unhide, "if BIOS sets Dev0:Fun1 up as hidden:"
+" 1=force unhide and hope BIOS doesn't fight driver for Dev0:Fun1 access");
+
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: Problems with EDAC coexisting with BIOS
2006-04-25 21:24 Gross, Mark
@ 2006-04-25 22:39 ` Corey Minyard
0 siblings, 0 replies; 44+ messages in thread
From: Corey Minyard @ 2006-04-25 22:39 UTC (permalink / raw)
To: Gross, Mark
Cc: Alan Cox, bluesmoke-devel, LKML, Carbonari, Steven,
Ong, Soo Keong, Wang, Zhenyu Z
Gross, Mark wrote:
>>
>>
>>
>
>Done.
>Signed-off-by: Mark Gross
>
>
>--mgross
>
>
Yes, this is what I was talking about. I believe the mode of
module_param should be 444, since modifying this after the module is
loaded won't do anything. Also, it might be nice to print something
different in the "force" case.
Thanks,
-Corey
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-25 20:22 Gross, Mark
0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-25 20:22 UTC (permalink / raw)
To: Corey Minyard
Cc: Alan Cox, bluesmoke-devel, LKML, Carbonari, Steven,
Ong, Soo Keong, Wang, Zhenyu Z
Something like a force-unhide parameter?
That makes sense, I'll take a stab at it now.
--mgross
>-----Original Message-----
>From: Corey Minyard [mailto:minyard@acm.org]
>Sent: Tuesday, April 25, 2006 12:55 PM
>To: Gross, Mark
>Cc: Alan Cox; bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari,
>Steven; Ong, Soo Keong; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Shouldn't you provide a way (kernel command line) to override this
check
>if the function can be safely unhidden?
>
>-Corey
>
>Gross, Mark wrote:
>
>>
>>
>>Patch to work around this problem is attached.
>>
>>Signed-off-by: Mark Gross
>>
>>--mgross
>>
>>
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-25 18:19 Gross, Mark
2006-04-25 19:55 ` Corey Minyard
0 siblings, 1 reply; 44+ messages in thread
From: Gross, Mark @ 2006-04-25 18:19 UTC (permalink / raw)
To: Gross, Mark, Alan Cox
Cc: bluesmoke-devel, LKML, Carbonari, Steven, Ong, Soo Keong,
Wang, Zhenyu Z
[-- Attachment #1: Type: text/plain, Size: 2596 bytes --]
>-----Original Message-----
>From: Gross, Mark
>Sent: Monday, April 24, 2006 11:14 AM
>To: 'Alan Cox'
>Cc: bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari, Steven;
Ong,
>Soo Keong; Wang, Zhenyu Z
>Subject: RE: Problems with EDAC coexisting with BIOS
>
>
>
>>-----Original Message-----
>>From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk]
>>Sent: Monday, April 24, 2006 10:50 AM
>>To: Gross, Mark
>>Cc: bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari, Steven;
Ong,
>>Soo Keong; Wang, Zhenyu Z
>>Subject: RE: Problems with EDAC coexisting with BIOS
>>
>>On Llu, 2006-04-24 at 08:57 -0700, Gross, Mark wrote:
>>> I think what I'm saying is pretty clear and I don't think it is
related
>>> to whatever workarounds where done earlier.
>>
>>Ok. I was concerned as I seem to remember an earlier errata fix
enabled
>>the memory controller temporarily to do a workaround on one bridge. We
>>hit this because it unconditionally disabled it afterwards and Intel
>>sent fixes for RHEL4. I don't believe the workaround in question is in
>>the current tree as it was fixed elsewhere.
>>
>>Just worried that if that is the case an SMI the wrong moment might
fail
>>to apply the workaround.
>>
>>
>>> >Why did Intel bother implementing this functionality and then
screwing
>>> >it up so that OS vendors can't use it ? It seems so bogus.
>>> >
>>>
>>> It was just a screw up not to have identified this issue sooner.
>>
>>Ok. So the intention was that the OS should also be able to access
this
>>material.
>>
>
>The E752x Si is made to allow access to the device / Function.
However;
>when it's integrated onto a MoBo with BIOS there can be implementations
>where we get into this coordination issue.
>
>>> >At the very least we should print a warning advising the user that
the
>>> >BIOS is incompatible and to ask the BIOS vendor for an update so
that
>>> >they can enable error detection and management support.
>>>
>>> I would place the warning in the probe or init code.
>>
>>Agreed, and then bale out. Customer pressure should do the rest if the
>>BIOS needs updating, or ACPI or similar need to grow a 'shared' API
for
>>this so the BIOS and OS can co-operate.
>>
>
>Yes and yes.
>
>I'm having trouble getting the dev0:fun1 hidden by bios test into the
>e752x_init code. It seems to be a shame having to fail the probe1 and
>leave the driver loaded in memory. Are there any recommendations on a
good
>way to do this?
>
Patch to work around this problem is attached.
Signed-off-by: Mark Gross
--mgross
[-- Attachment #2: edac.patch --]
[-- Type: application/octet-stream, Size: 1097 bytes --]
===================================================================
RCS file: /opt/CVS/linux/drivers/edac/e752x_edac.c,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 e752x_edac.c
--- drivers/edac/e752x_edac.c 21 Apr 2006 02:58:32 -0000 1.1.1.1
+++ drivers/edac/e752x_edac.c 25 Apr 2006 13:26:33 -0000
@@ -755,10 +755,16 @@
debugf0("MC: " __FILE__ ": %s(): mci\n", __func__);
debugf0("Starting Probe1\n");
- /* enable device 0 function 1 */
+ /* check to see if device 0 function 1 is enbaled if it isn't we assume
+ * the BIOS has reserved it for a reason and is expecting exclusive
+ * access, we take care to not violate that assumption and fail the
+ * probe. */
pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
- stat8 |= (1 << 5);
- pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
+ if (!(stat8 & (1 << 5))) {
+ printk(KERN_WARNING "contact your bios vendor to see if the "
+ "E752x error registers can be safely un-hidden\n");
+ goto fail;
+ }
/* need to find out the number of channels */
pci_read_config_dword(pdev, E752X_DRC, &drc);
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: Problems with EDAC coexisting with BIOS
2006-04-25 18:19 Gross, Mark
@ 2006-04-25 19:55 ` Corey Minyard
0 siblings, 0 replies; 44+ messages in thread
From: Corey Minyard @ 2006-04-25 19:55 UTC (permalink / raw)
To: Gross, Mark
Cc: Alan Cox, bluesmoke-devel, LKML, Carbonari, Steven,
Ong, Soo Keong, Wang, Zhenyu Z
Shouldn't you provide a way (kernel command line) to override this check
if the function can be safely unhidden?
-Corey
Gross, Mark wrote:
>
>
>Patch to work around this problem is attached.
>
>Signed-off-by: Mark Gross
>
>--mgross
>
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-24 18:14 Gross, Mark
0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-24 18:14 UTC (permalink / raw)
To: Alan Cox
Cc: bluesmoke-devel, LKML, Carbonari, Steven, Ong, Soo Keong,
Wang, Zhenyu Z
>-----Original Message-----
>From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk]
>Sent: Monday, April 24, 2006 10:50 AM
>To: Gross, Mark
>Cc: bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari, Steven;
Ong,
>Soo Keong; Wang, Zhenyu Z
>Subject: RE: Problems with EDAC coexisting with BIOS
>
>On Llu, 2006-04-24 at 08:57 -0700, Gross, Mark wrote:
>> I think what I'm saying is pretty clear and I don't think it is
related
>> to whatever workarounds where done earlier.
>
>Ok. I was concerned as I seem to remember an earlier errata fix enabled
>the memory controller temporarily to do a workaround on one bridge. We
>hit this because it unconditionally disabled it afterwards and Intel
>sent fixes for RHEL4. I don't believe the workaround in question is in
>the current tree as it was fixed elsewhere.
>
>Just worried that if that is the case an SMI the wrong moment might
fail
>to apply the workaround.
>
>
>> >Why did Intel bother implementing this functionality and then
screwing
>> >it up so that OS vendors can't use it ? It seems so bogus.
>> >
>>
>> It was just a screw up not to have identified this issue sooner.
>
>Ok. So the intention was that the OS should also be able to access this
>material.
>
The E752x Si is made to allow access to the device / Function. However;
when it's integrated onto a MoBo with BIOS there can be implementations
where we get into this coordination issue.
>> >At the very least we should print a warning advising the user that
the
>> >BIOS is incompatible and to ask the BIOS vendor for an update so
that
>> >they can enable error detection and management support.
>>
>> I would place the warning in the probe or init code.
>
>Agreed, and then bale out. Customer pressure should do the rest if the
>BIOS needs updating, or ACPI or similar need to grow a 'shared' API for
>this so the BIOS and OS can co-operate.
>
Yes and yes.
I'm having trouble getting the dev0:fun1 hidden by bios test into the
e752x_init code. It seems to be a shame having to fail the probe1 and
leave the driver loaded in memory. Are there any recommendations on a
good way to do this?
--mgross
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-24 15:57 Gross, Mark
2006-04-24 17:08 ` Eric W. Biederman
2006-04-24 17:49 ` Alan Cox
0 siblings, 2 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-24 15:57 UTC (permalink / raw)
To: Alan Cox
Cc: bluesmoke-devel, LKML, Carbonari, Steven, Ong, Soo Keong,
Wang, Zhenyu Z
[-- Attachment #1: Type: text/plain, Size: 3595 bytes --]
>-----Original Message-----
>From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk]
>Sent: Monday, April 24, 2006 6:19 AM
>To: Gross, Mark
>Cc: bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari, Steven;
Ong,
>Soo Keong; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>On Gwe, 2006-04-21 at 09:01 -0700, Gross, Mark wrote:
>> 1) The default AMI BIOS behavior on SMI is to check the chipset error
>> registers (Dev0:Fun1) and re-hide them.
>
>The words "bad design" come to mind (followed by a large number of more
>accurate phrases that are inappropriate for a public list)
>
Yes, EDAC should not have existed without first working more closely
with the BIOS folks. It's too bad we (Intel) didn't catch this. The
BIOS folks where not in the loop with the driver folks making
contributions to EDAC previously.
>> Basically if device 0 : function 1 is hidden by the platform at boot
>> time un-hiding and using the device and function is a risky thing to
do,
>
>Intel provided patches that do exactly this for some of the chip
>workarounds. Are you saying the Intel chip work around also needs
>fixing ?
>
I think what I'm saying is pretty clear and I don't think it is related
to whatever workarounds where done earlier.
>> The driver should never get loaded by default or automatically. If
the
>> user knows enough about there BIOS to trust that the SMI behavior
will
>> coexist with the driver then its OK to load otherwise using this
driver
>> is not a safe thing to do.
>
>So Intel and/or the BIOS vendors also forgot to put in any kind of
>indicator ? How do they expect end users to know this, or OS vendors ?
>Is there a technote that covers this mess ?
>
I don't know of any technote. It took me working with Soo Keong for a
few weeks to chase this issue down to the level I have. The short
answer is that the BIOS assumes the payload OS would not be fighting it
for hidden device access and the EDAC driver violates this assumption.
>> I think the best thing to do is to have the driver error out in its
init
>> or probe code if the dev0:fun1 is hidden at boot time.
>>
>> Comments?
>
>Why did Intel bother implementing this functionality and then screwing
>it up so that OS vendors can't use it ? It seems so bogus.
>
It was just a screw up not to have identified this issue sooner.
>At the very least we should print a warning advising the user that the
>BIOS is incompatible and to ask the BIOS vendor for an update so that
>they can enable error detection and management support.
I would place the warning in the probe or init code.
Attached is a test patch I'm testing now. I don't like it, but it seems
to be working so far. It basically fails the probe call leaving the
driver loaded. I'm going to move the test to e752x_int so the driver
fails at init this am at restart my tests.
>
>Is only the AMI BIOS this braindamaged, should we just blacklist AMI
>bioses in EDAC or is this shared Intel supplied code that may be found
>in other vendors systems.
>
Unknown. Also the BOIS teams for various platforms can modify the base
AMI functionality. I know that at least one Intel e7520 based system
with AMI based bios seems to not expose this issue. The point is that
without working out a handshake between the OS and the platform / BIOS
for this type of thing, loading EDAC without a patch like mine is
equivalent to playing Russian roulette. You can't know which platform /
bios will blow up on you if you load the thing.
--mgross
[-- Attachment #2: edac_2_6_16.patch --]
[-- Type: application/octet-stream, Size: 1065 bytes --]
? scripts/kconfig/lxdialog/lxdialog
Index: drivers/edac/e752x_edac.c
===================================================================
RCS file: /opt/CVS/linux/drivers/edac/e752x_edac.c,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 e752x_edac.c
--- drivers/edac/e752x_edac.c 21 Apr 2006 02:58:32 -0000 1.1.1.1
+++ drivers/edac/e752x_edac.c 23 Apr 2006 23:42:19 -0000
@@ -755,10 +755,13 @@
debugf0("MC: " __FILE__ ": %s(): mci\n", __func__);
debugf0("Starting Probe1\n");
- /* enable device 0 function 1 */
+ /* check to see if device 0 function 1 is enbaled if it isn't we assume
+ * the BIOS has reserved it for a reason like, it has SMI's setup
+ * expecting exclusive access, and we sould take care to not violate
+ * that assumption and fail the probe. */
pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
- stat8 |= (1 << 5);
- pci_write_config_byte(pdev, E752X_DEVPRES1, stat8);
+ if (!(stat8 & (1 << 5))
+ goto fail;
/* need to find out the number of channels */
pci_read_config_dword(pdev, E752X_DRC, &drc);
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: Problems with EDAC coexisting with BIOS
2006-04-24 15:57 Gross, Mark
@ 2006-04-24 17:08 ` Eric W. Biederman
2006-04-24 17:49 ` Alan Cox
1 sibling, 0 replies; 44+ messages in thread
From: Eric W. Biederman @ 2006-04-24 17:08 UTC (permalink / raw)
To: Gross, Mark
Cc: Alan Cox, bluesmoke-devel, LKML, Carbonari, Steven,
Ong, Soo Keong, Wang, Zhenyu Z
"Gross, Mark" <mark.gross@intel.com> writes:
> Yes, EDAC should not have existed without first working more closely
> with the BIOS folks. It's too bad we (Intel) didn't catch this. The
> BIOS folks where not in the loop with the driver folks making
> contributions to EDAC previously.
Isn't the real world a great place where people do things because
they work and not because they were designed to be that way?
There were BIOS folk involved but not the AMI BIOS folk. Beyond
that Intel is not known for having conversations with technical people
who want to do creative and useful things.
> I think what I'm saying is pretty clear and I don't think it is related
> to whatever workarounds where done earlier.
Alan appears to be asking because we seem to be breaking the same
set of rules.
>
> I don't know of any technote. It took me working with Soo Keong for a
> few weeks to chase this issue down to the level I have. The short
> answer is that the BIOS assumes the payload OS would not be fighting it
> for hidden device access and the EDAC driver violates this
> assumption.
If this is supposed to be an SMM mode only device why can the OS
enable it? If the BIOS prevents the OS from enabling that device
that probe routing will fail. But instead the BIOS in SMM mode is
allowing the device to be seen and probed, and then it is disabling
it. So there is not an SMM trap firing when the enable bit is set.
>>> I think the best thing to do is to have the driver error out in its
> init
>>> or probe code if the dev0:fun1 is hidden at boot time.
>>>
>>> Comments?
>>
>>Why did Intel bother implementing this functionality and then screwing
>>it up so that OS vendors can't use it ? It seems so bogus.
>>
>
> It was just a screw up not to have identified this issue sooner.
Now that the issue is identified hopefully we can work together
to successfully get useful information out of the hardware, about
it's health.
>>At the very least we should print a warning advising the user that the
>>BIOS is incompatible and to ask the BIOS vendor for an update so that
>>they can enable error detection and management support.
>
> I would place the warning in the probe or init code.
Placing a warning when we enable the device in the probe code sounds
reasonable.
> Attached is a test patch I'm testing now. I don't like it, but it seems
> to be working so far. It basically fails the probe call leaving the
> driver loaded. I'm going to move the test to e752x_int so the driver
> fails at init this am at restart my tests.
The probe will already fail if we can't actually find this device.
Simply ignoring the device if it isn't there is not enough, as there
are BIOS's that simply disable the device on Intel's recommendation
but don't seem to be doing anything interesting with it.
A check to see if we have actually enabled the device would be
reasonable. Of course the code already copes gracefully with
the situation where we enable the device and it doesn't show
up.
It is quite reasonable to make the code more careful in this
case but there is already a chance for the firmware to take
an SMI trap and not allow the OS to do something.
>>Is only the AMI BIOS this braindamaged, should we just blacklist AMI
>>bioses in EDAC or is this shared Intel supplied code that may be found
>>in other vendors systems.
>>
>
> Unknown. Also the BOIS teams for various platforms can modify the base
> AMI functionality. I know that at least one Intel e7520 based system
> with AMI based bios seems to not expose this issue. The point is that
> without working out a handshake between the OS and the platform / BIOS
> for this type of thing, loading EDAC without a patch like mine is
> equivalent to playing Russian roulette. You can't know which platform /
> bios will blow up on you if you load the thing.
Several pieces here.
- SMM mode is deeply deprecated. There is no reason to expect it
to be used on a server platform. Isn't that what them AML in ACPI is for?
Having the firmware doing anything while the system is running is
deeply troubling. Since there is not an OS/firmware handshake by
your own line of reasoning running an OS on the platform is playing
Russian roulette with the hardware.
- A check in the probe routine to see if we fail to enable the device
is reasonable. Warning if we need to enable the device is also reasonable.
Not enabling the device if we can enable the device is not reasonable.
- This code represents a huge amount of pent up demand, Intel has
had years to propose something better that people can actually use.
- The question was asked why these devices were hidden and at the time
the only answer that could be obtained was that Intel recommended
it, but no reason was given. So there was no reason to expect a
conflict with the firmware.
If the system event log actually captures all of the error events
that are reported by the hardware, so we can write an equivalent
driver by reading the SEL there may be a reasonable alternative
route. Otherwise as it appears likely the SEL filters the data
it appears to be yet another case of reducing the value of the
hardware by putting an unreliable firmware interface in front
of it.
Eric
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
2006-04-24 15:57 Gross, Mark
2006-04-24 17:08 ` Eric W. Biederman
@ 2006-04-24 17:49 ` Alan Cox
1 sibling, 0 replies; 44+ messages in thread
From: Alan Cox @ 2006-04-24 17:49 UTC (permalink / raw)
To: Gross, Mark
Cc: bluesmoke-devel, LKML, Carbonari, Steven, Ong, Soo Keong,
Wang, Zhenyu Z
On Llu, 2006-04-24 at 08:57 -0700, Gross, Mark wrote:
> I think what I'm saying is pretty clear and I don't think it is related
> to whatever workarounds where done earlier.
Ok. I was concerned as I seem to remember an earlier errata fix enabled
the memory controller temporarily to do a workaround on one bridge. We
hit this because it unconditionally disabled it afterwards and Intel
sent fixes for RHEL4. I don't believe the workaround in question is in
the current tree as it was fixed elsewhere.
Just worried that if that is the case an SMI the wrong moment might fail
to apply the workaround.
> >Why did Intel bother implementing this functionality and then screwing
> >it up so that OS vendors can't use it ? It seems so bogus.
> >
>
> It was just a screw up not to have identified this issue sooner.
Ok. So the intention was that the OS should also be able to access this
material.
> >At the very least we should print a warning advising the user that the
> >BIOS is incompatible and to ask the BIOS vendor for an update so that
> >they can enable error detection and management support.
>
> I would place the warning in the probe or init code.
Agreed, and then bale out. Customer pressure should do the rest if the
BIOS needs updating, or ACPI or similar need to grow a 'shared' API for
this so the BIOS and OS can co-operate.
Alan
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-24 14:32 Ong, Soo Keong
0 siblings, 0 replies; 44+ messages in thread
From: Ong, Soo Keong @ 2006-04-24 14:32 UTC (permalink / raw)
To: Alan Cox
Cc: Gross, Mark, bluesmoke-devel, LKML, Carbonari, Steven,
Wang, Zhenyu Z
There are 4 occasions (that I aware of) during the OS times that could
possibly trigger SMI
1. Before OS USB driver disconnect SMI from USB controller
2. ACPI driver call software SMI once
3. SpeedStep using ACPI interface
4. Error (connected to SMI) happens
I know there are always ways to improve BIOS. Allow me to look at the OS
first so that OS can be robust enough to handle different
implementations.
1 and 2 will be gone early in booting. 3 could be handled appropriately
by OS because OS knows when SpeedStep ACPI interface is called and is
done. 4 will be gone after error interrupt re-connection done by OS
after phase 1 and 2.
I am not the one who prefer error handling stay in BIOS, but many people
have different opinion from me.
I logout now.
-----Original Message-----
From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk]
Sent: Monday, April 24, 2006 10:30 PM
To: Ong, Soo Keong
Cc: Gross, Mark; bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari,
Steven; Wang, Zhenyu Z
Subject: RE: Problems with EDAC coexisting with BIOS
On Llu, 2006-04-24 at 22:15 +0800, Ong, Soo Keong wrote:
> To me, periodical is not a good design for error handling, it wastes
> transaction bandwidth that should be used for other more productive
> purposes.
The periodical choice is mostly down to the brain damaged choice of NMI
as the viable alternative, which is as good as 'not usable'
> It is more appropriate to have single handler, either OS or BIOS.
Agreed but then the BIOS must provide that service to the OS reliably
and efficiently so that users can build that service into their system
wide error management and control processes.
> In general, the errors handler connect the errors to the interrupt or
> interrutps. The handler should undhide (if it s hideable) the error
> controller and read its registers upon interrupt, then carry out
> appropriate actions to handle the erros.
Actually I am dubious that the error handler can do that. If the OS
kernel just issued the first half of a config cycle what occurs when the
SMI tries to play with PCI config space ?
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-24 14:15 Ong, Soo Keong
2006-04-24 14:29 ` Alan Cox
0 siblings, 1 reply; 44+ messages in thread
From: Ong, Soo Keong @ 2006-04-24 14:15 UTC (permalink / raw)
To: Alan Cox
Cc: Gross, Mark, bluesmoke-devel, LKML, Carbonari, Steven,
Wang, Zhenyu Z
To me, periodical is not a good design for error handling, it wastes
transaction bandwidth that should be used for other more productive
purposes.
It is more appropriate to have single handler, either OS or BIOS.
In general, the errors handler connect the errors to the interrupt or
interrutps. The handler should undhide (if it s hideable) the error
controller and read its registers upon interrupt, then carry out
appropriate actions to handle the erros.
-----Original Message-----
From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk]
Sent: Monday, April 24, 2006 10:14 PM
To: Ong, Soo Keong
Cc: Gross, Mark; bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari,
Steven; Wang, Zhenyu Z
Subject: RE: Problems with EDAC coexisting with BIOS
On Llu, 2006-04-24 at 21:59 +0800, Ong, Soo Keong wrote:
> Alan,
>
> Have you understood how the errors are connected to the interrupts
> (either SMI, NMI, SCI)?
I believe so
> When does EDAC read the error status? Periodical or upon interrpt by
> errors?
Periodically currently. The sf development tree has some code for
handling the NMI case but this isn't actually useful because an NMI can
occur half way through a PCI config transaction.
Alan
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
2006-04-24 14:15 Ong, Soo Keong
@ 2006-04-24 14:29 ` Alan Cox
2006-05-03 20:25 ` Tim Small
0 siblings, 1 reply; 44+ messages in thread
From: Alan Cox @ 2006-04-24 14:29 UTC (permalink / raw)
To: Ong, Soo Keong
Cc: Gross, Mark, bluesmoke-devel, LKML, Carbonari, Steven,
Wang, Zhenyu Z
On Llu, 2006-04-24 at 22:15 +0800, Ong, Soo Keong wrote:
> To me, periodical is not a good design for error handling, it wastes
> transaction bandwidth that should be used for other more productive
> purposes.
The periodical choice is mostly down to the brain damaged choice of NMI
as the viable alternative, which is as good as 'not usable'
> It is more appropriate to have single handler, either OS or BIOS.
Agreed but then the BIOS must provide that service to the OS reliably
and efficiently so that users can build that service into their system
wide error management and control processes.
> In general, the errors handler connect the errors to the interrupt or
> interrutps. The handler should undhide (if it s hideable) the error
> controller and read its registers upon interrupt, then carry out
> appropriate actions to handle the erros.
Actually I am dubious that the error handler can do that. If the OS
kernel just issued the first half of a config cycle what occurs when the
SMI tries to play with PCI config space ?
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
2006-04-24 14:29 ` Alan Cox
@ 2006-05-03 20:25 ` Tim Small
2006-05-03 20:37 ` thockin
2006-05-03 21:44 ` Alan Cox
0 siblings, 2 replies; 44+ messages in thread
From: Tim Small @ 2006-05-03 20:25 UTC (permalink / raw)
To: Alan Cox
Cc: Ong, Soo Keong, Gross, Mark, bluesmoke-devel, LKML,
Carbonari, Steven, Wang, Zhenyu Z
Alan Cox wrote:
>On Llu, 2006-04-24 at 22:15 +0800, Ong, Soo Keong wrote:
>
>
>>To me, periodical is not a good design for error handling, it wastes
>>transaction bandwidth that should be used for other more productive
>>purposes.
>>
>>
>
>The periodical choice is mostly down to the brain damaged choice of NMI
>as the viable alternative, which is as good as 'not usable'
>
>
Hi,
As I believe that the majority of the bluesmoke/EDAC developers are
(were) operating under the assumption that it would be possible to do
something with NMI-signalled errors, I was wondering what the problems
with using NMI-signalled ECC errors were?
Are there some systems states in which an incoming NMI throws a spanner
in to the works in an unrecoverable way? If this is the case, is it so
on all x86/x86-64 systems, or just a subset, and is there no way to
implement some sort of top half / bottom half style NMI handler
cleanly? As I am certainly not an x86 architecture expert, I would
appreciate any input from the resident gurus ;o).
Quickly returning to the original problem - I know this isn't a proper
API by any stretch of the imagination, and would require changes to
existing BIOSs, but the EDAC module could reprogram the chipset
error-signalling registers, so that an ECC error no longer triggers an
SMI. The BIOS SMI handler could then read the signalling registers, and
leave the ECC registers well alone if ECC errors are not set to generate
an SMI.
Cheers,
Tim.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
2006-05-03 20:25 ` Tim Small
@ 2006-05-03 20:37 ` thockin
2006-05-04 9:45 ` Tim Small
2006-05-03 21:44 ` Alan Cox
1 sibling, 1 reply; 44+ messages in thread
From: thockin @ 2006-05-03 20:37 UTC (permalink / raw)
To: Tim Small
Cc: Alan Cox, Ong, Soo Keong, Gross, Mark, bluesmoke-devel, LKML,
Carbonari, Steven, Wang, Zhenyu Z
On Wed, May 03, 2006 at 09:25:01PM +0100, Tim Small wrote:
> existing BIOSs, but the EDAC module could reprogram the chipset
> error-signalling registers, so that an ECC error no longer triggers an
This is key, I think.
> SMI. The BIOS SMI handler could then read the signalling registers, and
> leave the ECC registers well alone if ECC errors are not set to generate
> an SMI.
The fundamental problem with SMI is that we CAN'T know what it is doing.
I've seen systems which trigger SMI from a GPIO toggled by a clock. I've
seen systems trigger SMI from a chipset-internal periodic timer. I've
seen chipsets route NMI->SMI or even MCE->SMI. If the BIOS is polling the
error status registers from a periodic SMI, we're GOING to lose data.
The big hammer - turn off SMI - is probably OK on some systems, but is not
a general solution. More and more hardware workarounds and features are
SMI based. There are some rather interesting things that can be done in
SMM, *iff* we could get the BIOS out of the way.
Tim (watching EDAC from time to time, quietly)
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
2006-05-03 20:37 ` thockin
@ 2006-05-04 9:45 ` Tim Small
0 siblings, 0 replies; 44+ messages in thread
From: Tim Small @ 2006-05-04 9:45 UTC (permalink / raw)
To: thockin
Cc: Alan Cox, Ong, Soo Keong, Gross, Mark, bluesmoke-devel, LKML,
Carbonari, Steven, Wang, Zhenyu Z
thockin@hockin.org wrote:
>On Wed, May 03, 2006 at 09:25:01PM +0100, Tim Small wrote:
>
>
>>existing BIOSs, but the EDAC module could reprogram the chipset
>>error-signalling registers, so that an ECC error no longer triggers an
>>
>>
>
>This is key, I think.
>
>
>
>>SMI. The BIOS SMI handler could then read the signalling registers, and
>>leave the ECC registers well alone if ECC errors are not set to generate
>>an SMI.
>>
>>
>
>The fundamental problem with SMI is that we CAN'T know what it is doing.
>I've seen systems which trigger SMI from a GPIO toggled by a clock. I've
>seen systems trigger SMI from a chipset-internal periodic timer. I've
>seen chipsets route NMI->SMI or even MCE->SMI. If the BIOS is polling the
>error status registers from a periodic SMI, we're GOING to lose data.
>
>The big hammer - turn off SMI - is probably OK on some systems, but is not
>a general solution. More and more hardware workarounds and features are
>SMI based. There are some rather interesting things that can be done in
>SMM, *iff* we could get the BIOS out of the way.
>
>
Agreed - I have had experience of a system (Intel 855GME chipset based,
AMI BIOS) which emulates the i8042 in the BIOS at SMI time. Mmm nice.
When the Linux i8042 driver can polled the (pretend) i8042, the system
spent ages in the BIOS, and general interrupt latency on the system fell
apart... Oh what a mess.
A limited solution is probably to modify the existing EDAC drivers so
that they ensure SMI generation is disabled (for the specific errors
that the EDAC drivers are designed to handle). The OS is then at least
doing the right thing, even if the BIOS isn't... This should improve
upon the current behaviour on some systems, and shouldn't (as far as I
can see) break any others. The EDAC code also probably needs to be
toughened up (at least on some chipsets) so that it doesn't fall over
when the BIOS steps on its toes.
Tim.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
2006-05-03 20:25 ` Tim Small
2006-05-03 20:37 ` thockin
@ 2006-05-03 21:44 ` Alan Cox
2006-05-04 9:02 ` Tim Small
1 sibling, 1 reply; 44+ messages in thread
From: Alan Cox @ 2006-05-03 21:44 UTC (permalink / raw)
To: Tim Small
Cc: Ong, Soo Keong, Gross, Mark, bluesmoke-devel, LKML,
Carbonari, Steven, Wang, Zhenyu Z
On Mer, 2006-05-03 at 21:25 +0100, Tim Small wrote:
> something with NMI-signalled errors, I was wondering what the problems
> with using NMI-signalled ECC errors were?
The big problem with NMI is that it can occur *during* a PCI
configuration sequence (ie during pci_config_* functions). That means we
can't safely do some I/O, especially configuration space I/O in an NMI
handler. At best we could set a flag and catch it afterwards.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
2006-05-03 21:44 ` Alan Cox
@ 2006-05-04 9:02 ` Tim Small
0 siblings, 0 replies; 44+ messages in thread
From: Tim Small @ 2006-05-04 9:02 UTC (permalink / raw)
To: Alan Cox
Cc: Ong, Soo Keong, Gross, Mark, bluesmoke-devel, LKML,
Carbonari, Steven, Wang, Zhenyu Z
Alan Cox wrote:
>On Mer, 2006-05-03 at 21:25 +0100, Tim Small wrote:
>
>
>>something with NMI-signalled errors, I was wondering what the problems
>>with using NMI-signalled ECC errors were?
>>
>>
>
>The big problem with NMI is that it can occur *during* a PCI
>configuration sequence (ie during pci_config_* functions). That means we
>can't safely do some I/O, especially configuration space I/O in an NMI
>handler. At best we could set a flag and catch it afterwards.
>
>
I was assuming this was the case - but I don't think that deferring the
work until after the NMI handler has returned is necessarily a big
disadvantage - at least as far as ECC register-status checking is
concerned - since none of the hardware that I've looked at makes any
sort of guarantee about the timeliness of ECC-error-triggered NMI
delivery anyway - so any of the really smart (and urgent) stuff that you
could potentially do as part of the ECC error handling (e.g. terminating
a process if one of their physical pages was mangled) is not possible to
do in a reliable manner anyway.
About the best thing it is possible to do is to try and arrange to take
the page(s) in which an uncorrectable error occurred out of further use
(maybe do the same for correctable errors, if the same physical page
sees repeated correctable errors), plus maybe give the option of
panicing if an uncorrectable page was in use by the kernel?
My first thought was to schedule a tasklet as part of the ECC-specific
NMI handling, or are there any gotchas with doing this from within an
NMI handler?
Cheers,
Tim.
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-24 13:59 Ong, Soo Keong
2006-04-24 14:13 ` Alan Cox
0 siblings, 1 reply; 44+ messages in thread
From: Ong, Soo Keong @ 2006-04-24 13:59 UTC (permalink / raw)
To: Alan Cox, Gross, Mark
Cc: bluesmoke-devel, LKML, Carbonari, Steven, Wang, Zhenyu Z
Alan,
Have you understood how the errors are connected to the interrupts
(either SMI, NMI, SCI)?
When does EDAC read the error status? Periodical or upon interrpt by
errors?
Soo Keong
-----Original Message-----
From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk]
Sent: Monday, April 24, 2006 9:19 PM
To: Gross, Mark
Cc: bluesmoke-devel@lists.sourceforge.net; LKML; Carbonari, Steven; Ong,
Soo Keong; Wang, Zhenyu Z
Subject: Re: Problems with EDAC coexisting with BIOS
On Gwe, 2006-04-21 at 09:01 -0700, Gross, Mark wrote:
> 1) The default AMI BIOS behavior on SMI is to check the chipset error
> registers (Dev0:Fun1) and re-hide them.
The words "bad design" come to mind (followed by a large number of more
accurate phrases that are inappropriate for a public list)
> Basically if device 0 : function 1 is hidden by the platform at boot
> time un-hiding and using the device and function is a risky thing to
do,
Intel provided patches that do exactly this for some of the chip
workarounds. Are you saying the Intel chip work around also needs
fixing ?
> The driver should never get loaded by default or automatically. If
the
> user knows enough about there BIOS to trust that the SMI behavior will
> coexist with the driver then its OK to load otherwise using this
driver
> is not a safe thing to do.
So Intel and/or the BIOS vendors also forgot to put in any kind of
indicator ? How do they expect end users to know this, or OS vendors ?
Is there a technote that covers this mess ?
> I think the best thing to do is to have the driver error out in its
init
> or probe code if the dev0:fun1 is hidden at boot time.
>
> Comments?
Why did Intel bother implementing this functionality and then screwing
it up so that OS vendors can't use it ? It seems so bogus.
At the very least we should print a warning advising the user that the
BIOS is incompatible and to ask the BIOS vendor for an update so that
they can enable error detection and management support.
Is only the AMI BIOS this braindamaged, should we just blacklist AMI
bioses in EDAC or is this shared Intel supplied code that may be found
in other vendors systems.
Alan
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
2006-04-24 13:59 Ong, Soo Keong
@ 2006-04-24 14:13 ` Alan Cox
0 siblings, 0 replies; 44+ messages in thread
From: Alan Cox @ 2006-04-24 14:13 UTC (permalink / raw)
To: Ong, Soo Keong
Cc: Gross, Mark, bluesmoke-devel, LKML, Carbonari, Steven,
Wang, Zhenyu Z
On Llu, 2006-04-24 at 21:59 +0800, Ong, Soo Keong wrote:
> Alan,
>
> Have you understood how the errors are connected to the interrupts
> (either SMI, NMI, SCI)?
I believe so
> When does EDAC read the error status? Periodical or upon interrpt by
> errors?
Periodically currently. The sf development tree has some code for
handling the NMI case but this isn't actually useful because an NMI can
occur half way through a PCI config transaction.
Alan
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-23 1:44 Gross, Mark
0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-23 1:44 UTC (permalink / raw)
To: Tim Small
Cc: Doug Thompson, Ong, Soo Keong, Carbonari, Steven, Wang, Zhenyu Z,
bluesmoke-devel, linux-kernel
>-----Original Message-----
>From: Tim Small [mailto:tim@buttersideup.com]
>Sent: Saturday, April 22, 2006 11:32 AM
>To: Gross, Mark
>Cc: Doug Thompson; Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z;
>bluesmoke-devel@lists.sourceforge.net; linux-kernel@vger.kernel.org
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Gross, Mark wrote:
>
>>You can never predict when a SMI will bubble through the system. Even
>>if you handle case where the BIOS re-hides Dev0:Fun1 and not panic how
>>do you deal with the race between the BIOS SMI based handling and the
>>driver? Who will end up reading (and clearing) the error registers
>>first? There is no good way to share today.
>>
>>
>You could (at least from memory, on certain chipsets) modify the error
>reporting registers so that an SMI is no longer generated as a result
of
>MC ECC errors. True, this doesn't fix many of the other problems
>related to this issue, but would be useful in a "modprobe xyz_edac
>force_unhide_MC_PCI=1" case.
You can get SMI's for more than just ECC events, suppressing the ECC
SMI's won't save you from the other SMI events that could happen.
We need to work something out with the bios guys to do this right.
Today we don't have a good way of coordinating between the payload OS
and the BIOS for this type of platform level stuff.
>
>Closed-source BIOSes eh? Who needs em ;-p.
No comment.
--mgross
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-21 22:36 Doug Thompson
0 siblings, 0 replies; 44+ messages in thread
From: Doug Thompson @ 2006-04-21 22:36 UTC (permalink / raw)
To: mark.gross
Cc: soo.keong.ong, steven.carbonari, zhenyu.z.wang, bluesmoke-devel,
Doug Thompson, linux-kernel
> >
> >Does that fix cover the issue as you have discovered it? Or is there
> >more, or another fix you see?
>
> No, you should return an error from e752x_probe1 if bit5 is cleared
> already. To do anything else you are assuming that you are privileged
> to have more knowledge about what the BIOS is doing than you should.
>
> I'll send the patch to e752x_edac.c this weekend.
>
> --mgross
>
great. I don't have immediate access to a jarrell mobo at the moment and
definately no access to a manged one.
thanks
doug t
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-21 22:20 Gross, Mark
2006-04-22 18:31 ` Tim Small
0 siblings, 1 reply; 44+ messages in thread
From: Gross, Mark @ 2006-04-21 22:20 UTC (permalink / raw)
To: Doug Thompson
Cc: Ong, Soo Keong, Carbonari, Steven, Wang, Zhenyu Z,
bluesmoke-devel, linux-kernel
>-----Original Message-----
>From: Doug Thompson [mailto:dthompson@lnxi.com]
>Sent: Friday, April 21, 2006 2:43 PM
>To: Gross, Mark
>Cc: Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z; bluesmoke-
>devel@lists.sourceforge.net; Doug Thompson;
linux-kernel@vger.kernel.org
>Subject: RE: Problems with EDAC coexisting with BIOS
>
>On Fri, 2006-04-21 at 21:32 +0000, "Gross, Mark" wrote:
>>
>> >-----Original Message-----
>> >From: Doug Thompson [mailto:dthompson@lnxi.com]
>> >Sent: Friday, April 21, 2006 1:57 PM
>> >To: Gross, Mark
>> >Cc: Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z; bluesmoke-
>> >devel@lists.sourceforge.net; linux-kernel@vger.kernel.org
>> >Subject: Re: Problems with EDAC coexisting with BIOS
>> >
>> >Mark thanks for the informaton on this.
>> >
>> >Now the e752x_edac.c driver contains no direct calls to panic within
>> >itself. The edac_mc.c 'core' piece does, but only calls that if an
UE
>> >error is found and panic on UE is enabled. (The other is a PCI
parity
>> >panic, but doesn't effect this path). It might be possible that
since
>> >the hidden register was now hidden, the retrieval function returns
some
>> >garbage which falsely triggers the panic by the core.
>>
>> The problem is that once the BIOS SMI re-hides the device reading
from
>> the chipset error registers returns 0xFFFFFFFF.
>
>Ok so the panic is caused by the driver because the UE bit is a one
>(somewhere in the 32-bit field), and then informs the CORE driver of
the
>false positive UE event, which does the panic trigger. I follow the
>logic now.
>
>The question then comes up: how to determine if the system has the
>hidden register feature.
For the E752x chipsets, if the Dev0:Fun1 is hidden after the BIOS hands
control to the OS, I think the driver should simply honor this and not
un-hide it.
The test is to check if bit 5 of the DEVPRES1 resister is set. If set
the device is available, otherwise its hidden. (see section 3.5.41 of
e7520 specification
ftp://download.intel.com/design/chipsets/datashts/30300602.pdf )
>
>You mentioned that the probe should be refactored to look for this all
>ones return value as such an indicator, is that correct?
Yes, basically we should consider removing the un-hide code in
e752x_probe1, and handle the case where the dev0:fun1 is left hidden by
the BIOS as an error condition at driver probe time.
>
>But there is a window that the register is unhidden, the probe runs,
>detects the hardware it is looking at and registers itself. Then later
>the window closes, the register is hidden again and the panic can
>reoccurr as above. Thus, a check needs to be made in the poll operation
>for the all ones return value, notify the log of this situtation,
>shutdown future polling of this hardware and basicly become a no-op
poll
>operation.
You can never predict when a SMI will bubble through the system. Even
if you handle case where the BIOS re-hides Dev0:Fun1 and not panic how
do you deal with the race between the BIOS SMI based handling and the
driver? Who will end up reading (and clearing) the error registers
first? There is no good way to share today.
>
>Does that fix cover the issue as you have discovered it? Or is there
>more, or another fix you see?
No, you should return an error from e752x_probe1 if bit5 is cleared
already. To do anything else you are assuming that you are privileged
to have more knowledge about what the BIOS is doing than you should.
I'll send the patch to e752x_edac.c this weekend.
--mgross
>
>doug t
>
>
>
>>
>> >
>> >I would like to see the panic output and stack trace to see where
that
>> >panic came from. It might have come from the PCI device access
subsytem
>> >when it was trying to access the now hidden register.
>> >
>> >can you post that panice information?
>>
>> Yes, but the edac_mc.c call to panic doesn't drop any data to the
>> console.
>> The following is the output from an instrumentation we did to the
EDAC
>> that went into RHEL4-U3 :
>>
>> INIT: version 2.85 booting
>> Welcome to Red Hat Enterprise Linux AS
>> Press 'I' to enter interactive startup.
>> Starting udev: [ OK ]
>> Initializing hardware... storage network audioIn probe1, before
write
>> E752X_DEVPRES1 = 0x10
>> In probe1, after write
>> E752X_DEVPRES1 = 0x30
>>
>> Snip .....
>>
>> Configuring kernel parameters: [ OK ]
>> Setting clock (localtime): Sun Mar 12 00:52:17 PST 2006 [ OK ]
>> Setting hostname localhost.localdomain: [ OK ]
>> Checking root filesystem
>> [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/sda1
>> /: clean, 365954/8830976 files, 2357630/17657443 blocks
>> [ OK ]
>> Remounting root filesystem in read-write mode: [ OK ]
>> No devices found
>> No Software RAID disks
>> Setting up Logical Volume ManageE752X_DEVPRES1 = 0x10
>> ment: E752X_DEVPRES1 = 0x10
>> Kernel panic - not syncing: MC0: Uncorrected Error
>>
>> that's all you get, no stack trace.
>>
>> The EX752C_DEVPRES1 = ... debug statements came from sprinkling call
to
>> the following debug function we instrumented into the EDAC driver
code.
>>
>> static void print_error_info(struct pci_dev *pdev)
>> {
>> u8 stat8;
>> pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
>> printk(KERN_EMERG "E752X_DEVPRES1 = 0x%02X\n", stat8);
>> }
>>
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: Problems with EDAC coexisting with BIOS
2006-04-21 22:20 Gross, Mark
@ 2006-04-22 18:31 ` Tim Small
0 siblings, 0 replies; 44+ messages in thread
From: Tim Small @ 2006-04-22 18:31 UTC (permalink / raw)
To: Gross, Mark
Cc: Doug Thompson, Ong, Soo Keong, Carbonari, Steven, Wang, Zhenyu Z,
bluesmoke-devel, linux-kernel
Gross, Mark wrote:
>You can never predict when a SMI will bubble through the system. Even
>if you handle case where the BIOS re-hides Dev0:Fun1 and not panic how
>do you deal with the race between the BIOS SMI based handling and the
>driver? Who will end up reading (and clearing) the error registers
>first? There is no good way to share today.
>
>
You could (at least from memory, on certain chipsets) modify the error
reporting registers so that an SMI is no longer generated as a result of
MC ECC errors. True, this doesn't fix many of the other problems
related to this issue, but would be useful in a "modprobe xyz_edac
force_unhide_MC_PCI=1" case.
Closed-source BIOSes eh? Who needs em ;-p.
Tim.
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-21 21:42 Doug Thompson
0 siblings, 0 replies; 44+ messages in thread
From: Doug Thompson @ 2006-04-21 21:42 UTC (permalink / raw)
To: mark.gross
Cc: soo.keong.ong, steven.carbonari, zhenyu.z.wang, bluesmoke-devel,
Doug Thompson, linux-kernel
On Fri, 2006-04-21 at 21:32 +0000, "Gross, Mark" wrote:
>
> >-----Original Message-----
> >From: Doug Thompson [mailto:dthompson@lnxi.com]
> >Sent: Friday, April 21, 2006 1:57 PM
> >To: Gross, Mark
> >Cc: Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z; bluesmoke-
> >devel@lists.sourceforge.net; linux-kernel@vger.kernel.org
> >Subject: Re: Problems with EDAC coexisting with BIOS
> >
> >Mark thanks for the informaton on this.
> >
> >Now the e752x_edac.c driver contains no direct calls to panic within
> >itself. The edac_mc.c 'core' piece does, but only calls that if an UE
> >error is found and panic on UE is enabled. (The other is a PCI parity
> >panic, but doesn't effect this path). It might be possible that since
> >the hidden register was now hidden, the retrieval function returns some
> >garbage which falsely triggers the panic by the core.
>
> The problem is that once the BIOS SMI re-hides the device reading from
> the chipset error registers returns 0xFFFFFFFF.
Ok so the panic is caused by the driver because the UE bit is a one
(somewhere in the 32-bit field), and then informs the CORE driver of the
false positive UE event, which does the panic trigger. I follow the
logic now.
The question then comes up: how to determine if the system has the
hidden register feature.
You mentioned that the probe should be refactored to look for this all
ones return value as such an indicator, is that correct?
But there is a window that the register is unhidden, the probe runs,
detects the hardware it is looking at and registers itself. Then later
the window closes, the register is hidden again and the panic can
reoccurr as above. Thus, a check needs to be made in the poll operation
for the all ones return value, notify the log of this situtation,
shutdown future polling of this hardware and basicly become a no-op poll
operation.
Does that fix cover the issue as you have discovered it? Or is there
more, or another fix you see?
doug t
>
> >
> >I would like to see the panic output and stack trace to see where that
> >panic came from. It might have come from the PCI device access subsytem
> >when it was trying to access the now hidden register.
> >
> >can you post that panice information?
>
> Yes, but the edac_mc.c call to panic doesn't drop any data to the
> console.
> The following is the output from an instrumentation we did to the EDAC
> that went into RHEL4-U3 :
>
> INIT: version 2.85 booting
> Welcome to Red Hat Enterprise Linux AS
> Press 'I' to enter interactive startup.
> Starting udev: [ OK ]
> Initializing hardware... storage network audioIn probe1, before write
> E752X_DEVPRES1 = 0x10
> In probe1, after write
> E752X_DEVPRES1 = 0x30
>
> Snip .....
>
> Configuring kernel parameters: [ OK ]
> Setting clock (localtime): Sun Mar 12 00:52:17 PST 2006 [ OK ]
> Setting hostname localhost.localdomain: [ OK ]
> Checking root filesystem
> [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/sda1
> /: clean, 365954/8830976 files, 2357630/17657443 blocks
> [ OK ]
> Remounting root filesystem in read-write mode: [ OK ]
> No devices found
> No Software RAID disks
> Setting up Logical Volume ManageE752X_DEVPRES1 = 0x10
> ment: E752X_DEVPRES1 = 0x10
> Kernel panic - not syncing: MC0: Uncorrected Error
>
> that's all you get, no stack trace.
>
> The EX752C_DEVPRES1 = ... debug statements came from sprinkling call to
> the following debug function we instrumented into the EDAC driver code.
>
> static void print_error_info(struct pci_dev *pdev)
> {
> u8 stat8;
> pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
> printk(KERN_EMERG "E752X_DEVPRES1 = 0x%02X\n", stat8);
> }
>
^ permalink raw reply [flat|nested] 44+ messages in thread* RE: Problems with EDAC coexisting with BIOS
@ 2006-04-21 21:32 Gross, Mark
0 siblings, 0 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-21 21:32 UTC (permalink / raw)
To: Doug Thompson
Cc: Ong, Soo Keong, Carbonari, Steven, Wang, Zhenyu Z,
bluesmoke-devel, linux-kernel
>-----Original Message-----
>From: Doug Thompson [mailto:dthompson@lnxi.com]
>Sent: Friday, April 21, 2006 1:57 PM
>To: Gross, Mark
>Cc: Ong, Soo Keong; Carbonari, Steven; Wang, Zhenyu Z; bluesmoke-
>devel@lists.sourceforge.net; linux-kernel@vger.kernel.org
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Mark thanks for the informaton on this.
>
>Now the e752x_edac.c driver contains no direct calls to panic within
>itself. The edac_mc.c 'core' piece does, but only calls that if an UE
>error is found and panic on UE is enabled. (The other is a PCI parity
>panic, but doesn't effect this path). It might be possible that since
>the hidden register was now hidden, the retrieval function returns some
>garbage which falsely triggers the panic by the core.
The problem is that once the BIOS SMI re-hides the device reading from
the chipset error registers returns 0xFFFFFFFF.
>
>I would like to see the panic output and stack trace to see where that
>panic came from. It might have come from the PCI device access subsytem
>when it was trying to access the now hidden register.
>
>can you post that panice information?
Yes, but the edac_mc.c call to panic doesn't drop any data to the
console.
The following is the output from an instrumentation we did to the EDAC
that went into RHEL4-U3 :
INIT: version 2.85 booting
Welcome to Red Hat Enterprise Linux AS
Press 'I' to enter interactive startup.
Starting udev: [ OK ]
Initializing hardware... storage network audioIn probe1, before write
E752X_DEVPRES1 = 0x10
In probe1, after write
E752X_DEVPRES1 = 0x30
Snip .....
Configuring kernel parameters: [ OK ]
Setting clock (localtime): Sun Mar 12 00:52:17 PST 2006 [ OK ]
Setting hostname localhost.localdomain: [ OK ]
Checking root filesystem
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/sda1
/: clean, 365954/8830976 files, 2357630/17657443 blocks
[ OK ]
Remounting root filesystem in read-write mode: [ OK ]
No devices found
No Software RAID disks
Setting up Logical Volume ManageE752X_DEVPRES1 = 0x10
ment: E752X_DEVPRES1 = 0x10
Kernel panic - not syncing: MC0: Uncorrected Error
that's all you get, no stack trace.
The EX752C_DEVPRES1 = ... debug statements came from sprinkling call to
the following debug function we instrumented into the EDAC driver code.
static void print_error_info(struct pci_dev *pdev)
{
u8 stat8;
pci_read_config_byte(pdev, E752X_DEVPRES1, &stat8);
printk(KERN_EMERG "E752X_DEVPRES1 = 0x%02X\n", stat8);
}
--mgross
>
>thanks
>
>doug thompson
>
>
>On Fri, 2006-04-21 at 16:01 +0000, "Gross, Mark" wrote:
>> I'm sorry to have to bring up these issues after a fare amount of
good
>> work, and I don't know how this problem managed to get by for as long
as
>> it has, but there are some issues with the EDAC and the BIOS for
managed
>> computer systems.
>>
>> Managed computers are systems with automatic ECC logging to a System
>> Event Log or SEL. They typically have an out of band Board
Management
>> Controller aka BMC or IPMC that runs out of band WRT the OS payload.
>>
>> The issues found with the EDAC driver are:
>> 1) The default AMI BIOS behavior on SMI is to check the chipset error
>> registers (Dev0:Fun1) and re-hide them.
>> 2) If you are lucky enough to have BIOS code that doesn't re-hide
>> Dev0:Fun1; then when EDAC is loaded there is a race condition between
>> the platform BIOS and the driver to gain access to these registers.
>> 3) If the platform BIOS does the ECC logging out of band WRT the
payload
>> OS, there is no good way for the driver to know at load time.
>>
>> We discovered these problems when testing with one of the later
RHEL4-U3
>> RC's. The EDAC driver called panic when the device 0 Function 1 of
the
>> E7250 was re-hidden by the legacy USB SMI that when off between the
load
>> of the EDAC driver and the USB host driver. Loading the EDAC driver
for
>> many AMI bios's is a panic land mine waiting go off. Unless the OS
>> knows that it can trust the BIOS to not re-hide those chipset
registers
>> using this driver is not a safe thing to do.
>>
>> Basically if device 0 : function 1 is hidden by the platform at boot
>> time un-hiding and using the device and function is a risky thing to
do,
>> as there is likely a good reason for it to have been hidden in the
first
>> place. If the BIOS thinks that it owns some registers then the OS
>> should not use them without great care.
>>
>> It is possible that the driver could be modified to check for
re-hiding
>> of the DEV0:FUN1, but this will be racey WRT SMI processing. At
least
>> it shouldn't panic.
>>
>> The driver should never get loaded by default or automatically. If
the
>> user knows enough about there BIOS to trust that the SMI behavior
will
>> coexist with the driver then its OK to load otherwise using this
driver
>> is not a safe thing to do.
>>
>> I think the best thing to do is to have the driver error out in its
init
>> or probe code if the dev0:fun1 is hidden at boot time.
>>
>> Comments?
>>
>> Next steps?
>>
>> Do you want me to send a patch implementing graceful error handling
at
>> driver init time so it doesn't load if DEV0:FUN1 is hidden?
>>
>> --mgross
>> Intel Open Source Technology Center
>> (503) 677-4628
>> (503)-712-6227
>> ms: JF1-235
>> 2111 NW 25th Ave
>> Hillsboro, OR 97124
>>
>>
>>
>> -------------------------------------------------------
>> Using Tomcat but need to do more? Need to support web services,
security?
>> Get stuff done quickly with pre-integrated technology to make your
job
>easier
>> Download IBM WebSphere Application Server v.1.0.1 based on Apache
>Geronimo
>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
>> _______________________________________________
>> bluesmoke-devel mailing list
>> bluesmoke-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: Problems with EDAC coexisting with BIOS
@ 2006-04-21 20:57 Doug Thompson
0 siblings, 0 replies; 44+ messages in thread
From: Doug Thompson @ 2006-04-21 20:57 UTC (permalink / raw)
To: mark.gross
Cc: soo.keong.ong, steven.carbonari, zhenyu.z.wang, bluesmoke-devel,
linux-kernel
Mark thanks for the informaton on this.
Now the e752x_edac.c driver contains no direct calls to panic within
itself. The edac_mc.c 'core' piece does, but only calls that if an UE
error is found and panic on UE is enabled. (The other is a PCI parity
panic, but doesn't effect this path). It might be possible that since
the hidden register was now hidden, the retrieval function returns some
garbage which falsely triggers the panic by the core.
I would like to see the panic output and stack trace to see where that
panic came from. It might have come from the PCI device access subsytem
when it was trying to access the now hidden register.
can you post that panice information?
thanks
doug thompson
On Fri, 2006-04-21 at 16:01 +0000, "Gross, Mark" wrote:
> I'm sorry to have to bring up these issues after a fare amount of good
> work, and I don't know how this problem managed to get by for as long as
> it has, but there are some issues with the EDAC and the BIOS for managed
> computer systems.
>
> Managed computers are systems with automatic ECC logging to a System
> Event Log or SEL. They typically have an out of band Board Management
> Controller aka BMC or IPMC that runs out of band WRT the OS payload.
>
> The issues found with the EDAC driver are:
> 1) The default AMI BIOS behavior on SMI is to check the chipset error
> registers (Dev0:Fun1) and re-hide them.
> 2) If you are lucky enough to have BIOS code that doesn't re-hide
> Dev0:Fun1; then when EDAC is loaded there is a race condition between
> the platform BIOS and the driver to gain access to these registers.
> 3) If the platform BIOS does the ECC logging out of band WRT the payload
> OS, there is no good way for the driver to know at load time.
>
> We discovered these problems when testing with one of the later RHEL4-U3
> RC's. The EDAC driver called panic when the device 0 Function 1 of the
> E7250 was re-hidden by the legacy USB SMI that when off between the load
> of the EDAC driver and the USB host driver. Loading the EDAC driver for
> many AMI bios's is a panic land mine waiting go off. Unless the OS
> knows that it can trust the BIOS to not re-hide those chipset registers
> using this driver is not a safe thing to do.
>
> Basically if device 0 : function 1 is hidden by the platform at boot
> time un-hiding and using the device and function is a risky thing to do,
> as there is likely a good reason for it to have been hidden in the first
> place. If the BIOS thinks that it owns some registers then the OS
> should not use them without great care.
>
> It is possible that the driver could be modified to check for re-hiding
> of the DEV0:FUN1, but this will be racey WRT SMI processing. At least
> it shouldn't panic.
>
> The driver should never get loaded by default or automatically. If the
> user knows enough about there BIOS to trust that the SMI behavior will
> coexist with the driver then its OK to load otherwise using this driver
> is not a safe thing to do.
>
> I think the best thing to do is to have the driver error out in its init
> or probe code if the dev0:fun1 is hidden at boot time.
>
> Comments?
>
> Next steps?
>
> Do you want me to send a patch implementing graceful error handling at
> driver init time so it doesn't load if DEV0:FUN1 is hidden?
>
> --mgross
> Intel Open Source Technology Center
> (503) 677-4628
> (503)-712-6227
> ms: JF1-235
> 2111 NW 25th Ave
> Hillsboro, OR 97124
>
>
>
> -------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
> _______________________________________________
> bluesmoke-devel mailing list
> bluesmoke-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel
^ permalink raw reply [flat|nested] 44+ messages in thread
* Problems with EDAC coexisting with BIOS
@ 2006-04-21 16:01 Gross, Mark
2006-04-21 21:13 ` Ingo Oeser
2006-04-24 13:19 ` Alan Cox
0 siblings, 2 replies; 44+ messages in thread
From: Gross, Mark @ 2006-04-21 16:01 UTC (permalink / raw)
To: bluesmoke-devel, LKML
Cc: Carbonari, Steven, Ong, Soo Keong, Wang, Zhenyu Z, Gross, Mark
I'm sorry to have to bring up these issues after a fare amount of good
work, and I don't know how this problem managed to get by for as long as
it has, but there are some issues with the EDAC and the BIOS for managed
computer systems.
Managed computers are systems with automatic ECC logging to a System
Event Log or SEL. They typically have an out of band Board Management
Controller aka BMC or IPMC that runs out of band WRT the OS payload.
The issues found with the EDAC driver are:
1) The default AMI BIOS behavior on SMI is to check the chipset error
registers (Dev0:Fun1) and re-hide them.
2) If you are lucky enough to have BIOS code that doesn't re-hide
Dev0:Fun1; then when EDAC is loaded there is a race condition between
the platform BIOS and the driver to gain access to these registers.
3) If the platform BIOS does the ECC logging out of band WRT the payload
OS, there is no good way for the driver to know at load time.
We discovered these problems when testing with one of the later RHEL4-U3
RC's. The EDAC driver called panic when the device 0 Function 1 of the
E7250 was re-hidden by the legacy USB SMI that when off between the load
of the EDAC driver and the USB host driver. Loading the EDAC driver for
many AMI bios's is a panic land mine waiting go off. Unless the OS
knows that it can trust the BIOS to not re-hide those chipset registers
using this driver is not a safe thing to do.
Basically if device 0 : function 1 is hidden by the platform at boot
time un-hiding and using the device and function is a risky thing to do,
as there is likely a good reason for it to have been hidden in the first
place. If the BIOS thinks that it owns some registers then the OS
should not use them without great care.
It is possible that the driver could be modified to check for re-hiding
of the DEV0:FUN1, but this will be racey WRT SMI processing. At least
it shouldn't panic.
The driver should never get loaded by default or automatically. If the
user knows enough about there BIOS to trust that the SMI behavior will
coexist with the driver then its OK to load otherwise using this driver
is not a safe thing to do.
I think the best thing to do is to have the driver error out in its init
or probe code if the dev0:fun1 is hidden at boot time.
Comments?
Next steps?
Do you want me to send a patch implementing graceful error handling at
driver init time so it doesn't load if DEV0:FUN1 is hidden?
--mgross
Intel Open Source Technology Center
(503) 677-4628
(503)-712-6227
ms: JF1-235
2111 NW 25th Ave
Hillsboro, OR 97124
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
2006-04-21 16:01 Gross, Mark
@ 2006-04-21 21:13 ` Ingo Oeser
2006-04-24 13:19 ` Alan Cox
1 sibling, 0 replies; 44+ messages in thread
From: Ingo Oeser @ 2006-04-21 21:13 UTC (permalink / raw)
To: Gross, Mark
Cc: bluesmoke-devel, LKML, Carbonari, Steven, Ong, Soo Keong,
Wang, Zhenyu Z
Hi Mark,
On Friday, 21. April 2006 18:01, Gross, Mark wrote:
> I'm sorry to have to bring up these issues after a fare amount of good
> work, and I don't know how this problem managed to get by for as long as
> it has, but there are some issues with the EDAC and the BIOS for managed
> computer systems.
Can this condition be detected via DMI?
Then you could implement a whitelist or blacklist
and a module parameter to override it.
Regards
Ingo Oeser
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
2006-04-21 16:01 Gross, Mark
2006-04-21 21:13 ` Ingo Oeser
@ 2006-04-24 13:19 ` Alan Cox
2006-04-24 17:38 ` Doug Thompson
1 sibling, 1 reply; 44+ messages in thread
From: Alan Cox @ 2006-04-24 13:19 UTC (permalink / raw)
To: Gross, Mark
Cc: bluesmoke-devel, LKML, Carbonari, Steven, Ong, Soo Keong,
Wang, Zhenyu Z
On Gwe, 2006-04-21 at 09:01 -0700, Gross, Mark wrote:
> 1) The default AMI BIOS behavior on SMI is to check the chipset error
> registers (Dev0:Fun1) and re-hide them.
The words "bad design" come to mind (followed by a large number of more
accurate phrases that are inappropriate for a public list)
> Basically if device 0 : function 1 is hidden by the platform at boot
> time un-hiding and using the device and function is a risky thing to do,
Intel provided patches that do exactly this for some of the chip
workarounds. Are you saying the Intel chip work around also needs
fixing ?
> The driver should never get loaded by default or automatically. If the
> user knows enough about there BIOS to trust that the SMI behavior will
> coexist with the driver then its OK to load otherwise using this driver
> is not a safe thing to do.
So Intel and/or the BIOS vendors also forgot to put in any kind of
indicator ? How do they expect end users to know this, or OS vendors ?
Is there a technote that covers this mess ?
> I think the best thing to do is to have the driver error out in its init
> or probe code if the dev0:fun1 is hidden at boot time.
>
> Comments?
Why did Intel bother implementing this functionality and then screwing
it up so that OS vendors can't use it ? It seems so bogus.
At the very least we should print a warning advising the user that the
BIOS is incompatible and to ask the BIOS vendor for an update so that
they can enable error detection and management support.
Is only the AMI BIOS this braindamaged, should we just blacklist AMI
bioses in EDAC or is this shared Intel supplied code that may be found
in other vendors systems.
Alan
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Problems with EDAC coexisting with BIOS
2006-04-24 13:19 ` Alan Cox
@ 2006-04-24 17:38 ` Doug Thompson
0 siblings, 0 replies; 44+ messages in thread
From: Doug Thompson @ 2006-04-24 17:38 UTC (permalink / raw)
To: Alan Cox, Gross, Mark
Cc: bluesmoke-devel, LKML, Carbonari, Steven, Ong, Soo Keong,
Wang, Zhenyu Z
--- Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>
> At the very least we should print a warning advising the user that the
> BIOS is incompatible and to ask the BIOS vendor for an update so that
> they can enable error detection and management support.
Asking for the update might prompt for an update to actually be generated, but I
doubt it.
>
> Is only the AMI BIOS this braindamaged, should we just blacklist AMI
> bioses in EDAC or is this shared Intel supplied code that may be found
> in other vendors systems.
The engineer who wrote the E752X driver (originally from bluesmoke) was a LinuxBIOS
engineer. We do make some systems with factory BIOS, but not as many as with
LinuxBIOS. We might see these issues on those factory BIOS system, but none that I
know of.
doug thompson
>
> Alan
>
"If you think Education is expensive, just try Ignorance"
"Don't tell people HOW to do things, tell them WHAT you
want and they will surprise you with their ingenuity."
Gen George Patton
^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2006-05-04 16:44 UTC | newest]
Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-25 23:25 Problems with EDAC coexisting with BIOS Gross, Mark
2006-04-26 2:19 ` Corey Minyard
2006-04-26 2:34 ` Randy.Dunlap
2006-04-26 18:26 ` mark gross
2006-04-26 18:38 ` Randy.Dunlap
2006-04-26 19:39 ` mark gross
2006-04-26 20:17 ` Randy.Dunlap
-- strict thread matches above, loose matches on Subject: below --
2006-05-04 16:44 David Peterson
2006-05-03 23:06 David Peterson
2006-05-03 22:22 Doug Thompson
2006-05-03 21:39 Doug Thompson
2006-05-03 20:49 Gross, Mark
2006-04-26 3:24 Gross, Mark
2006-04-26 3:19 Gross, Mark
2006-04-25 21:24 Gross, Mark
2006-04-25 22:39 ` Corey Minyard
2006-04-25 20:22 Gross, Mark
2006-04-25 18:19 Gross, Mark
2006-04-25 19:55 ` Corey Minyard
2006-04-24 18:14 Gross, Mark
2006-04-24 15:57 Gross, Mark
2006-04-24 17:08 ` Eric W. Biederman
2006-04-24 17:49 ` Alan Cox
2006-04-24 14:32 Ong, Soo Keong
2006-04-24 14:15 Ong, Soo Keong
2006-04-24 14:29 ` Alan Cox
2006-05-03 20:25 ` Tim Small
2006-05-03 20:37 ` thockin
2006-05-04 9:45 ` Tim Small
2006-05-03 21:44 ` Alan Cox
2006-05-04 9:02 ` Tim Small
2006-04-24 13:59 Ong, Soo Keong
2006-04-24 14:13 ` Alan Cox
2006-04-23 1:44 Gross, Mark
2006-04-21 22:36 Doug Thompson
2006-04-21 22:20 Gross, Mark
2006-04-22 18:31 ` Tim Small
2006-04-21 21:42 Doug Thompson
2006-04-21 21:32 Gross, Mark
2006-04-21 20:57 Doug Thompson
2006-04-21 16:01 Gross, Mark
2006-04-21 21:13 ` Ingo Oeser
2006-04-24 13:19 ` Alan Cox
2006-04-24 17:38 ` Doug Thompson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox