* Linux 2.6.26 edac errors and ASUS P5W DH Deluxe motherboard @ 2008-08-18 14:17 Andy Chittenden 2008-08-18 19:45 ` Doug Thompson 2008-08-18 19:49 ` Bernd Schubert 0 siblings, 2 replies; 5+ messages in thread From: Andy Chittenden @ 2008-08-18 14:17 UTC (permalink / raw) To: linux-kernel I've just installed the linux-image-2.6.26-1-amd64 debian package on three of our ASUS P5W DH Deluxe based machines and they've all started spewing out messages: Message from syslogd@savage at Mon Aug 18 14:01:52 2008 ... savage kernel: [ 74.389644] EDAC MC0: UE page 0x7fe03, offset 0x0, grain 128, row 2, labels ":": i82975x UE Message from syslogd@savage at Mon Aug 18 14:01:53 2008 ... savage kernel: [ 75.555862] EDAC MC0: UE page 0x7fd44, offset 0x0, grain 128, row 2, labels ":": i82975x UE Message from syslogd@savage at Mon Aug 18 14:01:54 2008 ... savage kernel: [ 76.628039] EDAC MC0: UE page 0x7fd41, offset 0x0, grain 128, row 2, labels ":": i82975x UE Message from syslogd@savage at Mon Aug 18 14:01:55 2008 ... savage kernel: [ 77.629260] EDAC MC0: UE page 0x7fd27, offset 0x0, grain 128, row 2, labels ":": i82975x UE every second. I've removed that kernel package and they're running previous versions of the kernel (eg linux-image-2.6.25-2-amd64) happily. I've run memtest on one of them with no problems. So, anyone got any ideas what's causing this? (FWIW the machines have all got ECC memory in them). -- Andy, BlueArc Engineering ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Linux 2.6.26 edac errors and ASUS P5W DH Deluxe motherboard 2008-08-18 14:17 Linux 2.6.26 edac errors and ASUS P5W DH Deluxe motherboard Andy Chittenden @ 2008-08-18 19:45 ` Doug Thompson 2008-08-19 8:17 ` Andy Chittenden 2008-08-18 19:49 ` Bernd Schubert 1 sibling, 1 reply; 5+ messages in thread From: Doug Thompson @ 2008-08-18 19:45 UTC (permalink / raw) To: Andy Chittenden, linux-kernel --- Andy Chittenden <andyc@bluearc.com> wrote: > I've just installed the linux-image-2.6.26-1-amd64 debian package on > three of our ASUS P5W DH Deluxe based machines and they've all started > spewing out messages: > > Message from syslogd@savage at Mon Aug 18 14:01:52 2008 ... > savage kernel: [ 74.389644] EDAC MC0: UE page 0x7fe03, offset 0x0, > grain 128, row 2, labels ":": i82975x UE > > Message from syslogd@savage at Mon Aug 18 14:01:53 2008 ... > savage kernel: [ 75.555862] EDAC MC0: UE page 0x7fd44, offset 0x0, > grain 128, row 2, labels ":": i82975x UE > > Message from syslogd@savage at Mon Aug 18 14:01:54 2008 ... > savage kernel: [ 76.628039] EDAC MC0: UE page 0x7fd41, offset 0x0, > grain 128, row 2, labels ":": i82975x UE > > Message from syslogd@savage at Mon Aug 18 14:01:55 2008 ... > savage kernel: [ 77.629260] EDAC MC0: UE page 0x7fd27, offset 0x0, > grain 128, row 2, labels ":": i82975x UE > > every second. > > I've removed that kernel package and they're running previous versions > of the kernel (eg linux-image-2.6.25-2-amd64) happily. I've run memtest > on one of them with no problems. So, anyone got any ideas what's causing > this? (FWIW the machines have all got ECC memory in them). > > -- > Andy, BlueArc Engineering I don't know which version of the source code was used in the 25 or the 26 versions of the debian package, but it might be that the later one is really finding errors as I remember there was some patches against the i82975x module. The reports printed above are consistent. They are ALL in Chip Select Row 2, yet all 3 of the machines are outputting messages. Are they ALL the same row, or are they different rows? If different, they could be legit. The same row there might be an issue. Reading the manual for the mobo (http://support.asus.com/download/download.aspx?SLanguage=en-us) I see that there are 4 slots for memory: DIMM_A1 DIMM_A2 DIMM_B1 DIMM_B2 In the output above, you can see the following: labels ":" When properly set by edac-utils (http://sourceforge.net/projects/edac-utils/) user space support package (IF the target motherboard is set in its database) the labels' field will be composed of the offending DIMM, like "DIMM_A2" or such. This aids in identifying the problem DIMM. If you have this already installed, you might need to add to the motherboard database, your motherboard's DIMM labels to see it. Since I don't have one of these chipsets, is it possible I could access to one or more of these machines to take a look around? doug t W1DUG ^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: Linux 2.6.26 edac errors and ASUS P5W DH Deluxe motherboard 2008-08-18 19:45 ` Doug Thompson @ 2008-08-19 8:17 ` Andy Chittenden 2008-08-19 17:41 ` Doug Thompson 0 siblings, 1 reply; 5+ messages in thread From: Andy Chittenden @ 2008-08-19 8:17 UTC (permalink / raw) To: Doug Thompson, linux-kernel Hi Doug > I don't know which version of the source code was used in the 25 or > the 26 versions of the debian package, but it might be that the later > one is really finding errors as I remember there was some patches > against the i82975x module. I've done a diff between 2.6.25 and 2.6.26 source code of the i82975x_edac module. As you can see, there's not much difference: # diff -u linux-2.6.2[56]/drivers/edac/i82975x_edac.c --- linux-2.6.25/drivers/edac/i82975x_edac.c 2008-04-17 03:49:44.000000000 +0100 +++ linux-2.6.26/drivers/edac/i82975x_edac.c 2008-07-13 22:51:29.000000000 +0100 @@ -14,7 +14,7 @@ #include <linux/pci.h> #include <linux/pci_ids.h> #include <linux/slab.h> - +#include <linux/edac.h> #include "edac_core.h" #define I82975X_REVISION " Ver: 1.0.0 " __DATE__ @@ -611,6 +611,9 @@ debugf3("%s()\n", __func__); + /* Ensure that the OPSTATE is set correctly for POLL or NMI */ + opstate_init(); + pci_rc = pci_register_driver(&i82975x_driver); if (pci_rc < 0) goto fail0; @@ -664,3 +667,6 @@ MODULE_LICENSE("GPL"); MODULE_AUTHOR("Arvind R. <arvind@acarlab.com>"); MODULE_DESCRIPTION("MC support for Intel 82975 memory hub controllers"); + +module_param(edac_op_state, int, 0444); +MODULE_PARM_DESC(edac_op_state, "EDAC Error Reporting state: 0=Poll,1=NMI"); > Are they ALL the same row, or are they different rows? If different, > they could be legit. The same row there might be an issue. Hmm, they're different. On another m/c, I've managed to find the logged info when it booted up 2.6.26: /var/log/kern.log.1.gz:Aug 4 11:38:15 diesel kernel: [ 9.079151] EDAC MC0: UE page 0x7fe0b, offset 0x0, grain 128, row 1, labels ":": i82975x UE /var/log/kern.log.1.gz:Aug 4 11:38:15 diesel kernel: [ 10.104762] EDAC MC0: UE page 0x7e451, offset 0x0, grain 128, row 1, labels ":": i82975x UE /var/log/kern.log.1.gz:Aug 4 11:38:15 diesel kernel: [ 11.110256] EDAC MC0: UE page 0x7e7ae, offset 0x0, grain 128, row 1, labels ":": i82975x UE ... /var/log/kern.log.1.gz:Aug 4 11:52:05 diesel kernel: [ 11.636753] EDAC MC0: UE page 0x60000, offset 0x0, grain 128, row 1, labels ":": i82975x UE /var/log/kern.log.1.gz:Aug 4 11:52:05 diesel kernel: [ 12.641616] EDAC MC0: UE page 0xde771, offset 0x0, grain 128, row 3, labels ":": i82975x UE /var/log/kern.log.1.gz:Aug 4 11:52:05 diesel kernel: [ 13.734052] EDAC MC0: UE page 0xde771, offset 0x0, grain 128, row 3, labels ":": i82975x UE /var/log/kern.log.1.gz:Aug 4 11:52:05 diesel kernel: [ 14.743449] EDAC MC0: UE page 0xde771, offset 0x0, grain 128, row 3, labels ":": i82975x UE > When properly set by edac-utils (http://sourceforge.net/projects/edac-utils/) ... Thanks for the pointer. I've now installed edac-utils on the offending motherboards. It seems that the motherboard is half known about: # edac-ctl --mainboard edac-ctl: mainboard: ASUSTEK COMPUTER INC P5W DH Deluxe # edac-ctl --print-labels No dimm labels for ASUSTEK COMPUTER INC P5W DH Deluxe dmidecode gives some memory module info: Handle 0x0009, DMI type 6, 12 bytes Memory Module Information Socket Designation: DIMM0 Bank Connections: 9 11 Current Speed: 30 ns Type: Unknown FPM Parity ECC SDRAM Installed Size: 2048 MB (Double-bank Connection) Enabled Size: 2048 MB (Double-bank Connection) Error Status: OK Handle 0x000A, DMI type 6, 12 bytes Memory Module Information Socket Designation: DIMM1 Bank Connections: 9 11 Current Speed: 30 ns Type: Unknown FPM Parity ECC SDRAM Installed Size: 2048 MB (Double-bank Connection) Enabled Size: 2048 MB (Double-bank Connection) Error Status: OK Handle 0x000B, DMI type 6, 12 bytes Memory Module Information Socket Designation: DIMM2 Bank Connections: 9 11 Current Speed: 30 ns Type: Unknown FPM Parity ECC SDRAM Installed Size: 2048 MB (Double-bank Connection) Enabled Size: 2048 MB (Double-bank Connection) Error Status: OK Handle 0x000C, DMI type 6, 12 bytes Memory Module Information Socket Designation: DIMM3 Bank Connections: 9 11 Current Speed: 30 ns Type: Unknown FPM Parity ECC SDRAM Installed Size: 2048 MB (Double-bank Connection) Enabled Size: 2048 MB (Double-bank Connection) Error Status: OK > Since I don't have one of these chipsets, is it possible I could access to one or more of these machines to take a look around? Unfortunately not. If there's any commands you'd like me to run, then please let me know. If you could let me know what I need to put in /etc/edac/labels.db, that would be appreciated too. -- Andy, BlueArc Engineering ^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: Linux 2.6.26 edac errors and ASUS P5W DH Deluxe motherboard 2008-08-19 8:17 ` Andy Chittenden @ 2008-08-19 17:41 ` Doug Thompson 0 siblings, 0 replies; 5+ messages in thread From: Doug Thompson @ 2008-08-19 17:41 UTC (permalink / raw) To: Andy Chittenden, linux-kernel --- Andy Chittenden <andyc@bluearc.com> wrote: > > If you could let me know what I need to put in /etc/edac/labels.db, that > would be appreciated too. > This becomes a manual, one time, event, to discover the mapping of DIMMs to the silkscreen. One command is the 'dmidecode' which is run as root and dumps the BIOS DMI Tables. Unfortunately, many BIOSes do not correctly set these tables properly to the correct DIMM silk screen labels. Because of this lack, EDAC and edac-utils was created to provide mechanism for end users. If your system does provide correct DIMM Labels, you can create/correct the entry for your motherboard in the database file for edac-utils. If your system provides simple generic labels, then you will need to physically move DIMMs from slot to slot and watching as the error "moves" with the DIMM. This will take a few iterations and a state table. Usually, a DIMM will have 2 Chip-Select Rows (csrow) The first set of DIMMs form a 128-bit data path (called dual channel operation) and have csrows 0 and 1 The second set of DIMMs will have csrows 2 and 3. Therefore, you need to examine which csrow and which channel the error is being reported in. doug t W1DUG ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Linux 2.6.26 edac errors and ASUS P5W DH Deluxe motherboard 2008-08-18 14:17 Linux 2.6.26 edac errors and ASUS P5W DH Deluxe motherboard Andy Chittenden 2008-08-18 19:45 ` Doug Thompson @ 2008-08-18 19:49 ` Bernd Schubert 1 sibling, 0 replies; 5+ messages in thread From: Bernd Schubert @ 2008-08-18 19:49 UTC (permalink / raw) To: linux-kernel Andy Chittenden wrote: > I've just installed the linux-image-2.6.26-1-amd64 debian package on > three of our ASUS P5W DH Deluxe based machines and they've all started > spewing out messages: > > Message from syslogd@savage at Mon Aug 18 14:01:52 2008 ... > savage kernel: [ 74.389644] EDAC MC0: UE page 0x7fe03, offset 0x0, > grain 128, row 2, labels ":": i82975x UE > > Message from syslogd@savage at Mon Aug 18 14:01:53 2008 ... > savage kernel: [ 75.555862] EDAC MC0: UE page 0x7fd44, offset 0x0, > grain 128, row 2, labels ":": i82975x UE > > Message from syslogd@savage at Mon Aug 18 14:01:54 2008 ... > savage kernel: [ 76.628039] EDAC MC0: UE page 0x7fd41, offset 0x0, > grain 128, row 2, labels ":": i82975x UE > > Message from syslogd@savage at Mon Aug 18 14:01:55 2008 ... > savage kernel: [ 77.629260] EDAC MC0: UE page 0x7fd27, offset 0x0, > grain 128, row 2, labels ":": i82975x UE > > every second. > > I've removed that kernel package and they're running previous versions > of the kernel (eg linux-image-2.6.25-2-amd64) happily. I've run memtest > on one of them with no problems. So, anyone got any ideas what's causing > this? (FWIW the machines have all got ECC memory in them). > Do have an IPMI card installed in these systems? Know issue here with Asus boards + IPMI, you then need to disable a few ipmi sensors. Cheers, Bernd ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2008-08-19 17:48 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-08-18 14:17 Linux 2.6.26 edac errors and ASUS P5W DH Deluxe motherboard Andy Chittenden 2008-08-18 19:45 ` Doug Thompson 2008-08-19 8:17 ` Andy Chittenden 2008-08-19 17:41 ` Doug Thompson 2008-08-18 19:49 ` Bernd Schubert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox