* Re: Can we ignore errors in mcelog if the server is running fine
[not found] <fa.2RkKSvRvPsGNSGCsUHQ9gQ8qlrg@ifi.uio.no>
@ 2006-07-27 19:34 ` Robert Hancock
2006-07-28 5:28 ` Handle X
0 siblings, 1 reply; 6+ messages in thread
From: Robert Hancock @ 2006-07-27 19:34 UTC (permalink / raw)
To: Vikas Kedia; +Cc: linux-kernel
Vikas Kedia wrote:
> The server seems to be running fine. A. can I ignore the following
> mcelog errors ? B. If not what should i do to stop the server from
> reporting mcelog errors.
Looks like data cache ECC errors, meaning the CPU 0 is faulty.
Eventually if it's not replaced there will likely be some uncorrectable
errors and the system will likely crash.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Can we ignore errors in mcelog if the server is running fine
2006-07-27 19:34 ` Can we ignore errors in mcelog if the server is running fine Robert Hancock
@ 2006-07-28 5:28 ` Handle X
0 siblings, 0 replies; 6+ messages in thread
From: Handle X @ 2006-07-28 5:28 UTC (permalink / raw)
To: Robert Hancock; +Cc: Vikas Kedia, linux-kernel
On 7/27/06, Robert Hancock <hancockr@shaw.ca> wrote:
> Vikas Kedia wrote:
> > The server seems to be running fine. A. can I ignore the following
> > mcelog errors ? B. If not what should i do to stop the server from
> > reporting mcelog errors.
>
> Looks like data cache ECC errors, meaning the CPU 0 is faulty.
> Eventually if it's not replaced there will likely be some uncorrectable
> errors and the system will likely crash.
I am facing similar, but different errors.
[root@turyxsrv ~]# mcelog
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 89a560bb249
ADDR 1dfa49690
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 2021
bit46 = corrected ecc error
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 9410c00020080a13 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC a6550f2d4de
ADDR 1de74b670
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 2021
bit32 = err cpu0
bit46 = corrected ecc error
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 9410c00120080813 MCGSTATUS 0
MCE 2
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC afe4eba238a
ADDR 1d8049698
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 2021
bit46 = corrected ecc error
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 9410c00020080a13 MCGSTATUS 0
MCE 3
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC cc945738d0a
ADDR 194c4b670
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 2021
bit40 = error found by scrub
bit46 = corrected ecc error
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 9410c10020080a13 MCGSTATUS 0
Repeats whenever I do any kind of operations...
How severe is ChipKill errors? Should I consider throwing away CPU 1
and get another one.
Regards,
Om.
^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <fa.5uWgnVpIOBN4Pb1aWwNzF8P2OA0@ifi.uio.no>]
[parent not found: <6Dc4C-1tt-47@gated-at.bofh.it>]
* Re: Can we ignore errors in mcelog if the server is running fine
[not found] <6Dc4C-1tt-47@gated-at.bofh.it>
@ 2006-07-27 12:34 ` Bodo Eggert
0 siblings, 0 replies; 6+ messages in thread
From: Bodo Eggert @ 2006-07-27 12:34 UTC (permalink / raw)
To: Vikas Kedia, linux-kernel
Vikas Kedia <kedia.vikas@gmail.com> wrote:
> The server seems to be running fine.
Since these errors were corrected, you should expect that.
> A. can I ignore the following
> mcelog errors ?
Obviously, but I doubt you want to.
> B. If not what should i do to stop the server from
> reporting mcelog errors.
Fix the problem.
> CPU 0 0 data cache TSC 997fa760e9
> ADDR 2c13340
> Data cache ECC error (syndrome e3)
> bit46 = corrected ecc error
It's reported to be the data cache on CPU 0. You'll need to replace that
part of the cache (and the rest of the CPU, since you can't buy spare
cache lines nor that small soldering irons.-) The old CPU will be fine for
unimportant machines.
--
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.
http://david.woodhou.se/why-not-spf.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Can we ignore errors in mcelog if the server is running fine
@ 2006-07-27 11:11 Vikas Kedia
0 siblings, 0 replies; 6+ messages in thread
From: Vikas Kedia @ 2006-07-27 11:11 UTC (permalink / raw)
To: linux-kernel
The server seems to be running fine. A. can I ignore the following
mcelog errors ? B. If not what should i do to stop the server from
reporting mcelog errors.
root@srv3:~# less /var/log/mcelog
MCE 0
CPU 0 0 data cache TSC 997fa760e9
ADDR 2c13340
Data cache ECC error (syndrome e3)
bit46 = corrected ecc error
bus error 'local node origin, request didn't time out
data read mem transaction
memory access, level generic'
STATUS 9471c00000000833 MCGSTATUS 0
MCE 0
CPU 0 0 data cache TSC 1afa02913ab
ADDR 2c13380
Data cache ECC error (syndrome e3)
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
data read mem transaction
memory access, level generic'
STATUS d471c00000000833 MCGSTATUS 0
root@srv3:~# pwd
/root
root@srv3:~# less /var/log/mcelog
MCE 0
CPU 0 0 data cache TSC 997fa760e9
ADDR 2c13340
Data cache ECC error (syndrome e3)
bit46 = corrected ecc error
bus error 'local node origin, request didn't time out
data read mem transaction
memory access, level generic'
STATUS 9471c00000000833 MCGSTATUS 0
MCE 0
CPU 0 0 data cache TSC 1afa02913ab
ADDR 2c13380
Data cache ECC error (syndrome e3)
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
data read mem transaction
memory access, level generic'
STATUS d471c00000000833 MCGSTATUS 0
Best regards,
v
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-07-28 18:13 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <fa.2RkKSvRvPsGNSGCsUHQ9gQ8qlrg@ifi.uio.no>
2006-07-27 19:34 ` Can we ignore errors in mcelog if the server is running fine Robert Hancock
2006-07-28 5:28 ` Handle X
[not found] <fa.5uWgnVpIOBN4Pb1aWwNzF8P2OA0@ifi.uio.no>
[not found] ` <fa.9M8mPetEI5HZ8L2RMGPhKPm3gJA@ifi.uio.no>
2006-07-28 8:34 ` Robert Hancock
2006-07-28 18:13 ` Handle X
[not found] <6Dc4C-1tt-47@gated-at.bofh.it>
2006-07-27 12:34 ` Bodo Eggert
2006-07-27 11:11 Vikas Kedia
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox