public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Can we ignore errors in mcelog if the server is running fine
@ 2006-07-27 11:11 Vikas Kedia
  0 siblings, 0 replies; 6+ messages in thread
From: Vikas Kedia @ 2006-07-27 11:11 UTC (permalink / raw)
  To: linux-kernel

The server seems to be running fine. A. can I ignore the following
mcelog errors ? B. If not what should i do to stop the server from
reporting mcelog errors.

root@srv3:~# less /var/log/mcelog
MCE 0
CPU 0 0 data cache TSC 997fa760e9
ADDR 2c13340
  Data cache ECC error (syndrome e3)
       bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
      data read mem transaction
      memory access, level generic'
STATUS 9471c00000000833 MCGSTATUS 0
MCE 0
CPU 0 0 data cache TSC 1afa02913ab
ADDR 2c13380
  Data cache ECC error (syndrome e3)
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      data read mem transaction
      memory access, level generic'
STATUS d471c00000000833 MCGSTATUS 0
root@srv3:~# pwd
/root
root@srv3:~# less /var/log/mcelog
MCE 0
CPU 0 0 data cache TSC 997fa760e9
ADDR 2c13340
  Data cache ECC error (syndrome e3)
       bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
      data read mem transaction
      memory access, level generic'
STATUS 9471c00000000833 MCGSTATUS 0
MCE 0
CPU 0 0 data cache TSC 1afa02913ab
ADDR 2c13380
  Data cache ECC error (syndrome e3)
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      data read mem transaction
      memory access, level generic'
STATUS d471c00000000833 MCGSTATUS 0


Best regards,

v

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Can we ignore errors in mcelog if the server is running fine
       [not found] <6Dc4C-1tt-47@gated-at.bofh.it>
@ 2006-07-27 12:34 ` Bodo Eggert
  0 siblings, 0 replies; 6+ messages in thread
From: Bodo Eggert @ 2006-07-27 12:34 UTC (permalink / raw)
  To: Vikas Kedia, linux-kernel

Vikas Kedia <kedia.vikas@gmail.com> wrote:

> The server seems to be running fine.

Since these errors were corrected, you should expect that.

> A. can I ignore the following
> mcelog errors ?

Obviously, but I doubt you want to.

> B. If not what should i do to stop the server from
> reporting mcelog errors.

Fix the problem.

> CPU 0 0 data cache TSC 997fa760e9
> ADDR 2c13340
>   Data cache ECC error (syndrome e3)
>        bit46 = corrected ecc error

It's reported to be the data cache on CPU 0. You'll need to replace that
part of the cache (and the rest of the CPU, since you can't buy spare
cache lines nor that small soldering irons.-) The old CPU will be fine for
unimportant machines.
-- 
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.

http://david.woodhou.se/why-not-spf.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Can we ignore errors in mcelog if the server is running fine
       [not found] <fa.2RkKSvRvPsGNSGCsUHQ9gQ8qlrg@ifi.uio.no>
@ 2006-07-27 19:34 ` Robert Hancock
  2006-07-28  5:28   ` Handle X
  0 siblings, 1 reply; 6+ messages in thread
From: Robert Hancock @ 2006-07-27 19:34 UTC (permalink / raw)
  To: Vikas Kedia; +Cc: linux-kernel

Vikas Kedia wrote:
> The server seems to be running fine. A. can I ignore the following
> mcelog errors ? B. If not what should i do to stop the server from
> reporting mcelog errors.

Looks like data cache ECC errors, meaning the CPU 0 is faulty. 
Eventually if it's not replaced there will likely be some uncorrectable 
errors and the system will likely crash.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Can we ignore errors in mcelog if the server is running fine
  2006-07-27 19:34 ` Can we ignore errors in mcelog if the server is running fine Robert Hancock
@ 2006-07-28  5:28   ` Handle X
  0 siblings, 0 replies; 6+ messages in thread
From: Handle X @ 2006-07-28  5:28 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Vikas Kedia, linux-kernel

On 7/27/06, Robert Hancock <hancockr@shaw.ca> wrote:
> Vikas Kedia wrote:
> > The server seems to be running fine. A. can I ignore the following
> > mcelog errors ? B. If not what should i do to stop the server from
> > reporting mcelog errors.
>
> Looks like data cache ECC errors, meaning the CPU 0 is faulty.
> Eventually if it's not replaced there will likely be some uncorrectable
> errors and the system will likely crash.

I am facing similar, but different errors.

[root@turyxsrv ~]# mcelog
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 89a560bb249
ADDR 1dfa49690
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 2021
       bit46 = corrected ecc error
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 9410c00020080a13 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC a6550f2d4de
ADDR 1de74b670
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 2021
       bit32 = err cpu0
       bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 9410c00120080813 MCGSTATUS 0
MCE 2
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC afe4eba238a
ADDR 1d8049698
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 2021
       bit46 = corrected ecc error
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 9410c00020080a13 MCGSTATUS 0
MCE 3
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC cc945738d0a
ADDR 194c4b670
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 2021
       bit40 = error found by scrub
       bit46 = corrected ecc error
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 9410c10020080a13 MCGSTATUS 0

Repeats whenever I do any kind of operations...
How severe is ChipKill errors? Should I consider throwing away CPU 1
and get another one.

Regards,
Om.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Can we ignore errors in mcelog if the server is running fine
       [not found] ` <fa.9M8mPetEI5HZ8L2RMGPhKPm3gJA@ifi.uio.no>
@ 2006-07-28  8:34   ` Robert Hancock
  2006-07-28 18:13     ` Handle X
  0 siblings, 1 reply; 6+ messages in thread
From: Robert Hancock @ 2006-07-28  8:34 UTC (permalink / raw)
  To: Handle X; +Cc: Vikas Kedia, linux-kernel

Handle X wrote:
> On 7/27/06, Robert Hancock <hancockr@shaw.ca> wrote:
>> Vikas Kedia wrote:
>> > The server seems to be running fine. A. can I ignore the following
>> > mcelog errors ? B. If not what should i do to stop the server from
>> > reporting mcelog errors.
>>
>> Looks like data cache ECC errors, meaning the CPU 0 is faulty.
>> Eventually if it's not replaced there will likely be some uncorrectable
>> errors and the system will likely crash.
> 
> I am facing similar, but different errors.
> 
> [root@turyxsrv ~]# mcelog
> MCE 0
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 1 4 northbridge TSC 89a560bb249
> ADDR 1dfa49690
>  Northbridge Chipkill ECC error
>  Chipkill ECC syndrome = 2021
>       bit46 = corrected ecc error
>  bus error 'local node response, request didn't time out
>      generic read mem transaction
>      memory access, level generic'
> STATUS 9410c00020080a13 MCGSTATUS 0

> Repeats whenever I do any kind of operations...
> How severe is ChipKill errors? Should I consider throwing away CPU 1
> and get another one.

That sounds to me more like some of the RAM attached to CPU1 is bad..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Can we ignore errors in mcelog if the server is running fine
  2006-07-28  8:34   ` Robert Hancock
@ 2006-07-28 18:13     ` Handle X
  0 siblings, 0 replies; 6+ messages in thread
From: Handle X @ 2006-07-28 18:13 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Vikas Kedia, linux-kernel

> > [root@turyxsrv ~]# mcelog
> > MCE 0
> > HARDWARE ERROR. This is *NOT* a software problem!
> > Please contact your hardware vendor
> > CPU 1 4 northbridge TSC 89a560bb249
> > ADDR 1dfa49690
> >  Northbridge Chipkill ECC error
> >  Chipkill ECC syndrome = 2021
> >       bit46 = corrected ecc error
> >  bus error 'local node response, request didn't time out
> >      generic read mem transaction
> >      memory access, level generic'
> > STATUS 9410c00020080a13 MCGSTATUS 0
>
> > Repeats whenever I do any kind of operations...
> > How severe is ChipKill errors? Should I consider throwing away CPU 1
> > and get another one.
>
> That sounds to me more like some of the RAM attached to CPU1 is bad..
I took out CPU1. Errors went away. But so is half of the RAM
(accessible only to CPU1)
Okay, I would test with swapping the RAM of CPU0 to CPU1 and test. If
I get messages again, I would change the RAM.

Thanks.
Om.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-07-28 18:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <fa.2RkKSvRvPsGNSGCsUHQ9gQ8qlrg@ifi.uio.no>
2006-07-27 19:34 ` Can we ignore errors in mcelog if the server is running fine Robert Hancock
2006-07-28  5:28   ` Handle X
     [not found] <fa.5uWgnVpIOBN4Pb1aWwNzF8P2OA0@ifi.uio.no>
     [not found] ` <fa.9M8mPetEI5HZ8L2RMGPhKPm3gJA@ifi.uio.no>
2006-07-28  8:34   ` Robert Hancock
2006-07-28 18:13     ` Handle X
     [not found] <6Dc4C-1tt-47@gated-at.bofh.it>
2006-07-27 12:34 ` Bodo Eggert
2006-07-27 11:11 Vikas Kedia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox