* Machine Check Exception on Opteron 265
@ 2007-04-14 14:58 Espen Fjellvær Olsen
2007-04-17 13:56 ` Alan Cox
0 siblings, 1 reply; 6+ messages in thread
From: Espen Fjellvær Olsen @ 2007-04-14 14:58 UTC (permalink / raw)
To: linux-kernel; +Cc: Drift@tihlde
Hi!
Today our Opteron 265, 2x2, paniced after many months uptime, giving
only this error message:
HARDWARE ERROR
CPU 2: Machine Check Exception: 4 Bank 4: b60a200100000813
TSC 6bb9fd0142921a ADDR a891e9b8
This is not a software problem!
mcelog --ascii gives this on the above error:
HARDWARE ERROR
CPU 2 4 northbridge TSC 6bb9fd0142921a
Northbridge ECC error
ECC syndrome = 14
bit32 = err cpu0
bit45 = uncorrected ecc error
bit57 = processor context corrupt
bit61 = error uncorrected
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS b60a200100000813 MCGSTATUS 4
This is not a software problem!
As far as we know there wasnt any unuasal activity on the server at the
time.
We updated glibc yesterday, but that shouldnt really cause such a problem.
So now we wonder if this might be an MCE bug, or really a HW problem,
and if it is one of the CPUs, or the RAM thats faulty.
We are running 2.6.18.
--
Mvh
Espen Fjellvær Olsen
Drift@Tihlde
espen@mrfjo.org
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Machine Check Exception on Opteron 265
[not found] <fa.bATEBhtZlyN0K1NKutzIYJJSRC0@ifi.uio.no>
@ 2007-04-14 15:39 ` Robert Hancock
2007-04-16 6:31 ` Joachim Deguara
0 siblings, 1 reply; 6+ messages in thread
From: Robert Hancock @ 2007-04-14 15:39 UTC (permalink / raw)
To: Drift@tihlde; +Cc: linux-kernel
Espen Fjellvær Olsen wrote:
> Hi!
> Today our Opteron 265, 2x2, paniced after many months uptime, giving
> only this error message:
>
> HARDWARE ERROR
> CPU 2: Machine Check Exception: 4 Bank 4: b60a200100000813
> TSC 6bb9fd0142921a ADDR a891e9b8
> This is not a software problem!
>
> mcelog --ascii gives this on the above error:
>
> HARDWARE ERROR
> CPU 2 4 northbridge TSC 6bb9fd0142921a
> Northbridge ECC error
> ECC syndrome = 14
> bit32 = err cpu0
> bit45 = uncorrected ecc error
> bit57 = processor context corrupt
> bit61 = error uncorrected
> bus error 'local node origin, request didn't time out
> generic read mem transaction
> memory access, level generic'
> STATUS b60a200100000813 MCGSTATUS 4
> This is not a software problem!
>
>
> As far as we know there wasnt any unuasal activity on the server at the
> time.
> We updated glibc yesterday, but that shouldnt really cause such a problem.
> So now we wonder if this might be an MCE bug, or really a HW problem,
> and if it is one of the CPUs, or the RAM thats faulty.
> We are running 2.6.18.
Sounds like some bad RAM..
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Machine Check Exception on Opteron 265
2007-04-14 15:39 ` Robert Hancock
@ 2007-04-16 6:31 ` Joachim Deguara
0 siblings, 0 replies; 6+ messages in thread
From: Joachim Deguara @ 2007-04-16 6:31 UTC (permalink / raw)
To: Robert Hancock; +Cc: Drift@tihlde, linux-kernel
On Saturday 14 April 2007 17:39:28 Robert Hancock wrote:
> Espen Fjellvær Olsen wrote:
> > As far as we know there wasnt any unuasal activity on the server at the
> > time.
> > We updated glibc yesterday, but that shouldnt really cause such a
> > problem. So now we wonder if this might be an MCE bug, or really a HW
> > problem, and if it is one of the CPUs, or the RAM thats faulty.
> > We are running 2.6.18.
>
> Sounds like some bad RAM..
Clearly. I would run memtest-86+[1] for a night and you should see the bad
DIMM. You can try the new feature in version 1.70 of that tool to display
the DMI name of the DIMM to attempt to locate the exact DIMM. That option is
under Error Reporting menu.
-Joachim
[1] http://memtest.org/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Machine Check Exception on Opteron 265
2007-04-14 14:58 Machine Check Exception on Opteron 265 Espen Fjellvær Olsen
@ 2007-04-17 13:56 ` Alan Cox
2007-04-17 13:57 ` Matthew Garrett
2007-04-17 14:18 ` Espen Fjellvær Olsen
0 siblings, 2 replies; 6+ messages in thread
From: Alan Cox @ 2007-04-17 13:56 UTC (permalink / raw)
To: Drift@tihlde; +Cc: espen, linux-kernel
On Sat, 14 Apr 2007 16:58:43 +0200
Espen Fjellvær Olsen <espen@oxygen.tihlde.org> wrote:
> Hi!
> Today our Opteron 265, 2x2, paniced after many months uptime, giving
> only this error message:
>
> HARDWARE ERROR
> CPU 2: Machine Check Exception: 4 Bank 4: b60a200100000813
> TSC 6bb9fd0142921a ADDR a891e9b8
> This is not a software problem!
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^]
This is there for a good reason.
> We updated glibc yesterday, but that shouldnt really cause such a problem.
> So now we wonder if this might be an MCE bug, or really a HW problem,
> and if it is one of the CPUs, or the RAM thats faulty.
Consult your hardware vendor but if its a single event in a year it might
be anything - even cosmic rays.
Alan
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Machine Check Exception on Opteron 265
2007-04-17 13:56 ` Alan Cox
@ 2007-04-17 13:57 ` Matthew Garrett
2007-04-17 14:18 ` Espen Fjellvær Olsen
1 sibling, 0 replies; 6+ messages in thread
From: Matthew Garrett @ 2007-04-17 13:57 UTC (permalink / raw)
To: Alan Cox; +Cc: Drift@tihlde, espen, linux-kernel
On Tue, Apr 17, 2007 at 02:56:18PM +0100, Alan Cox wrote:
> On Sat, 14 Apr 2007 16:58:43 +0200
> Espen Fjellvær Olsen <espen@oxygen.tihlde.org> wrote:
> > Hi!
> > Today our Opteron 265, 2x2, paniced after many months uptime, giving
> > only this error message:
> >
> > HARDWARE ERROR
> > CPU 2: Machine Check Exception: 4 Bank 4: b60a200100000813
> > TSC 6bb9fd0142921a ADDR a891e9b8
> > This is not a software problem!
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^]
>
> This is there for a good reason.
Though we saw MCEs being generated by running the HPA code on sata_nv
with awdma, so it's not always true. I agree that in this case, it
probably is.
--
Matthew Garrett | mjg59@srcf.ucam.org
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Machine Check Exception on Opteron 265
2007-04-17 13:56 ` Alan Cox
2007-04-17 13:57 ` Matthew Garrett
@ 2007-04-17 14:18 ` Espen Fjellvær Olsen
1 sibling, 0 replies; 6+ messages in thread
From: Espen Fjellvær Olsen @ 2007-04-17 14:18 UTC (permalink / raw)
To: Alan Cox; +Cc: Drift@tihlde, espen, linux-kernel
Alan Cox wrote:
> On Sat, 14 Apr 2007 16:58:43 +0200
> Espen Fjellvær Olsen <espen@oxygen.tihlde.org> wrote:
>
>
>> Hi!
>> Today our Opteron 265, 2x2, paniced after many months uptime, giving
>> only this error message:
>>
>> HARDWARE ERROR
>> CPU 2: Machine Check Exception: 4 Bank 4: b60a200100000813
>> TSC 6bb9fd0142921a ADDR a891e9b8
>> This is not a software problem!
>>
*snip*
> Consult your hardware vendor but if its a single event in a year it might
> be anything - even cosmic rays.
>
Yeah, we have had more crashes now, and have removed some of our DIMMs
in hope of getting a stable system again.
And ofcourse running memtest on those DIMMs. Hope it is one of those,
and not one the CPUs =)
--
Mvh
Espen Fjellvær Olsen
Drift @ Tihlde
espenfo@tihlde.org
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2007-04-17 15:06 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-14 14:58 Machine Check Exception on Opteron 265 Espen Fjellvær Olsen
2007-04-17 13:56 ` Alan Cox
2007-04-17 13:57 ` Matthew Garrett
2007-04-17 14:18 ` Espen Fjellvær Olsen
[not found] <fa.bATEBhtZlyN0K1NKutzIYJJSRC0@ifi.uio.no>
2007-04-14 15:39 ` Robert Hancock
2007-04-16 6:31 ` Joachim Deguara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox