public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* CPU failures ... or something else ?
@ 2002-12-26  1:53 Josh Brooks
  2002-12-26  1:41 ` Felipe W Damasio
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Josh Brooks @ 2002-12-26  1:53 UTC (permalink / raw)
  To: linux-kernel


Hello,

I have a dual p3 866 running 2.4 kernel that is crashing once every few
days leaving this on the console:


Message from syslogd@localhost at Tue Dec 24 11:30:31 2002 ...
localhost kernel: CPU 1: Machine Check Exception: 0000000000000004

Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
localhost kernel: Bank 4: b200000000040151

Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
localhost kernel: Kernel panic: CPU context corrupt



Word on the street is that this indicates hardware failure of some kind
(cpu, bus, or memory).  My main question is, is that very surely the
culprit, or is it also possible that all of the hardware is perfect and
that a bug in the kernel code or some outside influence (remote exploit)
is causing this crash ?

Basically, I am ordering all new hardware to swap out, and I just want to
know if there is some remote possibility that my hardware is actually just
fine and this is some kind of software error ?

ALSO, I have not been physically at the console when this has happened,
and have not tried this yet, but whatever that thing is where you press
ctrl-alt-printscreen and get to enter those post-crash commands - do you
think that would work in this situation, or does the above error hard lock
the system so you can't do those emergency measures ?

thanks!



^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: CPU failures ... or something else ?
@ 2002-12-26  3:13 Ro0tSiEgE
  2002-12-26  3:22 ` Josh Brooks
  0 siblings, 1 reply; 27+ messages in thread
From: Ro0tSiEgE @ 2002-12-26  3:13 UTC (permalink / raw)
  To: linux-kernel

I never said that. A bad CPU would be my last guess. My first two are buggy
board (use nomce) or bad addresses in your ram. try running Memtest86
(http://www.memtest86.com) for a few minutes and see if you get any errors.

On Wednesday 25 December 2002 21:04, you wrote:
> So you are saying, that yes, it _is_ possible that my equipment is not
> faulty in any way ?
>
> thanks!
>
> On Wed, 25 Dec 2002, Bubba wrote:
> > try turning off the Machine Check Exception in the kernel as it is just
> > buggy on some machines, not necessarily a bug in the kernel, or without
> > recompiling, use the kernel param "nomce"
> >
> > On Wednesday 25 December 2002 19:53, Josh Brooks wrote:
> > > Hello,
> > >
> > > I have a dual p3 866 running 2.4 kernel that is crashing once every few
> > > days leaving this on the console:
> > >
> > >
> > > Message from syslogd@localhost at Tue Dec 24 11:30:31 2002 ...
> > > localhost kernel: CPU 1: Machine Check Exception: 0000000000000004
> > >
> > > Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> > > localhost kernel: Bank 4: b200000000040151
> > >
> > > Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> > > localhost kernel: Kernel panic: CPU context corrupt
> > >
> > >
> > >
> > > Word on the street is that this indicates hardware failure of some kind
> > > (cpu, bus, or memory).  My main question is, is that very surely the
> > > culprit, or is it also possible that all of the hardware is perfect and
> > > that a bug in the kernel code or some outside influence (remote
> > > exploit) is causing this crash ?
> > >
> > > Basically, I am ordering all new hardware to swap out, and I just want
> > > to know if there is some remote possibility that my hardware is
> > > actually just fine and this is some kind of software error ?
> > >
> > > ALSO, I have not been physically at the console when this has happened,
> > > and have not tried this yet, but whatever that thing is where you press
> > > ctrl-alt-printscreen and get to enter those post-crash commands - do
> > > you think that would work in this situation, or does the above error
> > > hard lock the system so you can't do those emergency measures ?
> > >
> > > thanks!
> > >
> > >
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > > in the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

-------------------------------------------------------



^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: CPU failures ... or something else ?
@ 2002-12-26  3:31 Billy Rose
  2002-12-26  3:38 ` Josh Brooks
  0 siblings, 1 reply; 27+ messages in thread
From: Billy Rose @ 2002-12-26  3:31 UTC (permalink / raw)
  To: user; +Cc: bp, linux-kernel

> Oh and by the way, this is a dell poweredge 2450, dual 866 p3 cpus,
> 2gigs ram, and using a PERC 3/D.  I have a 2.4.1 system running on
> _identical_ hardware with no problems, and this system that is
> MCE'ing is a 2.4.16.

try reseating the cpu's and vrm's. if that doesnt work, remove cpu #2
and #2 vrm. run it and see if the error occurs. if no error, #2 cpu or
#2 vrm is bad. if the error still occurs, swap out cpu #1 and #1 vrm
with cpu #2 and #2 vrm, then run again. if the error still occurs,
youre SOL.

billy

=====
"there's some milk in the fridge that's about to go bad...
and there it goes..." -bobby

^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: CPU failures ... or something else ?
@ 2002-12-26  3:50 Billy Rose
  2002-12-26  3:54 ` Josh Brooks
  0 siblings, 1 reply; 27+ messages in thread
From: Billy Rose @ 2002-12-26  3:50 UTC (permalink / raw)
  To: user; +Cc: bp, linux-kernel

> Well actually I ordered a complete replacement system - identical in
> every way.  So I am getting that on saturday, and presumably that
> will just be the big hammer that makes every problem go away.
>
> I am just posting to get a head start on the issue if, for some crazy
> reason I replace all hardware and the problem continues.  Sounds
> like that is a slim to none chance, since I am dealing with good
> hardware (dell) and it looks like this is a faulty component at work.
>
> Basically I am just moving the disks from one machine to another on
> saturday, and I suspect the problems just disappear when I do that.
>
>
> Comments on the possibility that the problems continue after moving
> the disks to different (but identical) hardware ?
>
>
> thanks!

does this machine have a DRAC card by any chance?


"there's some milk in the fridge that's about to go bad...
and there it goes..." -bobby

^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: CPU failures ... or something else ?
@ 2002-12-26  4:03 Billy Rose
  2002-12-26  4:04 ` Josh Brooks
  0 siblings, 1 reply; 27+ messages in thread
From: Billy Rose @ 2002-12-26  4:03 UTC (permalink / raw)
  To: user; +Cc: bp, linux-kernel, felipewd

i agree with felipe, sounds like either a stick of ram is bad, or proc
#1 is fried (possibly its vrm though).

a DRAC is the dell remote assistant card. it sits in a pci slot, has
an intel i860 proc on it, and has a 10/100 for a net cable. if you
have no cards, then it is obviously ruled out.

billy
=====
"there's some milk in the fridge that's about to go bad...
and there it goes..." -bobby

^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: CPU failures ... or something else ?
@ 2002-12-26  4:21 Billy Rose
  2002-12-26  4:48 ` Josh Brooks
  0 siblings, 1 reply; 27+ messages in thread
From: Billy Rose @ 2002-12-26  4:21 UTC (permalink / raw)
  To: user; +Cc: bp, linux-kernel, felipewd

> Understood.  Thank you for that diagnosis.
>
>
> usually it says proc #1 in the error, but the first time it said proc
> #0 - is that interesting ?

youre welcome :)

if youre hanging on to that box, remove the memory from banks 3 and 4
and it should be ok. if my memory serves me right, you cant have only 3
banks of memory (hence removing bank 3 also), the motherboard is
configured to handle 1, 2, or 4 populated banks. it you leave bank 3
in while removing bank 4, it will beep at you when you power it on and
do nothing. with a gig of ram, it should still be plenty useful.

billy
=====
"there's some milk in the fridge that's about to go bad...
and there it goes..." -bobby

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2003-01-01 20:56 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-26  1:53 CPU failures ... or something else ? Josh Brooks
2002-12-26  1:41 ` Felipe W Damasio
2002-12-26  2:02 ` Bubba
2002-12-26  3:04   ` Josh Brooks
2002-12-26 22:37     ` Pavel Machek
2002-12-26  3:09   ` Josh Brooks
2002-12-26  3:36     ` J Sloan
2002-12-26  3:39       ` Josh Brooks
2002-12-26  3:20   ` Josh Brooks
2002-12-26  6:03 ` Joseph D. Wagner
2002-12-27 23:30   ` Alan Cox
  -- strict thread matches above, loose matches on Subject: below --
2002-12-26  3:13 Ro0tSiEgE
2002-12-26  3:22 ` Josh Brooks
2002-12-26  6:08   ` Joseph D. Wagner
2002-12-26  3:31 Billy Rose
2002-12-26  3:38 ` Josh Brooks
2002-12-26  3:50 Billy Rose
2002-12-26  3:54 ` Josh Brooks
2002-12-26  4:03 Billy Rose
2002-12-26  4:04 ` Josh Brooks
2002-12-26  2:05   ` Felipe W Damasio
2002-12-26  6:13   ` Joseph D. Wagner
2002-12-26  6:35     ` Josh Brooks
2002-12-26  6:49       ` Joseph D. Wagner
2002-12-26  7:08       ` Felipe W Damasio
2002-12-26  4:21 Billy Rose
2002-12-26  4:48 ` Josh Brooks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox