public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* 3ware driver errors
@ 2003-03-24 21:28 Steven Pritchard
  2003-03-25  1:01 ` Jeff V. Merkey
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Pritchard @ 2003-03-24 21:28 UTC (permalink / raw)
  To: linux-kernel

(Apparently 3w-xxxx in the Subject gets caught as spam.  Somebody
might want to adjust that regular expression.  :-)

I have a server that is locking up every day or two with a console
full of this error:

    3w-xxxx: scsi0: Command failed: status = 0xcb, flags = 0x37, unit #0.

This is on a Dell PowerEdge 1400SC (dual PIII/1.13GHz, 1.1GB RAM),
with a 3ware Escalade 7000-2 and two WD1600JB drives, running Red Hat
8.0 with kernel-smp 2.4.18-27.8.0.

I plan to report this to Red Hat's bugzilla, but I'm hoping for some
ideas or big red flags to jump out at somebody here...  I use this box
for a UML hosting server, so all this downtime is affecting *way* too
many people.

This box has been having other stability problems, so I'm guessing
this might not be directly related to the 3ware card/driver.  It did
survive a memtest86 pass.

Steve
-- 
steve@silug.org           | Southern Illinois Linux Users Group
(618)398-7360             | See web site for meeting details.
Steven Pritchard          | http://www.silug.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3ware driver errors
  2003-03-25  1:01 ` Jeff V. Merkey
@ 2003-03-24 23:44   ` Larry McVoy
  2003-03-25  0:07     ` Mark Hahn
  2003-03-25  1:25     ` Jeff V. Merkey
  0 siblings, 2 replies; 23+ messages in thread
From: Larry McVoy @ 2003-03-24 23:44 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: Steven Pritchard, linux-kernel

On Mon, Mar 24, 2003 at 06:01:07PM -0700, Jeff V. Merkey wrote:
> There is a firmware upgrade you need to obtain from WD if you are using their 
> drives with a 3Ware controller.  The WD drives were optimized for desktop use
> and they go into a "powersave" mode of sorts which will cause them to disappear
> and reappear mysteriously with all sorts of strange errors.  WD is aware of 
> this problem and so is 3Ware.

Is this for all WD drives or just some?  I've got some wd400 drives that 
I've been using for a long time behind a 3ware in jbod mode.  I have seen
some errors but they seem to have settled down.  Is there any way to know?
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3ware driver errors
  2003-03-24 23:44   ` Larry McVoy
@ 2003-03-25  0:07     ` Mark Hahn
  2003-03-25  1:25     ` Jeff V. Merkey
  1 sibling, 0 replies; 23+ messages in thread
From: Mark Hahn @ 2003-03-25  0:07 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Jeff V. Merkey, Steven Pritchard, linux-kernel

> > and reappear mysteriously with all sorts of strange errors.  WD is aware of 
> > this problem and so is 3Ware.
> 
> Is this for all WD drives or just some?  I've got some wd400 drives that 
> I've been using for a long time behind a 3ware in jbod mode.  I have seen
> some errors but they seem to have settled down.  Is there any way to know?

I haven't seen the problem myself, but:
http://support.wdc.com/download/index.asp#raid3ware


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3ware driver errors
  2003-03-24 21:28 3ware driver errors Steven Pritchard
@ 2003-03-25  1:01 ` Jeff V. Merkey
  2003-03-24 23:44   ` Larry McVoy
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff V. Merkey @ 2003-03-25  1:01 UTC (permalink / raw)
  To: Steven Pritchard; +Cc: linux-kernel



There is a firmware upgrade you need to obtain from WD if you are using their 
drives with a 3Ware controller.  The WD drives were optimized for desktop use
and they go into a "powersave" mode of sorts which will cause them to disappear
and reappear mysteriously with all sorts of strange errors.  WD is aware of 
this problem and so is 3Ware.

Jeff

On Mon, Mar 24, 2003 at 03:28:13PM -0600, Steven Pritchard wrote:
> (Apparently 3w-xxxx in the Subject gets caught as spam.  Somebody
> might want to adjust that regular expression.  :-)
> 
> I have a server that is locking up every day or two with a console
> full of this error:
> 
>     3w-xxxx: scsi0: Command failed: status = 0xcb, flags = 0x37, unit #0.
> 
> This is on a Dell PowerEdge 1400SC (dual PIII/1.13GHz, 1.1GB RAM),
> with a 3ware Escalade 7000-2 and two WD1600JB drives, running Red Hat
> 8.0 with kernel-smp 2.4.18-27.8.0.
> 
> I plan to report this to Red Hat's bugzilla, but I'm hoping for some
> ideas or big red flags to jump out at somebody here...  I use this box
> for a UML hosting server, so all this downtime is affecting *way* too
> many people.
> 
> This box has been having other stability problems, so I'm guessing
> this might not be directly related to the 3ware card/driver.  It did
> survive a memtest86 pass.
> 
> Steve
> -- 
> steve@silug.org           | Southern Illinois Linux Users Group
> (618)398-7360             | See web site for meeting details.
> Steven Pritchard          | http://www.silug.org/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3ware driver errors
  2003-03-24 23:44   ` Larry McVoy
  2003-03-25  0:07     ` Mark Hahn
@ 2003-03-25  1:25     ` Jeff V. Merkey
  2003-03-25  3:12       ` Steven Pritchard
  2003-03-27 16:02       ` ECC error in 2.5.64 + some patches Larry McVoy
  1 sibling, 2 replies; 23+ messages in thread
From: Jeff V. Merkey @ 2003-03-25  1:25 UTC (permalink / raw)
  To: Larry McVoy, Steven Pritchard, linux-kernel


The person at WD to contact with specifics is listed below.  We have seen it 
on the 180GB drives, but the 200GB are also affected.

Suresh.Chekuri@wdc.com

Jeff

On Mon, Mar 24, 2003 at 03:44:10PM -0800, Larry McVoy wrote:
> On Mon, Mar 24, 2003 at 06:01:07PM -0700, Jeff V. Merkey wrote:
> > There is a firmware upgrade you need to obtain from WD if you are using their 
> > drives with a 3Ware controller.  The WD drives were optimized for desktop use
> > and they go into a "powersave" mode of sorts which will cause them to disappear
> > and reappear mysteriously with all sorts of strange errors.  WD is aware of 
> > this problem and so is 3Ware.
> 
> Is this for all WD drives or just some?  I've got some wd400 drives that 
> I've been using for a long time behind a 3ware in jbod mode.  I have seen
> some errors but they seem to have settled down.  Is there any way to know?
> -- 
> ---
> Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3ware driver errors
  2003-03-25  3:12       ` Steven Pritchard
@ 2003-03-25  3:11         ` Kevin P. Fleming
  2003-03-25 15:25         ` Ezra Nugroho
  1 sibling, 0 replies; 23+ messages in thread
From: Kevin P. Fleming @ 2003-03-25  3:11 UTC (permalink / raw)
  To: linux-kernel

Steven Pritchard wrote:
> I don't suppose you've heard if the 160GB drives are affected, have
> you?  The page on support.wdc.com that someone else referred to
> specifically mentions the 200s and the 180s, but I see no mention of
> the 160s.
> 

I'd like to know this too; I've got a pair of WD1600JB drives (less than 
two weeks old) attached to a 3Ware 7000-2. They've been working fine so 
far, but I'm not keen on finding a problem later...


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3ware driver errors
  2003-03-25  1:25     ` Jeff V. Merkey
@ 2003-03-25  3:12       ` Steven Pritchard
  2003-03-25  3:11         ` Kevin P. Fleming
  2003-03-25 15:25         ` Ezra Nugroho
  2003-03-27 16:02       ` ECC error in 2.5.64 + some patches Larry McVoy
  1 sibling, 2 replies; 23+ messages in thread
From: Steven Pritchard @ 2003-03-25  3:12 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: linux-kernel

On Mon, Mar 24, 2003 at 06:25:08PM -0700, Jeff V. Merkey wrote:
> The person at WD to contact with specifics is listed below.

Thanks for the pointer.  I have a lot of these WD drives...

> We have seen it on the 180GB drives, but the 200GB are also affected.

I don't suppose you've heard if the 160GB drives are affected, have
you?  The page on support.wdc.com that someone else referred to
specifically mentions the 200s and the 180s, but I see no mention of
the 160s.

Steve
-- 
steve@silug.org           | Southern Illinois Linux Users Group
(618)398-7360             | See web site for meeting details.
Steven Pritchard          | http://www.silug.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3ware driver errors
  2003-03-25  3:12       ` Steven Pritchard
  2003-03-25  3:11         ` Kevin P. Fleming
@ 2003-03-25 15:25         ` Ezra Nugroho
  2003-03-25 15:26           ` Roy Sigurd Karlsbakk
  1 sibling, 1 reply; 23+ messages in thread
From: Ezra Nugroho @ 2003-03-25 15:25 UTC (permalink / raw)
  To: linux-kernel

I have 8 120GB in a raid 5.
Although the site doesn't say that the 120s are affected, I have gotten
my raid to be degraded because one drive disappeared.
I got the same error message.

I am not sure if I want to upgrade the firmware, however, I am not sure
my array is stable either...

On Mon, 2003-03-24 at 22:12, Steven Pritchard wrote:
> On Mon, Mar 24, 2003 at 06:25:08PM -0700, Jeff V. Merkey wrote:
> > The person at WD to contact with specifics is listed below.
> 
> Thanks for the pointer.  I have a lot of these WD drives...
> 
> > We have seen it on the 180GB drives, but the 200GB are also affected.
> 
> I don't suppose you've heard if the 160GB drives are affected, have
> you?  The page on support.wdc.com that someone else referred to
> specifically mentions the 200s and the 180s, but I see no mention of
> the 160s.
> 
> Steve
> -- 
> steve@silug.org           | Southern Illinois Linux Users Group
> (618)398-7360             | See web site for meeting details.
> Steven Pritchard          | http://www.silug.org/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3ware driver errors
  2003-03-25 15:25         ` Ezra Nugroho
@ 2003-03-25 15:26           ` Roy Sigurd Karlsbakk
  2003-03-25 16:26             ` Ezra Nugroho
  0 siblings, 1 reply; 23+ messages in thread
From: Roy Sigurd Karlsbakk @ 2003-03-25 15:26 UTC (permalink / raw)
  To: Ezra Nugroho, linux-kernel

I'm running 2x8-port 3ware with 8 IBM 120gig disks on each controller in raid 
5. this has been running stably for half a year (that is - since I installed 
it).

On Tuesday 25 March 2003 16:25, Ezra Nugroho wrote:
> I have 8 120GB in a raid 5.
> Although the site doesn't say that the 120s are affected, I have gotten
> my raid to be degraded because one drive disappeared.
> I got the same error message.
>
> I am not sure if I want to upgrade the firmware, however, I am not sure
> my array is stable either...
>
> On Mon, 2003-03-24 at 22:12, Steven Pritchard wrote:
> > On Mon, Mar 24, 2003 at 06:25:08PM -0700, Jeff V. Merkey wrote:
> > > The person at WD to contact with specifics is listed below.
> >
> > Thanks for the pointer.  I have a lot of these WD drives...
> >
> > > We have seen it on the 180GB drives, but the 200GB are also affected.
> >
> > I don't suppose you've heard if the 160GB drives are affected, have
> > you?  The page on support.wdc.com that someone else referred to
> > specifically mentions the 200s and the 180s, but I see no mention of
> > the 160s.
> >
> > Steve
> > --
> > steve@silug.org           | Southern Illinois Linux Users Group
> > (618)398-7360             | See web site for meeting details.
> > Steven Pritchard          | http://www.silug.org/
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356

Computers are like air conditioners.
They stop working when you open Windows.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3ware driver errors
  2003-03-25 15:26           ` Roy Sigurd Karlsbakk
@ 2003-03-25 16:26             ` Ezra Nugroho
  0 siblings, 0 replies; 23+ messages in thread
From: Ezra Nugroho @ 2003-03-25 16:26 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: linux-kernel

yeah, but that's IBM drives, We were talking about the WDC -JB/BB.

I would be interested to listen to other WDJ sub 180G users who have the
same problem.
Anyone?



On Tue, 2003-03-25 at 10:26, Roy Sigurd Karlsbakk wrote:
> I'm running 2x8-port 3ware with 8 IBM 120gig disks on each controller in raid 
> 5. this has been running stably for half a year (that is - since I installed 
> it).
> 
> On Tuesday 25 March 2003 16:25, Ezra Nugroho wrote:
> > I have 8 120GB in a raid 5.
> > Although the site doesn't say that the 120s are affected, I have gotten
> > my raid to be degraded because one drive disappeared.
> > I got the same error message.
> >
> > I am not sure if I want to upgrade the firmware, however, I am not sure
> > my array is stable either...
> >
> > On Mon, 2003-03-24 at 22:12, Steven Pritchard wrote:
> > > On Mon, Mar 24, 2003 at 06:25:08PM -0700, Jeff V. Merkey wrote:
> > > > The person at WD to contact with specifics is listed below.
> > >
> > > Thanks for the pointer.  I have a lot of these WD drives...
> > >
> > > > We have seen it on the 180GB drives, but the 200GB are also affected.
> > >
> > > I don't suppose you've heard if the 160GB drives are affected, have
> > > you?  The page on support.wdc.com that someone else referred to
> > > specifically mentions the 200s and the 180s, but I see no mention of
> > > the 160s.
> > >
> > > Steve
> > > --
> > > steve@silug.org           | Southern Illinois Linux Users Group
> > > (618)398-7360             | See web site for meeting details.
> > > Steven Pritchard          | http://www.silug.org/
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > > in the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> Roy Sigurd Karlsbakk, Datavaktmester
> ProntoTV AS - http://www.pronto.tv/
> Tel: +47 9801 3356
> 
> Computers are like air conditioners.
> They stop working when you open Windows.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 23+ messages in thread

* ECC error in 2.5.64 + some patches
  2003-03-25  1:25     ` Jeff V. Merkey
  2003-03-25  3:12       ` Steven Pritchard
@ 2003-03-27 16:02       ` Larry McVoy
  2003-03-27 16:17         ` Tim Schmielau
                           ` (3 more replies)
  1 sibling, 4 replies; 23+ messages in thread
From: Larry McVoy @ 2003-03-27 16:02 UTC (permalink / raw)
  To: linux-kernel

I'm getting these on the machine we use to do the BK->CVS conversions.
My guess is that this means there was a memory error and ECC fixed it.
The only problem is that I'm reasonably sure that there isn't ECC on
these DIMMs.  Does anyone have the table of error codes to explanations?
Google didn't find anything for this one.

Thanks.

Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
slovax kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.

Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
slovax kernel: Bank 1: 9000000000000151

-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ECC error in 2.5.64 + some patches
  2003-03-27 16:02       ` ECC error in 2.5.64 + some patches Larry McVoy
@ 2003-03-27 16:17         ` Tim Schmielau
  2003-03-27 16:27           ` Larry McVoy
  2003-03-27 16:22         ` Randy.Dunlap
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Tim Schmielau @ 2003-03-27 16:17 UTC (permalink / raw)
  To: Larry McVoy; +Cc: linux-kernel

On Thu, 27 Mar 2003, Larry McVoy wrote:

> I'm getting these on the machine we use to do the BK->CVS conversions.
> My guess is that this means there was a memory error and ECC fixed it.
> The only problem is that I'm reasonably sure that there isn't ECC on
> these DIMMs.  Does anyone have the table of error codes to explanations?
> Google didn't find anything for this one.

No, I don't have a table of error codes either, but it's probably the
on-die Cache which has ECC for all recent (>=350 MHz iirc) Pentii.

Tim


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ECC error in 2.5.64 + some patches
  2003-03-27 16:02       ` ECC error in 2.5.64 + some patches Larry McVoy
  2003-03-27 16:17         ` Tim Schmielau
@ 2003-03-27 16:22         ` Randy.Dunlap
  2003-03-27 16:31           ` Larry McVoy
  2003-03-27 16:31         ` Dave Jones
  2003-03-27 17:00         ` Chris Wedgwood
  3 siblings, 1 reply; 23+ messages in thread
From: Randy.Dunlap @ 2003-03-27 16:22 UTC (permalink / raw)
  To: Larry McVoy; +Cc: linux-kernel

On Thu, 27 Mar 2003 08:02:20 -0800 Larry McVoy <lm@bitmover.com> wrote:

| I'm getting these on the machine we use to do the BK->CVS conversions.
| My guess is that this means there was a memory error and ECC fixed it.
| The only problem is that I'm reasonably sure that there isn't ECC on
| these DIMMs.  Does anyone have the table of error codes to explanations?
| Google didn't find anything for this one.
| 
| Thanks.
| 
| Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
| slovax kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
| 
| Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
| slovax kernel: Bank 1: 9000000000000151

You can try the Dave Jones "parsemce" tool on it, from
  http://www.codemonkey.org.uk/cruft/parsemce.c/

--
~Randy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ECC error in 2.5.64 + some patches
  2003-03-27 16:17         ` Tim Schmielau
@ 2003-03-27 16:27           ` Larry McVoy
  2003-03-27 16:39             ` Tim Schmielau
  0 siblings, 1 reply; 23+ messages in thread
From: Larry McVoy @ 2003-03-27 16:27 UTC (permalink / raw)
  To: Tim Schmielau; +Cc: Larry McVoy, linux-kernel

On Thu, Mar 27, 2003 at 05:17:25PM +0100, Tim Schmielau wrote:
> On Thu, 27 Mar 2003, Larry McVoy wrote:
> 
> > I'm getting these on the machine we use to do the BK->CVS conversions.
> > My guess is that this means there was a memory error and ECC fixed it.
> > The only problem is that I'm reasonably sure that there isn't ECC on
> > these DIMMs.  Does anyone have the table of error codes to explanations?
> > Google didn't find anything for this one.
> 
> No, I don't have a table of error codes either, but it's probably the
> on-die Cache which has ECC for all recent (>=350 MHz iirc) Pentii.

This is a 2.16Ghz Athlon not a Pentium if that makes a difference.

slovax ~ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 8
model name      : AMD Athlon(tm) XP 2700+
stepping        : 1
cpu MHz         : 2162.466
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 4276.22

-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ECC error in 2.5.64 + some patches
  2003-03-27 16:02       ` ECC error in 2.5.64 + some patches Larry McVoy
  2003-03-27 16:17         ` Tim Schmielau
  2003-03-27 16:22         ` Randy.Dunlap
@ 2003-03-27 16:31         ` Dave Jones
  2003-03-27 17:00         ` Chris Wedgwood
  3 siblings, 0 replies; 23+ messages in thread
From: Dave Jones @ 2003-03-27 16:31 UTC (permalink / raw)
  To: Larry McVoy, linux-kernel

On Thu, Mar 27, 2003 at 08:02:20AM -0800, Larry McVoy wrote:

 > Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
 > slovax kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
 > 
 > Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
 > slovax kernel: Bank 1: 9000000000000151

An MCE (Machine Check Exception) could be triggered by any number of
things from bad cooling, underrated power supply, to flaky RAM.
Give things a going over with memtest86 for the latter.
The former just means you pull everything apart and double check
it looks ok.

		Dave


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ECC error in 2.5.64 + some patches
  2003-03-27 16:22         ` Randy.Dunlap
@ 2003-03-27 16:31           ` Larry McVoy
  2003-03-28  0:10             ` Dave Jones
  0 siblings, 1 reply; 23+ messages in thread
From: Larry McVoy @ 2003-03-27 16:31 UTC (permalink / raw)
  To: Randy.Dunlap; +Cc: Larry McVoy, linux-kernel

> | Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
> | slovax kernel: Bank 1: 9000000000000151
> 
> You can try the Dave Jones "parsemce" tool on it, from
>   http://www.codemonkey.org.uk/cruft/parsemce.c/

slovax /tmp a.out -b 1 -e 9000000000000151
Status: (-8070450532247928495) Restart IP valid.

What does that mean?
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ECC error in 2.5.64 + some patches
  2003-03-27 16:27           ` Larry McVoy
@ 2003-03-27 16:39             ` Tim Schmielau
  0 siblings, 0 replies; 23+ messages in thread
From: Tim Schmielau @ 2003-03-27 16:39 UTC (permalink / raw)
  To: Larry McVoy; +Cc: linux-kernel

On Thu, 27 Mar 2003, Larry McVoy wrote:

> On Thu, Mar 27, 2003 at 05:17:25PM +0100, Tim Schmielau wrote:
> > On Thu, 27 Mar 2003, Larry McVoy wrote:
> >
> > > I'm getting these on the machine we use to do the BK->CVS conversions.
> > > My guess is that this means there was a memory error and ECC fixed it.
> > > The only problem is that I'm reasonably sure that there isn't ECC on
> > > these DIMMs.  Does anyone have the table of error codes to explanations?
> > > Google didn't find anything for this one.
> >
> > No, I don't have a table of error codes either, but it's probably the
> > on-die Cache which has ECC for all recent (>=350 MHz iirc) Pentii.
>
> This is a 2.16Ghz Athlon not a Pentium if that makes a difference.

The on-die second-level cache of all Athlons also has ECC. But I can't
find a document of the error codes on AMD's website either.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ECC error in 2.5.64 + some patches
  2003-03-27 16:02       ` ECC error in 2.5.64 + some patches Larry McVoy
                           ` (2 preceding siblings ...)
  2003-03-27 16:31         ` Dave Jones
@ 2003-03-27 17:00         ` Chris Wedgwood
  2003-03-27 17:19           ` Dominik Kubla
  3 siblings, 1 reply; 23+ messages in thread
From: Chris Wedgwood @ 2003-03-27 17:00 UTC (permalink / raw)
  To: Larry McVoy, linux-kernel

On Thu, Mar 27, 2003 at 08:02:20AM -0800, Larry McVoy wrote:

> My guess is that this means there was a memory error and ECC fixed
> it.

Nope.

There is an ecc driver for RAM and you'll be able to detect these
using that.  RAM ECC errors in my experience don't cause MCEs, usually
the CPU never notices.

> The only problem is that I'm reasonably sure that there isn't ECC on
> these DIMMs.

Dump the SPD and you can check...  usually the BIOS will tell you too.

> Does anyone have the table of error codes to explanations?  Google
> didn't find anything for this one.

as someone else pointed our, parsemce is what you want

> Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
> slovax kernel: Bank 1: 9000000000000151

Status: (9000000000000151) Restart IP valid.

*Exactly* what this means I don't know --- but I'm guessing the CPU is
overheating.  Check fans, air-flow, etc. and see if that helps.  So
far whenever I've seen the above problem it's *ALWAYS* been related to
the CPU getting too hot.


  --cw

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ECC error in 2.5.64 + some patches
  2003-03-27 17:00         ` Chris Wedgwood
@ 2003-03-27 17:19           ` Dominik Kubla
  2003-03-27 17:25             ` Chris Wedgwood
  2003-04-15 13:24             ` Larry McVoy
  0 siblings, 2 replies; 23+ messages in thread
From: Dominik Kubla @ 2003-03-27 17:19 UTC (permalink / raw)
  To: Chris Wedgwood, Larry McVoy, linux-kernel

Am Donnerstag, 27. März 2003 18:00 schrieb Chris Wedgwood:

> > Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
> > slovax kernel: Bank 1: 9000000000000151
>
> Status: (9000000000000151) Restart IP valid.
>
> *Exactly* what this means I don't know --- but I'm guessing the CPU is
> overheating.  Check fans, air-flow, etc. and see if that helps.  So
> far whenever I've seen the above problem it's *ALWAYS* been related to
> the CPU getting too hot.

Well the internal busses and buffers of modern CPU's and in many cases also 
the on-die caches have ECC logic.  And if i should hazard a guess: "Restart 
IP valid" => Restarted Instruction Pre-Fetch resulted in a valid state of the 
pre-fetch queue.

In Larry's case i'd remove the cpu cooler, clean everything and reassemble, 
since i would assume that there is a hot-spot on the die.

Regards,
  Dominik
-- 
Be at war with your voices, at peace with your neighbors, and let every new
year find you a better man. (Benjamin Franklin, 1706-1790)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ECC error in 2.5.64 + some patches
  2003-03-27 17:19           ` Dominik Kubla
@ 2003-03-27 17:25             ` Chris Wedgwood
  2003-04-15  4:34               ` Bill Davidsen
  2003-04-15 13:24             ` Larry McVoy
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Wedgwood @ 2003-03-27 17:25 UTC (permalink / raw)
  To: Dominik Kubla; +Cc: Larry McVoy, linux-kernel

On Thu, Mar 27, 2003 at 06:19:41PM +0100, Dominik Kubla wrote:

> Well the internal busses and buffers of modern CPU's and in many
> cases also the on-die caches have ECC logic.

his email said "DIMMs"

> And if i should hazard a guess: "Restart IP valid" => Restarted
> Instruction Pre-Fetch resulted in a valid state of the pre-fetch
> queue.

could be ... i've not checked the AMD docs

> In Larry's case i'd remove the cpu cooler, clean everything and
> reassemble, since i would assume that there is a hot-spot on the
> die.

or simply remove the side of the case or increase air-conditioning and
see if that goes away or becomes less apparent, IME if you get these
sporadically rather than often it's 'just' overheating...


  --cw


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ECC error in 2.5.64 + some patches
  2003-03-27 16:31           ` Larry McVoy
@ 2003-03-28  0:10             ` Dave Jones
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Jones @ 2003-03-28  0:10 UTC (permalink / raw)
  To: Larry McVoy, Randy.Dunlap, Larry McVoy, linux-kernel

On Thu, Mar 27, 2003 at 08:31:20AM -0800, Larry McVoy wrote:
 > > | Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
 > > | slovax kernel: Bank 1: 9000000000000151
 > > You can try the Dave Jones "parsemce" tool on it, from
 > >   http://www.codemonkey.org.uk/cruft/parsemce.c/
 > 
 > slovax /tmp a.out -b 1 -e 9000000000000151
 > Status: (-8070450532247928495) Restart IP valid.
 > 
 > What does that mean?

It means Dave sucks and hasn't done a good enough job on the parser.
parsemce is really really unintuitive to use.

There's some bits missing from your dump. Usually, MCEs look like..

 Sep  4 21:43:41 hamlet kernel: CPU 0: Machine Check Exception: 0000000000000004
 Sep  4 21:43:41 hamlet kernel: Bank 1: f600200000000152 at 7600200000000152

All we have to go on in your example is the bank status code.
(which is -s, not -e. -e would be the 00000000000000004 in the example above. [*])

So, without the missing bits, we have to fake it..

(davej@deviant:davej)$ ./a.out -b 1 -e 1 -s 9000000000000151 -a 0
Status: (1) Restart IP valid.
parsebank(1): 9000000000000151 @ 0
	External tag parity error
	Error enabled in control register
	Memory heirarchy error
	Request: Generic error
	Transaction type : Instruction
	Memory/IO : Reserved

Ignore the Status: line, thats decoded from the (faked) -e 1.

Any the wiser ? 8-)  [*]

		Dave

[*] See, unintuitive, evil and nasty.
    Given the time, I'd start over from scratch.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ECC error in 2.5.64 + some patches
  2003-03-27 17:25             ` Chris Wedgwood
@ 2003-04-15  4:34               ` Bill Davidsen
  0 siblings, 0 replies; 23+ messages in thread
From: Bill Davidsen @ 2003-04-15  4:34 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Dominik Kubla, Larry McVoy, linux-kernel

On Thu, 27 Mar 2003, Chris Wedgwood wrote:


> > In Larry's case i'd remove the cpu cooler, clean everything and
> > reassemble, since i would assume that there is a hot-spot on the
> > die.
> 
> or simply remove the side of the case or increase air-conditioning and
> see if that goes away or becomes less apparent, IME if you get these
> sporadically rather than often it's 'just' overheating...

Generally in any modern case the air flow works with the case closed, and
opening the case will not improve things. If you are set up to use the
fans to pull air out and have cold air come in by pressure differential it
*really* won't help. Of course a bad case might work better that way, but
there aren't a lot of them out there any more, sort of went out with the
AT form factor.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ECC error in 2.5.64 + some patches
  2003-03-27 17:19           ` Dominik Kubla
  2003-03-27 17:25             ` Chris Wedgwood
@ 2003-04-15 13:24             ` Larry McVoy
  1 sibling, 0 replies; 23+ messages in thread
From: Larry McVoy @ 2003-04-15 13:24 UTC (permalink / raw)
  To: Dominik Kubla; +Cc: Chris Wedgwood, Larry McVoy, linux-kernel

On Thu, Mar 27, 2003 at 06:19:41PM +0100, Dominik Kubla wrote:
> Am Donnerstag, 27. M?rz 2003 18:00 schrieb Chris Wedgwood:
> 
> > > Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
> > > slovax kernel: Bank 1: 9000000000000151
> >
> > Status: (9000000000000151) Restart IP valid.
> >
> > *Exactly* what this means I don't know --- but I'm guessing the CPU is
> > overheating.  Check fans, air-flow, etc. and see if that helps.  So
> > far whenever I've seen the above problem it's *ALWAYS* been related to
> > the CPU getting too hot.

FYI - it was a too small case with the power supply sitting more or less
on top of the CPU.  Moving everything to a bigger case fixed it.
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2003-04-15 13:12 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-03-24 21:28 3ware driver errors Steven Pritchard
2003-03-25  1:01 ` Jeff V. Merkey
2003-03-24 23:44   ` Larry McVoy
2003-03-25  0:07     ` Mark Hahn
2003-03-25  1:25     ` Jeff V. Merkey
2003-03-25  3:12       ` Steven Pritchard
2003-03-25  3:11         ` Kevin P. Fleming
2003-03-25 15:25         ` Ezra Nugroho
2003-03-25 15:26           ` Roy Sigurd Karlsbakk
2003-03-25 16:26             ` Ezra Nugroho
2003-03-27 16:02       ` ECC error in 2.5.64 + some patches Larry McVoy
2003-03-27 16:17         ` Tim Schmielau
2003-03-27 16:27           ` Larry McVoy
2003-03-27 16:39             ` Tim Schmielau
2003-03-27 16:22         ` Randy.Dunlap
2003-03-27 16:31           ` Larry McVoy
2003-03-28  0:10             ` Dave Jones
2003-03-27 16:31         ` Dave Jones
2003-03-27 17:00         ` Chris Wedgwood
2003-03-27 17:19           ` Dominik Kubla
2003-03-27 17:25             ` Chris Wedgwood
2003-04-15  4:34               ` Bill Davidsen
2003-04-15 13:24             ` Larry McVoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox