libata badness

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* libata badness
@ 2004-10-04 12:12 William Knop
  2004-10-04 13:59 ` Jon Lewis
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: William Knop @ 2004-10-04 12:12 UTC (permalink / raw)
  To: linux-kernel, linux-raid, linux-ide

Hi all,

I'm running a raid5 array atop a few sata drives via a promise tx4 
controller. The kernel is the official fedora lk 2.6.8-1, although I had 
run a few different kernels (never entirely successfully) with this array 
in the past.

In fact, this past weekend, I was getting oopses and panics (on lk 
2.6.8.1, 2.6.9-rc3, 2.6.9-rc3-mm1, and 2.6.9-rc3 w/ Jeff Garzik's recent 
libata patches) all of which happened when rebuilding a spare drive in the 
array. Unfortunately, somehow my root filesystem (ext3) got blown away-- 
it was on a reliable scsi drive (no bad blocks; I checked afterwards), and 
an adaptec aic7xxx host. The ram was good; I ran memtest86 on it. I'm 
assuming this was caused by some major kernel corruption, originating from 
libata.

I have since rebuilt my computer using an AMD Sempron (basically a Duron) 
rather than a P4. Other than that (cpu + m/b), it's the same hardware.

The errors I got over the weekend are similar to the one I just captured 
on my fresh fc2/lk2.6.8-1 install (at the same point; the spare disk had 
begun rebuilding). It's attached below.

Anyway, I haven't been able to find any other reports of this, so I'm at a 
loss about what to do. I hesitate to bring my array up at all now, for 
fear of blowing it away. Any assistance would be greatly appriciated.

Thanks much,
Will

---------- SNIP ----------
Unable to handle kernel paging request at virtual address 01000004
  printing eip:
229e4d8c
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: raid5 xor sata_promise md5 ipv6 parport_pc lp parport 
autofs4 sunrpc sk98lin sg joydev dm_mod uhci_hcd ehci_hcd button battery 
asus_acpi ac ext3 jbd sata_via libata aic7xxx sd_mod scsi_mod
CPU:    0
EIP:    0060:[<229e4d8c>]    Not tainted
EFLAGS: 00010206   (2.6.8-1.521)
EIP is at handle_stripe+0x29a/0x1407 [raid5]
eax: 00000001   ebx: 00000000   ecx: 00915cb8   edx: 21f7e1c0
esi: 1ccbd118   edi: 21f7e1c0   ebp: 01000000   esp: 1d300f28
ds: 007b   es: 007b   ss: 0068
Process md0_raid5 (pid: 2626, threadinfo=1d300000 task=1d317970)
Stack: 2283eb57 20db8000 21f7e1c0 21c30288 1ccbd204 20db8000 00000001 
1ccbd158
        00000002 00000000 00000000 00000001 00000000 00000000 00000001 
00000000
        00000001 00000001 00000000 00000003 1ccbd0ac 21f7e1c0 1ccbd0ac 
21f76c00
Call Trace:
  [<2283eb57>] ata_scsi_queuecmd+0xbe/0xc7 [libata]
  [<229e6b1c>] raid5d+0x1ce/0x2f8 [raid5]
  [<0228f5d2>] md_thread+0x227/0x256
  [<0211be05>] autoremove_wake_function+0x0/0x2d
  [<0211be05>] autoremove_wake_function+0x0/0x2d
  [<0228f3ab>] md_thread+0x0/0x256
  [<021041d9>] kernel_thread_helper+0x5/0xb
Code: 8b 55 04 83 c1 08 8b 45 00 83 d3 00 39 da 72 0e 0f 87 e0 01
---------- SNIP ----------

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 12:12 libata badness William Knop
@ 2004-10-04 13:59 ` Jon Lewis
  2004-10-04 15:50   ` William Knop
  2004-10-04 16:30 ` Jeff Garzik
  2004-10-04 22:47 ` Neil Brown
  2 siblings, 1 reply; 18+ messages in thread
From: Jon Lewis @ 2004-10-04 13:59 UTC (permalink / raw)
  To: William Knop; +Cc: linux-kernel, linux-raid, linux-ide

On Mon, 4 Oct 2004, William Knop wrote:

> Hi all,
>
> I'm running a raid5 array atop a few sata drives via a promise tx4
> controller. The kernel is the official fedora lk 2.6.8-1, although I had
> run a few different kernels (never entirely successfully) with this array
> in the past.

What kind of sata drives?  It's not quite the same end result, but there
have been several posts on linux-raid about defective Maxtor sata drives
causing system freezes.  If your drives are Maxtor, download their
powermax utility and test your drives.  You may find that you have one or
more marginal drives that appear to work most of the time, but powermax
will determine are bad.  Replacing one like that fixed my problems.

----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 13:59 ` Jon Lewis
@ 2004-10-04 15:50   ` William Knop
  2004-10-04 16:06     ` Mark Lord
  2004-10-04 16:09     ` Jon Lewis
  0 siblings, 2 replies; 18+ messages in thread
From: William Knop @ 2004-10-04 15:50 UTC (permalink / raw)
  To: Jon Lewis; +Cc: linux-kernel, linux-raid, linux-ide

> What kind of sata drives?  It's not quite the same end result, but there
> have been several posts on linux-raid about defective Maxtor sata drives
> causing system freezes.  If your drives are Maxtor, download their
> powermax utility and test your drives.  You may find that you have one or
> more marginal drives that appear to work most of the time, but powermax
> will determine are bad.  Replacing one like that fixed my problems.

Ah, well all of them are Maxtor drives... One 6y250m0 and three 7y250m0 
drives. I'm using powermax on them right now. They all passed the quick 
test, and the full test results are forthcoming.

Actually, I was backing up the array (cp from the array - 2 of 3 drives 
running - to a normal drive) when I read your response. Shortly 
thereafter, during the cp (perhaps after copying 100GB-120GB), I got a 
double fault. I've never gotten a double fault before, but I'm guessing 
it's quite a serious error. It totally locked up the machine, and it 
outputted two lines each with a double fault message, followed by a 
register dump.

The saga continues...

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 15:50   ` William Knop
@ 2004-10-04 16:06     ` Mark Lord
  2004-10-04 16:24       ` William Knop
  2004-10-04 16:30       ` Jeff Garzik
  2004-10-04 16:09     ` Jon Lewis
  1 sibling, 2 replies; 18+ messages in thread
From: Mark Lord @ 2004-10-04 16:06 UTC (permalink / raw)
  To: William Knop; +Cc: Jon Lewis, linux-kernel, linux-raid, linux-ide

I have used Maxtor "SATA" drives that require
the O/S to do a "SET FEATURES :: UDMA_MODE" command
on them before they will operate reliably.
This despite the SATA spec stating clearly that
such a command should/will have no effect.

I suppose libata does this already, but just in case not..
Something simple to check up on.
-- 
Mark Lord
(hdparm keeper & the original "Linux IDE Guy")

William Knop wrote:
>
> Ah, well all of them are Maxtor drives... One 6y250m0 and three 7y250m0 
> drives. I'm using powermax on them right now. They all passed the quick 
> test, and the full test results are forthcoming.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 15:50   ` William Knop
  2004-10-04 16:06     ` Mark Lord
@ 2004-10-04 16:09     ` Jon Lewis
  2004-10-04 16:34       ` William Knop
  1 sibling, 1 reply; 18+ messages in thread
From: Jon Lewis @ 2004-10-04 16:09 UTC (permalink / raw)
  To: William Knop; +Cc: linux-kernel, linux-raid, linux-ide

On Mon, 4 Oct 2004, William Knop wrote:

> Ah, well all of them are Maxtor drives... One 6y250m0 and three 7y250m0
> drives. I'm using powermax on them right now. They all passed the quick
> test, and the full test results are forthcoming.

I'm pretty sure all the bad ones we had (at least the one I found at my
location) "failed" the quick test in that after the quick test it asked me
to run the full test, after which it spit out the magic fault code to give
Maxtor's RMA form.  Another possibility that comes to mind is that your
power supply could be inadequate to run the system and all the drives.

----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 16:06     ` Mark Lord
@ 2004-10-04 16:24       ` William Knop
  2004-10-04 16:30       ` Jeff Garzik
  1 sibling, 0 replies; 18+ messages in thread
From: William Knop @ 2004-10-04 16:24 UTC (permalink / raw)
  To: Mark Lord; +Cc: Jon Lewis, linux-kernel, linux-raid, linux-ide

Great. I'll give that a shot after I drive checker utility finishes.

However it seems like the kernel shouldn't be oopsing, panicking, or 
double faulting if the drive is questionable. It apparently blew away my 
root fs last time. A peripheral drive failure shouldn't cause such 
destruction across the system, no?

On Mon, 4 Oct 2004, Mark Lord wrote:

> I have used Maxtor "SATA" drives that require
> the O/S to do a "SET FEATURES :: UDMA_MODE" command
> on them before they will operate reliably.
> This despite the SATA spec stating clearly that
> such a command should/will have no effect.
>
> I suppose libata does this already, but just in case not..
> Something simple to check up on.
> -- 
> Mark Lord
> (hdparm keeper & the original "Linux IDE Guy")
>
> William Knop wrote:
>> 
>> Ah, well all of them are Maxtor drives... One 6y250m0 and three 7y250m0 
>> drives. I'm using powermax on them right now. They all passed the quick 
>> test, and the full test results are forthcoming.
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 12:12 libata badness William Knop
  2004-10-04 13:59 ` Jon Lewis
@ 2004-10-04 16:30 ` Jeff Garzik
  2004-10-04 16:55   ` William Knop
  2004-10-04 17:42   ` William Knop
  2004-10-04 22:47 ` Neil Brown
  2 siblings, 2 replies; 18+ messages in thread
From: Jeff Garzik @ 2004-10-04 16:30 UTC (permalink / raw)
  To: William Knop; +Cc: linux-kernel, linux-raid, linux-ide

William Knop wrote:
> Hi all,
> 
> I'm running a raid5 array atop a few sata drives via a promise tx4 
> controller. The kernel is the official fedora lk 2.6.8-1, although I had 
> run a few different kernels (never entirely successfully) with this 
> array in the past.
> 
> In fact, this past weekend, I was getting oopses and panics (on lk 
> 2.6.8.1, 2.6.9-rc3, 2.6.9-rc3-mm1, and 2.6.9-rc3 w/ Jeff Garzik's recent 
> libata patches) all of which happened when rebuilding a spare drive in 
> the array. Unfortunately, somehow my root filesystem (ext3) got blown 
> away-- it was on a reliable scsi drive (no bad blocks; I checked 
> afterwards), and an adaptec aic7xxx host. The ram was good; I ran 
> memtest86 on it. I'm assuming this was caused by some major kernel 
> corruption, originating from libata.
> 
> I have since rebuilt my computer using an AMD Sempron (basically a 
> Duron) rather than a P4. Other than that (cpu + m/b), it's the same 
> hardware.
> 
> The errors I got over the weekend are similar to the one I just captured 
> on my fresh fc2/lk2.6.8-1 install (at the same point; the spare disk had 
> begun rebuilding). It's attached below.
> 
> Anyway, I haven't been able to find any other reports of this, so I'm at 
> a loss about what to do. I hesitate to bring my array up at all now, for 
> fear of blowing it away. Any assistance would be greatly appriciated.
> 
> Thanks much,
> Will
> 
> 
> ---------- SNIP ----------
> Unable to handle kernel paging request at virtual address 01000004
>  printing eip:
> 229e4d8c
> *pde = 00000000
> Oops: 0000 [#1]
> Modules linked in: raid5 xor sata_promise md5 ipv6 parport_pc lp parport 
> autofs4 sunrpc sk98lin sg joydev dm_mod uhci_hcd ehci_hcd button battery 
> asus_acpi ac ext3 jbd sata_via libata aic7xxx sd_mod scsi_mod
> CPU:    0
> EIP:    0060:[<229e4d8c>]    Not tainted
> EFLAGS: 00010206   (2.6.8-1.521)
> EIP is at handle_stripe+0x29a/0x1407 [raid5]
> eax: 00000001   ebx: 00000000   ecx: 00915cb8   edx: 21f7e1c0
> esi: 1ccbd118   edi: 21f7e1c0   ebp: 01000000   esp: 1d300f28
> ds: 007b   es: 007b   ss: 0068
> Process md0_raid5 (pid: 2626, threadinfo=1d300000 task=1d317970)
> Stack: 2283eb57 20db8000 21f7e1c0 21c30288 1ccbd204 20db8000 00000001 
> 1ccbd158
>        00000002 00000000 00000000 00000001 00000000 00000000 00000001 
> 00000000
>        00000001 00000001 00000000 00000003 1ccbd0ac 21f7e1c0 1ccbd0ac 
> 21f76c00
> Call Trace:
>  [<2283eb57>] ata_scsi_queuecmd+0xbe/0xc7 [libata]
>  [<229e6b1c>] raid5d+0x1ce/0x2f8 [raid5]
>  [<0228f5d2>] md_thread+0x227/0x256
>  [<0211be05>] autoremove_wake_function+0x0/0x2d
>  [<0211be05>] autoremove_wake_function+0x0/0x2d
>  [<0228f3ab>] md_thread+0x0/0x256
>  [<021041d9>] kernel_thread_helper+0x5/0xb
> Code: 8b 55 04 83 c1 08 8b 45 00 83 d3 00 39 da 72 0e 0f 87 e0 01

It either smells like a hardware problem or a raid problem.  The oops 
you list here is in raid5 not libata.

	Jeff




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 16:06     ` Mark Lord
  2004-10-04 16:24       ` William Knop
@ 2004-10-04 16:30       ` Jeff Garzik
  1 sibling, 0 replies; 18+ messages in thread
From: Jeff Garzik @ 2004-10-04 16:30 UTC (permalink / raw)
  To: Mark Lord; +Cc: William Knop, Jon Lewis, linux-kernel, linux-raid, linux-ide

Mark Lord wrote:
> I have used Maxtor "SATA" drives that require
> the O/S to do a "SET FEATURES :: UDMA_MODE" command
> on them before they will operate reliably.
> This despite the SATA spec stating clearly that
> such a command should/will have no effect.
> 
> I suppose libata does this already, but just in case not..
> Something simple to check up on.

libata always issues SET FEATURES - XFER MODE.

	Jeff



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 16:09     ` Jon Lewis
@ 2004-10-04 16:34       ` William Knop
  0 siblings, 0 replies; 18+ messages in thread
From: William Knop @ 2004-10-04 16:34 UTC (permalink / raw)
  To: Jon Lewis; +Cc: linux-kernel, linux-raid, linux-ide

> Maxtor's RMA form.  Another possibility that comes to mind is that your
> power supply could be inadequate to run the system and all the drives.

The power supply on this box is something upwards of 450W. Right now I 
only have four 7200rpm sata drives along with my athalon mb/cpu and a 
couple fans. I had six 7200rpm drives connected w/o raid and a P4 before 
and I had no problems. I'm pretty sure this isn't due to insufficient 
power supply.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 16:30 ` Jeff Garzik
@ 2004-10-04 16:55   ` William Knop
  2004-10-04 17:42   ` William Knop
  1 sibling, 0 replies; 18+ messages in thread
From: William Knop @ 2004-10-04 16:55 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel, linux-raid, linux-ide

> It either smells like a hardware problem or a raid problem.  The oops you 
> list here is in raid5 not libata.

I'm inclined to agree. I should have titled the thread "libata/md badness" 
since it appears to be a raid atop sata issue. Raid5 apparently works over 
scsi, though.

This is really beyond my realm of knowledge, though. After I check my 
drives for errors, I'm going to back up my array and then experiment a 
bit. I'll post any oopses I find in hopes that someone will be able to 
properly interpret them.

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 16:30 ` Jeff Garzik
  2004-10-04 16:55   ` William Knop
@ 2004-10-04 17:42   ` William Knop
  2004-10-04 17:50     ` Jim Paris
  2004-10-04 18:01     ` Jeff Garzik
  1 sibling, 2 replies; 18+ messages in thread
From: William Knop @ 2004-10-04 17:42 UTC (permalink / raw)
  To: linux-kernel, linux-raid, linux-ide

I just got another oops while trying to cp from my md/raid5 array (2 of 3 
sata drives) to another sata drive on the same controller. This time, 
though, it said there's a bug in timer.c, line 405, and that the 
stack's garbage. I'm thinking it has nothing to do with timer.c, and 
something in md or libata is chomping all over the kernel.

It's odd that no one else is experiencing problems. I read a post where 
someone was sucessfully using a mandrake 2.6.8.1 kernel, with the same 
controller (promise tx4) and raid5. I suppose my controller card could be 
faulty, but it's odd that drives on the controller w/o raid seem to be 
doing alright.

I'm going to try a 2.4 kernel so I might be able to back up my array 
before things get really hairy. I should probably also ping Neil Brown or 
whoever the current md maintainer is. I'll continue to post findings.

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 17:42   ` William Knop
@ 2004-10-04 17:50     ` Jim Paris
  2004-10-04 18:03       ` William Knop
  2004-10-04 18:01     ` Jeff Garzik
  1 sibling, 1 reply; 18+ messages in thread
From: Jim Paris @ 2004-10-04 17:50 UTC (permalink / raw)
  To: William Knop; +Cc: linux-kernel, linux-raid, linux-ide

> I just got another oops while trying to cp from my md/raid5 array (2 of 3 
> sata drives) to another sata drive on the same controller. This time, 
> though, it said there's a bug in timer.c, line 405, and that the 
> stack's garbage. I'm thinking it has nothing to do with timer.c, and 
> something in md or libata is chomping all over the kernel.

Or else something else on your system is bad.  Like your CPU or RAM.
Run memtest for a while.

-jim

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 17:42   ` William Knop
  2004-10-04 17:50     ` Jim Paris
@ 2004-10-04 18:01     ` Jeff Garzik
  1 sibling, 0 replies; 18+ messages in thread
From: Jeff Garzik @ 2004-10-04 18:01 UTC (permalink / raw)
  To: William Knop; +Cc: linux-kernel, linux-raid, linux-ide

William Knop wrote:
> 
> I just got another oops while trying to cp from my md/raid5 array (2 of 
> 3 sata drives) to another sata drive on the same controller. This time, 
> though, it said there's a bug in timer.c, line 405, and that the stack's 
> garbage. I'm thinking it has nothing to do with timer.c, and something 
> in md or libata is chomping all over the kernel.

If you are getting random oopses all over the place, I would suspect 
hardware before I suspect buggy code.

Jim's, and others' suggestions were good:  check power connectors (not 
just overall power consumption), test CPU, RAM, temperature, ...

	Jeff




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 17:50     ` Jim Paris
@ 2004-10-04 18:03       ` William Knop
  0 siblings, 0 replies; 18+ messages in thread
From: William Knop @ 2004-10-04 18:03 UTC (permalink / raw)
  To: Jim Paris; +Cc: linux-kernel, linux-raid, linux-ide

> Or else something else on your system is bad.  Like your CPU or RAM.
> Run memtest for a while.

I ran memtest for a few hours. It all checks out. Also, CPU intensive 
stuff doesn't cause weirdness.

The only things which cause corruption, so far as I can tell, are 
operations on my raid device, md0. Copying from the array takes a while 
for a crash, while rebuilding a drive on the array is very prompt (<1m) at 
causing a crash. The crashes likely as not cause widespread kernel 
corruption. It's only a matter of time before my root fs is blown away 
again.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 12:12 libata badness William Knop
  2004-10-04 13:59 ` Jon Lewis
  2004-10-04 16:30 ` Jeff Garzik
@ 2004-10-04 22:47 ` Neil Brown
  2004-10-05  3:11   ` William Knop
  2 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2004-10-04 22:47 UTC (permalink / raw)
  To: William Knop; +Cc: linux-kernel, linux-raid, linux-ide

On Monday October 4, wknop@andrew.cmu.edu wrote:
> CPU:    0
> EIP:    0060:[<229e4d8c>]    Not tainted
> EFLAGS: 00010206   (2.6.8-1.521)
> EIP is at handle_stripe+0x29a/0x1407 [raid5]
> eax: 00000001   ebx: 00000000   ecx: 00915cb8   edx: 21f7e1c0
> esi: 1ccbd118   edi: 21f7e1c0   ebp: 01000000   esp: 1d300f28
> ds: 007b   es: 007b   ss: 0068
> Process md0_raid5 (pid: 2626, threadinfo=1d300000 task=1d317970)
> Stack: 2283eb57 20db8000 21f7e1c0 21c30288 1ccbd204 20db8000 00000001 
> 1ccbd158
>         00000002 00000000 00000000 00000001 00000000 00000000 00000001 
> 00000000
>         00000001 00000001 00000000 00000003 1ccbd0ac 21f7e1c0 1ccbd0ac 
> 21f76c00
> Call Trace:
>   [<2283eb57>] ata_scsi_queuecmd+0xbe/0xc7 [libata]
>   [<229e6b1c>] raid5d+0x1ce/0x2f8 [raid5]
>   [<0228f5d2>] md_thread+0x227/0x256
>   [<0211be05>] autoremove_wake_function+0x0/0x2d
>   [<0211be05>] autoremove_wake_function+0x0/0x2d
>   [<0228f3ab>] md_thread+0x0/0x256
>   [<021041d9>] kernel_thread_helper+0x5/0xb
> Code: 8b 55 04 83 c1 08 8b 45 00 83 d3 00 39 da 72 0e 0f 87 e0 01

This code starts:

   0:   8b 55 04                  mov    0x4(%ebp),%edx
   3:   83 c1 08                  add    $0x8,%ecx

and as %ebp is 01000000, this oopses.  
It looks very much like a single-bit memory error (as has already been
suggested as a possibility).

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-04 22:47 ` Neil Brown
@ 2004-10-05  3:11   ` William Knop
  2004-10-05  4:49     ` Brad Campbell
  0 siblings, 1 reply; 18+ messages in thread
From: William Knop @ 2004-10-05  3:11 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-kernel, linux-raid, linux-ide

> This code starts:
>
>   0:   8b 55 04                  mov    0x4(%ebp),%edx
>   3:   83 c1 08                  add    $0x8,%ecx
>
> and as %ebp is 01000000, this oopses.
> It looks very much like a single-bit memory error (as has already been
> suggested as a possibility).

Oh my. So, I ran memcheck again for a few hours, and it checked out fine. 
Just in case, though, I bought a replacement stick of ram. Well, the 
oopses went away, so it must have been the ram.

Sigh. Sorry about that, everyone. I suppose the raid operations are 
particularly memory intensive. Anyway, thanks a ton for all the help.

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-05  3:11   ` William Knop
@ 2004-10-05  4:49     ` Brad Campbell
  2004-10-05  5:27       ` Norman Schmidt
  0 siblings, 1 reply; 18+ messages in thread
From: Brad Campbell @ 2004-10-05  4:49 UTC (permalink / raw)
  To: William Knop; +Cc: Neil Brown, linux-kernel, linux-raid, linux-ide

William Knop wrote:
> 
>> This code starts:
>>
>>   0:   8b 55 04                  mov    0x4(%ebp),%edx
>>   3:   83 c1 08                  add    $0x8,%ecx
>>
>> and as %ebp is 01000000, this oopses.
>> It looks very much like a single-bit memory error (as has already been
>> suggested as a possibility).
> 
> 
> Oh my. So, I ran memcheck again for a few hours, and it checked out 
> fine. Just in case, though, I bought a replacement stick of ram. Well, 
> the oopses went away, so it must have been the ram.

For future reference, I have had errors show up after 24-36 hours of memtest86. I usually find that 
if it passes 48 hours of testing then things are looking pretty reliable. A couple of hours is 
usually too small a sample to rely on.

Brad

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: libata badness
  2004-10-05  4:49     ` Brad Campbell
@ 2004-10-05  5:27       ` Norman Schmidt
  0 siblings, 0 replies; 18+ messages in thread
From: Norman Schmidt @ 2004-10-05  5:27 UTC (permalink / raw)
  To: linux-raid

Brad Campbell schrieb:

> For future reference, I have had errors show up after 24-36 hours of 
> memtest86. I usually find that if it passes 48 hours of testing then 
> things are looking pretty reliable. A couple of hours is usually too 
> small a sample to rely on.
> 
> Brad

I can confirm that finding. What I usually do in addition to that is to
start enough burnmmx/burnp6 processes to make the system swap, run
bonnie at the same time ant let the machine chew on this for two days.
Ich none of the processes stops, it can go into production.

However, I now had problems with a faulty power supply (capacitors) the
second time - symptoms was erratic behaviour, the software raid fell
apart quite often (because some of the disks sometimes went down) and so
on. One learns all the time.

Bye Norman.

-- 
Norman Schmidt          Institut fuer Physikal. u. Theoret. Chemie
Dipl.-Chem. Univ.       Friedrich-Alexander-Universitaet
schmidt@naa.net         Erlangen-Nuernberg
                         IT-Systembetreuer Physikalische Chemie

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2004-10-05  5:27 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-04 12:12 libata badness William Knop
2004-10-04 13:59 ` Jon Lewis
2004-10-04 15:50   ` William Knop
2004-10-04 16:06     ` Mark Lord
2004-10-04 16:24       ` William Knop
2004-10-04 16:30       ` Jeff Garzik
2004-10-04 16:09     ` Jon Lewis
2004-10-04 16:34       ` William Knop
2004-10-04 16:30 ` Jeff Garzik
2004-10-04 16:55   ` William Knop
2004-10-04 17:42   ` William Knop
2004-10-04 17:50     ` Jim Paris
2004-10-04 18:03       ` William Knop
2004-10-04 18:01     ` Jeff Garzik
2004-10-04 22:47 ` Neil Brown
2004-10-05  3:11   ` William Knop
2004-10-05  4:49     ` Brad Campbell
2004-10-05  5:27       ` Norman Schmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).