Daily crashes, incorrect RAID behaviour

All of lore.kernel.org
 help / color / mirror / Atom feed

* Daily crashes, incorrect RAID behaviour
@ 2006-08-15 11:36 Carsten Otto
  2006-08-15 12:33 ` Michael Tokarev
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Carsten Otto @ 2006-08-15 11:36 UTC (permalink / raw)
  To: linux-kernel

Hello!

System specs below (iCH7R, software raid 5)

My problems continue, even with a new and good power supply.
1) The system loses a disk about every week, only a hard reboot solves that
2) In the last three nights the system lost all disk access and
trashed the file systems

Regarding 1)
The system works normally and suddenly one disk does not respond.
After a soft reboot the BIOS does not recognize the disk, here a hard
reboot helps. Whenever I start my normal system in this situation, my
file systems get trashed. I think the software raid thinks the failed
disks (which lost several hours of write accesses) is OK and then
merges the data. When I delete the disk (or create another raid on the
partitions) I can add the disk without problems. This might be a bug,
at least it is _very_ annoying.

Regarding 2)
The system works as usual, but stops whenever disk access is needed
(some cached webpages work, but ssh login does not). On the screen I
see some scrolling messages telling me:
DriveReadySeekComplete (I do not recall the exact words, sorry) for one disk
many ext3 errors ("Something % 4 != 0, inode ..., something ..., )
After a reboot with the failed disk removed (to avoid the problem of
1) the system's file system is totally corrupt, fsck.ext3 finds a lot
of errors.
In my opinion this should not happen in a raid 5, am I correct?

Sidenotes:
The hard disk all are OK, I checked them.
The system choses the failing disks at random. I do not see a pattern here.
After I reported similar problems here I got the hint to get a better
power supply. I did that (600W now) but that does not help.
However, after the upgrade to the new power supply the system worked
fine for almost two weeks (then the weekly crashes started).

System specs:
Kernel 2.6.17.8 and newer
Software raid 5
Asus P5LB2 with iCH7R
Pentium D 805 (Dual Core)
2 GB PC533
4x Maxtor 300GB (Sata2)
1x Samsung 200GB (Pata)
Intel PCIe network

Thanks vor _every_ hint, I am desperate. The system is quite new and
only makes problems.
-- 
Carsten Otto
carsten.otto@gmail.com
www.c-otto.de

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Daily crashes, incorrect RAID behaviour
  2006-08-15 11:36 Daily crashes, incorrect RAID behaviour Carsten Otto
@ 2006-08-15 12:33 ` Michael Tokarev
  2006-08-15 12:57 ` Alan Cox
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Michael Tokarev @ 2006-08-15 12:33 UTC (permalink / raw)
  To: Carsten Otto; +Cc: linux-kernel

Carsten Otto wrote:
> Hello!
> 
> System specs below (iCH7R, software raid 5)
> 
> My problems continue, even with a new and good power supply.
> 1) The system loses a disk about every week, only a hard reboot solves that

We've seen this in alot of cases in the past.  The issue was in a single
batch of seagate 9gig drives (yes, old) - from time to time, one disk
just disappears from the system completely, only power-off-on cycle
forces it to reappear.  This happens without any pattern, ie, randomly -
sometimes a disk can disappear after several minutes after a power-on,
without any system load; and some times, it works just fine for several
months.

We tried to replace (RMA) the bad drives one by one, with the same
scenario all the time: they test the drive for a day, and call us back
saying everything's ok; we grab the drive, and return it back the next
day (because we *know* it's NOT Ok), and they send it for replacement.
The replaced drives (even refurbished ones) all works ok (we replaced
about 20 drives in total, all from the same batch).

I talked with seagate techs about this issue, but there was no conclusion
(he said it's "typical mishandling", like static elictricity etc, but
that does not match the behaviour at all).  And since the drives are very
old now (but quite some of them are still in production ;), and was already
quite old when the problem started happening (about 6 years ago).. it's
simpler to trash them, replacing with more modern drives.

That was only one batch of drives.  And the drives was excellent (for their
age anyway): no single disk failure in many years, not even single bad block
on about 50 drives!  If not counting those sporadic disappearing of course ;)
And Seagate guys says this is something they've never hear before, too.

That all to say: sometimes disk drives do strange things.  Rare, very rare,
but that happens... ;)

/mjt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Daily crashes, incorrect RAID behaviour
  2006-08-15 11:36 Daily crashes, incorrect RAID behaviour Carsten Otto
  2006-08-15 12:33 ` Michael Tokarev
@ 2006-08-15 12:57 ` Alan Cox
  2006-08-15 12:42   ` Carsten Otto
  2006-08-15 13:45 ` Ralf Müller
  2006-08-15 15:31 ` Carsten Otto
  3 siblings, 1 reply; 12+ messages in thread
From: Alan Cox @ 2006-08-15 12:57 UTC (permalink / raw)
  To: Carsten Otto; +Cc: linux-kernel

Ar Maw, 2006-08-15 am 13:36 +0200, ysgrifennodd Carsten Otto:
> The system works normally and suddenly one disk does not respond.
> After a soft reboot the BIOS does not recognize the disk, here a hard
> reboot helps. Whenever I start my normal system in this situation, my

Rule of thumb (and a good one). If the soft reboot and BIOS cannot
recover the disk then the disk is the problem. There isn't really
anything we can tell the drive to do which should make it take a hike
and ignore a reset sequence.  (Should.. however..)

> DriveReadySeekComplete (I do not recall the exact words, sorry) for one disk

Pity the exact text is essential.

> However, after the upgrade to the new power supply the system worked
> fine for almost two weeks (then the weekly crashes started).

I assume you've run memtest86 and also checked temperatures look good
around all the disks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Daily crashes, incorrect RAID behaviour
  2006-08-15 12:57 ` Alan Cox
@ 2006-08-15 12:42   ` Carsten Otto
  2006-08-15 13:08     ` Jan Engelhardt
  0 siblings, 1 reply; 12+ messages in thread
From: Carsten Otto @ 2006-08-15 12:42 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

> Rule of thumb (and a good one). If the soft reboot and BIOS cannot
> recover the disk then the disk is the problem. There isn't really
> anything we can tell the drive to do which should make it take a hike
> and ignore a reset sequence.  (Should.. however..)

Makes sense. I will focus my attention on the disks now (which makes
sense not only because of your information).

> > DriveReadySeekComplete (I do not recall the exact words, sorry) for one disk
> Pity the exact text is essential.

Here is the exact message I saw a few weeks ago (posted in here):

ata4: handling error/timeout
ata4: port reset, p_is 0 is 0 pis 0 cmd c017 tf 7f ss 0 se 0
ata4: status=0x50 { DriveReady SeekComplete }
sdd: Current: sense key=0x0
        ASC=0x0 ASCQ=0x0
Info fid=0x0

To my knowledge this time it did not look different at all.

> I assume you've run memtest86 and also checked temperatures look good
> around all the disks.

Of course. I even replaced the mainboard (screwdriver accident..) and
power supply (too weak). And I now know that the sata cables I used at
first did not cause the problems :)

Thanks,
-- 
Carsten Otto
carsten.otto@gmail.com
www.c-otto.de

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Daily crashes, incorrect RAID behaviour
  2006-08-15 12:42   ` Carsten Otto
@ 2006-08-15 13:08     ` Jan Engelhardt
  0 siblings, 0 replies; 12+ messages in thread
From: Jan Engelhardt @ 2006-08-15 13:08 UTC (permalink / raw)
  To: Carsten Otto; +Cc: Alan Cox, linux-kernel

>> > DriveReadySeekComplete (I do not recall the exact words, sorry) for
>> > one disk
>> Pity the exact text is essential.
>
> Here is the exact message I saw a few weeks ago (posted in here):
>
> ata4: handling error/timeout
> ata4: port reset, p_is 0 is 0 pis 0 cmd c017 tf 7f ss 0 se 0
> ata4: status=0x50 { DriveReady SeekComplete }
> sdd: Current: sense key=0x0
>       ASC=0x0 ASCQ=0x0
> Info fid=0x0


Although I do not want to accuse libata, is it possible that a libata bug 
is around?


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Daily crashes, incorrect RAID behaviour
  2006-08-15 11:36 Daily crashes, incorrect RAID behaviour Carsten Otto
  2006-08-15 12:33 ` Michael Tokarev
  2006-08-15 12:57 ` Alan Cox
@ 2006-08-15 13:45 ` Ralf Müller
  2006-08-19 11:37   ` Andrew Baker
  2006-08-15 15:31 ` Carsten Otto
  3 siblings, 1 reply; 12+ messages in thread
From: Ralf Müller @ 2006-08-15 13:45 UTC (permalink / raw)
  To: linux-kernel

On Tuesday 15 August 2006 13:36, you wrote:
> My problems continue, even with a new and good power supply.
> 1) The system loses a disk about every week, only a hard reboot
> solves that 2) In the last three nights the system lost all disk
> access and trashed the file systems
>
> Regarding 1)
> The system works normally and suddenly one disk does not respond.
> After a soft reboot the BIOS does not recognize the disk, here a hard
> reboot helps. Whenever I start my normal system in this situation, my
> file systems get trashed. I think the software raid thinks the failed
> disks (which lost several hours of write accesses) is OK and then
> merges the data. When I delete the disk (or create another raid on
> the partitions) I can add the disk without problems. This might be a
> bug, at least it is _very_ annoying.

> 4x Maxtor 300GB (Sata2)

I have a similar problem with maybe the same type of disk. My analysis 
of the problem is still not that complete so I did not asked on the 
kernel mailing list yet.

The disk type we use here in a ten disk RAID6 is:
Maxtor 7V300F0 300GB Sata2
on Promise Sata 300 TX4 controllers

Once in a week or two a random disk is not responding anymore and needs 
a complete power off/on cycle to recover. After power cycle the disk 
works without problems, doesn't report any SMART problems ...
I'm quite sure it is no problem with power supply, motherboard, 
backplane or controllers. Still open are cabling and disks. Nearly the 
same setup of hardware - just with different disks - is running smooth 
since about 8 month in a different system.

If this is the same disk type we maybe should return the disks to our 
hardware vendors as it may be a disk problem.

The only messages I get are like that:
Aug 13 17:25:10 backup-core kernel: ata7: command timeout
Aug 13 17:25:10 backup-core kernel: ata7: translated ATA stat/err 
0xff/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Aug 13 17:25:10 backup-core kernel: ata7: status=0xff { Busy }
Aug 13 17:25:42 backup-core kernel: ata7: command timeout
Aug 13 17:25:42 backup-core kernel: ata7: translated ATA stat/err 
0xff/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Aug 13 17:25:42 backup-core kernel: ata7: status=0xff { Busy }
Aug 13 17:26:43 backup-core kernel: ata7: command timeout
Aug 13 17:26:43 backup-core kernel: ata7: translated ATA stat/err 
0xff/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Aug 13 17:26:43 backup-core kernel: ata7: status=0xff { Busy }
Aug 13 17:26:43 backup-core kernel: end_request: I/O error, dev sdg, 
sector 2104383

Regards
Ralf

-- 
Van Roy's Law: -------------------------------------------------------
       An unbreakable toy is useful for breaking other toys.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Daily crashes, incorrect RAID behaviour
  2006-08-15 13:45 ` Ralf Müller
@ 2006-08-19 11:37   ` Andrew Baker
  2006-08-19 11:47     ` Justin Piszcz
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Baker @ 2006-08-19 11:37 UTC (permalink / raw)
  To: linux-kernel

We too are having the same problem and the only obviously common factor is
Maxtor SATA HDD.

We have two identical systems - 64 bit - 2 x Dual Opterons, 8Gb Ram running
Novell/SUSE SLES10. Both systems are showing the problem.

In our case the RAID controller is 

3ware Escalade 9550SX - 8LP

And the HDD are:

Maxtor MaxLine III (7V250F0) 250GB SATA II 

The symptoms here are almost exactly as you describe.  A disc "drops out" once
every week or two and the only way to clear the problem is a power cycle - or
remove and replace the HDD (our system is hot-swap).

Regards

Andrew

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Daily crashes, incorrect RAID behaviour
  2006-08-19 11:37   ` Andrew Baker
@ 2006-08-19 11:47     ` Justin Piszcz
  2006-08-19 18:53       ` Andrew Baker
  0 siblings, 1 reply; 12+ messages in thread
From: Justin Piszcz @ 2006-08-19 11:47 UTC (permalink / raw)
  To: Andrew Baker; +Cc: linux-kernel



On Sat, 19 Aug 2006, Andrew Baker wrote:

> We too are having the same problem and the only obviously common factor is
> Maxtor SATA HDD.
>
> We have two identical systems - 64 bit - 2 x Dual Opterons, 8Gb Ram running
> Novell/SUSE SLES10. Both systems are showing the problem.
>
> In our case the RAID controller is
>
> 3ware Escalade 9550SX - 8LP
>
> And the HDD are:
>
> Maxtor MaxLine III (7V250F0) 250GB SATA II
>
> The symptoms here are almost exactly as you describe.  A disc "drops out" once
> every week or two and the only way to clear the problem is a power cycle - or
> remove and replace the HDD (our system is hot-swap).
>
> Regards
>
> Andrew
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

I had the same problem with a 3ware 2 port IDE raid controller, 7006-2. 
One drive would always drop out under heavy I/O. Made me sick.  Moved to 
SW raid, all problems went away.

Justin.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Daily crashes, incorrect RAID behaviour
  2006-08-19 11:47     ` Justin Piszcz
@ 2006-08-19 18:53       ` Andrew Baker
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Baker @ 2006-08-19 18:53 UTC (permalink / raw)
  To: linux-kernel

For various complex reasons,
Software RAID is not a viable option on these systems.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Daily crashes, incorrect RAID behaviour
  2006-08-15 11:36 Daily crashes, incorrect RAID behaviour Carsten Otto
                   ` (2 preceding siblings ...)
  2006-08-15 13:45 ` Ralf Müller
@ 2006-08-15 15:31 ` Carsten Otto
  2006-08-15 18:28   ` Mike Dresser
  3 siblings, 1 reply; 12+ messages in thread
From: Carsten Otto @ 2006-08-15 15:31 UTC (permalink / raw)
  To: linux-kernel

Okay, after Ralf's message I found this newsgroup post:
http://groups.google.de/group/linux.debian.user/msg/f12dec920523a629?hl=de&

> You should be aware that currently
> Maxtor Maxline III's(7v300F0's) do not work properly due to a firmware
> bug.  The current version shipping is VA111630, an update is available to
> VA111670 which merely reduces the frequency of timeouts that get the drive
> kicked out from the array.

I got a new firmware from Maxtor today. My disks now have firmware
VA111900, before that I had VA111630. Let's see what happens...

PS: Maxtor's hotline guy had no record about firmware related
problems. I'd like to report those (with the two additional
references), but now the hotline has technical difficulties...

Bye,
-- 
Carsten Otto
carsten.otto@gmail.com
www.c-otto.de

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Daily crashes, incorrect RAID behaviour
  2006-08-15 15:31 ` Carsten Otto
@ 2006-08-15 18:28   ` Mike Dresser
  2006-08-15 19:27     ` Alistair John Strachan
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Dresser @ 2006-08-15 18:28 UTC (permalink / raw)
  To: Carsten Otto; +Cc: linux-kernel

On Tue, 15 Aug 2006, Carsten Otto wrote:

> Okay, after Ralf's message I found this newsgroup post:
> http://groups.google.de/group/linux.debian.user/msg/f12dec920523a629?hl=de&
>
>> You should be aware that currently
>> Maxtor Maxline III's(7v300F0's) do not work properly due to a firmware
>> bug.  The current version shipping is VA111630, an update is available to
>> VA111670 which merely reduces the frequency of timeouts that get the drive
>> kicked out from the array.

I'm running 680 now, and the 15 drives have been up for something like two 
months or so without issues at all.. Seems like the firmware fixes the 
problem.

Mike


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Daily crashes, incorrect RAID behaviour
  2006-08-15 18:28   ` Mike Dresser
@ 2006-08-15 19:27     ` Alistair John Strachan
  0 siblings, 0 replies; 12+ messages in thread
From: Alistair John Strachan @ 2006-08-15 19:27 UTC (permalink / raw)
  To: Mike Dresser; +Cc: Carsten Otto, linux-kernel

On Tuesday 15 August 2006 19:28, Mike Dresser wrote:
> On Tue, 15 Aug 2006, Carsten Otto wrote:
> > Okay, after Ralf's message I found this newsgroup post:
> > http://groups.google.de/group/linux.debian.user/msg/f12dec920523a629?hl=d
> >e&
> >
> >> You should be aware that currently
> >> Maxtor Maxline III's(7v300F0's) do not work properly due to a firmware
> >> bug.  The current version shipping is VA111630, an update is available
> >> to VA111670 which merely reduces the frequency of timeouts that get the
> >> drive kicked out from the array.
>
> I'm running 680 now, and the 15 drives have been up for something like two
> months or so without issues at all.. Seems like the firmware fixes the
> problem.

Still, the RAID rebuild problem is worrying. I might have to deliberately 
fault my RAID5 partitions and see if they rebuild correctly..

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2006-08-19 18:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-15 11:36 Daily crashes, incorrect RAID behaviour Carsten Otto
2006-08-15 12:33 ` Michael Tokarev
2006-08-15 12:57 ` Alan Cox
2006-08-15 12:42   ` Carsten Otto
2006-08-15 13:08     ` Jan Engelhardt
2006-08-15 13:45 ` Ralf Müller
2006-08-19 11:37   ` Andrew Baker
2006-08-19 11:47     ` Justin Piszcz
2006-08-19 18:53       ` Andrew Baker
2006-08-15 15:31 ` Carsten Otto
2006-08-15 18:28   ` Mike Dresser
2006-08-15 19:27     ` Alistair John Strachan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.