* Daily crashes, incorrect RAID behaviour
@ 2006-08-15 11:36 Carsten Otto
2006-08-15 12:33 ` Michael Tokarev
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Carsten Otto @ 2006-08-15 11:36 UTC (permalink / raw)
To: linux-kernel
Hello!
System specs below (iCH7R, software raid 5)
My problems continue, even with a new and good power supply.
1) The system loses a disk about every week, only a hard reboot solves that
2) In the last three nights the system lost all disk access and
trashed the file systems
Regarding 1)
The system works normally and suddenly one disk does not respond.
After a soft reboot the BIOS does not recognize the disk, here a hard
reboot helps. Whenever I start my normal system in this situation, my
file systems get trashed. I think the software raid thinks the failed
disks (which lost several hours of write accesses) is OK and then
merges the data. When I delete the disk (or create another raid on the
partitions) I can add the disk without problems. This might be a bug,
at least it is _very_ annoying.
Regarding 2)
The system works as usual, but stops whenever disk access is needed
(some cached webpages work, but ssh login does not). On the screen I
see some scrolling messages telling me:
DriveReadySeekComplete (I do not recall the exact words, sorry) for one disk
many ext3 errors ("Something % 4 != 0, inode ..., something ..., )
After a reboot with the failed disk removed (to avoid the problem of
1) the system's file system is totally corrupt, fsck.ext3 finds a lot
of errors.
In my opinion this should not happen in a raid 5, am I correct?
Sidenotes:
The hard disk all are OK, I checked them.
The system choses the failing disks at random. I do not see a pattern here.
After I reported similar problems here I got the hint to get a better
power supply. I did that (600W now) but that does not help.
However, after the upgrade to the new power supply the system worked
fine for almost two weeks (then the weekly crashes started).
System specs:
Kernel 2.6.17.8 and newer
Software raid 5
Asus P5LB2 with iCH7R
Pentium D 805 (Dual Core)
2 GB PC533
4x Maxtor 300GB (Sata2)
1x Samsung 200GB (Pata)
Intel PCIe network
Thanks vor _every_ hint, I am desperate. The system is quite new and
only makes problems.
--
Carsten Otto
carsten.otto@gmail.com
www.c-otto.de
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Daily crashes, incorrect RAID behaviour
2006-08-15 11:36 Daily crashes, incorrect RAID behaviour Carsten Otto
@ 2006-08-15 12:33 ` Michael Tokarev
2006-08-15 12:57 ` Alan Cox
` (2 subsequent siblings)
3 siblings, 0 replies; 12+ messages in thread
From: Michael Tokarev @ 2006-08-15 12:33 UTC (permalink / raw)
To: Carsten Otto; +Cc: linux-kernel
Carsten Otto wrote:
> Hello!
>
> System specs below (iCH7R, software raid 5)
>
> My problems continue, even with a new and good power supply.
> 1) The system loses a disk about every week, only a hard reboot solves that
We've seen this in alot of cases in the past. The issue was in a single
batch of seagate 9gig drives (yes, old) - from time to time, one disk
just disappears from the system completely, only power-off-on cycle
forces it to reappear. This happens without any pattern, ie, randomly -
sometimes a disk can disappear after several minutes after a power-on,
without any system load; and some times, it works just fine for several
months.
We tried to replace (RMA) the bad drives one by one, with the same
scenario all the time: they test the drive for a day, and call us back
saying everything's ok; we grab the drive, and return it back the next
day (because we *know* it's NOT Ok), and they send it for replacement.
The replaced drives (even refurbished ones) all works ok (we replaced
about 20 drives in total, all from the same batch).
I talked with seagate techs about this issue, but there was no conclusion
(he said it's "typical mishandling", like static elictricity etc, but
that does not match the behaviour at all). And since the drives are very
old now (but quite some of them are still in production ;), and was already
quite old when the problem started happening (about 6 years ago).. it's
simpler to trash them, replacing with more modern drives.
That was only one batch of drives. And the drives was excellent (for their
age anyway): no single disk failure in many years, not even single bad block
on about 50 drives! If not counting those sporadic disappearing of course ;)
And Seagate guys says this is something they've never hear before, too.
That all to say: sometimes disk drives do strange things. Rare, very rare,
but that happens... ;)
/mjt
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Daily crashes, incorrect RAID behaviour
2006-08-15 11:36 Daily crashes, incorrect RAID behaviour Carsten Otto
2006-08-15 12:33 ` Michael Tokarev
@ 2006-08-15 12:57 ` Alan Cox
2006-08-15 12:42 ` Carsten Otto
2006-08-15 13:45 ` Ralf Müller
2006-08-15 15:31 ` Carsten Otto
3 siblings, 1 reply; 12+ messages in thread
From: Alan Cox @ 2006-08-15 12:57 UTC (permalink / raw)
To: Carsten Otto; +Cc: linux-kernel
Ar Maw, 2006-08-15 am 13:36 +0200, ysgrifennodd Carsten Otto:
> The system works normally and suddenly one disk does not respond.
> After a soft reboot the BIOS does not recognize the disk, here a hard
> reboot helps. Whenever I start my normal system in this situation, my
Rule of thumb (and a good one). If the soft reboot and BIOS cannot
recover the disk then the disk is the problem. There isn't really
anything we can tell the drive to do which should make it take a hike
and ignore a reset sequence. (Should.. however..)
> DriveReadySeekComplete (I do not recall the exact words, sorry) for one disk
Pity the exact text is essential.
> However, after the upgrade to the new power supply the system worked
> fine for almost two weeks (then the weekly crashes started).
I assume you've run memtest86 and also checked temperatures look good
around all the disks.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Daily crashes, incorrect RAID behaviour
2006-08-15 12:57 ` Alan Cox
@ 2006-08-15 12:42 ` Carsten Otto
2006-08-15 13:08 ` Jan Engelhardt
0 siblings, 1 reply; 12+ messages in thread
From: Carsten Otto @ 2006-08-15 12:42 UTC (permalink / raw)
To: Alan Cox; +Cc: linux-kernel
> Rule of thumb (and a good one). If the soft reboot and BIOS cannot
> recover the disk then the disk is the problem. There isn't really
> anything we can tell the drive to do which should make it take a hike
> and ignore a reset sequence. (Should.. however..)
Makes sense. I will focus my attention on the disks now (which makes
sense not only because of your information).
> > DriveReadySeekComplete (I do not recall the exact words, sorry) for one disk
> Pity the exact text is essential.
Here is the exact message I saw a few weeks ago (posted in here):
ata4: handling error/timeout
ata4: port reset, p_is 0 is 0 pis 0 cmd c017 tf 7f ss 0 se 0
ata4: status=0x50 { DriveReady SeekComplete }
sdd: Current: sense key=0x0
ASC=0x0 ASCQ=0x0
Info fid=0x0
To my knowledge this time it did not look different at all.
> I assume you've run memtest86 and also checked temperatures look good
> around all the disks.
Of course. I even replaced the mainboard (screwdriver accident..) and
power supply (too weak). And I now know that the sata cables I used at
first did not cause the problems :)
Thanks,
--
Carsten Otto
carsten.otto@gmail.com
www.c-otto.de
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Daily crashes, incorrect RAID behaviour
2006-08-15 12:42 ` Carsten Otto
@ 2006-08-15 13:08 ` Jan Engelhardt
0 siblings, 0 replies; 12+ messages in thread
From: Jan Engelhardt @ 2006-08-15 13:08 UTC (permalink / raw)
To: Carsten Otto; +Cc: Alan Cox, linux-kernel
>> > DriveReadySeekComplete (I do not recall the exact words, sorry) for
>> > one disk
>> Pity the exact text is essential.
>
> Here is the exact message I saw a few weeks ago (posted in here):
>
> ata4: handling error/timeout
> ata4: port reset, p_is 0 is 0 pis 0 cmd c017 tf 7f ss 0 se 0
> ata4: status=0x50 { DriveReady SeekComplete }
> sdd: Current: sense key=0x0
> ASC=0x0 ASCQ=0x0
> Info fid=0x0
Although I do not want to accuse libata, is it possible that a libata bug
is around?
Jan Engelhardt
--
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Daily crashes, incorrect RAID behaviour
2006-08-15 11:36 Daily crashes, incorrect RAID behaviour Carsten Otto
2006-08-15 12:33 ` Michael Tokarev
2006-08-15 12:57 ` Alan Cox
@ 2006-08-15 13:45 ` Ralf Müller
2006-08-19 11:37 ` Andrew Baker
2006-08-15 15:31 ` Carsten Otto
3 siblings, 1 reply; 12+ messages in thread
From: Ralf Müller @ 2006-08-15 13:45 UTC (permalink / raw)
To: linux-kernel
On Tuesday 15 August 2006 13:36, you wrote:
> My problems continue, even with a new and good power supply.
> 1) The system loses a disk about every week, only a hard reboot
> solves that 2) In the last three nights the system lost all disk
> access and trashed the file systems
>
> Regarding 1)
> The system works normally and suddenly one disk does not respond.
> After a soft reboot the BIOS does not recognize the disk, here a hard
> reboot helps. Whenever I start my normal system in this situation, my
> file systems get trashed. I think the software raid thinks the failed
> disks (which lost several hours of write accesses) is OK and then
> merges the data. When I delete the disk (or create another raid on
> the partitions) I can add the disk without problems. This might be a
> bug, at least it is _very_ annoying.
> 4x Maxtor 300GB (Sata2)
I have a similar problem with maybe the same type of disk. My analysis
of the problem is still not that complete so I did not asked on the
kernel mailing list yet.
The disk type we use here in a ten disk RAID6 is:
Maxtor 7V300F0 300GB Sata2
on Promise Sata 300 TX4 controllers
Once in a week or two a random disk is not responding anymore and needs
a complete power off/on cycle to recover. After power cycle the disk
works without problems, doesn't report any SMART problems ...
I'm quite sure it is no problem with power supply, motherboard,
backplane or controllers. Still open are cabling and disks. Nearly the
same setup of hardware - just with different disks - is running smooth
since about 8 month in a different system.
If this is the same disk type we maybe should return the disks to our
hardware vendors as it may be a disk problem.
The only messages I get are like that:
Aug 13 17:25:10 backup-core kernel: ata7: command timeout
Aug 13 17:25:10 backup-core kernel: ata7: translated ATA stat/err
0xff/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Aug 13 17:25:10 backup-core kernel: ata7: status=0xff { Busy }
Aug 13 17:25:42 backup-core kernel: ata7: command timeout
Aug 13 17:25:42 backup-core kernel: ata7: translated ATA stat/err
0xff/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Aug 13 17:25:42 backup-core kernel: ata7: status=0xff { Busy }
Aug 13 17:26:43 backup-core kernel: ata7: command timeout
Aug 13 17:26:43 backup-core kernel: ata7: translated ATA stat/err
0xff/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Aug 13 17:26:43 backup-core kernel: ata7: status=0xff { Busy }
Aug 13 17:26:43 backup-core kernel: end_request: I/O error, dev sdg,
sector 2104383
Regards
Ralf
--
Van Roy's Law: -------------------------------------------------------
An unbreakable toy is useful for breaking other toys.
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Daily crashes, incorrect RAID behaviour
2006-08-15 13:45 ` Ralf Müller
@ 2006-08-19 11:37 ` Andrew Baker
2006-08-19 11:47 ` Justin Piszcz
0 siblings, 1 reply; 12+ messages in thread
From: Andrew Baker @ 2006-08-19 11:37 UTC (permalink / raw)
To: linux-kernel
We too are having the same problem and the only obviously common factor is
Maxtor SATA HDD.
We have two identical systems - 64 bit - 2 x Dual Opterons, 8Gb Ram running
Novell/SUSE SLES10. Both systems are showing the problem.
In our case the RAID controller is
3ware Escalade 9550SX - 8LP
And the HDD are:
Maxtor MaxLine III (7V250F0) 250GB SATA II
The symptoms here are almost exactly as you describe. A disc "drops out" once
every week or two and the only way to clear the problem is a power cycle - or
remove and replace the HDD (our system is hot-swap).
Regards
Andrew
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Daily crashes, incorrect RAID behaviour
2006-08-19 11:37 ` Andrew Baker
@ 2006-08-19 11:47 ` Justin Piszcz
2006-08-19 18:53 ` Andrew Baker
0 siblings, 1 reply; 12+ messages in thread
From: Justin Piszcz @ 2006-08-19 11:47 UTC (permalink / raw)
To: Andrew Baker; +Cc: linux-kernel
On Sat, 19 Aug 2006, Andrew Baker wrote:
> We too are having the same problem and the only obviously common factor is
> Maxtor SATA HDD.
>
> We have two identical systems - 64 bit - 2 x Dual Opterons, 8Gb Ram running
> Novell/SUSE SLES10. Both systems are showing the problem.
>
> In our case the RAID controller is
>
> 3ware Escalade 9550SX - 8LP
>
> And the HDD are:
>
> Maxtor MaxLine III (7V250F0) 250GB SATA II
>
> The symptoms here are almost exactly as you describe. A disc "drops out" once
> every week or two and the only way to clear the problem is a power cycle - or
> remove and replace the HDD (our system is hot-swap).
>
> Regards
>
> Andrew
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
I had the same problem with a 3ware 2 port IDE raid controller, 7006-2.
One drive would always drop out under heavy I/O. Made me sick. Moved to
SW raid, all problems went away.
Justin.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Daily crashes, incorrect RAID behaviour
2006-08-15 11:36 Daily crashes, incorrect RAID behaviour Carsten Otto
` (2 preceding siblings ...)
2006-08-15 13:45 ` Ralf Müller
@ 2006-08-15 15:31 ` Carsten Otto
2006-08-15 18:28 ` Mike Dresser
3 siblings, 1 reply; 12+ messages in thread
From: Carsten Otto @ 2006-08-15 15:31 UTC (permalink / raw)
To: linux-kernel
Okay, after Ralf's message I found this newsgroup post:
http://groups.google.de/group/linux.debian.user/msg/f12dec920523a629?hl=de&
> You should be aware that currently
> Maxtor Maxline III's(7v300F0's) do not work properly due to a firmware
> bug. The current version shipping is VA111630, an update is available to
> VA111670 which merely reduces the frequency of timeouts that get the drive
> kicked out from the array.
I got a new firmware from Maxtor today. My disks now have firmware
VA111900, before that I had VA111630. Let's see what happens...
PS: Maxtor's hotline guy had no record about firmware related
problems. I'd like to report those (with the two additional
references), but now the hotline has technical difficulties...
Bye,
--
Carsten Otto
carsten.otto@gmail.com
www.c-otto.de
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Daily crashes, incorrect RAID behaviour
2006-08-15 15:31 ` Carsten Otto
@ 2006-08-15 18:28 ` Mike Dresser
2006-08-15 19:27 ` Alistair John Strachan
0 siblings, 1 reply; 12+ messages in thread
From: Mike Dresser @ 2006-08-15 18:28 UTC (permalink / raw)
To: Carsten Otto; +Cc: linux-kernel
On Tue, 15 Aug 2006, Carsten Otto wrote:
> Okay, after Ralf's message I found this newsgroup post:
> http://groups.google.de/group/linux.debian.user/msg/f12dec920523a629?hl=de&
>
>> You should be aware that currently
>> Maxtor Maxline III's(7v300F0's) do not work properly due to a firmware
>> bug. The current version shipping is VA111630, an update is available to
>> VA111670 which merely reduces the frequency of timeouts that get the drive
>> kicked out from the array.
I'm running 680 now, and the 15 drives have been up for something like two
months or so without issues at all.. Seems like the firmware fixes the
problem.
Mike
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Daily crashes, incorrect RAID behaviour
2006-08-15 18:28 ` Mike Dresser
@ 2006-08-15 19:27 ` Alistair John Strachan
0 siblings, 0 replies; 12+ messages in thread
From: Alistair John Strachan @ 2006-08-15 19:27 UTC (permalink / raw)
To: Mike Dresser; +Cc: Carsten Otto, linux-kernel
On Tuesday 15 August 2006 19:28, Mike Dresser wrote:
> On Tue, 15 Aug 2006, Carsten Otto wrote:
> > Okay, after Ralf's message I found this newsgroup post:
> > http://groups.google.de/group/linux.debian.user/msg/f12dec920523a629?hl=d
> >e&
> >
> >> You should be aware that currently
> >> Maxtor Maxline III's(7v300F0's) do not work properly due to a firmware
> >> bug. The current version shipping is VA111630, an update is available
> >> to VA111670 which merely reduces the frequency of timeouts that get the
> >> drive kicked out from the array.
>
> I'm running 680 now, and the 15 drives have been up for something like two
> months or so without issues at all.. Seems like the firmware fixes the
> problem.
Still, the RAID rebuild problem is worrying. I might have to deliberately
fault my RAID5 partitions and see if they rebuild correctly..
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2006-08-19 18:53 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-15 11:36 Daily crashes, incorrect RAID behaviour Carsten Otto
2006-08-15 12:33 ` Michael Tokarev
2006-08-15 12:57 ` Alan Cox
2006-08-15 12:42 ` Carsten Otto
2006-08-15 13:08 ` Jan Engelhardt
2006-08-15 13:45 ` Ralf Müller
2006-08-19 11:37 ` Andrew Baker
2006-08-19 11:47 ` Justin Piszcz
2006-08-19 18:53 ` Andrew Baker
2006-08-15 15:31 ` Carsten Otto
2006-08-15 18:28 ` Mike Dresser
2006-08-15 19:27 ` Alistair John Strachan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.