RAID-5 design bug (or misfeature)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RAID-5 design bug (or misfeature)
@ 2005-05-29 22:53 Mikulas Patocka
  2005-05-29 23:01 ` Wakko Warner
  2005-05-29 23:58 ` Bernd Eckenfels
  0 siblings, 2 replies; 12+ messages in thread
From: Mikulas Patocka @ 2005-05-29 22:53 UTC (permalink / raw)
  To: linux-kernel

Hi

RAID-5 has rather serious design bug --- when two disks become temporarily
inaccessible (as it happened to me because of high temperature in server
room), linux writes information about these errors to the remaining disks
and when failed disks are on line again, RAID-5 won't ever be accessible.

RAID-HOWTO lists some actions that can be done in this case, but none of
them can be done if root filesystem is on RAID --- the machine just won't
boot.

I think Linux should stop accessing all disks in RAID-5 array if two disks
fail and not write "this array is dead" in superblocks on remaining disks,
efficiently destroying the whole array.

Mikulas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID-5 design bug (or misfeature)
  2005-05-29 22:53 RAID-5 design bug (or misfeature) Mikulas Patocka
@ 2005-05-29 23:01 ` Wakko Warner
  2005-05-29 23:58 ` Bernd Eckenfels
  1 sibling, 0 replies; 12+ messages in thread
From: Wakko Warner @ 2005-05-29 23:01 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-kernel

Mikulas Patocka wrote:
> RAID-5 has rather serious design bug --- when two disks become temporarily
> inaccessible (as it happened to me because of high temperature in server
> room), linux writes information about these errors to the remaining disks
> and when failed disks are on line again, RAID-5 won't ever be accessible.

I ran into this myself, however, I had 10 disks (5 per channel) and one
chennel went down.  Ok, my array was dead at that point and I had to reboot. 
What luck, the arry wasn't usable anymore.  My /usr was on that array, but
my / was not.  I did not want to go through the initrd/initramfs thing at
the time to setup my / with raid5, plus the fact you truely cannot boot from
it (thus partitioning and setting aside a slice wasn't viable to me)

> RAID-HOWTO lists some actions that can be done in this case, but none of
> them can be done if root filesystem is on RAID --- the machine just won't
> boot.

I had to reconstruct the array by hand with mdadm.  evms wouldn't touch it. 
Fortunately, I had a copy of each disk's information and the raid5's
information in files so it was quite easy to rebuild.  I did have backups
but that wasn't really what I wanted to do.  (It did take over 2 hours
before I could return to normal.  evms can't handle a raid5 that was in
reconstruction.  I think newer versions have this fixed.)

> I think Linux should stop accessing all disks in RAID-5 array if two disks
> fail and not write "this array is dead" in superblocks on remaining disks,
> efficiently destroying the whole array.

That'd be nice =)

-- 
 Lab tests show that use of micro$oft causes cancer in lab animals

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID-5 design bug (or misfeature)
  2005-05-29 22:53 RAID-5 design bug (or misfeature) Mikulas Patocka
  2005-05-29 23:01 ` Wakko Warner
@ 2005-05-29 23:58 ` Bernd Eckenfels
  2005-05-30  2:47   ` Mikulas Patocka
  1 sibling, 1 reply; 12+ messages in thread
From: Bernd Eckenfels @ 2005-05-29 23:58 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.58.0505300043540.5305@artax.karlin.mff.cuni.cz> you wrote:
> I think Linux should stop accessing all disks in RAID-5 array if two disks
> fail and not write "this array is dead" in superblocks on remaining disks,
> efficiently destroying the whole array.

I agree with you, however it is a pretty damned stupid idea to use raid-5
for a root disk (I was about to say it is not a good idea to use raid-5 on
linux at all :)

Gruss
Bernd

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID-5 design bug (or misfeature)
  2005-05-29 23:58 ` Bernd Eckenfels
@ 2005-05-30  2:47   ` Mikulas Patocka
  2005-05-30  3:00     ` Bernd Eckenfels
  2005-05-30 11:55     ` Alan Cox
  0 siblings, 2 replies; 12+ messages in thread
From: Mikulas Patocka @ 2005-05-30  2:47 UTC (permalink / raw)
  To: Bernd Eckenfels; +Cc: linux-kernel

> In article <Pine.LNX.4.58.0505300043540.5305@artax.karlin.mff.cuni.cz> you wrote:
> > I think Linux should stop accessing all disks in RAID-5 array if two disks
> > fail and not write "this array is dead" in superblocks on remaining disks,
> > efficiently destroying the whole array.
>
> I agree with you, however it is a pretty damned stupid idea to use raid-5
> for a root disk (I was about to say it is not a good idea to use raid-5 on
> linux at all :)

But root disk might fail too... This way, the system can't be taken down
by any single disk crash.

Mikulas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID-5 design bug (or misfeature)
  2005-05-30  2:47   ` Mikulas Patocka
@ 2005-05-30  3:00     ` Bernd Eckenfels
  2005-05-30 11:55     ` Alan Cox
  1 sibling, 0 replies; 12+ messages in thread
From: Bernd Eckenfels @ 2005-05-30  3:00 UTC (permalink / raw)
  To: linux-kernel

On Mon, May 30, 2005 at 04:47:58AM +0200, Mikulas Patocka wrote:
> But root disk might fail too... This way, the system can't be taken down
> by any single disk crash.

Yes, mirroring has here good properties, must boot loaders work with it, it
is less suspectible to silent corruption and you can use a 1+0 configuration
for additional protection against multi disk failures.

Greetings
Bernd

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID-5 design bug (or misfeature)
  2005-05-30  2:47   ` Mikulas Patocka
  2005-05-30  3:00     ` Bernd Eckenfels
@ 2005-05-30 11:55     ` Alan Cox
  2005-05-30 13:23       ` Stephen Frost
                         ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Alan Cox @ 2005-05-30 11:55 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Bernd Eckenfels, Linux Kernel Mailing List

On Llu, 2005-05-30 at 03:47, Mikulas Patocka wrote:
> > In article <Pine.LNX.4.58.0505300043540.5305@artax.karlin.mff.cuni.cz> you wrote:
> > > I think Linux should stop accessing all disks in RAID-5 array if two disks
> > > fail and not write "this array is dead" in superblocks on remaining disks,
> > > efficiently destroying the whole array.

It discovered the disks had failed because they had outstanding I/O that
failed to complete and errorred. At that point your stripes *are*
inconsistent. If it didn't mark them as failed then you wouldn't know it
was corrupted after a power restore. You can then clean it fsck it,
restore it, use mdadm as appropriate to restore the volume and check it.

> But root disk might fail too... This way, the system can't be taken down
> by any single disk crash.

It only takes on disk in an array to short 12v and 5v due to a component
failure to total the entire disk array, and with both IDE and SCSI a
drive fail can hang the entire bus anyway.

Alan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID-5 design bug (or misfeature)
  2005-05-30 11:55     ` Alan Cox
@ 2005-05-30 13:23       ` Stephen Frost
  2005-05-30 16:09       ` Mikulas Patocka
  2005-06-01 18:18       ` Bill Davidsen
  2 siblings, 0 replies; 12+ messages in thread
From: Stephen Frost @ 2005-05-30 13:23 UTC (permalink / raw)
  To: Alan Cox; +Cc: Mikulas Patocka, Bernd Eckenfels, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 940 bytes --]

* Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
> On Llu, 2005-05-30 at 03:47, Mikulas Patocka wrote:
> > > In article <Pine.LNX.4.58.0505300043540.5305@artax.karlin.mff.cuni.cz> you wrote:
> > > > I think Linux should stop accessing all disks in RAID-5 array if two disks
> > > > fail and not write "this array is dead" in superblocks on remaining disks,
> > > > efficiently destroying the whole array.
> 
> It discovered the disks had failed because they had outstanding I/O that
> failed to complete and errorred. At that point your stripes *are*
> inconsistent. If it didn't mark them as failed then you wouldn't know it
> was corrupted after a power restore. You can then clean it fsck it,
> restore it, use mdadm as appropriate to restore the volume and check it.

Could that I/O be backed out when it's discovered that there's too many
dead disks for the array to be kept online anymore?

	Just a thought,

		Stephen

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID-5 design bug (or misfeature)
  2005-05-30 11:55     ` Alan Cox
  2005-05-30 13:23       ` Stephen Frost
@ 2005-05-30 16:09       ` Mikulas Patocka
  2005-05-31  8:05         ` Helge Hafting
  2005-05-31 21:39         ` Pavel Machek
  2005-06-01 18:18       ` Bill Davidsen
  2 siblings, 2 replies; 12+ messages in thread
From: Mikulas Patocka @ 2005-05-30 16:09 UTC (permalink / raw)
  To: Alan Cox; +Cc: Bernd Eckenfels, Linux Kernel Mailing List

On Mon, 30 May 2005, Alan Cox wrote:

> On Llu, 2005-05-30 at 03:47, Mikulas Patocka wrote:
> > > In article <Pine.LNX.4.58.0505300043540.5305@artax.karlin.mff.cuni.cz> you wrote:
> > > > I think Linux should stop accessing all disks in RAID-5 array if two disks
> > > > fail and not write "this array is dead" in superblocks on remaining disks,
> > > > efficiently destroying the whole array.
>
> It discovered the disks had failed because they had outstanding I/O that
> failed to complete and errorred.

I think that's another problem --- when RAID-5 is operating in degraded
mode, the machine must not crash or volume will be damaged (sectors
that were not written may be damaged this way). Did anybody develop some
method to care about this (i.e. something like journaling on raid)? What
do hardware RAID controllers do in this situation?

> At that point your stripes *are*
> inconsistent. If it didn't mark them as failed then you wouldn't know it
> was corrupted after a power restore. You can then clean it fsck it,
> restore it,
> use mdadm as appropriate to restore the volume and check it.

I can't because mdadm is on that volume ... I solved it by booting from
floppy and editing raid superblocks with disk hexeditor but not every user
wants to do it; there should be at least kernel boot parameter for it.

> > But root disk might fail too... This way, the system can't be taken down
> > by any single disk crash.
>
> It only takes on disk in an array to short 12v and 5v due to a component
> failure to total the entire disk array, and with both IDE and SCSI a
> drive fail can hang the entire bus anyway.

I meant mechanical failure which is more common. Of course --- everything
can happen in case of electrical failure in disk/controller/bus/mainboard
...

Mikulas

> Alan
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID-5 design bug (or misfeature)
  2005-05-30 16:09       ` Mikulas Patocka
@ 2005-05-31  8:05         ` Helge Hafting
  2005-05-31 21:39         ` Pavel Machek
  1 sibling, 0 replies; 12+ messages in thread
From: Helge Hafting @ 2005-05-31  8:05 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Linux Kernel Mailing List

Mikulas Patocka wrote:

>
>I think that's another problem --- when RAID-5 is operating in degraded
>mode, the machine must not crash or volume will be damaged (sectors
>that were not written may be damaged this way). Did anybody develop some
>method to care about this (i.e. something like journaling on raid)? What
>do hardware RAID controllers do in this situation?
>  
>
Hot spares can keep the degraded time to a minimum.  If you want to
keep the risk to a minimum, unmount the raid fs until it is
resynchronized.  If you need more safety, there is options like raid-6
or mirrors of the entire raid-5 set.

Some hw controllers have a battery-backed cache.  Even a power loss
won't ruin the raid - the io will simply sit in that cache until the
disks become available again.  The io operation that was in effect when
power was lost can then be retried. Not that this saves you from everything,
the fs could be inconsistent anyway due to the os being killed in the
middle of its updates. A journalled fs can help with that though.

Helge Hafting

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID-5 design bug (or misfeature)
  2005-05-30 16:09       ` Mikulas Patocka
  2005-05-31  8:05         ` Helge Hafting
@ 2005-05-31 21:39         ` Pavel Machek
  2005-06-01  1:43           ` Mikulas Patocka
  1 sibling, 1 reply; 12+ messages in thread
From: Pavel Machek @ 2005-05-31 21:39 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List

Hi!

> > At that point your stripes *are*
> > inconsistent. If it didn't mark them as failed then you wouldn't know it
> > was corrupted after a power restore. You can then clean it fsck it,
> > restore it,
> > use mdadm as appropriate to restore the volume and check it.
> 
> I can't because mdadm is on that volume ... I solved it by booting from
> floppy and editing raid superblocks with disk hexeditor but not every user
> wants to do it; there should be at least kernel boot parameter for
> it.

Well, you should not use hexedit... just boot from rescue cd and run
mdadd from it. No need to pollute kernel with that one.

								Pavel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID-5 design bug (or misfeature)
  2005-05-31 21:39         ` Pavel Machek
@ 2005-06-01  1:43           ` Mikulas Patocka
  0 siblings, 0 replies; 12+ messages in thread
From: Mikulas Patocka @ 2005-06-01  1:43 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List

On Tue, 31 May 2005, Pavel Machek wrote:

> Hi!
>
> > > At that point your stripes *are*
> > > inconsistent. If it didn't mark them as failed then you wouldn't know it
> > > was corrupted after a power restore. You can then clean it fsck it,
> > > restore it,
> > > use mdadm as appropriate to restore the volume and check it.
> >
> > I can't because mdadm is on that volume ... I solved it by booting from
> > floppy and editing raid superblocks with disk hexeditor but not every user
> > wants to do it; there should be at least kernel boot parameter for
> > it.
>
> Well, you should not use hexedit... just boot from rescue cd and run
> mdadd from it. No need to pollute kernel with that one.

Hi!

I think editing superblock with hexedit is less dangerous than using
raid-tools --- with editor I know what changes I have made and I can
revert them. With raid-tools, if you create wrong /etc/raidtab (original
was on failed volume too), it will trash superblocks completely.

I still think it's stupid that Linux modifies RAID superblocks into
irreversible state.

BTW. that server doesn't have CD drive. It was installed from network.

Mikulas

> 								Pavel
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID-5 design bug (or misfeature)
  2005-05-30 11:55     ` Alan Cox
  2005-05-30 13:23       ` Stephen Frost
  2005-05-30 16:09       ` Mikulas Patocka
@ 2005-06-01 18:18       ` Bill Davidsen
  2 siblings, 0 replies; 12+ messages in thread
From: Bill Davidsen @ 2005-06-01 18:18 UTC (permalink / raw)
  To: Alan Cox; +Cc: Bernd Eckenfels, Linux Kernel Mailing List

Alan Cox wrote:
> On Llu, 2005-05-30 at 03:47, Mikulas Patocka wrote:
> 
>>>In article <Pine.LNX.4.58.0505300043540.5305@artax.karlin.mff.cuni.cz> you wrote:
>>>
>>>>I think Linux should stop accessing all disks in RAID-5 array if two disks
>>>>fail and not write "this array is dead" in superblocks on remaining disks,
>>>>efficiently destroying the whole array.
> 
> 
> It discovered the disks had failed because they had outstanding I/O that
> failed to complete and errorred. At that point your stripes *are*
> inconsistent. If it didn't mark them as failed then you wouldn't know it
> was corrupted after a power restore. You can then clean it fsck it,
> restore it, use mdadm as appropriate to restore the volume and check it.
> 
> 
>>But root disk might fail too... This way, the system can't be taken down
>>by any single disk crash.
> 
> 
> It only takes on disk in an array to short 12v and 5v due to a component
> failure to total the entire disk array, and with both IDE and SCSI a
> drive fail can hang the entire bus anyway.

Having somthing called "the entire bus" is more common on SCSI than IDE 
(at least well-configured IDE) unless you mean the PCI bus. I regularly 
used to see failures of one drive which made the SCSI controller decide 
that one other drive was bad. Fortunately some change in either the 
drive or controller (IBM ServeRAID) has made that a non-problem.

-- 
bill davidsen <davidsen@tmr.com>
   CTO TMR Associates, Inc
   Doing interesting things with small computers since 1979

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-06-01 18:13 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-29 22:53 RAID-5 design bug (or misfeature) Mikulas Patocka
2005-05-29 23:01 ` Wakko Warner
2005-05-29 23:58 ` Bernd Eckenfels
2005-05-30  2:47   ` Mikulas Patocka
2005-05-30  3:00     ` Bernd Eckenfels
2005-05-30 11:55     ` Alan Cox
2005-05-30 13:23       ` Stephen Frost
2005-05-30 16:09       ` Mikulas Patocka
2005-05-31  8:05         ` Helge Hafting
2005-05-31 21:39         ` Pavel Machek
2005-06-01  1:43           ` Mikulas Patocka
2005-06-01 18:18       ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox