Software RAID 5 and crashes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Software RAID 5 and crashes
@ 2004-08-01 10:19 Xuân Baldauf
  2004-08-01 11:54 ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Xuân Baldauf @ 2004-08-01 10:19 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Hello,

I have been extensively searching for documentation and mailing lists, 
but was yet unable to answer this question:

Does Linux software RAID 5 (or RAID 4) do ordered writes? (data stripes 
first, then parity stripes)

Because if the writes are not ordered, parity stripes could be written 
before data stripes. If the system crashes at this time, reconstruction 
will  try to reconstruct the parity stripes by using the wrong (old) 
data stripes.

If the writes are ordered, crashes after the write of the data stripe 
but before the write to the parity stripe do not harm.

Thanks,

Xuân Baldauf.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Software RAID 5 and crashes
  2004-08-01 10:19 Software RAID 5 and crashes Xuân Baldauf
@ 2004-08-01 11:54 ` Neil Brown
  2004-08-01 14:01   ` Xuân Baldauf
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Brown @ 2004-08-01 11:54 UTC (permalink / raw)
  To: Xuân Baldauf; +Cc: Linux Kernel Mailing List

On Sunday August 1, xuan--2004.08.01--linux-kernel--vger.kernel.org@baldauf.org wrote:
> Hello,
> 
> I have been extensively searching for documentation and mailing lists, 
> but was yet unable to answer this question:
> 
> Does Linux software RAID 5 (or RAID 4) do ordered writes? (data stripes 
> first, then parity stripes)

No, it doesn't impose any ordering between writes of parity and data
in the same stripe, and it would not have any material effect on any
outcomes if it did.

> 
> Because if the writes are not ordered, parity stripes could be written 
> before data stripes. If the system crashes at this time, reconstruction 
> will  try to reconstruct the parity stripes by using the wrong (old) 
> data stripes.
> 
> If the writes are ordered, crashes after the write of the data stripe 
> but before the write to the parity stripe do not harm.

When the system crashes, the RAID5 manager assume that all data blocks
are correct and all parity blocks are suspect.  It checks all parity
blocks against corresponding data and corrects  those that don't
match.
If a write is "in progress" - i.e. it has started but not all data and
parity has been written, then either the "old" data or the "new" data
are equally correct.  The only thing that needs to be guaranteed after
a crash, and the only thing that can be guaranteed, is that any data
that has been reported as "safe-in-storage" really is safe.  That is
all journalling filesystems, or anything else, assume.

Hope that helps.

NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Software RAID 5 and crashes
  2004-08-01 11:54 ` Neil Brown
@ 2004-08-01 14:01   ` Xuân Baldauf
  2004-08-01 15:08     ` Bernd Eckenfels
  2004-08-02  6:13     ` Neil Brown
  0 siblings, 2 replies; 5+ messages in thread
From: Xuân Baldauf @ 2004-08-01 14:01 UTC (permalink / raw)
  To: Neil Brown; +Cc: Linux Kernel Mailing List

Neil Brown wrote:

>On Sunday August 1, xuan--2004.08.01--linux-kernel--vger.kernel.org@baldauf.org wrote:
>  
>
>>Hello,
>>
>>I have been extensively searching for documentation and mailing lists, 
>>but was yet unable to answer this question:
>>
>>Does Linux software RAID 5 (or RAID 4) do ordered writes? (data stripes 
>>first, then parity stripes)
>>    
>>
>
>No, it doesn't impose any ordering between writes of parity and data
>in the same stripe, and it would not have any material effect on any
>outcomes if it did.
>  
>
I disagree. :-)

>  
>
>>Because if the writes are not ordered, parity stripes could be written 
>>before data stripes. If the system crashes at this time, reconstruction 
>>will  try to reconstruct the parity stripes by using the wrong (old) 
>>data stripes.
>>
>>If the writes are ordered, crashes after the write of the data stripe 
>>but before the write to the parity stripe do not harm.
>>    
>>
>
>When the system crashes, the RAID5 manager assume that all data blocks
>are correct and all parity blocks are suspect.  It checks all parity
>blocks against corresponding data and corrects  those that don't
>match.
>If a write is "in progress" - i.e. it has started but not all data and
>parity has been written, then either the "old" data or the "new" data
>are equally correct.  The only thing that needs to be guaranteed after
>a crash, and the only thing that can be guaranteed, is that any data
>that has been reported as "safe-in-storage" really is safe.  That is
>all journalling filesystems, or anything else, assume.
>
>Hope that helps.
>  
>
Yes, thank you, the clear statement, that writes are not ordered, helps. 
:-) It is also relieving to read that data blocks are always preferred 
to parity blocks, so that data blocks never can become scrambled by 
unmatching other data blocks and parity blocks (at least in non-degraded 
mode).

Unfortunately, it still does not make me satisfied, because: The 
asymmetry of "all data blocks are correct, all parity blocks are 
suspect" should be exploited.
Consider 4 disks joined as RAID 5. There are 4 stripes (s0, s1, s2, s3), 
where s3 is the parity stripe.

    * <>Example 0: s3,s2,s1 are written to disk, while s0 is not written
      to disk for some reason. The system crashes. What happens at
      reconstruction? s3 gets replaced by s0 XOR s1 XOR s2. s0 contains
      old (read: wrong) data.
    * Example 1: s0,s1,s2 are written to disk, while s3 is not written
      to disk for some reason. The system crashes. What happens at
      reconstruction? s3 gets replaced by s0 XOR s1 XOR s2. s0 does not
      contain old data. The stripe which contains the old data (s3) is
      replaced anyway during reconstruction.

If we now consider, that for each disk (as member of a RAID 5), there 
are parity stripes and there are data stripes. Doesn't it make sense to 
prefer data blocks over partiy blocks when writing, just to get more 
cases of "example 1" against "example 0" than without this preference?

One could even imagine to intensionally postpone the parity block 
writing for some time in favour of peak throughput. The RAID 5 device 
looses its rendundancy for some bounded time at a bounded region of its 
space, but this may be acceptable for certain applications, I think.

>NeilBrown
>
>  
>
Xuân Baldauf.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Software RAID 5 and crashes
  2004-08-01 14:01   ` Xuân Baldauf
@ 2004-08-01 15:08     ` Bernd Eckenfels
  2004-08-02  6:13     ` Neil Brown
  1 sibling, 0 replies; 5+ messages in thread
From: Bernd Eckenfels @ 2004-08-01 15:08 UTC (permalink / raw)
  To: linux-kernel

In article <410CF7AA.2020604@baldauf.org> you wrote:
> Unfortunately, it still does not make me satisfied, because: The 

IMHO the current Raid5 implementation can be better in terms of crash
recovery, BUT one should not forget that RAID5 is simply pretty bad even if
ideally coded with ordered write and transaction sequence numbers. Therefore
it is perhaps better to spend more time on alternatives or at least in
communicating the inherent problems of raid5 to the users which want more
(like you seem to need?)

>    * <>Example 0: s3,s2,s1 are written to disk, while s0 is not written
>    * Example 1: s0,s1,s2 are written to disk, while s3 is not written

One has to state clearly state that the failure to commit raid stripes in a
crash result in guranteed data loss even when the data is written redundant.
This is one of the raid5 problems, and it is even worse in a degregated
scenario.

> If we now consider, that for each disk (as member of a RAID 5), there 
> are parity stripes and there are data stripes. Doesn't it make sense to 
> prefer data blocks over partiy blocks when writing, just to get more 
> cases of "example 1" against "example 0" than without this preference?

I guess it depends on what "prefer" mean? Do you think about write ordering
with a performance impact or about some minor tweaking with possibly no use
(ie. because sorting the requests will be reordered in the controller and
device anyway)

> One could even imagine to intensionally postpone the parity block 
> writing for some time in favour of peak throughput.

That would only help, if you defer the decision when the data is stable to
that delayed time, which is for sure a performance killer.

> The RAID 5 device 
> looses its rendundancy for some bounded time at a bounded region of its 
> space, but this may be acceptable for certain applications, I think.

Well, I cant imagine applications which are sensitive to losing random file
appends without a transaction protocol, but are not sensitive to degregaded
redundancy?

BTW: I dont think hardware raid5 of any vendor performs much better?

Greetings
Bernd
-- 
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Software RAID 5 and crashes
  2004-08-01 14:01   ` Xuân Baldauf
  2004-08-01 15:08     ` Bernd Eckenfels
@ 2004-08-02  6:13     ` Neil Brown
  1 sibling, 0 replies; 5+ messages in thread
From: Neil Brown @ 2004-08-02  6:13 UTC (permalink / raw)
  To: Xuân Baldauf; +Cc: Linux Kernel Mailing List

On Sunday August 1, xuan--2004.08.01--linux-kernel--vger.kernel.org@baldauf.org wrote:
> 
> Unfortunately, it still does not make me satisfied, because: The 
> asymmetry of "all data blocks are correct, all parity blocks are 
> suspect" should be exploited.
> Consider 4 disks joined as RAID 5. There are 4 stripes (s0, s1, s2, s3), 
> where s3 is the parity stripe.
> 
>     * <>Example 0: s3,s2,s1 are written to disk, while s0 is not written
>       to disk for some reason. The system crashes. What happens at
>       reconstruction? s3 gets replaced by s0 XOR s1 XOR s2. s0 contains
>       old (read: wrong) data.

Define "wrong"....

Supposing it had actually crashed a couple of milliseconds earlier,
when none of the blocks had been written.  Is that any more wrong? or
less? 

When ever a machine crashes, it is wrong.  Whenever a machine crashes
you lose data.   A little more or a little bit less data being "lost"
in neither here nor there.  The only thing that it really makes sense
to worry about is consistency.  RAID5 provides all the consistency it
can, and leaves the rest up to higher layers.


>     * Example 1: s0,s1,s2 are written to disk, while s3 is not written
>       to disk for some reason. The system crashes. What happens at
>       reconstruction? s3 gets replaced by s0 XOR s1 XOR s2. s0 does not
>       contain old data. The stripe which contains the old data (s3) is
>       replaced anyway during reconstruction.
> 
> If we now consider, that for each disk (as member of a RAID 5), there 
> are parity stripes and there are data stripes. Doesn't it make sense to 
> prefer data blocks over partiy blocks when writing, just to get more 
> cases of "example 1" against "example 0" than without this
> preference?

No, and definitely not at the cost of delaying any writes or
complicating the code at all.

NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-08-02  6:14 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-01 10:19 Software RAID 5 and crashes Xuân Baldauf
2004-08-01 11:54 ` Neil Brown
2004-08-01 14:01   ` Xuân Baldauf
2004-08-01 15:08     ` Bernd Eckenfels
2004-08-02  6:13     ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox