RAID-6: help wanted

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID-6: help wanted
@ 2004-10-23 23:46 H. Peter Anvin
  2004-10-24  5:26 ` Brad Campbell
  2004-10-25  5:41 ` Neil Brown
  0 siblings, 2 replies; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-23 23:46 UTC (permalink / raw)
  To: linux-raid

Okay, it's well overdue to do this...

There is currently a data-corruption bug in the RAID-6 md layer
(raid6main.c).  I have so far not been successful in locating it,
although it is easily reproducible, thanks to a set of scripts by Jim
Paris.  I suspect it is a race condition between raid6d and the rest
of the kernel.

The problem is that I already have way too many projects in flight,
and I *really* would like to focus the time I have on getting
klibc/early userspace integrated.  Thus, I don't expect to have any
time at chasing the RAID-6 corruption bug any time soon.

What can be ruled out is errors in the RAID-6 algorithm codes,
however.  Those are easily testable in isolation and have a clean bill
of health.

	-hpa

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-23 23:46 RAID-6: help wanted H. Peter Anvin
@ 2004-10-24  5:26 ` Brad Campbell
  2004-10-24  6:46   ` Jim Paris
       [not found]   ` <417B546C.3050706@zytor.com>
  2004-10-25  5:41 ` Neil Brown
  1 sibling, 2 replies; 32+ messages in thread
From: Brad Campbell @ 2004-10-24  5:26 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

H. Peter Anvin wrote:
> Okay, it's well overdue to do this...
> 
> There is currently a data-corruption bug in the RAID-6 md layer
> (raid6main.c).  I have so far not been successful in locating it,
> although it is easily reproducible, thanks to a set of scripts by Jim
> Paris.  I suspect it is a race condition between raid6d and the rest
> of the kernel.
> 
> The problem is that I already have way too many projects in flight,
> and I *really* would like to focus the time I have on getting
> klibc/early userspace integrated.  Thus, I don't expect to have any
> time at chasing the RAID-6 corruption bug any time soon.
> 
> What can be ruled out is errors in the RAID-6 algorithm codes,
> however.  Those are easily testable in isolation and have a clean bill
> of health.

Great, can we get a copy of the scripts to try and assist?

Brad

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-24  5:26 ` Brad Campbell
@ 2004-10-24  6:46   ` Jim Paris
       [not found]   ` <417B546C.3050706@zytor.com>
  1 sibling, 0 replies; 32+ messages in thread
From: Jim Paris @ 2004-10-24  6:46 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid

> >There is currently a data-corruption bug in the RAID-6 md layer
> >(raid6main.c).  I have so far not been successful in locating it,
> >although it is easily reproducible, thanks to a set of scripts by Jim
> >Paris.  I suspect it is a race condition between raid6d and the rest
> >of the kernel.
> 
> Great, can we get a copy of the scripts to try and assist?

Sure:

---

Date: Fri, 6 Aug 2004 00:04:39 -0400
From: Jim Paris <jim@jtan.com>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: Kernel panic, FS corruption  Was: Re: Call for RAID-6 users

> If you can reproduce it with ext2/3 it would make debugging simpler, 
> because I understand the ext code and data structures a lot better.

This demonstrates it on ext2.  I can't seem to reproduce it with just
simple use of 'dd', but it shows up if I untar a ton of data.

This script:
- creates five 100MB "disks" through loopback
- puts them in a six-disk RAID-6 array (resulting size=400MB, degraded)
- untars about 350MB of data to the array
- runs e2fsck, which shows filesystem errors

Usage: 
- put r6ext.sh and big.tar.bz2 in a directory
- run r6ext.sh as root

Sorry for the huge files, but e2fsck didn't show any problems when I
scaled everything down by a factor of 10.  You could probably make
your own big.tar.bz2 and see the same problem, as there's nothing
special about this data.

  http://stonewall.mit.edu/~jim/r6ext.sh
  http://stonewall.mit.edu/~jim/big.tar.bz2 (77MB)

-jim

^ permalink raw reply	[flat|nested] 32+ messages in thread

[parent not found: <417B546C.3050706@zytor.com>]

* Re: RAID-6: help wanted
       [not found]   ` <417B546C.3050706@zytor.com>
@ 2004-10-24  7:09     ` H. Peter Anvin
  2004-10-24  7:17       ` H. Peter Anvin
  2004-10-24  7:18       ` Brad Campbell
  0 siblings, 2 replies; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-24  7:09 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid

H. Peter Anvin wrote:
> Brad Campbell wrote:
>>
>> Great, can we get a copy of the scripts to try and assist?
>>
>> Brad
> 
> In the process of being uploaded to:
> 
> http://www.zytor.com/~hpa/raid6-bug/
> 
> It'll be about 40 minutes until it's uploaded, because of the 80 MB data 
> file.
> 
> This will create a RAID-6 volume on a loopback device, fail a unit, and 
> then write a bunch of stuff, which will corrupt the filesystem.  The 
> corruption only happens when writing to a volume that is running in 
> degraded mode; it seems to happen both with 1 and 2 disks missing.
> 

Changed to:

http://userweb.kernel.org/~hpa/raid6-bug/

(It's a faster server, but it'll still take about 40 minutes from now 
until it's uploaded, but then it can at least be a quicker download.)

	-hpa

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-24  7:09     ` H. Peter Anvin
@ 2004-10-24  7:17       ` H. Peter Anvin
  2004-10-24  7:18       ` Brad Campbell
  1 sibling, 0 replies; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-24  7:17 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Brad Campbell, linux-raid

H. Peter Anvin wrote:
>>
>> It'll be about 40 minutes until it's uploaded, because of the 80 MB 
>> data file.
>>
>> This will create a RAID-6 volume on a loopback device, fail a unit, 
>> and then write a bunch of stuff, which will corrupt the filesystem.  
>> The corruption only happens when writing to a volume that is running 
>> in degraded mode; it seems to happen both with 1 and 2 disks missing.
>>

For all I know, too, the data file can be any large collection of 
small-to-medium size files (I think it currently contains a Linux kernel 
build tree), so downloading it all might not be necessary.

	-hpa

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-24  7:09     ` H. Peter Anvin
  2004-10-24  7:17       ` H. Peter Anvin
@ 2004-10-24  7:18       ` Brad Campbell
  1 sibling, 0 replies; 32+ messages in thread
From: Brad Campbell @ 2004-10-24  7:18 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: RAID Linux

H. Peter Anvin wrote:

> Changed to:
> 
> http://userweb.kernel.org/~hpa/raid6-bug/
> 
> (It's a faster server, but it'll still take about 40 minutes from now 
> until it's uploaded, but then it can at least be a quicker download.)

It will make no difference to me. I'm in the desert in the clutches of a monopolistic 
telecommunications carrier that seems to use brine soaked string as its backbone :p)

Ta for the headsup.

Regards,
Brad

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-23 23:46 RAID-6: help wanted H. Peter Anvin
  2004-10-24  5:26 ` Brad Campbell
@ 2004-10-25  5:41 ` Neil Brown
  2004-10-25  6:20   ` Jim Paris
  1 sibling, 1 reply; 32+ messages in thread
From: Neil Brown @ 2004-10-25  5:41 UTC (permalink / raw)
  To: H. Peter Anvin, Jim Paris; +Cc: linux-raid

On Saturday October 23, hpa@zytor.com wrote:
> Okay, it's well overdue to do this...
> 
> There is currently a data-corruption bug in the RAID-6 md layer
> (raid6main.c).  I have so far not been successful in locating it,
> although it is easily reproducible, thanks to a set of scripts by Jim
> Paris.  I suspect it is a race condition between raid6d and the rest
> of the kernel.


Does this fix the problem?
It is definitely wrong as it is.

NeilBrown



 ----------- Diffstat output ------------
 ./drivers/md/raid6main.c |    1 -
 1 files changed, 1 deletion(-)

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2004-10-22 16:14:11.000000000 +1000
+++ ./drivers/md/raid6main.c	2004-10-25 15:39:36.000000000 +1000
@@ -734,7 +734,6 @@ static void compute_parity(struct stripe
 	case READ_MODIFY_WRITE:
 		BUG();		/* READ_MODIFY_WRITE N/A for RAID-6 */
 	case RECONSTRUCT_WRITE:
-	case UPDATE_PARITY:	/* Is this right? */
 		for (i= disks; i-- ;)
 			if ( i != pd_idx && i != qd_idx && sh->dev[i].towrite ) {
 				chosen = sh->dev[i].towrite;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-25  5:41 ` Neil Brown
@ 2004-10-25  6:20   ` Jim Paris
  2004-10-25  6:24     ` Neil Brown
  2004-10-27  3:38     ` Neil Brown
  0 siblings, 2 replies; 32+ messages in thread
From: Jim Paris @ 2004-10-25  6:20 UTC (permalink / raw)
  To: Neil Brown; +Cc: H. Peter Anvin, linux-raid

> Does this fix the problem?
> It is definitely wrong as it is.
> 
..
>  	case RECONSTRUCT_WRITE:
> -	case UPDATE_PARITY:	/* Is this right? */

No, it does not.  Same sort of corruption with that line removed.

-jim

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-25  6:20   ` Jim Paris
@ 2004-10-25  6:24     ` Neil Brown
  2004-10-25  6:33       ` Jim Paris
  2004-10-25 14:23       ` H. Peter Anvin
  2004-10-27  3:38     ` Neil Brown
  1 sibling, 2 replies; 32+ messages in thread
From: Neil Brown @ 2004-10-25  6:24 UTC (permalink / raw)
  To: Jim Paris; +Cc: H. Peter Anvin, linux-raid

On Monday October 25, jim@jtan.com wrote:
> > Does this fix the problem?
> > It is definitely wrong as it is.
> > 
> ..
> >  	case RECONSTRUCT_WRITE:
> > -	case UPDATE_PARITY:	/* Is this right? */
> 
> No, it does not.  Same sort of corruption with that line removed.
> 
 
Thanks.  I'll try harder...

Looking at the script, the corruption happens while resync is still
going on.  If you wait for resync to finish, do you still get
corruption?

NeilBrown

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-25  6:24     ` Neil Brown
@ 2004-10-25  6:33       ` Jim Paris
  2004-10-25 14:23       ` H. Peter Anvin
  1 sibling, 0 replies; 32+ messages in thread
From: Jim Paris @ 2004-10-25  6:33 UTC (permalink / raw)
  To: Neil Brown; +Cc: H. Peter Anvin, linux-raid

> > > -	case UPDATE_PARITY:	/* Is this right? */
> > 
> > No, it does not.  Same sort of corruption with that line removed.
>  
> Thanks.  I'll try harder...
> 
> Looking at the script, the corruption happens while resync is still
> going on.  If you wait for resync to finish, do you still get
> corruption?

The problem shows up when the array is degraded (either 1 or 2 disks
missing), regardless of resync.  If the array is not degraded,
everything works fine, again regardless of resync.

-jim



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-25  6:24     ` Neil Brown
  2004-10-25  6:33       ` Jim Paris
@ 2004-10-25 14:23       ` H. Peter Anvin
  1 sibling, 0 replies; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-25 14:23 UTC (permalink / raw)
  To: linux-raid

Followup to:  <16764.39990.547866.885761@cse.unsw.edu.au>
By author:    Neil Brown <neilb@cse.unsw.edu.au>
In newsgroup: linux.dev.raid
>
> On Monday October 25, jim@jtan.com wrote:
> > > Does this fix the problem?
> > > It is definitely wrong as it is.
> > > 
> > ..
> > >  	case RECONSTRUCT_WRITE:
> > > -	case UPDATE_PARITY:	/* Is this right? */
> > 
> > No, it does not.  Same sort of corruption with that line removed.
> > 
>  
> Thanks.  I'll try harder...
> 
> Looking at the script, the corruption happens while resync is still
> going on.  If you wait for resync to finish, do you still get
> corruption?
> 

The version of the script I posted does wait for the resync to finish
before writing; it still exhibits the same problem.

	-hpa

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-25  6:20   ` Jim Paris
  2004-10-25  6:24     ` Neil Brown
@ 2004-10-27  3:38     ` Neil Brown
  2004-10-27  5:23       ` H. Peter Anvin
  2004-10-27  5:56       ` H. Peter Anvin
  1 sibling, 2 replies; 32+ messages in thread
From: Neil Brown @ 2004-10-27  3:38 UTC (permalink / raw)
  To: Jim Paris; +Cc: H. Peter Anvin, linux-raid

On Monday October 25, jim@jtan.com wrote:
> > Does this fix the problem?
> > It is definitely wrong as it is.
> > 
> ..
> >  	case RECONSTRUCT_WRITE:
> > -	case UPDATE_PARITY:	/* Is this right? */
> 
> No, it does not.  Same sort of corruption with that line removed.


Ok, take-2.  I've tested this one myself and it does seem to fix it.

The problem is that it is sometimes using parity to reconstruct a
block, when not all of the blocks have been read in.

In raid5, there are two choices for write - reconstruct-write or
read-modify-write.

If there are any failed drives, it always chooses read-modify-write
and so only has to read data from good drives.

raid6 only allows for reconstruct-write, so if it ever writes to an
array with a failed drive, it must read all blocks and reconstruct the
missing blocks before allowing the write.
As this is something that raid5 didn't have to care about, and as the
raid6 code was based on the raid5 code, it is easy to see how this
case was missed.

The following patch added a bit of tracing to track other cases
(hopefully non-existent) where calculations are done using
non-existent data, and make sure the required blocks are pre-read.
Possible this code (in handle_stripe) needs a substantial clean up...

I'll wait for comments and further testing before I forward it to
Andrew.


NeilBrown

========================================================
Fix raid6 problem

Sometimes it didn't read all (working) drives before
a parity calculation.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/raid6main.c |   17 +++++++++--------
 1 files changed, 9 insertions(+), 8 deletions(-)

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2004-10-22 16:14:11.000000000 +1000
+++ ./drivers/md/raid6main.c	2004-10-27 13:11:26.000000000 +1000
@@ -734,7 +734,6 @@ static void compute_parity(struct stripe
 	case READ_MODIFY_WRITE:
 		BUG();		/* READ_MODIFY_WRITE N/A for RAID-6 */
 	case RECONSTRUCT_WRITE:
-	case UPDATE_PARITY:	/* Is this right? */
 		for (i= disks; i-- ;)
 			if ( i != pd_idx && i != qd_idx && sh->dev[i].towrite ) {
 				chosen = sh->dev[i].towrite;
@@ -770,7 +769,8 @@ static void compute_parity(struct stripe
 		i = d0_idx;
 		do {
 			ptrs[count++] = page_address(sh->dev[i].page);
-
+			if (count <= disks-2 && !test_bit(R5_UPTODATE, &sh->dev[i].flags))
+				printk("block %d/%d not uptodate on parity calc\n", i,count);
 			i = raid6_next_disk(i, disks);
 		} while ( i != d0_idx );
 //		break;
@@ -818,7 +818,7 @@ static void compute_block_1(struct strip
 			if (test_bit(R5_UPTODATE, &sh->dev[i].flags))
 				ptr[count++] = p;
 			else
-				PRINTK("compute_block() %d, stripe %llu, %d"
+				printk("compute_block() %d, stripe %llu, %d"
 				       " not present\n", dd_idx,
 				       (unsigned long long)sh->sector, i);
 
@@ -875,6 +875,9 @@ static void compute_block_2(struct strip
 		do {
 			ptrs[count++] = page_address(sh->dev[i].page);
 			i = raid6_next_disk(i, disks);
+			if (i != dd_idx1 && i != dd_idx2 &&
+			    !test_bit(R5_UPTODATE, &sh->dev[i].flags))
+				printk("compute_2 with missing block %d/%d\n", count, i);
 		} while ( i != d0_idx );
 
 		if ( failb == disks-2 ) {
@@ -1157,17 +1160,15 @@ static void handle_stripe(struct stripe_
 	 * parity, or to satisfy requests
 	 * or to load a block that is being partially written.
 	 */
-	if (to_read || non_overwrite || (syncing && (uptodate < disks))) {
+	if (to_read || non_overwrite || (to_write && failed) || (syncing && (uptodate < disks))) {
 		for (i=disks; i--;) {
 			dev = &sh->dev[i];
 			if (!test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) &&
 			    (dev->toread ||
 			     (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
 			     syncing ||
-			     (failed >= 1 && (sh->dev[failed_num[0]].toread ||
-					 (sh->dev[failed_num[0]].towrite && !test_bit(R5_OVERWRITE, &sh->dev[failed_num[0]].flags)))) ||
-			     (failed >= 2 && (sh->dev[failed_num[1]].toread ||
-					 (sh->dev[failed_num[1]].towrite && !test_bit(R5_OVERWRITE, &sh->dev[failed_num[1]].flags))))
+			     (failed >= 1 && (sh->dev[failed_num[0]].toread || to_write)) ||
+			     (failed >= 2 && (sh->dev[failed_num[1]].toread || to_write))
 				    )
 				) {
 				/* we would like to get this block, possibly

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-27  3:38     ` Neil Brown
@ 2004-10-27  5:23       ` H. Peter Anvin
  2004-10-27  6:00         ` Jim Paris
  2004-10-27  5:56       ` H. Peter Anvin
  1 sibling, 1 reply; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-27  5:23 UTC (permalink / raw)
  To: Neil Brown; +Cc: Jim Paris, linux-raid

Neil Brown wrote:
> 
> Ok, take-2.  I've tested this one myself and it does seem to fix it.
> 
> The problem is that it is sometimes using parity to reconstruct a
> block, when not all of the blocks have been read in.
> 
> In raid5, there are two choices for write - reconstruct-write or
> read-modify-write.
> 
> If there are any failed drives, it always chooses read-modify-write
> and so only has to read data from good drives.
> 
> raid6 only allows for reconstruct-write, so if it ever writes to an
> array with a failed drive, it must read all blocks and reconstruct the
> missing blocks before allowing the write.
> As this is something that raid5 didn't have to care about, and as the
> raid6 code was based on the raid5 code, it is easy to see how this
> case was missed.
> 
> The following patch added a bit of tracing to track other cases
> (hopefully non-existent) where calculations are done using
> non-existent data, and make sure the required blocks are pre-read.
> Possible this code (in handle_stripe) needs a substantial clean up...
> 
> I'll wait for comments and further testing before I forward it to
> Andrew.
> 

That makes sense (and definitely explains why I didn't find the problem.)

I tried it out, and it seems much better now.  It does, however, still 
seem to have a problem:

+ e2fsck -nf /dev/md6
e2fsck 1.35 (28-Feb-2004)
Pass 1: Checking inodes, blocks, and sizes
Inode 7 has illegal block(s).  Clear? no

Illegal block #-1 (33619968) in inode 7.  IGNORED.
Error while iterating over blocks in inode 7: Illegal indirect block found
e2fsck: aborted

Inode 7 is a special-use inode:
#define EXT3_RESIZE_INO          7      /* Reserved group descriptors 
inode */

This is running the version of the r6ext.sh script that I posted, with 
the same datafile, on a PowerMac.

	-hpa

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-27  5:23       ` H. Peter Anvin
@ 2004-10-27  6:00         ` Jim Paris
  2004-10-27  6:03           ` H. Peter Anvin
  2004-10-28  1:17           ` H. Peter Anvin
  0 siblings, 2 replies; 32+ messages in thread
From: Jim Paris @ 2004-10-27  6:00 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Neil Brown, linux-raid

> That makes sense (and definitely explains why I didn't find the problem.)
> 
> I tried it out, and it seems much better now.  It does, however, still 
> seem to have a problem:
> 
> + e2fsck -nf /dev/md6
> e2fsck 1.35 (28-Feb-2004)
> Pass 1: Checking inodes, blocks, and sizes
> Inode 7 has illegal block(s).  Clear? no
> 
> Illegal block #-1 (33619968) in inode 7.  IGNORED.
> Error while iterating over blocks in inode 7: Illegal indirect block found
> e2fsck: aborted

The patch (thanks, Neil!) seems to work fine for me with both the
ReiserFS and ext2 test scripts, on an x86, both with and without
waiting for resync.

-jim

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-27  6:00         ` Jim Paris
@ 2004-10-27  6:03           ` H. Peter Anvin
  2004-10-28  1:17           ` H. Peter Anvin
  1 sibling, 0 replies; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-27  6:03 UTC (permalink / raw)
  To: Jim Paris; +Cc: Neil Brown, linux-raid

Jim Paris wrote:
>>That makes sense (and definitely explains why I didn't find the problem.)
>>
>>I tried it out, and it seems much better now.  It does, however, still 
>>seem to have a problem:
>>
>>+ e2fsck -nf /dev/md6
>>e2fsck 1.35 (28-Feb-2004)
>>Pass 1: Checking inodes, blocks, and sizes
>>Inode 7 has illegal block(s).  Clear? no
>>
>>Illegal block #-1 (33619968) in inode 7.  IGNORED.
>>Error while iterating over blocks in inode 7: Illegal indirect block found
>>e2fsck: aborted
> 
> 
> The patch (thanks, Neil!) seems to work fine for me with both the
> ReiserFS and ext2 test scripts, on an x86, both with and without
> waiting for resync.
> 

Right, see previous; it seems to be an unrelated ppc64 problem that 
happens even without RAID of any kind.  I'm building an i386 kernel with 
the patch now to try it out.

FWIW, I also hacked up Altivec support for ppc/ppc64; it took all of a 
whopping half-hour to make work, since gcc can generate Altivec code and 
it's actually quite good at it.  The resulting code runs at a whopping 
6.1 GB/s on a 2.5 GHz 970.

	-hpa

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-27  6:00         ` Jim Paris
  2004-10-27  6:03           ` H. Peter Anvin
@ 2004-10-28  1:17           ` H. Peter Anvin
  2004-10-28 16:00             ` Jim Paris
  1 sibling, 1 reply; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-28  1:17 UTC (permalink / raw)
  To: Jim Paris; +Cc: Neil Brown, linux-raid

Jim Paris wrote:
> 
> The patch (thanks, Neil!) seems to work fine for me with both the
> ReiserFS and ext2 test scripts, on an x86, both with and without
> waiting for resync.
> 

Works here too, once I tried it on i386.  I'll look separately into why ppc64 
seems to have problems, but that's not an issue for RAID-6.

	-hpa


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-28  1:17           ` H. Peter Anvin
@ 2004-10-28 16:00             ` Jim Paris
  2004-10-28 17:48               ` H. Peter Anvin
  2004-10-28 17:49               ` H. Peter Anvin
  0 siblings, 2 replies; 32+ messages in thread
From: Jim Paris @ 2004-10-28 16:00 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Neil Brown, linux-raid

> Works here too, once I tried it on i386.  I'll look separately into why 
> ppc64 seems to have problems, but that's not an issue for RAID-6.

Great.  I'll continue testing as I start using it on one of my file
servers.

Another issue:  If I create a 6-disk RAID-6 array ...

... with 2 missing, no resync happens.
... with 1 missing, no resync happens.  (???)
... with 0 missing, resync happens.
... with 2 missing, then add 1, recovery happens.
... with 0 missing, then fail 1, resync continues.

Shouldn't resync happen in the created-with-1-disk-missing case?

-jim

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-28 16:00             ` Jim Paris
@ 2004-10-28 17:48               ` H. Peter Anvin
  2004-10-28 17:49               ` H. Peter Anvin
  1 sibling, 0 replies; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-28 17:48 UTC (permalink / raw)
  To: Jim Paris; +Cc: Neil Brown, linux-raid

Jim Paris wrote:
>>Works here too, once I tried it on i386.  I'll look separately into why 
>>ppc64 seems to have problems, but that's not an issue for RAID-6.
> 
> 
> Great.  I'll continue testing as I start using it on one of my file
> servers.
> 
> Another issue:  If I create a 6-disk RAID-6 array ...
> 
> ... with 2 missing, no resync happens.
> ... with 1 missing, no resync happens.  (???)
> ... with 0 missing, resync happens.
> ... with 2 missing, then add 1, recovery happens.
> ... with 0 missing, then fail 1, resync continues.
> 
> Shouldn't resync happen in the created-with-1-disk-missing case?
> 
> -jim

No, why would it?  You don't have anything to resync to.

	-hpa


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-28 16:00             ` Jim Paris
  2004-10-28 17:48               ` H. Peter Anvin
@ 2004-10-28 17:49               ` H. Peter Anvin
  2004-10-29  0:43                 ` Neil Brown
  1 sibling, 1 reply; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-28 17:49 UTC (permalink / raw)
  To: Jim Paris; +Cc: Neil Brown, linux-raid

Jim Paris wrote:
> 
> Another issue:  If I create a 6-disk RAID-6 array ...
> 
> ... with 2 missing, no resync happens.
> ... with 1 missing, no resync happens.  (???)
> ... with 0 missing, resync happens.
> ... with 2 missing, then add 1, recovery happens.
> ... with 0 missing, then fail 1, resync continues.
> 
> Shouldn't resync happen in the created-with-1-disk-missing case?
> 

Nevermind, I guess it probably should, since there is still redundancy 
and therefore it can be inconsistent.

	-hpa


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-28 17:49               ` H. Peter Anvin
@ 2004-10-29  0:43                 ` Neil Brown
  2004-10-29 11:48                   ` Jim Paris
  2004-10-29 12:29                   ` Guy
  0 siblings, 2 replies; 32+ messages in thread
From: Neil Brown @ 2004-10-29  0:43 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Jim Paris, linux-raid

On Thursday October 28, hpa@zytor.com wrote:
> Jim Paris wrote:
> > 
> > Another issue:  If I create a 6-disk RAID-6 array ...
> > 
> > ... with 2 missing, no resync happens.
> > ... with 1 missing, no resync happens.  (???)
> > ... with 0 missing, resync happens.
> > ... with 2 missing, then add 1, recovery happens.
> > ... with 0 missing, then fail 1, resync continues.
> > 
> > Shouldn't resync happen in the created-with-1-disk-missing case?
> > 
> 
> Nevermind, I guess it probably should, since there is still redundancy 
> and therefore it can be inconsistent.
> 
> 	-hpa

I have a patch to mdadm to make it resync when there is one failure,
but I'm no longer convinced that it is needed.
In fact, the initial resync isn't really needed for raid6 (or raid1)
at all.  The first write to any stripe will make the redundancy for
that stripe correct regardless of what it was, and before the first
write, the content of the array is meaningless anyway.

Note that this is different to raid5 which, if using a
read-modify-write cycle, depends on the parity block being correct.

There would be an issue once we start doing background scans of the
arrays as the first scan could find lots of errors.  But maybe that
isn't a problem....

I'll probably include the patch in the next mdadm release, and revisit
the whole idea when (if) I implement background array scans.

NeilBrown

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-29  0:43                 ` Neil Brown
@ 2004-10-29 11:48                   ` Jim Paris
  2004-10-29 12:56                     ` Guy
  2004-10-29 12:29                   ` Guy
  1 sibling, 1 reply; 32+ messages in thread
From: Jim Paris @ 2004-10-29 11:48 UTC (permalink / raw)
  To: Neil Brown; +Cc: H. Peter Anvin, linux-raid

> I have a patch to mdadm to make it resync when there is one failure,
> but I'm no longer convinced that it is needed.
> In fact, the initial resync isn't really needed for raid6 (or raid1)
> at all.

I see.  I figured raid6 should just behave the same as raid5, but
didn't realize that raid5 only did it because of the read-modify-write.

However, I do still think that the data should always be synchronized.
Just because it's holding completely meaningless values at the moment
doesn't mean it should change each time you read it (which can easily
happen if the data gets read from different disks).  Other things
might eventually depend on the "meaningless" value.

Consider running raid5 on top of unsynced raid1 devices /dev/md[012]:

1. Do an initial resync of the raid5 so that D1 + D2 = P.

2. Write a real value to D1, and update P.
   At this point, D1 and P are synced between disks of their raid1s,
   since they have been written.

3. Now lose the disk holding D1, so you need to reconstruct it from D2 and P.
   But you can't do that, because D2's value changes every time you read it!

-jim

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: RAID-6: help wanted
  2004-10-29 11:48                   ` Jim Paris
@ 2004-10-29 12:56                     ` Guy
  2004-10-29 18:15                       ` Jim Paris
  0 siblings, 1 reply; 32+ messages in thread
From: Guy @ 2004-10-29 12:56 UTC (permalink / raw)
  To: 'Jim Paris', 'Neil Brown'
  Cc: 'H. Peter Anvin', linux-raid

I should read the messages in reverse order!
I just posted a similar email, but I could not think of a good example, or a
bad example. :)

As you said (in my words):
RAID5 on top of RAID1 arrays would be bad if the RAID1 arrays were not
synced and one RAID1 array was to fail.

You also said:
"3. Now lose the disk holding D1, so you need to reconstruct it from D2 and
P."

But loosing D1 would require 2 disk failures since a single disk failure
would not take a RAID1 array offline.  But other things can fail that could
cause the array to go offline.  So your point is still valid.

But it does not require any failures to corrupt data.  Using your example,
once the RAID5 array is synced, if you were to modify a block the old block
and the parity would be read, then the new parity would be computed based on
the old block and the new block, then both block written.  You never know
which disk a RAID1 array will read from.  Since the RAID1 was never synced
you can get different data on the read of the old data which will cause the
new parity to be wrong.

1. Do an initial resync of the raid5 so that D1 + D2 = P.

2. Write a real value to D1, and update P.
   Since this requires a read from D1 you don't know which disk will be
   used, if a different disk is used than the one used during the initial
   sync it will corrupt P.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Jim Paris
Sent: Friday, October 29, 2004 7:48 AM
To: Neil Brown
Cc: H. Peter Anvin; linux-raid@vger.kernel.org
Subject: Re: RAID-6: help wanted

> I have a patch to mdadm to make it resync when there is one failure,
> but I'm no longer convinced that it is needed.
> In fact, the initial resync isn't really needed for raid6 (or raid1)
> at all.

I see.  I figured raid6 should just behave the same as raid5, but
didn't realize that raid5 only did it because of the read-modify-write.

However, I do still think that the data should always be synchronized.
Just because it's holding completely meaningless values at the moment
doesn't mean it should change each time you read it (which can easily
happen if the data gets read from different disks).  Other things
might eventually depend on the "meaningless" value.

Consider running raid5 on top of unsynced raid1 devices /dev/md[012]:

1. Do an initial resync of the raid5 so that D1 + D2 = P.

2. Write a real value to D1, and update P.
   At this point, D1 and P are synced between disks of their raid1s,
   since they have been written.

3. Now lose the disk holding D1, so you need to reconstruct it from D2 and
P.
   But you can't do that, because D2's value changes every time you read it!

-jim
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-29 12:56                     ` Guy
@ 2004-10-29 18:15                       ` Jim Paris
  2004-10-29 19:04                         ` H. Peter Anvin
  0 siblings, 1 reply; 32+ messages in thread
From: Jim Paris @ 2004-10-29 18:15 UTC (permalink / raw)
  To: Guy; +Cc: 'Neil Brown', 'H. Peter Anvin', linux-raid

> I just posted a similar email, but I could not think of a good example, or a
> bad example. :)

It's hard coming up with specific examples.  The RAID5-on-RAID1 is
probably the best one I can think of.  But there are other cases:
let's say I have a small RAID1 partition used for booting my system,
always mounted read-only, and to back it up, I do a "dd" of the entire
/dev/md0 (since exact block positions matter to boot-loaders).  If
uninitialized areas of the disk change every time I read them, my
scripts might conclude that the backups have changed and need to get
burned to CD when nothing actually changed on the array.

> But it does not require any failures to corrupt data.

Right.  Having uninitialized portions appear to randomly change
violates assumptions that other drivers (like raid5) and applications
(like my fictional backup script) make about a block device.

> If you insist to add this feature, please make it an option that
> defaults to sync everything.

For now, to force RAID6 to sync, start it with n-2 disks and add 1,
rather than starting with n-1 disks:

  mdadm --create /dev/md1 -l 6 -n 6 missing missing /dev/hd[gikm]2
  mdadm --add /dev/md1 /dev/hdo2

> You say RAID6 requires 100% of the stripe to be read to modify the strip.
> Is this due to the math of RAID6, or was it done this way because it was
> easier?

I think they can both be updated with read-modify-write:

P' = P + D_n + D_n'
Q' = Q + g^n * D_n + g^n * D_n'

However, the multiplications by g^n for computing Q' could be killer
on your CPU, so it's a tradeoff.  Since we're updating e.g. 128k at
once for a single value of n, it's possible that it could be done 
in such a way that it's not too intensive (or cache-thrashing).
Or perhaps you have so many disks that it really is worth the time
to do the computation rather than read from all of them.  

-jim

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-29 18:15                       ` Jim Paris
@ 2004-10-29 19:04                         ` H. Peter Anvin
  2004-10-29 19:21                           ` Jim Paris
  0 siblings, 1 reply; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-29 19:04 UTC (permalink / raw)
  To: Jim Paris; +Cc: Guy, 'Neil Brown', linux-raid

Jim Paris wrote:
> 
> I think they can both be updated with read-modify-write:
> 
> P' = P + D_n + D_n'
> Q' = Q + g^n * D_n + g^n * D_n'
> 
> However, the multiplications by g^n for computing Q' could be killer
> on your CPU, so it's a tradeoff.  Since we're updating e.g. 128k at
> once for a single value of n, it's possible that it could be done 
> in such a way that it's not too intensive (or cache-thrashing).
> Or perhaps you have so many disks that it really is worth the time
> to do the computation rather than read from all of them.  
> 

It's not a matter of cache-trashing, it's a matter of the fact that very 
few CPUs have any form of parallel table lookup.  It could be done with 
dynamic code generation, but that's a whole ball of wax on its own.

	-hpa


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-29 19:04                         ` H. Peter Anvin
@ 2004-10-29 19:21                           ` Jim Paris
  2004-10-29 19:33                             ` H. Peter Anvin
  2004-10-29 20:28                             ` Guy
  0 siblings, 2 replies; 32+ messages in thread
From: Jim Paris @ 2004-10-29 19:21 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Guy, 'Neil Brown', linux-raid

> >I think they can both be updated with read-modify-write:
> >
> >P' = P + D_n + D_n'
> >Q' = Q + g^n * D_n + g^n * D_n'
> 
> It's not a matter of cache-trashing, it's a matter of the fact that very 
> few CPUs have any form of parallel table lookup.  It could be done with 
> dynamic code generation, but that's a whole ball of wax on its own.

Why dynamic?  Are there problems with just pregenerating all 256
optimized codepaths?

-jim

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-29 19:21                           ` Jim Paris
@ 2004-10-29 19:33                             ` H. Peter Anvin
  2004-10-29 20:28                             ` Guy
  1 sibling, 0 replies; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-29 19:33 UTC (permalink / raw)
  To: Jim Paris; +Cc: Guy, 'Neil Brown', linux-raid

Jim Paris wrote:
> 
> Why dynamic?  Are there problems with just pregenerating all 256
> optimized codepaths?
> 

Right, you could do that, of course, but it still amounts to 256 code 
paths that have to be generated, a framework to do them with (since gcc 
doesn't generate SSE-2 code properly, for example), and then figure out 
if it was all worth it at the end.

Given that the RAID-6 module is already quite large, I'm not so sure 
it's a win.

	-hpa

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: RAID-6: help wanted
  2004-10-29 19:21                           ` Jim Paris
  2004-10-29 19:33                             ` H. Peter Anvin
@ 2004-10-29 20:28                             ` Guy
  2004-10-29 20:32                               ` H. Peter Anvin
  1 sibling, 1 reply; 32+ messages in thread
From: Guy @ 2004-10-29 20:28 UTC (permalink / raw)
  To: 'Jim Paris', 'H. Peter Anvin'
  Cc: 'Neil Brown', linux-raid

Could you translate this into English?
P' = P + D_n + D_n'
Q' = Q + g^n * D_n + g^n * D_n'

I understand this much:
P    =old parity
D_n  =old data block(s)
P'   =new parity
D_n' =new data block(s)

But is "+" = xor?

I am lost on this one:
Q' = Q + g^n * D_n + g^n * D_n'

With the parity (xor) it can be done by the bit, so an example is easy.
Can Q be done by the bit, and if so, could you give an example?

If it takes more than 10 minutes, just tell me it is magic! :)

Also, on a related subject....
I have a 14 disk RAID5 with 1 spare.
Once RAID6 seems safe and stable I had hoped to convert to a 15 disk RAID6.
Is 15 disks too much for RAID6?
Any idea what a reasonable limit would be?

Thanks,
Guy



-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Jim Paris
Sent: Friday, October 29, 2004 3:22 PM
To: H. Peter Anvin
Cc: Guy; 'Neil Brown'; linux-raid@vger.kernel.org
Subject: Re: RAID-6: help wanted

> >I think they can both be updated with read-modify-write:
> >
> >P' = P + D_n + D_n'
> >Q' = Q + g^n * D_n + g^n * D_n'
> 
> It's not a matter of cache-trashing, it's a matter of the fact that very 
> few CPUs have any form of parallel table lookup.  It could be done with 
> dynamic code generation, but that's a whole ball of wax on its own.

Why dynamic?  Are there problems with just pregenerating all 256
optimized codepaths?

-jim
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-29 20:28                             ` Guy
@ 2004-10-29 20:32                               ` H. Peter Anvin
  2004-10-29 21:21                                 ` Guy
  0 siblings, 1 reply; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-29 20:32 UTC (permalink / raw)
  To: Guy; +Cc: 'Jim Paris', 'Neil Brown', linux-raid

Guy wrote:
> Could you translate this into English?
> P' = P + D_n + D_n'
> Q' = Q + g^n * D_n + g^n * D_n'
> 
> I understand this much:
> P    =old parity
> D_n  =old data block(s)
> P'   =new parity
> D_n' =new data block(s)
> 
> But is "+" = xor?
> 
> I am lost on this one:
> Q' = Q + g^n * D_n + g^n * D_n'
> 
> With the parity (xor) it can be done by the bit, so an example is easy.
> Can Q be done by the bit, and if so, could you give an example?
> 
> If it takes more than 10 minutes, just tell me it is magic! :)

See my paper on the subject:

http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

> Also, on a related subject....
> I have a 14 disk RAID5 with 1 spare.
> Once RAID6 seems safe and stable I had hoped to convert to a 15 disk RAID6.
> Is 15 disks too much for RAID6?
> Any idea what a reasonable limit would be?
> 

The limit is 27 disks total, it's imposed by the md system rather than RAID-6; 
RAID-6's inherent limit is 254+2 (see, again, the paper.)

	-hpa

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: RAID-6: help wanted
  2004-10-29 20:32                               ` H. Peter Anvin
@ 2004-10-29 21:21                                 ` Guy
  2004-10-29 21:33                                   ` H. Peter Anvin
  0 siblings, 1 reply; 32+ messages in thread
From: Guy @ 2004-10-29 21:21 UTC (permalink / raw)
  To: 'H. Peter Anvin'
  Cc: 'Jim Paris', 'Neil Brown', linux-raid

I took a look at your paper.  It's magic!  :)
My head hurts. :(

Ok, I understand that 27 disks is the limit.  But is that usable?  Would it
be too slow?  The disk IO seems unreasonable with 27 disks.

Guy

-----Original Message-----
From: H. Peter Anvin [mailto:hpa@zytor.com] 
Sent: Friday, October 29, 2004 4:32 PM
To: Guy
Cc: 'Jim Paris'; 'Neil Brown'; linux-raid@vger.kernel.org
Subject: Re: RAID-6: help wanted

Guy wrote:
> Could you translate this into English?
> P' = P + D_n + D_n'
> Q' = Q + g^n * D_n + g^n * D_n'
> 
> I understand this much:
> P    =old parity
> D_n  =old data block(s)
> P'   =new parity
> D_n' =new data block(s)
> 
> But is "+" = xor?
> 
> I am lost on this one:
> Q' = Q + g^n * D_n + g^n * D_n'
> 
> With the parity (xor) it can be done by the bit, so an example is easy.
> Can Q be done by the bit, and if so, could you give an example?
> 
> If it takes more than 10 minutes, just tell me it is magic! :)

See my paper on the subject:

http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

> Also, on a related subject....
> I have a 14 disk RAID5 with 1 spare.
> Once RAID6 seems safe and stable I had hoped to convert to a 15 disk
RAID6.
> Is 15 disks too much for RAID6?
> Any idea what a reasonable limit would be?
> 

The limit is 27 disks total, it's imposed by the md system rather than
RAID-6; 
RAID-6's inherent limit is 254+2 (see, again, the paper.)

	-hpa


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-29 21:21                                 ` Guy
@ 2004-10-29 21:33                                   ` H. Peter Anvin
  0 siblings, 0 replies; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-29 21:33 UTC (permalink / raw)
  To: Guy; +Cc: 'Jim Paris', 'Neil Brown', linux-raid

Guy wrote:
> I took a look at your paper.  It's magic!  :)
> My head hurts. :(
> 
> Ok, I understand that 27 disks is the limit.  But is that usable?  Would it
> be too slow?  The disk IO seems unreasonable with 27 disks.

Shouldn't be too different from RAID-5, at least as long as you have 
reasonably large I/O transactions.

However, unless someone can actually test it out, it's hard to say.

Note that supporting RMW on RAID-6 is definitely a possibility; I tried it 
once on a 6-disk configuration (the only one I have) and it was slower.

	-hpa

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: RAID-6: help wanted
  2004-10-29  0:43                 ` Neil Brown
  2004-10-29 11:48                   ` Jim Paris
@ 2004-10-29 12:29                   ` Guy
  1 sibling, 0 replies; 32+ messages in thread
From: Guy @ 2004-10-29 12:29 UTC (permalink / raw)
  To: 'Neil Brown', 'H. Peter Anvin'
  Cc: 'Jim Paris', linux-raid

Neil,
You said:
"the initial resync isn't really needed for raid6 (or raid1)
at all"

I understand your logic, but, I would prefer the data to be synced.  I can't
think of any examples of how it could make a difference, but if I read block
x, then ten minutes later I read the same block again, I want it to be the
same unless I changed it.  With RAID1 you never know which disk will be
read, RAID6 would only change if a disk failed.  If you insist to add this
feature, please make it an option that defaults to sync everything.  This
way someone that knows what they are doing can use the option, others will
get the safer (IMHO) default.

I also want an integrity checker that does not require the array to be
stopped. :)

I know you did not write the RAID6 code, but:
You say RAID6 requires 100% of the stripe to be read to modify the strip.
Is this due to the math of RAID6, or was it done this way because it was
easier?  When doing random disk writes, any idea how this effects the
performance of RAID6 compared to RAID5?  Does the performance of RAID6 get
worse as the number of disks is increased?  Until now, I assumed a disk
write would require read, read, read, modify, write, write, write.  Compared
to RAID5 with read, read, modify, write, write (for small updates).

Thanks for your time,
Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Neil Brown
Sent: Thursday, October 28, 2004 8:44 PM
To: H. Peter Anvin
Cc: Jim Paris; linux-raid@vger.kernel.org
Subject: Re: RAID-6: help wanted

On Thursday October 28, hpa@zytor.com wrote:
> Jim Paris wrote:
> > 
> > Another issue:  If I create a 6-disk RAID-6 array ...
> > 
> > ... with 2 missing, no resync happens.
> > ... with 1 missing, no resync happens.  (???)
> > ... with 0 missing, resync happens.
> > ... with 2 missing, then add 1, recovery happens.
> > ... with 0 missing, then fail 1, resync continues.
> > 
> > Shouldn't resync happen in the created-with-1-disk-missing case?
> > 
> 
> Nevermind, I guess it probably should, since there is still redundancy 
> and therefore it can be inconsistent.
> 
> 	-hpa

I have a patch to mdadm to make it resync when there is one failure,
but I'm no longer convinced that it is needed.
In fact, the initial resync isn't really needed for raid6 (or raid1)
at all.  The first write to any stripe will make the redundancy for
that stripe correct regardless of what it was, and before the first
write, the content of the array is meaningless anyway.

Note that this is different to raid5 which, if using a
read-modify-write cycle, depends on the parity block being correct.

There would be an issue once we start doing background scans of the
arrays as the first scan could find lots of errors.  But maybe that
isn't a problem....

I'll probably include the patch in the next mdadm release, and revisit
the whole idea when (if) I implement background array scans.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RAID-6: help wanted
  2004-10-27  3:38     ` Neil Brown
  2004-10-27  5:23       ` H. Peter Anvin
@ 2004-10-27  5:56       ` H. Peter Anvin
  1 sibling, 0 replies; 32+ messages in thread
From: H. Peter Anvin @ 2004-10-27  5:56 UTC (permalink / raw)
  To: Neil Brown; +Cc: Jim Paris, linux-raid

Okay, I think the "inode 7" problem is unrelated, because it happens 
without any disk failures, it happens on RAID-0, RAID-5, and even on a 
plain loop device.  I suspect there is something wrong in the ppc64 
kernel; I'll try it again on i386 and report back.

	-hpa

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2004-10-29 21:33 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-23 23:46 RAID-6: help wanted H. Peter Anvin
2004-10-24  5:26 ` Brad Campbell
2004-10-24  6:46   ` Jim Paris
     [not found]   ` <417B546C.3050706@zytor.com>
2004-10-24  7:09     ` H. Peter Anvin
2004-10-24  7:17       ` H. Peter Anvin
2004-10-24  7:18       ` Brad Campbell
2004-10-25  5:41 ` Neil Brown
2004-10-25  6:20   ` Jim Paris
2004-10-25  6:24     ` Neil Brown
2004-10-25  6:33       ` Jim Paris
2004-10-25 14:23       ` H. Peter Anvin
2004-10-27  3:38     ` Neil Brown
2004-10-27  5:23       ` H. Peter Anvin
2004-10-27  6:00         ` Jim Paris
2004-10-27  6:03           ` H. Peter Anvin
2004-10-28  1:17           ` H. Peter Anvin
2004-10-28 16:00             ` Jim Paris
2004-10-28 17:48               ` H. Peter Anvin
2004-10-28 17:49               ` H. Peter Anvin
2004-10-29  0:43                 ` Neil Brown
2004-10-29 11:48                   ` Jim Paris
2004-10-29 12:56                     ` Guy
2004-10-29 18:15                       ` Jim Paris
2004-10-29 19:04                         ` H. Peter Anvin
2004-10-29 19:21                           ` Jim Paris
2004-10-29 19:33                             ` H. Peter Anvin
2004-10-29 20:28                             ` Guy
2004-10-29 20:32                               ` H. Peter Anvin
2004-10-29 21:21                                 ` Guy
2004-10-29 21:33                                   ` H. Peter Anvin
2004-10-29 12:29                   ` Guy
2004-10-27  5:56       ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).