md: raid5 resync corrects read errors on data block

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* md: raid5 resync corrects read errors on data block - is this correct?
@ 2012-09-11 19:10 Alexander Lyakas
  2012-09-11 22:29 ` NeilBrown
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Lyakas @ 2012-09-11 19:10 UTC (permalink / raw)
  To: linux-raid

Hi Neil,
I have been doing some investigation about resync on raid5, and I
induced a repairable read error on a drive that contains a data block
for a particular stripe_head. I saw that resync corrects the read
error by recomputing the missing data block, rewriting it (twice) and
then re-reading back.

Then I dug into code and eventually saw that fetch_block():

if ((s->uptodate == disks - 1) &&
    (s->failed && (disk_idx == s->failed_num[0] ||
		   disk_idx == s->failed_num[1]))) {
	/* have disk failed, and we're requested to fetch it;
	 * do compute it
	 */

doesn't care whether disk_idx holds data or parity for this
stripe_head. It only checks that stripe_head has enough redundancy to
recompute the block.

My question: is this behavior correct? I mean that if we are doing
resync, it means that we are not sure that parity is correct (e.g.,
after an unclean shutdown). But we still use this possibly-incorrect
parity to recompute the missing data block. So is this a potential
bug?

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md: raid5 resync corrects read errors on data block - is this correct?
  2012-09-11 19:10 md: raid5 resync corrects read errors on data block - is this correct? Alexander Lyakas
@ 2012-09-11 22:29 ` NeilBrown
  2012-09-12  7:15   ` Alexander Lyakas
  2012-09-12 16:49   ` Alexander Lyakas
  0 siblings, 2 replies; 12+ messages in thread
From: NeilBrown @ 2012-09-11 22:29 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1753 bytes --]

On Tue, 11 Sep 2012 22:10:01 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> I have been doing some investigation about resync on raid5, and I
> induced a repairable read error on a drive that contains a data block
> for a particular stripe_head. I saw that resync corrects the read
> error by recomputing the missing data block, rewriting it (twice) and
> then re-reading back.
> 
> Then I dug into code and eventually saw that fetch_block():
> 
> if ((s->uptodate == disks - 1) &&
>     (s->failed && (disk_idx == s->failed_num[0] ||
> 		   disk_idx == s->failed_num[1]))) {
> 	/* have disk failed, and we're requested to fetch it;
> 	 * do compute it
> 	 */
> 
> doesn't care whether disk_idx holds data or parity for this
> stripe_head. It only checks that stripe_head has enough redundancy to
> recompute the block.
> 
> My question: is this behavior correct? I mean that if we are doing
> resync, it means that we are not sure that parity is correct (e.g.,
> after an unclean shutdown). But we still use this possibly-incorrect
> parity to recompute the missing data block. So is this a potential
> bug?
>

Maybe.
I guess that if bad-block recording is enabled, then recording a bad block
there would be the "right" thing to do.
However if bad-block recording is not enabled then there are two options:
 1/ kick the device from the array
 2/ re-write the failing block based on the parity block which might be
    incorrect, but very probably is correct.

I suspect that the second option (which is currently implemented) will cause
less data loss than the first.

So I think the current code is the best option (unless we want to add some
bad-block-recording support).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md: raid5 resync corrects read errors on data block - is this correct?
  2012-09-11 22:29 ` NeilBrown
@ 2012-09-12  7:15   ` Alexander Lyakas
  2012-09-12 16:49   ` Alexander Lyakas
  1 sibling, 0 replies; 12+ messages in thread
From: Alexander Lyakas @ 2012-09-12  7:15 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Thanks for clarifying, Neil.

On Wed, Sep 12, 2012 at 1:29 AM, NeilBrown <neilb@suse.de> wrote:
> On Tue, 11 Sep 2012 22:10:01 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hi Neil,
>> I have been doing some investigation about resync on raid5, and I
>> induced a repairable read error on a drive that contains a data block
>> for a particular stripe_head. I saw that resync corrects the read
>> error by recomputing the missing data block, rewriting it (twice) and
>> then re-reading back.
>>
>> Then I dug into code and eventually saw that fetch_block():
>>
>> if ((s->uptodate == disks - 1) &&
>>     (s->failed && (disk_idx == s->failed_num[0] ||
>>                  disk_idx == s->failed_num[1]))) {
>>       /* have disk failed, and we're requested to fetch it;
>>        * do compute it
>>        */
>>
>> doesn't care whether disk_idx holds data or parity for this
>> stripe_head. It only checks that stripe_head has enough redundancy to
>> recompute the block.
>>
>> My question: is this behavior correct? I mean that if we are doing
>> resync, it means that we are not sure that parity is correct (e.g.,
>> after an unclean shutdown). But we still use this possibly-incorrect
>> parity to recompute the missing data block. So is this a potential
>> bug?
>>
>
> Maybe.
> I guess that if bad-block recording is enabled, then recording a bad block
> there would be the "right" thing to do.
> However if bad-block recording is not enabled then there are two options:
>  1/ kick the device from the array
>  2/ re-write the failing block based on the parity block which might be
>     incorrect, but very probably is correct.
>
> I suspect that the second option (which is currently implemented) will cause
> less data loss than the first.
>
> So I think the current code is the best option (unless we want to add some
> bad-block-recording support).
>
> NeilBrown

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md: raid5 resync corrects read errors on data block - is this correct?
  2012-09-11 22:29 ` NeilBrown
  2012-09-12  7:15   ` Alexander Lyakas
@ 2012-09-12 16:49   ` Alexander Lyakas
  2012-09-13  0:19     ` NeilBrown
  1 sibling, 1 reply; 12+ messages in thread
From: Alexander Lyakas @ 2012-09-12 16:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,
I have done some more investigation on that.

I see that according to handle_stripe_dirtying(), raid6 always does
reconstruct-write, while raid5 checks what will be cheaper in terms of
IOs - read-modify-write or reconstruct-write. For example, for a
3-drive raid5, both are the same, so because of:

if (rmw < rcw && rmw > 0)
... /* this is not picked, because in this case rmw==rcw==1 */

reconstruct-write is always performed for such 3-drvie raid5. Is this correct?

The issue with doing read-modify-writes is that later we have no
reliable way to know whether the parity block is correct - when we
later do reconstruct-write because of a read error, for example. For
read requests we could have perhaps checked the bitmap, and do
reconstruct-write if the relevant bit is not set, but for write
requests the relevant bit will always be set, because it is set when
the write is started.

I tried the following scenario, which showed a data corruption:
# Create 4-drive raid5 in "--force" mode, so resync starts
# Write one sector on a stripe that resync has not handled yet. RMW is
performed, but the parity is incorrect because two other data blocks
were not taken into account (they contain garbage).
# Induce a read-error on the sector that I just wrote to
# Let resync handle this stripe

As a result, resync corrects my sector using other two data blocks +
parity block, which is out of sync. When I read back the sector, data
is incorrect.

I see that I can easily enforce raid5 to always do reconstruct-write,
the same way like you do for raid6. However, I realize that for
performance reasons, it is better to do RMW if possible.

What do you think about the following rough suggestion: in
handle_stripe_dirtying() check whether resync is ongoing or should be
started - using MD_RECOVERY_SYNC, for example. If there is an ongoing
resync, there is a good reason for that, probably parity on some
stripes is out of date. So in that case, always force
reconstruct-write. Otherwise, count what is cheaper like you do now.
(Can RCW be really cheaper than RMW?)

So during resync, array performance will be lower, but we will ensure
that all stripe-blocks are consistent. What do you think?

Thanks!
Alex.

On Wed, Sep 12, 2012 at 1:29 AM, NeilBrown <neilb@suse.de> wrote:
> On Tue, 11 Sep 2012 22:10:01 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hi Neil,
>> I have been doing some investigation about resync on raid5, and I
>> induced a repairable read error on a drive that contains a data block
>> for a particular stripe_head. I saw that resync corrects the read
>> error by recomputing the missing data block, rewriting it (twice) and
>> then re-reading back.
>>
>> Then I dug into code and eventually saw that fetch_block():
>>
>> if ((s->uptodate == disks - 1) &&
>>     (s->failed && (disk_idx == s->failed_num[0] ||
>>                  disk_idx == s->failed_num[1]))) {
>>       /* have disk failed, and we're requested to fetch it;
>>        * do compute it
>>        */
>>
>> doesn't care whether disk_idx holds data or parity for this
>> stripe_head. It only checks that stripe_head has enough redundancy to
>> recompute the block.
>>
>> My question: is this behavior correct? I mean that if we are doing
>> resync, it means that we are not sure that parity is correct (e.g.,
>> after an unclean shutdown). But we still use this possibly-incorrect
>> parity to recompute the missing data block. So is this a potential
>> bug?
>>
>
> Maybe.
> I guess that if bad-block recording is enabled, then recording a bad block
> there would be the "right" thing to do.
> However if bad-block recording is not enabled then there are two options:
>  1/ kick the device from the array
>  2/ re-write the failing block based on the parity block which might be
>     incorrect, but very probably is correct.
>
> I suspect that the second option (which is currently implemented) will cause
> less data loss than the first.
>
> So I think the current code is the best option (unless we want to add some
> bad-block-recording support).
>
> NeilBrown

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md: raid5 resync corrects read errors on data block - is this correct?
  2012-09-12 16:49   ` Alexander Lyakas
@ 2012-09-13  0:19     ` NeilBrown
  2012-09-13 16:05       ` Alexander Lyakas
  0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2012-09-13  0:19 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2730 bytes --]

On Wed, 12 Sep 2012 19:49:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> I have done some more investigation on that.
> 
> I see that according to handle_stripe_dirtying(), raid6 always does
> reconstruct-write, while raid5 checks what will be cheaper in terms of
> IOs - read-modify-write or reconstruct-write. For example, for a
> 3-drive raid5, both are the same, so because of:
> 
> if (rmw < rcw && rmw > 0)
> ... /* this is not picked, because in this case rmw==rcw==1 */
> 
> reconstruct-write is always performed for such 3-drvie raid5. Is this correct?

Yes. 

> 
> The issue with doing read-modify-writes is that later we have no
> reliable way to know whether the parity block is correct - when we
> later do reconstruct-write because of a read error, for example. For
> read requests we could have perhaps checked the bitmap, and do
> reconstruct-write if the relevant bit is not set, but for write
> requests the relevant bit will always be set, because it is set when
> the write is started.
> 
> I tried the following scenario, which showed a data corruption:
> # Create 4-drive raid5 in "--force" mode, so resync starts
> # Write one sector on a stripe that resync has not handled yet. RMW is
> performed, but the parity is incorrect because two other data blocks
> were not taken into account (they contain garbage).
> # Induce a read-error on the sector that I just wrote to
> # Let resync handle this stripe
> 
> As a result, resync corrects my sector using other two data blocks +
> parity block, which is out of sync. When I read back the sector, data
> is incorrect.
> 
> I see that I can easily enforce raid5 to always do reconstruct-write,
> the same way like you do for raid6. However, I realize that for
> performance reasons, it is better to do RMW if possible.
> 
> What do you think about the following rough suggestion: in
> handle_stripe_dirtying() check whether resync is ongoing or should be
> started - using MD_RECOVERY_SYNC, for example. If there is an ongoing
> resync, there is a good reason for that, probably parity on some
> stripes is out of date. So in that case, always force
> reconstruct-write. Otherwise, count what is cheaper like you do now.
> (Can RCW be really cheaper than RMW?)
> 
> So during resync, array performance will be lower, but we will ensure
> that all stripe-blocks are consistent. What do you think?

I'm fairly sure we used to do that - long long ago. (hunts through git
history...)  No.  The code-fragment was there but it was commented out.

I think it would be good to avoid 'rmw' if the sector offset is less than
recovery_cp.

Care to write a patch?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md: raid5 resync corrects read errors on data block - is this correct?
  2012-09-13  0:19     ` NeilBrown
@ 2012-09-13 16:05       ` Alexander Lyakas
  2012-09-13 16:11         ` Alexander Lyakas
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Lyakas @ 2012-09-13 16:05 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,
below please find the patch. It got pretty ugly, so even if you don't
apply it, please comment. Did I miss anything?
As we discussed some time ago, the usage of mddev->recovery_cp is
overloaded in MD - it indicates both the resync progress and also
whether array is dirty due to ongoing writes. So I tried to somehow
differentiate between these two. I did not manage to repro yet the
"needed" and "transitional" parts. I also don't force reconstruct on
"check" and "repair". I tested the patch on kernel 3.2.

Finally, I asked you some time ago about how MD treats stripes when it
is rebuilding a drive. So now I dug in the code, and found out that
absolutely everything you told me was correct. Amazing!

Thanks,
Alex.


---------------------------
From e876cddae5768bc7f6bdcbd617f36ea3a12445aa Mon Sep 17 00:00:00 2001
From: Alex Lyakas <alex@zadarastorage.com>
Date: Thu, 13 Sep 2012 18:55:00 +0300
Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of
 read-modify-write.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: Yair Hershko <yair@zadarastorage.com>

diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
index 5332202..fa03410 100644
--- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
+++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
@@ -2561,26 +2561,54 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 		 * look like rcw is cheaper
 		 */
 		rcw = 1; rmw = 2;
-	} else for (i = disks; i--; ) {
-		/* would I have to read this buffer for read_modify_write */
-		struct r5dev *dev = &sh->dev[i];
-		if ((dev->towrite || i == sh->pd_idx) &&
-		    !test_bit(R5_LOCKED, &dev->flags) &&
-		    !(test_bit(R5_UPTODATE, &dev->flags) ||
-		      test_bit(R5_Wantcompute, &dev->flags))) {
-			if (test_bit(R5_Insync, &dev->flags))
-				rmw++;
-			else
-				rmw += 2*disks;  /* cannot read it */
-		}
-		/* Would I have to read this buffer for reconstruct_write */
-		if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
-		    !test_bit(R5_LOCKED, &dev->flags) &&
-		    !(test_bit(R5_UPTODATE, &dev->flags) ||
-		    test_bit(R5_Wantcompute, &dev->flags))) {
-			if (test_bit(R5_Insync, &dev->flags)) rcw++;
-			else
-				rcw += 2*disks;
+	} else {
+		/* Attempt to check whether resync is now happening.
+		 * If yes, the array is dirty (after unclean shutdown or
+		 * initial creation), so parity in some stripes might be inconsistent.
+		 * In this case, we need to always do reconstruct-write, to ensure
+		 * that in case of drive failure or read-error correction, we
+		 * generate correct data from the parity.
+		 */
+		sector_t recovery_cp = conf->mddev->recovery_cp;
+		unsigned long recovery = conf->mddev->recovery;
+		int needed = test_bit(MD_RECOVERY_NEEDED, &recovery);
+		int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) &&
+			            !test_bit(MD_RECOVERY_REQUESTED, &recovery) &&
+			            !test_bit(MD_RECOVERY_CHECK, &recovery);
+		int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) &&
+			               !test_bit(MD_RECOVERY_SYNC, &recovery) &&
+			               !test_bit(MD_RECOVERY_RECOVER, &recovery) &&
+			               !test_bit(MD_RECOVERY_DONE, &recovery) &&
+			               !test_bit(MD_RECOVERY_RESHAPE, &recovery);
+		
+		/* If array is dirty and should start resync or already resyncing */
+		if (recovery_cp < MaxSector && sh->sector >= recovery_cp &&
+			(needed || resyncing || transitional)) {
+			/* Force reconstruct-write */
+			rcw = 1; rmw = 2;
+			pr_debug("dirty, force RCW recovery_cp=%lu sh->sector=%lu recovery=0x%lx\n",
+				     recovery_cp, sh->sector, recovery);
+		} else for (i = disks; i--; ) {
+			/* would I have to read this buffer for read_modify_write */
+			struct r5dev *dev = &sh->dev[i];
+			if ((dev->towrite || i == sh->pd_idx) &&
+			    !test_bit(R5_LOCKED, &dev->flags) &&
+			    !(test_bit(R5_UPTODATE, &dev->flags) ||
+			      test_bit(R5_Wantcompute, &dev->flags))) {
+				if (test_bit(R5_Insync, &dev->flags))
+					rmw++;
+				else
+					rmw += 2*disks;  /* cannot read it */
+			}
+			/* Would I have to read this buffer for reconstruct_write */
+			if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
+			    !test_bit(R5_LOCKED, &dev->flags) &&
+			    !(test_bit(R5_UPTODATE, &dev->flags) ||
+			    test_bit(R5_Wantcompute, &dev->flags))) {
+				if (test_bit(R5_Insync, &dev->flags)) rcw++;
+				else
+					rcw += 2*disks;
+			}
 		}
 	}
 	pr_debug("for sector %llu, rmw=%d rcw=%d\n",





On Thu, Sep 13, 2012 at 3:19 AM, NeilBrown <neilb@suse.de> wrote:
> On Wed, 12 Sep 2012 19:49:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hi Neil,
>> I have done some more investigation on that.
>>
>> I see that according to handle_stripe_dirtying(), raid6 always does
>> reconstruct-write, while raid5 checks what will be cheaper in terms of
>> IOs - read-modify-write or reconstruct-write. For example, for a
>> 3-drive raid5, both are the same, so because of:
>>
>> if (rmw < rcw && rmw > 0)
>> ... /* this is not picked, because in this case rmw==rcw==1 */
>>
>> reconstruct-write is always performed for such 3-drvie raid5. Is this correct?
>
> Yes.
>
>>
>> The issue with doing read-modify-writes is that later we have no
>> reliable way to know whether the parity block is correct - when we
>> later do reconstruct-write because of a read error, for example. For
>> read requests we could have perhaps checked the bitmap, and do
>> reconstruct-write if the relevant bit is not set, but for write
>> requests the relevant bit will always be set, because it is set when
>> the write is started.
>>
>> I tried the following scenario, which showed a data corruption:
>> # Create 4-drive raid5 in "--force" mode, so resync starts
>> # Write one sector on a stripe that resync has not handled yet. RMW is
>> performed, but the parity is incorrect because two other data blocks
>> were not taken into account (they contain garbage).
>> # Induce a read-error on the sector that I just wrote to
>> # Let resync handle this stripe
>>
>> As a result, resync corrects my sector using other two data blocks +
>> parity block, which is out of sync. When I read back the sector, data
>> is incorrect.
>>
>> I see that I can easily enforce raid5 to always do reconstruct-write,
>> the same way like you do for raid6. However, I realize that for
>> performance reasons, it is better to do RMW if possible.
>>
>> What do you think about the following rough suggestion: in
>> handle_stripe_dirtying() check whether resync is ongoing or should be
>> started - using MD_RECOVERY_SYNC, for example. If there is an ongoing
>> resync, there is a good reason for that, probably parity on some
>> stripes is out of date. So in that case, always force
>> reconstruct-write. Otherwise, count what is cheaper like you do now.
>> (Can RCW be really cheaper than RMW?)
>>
>> So during resync, array performance will be lower, but we will ensure
>> that all stripe-blocks are consistent. What do you think?
>
> I'm fairly sure we used to do that - long long ago. (hunts through git
> history...)  No.  The code-fragment was there but it was commented out.
>
> I think it would be good to avoid 'rmw' if the sector offset is less than
> recovery_cp.
>
> Care to write a patch?
>
> Thanks,
> NeilBrown

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: md: raid5 resync corrects read errors on data block - is this correct?
  2012-09-13 16:05       ` Alexander Lyakas
@ 2012-09-13 16:11         ` Alexander Lyakas
  2012-09-17 11:15           ` Alexander Lyakas
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Lyakas @ 2012-09-13 16:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

I forgot to mention that there is a flaw in my patch: if user does
echo "frozen" > /sys/block/mdX/md/sync_action
then resync is aborted, NEEDED flag is also down, so I cannot
distinguish between "needs resync" and "has ongoing writes". In that
case RMW is performed.

If there is a drive failure during resync, then eventually recovery_cp
becomes MaxSector (after additional "empty resync", as we discussed
long ago). So then it's ok to do RMW, because when we re-add the
drive, we will reconstruct its missing blocks.

Thanks,
Alex.


On Thu, Sep 13, 2012 at 7:05 PM, Alexander Lyakas
<alex.bolshoy@gmail.com> wrote:
> Hi Neil,
> below please find the patch. It got pretty ugly, so even if you don't
> apply it, please comment. Did I miss anything?
> As we discussed some time ago, the usage of mddev->recovery_cp is
> overloaded in MD - it indicates both the resync progress and also
> whether array is dirty due to ongoing writes. So I tried to somehow
> differentiate between these two. I did not manage to repro yet the
> "needed" and "transitional" parts. I also don't force reconstruct on
> "check" and "repair". I tested the patch on kernel 3.2.
>
> Finally, I asked you some time ago about how MD treats stripes when it
> is rebuilding a drive. So now I dug in the code, and found out that
> absolutely everything you told me was correct. Amazing!
>
> Thanks,
> Alex.
>
>
> ---------------------------
> From e876cddae5768bc7f6bdcbd617f36ea3a12445aa Mon Sep 17 00:00:00 2001
> From: Alex Lyakas <alex@zadarastorage.com>
> Date: Thu, 13 Sep 2012 18:55:00 +0300
> Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of
>  read-modify-write.
>
> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
> Signed-off-by: Yair Hershko <yair@zadarastorage.com>
>
> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> index 5332202..fa03410 100644
> --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> @@ -2561,26 +2561,54 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>                  * look like rcw is cheaper
>                  */
>                 rcw = 1; rmw = 2;
> -       } else for (i = disks; i--; ) {
> -               /* would I have to read this buffer for read_modify_write */
> -               struct r5dev *dev = &sh->dev[i];
> -               if ((dev->towrite || i == sh->pd_idx) &&
> -                   !test_bit(R5_LOCKED, &dev->flags) &&
> -                   !(test_bit(R5_UPTODATE, &dev->flags) ||
> -                     test_bit(R5_Wantcompute, &dev->flags))) {
> -                       if (test_bit(R5_Insync, &dev->flags))
> -                               rmw++;
> -                       else
> -                               rmw += 2*disks;  /* cannot read it */
> -               }
> -               /* Would I have to read this buffer for reconstruct_write */
> -               if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
> -                   !test_bit(R5_LOCKED, &dev->flags) &&
> -                   !(test_bit(R5_UPTODATE, &dev->flags) ||
> -                   test_bit(R5_Wantcompute, &dev->flags))) {
> -                       if (test_bit(R5_Insync, &dev->flags)) rcw++;
> -                       else
> -                               rcw += 2*disks;
> +       } else {
> +               /* Attempt to check whether resync is now happening.
> +                * If yes, the array is dirty (after unclean shutdown or
> +                * initial creation), so parity in some stripes might be inconsistent.
> +                * In this case, we need to always do reconstruct-write, to ensure
> +                * that in case of drive failure or read-error correction, we
> +                * generate correct data from the parity.
> +                */
> +               sector_t recovery_cp = conf->mddev->recovery_cp;
> +               unsigned long recovery = conf->mddev->recovery;
> +               int needed = test_bit(MD_RECOVERY_NEEDED, &recovery);
> +               int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) &&
> +                                   !test_bit(MD_RECOVERY_REQUESTED, &recovery) &&
> +                                   !test_bit(MD_RECOVERY_CHECK, &recovery);
> +               int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) &&
> +                                      !test_bit(MD_RECOVERY_SYNC, &recovery) &&
> +                                      !test_bit(MD_RECOVERY_RECOVER, &recovery) &&
> +                                      !test_bit(MD_RECOVERY_DONE, &recovery) &&
> +                                      !test_bit(MD_RECOVERY_RESHAPE, &recovery);
> +
> +               /* If array is dirty and should start resync or already resyncing */
> +               if (recovery_cp < MaxSector && sh->sector >= recovery_cp &&
> +                       (needed || resyncing || transitional)) {
> +                       /* Force reconstruct-write */
> +                       rcw = 1; rmw = 2;
> +                       pr_debug("dirty, force RCW recovery_cp=%lu sh->sector=%lu recovery=0x%lx\n",
> +                                    recovery_cp, sh->sector, recovery);
> +               } else for (i = disks; i--; ) {
> +                       /* would I have to read this buffer for read_modify_write */
> +                       struct r5dev *dev = &sh->dev[i];
> +                       if ((dev->towrite || i == sh->pd_idx) &&
> +                           !test_bit(R5_LOCKED, &dev->flags) &&
> +                           !(test_bit(R5_UPTODATE, &dev->flags) ||
> +                             test_bit(R5_Wantcompute, &dev->flags))) {
> +                               if (test_bit(R5_Insync, &dev->flags))
> +                                       rmw++;
> +                               else
> +                                       rmw += 2*disks;  /* cannot read it */
> +                       }
> +                       /* Would I have to read this buffer for reconstruct_write */
> +                       if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
> +                           !test_bit(R5_LOCKED, &dev->flags) &&
> +                           !(test_bit(R5_UPTODATE, &dev->flags) ||
> +                           test_bit(R5_Wantcompute, &dev->flags))) {
> +                               if (test_bit(R5_Insync, &dev->flags)) rcw++;
> +                               else
> +                                       rcw += 2*disks;
> +                       }
>                 }
>         }
>         pr_debug("for sector %llu, rmw=%d rcw=%d\n",
>
>
>
>
>
> On Thu, Sep 13, 2012 at 3:19 AM, NeilBrown <neilb@suse.de> wrote:
>> On Wed, 12 Sep 2012 19:49:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> wrote:
>>
>>> Hi Neil,
>>> I have done some more investigation on that.
>>>
>>> I see that according to handle_stripe_dirtying(), raid6 always does
>>> reconstruct-write, while raid5 checks what will be cheaper in terms of
>>> IOs - read-modify-write or reconstruct-write. For example, for a
>>> 3-drive raid5, both are the same, so because of:
>>>
>>> if (rmw < rcw && rmw > 0)
>>> ... /* this is not picked, because in this case rmw==rcw==1 */
>>>
>>> reconstruct-write is always performed for such 3-drvie raid5. Is this correct?
>>
>> Yes.
>>
>>>
>>> The issue with doing read-modify-writes is that later we have no
>>> reliable way to know whether the parity block is correct - when we
>>> later do reconstruct-write because of a read error, for example. For
>>> read requests we could have perhaps checked the bitmap, and do
>>> reconstruct-write if the relevant bit is not set, but for write
>>> requests the relevant bit will always be set, because it is set when
>>> the write is started.
>>>
>>> I tried the following scenario, which showed a data corruption:
>>> # Create 4-drive raid5 in "--force" mode, so resync starts
>>> # Write one sector on a stripe that resync has not handled yet. RMW is
>>> performed, but the parity is incorrect because two other data blocks
>>> were not taken into account (they contain garbage).
>>> # Induce a read-error on the sector that I just wrote to
>>> # Let resync handle this stripe
>>>
>>> As a result, resync corrects my sector using other two data blocks +
>>> parity block, which is out of sync. When I read back the sector, data
>>> is incorrect.
>>>
>>> I see that I can easily enforce raid5 to always do reconstruct-write,
>>> the same way like you do for raid6. However, I realize that for
>>> performance reasons, it is better to do RMW if possible.
>>>
>>> What do you think about the following rough suggestion: in
>>> handle_stripe_dirtying() check whether resync is ongoing or should be
>>> started - using MD_RECOVERY_SYNC, for example. If there is an ongoing
>>> resync, there is a good reason for that, probably parity on some
>>> stripes is out of date. So in that case, always force
>>> reconstruct-write. Otherwise, count what is cheaper like you do now.
>>> (Can RCW be really cheaper than RMW?)
>>>
>>> So during resync, array performance will be lower, but we will ensure
>>> that all stripe-blocks are consistent. What do you think?
>>
>> I'm fairly sure we used to do that - long long ago. (hunts through git
>> history...)  No.  The code-fragment was there but it was commented out.
>>
>> I think it would be good to avoid 'rmw' if the sector offset is less than
>> recovery_cp.
>>
>> Care to write a patch?
>>
>> Thanks,
>> NeilBrown

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md: raid5 resync corrects read errors on data block - is this correct?
  2012-09-13 16:11         ` Alexander Lyakas
@ 2012-09-17 11:15           ` Alexander Lyakas
  2012-09-19  5:59             ` NeilBrown
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Lyakas @ 2012-09-17 11:15 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,
below is a bit less-ugly version of the patch.
Thanks,
Alex.

From 05cf800d623bf558c99d542cf8bf083c85b7e5d5 Mon Sep 17 00:00:00 2001
From: Alex Lyakas <alex@zadarastorage.com>
Date: Thu, 13 Sep 2012 18:55:00 +0300
Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of
 read-modify-write.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: Yair Hershko <yair@zadarastorage.com>

diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
index 5332202..0702785 100644
--- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
+++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
@@ -2555,12 +2555,36 @@ static void handle_stripe_dirtying(struct r5conf *conf,
                                   int disks)
 {
        int rmw = 0, rcw = 0, i;
-       if (conf->max_degraded == 2) {
-               /* RAID6 requires 'rcw' in current implementation
-                * Calculate the real rcw later - for now fake it
+       sector_t recovery_cp = conf->mddev->recovery_cp;
+       unsigned long recovery = conf->mddev->recovery;
+       int needed = test_bit(MD_RECOVERY_NEEDED, &recovery);
+       int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) &&
+                       !test_bit(MD_RECOVERY_REQUESTED, &recovery) &&
+                       !test_bit(MD_RECOVERY_CHECK, &recovery);
+       int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) &&
+                          !test_bit(MD_RECOVERY_SYNC, &recovery) &&
+                          !test_bit(MD_RECOVERY_RECOVER, &recovery) &&
+                          !test_bit(MD_RECOVERY_DONE, &recovery) &&
+                          !test_bit(MD_RECOVERY_RESHAPE, &recovery);
+
+       /* RAID6 requires 'rcw' in current implementation.
+        * Otherwise, attempt to check whether resync is now happening
+        * or should start.
+         * If yes, then the array is dirty (after unclean shutdown or
+         * initial creation), so parity in some stripes might be inconsistent.
+         * In this case, we need to always do reconstruct-write, to ensure
+         * that in case of drive failure or read-error correction, we
+         * generate correct data from the parity.
+         */
+       if (conf->max_degraded == 2 ||
+           (recovery_cp < MaxSector && sh->sector >= recovery_cp &&
+            (needed || resyncing || transitional))) {
+               /* Calculate the real rcw later - for now fake it
                 * look like rcw is cheaper
                 */
                rcw = 1; rmw = 2;
+               pr_debug("force RCW max_degraded=%u, recovery_cp=%lu
sh->sector=%lu recovery=0x%lx\n",
+                        conf->max_degraded, recovery_cp, sh->sector, recovery);
        } else for (i = disks; i--; ) {
                /* would I have to read this buffer for read_modify_write */
                struct r5dev *dev = &sh->dev[i];

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: md: raid5 resync corrects read errors on data block - is this correct?
  2012-09-17 11:15           ` Alexander Lyakas
@ 2012-09-19  5:59             ` NeilBrown
  2012-09-20  8:26               ` Alexander Lyakas
  0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2012-09-19  5:59 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3449 bytes --]

On Mon, 17 Sep 2012 14:15:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> below is a bit less-ugly version of the patch.
> Thanks,
> Alex.
> 
> >From 05cf800d623bf558c99d542cf8bf083c85b7e5d5 Mon Sep 17 00:00:00 2001
> From: Alex Lyakas <alex@zadarastorage.com>
> Date: Thu, 13 Sep 2012 18:55:00 +0300
> Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of
>  read-modify-write.
> 
> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
> Signed-off-by: Yair Hershko <yair@zadarastorage.com>
> 
> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> index 5332202..0702785 100644
> --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> @@ -2555,12 +2555,36 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>                                    int disks)
>  {
>         int rmw = 0, rcw = 0, i;
> -       if (conf->max_degraded == 2) {
> -               /* RAID6 requires 'rcw' in current implementation
> -                * Calculate the real rcw later - for now fake it
> +       sector_t recovery_cp = conf->mddev->recovery_cp;
> +       unsigned long recovery = conf->mddev->recovery;
> +       int needed = test_bit(MD_RECOVERY_NEEDED, &recovery);
> +       int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) &&
> +                       !test_bit(MD_RECOVERY_REQUESTED, &recovery) &&
> +                       !test_bit(MD_RECOVERY_CHECK, &recovery);
> +       int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) &&
> +                          !test_bit(MD_RECOVERY_SYNC, &recovery) &&
> +                          !test_bit(MD_RECOVERY_RECOVER, &recovery) &&
> +                          !test_bit(MD_RECOVERY_DONE, &recovery) &&
> +                          !test_bit(MD_RECOVERY_RESHAPE, &recovery);

Thanks Alex,
 however I don't understand why you want to test all of these bits.
Isn't it enough just to check ->recovery_cp ??

> +
> +       /* RAID6 requires 'rcw' in current implementation.
> +        * Otherwise, attempt to check whether resync is now happening
> +        * or should start.
> +         * If yes, then the array is dirty (after unclean shutdown or
> +         * initial creation), so parity in some stripes might be inconsistent.
> +         * In this case, we need to always do reconstruct-write, to ensure
> +         * that in case of drive failure or read-error correction, we
> +         * generate correct data from the parity.
> +         */
> +       if (conf->max_degraded == 2 ||
> +           (recovery_cp < MaxSector && sh->sector >= recovery_cp &&
> +            (needed || resyncing || transitional))) {
> +               /* Calculate the real rcw later - for now fake it
>                  * look like rcw is cheaper

Also, we should probably fix this comment.  s/fake/make/

Thanks,
NeilBrown



>                  */
>                 rcw = 1; rmw = 2;
> +               pr_debug("force RCW max_degraded=%u, recovery_cp=%lu
> sh->sector=%lu recovery=0x%lx\n",
> +                        conf->max_degraded, recovery_cp, sh->sector, recovery);
>         } else for (i = disks; i--; ) {
>                 /* would I have to read this buffer for read_modify_write */
>                 struct r5dev *dev = &sh->dev[i];


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md: raid5 resync corrects read errors on data block - is this correct?
  2012-09-19  5:59             ` NeilBrown
@ 2012-09-20  8:26               ` Alexander Lyakas
  2012-09-25  6:57                 ` NeilBrown
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Lyakas @ 2012-09-20  8:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,
you are completely right. I got confused between mddev->recovery_cp
and sb->resync_offset; the latter may become 0 due to in-flight WRITEs
and not due to resync. Looking at the code again, I see that
recovery_cp is totally one-way from sb->resync_offset to MaxSector
(except for explicit loading via sysfs). Also recovery_cp is not
relevant to "check" and "repair". So recovery_cp is pretty simple
after all.

Below is V2 patch. (I have also to credit it to somebody else, because
he was the one that said - just do rcw while you are resyncing).

Thanks,
Alex.


-----------------
From cc3e2bfcf2fd2c69180577949425d69de88706bb Mon Sep 17 00:00:00 2001
From: Alex Lyakas <alex@zadarastorage.com>
Date: Thu, 13 Sep 2012 18:55:00 +0300
Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of
 read-modify-write.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: Yair Hershko <yair@zadarastorage.com>

diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
index 5332202..9fdd5e3 100644
--- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
+++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
@@ -2555,12 +2555,24 @@ static void handle_stripe_dirtying(struct r5conf *conf,
                                   int disks)
 {
        int rmw = 0, rcw = 0, i;
-       if (conf->max_degraded == 2) {
-               /* RAID6 requires 'rcw' in current implementation
-                * Calculate the real rcw later - for now fake it
+       sector_t recovery_cp = conf->mddev->recovery_cp;
+
+       /* RAID6 requires 'rcw' in current implementation.
+        * Otherwise, check whether resync is now happening or should start.
+        * If yes, then the array is dirty (after unclean shutdown or
+        * initial creation), so parity in some stripes might be inconsistent.
+        * In this case, we need to always do reconstruct-write, to ensure
+        * that in case of drive failure or read-error correction, we
+        * generate correct data from the parity.
+        */
+       if (conf->max_degraded == 2 ||
+           (recovery_cp < MaxSector && sh->sector >= recovery_cp)) {
+               /* Calculate the real rcw later - for now make it
                 * look like rcw is cheaper
                 */
                rcw = 1; rmw = 2;
+               pr_debug("force RCW max_degraded=%u, recovery_cp=%lu
sh->sector=%lu\n",
+                        conf->max_degraded, recovery_cp, sh->sector);
        } else for (i = disks; i--; ) {
                /* would I have to read this buffer for read_modify_write */
                struct r5dev *dev = &sh->dev[i];






On Wed, Sep 19, 2012 at 8:59 AM, NeilBrown <neilb@suse.de> wrote:
> On Mon, 17 Sep 2012 14:15:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hi Neil,
>> below is a bit less-ugly version of the patch.
>> Thanks,
>> Alex.
>>
>> >From 05cf800d623bf558c99d542cf8bf083c85b7e5d5 Mon Sep 17 00:00:00 2001
>> From: Alex Lyakas <alex@zadarastorage.com>
>> Date: Thu, 13 Sep 2012 18:55:00 +0300
>> Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of
>>  read-modify-write.
>>
>> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
>> Signed-off-by: Yair Hershko <yair@zadarastorage.com>
>>
>> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
>> b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
>> index 5332202..0702785 100644
>> --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
>> +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
>> @@ -2555,12 +2555,36 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>>                                    int disks)
>>  {
>>         int rmw = 0, rcw = 0, i;
>> -       if (conf->max_degraded == 2) {
>> -               /* RAID6 requires 'rcw' in current implementation
>> -                * Calculate the real rcw later - for now fake it
>> +       sector_t recovery_cp = conf->mddev->recovery_cp;
>> +       unsigned long recovery = conf->mddev->recovery;
>> +       int needed = test_bit(MD_RECOVERY_NEEDED, &recovery);
>> +       int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) &&
>> +                       !test_bit(MD_RECOVERY_REQUESTED, &recovery) &&
>> +                       !test_bit(MD_RECOVERY_CHECK, &recovery);
>> +       int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) &&
>> +                          !test_bit(MD_RECOVERY_SYNC, &recovery) &&
>> +                          !test_bit(MD_RECOVERY_RECOVER, &recovery) &&
>> +                          !test_bit(MD_RECOVERY_DONE, &recovery) &&
>> +                          !test_bit(MD_RECOVERY_RESHAPE, &recovery);
>
> Thanks Alex,
>  however I don't understand why you want to test all of these bits.
> Isn't it enough just to check ->recovery_cp ??
>
>> +
>> +       /* RAID6 requires 'rcw' in current implementation.
>> +        * Otherwise, attempt to check whether resync is now happening
>> +        * or should start.
>> +         * If yes, then the array is dirty (after unclean shutdown or
>> +         * initial creation), so parity in some stripes might be inconsistent.
>> +         * In this case, we need to always do reconstruct-write, to ensure
>> +         * that in case of drive failure or read-error correction, we
>> +         * generate correct data from the parity.
>> +         */
>> +       if (conf->max_degraded == 2 ||
>> +           (recovery_cp < MaxSector && sh->sector >= recovery_cp &&
>> +            (needed || resyncing || transitional))) {
>> +               /* Calculate the real rcw later - for now fake it
>>                  * look like rcw is cheaper
>
> Also, we should probably fix this comment.  s/fake/make/
>
> Thanks,
> NeilBrown
>
>
>
>>                  */
>>                 rcw = 1; rmw = 2;
>> +               pr_debug("force RCW max_degraded=%u, recovery_cp=%lu
>> sh->sector=%lu recovery=0x%lx\n",
>> +                        conf->max_degraded, recovery_cp, sh->sector, recovery);
>>         } else for (i = disks; i--; ) {
>>                 /* would I have to read this buffer for read_modify_write */
>>                 struct r5dev *dev = &sh->dev[i];
>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: md: raid5 resync corrects read errors on data block - is this correct?
  2012-09-20  8:26               ` Alexander Lyakas
@ 2012-09-25  6:57                 ` NeilBrown
  2012-09-25  7:50                   ` Alexander Lyakas
  0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2012-09-25  6:57 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 7017 bytes --]

On Thu, 20 Sep 2012 11:26:50 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> you are completely right. I got confused between mddev->recovery_cp
> and sb->resync_offset; the latter may become 0 due to in-flight WRITEs
> and not due to resync. Looking at the code again, I see that
> recovery_cp is totally one-way from sb->resync_offset to MaxSector
> (except for explicit loading via sysfs). Also recovery_cp is not
> relevant to "check" and "repair". So recovery_cp is pretty simple
> after all.
> 
> Below is V2 patch. (I have also to credit it to somebody else, because
> he was the one that said - just do rcw while you are resyncing).
> 
> Thanks,
> Alex.
> 
> 
> -----------------
> >From cc3e2bfcf2fd2c69180577949425d69de88706bb Mon Sep 17 00:00:00 2001
> From: Alex Lyakas <alex@zadarastorage.com>
> Date: Thu, 13 Sep 2012 18:55:00 +0300
> Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of
>  read-modify-write.
> 
> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
> Signed-off-by: Yair Hershko <yair@zadarastorage.com>

Signed-off-by has a very specific meaning - it isn't just a way of giving
recredit.
If Yair wrote some of the code, this is fine.
If not, then something like "Suggest-by:" might be more appropriate.
Should I change it to that.

applied, thanks.

NeilBrown


> 
> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> index 5332202..9fdd5e3 100644
> --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> @@ -2555,12 +2555,24 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>                                    int disks)
>  {
>         int rmw = 0, rcw = 0, i;
> -       if (conf->max_degraded == 2) {
> -               /* RAID6 requires 'rcw' in current implementation
> -                * Calculate the real rcw later - for now fake it
> +       sector_t recovery_cp = conf->mddev->recovery_cp;
> +
> +       /* RAID6 requires 'rcw' in current implementation.
> +        * Otherwise, check whether resync is now happening or should start.
> +        * If yes, then the array is dirty (after unclean shutdown or
> +        * initial creation), so parity in some stripes might be inconsistent.
> +        * In this case, we need to always do reconstruct-write, to ensure
> +        * that in case of drive failure or read-error correction, we
> +        * generate correct data from the parity.
> +        */
> +       if (conf->max_degraded == 2 ||
> +           (recovery_cp < MaxSector && sh->sector >= recovery_cp)) {
> +               /* Calculate the real rcw later - for now make it
>                  * look like rcw is cheaper
>                  */
>                 rcw = 1; rmw = 2;
> +               pr_debug("force RCW max_degraded=%u, recovery_cp=%lu
> sh->sector=%lu\n",
> +                        conf->max_degraded, recovery_cp, sh->sector);
>         } else for (i = disks; i--; ) {
>                 /* would I have to read this buffer for read_modify_write */
>                 struct r5dev *dev = &sh->dev[i];
> 
> 
> 
> 
> 
> 
> On Wed, Sep 19, 2012 at 8:59 AM, NeilBrown <neilb@suse.de> wrote:
> > On Mon, 17 Sep 2012 14:15:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> > wrote:
> >
> >> Hi Neil,
> >> below is a bit less-ugly version of the patch.
> >> Thanks,
> >> Alex.
> >>
> >> >From 05cf800d623bf558c99d542cf8bf083c85b7e5d5 Mon Sep 17 00:00:00 2001
> >> From: Alex Lyakas <alex@zadarastorage.com>
> >> Date: Thu, 13 Sep 2012 18:55:00 +0300
> >> Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of
> >>  read-modify-write.
> >>
> >> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
> >> Signed-off-by: Yair Hershko <yair@zadarastorage.com>
> >>
> >> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> >> b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> >> index 5332202..0702785 100644
> >> --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> >> +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
> >> @@ -2555,12 +2555,36 @@ static void handle_stripe_dirtying(struct r5conf *conf,
> >>                                    int disks)
> >>  {
> >>         int rmw = 0, rcw = 0, i;
> >> -       if (conf->max_degraded == 2) {
> >> -               /* RAID6 requires 'rcw' in current implementation
> >> -                * Calculate the real rcw later - for now fake it
> >> +       sector_t recovery_cp = conf->mddev->recovery_cp;
> >> +       unsigned long recovery = conf->mddev->recovery;
> >> +       int needed = test_bit(MD_RECOVERY_NEEDED, &recovery);
> >> +       int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) &&
> >> +                       !test_bit(MD_RECOVERY_REQUESTED, &recovery) &&
> >> +                       !test_bit(MD_RECOVERY_CHECK, &recovery);
> >> +       int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) &&
> >> +                          !test_bit(MD_RECOVERY_SYNC, &recovery) &&
> >> +                          !test_bit(MD_RECOVERY_RECOVER, &recovery) &&
> >> +                          !test_bit(MD_RECOVERY_DONE, &recovery) &&
> >> +                          !test_bit(MD_RECOVERY_RESHAPE, &recovery);
> >
> > Thanks Alex,
> >  however I don't understand why you want to test all of these bits.
> > Isn't it enough just to check ->recovery_cp ??
> >
> >> +
> >> +       /* RAID6 requires 'rcw' in current implementation.
> >> +        * Otherwise, attempt to check whether resync is now happening
> >> +        * or should start.
> >> +         * If yes, then the array is dirty (after unclean shutdown or
> >> +         * initial creation), so parity in some stripes might be inconsistent.
> >> +         * In this case, we need to always do reconstruct-write, to ensure
> >> +         * that in case of drive failure or read-error correction, we
> >> +         * generate correct data from the parity.
> >> +         */
> >> +       if (conf->max_degraded == 2 ||
> >> +           (recovery_cp < MaxSector && sh->sector >= recovery_cp &&
> >> +            (needed || resyncing || transitional))) {
> >> +               /* Calculate the real rcw later - for now fake it
> >>                  * look like rcw is cheaper
> >
> > Also, we should probably fix this comment.  s/fake/make/
> >
> > Thanks,
> > NeilBrown
> >
> >
> >
> >>                  */
> >>                 rcw = 1; rmw = 2;
> >> +               pr_debug("force RCW max_degraded=%u, recovery_cp=%lu
> >> sh->sector=%lu recovery=0x%lx\n",
> >> +                        conf->max_degraded, recovery_cp, sh->sector, recovery);
> >>         } else for (i = disks; i--; ) {
> >>                 /* would I have to read this buffer for read_modify_write */
> >>                 struct r5dev *dev = &sh->dev[i];
> >


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: md: raid5 resync corrects read errors on data block - is this correct?
  2012-09-25  6:57                 ` NeilBrown
@ 2012-09-25  7:50                   ` Alexander Lyakas
  0 siblings, 0 replies; 12+ messages in thread
From: Alexander Lyakas @ 2012-09-25  7:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Yes, Neil, please change it then to "Suggested-By".

Thanks!
Alex.


On Tue, Sep 25, 2012 at 8:57 AM, NeilBrown <neilb@suse.de> wrote:
> On Thu, 20 Sep 2012 11:26:50 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hi Neil,
>> you are completely right. I got confused between mddev->recovery_cp
>> and sb->resync_offset; the latter may become 0 due to in-flight WRITEs
>> and not due to resync. Looking at the code again, I see that
>> recovery_cp is totally one-way from sb->resync_offset to MaxSector
>> (except for explicit loading via sysfs). Also recovery_cp is not
>> relevant to "check" and "repair". So recovery_cp is pretty simple
>> after all.
>>
>> Below is V2 patch. (I have also to credit it to somebody else, because
>> he was the one that said - just do rcw while you are resyncing).
>>
>> Thanks,
>> Alex.
>>
>>
>> -----------------
>> >From cc3e2bfcf2fd2c69180577949425d69de88706bb Mon Sep 17 00:00:00 2001
>> From: Alex Lyakas <alex@zadarastorage.com>
>> Date: Thu, 13 Sep 2012 18:55:00 +0300
>> Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of
>>  read-modify-write.
>>
>> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
>> Signed-off-by: Yair Hershko <yair@zadarastorage.com>
>
> Signed-off-by has a very specific meaning - it isn't just a way of giving
> recredit.
> If Yair wrote some of the code, this is fine.
> If not, then something like "Suggest-by:" might be more appropriate.
> Should I change it to that.
>
> applied, thanks.
>
> NeilBrown
>
>
>>
>> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
>> b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
>> index 5332202..9fdd5e3 100644
>> --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
>> +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
>> @@ -2555,12 +2555,24 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>>                                    int disks)
>>  {
>>         int rmw = 0, rcw = 0, i;
>> -       if (conf->max_degraded == 2) {
>> -               /* RAID6 requires 'rcw' in current implementation
>> -                * Calculate the real rcw later - for now fake it
>> +       sector_t recovery_cp = conf->mddev->recovery_cp;
>> +
>> +       /* RAID6 requires 'rcw' in current implementation.
>> +        * Otherwise, check whether resync is now happening or should start.
>> +        * If yes, then the array is dirty (after unclean shutdown or
>> +        * initial creation), so parity in some stripes might be inconsistent.
>> +        * In this case, we need to always do reconstruct-write, to ensure
>> +        * that in case of drive failure or read-error correction, we
>> +        * generate correct data from the parity.
>> +        */
>> +       if (conf->max_degraded == 2 ||
>> +           (recovery_cp < MaxSector && sh->sector >= recovery_cp)) {
>> +               /* Calculate the real rcw later - for now make it
>>                  * look like rcw is cheaper
>>                  */
>>                 rcw = 1; rmw = 2;
>> +               pr_debug("force RCW max_degraded=%u, recovery_cp=%lu
>> sh->sector=%lu\n",
>> +                        conf->max_degraded, recovery_cp, sh->sector);
>>         } else for (i = disks; i--; ) {
>>                 /* would I have to read this buffer for read_modify_write */
>>                 struct r5dev *dev = &sh->dev[i];
>>
>>
>>
>>
>>
>>
>> On Wed, Sep 19, 2012 at 8:59 AM, NeilBrown <neilb@suse.de> wrote:
>> > On Mon, 17 Sep 2012 14:15:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> > wrote:
>> >
>> >> Hi Neil,
>> >> below is a bit less-ugly version of the patch.
>> >> Thanks,
>> >> Alex.
>> >>
>> >> >From 05cf800d623bf558c99d542cf8bf083c85b7e5d5 Mon Sep 17 00:00:00 2001
>> >> From: Alex Lyakas <alex@zadarastorage.com>
>> >> Date: Thu, 13 Sep 2012 18:55:00 +0300
>> >> Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of
>> >>  read-modify-write.
>> >>
>> >> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
>> >> Signed-off-by: Yair Hershko <yair@zadarastorage.com>
>> >>
>> >> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
>> >> b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
>> >> index 5332202..0702785 100644
>> >> --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
>> >> +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c
>> >> @@ -2555,12 +2555,36 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>> >>                                    int disks)
>> >>  {
>> >>         int rmw = 0, rcw = 0, i;
>> >> -       if (conf->max_degraded == 2) {
>> >> -               /* RAID6 requires 'rcw' in current implementation
>> >> -                * Calculate the real rcw later - for now fake it
>> >> +       sector_t recovery_cp = conf->mddev->recovery_cp;
>> >> +       unsigned long recovery = conf->mddev->recovery;
>> >> +       int needed = test_bit(MD_RECOVERY_NEEDED, &recovery);
>> >> +       int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) &&
>> >> +                       !test_bit(MD_RECOVERY_REQUESTED, &recovery) &&
>> >> +                       !test_bit(MD_RECOVERY_CHECK, &recovery);
>> >> +       int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) &&
>> >> +                          !test_bit(MD_RECOVERY_SYNC, &recovery) &&
>> >> +                          !test_bit(MD_RECOVERY_RECOVER, &recovery) &&
>> >> +                          !test_bit(MD_RECOVERY_DONE, &recovery) &&
>> >> +                          !test_bit(MD_RECOVERY_RESHAPE, &recovery);
>> >
>> > Thanks Alex,
>> >  however I don't understand why you want to test all of these bits.
>> > Isn't it enough just to check ->recovery_cp ??
>> >
>> >> +
>> >> +       /* RAID6 requires 'rcw' in current implementation.
>> >> +        * Otherwise, attempt to check whether resync is now happening
>> >> +        * or should start.
>> >> +         * If yes, then the array is dirty (after unclean shutdown or
>> >> +         * initial creation), so parity in some stripes might be inconsistent.
>> >> +         * In this case, we need to always do reconstruct-write, to ensure
>> >> +         * that in case of drive failure or read-error correction, we
>> >> +         * generate correct data from the parity.
>> >> +         */
>> >> +       if (conf->max_degraded == 2 ||
>> >> +           (recovery_cp < MaxSector && sh->sector >= recovery_cp &&
>> >> +            (needed || resyncing || transitional))) {
>> >> +               /* Calculate the real rcw later - for now fake it
>> >>                  * look like rcw is cheaper
>> >
>> > Also, we should probably fix this comment.  s/fake/make/
>> >
>> > Thanks,
>> > NeilBrown
>> >
>> >
>> >
>> >>                  */
>> >>                 rcw = 1; rmw = 2;
>> >> +               pr_debug("force RCW max_degraded=%u, recovery_cp=%lu
>> >> sh->sector=%lu recovery=0x%lx\n",
>> >> +                        conf->max_degraded, recovery_cp, sh->sector, recovery);
>> >>         } else for (i = disks; i--; ) {
>> >>                 /* would I have to read this buffer for read_modify_write */
>> >>                 struct r5dev *dev = &sh->dev[i];
>> >
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-09-25  7:50 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-11 19:10 md: raid5 resync corrects read errors on data block - is this correct? Alexander Lyakas
2012-09-11 22:29 ` NeilBrown
2012-09-12  7:15   ` Alexander Lyakas
2012-09-12 16:49   ` Alexander Lyakas
2012-09-13  0:19     ` NeilBrown
2012-09-13 16:05       ` Alexander Lyakas
2012-09-13 16:11         ` Alexander Lyakas
2012-09-17 11:15           ` Alexander Lyakas
2012-09-19  5:59             ` NeilBrown
2012-09-20  8:26               ` Alexander Lyakas
2012-09-25  6:57                 ` NeilBrown
2012-09-25  7:50                   ` Alexander Lyakas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).