* md: raid5 resync corrects read errors on data block - is this correct? @ 2012-09-11 19:10 Alexander Lyakas 2012-09-11 22:29 ` NeilBrown 0 siblings, 1 reply; 12+ messages in thread From: Alexander Lyakas @ 2012-09-11 19:10 UTC (permalink / raw) To: linux-raid Hi Neil, I have been doing some investigation about resync on raid5, and I induced a repairable read error on a drive that contains a data block for a particular stripe_head. I saw that resync corrects the read error by recomputing the missing data block, rewriting it (twice) and then re-reading back. Then I dug into code and eventually saw that fetch_block(): if ((s->uptodate == disks - 1) && (s->failed && (disk_idx == s->failed_num[0] || disk_idx == s->failed_num[1]))) { /* have disk failed, and we're requested to fetch it; * do compute it */ doesn't care whether disk_idx holds data or parity for this stripe_head. It only checks that stripe_head has enough redundancy to recompute the block. My question: is this behavior correct? I mean that if we are doing resync, it means that we are not sure that parity is correct (e.g., after an unclean shutdown). But we still use this possibly-incorrect parity to recompute the missing data block. So is this a potential bug? Thanks, Alex. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: md: raid5 resync corrects read errors on data block - is this correct? 2012-09-11 19:10 md: raid5 resync corrects read errors on data block - is this correct? Alexander Lyakas @ 2012-09-11 22:29 ` NeilBrown 2012-09-12 7:15 ` Alexander Lyakas 2012-09-12 16:49 ` Alexander Lyakas 0 siblings, 2 replies; 12+ messages in thread From: NeilBrown @ 2012-09-11 22:29 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1753 bytes --] On Tue, 11 Sep 2012 22:10:01 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > I have been doing some investigation about resync on raid5, and I > induced a repairable read error on a drive that contains a data block > for a particular stripe_head. I saw that resync corrects the read > error by recomputing the missing data block, rewriting it (twice) and > then re-reading back. > > Then I dug into code and eventually saw that fetch_block(): > > if ((s->uptodate == disks - 1) && > (s->failed && (disk_idx == s->failed_num[0] || > disk_idx == s->failed_num[1]))) { > /* have disk failed, and we're requested to fetch it; > * do compute it > */ > > doesn't care whether disk_idx holds data or parity for this > stripe_head. It only checks that stripe_head has enough redundancy to > recompute the block. > > My question: is this behavior correct? I mean that if we are doing > resync, it means that we are not sure that parity is correct (e.g., > after an unclean shutdown). But we still use this possibly-incorrect > parity to recompute the missing data block. So is this a potential > bug? > Maybe. I guess that if bad-block recording is enabled, then recording a bad block there would be the "right" thing to do. However if bad-block recording is not enabled then there are two options: 1/ kick the device from the array 2/ re-write the failing block based on the parity block which might be incorrect, but very probably is correct. I suspect that the second option (which is currently implemented) will cause less data loss than the first. So I think the current code is the best option (unless we want to add some bad-block-recording support). NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: md: raid5 resync corrects read errors on data block - is this correct? 2012-09-11 22:29 ` NeilBrown @ 2012-09-12 7:15 ` Alexander Lyakas 2012-09-12 16:49 ` Alexander Lyakas 1 sibling, 0 replies; 12+ messages in thread From: Alexander Lyakas @ 2012-09-12 7:15 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Thanks for clarifying, Neil. On Wed, Sep 12, 2012 at 1:29 AM, NeilBrown <neilb@suse.de> wrote: > On Tue, 11 Sep 2012 22:10:01 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hi Neil, >> I have been doing some investigation about resync on raid5, and I >> induced a repairable read error on a drive that contains a data block >> for a particular stripe_head. I saw that resync corrects the read >> error by recomputing the missing data block, rewriting it (twice) and >> then re-reading back. >> >> Then I dug into code and eventually saw that fetch_block(): >> >> if ((s->uptodate == disks - 1) && >> (s->failed && (disk_idx == s->failed_num[0] || >> disk_idx == s->failed_num[1]))) { >> /* have disk failed, and we're requested to fetch it; >> * do compute it >> */ >> >> doesn't care whether disk_idx holds data or parity for this >> stripe_head. It only checks that stripe_head has enough redundancy to >> recompute the block. >> >> My question: is this behavior correct? I mean that if we are doing >> resync, it means that we are not sure that parity is correct (e.g., >> after an unclean shutdown). But we still use this possibly-incorrect >> parity to recompute the missing data block. So is this a potential >> bug? >> > > Maybe. > I guess that if bad-block recording is enabled, then recording a bad block > there would be the "right" thing to do. > However if bad-block recording is not enabled then there are two options: > 1/ kick the device from the array > 2/ re-write the failing block based on the parity block which might be > incorrect, but very probably is correct. > > I suspect that the second option (which is currently implemented) will cause > less data loss than the first. > > So I think the current code is the best option (unless we want to add some > bad-block-recording support). > > NeilBrown ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: md: raid5 resync corrects read errors on data block - is this correct? 2012-09-11 22:29 ` NeilBrown 2012-09-12 7:15 ` Alexander Lyakas @ 2012-09-12 16:49 ` Alexander Lyakas 2012-09-13 0:19 ` NeilBrown 1 sibling, 1 reply; 12+ messages in thread From: Alexander Lyakas @ 2012-09-12 16:49 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, I have done some more investigation on that. I see that according to handle_stripe_dirtying(), raid6 always does reconstruct-write, while raid5 checks what will be cheaper in terms of IOs - read-modify-write or reconstruct-write. For example, for a 3-drive raid5, both are the same, so because of: if (rmw < rcw && rmw > 0) ... /* this is not picked, because in this case rmw==rcw==1 */ reconstruct-write is always performed for such 3-drvie raid5. Is this correct? The issue with doing read-modify-writes is that later we have no reliable way to know whether the parity block is correct - when we later do reconstruct-write because of a read error, for example. For read requests we could have perhaps checked the bitmap, and do reconstruct-write if the relevant bit is not set, but for write requests the relevant bit will always be set, because it is set when the write is started. I tried the following scenario, which showed a data corruption: # Create 4-drive raid5 in "--force" mode, so resync starts # Write one sector on a stripe that resync has not handled yet. RMW is performed, but the parity is incorrect because two other data blocks were not taken into account (they contain garbage). # Induce a read-error on the sector that I just wrote to # Let resync handle this stripe As a result, resync corrects my sector using other two data blocks + parity block, which is out of sync. When I read back the sector, data is incorrect. I see that I can easily enforce raid5 to always do reconstruct-write, the same way like you do for raid6. However, I realize that for performance reasons, it is better to do RMW if possible. What do you think about the following rough suggestion: in handle_stripe_dirtying() check whether resync is ongoing or should be started - using MD_RECOVERY_SYNC, for example. If there is an ongoing resync, there is a good reason for that, probably parity on some stripes is out of date. So in that case, always force reconstruct-write. Otherwise, count what is cheaper like you do now. (Can RCW be really cheaper than RMW?) So during resync, array performance will be lower, but we will ensure that all stripe-blocks are consistent. What do you think? Thanks! Alex. On Wed, Sep 12, 2012 at 1:29 AM, NeilBrown <neilb@suse.de> wrote: > On Tue, 11 Sep 2012 22:10:01 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hi Neil, >> I have been doing some investigation about resync on raid5, and I >> induced a repairable read error on a drive that contains a data block >> for a particular stripe_head. I saw that resync corrects the read >> error by recomputing the missing data block, rewriting it (twice) and >> then re-reading back. >> >> Then I dug into code and eventually saw that fetch_block(): >> >> if ((s->uptodate == disks - 1) && >> (s->failed && (disk_idx == s->failed_num[0] || >> disk_idx == s->failed_num[1]))) { >> /* have disk failed, and we're requested to fetch it; >> * do compute it >> */ >> >> doesn't care whether disk_idx holds data or parity for this >> stripe_head. It only checks that stripe_head has enough redundancy to >> recompute the block. >> >> My question: is this behavior correct? I mean that if we are doing >> resync, it means that we are not sure that parity is correct (e.g., >> after an unclean shutdown). But we still use this possibly-incorrect >> parity to recompute the missing data block. So is this a potential >> bug? >> > > Maybe. > I guess that if bad-block recording is enabled, then recording a bad block > there would be the "right" thing to do. > However if bad-block recording is not enabled then there are two options: > 1/ kick the device from the array > 2/ re-write the failing block based on the parity block which might be > incorrect, but very probably is correct. > > I suspect that the second option (which is currently implemented) will cause > less data loss than the first. > > So I think the current code is the best option (unless we want to add some > bad-block-recording support). > > NeilBrown ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: md: raid5 resync corrects read errors on data block - is this correct? 2012-09-12 16:49 ` Alexander Lyakas @ 2012-09-13 0:19 ` NeilBrown 2012-09-13 16:05 ` Alexander Lyakas 0 siblings, 1 reply; 12+ messages in thread From: NeilBrown @ 2012-09-13 0:19 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2730 bytes --] On Wed, 12 Sep 2012 19:49:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > I have done some more investigation on that. > > I see that according to handle_stripe_dirtying(), raid6 always does > reconstruct-write, while raid5 checks what will be cheaper in terms of > IOs - read-modify-write or reconstruct-write. For example, for a > 3-drive raid5, both are the same, so because of: > > if (rmw < rcw && rmw > 0) > ... /* this is not picked, because in this case rmw==rcw==1 */ > > reconstruct-write is always performed for such 3-drvie raid5. Is this correct? Yes. > > The issue with doing read-modify-writes is that later we have no > reliable way to know whether the parity block is correct - when we > later do reconstruct-write because of a read error, for example. For > read requests we could have perhaps checked the bitmap, and do > reconstruct-write if the relevant bit is not set, but for write > requests the relevant bit will always be set, because it is set when > the write is started. > > I tried the following scenario, which showed a data corruption: > # Create 4-drive raid5 in "--force" mode, so resync starts > # Write one sector on a stripe that resync has not handled yet. RMW is > performed, but the parity is incorrect because two other data blocks > were not taken into account (they contain garbage). > # Induce a read-error on the sector that I just wrote to > # Let resync handle this stripe > > As a result, resync corrects my sector using other two data blocks + > parity block, which is out of sync. When I read back the sector, data > is incorrect. > > I see that I can easily enforce raid5 to always do reconstruct-write, > the same way like you do for raid6. However, I realize that for > performance reasons, it is better to do RMW if possible. > > What do you think about the following rough suggestion: in > handle_stripe_dirtying() check whether resync is ongoing or should be > started - using MD_RECOVERY_SYNC, for example. If there is an ongoing > resync, there is a good reason for that, probably parity on some > stripes is out of date. So in that case, always force > reconstruct-write. Otherwise, count what is cheaper like you do now. > (Can RCW be really cheaper than RMW?) > > So during resync, array performance will be lower, but we will ensure > that all stripe-blocks are consistent. What do you think? I'm fairly sure we used to do that - long long ago. (hunts through git history...) No. The code-fragment was there but it was commented out. I think it would be good to avoid 'rmw' if the sector offset is less than recovery_cp. Care to write a patch? Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: md: raid5 resync corrects read errors on data block - is this correct? 2012-09-13 0:19 ` NeilBrown @ 2012-09-13 16:05 ` Alexander Lyakas 2012-09-13 16:11 ` Alexander Lyakas 0 siblings, 1 reply; 12+ messages in thread From: Alexander Lyakas @ 2012-09-13 16:05 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, below please find the patch. It got pretty ugly, so even if you don't apply it, please comment. Did I miss anything? As we discussed some time ago, the usage of mddev->recovery_cp is overloaded in MD - it indicates both the resync progress and also whether array is dirty due to ongoing writes. So I tried to somehow differentiate between these two. I did not manage to repro yet the "needed" and "transitional" parts. I also don't force reconstruct on "check" and "repair". I tested the patch on kernel 3.2. Finally, I asked you some time ago about how MD treats stripes when it is rebuilding a drive. So now I dug in the code, and found out that absolutely everything you told me was correct. Amazing! Thanks, Alex. --------------------------- From e876cddae5768bc7f6bdcbd617f36ea3a12445aa Mon Sep 17 00:00:00 2001 From: Alex Lyakas <alex@zadarastorage.com> Date: Thu, 13 Sep 2012 18:55:00 +0300 Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of read-modify-write. Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: Yair Hershko <yair@zadarastorage.com> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c index 5332202..fa03410 100644 --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c @@ -2561,26 +2561,54 @@ static void handle_stripe_dirtying(struct r5conf *conf, * look like rcw is cheaper */ rcw = 1; rmw = 2; - } else for (i = disks; i--; ) { - /* would I have to read this buffer for read_modify_write */ - struct r5dev *dev = &sh->dev[i]; - if ((dev->towrite || i == sh->pd_idx) && - !test_bit(R5_LOCKED, &dev->flags) && - !(test_bit(R5_UPTODATE, &dev->flags) || - test_bit(R5_Wantcompute, &dev->flags))) { - if (test_bit(R5_Insync, &dev->flags)) - rmw++; - else - rmw += 2*disks; /* cannot read it */ - } - /* Would I have to read this buffer for reconstruct_write */ - if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx && - !test_bit(R5_LOCKED, &dev->flags) && - !(test_bit(R5_UPTODATE, &dev->flags) || - test_bit(R5_Wantcompute, &dev->flags))) { - if (test_bit(R5_Insync, &dev->flags)) rcw++; - else - rcw += 2*disks; + } else { + /* Attempt to check whether resync is now happening. + * If yes, the array is dirty (after unclean shutdown or + * initial creation), so parity in some stripes might be inconsistent. + * In this case, we need to always do reconstruct-write, to ensure + * that in case of drive failure or read-error correction, we + * generate correct data from the parity. + */ + sector_t recovery_cp = conf->mddev->recovery_cp; + unsigned long recovery = conf->mddev->recovery; + int needed = test_bit(MD_RECOVERY_NEEDED, &recovery); + int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) && + !test_bit(MD_RECOVERY_REQUESTED, &recovery) && + !test_bit(MD_RECOVERY_CHECK, &recovery); + int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) && + !test_bit(MD_RECOVERY_SYNC, &recovery) && + !test_bit(MD_RECOVERY_RECOVER, &recovery) && + !test_bit(MD_RECOVERY_DONE, &recovery) && + !test_bit(MD_RECOVERY_RESHAPE, &recovery); + + /* If array is dirty and should start resync or already resyncing */ + if (recovery_cp < MaxSector && sh->sector >= recovery_cp && + (needed || resyncing || transitional)) { + /* Force reconstruct-write */ + rcw = 1; rmw = 2; + pr_debug("dirty, force RCW recovery_cp=%lu sh->sector=%lu recovery=0x%lx\n", + recovery_cp, sh->sector, recovery); + } else for (i = disks; i--; ) { + /* would I have to read this buffer for read_modify_write */ + struct r5dev *dev = &sh->dev[i]; + if ((dev->towrite || i == sh->pd_idx) && + !test_bit(R5_LOCKED, &dev->flags) && + !(test_bit(R5_UPTODATE, &dev->flags) || + test_bit(R5_Wantcompute, &dev->flags))) { + if (test_bit(R5_Insync, &dev->flags)) + rmw++; + else + rmw += 2*disks; /* cannot read it */ + } + /* Would I have to read this buffer for reconstruct_write */ + if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx && + !test_bit(R5_LOCKED, &dev->flags) && + !(test_bit(R5_UPTODATE, &dev->flags) || + test_bit(R5_Wantcompute, &dev->flags))) { + if (test_bit(R5_Insync, &dev->flags)) rcw++; + else + rcw += 2*disks; + } } } pr_debug("for sector %llu, rmw=%d rcw=%d\n", On Thu, Sep 13, 2012 at 3:19 AM, NeilBrown <neilb@suse.de> wrote: > On Wed, 12 Sep 2012 19:49:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hi Neil, >> I have done some more investigation on that. >> >> I see that according to handle_stripe_dirtying(), raid6 always does >> reconstruct-write, while raid5 checks what will be cheaper in terms of >> IOs - read-modify-write or reconstruct-write. For example, for a >> 3-drive raid5, both are the same, so because of: >> >> if (rmw < rcw && rmw > 0) >> ... /* this is not picked, because in this case rmw==rcw==1 */ >> >> reconstruct-write is always performed for such 3-drvie raid5. Is this correct? > > Yes. > >> >> The issue with doing read-modify-writes is that later we have no >> reliable way to know whether the parity block is correct - when we >> later do reconstruct-write because of a read error, for example. For >> read requests we could have perhaps checked the bitmap, and do >> reconstruct-write if the relevant bit is not set, but for write >> requests the relevant bit will always be set, because it is set when >> the write is started. >> >> I tried the following scenario, which showed a data corruption: >> # Create 4-drive raid5 in "--force" mode, so resync starts >> # Write one sector on a stripe that resync has not handled yet. RMW is >> performed, but the parity is incorrect because two other data blocks >> were not taken into account (they contain garbage). >> # Induce a read-error on the sector that I just wrote to >> # Let resync handle this stripe >> >> As a result, resync corrects my sector using other two data blocks + >> parity block, which is out of sync. When I read back the sector, data >> is incorrect. >> >> I see that I can easily enforce raid5 to always do reconstruct-write, >> the same way like you do for raid6. However, I realize that for >> performance reasons, it is better to do RMW if possible. >> >> What do you think about the following rough suggestion: in >> handle_stripe_dirtying() check whether resync is ongoing or should be >> started - using MD_RECOVERY_SYNC, for example. If there is an ongoing >> resync, there is a good reason for that, probably parity on some >> stripes is out of date. So in that case, always force >> reconstruct-write. Otherwise, count what is cheaper like you do now. >> (Can RCW be really cheaper than RMW?) >> >> So during resync, array performance will be lower, but we will ensure >> that all stripe-blocks are consistent. What do you think? > > I'm fairly sure we used to do that - long long ago. (hunts through git > history...) No. The code-fragment was there but it was commented out. > > I think it would be good to avoid 'rmw' if the sector offset is less than > recovery_cp. > > Care to write a patch? > > Thanks, > NeilBrown ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: md: raid5 resync corrects read errors on data block - is this correct? 2012-09-13 16:05 ` Alexander Lyakas @ 2012-09-13 16:11 ` Alexander Lyakas 2012-09-17 11:15 ` Alexander Lyakas 0 siblings, 1 reply; 12+ messages in thread From: Alexander Lyakas @ 2012-09-13 16:11 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid I forgot to mention that there is a flaw in my patch: if user does echo "frozen" > /sys/block/mdX/md/sync_action then resync is aborted, NEEDED flag is also down, so I cannot distinguish between "needs resync" and "has ongoing writes". In that case RMW is performed. If there is a drive failure during resync, then eventually recovery_cp becomes MaxSector (after additional "empty resync", as we discussed long ago). So then it's ok to do RMW, because when we re-add the drive, we will reconstruct its missing blocks. Thanks, Alex. On Thu, Sep 13, 2012 at 7:05 PM, Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > below please find the patch. It got pretty ugly, so even if you don't > apply it, please comment. Did I miss anything? > As we discussed some time ago, the usage of mddev->recovery_cp is > overloaded in MD - it indicates both the resync progress and also > whether array is dirty due to ongoing writes. So I tried to somehow > differentiate between these two. I did not manage to repro yet the > "needed" and "transitional" parts. I also don't force reconstruct on > "check" and "repair". I tested the patch on kernel 3.2. > > Finally, I asked you some time ago about how MD treats stripes when it > is rebuilding a drive. So now I dug in the code, and found out that > absolutely everything you told me was correct. Amazing! > > Thanks, > Alex. > > > --------------------------- > From e876cddae5768bc7f6bdcbd617f36ea3a12445aa Mon Sep 17 00:00:00 2001 > From: Alex Lyakas <alex@zadarastorage.com> > Date: Thu, 13 Sep 2012 18:55:00 +0300 > Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of > read-modify-write. > > Signed-off-by: Alex Lyakas <alex@zadarastorage.com> > Signed-off-by: Yair Hershko <yair@zadarastorage.com> > > diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > index 5332202..fa03410 100644 > --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > @@ -2561,26 +2561,54 @@ static void handle_stripe_dirtying(struct r5conf *conf, > * look like rcw is cheaper > */ > rcw = 1; rmw = 2; > - } else for (i = disks; i--; ) { > - /* would I have to read this buffer for read_modify_write */ > - struct r5dev *dev = &sh->dev[i]; > - if ((dev->towrite || i == sh->pd_idx) && > - !test_bit(R5_LOCKED, &dev->flags) && > - !(test_bit(R5_UPTODATE, &dev->flags) || > - test_bit(R5_Wantcompute, &dev->flags))) { > - if (test_bit(R5_Insync, &dev->flags)) > - rmw++; > - else > - rmw += 2*disks; /* cannot read it */ > - } > - /* Would I have to read this buffer for reconstruct_write */ > - if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx && > - !test_bit(R5_LOCKED, &dev->flags) && > - !(test_bit(R5_UPTODATE, &dev->flags) || > - test_bit(R5_Wantcompute, &dev->flags))) { > - if (test_bit(R5_Insync, &dev->flags)) rcw++; > - else > - rcw += 2*disks; > + } else { > + /* Attempt to check whether resync is now happening. > + * If yes, the array is dirty (after unclean shutdown or > + * initial creation), so parity in some stripes might be inconsistent. > + * In this case, we need to always do reconstruct-write, to ensure > + * that in case of drive failure or read-error correction, we > + * generate correct data from the parity. > + */ > + sector_t recovery_cp = conf->mddev->recovery_cp; > + unsigned long recovery = conf->mddev->recovery; > + int needed = test_bit(MD_RECOVERY_NEEDED, &recovery); > + int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) && > + !test_bit(MD_RECOVERY_REQUESTED, &recovery) && > + !test_bit(MD_RECOVERY_CHECK, &recovery); > + int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) && > + !test_bit(MD_RECOVERY_SYNC, &recovery) && > + !test_bit(MD_RECOVERY_RECOVER, &recovery) && > + !test_bit(MD_RECOVERY_DONE, &recovery) && > + !test_bit(MD_RECOVERY_RESHAPE, &recovery); > + > + /* If array is dirty and should start resync or already resyncing */ > + if (recovery_cp < MaxSector && sh->sector >= recovery_cp && > + (needed || resyncing || transitional)) { > + /* Force reconstruct-write */ > + rcw = 1; rmw = 2; > + pr_debug("dirty, force RCW recovery_cp=%lu sh->sector=%lu recovery=0x%lx\n", > + recovery_cp, sh->sector, recovery); > + } else for (i = disks; i--; ) { > + /* would I have to read this buffer for read_modify_write */ > + struct r5dev *dev = &sh->dev[i]; > + if ((dev->towrite || i == sh->pd_idx) && > + !test_bit(R5_LOCKED, &dev->flags) && > + !(test_bit(R5_UPTODATE, &dev->flags) || > + test_bit(R5_Wantcompute, &dev->flags))) { > + if (test_bit(R5_Insync, &dev->flags)) > + rmw++; > + else > + rmw += 2*disks; /* cannot read it */ > + } > + /* Would I have to read this buffer for reconstruct_write */ > + if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx && > + !test_bit(R5_LOCKED, &dev->flags) && > + !(test_bit(R5_UPTODATE, &dev->flags) || > + test_bit(R5_Wantcompute, &dev->flags))) { > + if (test_bit(R5_Insync, &dev->flags)) rcw++; > + else > + rcw += 2*disks; > + } > } > } > pr_debug("for sector %llu, rmw=%d rcw=%d\n", > > > > > > On Thu, Sep 13, 2012 at 3:19 AM, NeilBrown <neilb@suse.de> wrote: >> On Wed, 12 Sep 2012 19:49:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> >> wrote: >> >>> Hi Neil, >>> I have done some more investigation on that. >>> >>> I see that according to handle_stripe_dirtying(), raid6 always does >>> reconstruct-write, while raid5 checks what will be cheaper in terms of >>> IOs - read-modify-write or reconstruct-write. For example, for a >>> 3-drive raid5, both are the same, so because of: >>> >>> if (rmw < rcw && rmw > 0) >>> ... /* this is not picked, because in this case rmw==rcw==1 */ >>> >>> reconstruct-write is always performed for such 3-drvie raid5. Is this correct? >> >> Yes. >> >>> >>> The issue with doing read-modify-writes is that later we have no >>> reliable way to know whether the parity block is correct - when we >>> later do reconstruct-write because of a read error, for example. For >>> read requests we could have perhaps checked the bitmap, and do >>> reconstruct-write if the relevant bit is not set, but for write >>> requests the relevant bit will always be set, because it is set when >>> the write is started. >>> >>> I tried the following scenario, which showed a data corruption: >>> # Create 4-drive raid5 in "--force" mode, so resync starts >>> # Write one sector on a stripe that resync has not handled yet. RMW is >>> performed, but the parity is incorrect because two other data blocks >>> were not taken into account (they contain garbage). >>> # Induce a read-error on the sector that I just wrote to >>> # Let resync handle this stripe >>> >>> As a result, resync corrects my sector using other two data blocks + >>> parity block, which is out of sync. When I read back the sector, data >>> is incorrect. >>> >>> I see that I can easily enforce raid5 to always do reconstruct-write, >>> the same way like you do for raid6. However, I realize that for >>> performance reasons, it is better to do RMW if possible. >>> >>> What do you think about the following rough suggestion: in >>> handle_stripe_dirtying() check whether resync is ongoing or should be >>> started - using MD_RECOVERY_SYNC, for example. If there is an ongoing >>> resync, there is a good reason for that, probably parity on some >>> stripes is out of date. So in that case, always force >>> reconstruct-write. Otherwise, count what is cheaper like you do now. >>> (Can RCW be really cheaper than RMW?) >>> >>> So during resync, array performance will be lower, but we will ensure >>> that all stripe-blocks are consistent. What do you think? >> >> I'm fairly sure we used to do that - long long ago. (hunts through git >> history...) No. The code-fragment was there but it was commented out. >> >> I think it would be good to avoid 'rmw' if the sector offset is less than >> recovery_cp. >> >> Care to write a patch? >> >> Thanks, >> NeilBrown ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: md: raid5 resync corrects read errors on data block - is this correct? 2012-09-13 16:11 ` Alexander Lyakas @ 2012-09-17 11:15 ` Alexander Lyakas 2012-09-19 5:59 ` NeilBrown 0 siblings, 1 reply; 12+ messages in thread From: Alexander Lyakas @ 2012-09-17 11:15 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, below is a bit less-ugly version of the patch. Thanks, Alex. From 05cf800d623bf558c99d542cf8bf083c85b7e5d5 Mon Sep 17 00:00:00 2001 From: Alex Lyakas <alex@zadarastorage.com> Date: Thu, 13 Sep 2012 18:55:00 +0300 Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of read-modify-write. Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: Yair Hershko <yair@zadarastorage.com> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c index 5332202..0702785 100644 --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c @@ -2555,12 +2555,36 @@ static void handle_stripe_dirtying(struct r5conf *conf, int disks) { int rmw = 0, rcw = 0, i; - if (conf->max_degraded == 2) { - /* RAID6 requires 'rcw' in current implementation - * Calculate the real rcw later - for now fake it + sector_t recovery_cp = conf->mddev->recovery_cp; + unsigned long recovery = conf->mddev->recovery; + int needed = test_bit(MD_RECOVERY_NEEDED, &recovery); + int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) && + !test_bit(MD_RECOVERY_REQUESTED, &recovery) && + !test_bit(MD_RECOVERY_CHECK, &recovery); + int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) && + !test_bit(MD_RECOVERY_SYNC, &recovery) && + !test_bit(MD_RECOVERY_RECOVER, &recovery) && + !test_bit(MD_RECOVERY_DONE, &recovery) && + !test_bit(MD_RECOVERY_RESHAPE, &recovery); + + /* RAID6 requires 'rcw' in current implementation. + * Otherwise, attempt to check whether resync is now happening + * or should start. + * If yes, then the array is dirty (after unclean shutdown or + * initial creation), so parity in some stripes might be inconsistent. + * In this case, we need to always do reconstruct-write, to ensure + * that in case of drive failure or read-error correction, we + * generate correct data from the parity. + */ + if (conf->max_degraded == 2 || + (recovery_cp < MaxSector && sh->sector >= recovery_cp && + (needed || resyncing || transitional))) { + /* Calculate the real rcw later - for now fake it * look like rcw is cheaper */ rcw = 1; rmw = 2; + pr_debug("force RCW max_degraded=%u, recovery_cp=%lu sh->sector=%lu recovery=0x%lx\n", + conf->max_degraded, recovery_cp, sh->sector, recovery); } else for (i = disks; i--; ) { /* would I have to read this buffer for read_modify_write */ struct r5dev *dev = &sh->dev[i]; ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: md: raid5 resync corrects read errors on data block - is this correct? 2012-09-17 11:15 ` Alexander Lyakas @ 2012-09-19 5:59 ` NeilBrown 2012-09-20 8:26 ` Alexander Lyakas 0 siblings, 1 reply; 12+ messages in thread From: NeilBrown @ 2012-09-19 5:59 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 3449 bytes --] On Mon, 17 Sep 2012 14:15:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > below is a bit less-ugly version of the patch. > Thanks, > Alex. > > >From 05cf800d623bf558c99d542cf8bf083c85b7e5d5 Mon Sep 17 00:00:00 2001 > From: Alex Lyakas <alex@zadarastorage.com> > Date: Thu, 13 Sep 2012 18:55:00 +0300 > Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of > read-modify-write. > > Signed-off-by: Alex Lyakas <alex@zadarastorage.com> > Signed-off-by: Yair Hershko <yair@zadarastorage.com> > > diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > index 5332202..0702785 100644 > --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > @@ -2555,12 +2555,36 @@ static void handle_stripe_dirtying(struct r5conf *conf, > int disks) > { > int rmw = 0, rcw = 0, i; > - if (conf->max_degraded == 2) { > - /* RAID6 requires 'rcw' in current implementation > - * Calculate the real rcw later - for now fake it > + sector_t recovery_cp = conf->mddev->recovery_cp; > + unsigned long recovery = conf->mddev->recovery; > + int needed = test_bit(MD_RECOVERY_NEEDED, &recovery); > + int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) && > + !test_bit(MD_RECOVERY_REQUESTED, &recovery) && > + !test_bit(MD_RECOVERY_CHECK, &recovery); > + int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) && > + !test_bit(MD_RECOVERY_SYNC, &recovery) && > + !test_bit(MD_RECOVERY_RECOVER, &recovery) && > + !test_bit(MD_RECOVERY_DONE, &recovery) && > + !test_bit(MD_RECOVERY_RESHAPE, &recovery); Thanks Alex, however I don't understand why you want to test all of these bits. Isn't it enough just to check ->recovery_cp ?? > + > + /* RAID6 requires 'rcw' in current implementation. > + * Otherwise, attempt to check whether resync is now happening > + * or should start. > + * If yes, then the array is dirty (after unclean shutdown or > + * initial creation), so parity in some stripes might be inconsistent. > + * In this case, we need to always do reconstruct-write, to ensure > + * that in case of drive failure or read-error correction, we > + * generate correct data from the parity. > + */ > + if (conf->max_degraded == 2 || > + (recovery_cp < MaxSector && sh->sector >= recovery_cp && > + (needed || resyncing || transitional))) { > + /* Calculate the real rcw later - for now fake it > * look like rcw is cheaper Also, we should probably fix this comment. s/fake/make/ Thanks, NeilBrown > */ > rcw = 1; rmw = 2; > + pr_debug("force RCW max_degraded=%u, recovery_cp=%lu > sh->sector=%lu recovery=0x%lx\n", > + conf->max_degraded, recovery_cp, sh->sector, recovery); > } else for (i = disks; i--; ) { > /* would I have to read this buffer for read_modify_write */ > struct r5dev *dev = &sh->dev[i]; [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: md: raid5 resync corrects read errors on data block - is this correct? 2012-09-19 5:59 ` NeilBrown @ 2012-09-20 8:26 ` Alexander Lyakas 2012-09-25 6:57 ` NeilBrown 0 siblings, 1 reply; 12+ messages in thread From: Alexander Lyakas @ 2012-09-20 8:26 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, you are completely right. I got confused between mddev->recovery_cp and sb->resync_offset; the latter may become 0 due to in-flight WRITEs and not due to resync. Looking at the code again, I see that recovery_cp is totally one-way from sb->resync_offset to MaxSector (except for explicit loading via sysfs). Also recovery_cp is not relevant to "check" and "repair". So recovery_cp is pretty simple after all. Below is V2 patch. (I have also to credit it to somebody else, because he was the one that said - just do rcw while you are resyncing). Thanks, Alex. ----------------- From cc3e2bfcf2fd2c69180577949425d69de88706bb Mon Sep 17 00:00:00 2001 From: Alex Lyakas <alex@zadarastorage.com> Date: Thu, 13 Sep 2012 18:55:00 +0300 Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of read-modify-write. Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: Yair Hershko <yair@zadarastorage.com> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c index 5332202..9fdd5e3 100644 --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c @@ -2555,12 +2555,24 @@ static void handle_stripe_dirtying(struct r5conf *conf, int disks) { int rmw = 0, rcw = 0, i; - if (conf->max_degraded == 2) { - /* RAID6 requires 'rcw' in current implementation - * Calculate the real rcw later - for now fake it + sector_t recovery_cp = conf->mddev->recovery_cp; + + /* RAID6 requires 'rcw' in current implementation. + * Otherwise, check whether resync is now happening or should start. + * If yes, then the array is dirty (after unclean shutdown or + * initial creation), so parity in some stripes might be inconsistent. + * In this case, we need to always do reconstruct-write, to ensure + * that in case of drive failure or read-error correction, we + * generate correct data from the parity. + */ + if (conf->max_degraded == 2 || + (recovery_cp < MaxSector && sh->sector >= recovery_cp)) { + /* Calculate the real rcw later - for now make it * look like rcw is cheaper */ rcw = 1; rmw = 2; + pr_debug("force RCW max_degraded=%u, recovery_cp=%lu sh->sector=%lu\n", + conf->max_degraded, recovery_cp, sh->sector); } else for (i = disks; i--; ) { /* would I have to read this buffer for read_modify_write */ struct r5dev *dev = &sh->dev[i]; On Wed, Sep 19, 2012 at 8:59 AM, NeilBrown <neilb@suse.de> wrote: > On Mon, 17 Sep 2012 14:15:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hi Neil, >> below is a bit less-ugly version of the patch. >> Thanks, >> Alex. >> >> >From 05cf800d623bf558c99d542cf8bf083c85b7e5d5 Mon Sep 17 00:00:00 2001 >> From: Alex Lyakas <alex@zadarastorage.com> >> Date: Thu, 13 Sep 2012 18:55:00 +0300 >> Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of >> read-modify-write. >> >> Signed-off-by: Alex Lyakas <alex@zadarastorage.com> >> Signed-off-by: Yair Hershko <yair@zadarastorage.com> >> >> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c >> b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c >> index 5332202..0702785 100644 >> --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c >> +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c >> @@ -2555,12 +2555,36 @@ static void handle_stripe_dirtying(struct r5conf *conf, >> int disks) >> { >> int rmw = 0, rcw = 0, i; >> - if (conf->max_degraded == 2) { >> - /* RAID6 requires 'rcw' in current implementation >> - * Calculate the real rcw later - for now fake it >> + sector_t recovery_cp = conf->mddev->recovery_cp; >> + unsigned long recovery = conf->mddev->recovery; >> + int needed = test_bit(MD_RECOVERY_NEEDED, &recovery); >> + int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) && >> + !test_bit(MD_RECOVERY_REQUESTED, &recovery) && >> + !test_bit(MD_RECOVERY_CHECK, &recovery); >> + int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) && >> + !test_bit(MD_RECOVERY_SYNC, &recovery) && >> + !test_bit(MD_RECOVERY_RECOVER, &recovery) && >> + !test_bit(MD_RECOVERY_DONE, &recovery) && >> + !test_bit(MD_RECOVERY_RESHAPE, &recovery); > > Thanks Alex, > however I don't understand why you want to test all of these bits. > Isn't it enough just to check ->recovery_cp ?? > >> + >> + /* RAID6 requires 'rcw' in current implementation. >> + * Otherwise, attempt to check whether resync is now happening >> + * or should start. >> + * If yes, then the array is dirty (after unclean shutdown or >> + * initial creation), so parity in some stripes might be inconsistent. >> + * In this case, we need to always do reconstruct-write, to ensure >> + * that in case of drive failure or read-error correction, we >> + * generate correct data from the parity. >> + */ >> + if (conf->max_degraded == 2 || >> + (recovery_cp < MaxSector && sh->sector >= recovery_cp && >> + (needed || resyncing || transitional))) { >> + /* Calculate the real rcw later - for now fake it >> * look like rcw is cheaper > > Also, we should probably fix this comment. s/fake/make/ > > Thanks, > NeilBrown > > > >> */ >> rcw = 1; rmw = 2; >> + pr_debug("force RCW max_degraded=%u, recovery_cp=%lu >> sh->sector=%lu recovery=0x%lx\n", >> + conf->max_degraded, recovery_cp, sh->sector, recovery); >> } else for (i = disks; i--; ) { >> /* would I have to read this buffer for read_modify_write */ >> struct r5dev *dev = &sh->dev[i]; > ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: md: raid5 resync corrects read errors on data block - is this correct? 2012-09-20 8:26 ` Alexander Lyakas @ 2012-09-25 6:57 ` NeilBrown 2012-09-25 7:50 ` Alexander Lyakas 0 siblings, 1 reply; 12+ messages in thread From: NeilBrown @ 2012-09-25 6:57 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 7017 bytes --] On Thu, 20 Sep 2012 11:26:50 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > you are completely right. I got confused between mddev->recovery_cp > and sb->resync_offset; the latter may become 0 due to in-flight WRITEs > and not due to resync. Looking at the code again, I see that > recovery_cp is totally one-way from sb->resync_offset to MaxSector > (except for explicit loading via sysfs). Also recovery_cp is not > relevant to "check" and "repair". So recovery_cp is pretty simple > after all. > > Below is V2 patch. (I have also to credit it to somebody else, because > he was the one that said - just do rcw while you are resyncing). > > Thanks, > Alex. > > > ----------------- > >From cc3e2bfcf2fd2c69180577949425d69de88706bb Mon Sep 17 00:00:00 2001 > From: Alex Lyakas <alex@zadarastorage.com> > Date: Thu, 13 Sep 2012 18:55:00 +0300 > Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of > read-modify-write. > > Signed-off-by: Alex Lyakas <alex@zadarastorage.com> > Signed-off-by: Yair Hershko <yair@zadarastorage.com> Signed-off-by has a very specific meaning - it isn't just a way of giving recredit. If Yair wrote some of the code, this is fine. If not, then something like "Suggest-by:" might be more appropriate. Should I change it to that. applied, thanks. NeilBrown > > diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > index 5332202..9fdd5e3 100644 > --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > @@ -2555,12 +2555,24 @@ static void handle_stripe_dirtying(struct r5conf *conf, > int disks) > { > int rmw = 0, rcw = 0, i; > - if (conf->max_degraded == 2) { > - /* RAID6 requires 'rcw' in current implementation > - * Calculate the real rcw later - for now fake it > + sector_t recovery_cp = conf->mddev->recovery_cp; > + > + /* RAID6 requires 'rcw' in current implementation. > + * Otherwise, check whether resync is now happening or should start. > + * If yes, then the array is dirty (after unclean shutdown or > + * initial creation), so parity in some stripes might be inconsistent. > + * In this case, we need to always do reconstruct-write, to ensure > + * that in case of drive failure or read-error correction, we > + * generate correct data from the parity. > + */ > + if (conf->max_degraded == 2 || > + (recovery_cp < MaxSector && sh->sector >= recovery_cp)) { > + /* Calculate the real rcw later - for now make it > * look like rcw is cheaper > */ > rcw = 1; rmw = 2; > + pr_debug("force RCW max_degraded=%u, recovery_cp=%lu > sh->sector=%lu\n", > + conf->max_degraded, recovery_cp, sh->sector); > } else for (i = disks; i--; ) { > /* would I have to read this buffer for read_modify_write */ > struct r5dev *dev = &sh->dev[i]; > > > > > > > On Wed, Sep 19, 2012 at 8:59 AM, NeilBrown <neilb@suse.de> wrote: > > On Mon, 17 Sep 2012 14:15:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > > wrote: > > > >> Hi Neil, > >> below is a bit less-ugly version of the patch. > >> Thanks, > >> Alex. > >> > >> >From 05cf800d623bf558c99d542cf8bf083c85b7e5d5 Mon Sep 17 00:00:00 2001 > >> From: Alex Lyakas <alex@zadarastorage.com> > >> Date: Thu, 13 Sep 2012 18:55:00 +0300 > >> Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of > >> read-modify-write. > >> > >> Signed-off-by: Alex Lyakas <alex@zadarastorage.com> > >> Signed-off-by: Yair Hershko <yair@zadarastorage.com> > >> > >> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > >> b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > >> index 5332202..0702785 100644 > >> --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > >> +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > >> @@ -2555,12 +2555,36 @@ static void handle_stripe_dirtying(struct r5conf *conf, > >> int disks) > >> { > >> int rmw = 0, rcw = 0, i; > >> - if (conf->max_degraded == 2) { > >> - /* RAID6 requires 'rcw' in current implementation > >> - * Calculate the real rcw later - for now fake it > >> + sector_t recovery_cp = conf->mddev->recovery_cp; > >> + unsigned long recovery = conf->mddev->recovery; > >> + int needed = test_bit(MD_RECOVERY_NEEDED, &recovery); > >> + int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) && > >> + !test_bit(MD_RECOVERY_REQUESTED, &recovery) && > >> + !test_bit(MD_RECOVERY_CHECK, &recovery); > >> + int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) && > >> + !test_bit(MD_RECOVERY_SYNC, &recovery) && > >> + !test_bit(MD_RECOVERY_RECOVER, &recovery) && > >> + !test_bit(MD_RECOVERY_DONE, &recovery) && > >> + !test_bit(MD_RECOVERY_RESHAPE, &recovery); > > > > Thanks Alex, > > however I don't understand why you want to test all of these bits. > > Isn't it enough just to check ->recovery_cp ?? > > > >> + > >> + /* RAID6 requires 'rcw' in current implementation. > >> + * Otherwise, attempt to check whether resync is now happening > >> + * or should start. > >> + * If yes, then the array is dirty (after unclean shutdown or > >> + * initial creation), so parity in some stripes might be inconsistent. > >> + * In this case, we need to always do reconstruct-write, to ensure > >> + * that in case of drive failure or read-error correction, we > >> + * generate correct data from the parity. > >> + */ > >> + if (conf->max_degraded == 2 || > >> + (recovery_cp < MaxSector && sh->sector >= recovery_cp && > >> + (needed || resyncing || transitional))) { > >> + /* Calculate the real rcw later - for now fake it > >> * look like rcw is cheaper > > > > Also, we should probably fix this comment. s/fake/make/ > > > > Thanks, > > NeilBrown > > > > > > > >> */ > >> rcw = 1; rmw = 2; > >> + pr_debug("force RCW max_degraded=%u, recovery_cp=%lu > >> sh->sector=%lu recovery=0x%lx\n", > >> + conf->max_degraded, recovery_cp, sh->sector, recovery); > >> } else for (i = disks; i--; ) { > >> /* would I have to read this buffer for read_modify_write */ > >> struct r5dev *dev = &sh->dev[i]; > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: md: raid5 resync corrects read errors on data block - is this correct? 2012-09-25 6:57 ` NeilBrown @ 2012-09-25 7:50 ` Alexander Lyakas 0 siblings, 0 replies; 12+ messages in thread From: Alexander Lyakas @ 2012-09-25 7:50 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Yes, Neil, please change it then to "Suggested-By". Thanks! Alex. On Tue, Sep 25, 2012 at 8:57 AM, NeilBrown <neilb@suse.de> wrote: > On Thu, 20 Sep 2012 11:26:50 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hi Neil, >> you are completely right. I got confused between mddev->recovery_cp >> and sb->resync_offset; the latter may become 0 due to in-flight WRITEs >> and not due to resync. Looking at the code again, I see that >> recovery_cp is totally one-way from sb->resync_offset to MaxSector >> (except for explicit loading via sysfs). Also recovery_cp is not >> relevant to "check" and "repair". So recovery_cp is pretty simple >> after all. >> >> Below is V2 patch. (I have also to credit it to somebody else, because >> he was the one that said - just do rcw while you are resyncing). >> >> Thanks, >> Alex. >> >> >> ----------------- >> >From cc3e2bfcf2fd2c69180577949425d69de88706bb Mon Sep 17 00:00:00 2001 >> From: Alex Lyakas <alex@zadarastorage.com> >> Date: Thu, 13 Sep 2012 18:55:00 +0300 >> Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of >> read-modify-write. >> >> Signed-off-by: Alex Lyakas <alex@zadarastorage.com> >> Signed-off-by: Yair Hershko <yair@zadarastorage.com> > > Signed-off-by has a very specific meaning - it isn't just a way of giving > recredit. > If Yair wrote some of the code, this is fine. > If not, then something like "Suggest-by:" might be more appropriate. > Should I change it to that. > > applied, thanks. > > NeilBrown > > >> >> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c >> b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c >> index 5332202..9fdd5e3 100644 >> --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c >> +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c >> @@ -2555,12 +2555,24 @@ static void handle_stripe_dirtying(struct r5conf *conf, >> int disks) >> { >> int rmw = 0, rcw = 0, i; >> - if (conf->max_degraded == 2) { >> - /* RAID6 requires 'rcw' in current implementation >> - * Calculate the real rcw later - for now fake it >> + sector_t recovery_cp = conf->mddev->recovery_cp; >> + >> + /* RAID6 requires 'rcw' in current implementation. >> + * Otherwise, check whether resync is now happening or should start. >> + * If yes, then the array is dirty (after unclean shutdown or >> + * initial creation), so parity in some stripes might be inconsistent. >> + * In this case, we need to always do reconstruct-write, to ensure >> + * that in case of drive failure or read-error correction, we >> + * generate correct data from the parity. >> + */ >> + if (conf->max_degraded == 2 || >> + (recovery_cp < MaxSector && sh->sector >= recovery_cp)) { >> + /* Calculate the real rcw later - for now make it >> * look like rcw is cheaper >> */ >> rcw = 1; rmw = 2; >> + pr_debug("force RCW max_degraded=%u, recovery_cp=%lu >> sh->sector=%lu\n", >> + conf->max_degraded, recovery_cp, sh->sector); >> } else for (i = disks; i--; ) { >> /* would I have to read this buffer for read_modify_write */ >> struct r5dev *dev = &sh->dev[i]; >> >> >> >> >> >> >> On Wed, Sep 19, 2012 at 8:59 AM, NeilBrown <neilb@suse.de> wrote: >> > On Mon, 17 Sep 2012 14:15:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> >> > wrote: >> > >> >> Hi Neil, >> >> below is a bit less-ugly version of the patch. >> >> Thanks, >> >> Alex. >> >> >> >> >From 05cf800d623bf558c99d542cf8bf083c85b7e5d5 Mon Sep 17 00:00:00 2001 >> >> From: Alex Lyakas <alex@zadarastorage.com> >> >> Date: Thu, 13 Sep 2012 18:55:00 +0300 >> >> Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of >> >> read-modify-write. >> >> >> >> Signed-off-by: Alex Lyakas <alex@zadarastorage.com> >> >> Signed-off-by: Yair Hershko <yair@zadarastorage.com> >> >> >> >> diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c >> >> b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c >> >> index 5332202..0702785 100644 >> >> --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c >> >> +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c >> >> @@ -2555,12 +2555,36 @@ static void handle_stripe_dirtying(struct r5conf *conf, >> >> int disks) >> >> { >> >> int rmw = 0, rcw = 0, i; >> >> - if (conf->max_degraded == 2) { >> >> - /* RAID6 requires 'rcw' in current implementation >> >> - * Calculate the real rcw later - for now fake it >> >> + sector_t recovery_cp = conf->mddev->recovery_cp; >> >> + unsigned long recovery = conf->mddev->recovery; >> >> + int needed = test_bit(MD_RECOVERY_NEEDED, &recovery); >> >> + int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) && >> >> + !test_bit(MD_RECOVERY_REQUESTED, &recovery) && >> >> + !test_bit(MD_RECOVERY_CHECK, &recovery); >> >> + int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) && >> >> + !test_bit(MD_RECOVERY_SYNC, &recovery) && >> >> + !test_bit(MD_RECOVERY_RECOVER, &recovery) && >> >> + !test_bit(MD_RECOVERY_DONE, &recovery) && >> >> + !test_bit(MD_RECOVERY_RESHAPE, &recovery); >> > >> > Thanks Alex, >> > however I don't understand why you want to test all of these bits. >> > Isn't it enough just to check ->recovery_cp ?? >> > >> >> + >> >> + /* RAID6 requires 'rcw' in current implementation. >> >> + * Otherwise, attempt to check whether resync is now happening >> >> + * or should start. >> >> + * If yes, then the array is dirty (after unclean shutdown or >> >> + * initial creation), so parity in some stripes might be inconsistent. >> >> + * In this case, we need to always do reconstruct-write, to ensure >> >> + * that in case of drive failure or read-error correction, we >> >> + * generate correct data from the parity. >> >> + */ >> >> + if (conf->max_degraded == 2 || >> >> + (recovery_cp < MaxSector && sh->sector >= recovery_cp && >> >> + (needed || resyncing || transitional))) { >> >> + /* Calculate the real rcw later - for now fake it >> >> * look like rcw is cheaper >> > >> > Also, we should probably fix this comment. s/fake/make/ >> > >> > Thanks, >> > NeilBrown >> > >> > >> > >> >> */ >> >> rcw = 1; rmw = 2; >> >> + pr_debug("force RCW max_degraded=%u, recovery_cp=%lu >> >> sh->sector=%lu recovery=0x%lx\n", >> >> + conf->max_degraded, recovery_cp, sh->sector, recovery); >> >> } else for (i = disks; i--; ) { >> >> /* would I have to read this buffer for read_modify_write */ >> >> struct r5dev *dev = &sh->dev[i]; >> > > ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2012-09-25 7:50 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-09-11 19:10 md: raid5 resync corrects read errors on data block - is this correct? Alexander Lyakas 2012-09-11 22:29 ` NeilBrown 2012-09-12 7:15 ` Alexander Lyakas 2012-09-12 16:49 ` Alexander Lyakas 2012-09-13 0:19 ` NeilBrown 2012-09-13 16:05 ` Alexander Lyakas 2012-09-13 16:11 ` Alexander Lyakas 2012-09-17 11:15 ` Alexander Lyakas 2012-09-19 5:59 ` NeilBrown 2012-09-20 8:26 ` Alexander Lyakas 2012-09-25 6:57 ` NeilBrown 2012-09-25 7:50 ` Alexander Lyakas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).