* RAD1 doesn't fail WRITE that was written only on a rebuilding drive @ 2013-05-28 12:46 Alexander Lyakas 2013-05-28 20:12 ` Alexander Lyakas 2013-06-03 23:52 ` NeilBrown 0 siblings, 2 replies; 6+ messages in thread From: Alexander Lyakas @ 2013-05-28 12:46 UTC (permalink / raw) To: linux-raid; +Cc: NeilBrown Hello Neil, we continue testing last-drive RAID1 failure cases. We see the following issue: # RAID1 with drives A and B; drive B was freshly-added and is rebuilding # Drive A fails # WRITE request arrives to the array. It is failed by drive A, so r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B succeeds in writing it, so the same r1_bio is marked as R1BIO_Uptodate. # r1_bio arrives to handle_write_finished, badblocks are disabled, md_error()->error() does nothing because we don't fail the last drive of raid1 # raid_end_bio_io() calls call_bio_endio() # As a result, in call_bio_endio(): if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) clear_bit(BIO_UPTODATE, &bio->bi_flags); this code doesn't clear the BIO_UPTODATE flag, and the whole master WRITE succeeds, back to the upper layer. # This keeps happening until rebuild aborts, and drive B is ejected from the array[1]. After that, there is only one drive (A), so after it fails a WRITE, the master WRITE also fails. It should be noted, that I test a WRITE that is way ahead of recovery_offset of drive B. So after such WRITE fails, subsequent READ to the same place would fail, because drive A will fail it, and drive B cannot be attempted to READ from there (rebuild has not reached there yet). My concrete suggestion is that this behavior is not reasonable, and we should only count a successful WRITE to a drive that is marked as InSync. Please let me know what do you think? Thanks, Alex. [1] Sometimes, it takes up to 2 minutes to eject the drive B, because rebuild stops and keeps restarting constantly. I am still to debug why rebuild aborts and then immediately restarts, and this keeps happening for a long time. May 27 20:44:09 vc kernel: [ 6470.446899] md: recovery of RAID array md4 May 27 20:44:09 vc kernel: [ 6470.446903] md: minimum _guaranteed_ speed: 10000 KB/sec/disk. May 27 20:44:09 vc kernel: [ 6470.446905] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. May 27 20:44:09 vc kernel: [ 6470.446908] md: using 128k window, over a total of 477044736k. May 27 20:44:09 vc kernel: [ 6470.446910] md: resuming recovery of md4 from checkpoint. May 27 20:44:09 vc kernel: [ 6470.543922] md: md4: recovery done. May 27 20:44:10 vc kernel: [ 6470.727096] md: recovery of RAID array md4 May 27 20:44:10 vc kernel: [ 6470.727100] md: minimum _guaranteed_ speed: 10000 KB/sec/disk. May 27 20:44:10 vc kernel: [ 6470.727102] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. May 27 20:44:10 vc kernel: [ 6470.727105] md: using 128k window, over a total of 477044736k. May 27 20:44:10 vc kernel: [ 6470.727108] md: resuming recovery of md4 from checkpoint. May 27 20:44:10 vc kernel: [ 6470.797421] md: md4: recovery done. May 27 20:44:10 vc kernel: [ 6470.983361] md: recovery of RAID array md4 May 27 20:44:10 vc kernel: [ 6470.983365] md: minimum _guaranteed_ speed: 10000 KB/sec/disk. May 27 20:44:10 vc kernel: [ 6470.983367] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. May 27 20:44:10 vc kernel: [ 6470.983370] md: using 128k window, over a total of 477044736k. May 27 20:44:10 vc kernel: [ 6470.983372] md: resuming recovery of md4 from checkpoint. May 27 20:44:10 vc kernel: [ 6471.109254] md: md4: recovery done. ... Up to now, I see that md_do_sync() is triggered by raid1d() calling md_check_recovery() first thing it does. Then md_check_recovery()->md_register_thread(md_do_sync). ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAD1 doesn't fail WRITE that was written only on a rebuilding drive 2013-05-28 12:46 RAD1 doesn't fail WRITE that was written only on a rebuilding drive Alexander Lyakas @ 2013-05-28 20:12 ` Alexander Lyakas 2013-06-04 0:04 ` NeilBrown 2013-06-03 23:52 ` NeilBrown 1 sibling, 1 reply; 6+ messages in thread From: Alexander Lyakas @ 2013-05-28 20:12 UTC (permalink / raw) To: linux-raid; +Cc: NeilBrown Hi Neil, I have found why recovery is triggered again and again. After initial recovery aborts, raid1d() calls md_check_recovery(), which calls remove_and_add_spares(). So remove_and_add_spares wants to eject the rebuilding drive from the array, but cannot because rdev->nr_pending>0: if (rdev->raid_disk >= 0 && !test_bit(Blocked, &rdev->flags) && (test_bit(Faulty, &rdev->flags) || ! test_bit(In_sync, &rdev->flags)) && atomic_read(&rdev->nr_pending)==0) { if (mddev->pers->hot_remove_disk( ... So instead it goes ahead and triggers another recovery in the following condition (by incrementing "spares"): rdev_for_each(rdev, mddev) { if (rdev->raid_disk >= 0 && !test_bit(In_sync, &rdev->flags) && !test_bit(Faulty, &rdev->flags)) spares++; So another recovery starts, aborts, is triggered again, aborts and so forth. Because conditions: test_bit(Faulty, &rdev->flags) || ! test_bit(In_sync, &rdev->flags) and !test_bit(In_sync, &rdev->flags) && !test_bit(Faulty, &rdev->flags)) can both be true at the same time. Problem is that IOs keep coming in, because MD does not fail them as I discussed in the previous email. So nr_pending never gets to zero, or it takes it a long time to get to zero (totally racy), and we keep triggering and aborting recovery. Can we fix remove_and_add_spares() the following way: "if a drive qualifies for being removed from array, but cannot be removed because of nr_pending, don't try to trigger another recovery on it by returning doing spares++". This issue, however, is less critical, than the issue in the previous email, because the previous issue causes real data corruption: we WRITE only to the rebuilding drive, but we will not be reading from this drive, because it's still rebuilding. So the WRITEs are lost. Thanks, Alex. On Tue, May 28, 2013 at 3:46 PM, Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hello Neil, > we continue testing last-drive RAID1 failure cases. > We see the following issue: > > # RAID1 with drives A and B; drive B was freshly-added and is rebuilding > # Drive A fails > # WRITE request arrives to the array. It is failed by drive A, so > r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B > succeeds in writing it, so the same r1_bio is marked as > R1BIO_Uptodate. > # r1_bio arrives to handle_write_finished, badblocks are disabled, > md_error()->error() does nothing because we don't fail the last drive > of raid1 > # raid_end_bio_io() calls call_bio_endio() > # As a result, in call_bio_endio(): > if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) > clear_bit(BIO_UPTODATE, &bio->bi_flags); > this code doesn't clear the BIO_UPTODATE flag, and the whole master > WRITE succeeds, back to the upper layer. > > # This keeps happening until rebuild aborts, and drive B is ejected > from the array[1]. After that, there is only one drive (A), so after > it fails a WRITE, the master WRITE also fails. > > It should be noted, that I test a WRITE that is way ahead of > recovery_offset of drive B. So after such WRITE fails, subsequent READ > to the same place would fail, because drive A will fail it, and drive > B cannot be attempted to READ from there (rebuild has not reached > there yet). > > My concrete suggestion is that this behavior is not reasonable, and we > should only count a successful WRITE to a drive that is marked as > InSync. Please let me know what do you think? > > Thanks, > Alex. > > [1] Sometimes, it takes up to 2 minutes to eject the drive B, because > rebuild stops and keeps restarting constantly. I am still to debug why > rebuild aborts and then immediately restarts, and this keeps happening > for a long time. > May 27 20:44:09 vc kernel: [ 6470.446899] md: recovery of RAID array md4 > May 27 20:44:09 vc kernel: [ 6470.446903] md: minimum _guaranteed_ > speed: 10000 KB/sec/disk. > May 27 20:44:09 vc kernel: [ 6470.446905] md: using maximum available > idle IO bandwidth (but not more than 200000 KB/sec) for recovery. > May 27 20:44:09 vc kernel: [ 6470.446908] md: using 128k window, over > a total of 477044736k. > May 27 20:44:09 vc kernel: [ 6470.446910] md: resuming recovery of md4 > from checkpoint. > May 27 20:44:09 vc kernel: [ 6470.543922] md: md4: recovery done. > May 27 20:44:10 vc kernel: [ 6470.727096] md: recovery of RAID array md4 > May 27 20:44:10 vc kernel: [ 6470.727100] md: minimum _guaranteed_ > speed: 10000 KB/sec/disk. > May 27 20:44:10 vc kernel: [ 6470.727102] md: using maximum available > idle IO bandwidth (but not more than 200000 KB/sec) for recovery. > May 27 20:44:10 vc kernel: [ 6470.727105] md: using 128k window, over > a total of 477044736k. > May 27 20:44:10 vc kernel: [ 6470.727108] md: resuming recovery of md4 > from checkpoint. > May 27 20:44:10 vc kernel: [ 6470.797421] md: md4: recovery done. > May 27 20:44:10 vc kernel: [ 6470.983361] md: recovery of RAID array md4 > May 27 20:44:10 vc kernel: [ 6470.983365] md: minimum _guaranteed_ > speed: 10000 KB/sec/disk. > May 27 20:44:10 vc kernel: [ 6470.983367] md: using maximum available > idle IO bandwidth (but not more than 200000 KB/sec) for recovery. > May 27 20:44:10 vc kernel: [ 6470.983370] md: using 128k window, over > a total of 477044736k. > May 27 20:44:10 vc kernel: [ 6470.983372] md: resuming recovery of md4 > from checkpoint. > May 27 20:44:10 vc kernel: [ 6471.109254] md: md4: recovery done. > ... > > Up to now, I see that md_do_sync() is triggered by raid1d() calling > md_check_recovery() first thing it does. Then > md_check_recovery()->md_register_thread(md_do_sync). ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAD1 doesn't fail WRITE that was written only on a rebuilding drive 2013-05-28 20:12 ` Alexander Lyakas @ 2013-06-04 0:04 ` NeilBrown 0 siblings, 0 replies; 6+ messages in thread From: NeilBrown @ 2013-06-04 0:04 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2556 bytes --] On Tue, 28 May 2013 23:12:38 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > I have found why recovery is triggered again and again. > > After initial recovery aborts, raid1d() calls md_check_recovery(), > which calls remove_and_add_spares(). > So remove_and_add_spares wants to eject the rebuilding drive from the > array, but cannot because rdev->nr_pending>0: > if (rdev->raid_disk >= 0 && > !test_bit(Blocked, &rdev->flags) && > (test_bit(Faulty, &rdev->flags) || > ! test_bit(In_sync, &rdev->flags)) && > atomic_read(&rdev->nr_pending)==0) { > if (mddev->pers->hot_remove_disk( > ... > > So instead it goes ahead and triggers another recovery in the > following condition (by incrementing "spares"): > rdev_for_each(rdev, mddev) { > if (rdev->raid_disk >= 0 && > !test_bit(In_sync, &rdev->flags) && > !test_bit(Faulty, &rdev->flags)) > spares++; > > So another recovery starts, aborts, is triggered again, aborts and so forth. > > Because conditions: > test_bit(Faulty, &rdev->flags) || ! test_bit(In_sync, &rdev->flags) > and > !test_bit(In_sync, &rdev->flags) && !test_bit(Faulty, &rdev->flags)) > can both be true at the same time. > > Problem is that IOs keep coming in, because MD does not fail them as I > discussed in the previous email. So nr_pending never gets to zero, or > it takes it a long time to get to zero (totally racy), and we keep > triggering and aborting recovery. > > Can we fix remove_and_add_spares() the following way: > > "if a drive qualifies for being removed from array, but cannot be > removed because of nr_pending, don't try to trigger another recovery > on it by returning doing spares++". I'm not sure... While that approach would address this particular problem I'm not convinced that it would always be correct. The 'recovery_disabled' field is supposed to stop this sort of thing, but it doesn't get activated here because it only stops the adding of devices to the array, and we never add a device. I would probably remove the nr_pending test from remove_add_and_spares and leave it to the personality routine to do that test (I think they already do). Then get ->hot_remove_disk to have a particular error status which means "don't try any recovery", which is returned if the ->recovery_disabled fields match. Then remove_and_add_spares need to check for this return value and act accordingly. Something like that. Do you think that would work? Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAD1 doesn't fail WRITE that was written only on a rebuilding drive 2013-05-28 12:46 RAD1 doesn't fail WRITE that was written only on a rebuilding drive Alexander Lyakas 2013-05-28 20:12 ` Alexander Lyakas @ 2013-06-03 23:52 ` NeilBrown 2013-06-04 17:54 ` Alexander Lyakas 1 sibling, 1 reply; 6+ messages in thread From: NeilBrown @ 2013-06-03 23:52 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1747 bytes --] On Tue, 28 May 2013 15:46:33 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hello Neil, > we continue testing last-drive RAID1 failure cases. > We see the following issue: > > # RAID1 with drives A and B; drive B was freshly-added and is rebuilding > # Drive A fails > # WRITE request arrives to the array. It is failed by drive A, so > r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B > succeeds in writing it, so the same r1_bio is marked as > R1BIO_Uptodate. > # r1_bio arrives to handle_write_finished, badblocks are disabled, > md_error()->error() does nothing because we don't fail the last drive > of raid1 > # raid_end_bio_io() calls call_bio_endio() > # As a result, in call_bio_endio(): > if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) > clear_bit(BIO_UPTODATE, &bio->bi_flags); > this code doesn't clear the BIO_UPTODATE flag, and the whole master > WRITE succeeds, back to the upper layer. > > # This keeps happening until rebuild aborts, and drive B is ejected > from the array[1]. After that, there is only one drive (A), so after > it fails a WRITE, the master WRITE also fails. > > It should be noted, that I test a WRITE that is way ahead of > recovery_offset of drive B. So after such WRITE fails, subsequent READ > to the same place would fail, because drive A will fail it, and drive > B cannot be attempted to READ from there (rebuild has not reached > there yet). > > My concrete suggestion is that this behavior is not reasonable, and we > should only count a successful WRITE to a drive that is marked as > InSync. Please let me know what do you think? Sounds reasonable. Could you make and test a patch? Then I'll apply it. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAD1 doesn't fail WRITE that was written only on a rebuilding drive 2013-06-03 23:52 ` NeilBrown @ 2013-06-04 17:54 ` Alexander Lyakas 2013-06-12 0:26 ` NeilBrown 0 siblings, 1 reply; 6+ messages in thread From: Alexander Lyakas @ 2013-06-04 17:54 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 4595 bytes --] Hello Neil, On Tue, Jun 4, 2013 at 2:52 AM, NeilBrown <neilb@suse.de> wrote: > On Tue, 28 May 2013 15:46:33 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hello Neil, >> we continue testing last-drive RAID1 failure cases. >> We see the following issue: >> >> # RAID1 with drives A and B; drive B was freshly-added and is rebuilding >> # Drive A fails >> # WRITE request arrives to the array. It is failed by drive A, so >> r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B >> succeeds in writing it, so the same r1_bio is marked as >> R1BIO_Uptodate. >> # r1_bio arrives to handle_write_finished, badblocks are disabled, >> md_error()->error() does nothing because we don't fail the last drive >> of raid1 >> # raid_end_bio_io() calls call_bio_endio() >> # As a result, in call_bio_endio(): >> if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) >> clear_bit(BIO_UPTODATE, &bio->bi_flags); >> this code doesn't clear the BIO_UPTODATE flag, and the whole master >> WRITE succeeds, back to the upper layer. >> >> # This keeps happening until rebuild aborts, and drive B is ejected >> from the array[1]. After that, there is only one drive (A), so after >> it fails a WRITE, the master WRITE also fails. >> >> It should be noted, that I test a WRITE that is way ahead of >> recovery_offset of drive B. So after such WRITE fails, subsequent READ >> to the same place would fail, because drive A will fail it, and drive >> B cannot be attempted to READ from there (rebuild has not reached >> there yet). >> >> My concrete suggestion is that this behavior is not reasonable, and we >> should only count a successful WRITE to a drive that is marked as >> InSync. Please let me know what do you think? > > Sounds reasonable. Could you make and test a patch? Then I'll apply it. I read all the code of raid1.c, and it seems that we need to fix only one place. With this fix, I don't see corruption after I fail and then revive the last drive of raid1. I also attach the patch, in case Gmail wraps it. Also, with this fix, the rebuilding drive is ejected almost immediately, so the second part of the problem doesn't happen now. Although I will consider your suggestion for it later. From 8d25ada35b5e1ec6ea4c6da6e571ec691b34086a Mon Sep 17 00:00:00 2001 From: Alex Lyakas <alex@zadarastorage.com> Date: Tue, 4 Jun 2013 20:42:21 +0300 Subject: [PATCH] md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. Without that fix, the following scenario could happen: - RAID1 with drives A and B; drive B was freshly-added and is rebuilding - Drive A fails - WRITE request arrives to the array. It is failed by drive A, so r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B succeeds in writing it, so the same r1_bio is marked as R1BIO_Uptodate. - r1_bio arrives to handle_write_finished, badblocks are disabled, md_error()->error() does nothing because we don't fail the last drive of raid1 - raid_end_bio_io() calls call_bio_endio() - As a result, in call_bio_endio(): if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) clear_bit(BIO_UPTODATE, &bio->bi_flags); this code doesn't clear the BIO_UPTODATE flag, and the whole master WRITE succeeds, back to the upper layer. So we returned success to the upper layer, even though we had written the data onto the rebuilding drive only. But when we want to read the data back, we would not read from the rebuilding drive, so this data is lost. Signed-off-by: Alex Lyakas <alex@zadarastorage.com> --- drivers/md/raid1.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 6af167f..d8d159e 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -427,7 +427,17 @@ static void raid1_end_write_request(struct bio *bio, int error) r1_bio->bios[mirror] = NULL; to_put = bio; - set_bit(R1BIO_Uptodate, &r1_bio->state); + /* + * Do not set R1BIO_Uptodate if the current device is + * rebuilding or Faulty. This is because we cannot use + * such device for properly reading the data back (we could + * potentially use it, if the current write would have felt + * before rdev->recovery_offset, but for simplicity we don't + * check this here. + */ + if (test_bit(In_sync, &conf->mirrors[mirror].rdev->flags) && + !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)) + set_bit(R1BIO_Uptodate, &r1_bio->state); /* Maybe we can clear some bad blocks. */ if (is_badblock(conf->mirrors[mirror].rdev, -- 1.7.9.5 [-- Attachment #2: 0001-md-raid1-consider-WRITE-as-successful-only-if-at-lea.patch --] [-- Type: application/octet-stream, Size: 2371 bytes --] From 8d25ada35b5e1ec6ea4c6da6e571ec691b34086a Mon Sep 17 00:00:00 2001 From: Alex Lyakas <alex@zadarastorage.com> Date: Tue, 4 Jun 2013 20:42:21 +0300 Subject: [PATCH] md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. Without that fix, the following scenario could happen: - RAID1 with drives A and B; drive B was freshly-added and is rebuilding - Drive A fails - WRITE request arrives to the array. It is failed by drive A, so r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B succeeds in writing it, so the same r1_bio is marked as R1BIO_Uptodate. - r1_bio arrives to handle_write_finished, badblocks are disabled, md_error()->error() does nothing because we don't fail the last drive of raid1 - raid_end_bio_io() calls call_bio_endio() - As a result, in call_bio_endio(): if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) clear_bit(BIO_UPTODATE, &bio->bi_flags); this code doesn't clear the BIO_UPTODATE flag, and the whole master WRITE succeeds, back to the upper layer. So we returned success to the upper layer, even though we had written the data onto the rebuilding drive only. But when we want to read the data back, we would not read from the rebuilding drive, so this data is lost. Signed-off-by: Alex Lyakas <alex@zadarastorage.com> --- drivers/md/raid1.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 6af167f..d8d159e 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -427,7 +427,17 @@ static void raid1_end_write_request(struct bio *bio, int error) r1_bio->bios[mirror] = NULL; to_put = bio; - set_bit(R1BIO_Uptodate, &r1_bio->state); + /* + * Do not set R1BIO_Uptodate if the current device is + * rebuilding or Faulty. This is because we cannot use + * such device for properly reading the data back (we could + * potentially use it, if the current write would have felt + * before rdev->recovery_offset, but for simplicity we don't + * check this here. + */ + if (test_bit(In_sync, &conf->mirrors[mirror].rdev->flags) && + !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)) + set_bit(R1BIO_Uptodate, &r1_bio->state); /* Maybe we can clear some bad blocks. */ if (is_badblock(conf->mirrors[mirror].rdev, -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: RAD1 doesn't fail WRITE that was written only on a rebuilding drive 2013-06-04 17:54 ` Alexander Lyakas @ 2013-06-12 0:26 ` NeilBrown 0 siblings, 0 replies; 6+ messages in thread From: NeilBrown @ 2013-06-12 0:26 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 5226 bytes --] On Tue, 4 Jun 2013 20:54:28 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hello Neil, > > On Tue, Jun 4, 2013 at 2:52 AM, NeilBrown <neilb@suse.de> wrote: > > On Tue, 28 May 2013 15:46:33 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > > wrote: > > > >> Hello Neil, > >> we continue testing last-drive RAID1 failure cases. > >> We see the following issue: > >> > >> # RAID1 with drives A and B; drive B was freshly-added and is rebuilding > >> # Drive A fails > >> # WRITE request arrives to the array. It is failed by drive A, so > >> r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B > >> succeeds in writing it, so the same r1_bio is marked as > >> R1BIO_Uptodate. > >> # r1_bio arrives to handle_write_finished, badblocks are disabled, > >> md_error()->error() does nothing because we don't fail the last drive > >> of raid1 > >> # raid_end_bio_io() calls call_bio_endio() > >> # As a result, in call_bio_endio(): > >> if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) > >> clear_bit(BIO_UPTODATE, &bio->bi_flags); > >> this code doesn't clear the BIO_UPTODATE flag, and the whole master > >> WRITE succeeds, back to the upper layer. > >> > >> # This keeps happening until rebuild aborts, and drive B is ejected > >> from the array[1]. After that, there is only one drive (A), so after > >> it fails a WRITE, the master WRITE also fails. > >> > >> It should be noted, that I test a WRITE that is way ahead of > >> recovery_offset of drive B. So after such WRITE fails, subsequent READ > >> to the same place would fail, because drive A will fail it, and drive > >> B cannot be attempted to READ from there (rebuild has not reached > >> there yet). > >> > >> My concrete suggestion is that this behavior is not reasonable, and we > >> should only count a successful WRITE to a drive that is marked as > >> InSync. Please let me know what do you think? > > > > Sounds reasonable. Could you make and test a patch? Then I'll apply it. > > I read all the code of raid1.c, and it seems that we need to fix only > one place. With this fix, I don't see corruption after I fail and then > revive the last drive of raid1. I also attach the patch, in case Gmail > wraps it. > > Also, with this fix, the rebuilding drive is ejected almost > immediately, so the second part of the problem doesn't happen now. > Although I will consider your suggestion for it later. > > > > >From 8d25ada35b5e1ec6ea4c6da6e571ec691b34086a Mon Sep 17 00:00:00 2001 > From: Alex Lyakas <alex@zadarastorage.com> > Date: Tue, 4 Jun 2013 20:42:21 +0300 > Subject: [PATCH] md/raid1: consider WRITE as successful only if at least one > non-Faulty and non-rebuilding drive completed it. > > Without that fix, the following scenario could happen: > > - RAID1 with drives A and B; drive B was freshly-added and is rebuilding > - Drive A fails > - WRITE request arrives to the array. It is failed by drive A, so > r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B > succeeds in writing it, so the same r1_bio is marked as > R1BIO_Uptodate. > - r1_bio arrives to handle_write_finished, badblocks are disabled, > md_error()->error() does nothing because we don't fail the last drive > of raid1 > - raid_end_bio_io() calls call_bio_endio() > - As a result, in call_bio_endio(): > if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) > clear_bit(BIO_UPTODATE, &bio->bi_flags); > this code doesn't clear the BIO_UPTODATE flag, and the whole master > WRITE succeeds, back to the upper layer. > > So we returned success to the upper layer, even though we had written > the data onto the rebuilding drive only. But when we want to read the > data back, we would not read from the rebuilding drive, so this data > is lost. > > Signed-off-by: Alex Lyakas <alex@zadarastorage.com> > --- > drivers/md/raid1.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 6af167f..d8d159e 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -427,7 +427,17 @@ static void raid1_end_write_request(struct bio > *bio, int error) > > r1_bio->bios[mirror] = NULL; > to_put = bio; > - set_bit(R1BIO_Uptodate, &r1_bio->state); > + /* > + * Do not set R1BIO_Uptodate if the current device is > + * rebuilding or Faulty. This is because we cannot use > + * such device for properly reading the data back (we could > + * potentially use it, if the current write would have felt > + * before rdev->recovery_offset, but for simplicity we don't > + * check this here. > + */ > + if (test_bit(In_sync, &conf->mirrors[mirror].rdev->flags) && > + !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)) > + set_bit(R1BIO_Uptodate, &r1_bio->state); > > /* Maybe we can clear some bad blocks. */ > if (is_badblock(conf->mirrors[mirror].rdev, hi Alex, thanks for the excellent analysis and elegant patch. I've applied it and create a similar patch for raid10 which has the same problem. I'll send them upstream sometime this week. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-06-12 0:26 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-05-28 12:46 RAD1 doesn't fail WRITE that was written only on a rebuilding drive Alexander Lyakas 2013-05-28 20:12 ` Alexander Lyakas 2013-06-04 0:04 ` NeilBrown 2013-06-03 23:52 ` NeilBrown 2013-06-04 17:54 ` Alexander Lyakas 2013-06-12 0:26 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).