RAD1 doesn't fail WRITE that was written only on a rebuilding drive

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAD1 doesn't fail WRITE that was written only on a rebuilding drive
@ 2013-05-28 12:46 Alexander Lyakas
  2013-05-28 20:12 ` Alexander Lyakas
  2013-06-03 23:52 ` NeilBrown
  0 siblings, 2 replies; 6+ messages in thread
From: Alexander Lyakas @ 2013-05-28 12:46 UTC (permalink / raw)
  To: linux-raid; +Cc: NeilBrown

Hello Neil,
we continue testing last-drive RAID1 failure cases.
We see the following issue:

# RAID1 with drives A and B; drive B was freshly-added and is rebuilding
# Drive A fails
# WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
# r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
# raid_end_bio_io()  calls call_bio_endio()
# As a result, in call_bio_endio():
	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
		clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.

# This keeps happening until rebuild aborts, and drive B is ejected
from the array[1]. After that, there is only one drive (A), so after
it fails a WRITE, the master WRITE also fails.

It should be noted, that I test a WRITE that is way ahead of
recovery_offset of drive B. So after such WRITE fails, subsequent READ
to the same place would fail, because drive A will fail it, and drive
B cannot be attempted to READ from there (rebuild has not reached
there yet).

My concrete suggestion is that this behavior is not reasonable, and we
should only count a successful WRITE to a drive that is marked as
InSync. Please let me know what do you think?

Thanks,
Alex.

[1]  Sometimes, it takes up to 2 minutes to eject the drive B, because
rebuild stops and keeps restarting constantly. I am still to debug why
rebuild aborts and then immediately restarts, and this keeps happening
for a long time.
May 27 20:44:09 vc kernel: [ 6470.446899] md: recovery of RAID array md4
May 27 20:44:09 vc kernel: [ 6470.446903] md: minimum _guaranteed_
speed: 10000 KB/sec/disk.
May 27 20:44:09 vc kernel: [ 6470.446905] md: using maximum available
idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
May 27 20:44:09 vc kernel: [ 6470.446908] md: using 128k window, over
a total of 477044736k.
May 27 20:44:09 vc kernel: [ 6470.446910] md: resuming recovery of md4
from checkpoint.
May 27 20:44:09 vc kernel: [ 6470.543922] md: md4: recovery done.
May 27 20:44:10 vc kernel: [ 6470.727096] md: recovery of RAID array md4
May 27 20:44:10 vc kernel: [ 6470.727100] md: minimum _guaranteed_
speed: 10000 KB/sec/disk.
May 27 20:44:10 vc kernel: [ 6470.727102] md: using maximum available
idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
May 27 20:44:10 vc kernel: [ 6470.727105] md: using 128k window, over
a total of 477044736k.
May 27 20:44:10 vc kernel: [ 6470.727108] md: resuming recovery of md4
from checkpoint.
May 27 20:44:10 vc kernel: [ 6470.797421] md: md4: recovery done.
May 27 20:44:10 vc kernel: [ 6470.983361] md: recovery of RAID array md4
May 27 20:44:10 vc kernel: [ 6470.983365] md: minimum _guaranteed_
speed: 10000 KB/sec/disk.
May 27 20:44:10 vc kernel: [ 6470.983367] md: using maximum available
idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
May 27 20:44:10 vc kernel: [ 6470.983370] md: using 128k window, over
a total of 477044736k.
May 27 20:44:10 vc kernel: [ 6470.983372] md: resuming recovery of md4
from checkpoint.
May 27 20:44:10 vc kernel: [ 6471.109254] md: md4: recovery done.
...

Up to now, I see that md_do_sync() is triggered by raid1d() calling
md_check_recovery() first thing it does. Then
md_check_recovery()->md_register_thread(md_do_sync).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAD1 doesn't fail WRITE that was written only on a rebuilding drive
  2013-05-28 12:46 RAD1 doesn't fail WRITE that was written only on a rebuilding drive Alexander Lyakas
@ 2013-05-28 20:12 ` Alexander Lyakas
  2013-06-04  0:04   ` NeilBrown
  2013-06-03 23:52 ` NeilBrown
  1 sibling, 1 reply; 6+ messages in thread
From: Alexander Lyakas @ 2013-05-28 20:12 UTC (permalink / raw)
  To: linux-raid; +Cc: NeilBrown

Hi Neil,
I have found why recovery is triggered again and again.

After initial recovery aborts, raid1d() calls md_check_recovery(),
which calls remove_and_add_spares().
So remove_and_add_spares wants to eject the rebuilding drive from the
array, but cannot because rdev->nr_pending>0:
		if (rdev->raid_disk >= 0 &&
		    !test_bit(Blocked, &rdev->flags) &&
		    (test_bit(Faulty, &rdev->flags) ||
		     ! test_bit(In_sync, &rdev->flags)) &&
		    atomic_read(&rdev->nr_pending)==0) {
if (mddev->pers->hot_remove_disk(
...

So instead it goes ahead and triggers another recovery in the
following condition (by incrementing "spares"):
	rdev_for_each(rdev, mddev) {
		if (rdev->raid_disk >= 0 &&
		    !test_bit(In_sync, &rdev->flags) &&
		    !test_bit(Faulty, &rdev->flags))
			spares++;

So another recovery starts, aborts, is triggered again, aborts and so forth.

Because conditions:
test_bit(Faulty, &rdev->flags) || ! test_bit(In_sync, &rdev->flags)
and
!test_bit(In_sync, &rdev->flags) &&  !test_bit(Faulty, &rdev->flags))
can both be true at the same time.

Problem is that IOs keep coming in, because MD does not fail them as I
discussed in the previous email. So nr_pending never gets to zero, or
it takes it a long time to get to zero (totally racy), and we keep
triggering and aborting recovery.

Can we fix remove_and_add_spares() the following way:

"if a drive qualifies for being removed from array, but cannot be
removed because of nr_pending, don't try to trigger another recovery
on it by returning doing spares++".

This issue, however, is less critical, than the issue in the previous
email, because the previous issue causes real data corruption: we
WRITE only to the rebuilding drive, but we will not be reading from
this drive, because it's still rebuilding. So the WRITEs are lost.

Thanks,
Alex.







On Tue, May 28, 2013 at 3:46 PM, Alexander Lyakas
<alex.bolshoy@gmail.com> wrote:
> Hello Neil,
> we continue testing last-drive RAID1 failure cases.
> We see the following issue:
>
> # RAID1 with drives A and B; drive B was freshly-added and is rebuilding
> # Drive A fails
> # WRITE request arrives to the array. It is failed by drive A, so
> r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
> succeeds in writing it, so the same r1_bio is marked as
> R1BIO_Uptodate.
> # r1_bio arrives to handle_write_finished, badblocks are disabled,
> md_error()->error() does nothing because we don't fail the last drive
> of raid1
> # raid_end_bio_io()  calls call_bio_endio()
> # As a result, in call_bio_endio():
>         if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>                 clear_bit(BIO_UPTODATE, &bio->bi_flags);
> this code doesn't clear the BIO_UPTODATE flag, and the whole master
> WRITE succeeds, back to the upper layer.
>
> # This keeps happening until rebuild aborts, and drive B is ejected
> from the array[1]. After that, there is only one drive (A), so after
> it fails a WRITE, the master WRITE also fails.
>
> It should be noted, that I test a WRITE that is way ahead of
> recovery_offset of drive B. So after such WRITE fails, subsequent READ
> to the same place would fail, because drive A will fail it, and drive
> B cannot be attempted to READ from there (rebuild has not reached
> there yet).
>
> My concrete suggestion is that this behavior is not reasonable, and we
> should only count a successful WRITE to a drive that is marked as
> InSync. Please let me know what do you think?
>
> Thanks,
> Alex.
>
> [1]  Sometimes, it takes up to 2 minutes to eject the drive B, because
> rebuild stops and keeps restarting constantly. I am still to debug why
> rebuild aborts and then immediately restarts, and this keeps happening
> for a long time.
> May 27 20:44:09 vc kernel: [ 6470.446899] md: recovery of RAID array md4
> May 27 20:44:09 vc kernel: [ 6470.446903] md: minimum _guaranteed_
> speed: 10000 KB/sec/disk.
> May 27 20:44:09 vc kernel: [ 6470.446905] md: using maximum available
> idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
> May 27 20:44:09 vc kernel: [ 6470.446908] md: using 128k window, over
> a total of 477044736k.
> May 27 20:44:09 vc kernel: [ 6470.446910] md: resuming recovery of md4
> from checkpoint.
> May 27 20:44:09 vc kernel: [ 6470.543922] md: md4: recovery done.
> May 27 20:44:10 vc kernel: [ 6470.727096] md: recovery of RAID array md4
> May 27 20:44:10 vc kernel: [ 6470.727100] md: minimum _guaranteed_
> speed: 10000 KB/sec/disk.
> May 27 20:44:10 vc kernel: [ 6470.727102] md: using maximum available
> idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
> May 27 20:44:10 vc kernel: [ 6470.727105] md: using 128k window, over
> a total of 477044736k.
> May 27 20:44:10 vc kernel: [ 6470.727108] md: resuming recovery of md4
> from checkpoint.
> May 27 20:44:10 vc kernel: [ 6470.797421] md: md4: recovery done.
> May 27 20:44:10 vc kernel: [ 6470.983361] md: recovery of RAID array md4
> May 27 20:44:10 vc kernel: [ 6470.983365] md: minimum _guaranteed_
> speed: 10000 KB/sec/disk.
> May 27 20:44:10 vc kernel: [ 6470.983367] md: using maximum available
> idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
> May 27 20:44:10 vc kernel: [ 6470.983370] md: using 128k window, over
> a total of 477044736k.
> May 27 20:44:10 vc kernel: [ 6470.983372] md: resuming recovery of md4
> from checkpoint.
> May 27 20:44:10 vc kernel: [ 6471.109254] md: md4: recovery done.
> ...
>
> Up to now, I see that md_do_sync() is triggered by raid1d() calling
> md_check_recovery() first thing it does. Then
> md_check_recovery()->md_register_thread(md_do_sync).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAD1 doesn't fail WRITE that was written only on a rebuilding drive
  2013-05-28 12:46 RAD1 doesn't fail WRITE that was written only on a rebuilding drive Alexander Lyakas
  2013-05-28 20:12 ` Alexander Lyakas
@ 2013-06-03 23:52 ` NeilBrown
  2013-06-04 17:54   ` Alexander Lyakas
  1 sibling, 1 reply; 6+ messages in thread
From: NeilBrown @ 2013-06-03 23:52 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1747 bytes --]

On Tue, 28 May 2013 15:46:33 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hello Neil,
> we continue testing last-drive RAID1 failure cases.
> We see the following issue:
> 
> # RAID1 with drives A and B; drive B was freshly-added and is rebuilding
> # Drive A fails
> # WRITE request arrives to the array. It is failed by drive A, so
> r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
> succeeds in writing it, so the same r1_bio is marked as
> R1BIO_Uptodate.
> # r1_bio arrives to handle_write_finished, badblocks are disabled,
> md_error()->error() does nothing because we don't fail the last drive
> of raid1
> # raid_end_bio_io()  calls call_bio_endio()
> # As a result, in call_bio_endio():
> 	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
> 		clear_bit(BIO_UPTODATE, &bio->bi_flags);
> this code doesn't clear the BIO_UPTODATE flag, and the whole master
> WRITE succeeds, back to the upper layer.
> 
> # This keeps happening until rebuild aborts, and drive B is ejected
> from the array[1]. After that, there is only one drive (A), so after
> it fails a WRITE, the master WRITE also fails.
> 
> It should be noted, that I test a WRITE that is way ahead of
> recovery_offset of drive B. So after such WRITE fails, subsequent READ
> to the same place would fail, because drive A will fail it, and drive
> B cannot be attempted to READ from there (rebuild has not reached
> there yet).
> 
> My concrete suggestion is that this behavior is not reasonable, and we
> should only count a successful WRITE to a drive that is marked as
> InSync. Please let me know what do you think?

Sounds reasonable.  Could you make and test a patch?  Then I'll apply it.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAD1 doesn't fail WRITE that was written only on a rebuilding drive
  2013-05-28 20:12 ` Alexander Lyakas
@ 2013-06-04  0:04   ` NeilBrown
  0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2013-06-04  0:04 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2556 bytes --]

On Tue, 28 May 2013 23:12:38 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> I have found why recovery is triggered again and again.
> 
> After initial recovery aborts, raid1d() calls md_check_recovery(),
> which calls remove_and_add_spares().
> So remove_and_add_spares wants to eject the rebuilding drive from the
> array, but cannot because rdev->nr_pending>0:
> 		if (rdev->raid_disk >= 0 &&
> 		    !test_bit(Blocked, &rdev->flags) &&
> 		    (test_bit(Faulty, &rdev->flags) ||
> 		     ! test_bit(In_sync, &rdev->flags)) &&
> 		    atomic_read(&rdev->nr_pending)==0) {
> if (mddev->pers->hot_remove_disk(
> ...
> 
> So instead it goes ahead and triggers another recovery in the
> following condition (by incrementing "spares"):
> 	rdev_for_each(rdev, mddev) {
> 		if (rdev->raid_disk >= 0 &&
> 		    !test_bit(In_sync, &rdev->flags) &&
> 		    !test_bit(Faulty, &rdev->flags))
> 			spares++;
> 
> So another recovery starts, aborts, is triggered again, aborts and so forth.
> 
> Because conditions:
> test_bit(Faulty, &rdev->flags) || ! test_bit(In_sync, &rdev->flags)
> and
> !test_bit(In_sync, &rdev->flags) &&  !test_bit(Faulty, &rdev->flags))
> can both be true at the same time.
> 
> Problem is that IOs keep coming in, because MD does not fail them as I
> discussed in the previous email. So nr_pending never gets to zero, or
> it takes it a long time to get to zero (totally racy), and we keep
> triggering and aborting recovery.
> 
> Can we fix remove_and_add_spares() the following way:
> 
> "if a drive qualifies for being removed from array, but cannot be
> removed because of nr_pending, don't try to trigger another recovery
> on it by returning doing spares++".

I'm not sure...  While that approach would address this particular problem
I'm not convinced that it would always be correct.

The 'recovery_disabled' field is supposed to stop this sort of thing, but it
doesn't get activated here because it only stops the adding of devices to the
array, and we never add a device.

I would probably remove the nr_pending test from remove_add_and_spares and
leave it to the personality routine to do that test (I think they already do).
Then get ->hot_remove_disk to have a particular error status which means
"don't try any recovery", which is returned if the ->recovery_disabled fields
match.
Then remove_and_add_spares need to check for this return value and act
accordingly.
Something like that.

Do you think that would  work?

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAD1 doesn't fail WRITE that was written only on a rebuilding drive
  2013-06-03 23:52 ` NeilBrown
@ 2013-06-04 17:54   ` Alexander Lyakas
  2013-06-12  0:26     ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Alexander Lyakas @ 2013-06-04 17:54 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4595 bytes --]

Hello Neil,

On Tue, Jun 4, 2013 at 2:52 AM, NeilBrown <neilb@suse.de> wrote:
> On Tue, 28 May 2013 15:46:33 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hello Neil,
>> we continue testing last-drive RAID1 failure cases.
>> We see the following issue:
>>
>> # RAID1 with drives A and B; drive B was freshly-added and is rebuilding
>> # Drive A fails
>> # WRITE request arrives to the array. It is failed by drive A, so
>> r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
>> succeeds in writing it, so the same r1_bio is marked as
>> R1BIO_Uptodate.
>> # r1_bio arrives to handle_write_finished, badblocks are disabled,
>> md_error()->error() does nothing because we don't fail the last drive
>> of raid1
>> # raid_end_bio_io()  calls call_bio_endio()
>> # As a result, in call_bio_endio():
>>       if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>>               clear_bit(BIO_UPTODATE, &bio->bi_flags);
>> this code doesn't clear the BIO_UPTODATE flag, and the whole master
>> WRITE succeeds, back to the upper layer.
>>
>> # This keeps happening until rebuild aborts, and drive B is ejected
>> from the array[1]. After that, there is only one drive (A), so after
>> it fails a WRITE, the master WRITE also fails.
>>
>> It should be noted, that I test a WRITE that is way ahead of
>> recovery_offset of drive B. So after such WRITE fails, subsequent READ
>> to the same place would fail, because drive A will fail it, and drive
>> B cannot be attempted to READ from there (rebuild has not reached
>> there yet).
>>
>> My concrete suggestion is that this behavior is not reasonable, and we
>> should only count a successful WRITE to a drive that is marked as
>> InSync. Please let me know what do you think?
>
> Sounds reasonable.  Could you make and test a patch?  Then I'll apply it.

I read all the code of raid1.c, and it seems that we need to fix only
one place. With this fix, I don't see corruption after I fail and then
revive the last drive of raid1. I also attach the patch, in case Gmail
wraps it.

Also, with this fix, the rebuilding drive is ejected almost
immediately, so the second part of the problem doesn't happen now.
Although I will consider your suggestion for it later.



From 8d25ada35b5e1ec6ea4c6da6e571ec691b34086a Mon Sep 17 00:00:00 2001
From: Alex Lyakas <alex@zadarastorage.com>
Date: Tue, 4 Jun 2013 20:42:21 +0300
Subject: [PATCH] md/raid1: consider WRITE as successful only if at least one
 non-Faulty and non-rebuilding drive completed it.

Without that fix, the following scenario could happen:

- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io()  calls call_bio_endio()
- As a result, in call_bio_endio():
        if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
                clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.

So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
---
 drivers/md/raid1.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 6af167f..d8d159e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -427,7 +427,17 @@ static void raid1_end_write_request(struct bio
*bio, int error)

 		r1_bio->bios[mirror] = NULL;
 		to_put = bio;
-		set_bit(R1BIO_Uptodate, &r1_bio->state);
+		/*
+		 * Do not set R1BIO_Uptodate if the current device is
+		 * rebuilding or Faulty. This is because we cannot use
+		 * such device for properly reading the data back (we could
+		 * potentially use it, if the current write would have felt
+		 * before rdev->recovery_offset, but for simplicity we don't
+		 * check this here.
+		 */
+		if (test_bit(In_sync, &conf->mirrors[mirror].rdev->flags) &&
+			!test_bit(Faulty, &conf->mirrors[mirror].rdev->flags))
+			set_bit(R1BIO_Uptodate, &r1_bio->state);

 		/* Maybe we can clear some bad blocks. */
 		if (is_badblock(conf->mirrors[mirror].rdev,
-- 
1.7.9.5

[-- Attachment #2: 0001-md-raid1-consider-WRITE-as-successful-only-if-at-lea.patch --]
[-- Type: application/octet-stream, Size: 2371 bytes --]

From 8d25ada35b5e1ec6ea4c6da6e571ec691b34086a Mon Sep 17 00:00:00 2001
From: Alex Lyakas <alex@zadarastorage.com>
Date: Tue, 4 Jun 2013 20:42:21 +0300
Subject: [PATCH] md/raid1: consider WRITE as successful only if at least one
 non-Faulty and non-rebuilding drive completed it.

Without that fix, the following scenario could happen:

- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io()  calls call_bio_endio()
- As a result, in call_bio_endio():
        if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
                clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.

So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
---
 drivers/md/raid1.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 6af167f..d8d159e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -427,7 +427,17 @@ static void raid1_end_write_request(struct bio *bio, int error)
 
 		r1_bio->bios[mirror] = NULL;
 		to_put = bio;
-		set_bit(R1BIO_Uptodate, &r1_bio->state);
+		/*
+		 * Do not set R1BIO_Uptodate if the current device is 
+		 * rebuilding or Faulty. This is because we cannot use
+		 * such device for properly reading the data back (we could
+		 * potentially use it, if the current write would have felt
+		 * before rdev->recovery_offset, but for simplicity we don't
+		 * check this here.
+		 */
+		if (test_bit(In_sync, &conf->mirrors[mirror].rdev->flags) &&
+			!test_bit(Faulty, &conf->mirrors[mirror].rdev->flags))
+			set_bit(R1BIO_Uptodate, &r1_bio->state);
 
 		/* Maybe we can clear some bad blocks. */
 		if (is_badblock(conf->mirrors[mirror].rdev,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: RAD1 doesn't fail WRITE that was written only on a rebuilding drive
  2013-06-04 17:54   ` Alexander Lyakas
@ 2013-06-12  0:26     ` NeilBrown
  0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2013-06-12  0:26 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5226 bytes --]

On Tue, 4 Jun 2013 20:54:28 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hello Neil,
> 
> On Tue, Jun 4, 2013 at 2:52 AM, NeilBrown <neilb@suse.de> wrote:
> > On Tue, 28 May 2013 15:46:33 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> > wrote:
> >
> >> Hello Neil,
> >> we continue testing last-drive RAID1 failure cases.
> >> We see the following issue:
> >>
> >> # RAID1 with drives A and B; drive B was freshly-added and is rebuilding
> >> # Drive A fails
> >> # WRITE request arrives to the array. It is failed by drive A, so
> >> r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
> >> succeeds in writing it, so the same r1_bio is marked as
> >> R1BIO_Uptodate.
> >> # r1_bio arrives to handle_write_finished, badblocks are disabled,
> >> md_error()->error() does nothing because we don't fail the last drive
> >> of raid1
> >> # raid_end_bio_io()  calls call_bio_endio()
> >> # As a result, in call_bio_endio():
> >>       if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
> >>               clear_bit(BIO_UPTODATE, &bio->bi_flags);
> >> this code doesn't clear the BIO_UPTODATE flag, and the whole master
> >> WRITE succeeds, back to the upper layer.
> >>
> >> # This keeps happening until rebuild aborts, and drive B is ejected
> >> from the array[1]. After that, there is only one drive (A), so after
> >> it fails a WRITE, the master WRITE also fails.
> >>
> >> It should be noted, that I test a WRITE that is way ahead of
> >> recovery_offset of drive B. So after such WRITE fails, subsequent READ
> >> to the same place would fail, because drive A will fail it, and drive
> >> B cannot be attempted to READ from there (rebuild has not reached
> >> there yet).
> >>
> >> My concrete suggestion is that this behavior is not reasonable, and we
> >> should only count a successful WRITE to a drive that is marked as
> >> InSync. Please let me know what do you think?
> >
> > Sounds reasonable.  Could you make and test a patch?  Then I'll apply it.
> 
> I read all the code of raid1.c, and it seems that we need to fix only
> one place. With this fix, I don't see corruption after I fail and then
> revive the last drive of raid1. I also attach the patch, in case Gmail
> wraps it.
> 
> Also, with this fix, the rebuilding drive is ejected almost
> immediately, so the second part of the problem doesn't happen now.
> Although I will consider your suggestion for it later.
> 
> 
> 
> >From 8d25ada35b5e1ec6ea4c6da6e571ec691b34086a Mon Sep 17 00:00:00 2001
> From: Alex Lyakas <alex@zadarastorage.com>
> Date: Tue, 4 Jun 2013 20:42:21 +0300
> Subject: [PATCH] md/raid1: consider WRITE as successful only if at least one
>  non-Faulty and non-rebuilding drive completed it.
> 
> Without that fix, the following scenario could happen:
> 
> - RAID1 with drives A and B; drive B was freshly-added and is rebuilding
> - Drive A fails
> - WRITE request arrives to the array. It is failed by drive A, so
> r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
> succeeds in writing it, so the same r1_bio is marked as
> R1BIO_Uptodate.
> - r1_bio arrives to handle_write_finished, badblocks are disabled,
> md_error()->error() does nothing because we don't fail the last drive
> of raid1
> - raid_end_bio_io()  calls call_bio_endio()
> - As a result, in call_bio_endio():
>         if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>                 clear_bit(BIO_UPTODATE, &bio->bi_flags);
> this code doesn't clear the BIO_UPTODATE flag, and the whole master
> WRITE succeeds, back to the upper layer.
> 
> So we returned success to the upper layer, even though we had written
> the data onto the rebuilding drive only. But when we want to read the
> data back, we would not read from the rebuilding drive, so this data
> is lost.
> 
> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
> ---
>  drivers/md/raid1.c |   12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 6af167f..d8d159e 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -427,7 +427,17 @@ static void raid1_end_write_request(struct bio
> *bio, int error)
> 
>  		r1_bio->bios[mirror] = NULL;
>  		to_put = bio;
> -		set_bit(R1BIO_Uptodate, &r1_bio->state);
> +		/*
> +		 * Do not set R1BIO_Uptodate if the current device is
> +		 * rebuilding or Faulty. This is because we cannot use
> +		 * such device for properly reading the data back (we could
> +		 * potentially use it, if the current write would have felt
> +		 * before rdev->recovery_offset, but for simplicity we don't
> +		 * check this here.
> +		 */
> +		if (test_bit(In_sync, &conf->mirrors[mirror].rdev->flags) &&
> +			!test_bit(Faulty, &conf->mirrors[mirror].rdev->flags))
> +			set_bit(R1BIO_Uptodate, &r1_bio->state);
> 
>  		/* Maybe we can clear some bad blocks. */
>  		if (is_badblock(conf->mirrors[mirror].rdev,


hi Alex,
 thanks for the excellent analysis and elegant patch.

I've applied it and create a similar patch for raid10 which has the same
problem.
I'll send them upstream sometime this week.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-06-12  0:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-28 12:46 RAD1 doesn't fail WRITE that was written only on a rebuilding drive Alexander Lyakas
2013-05-28 20:12 ` Alexander Lyakas
2013-06-04  0:04   ` NeilBrown
2013-06-03 23:52 ` NeilBrown
2013-06-04 17:54   ` Alexander Lyakas
2013-06-12  0:26     ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).