[PATCH 1/2] md: clear CHANGE_PENDING in readonly array

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] md: clear CHANGE_PENDING in readonly array
@ 2015-09-18 17:20 Shaohua Li
  2015-09-18 17:20 ` [PATCH 2/2] raid5: update analysis state for failed stripe Shaohua Li
  2015-09-23  6:05 ` [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Neil Brown
  0 siblings, 2 replies; 10+ messages in thread
From: Shaohua Li @ 2015-09-18 17:20 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

If faulty disks of an array are more than allowed degraded number, the
array enters error handling. It will be marked as read-only with
MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
clear CHANGE_PENDING bit for read-only array.  If MD_CHANGE_PENDING is
set for a raid5 array, all returned IO will be hold on a list till the
bit is clear. But recovery nevery clears this bit, the IO is always in
pending state and nevery finish. This has bad effects like upper layer
can't get an IO error and the array can't be stopped.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 95824fb..c596b73 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8209,6 +8209,7 @@ void md_check_recovery(struct mddev *mddev)
 			md_reap_sync_thread(mddev);
 			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
 			goto unlock;
 		}

-- 
1.8.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] raid5: update analysis state for failed stripe
  2015-09-18 17:20 [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Shaohua Li
@ 2015-09-18 17:20 ` Shaohua Li
  2015-09-23  6:21   ` Neil Brown
  2015-09-23  6:05 ` [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Neil Brown
  1 sibling, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2015-09-18 17:20 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

handle_failed_stripe() makes the stripe fail, eg, all IO will return
with a failure, but it doesn't update stripe_head_state. Later
handle_stripe() has special handling for raid6 for handle_stripe_fill().
That check before handle_stripe_fill() doesn't skip the failed stripe
and we get a kernel crash in need_this_block.  This patch clear the
analysis state to make sure no functions wrongly called after
handle_failed_stripe()

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 394cdf8..8e4fb89a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3155,6 +3155,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			spin_unlock_irq(&sh->stripe_lock);
 			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
 				wake_up(&conf->wait_for_overlap);
+			if (bi)
+				s->to_read--;
 			while (bi && bi->bi_iter.bi_sector <
 			       sh->dev[i].sector + STRIPE_SECTORS) {
 				struct bio *nextbi =
@@ -3173,6 +3175,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 		 */
 		clear_bit(R5_LOCKED, &sh->dev[i].flags);
 	}
+	s->to_write = 0;
+	s->written = 0;
 
 	if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state))
 		if (atomic_dec_and_test(&conf->pending_full_writes))
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] raid5: update analysis state for failed stripe
  2015-09-18 17:20 ` [PATCH 2/2] raid5: update analysis state for failed stripe Shaohua Li
@ 2015-09-23  6:21   ` Neil Brown
  2015-09-23  6:34     ` Shaohua Li
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2015-09-23  6:21 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 1710 bytes --]

Shaohua Li <shli@fb.com> writes:

> handle_failed_stripe() makes the stripe fail, eg, all IO will return
> with a failure, but it doesn't update stripe_head_state. Later
> handle_stripe() has special handling for raid6 for handle_stripe_fill().
> That check before handle_stripe_fill() doesn't skip the failed stripe
> and we get a kernel crash in need_this_block.  This patch clear the
> analysis state to make sure no functions wrongly called after
> handle_failed_stripe()
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/raid5.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 394cdf8..8e4fb89a 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3155,6 +3155,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>  			spin_unlock_irq(&sh->stripe_lock);
>  			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
>  				wake_up(&conf->wait_for_overlap);
> +			if (bi)
> +				s->to_read--;
>  			while (bi && bi->bi_iter.bi_sector <
>  			       sh->dev[i].sector + STRIPE_SECTORS) {
>  				struct bio *nextbi =
> @@ -3173,6 +3175,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>  		 */
>  		clear_bit(R5_LOCKED, &sh->dev[i].flags);
>  	}
> +	s->to_write = 0;
> +	s->written = 0;
>  
>  	if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state))
>  		if (atomic_dec_and_test(&conf->pending_full_writes))
> -- 
> 1.8.1

Again, this probably is a sensible fix, but I would like to be certain.
Where exactly in need_this_block does the kernel crash?  I cannot see
anything that could cause an invalid address....

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] raid5: update analysis state for failed stripe
  2015-09-23  6:21   ` Neil Brown
@ 2015-09-23  6:34     ` Shaohua Li
  2015-09-24  5:26       ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2015-09-23  6:34 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Wed, Sep 23, 2015 at 04:21:58PM +1000, Neil Brown wrote:
> Shaohua Li <shli@fb.com> writes:
> 
> > handle_failed_stripe() makes the stripe fail, eg, all IO will return
> > with a failure, but it doesn't update stripe_head_state. Later
> > handle_stripe() has special handling for raid6 for handle_stripe_fill().
> > That check before handle_stripe_fill() doesn't skip the failed stripe
> > and we get a kernel crash in need_this_block.  This patch clear the
> > analysis state to make sure no functions wrongly called after
> > handle_failed_stripe()
> >
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  drivers/md/raid5.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 394cdf8..8e4fb89a 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -3155,6 +3155,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
> >  			spin_unlock_irq(&sh->stripe_lock);
> >  			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
> >  				wake_up(&conf->wait_for_overlap);
> > +			if (bi)
> > +				s->to_read--;
> >  			while (bi && bi->bi_iter.bi_sector <
> >  			       sh->dev[i].sector + STRIPE_SECTORS) {
> >  				struct bio *nextbi =
> > @@ -3173,6 +3175,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
> >  		 */
> >  		clear_bit(R5_LOCKED, &sh->dev[i].flags);
> >  	}
> > +	s->to_write = 0;
> > +	s->written = 0;
> >  
> >  	if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state))
> >  		if (atomic_dec_and_test(&conf->pending_full_writes))
> > -- 
> > 1.8.1
> 
> Again, this probably is a sensible fix, but I would like to be certain.
> Where exactly in need_this_block does the kernel crash?  I cannot see
> anything that could cause an invalid address....


>>for (i = 0; i < s->failed; i++) {
>>                if (fdev[i]->towrite &&
the fdev[i]->towrite. because s->failed >=2 (it's 3 in my case), while
the array size is 2.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] raid5: update analysis state for failed stripe
  2015-09-23  6:34     ` Shaohua Li
@ 2015-09-24  5:26       ` Neil Brown
  0 siblings, 0 replies; 10+ messages in thread
From: Neil Brown @ 2015-09-24  5:26 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 3758 bytes --]

Shaohua Li <shli@fb.com> writes:

> On Wed, Sep 23, 2015 at 04:21:58PM +1000, Neil Brown wrote:
>> Shaohua Li <shli@fb.com> writes:
>> 
>> > handle_failed_stripe() makes the stripe fail, eg, all IO will return
>> > with a failure, but it doesn't update stripe_head_state. Later
>> > handle_stripe() has special handling for raid6 for handle_stripe_fill().
>> > That check before handle_stripe_fill() doesn't skip the failed stripe
>> > and we get a kernel crash in need_this_block.  This patch clear the
>> > analysis state to make sure no functions wrongly called after
>> > handle_failed_stripe()
>> >
>> > Signed-off-by: Shaohua Li <shli@fb.com>
>> > ---
>> >  drivers/md/raid5.c | 4 ++++
>> >  1 file changed, 4 insertions(+)
>> >
>> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> > index 394cdf8..8e4fb89a 100644
>> > --- a/drivers/md/raid5.c
>> > +++ b/drivers/md/raid5.c
>> > @@ -3155,6 +3155,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>> >  			spin_unlock_irq(&sh->stripe_lock);
>> >  			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
>> >  				wake_up(&conf->wait_for_overlap);
>> > +			if (bi)
>> > +				s->to_read--;
>> >  			while (bi && bi->bi_iter.bi_sector <
>> >  			       sh->dev[i].sector + STRIPE_SECTORS) {
>> >  				struct bio *nextbi =
>> > @@ -3173,6 +3175,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>> >  		 */
>> >  		clear_bit(R5_LOCKED, &sh->dev[i].flags);
>> >  	}
>> > +	s->to_write = 0;
>> > +	s->written = 0;
>> >  
>> >  	if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state))
>> >  		if (atomic_dec_and_test(&conf->pending_full_writes))
>> > -- 
>> > 1.8.1
>> 
>> Again, this probably is a sensible fix, but I would like to be certain.
>> Where exactly in need_this_block does the kernel crash?  I cannot see
>> anything that could cause an invalid address....
>
>
>>>for (i = 0; i < s->failed; i++) {
>>>                if (fdev[i]->towrite &&
> the fdev[i]->towrite. because s->failed >=2 (it's 3 in my case), while
> the array size is 2.
>
> Thanks,
> Shaohua

Ahh, of course.
In that case I think I'd like to limit the for loop as well.
So I've applied your patch and this one as well.

Thanks,
NeilBrown

From 76e308d70b204ff0af0028458caabfeacac4541a Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.com>
Date: Thu, 24 Sep 2015 15:25:36 +1000
Subject: [PATCH] md/raid5: don't index beyond end of array in
 need_this_block().

When need_this_block probably shouldn't be called when there
are more than 2 failed devices, we really don't want it to try
indexing beyond the end of the failed_num[] of fdev[] arrays.

So limit the loops to at most 2 iterations.

Reported-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 903d8a2b7b07..0f49ce411c9a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3304,7 +3304,7 @@ static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s,
 		 */
 		return 0;
 
-	for (i = 0; i < s->failed; i++) {
+	for (i = 0; i < s->failed && i < 2; i++) {
 		if (fdev[i]->towrite &&
 		    !test_bit(R5_UPTODATE, &fdev[i]->flags) &&
 		    !test_bit(R5_OVERWRITE, &fdev[i]->flags))
@@ -3328,7 +3328,7 @@ static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s,
 	    sh->sector < sh->raid_conf->mddev->recovery_cp)
 		/* reconstruct-write isn't being forced */
 		return 0;
-	for (i = 0; i < s->failed; i++) {
+	for (i = 0; i < s->failed && i < 2; i++) {
 		if (s->failed_num[i] != sh->pd_idx &&
 		    s->failed_num[i] != sh->qd_idx &&
 		    !test_bit(R5_UPTODATE, &fdev[i]->flags) &&

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] md: clear CHANGE_PENDING in readonly array
  2015-09-18 17:20 [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Shaohua Li
  2015-09-18 17:20 ` [PATCH 2/2] raid5: update analysis state for failed stripe Shaohua Li
@ 2015-09-23  6:05 ` Neil Brown
  2015-09-23  6:23   ` Shaohua Li
  1 sibling, 1 reply; 10+ messages in thread
From: Neil Brown @ 2015-09-23  6:05 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 1580 bytes --]

Shaohua Li <shli@fb.com> writes:

> If faulty disks of an array are more than allowed degraded number, the
> array enters error handling. It will be marked as read-only with
> MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
> clear CHANGE_PENDING bit for read-only array.  If MD_CHANGE_PENDING is
> set for a raid5 array, all returned IO will be hold on a list till the
> bit is clear. But recovery nevery clears this bit, the IO is always in
> pending state and nevery finish. This has bad effects like upper layer
> can't get an IO error and the array can't be stopped.
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/md.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 95824fb..c596b73 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8209,6 +8209,7 @@ void md_check_recovery(struct mddev *mddev)
>  			md_reap_sync_thread(mddev);
>  			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
>  			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
>  			goto unlock;
>  		}
>  
> -- 
> 1.8.1

Hi,
 I can see that clearing MD_CHANGE_PENDING there is probably correct -
 bug introduced by
   Commit: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.")

 However I don't understand your reasoning.  You say that the array is
 marked as read-only, but I don't see how that would happen.  What
 causes the array to be marked "read-only"?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] md: clear CHANGE_PENDING in readonly array
  2015-09-23  6:05 ` [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Neil Brown
@ 2015-09-23  6:23   ` Shaohua Li
  2015-09-24  4:03     ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2015-09-23  6:23 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Wed, Sep 23, 2015 at 04:05:33PM +1000, Neil Brown wrote:
> Shaohua Li <shli@fb.com> writes:
> 
> > If faulty disks of an array are more than allowed degraded number, the
> > array enters error handling. It will be marked as read-only with
> > MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
> > clear CHANGE_PENDING bit for read-only array.  If MD_CHANGE_PENDING is
> > set for a raid5 array, all returned IO will be hold on a list till the
> > bit is clear. But recovery nevery clears this bit, the IO is always in
> > pending state and nevery finish. This has bad effects like upper layer
> > can't get an IO error and the array can't be stopped.
> >
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  drivers/md/md.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 95824fb..c596b73 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -8209,6 +8209,7 @@ void md_check_recovery(struct mddev *mddev)
> >  			md_reap_sync_thread(mddev);
> >  			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
> >  			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> > +			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
> >  			goto unlock;
> >  		}
> >  
> > -- 
> > 1.8.1
> 
> Hi,
>  I can see that clearing MD_CHANGE_PENDING there is probably correct -
>  bug introduced by
>    Commit: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.")
> 
>  However I don't understand your reasoning.  You say that the array is
>  marked as read-only, but I don't see how that would happen.  What
>  causes the array to be marked "read-only"?

It's set read-only by mdadm. I didn't look carefully, but looks there is
disk failure event, mdadm is invoked automatically by some background
daemon. It's a ubuntu distribution.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] md: clear CHANGE_PENDING in readonly array
  2015-09-23  6:23   ` Shaohua Li
@ 2015-09-24  4:03     ` Neil Brown
  2015-09-24 16:47       ` Shaohua Li
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2015-09-24  4:03 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 3855 bytes --]

Shaohua Li <shli@fb.com> writes:

> On Wed, Sep 23, 2015 at 04:05:33PM +1000, Neil Brown wrote:
>> Shaohua Li <shli@fb.com> writes:
>> 
>> > If faulty disks of an array are more than allowed degraded number, the
>> > array enters error handling. It will be marked as read-only with
>> > MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
>> > clear CHANGE_PENDING bit for read-only array.  If MD_CHANGE_PENDING is
>> > set for a raid5 array, all returned IO will be hold on a list till the
>> > bit is clear. But recovery nevery clears this bit, the IO is always in
>> > pending state and nevery finish. This has bad effects like upper layer
>> > can't get an IO error and the array can't be stopped.
>> >
>> > Signed-off-by: Shaohua Li <shli@fb.com>
>> > ---
>> >  drivers/md/md.c | 1 +
>> >  1 file changed, 1 insertion(+)
>> >
>> > diff --git a/drivers/md/md.c b/drivers/md/md.c
>> > index 95824fb..c596b73 100644
>> > --- a/drivers/md/md.c
>> > +++ b/drivers/md/md.c
>> > @@ -8209,6 +8209,7 @@ void md_check_recovery(struct mddev *mddev)
>> >  			md_reap_sync_thread(mddev);
>> >  			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
>> >  			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>> > +			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
>> >  			goto unlock;
>> >  		}
>> >  
>> > -- 
>> > 1.8.1
>> 
>> Hi,
>>  I can see that clearing MD_CHANGE_PENDING there is probably correct -
>>  bug introduced by
>>    Commit: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.")
>> 
>>  However I don't understand your reasoning.  You say that the array is
>>  marked as read-only, but I don't see how that would happen.  What
>>  causes the array to be marked "read-only"?
>
> It's set read-only by mdadm. I didn't look carefully, but looks there is
> disk failure event, mdadm is invoked automatically by some background
> daemon. It's a ubuntu distribution.

Thanks.
This raises a couple of questions.

1/ What should md_set_readonly do if it finds that MD_CHANGE_PENDING is
   set?
  Maybe it should wait for md_check_recovery to get run which should
  clear the bit, after probably writing out the metadata.

2/ Why didn't md_check_recovery already do that before mdadm had a
   chance to set the array read-only?
  I guess that is just a timing thing.  md_check_recovery could be
  delayed, and mdadm could get called by udev rather quickly.

I think I'll get md_set_readonly to
	wait_event(mddev->sb_wait,
		   !test_bit(MD_CHANGE_PENDING, &mddev->flags));

because I think that is the right thing to do.  But if the array is
already read-only that won't help, so I'll still need you patch.

Would you be able to test that the following patch (without your patch)
also fixes the symptom?

Thanks.
NeilBrown

From a0a21b57523c41e87e0a50333b15f6c7ca02d714 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.com>
Date: Thu, 24 Sep 2015 14:00:51 +1000
Subject: [PATCH] md: wait for pending superblock updates before switching to
 read-only

If a superblock update is pending, wait for it to complete before
letting md_set_readonly() switch to readonly.
Otherwise we might lose important information about a device having
failed.

Reported-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 23651bf8dd80..9882f70f45e1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5432,6 +5432,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
 	mddev_unlock(mddev);
 	wait_event(resync_wait, !test_bit(MD_RECOVERY_RUNNING,
 					  &mddev->recovery));
+	wait_event(mddev->sb_wait,
+		   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
 	mddev_lock_nointr(mddev);
 
 	mutex_lock(&mddev->open_mutex);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] md: clear CHANGE_PENDING in readonly array
  2015-09-24  4:03     ` Neil Brown
@ 2015-09-24 16:47       ` Shaohua Li
  2015-09-30  6:59         ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2015-09-24 16:47 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Thu, Sep 24, 2015 at 02:03:53PM +1000, Neil Brown wrote:
> Shaohua Li <shli@fb.com> writes:
> 
> > On Wed, Sep 23, 2015 at 04:05:33PM +1000, Neil Brown wrote:
> >> Shaohua Li <shli@fb.com> writes:
> >> 
> >> > If faulty disks of an array are more than allowed degraded number, the
> >> > array enters error handling. It will be marked as read-only with
> >> > MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
> >> > clear CHANGE_PENDING bit for read-only array.  If MD_CHANGE_PENDING is
> >> > set for a raid5 array, all returned IO will be hold on a list till the
> >> > bit is clear. But recovery nevery clears this bit, the IO is always in
> >> > pending state and nevery finish. This has bad effects like upper layer
> >> > can't get an IO error and the array can't be stopped.
> >> >
> >> > Signed-off-by: Shaohua Li <shli@fb.com>
> >> > ---
> >> >  drivers/md/md.c | 1 +
> >> >  1 file changed, 1 insertion(+)
> >> >
> >> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> > index 95824fb..c596b73 100644
> >> > --- a/drivers/md/md.c
> >> > +++ b/drivers/md/md.c
> >> > @@ -8209,6 +8209,7 @@ void md_check_recovery(struct mddev *mddev)
> >> >  			md_reap_sync_thread(mddev);
> >> >  			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
> >> >  			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> >> > +			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
> >> >  			goto unlock;
> >> >  		}
> >> >  
> >> > -- 
> >> > 1.8.1
> >> 
> >> Hi,
> >>  I can see that clearing MD_CHANGE_PENDING there is probably correct -
> >>  bug introduced by
> >>    Commit: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.")
> >> 
> >>  However I don't understand your reasoning.  You say that the array is
> >>  marked as read-only, but I don't see how that would happen.  What
> >>  causes the array to be marked "read-only"?
> >
> > It's set read-only by mdadm. I didn't look carefully, but looks there is
> > disk failure event, mdadm is invoked automatically by some background
> > daemon. It's a ubuntu distribution.
> 
> Thanks.
> This raises a couple of questions.
> 
> 1/ What should md_set_readonly do if it finds that MD_CHANGE_PENDING is
>    set?
>   Maybe it should wait for md_check_recovery to get run which should
>   clear the bit, after probably writing out the metadata.
> 
> 2/ Why didn't md_check_recovery already do that before mdadm had a
>    chance to set the array read-only?
>   I guess that is just a timing thing.  md_check_recovery could be
>   delayed, and mdadm could get called by udev rather quickly.
> 
> I think I'll get md_set_readonly to
> 	wait_event(mddev->sb_wait,
> 		   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
> 
> because I think that is the right thing to do.  But if the array is
> already read-only that won't help, so I'll still need you patch.
> 
> Would you be able to test that the following patch (without your patch)
> also fixes the symptom?

Yes, the wait_event patch fixes the issue (without my patch).

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] md: clear CHANGE_PENDING in readonly array
  2015-09-24 16:47       ` Shaohua Li
@ 2015-09-30  6:59         ` Neil Brown
  0 siblings, 0 replies; 10+ messages in thread
From: Neil Brown @ 2015-09-30  6:59 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 3316 bytes --]

Shaohua Li <shli@fb.com> writes:

> On Thu, Sep 24, 2015 at 02:03:53PM +1000, Neil Brown wrote:
>> Shaohua Li <shli@fb.com> writes:
>> 
>> > On Wed, Sep 23, 2015 at 04:05:33PM +1000, Neil Brown wrote:
>> >> Shaohua Li <shli@fb.com> writes:
>> >> 
>> >> > If faulty disks of an array are more than allowed degraded number, the
>> >> > array enters error handling. It will be marked as read-only with
>> >> > MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
>> >> > clear CHANGE_PENDING bit for read-only array.  If MD_CHANGE_PENDING is
>> >> > set for a raid5 array, all returned IO will be hold on a list till the
>> >> > bit is clear. But recovery nevery clears this bit, the IO is always in
>> >> > pending state and nevery finish. This has bad effects like upper layer
>> >> > can't get an IO error and the array can't be stopped.
>> >> >
>> >> > Signed-off-by: Shaohua Li <shli@fb.com>
>> >> > ---
>> >> >  drivers/md/md.c | 1 +
>> >> >  1 file changed, 1 insertion(+)
>> >> >
>> >> > diff --git a/drivers/md/md.c b/drivers/md/md.c
>> >> > index 95824fb..c596b73 100644
>> >> > --- a/drivers/md/md.c
>> >> > +++ b/drivers/md/md.c
>> >> > @@ -8209,6 +8209,7 @@ void md_check_recovery(struct mddev *mddev)
>> >> >  			md_reap_sync_thread(mddev);
>> >> >  			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
>> >> >  			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>> >> > +			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
>> >> >  			goto unlock;
>> >> >  		}
>> >> >  
>> >> > -- 
>> >> > 1.8.1
>> >> 
>> >> Hi,
>> >>  I can see that clearing MD_CHANGE_PENDING there is probably correct -
>> >>  bug introduced by
>> >>    Commit: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.")
>> >> 
>> >>  However I don't understand your reasoning.  You say that the array is
>> >>  marked as read-only, but I don't see how that would happen.  What
>> >>  causes the array to be marked "read-only"?
>> >
>> > It's set read-only by mdadm. I didn't look carefully, but looks there is
>> > disk failure event, mdadm is invoked automatically by some background
>> > daemon. It's a ubuntu distribution.
>> 
>> Thanks.
>> This raises a couple of questions.
>> 
>> 1/ What should md_set_readonly do if it finds that MD_CHANGE_PENDING is
>>    set?
>>   Maybe it should wait for md_check_recovery to get run which should
>>   clear the bit, after probably writing out the metadata.
>> 
>> 2/ Why didn't md_check_recovery already do that before mdadm had a
>>    chance to set the array read-only?
>>   I guess that is just a timing thing.  md_check_recovery could be
>>   delayed, and mdadm could get called by udev rather quickly.
>> 
>> I think I'll get md_set_readonly to
>> 	wait_event(mddev->sb_wait,
>> 		   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
>> 
>> because I think that is the right thing to do.  But if the array is
>> already read-only that won't help, so I'll still need you patch.
>> 
>> Would you be able to test that the following patch (without your patch)
>> also fixes the symptom?
>
> Yes, the wait_event patch fixes the issue (without my patch).
>
> Thanks,
> Shaohua

Thanks for testing.  I change "Reported-by:" to
"Reported-and-tested-by:" :-)

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-09-30  6:59 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-18 17:20 [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Shaohua Li
2015-09-18 17:20 ` [PATCH 2/2] raid5: update analysis state for failed stripe Shaohua Li
2015-09-23  6:21   ` Neil Brown
2015-09-23  6:34     ` Shaohua Li
2015-09-24  5:26       ` Neil Brown
2015-09-23  6:05 ` [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Neil Brown
2015-09-23  6:23   ` Shaohua Li
2015-09-24  4:03     ` Neil Brown
2015-09-24 16:47       ` Shaohua Li
2015-09-30  6:59         ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).