* [PATCH 1/2] md: clear CHANGE_PENDING in readonly array @ 2015-09-18 17:20 Shaohua Li 2015-09-18 17:20 ` [PATCH 2/2] raid5: update analysis state for failed stripe Shaohua Li 2015-09-23 6:05 ` [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Neil Brown 0 siblings, 2 replies; 10+ messages in thread From: Shaohua Li @ 2015-09-18 17:20 UTC (permalink / raw) To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb If faulty disks of an array are more than allowed degraded number, the array enters error handling. It will be marked as read-only with MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is set for a raid5 array, all returned IO will be hold on a list till the bit is clear. But recovery nevery clears this bit, the IO is always in pending state and nevery finish. This has bad effects like upper layer can't get an IO error and the array can't be stopped. Signed-off-by: Shaohua Li <shli@fb.com> --- drivers/md/md.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/md/md.c b/drivers/md/md.c index 95824fb..c596b73 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8209,6 +8209,7 @@ void md_check_recovery(struct mddev *mddev) md_reap_sync_thread(mddev); clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + clear_bit(MD_CHANGE_PENDING, &mddev->flags); goto unlock; } -- 1.8.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2] raid5: update analysis state for failed stripe 2015-09-18 17:20 [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Shaohua Li @ 2015-09-18 17:20 ` Shaohua Li 2015-09-23 6:21 ` Neil Brown 2015-09-23 6:05 ` [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Neil Brown 1 sibling, 1 reply; 10+ messages in thread From: Shaohua Li @ 2015-09-18 17:20 UTC (permalink / raw) To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb handle_failed_stripe() makes the stripe fail, eg, all IO will return with a failure, but it doesn't update stripe_head_state. Later handle_stripe() has special handling for raid6 for handle_stripe_fill(). That check before handle_stripe_fill() doesn't skip the failed stripe and we get a kernel crash in need_this_block. This patch clear the analysis state to make sure no functions wrongly called after handle_failed_stripe() Signed-off-by: Shaohua Li <shli@fb.com> --- drivers/md/raid5.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 394cdf8..8e4fb89a 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -3155,6 +3155,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh, spin_unlock_irq(&sh->stripe_lock); if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) wake_up(&conf->wait_for_overlap); + if (bi) + s->to_read--; while (bi && bi->bi_iter.bi_sector < sh->dev[i].sector + STRIPE_SECTORS) { struct bio *nextbi = @@ -3173,6 +3175,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh, */ clear_bit(R5_LOCKED, &sh->dev[i].flags); } + s->to_write = 0; + s->written = 0; if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state)) if (atomic_dec_and_test(&conf->pending_full_writes)) -- 1.8.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] raid5: update analysis state for failed stripe 2015-09-18 17:20 ` [PATCH 2/2] raid5: update analysis state for failed stripe Shaohua Li @ 2015-09-23 6:21 ` Neil Brown 2015-09-23 6:34 ` Shaohua Li 0 siblings, 1 reply; 10+ messages in thread From: Neil Brown @ 2015-09-23 6:21 UTC (permalink / raw) To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams [-- Attachment #1: Type: text/plain, Size: 1710 bytes --] Shaohua Li <shli@fb.com> writes: > handle_failed_stripe() makes the stripe fail, eg, all IO will return > with a failure, but it doesn't update stripe_head_state. Later > handle_stripe() has special handling for raid6 for handle_stripe_fill(). > That check before handle_stripe_fill() doesn't skip the failed stripe > and we get a kernel crash in need_this_block. This patch clear the > analysis state to make sure no functions wrongly called after > handle_failed_stripe() > > Signed-off-by: Shaohua Li <shli@fb.com> > --- > drivers/md/raid5.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index 394cdf8..8e4fb89a 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -3155,6 +3155,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh, > spin_unlock_irq(&sh->stripe_lock); > if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) > wake_up(&conf->wait_for_overlap); > + if (bi) > + s->to_read--; > while (bi && bi->bi_iter.bi_sector < > sh->dev[i].sector + STRIPE_SECTORS) { > struct bio *nextbi = > @@ -3173,6 +3175,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh, > */ > clear_bit(R5_LOCKED, &sh->dev[i].flags); > } > + s->to_write = 0; > + s->written = 0; > > if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state)) > if (atomic_dec_and_test(&conf->pending_full_writes)) > -- > 1.8.1 Again, this probably is a sensible fix, but I would like to be certain. Where exactly in need_this_block does the kernel crash? I cannot see anything that could cause an invalid address.... Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 818 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] raid5: update analysis state for failed stripe 2015-09-23 6:21 ` Neil Brown @ 2015-09-23 6:34 ` Shaohua Li 2015-09-24 5:26 ` Neil Brown 0 siblings, 1 reply; 10+ messages in thread From: Shaohua Li @ 2015-09-23 6:34 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams On Wed, Sep 23, 2015 at 04:21:58PM +1000, Neil Brown wrote: > Shaohua Li <shli@fb.com> writes: > > > handle_failed_stripe() makes the stripe fail, eg, all IO will return > > with a failure, but it doesn't update stripe_head_state. Later > > handle_stripe() has special handling for raid6 for handle_stripe_fill(). > > That check before handle_stripe_fill() doesn't skip the failed stripe > > and we get a kernel crash in need_this_block. This patch clear the > > analysis state to make sure no functions wrongly called after > > handle_failed_stripe() > > > > Signed-off-by: Shaohua Li <shli@fb.com> > > --- > > drivers/md/raid5.c | 4 ++++ > > 1 file changed, 4 insertions(+) > > > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > > index 394cdf8..8e4fb89a 100644 > > --- a/drivers/md/raid5.c > > +++ b/drivers/md/raid5.c > > @@ -3155,6 +3155,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh, > > spin_unlock_irq(&sh->stripe_lock); > > if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) > > wake_up(&conf->wait_for_overlap); > > + if (bi) > > + s->to_read--; > > while (bi && bi->bi_iter.bi_sector < > > sh->dev[i].sector + STRIPE_SECTORS) { > > struct bio *nextbi = > > @@ -3173,6 +3175,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh, > > */ > > clear_bit(R5_LOCKED, &sh->dev[i].flags); > > } > > + s->to_write = 0; > > + s->written = 0; > > > > if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state)) > > if (atomic_dec_and_test(&conf->pending_full_writes)) > > -- > > 1.8.1 > > Again, this probably is a sensible fix, but I would like to be certain. > Where exactly in need_this_block does the kernel crash? I cannot see > anything that could cause an invalid address.... >>for (i = 0; i < s->failed; i++) { >> if (fdev[i]->towrite && the fdev[i]->towrite. because s->failed >=2 (it's 3 in my case), while the array size is 2. Thanks, Shaohua ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] raid5: update analysis state for failed stripe 2015-09-23 6:34 ` Shaohua Li @ 2015-09-24 5:26 ` Neil Brown 0 siblings, 0 replies; 10+ messages in thread From: Neil Brown @ 2015-09-24 5:26 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams [-- Attachment #1: Type: text/plain, Size: 3758 bytes --] Shaohua Li <shli@fb.com> writes: > On Wed, Sep 23, 2015 at 04:21:58PM +1000, Neil Brown wrote: >> Shaohua Li <shli@fb.com> writes: >> >> > handle_failed_stripe() makes the stripe fail, eg, all IO will return >> > with a failure, but it doesn't update stripe_head_state. Later >> > handle_stripe() has special handling for raid6 for handle_stripe_fill(). >> > That check before handle_stripe_fill() doesn't skip the failed stripe >> > and we get a kernel crash in need_this_block. This patch clear the >> > analysis state to make sure no functions wrongly called after >> > handle_failed_stripe() >> > >> > Signed-off-by: Shaohua Li <shli@fb.com> >> > --- >> > drivers/md/raid5.c | 4 ++++ >> > 1 file changed, 4 insertions(+) >> > >> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c >> > index 394cdf8..8e4fb89a 100644 >> > --- a/drivers/md/raid5.c >> > +++ b/drivers/md/raid5.c >> > @@ -3155,6 +3155,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh, >> > spin_unlock_irq(&sh->stripe_lock); >> > if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) >> > wake_up(&conf->wait_for_overlap); >> > + if (bi) >> > + s->to_read--; >> > while (bi && bi->bi_iter.bi_sector < >> > sh->dev[i].sector + STRIPE_SECTORS) { >> > struct bio *nextbi = >> > @@ -3173,6 +3175,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh, >> > */ >> > clear_bit(R5_LOCKED, &sh->dev[i].flags); >> > } >> > + s->to_write = 0; >> > + s->written = 0; >> > >> > if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state)) >> > if (atomic_dec_and_test(&conf->pending_full_writes)) >> > -- >> > 1.8.1 >> >> Again, this probably is a sensible fix, but I would like to be certain. >> Where exactly in need_this_block does the kernel crash? I cannot see >> anything that could cause an invalid address.... > > >>>for (i = 0; i < s->failed; i++) { >>> if (fdev[i]->towrite && > the fdev[i]->towrite. because s->failed >=2 (it's 3 in my case), while > the array size is 2. > > Thanks, > Shaohua Ahh, of course. In that case I think I'd like to limit the for loop as well. So I've applied your patch and this one as well. Thanks, NeilBrown From 76e308d70b204ff0af0028458caabfeacac4541a Mon Sep 17 00:00:00 2001 From: NeilBrown <neilb@suse.com> Date: Thu, 24 Sep 2015 15:25:36 +1000 Subject: [PATCH] md/raid5: don't index beyond end of array in need_this_block(). When need_this_block probably shouldn't be called when there are more than 2 failed devices, we really don't want it to try indexing beyond the end of the failed_num[] of fdev[] arrays. So limit the loops to at most 2 iterations. Reported-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.de> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 903d8a2b7b07..0f49ce411c9a 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -3304,7 +3304,7 @@ static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s, */ return 0; - for (i = 0; i < s->failed; i++) { + for (i = 0; i < s->failed && i < 2; i++) { if (fdev[i]->towrite && !test_bit(R5_UPTODATE, &fdev[i]->flags) && !test_bit(R5_OVERWRITE, &fdev[i]->flags)) @@ -3328,7 +3328,7 @@ static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s, sh->sector < sh->raid_conf->mddev->recovery_cp) /* reconstruct-write isn't being forced */ return 0; - for (i = 0; i < s->failed; i++) { + for (i = 0; i < s->failed && i < 2; i++) { if (s->failed_num[i] != sh->pd_idx && s->failed_num[i] != sh->qd_idx && !test_bit(R5_UPTODATE, &fdev[i]->flags) && [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 818 bytes --] ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] md: clear CHANGE_PENDING in readonly array 2015-09-18 17:20 [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Shaohua Li 2015-09-18 17:20 ` [PATCH 2/2] raid5: update analysis state for failed stripe Shaohua Li @ 2015-09-23 6:05 ` Neil Brown 2015-09-23 6:23 ` Shaohua Li 1 sibling, 1 reply; 10+ messages in thread From: Neil Brown @ 2015-09-23 6:05 UTC (permalink / raw) To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams [-- Attachment #1: Type: text/plain, Size: 1580 bytes --] Shaohua Li <shli@fb.com> writes: > If faulty disks of an array are more than allowed degraded number, the > array enters error handling. It will be marked as read-only with > MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't > clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is > set for a raid5 array, all returned IO will be hold on a list till the > bit is clear. But recovery nevery clears this bit, the IO is always in > pending state and nevery finish. This has bad effects like upper layer > can't get an IO error and the array can't be stopped. > > Signed-off-by: Shaohua Li <shli@fb.com> > --- > drivers/md/md.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 95824fb..c596b73 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -8209,6 +8209,7 @@ void md_check_recovery(struct mddev *mddev) > md_reap_sync_thread(mddev); > clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); > clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > + clear_bit(MD_CHANGE_PENDING, &mddev->flags); > goto unlock; > } > > -- > 1.8.1 Hi, I can see that clearing MD_CHANGE_PENDING there is probably correct - bug introduced by Commit: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.") However I don't understand your reasoning. You say that the array is marked as read-only, but I don't see how that would happen. What causes the array to be marked "read-only"? Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 818 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] md: clear CHANGE_PENDING in readonly array 2015-09-23 6:05 ` [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Neil Brown @ 2015-09-23 6:23 ` Shaohua Li 2015-09-24 4:03 ` Neil Brown 0 siblings, 1 reply; 10+ messages in thread From: Shaohua Li @ 2015-09-23 6:23 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams On Wed, Sep 23, 2015 at 04:05:33PM +1000, Neil Brown wrote: > Shaohua Li <shli@fb.com> writes: > > > If faulty disks of an array are more than allowed degraded number, the > > array enters error handling. It will be marked as read-only with > > MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't > > clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is > > set for a raid5 array, all returned IO will be hold on a list till the > > bit is clear. But recovery nevery clears this bit, the IO is always in > > pending state and nevery finish. This has bad effects like upper layer > > can't get an IO error and the array can't be stopped. > > > > Signed-off-by: Shaohua Li <shli@fb.com> > > --- > > drivers/md/md.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > index 95824fb..c596b73 100644 > > --- a/drivers/md/md.c > > +++ b/drivers/md/md.c > > @@ -8209,6 +8209,7 @@ void md_check_recovery(struct mddev *mddev) > > md_reap_sync_thread(mddev); > > clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); > > clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > > + clear_bit(MD_CHANGE_PENDING, &mddev->flags); > > goto unlock; > > } > > > > -- > > 1.8.1 > > Hi, > I can see that clearing MD_CHANGE_PENDING there is probably correct - > bug introduced by > Commit: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.") > > However I don't understand your reasoning. You say that the array is > marked as read-only, but I don't see how that would happen. What > causes the array to be marked "read-only"? It's set read-only by mdadm. I didn't look carefully, but looks there is disk failure event, mdadm is invoked automatically by some background daemon. It's a ubuntu distribution. Thanks, Shaohua ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] md: clear CHANGE_PENDING in readonly array 2015-09-23 6:23 ` Shaohua Li @ 2015-09-24 4:03 ` Neil Brown 2015-09-24 16:47 ` Shaohua Li 0 siblings, 1 reply; 10+ messages in thread From: Neil Brown @ 2015-09-24 4:03 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams [-- Attachment #1: Type: text/plain, Size: 3855 bytes --] Shaohua Li <shli@fb.com> writes: > On Wed, Sep 23, 2015 at 04:05:33PM +1000, Neil Brown wrote: >> Shaohua Li <shli@fb.com> writes: >> >> > If faulty disks of an array are more than allowed degraded number, the >> > array enters error handling. It will be marked as read-only with >> > MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't >> > clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is >> > set for a raid5 array, all returned IO will be hold on a list till the >> > bit is clear. But recovery nevery clears this bit, the IO is always in >> > pending state and nevery finish. This has bad effects like upper layer >> > can't get an IO error and the array can't be stopped. >> > >> > Signed-off-by: Shaohua Li <shli@fb.com> >> > --- >> > drivers/md/md.c | 1 + >> > 1 file changed, 1 insertion(+) >> > >> > diff --git a/drivers/md/md.c b/drivers/md/md.c >> > index 95824fb..c596b73 100644 >> > --- a/drivers/md/md.c >> > +++ b/drivers/md/md.c >> > @@ -8209,6 +8209,7 @@ void md_check_recovery(struct mddev *mddev) >> > md_reap_sync_thread(mddev); >> > clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); >> > clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); >> > + clear_bit(MD_CHANGE_PENDING, &mddev->flags); >> > goto unlock; >> > } >> > >> > -- >> > 1.8.1 >> >> Hi, >> I can see that clearing MD_CHANGE_PENDING there is probably correct - >> bug introduced by >> Commit: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.") >> >> However I don't understand your reasoning. You say that the array is >> marked as read-only, but I don't see how that would happen. What >> causes the array to be marked "read-only"? > > It's set read-only by mdadm. I didn't look carefully, but looks there is > disk failure event, mdadm is invoked automatically by some background > daemon. It's a ubuntu distribution. Thanks. This raises a couple of questions. 1/ What should md_set_readonly do if it finds that MD_CHANGE_PENDING is set? Maybe it should wait for md_check_recovery to get run which should clear the bit, after probably writing out the metadata. 2/ Why didn't md_check_recovery already do that before mdadm had a chance to set the array read-only? I guess that is just a timing thing. md_check_recovery could be delayed, and mdadm could get called by udev rather quickly. I think I'll get md_set_readonly to wait_event(mddev->sb_wait, !test_bit(MD_CHANGE_PENDING, &mddev->flags)); because I think that is the right thing to do. But if the array is already read-only that won't help, so I'll still need you patch. Would you be able to test that the following patch (without your patch) also fixes the symptom? Thanks. NeilBrown From a0a21b57523c41e87e0a50333b15f6c7ca02d714 Mon Sep 17 00:00:00 2001 From: NeilBrown <neilb@suse.com> Date: Thu, 24 Sep 2015 14:00:51 +1000 Subject: [PATCH] md: wait for pending superblock updates before switching to read-only If a superblock update is pending, wait for it to complete before letting md_set_readonly() switch to readonly. Otherwise we might lose important information about a device having failed. Reported-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com> diff --git a/drivers/md/md.c b/drivers/md/md.c index 23651bf8dd80..9882f70f45e1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5432,6 +5432,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) mddev_unlock(mddev); wait_event(resync_wait, !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); + wait_event(mddev->sb_wait, + !test_bit(MD_CHANGE_PENDING, &mddev->flags)); mddev_lock_nointr(mddev); mutex_lock(&mddev->open_mutex); [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 818 bytes --] ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] md: clear CHANGE_PENDING in readonly array 2015-09-24 4:03 ` Neil Brown @ 2015-09-24 16:47 ` Shaohua Li 2015-09-30 6:59 ` Neil Brown 0 siblings, 1 reply; 10+ messages in thread From: Shaohua Li @ 2015-09-24 16:47 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams On Thu, Sep 24, 2015 at 02:03:53PM +1000, Neil Brown wrote: > Shaohua Li <shli@fb.com> writes: > > > On Wed, Sep 23, 2015 at 04:05:33PM +1000, Neil Brown wrote: > >> Shaohua Li <shli@fb.com> writes: > >> > >> > If faulty disks of an array are more than allowed degraded number, the > >> > array enters error handling. It will be marked as read-only with > >> > MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't > >> > clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is > >> > set for a raid5 array, all returned IO will be hold on a list till the > >> > bit is clear. But recovery nevery clears this bit, the IO is always in > >> > pending state and nevery finish. This has bad effects like upper layer > >> > can't get an IO error and the array can't be stopped. > >> > > >> > Signed-off-by: Shaohua Li <shli@fb.com> > >> > --- > >> > drivers/md/md.c | 1 + > >> > 1 file changed, 1 insertion(+) > >> > > >> > diff --git a/drivers/md/md.c b/drivers/md/md.c > >> > index 95824fb..c596b73 100644 > >> > --- a/drivers/md/md.c > >> > +++ b/drivers/md/md.c > >> > @@ -8209,6 +8209,7 @@ void md_check_recovery(struct mddev *mddev) > >> > md_reap_sync_thread(mddev); > >> > clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); > >> > clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > >> > + clear_bit(MD_CHANGE_PENDING, &mddev->flags); > >> > goto unlock; > >> > } > >> > > >> > -- > >> > 1.8.1 > >> > >> Hi, > >> I can see that clearing MD_CHANGE_PENDING there is probably correct - > >> bug introduced by > >> Commit: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.") > >> > >> However I don't understand your reasoning. You say that the array is > >> marked as read-only, but I don't see how that would happen. What > >> causes the array to be marked "read-only"? > > > > It's set read-only by mdadm. I didn't look carefully, but looks there is > > disk failure event, mdadm is invoked automatically by some background > > daemon. It's a ubuntu distribution. > > Thanks. > This raises a couple of questions. > > 1/ What should md_set_readonly do if it finds that MD_CHANGE_PENDING is > set? > Maybe it should wait for md_check_recovery to get run which should > clear the bit, after probably writing out the metadata. > > 2/ Why didn't md_check_recovery already do that before mdadm had a > chance to set the array read-only? > I guess that is just a timing thing. md_check_recovery could be > delayed, and mdadm could get called by udev rather quickly. > > I think I'll get md_set_readonly to > wait_event(mddev->sb_wait, > !test_bit(MD_CHANGE_PENDING, &mddev->flags)); > > because I think that is the right thing to do. But if the array is > already read-only that won't help, so I'll still need you patch. > > Would you be able to test that the following patch (without your patch) > also fixes the symptom? Yes, the wait_event patch fixes the issue (without my patch). Thanks, Shaohua ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] md: clear CHANGE_PENDING in readonly array 2015-09-24 16:47 ` Shaohua Li @ 2015-09-30 6:59 ` Neil Brown 0 siblings, 0 replies; 10+ messages in thread From: Neil Brown @ 2015-09-30 6:59 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams [-- Attachment #1: Type: text/plain, Size: 3316 bytes --] Shaohua Li <shli@fb.com> writes: > On Thu, Sep 24, 2015 at 02:03:53PM +1000, Neil Brown wrote: >> Shaohua Li <shli@fb.com> writes: >> >> > On Wed, Sep 23, 2015 at 04:05:33PM +1000, Neil Brown wrote: >> >> Shaohua Li <shli@fb.com> writes: >> >> >> >> > If faulty disks of an array are more than allowed degraded number, the >> >> > array enters error handling. It will be marked as read-only with >> >> > MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't >> >> > clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is >> >> > set for a raid5 array, all returned IO will be hold on a list till the >> >> > bit is clear. But recovery nevery clears this bit, the IO is always in >> >> > pending state and nevery finish. This has bad effects like upper layer >> >> > can't get an IO error and the array can't be stopped. >> >> > >> >> > Signed-off-by: Shaohua Li <shli@fb.com> >> >> > --- >> >> > drivers/md/md.c | 1 + >> >> > 1 file changed, 1 insertion(+) >> >> > >> >> > diff --git a/drivers/md/md.c b/drivers/md/md.c >> >> > index 95824fb..c596b73 100644 >> >> > --- a/drivers/md/md.c >> >> > +++ b/drivers/md/md.c >> >> > @@ -8209,6 +8209,7 @@ void md_check_recovery(struct mddev *mddev) >> >> > md_reap_sync_thread(mddev); >> >> > clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); >> >> > clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); >> >> > + clear_bit(MD_CHANGE_PENDING, &mddev->flags); >> >> > goto unlock; >> >> > } >> >> > >> >> > -- >> >> > 1.8.1 >> >> >> >> Hi, >> >> I can see that clearing MD_CHANGE_PENDING there is probably correct - >> >> bug introduced by >> >> Commit: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.") >> >> >> >> However I don't understand your reasoning. You say that the array is >> >> marked as read-only, but I don't see how that would happen. What >> >> causes the array to be marked "read-only"? >> > >> > It's set read-only by mdadm. I didn't look carefully, but looks there is >> > disk failure event, mdadm is invoked automatically by some background >> > daemon. It's a ubuntu distribution. >> >> Thanks. >> This raises a couple of questions. >> >> 1/ What should md_set_readonly do if it finds that MD_CHANGE_PENDING is >> set? >> Maybe it should wait for md_check_recovery to get run which should >> clear the bit, after probably writing out the metadata. >> >> 2/ Why didn't md_check_recovery already do that before mdadm had a >> chance to set the array read-only? >> I guess that is just a timing thing. md_check_recovery could be >> delayed, and mdadm could get called by udev rather quickly. >> >> I think I'll get md_set_readonly to >> wait_event(mddev->sb_wait, >> !test_bit(MD_CHANGE_PENDING, &mddev->flags)); >> >> because I think that is the right thing to do. But if the array is >> already read-only that won't help, so I'll still need you patch. >> >> Would you be able to test that the following patch (without your patch) >> also fixes the symptom? > > Yes, the wait_event patch fixes the issue (without my patch). > > Thanks, > Shaohua Thanks for testing. I change "Reported-by:" to "Reported-and-tested-by:" :-) NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 818 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-09-30 6:59 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-09-18 17:20 [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Shaohua Li 2015-09-18 17:20 ` [PATCH 2/2] raid5: update analysis state for failed stripe Shaohua Li 2015-09-23 6:21 ` Neil Brown 2015-09-23 6:34 ` Shaohua Li 2015-09-24 5:26 ` Neil Brown 2015-09-23 6:05 ` [PATCH 1/2] md: clear CHANGE_PENDING in readonly array Neil Brown 2015-09-23 6:23 ` Shaohua Li 2015-09-24 4:03 ` Neil Brown 2015-09-24 16:47 ` Shaohua Li 2015-09-30 6:59 ` Neil Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).