* raid5: R5_WriteError is not cleared after rdev is marked as Faulty
@ 2014-01-02 15:58 Alexander Lyakas
2014-01-06 2:15 ` NeilBrown
0 siblings, 1 reply; 4+ messages in thread
From: Alexander Lyakas @ 2014-01-02 15:58 UTC (permalink / raw)
To: NeilBrown, linux-raid
Hi Neil,
I see the following issue: if more than one stripe-head experiences an
error in raid5_end_write_request, then when first stripe-head is
handled, it eventually fails the disk, but when the second stripe-head
is handled, disk is already marked as Faulty, so R5_WriteError is not
cleared from this second stripe-head, but it still remains in the
stripe cache.
Below are more details on the flow. I added some prints to display the
flags values etc.
# raid6 is rebuilding a disk
# disk fails, as a result, raid5_end_write_request is called on two
stripe-heads:
[23369]md48066*[raid5_end_write_request:2002] sh[4560].flags=0x101A
dev[0].flags=0x20003 rdev[dm-6].flags=0x40
bio=ffff880028744fd8[4592(4560):8] WRITE ERROR
[13527.467363] [3842]md48066*[raid5_end_write_request:2002]
sh[4568].flags=0x101B dev[0].flags=0x20003 rdev[dm-6].flags=0x240
bio=ffff880028742bc0[4600(4568):8] WRITE ERROR
As a result, appropriate r5dev's are marked with R5_WriteError on
these two stripe-heads.
# raid5 calls analyse_stripe on behalf of stripe-head 4560 ((through
handle_active_stripes->handle_stripe->analyse_stripe)
# analyse_stripe performs this code:
if (rdev && test_bit(R5_WriteError, &dev->flags)) {
/* This flag does not apply to '.replacement'
* only to .rdev, so make sure to check that*/
struct md_rdev *rdev2 = rcu_dereference(
conf->disks[i].rdev);
if (rdev2 == rdev)
clear_bit(R5_Insync, &dev->flags);
if (rdev2 && !test_bit(Faulty, &rdev2->flags)) {
s->handle_bad_blocks = 1;
it clears the R5_WriteError and sets handle_bad_blocks on the stripe-head
# handle_stripe performs:
if (s.handle_bad_blocks)
for (i = disks; i--; ) {
struct md_rdev *rdev;
struct r5dev *dev = &sh->dev[i];
if (test_and_clear_bit(R5_WriteError, &dev->flags)) {
/* We own a safe reference to the rdev */
rdev = conf->disks[i].rdev;
if (!rdev_set_badblocks(rdev, sh->sector,
STRIPE_SECTORS, 0))
md_error(conf->mddev, rdev);
this marks the rdev as Faulty and fails it (bad blocks are disabled)
# raid5d calls analyse_stripe on stripe-head 4568. However, rdev is
already marked as Faulty, so it does:
if (rdev && test_bit(Faulty, &rdev->flags))
rdev = NULL;
and later this condition does not hold:
if (rdev && test_bit(R5_WriteError, &dev->flags)) {
so R5_WriteError is not cleared, but this stripe-head is still valid
in the stripe cache.
Dumping the state of this stripe-head in the stripe-cache (through
custom sysfs entry):
sh[4568].state=0x1010 disks=5 pdidx=2 qdidx=3
dev[0].flags=0x20001 rdev[???].flags=0x0 (sector=13688)
dev[1].flags=0x11 rdev[dm-7].flags=0x2 (sector=13720)
dev[2].flags=0x11 rdev[dm-8].flags=0x2 (sector=0) *P*
dev[3].flags=0x11 rdev[dm-9].flags=0x2 (sector=0) *Q*
dev[4].flags=0x11 rdev[dm-10].flags=0x2 (sector=13656)
# Now user recovers the disk and re-adds the disk back into the array.
Rebuild starts, and eventually md_do_sync calls sync_request on this
stripe-head, which calls handle_stripe->analyse_stripe. Now
analyse_stripe sees the stale R5_WriteError flag and fails the disk,
which aborts the rebuild. Here is the stack trace, which detects the
stale R5_WriteError and fails the disk:
[<ffffffffa02d4543>] error+0x153/0x1d0 [raid456]
[<ffffffffa0445449>] md_error.part.39+0x19/0xa0 [md_mod]
[<ffffffffa0445503>] md_error+0x33/0x60 [md_mod]
[<ffffffffa02da07c>] handle_stripe+0x57c/0x2250 [raid456]
[<ffffffffa02dcacb>] sync_request+0x18b/0x390 [raid456]
[<ffffffffa044320c>] md_do_sync+0x76c/0xda0 [md_mod]
[<ffffffffa044084d>] md_thread+0x10d/0x140 [md_mod]
Does this analysis make sense? I don't know how to fix issue though. I
am looking at kernel (3.8.13 - as usual), but looking at your
for-linus branch, it seems like the same would happen.
Thanks,
Alex.
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: raid5: R5_WriteError is not cleared after rdev is marked as Faulty 2014-01-02 15:58 raid5: R5_WriteError is not cleared after rdev is marked as Faulty Alexander Lyakas @ 2014-01-06 2:15 ` NeilBrown 2014-01-06 15:12 ` Alexander Lyakas 0 siblings, 1 reply; 4+ messages in thread From: NeilBrown @ 2014-01-06 2:15 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 5938 bytes --] On Thu, 2 Jan 2014 17:58:33 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > I see the following issue: if more than one stripe-head experiences an > error in raid5_end_write_request, then when first stripe-head is > handled, it eventually fails the disk, but when the second stripe-head > is handled, disk is already marked as Faulty, so R5_WriteError is not > cleared from this second stripe-head, but it still remains in the > stripe cache. > > Below are more details on the flow. I added some prints to display the > flags values etc. > > # raid6 is rebuilding a disk > > # disk fails, as a result, raid5_end_write_request is called on two > stripe-heads: > [23369]md48066*[raid5_end_write_request:2002] sh[4560].flags=0x101A > dev[0].flags=0x20003 rdev[dm-6].flags=0x40 > bio=ffff880028744fd8[4592(4560):8] WRITE ERROR > [13527.467363] [3842]md48066*[raid5_end_write_request:2002] > sh[4568].flags=0x101B dev[0].flags=0x20003 rdev[dm-6].flags=0x240 > bio=ffff880028742bc0[4600(4568):8] WRITE ERROR > > As a result, appropriate r5dev's are marked with R5_WriteError on > these two stripe-heads. > > # raid5 calls analyse_stripe on behalf of stripe-head 4560 ((through > handle_active_stripes->handle_stripe->analyse_stripe) > > # analyse_stripe performs this code: > if (rdev && test_bit(R5_WriteError, &dev->flags)) { > /* This flag does not apply to '.replacement' > * only to .rdev, so make sure to check that*/ > struct md_rdev *rdev2 = rcu_dereference( > conf->disks[i].rdev); > if (rdev2 == rdev) > clear_bit(R5_Insync, &dev->flags); > if (rdev2 && !test_bit(Faulty, &rdev2->flags)) { > s->handle_bad_blocks = 1; > it clears the R5_WriteError and sets handle_bad_blocks on the stripe-head > > # handle_stripe performs: > if (s.handle_bad_blocks) > for (i = disks; i--; ) { > struct md_rdev *rdev; > struct r5dev *dev = &sh->dev[i]; > if (test_and_clear_bit(R5_WriteError, &dev->flags)) { > /* We own a safe reference to the rdev */ > rdev = conf->disks[i].rdev; > if (!rdev_set_badblocks(rdev, sh->sector, > STRIPE_SECTORS, 0)) > md_error(conf->mddev, rdev); > this marks the rdev as Faulty and fails it (bad blocks are disabled) > > # raid5d calls analyse_stripe on stripe-head 4568. However, rdev is > already marked as Faulty, so it does: > if (rdev && test_bit(Faulty, &rdev->flags)) > rdev = NULL; > and later this condition does not hold: > if (rdev && test_bit(R5_WriteError, &dev->flags)) { > so R5_WriteError is not cleared, but this stripe-head is still valid > in the stripe cache. > > Dumping the state of this stripe-head in the stripe-cache (through > custom sysfs entry): > sh[4568].state=0x1010 disks=5 pdidx=2 qdidx=3 > dev[0].flags=0x20001 rdev[???].flags=0x0 (sector=13688) > dev[1].flags=0x11 rdev[dm-7].flags=0x2 (sector=13720) > dev[2].flags=0x11 rdev[dm-8].flags=0x2 (sector=0) *P* > dev[3].flags=0x11 rdev[dm-9].flags=0x2 (sector=0) *Q* > dev[4].flags=0x11 rdev[dm-10].flags=0x2 (sector=13656) > > # Now user recovers the disk and re-adds the disk back into the array. > Rebuild starts, and eventually md_do_sync calls sync_request on this > stripe-head, which calls handle_stripe->analyse_stripe. Now > analyse_stripe sees the stale R5_WriteError flag and fails the disk, > which aborts the rebuild. Here is the stack trace, which detects the > stale R5_WriteError and fails the disk: > [<ffffffffa02d4543>] error+0x153/0x1d0 [raid456] > [<ffffffffa0445449>] md_error.part.39+0x19/0xa0 [md_mod] > [<ffffffffa0445503>] md_error+0x33/0x60 [md_mod] > [<ffffffffa02da07c>] handle_stripe+0x57c/0x2250 [raid456] > [<ffffffffa02dcacb>] sync_request+0x18b/0x390 [raid456] > [<ffffffffa044320c>] md_do_sync+0x76c/0xda0 [md_mod] > [<ffffffffa044084d>] md_thread+0x10d/0x140 [md_mod] > > Does this analysis make sense? I don't know how to fix issue though. I > am looking at kernel (3.8.13 - as usual), but looking at your > for-linus branch, it seems like the same would happen. > Looks like: commit 5d8c71f9e5fbdd95650be00294d238e27a363b5c Author: Adam Kwolek <adam.kwolek@intel.com> Date: Fri Dec 9 14:26:11 2011 +1100 md: raid5 crash during degradation introduced this problem (in linux 3.3) with an over-simple fix for a problem. I think that we can not just reverted that patch as commit 14a75d3e07c784c004b4b44b34af996b8e4ac453 Author: NeilBrown <neilb@suse.de> Date: Fri Dec 23 10:17:52 2011 +1100 md/raid5: preferentially read from replacement device if possible. has effected a better fix. So the patch below should work. Does that make sense to you? Thanks, NeilBrown diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index cc055da02e2a..9168173deaf3 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -3608,7 +3608,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) */ set_bit(R5_Insync, &dev->flags); - if (rdev && test_bit(R5_WriteError, &dev->flags)) { + if (test_bit(R5_WriteError, &dev->flags)) { /* This flag does not apply to '.replacement' * only to .rdev, so make sure to check that*/ struct md_rdev *rdev2 = rcu_dereference( @@ -3621,7 +3621,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) } else clear_bit(R5_WriteError, &dev->flags); } - if (rdev && test_bit(R5_MadeGood, &dev->flags)) { + if (test_bit(R5_MadeGood, &dev->flags)) { /* This flag does not apply to '.replacement' * only to .rdev, so make sure to check that*/ struct md_rdev *rdev2 = rcu_dereference( [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: raid5: R5_WriteError is not cleared after rdev is marked as Faulty 2014-01-06 2:15 ` NeilBrown @ 2014-01-06 15:12 ` Alexander Lyakas 2014-01-06 22:46 ` NeilBrown 0 siblings, 1 reply; 4+ messages in thread From: Alexander Lyakas @ 2014-01-06 15:12 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Thanks, Neil! The test that was failing, passes now. Thanks for the patch, Alex. On Mon, Jan 6, 2014 at 4:15 AM, NeilBrown <neilb@suse.de> wrote: > On Thu, 2 Jan 2014 17:58:33 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hi Neil, >> I see the following issue: if more than one stripe-head experiences an >> error in raid5_end_write_request, then when first stripe-head is >> handled, it eventually fails the disk, but when the second stripe-head >> is handled, disk is already marked as Faulty, so R5_WriteError is not >> cleared from this second stripe-head, but it still remains in the >> stripe cache. >> >> Below are more details on the flow. I added some prints to display the >> flags values etc. >> >> # raid6 is rebuilding a disk >> >> # disk fails, as a result, raid5_end_write_request is called on two >> stripe-heads: >> [23369]md48066*[raid5_end_write_request:2002] sh[4560].flags=0x101A >> dev[0].flags=0x20003 rdev[dm-6].flags=0x40 >> bio=ffff880028744fd8[4592(4560):8] WRITE ERROR >> [13527.467363] [3842]md48066*[raid5_end_write_request:2002] >> sh[4568].flags=0x101B dev[0].flags=0x20003 rdev[dm-6].flags=0x240 >> bio=ffff880028742bc0[4600(4568):8] WRITE ERROR >> >> As a result, appropriate r5dev's are marked with R5_WriteError on >> these two stripe-heads. >> >> # raid5 calls analyse_stripe on behalf of stripe-head 4560 ((through >> handle_active_stripes->handle_stripe->analyse_stripe) >> >> # analyse_stripe performs this code: >> if (rdev && test_bit(R5_WriteError, &dev->flags)) { >> /* This flag does not apply to '.replacement' >> * only to .rdev, so make sure to check that*/ >> struct md_rdev *rdev2 = rcu_dereference( >> conf->disks[i].rdev); >> if (rdev2 == rdev) >> clear_bit(R5_Insync, &dev->flags); >> if (rdev2 && !test_bit(Faulty, &rdev2->flags)) { >> s->handle_bad_blocks = 1; >> it clears the R5_WriteError and sets handle_bad_blocks on the stripe-head >> >> # handle_stripe performs: >> if (s.handle_bad_blocks) >> for (i = disks; i--; ) { >> struct md_rdev *rdev; >> struct r5dev *dev = &sh->dev[i]; >> if (test_and_clear_bit(R5_WriteError, &dev->flags)) { >> /* We own a safe reference to the rdev */ >> rdev = conf->disks[i].rdev; >> if (!rdev_set_badblocks(rdev, sh->sector, >> STRIPE_SECTORS, 0)) >> md_error(conf->mddev, rdev); >> this marks the rdev as Faulty and fails it (bad blocks are disabled) >> >> # raid5d calls analyse_stripe on stripe-head 4568. However, rdev is >> already marked as Faulty, so it does: >> if (rdev && test_bit(Faulty, &rdev->flags)) >> rdev = NULL; >> and later this condition does not hold: >> if (rdev && test_bit(R5_WriteError, &dev->flags)) { >> so R5_WriteError is not cleared, but this stripe-head is still valid >> in the stripe cache. >> >> Dumping the state of this stripe-head in the stripe-cache (through >> custom sysfs entry): >> sh[4568].state=0x1010 disks=5 pdidx=2 qdidx=3 >> dev[0].flags=0x20001 rdev[???].flags=0x0 (sector=13688) >> dev[1].flags=0x11 rdev[dm-7].flags=0x2 (sector=13720) >> dev[2].flags=0x11 rdev[dm-8].flags=0x2 (sector=0) *P* >> dev[3].flags=0x11 rdev[dm-9].flags=0x2 (sector=0) *Q* >> dev[4].flags=0x11 rdev[dm-10].flags=0x2 (sector=13656) >> >> # Now user recovers the disk and re-adds the disk back into the array. >> Rebuild starts, and eventually md_do_sync calls sync_request on this >> stripe-head, which calls handle_stripe->analyse_stripe. Now >> analyse_stripe sees the stale R5_WriteError flag and fails the disk, >> which aborts the rebuild. Here is the stack trace, which detects the >> stale R5_WriteError and fails the disk: >> [<ffffffffa02d4543>] error+0x153/0x1d0 [raid456] >> [<ffffffffa0445449>] md_error.part.39+0x19/0xa0 [md_mod] >> [<ffffffffa0445503>] md_error+0x33/0x60 [md_mod] >> [<ffffffffa02da07c>] handle_stripe+0x57c/0x2250 [raid456] >> [<ffffffffa02dcacb>] sync_request+0x18b/0x390 [raid456] >> [<ffffffffa044320c>] md_do_sync+0x76c/0xda0 [md_mod] >> [<ffffffffa044084d>] md_thread+0x10d/0x140 [md_mod] >> >> Does this analysis make sense? I don't know how to fix issue though. I >> am looking at kernel (3.8.13 - as usual), but looking at your >> for-linus branch, it seems like the same would happen. >> > > Looks like: > > commit 5d8c71f9e5fbdd95650be00294d238e27a363b5c > Author: Adam Kwolek <adam.kwolek@intel.com> > Date: Fri Dec 9 14:26:11 2011 +1100 > > md: raid5 crash during degradation > > > introduced this problem (in linux 3.3) with an over-simple fix for a problem. > > I think that we can not just reverted that patch as > > commit 14a75d3e07c784c004b4b44b34af996b8e4ac453 > Author: NeilBrown <neilb@suse.de> > Date: Fri Dec 23 10:17:52 2011 +1100 > > md/raid5: preferentially read from replacement device if possible. > > > has effected a better fix. > So the patch below should work. Does that make sense to you? > > Thanks, > NeilBrown > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index cc055da02e2a..9168173deaf3 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -3608,7 +3608,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) > */ > set_bit(R5_Insync, &dev->flags); > > - if (rdev && test_bit(R5_WriteError, &dev->flags)) { > + if (test_bit(R5_WriteError, &dev->flags)) { > /* This flag does not apply to '.replacement' > * only to .rdev, so make sure to check that*/ > struct md_rdev *rdev2 = rcu_dereference( > @@ -3621,7 +3621,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) > } else > clear_bit(R5_WriteError, &dev->flags); > } > - if (rdev && test_bit(R5_MadeGood, &dev->flags)) { > + if (test_bit(R5_MadeGood, &dev->flags)) { > /* This flag does not apply to '.replacement' > * only to .rdev, so make sure to check that*/ > struct md_rdev *rdev2 = rcu_dereference( ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: raid5: R5_WriteError is not cleared after rdev is marked as Faulty 2014-01-06 15:12 ` Alexander Lyakas @ 2014-01-06 22:46 ` NeilBrown 0 siblings, 0 replies; 4+ messages in thread From: NeilBrown @ 2014-01-06 22:46 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 267 bytes --] On Mon, 6 Jan 2014 17:12:10 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Thanks, Neil! > The test that was failing, passes now. > > Thanks for the patch, > Alex. > > Thanks for testing. I'll send the fix upstream shortly. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-01-06 22:46 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-01-02 15:58 raid5: R5_WriteError is not cleared after rdev is marked as Faulty Alexander Lyakas 2014-01-06 2:15 ` NeilBrown 2014-01-06 15:12 ` Alexander Lyakas 2014-01-06 22:46 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).