* [PATCH 000 of 2] md: Two more bugfixes. @ 2007-05-10 6:22 NeilBrown 2007-05-10 6:22 ` [PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem NeilBrown 2007-05-10 6:22 ` [PATCH 002 of 2] md: Improve the is_mddev_idle test NeilBrown 0 siblings, 2 replies; 8+ messages in thread From: NeilBrown @ 2007-05-10 6:22 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-raid, linux-kernel, Neil Brown, stable Following are two bugfixes for md in current kernels. The first is suitable for -stable is it can allow drive errors through to the filesystem wrongly. Both are suitable for 2.6.22. Thanks, NeilBrown [PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem. [PATCH 002 of 2] md: Improve the is_mddev_idle test ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem. 2007-05-10 6:22 [PATCH 000 of 2] md: Two more bugfixes NeilBrown @ 2007-05-10 6:22 ` NeilBrown 2007-05-10 6:22 ` [PATCH 002 of 2] md: Improve the is_mddev_idle test NeilBrown 1 sibling, 0 replies; 8+ messages in thread From: NeilBrown @ 2007-05-10 6:22 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-raid, linux-kernel, Neil Brown, stable When a raid1 has only one working drive, we want read error to propagate up to the filesystem as there is no point failing the last drive in an array. Currently the code perform this check is racy. If a write and a read a both submitted to a device on a 2-drive raid1, and the write fails followed by the read failing, the read will see that there is only one working drive and will pass the failure up, even though the one working drive is actually the *other* one. So, tighten up the locking. Cc: stable@kernel.org Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./drivers/md/raid1.c | 33 +++++++++++++++++++-------------- 1 file changed, 19 insertions(+), 14 deletions(-) diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c --- .prev/drivers/md/raid1.c 2007-05-10 15:51:54.000000000 +1000 +++ ./drivers/md/raid1.c 2007-05-10 15:51:58.000000000 +1000 @@ -271,21 +271,25 @@ static int raid1_end_read_request(struct */ update_head_pos(mirror, r1_bio); - if (uptodate || (conf->raid_disks - conf->mddev->degraded) <= 1) { - /* - * Set R1BIO_Uptodate in our master bio, so that - * we will return a good error code for to the higher - * levels even if IO on some other mirrored buffer fails. - * - * The 'master' represents the composite IO operation to - * user-side. So if something waits for IO, then it will - * wait for the 'master' bio. + if (uptodate) + set_bit(R1BIO_Uptodate, &r1_bio->state); + else { + /* If all other devices have failed, we want to return + * the error upwards rather than fail the last device. + * Here we redefine "uptodate" to mean "Don't want to retry" */ - if (uptodate) - set_bit(R1BIO_Uptodate, &r1_bio->state); + unsigned long flags; + spin_lock_irqsave(&conf->device_lock, flags); + if (r1_bio->mddev->degraded == conf->raid_disks || + (r1_bio->mddev->degraded == conf->raid_disks-1 && + !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags))) + uptodate = 1; + spin_unlock_irqrestore(&conf->device_lock, flags); + } + if (uptodate) raid_end_bio_io(r1_bio); - } else { + else { /* * oops, read error: */ @@ -992,13 +996,14 @@ static void error(mddev_t *mddev, mdk_rd unsigned long flags; spin_lock_irqsave(&conf->device_lock, flags); mddev->degraded++; + set_bit(Faulty, &rdev->flags); spin_unlock_irqrestore(&conf->device_lock, flags); /* * if recovery is running, make sure it aborts. */ set_bit(MD_RECOVERY_ERR, &mddev->recovery); - } - set_bit(Faulty, &rdev->flags); + } else + set_bit(Faulty, &rdev->flags); set_bit(MD_CHANGE_DEVS, &mddev->flags); printk(KERN_ALERT "raid1: Disk failure on %s, disabling device. \n" " Operation continuing on %d devices\n", ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 002 of 2] md: Improve the is_mddev_idle test 2007-05-10 6:22 [PATCH 000 of 2] md: Two more bugfixes NeilBrown 2007-05-10 6:22 ` [PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem NeilBrown @ 2007-05-10 6:22 ` NeilBrown 2007-05-10 7:33 ` Andrew Morton 2007-05-10 9:48 ` Jan Engelhardt 1 sibling, 2 replies; 8+ messages in thread From: NeilBrown @ 2007-05-10 6:22 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-raid, linux-kernel, Neil Brown During a 'resync' or similar activity, md checks if the devices in the array are otherwise active and winds back resync activity when they are. This test in done in is_mddev_idle, and it is somewhat fragile - it sometimes thinks there is non-sync io when there isn't. The test compares the total sectors of io (disk_stat_read) with the sectors of resync io (disk->sync_io). This has problems because total sectors gets updated when a request completes, while resync io gets updated when the request is submitted. The time difference can cause large differenced between the two which do not actually imply non-resync activity. The test currently allows for some fuzz (+/- 4096) but there are some cases when it is not enough. The test currently looks for any (non-fuzz) difference, either positive or negative. This clearly is not needed. Any non-sync activity will cause the total sectors to grow faster than the sync_io count (never slower) so we only need to look for a positive differences. If we do this then the amount of in-flight sync io will never cause the appearance of non-sync IO. Once enough non-sync IO to worry about starts happening, resync will be slowed down and the measurements will thus be more precise (as there is less in-flight) and control of resync will still be suitably responsive. Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./drivers/md/md.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2007-05-10 15:51:54.000000000 +1000 +++ ./drivers/md/md.c 2007-05-10 16:05:10.000000000 +1000 @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev) * * Note: the following is an unsigned comparison. */ - if ((curr_events - rdev->last_events + 4096) > 8192) { + if ((long)curr_events - (long)rdev->last_events > 4096) { rdev->last_events = curr_events; idle = 0; } ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test 2007-05-10 6:22 ` [PATCH 002 of 2] md: Improve the is_mddev_idle test NeilBrown @ 2007-05-10 7:33 ` Andrew Morton 2007-05-10 9:46 ` Neil Brown 2007-05-10 9:48 ` Jan Engelhardt 1 sibling, 1 reply; 8+ messages in thread From: Andrew Morton @ 2007-05-10 7:33 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, linux-kernel On Thu, 10 May 2007 16:22:31 +1000 NeilBrown <neilb@suse.de> wrote: > The test currently looks for any (non-fuzz) difference, either > positive or negative. This clearly is not needed. Any non-sync > activity will cause the total sectors to grow faster than the sync_io > count (never slower) so we only need to look for a positive differences. > > ... > > --- .prev/drivers/md/md.c 2007-05-10 15:51:54.000000000 +1000 > +++ ./drivers/md/md.c 2007-05-10 16:05:10.000000000 +1000 > @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev) > * > * Note: the following is an unsigned comparison. > */ > - if ((curr_events - rdev->last_events + 4096) > 8192) { > + if ((long)curr_events - (long)rdev->last_events > 4096) { > rdev->last_events = curr_events; > idle = 0; In which case would unsigned counters be more appropriate? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test 2007-05-10 7:33 ` Andrew Morton @ 2007-05-10 9:46 ` Neil Brown 0 siblings, 0 replies; 8+ messages in thread From: Neil Brown @ 2007-05-10 9:46 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-raid, linux-kernel On Thursday May 10, akpm@linux-foundation.org wrote: > On Thu, 10 May 2007 16:22:31 +1000 NeilBrown <neilb@suse.de> wrote: > > > The test currently looks for any (non-fuzz) difference, either > > positive or negative. This clearly is not needed. Any non-sync > > activity will cause the total sectors to grow faster than the sync_io > > count (never slower) so we only need to look for a positive differences. > > > > ... > > > > --- .prev/drivers/md/md.c 2007-05-10 15:51:54.000000000 +1000 > > +++ ./drivers/md/md.c 2007-05-10 16:05:10.000000000 +1000 > > @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev) > > * > > * Note: the following is an unsigned comparison. > > */ > > - if ((curr_events - rdev->last_events + 4096) > 8192) { > > + if ((long)curr_events - (long)rdev->last_events > 4096) { > > rdev->last_events = curr_events; > > idle = 0; > > In which case would unsigned counters be more appropriate? I guess..... It is really the comparison that I want to be signed, I don't much care about the counted - they are expected to wrap (though they might not). So maybe I really want if ((signed long)(curr_events - rdev->last_events) > 4096) { to make it clear... But people expect number to be signed by default, so that probably isn't necessary. Yeah, I'll make them signed one day. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test 2007-05-10 6:22 ` [PATCH 002 of 2] md: Improve the is_mddev_idle test NeilBrown 2007-05-10 7:33 ` Andrew Morton @ 2007-05-10 9:48 ` Jan Engelhardt 2007-05-10 10:04 ` Neil Brown 1 sibling, 1 reply; 8+ messages in thread From: Jan Engelhardt @ 2007-05-10 9:48 UTC (permalink / raw) To: NeilBrown; +Cc: Andrew Morton, linux-raid, linux-kernel On May 10 2007 16:22, NeilBrown wrote: > >diff .prev/drivers/md/md.c ./drivers/md/md.c >--- .prev/drivers/md/md.c 2007-05-10 15:51:54.000000000 +1000 >+++ ./drivers/md/md.c 2007-05-10 16:05:10.000000000 +1000 >@@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev) > * > * Note: the following is an unsigned comparison. > */ >- if ((curr_events - rdev->last_events + 4096) > 8192) { >+ if ((long)curr_events - (long)rdev->last_events > 4096) { > rdev->last_events = curr_events; > idle = 0; > } What did really change? Unless I am seriously mistaken, curr_events - last_evens + 4096 > 8192 is mathematically equivalent to curr_events - last_evens > 4096 The casting to (long) may however force a signed comparison which turns things quite upside down, and the comment does not apply anymore. Jan -- ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test 2007-05-10 9:48 ` Jan Engelhardt @ 2007-05-10 10:04 ` Neil Brown 2007-05-10 12:52 ` Jan Engelhardt 0 siblings, 1 reply; 8+ messages in thread From: Neil Brown @ 2007-05-10 10:04 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Andrew Morton, linux-raid, linux-kernel On Thursday May 10, jengelh@linux01.gwdg.de wrote: > > On May 10 2007 16:22, NeilBrown wrote: > > > >diff .prev/drivers/md/md.c ./drivers/md/md.c > >--- .prev/drivers/md/md.c 2007-05-10 15:51:54.000000000 +1000 > >+++ ./drivers/md/md.c 2007-05-10 16:05:10.000000000 +1000 > >@@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev) > > * > > * Note: the following is an unsigned comparison. > > */ > >- if ((curr_events - rdev->last_events + 4096) > 8192) { > >+ if ((long)curr_events - (long)rdev->last_events > 4096) { > > rdev->last_events = curr_events; > > idle = 0; > > } > > What did really change? Unless I am seriously mistaken, > > curr_events - last_evens + 4096 > 8192 > > is mathematically equivalent to > > curr_events - last_evens > 4096 > > The casting to (long) may however force a signed comparison which turns > things quite upside down, and the comment does not apply anymore. Yes, the use of a signed comparison is the significant difference. And yes, the comment becomes wrong. I'm in the process of redrafting that. It currently stands at: /* sync IO will cause sync_io to increase before the disk_stats * as sync_io is counted when a request starts, and * disk_stats is counted when it completes. * So resync activity will cause curr_events to be smaller than * when there was no such activity. * non-sync IO will cause disk_stat to increase without * increasing sync_io so curr_events will (eventually) * be larger than it was before. Once it becomes * substantially larger, the test below will cause * the array to appear non-idle, and resync will slow * down. * If there is a lot of outstanding resync activity when * we set last_event to curr_events, then all that activity * completing might cause the array to appear non-idle * and resync will be slowed down even though there might * not have been non-resync activity. This will only * happen once though. 'last_events' will soon reflect * the state where there is little or no outstanding * resync requests, and further resync activity will * always make curr_events less than last_events. * */ Does that read at all well? NeilBrown ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test 2007-05-10 10:04 ` Neil Brown @ 2007-05-10 12:52 ` Jan Engelhardt 0 siblings, 0 replies; 8+ messages in thread From: Jan Engelhardt @ 2007-05-10 12:52 UTC (permalink / raw) To: Neil Brown; +Cc: Andrew Morton, linux-raid, linux-kernel On May 10 2007 20:04, Neil Brown wrote: >> >- if ((curr_events - rdev->last_events + 4096) > 8192) { >> >+ if ((long)curr_events - (long)rdev->last_events > 4096) { >> > rdev->last_events = curr_events; >> > idle = 0; >> > } >> >/* sync IO will cause sync_io to increase before the disk_stats > * as sync_io is counted when a request starts, and > * disk_stats is counted when it completes. > * So resync activity will cause curr_events to be smaller than > * when there was no such activity. > * non-sync IO will cause disk_stat to increase without > * increasing sync_io so curr_events will (eventually) > * be larger than it was before. Once it becomes > * substantially larger, the test below will cause > * the array to appear non-idle, and resync will slow > * down. > * If there is a lot of outstanding resync activity when > * we set last_event to curr_events, then all that activity > * completing might cause the array to appear non-idle > * and resync will be slowed down even though there might > * not have been non-resync activity. This will only > * happen once though. 'last_events' will soon reflect > * the state where there is little or no outstanding > * resync requests, and further resync activity will > * always make curr_events less than last_events. > * > */ > >Does that read at all well? It is a more verbose explanation of your patch description, yes. Jan -- ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2007-05-10 12:52 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-05-10 6:22 [PATCH 000 of 2] md: Two more bugfixes NeilBrown 2007-05-10 6:22 ` [PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem NeilBrown 2007-05-10 6:22 ` [PATCH 002 of 2] md: Improve the is_mddev_idle test NeilBrown 2007-05-10 7:33 ` Andrew Morton 2007-05-10 9:46 ` Neil Brown 2007-05-10 9:48 ` Jan Engelhardt 2007-05-10 10:04 ` Neil Brown 2007-05-10 12:52 ` Jan Engelhardt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).