[PATCH 000 of 2] md: Two more bugfixes.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 000 of 2] md: Two more bugfixes.
@ 2007-05-10  6:22 NeilBrown
  2007-05-10  6:22 ` [PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem NeilBrown
  2007-05-10  6:22 ` [PATCH 002 of 2] md: Improve the is_mddev_idle test NeilBrown
  0 siblings, 2 replies; 8+ messages in thread
From: NeilBrown @ 2007-05-10  6:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Neil Brown, stable

Following are two bugfixes for md in current kernels.
The first is suitable for -stable is it can allow drive errors
through to the filesystem wrongly.
Both are suitable for 2.6.22.

Thanks,
NeilBrown

 [PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem.
 [PATCH 002 of 2] md: Improve the is_mddev_idle test

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem.
  2007-05-10  6:22 [PATCH 000 of 2] md: Two more bugfixes NeilBrown
@ 2007-05-10  6:22 ` NeilBrown
  2007-05-10  6:22 ` [PATCH 002 of 2] md: Improve the is_mddev_idle test NeilBrown
  1 sibling, 0 replies; 8+ messages in thread
From: NeilBrown @ 2007-05-10  6:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Neil Brown, stable


When a raid1 has only one working drive, we want read error to
propagate up to the filesystem as there is no point failing the last
drive in an array.

Currently the code perform this check is racy.  If a write and a read
a both submitted to a device on a 2-drive raid1, and the write fails
followed by the read failing, the read will see that there is only one
working drive and will pass the failure up, even though the one
working drive is actually the *other* one.

So, tighten up the locking.

Cc: stable@kernel.org
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c |   33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c	2007-05-10 15:51:54.000000000 +1000
+++ ./drivers/md/raid1.c	2007-05-10 15:51:58.000000000 +1000
@@ -271,21 +271,25 @@ static int raid1_end_read_request(struct
 	 */
 	update_head_pos(mirror, r1_bio);
 
-	if (uptodate || (conf->raid_disks - conf->mddev->degraded) <= 1) {
-		/*
-		 * Set R1BIO_Uptodate in our master bio, so that
-		 * we will return a good error code for to the higher
-		 * levels even if IO on some other mirrored buffer fails.
-		 *
-		 * The 'master' represents the composite IO operation to
-		 * user-side. So if something waits for IO, then it will
-		 * wait for the 'master' bio.
+	if (uptodate)
+		set_bit(R1BIO_Uptodate, &r1_bio->state);
+	else {
+		/* If all other devices have failed, we want to return
+		 * the error upwards rather than fail the last device.
+		 * Here we redefine "uptodate" to mean "Don't want to retry"
 		 */
-		if (uptodate)
-			set_bit(R1BIO_Uptodate, &r1_bio->state);
+		unsigned long flags;
+		spin_lock_irqsave(&conf->device_lock, flags);
+		if (r1_bio->mddev->degraded == conf->raid_disks ||
+		    (r1_bio->mddev->degraded == conf->raid_disks-1 &&
+		     !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
+			uptodate = 1;
+		spin_unlock_irqrestore(&conf->device_lock, flags);
+	}
 
+	if (uptodate)
 		raid_end_bio_io(r1_bio);
-	} else {
+	else {
 		/*
 		 * oops, read error:
 		 */
@@ -992,13 +996,14 @@ static void error(mddev_t *mddev, mdk_rd
 		unsigned long flags;
 		spin_lock_irqsave(&conf->device_lock, flags);
 		mddev->degraded++;
+		set_bit(Faulty, &rdev->flags);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 		/*
 		 * if recovery is running, make sure it aborts.
 		 */
 		set_bit(MD_RECOVERY_ERR, &mddev->recovery);
-	}
-	set_bit(Faulty, &rdev->flags);
+	} else
+		set_bit(Faulty, &rdev->flags);
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 	printk(KERN_ALERT "raid1: Disk failure on %s, disabling device. \n"
 		"	Operation continuing on %d devices\n",

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 002 of 2] md: Improve the is_mddev_idle test
  2007-05-10  6:22 [PATCH 000 of 2] md: Two more bugfixes NeilBrown
  2007-05-10  6:22 ` [PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem NeilBrown
@ 2007-05-10  6:22 ` NeilBrown
  2007-05-10  7:33   ` Andrew Morton
  2007-05-10  9:48   ` Jan Engelhardt
  1 sibling, 2 replies; 8+ messages in thread
From: NeilBrown @ 2007-05-10  6:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Neil Brown

During a 'resync' or similar activity, md checks if the devices in the
array are otherwise active and winds back resync activity when they
are.  This test in done in is_mddev_idle, and it is somewhat fragile -
it sometimes thinks there is non-sync io when there isn't.

The test compares the total sectors of io (disk_stat_read) with the sectors
of resync io (disk->sync_io).
This has problems because total sectors gets updated when a request completes,
while resync io gets updated when the request is submitted.  The time difference
can cause large differenced between the two which do not actually imply 
non-resync activity.  The test currently allows for some fuzz (+/- 4096)
but there are some cases when it is not enough.

The test currently looks for any (non-fuzz) difference, either
positive or negative.  This clearly is not needed.  Any non-sync
activity will cause the total sectors to grow faster than the sync_io
count (never slower) so we only need to look for a positive differences.

If we do this then the amount of in-flight sync io will never cause
the appearance of non-sync IO.  Once enough non-sync IO to worry about
starts happening, resync will be slowed down and the measurements will
thus be more precise (as there is less in-flight) and control of resync
will still be suitably responsive.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2007-05-10 15:51:54.000000000 +1000
+++ ./drivers/md/md.c	2007-05-10 16:05:10.000000000 +1000
@@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
 		 *
 		 * Note: the following is an unsigned comparison.
 		 */
-		if ((curr_events - rdev->last_events + 4096) > 8192) {
+		if ((long)curr_events - (long)rdev->last_events > 4096) {
 			rdev->last_events = curr_events;
 			idle = 0;
 		}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test
  2007-05-10  6:22 ` [PATCH 002 of 2] md: Improve the is_mddev_idle test NeilBrown
@ 2007-05-10  7:33   ` Andrew Morton
  2007-05-10  9:46     ` Neil Brown
  2007-05-10  9:48   ` Jan Engelhardt
  1 sibling, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2007-05-10  7:33 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel

On Thu, 10 May 2007 16:22:31 +1000 NeilBrown <neilb@suse.de> wrote:

> The test currently looks for any (non-fuzz) difference, either
> positive or negative.  This clearly is not needed.  Any non-sync
> activity will cause the total sectors to grow faster than the sync_io
> count (never slower) so we only need to look for a positive differences.
> 
> ...
>
> --- .prev/drivers/md/md.c	2007-05-10 15:51:54.000000000 +1000
> +++ ./drivers/md/md.c	2007-05-10 16:05:10.000000000 +1000
> @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
>  		 *
>  		 * Note: the following is an unsigned comparison.
>  		 */
> -		if ((curr_events - rdev->last_events + 4096) > 8192) {
> +		if ((long)curr_events - (long)rdev->last_events > 4096) {
>  			rdev->last_events = curr_events;
>  			idle = 0;

In which case would unsigned counters be more appropriate?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test
  2007-05-10  7:33   ` Andrew Morton
@ 2007-05-10  9:46     ` Neil Brown
  0 siblings, 0 replies; 8+ messages in thread
From: Neil Brown @ 2007-05-10  9:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel

On Thursday May 10, akpm@linux-foundation.org wrote:
> On Thu, 10 May 2007 16:22:31 +1000 NeilBrown <neilb@suse.de> wrote:
> 
> > The test currently looks for any (non-fuzz) difference, either
> > positive or negative.  This clearly is not needed.  Any non-sync
> > activity will cause the total sectors to grow faster than the sync_io
> > count (never slower) so we only need to look for a positive differences.
> > 
> > ...
> >
> > --- .prev/drivers/md/md.c	2007-05-10 15:51:54.000000000 +1000
> > +++ ./drivers/md/md.c	2007-05-10 16:05:10.000000000 +1000
> > @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
> >  		 *
> >  		 * Note: the following is an unsigned comparison.
> >  		 */
> > -		if ((curr_events - rdev->last_events + 4096) > 8192) {
> > +		if ((long)curr_events - (long)rdev->last_events > 4096) {
> >  			rdev->last_events = curr_events;
> >  			idle = 0;
> 
> In which case would unsigned counters be more appropriate?

I guess.....

It is really the comparison that I want to be signed, I don't much
care about the counted - they are expected to wrap (though they might
not).
So maybe I really want

     if ((signed long)(curr_events - rdev->last_events) > 4096) {

to make it clear...
But people expect number to be signed by default, so that probably
isn't necessary.

Yeah, I'll make them signed one day.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test
  2007-05-10  6:22 ` [PATCH 002 of 2] md: Improve the is_mddev_idle test NeilBrown
  2007-05-10  7:33   ` Andrew Morton
@ 2007-05-10  9:48   ` Jan Engelhardt
  2007-05-10 10:04     ` Neil Brown
  1 sibling, 1 reply; 8+ messages in thread
From: Jan Engelhardt @ 2007-05-10  9:48 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrew Morton, linux-raid, linux-kernel


On May 10 2007 16:22, NeilBrown wrote:
>
>diff .prev/drivers/md/md.c ./drivers/md/md.c
>--- .prev/drivers/md/md.c	2007-05-10 15:51:54.000000000 +1000
>+++ ./drivers/md/md.c	2007-05-10 16:05:10.000000000 +1000
>@@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
> 		 *
> 		 * Note: the following is an unsigned comparison.
> 		 */
>-		if ((curr_events - rdev->last_events + 4096) > 8192) {
>+		if ((long)curr_events - (long)rdev->last_events > 4096) {
> 			rdev->last_events = curr_events;
> 			idle = 0;
> 		}

What did really change? Unless I am seriously mistaken,

    curr_events - last_evens + 4096 > 8192

is mathematically equivalent to

    curr_events - last_evens        > 4096

The casting to (long) may however force a signed comparison which turns
things quite upside down, and the comment does not apply anymore.


	Jan
-- 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test
  2007-05-10  9:48   ` Jan Engelhardt
@ 2007-05-10 10:04     ` Neil Brown
  2007-05-10 12:52       ` Jan Engelhardt
  0 siblings, 1 reply; 8+ messages in thread
From: Neil Brown @ 2007-05-10 10:04 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Andrew Morton, linux-raid, linux-kernel

On Thursday May 10, jengelh@linux01.gwdg.de wrote:
> 
> On May 10 2007 16:22, NeilBrown wrote:
> >
> >diff .prev/drivers/md/md.c ./drivers/md/md.c
> >--- .prev/drivers/md/md.c	2007-05-10 15:51:54.000000000 +1000
> >+++ ./drivers/md/md.c	2007-05-10 16:05:10.000000000 +1000
> >@@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
> > 		 *
> > 		 * Note: the following is an unsigned comparison.
> > 		 */
> >-		if ((curr_events - rdev->last_events + 4096) > 8192) {
> >+		if ((long)curr_events - (long)rdev->last_events > 4096) {
> > 			rdev->last_events = curr_events;
> > 			idle = 0;
> > 		}
> 
> What did really change? Unless I am seriously mistaken,
> 
>     curr_events - last_evens + 4096 > 8192
> 
> is mathematically equivalent to
> 
>     curr_events - last_evens        > 4096
> 
> The casting to (long) may however force a signed comparison which turns
> things quite upside down, and the comment does not apply anymore.

Yes, the use of a signed comparison is the significant difference.
And yes, the comment becomes wrong.  I'm in the process of redrafting
that.  It currently stands at:

		/* sync IO will cause sync_io to increase before the disk_stats
		 * as sync_io is counted when a request starts, and 
		 * disk_stats is counted when it completes.
		 * So resync activity will cause curr_events to be smaller than
		 * when there was no such activity.
		 * non-sync IO will cause disk_stat to increase without
		 * increasing sync_io so curr_events will (eventually)
		 * be larger than it was before.  Once it becomes
		 * substantially larger, the test below will cause
		 * the array to appear non-idle, and resync will slow
		 * down.
		 * If there is a lot of outstanding resync activity when
		 * we set last_event to curr_events, then all that activity
		 * completing might cause the array to appear non-idle
		 * and resync will be slowed down even though there might
		 * not have been non-resync activity.  This will only
		 * happen once though.  'last_events' will soon reflect
		 * the state where there is little or no outstanding
		 * resync requests, and further resync activity will
		 * always make curr_events less than last_events.
		 *
		 */


Does that read at all well?

NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test
  2007-05-10 10:04     ` Neil Brown
@ 2007-05-10 12:52       ` Jan Engelhardt
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Engelhardt @ 2007-05-10 12:52 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-raid, linux-kernel


On May 10 2007 20:04, Neil Brown wrote:
>> >-		if ((curr_events - rdev->last_events + 4096) > 8192) {
>> >+		if ((long)curr_events - (long)rdev->last_events > 4096) {
>> > 			rdev->last_events = curr_events;
>> > 			idle = 0;
>> > 		}
>> 
>/* sync IO will cause sync_io to increase before the disk_stats
> * as sync_io is counted when a request starts, and 
> * disk_stats is counted when it completes.
> * So resync activity will cause curr_events to be smaller than
> * when there was no such activity.
> * non-sync IO will cause disk_stat to increase without
> * increasing sync_io so curr_events will (eventually)
> * be larger than it was before.  Once it becomes
> * substantially larger, the test below will cause
> * the array to appear non-idle, and resync will slow
> * down.
> * If there is a lot of outstanding resync activity when
> * we set last_event to curr_events, then all that activity
> * completing might cause the array to appear non-idle
> * and resync will be slowed down even though there might
> * not have been non-resync activity.  This will only
> * happen once though.  'last_events' will soon reflect
> * the state where there is little or no outstanding
> * resync requests, and further resync activity will
> * always make curr_events less than last_events.
> *
> */
>
>Does that read at all well?

It is a more verbose explanation of your patch description, yes.


	Jan
-- 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-05-10 12:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-10  6:22 [PATCH 000 of 2] md: Two more bugfixes NeilBrown
2007-05-10  6:22 ` [PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem NeilBrown
2007-05-10  6:22 ` [PATCH 002 of 2] md: Improve the is_mddev_idle test NeilBrown
2007-05-10  7:33   ` Andrew Morton
2007-05-10  9:46     ` Neil Brown
2007-05-10  9:48   ` Jan Engelhardt
2007-05-10 10:04     ` Neil Brown
2007-05-10 12:52       ` Jan Engelhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).