[RFC] MD: Restart an interrupted incremental recovery.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] MD: Restart an interrupted incremental recovery.
@ 2011-10-12 19:41 Andrei Warkentin
  2011-10-12 23:04 ` Andrei E. Warkentin
  2011-10-14  1:07 ` [RFC] MD: Restart " Andrei E. Warkentin
  0 siblings, 2 replies; 8+ messages in thread
From: Andrei Warkentin @ 2011-10-12 19:41 UTC (permalink / raw)
  To: linux-raid; +Cc: Andrei Warkentin, Neil Brown, Andrei Warkentin

If an incremental recovery was interrupted, a subsequent
re-add will result in a full recovery, even though an
incremental should be possible.

This patch handles 99% of possible interruptions of an
incremental recovery (it doesn't handle I/O failure
that occurs immediately after the hot_add_disk(), which
causes the added rdev to become a spare).

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com>
---
 drivers/md/md.c           |   28 +++++++++++++++++++++++++---
 include/linux/raid/md_p.h |    6 +++++-
 2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5404b22..8002e02 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1701,10 +1701,27 @@ static int super_1_validate(mddev_t *mddev, mdk_rdev_t *rdev)
 			break;
 		default:
 			if ((le32_to_cpu(sb->feature_map) &
-			     MD_FEATURE_RECOVERY_OFFSET))
-				rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
-			else
+			     MD_FEATURE_IN_INCREMENTAL)) {
+				/* This rdev had been in an incremental
+				 * recovery, which was interrupted, and
+				 * can be restarted. Note that we specifically
+				 * ignore recovery_offset, as I/O to the degraded
+				 * array could have modified data in sectors
+				 * below the recovery offset.
+				 */
+				char b[BDEVNAME_SIZE];
 				set_bit(In_sync, &rdev->flags);
+				printk(KERN_NOTICE
+				       "md: restarting interrupted incremental resync on %s\n",
+				       bdevname(rdev->bdev, b));
+			} else {
+				if ((le32_to_cpu(sb->feature_map) &
+				     MD_FEATURE_RECOVERY_OFFSET))
+					rdev->recovery_offset =
+						le64_to_cpu(sb->recovery_offset);
+				else
+					set_bit(In_sync, &rdev->flags);
+			}
 			rdev->raid_disk = role;
 			break;
 		}
@@ -1738,6 +1755,11 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
 	else
 		sb->resync_offset = cpu_to_le64(0);
 
+	if (mddev->bitmap && rdev->saved_raid_disk != -1)
+		sb->feature_map |= cpu_to_le32(MD_FEATURE_IN_INCREMENTAL);
+	else
+		sb->feature_map &= ~cpu_to_le32(MD_FEATURE_IN_INCREMENTAL);
+
 	sb->cnt_corrected_read = cpu_to_le32(atomic_read(&rdev->corrected_errors));
 
 	sb->raid_disks = cpu_to_le32(mddev->raid_disks);
diff --git a/include/linux/raid/md_p.h b/include/linux/raid/md_p.h
index 9e65d9e..8a17446 100644
--- a/include/linux/raid/md_p.h
+++ b/include/linux/raid/md_p.h
@@ -277,7 +277,11 @@ struct mdp_superblock_1 {
 					   */
 #define	MD_FEATURE_RESHAPE_ACTIVE	4
 #define	MD_FEATURE_BAD_BLOCKS		8 /* badblock list is not empty */
+#define	MD_FEATURE_IN_INCREMENTAL	16 /* set on disks undergoing incremental
+					    * recovery, so that an interrupted
+					    * incremental recovery may be restarted
+					    */
 
-#define	MD_FEATURE_ALL			(1|2|4|8)
+#define	MD_FEATURE_ALL			(1|2|4|8|16)
 
 #endif 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC] MD: Restart an interrupted incremental recovery.
  2011-10-12 19:41 [RFC] MD: Restart an interrupted incremental recovery Andrei Warkentin
@ 2011-10-12 23:04 ` Andrei E. Warkentin
  2011-10-12 23:05   ` [RFC] MD: Allow restarting " Andrei Warkentin
  2011-10-14  1:07 ` [RFC] MD: Restart " Andrei E. Warkentin
  1 sibling, 1 reply; 8+ messages in thread
From: Andrei E. Warkentin @ 2011-10-12 23:04 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-raid, Neil Brown

2011/10/12 Andrei Warkentin <andreiw@vmware.com>:
> If an incremental recovery was interrupted, a subsequent
> re-add will result in a full recovery, even though an
> incremental should be possible.
>
> This patch handles 99% of possible interruptions of an
> incremental recovery (it doesn't handle I/O failure
> that occurs immediately after the hot_add_disk(), which
> causes the added rdev to become a spare).
>
> Cc: Neil Brown <neilb@suse.de>
> Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com>
> ---

Here is an alternative idea, that doesn't involve messing with the
superblocks - don't write out SB for recovering device (undergoing
incremental) until array is not degraded.

A

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC] MD: Allow restarting an interrupted incremental recovery.
  2011-10-12 23:04 ` Andrei E. Warkentin
@ 2011-10-12 23:05   ` Andrei Warkentin
  2011-10-14  1:18     ` Andrei E. Warkentin
  0 siblings, 1 reply; 8+ messages in thread
From: Andrei Warkentin @ 2011-10-12 23:05 UTC (permalink / raw)
  To: linux-raid; +Cc: Andrei Warkentin, Neil Brown

If an incremental recovery was interrupted, a subsequent
re-add will result in a full recovery, even though an
incremental should be possible (seen with raid1).

Solve this problem by not updating the superblock on the
recovering device until array is not degraded any longer.

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrei Warkentin <andreiw@vmware.com>
---
 drivers/md/md.c |   19 +++++++++++++------
 drivers/md/md.h |    6 ++++++
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5404b22..153b3c6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2444,9 +2444,12 @@ repeat:
 			continue; /* no noise on spare devices */
 		if (test_bit(Faulty, &rdev->flags))
 			dprintk("(skipping faulty ");
+		else if (test_bit(InIncremental, &rdev->flags))
+			dprintk("(skipping incremental s/r ");
 
 		dprintk("%s ", bdevname(rdev->bdev,b));
-		if (!test_bit(Faulty, &rdev->flags)) {
+		if (!test_bit(Faulty, &rdev->flags) &&
+		    !test_bit(InIncremental, &rdev->flags)) {
 			md_super_write(mddev,rdev,
 				       rdev->sb_start, rdev->sb_size,
 				       rdev->sb_page);
@@ -5490,9 +5493,10 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
 			return -EINVAL;
 		}
 
-		if (test_bit(In_sync, &rdev->flags))
+		if (test_bit(In_sync, &rdev->flags)) {
 			rdev->saved_raid_disk = rdev->raid_disk;
-		else
+			set_bit(InIncremental, &rdev->flags);
+		} else
 			rdev->saved_raid_disk = -1;
 
 		clear_bit(In_sync, &rdev->flags); /* just to be sure */
@@ -7353,15 +7357,18 @@ static void reap_sync_thread(mddev_t *mddev)
 	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
 	    mddev->pers->finish_reshape)
 		mddev->pers->finish_reshape(mddev);
-	md_update_sb(mddev, 1);
 
 	/* if array is no-longer degraded, then any saved_raid_disk
-	 * information must be scrapped
+	 * information must be scrapped, and superblock for
+	 * incrementally recovered device written out.
 	 */
 	if (!mddev->degraded)
-		list_for_each_entry(rdev, &mddev->disks, same_set)
+		list_for_each_entry(rdev, &mddev->disks, same_set) {
 			rdev->saved_raid_disk = -1;
+			clear_bit(InIncremental, &rdev->flags);
+		}
 
+	md_update_sb(mddev, 1);
 	clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
 	clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 	clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1e586bb..5e5399a 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -104,6 +104,12 @@ struct mdk_rdev_s
 					 * accurately as possible is good, but
 					 * not absolutely critical.
 					 */
+#define InIncremental	12		/* Device is undergoing incremental
+					 * recovery, hence its superblock
+					 * is not written out until the recovery
+					 * ends, allowing the later to be
+					 * restared if interrupted.
+					 */
 	wait_queue_head_t blocked_wait;
 
 	int desc_nr;			/* descriptor index in the superblock */
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC] MD: Restart an interrupted incremental recovery.
  2011-10-12 19:41 [RFC] MD: Restart an interrupted incremental recovery Andrei Warkentin
  2011-10-12 23:04 ` Andrei E. Warkentin
@ 2011-10-14  1:07 ` Andrei E. Warkentin
  1 sibling, 0 replies; 8+ messages in thread
From: Andrei E. Warkentin @ 2011-10-14  1:07 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-raid, Neil Brown

2011/10/12 Andrei Warkentin <andreiw@vmware.com>:
> If an incremental recovery was interrupted, a subsequent
> re-add will result in a full recovery, even though an
> incremental should be possible.
>
> This patch handles 99% of possible interruptions of an
> incremental recovery (it doesn't handle I/O failure
> that occurs immediately after the hot_add_disk(), which
> causes the added rdev to become a spare).
>
> Cc: Neil Brown <neilb@suse.de>
> Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com>

Abandoning since there is a better solution (other patch which doesn't touch
the recovering driver's sb while incremental recovery is going on).

A

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] MD: Allow restarting an interrupted incremental recovery.
  2011-10-12 23:05   ` [RFC] MD: Allow restarting " Andrei Warkentin
@ 2011-10-14  1:18     ` Andrei E. Warkentin
  2011-10-17  3:20       ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Andrei E. Warkentin @ 2011-10-14  1:18 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-raid, Neil Brown

2011/10/12 Andrei Warkentin <andreiw@vmware.com>:
> If an incremental recovery was interrupted, a subsequent
> re-add will result in a full recovery, even though an
> incremental should be possible (seen with raid1).
>
> Solve this problem by not updating the superblock on the
> recovering device until array is not degraded any longer.
>
> Cc: Neil Brown <neilb@suse.de>
> Signed-off-by: Andrei Warkentin <andreiw@vmware.com>

FWIW it appears to me that this idea seems to work well, for the
following reasons:

1) The recovering sb is not touched until the array is not degraded
(only for incremental sync).
2) The events_cleared count isn't updated in the active bitmap sb
until array is not degraded. This implies that if the incremental was
interrupted, recovering_sb->events is NOT less than
active_bitmap->events_cleared).
3) The bitmaps (and sb) are updated on all drives at all times as it
were before.

How I tested it:
1) Create RAID1 array with bitmap.
2) Degrade array by removing a drive.
3) Write a bunch of data (Gigs...)
4) Re-add removed drive - an incremental recovery is started.
5) Interrupt the incremental.
6) Write some more data.
7) MD5sum the data.
8) Re-add removed drive - and incremental recovery is restarted (I
verified it starts at sec 0, just like you mentioned it should be, to
avoid consistency issues). Verified that, indeed, only changed blocks
(as noted by write-intent) are synced.
10)  Remove other half.
11) MD5sum data - hashes match.

Without this fix, you would of course have to deal with a full resync
after the interrupted incremental.

Is there anything you think I'm missing here?

A

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] MD: Allow restarting an interrupted incremental recovery.
  2011-10-14  1:18     ` Andrei E. Warkentin
@ 2011-10-17  3:20       ` NeilBrown
  2011-10-17 17:26         ` Andrei Warkentin
  0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2011-10-17  3:20 UTC (permalink / raw)
  To: Andrei E. Warkentin; +Cc: Andrei Warkentin, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2650 bytes --]

On Thu, 13 Oct 2011 21:18:43 -0400 "Andrei E. Warkentin"
<andrey.warkentin@gmail.com> wrote:

> 2011/10/12 Andrei Warkentin <andreiw@vmware.com>:
> > If an incremental recovery was interrupted, a subsequent
> > re-add will result in a full recovery, even though an
> > incremental should be possible (seen with raid1).
> >
> > Solve this problem by not updating the superblock on the
> > recovering device until array is not degraded any longer.
> >
> > Cc: Neil Brown <neilb@suse.de>
> > Signed-off-by: Andrei Warkentin <andreiw@vmware.com>
> 
> FWIW it appears to me that this idea seems to work well, for the
> following reasons:
> 
> 1) The recovering sb is not touched until the array is not degraded
> (only for incremental sync).
> 2) The events_cleared count isn't updated in the active bitmap sb
> until array is not degraded. This implies that if the incremental was
> interrupted, recovering_sb->events is NOT less than
> active_bitmap->events_cleared).
> 3) The bitmaps (and sb) are updated on all drives at all times as it
> were before.
> 
> How I tested it:
> 1) Create RAID1 array with bitmap.
> 2) Degrade array by removing a drive.
> 3) Write a bunch of data (Gigs...)
> 4) Re-add removed drive - an incremental recovery is started.
> 5) Interrupt the incremental.
> 6) Write some more data.
> 7) MD5sum the data.
> 8) Re-add removed drive - and incremental recovery is restarted (I
> verified it starts at sec 0, just like you mentioned it should be, to
> avoid consistency issues). Verified that, indeed, only changed blocks
> (as noted by write-intent) are synced.
> 10)  Remove other half.
> 11) MD5sum data - hashes match.
> 
> Without this fix, you would of course have to deal with a full resync
> after the interrupted incremental.
> 
> Is there anything you think I'm missing here?
> 
> A

Not much, it looks good, and your testing is of course a good sign.

My only thought is whether we really need the new InIncremental flag.
You set it exactly when saved_raid_disk is set, and clear it exactly when
saved_raid_disk is cleared (set to -1).
So maybe we can just used saved_raid_disk.
If you look at it that way, you might notice that saved_raid_disk is also set
in slot_store, so probably InIncremental should be set there.  So that might
be the one thing you missed.

Could you respin the patch without adding InIncremental, and testing 
  rdev->saved_raid_disk >= 0
instead, check if you agree that should work, and perform a similar test?
(Is that asking too much?).

If you agree that works I would like to go with that version.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] MD: Allow restarting an interrupted incremental recovery.
  2011-10-17  3:20       ` NeilBrown
@ 2011-10-17 17:26         ` Andrei Warkentin
  2011-10-17 20:58           ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Andrei Warkentin @ 2011-10-17 17:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrei Warkentin, linux-raid, Andrei E. Warkentin

Hi Neil,

----- Original Message -----
> From: "NeilBrown" <neilb@suse.de>
> To: "Andrei E. Warkentin" <andrey.warkentin@gmail.com>
> Cc: "Andrei Warkentin" <andreiw@vmware.com>, linux-raid@vger.kernel.org
> Sent: Sunday, October 16, 2011 11:20:39 PM
> Subject: Re: [RFC] MD: Allow restarting an interrupted incremental recovery.
> 
> On Thu, 13 Oct 2011 21:18:43 -0400 "Andrei E. Warkentin"
> <andrey.warkentin@gmail.com> wrote:
> 
> > 2011/10/12 Andrei Warkentin <andreiw@vmware.com>:
> > > If an incremental recovery was interrupted, a subsequent
> > > re-add will result in a full recovery, even though an
> > > incremental should be possible (seen with raid1).
> > >
> > > Solve this problem by not updating the superblock on the
> > > recovering device until array is not degraded any longer.
> > >
> > > Cc: Neil Brown <neilb@suse.de>
> > > Signed-off-by: Andrei Warkentin <andreiw@vmware.com>
> > 
> > FWIW it appears to me that this idea seems to work well, for the
> > following reasons:
> > 
> > 1) The recovering sb is not touched until the array is not degraded
> > (only for incremental sync).
> > 2) The events_cleared count isn't updated in the active bitmap sb
> > until array is not degraded. This implies that if the incremental
> > was
> > interrupted, recovering_sb->events is NOT less than
> > active_bitmap->events_cleared).
> > 3) The bitmaps (and sb) are updated on all drives at all times as
> > it
> > were before.
> > 
> > How I tested it:
> > 1) Create RAID1 array with bitmap.
> > 2) Degrade array by removing a drive.
> > 3) Write a bunch of data (Gigs...)
> > 4) Re-add removed drive - an incremental recovery is started.
> > 5) Interrupt the incremental.
> > 6) Write some more data.
> > 7) MD5sum the data.
> > 8) Re-add removed drive - and incremental recovery is restarted (I
> > verified it starts at sec 0, just like you mentioned it should be,
> > to
> > avoid consistency issues). Verified that, indeed, only changed
> > blocks
> > (as noted by write-intent) are synced.
> > 10)  Remove other half.
> > 11) MD5sum data - hashes match.
> > 
> > Without this fix, you would of course have to deal with a full
> > resync
> > after the interrupted incremental.
> > 
> > Is there anything you think I'm missing here?
> > 
> > A
> 
> Not much, it looks good, and your testing is of course a good sign.
> 
> My only thought is whether we really need the new InIncremental flag.
> You set it exactly when saved_raid_disk is set, and clear it exactly
> when
> saved_raid_disk is cleared (set to -1).
> So maybe we can just used saved_raid_disk.

You're right, I looked at the code paths again (for non-bitmapped disks), and there is
no reason not to leverage saved_raid_disk. It should be safe (i.e. not missing necessary sb fluesh) 
for non-bitmapped disks - a disk resuming recovery (from rec_offset) will never have In_sync, and raid_saved_disk
will not be set. For multipath this is irrelevant as well.

> If you look at it that way, you might notice that saved_raid_disk is
> also set
> in slot_store, so probably InIncremental should be set there.  So
> that might
> be the one thing you missed.
>

Actually, maybe you can clarify something a bit about that code? The part where
slot != -1 and mddev->pers != NULL looks a lot like the add_new_disk path - except:

After pers->hot_add_disk:
1) rdev In_sync isn't cleared
2) Array isn't checked for degraded state and MD_RECOVERY_RECOVER isn't conditionally set.
3) MD_RECOVERY_NEEDED isn't set.
4) mddev->thread is't woken up.

Is this because if an array was degraded AND there were extra spares, they would already be
assigned to the array?
 
> Could you respin the patch without adding InIncremental, and testing
>   rdev->saved_raid_disk >= 0
> instead, check if you agree that should work, and perform a similar
> test?
> (Is that asking too much?).
> 

Absolutely! Thanks again!

A

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] MD: Allow restarting an interrupted incremental recovery.
  2011-10-17 17:26         ` Andrei Warkentin
@ 2011-10-17 20:58           ` NeilBrown
  0 siblings, 0 replies; 8+ messages in thread
From: NeilBrown @ 2011-10-17 20:58 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: Andrei Warkentin, linux-raid, Andrei E. Warkentin

[-- Attachment #1: Type: text/plain, Size: 2399 bytes --]

On Mon, 17 Oct 2011 10:26:44 -0700 (PDT) Andrei Warkentin
<awarkentin@vmware.com> wrote:

> Hi Neil,

> > If you look at it that way, you might notice that saved_raid_disk is
> > also set
> > in slot_store, so probably InIncremental should be set there.  So
> > that might
> > be the one thing you missed.
> >
> 
> Actually, maybe you can clarify something a bit about that code? The part where
> slot != -1 and mddev->pers != NULL looks a lot like the add_new_disk path - except:
> 
> After pers->hot_add_disk:
> 1) rdev In_sync isn't cleared

That looks like a bug.  I don't think I've ever tested re-adding a device by
setting 'slot' - only adding.  And for adding, In_sync is already clear.
Thanks for spotting that.

> 2) Array isn't checked for degraded state and MD_RECOVERY_RECOVER isn't conditionally set.

I think that isn't really needed in add_new_disk.
md_check_recovery will set it before recovery starts if there are spares in
the array.
So I'm inclined to just remove it from add_new_disk, but I feel that I need
to read the code a bit more carefully first.

In any case, the next point makes the omission harmless even if
MD_RECOVERY_RECOVER is needed there for some reason.

> 3) MD_RECOVERY_NEEDED isn't set.

That might be deliberate.  It allows a few drives to be added, then recovery
started by writing "recover" to sync_action.
The same could be achieved by writing "frozen" to sync_action first but I
probably wrote this code before I added "frozen".

> 4) mddev->thread is't woken up.

That goes with 3.  we only wake up the thread so that it notices
MD_RECOVERY_NEEDED.

> 
> Is this because if an array was degraded AND there were extra spares, they would already be
> assigned to the array?

Probably, yes.

Thinking a bit more, you would only get to set 'slot' on a spare in an active
array if you had first 'frozen' sync_action .. so some of my comments above
might be a bit wrong.
And given that you have frozen sync_action, you really don't want to set
MD_RECOVERY_NEEDED or start the thread.

Thanks,
NeilBrown

>  
> > Could you respin the patch without adding InIncremental, and testing
> >   rdev->saved_raid_disk >= 0
> > instead, check if you agree that should work, and perform a similar
> > test?
> > (Is that asking too much?).
> > 
> 
> Absolutely! Thanks again!
> 
> A

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-10-17 20:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-12 19:41 [RFC] MD: Restart an interrupted incremental recovery Andrei Warkentin
2011-10-12 23:04 ` Andrei E. Warkentin
2011-10-12 23:05   ` [RFC] MD: Allow restarting " Andrei Warkentin
2011-10-14  1:18     ` Andrei E. Warkentin
2011-10-17  3:20       ` NeilBrown
2011-10-17 17:26         ` Andrei Warkentin
2011-10-17 20:58           ` NeilBrown
2011-10-14  1:07 ` [RFC] MD: Restart " Andrei E. Warkentin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).