[GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33
@ 2009-12-13  4:17 Dan Williams
  2009-12-13  4:17 ` [PATCH 1/2] md: rcu_read_lock() walk of mddev->disks in md_do_sync() Dan Williams
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Dan Williams @ 2009-12-13  4:17 UTC (permalink / raw)
  To: neilb; +Cc: ed.ciechanowski, marcin.labun, linux-raid

Hi Neil,

A bit late, but hopefully acceptable.  This allows mdmon to restart
recovery operations.

     git://git.kernel.org/pub/scm/linux/kernel/git/djbw/md.git for-neil

The modifications to mdadm/mdmon need a bit more testing but you can see
the current results on my 'scratch' [1] branch, and I have copied the
meat of the mdmon implementation below:  'mdmon: add recovery checkpoint
support'.

Thanks,
Dan

---

Dan Williams (2):
      md: rcu_read_lock() walk of mddev->disks in md_do_sync()
      md: add 'recovery_start' sysfs attribute

 Documentation/md.txt |   15 ++++++++---
 drivers/md/md.c      |   71 ++++++++++++++++++++++++++++++++++++++++++++------
 drivers/md/md.h      |    1 +
 3 files changed, 75 insertions(+), 12 deletions(-)

---

[1]: git://git.kernel.org/pub/scm/linux/kernel/git/djbw/mdadm.git scratch

mdmon: add recovery checkpoint support

From: Dan Williams <dan.j.williams@intel.com>

With the md/recovery_start attribute userspace can specifiy the starting
position of a spare recovery operation.  The manager thread of mdmon is
modified to activate spares prior to notifying the monitor thread of new
arrays.  This allows the monitor thread to process the 'activate spare'
metadata update message prior to the array being marked active.  If
->process_update() finds that the metadata update is a no-op (i.e. the
spare was already in process of being rebuilt) then the checkpoint can
be assumed valid, otherwise the checkpoint is reset to zero.  Before the
array is marked active resume_recovery() is called to update
md/recovery_start.

Recording the recovery checkpoint is the same as the md/resync_start
case.  Whenever sync_action transitions to idle the kernel updates
md/recovery_start.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 managemon.c   |   57 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 mdadm.h       |    2 ++
 monitor.c     |   45 ++++++++++++++++++++++++++++++++++++++++++++-
 super-ddf.c   |    1 +
 super-intel.c |   31 ++++++++++++++++++++++++-------
 5 files changed, 117 insertions(+), 19 deletions(-)


diff --git a/managemon.c b/managemon.c
index 9d73521..7e09d97 100644
--- a/managemon.c
+++ b/managemon.c
@@ -360,8 +360,8 @@ static void manage_container(struct mdstat_ent *mdstat,
 	}
 }
 
-static void manage_member(struct mdstat_ent *mdstat,
-			  struct active_array *a)
+static int manage_member(struct mdstat_ent *mdstat, struct active_array *a,
+			 int init)
 {
 	/* Compare mdstat info with known state of member array.
 	 * We do not need to look for device state changes here, that
@@ -373,9 +373,15 @@ static void manage_member(struct mdstat_ent *mdstat,
 	 * mdstat until the reshape completes FIXME.
 	 *
 	 * Actually, we also want to handle degraded arrays here by
-	 * trying to find and assign a spare.
-	 * We do that whenever the monitor tells us too.
+	 * trying to find and assign a spare.  We do that whenever the
+	 * monitor tells us too, and at init time as a special case.  At
+	 * init time we want to continue any in-progress recovery
+	 * operations before the array becomes writable, and before the
+	 * monitor has a chance to mark the spare disk(s) as missing in
+	 * the metadata
 	 */
+	int activated = 0;
+
 	// FIXME
 	a->info.array.raid_disks = mdstat->raid_disks;
 	a->info.array.chunk_size = mdstat->chunk_size;
@@ -394,7 +400,7 @@ static void manage_member(struct mdstat_ent *mdstat,
 		 */
 		newdev = a->container->ss->activate_spare(a, &updates);
 		if (!newdev)
-			return;
+			return 0;
 
 		newa = duplicate_aa(a);
 		if (!newa)
@@ -424,8 +430,25 @@ static void manage_member(struct mdstat_ent *mdstat,
 		}
 		queue_metadata_update(updates);
 		updates = NULL;
-		replace_array(a->container, a, newa);
-		sysfs_set_str(&a->info, NULL, "sync_action", "recover");
+		if (init) {
+			/* monitor never saw 'a' so we can free it
+			 * directly, and let the monitor handle the
+			 * start of recovery
+			 */
+			free_aa(a);
+
+			/* monitor needs to see the metadata update
+			 * before the array to verify the recovery
+			 * checkpoint
+			 */
+			check_update_queue(newa->container);
+			barrier();
+			replace_array(newa->container, NULL, newa);
+		} else {
+			replace_array(a->container, a, newa);
+			sysfs_set_str(&a->info, NULL, "sync_action", "recover");
+		}
+		activated = 1;
  out:
 		while (newdev) {
 			d = newdev->next;
@@ -434,6 +457,8 @@ static void manage_member(struct mdstat_ent *mdstat,
 		}
 		free_updates(&updates);
 	}
+
+	return activated;
 }
 
 static int aa_ready(struct active_array *aa)
@@ -557,10 +582,20 @@ static void manage_new(struct mdstat_ent *mdstat,
 		new->container = NULL;
 		free_aa(new);
 	} else {
-		replace_array(container, victim, new);
-		if (failed) {
+		/* try to assign spares before the monitor thread
+		 * activates this array, if that attempt fails, or if
+		 * this array was previously active, just hot add spares
+		 * as normal
+		 */
+		int active = 0;
+
+		if (failed)
 			new->check_degraded = 1;
-			manage_member(mdstat, new);
+		if (victim == NULL)
+			active = manage_member(mdstat, new, 1);
+		if (!active) {
+			replace_array(container, victim, new);
+			manage_member(mdstat, new, 0);
 		}
 	}
 }
@@ -585,7 +620,7 @@ void manage(struct mdstat_ent *mdstat, struct supertype *container)
 		for (a = container->arrays; a; a = a->next) {
 			if (mdstat->devnum == a->devnum) {
 				if (a->container)
-					manage_member(mdstat, a);
+					manage_member(mdstat, a, 0);
 				break;
 			}
 		}
diff --git a/mdadm.h b/mdadm.h
index 9766cdf..d0fc360 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -128,6 +128,8 @@ extern __off64_t lseek64 __P ((int __fd, __off64_t __offset, int __whence));
 #endif
 #endif /* __KLIBC__ */
 
+/* compiler optimization barrier */
+#define barrier() __asm__ __volatile__("": : :"memory")
 
 /*
  * min()/max()/clamp() macros that also do
diff --git a/monitor.c b/monitor.c
index d02e735..c91c0b7 100644
--- a/monitor.c
+++ b/monitor.c
@@ -157,6 +157,46 @@ static void signal_manager(void)
 	syscall(SYS_tgkill, pid, mgr_tid, SIGUSR1);
 }
 
+static enum sync_action resume_recovery(struct active_array *a)
+{
+	/* FIXME currently only honors recovery_start requests when
+	 * there is only one spare in the array.  We cannot guarantee
+	 * the slot assignment selected by the kernel in the multi-spare
+	 * case.
+	 *
+	 * We rely on several events having happened before this point:
+	 * 1/ manage_member() reactivates a previously active spare
+	 * 2/ ->process_update() validates the update record
+	 *    a) if the update is a no-op then the metadata's checkpoint
+	 *       is valid
+	 *    b) if the update modifies metadata the checkpoint is reset
+	 * 3/ ->set_array_state() upon initial activation sets
+	 *    recovery_start to the stored/reset checkpoint
+	 */
+	enum sync_action action = bad_action;
+	struct mdinfo *mdi;
+	int spares = 0;
+
+	for (mdi = a->info.devs; mdi ; mdi = mdi->next)
+		if (mdi->curr_state & DS_SPARE)
+			spares++;
+
+	if (spares == 1) {
+		char buf[30];
+		int rc;
+
+		sprintf(buf, "%llu", a->recovery_start);
+		rc = write_attr(buf, a->recovery_start_fd);
+		if (rc < 0)
+			dprintf("%s: failed to write %llu to recovery_start (%d)\n",
+				__func__, a->recovery_start, errno);
+		action = recover;
+	}
+
+	return action;
+}
+
+
 /* Monitor a set of active md arrays - all of which share the
  * same metadata - and respond to events that require
  * metadata update.
@@ -273,7 +313,9 @@ static int read_and_act(struct active_array *a)
 			if (a->container->ss->set_array_state(a, 2))
 				a->next_state = read_auto; /* array is clean */
 			else {
-				a->next_state = active; /* Now active for recovery etc */
+				/* Now active for recovery etc */
+				a->next_state = active;
+				a->next_action = resume_recovery(a);
 				dirty = 1;
 			}
 		}
@@ -297,6 +339,7 @@ static int read_and_act(struct active_array *a)
 		/* A recovery has finished.  Some disks may be in sync now,
 		 * and the array may no longer be degraded
 		 */
+		a->container->ss->set_array_state(a, a->curr_state <= clean);
 		for (mdi = a->info.devs ; mdi ; mdi = mdi->next) {
 			a->container->ss->set_disk(a, mdi->disk.raid_disk,
 						   mdi->curr_state);
diff --git a/super-ddf.c b/super-ddf.c
index fe83642..c74d7cf 100644
--- a/super-ddf.c
+++ b/super-ddf.c
@@ -3068,6 +3068,7 @@ static int ddf_set_array_state(struct active_array *a, int consistent)
 		consistent = 1;
 		if (!is_resync_complete(a))
 			consistent = 0;
+		a->recovery_start = 0;
 	}
 	if (consistent)
 		ddf->virt->entries[inst].state &= ~DDF_state_inconsistent;
diff --git a/super-intel.c b/super-intel.c
index 3a05ae5..3bc47dd 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -4310,8 +4310,14 @@ static int imsm_set_array_state(struct active_array *a, int consistent)
 	if (consistent == 2 &&
 	    (!is_resync_complete(a) ||
 	     map_state != IMSM_T_STATE_NORMAL ||
-	     dev->vol.migr_state))
+	     dev->vol.migr_state)) {
 		consistent = 0;
+		if (is_rebuilding(dev)) {
+			__u32 units = __le32_to_cpu(dev->vol.curr_migr_unit);
+
+			a->recovery_start = units * blocks_per_migr_unit(dev);
+		}
+	}
 
 	if (is_resync_complete(a)) {
 		/* complete intialization / resync,
@@ -4358,8 +4364,7 @@ static int imsm_set_array_state(struct active_array *a, int consistent)
 
 	/* mark dirty / clean */
 	if (dev->vol.dirty != !consistent) {
-		dprintf("imsm: mark '%s' (%llu)\n",
-			consistent ? "clean" : "dirty", a->resync_start);
+		dprintf("imsm: mark '%s'\n", consistent ? "clean" : "dirty");
 		if (consistent)
 			dev->vol.dirty = 0;
 		else
@@ -4815,8 +4820,6 @@ static void imsm_process_update(struct supertype *st,
 			return;
 		}
 
-		super->updates_pending++;
-
 		/* count failures (excluding rebuilds and the victim)
 		 * to determine map[0] state
 		 */
@@ -4829,12 +4832,26 @@ static void imsm_process_update(struct supertype *st,
 				failed++;
 		}
 
-		/* adding a pristine spare, assign a new index */
+		disk = &dl->disk;
 		if (dl->index < 0) {
+			/* adding a pristine spare, assign a new index */
 			dl->index = super->anchor->num_disks;
 			super->anchor->num_disks++;
+		} else if (dl->index == victim && failed == 0 &&
+			   is_rebuilding(dev) &&
+			   is_configured(disk) && !is_spare(disk)) {
+			/* we don't need to touch metadata to process
+			 * this update record which is our indication
+			 * that a recovery operation can be resumed
+			 */
+			dprintf("%s: resuming recovery of slot%d from %d\n",
+				__func__, victim,
+				__le32_to_cpu(dev->vol.curr_migr_unit));
+			break;
 		}
-		disk = &dl->disk;
+
+		super->updates_pending++;
+
 		disk->status |= CONFIGURED_DISK;
 		disk->status &= ~SPARE_DISK;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 1/2] md: rcu_read_lock() walk of mddev->disks in md_do_sync()
  2009-12-13  4:17 [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33 Dan Williams
@ 2009-12-13  4:17 ` Dan Williams
  2009-12-13  4:17 ` [PATCH 2/2] md: add 'recovery_start' sysfs attribute Dan Williams
  2009-12-14  4:07 ` [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33 Neil Brown
  2 siblings, 0 replies; 11+ messages in thread
From: Dan Williams @ 2009-12-13  4:17 UTC (permalink / raw)
  To: neilb; +Cc: ed.ciechanowski, marcin.labun, linux-raid

Other walks of this list are either under rcu_read_lock() or the list
mutation lock (mddev_lock()).  This protects against the improbable case of a
disk being removed from the array at the start of md_do_sync().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/md.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index b182f86..3e8fb67 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6342,12 +6342,14 @@ void md_do_sync(mddev_t *mddev)
 		/* recovery follows the physical size of devices */
 		max_sectors = mddev->dev_sectors;
 		j = MaxSector;
-		list_for_each_entry(rdev, &mddev->disks, same_set)
+		rcu_read_lock();
+		list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
 			if (rdev->raid_disk >= 0 &&
 			    !test_bit(Faulty, &rdev->flags) &&
 			    !test_bit(In_sync, &rdev->flags) &&
 			    rdev->recovery_offset < j)
 				j = rdev->recovery_offset;
+		rcu_read_unlock();
 	}
 
 	printk(KERN_INFO "md: %s of RAID array %s\n", desc, mdname(mddev));
@@ -6516,12 +6518,14 @@ void md_do_sync(mddev_t *mddev)
 		} else {
 			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 				mddev->curr_resync = MaxSector;
-			list_for_each_entry(rdev, &mddev->disks, same_set)
+			rcu_read_lock();
+			list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
 				if (rdev->raid_disk >= 0 &&
 				    !test_bit(Faulty, &rdev->flags) &&
 				    !test_bit(In_sync, &rdev->flags) &&
 				    rdev->recovery_offset < mddev->curr_resync)
 					rdev->recovery_offset = mddev->curr_resync;
+			rcu_read_unlock();
 		}
 	}
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] md: add 'recovery_start' sysfs attribute
  2009-12-13  4:17 [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33 Dan Williams
  2009-12-13  4:17 ` [PATCH 1/2] md: rcu_read_lock() walk of mddev->disks in md_do_sync() Dan Williams
@ 2009-12-13  4:17 ` Dan Williams
  2009-12-14  4:07 ` [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33 Neil Brown
  2 siblings, 0 replies; 11+ messages in thread
From: Dan Williams @ 2009-12-13  4:17 UTC (permalink / raw)
  To: neilb; +Cc: ed.ciechanowski, marcin.labun, linux-raid

Enable external metadata arrays to manage rebuild checkpointing via a
md/recovery_start attribute that overrides rdev->recovery_offset.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 Documentation/md.txt |   15 +++++++++--
 drivers/md/md.c      |   69 +++++++++++++++++++++++++++++++++++++++++++-------
 drivers/md/md.h      |    1 +
 3 files changed, 72 insertions(+), 13 deletions(-)

diff --git a/Documentation/md.txt b/Documentation/md.txt
index 4edd39e..2b03814 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -233,9 +233,18 @@ All md devices contain:
 
   resync_start
      The point at which resync should start.  If no resync is needed,
-     this will be a very large number.  At array creation it will
-     default to 0, though starting the array as 'clean' will
-     set it much larger.
+     this will be a very large number (or 'none' since 2.6.30-rc1).  At
+     array creation it will default to 0, though starting the array as
+     'clean' will set it much larger.
+
+  recovery_start
+     The point at which recovery should start when rebuilding a degraded
+     array member.  This value overrides the 'recovery_offset' read from
+     the metadata.  Setting this value to zero tells md to use/report
+     the default recovery_offset read from the metadata.  This value
+     auto-resets itself to zero (default recovery_offset) after it has
+     been consumed by the recovery process.  This value cannot be
+     changed while a recovery is in-flight.
 
    new_dev
      This file can be written but not read.  The value written should
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3e8fb67..5f09d40 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2983,6 +2983,56 @@ resync_start_store(mddev_t *mddev, const char *buf, size_t len)
 static struct md_sysfs_entry md_resync_start =
 __ATTR(resync_start, S_IRUGO|S_IWUSR, resync_start_show, resync_start_store);
 
+static sector_t md_recovery_offset(mddev_t *mddev)
+{
+	/* this is sometimes called outside mddev_lock() hence the
+	 * rcu_read_lock()
+	 */
+	sector_t recovery_offset = MaxSector;
+	mdk_rdev_t *rdev;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
+		if (rdev->raid_disk >= 0 &&
+		    !test_bit(Faulty, &rdev->flags) &&
+		    !test_bit(In_sync, &rdev->flags) &&
+		    rdev->recovery_offset < recovery_offset)
+			recovery_offset = rdev->recovery_offset;
+	rcu_read_unlock();
+
+	return recovery_offset;
+}
+
+static ssize_t recovery_start_show(mddev_t *mddev, char *page)
+{
+	unsigned long long recovery_start = mddev->recovery_start;
+
+	if (recovery_start == 0)
+		recovery_start = md_recovery_offset(mddev);
+
+	if (recovery_start == MaxSector)
+		return sprintf(page, "none\n");
+
+	return sprintf(page, "%llu\n", recovery_start);
+}
+
+static ssize_t recovery_start_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	unsigned long long recovery_start;
+
+	if (strict_strtoull(buf, 10, &recovery_start))
+		return -EINVAL;
+
+	if (!mddev->ro || !mddev->degraded || md_recovery_offset(mddev) > 0)
+		return -EBUSY;
+
+	mddev->recovery_start = recovery_start;
+	return len;
+}
+
+static struct md_sysfs_entry md_recovery_start =
+__ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, recovery_start_store);
+
 /*
  * The array state can be:
  *
@@ -3788,6 +3838,7 @@ static struct attribute *md_default_attrs[] = {
 	&md_chunk_size.attr,
 	&md_size.attr,
 	&md_resync_start.attr,
+	&md_recovery_start.attr,
 	&md_metadata.attr,
 	&md_new_device.attr,
 	&md_safe_delay.attr,
@@ -4426,6 +4477,7 @@ out:
 		mddev->dev_sectors = 0;
 		mddev->raid_disks = 0;
 		mddev->recovery_cp = 0;
+		mddev->recovery_start = 0;
 		mddev->resync_min = 0;
 		mddev->resync_max = MaxSector;
 		mddev->reshape_position = MaxSector;
@@ -6338,18 +6390,15 @@ void md_do_sync(mddev_t *mddev)
 
 	} else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
 		max_sectors = mddev->dev_sectors;
-	else {
+	else if (mddev->recovery_start) {
+		/* userspace requested override of rdev->recovery_offset */
+		max_sectors = mddev->dev_sectors;
+		j = mddev->recovery_start;
+		mddev->recovery_start = 0;
+	} else {
 		/* recovery follows the physical size of devices */
 		max_sectors = mddev->dev_sectors;
-		j = MaxSector;
-		rcu_read_lock();
-		list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
-			if (rdev->raid_disk >= 0 &&
-			    !test_bit(Faulty, &rdev->flags) &&
-			    !test_bit(In_sync, &rdev->flags) &&
-			    rdev->recovery_offset < j)
-				j = rdev->recovery_offset;
-		rcu_read_unlock();
+		j = md_recovery_offset(mddev);
 	}
 
 	printk(KERN_INFO "md: %s of RAID array %s\n", desc, mdname(mddev));
diff --git a/drivers/md/md.h b/drivers/md/md.h
index f184b69..03a18b4 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -252,6 +252,7 @@ struct mddev_s
 	atomic_t			recovery_active; /* blocks scheduled, but not written */
 	wait_queue_head_t		recovery_wait;
 	sector_t			recovery_cp;
+	sector_t			recovery_start;	/* override rdev->recovery_offset */
 	sector_t			resync_min;	/* user requested sync
 							 * starts here */
 	sector_t			resync_max;	/* resync should pause


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33
  2009-12-13  4:17 [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33 Dan Williams
  2009-12-13  4:17 ` [PATCH 1/2] md: rcu_read_lock() walk of mddev->disks in md_do_sync() Dan Williams
  2009-12-13  4:17 ` [PATCH 2/2] md: add 'recovery_start' sysfs attribute Dan Williams
@ 2009-12-14  4:07 ` Neil Brown
  2009-12-14  4:49   ` Dan Williams
  2009-12-15  0:37   ` Dan Williams
  2 siblings, 2 replies; 11+ messages in thread
From: Neil Brown @ 2009-12-14  4:07 UTC (permalink / raw)
  To: Dan Williams; +Cc: ed.ciechanowski, marcin.labun, linux-raid

On Sat, 12 Dec 2009 21:17:01 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Hi Neil,
> 
> A bit late, but hopefully acceptable.  This allows mdmon to restart
> recovery operations.
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/djbw/md.git for-neil
> 
> The modifications to mdadm/mdmon need a bit more testing but you can see
> the current results on my 'scratch' [1] branch, and I have copied the
> meat of the mdmon implementation below:  'mdmon: add recovery checkpoint
> support'.
> 

Thanks Dan.

> Dan Williams (2):
>       md: rcu_read_lock() walk of mddev->disks in md_do_sync()

This one is fine.

>       md: add 'recovery_start' sysfs attribute

This one I don't like so much.
recovery_start should be a per-device value, not a per-array value,
and each device can theoretically have been recovered to a different place.
We don't make much use of that fact, but maybe we could one day.

So I have change the code to simply expose rdev->recovery_offset through
sysfs, which should provide all the functionality you need.
Patch below.
It still says "From: Dan Williams" because I started with you patch
and then changes almost all of it (but not quite all).  That seems a
bit odd but doesn't bother me - tell me if it bothers you.

NeilBrown

From d8fe7d6fbbd73a89f7be356791699cb2e9b95f78 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams@intel.com>
Date: Sat, 12 Dec 2009 21:17:12 -0700
Subject: [PATCH] md: add 'recovery_start' per-device sysfs attribute

Enable external metadata arrays to manage rebuild checkpointing via a
md/dev-XXX/recovery_start attribute which reflects rdev->recovery_offset

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 Documentation/md.txt |   27 +++++++++++++++++++++++----
 drivers/md/md.c      |   37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/Documentation/md.txt b/Documentation/md.txt
index 21d26fb..188f476 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -233,9 +233,9 @@ All md devices contain:
 
   resync_start
      The point at which resync should start.  If no resync is needed,
-     this will be a very large number.  At array creation it will
-     default to 0, though starting the array as 'clean' will
-     set it much larger.
+     this will be a very large number (or 'none' since 2.6.30-rc1).  At
+     array creation it will default to 0, though starting the array as
+     'clean' will set it much larger.
 
    new_dev
      This file can be written but not read.  The value written should
@@ -379,8 +379,9 @@ Each directory contains:
 	Writing "writemostly" sets the writemostly flag.
 	Writing "-writemostly" clears the writemostly flag.
 	Writing "blocked" sets the "blocked" flag.
-	Writing "-blocked" clear the "blocked" flag and allows writes
+	Writing "-blocked" clears the "blocked" flag and allows writes
 		to complete.
+	Writing "in_sync" sets the in_sync flag.
 
 	This file responds to select/poll. Any change to 'faulty'
 	or 'blocked' causes an event.
@@ -417,6 +418,24 @@ Each directory contains:
         array.  If a value less than the current component_size is
         written, it will be rejected.
 
+      recovery_start
+
+        When the device is not 'in_sync', this records the number of
+	sectors from the start of the device which are known to be
+	correct.  This is normally zero, but during a recovery
+	operation is will steadily increase, and if the recovery is
+	interrupted, restoring this value can cause recovery to
+	avoid repeating the earlier blocks.  With v1.x metadata, this
+	value is saved and restored automatically.
+
+	This can be set whenever the device is not an active member of
+	the array, either before the array is activated, or before
+	the 'slot' is set.
+
+	Setting this to 'none' is equivalent to setting 'in_sync'.
+	Setting to any other value also clears the 'in_sync' flag.
+	
+
 
 An active md device will also contain and entry for each active device
 in the array.  These are named
diff --git a/drivers/md/md.c b/drivers/md/md.c
index ea64a68..6bf2f5c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2551,12 +2551,49 @@ rdev_size_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 static struct rdev_sysfs_entry rdev_size =
 __ATTR(size, S_IRUGO|S_IWUSR, rdev_size_show, rdev_size_store);
 
+
+static ssize_t recovery_start_show(mdk_rdev_t *rdev, char *page)
+{
+	unsigned long long recovery_start = rdev->recovery_offset;
+
+	if (test_bit(In_sync, &rdev->flags) ||
+	    recovery_start == MaxSector)
+		return sprintf(page, "none\n");
+
+	return sprintf(page, "%llu\n", recovery_start);
+}
+
+static ssize_t recovery_start_store(mdk_rdev_t *rdev, const char *buf, size_t len)
+{
+	unsigned long long recovery_start;
+
+	if (cmd_match(buf, "none"))
+		recovery_start = MaxSector;
+	else if (strict_strtoull(buf, 10, &recovery_start))
+		return -EINVAL;
+
+	if (rdev->mddev->pers &&
+	    rdev->raid_disk >= 0)
+		return -EBUSY;
+
+	rdev->recovery_offset = recovery_start;
+	if (recovery_start == MaxSector)
+		set_bit(In_sync, &rdev->flags);
+	else
+		clear_bit(In_sync, &rdev->flags);
+	return len;
+}
+
+static struct rdev_sysfs_entry rdev_recovery_start =
+__ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, recovery_start_store);
+
 static struct attribute *rdev_default_attrs[] = {
 	&rdev_state.attr,
 	&rdev_errors.attr,
 	&rdev_slot.attr,
 	&rdev_offset.attr,
 	&rdev_size.attr,
+	&rdev_recovery_start.attr,
 	NULL,
 };
 static ssize_t
-- 
1.6.5.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33
  2009-12-14  4:07 ` [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33 Neil Brown
@ 2009-12-14  4:49   ` Dan Williams
  2009-12-14  5:35     ` Neil Brown
  2009-12-15  0:37   ` Dan Williams
  1 sibling, 1 reply; 11+ messages in thread
From: Dan Williams @ 2009-12-14  4:49 UTC (permalink / raw)
  To: Neil Brown; +Cc: ed.ciechanowski, marcin.labun, linux-raid

On Sun, Dec 13, 2009 at 9:07 PM, Neil Brown <neilb@suse.de> wrote:
> On Sat, 12 Dec 2009 21:17:01 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
>
>>       md: add 'recovery_start' sysfs attribute
>
> This one I don't like so much.
> recovery_start should be a per-device value, not a per-array value,
> and each device can theoretically have been recovered to a different place.
> We don't make much use of that fact, but maybe we could one day.
>
> So I have change the code to simply expose rdev->recovery_offset through
> sysfs, which should provide all the functionality you need.

Yes, and more flexible in the long run, ack.

> Patch below.
> It still says "From: Dan Williams" because I started with you patch
> and then changes almost all of it (but not quite all).  That seems a
> bit odd but doesn't bother me - tell me if it bothers you.

I've typically added an akpm inspired:
[foo@bar.com: made some mods]
...note when I have touched other people's patches, but it's not a big
deal to me.

> +static ssize_t recovery_start_store(mdk_rdev_t *rdev, const char *buf, size_t len)
> +{
> +       unsigned long long recovery_start;
> +
> +       if (cmd_match(buf, "none"))
> +               recovery_start = MaxSector;

Should probably do the same for resync_start to be consistent?

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33
  2009-12-14  4:49   ` Dan Williams
@ 2009-12-14  5:35     ` Neil Brown
  0 siblings, 0 replies; 11+ messages in thread
From: Neil Brown @ 2009-12-14  5:35 UTC (permalink / raw)
  To: Dan Williams; +Cc: ed.ciechanowski, marcin.labun, linux-raid

On Sun, 13 Dec 2009 21:49:07 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> > +static ssize_t recovery_start_store(mdk_rdev_t *rdev, const char *buf, size_t len)
> > +{
> > +       unsigned long long recovery_start;
> > +
> > +       if (cmd_match(buf, "none"))
> > +               recovery_start = MaxSector;  
> 
> Should probably do the same for resync_start to be consistent?

Good idea, thanks.

I've resubmitted the pull request for all the 2.6.33 patches.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33
  2009-12-14  4:07 ` [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33 Neil Brown
  2009-12-14  4:49   ` Dan Williams
@ 2009-12-15  0:37   ` Dan Williams
  2009-12-15  4:19     ` Dan Williams
  1 sibling, 1 reply; 11+ messages in thread
From: Dan Williams @ 2009-12-15  0:37 UTC (permalink / raw)
  To: Neil Brown; +Cc: Ciechanowski, Ed, Labun, Marcin, linux-raid@vger.kernel.org

On Sun, 2009-12-13 at 21:07 -0700, Neil Brown wrote:
> +static ssize_t recovery_start_store(mdk_rdev_t *rdev, const char *buf, size_t len)
> +{
> +	unsigned long long recovery_start;
> +
> +	if (cmd_match(buf, "none"))
> +		recovery_start = MaxSector;
> +	else if (strict_strtoull(buf, 10, &recovery_start))
> +		return -EINVAL;
> +
> +	if (rdev->mddev->pers &&
> +	    rdev->raid_disk >= 0)
> +		return -EBUSY;

Ok, I had a chance to test this out and have a question about how you
envisioned mdmon handling this restriction which is a bit tighter than
what I had before.  The prior version allowed updates as long as the
array was read-only.  This version forces recovery_start to be written
at sysfs_add_disk() time (before 'slot' is written). The conceptual
problem I ran into was a race between ->activate_spare() determining the
last valid checkpoint and the monitor thread starting up the array:

->activate_spare(): read recovery checkpoint
( array becomes read/write )
( array becomes dirty, checkpoint invalidated )
sysfs_add_disk(): write invalid recovery checkpoint
( recovery starts from the wrong location )

The scheme I came up with was to not touch recovery_start in the manager
thread and let the monitor thread have the last word on the recovery
checkpoint.  It would only write to md/rdX/recovery_start at the initial
readonly->active transition, otherwise recovery starts from default-0.
Is the patch below off base?

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1cc5f2d..bd24e20 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2467,7 +2467,8 @@ static ssize_t recovery_start_store(mdk_rdev_t *rdev, const char *buf, size_t le
 	else if (strict_strtoull(buf, 10, &recovery_start))
 		return -EINVAL;

-	if (rdev->mddev->pers &&
+	if (mddev->ro != 1 &&
+	    rdev->mddev->pers &&
 	    rdev->raid_disk >= 0)
 		return -EBUSY;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33
  2009-12-15  0:37   ` Dan Williams
@ 2009-12-15  4:19     ` Dan Williams
  2009-12-15 18:03       ` Dan Williams
  0 siblings, 1 reply; 11+ messages in thread
From: Dan Williams @ 2009-12-15  4:19 UTC (permalink / raw)
  To: Neil Brown; +Cc: Ciechanowski, Ed, Labun, Marcin, linux-raid@vger.kernel.org

On Mon, Dec 14, 2009 at 5:37 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sun, 2009-12-13 at 21:07 -0700, Neil Brown wrote:
>> +static ssize_t recovery_start_store(mdk_rdev_t *rdev, const char *buf, size_t len)
>> +{
>> +     unsigned long long recovery_start;
>> +
>> +     if (cmd_match(buf, "none"))
>> +             recovery_start = MaxSector;
>> +     else if (strict_strtoull(buf, 10, &recovery_start))
>> +             return -EINVAL;
>> +
>> +     if (rdev->mddev->pers &&
>> +         rdev->raid_disk >= 0)
>> +             return -EBUSY;
>
> Ok, I had a chance to test this out and have a question about how you
> envisioned mdmon handling this restriction which is a bit tighter than
> what I had before.  The prior version allowed updates as long as the
> array was read-only.  This version forces recovery_start to be written
> at sysfs_add_disk() time (before 'slot' is written). The conceptual
> problem I ran into was a race between ->activate_spare() determining the
> last valid checkpoint and the monitor thread starting up the array:
>
> ->activate_spare(): read recovery checkpoint
> ( array becomes read/write )
> ( array becomes dirty, checkpoint invalidated )
> sysfs_add_disk(): write invalid recovery checkpoint
> ( recovery starts from the wrong location )

On second thought, if we get to activate_spare() it's already too
late.  Moving this to mdadm at assembly time (prior to setting
readonly) is a better approach.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33
  2009-12-15  4:19     ` Dan Williams
@ 2009-12-15 18:03       ` Dan Williams
  2009-12-16  5:16         ` Neil Brown
  0 siblings, 1 reply; 11+ messages in thread
From: Dan Williams @ 2009-12-15 18:03 UTC (permalink / raw)
  To: Neil Brown; +Cc: Ciechanowski, Ed, Labun, Marcin, linux-raid@vger.kernel.org

On Mon, Dec 14, 2009 at 9:19 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On second thought, if we get to activate_spare() it's already too
> late.  Moving this to mdadm at assembly time (prior to setting
> readonly) is a better approach.
>

Problem.  slot_store() in the array inactive case currently does:

                /* assume it is working */
                clear_bit(Faulty, &rdev->flags);
                clear_bit(WriteMostly, &rdev->flags);
                set_bit(In_sync, &rdev->flags);
                sysfs_notify_dirent(rdev->sysfs_state);

i.e. sets the disk insync even if we specified a recovery_start <
MaxSector.  If userspace can guarantee that the array stays inactive
then it can write to 'recovery_start' after 'slot' and catch attempts
to cold_add() out-of-sync disks on pre-2.6.33 kernels, but that gives
a window of invalid configuration.  The other fix is to remove the
set_bit(In_sync), and then for the pre-2.6.33 case userspace would
need to disallow adding out-of-sync disks and force them through the
hot_add() case.  This is how mdadm/mdmon currently operates, but that
is a surprising ABI quirk when switching to/from 2.6.33.  A third
option is to allow recovery_start_store to be modified while the array
is read only. Although not my favorite, because it requires tricky
mdmon logic to catch activate_spare() attempts before the monitor
thread starts touching the array, it has the benefit of not changing
any old behavior and no window of invalid configuration.  Thoughts??

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33
  2009-12-15 18:03       ` Dan Williams
@ 2009-12-16  5:16         ` Neil Brown
  2009-12-16  6:24           ` Dan Williams
  0 siblings, 1 reply; 11+ messages in thread
From: Neil Brown @ 2009-12-16  5:16 UTC (permalink / raw)
  To: Dan Williams; +Cc: Ciechanowski, Ed, Labun, Marcin, linux-raid@vger.kernel.org

On Tue, 15 Dec 2009 11:03:06 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Mon, Dec 14, 2009 at 9:19 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > On second thought, if we get to activate_spare() it's already too
> > late.  Moving this to mdadm at assembly time (prior to setting
> > readonly) is a better approach.
> >
> 
> Problem.  slot_store() in the array inactive case currently does:
> 
>                 /* assume it is working */
>                 clear_bit(Faulty, &rdev->flags);
>                 clear_bit(WriteMostly, &rdev->flags);
>                 set_bit(In_sync, &rdev->flags);
>                 sysfs_notify_dirent(rdev->sysfs_state);
> 
> i.e. sets the disk insync even if we specified a recovery_start <
> MaxSector.  If userspace can guarantee that the array stays inactive
> then it can write to 'recovery_start' after 'slot' and catch attempts
> to cold_add() out-of-sync disks on pre-2.6.33 kernels, but that gives
> a window of invalid configuration.  The other fix is to remove the
> set_bit(In_sync), and then for the pre-2.6.33 case userspace would
> need to disallow adding out-of-sync disks and force them through the
> hot_add() case.  This is how mdadm/mdmon currently operates, but that
> is a surprising ABI quirk when switching to/from 2.6.33.  A third
> option is to allow recovery_start_store to be modified while the array
> is read only. Although not my favorite, because it requires tricky
> mdmon logic to catch activate_spare() attempts before the monitor
> thread starts touching the array, it has the benefit of not changing
> any old behavior and no window of invalid configuration.  Thoughts??

I'm tempted to wait a bit longer and see if you find a solution,
as you seem to be progressing quite well :-)  But I won't.

I imagine there are two cases:
 1/ assembling an array from devices some of which might be partially
    recovered,
 2/ re-adding a device to an array which is already active.

In the first case, mdadm would:
   - add the disk (write to new_dev)
   - set the slot  - this sets 'In_sync'
   - set the recovery_start - this clears 'In_sync' as required.

In the second case either mdadm or mdmon would:
   - write 'frozen' to sync_action, which would inhibit any call
             to remove_and_add_spares
   - add the disk
   - set recovery_start
   - set the slot
   - write 'recover' to sync_action

It is unfortunate that the setting of 'slot' and 'recovery_start'
must be in different orders in the different cases, but maybe that
isn't a tragedy.

Possibly I could change slot_store in the pers==NULL case to not
set In_sync if recovery_offset were not MaxSector, but
I'm not sure it is worth the effort.

Does that answer your concerns?

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33
  2009-12-16  5:16         ` Neil Brown
@ 2009-12-16  6:24           ` Dan Williams
  0 siblings, 0 replies; 11+ messages in thread
From: Dan Williams @ 2009-12-16  6:24 UTC (permalink / raw)
  To: Neil Brown; +Cc: Ciechanowski, Ed, Labun, Marcin, linux-raid@vger.kernel.org

On Tue, Dec 15, 2009 at 10:16 PM, Neil Brown <neilb@suse.de> wrote:
> I'm tempted to wait a bit longer and see if you find a solution,
> as you seem to be progressing quite well :-)  But I won't.
>
> I imagine there are two cases:
>  1/ assembling an array from devices some of which might be partially
>    recovered,
>  2/ re-adding a device to an array which is already active.
>
> In the first case, mdadm would:
>   - add the disk (write to new_dev)
>   - set the slot  - this sets 'In_sync'
>   - set the recovery_start - this clears 'In_sync' as required.
>
> In the second case either mdadm or mdmon would:
>   - write 'frozen' to sync_action, which would inhibit any call
>             to remove_and_add_spares
>   - add the disk
>   - set recovery_start
>   - set the slot
>   - write 'recover' to sync_action
>
> It is unfortunate that the setting of 'slot' and 'recovery_start'
> must be in different orders in the different cases, but maybe that
> isn't a tragedy.
>
> Possibly I could change slot_store in the pers==NULL case to not
> set In_sync if recovery_offset were not MaxSector, but
> I'm not sure it is worth the effort.
>
> Does that answer your concerns?

I had my mind set on a unified sysfs_add_disk() implementation that
would fallback to recovery_start=0 in older kernels.  But context
sensitive 'add' routines frees me from this mental rat hole.  Undoing
the In_sync bit after the slot write is palatable given the
alternatives.  Thanks for the nudge.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-12-16  6:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-13  4:17 [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33 Dan Williams
2009-12-13  4:17 ` [PATCH 1/2] md: rcu_read_lock() walk of mddev->disks in md_do_sync() Dan Williams
2009-12-13  4:17 ` [PATCH 2/2] md: add 'recovery_start' sysfs attribute Dan Williams
2009-12-14  4:07 ` [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33 Neil Brown
2009-12-14  4:49   ` Dan Williams
2009-12-14  5:35     ` Neil Brown
2009-12-15  0:37   ` Dan Williams
2009-12-15  4:19     ` Dan Williams
2009-12-15 18:03       ` Dan Williams
2009-12-16  5:16         ` Neil Brown
2009-12-16  6:24           ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).