[md PATCH 00/10] md patches for 3.5: RAID10 reshape

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [md PATCH 00/10] md patches for 3.5: RAID10 reshape
@ 2012-04-03  5:53 NeilBrown
  2012-04-03  5:53 ` [md PATCH 01/10] md: allow a reshape operation to be reversed NeilBrown
                   ` (10 more replies)
  0 siblings, 11 replies; 17+ messages in thread
From: NeilBrown @ 2012-04-03  5:53 UTC (permalink / raw)
  To: linux-raid

The following patches are now in my queue for 3.5.
(I may add some more, but to avoid patch-bombing I sending these
 now as I think they are ready).

This is all about reshaping arrays and particularly RAID10.

As you probably know, RAID5 reshaping often needs a backup file.
That is a pain for a number of reasons and is slow.
So RAID10 does it without a backup file.

This is achieved by modifying the data_offset during the reshape,
so there is never any need to write data over data that is still
live.

Eventually I'll make RAID5 reshape work this way too (optionally)
however as 0.90 metadata doesn't allow data_offset to be changed,
this is a v1.x feature only.

I have initial patches to support some of this with mdadm, but I don't
know when they'll make it into a release.  I have a very strong
feeling that mdadm needs some major refactoring work before I can feel
happy about it again, and that might have to come first, or at least
it will not come last.
(I'll probably put the patches in my git tree sometime soon for
experimentation).

Review/comments/etc always welcome.

You can find all of these in my 'for-next' branch (and so in
linux-next in a day or so) if you would prefer to look at them that
way.

Thanks,
NeilBrown

---

NeilBrown (10):
      md/raid10: add reshape support
      md/raid10: split out interpretation of layout to separate function.
      md/raid10: Introduce 'prev' geometry to support reshape.
      md: use resync_max_sectors for reshape as well as resync.
      md: teach sync_page_io about new_data_offset.
      md/raid10: collect some geometry fields into a dedicated structure.
      md/raid5: allow for change in data_offset while managing a reshape.
      md/raid5:  Use correct data_offset for all IO.
      md: add possibility to change data-offset for devices.
      md: allow a reshape operation to be reversed.

 drivers/md/md.c           |  299 ++++++++++-
 drivers/md/md.h           |    8 
 drivers/md/raid1.c        |    4 
 drivers/md/raid10.c       | 1236 ++++++++++++++++++++++++++++++++++++++++-----
 drivers/md/raid10.h       |   35 +
 drivers/md/raid5.c        |  206 +++++---
 drivers/md/raid5.h        |    6 
 include/linux/raid/md_p.h |   15 -
 8 files changed, 1565 insertions(+), 244 deletions(-)

-- 
Signature

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [md PATCH 01/10] md: allow a reshape operation to be reversed.
  2012-04-03  5:53 [md PATCH 00/10] md patches for 3.5: RAID10 reshape NeilBrown
@ 2012-04-03  5:53 ` NeilBrown
  2012-10-01  9:11   ` Roman Mamedov
  2012-10-01 10:15   ` Roman Mamedov
  2012-04-03  5:53 ` [md PATCH 04/10] md/raid5: allow for change in data_offset while managing a reshape NeilBrown
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 17+ messages in thread
From: NeilBrown @ 2012-04-03  5:53 UTC (permalink / raw)
  To: linux-raid

Currently a reshape operation always progresses from the start
of the array to the end unless the number of devices is being
reduced, in which case it progressed in the opposite direction.

To reverse a partial reshape which changes the number of devices
you can stop the array and re-assemble with the raid-disks numbers
reversed and it will undo.

However for a reshape that does not change the number of devices
it is not possible to reverse the reshape in the middle - you have to
wait until it completes.

So add a 'reshape_direction' attribute with is either 'forwards' or
'backwards' and can be explicitly set when delta_disks is zero.

This will become more important when we allow the data_offset to
change in a reshape.  Then the explicit statement of what direction is
being used will be more useful.

This can be enabled in raid5 trivially as it already supports
reverse reshape and just needs to use a different trigger to request it.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c           |   67 ++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/md.h           |    1 +
 drivers/md/raid5.c        |   23 ++++++++-------
 include/linux/raid/md_p.h |    7 ++++-
 4 files changed, 84 insertions(+), 14 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index b572e1e..1ea2e26 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -605,6 +605,7 @@ void mddev_init(struct mddev *mddev)
 	init_waitqueue_head(&mddev->sb_wait);
 	init_waitqueue_head(&mddev->recovery_wait);
 	mddev->reshape_position = MaxSector;
+	mddev->reshape_backwards = 0;
 	mddev->resync_min = 0;
 	mddev->resync_max = MaxSector;
 	mddev->level = LEVEL_NONE;
@@ -1183,6 +1184,7 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev)
 		mddev->events = ev1;
 		mddev->bitmap_info.offset = 0;
 		mddev->bitmap_info.default_offset = MD_SB_BYTES >> 9;
+		mddev->reshape_backwards = 0;
 
 		if (mddev->minor_version >= 91) {
 			mddev->reshape_position = sb->reshape_position;
@@ -1190,6 +1192,8 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev)
 			mddev->new_level = sb->new_level;
 			mddev->new_layout = sb->new_layout;
 			mddev->new_chunk_sectors = sb->new_chunk >> 9;
+			if (mddev->delta_disks < 0)
+				mddev->reshape_backwards = 1;
 		} else {
 			mddev->reshape_position = MaxSector;
 			mddev->delta_disks = 0;
@@ -1643,7 +1647,8 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 		mddev->events = ev1;
 		mddev->bitmap_info.offset = 0;
 		mddev->bitmap_info.default_offset = 1024 >> 9;
-		
+		mddev->reshape_backwards = 0;
+
 		mddev->recovery_cp = le64_to_cpu(sb->resync_offset);
 		memcpy(mddev->uuid, sb->set_uuid, 16);
 
@@ -1660,6 +1665,11 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 			mddev->new_level = le32_to_cpu(sb->new_level);
 			mddev->new_layout = le32_to_cpu(sb->new_layout);
 			mddev->new_chunk_sectors = le32_to_cpu(sb->new_chunk);
+			if (mddev->delta_disks < 0 ||
+			    (mddev->delta_disks == 0 &&
+			     (le32_to_cpu(sb->feature_map)
+			      & MD_FEATURE_RESHAPE_BACKWARDS)))
+				mddev->reshape_backwards = 1;
 		} else {
 			mddev->reshape_position = MaxSector;
 			mddev->delta_disks = 0;
@@ -1779,6 +1789,10 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 		sb->delta_disks = cpu_to_le32(mddev->delta_disks);
 		sb->new_level = cpu_to_le32(mddev->new_level);
 		sb->new_chunk = cpu_to_le32(mddev->new_chunk_sectors);
+		if (mddev->delta_disks == 0 &&
+		    mddev->reshape_backwards)
+			sb->feature_map
+				|= cpu_to_le32(MD_FEATURE_RESHAPE_BACKWARDS);
 	}
 
 	if (rdev->badblocks.count == 0)
@@ -3417,6 +3431,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 		mddev->new_chunk_sectors = mddev->chunk_sectors;
 		mddev->raid_disks -= mddev->delta_disks;
 		mddev->delta_disks = 0;
+		mddev->reshape_backwards = 0;
 		module_put(pers->owner);
 		printk(KERN_WARNING "md: %s: %s would not accept array\n",
 		       mdname(mddev), clevel);
@@ -3490,6 +3505,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 	mddev->layout = mddev->new_layout;
 	mddev->chunk_sectors = mddev->new_chunk_sectors;
 	mddev->delta_disks = 0;
+	mddev->reshape_backwards = 0;
 	mddev->degraded = 0;
 	if (mddev->pers->sync_request == NULL) {
 		/* this is now an array without redundancy, so
@@ -3583,6 +3599,7 @@ raid_disks_store(struct mddev *mddev, const char *buf, size_t len)
 		int olddisks = mddev->raid_disks - mddev->delta_disks;
 		mddev->delta_disks = n - olddisks;
 		mddev->raid_disks = n;
+		mddev->reshape_backwards = (mddev->delta_disks < 0);
 	} else
 		mddev->raid_disks = n;
 	return rv ? rv : len;
@@ -4434,6 +4451,7 @@ reshape_position_store(struct mddev *mddev, const char *buf, size_t len)
 		return -EINVAL;
 	mddev->reshape_position = new;
 	mddev->delta_disks = 0;
+	mddev->reshape_backwards = 0;
 	mddev->new_level = mddev->level;
 	mddev->new_layout = mddev->layout;
 	mddev->new_chunk_sectors = mddev->chunk_sectors;
@@ -4445,6 +4463,42 @@ __ATTR(reshape_position, S_IRUGO|S_IWUSR, reshape_position_show,
        reshape_position_store);
 
 static ssize_t
+reshape_direction_show(struct mddev *mddev, char *page)
+{
+	return sprintf(page, "%s\n",
+		       mddev->reshape_backwards ? "backwards" : "forwards");
+}
+
+static ssize_t
+reshape_direction_store(struct mddev *mddev, const char *buf, size_t len)
+{
+	int backwards = 0;
+	if (cmd_match(buf, "forwards"))
+		backwards = 0;
+	else if (cmd_match(buf, "backwards"))
+		backwards = 1;
+	else
+		return -EINVAL;
+	if (mddev->reshape_backwards == backwards)
+		return len;
+
+	/* check if we are allowed to change */
+	if (mddev->delta_disks)
+		return -EBUSY;
+
+	if (mddev->persistent &&
+	    mddev->major_version == 0)
+		return -EINVAL;
+
+	mddev->reshape_backwards = backwards;
+	return len;
+}
+
+static struct md_sysfs_entry md_reshape_direction =
+__ATTR(reshape_direction, S_IRUGO|S_IWUSR, reshape_direction_show,
+       reshape_direction_store);
+
+static ssize_t
 array_size_show(struct mddev *mddev, char *page)
 {
 	if (mddev->external_size)
@@ -4499,6 +4553,7 @@ static struct attribute *md_default_attrs[] = {
 	&md_safe_delay.attr,
 	&md_array_state.attr,
 	&md_reshape_position.attr,
+	&md_reshape_direction.attr,
 	&md_array_size.attr,
 	&max_corr_read_errors.attr,
 	NULL,
@@ -5062,6 +5117,7 @@ static void md_clean(struct mddev *mddev)
 	mddev->events = 0;
 	mddev->can_decrease_events = 0;
 	mddev->delta_disks = 0;
+	mddev->reshape_backwards = 0;
 	mddev->new_level = LEVEL_NONE;
 	mddev->new_layout = 0;
 	mddev->new_chunk_sectors = 0;
@@ -5886,6 +5942,7 @@ static int set_array_info(struct mddev * mddev, mdu_array_info_t *info)
 	mddev->new_chunk_sectors = mddev->chunk_sectors;
 	mddev->new_layout = mddev->layout;
 	mddev->delta_disks = 0;
+	mddev->reshape_backwards = 0;
 
 	return 0;
 }
@@ -5951,10 +6008,16 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
 	if (mddev->sync_thread || mddev->reshape_position != MaxSector)
 		return -EBUSY;
 	mddev->delta_disks = raid_disks - mddev->raid_disks;
+	if (mddev->delta_disks < 0)
+		mddev->reshape_backwards = 1;
+	else if (mddev->delta_disks > 0)
+		mddev->reshape_backwards = 0;
 
 	rv = mddev->pers->check_reshape(mddev);
-	if (rv < 0)
+	if (rv < 0) {
 		mddev->delta_disks = 0;
+		mddev->reshape_backwards = 0;
+	}
 	return rv;
 }
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1c2063c..d51c0ca 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -262,6 +262,7 @@ struct mddev {
 	sector_t			reshape_position;
 	int				delta_disks, new_level, new_layout;
 	int				new_chunk_sectors;
+	int				reshape_backwards;
 
 	atomic_t			plug_cnt;	/* If device is expecting
 							 * more bios soon.
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f351422..0abbd34 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3970,13 +3970,13 @@ static void make_request(struct mddev *mddev, struct bio * bi)
 			 * to check again.
 			 */
 			spin_lock_irq(&conf->device_lock);
-			if (mddev->delta_disks < 0
+			if (mddev->reshape_backwards
 			    ? logical_sector < conf->reshape_progress
 			    : logical_sector >= conf->reshape_progress) {
 				disks = conf->previous_raid_disks;
 				previous = 1;
 			} else {
-				if (mddev->delta_disks < 0
+				if (mddev->reshape_backwards
 				    ? logical_sector < conf->reshape_safe
 				    : logical_sector >= conf->reshape_safe) {
 					spin_unlock_irq(&conf->device_lock);
@@ -4009,7 +4009,7 @@ static void make_request(struct mddev *mddev, struct bio * bi)
 				 */
 				int must_retry = 0;
 				spin_lock_irq(&conf->device_lock);
-				if (mddev->delta_disks < 0
+				if (mddev->reshape_backwards
 				    ? logical_sector >= conf->reshape_progress
 				    : logical_sector < conf->reshape_progress)
 					/* mismatch, need to try again */
@@ -4108,11 +4108,11 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 
 	if (sector_nr == 0) {
 		/* If restarting in the middle, skip the initial sectors */
-		if (mddev->delta_disks < 0 &&
+		if (mddev->reshape_backwards &&
 		    conf->reshape_progress < raid5_size(mddev, 0, 0)) {
 			sector_nr = raid5_size(mddev, 0, 0)
 				- conf->reshape_progress;
-		} else if (mddev->delta_disks >= 0 &&
+		} else if (!mddev->reshape_backwards &&
 			   conf->reshape_progress > 0)
 			sector_nr = conf->reshape_progress;
 		sector_div(sector_nr, new_data_disks);
@@ -4147,7 +4147,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 	sector_div(readpos, data_disks);
 	safepos = conf->reshape_safe;
 	sector_div(safepos, data_disks);
-	if (mddev->delta_disks < 0) {
+	if (mddev->reshape_backwards) {
 		writepos -= min_t(sector_t, reshape_sectors, writepos);
 		readpos += reshape_sectors;
 		safepos += reshape_sectors;
@@ -4174,7 +4174,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 	 * Maybe that number should be configurable, but I'm not sure it is
 	 * worth it.... maybe it could be a multiple of safemode_delay???
 	 */
-	if ((mddev->delta_disks < 0
+	if ((mddev->reshape_backwards
 	     ? (safepos > writepos && readpos < writepos)
 	     : (safepos < writepos && readpos > writepos)) ||
 	    time_after(jiffies, conf->reshape_checkpoint + 10*HZ)) {
@@ -4195,7 +4195,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 		sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 	}
 
-	if (mddev->delta_disks < 0) {
+	if (mddev->reshape_backwards) {
 		BUG_ON(conf->reshape_progress == 0);
 		stripe_addr = writepos;
 		BUG_ON((mddev->dev_sectors &
@@ -4239,7 +4239,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 		list_add(&sh->lru, &stripes);
 	}
 	spin_lock_irq(&conf->device_lock);
-	if (mddev->delta_disks < 0)
+	if (mddev->reshape_backwards)
 		conf->reshape_progress -= reshape_sectors * new_data_disks;
 	else
 		conf->reshape_progress += reshape_sectors * new_data_disks;
@@ -5008,7 +5008,7 @@ static int run(struct mddev *mddev)
 				       mdname(mddev));
 				return -EINVAL;
 			}
-		} else if (mddev->delta_disks < 0
+		} else if (mddev->reshape_backwards
 		    ? (here_new * mddev->new_chunk_sectors <=
 		       here_old * mddev->chunk_sectors)
 		    : (here_new * mddev->new_chunk_sectors >=
@@ -5535,7 +5535,7 @@ static int raid5_start_reshape(struct mddev *mddev)
 	conf->chunk_sectors = mddev->new_chunk_sectors;
 	conf->prev_algo = conf->algorithm;
 	conf->algorithm = mddev->new_layout;
-	if (mddev->delta_disks < 0)
+	if (mddev->reshape_backwards)
 		conf->reshape_progress = raid5_size(mddev, 0, 0);
 	else
 		conf->reshape_progress = 0;
@@ -5663,6 +5663,7 @@ static void raid5_finish_reshape(struct mddev *mddev)
 		mddev->chunk_sectors = conf->chunk_sectors;
 		mddev->reshape_position = MaxSector;
 		mddev->delta_disks = 0;
+		mddev->reshape_backwards = 0;
 	}
 }
 
diff --git a/include/linux/raid/md_p.h b/include/linux/raid/md_p.h
index 8c0a3ad..07e05f9 100644
--- a/include/linux/raid/md_p.h
+++ b/include/linux/raid/md_p.h
@@ -281,10 +281,15 @@ struct mdp_superblock_1 {
 					    * active device with same 'role'.
 					    * 'recovery_offset' is also set.
 					    */
+#define	MD_FEATURE_RESHAPE_BACKWARDS	32 /* Reshape doesn't change number
+					    * of devices, but is going
+					    * backwards anyway.
+					    */
 #define	MD_FEATURE_ALL			(MD_FEATURE_BITMAP_OFFSET	\
 					|MD_FEATURE_RECOVERY_OFFSET	\
 					|MD_FEATURE_RESHAPE_ACTIVE	\
 					|MD_FEATURE_BAD_BLOCKS		\
-					|MD_FEATURE_REPLACEMENT)
+					|MD_FEATURE_REPLACEMENT		\
+					|MD_FEATURE_RESHAPE_BACKWARDS)
 
 #endif 



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 02/10] md: add possibility to change data-offset for devices.
  2012-04-03  5:53 [md PATCH 00/10] md patches for 3.5: RAID10 reshape NeilBrown
                   ` (4 preceding siblings ...)
  2012-04-03  5:53 ` [md PATCH 03/10] md/raid5: Use correct data_offset for all IO NeilBrown
@ 2012-04-03  5:53 ` NeilBrown
  2012-04-03  5:53 ` [md PATCH 09/10] md/raid10: split out interpretation of layout to separate function NeilBrown
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2012-04-03  5:53 UTC (permalink / raw)
  To: linux-raid

When reshaping we can avoid costly intermediate backup by
changing the 'start' address of the array on the device
(if there is enough room).

So as a first step, allow such a change to be requested
through sysfs, and recorded in v1.x metadata.

(As we didn't previous check that all 'pad' fields were zero,
 we need a new FEATURE flag for this.
 A (belatedly) check that all remaining 'pad' fields are
 zero to avoid a repeat of this)

The new data offset must be requested separately for each device.
This allows each to have a different change in the data offset.
This is not likely to be used often but as data_offset can be
set per-device, new_data_offset should be too.

This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
it is never used and never will be.  At the same time we add a new
arg ('in_new') which is currently always zero but will be used more
soon.

When a reshape finishes we will need to update the data_offset
and rdev->sectors.  So provide an exported function to do that.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c           |  220 +++++++++++++++++++++++++++++++++++++++++----
 drivers/md/md.h           |    7 +
 drivers/md/raid1.c        |    4 -
 drivers/md/raid10.c       |    8 +-
 drivers/md/raid5.c        |   10 +-
 include/linux/raid/md_p.h |   10 ++
 6 files changed, 225 insertions(+), 34 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1ea2e26..0db613d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1033,12 +1033,17 @@ static unsigned int calc_sb_csum(mdp_super_t * sb)
 struct super_type  {
 	char		    *name;
 	struct module	    *owner;
-	int		    (*load_super)(struct md_rdev *rdev, struct md_rdev *refdev,
+	int		    (*load_super)(struct md_rdev *rdev,
+					  struct md_rdev *refdev,
 					  int minor_version);
-	int		    (*validate_super)(struct mddev *mddev, struct md_rdev *rdev);
-	void		    (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
+	int		    (*validate_super)(struct mddev *mddev,
+					      struct md_rdev *rdev);
+	void		    (*sync_super)(struct mddev *mddev,
+					  struct md_rdev *rdev);
 	unsigned long long  (*rdev_size_change)(struct md_rdev *rdev,
 						sector_t num_sectors);
+	int		    (*allow_new_offset)(struct md_rdev *rdev,
+						unsigned long long new_offset);
 };
 
 /*
@@ -1110,6 +1115,7 @@ static int super_90_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor
 
 	rdev->preferred_minor = sb->md_minor;
 	rdev->data_offset = 0;
+	rdev->new_data_offset = 0;
 	rdev->sb_size = MD_SB_BYTES;
 	rdev->badblocks.shift = -1;
 
@@ -1436,6 +1442,12 @@ super_90_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
 	return num_sectors;
 }
 
+static int
+super_90_allow_new_offset(struct md_rdev *rdev, unsigned long long new_offset)
+{
+	/* non-zero offset changes not possible with v0.90 */
+	return new_offset == 0;
+}
 
 /*
  * version 1 superblock
@@ -1471,6 +1483,7 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
 	struct mdp_superblock_1 *sb;
 	int ret;
 	sector_t sb_start;
+	sector_t sectors;
 	char b[BDEVNAME_SIZE], b2[BDEVNAME_SIZE];
 	int bmask;
 
@@ -1525,9 +1538,18 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
 		       bdevname(rdev->bdev,b));
 		return -EINVAL;
 	}
+	if (sb->pad0 ||
+	    sb->pad3[0] ||
+	    memcmp(sb->pad3, sb->pad3+1, sizeof(sb->pad3) - sizeof(sb->pad3[1])))
+		/* Some padding is non-zero, might be a new feature */
+		return -EINVAL;
 
 	rdev->preferred_minor = 0xffff;
 	rdev->data_offset = le64_to_cpu(sb->data_offset);
+	rdev->new_data_offset = rdev->data_offset;
+	if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RESHAPE_ACTIVE) &&
+	    (le32_to_cpu(sb->feature_map) & MD_FEATURE_NEW_OFFSET))
+		rdev->new_data_offset += (s32)le32_to_cpu(sb->new_offset);
 	atomic_set(&rdev->corrected_errors, le32_to_cpu(sb->cnt_corrected_read));
 
 	rdev->sb_size = le32_to_cpu(sb->max_dev) * 2 + 256;
@@ -1538,6 +1560,9 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
 	if (minor_version
 	    && rdev->data_offset < sb_start + (rdev->sb_size/512))
 		return -EINVAL;
+	if (minor_version
+	    && rdev->new_data_offset < sb_start + (rdev->sb_size/512))
+		return -EINVAL;
 
 	if (sb->level == cpu_to_le32(LEVEL_MULTIPATH))
 		rdev->desc_nr = -1;
@@ -1609,16 +1634,17 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
 		else
 			ret = 0;
 	}
-	if (minor_version)
-		rdev->sectors = (i_size_read(rdev->bdev->bd_inode) >> 9) -
-			le64_to_cpu(sb->data_offset);
-	else
-		rdev->sectors = rdev->sb_start;
-	if (rdev->sectors < le64_to_cpu(sb->data_size))
+	if (minor_version) {
+		sectors = (i_size_read(rdev->bdev->bd_inode) >> 9);
+		if (rdev->data_offset > rdev->new_data_offset)
+			sectors -= rdev->data_offset;
+		else
+			sectors -= rdev->new_data_offset;
+	} else
+		sectors = rdev->sb_start;
+	if (sectors < le64_to_cpu(sb->data_size))
 		return -EINVAL;
 	rdev->sectors = le64_to_cpu(sb->data_size);
-	if (le64_to_cpu(sb->size) > rdev->sectors)
-		return -EINVAL;
 	return ret;
 }
 
@@ -1743,7 +1769,6 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 	sb->feature_map = 0;
 	sb->pad0 = 0;
 	sb->recovery_offset = cpu_to_le64(0);
-	memset(sb->pad1, 0, sizeof(sb->pad1));
 	memset(sb->pad3, 0, sizeof(sb->pad3));
 
 	sb->utime = cpu_to_le64((__u64)mddev->utime);
@@ -1765,6 +1790,8 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 		sb->devflags |= WriteMostly1;
 	else
 		sb->devflags &= ~WriteMostly1;
+	sb->data_offset = cpu_to_le64(rdev->data_offset);
+	sb->data_size = cpu_to_le64(rdev->sectors);
 
 	if (mddev->bitmap && mddev->bitmap_info.file == NULL) {
 		sb->bitmap_offset = cpu_to_le32((__u32)mddev->bitmap_info.offset);
@@ -1793,6 +1820,12 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 		    mddev->reshape_backwards)
 			sb->feature_map
 				|= cpu_to_le32(MD_FEATURE_RESHAPE_BACKWARDS);
+		if (rdev->new_data_offset != rdev->data_offset) {
+			sb->feature_map
+				|= cpu_to_le32(MD_FEATURE_NEW_OFFSET);
+			sb->new_offset = cpu_to_le32((__u32)(rdev->new_data_offset
+							     - rdev->data_offset));
+		}
 	}
 
 	if (rdev->badblocks.count == 0)
@@ -1869,6 +1902,8 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
 	sector_t max_sectors;
 	if (num_sectors && num_sectors < rdev->mddev->dev_sectors)
 		return 0; /* component must fit device */
+	if (rdev->data_offset != rdev->new_data_offset)
+		return 0; /* too confusing */
 	if (rdev->sb_start < rdev->data_offset) {
 		/* minor versions 1 and 2; superblock before data */
 		max_sectors = i_size_read(rdev->bdev->bd_inode) >> 9;
@@ -1896,6 +1931,40 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
 		       rdev->sb_page);
 	md_super_wait(rdev->mddev);
 	return num_sectors;
+
+}
+
+static int
+super_1_allow_new_offset(struct md_rdev *rdev,
+			 unsigned long long new_offset)
+{
+	/* All necessary checks on new >= old have been done */
+	struct bitmap *bitmap;
+	if (new_offset >= rdev->data_offset)
+		return 1;
+
+	/* with 1.0 metadata, there is no metadata to tread on
+	 * so we can always move back */
+	if (rdev->mddev->minor_version == 0)
+		return 1;
+
+	/* otherwise we must be sure not to step on
+	 * any metadata, so stay:
+	 * 36K beyond start of superblock
+	 * beyond end of badblocks
+	 * beyond write-intent bitmap
+	 */
+	if (rdev->sb_start + (32+4)*2 > new_offset)
+		return 0;
+	bitmap = rdev->mddev->bitmap;
+	if (bitmap && !rdev->mddev->bitmap_info.file &&
+	    rdev->sb_start + rdev->mddev->bitmap_info.offset +
+	    bitmap->file_pages * (PAGE_SIZE>>9) > new_offset)
+		return 0;
+	if (rdev->badblocks.sector + rdev->badblocks.size > new_offset)
+		return 0;
+
+	return 1;
 }
 
 static struct super_type super_types[] = {
@@ -1906,6 +1975,7 @@ static struct super_type super_types[] = {
 		.validate_super	    = super_90_validate,
 		.sync_super	    = super_90_sync,
 		.rdev_size_change   = super_90_rdev_size_change,
+		.allow_new_offset   = super_90_allow_new_offset,
 	},
 	[1] = {
 		.name	= "md-1",
@@ -1914,6 +1984,7 @@ static struct super_type super_types[] = {
 		.validate_super	    = super_1_validate,
 		.sync_super	    = super_1_sync,
 		.rdev_size_change   = super_1_rdev_size_change,
+		.allow_new_offset   = super_1_allow_new_offset,
 	},
 };
 
@@ -2821,9 +2892,8 @@ offset_show(struct md_rdev *rdev, char *page)
 static ssize_t
 offset_store(struct md_rdev *rdev, const char *buf, size_t len)
 {
-	char *e;
-	unsigned long long offset = simple_strtoull(buf, &e, 10);
-	if (e==buf || (*e && *e != '\n'))
+	unsigned long long offset;
+	if (strict_strtoull(buf, 10, &offset) < 0)
 		return -EINVAL;
 	if (rdev->mddev->pers && rdev->raid_disk >= 0)
 		return -EBUSY;
@@ -2838,6 +2908,63 @@ offset_store(struct md_rdev *rdev, const char *buf, size_t len)
 static struct rdev_sysfs_entry rdev_offset =
 __ATTR(offset, S_IRUGO|S_IWUSR, offset_show, offset_store);
 
+static ssize_t new_offset_show(struct md_rdev *rdev, char *page)
+{
+	return sprintf(page, "%llu\n",
+		       (unsigned long long)rdev->new_data_offset);
+}
+
+static ssize_t new_offset_store(struct md_rdev *rdev,
+				const char *buf, size_t len)
+{
+	unsigned long long new_offset;
+	struct mddev *mddev = rdev->mddev;
+
+	if (strict_strtoull(buf, 10, &new_offset) < 0)
+		return -EINVAL;
+
+	if (mddev->sync_thread)
+		return -EBUSY;
+	if (new_offset == rdev->data_offset)
+		/* reset is always permitted */
+		;
+	else if (new_offset > rdev->data_offset) {
+		/* must not push array size beyond rdev_sectors */
+		if (new_offset - rdev->data_offset
+		    + mddev->dev_sectors > rdev->sectors)
+				return -E2BIG;
+	}
+	/* Metadata worries about other space details. */
+
+	/* decreasing the offset is inconsistent with a backwards
+	 * reshape.
+	 */
+	if (new_offset < rdev->data_offset &&
+	    mddev->reshape_backwards)
+		return -EINVAL;
+	/* Increasing offset is inconsistent with forwards
+	 * reshape.  reshape_direction should be set to
+	 * 'backwards' first.
+	 */
+	if (new_offset > rdev->data_offset &&
+	    !mddev->reshape_backwards)
+		return -EINVAL;
+
+	if (mddev->pers && mddev->persistent &&
+	    !super_types[mddev->major_version]
+	    .allow_new_offset(rdev, new_offset))
+		return -E2BIG;
+	rdev->new_data_offset = new_offset;
+	if (new_offset > rdev->data_offset)
+		mddev->reshape_backwards = 1;
+	else if (new_offset < rdev->data_offset)
+		mddev->reshape_backwards = 0;
+
+	return len;
+}
+static struct rdev_sysfs_entry rdev_new_offset =
+__ATTR(new_offset, S_IRUGO|S_IWUSR, new_offset_show, new_offset_store);
+
 static ssize_t
 rdev_size_show(struct md_rdev *rdev, char *page)
 {
@@ -2882,6 +3009,8 @@ rdev_size_store(struct md_rdev *rdev, const char *buf, size_t len)
 
 	if (strict_blocks_to_sectors(buf, &sectors) < 0)
 		return -EINVAL;
+	if (rdev->data_offset != rdev->new_data_offset)
+		return -EINVAL; /* too confusing */
 	if (my_mddev->pers && rdev->raid_disk >= 0) {
 		if (my_mddev->persistent) {
 			sectors = super_types[my_mddev->major_version].
@@ -3018,6 +3147,7 @@ static struct attribute *rdev_default_attrs[] = {
 	&rdev_errors.attr,
 	&rdev_slot.attr,
 	&rdev_offset.attr,
+	&rdev_new_offset.attr,
 	&rdev_size.attr,
 	&rdev_recovery_start.attr,
 	&rdev_bad_blocks.attr,
@@ -3092,6 +3222,7 @@ int md_rdev_init(struct md_rdev *rdev)
 	rdev->raid_disk = -1;
 	rdev->flags = 0;
 	rdev->data_offset = 0;
+	rdev->new_data_offset = 0;
 	rdev->sb_events = 0;
 	rdev->last_read_error.tv_sec  = 0;
 	rdev->last_read_error.tv_nsec = 0;
@@ -3596,7 +3727,17 @@ raid_disks_store(struct mddev *mddev, const char *buf, size_t len)
 	if (mddev->pers)
 		rv = update_raid_disks(mddev, n);
 	else if (mddev->reshape_position != MaxSector) {
+		struct md_rdev *rdev;
 		int olddisks = mddev->raid_disks - mddev->delta_disks;
+
+		rdev_for_each(rdev, mddev) {
+			if (olddisks < n &&
+			    rdev->data_offset < rdev->new_data_offset)
+				return -EINVAL;
+			if (olddisks > n &&
+			    rdev->data_offset > rdev->new_data_offset)
+				return -EINVAL;
+		}
 		mddev->delta_disks = n - olddisks;
 		mddev->raid_disks = n;
 		mddev->reshape_backwards = (mddev->delta_disks < 0);
@@ -4443,6 +4584,7 @@ reshape_position_show(struct mddev *mddev, char *page)
 static ssize_t
 reshape_position_store(struct mddev *mddev, const char *buf, size_t len)
 {
+	struct md_rdev *rdev;
 	char *e;
 	unsigned long long new = simple_strtoull(buf, &e, 10);
 	if (mddev->pers)
@@ -4455,6 +4597,8 @@ reshape_position_store(struct mddev *mddev, const char *buf, size_t len)
 	mddev->new_level = mddev->level;
 	mddev->new_layout = mddev->layout;
 	mddev->new_chunk_sectors = mddev->chunk_sectors;
+	rdev_for_each(rdev, mddev)
+		rdev->new_data_offset = rdev->data_offset;
 	return len;
 }
 
@@ -5999,6 +6143,7 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
 static int update_raid_disks(struct mddev *mddev, int raid_disks)
 {
 	int rv;
+	struct md_rdev *rdev;
 	/* change the number of raid disks */
 	if (mddev->pers->check_reshape == NULL)
 		return -EINVAL;
@@ -6007,6 +6152,16 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
 		return -EINVAL;
 	if (mddev->sync_thread || mddev->reshape_position != MaxSector)
 		return -EBUSY;
+
+	rdev_for_each(rdev, mddev) {
+		if (mddev->raid_disks < raid_disks &&
+		    rdev->data_offset < rdev->new_data_offset)
+			return -EINVAL;
+		if (mddev->raid_disks > raid_disks &&
+		    rdev->data_offset > rdev->new_data_offset)
+			return -EINVAL;
+	}
+
 	mddev->delta_disks = raid_disks - mddev->raid_disks;
 	if (mddev->delta_disks < 0)
 		mddev->reshape_backwards = 1;
@@ -7707,6 +7862,20 @@ void md_wait_for_blocked_rdev(struct md_rdev *rdev, struct mddev *mddev)
 }
 EXPORT_SYMBOL(md_wait_for_blocked_rdev);
 
+void md_finish_reshape(struct mddev *mddev)
+{
+	/* called be personality module when reshape completes. */
+	struct md_rdev *rdev;
+
+	rdev_for_each(rdev, mddev) {
+		if (rdev->data_offset > rdev->new_data_offset)
+			rdev->sectors += rdev->data_offset - rdev->new_data_offset;
+		else
+			rdev->sectors -= rdev->new_data_offset - rdev->data_offset;
+		rdev->data_offset = rdev->new_data_offset;
+	}
+}
+EXPORT_SYMBOL(md_finish_reshape);
 
 /* Bad block management.
  * We can record which blocks on each device are 'bad' and so just
@@ -7955,10 +8124,15 @@ static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
 }
 
 int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
-		       int acknowledged)
+		       int is_new)
 {
-	int rv = md_set_badblocks(&rdev->badblocks,
-				  s + rdev->data_offset, sectors, acknowledged);
+	int rv;
+	if (is_new)
+		s += rdev->new_data_offset;
+	else
+		s += rdev->data_offset;
+	rv = md_set_badblocks(&rdev->badblocks,
+			      s, sectors, 0);
 	if (rv) {
 		/* Make sure they get written out promptly */
 		sysfs_notify_dirent_safe(rdev->sysfs_state);
@@ -8064,11 +8238,15 @@ out:
 	return rv;
 }
 
-int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors)
+int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
+			 int is_new)
 {
+	if (is_new)
+		s += rdev->new_data_offset;
+	else
+		s += rdev->data_offset;
 	return md_clear_badblocks(&rdev->badblocks,
-				  s + rdev->data_offset,
-				  sectors);
+				  s, sectors);
 }
 EXPORT_SYMBOL_GPL(rdev_clear_badblocks);
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index d51c0ca..98913e8 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -55,6 +55,7 @@ struct md_rdev {
 	int		sb_loaded;
 	__u64		sb_events;
 	sector_t	data_offset;	/* start of data in array */
+	sector_t	new_data_offset;/* only relevant while reshaping */
 	sector_t 	sb_start;	/* offset of the super block (in 512byte sectors) */
 	int		sb_size;	/* bytes in the superblock */
 	int		preferred_minor;	/* autorun support */
@@ -193,8 +194,9 @@ static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
 	return 0;
 }
 extern int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
-			      int acknowledged);
-extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors);
+			      int is_new);
+extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
+				int is_new);
 extern void md_ack_all_badblocks(struct badblocks *bb);
 
 struct mddev {
@@ -592,6 +594,7 @@ extern void md_write_start(struct mddev *mddev, struct bio *bi);
 extern void md_write_end(struct mddev *mddev);
 extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
 extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
+extern void md_finish_reshape(struct mddev *mddev);
 
 extern int mddev_congested(struct mddev *mddev, int bits);
 extern void md_flush_request(struct mddev *mddev, struct bio *bio);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d35e4c9..90bba78 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2023,7 +2023,7 @@ static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio
 			continue;
 		if (test_bit(BIO_UPTODATE, &bio->bi_flags) &&
 		    test_bit(R1BIO_MadeGood, &r1_bio->state)) {
-			rdev_clear_badblocks(rdev, r1_bio->sector, s);
+			rdev_clear_badblocks(rdev, r1_bio->sector, s, 0);
 		}
 		if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
 		    test_bit(R1BIO_WriteError, &r1_bio->state)) {
@@ -2043,7 +2043,7 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 			struct md_rdev *rdev = conf->mirrors[m].rdev;
 			rdev_clear_badblocks(rdev,
 					     r1_bio->sector,
-					     r1_bio->sectors);
+					     r1_bio->sectors, 0);
 			rdev_dec_pending(rdev, conf->mddev);
 		} else if (r1_bio->bios[m] != NULL) {
 			/* This drive got a write error.  We need to
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index fff7821..b642841 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2480,7 +2480,7 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
 				rdev_clear_badblocks(
 					rdev,
 					r10_bio->devs[m].addr,
-					r10_bio->sectors);
+					r10_bio->sectors, 0);
 			} else {
 				if (!rdev_set_badblocks(
 					    rdev,
@@ -2496,7 +2496,7 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
 				rdev_clear_badblocks(
 					rdev,
 					r10_bio->devs[m].addr,
-					r10_bio->sectors);
+					r10_bio->sectors, 0);
 			} else {
 				if (!rdev_set_badblocks(
 					    rdev,
@@ -2515,7 +2515,7 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
 				rdev_clear_badblocks(
 					rdev,
 					r10_bio->devs[m].addr,
-					r10_bio->sectors);
+					r10_bio->sectors, 0);
 				rdev_dec_pending(rdev, conf->mddev);
 			} else if (bio != NULL &&
 				   !test_bit(BIO_UPTODATE, &bio->bi_flags)) {
@@ -2532,7 +2532,7 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
 				rdev_clear_badblocks(
 					rdev,
 					r10_bio->devs[m].addr,
-					r10_bio->sectors);
+					r10_bio->sectors, 0);
 				rdev_dec_pending(rdev, conf->mddev);
 			}
 		}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0abbd34..3705585 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3561,7 +3561,7 @@ finish:
 			if (test_and_clear_bit(R5_MadeGood, &dev->flags)) {
 				rdev = conf->disks[i].rdev;
 				rdev_clear_badblocks(rdev, sh->sector,
-						     STRIPE_SECTORS);
+						     STRIPE_SECTORS, 0);
 				rdev_dec_pending(rdev, conf->mddev);
 			}
 			if (test_and_clear_bit(R5_MadeGoodRepl, &dev->flags)) {
@@ -3570,7 +3570,7 @@ finish:
 					/* rdev have been moved down */
 					rdev = conf->disks[i].rdev;
 				rdev_clear_badblocks(rdev, sh->sector,
-						     STRIPE_SECTORS);
+						     STRIPE_SECTORS, 0);
 				rdev_dec_pending(rdev, conf->mddev);
 			}
 		}
@@ -5505,10 +5505,14 @@ static int raid5_start_reshape(struct mddev *mddev)
 	if (!check_stripe_cache(mddev))
 		return -ENOSPC;
 
-	rdev_for_each(rdev, mddev)
+	rdev_for_each(rdev, mddev) {
+		/* Don't support changing data_offset yet */
+		if (rdev->new_data_offset != rdev->data_offset)
+			return -EINVAL;
 		if (!test_bit(In_sync, &rdev->flags)
 		    && !test_bit(Faulty, &rdev->flags))
 			spares++;
+	}
 
 	if (spares - mddev->degraded < mddev->delta_disks - conf->max_degraded)
 		/* Not enough devices even to make a degraded array
diff --git a/include/linux/raid/md_p.h b/include/linux/raid/md_p.h
index 07e05f9..ee75353 100644
--- a/include/linux/raid/md_p.h
+++ b/include/linux/raid/md_p.h
@@ -233,7 +233,10 @@ struct mdp_superblock_1 {
 	__le32	delta_disks;	/* change in number of raid_disks		*/
 	__le32	new_layout;	/* new layout					*/
 	__le32	new_chunk;	/* new chunk size (512byte sectors)		*/
-	__u8	pad1[128-124];	/* set to 0 when written */
+	__le32  new_offset;	/* signed number to add to data_offset in new
+				 * layout.  0 == no-change.  This can be
+				 * different on each device in the array.
+				 */
 
 	/* constant this-device information - 64 bytes */
 	__le64	data_offset;	/* sector start of data, often 0 */
@@ -285,11 +288,14 @@ struct mdp_superblock_1 {
 					    * of devices, but is going
 					    * backwards anyway.
 					    */
+#define	MD_FEATURE_NEW_OFFSET		64 /* new_offset must be honoured */
 #define	MD_FEATURE_ALL			(MD_FEATURE_BITMAP_OFFSET	\
 					|MD_FEATURE_RECOVERY_OFFSET	\
 					|MD_FEATURE_RESHAPE_ACTIVE	\
 					|MD_FEATURE_BAD_BLOCKS		\
 					|MD_FEATURE_REPLACEMENT		\
-					|MD_FEATURE_RESHAPE_BACKWARDS)
+					|MD_FEATURE_RESHAPE_BACKWARDS	\
+					|MD_FEATURE_NEW_OFFSET		\
+					)
 
 #endif 



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 03/10] md/raid5:  Use correct data_offset for all IO.
  2012-04-03  5:53 [md PATCH 00/10] md patches for 3.5: RAID10 reshape NeilBrown
                   ` (3 preceding siblings ...)
  2012-04-03  5:53 ` [md PATCH 06/10] md: teach sync_page_io about new_data_offset NeilBrown
@ 2012-04-03  5:53 ` NeilBrown
  2012-04-03  5:53 ` [md PATCH 02/10] md: add possibility to change data-offset for devices NeilBrown
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2012-04-03  5:53 UTC (permalink / raw)
  To: linux-raid

As there can now be two different data_offsets - an 'old' and
a 'new' - we need to carefully choose between them.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   72 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 59 insertions(+), 13 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3705585..71d1de9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -488,6 +488,27 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
 	return sh;
 }
 
+/* Determine if 'data_offset' or 'new_data_offset' should be used
+ * in this stripe_head.
+ */
+static int use_new_offset(struct r5conf *conf, struct stripe_head *sh)
+{
+	sector_t progress = conf->reshape_progress;
+	/* Need a memory barrier to make sure we see the value
+	 * of conf->generation, or ->data_offset that was set before
+	 * reshape_progress was updated.
+	 */
+	smp_rmb();
+	if (progress == MaxSector)
+		return 0;
+	if (sh->generation == conf->generation - 1)
+		return 0;
+	/* We are in a reshape, and this is a new-generation stripe,
+	 * so use new_data_offset.
+	 */
+	return 1;
+}
+
 static void
 raid5_end_read_request(struct bio *bi, int error);
 static void
@@ -603,7 +624,12 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 				__func__, (unsigned long long)sh->sector,
 				bi->bi_rw, i);
 			atomic_inc(&sh->count);
-			bi->bi_sector = sh->sector + rdev->data_offset;
+			if (use_new_offset(conf, sh))
+				bi->bi_sector = (sh->sector
+						 + rdev->new_data_offset);
+			else
+				bi->bi_sector = (sh->sector
+						 + rdev->data_offset);
 			bi->bi_flags = 1 << BIO_UPTODATE;
 			bi->bi_idx = 0;
 			bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
@@ -627,7 +653,12 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 				__func__, (unsigned long long)sh->sector,
 				rbi->bi_rw, i);
 			atomic_inc(&sh->count);
-			rbi->bi_sector = sh->sector + rrdev->data_offset;
+			if (use_new_offset(conf, sh))
+				rbi->bi_sector = (sh->sector
+						  + rrdev->new_data_offset);
+			else
+				rbi->bi_sector = (sh->sector
+						  + rrdev->data_offset);
 			rbi->bi_flags = 1 << BIO_UPTODATE;
 			rbi->bi_idx = 0;
 			rbi->bi_io_vec[0].bv_len = STRIPE_SIZE;
@@ -1648,7 +1679,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 	char b[BDEVNAME_SIZE];
 	struct md_rdev *rdev = NULL;
-
+	sector_t s;
 
 	for (i=0 ; i<disks; i++)
 		if (bi == &sh->dev[i].req)
@@ -1671,6 +1702,10 @@ static void raid5_end_read_request(struct bio * bi, int error)
 	if (!rdev)
 		rdev = conf->disks[i].rdev;
 
+	if (use_new_offset(conf, sh))
+		s = sh->sector + rdev->new_data_offset;
+	else
+		s = sh->sector + rdev->data_offset;
 	if (uptodate) {
 		set_bit(R5_UPTODATE, &sh->dev[i].flags);
 		if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
@@ -1683,8 +1718,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
 				"md/raid:%s: read error corrected"
 				" (%lu sectors at %llu on %s)\n",
 				mdname(conf->mddev), STRIPE_SECTORS,
-				(unsigned long long)(sh->sector
-						     + rdev->data_offset),
+				(unsigned long long)s,
 				bdevname(rdev->bdev, b));
 			atomic_add(STRIPE_SECTORS, &rdev->corrected_errors);
 			clear_bit(R5_ReadError, &sh->dev[i].flags);
@@ -1704,8 +1738,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
 				"md/raid:%s: read error on replacement device "
 				"(sector %llu on %s).\n",
 				mdname(conf->mddev),
-				(unsigned long long)(sh->sector
-						     + rdev->data_offset),
+				(unsigned long long)s,
 				bdn);
 		else if (conf->mddev->degraded >= conf->max_degraded)
 			printk_ratelimited(
@@ -1713,8 +1746,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
 				"md/raid:%s: read error not correctable "
 				"(sector %llu on %s).\n",
 				mdname(conf->mddev),
-				(unsigned long long)(sh->sector
-						     + rdev->data_offset),
+				(unsigned long long)s,
 				bdn);
 		else if (test_bit(R5_ReWrite, &sh->dev[i].flags))
 			/* Oh, no!!! */
@@ -1723,8 +1755,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
 				"md/raid:%s: read error NOT corrected!! "
 				"(sector %llu on %s).\n",
 				mdname(conf->mddev),
-				(unsigned long long)(sh->sector
-						     + rdev->data_offset),
+				(unsigned long long)s,
 				bdn);
 		else if (atomic_read(&rdev->read_errors)
 			 > conf->max_nr_stripes)
@@ -3842,6 +3873,7 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
 		raid_bio->bi_next = (void*)rdev;
 		align_bi->bi_bdev =  rdev->bdev;
 		align_bi->bi_flags &= ~(1 << BIO_SEG_VALID);
+		/* No reshape active, so we can trust rdev->data_offset */
 		align_bi->bi_sector += rdev->data_offset;
 
 		if (!bio_fits_rdev(align_bi) ||
@@ -5182,9 +5214,12 @@ static int run(struct mddev *mddev)
 		blk_queue_io_opt(mddev->queue, chunk_size *
 				 (conf->raid_disks - conf->max_degraded));
 
-		rdev_for_each(rdev, mddev)
+		rdev_for_each(rdev, mddev) {
 			disk_stack_limits(mddev->gendisk, rdev->bdev,
 					  rdev->data_offset << 9);
+			disk_stack_limits(mddev->gendisk, rdev->bdev,
+					  rdev->new_data_offset << 9);
+		}
 	}
 
 	return 0;
@@ -5539,12 +5574,16 @@ static int raid5_start_reshape(struct mddev *mddev)
 	conf->chunk_sectors = mddev->new_chunk_sectors;
 	conf->prev_algo = conf->algorithm;
 	conf->algorithm = mddev->new_layout;
+	conf->generation++;
+	/* Code that selects data_offset needs to see the generation update
+	 * if reshape_progress has been set - so a memory barrier needed.
+	 */
+	smp_mb();
 	if (mddev->reshape_backwards)
 		conf->reshape_progress = raid5_size(mddev, 0, 0);
 	else
 		conf->reshape_progress = 0;
 	conf->reshape_safe = conf->reshape_progress;
-	conf->generation++;
 	spin_unlock_irq(&conf->device_lock);
 
 	/* Add some new drives, as many as will fit.
@@ -5596,6 +5635,9 @@ static int raid5_start_reshape(struct mddev *mddev)
 		mddev->recovery = 0;
 		spin_lock_irq(&conf->device_lock);
 		mddev->raid_disks = conf->raid_disks = conf->previous_raid_disks;
+		rdev_for_each(rdev, mddev)
+			rdev->new_data_offset = rdev->data_offset;
+		smp_wmb();
 		conf->reshape_progress = MaxSector;
 		mddev->reshape_position = MaxSector;
 		spin_unlock_irq(&conf->device_lock);
@@ -5614,9 +5656,13 @@ static void end_reshape(struct r5conf *conf)
 {
 
 	if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {
+		struct md_rdev *rdev;
 
 		spin_lock_irq(&conf->device_lock);
 		conf->previous_raid_disks = conf->raid_disks;
+		rdev_for_each(rdev, conf->mddev)
+			rdev->data_offset = rdev->new_data_offset;
+		smp_wmb();
 		conf->reshape_progress = MaxSector;
 		spin_unlock_irq(&conf->device_lock);
 		wake_up(&conf->wait_for_overlap);



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 04/10] md/raid5: allow for change in data_offset while managing a reshape.
  2012-04-03  5:53 [md PATCH 00/10] md patches for 3.5: RAID10 reshape NeilBrown
  2012-04-03  5:53 ` [md PATCH 01/10] md: allow a reshape operation to be reversed NeilBrown
@ 2012-04-03  5:53 ` NeilBrown
  2012-04-03  5:53 ` [md PATCH 05/10] md/raid10: collect some geometry fields into a dedicated structure NeilBrown
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2012-04-03  5:53 UTC (permalink / raw)
  To: linux-raid

The important issue here is incorporating the different in data_offset
into calculations concerning when we might need to over-write data
that is still thought to be valid.

To this end we find the minimum offset difference across all devices
and add that where appropriate.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |  109 ++++++++++++++++++++++++++++++++++++----------------
 drivers/md/raid5.h |    6 +++
 2 files changed, 82 insertions(+), 33 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 71d1de9..0172bdd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4165,13 +4165,11 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 	else
 		reshape_sectors = mddev->chunk_sectors;
 
-	/* we update the metadata when there is more than 3Meg
-	 * in the block range (that is rather arbitrary, should
-	 * probably be time based) or when the data about to be
-	 * copied would over-write the source of the data at
-	 * the front of the range.
-	 * i.e. one new_stripe along from reshape_progress new_maps
-	 * to after where reshape_safe old_maps to
+	/* We update the metadata at least every 10 seconds, or when
+	 * the data about to be copied would over-write the source of
+	 * the data at the front of the range.  i.e. one new_stripe
+	 * along from reshape_progress new_maps to after where
+	 * reshape_safe old_maps to
 	 */
 	writepos = conf->reshape_progress;
 	sector_div(writepos, new_data_disks);
@@ -4189,11 +4187,29 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 		safepos -= min_t(sector_t, reshape_sectors, safepos);
 	}
 
+	/* Having calculated the 'writepos' possibly use it
+	 * to set 'stripe_addr' which is where we will write to.
+	 */
+	if (mddev->reshape_backwards) {
+		BUG_ON(conf->reshape_progress == 0);
+		stripe_addr = writepos;
+		BUG_ON((mddev->dev_sectors &
+			~((sector_t)reshape_sectors - 1))
+		       - reshape_sectors - stripe_addr
+		       != sector_nr);
+	} else {
+		BUG_ON(writepos != sector_nr + reshape_sectors);
+		stripe_addr = sector_nr;
+	}
+
 	/* 'writepos' is the most advanced device address we might write.
 	 * 'readpos' is the least advanced device address we might read.
 	 * 'safepos' is the least address recorded in the metadata as having
 	 *     been reshaped.
-	 * If 'readpos' is behind 'writepos', then there is no way that we can
+	 * If there is a min_offset_diff, these are adjusted either by
+	 * increasing the safepos/readpos if diff is negative, or
+	 * increasing writepos if diff is positive.
+	 * If 'readpos' is then behind 'writepos', there is no way that we can
 	 * ensure safety in the face of a crash - that must be done by userspace
 	 * making a backup of the data.  So in that case there is no particular
 	 * rush to update metadata.
@@ -4206,6 +4222,12 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 	 * Maybe that number should be configurable, but I'm not sure it is
 	 * worth it.... maybe it could be a multiple of safemode_delay???
 	 */
+	if (conf->min_offset_diff < 0) {
+		safepos += -conf->min_offset_diff;
+		readpos += -conf->min_offset_diff;
+	} else
+		writepos += conf->min_offset_diff;
+
 	if ((mddev->reshape_backwards
 	     ? (safepos > writepos && readpos < writepos)
 	     : (safepos < writepos && readpos > writepos)) ||
@@ -4227,17 +4249,6 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 		sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 	}
 
-	if (mddev->reshape_backwards) {
-		BUG_ON(conf->reshape_progress == 0);
-		stripe_addr = writepos;
-		BUG_ON((mddev->dev_sectors &
-			~((sector_t)reshape_sectors - 1))
-		       - reshape_sectors - stripe_addr
-		       != sector_nr);
-	} else {
-		BUG_ON(writepos != sector_nr + reshape_sectors);
-		stripe_addr = sector_nr;
-	}
 	INIT_LIST_HEAD(&stripes);
 	for (i = 0; i < reshape_sectors; i += STRIPE_SECTORS) {
 		int j;
@@ -4984,16 +4995,42 @@ static int run(struct mddev *mddev)
 	struct md_rdev *rdev;
 	sector_t reshape_offset = 0;
 	int i;
+	long long min_offset_diff = 0;
+	int first = 1;
 
 	if (mddev->recovery_cp != MaxSector)
 		printk(KERN_NOTICE "md/raid:%s: not clean"
 		       " -- starting background reconstruction\n",
 		       mdname(mddev));
+
+	rdev_for_each(rdev, mddev) {
+		long long diff;
+		if (rdev->raid_disk < 0)
+			continue;
+		diff = (rdev->new_data_offset - rdev->data_offset);
+		if (first) {
+			min_offset_diff = diff;
+			first = 0;
+		} else if (mddev->reshape_backwards &&
+			 diff < min_offset_diff)
+			min_offset_diff = diff;
+		else if (!mddev->reshape_backwards &&
+			 diff > min_offset_diff)
+			min_offset_diff = diff;
+	}
+
 	if (mddev->reshape_position != MaxSector) {
 		/* Check that we can continue the reshape.
-		 * Currently only disks can change, it must
-		 * increase, and we must be past the point where
-		 * a stripe over-writes itself
+		 * Difficulties arise if the stripe we would write to
+		 * next is at or after the stripe we would read from next.
+		 * For a reshape that changes the number of devices, this
+		 * is only possible for a very short time, and mdadm makes
+		 * sure that time appears to have past before assembling
+		 * the array.  So we fail if that time hasn't passed.
+		 * For a reshape that keeps the number of devices the same
+		 * mdadm must be monitoring the reshape can keeping the
+		 * critical areas read-only and backed up.  It will start
+		 * the array in read-only mode, so we check for that.
 		 */
 		sector_t here_new, here_old;
 		int old_disks;
@@ -5025,26 +5062,34 @@ static int run(struct mddev *mddev)
 		/* here_old is the first stripe that we might need to read
 		 * from */
 		if (mddev->delta_disks == 0) {
+			if ((here_new * mddev->new_chunk_sectors !=
+			     here_old * mddev->chunk_sectors)) {
+				printk(KERN_ERR "md/raid:%s: reshape position is"
+				       " confused - aborting\n", mdname(mddev));
+				return -EINVAL;
+			}
 			/* We cannot be sure it is safe to start an in-place
-			 * reshape.  It is only safe if user-space if monitoring
+			 * reshape.  It is only safe if user-space is monitoring
 			 * and taking constant backups.
 			 * mdadm always starts a situation like this in
 			 * readonly mode so it can take control before
 			 * allowing any writes.  So just check for that.
 			 */
-			if ((here_new * mddev->new_chunk_sectors != 
-			     here_old * mddev->chunk_sectors) ||
-			    mddev->ro == 0) {
-				printk(KERN_ERR "md/raid:%s: in-place reshape must be started"
-				       " in read-only mode - aborting\n",
+			if (abs(min_offset_diff) >= mddev->chunk_sectors &&
+			    abs(min_offset_diff) >= mddev->new_chunk_sectors)
+				/* not really in-place - so OK */;
+			else if (mddev->ro == 0) {
+				printk(KERN_ERR "md/raid:%s: in-place reshape "
+				       "must be started in read-only mode "
+				       "- aborting\n",
 				       mdname(mddev));
 				return -EINVAL;
 			}
 		} else if (mddev->reshape_backwards
-		    ? (here_new * mddev->new_chunk_sectors <=
+		    ? (here_new * mddev->new_chunk_sectors + min_offset_diff <=
 		       here_old * mddev->chunk_sectors)
 		    : (here_new * mddev->new_chunk_sectors >=
-		       here_old * mddev->chunk_sectors)) {
+		       here_old * mddev->chunk_sectors + (-min_offset_diff))) {
 			/* Reading from the same stripe as writing to - bad */
 			printk(KERN_ERR "md/raid:%s: reshape_position too early for "
 			       "auto-recovery - aborting.\n",
@@ -5069,6 +5114,7 @@ static int run(struct mddev *mddev)
 	if (IS_ERR(conf))
 		return PTR_ERR(conf);
 
+	conf->min_offset_diff = min_offset_diff;
 	mddev->thread = conf->thread;
 	conf->thread = NULL;
 	mddev->private = conf;
@@ -5541,9 +5587,6 @@ static int raid5_start_reshape(struct mddev *mddev)
 		return -ENOSPC;
 
 	rdev_for_each(rdev, mddev) {
-		/* Don't support changing data_offset yet */
-		if (rdev->new_data_offset != rdev->data_offset)
-			return -EINVAL;
 		if (!test_bit(In_sync, &rdev->flags)
 		    && !test_bit(Faulty, &rdev->flags))
 			spares++;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 8d8e139..c6bdfa0 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -385,6 +385,12 @@ struct r5conf {
 	short			generation; /* increments with every reshape */
 	unsigned long		reshape_checkpoint; /* Time we last updated
 						     * metadata */
+	long long		min_offset_diff; /* minimum difference between
+						  * data_offset and
+						  * new_data_offset across all
+						  * devices.  May be negative,
+						  * but is closest to zero.
+						  */
 
 	struct list_head	handle_list; /* stripes needing handling */
 	struct list_head	hold_list; /* preread ready stripes */



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 05/10] md/raid10: collect some geometry fields into a dedicated structure.
  2012-04-03  5:53 [md PATCH 00/10] md patches for 3.5: RAID10 reshape NeilBrown
  2012-04-03  5:53 ` [md PATCH 01/10] md: allow a reshape operation to be reversed NeilBrown
  2012-04-03  5:53 ` [md PATCH 04/10] md/raid5: allow for change in data_offset while managing a reshape NeilBrown
@ 2012-04-03  5:53 ` NeilBrown
  2012-04-03  5:53 ` [md PATCH 06/10] md: teach sync_page_io about new_data_offset NeilBrown
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2012-04-03  5:53 UTC (permalink / raw)
  To: linux-raid

We will shortly be adding reshape support for RAID10 which will
require it having 2 concurrent geometries (before and after).
To make that easier, collect most geometry fields into 'struct geom'
and access them from there.  Then we will more easily be able to add
a second set of fields.

Note that 'copies' is not in this struct and so cannot be changed.
There is little need to change this number and doing so is a lot
more difficult as it requires reallocating more things.
So leave it out for now.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid10.c |  192 ++++++++++++++++++++++++++-------------------------
 drivers/md/raid10.h |   23 +++---
 2 files changed, 111 insertions(+), 104 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b642841..39f5239 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -511,42 +511,43 @@ static void raid10_find_phys(struct r10conf *conf, struct r10bio *r10bio)
 	sector_t chunk;
 	sector_t stripe;
 	int dev;
+	struct geom *geo = &conf->geo;
 
 	int slot = 0;
 
 	/* now calculate first sector/dev */
-	chunk = r10bio->sector >> conf->chunk_shift;
-	sector = r10bio->sector & conf->chunk_mask;
+	chunk = r10bio->sector >> geo->chunk_shift;
+	sector = r10bio->sector & geo->chunk_mask;
 
-	chunk *= conf->near_copies;
+	chunk *= geo->near_copies;
 	stripe = chunk;
-	dev = sector_div(stripe, conf->raid_disks);
-	if (conf->far_offset)
-		stripe *= conf->far_copies;
+	dev = sector_div(stripe, geo->raid_disks);
+	if (geo->far_offset)
+		stripe *= geo->far_copies;
 
-	sector += stripe << conf->chunk_shift;
+	sector += stripe << geo->chunk_shift;
 
 	/* and calculate all the others */
-	for (n=0; n < conf->near_copies; n++) {
+	for (n = 0; n < geo->near_copies; n++) {
 		int d = dev;
 		sector_t s = sector;
 		r10bio->devs[slot].addr = sector;
 		r10bio->devs[slot].devnum = d;
 		slot++;
 
-		for (f = 1; f < conf->far_copies; f++) {
-			d += conf->near_copies;
-			if (d >= conf->raid_disks)
-				d -= conf->raid_disks;
-			s += conf->stride;
+		for (f = 1; f < geo->far_copies; f++) {
+			d += geo->near_copies;
+			if (d >= geo->raid_disks)
+				d -= geo->raid_disks;
+			s += geo->stride;
 			r10bio->devs[slot].devnum = d;
 			r10bio->devs[slot].addr = s;
 			slot++;
 		}
 		dev++;
-		if (dev >= conf->raid_disks) {
+		if (dev >= geo->raid_disks) {
 			dev = 0;
-			sector += (conf->chunk_mask + 1);
+			sector += (geo->chunk_mask + 1);
 		}
 	}
 	BUG_ON(slot != conf->copies);
@@ -555,28 +556,29 @@ static void raid10_find_phys(struct r10conf *conf, struct r10bio *r10bio)
 static sector_t raid10_find_virt(struct r10conf *conf, sector_t sector, int dev)
 {
 	sector_t offset, chunk, vchunk;
+	struct geom *geo = &conf->geo;
 
-	offset = sector & conf->chunk_mask;
-	if (conf->far_offset) {
+	offset = sector & geo->chunk_mask;
+	if (geo->far_offset) {
 		int fc;
-		chunk = sector >> conf->chunk_shift;
-		fc = sector_div(chunk, conf->far_copies);
-		dev -= fc * conf->near_copies;
+		chunk = sector >> geo->chunk_shift;
+		fc = sector_div(chunk, geo->far_copies);
+		dev -= fc * geo->near_copies;
 		if (dev < 0)
-			dev += conf->raid_disks;
+			dev += geo->raid_disks;
 	} else {
-		while (sector >= conf->stride) {
-			sector -= conf->stride;
-			if (dev < conf->near_copies)
-				dev += conf->raid_disks - conf->near_copies;
+		while (sector >= geo->stride) {
+			sector -= geo->stride;
+			if (dev < geo->near_copies)
+				dev += geo->raid_disks - geo->near_copies;
 			else
-				dev -= conf->near_copies;
+				dev -= geo->near_copies;
 		}
-		chunk = sector >> conf->chunk_shift;
+		chunk = sector >> geo->chunk_shift;
 	}
-	vchunk = chunk * conf->raid_disks + dev;
-	sector_div(vchunk, conf->near_copies);
-	return (vchunk << conf->chunk_shift) + offset;
+	vchunk = chunk * geo->raid_disks + dev;
+	sector_div(vchunk, geo->near_copies);
+	return (vchunk << geo->chunk_shift) + offset;
 }
 
 /**
@@ -599,8 +601,9 @@ static int raid10_mergeable_bvec(struct request_queue *q,
 	int max;
 	unsigned int chunk_sectors = mddev->chunk_sectors;
 	unsigned int bio_sectors = bvm->bi_size >> 9;
+	struct geom *geo = &conf->geo;
 
-	if (conf->near_copies < conf->raid_disks) {
+	if (geo->near_copies < geo->raid_disks) {
 		max = (chunk_sectors - ((sector & (chunk_sectors - 1))
 					+ bio_sectors)) << 9;
 		if (max < 0)
@@ -681,6 +684,7 @@ static struct md_rdev *read_balance(struct r10conf *conf,
 	struct md_rdev *rdev, *best_rdev;
 	int do_balance;
 	int best_slot;
+	struct geom *geo = &conf->geo;
 
 	raid10_find_phys(conf, r10_bio);
 	rcu_read_lock();
@@ -761,11 +765,11 @@ retry:
 		 * sequential read speed for 'far copies' arrays.  So only
 		 * keep it for 'near' arrays, and review those later.
 		 */
-		if (conf->near_copies > 1 && !atomic_read(&rdev->nr_pending))
+		if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
 			break;
 
 		/* for far > 1 always use the lowest address */
-		if (conf->far_copies > 1)
+		if (geo->far_copies > 1)
 			new_distance = r10_bio->devs[slot].addr;
 		else
 			new_distance = abs(r10_bio->devs[slot].addr -
@@ -812,7 +816,7 @@ static int raid10_congested(void *data, int bits)
 	if (mddev_congested(mddev, bits))
 		return 1;
 	rcu_read_lock();
-	for (i = 0; i < conf->raid_disks && ret == 0; i++) {
+	for (i = 0; i < conf->geo.raid_disks && ret == 0; i++) {
 		struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
@@ -979,7 +983,8 @@ static void make_request(struct mddev *mddev, struct bio * bio)
 	struct r10bio *r10_bio;
 	struct bio *read_bio;
 	int i;
-	int chunk_sects = conf->chunk_mask + 1;
+	sector_t chunk_mask = conf->geo.chunk_mask;
+	int chunk_sects = chunk_mask + 1;
 	const int rw = bio_data_dir(bio);
 	const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
 	const unsigned long do_fua = (bio->bi_rw & REQ_FUA);
@@ -997,9 +1002,9 @@ static void make_request(struct mddev *mddev, struct bio * bio)
 	/* If this request crosses a chunk boundary, we need to
 	 * split it.  This will only happen for 1 PAGE (or less) requests.
 	 */
-	if (unlikely( (bio->bi_sector & conf->chunk_mask) + (bio->bi_size >> 9)
-		      > chunk_sects &&
-		    conf->near_copies < conf->raid_disks)) {
+	if (unlikely((bio->bi_sector & chunk_mask) + (bio->bi_size >> 9)
+		     > chunk_sects
+		     && conf->geo.near_copies < conf->geo.raid_disks)) {
 		struct bio_pair *bp;
 		/* Sanity check -- queue functions should prevent this happening */
 		if (bio->bi_vcnt != 1 ||
@@ -1368,19 +1373,19 @@ static void status(struct seq_file *seq, struct mddev *mddev)
 	struct r10conf *conf = mddev->private;
 	int i;
 
-	if (conf->near_copies < conf->raid_disks)
+	if (conf->geo.near_copies < conf->geo.raid_disks)
 		seq_printf(seq, " %dK chunks", mddev->chunk_sectors / 2);
-	if (conf->near_copies > 1)
-		seq_printf(seq, " %d near-copies", conf->near_copies);
-	if (conf->far_copies > 1) {
-		if (conf->far_offset)
-			seq_printf(seq, " %d offset-copies", conf->far_copies);
+	if (conf->geo.near_copies > 1)
+		seq_printf(seq, " %d near-copies", conf->geo.near_copies);
+	if (conf->geo.far_copies > 1) {
+		if (conf->geo.far_offset)
+			seq_printf(seq, " %d offset-copies", conf->geo.far_copies);
 		else
-			seq_printf(seq, " %d far-copies", conf->far_copies);
+			seq_printf(seq, " %d far-copies", conf->geo.far_copies);
 	}
-	seq_printf(seq, " [%d/%d] [", conf->raid_disks,
-					conf->raid_disks - mddev->degraded);
-	for (i = 0; i < conf->raid_disks; i++)
+	seq_printf(seq, " [%d/%d] [", conf->geo.raid_disks,
+					conf->geo.raid_disks - mddev->degraded);
+	for (i = 0; i < conf->geo.raid_disks; i++)
 		seq_printf(seq, "%s",
 			      conf->mirrors[i].rdev &&
 			      test_bit(In_sync, &conf->mirrors[i].rdev->flags) ? "U" : "_");
@@ -1403,7 +1408,7 @@ static int enough(struct r10conf *conf, int ignore)
 			if (conf->mirrors[first].rdev &&
 			    first != ignore)
 				cnt++;
-			first = (first+1) % conf->raid_disks;
+			first = (first+1) % conf->geo.raid_disks;
 		}
 		if (cnt == 0)
 			return 0;
@@ -1445,7 +1450,7 @@ static void error(struct mddev *mddev, struct md_rdev *rdev)
 	       "md/raid10:%s: Disk failure on %s, disabling device.\n"
 	       "md/raid10:%s: Operation continuing on %d devices.\n",
 	       mdname(mddev), bdevname(rdev->bdev, b),
-	       mdname(mddev), conf->raid_disks - mddev->degraded);
+	       mdname(mddev), conf->geo.raid_disks - mddev->degraded);
 }
 
 static void print_conf(struct r10conf *conf)
@@ -1458,10 +1463,10 @@ static void print_conf(struct r10conf *conf)
 		printk(KERN_DEBUG "(!conf)\n");
 		return;
 	}
-	printk(KERN_DEBUG " --- wd:%d rd:%d\n", conf->raid_disks - conf->mddev->degraded,
-		conf->raid_disks);
+	printk(KERN_DEBUG " --- wd:%d rd:%d\n", conf->geo.raid_disks - conf->mddev->degraded,
+		conf->geo.raid_disks);
 
-	for (i = 0; i < conf->raid_disks; i++) {
+	for (i = 0; i < conf->geo.raid_disks; i++) {
 		char b[BDEVNAME_SIZE];
 		tmp = conf->mirrors + i;
 		if (tmp->rdev)
@@ -1493,7 +1498,7 @@ static int raid10_spare_active(struct mddev *mddev)
 	 * Find all non-in_sync disks within the RAID10 configuration
 	 * and mark them in_sync
 	 */
-	for (i = 0; i < conf->raid_disks; i++) {
+	for (i = 0; i < conf->geo.raid_disks; i++) {
 		tmp = conf->mirrors + i;
 		if (tmp->replacement
 		    && tmp->replacement->recovery_offset == MaxSector
@@ -1535,7 +1540,7 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 	int err = -EEXIST;
 	int mirror;
 	int first = 0;
-	int last = conf->raid_disks - 1;
+	int last = conf->geo.raid_disks - 1;
 	struct request_queue *q = bdev_get_queue(rdev->bdev);
 
 	if (mddev->recovery_cp < MaxSector)
@@ -2603,7 +2608,7 @@ static int init_resync(struct r10conf *conf)
 	buffs = RESYNC_WINDOW / RESYNC_BLOCK_SIZE;
 	BUG_ON(conf->r10buf_pool);
 	conf->have_replacement = 0;
-	for (i = 0; i < conf->raid_disks; i++)
+	for (i = 0; i < conf->geo.raid_disks; i++)
 		if (conf->mirrors[i].replacement)
 			conf->have_replacement = 1;
 	conf->r10buf_pool = mempool_create(buffs, r10buf_pool_alloc, r10buf_pool_free, conf);
@@ -2657,6 +2662,7 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 	sector_t sync_blocks;
 	sector_t sectors_skipped = 0;
 	int chunks_skipped = 0;
+	sector_t chunk_mask = conf->geo.chunk_mask;
 
 	if (!conf->r10buf_pool)
 		if (init_resync(conf))
@@ -2680,7 +2686,7 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 			if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
 				bitmap_end_sync(mddev->bitmap, mddev->curr_resync,
 						&sync_blocks, 1);
-			else for (i=0; i<conf->raid_disks; i++) {
+			else for (i = 0; i < conf->geo.raid_disks; i++) {
 				sector_t sect =
 					raid10_find_virt(conf, mddev->curr_resync, i);
 				bitmap_end_sync(mddev->bitmap, sect,
@@ -2694,7 +2700,7 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 				/* Completed a full sync so the replacements
 				 * are now fully recovered.
 				 */
-				for (i = 0; i < conf->raid_disks; i++)
+				for (i = 0; i < conf->geo.raid_disks; i++)
 					if (conf->mirrors[i].replacement)
 						conf->mirrors[i].replacement
 							->recovery_offset
@@ -2707,7 +2713,7 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 		*skipped = 1;
 		return sectors_skipped;
 	}
-	if (chunks_skipped >= conf->raid_disks) {
+	if (chunks_skipped >= conf->geo.raid_disks) {
 		/* if there has been nothing to do on any drive,
 		 * then there is nothing to do at all..
 		 */
@@ -2721,9 +2727,9 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 	/* make sure whole request will fit in a chunk - if chunks
 	 * are meaningful
 	 */
-	if (conf->near_copies < conf->raid_disks &&
-	    max_sector > (sector_nr | conf->chunk_mask))
-		max_sector = (sector_nr | conf->chunk_mask) + 1;
+	if (conf->geo.near_copies < conf->geo.raid_disks &&
+	    max_sector > (sector_nr | chunk_mask))
+		max_sector = (sector_nr | chunk_mask) + 1;
 	/*
 	 * If there is non-resync activity waiting for us then
 	 * put in a delay to throttle resync.
@@ -2752,7 +2758,7 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 		int j;
 		r10_bio = NULL;
 
-		for (i=0 ; i<conf->raid_disks; i++) {
+		for (i = 0 ; i < conf->geo.raid_disks; i++) {
 			int still_degraded;
 			struct r10bio *rb2;
 			sector_t sect;
@@ -2806,7 +2812,7 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 			/* Need to check if the array will still be
 			 * degraded
 			 */
-			for (j=0; j<conf->raid_disks; j++)
+			for (j = 0; j < conf->geo.raid_disks; j++)
 				if (conf->mirrors[j].rdev == NULL ||
 				    test_bit(Faulty, &conf->mirrors[j].rdev->flags)) {
 					still_degraded = 1;
@@ -2984,9 +2990,9 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 		r10_bio->sector = sector_nr;
 		set_bit(R10BIO_IsSync, &r10_bio->state);
 		raid10_find_phys(conf, r10_bio);
-		r10_bio->sectors = (sector_nr | conf->chunk_mask) - sector_nr +1;
+		r10_bio->sectors = (sector_nr | chunk_mask) - sector_nr + 1;
 
-		for (i=0; i<conf->copies; i++) {
+		for (i = 0; i < conf->copies; i++) {
 			int d = r10_bio->devs[i].devnum;
 			sector_t first_bad, sector;
 			int bad_sectors;
@@ -3152,16 +3158,16 @@ raid10_size(struct mddev *mddev, sector_t sectors, int raid_disks)
 	struct r10conf *conf = mddev->private;
 
 	if (!raid_disks)
-		raid_disks = conf->raid_disks;
+		raid_disks = conf->geo.raid_disks;
 	if (!sectors)
 		sectors = conf->dev_sectors;
 
-	size = sectors >> conf->chunk_shift;
-	sector_div(size, conf->far_copies);
+	size = sectors >> conf->geo.chunk_shift;
+	sector_div(size, conf->geo.far_copies);
 	size = size * raid_disks;
-	sector_div(size, conf->near_copies);
+	sector_div(size, conf->geo.near_copies);
 
-	return size << conf->chunk_shift;
+	return size << conf->geo.chunk_shift;
 }
 
 
@@ -3206,22 +3212,22 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 		goto out;
 
 
-	conf->raid_disks = mddev->raid_disks;
-	conf->near_copies = nc;
-	conf->far_copies = fc;
+	conf->geo.raid_disks = mddev->raid_disks;
+	conf->geo.near_copies = nc;
+	conf->geo.far_copies = fc;
 	conf->copies = nc*fc;
-	conf->far_offset = fo;
-	conf->chunk_mask = mddev->new_chunk_sectors - 1;
-	conf->chunk_shift = ffz(~mddev->new_chunk_sectors);
+	conf->geo.far_offset = fo;
+	conf->geo.chunk_mask = mddev->new_chunk_sectors - 1;
+	conf->geo.chunk_shift = ffz(~mddev->new_chunk_sectors);
 
 	conf->r10bio_pool = mempool_create(NR_RAID10_BIOS, r10bio_pool_alloc,
 					   r10bio_pool_free, conf);
 	if (!conf->r10bio_pool)
 		goto out;
 
-	size = mddev->dev_sectors >> conf->chunk_shift;
+	size = mddev->dev_sectors >> conf->geo.chunk_shift;
 	sector_div(size, fc);
-	size = size * conf->raid_disks;
+	size = size * conf->geo.raid_disks;
 	sector_div(size, nc);
 	/* 'size' is now the number of chunks in the array */
 	/* calculate "used chunks per device" in 'stride' */
@@ -3230,16 +3236,16 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 	/* We need to round up when dividing by raid_disks to
 	 * get the stride size.
 	 */
-	stride += conf->raid_disks - 1;
-	sector_div(stride, conf->raid_disks);
+	stride += conf->geo.raid_disks - 1;
+	sector_div(stride, conf->geo.raid_disks);
 
-	conf->dev_sectors = stride << conf->chunk_shift;
+	conf->dev_sectors = stride << conf->geo.chunk_shift;
 
 	if (fo)
 		stride = 1;
 	else
 		sector_div(stride, fc);
-	conf->stride = stride << conf->chunk_shift;
+	conf->geo.stride = stride << conf->geo.chunk_shift;
 
 
 	spin_lock_init(&conf->device_lock);
@@ -3297,16 +3303,16 @@ static int run(struct mddev *mddev)
 
 	chunk_size = mddev->chunk_sectors << 9;
 	blk_queue_io_min(mddev->queue, chunk_size);
-	if (conf->raid_disks % conf->near_copies)
-		blk_queue_io_opt(mddev->queue, chunk_size * conf->raid_disks);
+	if (conf->geo.raid_disks % conf->geo.near_copies)
+		blk_queue_io_opt(mddev->queue, chunk_size * conf->geo.raid_disks);
 	else
 		blk_queue_io_opt(mddev->queue, chunk_size *
-				 (conf->raid_disks / conf->near_copies));
+				 (conf->geo.raid_disks / conf->geo.near_copies));
 
 	rdev_for_each(rdev, mddev) {
 
 		disk_idx = rdev->raid_disk;
-		if (disk_idx >= conf->raid_disks
+		if (disk_idx >= conf->geo.raid_disks
 		    || disk_idx < 0)
 			continue;
 		disk = conf->mirrors + disk_idx;
@@ -3334,7 +3340,7 @@ static int run(struct mddev *mddev)
 	}
 
 	mddev->degraded = 0;
-	for (i = 0; i < conf->raid_disks; i++) {
+	for (i = 0; i < conf->geo.raid_disks; i++) {
 
 		disk = conf->mirrors + i;
 
@@ -3361,8 +3367,8 @@ static int run(struct mddev *mddev)
 		       mdname(mddev));
 	printk(KERN_INFO
 		"md/raid10:%s: active with %d out of %d devices\n",
-		mdname(mddev), conf->raid_disks - mddev->degraded,
-		conf->raid_disks);
+		mdname(mddev), conf->geo.raid_disks - mddev->degraded,
+		conf->geo.raid_disks);
 	/*
 	 * Ok, everything is just fine now
 	 */
@@ -3379,9 +3385,9 @@ static int run(struct mddev *mddev)
 	 * maybe...
 	 */
 	{
-		int stripe = conf->raid_disks *
+		int stripe = conf->geo.raid_disks *
 			((mddev->chunk_sectors << 9) / PAGE_SIZE);
-		stripe /= conf->near_copies;
+		stripe /= conf->geo.near_copies;
 		if (mddev->queue->backing_dev_info.ra_pages < 2* stripe)
 			mddev->queue->backing_dev_info.ra_pages = 2* stripe;
 	}
@@ -3453,7 +3459,7 @@ static int raid10_resize(struct mddev *mddev, sector_t sectors)
 	struct r10conf *conf = mddev->private;
 	sector_t oldsize, size;
 
-	if (conf->far_copies > 1 && !conf->far_offset)
+	if (conf->geo.far_copies > 1 && !conf->geo.far_offset)
 		return -EINVAL;
 
 	oldsize = raid10_size(mddev, 0, 0);
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 7c615613..4c4942a 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -14,33 +14,34 @@ struct mirror_info {
 struct r10conf {
 	struct mddev		*mddev;
 	struct mirror_info	*mirrors;
-	int			raid_disks;
 	spinlock_t		device_lock;
 
 	/* geometry */
-	int			near_copies;  /* number of copies laid out
+	struct geom {
+		int		raid_disks;
+		int		near_copies;  /* number of copies laid out
 					       * raid0 style */
-	int 			far_copies;   /* number of copies laid out
+		int		far_copies;   /* number of copies laid out
 					       * at large strides across drives
 					       */
-	int			far_offset;   /* far_copies are offset by 1
+		int		far_offset;   /* far_copies are offset by 1
 					       * stripe instead of many
 					       */
-	int			copies;	      /* near_copies * far_copies.
-					       * must be <= raid_disks
-					       */
-	sector_t		stride;	      /* distance between far copies.
+		sector_t	stride;	      /* distance between far copies.
 					       * This is size / far_copies unless
 					       * far_offset, in which case it is
 					       * 1 stripe.
 					       */
+		int		chunk_shift; /* shift from chunks to sectors */
+		sector_t	chunk_mask;
+	} geo;
+	int			copies;	      /* near_copies * far_copies.
+					       * must be <= raid_disks
+					       */
 
 	sector_t		dev_sectors;  /* temp copy of
 					       * mddev->dev_sectors */
 
-	int			chunk_shift; /* shift from chunks to sectors */
-	sector_t		chunk_mask;
-
 	struct list_head	retry_list;
 	/* queue pending writes and submit them on unplug */
 	struct bio_list		pending_bio_list;



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 06/10] md: teach sync_page_io about new_data_offset.
  2012-04-03  5:53 [md PATCH 00/10] md patches for 3.5: RAID10 reshape NeilBrown
                   ` (2 preceding siblings ...)
  2012-04-03  5:53 ` [md PATCH 05/10] md/raid10: collect some geometry fields into a dedicated structure NeilBrown
@ 2012-04-03  5:53 ` NeilBrown
  2012-04-03  5:53 ` [md PATCH 03/10] md/raid5: Use correct data_offset for all IO NeilBrown
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2012-04-03  5:53 UTC (permalink / raw)
  To: linux-raid

Some code in raid1 and raid10 use sync_page_io to
read/write pages when responding to read errors.
As we will shortly support changing data_offset for
raid10, this function must understand new_data_offset.

So add that understanding.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0db613d..8da63d1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -886,6 +886,10 @@ int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
 		rdev->meta_bdev : rdev->bdev;
 	if (metadata_op)
 		bio->bi_sector = sector + rdev->sb_start;
+	else if (rdev->mddev->reshape_position != MaxSector &&
+		 (rdev->mddev->reshape_backwards ==
+		  (sector >= rdev->mddev->reshape_position)))
+		bio->bi_sector = sector + rdev->new_data_offset;
 	else
 		bio->bi_sector = sector + rdev->data_offset;
 	bio_add_page(bio, page, size, 0);



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 07/10] md: use resync_max_sectors for reshape as well as resync.
  2012-04-03  5:53 [md PATCH 00/10] md patches for 3.5: RAID10 reshape NeilBrown
                   ` (8 preceding siblings ...)
  2012-04-03  5:53 ` [md PATCH 08/10] md/raid10: Introduce 'prev' geometry to support reshape NeilBrown
@ 2012-04-03  5:53 ` NeilBrown
  2014-03-01  5:08 ` [md PATCH 00/10] md patches for 3.5: RAID10 reshape Phillip Susi
  10 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2012-04-03  5:53 UTC (permalink / raw)
  To: linux-raid

Some resync type operations need to act on the address space of the
device, others on the address space of the array.

This only affects RAID10, so it sets resync_max_sectors to the array
size (it defaults to the device size), and that is currently used for
resync only.  However reshape of a RAID10 must be done against the
array size, not device size, so change code to use resync_max_sectors
for both the resync and the reshape cases.
This does not affect RAID5 or RAID1, just RAID10.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8da63d1..a672f0d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4426,7 +4426,8 @@ sync_completed_show(struct mddev *mddev, char *page)
 	if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
 		return sprintf(page, "none\n");
 
-	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
+	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ||
+	    test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
 		max_sectors = mddev->resync_max_sectors;
 	else
 		max_sectors = mddev->dev_sectors;
@@ -6804,7 +6805,8 @@ static void status_resync(struct seq_file *seq, struct mddev * mddev)
 
 	resync = mddev->curr_resync - atomic_read(&mddev->recovery_active);
 
-	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
+	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ||
+	    test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
 		max_sectors = mddev->resync_max_sectors;
 	else
 		max_sectors = mddev->dev_sectors;
@@ -7367,7 +7369,7 @@ void md_do_sync(struct mddev *mddev)
 			j = mddev->recovery_cp;
 
 	} else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
-		max_sectors = mddev->dev_sectors;
+		max_sectors = mddev->resync_max_sectors;
 	else {
 		/* recovery follows the physical size of devices */
 		max_sectors = mddev->dev_sectors;



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 08/10] md/raid10: Introduce 'prev' geometry to support reshape.
  2012-04-03  5:53 [md PATCH 00/10] md patches for 3.5: RAID10 reshape NeilBrown
                   ` (7 preceding siblings ...)
  2012-04-03  5:53 ` [md PATCH 10/10] md/raid10: add reshape support NeilBrown
@ 2012-04-03  5:53 ` NeilBrown
  2012-04-03  5:53 ` [md PATCH 07/10] md: use resync_max_sectors for reshape as well as resync NeilBrown
  2014-03-01  5:08 ` [md PATCH 00/10] md patches for 3.5: RAID10 reshape Phillip Susi
  10 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2012-04-03  5:53 UTC (permalink / raw)
  To: linux-raid

When RAID10 supports reshape it will need a 'previous' and a 'current'
geometry, so introduce that here.
Use the 'prev' geometry when before the reshape_position, and the
current 'geo' when beyond it.  At other times, use both as
appropriate.

For now, both are identical (And reshape_position is never set).

When we use the 'prev' geometry, we must use the old data_offset.
When we use the current (And a reshape is happening) we must use
the new_data_offset.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid10.c |  106 ++++++++++++++++++++++++++++++++++++++++-----------
 drivers/md/raid10.h |    8 +++-
 2 files changed, 91 insertions(+), 23 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 39f5239..0a6746f 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -504,15 +504,13 @@ static void raid10_end_write_request(struct bio *bio, int error)
  * sector offset to a virtual address
  */
 
-static void raid10_find_phys(struct r10conf *conf, struct r10bio *r10bio)
+static void __raid10_find_phys(struct geom *geo, struct r10bio *r10bio)
 {
 	int n,f;
 	sector_t sector;
 	sector_t chunk;
 	sector_t stripe;
 	int dev;
-	struct geom *geo = &conf->geo;
-
 	int slot = 0;
 
 	/* now calculate first sector/dev */
@@ -550,12 +548,29 @@ static void raid10_find_phys(struct r10conf *conf, struct r10bio *r10bio)
 			sector += (geo->chunk_mask + 1);
 		}
 	}
-	BUG_ON(slot != conf->copies);
+}
+
+static void raid10_find_phys(struct r10conf *conf, struct r10bio *r10bio)
+{
+	struct geom *geo = &conf->geo;
+
+	if (conf->reshape_progress != MaxSector &&
+	    ((r10bio->sector >= conf->reshape_progress) !=
+	     conf->mddev->reshape_backwards)) {
+		set_bit(R10BIO_Previous, &r10bio->state);
+		geo = &conf->prev;
+	} else
+		clear_bit(R10BIO_Previous, &r10bio->state);
+
+	__raid10_find_phys(geo, r10bio);
 }
 
 static sector_t raid10_find_virt(struct r10conf *conf, sector_t sector, int dev)
 {
 	sector_t offset, chunk, vchunk;
+	/* Never use conf->prev as this is only called during resync
+	 * or recovery, so reshape isn't happening
+	 */
 	struct geom *geo = &conf->geo;
 
 	offset = sector & geo->chunk_mask;
@@ -603,6 +618,11 @@ static int raid10_mergeable_bvec(struct request_queue *q,
 	unsigned int bio_sectors = bvm->bi_size >> 9;
 	struct geom *geo = &conf->geo;
 
+	if (conf->reshape_progress != MaxSector &&
+	    ((sector >= conf->reshape_progress) !=
+	     conf->mddev->reshape_backwards))
+		geo = &conf->prev;
+
 	if (geo->near_copies < geo->raid_disks) {
 		max = (chunk_sectors - ((sector & (chunk_sectors - 1))
 					+ bio_sectors)) << 9;
@@ -617,6 +637,12 @@ static int raid10_mergeable_bvec(struct request_queue *q,
 	if (mddev->merge_check_needed) {
 		struct r10bio r10_bio;
 		int s;
+		if (conf->reshape_progress != MaxSector) {
+			/* Cannot give any guidance during reshape */
+			if (max <= biovec->bv_len && bio_sectors == 0)
+				return biovec->bv_len;
+			return 0;
+		}
 		r10_bio.sector = sector;
 		raid10_find_phys(conf, &r10_bio);
 		rcu_read_lock();
@@ -816,7 +842,10 @@ static int raid10_congested(void *data, int bits)
 	if (mddev_congested(mddev, bits))
 		return 1;
 	rcu_read_lock();
-	for (i = 0; i < conf->geo.raid_disks && ret == 0; i++) {
+	for (i = 0;
+	     (i < conf->geo.raid_disks || i < conf->prev.raid_disks)
+		     && ret == 0;
+	     i++) {
 		struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
@@ -977,13 +1006,23 @@ static void unfreeze_array(struct r10conf *conf)
 	spin_unlock_irq(&conf->resync_lock);
 }
 
+static sector_t choose_data_offset(struct r10bio *r10_bio,
+				   struct md_rdev *rdev)
+{
+	if (!test_bit(MD_RECOVERY_RESHAPE, &rdev->mddev->recovery) ||
+	    test_bit(R10BIO_Previous, &r10_bio->state))
+		return rdev->data_offset;
+	else
+		return rdev->new_data_offset;
+}
+
 static void make_request(struct mddev *mddev, struct bio * bio)
 {
 	struct r10conf *conf = mddev->private;
 	struct r10bio *r10_bio;
 	struct bio *read_bio;
 	int i;
-	sector_t chunk_mask = conf->geo.chunk_mask;
+	sector_t chunk_mask = (conf->geo.chunk_mask & conf->prev.chunk_mask);
 	int chunk_sects = chunk_mask + 1;
 	const int rw = bio_data_dir(bio);
 	const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
@@ -1004,7 +1043,8 @@ static void make_request(struct mddev *mddev, struct bio * bio)
 	 */
 	if (unlikely((bio->bi_sector & chunk_mask) + (bio->bi_size >> 9)
 		     > chunk_sects
-		     && conf->geo.near_copies < conf->geo.raid_disks)) {
+		     && (conf->geo.near_copies < conf->geo.raid_disks
+			 || conf->prev.near_copies < conf->prev.raid_disks))) {
 		struct bio_pair *bp;
 		/* Sanity check -- queue functions should prevent this happening */
 		if (bio->bi_vcnt != 1 ||
@@ -1098,7 +1138,7 @@ read_again:
 		r10_bio->devs[slot].rdev = rdev;
 
 		read_bio->bi_sector = r10_bio->devs[slot].addr +
-			rdev->data_offset;
+			choose_data_offset(r10_bio, rdev);
 		read_bio->bi_bdev = rdev->bdev;
 		read_bio->bi_end_io = raid10_end_read_request;
 		read_bio->bi_rw = READ | do_sync;
@@ -1302,7 +1342,8 @@ retry_write:
 		r10_bio->devs[i].bio = mbio;
 
 		mbio->bi_sector	= (r10_bio->devs[i].addr+
-				   conf->mirrors[d].rdev->data_offset);
+				   choose_data_offset(r10_bio,
+						      conf->mirrors[d].rdev));
 		mbio->bi_bdev = conf->mirrors[d].rdev->bdev;
 		mbio->bi_end_io	= raid10_end_write_request;
 		mbio->bi_rw = WRITE | do_sync | do_fua;
@@ -1326,8 +1367,10 @@ retry_write:
 		 * so it cannot disappear, so the replacement cannot
 		 * become NULL here
 		 */
-		mbio->bi_sector	= (r10_bio->devs[i].addr+
-				   conf->mirrors[d].replacement->data_offset);
+		mbio->bi_sector	= (r10_bio->devs[i].addr +
+				   choose_data_offset(
+					   r10_bio,
+					   conf->mirrors[d].replacement));
 		mbio->bi_bdev = conf->mirrors[d].replacement->bdev;
 		mbio->bi_end_io	= raid10_end_write_request;
 		mbio->bi_rw = WRITE | do_sync | do_fua;
@@ -1397,7 +1440,7 @@ static void status(struct seq_file *seq, struct mddev *mddev)
  * Don't consider the device numbered 'ignore'
  * as we might be about to remove it.
  */
-static int enough(struct r10conf *conf, int ignore)
+static int _enough(struct r10conf *conf, struct geom *geo, int ignore)
 {
 	int first = 0;
 
@@ -1408,7 +1451,7 @@ static int enough(struct r10conf *conf, int ignore)
 			if (conf->mirrors[first].rdev &&
 			    first != ignore)
 				cnt++;
-			first = (first+1) % conf->geo.raid_disks;
+			first = (first+1) % geo->raid_disks;
 		}
 		if (cnt == 0)
 			return 0;
@@ -1416,6 +1459,12 @@ static int enough(struct r10conf *conf, int ignore)
 	return 1;
 }
 
+static int enough(struct r10conf *conf, int ignore)
+{
+	return _enough(conf, &conf->geo, ignore) &&
+		_enough(conf, &conf->prev, ignore);
+}
+
 static void error(struct mddev *mddev, struct md_rdev *rdev)
 {
 	char b[BDEVNAME_SIZE];
@@ -2223,7 +2272,9 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
 				       " (%d sectors at %llu on %s)\n",
 				       mdname(mddev), s,
 				       (unsigned long long)(
-					       sect + rdev->data_offset),
+					       sect +
+					       choose_data_offset(r10_bio,
+								  rdev)),
 				       bdevname(rdev->bdev, b));
 				printk(KERN_NOTICE "md/raid10:%s: %s: failing "
 				       "drive\n",
@@ -2261,7 +2312,8 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
 				       " (%d sectors at %llu on %s)\n",
 				       mdname(mddev), s,
 				       (unsigned long long)(
-					       sect + rdev->data_offset),
+					       sect +
+					       choose_data_offset(r10_bio, rdev)),
 				       bdevname(rdev->bdev, b));
 				printk(KERN_NOTICE "md/raid10:%s: %s: failing "
 				       "drive\n",
@@ -2274,7 +2326,8 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
 				       " (%d sectors at %llu on %s)\n",
 				       mdname(mddev), s,
 				       (unsigned long long)(
-					       sect + rdev->data_offset),
+					       sect +
+					       choose_data_offset(r10_bio, rdev)),
 				       bdevname(rdev->bdev, b));
 				atomic_add(s, &rdev->corrected_errors);
 			}
@@ -2348,7 +2401,7 @@ static int narrow_write_error(struct r10bio *r10_bio, int i)
 		wbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 		md_trim_bio(wbio, sector - bio->bi_sector, sectors);
 		wbio->bi_sector = (r10_bio->devs[i].addr+
-				   rdev->data_offset+
+				   choose_data_offset(r10_bio, rdev) +
 				   (sector - r10_bio->sector));
 		wbio->bi_bdev = rdev->bdev;
 		if (submit_bio_wait(WRITE, wbio) == 0)
@@ -2425,7 +2478,7 @@ read_more:
 	r10_bio->devs[slot].bio = bio;
 	r10_bio->devs[slot].rdev = rdev;
 	bio->bi_sector = r10_bio->devs[slot].addr
-		+ rdev->data_offset;
+		+ choose_data_offset(r10_bio, rdev);
 	bio->bi_bdev = rdev->bdev;
 	bio->bi_rw = READ | do_sync;
 	bio->bi_private = r10_bio;
@@ -3246,7 +3299,8 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 	else
 		sector_div(stride, fc);
 	conf->geo.stride = stride << conf->geo.chunk_shift;
-
+	conf->prev = conf->geo;
+	conf->reshape_progress = MaxSector;
 
 	spin_lock_init(&conf->device_lock);
 	INIT_LIST_HEAD(&conf->retry_list);
@@ -3312,8 +3366,10 @@ static int run(struct mddev *mddev)
 	rdev_for_each(rdev, mddev) {
 
 		disk_idx = rdev->raid_disk;
-		if (disk_idx >= conf->geo.raid_disks
-		    || disk_idx < 0)
+		if (disk_idx < 0)
+			continue;
+		if (disk_idx >= conf->geo.raid_disks &&
+		    disk_idx >= conf->prev.raid_disks)
 			continue;
 		disk = conf->mirrors + disk_idx;
 
@@ -3340,7 +3396,10 @@ static int run(struct mddev *mddev)
 	}
 
 	mddev->degraded = 0;
-	for (i = 0; i < conf->geo.raid_disks; i++) {
+	for (i = 0;
+	     i < conf->geo.raid_disks
+		     || i < conf->prev.raid_disks;
+	     i++) {
 
 		disk = conf->mirrors + i;
 
@@ -3459,6 +3518,9 @@ static int raid10_resize(struct mddev *mddev, sector_t sectors)
 	struct r10conf *conf = mddev->private;
 	sector_t oldsize, size;
 
+	if (mddev->reshape_position != MaxSector)
+		return -EBUSY;
+
 	if (conf->geo.far_copies > 1 && !conf->geo.far_offset)
 		return -EINVAL;
 
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 4c4942a..37509d7 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -34,13 +34,14 @@ struct r10conf {
 					       */
 		int		chunk_shift; /* shift from chunks to sectors */
 		sector_t	chunk_mask;
-	} geo;
+	} prev, geo;
 	int			copies;	      /* near_copies * far_copies.
 					       * must be <= raid_disks
 					       */
 
 	sector_t		dev_sectors;  /* temp copy of
 					       * mddev->dev_sectors */
+	sector_t		reshape_progress;
 
 	struct list_head	retry_list;
 	/* queue pending writes and submit them on unplug */
@@ -147,5 +148,10 @@ enum r10bio_state {
  */
 	R10BIO_MadeGood,
 	R10BIO_WriteError,
+/* During a reshape we might be performing IO on the
+ * 'previous' part of the array, in which case this
+ * flag is set
+ */
+	R10BIO_Previous,
 };
 #endif



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 09/10] md/raid10: split out interpretation of layout to separate function.
  2012-04-03  5:53 [md PATCH 00/10] md patches for 3.5: RAID10 reshape NeilBrown
                   ` (5 preceding siblings ...)
  2012-04-03  5:53 ` [md PATCH 02/10] md: add possibility to change data-offset for devices NeilBrown
@ 2012-04-03  5:53 ` NeilBrown
  2012-04-03  5:53 ` [md PATCH 10/10] md/raid10: add reshape support NeilBrown
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2012-04-03  5:53 UTC (permalink / raw)
  To: linux-raid

We will soon be interpreting the layout (and chunksize etc) from
multiple places to support reshape.  So split it out into separate
function.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid10.c |   74 ++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 52 insertions(+), 22 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0a6746f..1f84b8e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3223,28 +3223,65 @@ raid10_size(struct mddev *mddev, sector_t sectors, int raid_disks)
 	return size << conf->geo.chunk_shift;
 }
 
+enum geo_type {geo_new, geo_old, geo_start};
+static int setup_geo(struct geom *geo, struct mddev *mddev, enum geo_type new)
+{
+	int nc, fc, fo;
+	int layout, chunk, disks;
+	switch (new) {
+	case geo_old:
+		layout = mddev->layout;
+		chunk = mddev->chunk_sectors;
+		disks = mddev->raid_disks - mddev->delta_disks;
+		break;
+	case geo_new:
+		layout = mddev->new_layout;
+		chunk = mddev->new_chunk_sectors;
+		disks = mddev->raid_disks;
+		break;
+	default: /* avoid 'may be unused' warnings */
+	case geo_start: /* new when starting reshape - raid_disks not
+			 * updated yet. */
+		layout = mddev->new_layout;
+		chunk = mddev->new_chunk_sectors;
+		disks = mddev->raid_disks + mddev->delta_disks;
+		break;
+	}
+	if (layout >> 17)
+		return -1;
+	if (chunk < (PAGE_SIZE >> 9) ||
+	    !is_power_of_2(chunk))
+		return -2;
+	nc = layout & 255;
+	fc = (layout >> 8) & 255;
+	fo = layout & (1<<16);
+	geo->raid_disks = disks;
+	geo->near_copies = nc;
+	geo->far_copies = fc;
+	geo->far_offset = fo;
+	geo->chunk_mask = chunk - 1;
+	geo->chunk_shift = ffz(~chunk);
+	return nc*fc;
+}
 
 static struct r10conf *setup_conf(struct mddev *mddev)
 {
 	struct r10conf *conf = NULL;
-	int nc, fc, fo;
 	sector_t stride, size;
 	int err = -EINVAL;
+	struct geom geo;
+	int copies;
+
+	copies = setup_geo(&geo, mddev, geo_new);
 
-	if (mddev->new_chunk_sectors < (PAGE_SIZE >> 9) ||
-	    !is_power_of_2(mddev->new_chunk_sectors)) {
+	if (copies == -2) {
 		printk(KERN_ERR "md/raid10:%s: chunk size must be "
 		       "at least PAGE_SIZE(%ld) and be a power of 2.\n",
 		       mdname(mddev), PAGE_SIZE);
 		goto out;
 	}
 
-	nc = mddev->new_layout & 255;
-	fc = (mddev->new_layout >> 8) & 255;
-	fo = mddev->new_layout & (1<<16);
-
-	if ((nc*fc) <2 || (nc*fc) > mddev->raid_disks ||
-	    (mddev->new_layout >> 17)) {
+	if (copies < 2 || copies > mddev->raid_disks) {
 		printk(KERN_ERR "md/raid10:%s: unsupported raid10 layout: 0x%8x\n",
 		       mdname(mddev), mddev->new_layout);
 		goto out;
@@ -3264,24 +3301,17 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 	if (!conf->tmppage)
 		goto out;
 
-
-	conf->geo.raid_disks = mddev->raid_disks;
-	conf->geo.near_copies = nc;
-	conf->geo.far_copies = fc;
-	conf->copies = nc*fc;
-	conf->geo.far_offset = fo;
-	conf->geo.chunk_mask = mddev->new_chunk_sectors - 1;
-	conf->geo.chunk_shift = ffz(~mddev->new_chunk_sectors);
-
+	conf->geo = geo;
+	conf->copies = copies;
 	conf->r10bio_pool = mempool_create(NR_RAID10_BIOS, r10bio_pool_alloc,
 					   r10bio_pool_free, conf);
 	if (!conf->r10bio_pool)
 		goto out;
 
 	size = mddev->dev_sectors >> conf->geo.chunk_shift;
-	sector_div(size, fc);
+	sector_div(size, geo.far_copies);
 	size = size * conf->geo.raid_disks;
-	sector_div(size, nc);
+	sector_div(size, geo.near_copies);
 	/* 'size' is now the number of chunks in the array */
 	/* calculate "used chunks per device" in 'stride' */
 	stride = size * conf->copies;
@@ -3294,10 +3324,10 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 
 	conf->dev_sectors = stride << conf->geo.chunk_shift;
 
-	if (fo)
+	if (geo.far_offset)
 		stride = 1;
 	else
-		sector_div(stride, fc);
+		sector_div(stride, geo.far_copies);
 	conf->geo.stride = stride << conf->geo.chunk_shift;
 	conf->prev = conf->geo;
 	conf->reshape_progress = MaxSector;



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 10/10] md/raid10: add reshape support
  2012-04-03  5:53 [md PATCH 00/10] md patches for 3.5: RAID10 reshape NeilBrown
                   ` (6 preceding siblings ...)
  2012-04-03  5:53 ` [md PATCH 09/10] md/raid10: split out interpretation of layout to separate function NeilBrown
@ 2012-04-03  5:53 ` NeilBrown
  2012-04-03  5:53 ` [md PATCH 08/10] md/raid10: Introduce 'prev' geometry to support reshape NeilBrown
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2012-04-03  5:53 UTC (permalink / raw)
  To: linux-raid

A 'near' or 'offset' lay RAID10 array can be reshaped to a different
'near' or 'offset' layout, a different chunk size, and a different
number of devices.
However the number of copies cannot change.

Unlike RAID5/6, we do not support having user-space backup data that
is being relocated during a 'critical section'.  Rather, the
data_offset of each device must change so that when writing any block
to a new location, it will not over-write any data that is still
'live'.

This means that RAID10 reshape is not supportable on v0.90 metadata.

The different between the old data_offset and the new_offset must be
at least the larger of the chunksize multiplied by offset copies of
each of the old and new layout. (for 'near' mode, offset_copies == 1).

A larger difference of around 64M seems useful for in-place reshapes
as more data can be moved between metadata updates.
Very large differences (e.g. 512M) seem to slow the process down due
to lots of long seeks (on oldish consumer graded devices at least).

Metadata needs to be updated whenever the place we are about to write
to is considered - by the current metadata - to still contain data in
the old layout.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid10.c |  888 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/raid10.h |    6 
 2 files changed, 874 insertions(+), 20 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 1f84b8e..409dcc4 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -24,6 +24,7 @@
 #include <linux/module.h>
 #include <linux/seq_file.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
 #include "md.h"
 #include "raid10.h"
 #include "raid0.h"
@@ -68,6 +69,11 @@ static int max_queued_requests = 1024;
 static void allow_barrier(struct r10conf *conf);
 static void lower_barrier(struct r10conf *conf);
 static int enough(struct r10conf *conf, int ignore);
+static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
+				int *skipped);
+static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio);
+static void end_reshape_write(struct bio *bio, int error);
+static void end_reshape(struct r10conf *conf);
 
 static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -112,7 +118,8 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
 	if (!r10_bio)
 		return NULL;
 
-	if (test_bit(MD_RECOVERY_SYNC, &conf->mddev->recovery))
+	if (test_bit(MD_RECOVERY_SYNC, &conf->mddev->recovery) ||
+	    test_bit(MD_RECOVERY_RESHAPE, &conf->mddev->recovery))
 		nalloc = conf->copies; /* resync */
 	else
 		nalloc = 2; /* recovery */
@@ -140,9 +147,10 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
 		struct bio *rbio = r10_bio->devs[j].repl_bio;
 		bio = r10_bio->devs[j].bio;
 		for (i = 0; i < RESYNC_PAGES; i++) {
-			if (j == 1 && !test_bit(MD_RECOVERY_SYNC,
-						&conf->mddev->recovery)) {
-				/* we can share bv_page's during recovery */
+			if (j > 0 && !test_bit(MD_RECOVERY_SYNC,
+					       &conf->mddev->recovery)) {
+				/* we can share bv_page's during recovery
+				 * and reshape */
 				struct bio *rbio = r10_bio->devs[0].bio;
 				page = rbio->bi_io_vec[i].bv_page;
 				get_page(page);
@@ -309,6 +317,10 @@ static int find_bio_disk(struct r10conf *conf, struct r10bio *r10_bio,
 			break;
 		}
 	}
+	/* Handle reshape reads */
+	if (slot == conf->copies &&
+	    bio == r10_bio->master_bio)
+		slot = r10_bio->read_slot;
 
 	BUG_ON(slot == conf->copies);
 	update_head_pos(slot, r10_bio);
@@ -614,10 +626,11 @@ static int raid10_mergeable_bvec(struct request_queue *q,
 	struct r10conf *conf = mddev->private;
 	sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
 	int max;
-	unsigned int chunk_sectors = mddev->chunk_sectors;
+	unsigned int chunk_sectors;
 	unsigned int bio_sectors = bvm->bi_size >> 9;
 	struct geom *geo = &conf->geo;
 
+	chunk_sectors = (conf->geo.chunk_mask & conf->prev.chunk_mask) + 1;
 	if (conf->reshape_progress != MaxSector &&
 	    ((sector >= conf->reshape_progress) !=
 	     conf->mddev->reshape_backwards))
@@ -1032,6 +1045,7 @@ static void make_request(struct mddev *mddev, struct bio * bio)
 	int plugged;
 	int sectors_handled;
 	int max_sectors;
+	int sectors;
 
 	if (unlikely(bio->bi_rw & REQ_FLUSH)) {
 		md_flush_request(mddev, bio);
@@ -1096,10 +1110,41 @@ static void make_request(struct mddev *mddev, struct bio * bio)
 	 */
 	wait_barrier(conf);
 
+	sectors = bio->bi_size >> 9;
+	while (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+	    bio->bi_sector < conf->reshape_progress &&
+	    bio->bi_sector + sectors > conf->reshape_progress) {
+		/* IO spans the reshape position.  Need to wait for
+		 * reshape to pass
+		 */
+		allow_barrier(conf);
+		wait_event(conf->wait_barrier,
+			   conf->reshape_progress <= bio->bi_sector ||
+			   conf->reshape_progress >= bio->bi_sector + sectors);
+		wait_barrier(conf);
+	}
+	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+	    bio_data_dir(bio) == WRITE &&
+	    (mddev->reshape_backwards
+	     ? (bio->bi_sector < conf->reshape_safe &&
+		bio->bi_sector + sectors > conf->reshape_progress)
+	     : (bio->bi_sector + sectors > conf->reshape_safe &&
+		bio->bi_sector < conf->reshape_progress))) {
+		/* Need to update reshape_position in metadata */
+		mddev->reshape_position = conf->reshape_progress;
+		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		md_wakeup_thread(mddev->thread);
+		wait_event(mddev->sb_wait,
+			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
+
+		conf->reshape_safe = mddev->reshape_position;
+	}
+
 	r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
 
 	r10_bio->master_bio = bio;
-	r10_bio->sectors = bio->bi_size >> 9;
+	r10_bio->sectors = sectors;
 
 	r10_bio->mddev = mddev;
 	r10_bio->sector = bio->bi_sector;
@@ -2631,6 +2676,8 @@ static void raid10d(struct mddev *mddev)
 		if (test_bit(R10BIO_MadeGood, &r10_bio->state) ||
 		    test_bit(R10BIO_WriteError, &r10_bio->state))
 			handle_write_completed(conf, r10_bio);
+		else if (test_bit(R10BIO_IsReshape, &r10_bio->state))
+			reshape_request_write(mddev, r10_bio);
 		else if (test_bit(R10BIO_IsSync, &r10_bio->state))
 			sync_request_write(mddev, r10_bio);
 		else if (test_bit(R10BIO_IsRecover, &r10_bio->state))
@@ -2723,7 +2770,8 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 
  skipped:
 	max_sector = mddev->dev_sectors;
-	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
+	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ||
+	    test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
 		max_sector = mddev->resync_max_sectors;
 	if (sector_nr >= max_sector) {
 		/* If we aborted, we need to abort the
@@ -2735,6 +2783,11 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 		 * we need to convert that to several
 		 * virtual addresses.
 		 */
+		if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
+			end_reshape(conf);
+			return 0;
+		}
+
 		if (mddev->curr_resync < max_sector) { /* aborted */
 			if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
 				bitmap_end_sync(mddev->bitmap, mddev->curr_resync,
@@ -2766,6 +2819,10 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 		*skipped = 1;
 		return sectors_skipped;
 	}
+
+	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
+		return reshape_request(mddev, sector_nr, skipped);
+
 	if (chunks_skipped >= conf->geo.raid_disks) {
 		/* if there has been nothing to do on any drive,
 		 * then there is nothing to do at all..
@@ -3211,7 +3268,8 @@ raid10_size(struct mddev *mddev, sector_t sectors, int raid_disks)
 	struct r10conf *conf = mddev->private;
 
 	if (!raid_disks)
-		raid_disks = conf->geo.raid_disks;
+		raid_disks = min(conf->geo.raid_disks,
+				 conf->prev.raid_disks);
 	if (!sectors)
 		sectors = conf->dev_sectors;
 
@@ -3223,6 +3281,34 @@ raid10_size(struct mddev *mddev, sector_t sectors, int raid_disks)
 	return size << conf->geo.chunk_shift;
 }
 
+static int LCM(int a, int b)
+{
+	int prod = a * b;
+
+	while (a != b) {
+		if (a < b)
+			b -= a;
+		else
+			a -= b;
+	}
+	return prod / a;
+}
+
+static unsigned long stripe_blocks(struct geom *geo)
+{
+	/* Given a 'geom', calculate the 'stripe' size being the
+	 * number of stored blocks (sectors) before the stripe pattern
+	 * repeats.  This is not well-defined for 'far' layouts, so we
+	 * assume 'offset' or 'near'.
+	 *
+	 * The answer is the chunk size times the LCM of raid_disks
+	 * and near_copies, times far_copies.
+	 */
+	unsigned long size = geo->far_copies * LCM(geo->raid_disks,
+						   geo->near_copies);
+	return size << geo->chunk_shift;
+}
+
 enum geo_type {geo_new, geo_old, geo_start};
 static int setup_geo(struct geom *geo, struct mddev *mddev, enum geo_type new)
 {
@@ -3292,7 +3378,9 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 	if (!conf)
 		goto out;
 
-	conf->mirrors = kzalloc(sizeof(struct mirror_info)*mddev->raid_disks,
+	/* FIXME calc properly */
+	conf->mirrors = kzalloc(sizeof(struct mirror_info)*(mddev->raid_disks +
+							    max(0,mddev->delta_disks)),
 				GFP_KERNEL);
 	if (!conf->mirrors)
 		goto out;
@@ -3329,9 +3417,17 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 	else
 		sector_div(stride, geo.far_copies);
 	conf->geo.stride = stride << conf->geo.chunk_shift;
-	conf->prev = conf->geo;
-	conf->reshape_progress = MaxSector;
-
+	if (mddev->reshape_position == MaxSector) {
+		conf->prev = conf->geo;
+		conf->reshape_progress = MaxSector;
+	} else {
+		if (setup_geo(&conf->prev, mddev, geo_old) != conf->copies) {
+			err = -EINVAL;
+			goto out;
+		}
+		conf->reshape_progress = mddev->reshape_position;
+		conf->prev.stride = 1 << conf->prev.chunk_shift;
+	}
 	spin_lock_init(&conf->device_lock);
 	INIT_LIST_HEAD(&conf->retry_list);
 
@@ -3346,8 +3442,9 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 	return conf;
 
  out:
-	printk(KERN_ERR "md/raid10:%s: couldn't allocate memory.\n",
-	       mdname(mddev));
+	if (err == -ENOMEM)
+		printk(KERN_ERR "md/raid10:%s: couldn't allocate memory.\n",
+		       mdname(mddev));
 	if (conf) {
 		if (conf->r10bio_pool)
 			mempool_destroy(conf->r10bio_pool);
@@ -3365,12 +3462,8 @@ static int run(struct mddev *mddev)
 	struct mirror_info *disk;
 	struct md_rdev *rdev;
 	sector_t size;
-
-	/*
-	 * copy the already verified devices into our private RAID10
-	 * bookkeeping area. [whatever we allocate in run(),
-	 * should be freed in stop()]
-	 */
+	sector_t min_offset_diff = 0;
+	int first = 1;
 
 	if (mddev->private == NULL) {
 		conf = setup_conf(mddev);
@@ -3394,6 +3487,7 @@ static int run(struct mddev *mddev)
 				 (conf->geo.raid_disks / conf->geo.near_copies));
 
 	rdev_for_each(rdev, mddev) {
+		long long diff;
 
 		disk_idx = rdev->raid_disk;
 		if (disk_idx < 0)
@@ -3412,12 +3506,20 @@ static int run(struct mddev *mddev)
 				goto out_free_conf;
 			disk->rdev = rdev;
 		}
+		diff = (rdev->new_data_offset - rdev->data_offset);
+		if (!mddev->reshape_backwards)
+			diff = -diff;
+		if (diff < 0)
+			diff = 0;
+		if (first || diff < min_offset_diff)
+			min_offset_diff = diff;
 
 		disk_stack_limits(mddev->gendisk, rdev->bdev,
 				  rdev->data_offset << 9);
 
 		disk->head_position = 0;
 	}
+
 	/* need to check that every block has at least one working mirror */
 	if (!enough(conf, -1)) {
 		printk(KERN_ERR "md/raid10:%s: not enough operational mirrors.\n",
@@ -3425,6 +3527,16 @@ static int run(struct mddev *mddev)
 		goto out_free_conf;
 	}
 
+	if (conf->reshape_progress != MaxSector) {
+		/* must ensure that shape change is supported */
+		if (conf->geo.far_copies != 1 &&
+		    conf->geo.far_offset == 0)
+			goto out_free_conf;
+		if (conf->prev.far_copies != 1 &&
+		    conf->geo.far_offset == 0)
+			goto out_free_conf;
+	}
+
 	mddev->degraded = 0;
 	for (i = 0;
 	     i < conf->geo.raid_disks
@@ -3486,6 +3598,40 @@ static int run(struct mddev *mddev)
 	if (md_integrity_register(mddev))
 		goto out_free_conf;
 
+	if (conf->reshape_progress != MaxSector) {
+		unsigned long before_size, before_length;
+		unsigned long after_size, after_length;
+		unsigned long blocks;
+
+		blocks = stripe_blocks(&conf->prev);
+		before_size = blocks / conf->copies;
+		before_length = blocks / conf->prev.raid_disks;
+
+		blocks = stripe_blocks(&conf->geo);
+		after_size = blocks / conf->copies;
+		after_length = blocks / conf->geo.raid_disks;
+
+		if (max(before_length, after_length) > min_offset_diff) {
+			/* This cannot work */
+			printk("md/raid10: offset difference not enough to continue reshape\n");
+			goto out_free_conf;
+		}
+		conf->offset_diff = min_offset_diff;
+
+		conf->geo.stripe_size = after_size;
+		conf->geo.stripe_length = after_length;
+		conf->prev.stripe_size = before_size;
+		conf->prev.stripe_length = before_length;
+
+		conf->reshape_safe = conf->reshape_progress;
+		clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+		clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
+		set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
+		set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
+		mddev->sync_thread = md_register_thread(md_do_sync, mddev,
+							"reshape");
+	}
+
 	return 0;
 
 out_free_conf:
@@ -3624,6 +3770,705 @@ static void *raid10_takeover(struct mddev *mddev)
 	return ERR_PTR(-EINVAL);
 }
 
+static int raid10_check_reshape(struct mddev *mddev)
+{
+	/* Called when there is a request to change
+	 * - layout (to ->new_layout)
+	 * - chunk size (to ->new_chunk_sectors)
+	 * - raid_disks (by delta_disks)
+	 * or when trying to restart a reshape that was ongoing.
+	 *
+	 * We need to validate the request and possibly allocate
+	 * space if that might be an issue later.
+	 *
+	 * Currently we reject any reshape of a 'far' mode array,
+	 * allow chunk size to change if new is generally acceptable,
+	 * allow raid_disks to increase, and allow
+	 * a switch between 'near' mode and 'offset' mode.
+	 */
+	struct r10conf *conf = mddev->private;
+	struct geom geo;
+
+	if (conf->geo.far_copies != 1 && !conf->geo.far_offset)
+		return -EINVAL;
+
+	if (setup_geo(&geo, mddev, geo_start) != conf->copies)
+		/* mustn't change number of copies */
+		return -EINVAL;
+	if (geo.far_copies > 1 && !geo.far_offset)
+		/* Cannot switch to 'far' mode */
+		return -EINVAL;
+
+	if (mddev->array_sectors & geo.chunk_mask)
+			/* not factor of array size */
+			return -EINVAL;
+
+	if (mddev->delta_disks < 0)
+		return -EINVAL;
+
+	if (mddev->bitmap)
+		return -EBUSY;
+	if (!enough(conf, -1))
+		return -EINVAL;
+
+	kfree(conf->mirrors_new);
+	conf->mirrors_new = NULL;
+	if (mddev->delta_disks > 0) {
+		/* allocate new 'mirrors' list */
+		conf->mirrors_new = kzalloc(
+			sizeof(struct mirror_info)
+			*(mddev->raid_disks +
+			  mddev->delta_disks),
+			GFP_KERNEL);
+		if (!conf->mirrors_new)
+			return -ENOMEM;
+	}
+	return 0;
+}
+
+/*
+ * Need to check if array has failed when deciding whether to:
+ *  - start an array
+ *  - remove non-faulty devices
+ *  - add a spare
+ *  - allow a reshape
+ * This determination is simple when no reshape is happening.
+ * However if there is a reshape, we need to carefully check
+ * both the before and after sections.
+ * This is because some failed devices may only affect one
+ * of the two sections, and some non-in_sync devices may
+ * be insync in the section most affected by failed devices.
+ */
+static int calc_degraded(struct r10conf *conf)
+{
+	int degraded, degraded2;
+	int i;
+
+	rcu_read_lock();
+	degraded = 0;
+	/* 'prev' section first */
+	for (i = 0; i < conf->prev.raid_disks; i++) {
+		struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
+		if (!rdev || test_bit(Faulty, &rdev->flags))
+			degraded++;
+		else if (!test_bit(In_sync, &rdev->flags))
+			/* When we can reduce the number of devices in
+			 * an array, this might not contribute to
+			 * 'degraded'.  It does now.
+			 */
+			degraded++;
+	}
+	rcu_read_unlock();
+	if (conf->geo.raid_disks == conf->prev.raid_disks)
+		return degraded;
+	rcu_read_lock();
+	degraded2 = 0;
+	for (i = 0; i < conf->geo.raid_disks; i++) {
+		struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
+		if (!rdev || test_bit(Faulty, &rdev->flags))
+			degraded2++;
+		else if (!test_bit(In_sync, &rdev->flags)) {
+			/* If reshape is increasing the number of devices,
+			 * this section has already been recovered, so
+			 * it doesn't contribute to degraded.
+			 * else it does.
+			 */
+			if (conf->geo.raid_disks <= conf->prev.raid_disks)
+				degraded2++;
+		}
+	}
+	rcu_read_unlock();
+	if (degraded2 > degraded)
+		return degraded2;
+	return degraded;
+}
+
+static int raid10_start_reshape(struct mddev *mddev)
+{
+	/* A 'reshape' has been requested. This commits
+	 * the various 'new' fields and sets MD_RECOVER_RESHAPE
+	 * This also checks if there are enough spares and adds them
+	 * to the array.
+	 * We currently require enough spares to make the final
+	 * array non-degraded.  We also require that the difference
+	 * between old and new data_offset - on each device - is
+	 * enough that we never risk over-writing.
+	 */
+
+	unsigned long before_size, before_length;
+	unsigned long after_size, after_length;
+	unsigned long blocks;
+	sector_t min_offset_diff = 0;
+	int first = 1;
+	struct geom new;
+	struct r10conf *conf = mddev->private;
+	struct md_rdev *rdev;
+	int spares = 0;
+
+	if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+		return -EBUSY;
+
+	blocks = stripe_blocks(&conf->geo);
+	before_size = blocks / conf->copies;
+	before_length = blocks / conf->geo.raid_disks;
+
+	if (setup_geo(&new, mddev, geo_start) != conf->copies)
+		return -EINVAL;
+	blocks = stripe_blocks(&new);
+	after_size = blocks / conf->copies;
+	after_length = blocks / new.raid_disks;
+
+	conf->geo.stripe_size = after_size;
+	conf->geo.stripe_length = after_length;
+	conf->prev.stripe_size = before_size;
+	conf->prev.stripe_length = before_length;
+
+	rdev_for_each(rdev, mddev) {
+		if (!test_bit(In_sync, &rdev->flags)
+		    && !test_bit(Faulty, &rdev->flags))
+			spares++;
+		if (rdev->raid_disk >= 0) {
+			long long diff = (rdev->new_data_offset
+					  - rdev->data_offset);
+			if (!mddev->reshape_backwards)
+				diff = -diff;
+			if (diff < 0)
+				diff = 0;
+			if (first || diff < min_offset_diff)
+				min_offset_diff = diff;
+		}
+	}
+	printk("Required offset diff %llu\n",(unsigned long long)max(before_length, after_length));
+	printk("Have: %llu\n",(unsigned long long)min_offset_diff);
+
+	if (max(before_length, after_length) > min_offset_diff)
+		return -EINVAL;
+
+	if (spares < mddev->delta_disks)
+		return -EINVAL;
+
+	conf->offset_diff = min_offset_diff;
+	spin_lock_irq(&conf->device_lock);
+	if (conf->mirrors_new) {
+		memcpy(conf->mirrors_new, conf->mirrors,
+		       sizeof(struct mirror_info)*conf->prev.raid_disks);
+		smp_mb();
+		kfree(conf->mirrors_old); /* FIXME and elsewhere */
+		conf->mirrors_old = conf->mirrors;
+		conf->mirrors = conf->mirrors_new;
+		conf->mirrors_new = NULL;
+	}
+	setup_geo(&conf->geo, mddev, geo_start);
+	smp_mb();
+	if (mddev->reshape_backwards)
+		conf->reshape_progress = raid10_size(mddev, 0, 0);
+	else
+		conf->reshape_progress = 0;
+	spin_unlock_irq(&conf->device_lock);
+
+	if (mddev->delta_disks > 0) {
+		rdev_for_each(rdev, mddev)
+			if (rdev->raid_disk < 0 &&
+			    !test_bit(Faulty, &rdev->flags)) {
+				if (raid10_add_disk(mddev, rdev) == 0) {
+					if (rdev->raid_disk >=
+					    conf->prev.raid_disks)
+						set_bit(In_sync, &rdev->flags);
+					else
+						rdev->recovery_offset = 0;
+
+					if (sysfs_link_rdev(mddev, rdev))
+						/* Failure here  is OK */;
+				}
+			} else if (rdev->raid_disk >= conf->prev.raid_disks
+				   && !test_bit(Faulty, &rdev->flags)) {
+				/* This is a spare that was manually added */
+				set_bit(In_sync, &rdev->flags);
+			}
+
+		/* When a reshape changes the number of devices,
+		 * ->degraded is measured against the large of the
+		 * pre and  post numbers.
+		 */
+		spin_lock_irq(&conf->device_lock);
+		mddev->degraded = calc_degraded(conf);
+		spin_unlock_irq(&conf->device_lock);
+	}
+	mddev->raid_disks = conf->geo.raid_disks;
+	mddev->reshape_position = conf->reshape_progress;
+	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+
+	clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+	clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
+	set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
+	set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
+
+	mddev->sync_thread = md_register_thread(md_do_sync, mddev,
+						"reshape");
+	if (!mddev->sync_thread) {
+		mddev->recovery = 0;
+		spin_lock_irq(&conf->device_lock);
+		conf->geo = conf->prev;
+		mddev->raid_disks = conf->geo.raid_disks;
+		rdev_for_each(rdev, mddev)
+			rdev->new_data_offset = rdev->data_offset;
+		smp_wmb();
+		conf->reshape_progress = MaxSector;
+		mddev->reshape_position = MaxSector;
+		spin_unlock_irq(&conf->device_lock);
+		return -EAGAIN;
+	}
+	conf->reshape_checkpoint = jiffies;
+	md_wakeup_thread(mddev->sync_thread);
+	md_new_event(mddev);
+	return 0;
+}
+
+static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
+				int *skipped)
+{
+	/* We simply copy at most one chunk (smallest of old and new)
+	 * at a time, possibly less if that exceeds RESYNC_PAGES,
+	 * or we hit a bad block or something.
+	 * This might mean we pause for normal IO in the middle of
+	 * a chunk, but that is not a problem was mddev->reshape_position
+	 * can record any location.
+	 *
+	 * If we will want to write to a location that isn't
+	 * yet recorded as 'safe' (i.e. in metadata on disk) then
+	 * we need to flush all reshape requests and update the metadata.
+	 *
+	 * When reshaping forwards (e.g. to more devices), we interpret
+	 * 'safe' as the earliest block which might not have been copied
+	 * down yet.  We divide this by previous stripe size and multiply
+	 * by previous stripe length to get lowest device offset that we
+	 * cannot write to yet.
+	 * We interpret 'sector_nr' as an address that we want to write to.
+	 * We divide by new stripe size and multiply by stripe_length to
+	 * get device offset that we might want to write to.  We round up
+	 * be sure we cover the whole stripe.
+	 * If this 'next' write position is after the 'safe' position,
+	 * we must update the metadata to increase the 'safe' position.
+	 *
+	 * When reshaping backwards, we round in the opposite direction
+	 * and perform the reverse test:  next write position must not be
+	 * less than current safe position.
+	 *
+	 * In all this the minimum difference in data offsets
+	 * (conf->offset_diff - always positive) allows a bit of slack,
+	 * so next can be after 'safe', but not by more than offset_disk
+	 *
+	 * We need to prepare all the bios here before we start any IO
+	 * to ensure the size we choose is acceptable to all devices.
+	 * The means one for each copy for write-out and an extra one for
+	 * read-in.
+	 * We store the read-in bio in ->master_bio and the others in
+	 * ->devs[x].bio and ->devs[x].repl_bio.
+	 */
+	struct r10conf *conf = mddev->private;
+	struct r10bio *r10_bio;
+	sector_t next, safe, last;
+	int max_sectors;
+	int nr_sectors;
+	int s;
+	struct md_rdev *rdev;
+	int need_flush = 0;
+	struct bio *blist;
+	struct bio *bio, *read_bio;
+	int sectors_done = 0;
+
+	if (sector_nr == 0) {
+		/* If restarting in the middle, skip the initial sectors */
+		if (mddev->reshape_backwards &&
+		    conf->reshape_progress < raid10_size(mddev, 0, 0)) {
+			sector_nr = (raid10_size(mddev, 0, 0)
+				     - conf->reshape_progress);
+		} else if (!mddev->reshape_backwards &&
+			   conf->reshape_progress > 0)
+			sector_nr = conf->reshape_progress;
+		if (sector_nr) {
+			mddev->curr_resync_completed = sector_nr;
+			sysfs_notify(&mddev->kobj, NULL, "sync_completed");
+			*skipped = 1;
+			return sector_nr;
+		}
+	}
+
+	/* We don't use sector_nr to track where we are up to
+	 * as that doesn't work well for ->reshape_backwards.
+	 * So just use ->reshape_progress.
+	 */
+	if (mddev->reshape_backwards) {
+		/* 'next' is the earliest device address that we might
+		 * write to for this chunk in the new layout
+		 */
+		next = conf->reshape_progress - 1;
+		sector_div(next, conf->geo.stripe_size);
+		next *= conf->geo.stripe_length;
+
+		/* 'safe' is the last device address that we might read from
+		 * in the old layout after a restart
+		 */
+		safe = (conf->reshape_safe | conf->prev.chunk_mask) + 1;
+		safe = DIV_ROUND_UP_SECTOR_T(safe, conf->prev.stripe_size);
+		safe *= conf->prev.stripe_length;
+
+		if (next + conf->offset_diff < safe)
+			need_flush = 1;
+
+		last = conf->reshape_progress - 1;
+		sector_nr = last & ~(sector_t)(conf->geo.chunk_mask
+					       & conf->prev.chunk_mask);
+		if (sector_nr + RESYNC_BLOCK_SIZE/512 < last)
+			sector_nr = last + 1 - RESYNC_BLOCK_SIZE/512;
+	} else {
+		/* 'next' is the last device address that we might write to
+		 * for this chunk in the new layout
+		 */
+		next = (conf->reshape_progress | conf->geo.chunk_mask) + 1;
+		next = DIV_ROUND_UP_SECTOR_T(next, conf->geo.stripe_size);
+		next *= conf->geo.stripe_length;
+
+		/* 'safe' is the earliest device address that we might read from
+		 * in the old layout after a restart
+		 */
+		safe = conf->reshape_safe;
+		sector_div(safe, conf->prev.stripe_size);
+		safe *= conf->prev.stripe_length;
+
+		/* Need to update metadata if 'next' might be beyond 'safe'
+		 * as that would possibly corrupt data
+		 */
+		if (next >= safe + conf->offset_diff)
+			need_flush = 1;
+
+		sector_nr = conf->reshape_progress;
+		last  = sector_nr | (conf->geo.chunk_mask
+				     & conf->prev.chunk_mask);
+
+		if (sector_nr + RESYNC_BLOCK_SIZE/512 <= last)
+			last = sector_nr + RESYNC_BLOCK_SIZE/512 - 1;
+	}
+
+	if (need_flush ||
+	    time_after(jiffies, conf->reshape_checkpoint + 10*HZ)) {
+		/* Need to update reshape_position in metadata */
+		wait_barrier(conf);
+		mddev->reshape_position = conf->reshape_progress;
+		mddev->curr_resync_completed = sector_nr;
+		conf->reshape_checkpoint = jiffies;
+		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		md_wakeup_thread(mddev->thread);
+		wait_event(mddev->sb_wait, mddev->flags == 0 ||
+			   kthread_should_stop());
+		conf->reshape_safe = mddev->reshape_position;
+		allow_barrier(conf);
+	}
+
+read_more:
+	/* Now schedule reads for blocks from sector_nr to last */
+	r10_bio = mempool_alloc(conf->r10buf_pool, GFP_NOIO);
+	raise_barrier(conf, sectors_done != 0);
+	atomic_set(&r10_bio->remaining, 0);
+	r10_bio->mddev = mddev;
+	r10_bio->sector = sector_nr;
+	set_bit(R10BIO_IsReshape, &r10_bio->state);
+	r10_bio->sectors = last - sector_nr + 1;
+	rdev = read_balance(conf, r10_bio, &max_sectors);
+	BUG_ON(!test_bit(R10BIO_Previous, &r10_bio->state));
+
+	if (!rdev) {
+		/* Cannot read from here, so need to record bad blocks
+		 * on all the target devices.
+		 */
+		// FIXME
+		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+		return sectors_done;
+	}
+
+	read_bio = bio_alloc_mddev(GFP_KERNEL, RESYNC_PAGES, mddev);
+
+	read_bio->bi_bdev = rdev->bdev;
+	read_bio->bi_sector = (r10_bio->devs[r10_bio->read_slot].addr
+			       + rdev->data_offset);
+	read_bio->bi_private = r10_bio;
+	read_bio->bi_end_io = end_sync_read;
+	read_bio->bi_rw = READ;
+	read_bio->bi_flags &= ~(BIO_POOL_MASK - 1);
+	read_bio->bi_flags |= 1 << BIO_UPTODATE;
+	read_bio->bi_vcnt = 0;
+	read_bio->bi_idx = 0;
+	read_bio->bi_size = 0;
+	r10_bio->master_bio = read_bio;
+
+	/* Now find the locations in the new layout */
+	__raid10_find_phys(&conf->geo, r10_bio);
+
+	blist = read_bio;
+	read_bio->bi_next = NULL;
+
+	for (s = 0; s < conf->copies*2; s++) {
+		struct bio *b;
+		int d = r10_bio->devs[s/2].devnum;
+		struct md_rdev *rdev2;
+		if (s&1) {
+			rdev2 = conf->mirrors[d].replacement;
+			b = r10_bio->devs[s/2].repl_bio;
+		} else {
+			rdev2 = conf->mirrors[d].rdev;
+			b = r10_bio->devs[s/2].bio;
+		}
+		if (!rdev2 || test_bit(Faulty, &rdev2->flags))
+			continue;
+		b->bi_bdev = rdev2->bdev;
+		b->bi_sector = r10_bio->devs[s/2].addr + rdev2->new_data_offset;
+		b->bi_private = r10_bio;
+		b->bi_end_io = end_reshape_write;
+		b->bi_rw = WRITE;
+		b->bi_flags &= ~(BIO_POOL_MASK - 1);
+		b->bi_flags |= 1 << BIO_UPTODATE;
+		b->bi_next = blist;
+		b->bi_vcnt = 0;
+		b->bi_idx = 0;
+		b->bi_size = 0;
+		blist = b;
+	}
+
+	/* Now add as many pages as possible to all of these bios. */
+
+	nr_sectors = 0;
+	for (s = 0 ; s < max_sectors; s += PAGE_SIZE >> 9) {
+		struct page *page = r10_bio->devs[0].bio->bi_io_vec[s/(PAGE_SIZE>>9)].bv_page;
+		int len = (max_sectors - s) << 9;
+		if (len > PAGE_SIZE)
+			len = PAGE_SIZE;
+		for (bio = blist; bio ; bio = bio->bi_next) {
+			struct bio *bio2;
+			if (bio_add_page(bio, page, len, 0))
+				continue;
+
+			/* Didn't fit, must stop */
+			for (bio2 = blist;
+			     bio2 && bio2 != bio;
+			     bio2 = bio2->bi_next) {
+				/* Remove last page from this bio */
+				bio2->bi_vcnt--;
+				bio2->bi_size -= len;
+				bio2->bi_flags &= ~(1<<BIO_SEG_VALID);
+			}
+			goto bio_full;
+		}
+		sector_nr += len >> 9;
+		nr_sectors += len >> 9;
+	}
+bio_full:
+	r10_bio->sectors = nr_sectors;
+
+	/* Now submit the read */
+	md_sync_acct(read_bio->bi_bdev, r10_bio->sectors);
+	atomic_inc(&r10_bio->remaining);
+	read_bio->bi_next = NULL;
+	generic_make_request(read_bio);
+	sector_nr += nr_sectors;
+	sectors_done += nr_sectors;
+	if (sector_nr <= last)
+		goto read_more;
+
+	/* Now that we have done the whole section we can
+	 * update reshape_progress
+	 */
+	if (mddev->reshape_backwards)
+		conf->reshape_progress -= sectors_done;
+	else
+		conf->reshape_progress += sectors_done;
+
+	return sectors_done;
+}
+
+static void end_reshape_request(struct r10bio *r10_bio);
+static int handle_reshape_read_error(struct mddev *mddev,
+				     struct r10bio *r10_bio);
+static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio)
+{
+	/* Reshape read completed.  Hopefully we have a block
+	 * to write out.
+	 * If we got a read error then we do sync 1-page reads from
+	 * elsewhere until we find the data - or give up.
+	 */
+	struct r10conf *conf = mddev->private;
+	int s;
+
+	if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
+		if (handle_reshape_read_error(mddev, r10_bio) < 0) {
+			/* Reshape has been aborted */
+			md_done_sync(mddev, r10_bio->sectors, 0);
+			return;
+		}
+
+	/* We definitely have the data in the pages, schedule the
+	 * writes.
+	 */
+	atomic_set(&r10_bio->remaining, 1);
+	for (s = 0; s < conf->copies*2; s++) {
+		struct bio *b;
+		int d = r10_bio->devs[s/2].devnum;
+		struct md_rdev *rdev;
+		if (s&1) {
+			rdev = conf->mirrors[d].replacement;
+			b = r10_bio->devs[s/2].repl_bio;
+		} else {
+			rdev = conf->mirrors[d].rdev;
+			b = r10_bio->devs[s/2].bio;
+		}
+		if (!rdev || test_bit(Faulty, &rdev->flags))
+			continue;
+		atomic_inc(&rdev->nr_pending);
+		md_sync_acct(b->bi_bdev, r10_bio->sectors);
+		atomic_inc(&r10_bio->remaining);
+		b->bi_next = NULL;
+		generic_make_request(b);
+	}
+	end_reshape_request(r10_bio);
+}
+
+static void end_reshape(struct r10conf *conf)
+{
+	if (test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery))
+		return;
+
+	spin_lock_irq(&conf->device_lock);
+	conf->prev = conf->geo;
+	md_finish_reshape(conf->mddev);
+	smp_wmb();
+	conf->reshape_progress = MaxSector;
+	spin_unlock_irq(&conf->device_lock);
+
+	/* read-ahead size must cover two whole stripes, which is
+	 * 2 * (datadisks) * chunksize where 'n' is the number of raid devices
+	 */
+	if (conf->mddev->queue) {
+		int stripe = conf->geo.raid_disks *
+			((conf->mddev->chunk_sectors << 9) / PAGE_SIZE);
+		stripe /= conf->geo.near_copies;
+		if (conf->mddev->queue->backing_dev_info.ra_pages < 2* stripe)
+			conf->mddev->queue->backing_dev_info.ra_pages = 2* stripe;
+	}
+}
+
+
+static int handle_reshape_read_error(struct mddev *mddev,
+				     struct r10bio *r10_bio)
+{
+	/* Use sync reads to get the blocks from somewhere else */
+	int sectors = r10_bio->sectors;
+	struct r10bio r10b;
+	struct r10conf *conf = mddev->private;
+	int slot = 0;
+	int idx = 0;
+	struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
+
+	__raid10_find_phys(&conf->prev, &r10b);
+
+	while (sectors) {
+		int s = sectors;
+		int success = 0;
+		int first_slot = slot;
+
+		if (s > (PAGE_SIZE >> 9))
+			s = PAGE_SIZE >> 9;
+
+		while (!success) {
+			int d = r10b.devs[slot].devnum;
+			struct md_rdev *rdev = conf->mirrors[d].rdev;
+			sector_t addr;
+			if (rdev == NULL ||
+			    test_bit(Faulty, &rdev->flags) ||
+			    !test_bit(In_sync, &rdev->flags))
+				goto failed;
+
+			addr = r10b.devs[slot].addr + idx * PAGE_SIZE;
+			success = sync_page_io(rdev,
+					       addr,
+					       s << 9,
+					       bvec[idx].bv_page,
+					       READ, false);
+			if (success)
+				break;
+		failed:
+			slot++;
+			if (slot >= conf->copies)
+				slot = 0;
+			if (slot == first_slot)
+				break;
+		}
+		if (!success) {
+			/* couldn't read this block, must give up */
+			set_bit(MD_RECOVERY_INTR,
+				&mddev->recovery);
+			return -EIO;
+		}
+		sectors -= s;
+		idx++;
+	}
+	return 0;
+}
+
+static void end_reshape_write(struct bio *bio, int error)
+{
+	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	struct r10bio *r10_bio = bio->bi_private;
+	struct mddev *mddev = r10_bio->mddev;
+	struct r10conf *conf = mddev->private;
+	int d;
+	int slot;
+	int repl;
+	struct md_rdev *rdev = NULL;
+
+	d = find_bio_disk(conf, r10_bio, bio, &slot, &repl);
+	if (repl)
+		rdev = conf->mirrors[d].replacement;
+	if (!rdev) {
+		smp_mb();
+		rdev = conf->mirrors[d].rdev;
+	}
+
+	if (!uptodate) {
+		/* FIXME should record badblock */
+		md_error(mddev, rdev);
+	}
+
+	rdev_dec_pending(rdev, mddev);
+	end_reshape_request(r10_bio);
+}
+
+static void end_reshape_request(struct r10bio *r10_bio)
+{
+	if (!atomic_dec_and_test(&r10_bio->remaining))
+		return;
+	md_done_sync(r10_bio->mddev, r10_bio->sectors, 1);
+	bio_put(r10_bio->master_bio);
+	put_buf(r10_bio);
+}
+
+static void raid10_finish_reshape(struct mddev *mddev)
+{
+	struct r10conf *conf = mddev->private;
+
+	if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+		return;
+
+	if (mddev->delta_disks > 0) {
+		md_set_array_sectors(mddev, raid10_size(mddev, 0, 0));
+		set_capacity(mddev->gendisk, mddev->array_sectors);
+		revalidate_disk(mddev->gendisk);
+	}
+	mddev->layout = mddev->new_layout;
+	mddev->chunk_sectors = 1 << conf->geo.chunk_shift;
+	mddev->reshape_position = MaxSector;
+	mddev->delta_disks = 0;
+	mddev->reshape_backwards = 0;
+}
+
 static struct md_personality raid10_personality =
 {
 	.name		= "raid10",
@@ -3642,6 +4487,9 @@ static struct md_personality raid10_personality =
 	.size		= raid10_size,
 	.resize		= raid10_resize,
 	.takeover	= raid10_takeover,
+	.check_reshape	= raid10_check_reshape,
+	.start_reshape	= raid10_start_reshape,
+	.finish_reshape	= raid10_finish_reshape,
 };
 
 static int __init raid_init(void)
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 37509d7..630d0f2 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -14,6 +14,7 @@ struct mirror_info {
 struct r10conf {
 	struct mddev		*mddev;
 	struct mirror_info	*mirrors;
+	struct mirror_info	*mirrors_new, *mirrors_old;
 	spinlock_t		device_lock;
 
 	/* geometry */
@@ -34,6 +35,7 @@ struct r10conf {
 					       */
 		int		chunk_shift; /* shift from chunks to sectors */
 		sector_t	chunk_mask;
+		long		stripe_size, stripe_length;
 	} prev, geo;
 	int			copies;	      /* near_copies * far_copies.
 					       * must be <= raid_disks
@@ -42,6 +44,9 @@ struct r10conf {
 	sector_t		dev_sectors;  /* temp copy of
 					       * mddev->dev_sectors */
 	sector_t		reshape_progress;
+	sector_t		reshape_safe;
+	unsigned long		reshape_checkpoint;
+	sector_t		offset_diff;
 
 	struct list_head	retry_list;
 	/* queue pending writes and submit them on unplug */
@@ -138,6 +143,7 @@ enum r10bio_state {
 	R10BIO_Uptodate,
 	R10BIO_IsSync,
 	R10BIO_IsRecover,
+	R10BIO_IsReshape,
 	R10BIO_Degraded,
 /* Set ReadError on bios that experience a read error
  * so that raid10d knows what to do with them.



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [md PATCH 01/10] md: allow a reshape operation to be reversed.
  2012-04-03  5:53 ` [md PATCH 01/10] md: allow a reshape operation to be reversed NeilBrown
@ 2012-10-01  9:11   ` Roman Mamedov
  2012-10-01 10:15   ` Roman Mamedov
  1 sibling, 0 replies; 17+ messages in thread
From: Roman Mamedov @ 2012-10-01  9:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1756 bytes --]

On Tue, 03 Apr 2012 15:53:01 +1000
NeilBrown <neilb@suse.de> wrote:

> Currently a reshape operation always progresses from the start
> of the array to the end unless the number of devices is being
> reduced, in which case it progressed in the opposite direction.
> 
> To reverse a partial reshape which changes the number of devices
> you can stop the array and re-assemble with the raid-disks numbers
> reversed and it will undo.
> 
> However for a reshape that does not change the number of devices
> it is not possible to reverse the reshape in the middle - you have to
> wait until it completes.
> 
> So add a 'reshape_direction' attribute with is either 'forwards' or
> 'backwards' and can be explicitly set when delta_disks is zero.
> 
> This will become more important when we allow the data_offset to
> change in a reshape.  Then the explicit statement of what direction is
> being used will be more useful.
> 
> This can be enabled in raid5 trivially as it already supports
> reverse reshape and just needs to use a different trigger to request it.

Hello,

Is it possible to use this to cancel a RAID5 reshape that is reducing the
number of devices?

I understand that reshape goes backwards by design in this case, so I need to
switch it to "forwards".

I stopped the reshaping array and then started it with "mdadm --assemble
--freeze-reshape" and listing the backup file and all the devices on the command line.

But even then:

# echo forwards > /sys/block/md0/md/reshape_direction 
-bash: echo: write error: Device or resource busy

-- 
With respect,
Roman

~~~~~~~~~~~~~~~~~~~~~~~~~~~
"Stallman had a printer,
with code he could not see.
So he began to tinker,
and set the software free."

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [md PATCH 01/10] md: allow a reshape operation to be reversed.
  2012-04-03  5:53 ` [md PATCH 01/10] md: allow a reshape operation to be reversed NeilBrown
  2012-10-01  9:11   ` Roman Mamedov
@ 2012-10-01 10:15   ` Roman Mamedov
  2012-10-02  2:41     ` NeilBrown
  1 sibling, 1 reply; 17+ messages in thread
From: Roman Mamedov @ 2012-10-01 10:15 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 642 bytes --]

On Tue, 03 Apr 2012 15:53:01 +1000
NeilBrown <neilb@suse.de> wrote:

> To reverse a partial reshape which changes the number of devices
> you can stop the array and re-assemble with the raid-disks numbers
> reversed and it will undo.

Unfortunately this does not seem to work:

# mdadm --assemble --raid-disks=5 --backup-file=/mnt/scratch/backup-file /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sde1 /dev/sdd1 /dev/sdf1
mdadm:option --raid-disks not valid in assemble mode

-- 
With respect,
Roman

~~~~~~~~~~~~~~~~~~~~~~~~~~~
"Stallman had a printer,
with code he could not see.
So he began to tinker,
and set the software free."

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [md PATCH 01/10] md: allow a reshape operation to be reversed.
  2012-10-01 10:15   ` Roman Mamedov
@ 2012-10-02  2:41     ` NeilBrown
  0 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2012-10-02  2:41 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 790 bytes --]

On Mon, 1 Oct 2012 16:15:54 +0600 Roman Mamedov <rm@romanrm.ru> wrote:

> On Tue, 03 Apr 2012 15:53:01 +1000
> NeilBrown <neilb@suse.de> wrote:
> 
> > To reverse a partial reshape which changes the number of devices
> > you can stop the array and re-assemble with the raid-disks numbers
> > reversed and it will undo.
> 
> Unfortunately this does not seem to work:
> 
> # mdadm --assemble --raid-disks=5 --backup-file=/mnt/scratch/backup-file /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sde1 /dev/sdd1 /dev/sdf1
> mdadm:option --raid-disks not valid in assemble mode
> 

No, mdadm doesn't support this yet.  The patch you identified is part of
preparing the way for this functionality, but it isn't ready yet.

For now, you'll just need to let the reshape complete.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [md PATCH 00/10] md patches for 3.5: RAID10 reshape
  2012-04-03  5:53 [md PATCH 00/10] md patches for 3.5: RAID10 reshape NeilBrown
                   ` (9 preceding siblings ...)
  2012-04-03  5:53 ` [md PATCH 07/10] md: use resync_max_sectors for reshape as well as resync NeilBrown
@ 2014-03-01  5:08 ` Phillip Susi
  2014-03-01  7:03   ` NeilBrown
  10 siblings, 1 reply; 17+ messages in thread
From: Phillip Susi @ 2014-03-01  5:08 UTC (permalink / raw)
  To: NeilBrown, linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 04/03/2012 01:53 AM, NeilBrown wrote:
> The following patches are now in my queue for 3.5. (I may add some
> more, but to avoid patch-bombing I sending these now as I think
> they are ready).
> 
> This is all about reshaping arrays and particularly RAID10.

I was wondering what the current status of this patch set is?  It
looks like this patch did get merged into the kernel, but mdadm does
not yet support reshaping raid10?  Does that mean that adding a drive
to a raid10 can be done by manually fiddling with sysfs?



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCgAGBQJTEWtLAAoJEI5FoCIzSKrwH10H/RIXXRFm8gWKC3gDE8pZtOcY
kVRMLu4OxZpT6nI4FlnanDbRMWxjTEvUehR+1ZfZu7+0QTF9e1WSUAhHHOKwpl0Q
KYZUeiDfOQjviUuTPnB4+Sfjh9CUJUjXVbvy5ZzxNxn6LnzHcLkxwhesQ6c7NLzH
oy4hQtHgRH2lzjcuuDVWJD9ZZmD2u0yOYwTt8aOCWmw6VrYnA/E0YJg9P9PrA/7m
Nfute2Xi8ulQmC3KMmM/1t8zntohynb4dX2y1KH1kB0Eb8i4WWVR3kpfWOdW0ta8
T75Hqo/zK6euJVNy9lYKNQZN1CjQJgsBDrgM+oR5tpBREI0zRzU6bn5ont12TII=
=mHgu
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [md PATCH 00/10] md patches for 3.5: RAID10 reshape
  2014-03-01  5:08 ` [md PATCH 00/10] md patches for 3.5: RAID10 reshape Phillip Susi
@ 2014-03-01  7:03   ` NeilBrown
  2014-03-01 19:33     ` Phillip Susi
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2014-03-01  7:03 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sat, 01 Mar 2014 00:08:30 -0500 Phillip Susi <psusi@ubuntu.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
> 
> On 04/03/2012 01:53 AM, NeilBrown wrote:
> > The following patches are now in my queue for 3.5. (I may add some
> > more, but to avoid patch-bombing I sending these now as I think
> > they are ready).
> > 
> > This is all about reshaping arrays and particularly RAID10.
> 
> I was wondering what the current status of this patch set is?  It
> looks like this patch did get merged into the kernel, but mdadm does
> not yet support reshaping raid10?  Does that mean that adding a drive
> to a raid10 can be done by manually fiddling with sysfs?
> 

Why do you think mdadm doesn't yet support reshaping raid10?

commit 19ceb16dafb7df98ff90298008d4488dc93b370a
Author: NeilBrown <neilb@suse.de>
Date:   Thu Oct 4 16:34:21 2012 +1000

    Grow: add raid10 reshape.
 

$ git describe --contains 19ceb16dafb7df98ff90298008d4488dc93b370a
mdadm-3.3-rc1~192


So mdadm-3.3 and later should do what you want.

NeilBrown
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iQIVAwUBUxGGRznsnt1WYoG5AQLj3A//YXEEDqKHvfwCS+VBPoIBiHoPCcjC9iPf
R6JWZrnXi9W1Xtb+ElHPO3dTwfIDS12YBF6t8QROUee6oiPM8syAVOxGqnxdHs6n
wQhPnat3xIDpc9erkkMhVxhcRm3Qt6ksdsbP8Hp4Btrph0n2tNBrZzYbgdlmksjH
fn79vyolUiapiBlOrm6fLffFnQRREjnQyo78wr0e1R9Cucmt2VTvhkyzYOtaiI09
P84OBHEYFBidam3wii5FrizU7DHoFVHs5CjJ7jrRSgk4hUggUGyBDkfQeaUJNmXW
W0oCO5SJAXdfBUT5Yyf9Ni04pbfST7hiVOk4UH9LPjzQoCrULdu6brn7Mcm7m3vc
4Rly3vU8RqMkLYXpJyanq/OHNx82IGQqgPeYpM4y43LJ3Jk5ZVHKA1I3sFCBOROC
q2ivJtgHQVaL4LMrXbfnZlaKaCTry02ed0nNzJXoPlRet2I4XLy2U5WjCVf+lH7I
eWRNjeN6jCb3liKw5Nzip+ZGdwPddmJco552+6HxyL69dApX1uU+/zyvkc+s+G+z
a1D0pI87DaD4bHSiqy8i2YOiqhVBdxrw4mVcBkoZWbbyaG+QTHVNwHs0xxEW6ATZ
a/pCFhgr1gUCAkcdVIsdx8BEPW+HRbNT4L9xOagLdKzFQaO36ouR2I0U4BeKfZVG
VMEyoU0LvTY=
=bEE5
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [md PATCH 00/10] md patches for 3.5: RAID10 reshape
  2014-03-01  7:03   ` NeilBrown
@ 2014-03-01 19:33     ` Phillip Susi
  0 siblings, 0 replies; 17+ messages in thread
From: Phillip Susi @ 2014-03-01 19:33 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 03/01/2014 02:03 AM, NeilBrown wrote:
> Why do you think mdadm doesn't yet support reshaping raid10?
> 
> commit 19ceb16dafb7df98ff90298008d4488dc93b370a Author: NeilBrown
> <neilb@suse.de> Date:   Thu Oct 4 16:34:21 2012 +1000
> 
> Grow: add raid10 reshape.
> 
> 
> $ git describe --contains 19ceb16dafb7df98ff90298008d4488dc93b370a 
> mdadm-3.3-rc1~192
> 
> 
> So mdadm-3.3 and later should do what you want.

Ahh, excellent.. I had searched but couldn't find that commit in the
git repo.  I guess it's time to upgrade.



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCgAGBQJTEjYUAAoJEI5FoCIzSKrwFIUH/RWpQDZ7rZt1BahLhpYwxd/V
2ywVDKlK4qidbL5LoypGoFZQA2BEL2d+XzIajG5AOQ6OI0hwgJRJuOSATnAGO+iw
NRqMbkwAruEjt1Fg9dPYbgXbVAvQy/hZ24EHTiMwXrzHKmLYQvad3V5ghaQUGDYa
sYYEVq95aP54tPjET48r5IrD5ZoTdqHTrzjT490hpG6usNmEaCxMrGt2rxJxhPAO
asTWvGSheEr5A30rFTDan4PE96dOpjwd+aRGt3I5z2lIBxIYbQQ4qHxbZj4xoTq5
M4520lkVWNJ5u826kZeQ3YeGfjalaqfZ6YSgeFyUcFarnMS8QQ5xyQBa7XIjqs8=
=H9ub
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-03-01 19:33 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-03  5:53 [md PATCH 00/10] md patches for 3.5: RAID10 reshape NeilBrown
2012-04-03  5:53 ` [md PATCH 01/10] md: allow a reshape operation to be reversed NeilBrown
2012-10-01  9:11   ` Roman Mamedov
2012-10-01 10:15   ` Roman Mamedov
2012-10-02  2:41     ` NeilBrown
2012-04-03  5:53 ` [md PATCH 04/10] md/raid5: allow for change in data_offset while managing a reshape NeilBrown
2012-04-03  5:53 ` [md PATCH 05/10] md/raid10: collect some geometry fields into a dedicated structure NeilBrown
2012-04-03  5:53 ` [md PATCH 06/10] md: teach sync_page_io about new_data_offset NeilBrown
2012-04-03  5:53 ` [md PATCH 03/10] md/raid5: Use correct data_offset for all IO NeilBrown
2012-04-03  5:53 ` [md PATCH 02/10] md: add possibility to change data-offset for devices NeilBrown
2012-04-03  5:53 ` [md PATCH 09/10] md/raid10: split out interpretation of layout to separate function NeilBrown
2012-04-03  5:53 ` [md PATCH 10/10] md/raid10: add reshape support NeilBrown
2012-04-03  5:53 ` [md PATCH 08/10] md/raid10: Introduce 'prev' geometry to support reshape NeilBrown
2012-04-03  5:53 ` [md PATCH 07/10] md: use resync_max_sectors for reshape as well as resync NeilBrown
2014-03-01  5:08 ` [md PATCH 00/10] md patches for 3.5: RAID10 reshape Phillip Susi
2014-03-01  7:03   ` NeilBrown
2014-03-01 19:33     ` Phillip Susi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).