[PATCH 000 of 13] md: Introduction

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 000 of 13] md: Introduction
@ 2006-03-17  4:47 NeilBrown
  2006-03-17  4:47 ` [PATCH 001 of 13] md: Add '4' to the list of levels for which bitmaps are supported NeilBrown
                   ` (12 more replies)
  0 siblings, 13 replies; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel

Following are 13 patches for md in 2.6.last (created against 2.6.16-rc6-mm1).
They are NOT appropriate for 2.6.16 but should be OK for .17.  Please include
them in -mm.

The first three are assorted bug fixes, none really serious

The remainder implement raid5 reshaping.  Currently the only shape change
that is supported is added a device, but it is envisioned that 
changing the chunksize and layout will also be supported, as well
as changing the level (e.g. 1->5, 5->6).

The reshape process naturally has to move all of the data in the
array, and so should be used with caution.  It is believed to work,
and some testing does support this, but wider testing would be great
for increasing my confidence.

You will need a version of mdadm newer than 2.3.1 to make use of raid5
growth.  This is because mdadm need to take a copy of a 'critical
section' at the start of the array incase there is a crash at an
awkward moment.  On restart, mdadm will restore the critical section
and allow reshape to continue.

I hope to release a 2.4-pre by early next week - it still needs a
little more polishing.

NeilBrown

 [PATCH 001 of 13] md: Add '4' to the list of levels for which bitmaps are supported.
 [PATCH 002 of 13] md: Fix the 'failed' count for version-0 superblocks.
 [PATCH 003 of 13] md: Update status_resync to handle LARGE devices.

 [PATCH 004 of 13] md: Split disks array out of raid5 conf structure so it is easier to grow.
 [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.
 [PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while array is expanding.
 [PATCH 007 of 13] md: Core of raid5 resize process
 [PATCH 008 of 13] md: Final stages of raid5 expand code.
 [PATCH 009 of 13] md: Checkpoint and allow restart of raid5 reshape
 [PATCH 010 of 13] md: Only checkpoint expansion progress occasionally.
 [PATCH 011 of 13] md: Split reshape handler in check_reshape and start_reshape.
 [PATCH 012 of 13] md: Make 'reshape' a possible sync_action action.
 [PATCH 013 of 13] md: Support suspending of IO to regions of an md array.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 001 of 13] md: Add '4' to the list of levels for which bitmaps are supported.
  2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
@ 2006-03-17  4:47 ` NeilBrown
  2006-03-17  4:47 ` [PATCH 002 of 13] md: Fix the 'failed' count for version-0 superblocks NeilBrown
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


I really should make this a function of the personality....


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2006-03-17 11:45:43.000000000 +1100
+++ ./drivers/md/md.c	2006-03-17 11:48:08.000000000 +1100
@@ -763,7 +763,8 @@ static int super_90_validate(mddev_t *md
 
 		if (sb->state & (1<<MD_SB_BITMAP_PRESENT) &&
 		    mddev->bitmap_file == NULL) {
-			if (mddev->level != 1 && mddev->level != 5 && mddev->level != 6
+			if (mddev->level != 1 && mddev->level != 4
+			    && mddev->level != 5 && mddev->level != 6
 			    && mddev->level != 10) {
 				/* FIXME use a better test */
 				printk(KERN_WARNING "md: bitmaps not supported for this level.\n");

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 002 of 13] md: Fix the 'failed' count for version-0 superblocks.
  2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
  2006-03-17  4:47 ` [PATCH 001 of 13] md: Add '4' to the list of levels for which bitmaps are supported NeilBrown
@ 2006-03-17  4:47 ` NeilBrown
  2006-03-17  4:47 ` [PATCH 003 of 13] md: Update status_resync to handle LARGE devices NeilBrown
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


We are counting failed devices twice, once of the device that is
failed, and once for the hole that has been left in the array.  Remove
the former so 'failed' matches 'missing'.  Storing these counts in the
superblock is a bit silly anyway....

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2006-03-17 11:48:08.000000000 +1100
+++ ./drivers/md/md.c	2006-03-17 11:48:08.000000000 +1100
@@ -893,10 +893,9 @@ static void super_90_sync(mddev_t *mddev
 			d->raid_disk = rdev2->raid_disk;
 		else
 			d->raid_disk = rdev2->desc_nr; /* compatibility */
-		if (test_bit(Faulty, &rdev2->flags)) {
+		if (test_bit(Faulty, &rdev2->flags))
 			d->state = (1<<MD_DISK_FAULTY);
-			failed++;
-		} else if (test_bit(In_sync, &rdev2->flags)) {
+		else if (test_bit(In_sync, &rdev2->flags)) {
 			d->state = (1<<MD_DISK_ACTIVE);
 			d->state |= (1<<MD_DISK_SYNC);
 			active++;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 003 of 13] md: Update status_resync to handle LARGE devices.
  2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
  2006-03-17  4:47 ` [PATCH 001 of 13] md: Add '4' to the list of levels for which bitmaps are supported NeilBrown
  2006-03-17  4:47 ` [PATCH 002 of 13] md: Fix the 'failed' count for version-0 superblocks NeilBrown
@ 2006-03-17  4:47 ` NeilBrown
  2006-03-17  4:47 ` [PATCH 004 of 13] md: Split disks array out of raid5 conf structure so it is easier to grow NeilBrown
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


status_resync - used by /proc/mdstat to report the status of a resync,
assumes that device sizes will always fit into an 'unsigned long' This
is no longer the case...

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |   30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2006-03-17 11:48:08.000000000 +1100
+++ ./drivers/md/md.c	2006-03-17 11:48:30.000000000 +1100
@@ -4040,7 +4040,10 @@ static void status_unused(struct seq_fil
 
 static void status_resync(struct seq_file *seq, mddev_t * mddev)
 {
-	unsigned long max_blocks, resync, res, dt, db, rt;
+	sector_t max_blocks, resync, res;
+	unsigned long dt, db, rt;
+	int scale;
+	unsigned int per_milli;
 
 	resync = (mddev->curr_resync - atomic_read(&mddev->recovery_active))/2;
 
@@ -4056,9 +4059,22 @@ static void status_resync(struct seq_fil
 		MD_BUG();
 		return;
 	}
-	res = (resync/1024)*1000/(max_blocks/1024 + 1);
+	/* Pick 'scale' such that (resync>>scale)*1000 will fit
+	 * in a sector_t, and (max_blocks>>scale) will fit in a
+	 * u32, as those are the requirements for sector_div.
+	 * Thus 'scale' must be at least 10
+	 */
+	scale = 10;
+	if (sizeof(sector_t) > sizeof(unsigned long)) {
+		while ( max_blocks/2 > (1ULL<<(scale+32)))
+			scale++;
+	}
+	res = (resync>>scale)*1000;
+	sector_div(res, (u32)((max_blocks>>scale)+1));
+
+	per_milli = res;
 	{
-		int i, x = res/50, y = 20-x;
+		int i, x = per_milli/50, y = 20-x;
 		seq_printf(seq, "[");
 		for (i = 0; i < x; i++)
 			seq_printf(seq, "=");
@@ -4067,10 +4083,12 @@ static void status_resync(struct seq_fil
 			seq_printf(seq, ".");
 		seq_printf(seq, "] ");
 	}
-	seq_printf(seq, " %s =%3lu.%lu%% (%lu/%lu)",
+	seq_printf(seq, " %s =%3u.%u%% (%llu/%llu)",
 		      (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?
 		       "resync" : "recovery"),
-		      res/10, res % 10, resync, max_blocks);
+		      per_milli/10, per_milli % 10,
+		   (unsigned long long) resync,
+		   (unsigned long long) max_blocks);
 
 	/*
 	 * We do not want to overflow, so the order of operands and
@@ -4084,7 +4102,7 @@ static void status_resync(struct seq_fil
 	dt = ((jiffies - mddev->resync_mark) / HZ);
 	if (!dt) dt++;
 	db = resync - (mddev->resync_mark_cnt/2);
-	rt = (dt * ((max_blocks-resync) / (db/100+1)))/100;
+	rt = (dt * ((unsigned long)(max_blocks-resync) / (db/100+1)))/100;
 
 	seq_printf(seq, " finish=%lu.%lumin", rt / 60, (rt % 60)/6);
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 004 of 13] md: Split disks array out of raid5 conf structure so it is easier to grow.
  2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
                   ` (2 preceding siblings ...)
  2006-03-17  4:47 ` [PATCH 003 of 13] md: Update status_resync to handle LARGE devices NeilBrown
@ 2006-03-17  4:47 ` NeilBrown
  2006-03-17  4:47 ` [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array NeilBrown
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


Previously the array of disk information was included in the
raid5 'conf' structure which was allocated to an appropriate size.
This makes it awkward to change the size of that array.
So we split it off into a separate kmalloced array which will
require a little extra indexing, but is much easier to grow.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c         |   10 +++++++---
 ./drivers/md/raid6main.c     |   10 +++++++---
 ./include/linux/raid/raid5.h |    2 +-
 3 files changed, 15 insertions(+), 7 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-03-17 11:45:43.000000000 +1100
+++ ./drivers/md/raid5.c	2006-03-17 11:48:55.000000000 +1100
@@ -1822,11 +1822,13 @@ static int run(mddev_t *mddev)
 		return -EIO;
 	}
 
-	mddev->private = kzalloc(sizeof (raid5_conf_t)
-				 + mddev->raid_disks * sizeof(struct disk_info),
-				 GFP_KERNEL);
+	mddev->private = kzalloc(sizeof (raid5_conf_t), GFP_KERNEL);
 	if ((conf = mddev->private) == NULL)
 		goto abort;
+	conf->disks = kzalloc(mddev->raid_disks * sizeof(struct disk_info),
+			      GFP_KERNEL);
+	if (!conf->disks)
+		goto abort;
 
 	conf->mddev = mddev;
 
@@ -1966,6 +1968,7 @@ static int run(mddev_t *mddev)
 abort:
 	if (conf) {
 		print_raid5_conf(conf);
+		kfree(conf->disks);
 		kfree(conf->stripe_hashtbl);
 		kfree(conf);
 	}
@@ -1986,6 +1989,7 @@ static int stop(mddev_t *mddev)
 	kfree(conf->stripe_hashtbl);
 	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
 	sysfs_remove_group(&mddev->kobj, &raid5_attrs_group);
+	kfree(conf->disks);
 	kfree(conf);
 	mddev->private = NULL;
 	return 0;

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2006-03-17 11:45:43.000000000 +1100
+++ ./drivers/md/raid6main.c	2006-03-17 11:48:55.000000000 +1100
@@ -2006,11 +2006,14 @@ static int run(mddev_t *mddev)
 		return -EIO;
 	}
 
-	mddev->private = kzalloc(sizeof (raid6_conf_t)
-				 + mddev->raid_disks * sizeof(struct disk_info),
-				 GFP_KERNEL);
+	mddev->private = kzalloc(sizeof (raid6_conf_t), GFP_KERNEL);
 	if ((conf = mddev->private) == NULL)
 		goto abort;
+	conf->disks = kzalloc(mddev->raid_disks * sizeof(struct disk_info),
+				 GFP_KERNEL);
+	if (!conf->disks)
+		goto abort;
+
 	conf->mddev = mddev;
 
 	if ((conf->stripe_hashtbl = kzalloc(PAGE_SIZE, GFP_KERNEL)) == NULL)
@@ -2158,6 +2161,7 @@ abort:
 		print_raid6_conf(conf);
 		safe_put_page(conf->spare_page);
 		kfree(conf->stripe_hashtbl);
+		kfree(conf->disks);
 		kfree(conf);
 	}
 	mddev->private = NULL;

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~	2006-03-17 11:45:43.000000000 +1100
+++ ./include/linux/raid/raid5.h	2006-03-17 11:48:55.000000000 +1100
@@ -240,7 +240,7 @@ struct raid5_private_data {
 							 * waiting for 25% to be free
 							 */        
 	spinlock_t		device_lock;
-	struct disk_info	disks[0];
+	struct disk_info	*disks;
 };
 
 typedef struct raid5_private_data raid5_conf_t;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.
  2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
                   ` (3 preceding siblings ...)
  2006-03-17  4:47 ` [PATCH 004 of 13] md: Split disks array out of raid5 conf structure so it is easier to grow NeilBrown
@ 2006-03-17  4:47 ` NeilBrown
  2006-03-17  5:50   ` Andrew Morton
                     ` (2 more replies)
  2006-03-17  4:47 ` [PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while array is expanding NeilBrown
                   ` (7 subsequent siblings)
  12 siblings, 3 replies; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


Before a RAID-5 can be expanded, we need to be able to expand the
stripe-cache data structure.  
This requires allocating new stripes in a new kmem_cache.
If this succeeds, we copy cache pages over and release the old
stripes and kmem_cache.
We then allocate new pages.  If that fails, we leave the stripe
cache at it's new size.  It isn't worth the effort to shrink 
it back again.

Unfortuanately this means we need two kmem_cache names as we, for a
short period of time, we have two kmem_caches.  So they are
raid5/%s and raid5/%s-alt


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c         |  118 +++++++++++++++++++++++++++++++++++++++++--
 ./drivers/md/raid6main.c     |    4 -
 ./include/linux/raid/raid5.h |    9 ++-
 3 files changed, 123 insertions(+), 8 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-03-17 11:48:55.000000000 +1100
+++ ./drivers/md/raid5.c	2006-03-17 11:48:56.000000000 +1100
@@ -313,20 +313,130 @@ static int grow_stripes(raid5_conf_t *co
 	kmem_cache_t *sc;
 	int devs = conf->raid_disks;
 
-	sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
-
-	sc = kmem_cache_create(conf->cache_name, 
+	sprintf(conf->cache_name[0], "raid5/%s", mdname(conf->mddev));
+	sprintf(conf->cache_name[1], "raid5/%s-alt", mdname(conf->mddev));
+	conf->active_name = 0;
+	sc = kmem_cache_create(conf->cache_name[conf->active_name],
 			       sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
 			       0, 0, NULL, NULL);
 	if (!sc)
 		return 1;
 	conf->slab_cache = sc;
+	conf->pool_size = devs;
 	while (num--) {
 		if (!grow_one_stripe(conf))
 			return 1;
 	}
 	return 0;
 }
+static int resize_stripes(raid5_conf_t *conf, int newsize)
+{
+	/* make all the stripes able to hold 'newsize' devices.
+	 * New slots in each stripe get 'page' set to a new page.
+	 * We allocate all the new stripes first, then if that succeeds,
+	 * copy everything across.
+	 * Finally we add new pages.  This could fail, but we leave
+	 * the stripe cache at it's new size, just with some pages empty.
+	 *
+	 * We use GFP_NOIO allocations as IO to the raid5 is blocked
+	 * at some points in this operation.
+	 */
+	struct stripe_head *osh, *nsh;
+	struct list_head newstripes, oldstripes;
+	struct disk_info *ndisks;
+	int err = 0;
+	kmem_cache_t *sc;
+	int i;
+
+	if (newsize <= conf->pool_size)
+		return 0; /* never bother to shrink */
+
+	sc = kmem_cache_create(conf->cache_name[1-conf->active_name],
+			       sizeof(struct stripe_head)+(newsize-1)*sizeof(struct r5dev),
+			       0, 0, NULL, NULL);
+	if (!sc)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&newstripes);
+	for (i = conf->max_nr_stripes; i; i--) {
+		nsh = kmem_cache_alloc(sc, GFP_NOIO);
+		if (!nsh)
+			break;
+
+		memset(nsh, 0, sizeof(*nsh) + (newsize-1)*sizeof(struct r5dev));
+
+		nsh->raid_conf = conf;
+		spin_lock_init(&nsh->lock);
+
+		list_add(&nsh->lru, &newstripes);
+	}
+	if (i) {
+		/* didn't get enough, give up */
+		while (!list_empty(&newstripes)) {
+			nsh = list_entry(newstripes.next, struct stripe_head, lru);
+			list_del(&nsh->lru);
+			kmem_cache_free(sc, nsh);
+		}
+		kmem_cache_destroy(sc);
+		return -ENOMEM;
+	}
+	/* OK, we have enough stripes, start collecting inactive
+	 * stripes and copying them over
+	 */
+	INIT_LIST_HEAD(&oldstripes);
+	list_for_each_entry(nsh, &newstripes, lru) {
+		spin_lock_irq(&conf->device_lock);
+		wait_event_lock_irq(conf->wait_for_stripe,
+				    !list_empty(&conf->inactive_list),
+				    conf->device_lock,
+				    unplug_slaves(conf->mddev);
+			);
+		osh = get_free_stripe(conf);
+		spin_unlock_irq(&conf->device_lock);
+		atomic_set(&nsh->count, 1);
+		for(i=0; i<conf->pool_size; i++)
+			nsh->dev[i].page = osh->dev[i].page;
+		for( ; i<newsize; i++)
+			nsh->dev[i].page = NULL;
+		list_add(&osh->lru, &oldstripes);
+	}
+	/* Got them all.
+	 * Return the new ones and free the old ones.
+	 * At this point, we are holding all the stripes so the array
+	 * is completely stalled, so now is a good time to resize
+	 * conf->disks.
+	 */
+	ndisks = kzalloc(newsize * sizeof(struct disk_info), GFP_NOIO);
+	if (ndisks) {
+		for (i=0; i<conf->raid_disks; i++)
+			ndisks[i] = conf->disks[i];
+		kfree(conf->disks);
+		conf->disks = ndisks;
+	} else
+		err = -ENOMEM;
+	while(!list_empty(&newstripes)) {
+		nsh = list_entry(newstripes.next, struct stripe_head, lru);
+		list_del_init(&nsh->lru);
+		for (i=conf->raid_disks; i < newsize; i++)
+			if (nsh->dev[i].page == NULL) {
+				struct page *p = alloc_page(GFP_NOIO);
+				nsh->dev[i].page = p;
+				if (!p)
+					err = -ENOMEM;
+			}
+		release_stripe(nsh);
+	}
+	while(!list_empty(&oldstripes)) {
+		osh = list_entry(oldstripes.next, struct stripe_head, lru);
+		list_del(&osh->lru);
+		kmem_cache_free(conf->slab_cache, osh);
+	}
+	kmem_cache_destroy(conf->slab_cache);
+	conf->slab_cache = sc;
+	conf->active_name = 1-conf->active_name;
+	conf->pool_size = newsize;
+	return err;
+}
+
 
 static int drop_one_stripe(raid5_conf_t *conf)
 {
@@ -339,7 +449,7 @@ static int drop_one_stripe(raid5_conf_t 
 		return 0;
 	if (atomic_read(&sh->count))
 		BUG();
-	shrink_buffers(sh, conf->raid_disks);
+	shrink_buffers(sh, conf->pool_size);
 	kmem_cache_free(conf->slab_cache, sh);
 	atomic_dec(&conf->active_stripes);
 	return 1;

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2006-03-17 11:48:55.000000000 +1100
+++ ./drivers/md/raid6main.c	2006-03-17 11:48:56.000000000 +1100
@@ -331,9 +331,9 @@ static int grow_stripes(raid6_conf_t *co
 	kmem_cache_t *sc;
 	int devs = conf->raid_disks;
 
-	sprintf(conf->cache_name, "raid6/%s", mdname(conf->mddev));
+	sprintf(conf->cache_name[0], "raid6/%s", mdname(conf->mddev));
 
-	sc = kmem_cache_create(conf->cache_name,
+	sc = kmem_cache_create(conf->cache_name[0],
 			       sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
 			       0, 0, NULL, NULL);
 	if (!sc)

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~	2006-03-17 11:48:55.000000000 +1100
+++ ./include/linux/raid/raid5.h	2006-03-17 11:48:56.000000000 +1100
@@ -216,7 +216,11 @@ struct raid5_private_data {
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 
-	char			cache_name[20];
+	/* unfortunately we need two cache names as we temporarily have
+	 * two caches.
+	 */
+	int			active_name;
+	char			cache_name[2][20];
 	kmem_cache_t		*slab_cache; /* for allocating stripes */
 
 	int			seq_flush, seq_write;
@@ -238,7 +242,8 @@ struct raid5_private_data {
 	wait_queue_head_t	wait_for_overlap;
 	int			inactive_blocked;	/* release of inactive stripes blocked,
 							 * waiting for 25% to be free
-							 */        
+							 */
+	int			pool_size; /* number of disks in stripeheads in pool */
 	spinlock_t		device_lock;
 	struct disk_info	*disks;
 };

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while array is expanding.
  2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
                   ` (4 preceding siblings ...)
  2006-03-17  4:47 ` [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array NeilBrown
@ 2006-03-17  4:47 ` NeilBrown
  2006-03-17  6:01   ` Andrew Morton
  2006-03-17  4:47 ` [PATCH 007 of 13] md: Core of raid5 resize process NeilBrown
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


We need to allow that different stripes are of different effective sizes,
and use the appropriate size.
Also, when a stripe is being expanded, we must block any IO attempts
until the stripe is stable again.

Key elements in this change are:
 - each stripe_head gets a 'disk' field which is part of the key,
   thus there can sometimes be two stripe heads of the same area of
   the array, but covering different numbers of devices.  One of these
   will be marked STRIPE_EXPANDING and so won't accept new requests.
 - conf->expand_progress tracks how the expansion is progressing and
   is used to determine whether the target part of the array has been
   expanded yet or not.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c         |   88 ++++++++++++++++++++++++++++---------------
 ./include/linux/raid/raid5.h |    6 ++
 2 files changed, 64 insertions(+), 30 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-03-17 11:48:56.000000000 +1100
+++ ./drivers/md/raid5.c	2006-03-17 11:48:56.000000000 +1100
@@ -178,10 +178,10 @@ static int grow_buffers(struct stripe_he
 
 static void raid5_build_block (struct stripe_head *sh, int i);
 
-static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx)
+static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int disks)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int i;
 
 	if (atomic_read(&sh->count) != 0)
 		BUG();
@@ -198,7 +198,9 @@ static void init_stripe(struct stripe_he
 	sh->pd_idx = pd_idx;
 	sh->state = 0;
 
-	for (i=disks; i--; ) {
+	sh->disks = disks;
+
+	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 
 		if (dev->toread || dev->towrite || dev->written ||
@@ -215,7 +217,7 @@ static void init_stripe(struct stripe_he
 	insert_hash(conf, sh);
 }
 
-static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector)
+static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector, int disks)
 {
 	struct stripe_head *sh;
 	struct hlist_node *hn;
@@ -223,7 +225,7 @@ static struct stripe_head *__find_stripe
 	CHECK_DEVLOCK();
 	PRINTK("__find_stripe, sector %llu\n", (unsigned long long)sector);
 	hlist_for_each_entry(sh, hn, stripe_hash(conf, sector), hash)
-		if (sh->sector == sector)
+		if (sh->sector == sector && sh->disks == disks)
 			return sh;
 	PRINTK("__stripe %llu not in cache\n", (unsigned long long)sector);
 	return NULL;
@@ -232,8 +234,8 @@ static struct stripe_head *__find_stripe
 static void unplug_slaves(mddev_t *mddev);
 static void raid5_unplug_device(request_queue_t *q);
 
-static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector,
-					     int pd_idx, int noblock) 
+static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector, int disks,
+					     int pd_idx, int noblock)
 {
 	struct stripe_head *sh;
 
@@ -245,7 +247,7 @@ static struct stripe_head *get_active_st
 		wait_event_lock_irq(conf->wait_for_stripe,
 				    conf->quiesce == 0,
 				    conf->device_lock, /* nothing */);
-		sh = __find_stripe(conf, sector);
+		sh = __find_stripe(conf, sector, disks);
 		if (!sh) {
 			if (!conf->inactive_blocked)
 				sh = get_free_stripe(conf);
@@ -263,7 +265,7 @@ static struct stripe_head *get_active_st
 					);
 				conf->inactive_blocked = 0;
 			} else
-				init_stripe(sh, sector, pd_idx);
+				init_stripe(sh, sector, pd_idx, disks);
 		} else {
 			if (atomic_read(&sh->count)) {
 				if (!list_empty(&sh->lru))
@@ -300,6 +302,7 @@ static int grow_one_stripe(raid5_conf_t 
 		kmem_cache_free(conf->slab_cache, sh);
 		return 0;
 	}
+	sh->disks = conf->raid_disks;
 	/* we just created an active stripe so... */
 	atomic_set(&sh->count, 1);
 	atomic_inc(&conf->active_stripes);
@@ -470,7 +473,7 @@ static int raid5_end_read_request(struct
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
 	if (bi->bi_size)
@@ -568,7 +571,7 @@ static int raid5_end_write_request (stru
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	unsigned long flags;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
@@ -722,7 +725,7 @@ static sector_t raid5_compute_sector(sec
 static sector_t compute_blocknr(struct stripe_head *sh, int i)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int raid_disks = conf->raid_disks, data_disks = raid_disks - 1;
+	int raid_disks = sh->disks, data_disks = raid_disks - 1;
 	sector_t new_sector = sh->sector, check;
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	sector_t stripe;
@@ -823,8 +826,7 @@ static void copy_data(int frombio, struc
 
 static void compute_block(struct stripe_head *sh, int dd_idx)
 {
-	raid5_conf_t *conf = sh->raid_conf;
-	int i, count, disks = conf->raid_disks;
+	int i, count, disks = sh->disks;
 	void *ptr[MAX_XOR_BLOCKS], *p;
 
 	PRINTK("compute_block, stripe %llu, idx %d\n", 
@@ -854,7 +856,7 @@ static void compute_block(struct stripe_
 static void compute_parity(struct stripe_head *sh, int method)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int i, pd_idx = sh->pd_idx, disks = conf->raid_disks, count;
+	int i, pd_idx = sh->pd_idx, disks = sh->disks, count;
 	void *ptr[MAX_XOR_BLOCKS];
 	struct bio *chosen;
 
@@ -1042,7 +1044,7 @@ static int add_stripe_bio(struct stripe_
 static void handle_stripe(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks;
+	int disks = sh->disks;
 	struct bio *return_bi= NULL;
 	struct bio *bi;
 	int i;
@@ -1636,12 +1638,10 @@ static inline void raid5_plug_device(rai
 	spin_unlock_irq(&conf->device_lock);
 }
 
-static int make_request (request_queue_t *q, struct bio * bi)
+static int make_request(request_queue_t *q, struct bio * bi)
 {
 	mddev_t *mddev = q->queuedata;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
-	const unsigned int raid_disks = conf->raid_disks;
-	const unsigned int data_disks = raid_disks - 1;
 	unsigned int dd_idx, pd_idx;
 	sector_t new_sector;
 	sector_t logical_sector, last_sector;
@@ -1665,20 +1665,48 @@ static int make_request (request_queue_t
 
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
+		int disks;
 		
-		new_sector = raid5_compute_sector(logical_sector,
-						  raid_disks, data_disks, &dd_idx, &pd_idx, conf);
-
+	retry:
+		if (likely(conf->expand_progress == MaxSector))
+			disks = conf->raid_disks;
+		else {
+			spin_lock_irq(&conf->device_lock);
+			disks = conf->raid_disks;
+			if (logical_sector >= conf->expand_progress)
+				disks = conf->previous_raid_disks;
+			spin_unlock_irq(&conf->device_lock);
+		}
+ 		new_sector = raid5_compute_sector(logical_sector, disks, disks - 1,
+						  &dd_idx, &pd_idx, conf);
 		PRINTK("raid5: make_request, sector %llu logical %llu\n",
 			(unsigned long long)new_sector, 
 			(unsigned long long)logical_sector);
 
-	retry:
 		prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
-		sh = get_active_stripe(conf, new_sector, pd_idx, (bi->bi_rw&RWA_MASK));
+		sh = get_active_stripe(conf, new_sector, disks, pd_idx, (bi->bi_rw&RWA_MASK));
 		if (sh) {
-			if (!add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
-				/* Add failed due to overlap.  Flush everything
+			if (unlikely(conf->expand_progress != MaxSector)) {
+				/* expansion might have moved on while waiting for a
+				 * stripe, so we much do the range check again.
+				 */
+				int must_retry = 0;
+				spin_lock_irq(&conf->device_lock);
+				if (logical_sector <  conf->expand_progress &&
+				    disks == conf->previous_raid_disks)
+					/* mismatch, need to try again */
+					must_retry = 1;
+				spin_unlock_irq(&conf->device_lock);
+				if (must_retry) {
+					release_stripe(sh);
+					goto retry;
+				}
+			}
+
+			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
+			    !add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
+				/* Stripe is busy expanding or
+				 * add failed due to overlap.  Flush everything
 				 * and wait a while
 				 */
 				raid5_unplug_device(mddev->queue);
@@ -1690,7 +1718,6 @@ static int make_request (request_queue_t
 			raid5_plug_device(conf);
 			handle_stripe(sh);
 			release_stripe(sh);
-
 		} else {
 			/* cannot get stripe for read-ahead, just give-up */
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
@@ -1766,9 +1793,9 @@ static sector_t sync_request(mddev_t *md
 
 	first_sector = raid5_compute_sector((sector_t)stripe*data_disks*sectors_per_chunk
 		+ chunk_offset, raid_disks, data_disks, &dd_idx, &pd_idx, conf);
-	sh = get_active_stripe(conf, sector_nr, pd_idx, 1);
+	sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 1);
 	if (sh == NULL) {
-		sh = get_active_stripe(conf, sector_nr, pd_idx, 0);
+		sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 0);
 		/* make sure we don't swamp the stripe cache if someone else
 		 * is trying to get access 
 		 */
@@ -1985,6 +2012,7 @@ static int run(mddev_t *mddev)
 	conf->level = mddev->level;
 	conf->algorithm = mddev->layout;
 	conf->max_nr_stripes = NR_STRIPES;
+	conf->expand_progress = MaxSector;
 
 	/* device size must be a multiple of chunk size */
 	mddev->size &= ~(mddev->chunk_size/1024 -1);
@@ -2115,7 +2143,7 @@ static void print_sh (struct stripe_head
 	printk("sh %llu,  count %d.\n",
 		(unsigned long long)sh->sector, atomic_read(&sh->count));
 	printk("sh %llu, ", (unsigned long long)sh->sector);
-	for (i = 0; i < sh->raid_conf->raid_disks; i++) {
+	for (i = 0; i < sh->disks; i++) {
 		printk("(cache%d: %p %ld) ", 
 			i, sh->dev[i].page, sh->dev[i].flags);
 	}

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~	2006-03-17 11:48:56.000000000 +1100
+++ ./include/linux/raid/raid5.h	2006-03-17 11:48:56.000000000 +1100
@@ -135,6 +135,7 @@ struct stripe_head {
 	atomic_t		count;			/* nr of active thread/requests */
 	spinlock_t		lock;
 	int			bm_seq;	/* sequence number for bitmap flushes */
+	int			disks;			/* disks in stripe */
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
@@ -174,6 +175,7 @@ struct stripe_head {
 #define	STRIPE_DELAYED		6
 #define	STRIPE_DEGRADED		7
 #define	STRIPE_BIT_DELAY	8
+#define	STRIPE_EXPANDING	9
 
 /*
  * Plugging:
@@ -211,6 +213,10 @@ struct raid5_private_data {
 	int			raid_disks, working_disks, failed_disks;
 	int			max_nr_stripes;
 
+	/* used during an expand */
+	sector_t		expand_progress;	/* MaxSector when no expand happening */
+	int			previous_raid_disks;
+
 	struct list_head	handle_list; /* stripes needing handling */
 	struct list_head	delayed_list; /* stripes that have plugged requests */
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 007 of 13] md: Core of raid5 resize process
  2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
                   ` (5 preceding siblings ...)
  2006-03-17  4:47 ` [PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while array is expanding NeilBrown
@ 2006-03-17  4:47 ` NeilBrown
  2006-03-17  6:03   ` Andrew Morton
  2006-03-17  4:48 ` [PATCH 008 of 13] md: Final stages of raid5 expand code NeilBrown
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


This patch provides the core of the resize/expand process.

sync_request notices if a 'reshape' is happening and acts accordingly.

It allocated new stripe_heads for the next chunk-wide-stripe in the
target geometry, marking them STRIPE_EXPANDING.
Then it finds which stripe heads in the old geometry can provide data
needed by these and marks them STRIPE_EXPAND_SOURCE.  This causes
stripe_handle to read all blocks on those stripes.
Once all blocks on a STRIPE_EXPAND_SOURCE stripe_head are read, any that
are needed are copied into the corresponding STRIPE_EXPANDING stripe_head.
Once a STRIPE_EXPANDING stripe_head is full, it is marks STRIPE_EXPAND_READY
and then is written out and released.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c            |   14 ++-
 ./drivers/md/raid5.c         |  185 +++++++++++++++++++++++++++++++++++++------
 ./include/linux/raid/md_k.h  |    4 
 ./include/linux/raid/raid5.h |    4 
 4 files changed, 181 insertions(+), 26 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2006-03-17 11:48:30.000000000 +1100
+++ ./drivers/md/md.c	2006-03-17 11:48:57.000000000 +1100
@@ -2161,7 +2161,9 @@ action_show(mddev_t *mddev, char *page)
 	char *type = "idle";
 	if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
 	    test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)) {
-		if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
+		if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
+			type = "reshape";
+		else if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
 			if (!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
 				type = "resync";
 			else if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery))
@@ -4084,8 +4086,10 @@ static void status_resync(struct seq_fil
 		seq_printf(seq, "] ");
 	}
 	seq_printf(seq, " %s =%3u.%u%% (%llu/%llu)",
+		   (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)?
+		    "reshape" :
 		      (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?
-		       "resync" : "recovery"),
+		       "resync" : "recovery")),
 		      per_milli/10, per_milli % 10,
 		   (unsigned long long) resync,
 		   (unsigned long long) max_blocks);
@@ -4539,7 +4543,9 @@ static void md_do_sync(mddev_t *mddev)
 		 */
 		max_sectors = mddev->resync_max_sectors;
 		mddev->resync_mismatches = 0;
-	} else
+	} else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
+		max_sectors = mddev->size << 1;
+	else
 		/* recovery follows the physical size of devices */
 		max_sectors = mddev->size << 1;
 
@@ -4675,6 +4681,8 @@ static void md_do_sync(mddev_t *mddev)
 	mddev->pers->sync_request(mddev, max_sectors, &skipped, 1);
 
 	if (!test_bit(MD_RECOVERY_ERR, &mddev->recovery) &&
+	    test_bit(MD_RECOVERY_SYNC, &mddev->recovery) &&
+	    !test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
 	    mddev->curr_resync > 2 &&
 	    mddev->curr_resync >= mddev->recovery_cp) {
 		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-03-17 11:48:56.000000000 +1100
+++ ./drivers/md/raid5.c	2006-03-17 11:48:57.000000000 +1100
@@ -93,11 +93,11 @@ static void __release_stripe(raid5_conf_
 				if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
 					md_wakeup_thread(conf->mddev->thread);
 			}
-			list_add_tail(&sh->lru, &conf->inactive_list);
 			atomic_dec(&conf->active_stripes);
-			if (!conf->inactive_blocked ||
-			    atomic_read(&conf->active_stripes) < (conf->max_nr_stripes*3/4))
+			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
+				list_add_tail(&sh->lru, &conf->inactive_list);
 				wake_up(&conf->wait_for_stripe);
+			}
 		}
 	}
 }
@@ -273,9 +273,8 @@ static struct stripe_head *get_active_st
 			} else {
 				if (!test_bit(STRIPE_HANDLE, &sh->state))
 					atomic_inc(&conf->active_stripes);
-				if (list_empty(&sh->lru))
-					BUG();
-				list_del_init(&sh->lru);
+				if (!list_empty(&sh->lru))
+					list_del_init(&sh->lru);
 			}
 		}
 	} while (sh == NULL);
@@ -1022,6 +1021,18 @@ static int add_stripe_bio(struct stripe_
 	return 0;
 }
 
+int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks)
+{
+	int sectors_per_chunk = conf->chunk_size >> 9;
+	sector_t x = stripe;
+	int pd_idx, dd_idx;
+	int chunk_offset = sector_div(x, sectors_per_chunk);
+	stripe = x;
+	raid5_compute_sector(stripe*(disks-1)*sectors_per_chunk
+			     + chunk_offset, disks, disks-1, &dd_idx, &pd_idx, conf);
+	return pd_idx;
+}
+
 
 /*
  * handle_stripe - do things to a stripe.
@@ -1048,7 +1059,7 @@ static void handle_stripe(struct stripe_
 	struct bio *return_bi= NULL;
 	struct bio *bi;
 	int i;
-	int syncing;
+	int syncing, expanding, expanded;
 	int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
 	int non_overwrite = 0;
 	int failed_num=0;
@@ -1063,6 +1074,8 @@ static void handle_stripe(struct stripe_
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
 	syncing = test_bit(STRIPE_SYNCING, &sh->state);
+	expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+	expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
 	/* Now to look around and see what can be done */
 
 	rcu_read_lock();
@@ -1255,13 +1268,14 @@ static void handle_stripe(struct stripe_
 	 * parity, or to satisfy requests
 	 * or to load a block that is being partially written.
 	 */
-	if (to_read || non_overwrite || (syncing && (uptodate < disks))) {
+	if (to_read || non_overwrite || (syncing && (uptodate < disks)) || expanding) {
 		for (i=disks; i--;) {
 			dev = &sh->dev[i];
 			if (!test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) &&
 			    (dev->toread ||
 			     (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
 			     syncing ||
+			     expanding ||
 			     (failed && (sh->dev[failed_num].toread ||
 					 (sh->dev[failed_num].towrite && !test_bit(R5_OVERWRITE, &sh->dev[failed_num].flags))))
 				    )
@@ -1451,13 +1465,76 @@ static void handle_stripe(struct stripe_
 			set_bit(R5_Wantwrite, &dev->flags);
 			set_bit(R5_ReWrite, &dev->flags);
 			set_bit(R5_LOCKED, &dev->flags);
+			locked++;
 		} else {
 			/* let's read it back */
 			set_bit(R5_Wantread, &dev->flags);
 			set_bit(R5_LOCKED, &dev->flags);
+			locked++;
 		}
 	}
 
+	if (expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
+		/* Need to write out all blocks after computing parity */
+		sh->disks = conf->raid_disks;
+		sh->pd_idx = stripe_to_pdidx(sh->sector, conf, conf->raid_disks);
+		compute_parity(sh, RECONSTRUCT_WRITE);
+		for (i= conf->raid_disks; i--;) {
+			set_bit(R5_LOCKED, &sh->dev[i].flags);
+			locked++;
+			set_bit(R5_Wantwrite, &sh->dev[i].flags);
+		}
+		clear_bit(STRIPE_EXPANDING, &sh->state);
+	} else if (expanded) {
+		clear_bit(STRIPE_EXPAND_READY, &sh->state);
+		wake_up(&conf->wait_for_overlap);
+		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
+	}
+
+	if (expanding && locked == 0) {
+		/* We have read all the blocks in this stripe and now we need to
+		 * copy some of them into a target stripe for expand.
+		 */
+		clear_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+		for (i=0; i< sh->disks; i++)
+			if (i != sh->pd_idx) {
+				int dd_idx, pd_idx, j;
+				struct stripe_head *sh2;
+
+				sector_t bn = compute_blocknr(sh, i);
+				sector_t s = raid5_compute_sector(bn, conf->raid_disks,
+								  conf->raid_disks-1,
+								  &dd_idx, &pd_idx, conf);
+				sh2 = get_active_stripe(conf, s, conf->raid_disks, pd_idx, 1);
+				if (sh2 == NULL)
+					/* so far only the early blocks of this stripe
+					 * have been requested.  When later blocks
+					 * get requested, we will try again
+					 */
+					continue;
+				if(!test_bit(STRIPE_EXPANDING, &sh2->state) ||
+				   test_bit(R5_Expanded, &sh2->dev[dd_idx].flags)) {
+					/* must have already done this block */
+					release_stripe(sh2);
+					continue;
+				}
+				memcpy(page_address(sh2->dev[dd_idx].page),
+				       page_address(sh->dev[i].page),
+				       STRIPE_SIZE);
+				set_bit(R5_Expanded, &sh2->dev[dd_idx].flags);
+				set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
+				for (j=0; j<conf->raid_disks; j++)
+					if (j != sh2->pd_idx &&
+					    !test_bit(R5_Expanded, &sh2->dev[j].flags))
+						break;
+				if (j == conf->raid_disks) {
+					set_bit(STRIPE_EXPAND_READY, &sh2->state);
+					set_bit(STRIPE_HANDLE, &sh2->state);
+				}
+				release_stripe(sh2);
+			}
+	}
+
 	spin_unlock(&sh->lock);
 
 	while ((bi=return_bi)) {
@@ -1496,7 +1573,7 @@ static void handle_stripe(struct stripe_
 		rcu_read_unlock();
  
 		if (rdev) {
-			if (syncing)
+			if (syncing || expanding || expanded)
 				md_sync_acct(rdev->bdev, STRIPE_SECTORS);
 
 			bi->bi_bdev = rdev->bdev;
@@ -1744,12 +1821,8 @@ static sector_t sync_request(mddev_t *md
 {
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	struct stripe_head *sh;
-	int sectors_per_chunk = conf->chunk_size >> 9;
-	sector_t x;
-	unsigned long stripe;
-	int chunk_offset;
-	int dd_idx, pd_idx;
-	sector_t first_sector;
+	int pd_idx;
+	sector_t first_sector, last_sector;
 	int raid_disks = conf->raid_disks;
 	int data_disks = raid_disks-1;
 	sector_t max_sector = mddev->size << 1;
@@ -1768,6 +1841,80 @@ static sector_t sync_request(mddev_t *md
 
 		return 0;
 	}
+
+	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
+		/* reshaping is quite different to recovery/resync so it is
+		 * handled quite separately ... here.
+		 *
+		 * On each call to sync_request, we gather one chunk worth of
+		 * destination stripes and flag them as expanding.
+		 * Then we find all the source stripes and request reads.
+		 * As the reads complete, handle_stripe will copy the data
+		 * into the destination stripe and release that stripe.
+		 */
+		int i;
+		int dd_idx;
+		for (i=0; i < conf->chunk_size/512; i+= STRIPE_SECTORS) {
+			int j;
+			int skipped = 0;
+			pd_idx = stripe_to_pdidx(sector_nr+i, conf, conf->raid_disks);
+			sh = get_active_stripe(conf, sector_nr+i,
+					       conf->raid_disks, pd_idx, 0);
+			set_bit(STRIPE_EXPANDING, &sh->state);
+			/* If any of this stripe is beyond the end of the old
+			 * array, then we need to zero those blocks
+			 */
+			for (j=sh->disks; j--;) {
+				sector_t s;
+				if (j == sh->pd_idx)
+					continue;
+				s = compute_blocknr(sh, j);
+				if (s < (mddev->array_size<<1)) {
+					skipped = 1;
+					continue;
+				}
+				memset(page_address(sh->dev[j].page), 0, STRIPE_SIZE);
+				set_bit(R5_Expanded, &sh->dev[j].flags);
+				set_bit(R5_UPTODATE, &sh->dev[j].flags);
+			}
+			if (!skipped) {
+				set_bit(STRIPE_EXPAND_READY, &sh->state);
+				set_bit(STRIPE_HANDLE, &sh->state);
+			}
+			release_stripe(sh);
+		}
+		spin_lock_irq(&conf->device_lock);
+		conf->expand_progress = (sector_nr + i)*(conf->raid_disks-1);
+		spin_unlock_irq(&conf->device_lock);
+		/* Ok, those stripe are ready. We can start scheduling
+		 * reads on the source stripes.
+		 * The source stripes are determined by mapping the first and last
+		 * block on the destination stripes.
+		 */
+		raid_disks = conf->previous_raid_disks;
+		data_disks = raid_disks - 1;
+		first_sector =
+			raid5_compute_sector(sector_nr*(conf->raid_disks-1),
+					     raid_disks, data_disks,
+					     &dd_idx, &pd_idx, conf);
+		last_sector =
+			raid5_compute_sector((sector_nr+conf->chunk_size/512)
+					       *(conf->raid_disks-1) -1,
+					     raid_disks, data_disks,
+					     &dd_idx, &pd_idx, conf);
+		if (last_sector >= (mddev->size<<1))
+			last_sector = (mddev->size<<1)-1;
+		while (first_sector <= last_sector) {
+			pd_idx = stripe_to_pdidx(first_sector, conf, conf->previous_raid_disks);
+			sh = get_active_stripe(conf, first_sector,
+					       conf->previous_raid_disks, pd_idx, 0);
+			set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+			set_bit(STRIPE_HANDLE, &sh->state);
+			release_stripe(sh);
+			first_sector += STRIPE_SECTORS;
+		}
+		return conf->chunk_size>>9;
+	}
 	/* if there is 1 or more failed drives and we are trying
 	 * to resync, then assert that we are finished, because there is
 	 * nothing we can do.
@@ -1786,13 +1933,7 @@ static sector_t sync_request(mddev_t *md
 		return sync_blocks * STRIPE_SECTORS; /* keep things rounded to whole stripes */
 	}
 
-	x = sector_nr;
-	chunk_offset = sector_div(x, sectors_per_chunk);
-	stripe = x;
-	BUG_ON(x != stripe);
-
-	first_sector = raid5_compute_sector((sector_t)stripe*data_disks*sectors_per_chunk
-		+ chunk_offset, raid_disks, data_disks, &dd_idx, &pd_idx, conf);
+	pd_idx = stripe_to_pdidx(sector_nr, conf, raid_disks);
 	sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 1);
 	if (sh == NULL) {
 		sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 0);

diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~	2006-03-17 11:45:43.000000000 +1100
+++ ./include/linux/raid/md_k.h	2006-03-17 11:48:57.000000000 +1100
@@ -157,6 +157,9 @@ struct mddev_s
 	 * DONE:     thread is done and is waiting to be reaped
 	 * REQUEST:  user-space has requested a sync (used with SYNC)
 	 * CHECK:    user-space request for for check-only, no repair
+	 * RESHAPE:  A reshape is happening
+	 *
+	 * If neither SYNC or RESHAPE are set, then it is a recovery.
 	 */
 #define	MD_RECOVERY_RUNNING	0
 #define	MD_RECOVERY_SYNC	1
@@ -166,6 +169,7 @@ struct mddev_s
 #define	MD_RECOVERY_NEEDED	5
 #define	MD_RECOVERY_REQUESTED	6
 #define	MD_RECOVERY_CHECK	7
+#define MD_RECOVERY_RESHAPE	8
 	unsigned long			recovery;
 
 	int				in_sync;	/* know to not need resync */

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~	2006-03-17 11:48:56.000000000 +1100
+++ ./include/linux/raid/raid5.h	2006-03-17 11:48:57.000000000 +1100
@@ -157,6 +157,7 @@ struct stripe_head {
 #define	R5_ReadError	8	/* seen a read error here recently */
 #define	R5_ReWrite	9	/* have tried to over-write the readerror */
 
+#define	R5_Expanded	10	/* This block now has post-expand data */
 /*
  * Write method
  */
@@ -176,7 +177,8 @@ struct stripe_head {
 #define	STRIPE_DEGRADED		7
 #define	STRIPE_BIT_DELAY	8
 #define	STRIPE_EXPANDING	9
-
+#define	STRIPE_EXPAND_SOURCE	10
+#define	STRIPE_EXPAND_READY	11
 /*
  * Plugging:
  *

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 008 of 13] md: Final stages of raid5 expand code.
  2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
                   ` (6 preceding siblings ...)
  2006-03-17  4:47 ` [PATCH 007 of 13] md: Core of raid5 resize process NeilBrown
@ 2006-03-17  4:48 ` NeilBrown
  2006-03-17  4:48 ` [PATCH 009 of 13] md: Checkpoint and allow restart of raid5 reshape NeilBrown
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


This patch adds raid5_reshape and end_reshape which will
start and finish the reshape processes.

raid5_reshape is only enabled in CONFIG_MD_RAID5_RESHAPE is set,
to discourage accidental use.

Read the 'help' for the CONFIG_MD_RAID5_RESHAPE entry.

and Make sure that you have backups, just in case.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/Kconfig      |   26 ++++++++++
 ./drivers/md/md.c         |    7 +-
 ./drivers/md/raid5.c      |  117 ++++++++++++++++++++++++++++++++++++++++++++++
 ./include/linux/raid/md.h |    3 -
 4 files changed, 149 insertions(+), 4 deletions(-)

diff ./drivers/md/Kconfig~current~ ./drivers/md/Kconfig
--- ./drivers/md/Kconfig~current~	2006-03-17 11:45:43.000000000 +1100
+++ ./drivers/md/Kconfig	2006-03-17 11:48:57.000000000 +1100
@@ -127,6 +127,32 @@ config MD_RAID5
 
 	  If unsure, say Y.
 
+config MD_RAID5_RESHAPE
+	bool "Support adding drives to a raid-5 array (experimental)"
+	depends on MD_RAID5 && EXPERIMENTAL
+	---help---
+	  A RAID-5 set can be expanded by adding extra drives. This
+	  requires "restriping" the array which means (almost) every
+	  block must be written to a different place.
+
+          This option allows such restriping to be done while the array
+	  is online.  However it is still EXPERIMENTAL code.  It should
+	  work, but please be sure that you have backups.
+
+	  You will need a version of mdadm newer than 2.3.1.   During the
+	  early stage of reshape there is a critical section where live data
+	  is being over-written.  A crash during this time needs extra care
+	  for recovery.  The newer mdadm takes a copy of the data in the
+	  critical section and will restore it, if necessary, after a crash.
+
+	  The mdadm usage is e.g.
+	       mdadm --grow /dev/md1 --raid-disks=6
+	  to grow '/dev/md1' to having 6 disks.
+
+	  Note: The array can only be expanded, not contracted.
+	  There should be enough spares already present to make the new
+	  array workable.
+
 config MD_RAID6
 	tristate "RAID-6 mode"
 	depends on BLK_DEV_MD

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2006-03-17 11:48:57.000000000 +1100
+++ ./drivers/md/md.c	2006-03-17 11:48:57.000000000 +1100
@@ -159,12 +159,12 @@ static int start_readonly;
  */
 static DECLARE_WAIT_QUEUE_HEAD(md_event_waiters);
 static atomic_t md_event_count;
-static void md_new_event(mddev_t *mddev)
+void md_new_event(mddev_t *mddev)
 {
 	atomic_inc(&md_event_count);
 	wake_up(&md_event_waiters);
 }
-
+EXPORT_SYMBOL_GPL(md_new_event);
 /*
  * Enables to iterate over all existing md arrays
  * all_mddevs_lock protects this list.
@@ -4463,7 +4463,7 @@ static DECLARE_WAIT_QUEUE_HEAD(resync_wa
 
 #define SYNC_MARKS	10
 #define	SYNC_MARK_STEP	(3*HZ)
-static void md_do_sync(mddev_t *mddev)
+void md_do_sync(mddev_t *mddev)
 {
 	mddev_t *mddev2;
 	unsigned int currspeed = 0,
@@ -4700,6 +4700,7 @@ static void md_do_sync(mddev_t *mddev)
 	set_bit(MD_RECOVERY_DONE, &mddev->recovery);
 	md_wakeup_thread(mddev->thread);
 }
+EXPORT_SYMBOL_GPL(md_do_sync);
 
 
 /*

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-03-17 11:48:57.000000000 +1100
+++ ./drivers/md/raid5.c	2006-03-17 11:48:57.000000000 +1100
@@ -1021,6 +1021,8 @@ static int add_stripe_bio(struct stripe_
 	return 0;
 }
 
+static void end_reshape(raid5_conf_t *conf);
+
 int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks)
 {
 	int sectors_per_chunk = conf->chunk_size >> 9;
@@ -1831,6 +1833,10 @@ static sector_t sync_request(mddev_t *md
 	if (sector_nr >= max_sector) {
 		/* just being told to finish up .. nothing much to do */
 		unplug_slaves(mddev);
+		if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
+			end_reshape(conf);
+			return 0;
+		}
 
 		if (mddev->curr_resync < max_sector) /* aborted */
 			bitmap_end_sync(mddev->bitmap, mddev->curr_resync,
@@ -2451,6 +2457,114 @@ static int raid5_resize(mddev_t *mddev, 
 	return 0;
 }
 
+static int raid5_reshape(mddev_t *mddev, int raid_disks)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	int err;
+	mdk_rdev_t *rdev;
+	struct list_head *rtmp;
+	int spares = 0;
+	int added_devices = 0;
+
+	if (mddev->degraded ||
+	    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+		return -EBUSY;
+	if (conf->raid_disks > raid_disks)
+		return -EINVAL; /* Cannot shrink array yet */
+	if (conf->raid_disks == raid_disks)
+		return 0; /* nothing to do */
+
+	/* Can only proceed if there are plenty of stripe_heads.
+	 * We need a minimum of one full stripe,, and for sensible progress
+	 * it is best to have about 4 times that.
+	 * If we require 4 times, then the default 256 4K stripe_heads will
+	 * allow for chunk sizes up to 256K, which is probably OK.
+	 * If the chunk size is greater, user-space should request more
+	 * stripe_heads first.
+	 */
+	if ((mddev->chunk_size / STRIPE_SIZE) * 4 > conf->max_nr_stripes) {
+		printk(KERN_WARNING "raid5: reshape: not enough stripes.  Needed %lu\n",
+		       (mddev->chunk_size / STRIPE_SIZE)*4);
+		return -ENOSPC;
+	}
+
+	ITERATE_RDEV(mddev, rdev, rtmp)
+		if (rdev->raid_disk < 0 &&
+		    !test_bit(Faulty, &rdev->flags))
+			spares++;
+	if (conf->raid_disks + spares < raid_disks-1)
+		/* Not enough devices even to make a degraded array
+		 * of that size
+		 */
+		return -EINVAL;
+
+	err = resize_stripes(conf, raid_disks);
+	if (err)
+		return err;
+
+	spin_lock_irq(&conf->device_lock);
+	conf->previous_raid_disks = conf->raid_disks;
+	mddev->raid_disks = conf->raid_disks = raid_disks;
+	conf->expand_progress = 0;
+	spin_unlock_irq(&conf->device_lock);
+
+	/* Add some new drives, as many as will fit.
+	 * We know there are enough to make the newly sized array work.
+	 */
+	ITERATE_RDEV(mddev, rdev, rtmp)
+		if (rdev->raid_disk < 0 &&
+		    !test_bit(Faulty, &rdev->flags)) {
+			if (raid5_add_disk(mddev, rdev)) {
+				char nm[20];
+				set_bit(In_sync, &rdev->flags);
+				conf->working_disks++;
+				added_devices++;
+				sprintf(nm, "rd%d", rdev->raid_disk);
+				sysfs_create_link(&mddev->kobj, &rdev->kobj, nm);
+			} else
+				break;
+		}
+
+	mddev->degraded = (raid_disks - conf->previous_raid_disks) - added_devices;
+	clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+	clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
+	set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
+	set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
+	mddev->sync_thread = md_register_thread(md_do_sync, mddev,
+						"%s_reshape");
+	if (!mddev->sync_thread) {
+		mddev->recovery = 0;
+		spin_lock_irq(&conf->device_lock);
+		mddev->raid_disks = conf->raid_disks = conf->previous_raid_disks;
+		conf->expand_progress = MaxSector;
+		spin_unlock_irq(&conf->device_lock);
+		return -EAGAIN;
+	}
+	md_wakeup_thread(mddev->sync_thread);
+	md_new_event(mddev);
+	return 0;
+}
+
+static void end_reshape(raid5_conf_t *conf)
+{
+	struct block_device *bdev;
+
+	conf->mddev->array_size = conf->mddev->size * (conf->mddev->raid_disks-1);
+	set_capacity(conf->mddev->gendisk, conf->mddev->array_size << 1);
+	conf->mddev->changed = 1;
+
+	bdev = bdget_disk(conf->mddev->gendisk, 0);
+	if (bdev) {
+		mutex_lock(&bdev->bd_inode->i_mutex);
+		i_size_write(bdev->bd_inode, conf->mddev->array_size << 10);
+		mutex_unlock(&bdev->bd_inode->i_mutex);
+		bdput(bdev);
+	}
+	spin_lock_irq(&conf->device_lock);
+	conf->expand_progress = MaxSector;
+	spin_unlock_irq(&conf->device_lock);
+}
+
 static void raid5_quiesce(mddev_t *mddev, int state)
 {
 	raid5_conf_t *conf = mddev_to_conf(mddev);
@@ -2489,6 +2603,9 @@ static struct mdk_personality raid5_pers
 	.spare_active	= raid5_spare_active,
 	.sync_request	= sync_request,
 	.resize		= raid5_resize,
+#if CONFIG_MD_RAID5_RESHAPE
+	.reshape	= raid5_reshape,
+#endif
 	.quiesce	= raid5_quiesce,
 };
 

diff ./include/linux/raid/md.h~current~ ./include/linux/raid/md.h
--- ./include/linux/raid/md.h~current~	2006-03-17 11:45:43.000000000 +1100
+++ ./include/linux/raid/md.h	2006-03-17 11:48:57.000000000 +1100
@@ -92,7 +92,8 @@ extern void md_super_write(mddev_t *mdde
 extern void md_super_wait(mddev_t *mddev);
 extern int sync_page_io(struct block_device *bdev, sector_t sector, int size,
 			struct page *page, int rw);
-
+extern void md_do_sync(mddev_t *mddev);
+extern void md_new_event(mddev_t *mddev);
 
 #define MD_BUG(x...) { printk("md: bug in file %s, line %d\n", __FILE__, __LINE__); md_print_devices(); }
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 009 of 13] md: Checkpoint and allow restart of raid5 reshape
  2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
                   ` (7 preceding siblings ...)
  2006-03-17  4:48 ` [PATCH 008 of 13] md: Final stages of raid5 expand code NeilBrown
@ 2006-03-17  4:48 ` NeilBrown
  2006-03-17  4:48 ` [PATCH 010 of 13] md: Only checkpoint expansion progress occasionally NeilBrown
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


We allow the superblock to record an 'old' and a 'new'
geometry, and a position where any conversion is up to.
The geometry allows for changing chunksize, layout and
level as well as number of devices.

When using verion-0.90 superblock, we convert the version
to 0.91 while the conversion is happening so that an old
kernel will refuse the assemble the array.
For version-1, we use a feature bit for the same effect.

When starting an array we check for an incomplete reshape
and restart the reshape process if needed.
If the reshape stopped at an awkward time (like when updating
the first stripe) we refuse to assemble the array, and
let user-space worry about it.




Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c            |   69 ++++++++++++++++++++-
 ./drivers/md/raid1.c         |    5 +
 ./drivers/md/raid5.c         |  140 ++++++++++++++++++++++++++++++++++++-------
 ./include/linux/raid/md.h    |    2 
 ./include/linux/raid/md_k.h  |    8 ++
 ./include/linux/raid/md_p.h  |   32 ++++++++-
 ./include/linux/raid/raid5.h |    1 
 7 files changed, 231 insertions(+), 26 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2006-03-17 11:48:57.000000000 +1100
+++ ./drivers/md/md.c	2006-03-17 11:48:58.000000000 +1100
@@ -659,7 +659,8 @@ static int super_90_load(mdk_rdev_t *rde
 	}
 
 	if (sb->major_version != 0 ||
-	    sb->minor_version != 90) {
+	    sb->minor_version < 90 ||
+	    sb->minor_version > 91) {
 		printk(KERN_WARNING "Bad version number %d.%d on %s\n",
 			sb->major_version, sb->minor_version,
 			b);
@@ -744,6 +745,20 @@ static int super_90_validate(mddev_t *md
 		mddev->bitmap_offset = 0;
 		mddev->default_bitmap_offset = MD_SB_BYTES >> 9;
 
+		if (mddev->minor_version >= 91) {
+			mddev->reshape_position = sb->reshape_position;
+			mddev->delta_disks = sb->delta_disks;
+			mddev->new_level = sb->new_level;
+			mddev->new_layout = sb->new_layout;
+			mddev->new_chunk = sb->new_chunk;
+		} else {
+			mddev->reshape_position = MaxSector;
+			mddev->delta_disks = 0;
+			mddev->new_level = mddev->level;
+			mddev->new_layout = mddev->layout;
+			mddev->new_chunk = mddev->chunk_size;
+		}
+
 		if (sb->state & (1<<MD_SB_CLEAN))
 			mddev->recovery_cp = MaxSector;
 		else {
@@ -838,7 +853,6 @@ static void super_90_sync(mddev_t *mddev
 
 	sb->md_magic = MD_SB_MAGIC;
 	sb->major_version = mddev->major_version;
-	sb->minor_version = mddev->minor_version;
 	sb->patch_version = mddev->patch_version;
 	sb->gvalid_words  = 0; /* ignored */
 	memcpy(&sb->set_uuid0, mddev->uuid+0, 4);
@@ -857,6 +871,17 @@ static void super_90_sync(mddev_t *mddev
 	sb->events_hi = (mddev->events>>32);
 	sb->events_lo = (u32)mddev->events;
 
+	if (mddev->reshape_position == MaxSector)
+		sb->minor_version = 90;
+	else {
+		sb->minor_version = 91;
+		sb->reshape_position = mddev->reshape_position;
+		sb->new_level = mddev->new_level;
+		sb->delta_disks = mddev->delta_disks;
+		sb->new_layout = mddev->new_layout;
+		sb->new_chunk = mddev->new_chunk;
+	}
+	mddev->minor_version = sb->minor_version;
 	if (mddev->in_sync)
 	{
 		sb->recovery_cp = mddev->recovery_cp;
@@ -1101,6 +1126,20 @@ static int super_1_validate(mddev_t *mdd
 			}
 			mddev->bitmap_offset = (__s32)le32_to_cpu(sb->bitmap_offset);
 		}
+		if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RESHAPE_ACTIVE)) {
+			mddev->reshape_position = le64_to_cpu(sb->reshape_position);
+			mddev->delta_disks = le32_to_cpu(sb->delta_disks);
+			mddev->new_level = le32_to_cpu(sb->new_level);
+			mddev->new_layout = le32_to_cpu(sb->new_layout);
+			mddev->new_chunk = le32_to_cpu(sb->new_chunk)<<9;
+		} else {
+			mddev->reshape_position = MaxSector;
+			mddev->delta_disks = 0;
+			mddev->new_level = mddev->level;
+			mddev->new_layout = mddev->layout;
+			mddev->new_chunk = mddev->chunk_size;
+		}
+
 	} else if (mddev->pers == NULL) {
 		/* Insist of good event counter while assembling */
 		__u64 ev1 = le64_to_cpu(sb->events);
@@ -1172,6 +1211,14 @@ static void super_1_sync(mddev_t *mddev,
 		sb->bitmap_offset = cpu_to_le32((__u32)mddev->bitmap_offset);
 		sb->feature_map = cpu_to_le32(MD_FEATURE_BITMAP_OFFSET);
 	}
+	if (mddev->reshape_position != MaxSector) {
+		sb->feature_map |= cpu_to_le32(MD_FEATURE_RESHAPE_ACTIVE);
+		sb->reshape_position = cpu_to_le64(mddev->reshape_position);
+		sb->new_layout = cpu_to_le32(mddev->new_layout);
+		sb->delta_disks = cpu_to_le32(mddev->delta_disks);
+		sb->new_level = cpu_to_le32(mddev->new_level);
+		sb->new_chunk = cpu_to_le32(mddev->new_chunk>>9);
+	}
 
 	max_dev = 0;
 	ITERATE_RDEV(mddev,rdev2,tmp)
@@ -1492,7 +1539,7 @@ static void sync_sbs(mddev_t * mddev)
 	}
 }
 
-static void md_update_sb(mddev_t * mddev)
+void md_update_sb(mddev_t * mddev)
 {
 	int err;
 	struct list_head *tmp;
@@ -1569,6 +1616,7 @@ repeat:
 	wake_up(&mddev->sb_wait);
 
 }
+EXPORT_SYMBOL_GPL(md_update_sb);
 
 /* words written to sysfs files may, or my not, be \n terminated.
  * We want to accept with case. For this we use cmd_match.
@@ -2540,6 +2588,14 @@ static int do_md_run(mddev_t * mddev)
 	mddev->level = pers->level;
 	strlcpy(mddev->clevel, pers->name, sizeof(mddev->clevel));
 
+	if (mddev->reshape_position != MaxSector &&
+	    pers->reshape == NULL) {
+		/* This personality cannot handle reshaping... */
+		mddev->pers = NULL;
+		module_put(pers->owner);
+		return -EINVAL;
+	}
+
 	mddev->recovery = 0;
 	mddev->resync_max_sectors = mddev->size << 1; /* may be over-ridden by personality */
 	mddev->barriers_work = 1;
@@ -3428,11 +3484,18 @@ static int set_array_info(mddev_t * mdde
 	mddev->default_bitmap_offset = MD_SB_BYTES >> 9;
 	mddev->bitmap_offset = 0;
 
+	mddev->reshape_position = MaxSector;
+
 	/*
 	 * Generate a 128 bit UUID
 	 */
 	get_random_bytes(mddev->uuid, 16);
 
+	mddev->new_level = mddev->level;
+	mddev->new_chunk = mddev->chunk_size;
+	mddev->new_layout = mddev->layout;
+	mddev->delta_disks = 0;
+
 	return 0;
 }
 

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2006-03-17 11:45:43.000000000 +1100
+++ ./drivers/md/raid1.c	2006-03-17 11:48:58.000000000 +1100
@@ -1789,6 +1789,11 @@ static int run(mddev_t *mddev)
 		       mdname(mddev), mddev->level);
 		goto out;
 	}
+	if (mddev->reshape_position != MaxSector) {
+		printk("raid1: %s: reshape_position set but not supported\n",
+		       mdname(mddev));
+		goto out;
+	}
 	/*
 	 * copy the already verified devices into our private RAID1
 	 * bookkeeping area. [whatever we allocate in run(),

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-03-17 11:48:57.000000000 +1100
+++ ./drivers/md/raid5.c	2006-03-17 11:48:58.000000000 +1100
@@ -22,6 +22,7 @@
 #include <linux/raid/raid5.h>
 #include <linux/highmem.h>
 #include <linux/bitops.h>
+#include <linux/kthread.h>
 #include <asm/atomic.h>
 
 #include <linux/raid/bitmap.h>
@@ -1489,6 +1490,7 @@ static void handle_stripe(struct stripe_
 		clear_bit(STRIPE_EXPANDING, &sh->state);
 	} else if (expanded) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
+		atomic_dec(&conf->reshape_stripes);
 		wake_up(&conf->wait_for_overlap);
 		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
 	}
@@ -1860,6 +1862,26 @@ static sector_t sync_request(mddev_t *md
 		 */
 		int i;
 		int dd_idx;
+
+		if (sector_nr == 0 &&
+		    conf->expand_progress != 0) {
+			/* restarting in the middle, skip the initial sectors */
+			sector_nr = conf->expand_progress;
+			sector_div(sector_nr, conf->raid_disks-1);
+			*skipped = 1;
+			return sector_nr;
+		}
+
+		/* Cannot proceed until we've updated the superblock... */
+		wait_event(conf->wait_for_overlap,
+			   atomic_read(&conf->reshape_stripes)==0);
+		mddev->reshape_position = conf->expand_progress;
+
+		mddev->sb_dirty = 1;
+		md_wakeup_thread(mddev->thread);
+		wait_event(mddev->sb_wait, mddev->sb_dirty == 0 ||
+			kthread_should_stop());
+
 		for (i=0; i < conf->chunk_size/512; i+= STRIPE_SECTORS) {
 			int j;
 			int skipped = 0;
@@ -1867,6 +1889,7 @@ static sector_t sync_request(mddev_t *md
 			sh = get_active_stripe(conf, sector_nr+i,
 					       conf->raid_disks, pd_idx, 0);
 			set_bit(STRIPE_EXPANDING, &sh->state);
+			atomic_inc(&conf->reshape_stripes);
 			/* If any of this stripe is beyond the end of the old
 			 * array, then we need to zero those blocks
 			 */
@@ -2106,10 +2129,61 @@ static int run(mddev_t *mddev)
 		return -EIO;
 	}
 
+	if (mddev->reshape_position != MaxSector) {
+		/* Check that we can continue the reshape.
+		 * Currently only disks can change, it must
+		 * increase, and we must be past the point where
+		 * a stripe over-writes itself
+		 */
+		sector_t here_new, here_old, there_new;
+		int old_disks;
+
+		if (mddev->new_level != mddev->level ||
+		    mddev->new_layout != mddev->layout ||
+		    mddev->new_chunk != mddev->chunk_size) {
+			printk(KERN_ERR "raid5: %s: unsupported reshape required - aborting.\n",
+			       mdname(mddev));
+			return -EINVAL;
+		}
+		if (mddev->delta_disks <= 0) {
+			printk(KERN_ERR "raid5: %s: unsupported reshape (reduce disks) required - aborting.\n",
+			       mdname(mddev));
+			return -EINVAL;
+		}
+		old_disks = mddev->raid_disks - mddev->delta_disks;
+		/* reshape_position must be on a new-stripe boundary, and one
+		 * further up in new geometry must map after here in old geometry.
+		 */
+		here_new = mddev->reshape_position;
+		if (sector_div(here_new, (mddev->chunk_size>>9)*(mddev->raid_disks-1))) {
+			printk(KERN_ERR "raid5: reshape_position not on a stripe boundary\n");
+			return -EINVAL;
+		}
+		/* here_new is the stripe we will write to */
+		here_old = mddev->reshape_position;
+		sector_div(here_old, (mddev->chunk_size>>9)*(old_disks-1));
+		/* here_old is the first stripe that we might need to read from */
+		if (here_new >= here_old) {
+			/* Reading from the same stripe as writing to - bad */
+			printk(KERN_ERR "raid5: reshape_position too early for auto-recovery - aborting.\n");
+			return -EINVAL;
+		}
+		printk(KERN_INFO "raid5: reshape will continue\n");
+		/* OK, we should be able to continue; */
+	}
+
+
 	mddev->private = kzalloc(sizeof (raid5_conf_t), GFP_KERNEL);
 	if ((conf = mddev->private) == NULL)
 		goto abort;
-	conf->disks = kzalloc(mddev->raid_disks * sizeof(struct disk_info),
+	if (mddev->reshape_position == MaxSector) {
+		conf->previous_raid_disks = conf->raid_disks = mddev->raid_disks;
+	} else {
+		conf->raid_disks = mddev->raid_disks;
+		conf->previous_raid_disks = mddev->raid_disks - mddev->delta_disks;
+	}
+
+	conf->disks = kzalloc(conf->raid_disks * sizeof(struct disk_info),
 			      GFP_KERNEL);
 	if (!conf->disks)
 		goto abort;
@@ -2133,7 +2207,7 @@ static int run(mddev_t *mddev)
 
 	ITERATE_RDEV(mddev,rdev,tmp) {
 		raid_disk = rdev->raid_disk;
-		if (raid_disk >= mddev->raid_disks
+		if (raid_disk >= conf->raid_disks
 		    || raid_disk < 0)
 			continue;
 		disk = conf->disks + raid_disk;
@@ -2149,7 +2223,6 @@ static int run(mddev_t *mddev)
 		}
 	}
 
-	conf->raid_disks = mddev->raid_disks;
 	/*
 	 * 0 for a fully functional array, 1 for a degraded array.
 	 */
@@ -2159,7 +2232,7 @@ static int run(mddev_t *mddev)
 	conf->level = mddev->level;
 	conf->algorithm = mddev->layout;
 	conf->max_nr_stripes = NR_STRIPES;
-	conf->expand_progress = MaxSector;
+	conf->expand_progress = mddev->reshape_position;
 
 	/* device size must be a multiple of chunk size */
 	mddev->size &= ~(mddev->chunk_size/1024 -1);
@@ -2232,6 +2305,20 @@ static int run(mddev_t *mddev)
 
 	print_raid5_conf(conf);
 
+	if (conf->expand_progress != MaxSector) {
+		printk("...ok start reshape thread\n");
+		atomic_set(&conf->reshape_stripes, 0);
+		clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+		clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
+		set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
+		set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
+		mddev->sync_thread = md_register_thread(md_do_sync, mddev,
+							"%s_reshape");
+		/* FIXME if md_register_thread fails?? */
+		md_wakeup_thread(mddev->sync_thread);
+
+	}
+
 	/* read-ahead size must cover two whole stripes, which is
 	 * 2 * (n-1) * chunksize where 'n' is the number of raid devices
 	 */
@@ -2247,8 +2334,8 @@ static int run(mddev_t *mddev)
 
 	mddev->queue->unplug_fn = raid5_unplug_device;
 	mddev->queue->issue_flush_fn = raid5_issue_flush;
+	mddev->array_size =  mddev->size * (conf->previous_raid_disks - 1);
 
-	mddev->array_size =  mddev->size * (mddev->raid_disks - 1);
 	return 0;
 abort:
 	if (conf) {
@@ -2421,7 +2508,7 @@ static int raid5_add_disk(mddev_t *mddev
 	/*
 	 * find the disk ...
 	 */
-	for (disk=0; disk < mddev->raid_disks; disk++)
+	for (disk=0; disk < conf->raid_disks; disk++)
 		if ((p=conf->disks + disk)->rdev == NULL) {
 			clear_bit(In_sync, &rdev->flags);
 			rdev->raid_disk = disk;
@@ -2502,9 +2589,10 @@ static int raid5_reshape(mddev_t *mddev,
 	if (err)
 		return err;
 
+	atomic_set(&conf->reshape_stripes, 0);
 	spin_lock_irq(&conf->device_lock);
 	conf->previous_raid_disks = conf->raid_disks;
-	mddev->raid_disks = conf->raid_disks = raid_disks;
+	conf->raid_disks = raid_disks;
 	conf->expand_progress = 0;
 	spin_unlock_irq(&conf->device_lock);
 
@@ -2526,6 +2614,14 @@ static int raid5_reshape(mddev_t *mddev,
 		}
 
 	mddev->degraded = (raid_disks - conf->previous_raid_disks) - added_devices;
+	mddev->new_chunk = mddev->chunk_size;
+	mddev->new_layout = mddev->layout;
+	mddev->new_level = mddev->level;
+	mddev->raid_disks = raid_disks;
+	mddev->delta_disks = raid_disks - conf->previous_raid_disks;
+	mddev->reshape_position = 0;
+	mddev->sb_dirty = 1;
+
 	clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 	clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
 	set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
@@ -2536,6 +2632,7 @@ static int raid5_reshape(mddev_t *mddev,
 		mddev->recovery = 0;
 		spin_lock_irq(&conf->device_lock);
 		mddev->raid_disks = conf->raid_disks = conf->previous_raid_disks;
+		mddev->delta_disks = 0;
 		conf->expand_progress = MaxSector;
 		spin_unlock_irq(&conf->device_lock);
 		return -EAGAIN;
@@ -2549,20 +2646,23 @@ static void end_reshape(raid5_conf_t *co
 {
 	struct block_device *bdev;
 
-	conf->mddev->array_size = conf->mddev->size * (conf->mddev->raid_disks-1);
-	set_capacity(conf->mddev->gendisk, conf->mddev->array_size << 1);
-	conf->mddev->changed = 1;
-
-	bdev = bdget_disk(conf->mddev->gendisk, 0);
-	if (bdev) {
-		mutex_lock(&bdev->bd_inode->i_mutex);
-		i_size_write(bdev->bd_inode, conf->mddev->array_size << 10);
-		mutex_unlock(&bdev->bd_inode->i_mutex);
-		bdput(bdev);
+	if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {
+		conf->mddev->array_size = conf->mddev->size * (conf->raid_disks-1);
+		set_capacity(conf->mddev->gendisk, conf->mddev->array_size << 1);
+		conf->mddev->changed = 1;
+
+		bdev = bdget_disk(conf->mddev->gendisk, 0);
+		if (bdev) {
+			mutex_lock(&bdev->bd_inode->i_mutex);
+			i_size_write(bdev->bd_inode, conf->mddev->array_size << 10);
+			mutex_unlock(&bdev->bd_inode->i_mutex);
+			bdput(bdev);
+		}
+		spin_lock_irq(&conf->device_lock);
+		conf->expand_progress = MaxSector;
+		spin_unlock_irq(&conf->device_lock);
+		conf->mddev->reshape_position = MaxSector;
 	}
-	spin_lock_irq(&conf->device_lock);
-	conf->expand_progress = MaxSector;
-	spin_unlock_irq(&conf->device_lock);
 }
 
 static void raid5_quiesce(mddev_t *mddev, int state)

diff ./include/linux/raid/md.h~current~ ./include/linux/raid/md.h
--- ./include/linux/raid/md.h~current~	2006-03-17 11:48:57.000000000 +1100
+++ ./include/linux/raid/md.h	2006-03-17 11:48:58.000000000 +1100
@@ -95,6 +95,8 @@ extern int sync_page_io(struct block_dev
 extern void md_do_sync(mddev_t *mddev);
 extern void md_new_event(mddev_t *mddev);
 
+extern void md_update_sb(mddev_t * mddev);
+
 #define MD_BUG(x...) { printk("md: bug in file %s, line %d\n", __FILE__, __LINE__); md_print_devices(); }
 
 #endif 

diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~	2006-03-17 11:48:57.000000000 +1100
+++ ./include/linux/raid/md_k.h	2006-03-17 11:48:58.000000000 +1100
@@ -132,6 +132,14 @@ struct mddev_s
 
 	char				uuid[16];
 
+	/* If the array is being reshaped, we need to record the
+	 * new shape and an indication of where we are up to.
+	 * This is written to the superblock.
+	 * If reshape_position is MaxSector, then no reshape is happening (yet).
+	 */
+	sector_t			reshape_position;
+	int				delta_disks, new_level, new_layout, new_chunk;
+
 	struct mdk_thread_s		*thread;	/* management thread */
 	struct mdk_thread_s		*sync_thread;	/* doing resync or reconstruct */
 	sector_t			curr_resync;	/* blocks scheduled */

diff ./include/linux/raid/md_p.h~current~ ./include/linux/raid/md_p.h
--- ./include/linux/raid/md_p.h~current~	2006-03-17 11:45:43.000000000 +1100
+++ ./include/linux/raid/md_p.h	2006-03-17 11:48:58.000000000 +1100
@@ -102,6 +102,18 @@ typedef struct mdp_device_descriptor_s {
 #define MD_SB_ERRORS		1
 
 #define	MD_SB_BITMAP_PRESENT	8 /* bitmap may be present nearby */
+
+/*
+ * Notes:
+ * - if an array is being reshaped (restriped) in order to change the
+ *   the number of active devices in the array, 'raid_disks' will be
+ *   the larger of the old and new numbers.  'delta_disks' will
+ *   be the "new - old".  So if +ve, raid_disks is the new value, and
+ *   "raid_disks-delta_disks" is the old.  If -ve, raid_disks is the
+ *   old value and "raid_disks+delta_disks" is the new (smaller) value.
+ */
+
+
 typedef struct mdp_superblock_s {
 	/*
 	 * Constant generic information
@@ -146,7 +158,13 @@ typedef struct mdp_superblock_s {
 	__u32 cp_events_hi;	/* 10 high-order of checkpoint update count   */
 #endif
 	__u32 recovery_cp;	/* 11 recovery checkpoint sector count	      */
-	__u32 gstate_sreserved[MD_SB_GENERIC_STATE_WORDS - 12];
+	/* There are only valid for minor_version > 90 */
+	__u64 reshape_position;	/* 12,13 next address in array-space for reshape */
+	__u32 new_level;	/* 14 new level we are reshaping to	      */
+	__u32 delta_disks;	/* 15 change in number of raid_disks	      */
+	__u32 new_layout;	/* 16 new layout			      */
+	__u32 new_chunk;	/* 17 new chunk size (bytes)		      */
+	__u32 gstate_sreserved[MD_SB_GENERIC_STATE_WORDS - 18];
 
 	/*
 	 * Personality information
@@ -207,7 +225,14 @@ struct mdp_superblock_1 {
 				 * NOTE: signed, so bitmap can be before superblock
 				 * only meaningful of feature_map[0] is set.
 				 */
-	__u8	pad1[128-100];	/* set to 0 when written */
+
+	/* These are only valid with feature bit '4' */
+	__u64	reshape_position;	/* next address in array-space for reshape */
+	__u32	new_level;	/* new level we are reshaping to		*/
+	__u32	delta_disks;	/* change in number of raid_disks		*/
+	__u32	new_layout;	/* new layout					*/
+	__u32	new_chunk;	/* new chunk size (bytes)			*/
+	__u8	pad1[128-124];	/* set to 0 when written */
 
 	/* constant this-device information - 64 bytes */
 	__u64	data_offset;	/* sector start of data, often 0 */
@@ -240,8 +265,9 @@ struct mdp_superblock_1 {
 
 /* feature_map bits */
 #define MD_FEATURE_BITMAP_OFFSET	1
+#define	MD_FEATURE_RESHAPE_ACTIVE	4
 
-#define	MD_FEATURE_ALL			1
+#define	MD_FEATURE_ALL			5
 
 #endif 
 

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~	2006-03-17 11:48:57.000000000 +1100
+++ ./include/linux/raid/raid5.h	2006-03-17 11:48:58.000000000 +1100
@@ -224,6 +224,7 @@ struct raid5_private_data {
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 
+	atomic_t		reshape_stripes; /* stripes with pending writes for reshape */
 	/* unfortunately we need two cache names as we temporarily have
 	 * two caches.
 	 */

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 010 of 13] md: Only checkpoint expansion progress occasionally.
  2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
                   ` (8 preceding siblings ...)
  2006-03-17  4:48 ` [PATCH 009 of 13] md: Checkpoint and allow restart of raid5 reshape NeilBrown
@ 2006-03-17  4:48 ` NeilBrown
  2006-03-17  6:17   ` Andrew Morton
  2006-03-17  4:48 ` [PATCH 011 of 13] md: Split reshape handler in check_reshape and start_reshape NeilBrown
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


Instead of checkpointing at each stripe, only checkpoint
when a new write would overwrite uncheckpointed data.
Block any write to the uncheckpointed area.
Arbitrarily checkpoint at least every 3Meg.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c         |   53 ++++++++++++++++++++++++++++++++++---------
 ./include/linux/raid/raid5.h |    3 ++
 2 files changed, 45 insertions(+), 11 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-03-17 11:48:58.000000000 +1100
+++ ./drivers/md/raid5.c	2006-03-17 11:48:58.000000000 +1100
@@ -1747,8 +1747,9 @@ static int make_request(request_queue_t 
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
 		int disks;
-		
+
 	retry:
+		prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 		if (likely(conf->expand_progress == MaxSector))
 			disks = conf->raid_disks;
 		else {
@@ -1756,6 +1757,13 @@ static int make_request(request_queue_t 
 			disks = conf->raid_disks;
 			if (logical_sector >= conf->expand_progress)
 				disks = conf->previous_raid_disks;
+			else {
+				if (logical_sector >= conf->expand_lo) {
+					spin_unlock_irq(&conf->device_lock);
+					schedule();
+					goto retry;
+				}
+			}
 			spin_unlock_irq(&conf->device_lock);
 		}
  		new_sector = raid5_compute_sector(logical_sector, disks, disks - 1,
@@ -1764,7 +1772,6 @@ static int make_request(request_queue_t 
 			(unsigned long long)new_sector, 
 			(unsigned long long)logical_sector);
 
-		prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 		sh = get_active_stripe(conf, new_sector, disks, pd_idx, (bi->bi_rw&RWA_MASK));
 		if (sh) {
 			if (unlikely(conf->expand_progress != MaxSector)) {
@@ -1862,6 +1869,7 @@ static sector_t sync_request(mddev_t *md
 		 */
 		int i;
 		int dd_idx;
+		sector_t writepos, safepos, gap;
 
 		if (sector_nr == 0 &&
 		    conf->expand_progress != 0) {
@@ -1872,15 +1880,36 @@ static sector_t sync_request(mddev_t *md
 			return sector_nr;
 		}
 
-		/* Cannot proceed until we've updated the superblock... */
-		wait_event(conf->wait_for_overlap,
-			   atomic_read(&conf->reshape_stripes)==0);
-		mddev->reshape_position = conf->expand_progress;
-
-		mddev->sb_dirty = 1;
-		md_wakeup_thread(mddev->thread);
-		wait_event(mddev->sb_wait, mddev->sb_dirty == 0 ||
-			kthread_should_stop());
+		/* we update the metadata when there is more than 3Meg
+		 * in the block range (that is rather arbitrary, should
+		 * probably be time based) or when the data about to be
+		 * copied would over-write the source of the data at
+		 * the front of the range.
+		 * i.e. one new_stripe forward from expand_progress new_maps
+		 * to after where expand_lo old_maps to
+		 */
+		writepos = conf->expand_progress +
+			conf->chunk_size/512*(conf->raid_disks-1);
+		sector_div(writepos, conf->raid_disks-1);
+		safepos = conf->expand_lo;
+		sector_div(safepos, conf->previous_raid_disks-1);
+		gap = conf->expand_progress - conf->expand_lo;
+
+		if (writepos >= safepos ||
+		    gap > (conf->raid_disks-1)*3000*2 /*3Meg*/) {
+			/* Cannot proceed until we've updated the superblock... */
+			wait_event(conf->wait_for_overlap,
+				   atomic_read(&conf->reshape_stripes)==0);
+			mddev->reshape_position = conf->expand_progress;
+			mddev->sb_dirty = 1;
+			md_wakeup_thread(mddev->thread);
+			wait_event(mddev->sb_wait, mddev->sb_dirty == 0 ||
+				   kthread_should_stop());
+			spin_lock_irq(&conf->device_lock);
+			conf->expand_lo = mddev->reshape_position;
+			spin_unlock_irq(&conf->device_lock);
+			wake_up(&conf->wait_for_overlap);
+		}
 
 		for (i=0; i < conf->chunk_size/512; i+= STRIPE_SECTORS) {
 			int j;
@@ -2307,6 +2336,7 @@ static int run(mddev_t *mddev)
 
 	if (conf->expand_progress != MaxSector) {
 		printk("...ok start reshape thread\n");
+		conf->expand_lo = conf->expand_progress;
 		atomic_set(&conf->reshape_stripes, 0);
 		clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 		clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
@@ -2594,6 +2624,7 @@ static int raid5_reshape(mddev_t *mddev,
 	conf->previous_raid_disks = conf->raid_disks;
 	conf->raid_disks = raid_disks;
 	conf->expand_progress = 0;
+	conf->expand_lo = 0;
 	spin_unlock_irq(&conf->device_lock);
 
 	/* Add some new drives, as many as will fit.

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~	2006-03-17 11:48:58.000000000 +1100
+++ ./include/linux/raid/raid5.h	2006-03-17 11:48:58.000000000 +1100
@@ -217,6 +217,9 @@ struct raid5_private_data {
 
 	/* used during an expand */
 	sector_t		expand_progress;	/* MaxSector when no expand happening */
+	sector_t		expand_lo; /* from here up to expand_progress it out-of-bounds
+					    * as we haven't flushed the metadata yet
+					    */
 	int			previous_raid_disks;
 
 	struct list_head	handle_list; /* stripes needing handling */

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 011 of 13] md: Split reshape handler in check_reshape and start_reshape.
  2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
                   ` (9 preceding siblings ...)
  2006-03-17  4:48 ` [PATCH 010 of 13] md: Only checkpoint expansion progress occasionally NeilBrown
@ 2006-03-17  4:48 ` NeilBrown
  2006-03-17  4:48 ` [PATCH 012 of 13] md: Make 'reshape' a possible sync_action action NeilBrown
  2006-03-17  4:48 ` [PATCH 013 of 13] md: Support suspending of IO to regions of an md array NeilBrown
  12 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


check_reshape checks validity and does things that can be done
instantly - like adding devices to raid1.
start_reshape initiates a restriping process to convert the whole array.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c           |   10 ++++---
 ./drivers/md/raid1.c        |   19 +++++++++++--
 ./drivers/md/raid5.c        |   60 ++++++++++++++++++++++++--------------------
 ./include/linux/raid/md_k.h |    3 +-
 4 files changed, 58 insertions(+), 34 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2006-03-17 11:48:58.000000000 +1100
+++ ./drivers/md/md.c	2006-03-17 11:48:59.000000000 +1100
@@ -2589,7 +2589,7 @@ static int do_md_run(mddev_t * mddev)
 	strlcpy(mddev->clevel, pers->name, sizeof(mddev->clevel));
 
 	if (mddev->reshape_position != MaxSector &&
-	    pers->reshape == NULL) {
+	    pers->start_reshape == NULL) {
 		/* This personality cannot handle reshaping... */
 		mddev->pers = NULL;
 		module_put(pers->owner);
@@ -3551,14 +3551,16 @@ static int update_raid_disks(mddev_t *md
 {
 	int rv;
 	/* change the number of raid disks */
-	if (mddev->pers->reshape == NULL)
+	if (mddev->pers->check_reshape == NULL)
 		return -EINVAL;
 	if (raid_disks <= 0 ||
 	    raid_disks >= mddev->max_disks)
 		return -EINVAL;
-	if (mddev->sync_thread)
+	if (mddev->sync_thread || mddev->reshape_position != MaxSector)
 		return -EBUSY;
-	rv = mddev->pers->reshape(mddev, raid_disks);
+	mddev->delta_disks = raid_disks - mddev->raid_disks;
+
+	rv = mddev->pers->check_reshape(mddev);
 	return rv;
 }
 

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2006-03-17 11:48:58.000000000 +1100
+++ ./drivers/md/raid1.c	2006-03-17 11:48:59.000000000 +1100
@@ -1976,7 +1976,7 @@ static int raid1_resize(mddev_t *mddev, 
 	return 0;
 }
 
-static int raid1_reshape(mddev_t *mddev, int raid_disks)
+static int raid1_reshape(mddev_t *mddev)
 {
 	/* We need to:
 	 * 1/ resize the r1bio_pool
@@ -1993,10 +1993,22 @@ static int raid1_reshape(mddev_t *mddev,
 	struct pool_info *newpoolinfo;
 	mirror_info_t *newmirrors;
 	conf_t *conf = mddev_to_conf(mddev);
-	int cnt;
+	int cnt, raid_disks;
 
 	int d, d2;
 
+	/* Cannot change chunk_size, layout, or level */
+	if (mddev->chunk_size != mddev->new_chunk ||
+	    mddev->layout != mddev->new_layout ||
+	    mddev->level != mddev->new_level) {
+		mddev->new_chunk = mddev->chunk_size;
+		mddev->new_layout = mddev->layout;
+		mddev->new_level = mddev->level;
+		return -EINVAL;
+	}
+
+	raid_disks = mddev->raid_disks + mddev->delta_disks;
+
 	if (raid_disks < conf->raid_disks) {
 		cnt=0;
 		for (d= 0; d < conf->raid_disks; d++)
@@ -2043,6 +2055,7 @@ static int raid1_reshape(mddev_t *mddev,
 
 	mddev->degraded += (raid_disks - conf->raid_disks);
 	conf->raid_disks = mddev->raid_disks = raid_disks;
+	mddev->delta_disks = 0;
 
 	conf->last_used = 0; /* just make sure it is in-range */
 	lower_barrier(conf);
@@ -2084,7 +2097,7 @@ static struct mdk_personality raid1_pers
 	.spare_active	= raid1_spare_active,
 	.sync_request	= sync_request,
 	.resize		= raid1_resize,
-	.reshape	= raid1_reshape,
+	.check_reshape	= raid1_reshape,
 	.quiesce	= raid1_quiesce,
 };
 

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-03-17 11:48:58.000000000 +1100
+++ ./drivers/md/raid5.c	2006-03-17 11:48:59.000000000 +1100
@@ -2574,21 +2574,15 @@ static int raid5_resize(mddev_t *mddev, 
 	return 0;
 }
 
-static int raid5_reshape(mddev_t *mddev, int raid_disks)
+static int raid5_check_reshape(mddev_t *mddev)
 {
 	raid5_conf_t *conf = mddev_to_conf(mddev);
 	int err;
-	mdk_rdev_t *rdev;
-	struct list_head *rtmp;
-	int spares = 0;
-	int added_devices = 0;
 
-	if (mddev->degraded ||
-	    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
-		return -EBUSY;
-	if (conf->raid_disks > raid_disks)
-		return -EINVAL; /* Cannot shrink array yet */
-	if (conf->raid_disks == raid_disks)
+	if (mddev->delta_disks < 0 ||
+	    mddev->new_level != mddev->level)
+		return -EINVAL; /* Cannot shrink array or change level yet */
+	if (mddev->delta_disks == 0)
 		return 0; /* nothing to do */
 
 	/* Can only proceed if there are plenty of stripe_heads.
@@ -2599,30 +2593,48 @@ static int raid5_reshape(mddev_t *mddev,
 	 * If the chunk size is greater, user-space should request more
 	 * stripe_heads first.
 	 */
-	if ((mddev->chunk_size / STRIPE_SIZE) * 4 > conf->max_nr_stripes) {
+	if ((mddev->chunk_size / STRIPE_SIZE) * 4 > conf->max_nr_stripes ||
+	    (mddev->new_chunk / STRIPE_SIZE) * 4 > conf->max_nr_stripes) {
 		printk(KERN_WARNING "raid5: reshape: not enough stripes.  Needed %lu\n",
 		       (mddev->chunk_size / STRIPE_SIZE)*4);
 		return -ENOSPC;
 	}
 
+	err = resize_stripes(conf, conf->raid_disks + mddev->delta_disks);
+	if (err)
+		return err;
+
+	/* looks like we might be able to manage this */
+	return 0;
+}
+
+static int raid5_start_reshape(mddev_t *mddev)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	mdk_rdev_t *rdev;
+	struct list_head *rtmp;
+	int spares = 0;
+	int added_devices = 0;
+
+	if (mddev->degraded ||
+	    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+		return -EBUSY;
+
 	ITERATE_RDEV(mddev, rdev, rtmp)
 		if (rdev->raid_disk < 0 &&
 		    !test_bit(Faulty, &rdev->flags))
 			spares++;
-	if (conf->raid_disks + spares < raid_disks-1)
+
+	if (spares < mddev->delta_disks-1)
 		/* Not enough devices even to make a degraded array
 		 * of that size
 		 */
 		return -EINVAL;
 
-	err = resize_stripes(conf, raid_disks);
-	if (err)
-		return err;
-
 	atomic_set(&conf->reshape_stripes, 0);
 	spin_lock_irq(&conf->device_lock);
 	conf->previous_raid_disks = conf->raid_disks;
-	conf->raid_disks = raid_disks;
+	conf->raid_disks += mddev->delta_disks;
 	conf->expand_progress = 0;
 	conf->expand_lo = 0;
 	spin_unlock_irq(&conf->device_lock);
@@ -2644,12 +2656,8 @@ static int raid5_reshape(mddev_t *mddev,
 				break;
 		}
 
-	mddev->degraded = (raid_disks - conf->previous_raid_disks) - added_devices;
-	mddev->new_chunk = mddev->chunk_size;
-	mddev->new_layout = mddev->layout;
-	mddev->new_level = mddev->level;
-	mddev->raid_disks = raid_disks;
-	mddev->delta_disks = raid_disks - conf->previous_raid_disks;
+	mddev->degraded = (conf->raid_disks - conf->previous_raid_disks) - added_devices;
+	mddev->raid_disks = conf->raid_disks;
 	mddev->reshape_position = 0;
 	mddev->sb_dirty = 1;
 
@@ -2663,7 +2671,6 @@ static int raid5_reshape(mddev_t *mddev,
 		mddev->recovery = 0;
 		spin_lock_irq(&conf->device_lock);
 		mddev->raid_disks = conf->raid_disks = conf->previous_raid_disks;
-		mddev->delta_disks = 0;
 		conf->expand_progress = MaxSector;
 		spin_unlock_irq(&conf->device_lock);
 		return -EAGAIN;
@@ -2735,7 +2742,8 @@ static struct mdk_personality raid5_pers
 	.sync_request	= sync_request,
 	.resize		= raid5_resize,
 #if CONFIG_MD_RAID5_RESHAPE
-	.reshape	= raid5_reshape,
+	.check_reshape	= raid5_check_reshape,
+	.start_reshape	= raid5_start_reshape,
 #endif
 	.quiesce	= raid5_quiesce,
 };

diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~	2006-03-17 11:48:58.000000000 +1100
+++ ./include/linux/raid/md_k.h	2006-03-17 11:48:59.000000000 +1100
@@ -261,7 +261,8 @@ struct mdk_personality
 	int (*spare_active) (mddev_t *mddev);
 	sector_t (*sync_request)(mddev_t *mddev, sector_t sector_nr, int *skipped, int go_faster);
 	int (*resize) (mddev_t *mddev, sector_t sectors);
-	int (*reshape) (mddev_t *mddev, int raid_disks);
+	int (*check_reshape) (mddev_t *mddev);
+	int (*start_reshape) (mddev_t *mddev);
 	int (*reconfig) (mddev_t *mddev, int layout, int chunk_size);
 	/* quiesce moves between quiescence states
 	 * 0 - fully active

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 012 of 13] md: Make 'reshape' a possible sync_action action.
  2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
                   ` (10 preceding siblings ...)
  2006-03-17  4:48 ` [PATCH 011 of 13] md: Split reshape handler in check_reshape and start_reshape NeilBrown
@ 2006-03-17  4:48 ` NeilBrown
  2006-03-17  4:48 ` [PATCH 013 of 13] md: Support suspending of IO to regions of an md array NeilBrown
  12 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


This allows reshape to be triggerred via sysfs (which is the
only way to start it happening).

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2006-03-17 11:48:59.000000000 +1100
+++ ./drivers/md/md.c	2006-03-17 11:48:59.000000000 +1100
@@ -2242,7 +2242,14 @@ action_store(mddev_t *mddev, const char 
 		return -EBUSY;
 	else if (cmd_match(page, "resync") || cmd_match(page, "recover"))
 		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
-	else {
+	else if (cmd_match(page, "reshape")) {
+		int err;
+		if (mddev->pers->start_reshape == NULL)
+			return -EINVAL;
+		err = mddev->pers->start_reshape(mddev);
+		if (err)
+			return err;
+	} else {
 		if (cmd_match(page, "check"))
 			set_bit(MD_RECOVERY_CHECK, &mddev->recovery);
 		else if (cmd_match(page, "repair"))

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 013 of 13] md: Support suspending of IO to regions of an md array.
  2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
                   ` (11 preceding siblings ...)
  2006-03-17  4:48 ` [PATCH 012 of 13] md: Make 'reshape' a possible sync_action action NeilBrown
@ 2006-03-17  4:48 ` NeilBrown
  12 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2006-03-17  4:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


This allows user-space to access data safely.
This is needed for raid5 reshape as user-space needs to take
a backup of the first few stripes before allowing reshape
to commense.
It will also be useful in cluster-aware raid1 configurations
so that all cluster members can leave a section of the array untouched
while a resync/recovery happens.

A 'start' and 'end' of the suspended range are written to 2 sysfs
attributes.  Note that only one range can be suspended at a time.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c           |   59 ++++++++++++++++++++++++++++++++++++++++++++
 ./drivers/md/raid5.c        |   14 ++++++++++
 ./include/linux/raid/md_k.h |    4 ++
 3 files changed, 77 insertions(+)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2006-03-17 11:48:59.000000000 +1100
+++ ./drivers/md/md.c	2006-03-17 11:48:59.000000000 +1100
@@ -2360,6 +2360,63 @@ sync_completed_show(mddev_t *mddev, char
 static struct md_sysfs_entry
 md_sync_completed = __ATTR_RO(sync_completed);
 
+static ssize_t
+suspend_lo_show(mddev_t *mddev, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)mddev->suspend_lo);
+}
+
+static ssize_t
+suspend_lo_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	char *e;
+	unsigned long long new = simple_strtoull(buf, &e, 10);
+
+	if (mddev->pers->quiesce == NULL)
+		return -EINVAL;
+	if (buf == e || (*e && *e != '\n'))
+		return -EINVAL;
+	if (new >= mddev->suspend_hi ||
+	    (new > mddev->suspend_lo && new < mddev->suspend_hi)) {
+		mddev->suspend_lo = new;
+		mddev->pers->quiesce(mddev, 2);
+		return len;
+	} else
+		return -EINVAL;
+}
+static struct md_sysfs_entry md_suspend_lo =
+__ATTR(suspend_lo, S_IRUGO|S_IWUSR, suspend_lo_show, suspend_lo_store);
+
+
+static ssize_t
+suspend_hi_show(mddev_t *mddev, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)mddev->suspend_hi);
+}
+
+static ssize_t
+suspend_hi_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	char *e;
+	unsigned long long new = simple_strtoull(buf, &e, 10);
+
+	if (mddev->pers->quiesce == NULL)
+		return -EINVAL;
+	if (buf == e || (*e && *e != '\n'))
+		return -EINVAL;
+	if ((new <= mddev->suspend_lo && mddev->suspend_lo >= mddev->suspend_hi) ||
+	    (new > mddev->suspend_lo && new > mddev->suspend_hi)) {
+		mddev->suspend_hi = new;
+		mddev->pers->quiesce(mddev, 1);
+		mddev->pers->quiesce(mddev, 0);
+		return len;
+	} else
+		return -EINVAL;
+}
+static struct md_sysfs_entry md_suspend_hi =
+__ATTR(suspend_hi, S_IRUGO|S_IWUSR, suspend_hi_show, suspend_hi_store);
+
+
 static struct attribute *md_default_attrs[] = {
 	&md_level.attr,
 	&md_raid_disks.attr,
@@ -2377,6 +2434,8 @@ static struct attribute *md_redundancy_a
 	&md_sync_max.attr,
 	&md_sync_speed.attr,
 	&md_sync_completed.attr,
+	&md_suspend_lo.attr,
+	&md_suspend_hi.attr,
 	NULL,
 };
 static struct attribute_group md_redundancy_group = {

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-03-17 11:48:59.000000000 +1100
+++ ./drivers/md/raid5.c	2006-03-17 11:48:59.000000000 +1100
@@ -1790,6 +1790,15 @@ static int make_request(request_queue_t 
 					goto retry;
 				}
 			}
+			/* FIXME what if we get a false positive because these
+			 * are being updated.
+			 */
+			if (logical_sector >= mddev->suspend_lo &&
+			    logical_sector < mddev->suspend_hi) {
+				release_stripe(sh);
+				schedule();
+				goto retry;
+			}
 
 			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
 			    !add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
@@ -2708,6 +2717,10 @@ static void raid5_quiesce(mddev_t *mddev
 	raid5_conf_t *conf = mddev_to_conf(mddev);
 
 	switch(state) {
+	case 2: /* resume for a suspend */
+		wake_up(&conf->wait_for_overlap);
+		break;
+
 	case 1: /* stop all writes */
 		spin_lock_irq(&conf->device_lock);
 		conf->quiesce = 1;
@@ -2721,6 +2734,7 @@ static void raid5_quiesce(mddev_t *mddev
 		spin_lock_irq(&conf->device_lock);
 		conf->quiesce = 0;
 		wake_up(&conf->wait_for_stripe);
+		wake_up(&conf->wait_for_overlap);
 		spin_unlock_irq(&conf->device_lock);
 		break;
 	}

diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~	2006-03-17 11:48:59.000000000 +1100
+++ ./include/linux/raid/md_k.h	2006-03-17 11:48:59.000000000 +1100
@@ -151,6 +151,10 @@ struct mddev_s
 	sector_t			resync_mismatches; /* count of sectors where
 							    * parity/replica mismatch found
 							    */
+
+	/* allow user-space to request suspension of IO to regions of the array */
+	sector_t			suspend_lo;
+	sector_t			suspend_hi;
 	/* if zero, use the system-wide default */
 	int				sync_speed_min;
 	int				sync_speed_max;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.
  2006-03-17  4:47 ` [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array NeilBrown
@ 2006-03-17  5:50   ` Andrew Morton
  2006-03-17  6:04     ` Neil Brown
  2006-03-17  5:53   ` Andrew Morton
  2006-03-17  5:57   ` Andrew Morton
  2 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2006-03-17  5:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel

NeilBrown <neilb@suse.de> wrote:
>
> +static int resize_stripes(raid5_conf_t *conf, int newsize)
>  +{
>  +	/* make all the stripes able to hold 'newsize' devices.
>  +	 * New slots in each stripe get 'page' set to a new page.
>  +	 * We allocate all the new stripes first, then if that succeeds,
>  +	 * copy everything across.
>  +	 * Finally we add new pages.  This could fail, but we leave
>  +	 * the stripe cache at it's new size, just with some pages empty.
>  +	 *
>  +	 * We use GFP_NOIO allocations as IO to the raid5 is blocked
>  +	 * at some points in this operation.
>  +	 */
>  +	struct stripe_head *osh, *nsh;
>  +	struct list_head newstripes, oldstripes;

You can use LIST_HEAD() here, avoid the separate INIT_LIST_HEAD().


>  +	struct disk_info *ndisks;
>  +	int err = 0;
>  +	kmem_cache_t *sc;
>  +	int i;
>  +
>  +	if (newsize <= conf->pool_size)
>  +		return 0; /* never bother to shrink */
>  +
>  +	sc = kmem_cache_create(conf->cache_name[1-conf->active_name],
>  +			       sizeof(struct stripe_head)+(newsize-1)*sizeof(struct r5dev),
>  +			       0, 0, NULL, NULL);

kmem_cache_create() internally does a GFP_KERNEL allocation.

>  +	if (!sc)
>  +		return -ENOMEM;
>  +	INIT_LIST_HEAD(&newstripes);
>  +	for (i = conf->max_nr_stripes; i; i--) {
>  +		nsh = kmem_cache_alloc(sc, GFP_NOIO);

So either this can use GFP_KERNEL, or we have a problem.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.
  2006-03-17  4:47 ` [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array NeilBrown
  2006-03-17  5:50   ` Andrew Morton
@ 2006-03-17  5:53   ` Andrew Morton
  2006-03-17  5:57   ` Andrew Morton
  2 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2006-03-17  5:53 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel

NeilBrown <neilb@suse.de> wrote:
>
> +		wait_event_lock_irq(conf->wait_for_stripe,
>  +				    !list_empty(&conf->inactive_list),
>  +				    conf->device_lock,
>  +				    unplug_slaves(conf->mddev);
>  +			);

Boy, that's an ugly-looking thing, isn't it?

__wait_event_lock_irq() already puts a semicolon after `cmd' so I think the
one here isn't needed, which would make it a bit less of a surprise to look
at.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.
  2006-03-17  4:47 ` [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array NeilBrown
  2006-03-17  5:50   ` Andrew Morton
  2006-03-17  5:53   ` Andrew Morton
@ 2006-03-17  5:57   ` Andrew Morton
  2006-03-17  6:24     ` Neil Brown
  2 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2006-03-17  5:57 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel

NeilBrown <neilb@suse.de> wrote:
>
> +	/* Got them all.
>  +	 * Return the new ones and free the old ones.
>  +	 * At this point, we are holding all the stripes so the array
>  +	 * is completely stalled, so now is a good time to resize
>  +	 * conf->disks.
>  +	 */
>  +	ndisks = kzalloc(newsize * sizeof(struct disk_info), GFP_NOIO);
>  +	if (ndisks) {
>  +		for (i=0; i<conf->raid_disks; i++)
>  +			ndisks[i] = conf->disks[i];
>  +		kfree(conf->disks);
>  +		conf->disks = ndisks;
>  +	} else
>  +		err = -ENOMEM;
>  +	while(!list_empty(&newstripes)) {
>  +		nsh = list_entry(newstripes.next, struct stripe_head, lru);
>  +		list_del_init(&nsh->lru);
>  +		for (i=conf->raid_disks; i < newsize; i++)
>  +			if (nsh->dev[i].page == NULL) {
>  +				struct page *p = alloc_page(GFP_NOIO);
>  +				nsh->dev[i].page = p;
>  +				if (!p)
>  +					err = -ENOMEM;
>  +			}
>  +		release_stripe(nsh);
>  +	}
>  +	while(!list_empty(&oldstripes)) {
>  +		osh = list_entry(oldstripes.next, struct stripe_head, lru);
>  +		list_del(&osh->lru);
>  +		kmem_cache_free(conf->slab_cache, osh);
>  +	}
>  +	kmem_cache_destroy(conf->slab_cache);
>  +	conf->slab_cache = sc;
>  +	conf->active_name = 1-conf->active_name;
>  +	conf->pool_size = newsize;
>  +	return err;
>  +}

Are you sure the -ENOMEM handling here is solid?  It looks.... strange.

There are a few more GFP_NOIOs in this function, which can possibly become
GFP_KERNEL.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while array is expanding.
  2006-03-17  4:47 ` [PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while array is expanding NeilBrown
@ 2006-03-17  6:01   ` Andrew Morton
  2006-03-17  6:17     ` Neil Brown
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2006-03-17  6:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel

NeilBrown <neilb@suse.de> wrote:
>
>  -	retry:
>   		prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
>  -		sh = get_active_stripe(conf, new_sector, pd_idx, (bi->bi_rw&RWA_MASK));
>  +		sh = get_active_stripe(conf, new_sector, disks, pd_idx, (bi->bi_rw&RWA_MASK));
>   		if (sh) {
>  -			if (!add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
>  -				/* Add failed due to overlap.  Flush everything
>  +			if (unlikely(conf->expand_progress != MaxSector)) {
>  +				/* expansion might have moved on while waiting for a
>  +				 * stripe, so we much do the range check again.
>  +				 */
>  +				int must_retry = 0;
>  +				spin_lock_irq(&conf->device_lock);
>  +				if (logical_sector <  conf->expand_progress &&
>  +				    disks == conf->previous_raid_disks)
>  +					/* mismatch, need to try again */
>  +					must_retry = 1;
>  +				spin_unlock_irq(&conf->device_lock);
>  +				if (must_retry) {
>  +					release_stripe(sh);
>  +					goto retry;
>  +				}
>  +			}

The locking in here looks strange.  We take the lock, do some arithmetic
and some tests and then drop the lock again.  Is it not possible that the
result of those tests now becomes invalid?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 007 of 13] md: Core of raid5 resize process
  2006-03-17  4:47 ` [PATCH 007 of 13] md: Core of raid5 resize process NeilBrown
@ 2006-03-17  6:03   ` Andrew Morton
  2006-03-17  7:10     ` Neil Brown
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2006-03-17  6:03 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel

NeilBrown <neilb@suse.de> wrote:
>
> @@ -4539,7 +4543,9 @@ static void md_do_sync(mddev_t *mddev)
>   		 */
>   		max_sectors = mddev->resync_max_sectors;
>   		mddev->resync_mismatches = 0;
>  -	} else
>  +	} else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
>  +		max_sectors = mddev->size << 1;
>  +	else
>   		/* recovery follows the physical size of devices */
>   		max_sectors = mddev->size << 1;
>   

This change is a no-op.   Intentional?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.
  2006-03-17  5:50   ` Andrew Morton
@ 2006-03-17  6:04     ` Neil Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Neil Brown @ 2006-03-17  6:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel

On Thursday March 16, akpm@osdl.org wrote:
> NeilBrown <neilb@suse.de> wrote:
> >
> > +static int resize_stripes(raid5_conf_t *conf, int newsize)
> >  +{
> >  +	/* make all the stripes able to hold 'newsize' devices.
> >  +	 * New slots in each stripe get 'page' set to a new page.
> >  +	 * We allocate all the new stripes first, then if that succeeds,
> >  +	 * copy everything across.
> >  +	 * Finally we add new pages.  This could fail, but we leave
> >  +	 * the stripe cache at it's new size, just with some pages empty.
> >  +	 *
> >  +	 * We use GFP_NOIO allocations as IO to the raid5 is blocked
> >  +	 * at some points in this operation.
> >  +	 */
> >  +	struct stripe_head *osh, *nsh;
> >  +	struct list_head newstripes, oldstripes;
> 
> You can use LIST_HEAD() here, avoid the separate INIT_LIST_HEAD().
> 

I guess.....
I have to have the declaration "miles" from where I use the variable.
Do I have to have the initialisation equally far?  Ok, I'll do that..


> 
> >  +	struct disk_info *ndisks;
> >  +	int err = 0;
> >  +	kmem_cache_t *sc;
> >  +	int i;
> >  +
> >  +	if (newsize <= conf->pool_size)
> >  +		return 0; /* never bother to shrink */
> >  +
> >  +	sc = kmem_cache_create(conf->cache_name[1-conf->active_name],
> >  +			       sizeof(struct stripe_head)+(newsize-1)*sizeof(struct r5dev),
> >  +			       0, 0, NULL, NULL);
> 
> kmem_cache_create() internally does a GFP_KERNEL allocation.
> 
> >  +	if (!sc)
> >  +		return -ENOMEM;
> >  +	INIT_LIST_HEAD(&newstripes);
> >  +	for (i = conf->max_nr_stripes; i; i--) {
> >  +		nsh = kmem_cache_alloc(sc, GFP_NOIO);
> 
> So either this can use GFP_KERNEL, or we have a problem.

Good point....  Maybe the comment about GFP_NOIO just needs to be
improved. 
We cannot risk waiting on IO after the 
	/* OK, we have enough stripes, start collecting inactive
	 * stripes and copying them over
	 */

comment, up to the second last while loop, that starts
	while(!list_empty(&newstripes)) {

Before the comment, which where the kmem_cache_create is, is OK.

Thanks!

NeilBrown


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while array is expanding.
  2006-03-17  6:01   ` Andrew Morton
@ 2006-03-17  6:17     ` Neil Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Neil Brown @ 2006-03-17  6:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel

On Thursday March 16, akpm@osdl.org wrote:
> NeilBrown <neilb@suse.de> wrote:
> >
> >  -	retry:
> >   		prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
> >  -		sh = get_active_stripe(conf, new_sector, pd_idx, (bi->bi_rw&RWA_MASK));
> >  +		sh = get_active_stripe(conf, new_sector, disks, pd_idx, (bi->bi_rw&RWA_MASK));
> >   		if (sh) {
> >  -			if (!add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
> >  -				/* Add failed due to overlap.  Flush everything
> >  +			if (unlikely(conf->expand_progress != MaxSector)) {
> >  +				/* expansion might have moved on while waiting for a
> >  +				 * stripe, so we much do the range check again.
> >  +				 */
> >  +				int must_retry = 0;
> >  +				spin_lock_irq(&conf->device_lock);
> >  +				if (logical_sector <  conf->expand_progress &&
> >  +				    disks == conf->previous_raid_disks)
> >  +					/* mismatch, need to try again */
> >  +					must_retry = 1;
> >  +				spin_unlock_irq(&conf->device_lock);
> >  +				if (must_retry) {
> >  +					release_stripe(sh);
> >  +					goto retry;
> >  +				}
> >  +			}
> 
> The locking in here looks strange.  We take the lock, do some arithmetic
> and some tests and then drop the lock again.  Is it not possible that the
> result of those tests now becomes invalid?

Obviously another comment missing.
 conf->expand_progress is sector_t and so could be 64bits on a 32 bit
 platform, and so I cannot be sure it is updated atomically.  So I
 always access it within a lock (unless I am comparing for equality with ~0).

 Yes, the result can become invalid, but only in one direction:  As
 expand_progress always increases, it is possible that it will pass
 logical_sector.  When that happens, STRIPE_EXPANDING gets set on the
 stripe_head at logical_sector.
 So because we took a reference to logical_sector *before* this test,
 and check for stripe_expanding *after* the test, we can easily catch
 that transition.

 Putting it another way, this test is to catch cases where
 logical_sector is a long way from expand_progress, the subsequent
 test of STRIPE_EXPANDING catches cases where they are close together,
 and the ordering wrt get_stripe_active ensures there are no holes.

Now to put that into a few short-but-clear comments.

Thanks again!

NeilBrown

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 010 of 13] md: Only checkpoint expansion progress occasionally.
  2006-03-17  4:48 ` [PATCH 010 of 13] md: Only checkpoint expansion progress occasionally NeilBrown
@ 2006-03-17  6:17   ` Andrew Morton
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2006-03-17  6:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel

NeilBrown <neilb@suse.de> wrote:
>
> diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
>  --- ./drivers/md/raid5.c~current~	2006-03-17 11:48:58.000000000 +1100
>  +++ ./drivers/md/raid5.c	2006-03-17 11:48:58.000000000 +1100
>  @@ -1747,8 +1747,9 @@ static int make_request(request_queue_t 

That's a fairly complex function..

I wonder about this:

	spin_lock_irq(&conf->device_lock);
	if (--bi->bi_phys_segments == 0) {
		int bytes = bi->bi_size;

		if ( bio_data_dir(bi) == WRITE )
			md_write_end(mddev);
		bi->bi_size = 0;
		bi->bi_end_io(bi, bytes, 0);
	}
	spin_unlock_irq(&conf->device_lock);

bi_end_io() can be somewhat expensive.  Does it need to happen under the lock?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.
  2006-03-17  5:57   ` Andrew Morton
@ 2006-03-17  6:24     ` Neil Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Neil Brown @ 2006-03-17  6:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel

On Thursday March 16, akpm@osdl.org wrote:
> NeilBrown <neilb@suse.de> wrote:
> >
> > +	/* Got them all.
> >  +	 * Return the new ones and free the old ones.
> >  +	 * At this point, we are holding all the stripes so the array
> >  +	 * is completely stalled, so now is a good time to resize
> >  +	 * conf->disks.
> >  +	 */
> >  +	ndisks = kzalloc(newsize * sizeof(struct disk_info), GFP_NOIO);
> >  +	if (ndisks) {
> >  +		for (i=0; i<conf->raid_disks; i++)
> >  +			ndisks[i] = conf->disks[i];
> >  +		kfree(conf->disks);
> >  +		conf->disks = ndisks;
> >  +	} else
> >  +		err = -ENOMEM;
> >  +	while(!list_empty(&newstripes)) {
> >  +		nsh = list_entry(newstripes.next, struct stripe_head, lru);
> >  +		list_del_init(&nsh->lru);
> >  +		for (i=conf->raid_disks; i < newsize; i++)
> >  +			if (nsh->dev[i].page == NULL) {
> >  +				struct page *p = alloc_page(GFP_NOIO);
> >  +				nsh->dev[i].page = p;
> >  +				if (!p)
> >  +					err = -ENOMEM;
> >  +			}
> >  +		release_stripe(nsh);
> >  +	}
> >  +	while(!list_empty(&oldstripes)) {
> >  +		osh = list_entry(oldstripes.next, struct stripe_head, lru);
> >  +		list_del(&osh->lru);
> >  +		kmem_cache_free(conf->slab_cache, osh);
> >  +	}
> >  +	kmem_cache_destroy(conf->slab_cache);
> >  +	conf->slab_cache = sc;
> >  +	conf->active_name = 1-conf->active_name;
> >  +	conf->pool_size = newsize;
> >  +	return err;
> >  +}
> 
> Are you sure the -ENOMEM handling here is solid?  It
> looks.... strange.

The philosophy of the -ENOMEM handling is (awkwardly?) embodied in the
comment
	 * Finally we add new pages.  This could fail, but we leave
	 * the stripe cache at it's new size, just with some pages empty.

at the top of the function.  The core function here is making some
data structures bigger.  In each case, having a bigger data structure
than required is no big deal.  So we try to increase the size of each
of them (the stripe_head cache, the 'disks' array, and the pages
allocated to each stripe.
If any of there fail we return -ENOMEM, but may allow others to
succeed.

Does that help?

NeilBrown

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 007 of 13] md: Core of raid5 resize process
  2006-03-17  6:03   ` Andrew Morton
@ 2006-03-17  7:10     ` Neil Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Neil Brown @ 2006-03-17  7:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel

On Thursday March 16, akpm@osdl.org wrote:
> NeilBrown <neilb@suse.de> wrote:
> >
> > @@ -4539,7 +4543,9 @@ static void md_do_sync(mddev_t *mddev)
> >   		 */
> >   		max_sectors = mddev->resync_max_sectors;
> >   		mddev->resync_mismatches = 0;
> >  -	} else
> >  +	} else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
> >  +		max_sectors = mddev->size << 1;
> >  +	else
> >   		/* recovery follows the physical size of devices */
> >   		max_sectors = mddev->size << 1;
> >   
> 
> This change is a no-op.   Intentional?

Uhmm... sort of.
A later patch adds stuff to the later branch but not the middle one.
This comes from creating a patch to fix a bug, then merging it back
into the wrong original patch...

NeilBrown

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2006-03-17  7:10 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-17  4:47 [PATCH 000 of 13] md: Introduction NeilBrown
2006-03-17  4:47 ` [PATCH 001 of 13] md: Add '4' to the list of levels for which bitmaps are supported NeilBrown
2006-03-17  4:47 ` [PATCH 002 of 13] md: Fix the 'failed' count for version-0 superblocks NeilBrown
2006-03-17  4:47 ` [PATCH 003 of 13] md: Update status_resync to handle LARGE devices NeilBrown
2006-03-17  4:47 ` [PATCH 004 of 13] md: Split disks array out of raid5 conf structure so it is easier to grow NeilBrown
2006-03-17  4:47 ` [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array NeilBrown
2006-03-17  5:50   ` Andrew Morton
2006-03-17  6:04     ` Neil Brown
2006-03-17  5:53   ` Andrew Morton
2006-03-17  5:57   ` Andrew Morton
2006-03-17  6:24     ` Neil Brown
2006-03-17  4:47 ` [PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while array is expanding NeilBrown
2006-03-17  6:01   ` Andrew Morton
2006-03-17  6:17     ` Neil Brown
2006-03-17  4:47 ` [PATCH 007 of 13] md: Core of raid5 resize process NeilBrown
2006-03-17  6:03   ` Andrew Morton
2006-03-17  7:10     ` Neil Brown
2006-03-17  4:48 ` [PATCH 008 of 13] md: Final stages of raid5 expand code NeilBrown
2006-03-17  4:48 ` [PATCH 009 of 13] md: Checkpoint and allow restart of raid5 reshape NeilBrown
2006-03-17  4:48 ` [PATCH 010 of 13] md: Only checkpoint expansion progress occasionally NeilBrown
2006-03-17  6:17   ` Andrew Morton
2006-03-17  4:48 ` [PATCH 011 of 13] md: Split reshape handler in check_reshape and start_reshape NeilBrown
2006-03-17  4:48 ` [PATCH 012 of 13] md: Make 'reshape' a possible sync_action action NeilBrown
2006-03-17  4:48 ` [PATCH 013 of 13] md: Support suspending of IO to regions of an md array NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).