[PATCH] Online RAID-5 resizing

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Online RAID-5 resizing
@ 2005-09-20 14:33 Steinar H. Gunderson
  2005-09-20 15:01 ` Neil Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-09-20 14:33 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2250 bytes --]

(Please Cc me on any replies, I'm not subscribed)

Hi,

Attached is a patch (against 2.6.12) for adding online RAID-5 resize
capabilities to Linux' RAID code. It needs to changes to mdadm (I've only
tested with mdadm 1.12.0, though), you can just do

  mdadm --add /dev/md1 /dev/hd[eg]1
  mdadm --grow /dev/md1 -n 4

and it will restripe /dev/md1; you can still use the volume just fine
during the expand process. (cat /proc/mdstat to get the progress; it will
look like a regular sync, and when the restripe is done the volume will
suddenly get larger and do a regular sync of the new parts.)

The patch is quite rough -- it's my first trip ever into the md code, the
block layer or really kernel code in general, so expect subtle race
conditions and problems here and there. :-) That being said, it seems to be
quite stable on my (SMP) test system now -- I would really take backups
before testing it, though! You have been warned :-)

Things still to do, off the top of my head:

- It's RAID-5 only; I don't really use RAID-0, and RAID-6 would probably be
  more complex.
- It supports only growing, not shrinking. (Not sure if I really care about
  fixing this one.)
- It leaks memory; it doesn't properly free up the old stripes etc. at the
  end of the resize. (This also makes it impossible to do a grow and then
  another grow without stopping and starting the volumes.)
- There is absolutely no crash recovery -- this shouldn't be so hard to do
  (just update the superblock every time, with some progress meter, and
  restart from that spot in case of a crash), but I have no knowledge of the
  on-disk superblock format at all, so some help would be appreciated here.
  Also, I'm not really sure what happens if it encounters a bad block during
  the restripe.
- It's quite slow; on my test system with old IDE disks, it achieves about
  1MB/sec. One could probably make a speed/memory tradeoff here, and move
  more chunks at a time instead of just one by one; I'm a bit concerned
  about the implications of the kernel allocating something like 64MB in one
  go, though :-)
  
Comments, patches, fixes etc. would be greatly appreciated. (Again, remember
to Cc me, I'm not on the list.)

/* Steinar */
-- 
Homepage: http://www.sesse.net/

[-- Attachment #2: raid5-resize.patch --]
[-- Type: text/plain, Size: 55631 bytes --]

diff -ur linux-2.6-2.6.12/drivers/md/raid5.c ../linux-2.6-2.6.12/drivers/md/raid5.c
--- linux-2.6-2.6.12/drivers/md/raid5.c	2005-06-17 21:48:29.000000000 +0200
+++ linux-2.6-2.6.12.patch/drivers/md/raid5.c	2005-09-20 00:13:55.000000000 +0200
@@ -68,19 +68,40 @@
 #endif
 
 static void print_raid5_conf (raid5_conf_t *conf);
+#if RAID5_DEBUG
+static void print_sh (struct stripe_head *sh);
+#endif
+static int sync_request (mddev_t *mddev, sector_t sector_nr, int go_faster);
+static void raid5_finish_expand (raid5_conf_t *conf);
+static sector_t raid5_compute_sector(sector_t r_sector, unsigned int raid_disks,
+			unsigned int data_disks, unsigned int * dd_idx,
+			unsigned int * pd_idx, raid5_conf_t *conf);
 
 static inline void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 {
+	PRINTK("__release_stripe, conf=%p\n", conf);
+	BUG_ON(atomic_read(&sh->count) == 0);
 	if (atomic_dec_and_test(&sh->count)) {
 		if (!list_empty(&sh->lru))
 			BUG();
-		if (atomic_read(&conf->active_stripes)==0)
-			BUG();
+		if (conf->expand_in_progress && sh->disks == conf->raid_disks) {
+			if (atomic_read(&conf->active_stripes_expand)==0)
+				BUG();
+		} else {
+			if (atomic_read(&conf->active_stripes)==0)
+				BUG();
+		}
 		if (test_bit(STRIPE_HANDLE, &sh->state)) {
-			if (test_bit(STRIPE_DELAYED, &sh->state))
+			if (test_bit(STRIPE_DELAY_EXPAND, &sh->state)) {
+				list_add_tail(&sh->lru, &conf->wait_for_expand_list);
+				printk("delaying stripe with sector %llu (expprog=%llu, active=%d)\n", sh->sector,
+					conf->expand_progress, atomic_read(&conf->active_stripes_expand));
+			} else if (test_bit(STRIPE_DELAYED, &sh->state)) {
+//				printk("real-delay\n");
 				list_add_tail(&sh->lru, &conf->delayed_list);
-			else
+			} else {
 				list_add_tail(&sh->lru, &conf->handle_list);
+			}
 			md_wakeup_thread(conf->mddev->thread);
 		} else {
 			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
@@ -88,11 +109,34 @@
 				if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
 					md_wakeup_thread(conf->mddev->thread);
 			}
-			list_add_tail(&sh->lru, &conf->inactive_list);
-			atomic_dec(&conf->active_stripes);
-			if (!conf->inactive_blocked ||
-			    atomic_read(&conf->active_stripes) < (NR_STRIPES*3/4))
-				wake_up(&conf->wait_for_stripe);
+			if (conf->expand_in_progress && sh->disks == conf->raid_disks) {
+				list_add_tail(&sh->lru, &conf->inactive_list_expand);
+				atomic_dec(&conf->active_stripes_expand);
+			} else {
+				list_add_tail(&sh->lru, &conf->inactive_list);
+				if (conf->expand_in_progress == 2) {
+					// we are in the process of finishing up an expand, see
+					// if we have no active stripes left
+					if (atomic_dec_and_test(&conf->active_stripes)) {
+						printk("Finishing up expand\n");
+						raid5_finish_expand(conf);
+						printk("Expand done.\n");
+					}
+				} else {
+					atomic_dec(&conf->active_stripes);
+				}
+			}
+			if (conf->expand_in_progress && sh->disks == conf->raid_disks) {
+				if (!conf->inactive_blocked_expand ||
+				    atomic_read(&conf->active_stripes_expand) < (NR_STRIPES*3/4)) {
+					wake_up(&conf->wait_for_stripe_expand);
+				}
+			} else {
+				if (!conf->inactive_blocked ||
+				    atomic_read(&conf->active_stripes) < (NR_STRIPES*3/4)) {
+					wake_up(&conf->wait_for_stripe);
+				}
+			}
 		}
 	}
 }
@@ -133,20 +177,44 @@
 
 
 /* find an idle stripe, make sure it is unhashed, and return it. */
-static struct stripe_head *get_free_stripe(raid5_conf_t *conf)
+static struct stripe_head *get_free_stripe(raid5_conf_t *conf, int expand)
 {
 	struct stripe_head *sh = NULL;
 	struct list_head *first;
 
 	CHECK_DEVLOCK();
-	if (list_empty(&conf->inactive_list))
-		goto out;
-	first = conf->inactive_list.next;
-	sh = list_entry(first, struct stripe_head, lru);
-	list_del_init(first);
-	remove_hash(sh);
-	atomic_inc(&conf->active_stripes);
+
+	if (expand) {
+		if (list_empty(&conf->inactive_list_expand))
+			goto out;
+		first = conf->inactive_list_expand.next;
+		sh = list_entry(first, struct stripe_head, lru);
+		list_del_init(first);
+		remove_hash(sh);
+		atomic_inc(&conf->active_stripes_expand);
+	} else {
+		if (list_empty(&conf->inactive_list))
+			goto out;
+		first = conf->inactive_list.next;
+		sh = list_entry(first, struct stripe_head, lru);
+		list_del_init(first);
+		remove_hash(sh);
+		atomic_inc(&conf->active_stripes);
+	}
 out:
+
+	if (sh) {
+		if (conf->expand_in_progress) {
+			if (expand)
+				BUG_ON(sh->disks != conf->raid_disks);
+			else
+				BUG_ON(sh->disks != conf->previous_raid_disks);
+		} else {
+			BUG_ON(expand);
+			BUG_ON(sh->disks != conf->raid_disks);
+		}
+	}
+
 	return sh;
 }
 
@@ -184,7 +252,7 @@
 static inline void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 
 	if (atomic_read(&sh->count) != 0)
 		BUG();
@@ -245,21 +313,59 @@
 
 	do {
 		sh = __find_stripe(conf, sector);
+
+		// make sure this is of the right size; if not, remove it from the hash
+		if (sh) {
+			int correct_disks = conf->raid_disks;
+			if (conf->expand_in_progress && sector >= conf->expand_progress) {
+				correct_disks = conf->previous_raid_disks;
+			}
+
+			if (sh->disks != correct_disks) {
+				BUG_ON(atomic_read(&sh->count) != 0);
+
+				remove_hash(sh);
+				sh = NULL;
+			}
+		}
+		
 		if (!sh) {
-			if (!conf->inactive_blocked)
-				sh = get_free_stripe(conf);
+			if (conf->expand_in_progress && sector * (conf->raid_disks - 1) < conf->expand_progress) {
+				if (!conf->inactive_blocked_expand) {
+					sh = get_free_stripe(conf, 1);
+				}
+			} else {
+				if (!conf->inactive_blocked) {
+					sh = get_free_stripe(conf, 0);
+				}
+			}
 			if (noblock && sh == NULL)
 				break;
 			if (!sh) {
-				conf->inactive_blocked = 1;
-				wait_event_lock_irq(conf->wait_for_stripe,
-						    !list_empty(&conf->inactive_list) &&
-						    (atomic_read(&conf->active_stripes) < (NR_STRIPES *3/4)
-						     || !conf->inactive_blocked),
-						    conf->device_lock,
-						    unplug_slaves(conf->mddev);
-					);
-				conf->inactive_blocked = 0;
+				if (conf->expand_in_progress && sector * (conf->raid_disks - 1) < conf->expand_progress) {
+//					printk("WAITING FOR AN EXPAND STRIPE\n");
+					conf->inactive_blocked_expand = 1;
+					wait_event_lock_irq(conf->wait_for_stripe_expand,
+							    !list_empty(&conf->inactive_list_expand) &&
+							    (atomic_read(&conf->active_stripes_expand) < (NR_STRIPES *3/4)
+							     || !conf->inactive_blocked_expand),
+							    conf->device_lock,
+							    unplug_slaves(conf->mddev);
+						);
+					conf->inactive_blocked_expand = 0;
+				} else {
+//					printk("WAITING FOR A NON-EXPAND STRIPE, sector=%llu\n", sector);
+					conf->inactive_blocked = 1;
+					wait_event_lock_irq(conf->wait_for_stripe,
+							    !list_empty(&conf->inactive_list) &&
+							    (atomic_read(&conf->active_stripes) < (NR_STRIPES *3/4)
+							     || !conf->inactive_blocked),
+							    conf->device_lock,
+							    unplug_slaves(conf->mddev);
+						);
+					conf->inactive_blocked = 0;
+				}
+//				printk("INACTIVITY DONE\n");
 			} else
 				init_stripe(sh, sector, pd_idx);
 		} else {
@@ -267,8 +373,13 @@
 				if (!list_empty(&sh->lru))
 					BUG();
 			} else {
-				if (!test_bit(STRIPE_HANDLE, &sh->state))
-					atomic_inc(&conf->active_stripes);
+				if (!test_bit(STRIPE_HANDLE, &sh->state)) {
+					if (conf->expand_in_progress && sector < conf->expand_progress) {
+						atomic_inc(&conf->active_stripes_expand);
+					} else {
+						atomic_inc(&conf->active_stripes);
+					}
+				}
 				if (list_empty(&sh->lru))
 					BUG();
 				list_del_init(&sh->lru);
@@ -283,26 +394,34 @@
 	return sh;
 }
 
-static int grow_stripes(raid5_conf_t *conf, int num)
+static int grow_stripes(raid5_conf_t *conf, int num, int expand)
 {
 	struct stripe_head *sh;
 	kmem_cache_t *sc;
 	int devs = conf->raid_disks;
 
-	sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
+	if (expand)
+		sprintf(conf->cache_name, "raid5e/%s", mdname(conf->mddev));
+	else
+		sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
 
 	sc = kmem_cache_create(conf->cache_name, 
 			       sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
 			       0, 0, NULL, NULL);
 	if (!sc)
 		return 1;
-	conf->slab_cache = sc;
+	if (expand)
+		conf->slab_cache_expand = sc;
+	else
+		conf->slab_cache = sc;
 	while (num--) {
 		sh = kmem_cache_alloc(sc, GFP_KERNEL);
 		if (!sh)
 			return 1;
+		printk("alloc stripe: %p\n", sh);
 		memset(sh, 0, sizeof(*sh) + (devs-1)*sizeof(struct r5dev));
 		sh->raid_conf = conf;
+		sh->disks = conf->raid_disks;
 		spin_lock_init(&sh->lock);
 
 		if (grow_buffers(sh, conf->raid_disks)) {
@@ -312,10 +431,15 @@
 		}
 		/* we just created an active stripe so... */
 		atomic_set(&sh->count, 1);
-		atomic_inc(&conf->active_stripes);
+		if (expand) {
+			atomic_inc(&conf->active_stripes_expand);
+		} else {
+			atomic_inc(&conf->active_stripes);
+		}
 		INIT_LIST_HEAD(&sh->lru);
 		release_stripe(sh);
 	}
+	printk("done growing\n");
 	return 0;
 }
 
@@ -325,7 +449,7 @@
 
 	while (1) {
 		spin_lock_irq(&conf->device_lock);
-		sh = get_free_stripe(conf);
+		sh = get_free_stripe(conf, 0);
 		spin_unlock_irq(&conf->device_lock);
 		if (!sh)
 			break;
@@ -344,7 +468,7 @@
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
 	if (bi->bi_size)
@@ -393,6 +517,8 @@
 		set_bit(R5_UPTODATE, &sh->dev[i].flags);
 #endif		
 	} else {
+		printk("received non-up-to-date information for disk %u, sector %llu!\n",
+			i, sh->sector);
 		md_error(conf->mddev, conf->disks[i].rdev);
 		clear_bit(R5_UPTODATE, &sh->dev[i].flags);
 	}
@@ -411,12 +537,93 @@
 	return 0;
 }
 
+							
+static void raid5_finish_expand (raid5_conf_t *conf)
+{
+	int i;
+	struct disk_info *tmp;
+//	shrink_stripes(conf);
+	
+	conf->expand_in_progress = 0;
+	conf->active_stripes = conf->active_stripes_expand;
+	conf->inactive_list = conf->inactive_list_expand;
+	conf->wait_for_stripe = conf->wait_for_stripe_expand;
+	conf->slab_cache = conf->slab_cache_expand;
+	conf->inactive_blocked = conf->inactive_blocked_expand;
+
+	// fix up linked list
+	conf->inactive_list.next->prev = &conf->inactive_list;
+	{
+		struct list_head *first = &conf->inactive_list;
+		while (1) {
+			if (first->next == &conf->inactive_list_expand) {
+				first->next = &conf->inactive_list;
+				break;
+			}
+
+			first = first->next;
+		}
+	}
+
+	conf->wait_for_stripe.task_list.next->prev = &conf->wait_for_stripe.task_list;
+	{
+		struct list_head *first = &conf->wait_for_stripe.task_list;
+		while (1) {
+			if (first->next == &conf->wait_for_stripe_expand.task_list) {
+				first->next = &conf->wait_for_stripe.task_list;
+				break;
+			}
+
+			first = first->next;
+		}
+	}
+
+	for (i = conf->previous_raid_disks; i < conf->raid_disks; i++) {
+		tmp = conf->disks + i;
+		if (tmp->rdev
+		    && !tmp->rdev->faulty
+		    && !tmp->rdev->in_sync) {
+			conf->mddev->degraded--;
+			conf->failed_disks--;
+			conf->working_disks++;
+			tmp->rdev->in_sync = 1;
+		}
+	}
+
+	// hey, mr. md code: we have more space now!
+ 	{	
+		struct block_device *bdev;
+		sector_t sync_sector;
+		unsigned dummy1, dummy2;
+
+		conf->mddev->array_size = conf->mddev->size * (conf->mddev->raid_disks-1);
+		set_capacity(conf->mddev->gendisk, conf->mddev->array_size << 1);
+		conf->mddev->changed = 1;
+
+		sync_sector = raid5_compute_sector(conf->expand_progress, conf->raid_disks,
+			conf->raid_disks - 1, &dummy1, &dummy2, conf);
+		
+		conf->mddev->recovery_cp = sync_sector << 1;    // FIXME: hum, hum
+		set_bit(MD_RECOVERY_NEEDED, &conf->mddev->recovery);
+
+		bdev = bdget_disk(conf->mddev->gendisk, 0);
+		if (bdev) {
+			down(&bdev->bd_inode->i_sem);
+			i_size_write(bdev->bd_inode, conf->mddev->array_size << 10);
+			up(&bdev->bd_inode->i_sem);
+			bdput(bdev);
+		}
+	}
+	
+	/* FIXME: free old stuff here! (what are we missing?) */
+}
+
 static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
 				    int error)
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	unsigned long flags;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
@@ -436,8 +643,11 @@
 	}
 
 	spin_lock_irqsave(&conf->device_lock, flags);
-	if (!uptodate)
+	if (!uptodate) {
+		printk("end_write_request ends with error, for disk %u sector %llu\n",
+			i, sh->sector);
 		md_error(conf->mddev, conf->disks[i].rdev);
+	}
 
 	rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
 	
@@ -512,12 +722,14 @@
 	int sectors_per_chunk = conf->chunk_size >> 9;
 
 	/* First compute the information on this sector */
+	PRINTK("r_sector_inp=%llu\n", r_sector);
 
 	/*
 	 * Compute the chunk number and the sector offset inside the chunk
 	 */
 	chunk_offset = sector_div(r_sector, sectors_per_chunk);
 	chunk_number = r_sector;
+	PRINTK("r_sector=%llu, chunk_number=%lu\n", r_sector, chunk_number);
 	BUG_ON(r_sector != chunk_number);
 
 	/*
@@ -556,7 +768,7 @@
 			break;
 		default:
 			printk("raid5: unsupported algorithm %d\n",
-				conf->algorithm);
+					conf->algorithm);
 	}
 
 	/*
@@ -570,7 +782,7 @@
 static sector_t compute_blocknr(struct stripe_head *sh, int i)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int raid_disks = conf->raid_disks, data_disks = raid_disks - 1;
+	int raid_disks = sh->disks, data_disks = raid_disks - 1;
 	sector_t new_sector = sh->sector, check;
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	sector_t stripe;
@@ -582,7 +794,7 @@
 	stripe = new_sector;
 	BUG_ON(new_sector != stripe);
 
-	
+
 	switch (conf->algorithm) {
 		case ALGORITHM_LEFT_ASYMMETRIC:
 		case ALGORITHM_RIGHT_ASYMMETRIC:
@@ -597,7 +809,7 @@
 			break;
 		default:
 			printk("raid5: unsupported algorithm %d\n",
-				conf->algorithm);
+					conf->algorithm);
 	}
 
 	chunk_number = stripe * data_disks + i;
@@ -605,7 +817,8 @@
 
 	check = raid5_compute_sector (r_sector, raid_disks, data_disks, &dummy1, &dummy2, conf);
 	if (check != sh->sector || dummy1 != dd_idx || dummy2 != sh->pd_idx) {
-		printk("compute_blocknr: map not correct\n");
+		printk("compute_blocknr: map not correct (%llu,%u,%u vs. %llu,%u,%u) disks=%u offset=%u virtual_dd=%u\n",
+				check, dummy1, dummy2, sh->sector, dd_idx, sh->pd_idx, sh->disks, chunk_offset, i);
 		return 0;
 	}
 	return r_sector;
@@ -620,8 +833,8 @@
  * All iovecs in the bio must be considered.
  */
 static void copy_data(int frombio, struct bio *bio,
-		     struct page *page,
-		     sector_t sector)
+		struct page *page,
+		sector_t sector)
 {
 	char *pa = page_address(page);
 	struct bio_vec *bvl;
@@ -646,7 +859,7 @@
 		if (len > 0 && page_offset + len > STRIPE_SIZE)
 			clen = STRIPE_SIZE - page_offset;
 		else clen = len;
-			
+
 		if (clen > 0) {
 			char *ba = __bio_kmap_atomic(bio, i, KM_USER0);
 			if (frombio)
@@ -662,21 +875,21 @@
 }
 
 #define check_xor() 	do { 						\
-			   if (count == MAX_XOR_BLOCKS) {		\
-				xor_block(count, STRIPE_SIZE, ptr);	\
-				count = 1;				\
-			   }						\
-			} while(0)
+	if (count == MAX_XOR_BLOCKS) {		\
+		xor_block(count, STRIPE_SIZE, ptr);	\
+		count = 1;				\
+	}						\
+} while(0)
 
 
 static void compute_block(struct stripe_head *sh, int dd_idx)
 {
-	raid5_conf_t *conf = sh->raid_conf;
-	int i, count, disks = conf->raid_disks;
+	//	raid5_conf_t *conf = sh->raid_conf;
+	int i, count, disks = sh->disks;
 	void *ptr[MAX_XOR_BLOCKS], *p;
 
 	PRINTK("compute_block, stripe %llu, idx %d\n", 
-		(unsigned long long)sh->sector, dd_idx);
+			(unsigned long long)sh->sector, dd_idx);
 
 	ptr[0] = page_address(sh->dev[dd_idx].page);
 	memset(ptr[0], 0, STRIPE_SIZE);
@@ -689,8 +902,8 @@
 			ptr[count++] = p;
 		else
 			printk("compute_block() %d, stripe %llu, %d"
-				" not present\n", dd_idx,
-				(unsigned long long)sh->sector, i);
+					" not present\n", dd_idx,
+					(unsigned long long)sh->sector, i);
 
 		check_xor();
 	}
@@ -702,59 +915,59 @@
 static void compute_parity(struct stripe_head *sh, int method)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int i, pd_idx = sh->pd_idx, disks = conf->raid_disks, count;
+	int i, pd_idx = sh->pd_idx, disks = sh->disks, count;
 	void *ptr[MAX_XOR_BLOCKS];
 	struct bio *chosen;
 
 	PRINTK("compute_parity, stripe %llu, method %d\n",
-		(unsigned long long)sh->sector, method);
+			(unsigned long long)sh->sector, method);
 
 	count = 1;
 	ptr[0] = page_address(sh->dev[pd_idx].page);
 	switch(method) {
-	case READ_MODIFY_WRITE:
-		if (!test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags))
-			BUG();
-		for (i=disks ; i-- ;) {
-			if (i==pd_idx)
-				continue;
-			if (sh->dev[i].towrite &&
-			    test_bit(R5_UPTODATE, &sh->dev[i].flags)) {
-				ptr[count++] = page_address(sh->dev[i].page);
-				chosen = sh->dev[i].towrite;
-				sh->dev[i].towrite = NULL;
-
-				if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
-					wake_up(&conf->wait_for_overlap);
-
-				if (sh->dev[i].written) BUG();
-				sh->dev[i].written = chosen;
-				check_xor();
+		case READ_MODIFY_WRITE:
+			if (!test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags))
+				BUG();
+			for (i=disks ; i-- ;) {
+				if (i==pd_idx)
+					continue;
+				if (sh->dev[i].towrite &&
+						test_bit(R5_UPTODATE, &sh->dev[i].flags)) {
+					ptr[count++] = page_address(sh->dev[i].page);
+					chosen = sh->dev[i].towrite;
+					sh->dev[i].towrite = NULL;
+
+					if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+						wake_up(&conf->wait_for_overlap);
+
+					if (sh->dev[i].written) BUG();
+					sh->dev[i].written = chosen;
+					check_xor();
+				}
 			}
-		}
-		break;
-	case RECONSTRUCT_WRITE:
-		memset(ptr[0], 0, STRIPE_SIZE);
-		for (i= disks; i-- ;)
-			if (i!=pd_idx && sh->dev[i].towrite) {
-				chosen = sh->dev[i].towrite;
-				sh->dev[i].towrite = NULL;
+			break;
+		case RECONSTRUCT_WRITE:
+			memset(ptr[0], 0, STRIPE_SIZE);
+			for (i= disks; i-- ;)
+				if (i!=pd_idx && sh->dev[i].towrite) {
+					chosen = sh->dev[i].towrite;
+					sh->dev[i].towrite = NULL;
 
-				if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
-					wake_up(&conf->wait_for_overlap);
+					if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+						wake_up(&conf->wait_for_overlap);
 
-				if (sh->dev[i].written) BUG();
-				sh->dev[i].written = chosen;
-			}
-		break;
-	case CHECK_PARITY:
-		break;
+					if (sh->dev[i].written) BUG();
+					sh->dev[i].written = chosen;
+				}
+			break;
+		case CHECK_PARITY:
+			break;
 	}
 	if (count>1) {
 		xor_block(count, STRIPE_SIZE, ptr);
 		count = 1;
 	}
-	
+
 	for (i = disks; i--;)
 		if (sh->dev[i].written) {
 			sector_t sector = sh->dev[i].sector;
@@ -769,24 +982,24 @@
 		}
 
 	switch(method) {
-	case RECONSTRUCT_WRITE:
-	case CHECK_PARITY:
-		for (i=disks; i--;)
-			if (i != pd_idx) {
-				ptr[count++] = page_address(sh->dev[i].page);
-				check_xor();
-			}
-		break;
-	case READ_MODIFY_WRITE:
-		for (i = disks; i--;)
-			if (sh->dev[i].written) {
-				ptr[count++] = page_address(sh->dev[i].page);
-				check_xor();
-			}
+		case RECONSTRUCT_WRITE:
+		case CHECK_PARITY:
+			for (i=disks; i--;)
+				if (i != pd_idx) {
+					ptr[count++] = page_address(sh->dev[i].page);
+					check_xor();
+				}
+			break;
+		case READ_MODIFY_WRITE:
+			for (i = disks; i--;)
+				if (sh->dev[i].written) {
+					ptr[count++] = page_address(sh->dev[i].page);
+					check_xor();
+				}
 	}
 	if (count != 1)
 		xor_block(count, STRIPE_SIZE, ptr);
-	
+
 	if (method != CHECK_PARITY) {
 		set_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
 		set_bit(R5_LOCKED,   &sh->dev[pd_idx].flags);
@@ -805,16 +1018,18 @@
 	raid5_conf_t *conf = sh->raid_conf;
 
 	PRINTK("adding bh b#%llu to stripe s#%llu\n",
-		(unsigned long long)bi->bi_sector,
-		(unsigned long long)sh->sector);
+			(unsigned long long)bi->bi_sector,
+			(unsigned long long)sh->sector);
 
 
 	spin_lock(&sh->lock);
 	spin_lock_irq(&conf->device_lock);
+	PRINTK("lock, DISKS: %u\n", sh->disks);
 	if (forwrite)
 		bip = &sh->dev[dd_idx].towrite;
 	else
 		bip = &sh->dev[dd_idx].toread;
+	PRINTK("pip, disk=%u, bip=%p, num_disks=%u\n", dd_idx, bip, sh->disks);
 	while (*bip && (*bip)->bi_sector < bi->bi_sector) {
 		if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
 			goto overlap;
@@ -833,16 +1048,16 @@
 	spin_unlock(&sh->lock);
 
 	PRINTK("added bi b#%llu to stripe s#%llu, disk %d.\n",
-		(unsigned long long)bi->bi_sector,
-		(unsigned long long)sh->sector, dd_idx);
+			(unsigned long long)bi->bi_sector,
+			(unsigned long long)sh->sector, dd_idx);
 
 	if (forwrite) {
 		/* check if page is covered */
 		sector_t sector = sh->dev[dd_idx].sector;
 		for (bi=sh->dev[dd_idx].towrite;
-		     sector < sh->dev[dd_idx].sector + STRIPE_SECTORS &&
-			     bi && bi->bi_sector <= sector;
-		     bi = r5_next_bio(bi, sh->dev[dd_idx].sector)) {
+				sector < sh->dev[dd_idx].sector + STRIPE_SECTORS &&
+				bi && bi->bi_sector <= sector;
+				bi = r5_next_bio(bi, sh->dev[dd_idx].sector)) {
 			if (bi->bi_sector + (bi->bi_size>>9) >= sector)
 				sector = bi->bi_sector + (bi->bi_size>>9);
 		}
@@ -851,7 +1066,9 @@
 	}
 	return 1;
 
- overlap:
+overlap:
+	printk("overlap\n");
+
 	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
 	spin_unlock_irq(&conf->device_lock);
 	spin_unlock(&sh->lock);
@@ -876,11 +1093,11 @@
  * get BH_Lock set before the stripe lock is released.
  *
  */
- 
+
 static void handle_stripe(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks;
+	int disks = sh->disks;
 	struct bio *return_bi= NULL;
 	struct bio *bi;
 	int i;
@@ -891,12 +1108,13 @@
 	struct r5dev *dev;
 
 	PRINTK("handling stripe %llu, cnt=%d, pd_idx=%d\n",
-		(unsigned long long)sh->sector, atomic_read(&sh->count),
-		sh->pd_idx);
+			(unsigned long long)sh->sector, atomic_read(&sh->count),
+			sh->pd_idx);
 
 	spin_lock(&sh->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
+	clear_bit(STRIPE_DELAY_EXPAND, &sh->state);
 
 	syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	/* Now to look around and see what can be done */
@@ -908,7 +1126,7 @@
 		clear_bit(R5_Syncio, &dev->flags);
 
 		PRINTK("check %d: state 0x%lx read %p write %p written %p\n",
-			i, dev->flags, dev->toread, dev->towrite, dev->written);
+				i, dev->flags, dev->toread, dev->towrite, dev->written);
 		/* maybe we can reply to a read */
 		if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread) {
 			struct bio *rbi, *rbi2;
@@ -936,7 +1154,7 @@
 		if (test_bit(R5_LOCKED, &dev->flags)) locked++;
 		if (test_bit(R5_UPTODATE, &dev->flags)) uptodate++;
 
-		
+
 		if (dev->toread) to_read++;
 		if (dev->towrite) {
 			to_write++;
@@ -945,19 +1163,21 @@
 		}
 		if (dev->written) written++;
 		rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
-		if (!rdev || !rdev->in_sync) {
+		if (!conf->expand_in_progress && (!rdev || !rdev->in_sync)) {
 			failed++;
 			failed_num = i;
+			printk("failing disk %u (%p)!\n", i, rdev);
 		} else
 			set_bit(R5_Insync, &dev->flags);
 	}
-	PRINTK("locked=%d uptodate=%d to_read=%d"
-		" to_write=%d failed=%d failed_num=%d\n",
-		locked, uptodate, to_read, to_write, failed, failed_num);
 	/* check if the array has lost two devices and, if so, some requests might
 	 * need to be failed
 	 */
 	if (failed > 1 && to_read+to_write+written) {
+		printk("Need to fail requests!\n");
+		printk("locked=%d uptodate=%d to_read=%d"
+			" to_write=%d failed=%d failed_num=%d disks=%d\n",
+			locked, uptodate, to_read, to_write, failed, failed_num, disks);
 		spin_lock_irq(&conf->device_lock);
 		for (i=disks; i--; ) {
 			/* fail all writes first */
@@ -1012,7 +1232,7 @@
 		}
 		spin_unlock_irq(&conf->device_lock);
 	}
-	if (failed > 1 && syncing) {
+	if (failed > 1 && syncing && !conf->expand_in_progress) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
 		clear_bit(STRIPE_SYNCING, &sh->state);
 		syncing = 0;
@@ -1023,37 +1243,37 @@
 	 */
 	dev = &sh->dev[sh->pd_idx];
 	if ( written &&
-	     ( (test_bit(R5_Insync, &dev->flags) && !test_bit(R5_LOCKED, &dev->flags) &&
-		test_bit(R5_UPTODATE, &dev->flags))
-	       || (failed == 1 && failed_num == sh->pd_idx))
-	    ) {
-	    /* any written block on an uptodate or failed drive can be returned.
-	     * Note that if we 'wrote' to a failed drive, it will be UPTODATE, but 
-	     * never LOCKED, so we don't need to test 'failed' directly.
-	     */
-	    for (i=disks; i--; )
-		if (sh->dev[i].written) {
-		    dev = &sh->dev[i];
-		    if (!test_bit(R5_LOCKED, &dev->flags) &&
-			 test_bit(R5_UPTODATE, &dev->flags) ) {
-			/* We can return any write requests */
-			    struct bio *wbi, *wbi2;
-			    PRINTK("Return write for disc %d\n", i);
-			    spin_lock_irq(&conf->device_lock);
-			    wbi = dev->written;
-			    dev->written = NULL;
-			    while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
-				    wbi2 = r5_next_bio(wbi, dev->sector);
-				    if (--wbi->bi_phys_segments == 0) {
-					    md_write_end(conf->mddev);
-					    wbi->bi_next = return_bi;
-					    return_bi = wbi;
-				    }
-				    wbi = wbi2;
-			    }
-			    spin_unlock_irq(&conf->device_lock);
-		    }
-		}
+			( (test_bit(R5_Insync, &dev->flags) && !test_bit(R5_LOCKED, &dev->flags) &&
+			   test_bit(R5_UPTODATE, &dev->flags))
+			  || (failed == 1 && failed_num == sh->pd_idx))
+	   ) {
+		/* any written block on an uptodate or failed drive can be returned.
+		 * Note that if we 'wrote' to a failed drive, it will be UPTODATE, but 
+		 * never LOCKED, so we don't need to test 'failed' directly.
+		 */
+		for (i=disks; i--; )
+			if (sh->dev[i].written) {
+				dev = &sh->dev[i];
+				if (!test_bit(R5_LOCKED, &dev->flags) &&
+						test_bit(R5_UPTODATE, &dev->flags) ) {
+					/* We can return any write requests */
+					struct bio *wbi, *wbi2;
+					PRINTK("Return write for disc %d\n", i);
+					spin_lock_irq(&conf->device_lock);
+					wbi = dev->written;
+					dev->written = NULL;
+					while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
+						wbi2 = r5_next_bio(wbi, dev->sector);
+						if (--wbi->bi_phys_segments == 0) {
+							md_write_end(conf->mddev);
+							wbi->bi_next = return_bi;
+							return_bi = wbi;
+						}
+						wbi = wbi2;
+					}
+					spin_unlock_irq(&conf->device_lock);
+				}
+			}
 	}
 
 	/* Now we might consider reading some blocks, either to check/generate
@@ -1064,13 +1284,13 @@
 		for (i=disks; i--;) {
 			dev = &sh->dev[i];
 			if (!test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) &&
-			    (dev->toread ||
-			     (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
-			     syncing ||
-			     (failed && (sh->dev[failed_num].toread ||
-					 (sh->dev[failed_num].towrite && !test_bit(R5_OVERWRITE, &sh->dev[failed_num].flags))))
-				    )
-				) {
+					(dev->toread ||
+					 (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
+					 syncing ||
+					 (failed && (sh->dev[failed_num].toread ||
+						     (sh->dev[failed_num].towrite && !test_bit(R5_OVERWRITE, &sh->dev[failed_num].flags))))
+					)
+			   ) {
 				/* we would like to get this block, possibly
 				 * by computing it, but we might not be able to
 				 */
@@ -1085,23 +1305,303 @@
 					/* if I am just reading this block and we don't have
 					   a failed drive, or any pending writes then sidestep the cache */
 					if (sh->bh_read[i] && !sh->bh_read[i]->b_reqnext &&
-					    ! syncing && !failed && !to_write) {
+							! syncing && !failed && !to_write) {
 						sh->bh_cache[i]->b_page =  sh->bh_read[i]->b_page;
 						sh->bh_cache[i]->b_data =  sh->bh_read[i]->b_data;
 					}
 #endif
 					locked++;
 					PRINTK("Reading block %d (sync=%d)\n", 
-						i, syncing);
-					if (syncing)
+							i, syncing);
+					if (syncing && !conf->expand_in_progress)
 						md_sync_acct(conf->disks[i].rdev->bdev,
-							     STRIPE_SECTORS);
+								STRIPE_SECTORS);
 				}
 			}
 		}
 		set_bit(STRIPE_HANDLE, &sh->state);
 	}
 
+	// see if we have the data we need to expand by another block
+	if (conf->expand_in_progress && sh->disks == conf->previous_raid_disks) {
+		int uptodate = 0, delay_to_future=0, d = 0, count = 0, needed_uptodate = 0;
+		for (i=0; i<disks; ++i) {
+			sector_t start_sector, dest_sector;
+			unsigned int dd_idx, pd_idx;
+
+			if (i == sh->pd_idx)
+				continue;
+
+			start_sector = sh->sector * (conf->previous_raid_disks - 1) + d * (conf->chunk_size >> 9);
+			++d;
+
+			// see what sector this block would land in the new layout
+			dest_sector = raid5_compute_sector(start_sector, conf->raid_disks,
+				conf->raid_disks - 1, &dd_idx, &pd_idx, conf);
+			if (dd_idx > pd_idx)
+				--dd_idx;
+
+/*			printk("start_sector = %llu (base=%llu, i=%u, d=%u) || dest_stripe = %llu\n", start_sector, sh->sector,
+				i, d, dest_stripe); */
+			
+			if (dest_sector * (conf->raid_disks - 1) >= conf->expand_progress &&
+ 			    dest_sector * (conf->raid_disks - 1) <  conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+/*				printk("UPDATING CHUNK %u FROM DISK %u (sec=%llu, dest_sector=%llu, uptodate=%u)\n",
+					dd_idx, i, start_sector, dest_sector, test_bit(R5_UPTODATE, &sh->dev[i].flags)); */
+				unsigned int buf_sector;
+				sector_t base = conf->expand_progress;
+				sector_div(base, conf->raid_disks - 1);
+
+				buf_sector = dd_idx * (conf->chunk_size / STRIPE_SIZE) + (dest_sector - base) / STRIPE_SECTORS;
+				
+				if (test_bit(R5_UPTODATE, &sh->dev[i].flags)) {
+					conf->expand_buffer[buf_sector].up_to_date = 1;
+//					printk("memcpy device %u/%u: %p <- %p\n", i, sh->disks,
+//						page_address(conf->expand_buffer[buf_sector].page), page_address(sh->dev[i].page));
+					memcpy(page_address(conf->expand_buffer[buf_sector].page), page_address(sh->dev[i].page), STRIPE_SIZE);
+//					printk("memcpy done\n");
+					count = 1;
+					PRINTK("Updating %u\n", buf_sector);
+				} else {
+					conf->expand_buffer[buf_sector].up_to_date = 0;
+				}
+			} else if (dest_sector * (conf->raid_disks - 1) >= conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1) &&
+				   dest_sector * (conf->raid_disks - 1) < conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1) * 2 &&
+				   syncing) {
+				delay_to_future = 1;
+			}
+		}
+
+		for (i=0; i < (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE); ++i) {
+			uptodate += conf->expand_buffer[i].up_to_date;
+		}
+		if (count) 
+			PRINTK("%u/%lu is up to date\n", uptodate, (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE));
+	
+		/*
+		 * Figure out how many stripes we need for this chunk to be complete.
+		 * In almost all cases, this will be a full destination stripe, but our
+		 * original volume might not be big enough for that at the very end --
+		 * so use the rest of the volume then.
+	         */
+		needed_uptodate = (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE);
+		if (((conf->mddev->size << 1) - conf->expand_progress) / STRIPE_SECTORS < needed_uptodate) {
+			needed_uptodate = ((conf->mddev->size << 1) - conf->expand_progress) / STRIPE_SECTORS;
+//			printk("reading partial block at the end: %u\n", needed_uptodate);
+		}
+		if (needed_uptodate > 0 && uptodate == needed_uptodate) {
+			// we can do an expand!
+			struct stripe_head *newsh[256];   // FIXME: dynamic allocation somewhere instead?
+			sector_t dest_sector, advance;
+			unsigned i;
+			unsigned int dummy1, dummy2, pd_idx;
+
+			if ((conf->mddev->size << 1) - conf->expand_progress > (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+				advance = (conf->chunk_size * (conf->raid_disks - 1)) >> 9;
+			} else {
+				advance = (conf->mddev->size << 1) - conf->expand_progress;
+			}
+
+//			sector_div(new_sector, (conf->raid_disks - 1));
+//			printk("EXPANDING ONTO SECTOR %llu\n", conf->expand_progress);
+//			printk("EXPAND => %llu/%llu\n", conf->expand_progress, conf->mddev->size << 1);
+			
+			// find the parity disk and starting sector
+			dest_sector = raid5_compute_sector(conf->expand_progress, conf->raid_disks,
+				conf->raid_disks - 1, &dummy1, &pd_idx, conf);
+			printk("Expanding onto %llu\n", dest_sector);
+		
+			spin_lock_irq(&conf->device_lock);
+			
+			/*
+			 * Check that we won't try to expand over an area where there's
+			 * still active stripes; if we do, we'll risk inconsistency since we
+			 * suddenly have two different sets of stripes referring to the
+			 * same logical sector.
+			 */
+			{
+				struct stripe_head *ash;
+				int activity = 0, i;
+				sector_t first_touched_sector, last_touched_sector;
+				
+				first_touched_sector = raid5_compute_sector(conf->expand_progress,
+					conf->previous_raid_disks, conf->previous_raid_disks - 1, &dummy1, &dummy2, conf);
+				last_touched_sector = raid5_compute_sector(conf->expand_progress + ((conf->chunk_size * (conf->previous_raid_disks - 1)) >> 9) - 1,
+					conf->previous_raid_disks, conf->previous_raid_disks - 1, &dummy1, &dummy2, conf);
+
+				for (i = 0; i < NR_HASH; i++) {
+					ash = conf->stripe_hashtbl[i];
+					for (; ash; ash = ash->hash_next) {						
+						if (sh == ash && atomic_read(&ash->count) == 1 && !to_write)
+							continue;   // we'll release it shortly, so it's OK (?)
+
+						// is this stripe active, and within the region we're expanding?
+						if (atomic_read(&ash->count) > 0 &&
+						    ash->disks == conf->previous_raid_disks &&
+						    ash->sector >= first_touched_sector &&
+						    ash->sector <= last_touched_sector) {
+							activity = 1;
+							break;
+						}
+					}
+				}
+				
+				if (activity) {
+					spin_unlock_irq(&conf->device_lock);
+					goto please_wait;
+				}
+			}
+
+			/*
+			 * Check that we have enough free stripes to write out our
+			 * entire chunk in the new layout. If not, we'll have to wait
+			 * until some writes have been retired. We can't just do
+			 * as in get_active_stripe() and sleep here until enough are
+			 * free, since all busy stripes might have STRIPE_HANDLE set
+			 * and thus won't be retired until somebody (our thread!) takes
+			 * care of them.
+			 */	
+			
+			{
+				int not_enough_free = 0;
+				
+				for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+					newsh[i] = get_free_stripe(conf, 1);
+					if (newsh[i] == NULL) {
+						not_enough_free = 1;
+						break;
+					}
+					init_stripe(newsh[i], dest_sector + i * STRIPE_SECTORS, pd_idx);		
+				}
+
+				if (not_enough_free) {
+					// release all the stripes we allocated
+					for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+						if (newsh[i] == NULL)
+							break;
+						atomic_inc(&newsh[i]->count);
+						__release_stripe(conf, newsh[i]);
+					}
+					spin_unlock_irq(&conf->device_lock);
+					goto please_wait;
+				}
+			}
+
+			for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+				for (d = 0; d < conf->raid_disks; ++d) {
+					unsigned dd_idx = d;
+					
+					if (d != pd_idx) {
+						if (dd_idx > pd_idx)
+							--dd_idx;
+
+						memcpy(page_address(newsh[i]->dev[d].page), page_address(conf->expand_buffer[dd_idx * conf->chunk_size / STRIPE_SIZE + i].page), STRIPE_SIZE);
+					}
+					set_bit(R5_Wantwrite, &newsh[i]->dev[d].flags);
+					set_bit(R5_Syncio, &newsh[i]->dev[d].flags);
+				}
+			}
+			
+			for (i=0; i < (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE); ++i) {
+				conf->expand_buffer[i].up_to_date = 0;
+			}
+
+			conf->expand_progress += advance;
+			
+			spin_unlock_irq(&conf->device_lock);
+			
+			for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+				compute_parity(newsh[i], RECONSTRUCT_WRITE);
+					
+				atomic_inc(&newsh[i]->count);
+				set_bit(STRIPE_INSYNC, &newsh[i]->state);
+				set_bit(STRIPE_HANDLE, &newsh[i]->state);
+				release_stripe(newsh[i]);
+			}
+
+			spin_lock_irq(&conf->device_lock);
+			md_done_sync(conf->mddev, advance, 1);
+			wake_up(&conf->wait_for_expand_progress);
+			spin_unlock_irq(&conf->device_lock);
+
+//			md_sync_acct(conf->disks[0].rdev->bdev, STRIPE_SECTORS * (conf->raid_disks - 1));
+
+			// see if we have delayed data that we can process now
+			{			
+				struct list_head *l, *next;
+				
+				spin_lock_irq(&conf->device_lock);
+				l = conf->wait_for_expand_list.next;
+
+//				printk("printing delay list:\n");
+				while (l != &conf->wait_for_expand_list) {
+					int i, d = 0;
+					int do_process = 0;
+					
+					struct stripe_head *dsh;
+					dsh = list_entry(l, struct stripe_head, lru);
+//					printk("sector: %llu\n", dsh->sector);
+					
+					for (i=0; i<disks; ++i) {
+						sector_t start_sector, dest_sector;
+						unsigned int dd_idx, pd_idx;
+
+						if (i == dsh->pd_idx)
+							continue;
+
+						start_sector = dsh->sector * (conf->previous_raid_disks - 1) + d * (conf->chunk_size >> 9);
+
+						// see what sector this block would land in in the new layout
+						dest_sector = raid5_compute_sector(start_sector, conf->raid_disks,
+								conf->raid_disks - 1, &dd_idx, &pd_idx, conf);
+						if (/*dest_sector * (conf->raid_disks - 1) >= conf->expand_progress &&*/
+						    dest_sector * (conf->raid_disks - 1) <  conf->expand_progress + (conf->raid_disks - 1) * (conf->chunk_size >> 9)) {
+							do_process = 1;
+						}
+
+						++d;
+					}
+					
+					next = l->next;
+					
+					if (do_process) {
+						list_del_init(l);
+
+						set_bit(STRIPE_HANDLE, &dsh->state);
+						clear_bit(STRIPE_DELAYED, &dsh->state);
+						clear_bit(STRIPE_DELAY_EXPAND, &dsh->state);
+						atomic_inc(&dsh->count);
+						atomic_inc(&dsh->count);
+						printk("pulling in stuff from delayed, sector=%llu\n",
+							dsh->sector);
+						__release_stripe(conf, dsh);
+					} else {
+						printk("still there\n");
+					}
+
+					l = next;
+				}
+
+				spin_unlock_irq(&conf->device_lock);
+			}
+
+			// see if we are done
+			if (conf->expand_progress >= conf->mddev->array_size << 1) {
+				printk("expand done, waiting for last activity to settle...\n");
+//				conf->mddev->raid_disks = conf->raid_disks;
+//				raid5_resize(conf->mddev, conf->mddev->size << 1);
+				conf->expand_in_progress = 2;
+			}
+
+please_wait:			
+			1;
+		}
+
+		if (delay_to_future) { // && atomic_dec_and_test(&sh->count)) {
+			set_bit(STRIPE_DELAY_EXPAND, &sh->state);
+		}
+	}
+
 	/* now to consider writing and what else, if anything should be read */
 	if (to_write) {
 		int rmw=0, rcw=0;
@@ -1237,7 +1737,9 @@
 		}
 	}
 	if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
-		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
+		if (!conf->expand_in_progress) {
+			md_done_sync(conf->mddev, STRIPE_SECTORS,1);
+		}
 		clear_bit(STRIPE_SYNCING, &sh->state);
 	}
 	
@@ -1279,7 +1781,7 @@
 		rcu_read_unlock();
  
 		if (rdev) {
-			if (test_bit(R5_Syncio, &sh->dev[i].flags))
+			if (test_bit(R5_Syncio, &sh->dev[i].flags) && !conf->expand_in_progress)
 				md_sync_acct(rdev->bdev, STRIPE_SECTORS);
 
 			bi->bi_bdev = rdev->bdev;
@@ -1308,6 +1810,7 @@
 
 static inline void raid5_activate_delayed(raid5_conf_t *conf)
 {
+	PRINTK("raid5_activate_delayed\n");
 	if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
 		while (!list_empty(&conf->delayed_list)) {
 			struct list_head *l = conf->delayed_list.next;
@@ -1428,8 +1931,15 @@
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
 		
-		new_sector = raid5_compute_sector(logical_sector,
-						  raid_disks, data_disks, &dd_idx, &pd_idx, conf);
+		if (conf->expand_in_progress && logical_sector >= conf->expand_progress) {
+			PRINTK("GEOM: old\n");
+			new_sector = raid5_compute_sector(logical_sector,
+				conf->previous_raid_disks, conf->previous_raid_disks - 1, &dd_idx, &pd_idx, conf);
+		} else {
+			PRINTK("GEOM: new\n");
+			new_sector = raid5_compute_sector(logical_sector,
+				 raid_disks, data_disks, &dd_idx, &pd_idx, conf);
+		}
 
 		PRINTK("raid5: make_request, sector %llu logical %llu\n",
 			(unsigned long long)new_sector, 
@@ -1488,6 +1998,13 @@
 	int raid_disks = conf->raid_disks;
 	int data_disks = raid_disks-1;
 
+	if (conf->expand_in_progress) {
+		raid_disks = conf->previous_raid_disks;
+		data_disks = raid_disks-1;
+	}
+
+	BUG_ON(data_disks == 0 || raid_disks == 0);
+	
 	if (sector_nr >= mddev->size <<1) {
 		/* just being told to finish up .. nothing much to do */
 		unplug_slaves(mddev);
@@ -1499,17 +2016,41 @@
 	 */
 	if (mddev->degraded >= 1 && test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
 		int rv = (mddev->size << 1) - sector_nr;
+		printk("md_done_sync()\n");
 		md_done_sync(mddev, rv, 1);
 		return rv;
 	}
+	
+	/* if we're in an expand, we can't allow the process
+	 * to keep reading in stripes; we might not have enough buffer
+	 * space to keep it all in RAM.
+	 */
+	if (conf->expand_in_progress && sector_nr >= conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+		//printk("DELAY\n");
+		//printall(conf);
+		//printk("progress = %llu\n", conf->expand_progress);
+		spin_lock_irq(&conf->device_lock);
+		wait_event_lock_irq(conf->wait_for_expand_progress,
+			    sector_nr < conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1),
+			    conf->device_lock,
+			    unplug_slaves(conf->mddev);
+		);
+		spin_unlock_irq(&conf->device_lock);
+		//printk("DELAY DONE\n");
+	}
 
 	x = sector_nr;
 	chunk_offset = sector_div(x, sectors_per_chunk);
 	stripe = x;
 	BUG_ON(x != stripe);
-
+	
+	PRINTK("sync_request:%llu/%llu, %u+%u active, pr=%llu v. %llu\n", sector_nr, mddev->size<<1,
+		atomic_read(&conf->active_stripes), atomic_read(&conf->active_stripes_expand),
+		sector_nr,
+		conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)); 
+ 
 	first_sector = raid5_compute_sector((sector_t)stripe*data_disks*sectors_per_chunk
-		+ chunk_offset, raid_disks, data_disks, &dd_idx, &pd_idx, conf);
+			+ chunk_offset, raid_disks, data_disks, &dd_idx, &pd_idx, conf);
 	sh = get_active_stripe(conf, sector_nr, pd_idx, 1);
 	if (sh == NULL) {
 		sh = get_active_stripe(conf, sector_nr, pd_idx, 0);
@@ -1553,18 +2094,29 @@
 	while (1) {
 		struct list_head *first;
 
+		conf = mddev_to_conf(mddev);
+
 		if (list_empty(&conf->handle_list) &&
 		    atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
 		    !blk_queue_plugged(mddev->queue) &&
-		    !list_empty(&conf->delayed_list))
+		    !list_empty(&conf->delayed_list)) {
+			PRINTK("activate delayed\n");
 			raid5_activate_delayed(conf);
+		}
 
 		if (list_empty(&conf->handle_list))
 			break;
 
 		first = conf->handle_list.next;
+		PRINTK("first: %p\n", first);
+		
 		sh = list_entry(first, struct stripe_head, lru);
 
+#if RAID5_DEBUG
+		PRINTK("sh: %p\n", sh);
+		print_sh(sh);
+#endif
+
 		list_del_init(first);
 		atomic_inc(&sh->count);
 		if (atomic_read(&sh->count)!= 1)
@@ -1577,7 +2129,7 @@
 
 		spin_lock_irq(&conf->device_lock);
 	}
-	PRINTK("%d stripes handled\n", handled);
+//	PRINTK("%d stripes handled\n", handled);
 
 	spin_unlock_irq(&conf->device_lock);
 
@@ -1594,6 +2146,8 @@
 	struct disk_info *disk;
 	struct list_head *tmp;
 
+	printk("run()!\n");
+	
 	if (mddev->level != 5 && mddev->level != 4) {
 		printk("raid5: %s: raid level not set to 4/5 (%d)\n", mdname(mddev), mddev->level);
 		return -EIO;
@@ -1650,6 +2204,7 @@
 	conf->level = mddev->level;
 	conf->algorithm = mddev->layout;
 	conf->max_nr_stripes = NR_STRIPES;
+	conf->expand_in_progress = 0;
 
 	/* device size must be a multiple of chunk size */
 	mddev->size &= ~(mddev->chunk_size/1024 -1);
@@ -1691,7 +2246,7 @@
 	}
 memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
 		 conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
-	if (grow_stripes(conf, conf->max_nr_stripes)) {
+	if (grow_stripes(conf, conf->max_nr_stripes, 0)) {
 		printk(KERN_ERR 
 			"raid5: couldn't allocate %dkB for buffers\n", memory);
 		shrink_stripes(conf);
@@ -1767,8 +2322,8 @@
 
 	printk("sh %llu, pd_idx %d, state %ld.\n",
 		(unsigned long long)sh->sector, sh->pd_idx, sh->state);
-	printk("sh %llu,  count %d.\n",
-		(unsigned long long)sh->sector, atomic_read(&sh->count));
+	printk("sh %llu,  count %d, disks %d.\n",
+		(unsigned long long)sh->sector, atomic_read(&sh->count), sh->disks);
 	printk("sh %llu, ", (unsigned long long)sh->sector);
 	for (i = 0; i < sh->raid_conf->raid_disks; i++) {
 		printk("(cache%d: %p %ld) ", 
@@ -1865,6 +2420,9 @@
 	mdk_rdev_t *rdev;
 	struct disk_info *p = conf->disks + number;
 
+	printk("we were asked to remove a disk\n");
+	return -EBUSY;  // sesse hack
+	
 	print_raid5_conf(conf);
 	rdev = p->rdev;
 	if (rdev) {
@@ -1894,27 +2452,37 @@
 	int disk;
 	struct disk_info *p;
 
-	if (mddev->degraded > 1)
+	printk("RAID5 ADD DISK PLZ: %p\n", rdev);
+	
+	if (mddev->degraded > 1) {
+		printk("GAVE UP\n");
+		
 		/* no point adding a device */
 		return 0;
+	}
 
 	/*
 	 * find the disk ...
 	 */
-	for (disk=0; disk < mddev->raid_disks; disk++)
+	for (disk=0; disk < mddev->raid_disks; disk++) {
 		if ((p=conf->disks + disk)->rdev == NULL) {
+			printk("adding disk to %u\n", disk);
+			
+			rdev->faulty = 0;
 			rdev->in_sync = 0;
 			rdev->raid_disk = disk;
 			found = 1;
 			p->rdev = rdev;
 			break;
 		}
+	}
 	print_raid5_conf(conf);
 	return found;
 }
 
 static int raid5_resize(mddev_t *mddev, sector_t sectors)
 {
+        raid5_conf_t *conf = mddev_to_conf(mddev);
 	/* no resync is happening, and there is enough space
 	 * on all devices, so we can resize.
 	 * We need to make sure resync covers any new space.
@@ -1922,8 +2490,14 @@
 	 * any io in the removed space completes, but it hardly seems
 	 * worth it.
 	 */
+	printk("asked to resize\n");
+	if (conf->expand_in_progress)
+		return -EBUSY;
+		
 	sectors &= ~((sector_t)mddev->chunk_size/512 - 1);
+	printk("old array_size: %llu\n", mddev->array_size);
 	mddev->array_size = (sectors * (mddev->raid_disks-1))>>1;
+	printk("new array_size: %llu (%llu x %u)\n", mddev->array_size, sectors, mddev->raid_disks - 1);
 	set_capacity(mddev->gendisk, mddev->array_size << 1);
 	mddev->changed = 1;
 	if (sectors/2  > mddev->size && mddev->recovery_cp == MaxSector) {
@@ -1934,6 +2508,221 @@
 	return 0;
 }
 
+static int raid5_reshape(mddev_t *mddev, int raid_disks)
+{
+        raid5_conf_t *conf = mddev_to_conf(mddev);
+	raid5_conf_t *newconf;
+        struct list_head *tmp;
+	mdk_rdev_t *rdev;
+	unsigned long flags;
+
+        int d, i;
+
+	if (mddev->degraded >= 1 || conf->expand_in_progress)
+		return -EBUSY;
+	
+	printk("sesse was here: reshape to %u disks\n", raid_disks);
+	print_raid5_conf(conf);
+	
+	newconf = kmalloc (sizeof (raid5_conf_t)
+			+ raid_disks * sizeof(struct disk_info),
+			GFP_KERNEL);
+	if (newconf == NULL)
+		return -ENOMEM;	
+	
+	memset(newconf, 0, sizeof (raid5_conf_t) + raid_disks * sizeof(struct disk_info));
+	memcpy(newconf, conf, sizeof (raid5_conf_t) + conf->raid_disks * sizeof(struct disk_info));
+
+	newconf->expand_in_progress = 1;
+	newconf->expand_progress = 0;
+	newconf->raid_disks = mddev->raid_disks = raid_disks;	
+	newconf->previous_raid_disks = conf->raid_disks;	
+	
+	INIT_LIST_HEAD(&newconf->inactive_list_expand);
+	
+	
+	spin_lock_irqsave(&conf->device_lock, flags);
+	mddev->private = newconf;
+
+	printk("conf=%p newconf=%p\n", conf, newconf);
+	
+	if (newconf->handle_list.next)
+		newconf->handle_list.next->prev = &newconf->handle_list;
+	if (newconf->delayed_list.next)
+		newconf->delayed_list.next->prev = &newconf->delayed_list;
+	if (newconf->inactive_list.next)
+		newconf->inactive_list.next->prev = &newconf->inactive_list;
+
+	if (newconf->handle_list.prev == &conf->handle_list)
+		newconf->handle_list.prev = &newconf->handle_list;
+	if (newconf->delayed_list.prev == &conf->delayed_list)
+		newconf->delayed_list.prev = &newconf->delayed_list;
+	if (newconf->inactive_list.prev == &conf->inactive_list)
+		newconf->inactive_list.prev = &newconf->inactive_list;
+	
+	if (newconf->wait_for_stripe.task_list.prev == &conf->wait_for_stripe.task_list)
+		newconf->wait_for_stripe.task_list.prev = &newconf->wait_for_stripe.task_list;
+	if (newconf->wait_for_overlap.task_list.prev == &conf->wait_for_overlap.task_list)
+		newconf->wait_for_overlap.task_list.prev = &newconf->wait_for_overlap.task_list;
+	
+	init_waitqueue_head(&newconf->wait_for_stripe_expand);
+	init_waitqueue_head(&newconf->wait_for_expand_progress);
+	INIT_LIST_HEAD(&newconf->wait_for_expand_list);
+	
+	// update all the stripes
+	for (i = 0; i < NR_STRIPES; ++i) {
+		struct stripe_head *sh = newconf->stripe_hashtbl[i];
+		while (sh) {
+			sh->raid_conf = newconf;
+			
+			if (sh->lru.next == &conf->inactive_list)
+				sh->lru.next = &newconf->inactive_list;
+			if (sh->lru.next == &conf->handle_list)
+				sh->lru.next = &newconf->handle_list;
+
+			sh = sh->hash_next;
+		}
+	}
+
+	// ...and all on the inactive queue
+	{
+		struct list_head *first = newconf->inactive_list.next;
+		
+		while (1) {
+			struct stripe_head *sh = list_entry(first, struct stripe_head, lru);
+			sh->raid_conf = newconf;
+		
+			if (sh->lru.next == &conf->inactive_list)
+				sh->lru.next = &newconf->inactive_list;
+			if (sh->lru.next == &conf->handle_list)
+				sh->lru.next = &newconf->handle_list;
+
+			if (first->next == &conf->inactive_list || first->next == &newconf->inactive_list) {
+				first->next = &newconf->inactive_list;
+				break;
+			}
+					
+			first = first->next;
+		};
+	}
+
+	// update the pointer for the other lists as well
+	{
+		struct list_head *first = &newconf->handle_list;
+		while (1) {
+			if (first->next == &conf->handle_list) {
+				first->next = &newconf->handle_list;
+				break;
+			}
+					
+			first = first->next;
+		};
+	}
+	{
+		struct list_head *first = &newconf->delayed_list;
+		while (1) {
+			if (first->next == &conf->delayed_list) {
+				first->next = &newconf->delayed_list;
+				break;
+			}
+					
+			first = first->next;
+		};
+	}
+	{
+		struct list_head *first = &newconf->wait_for_stripe.task_list;
+		while (1) {
+			if (first->next == &conf->wait_for_stripe.task_list) {
+				first->next = &newconf->wait_for_stripe.task_list;
+				break;
+			}
+					
+			first = first->next;
+		};
+	}
+	{
+		struct list_head *first = &newconf->wait_for_overlap.task_list;
+		while (1) {
+			if (first->next == &conf->wait_for_overlap.task_list) {
+				first->next = &newconf->wait_for_overlap.task_list;
+				break;
+			}
+					
+			first = first->next;
+		};
+	}
+	
+	ITERATE_RDEV(mddev,rdev,tmp) {
+		printk("disk: %p\n", rdev);
+		for (d= 0; d < newconf->raid_disks; d++) {
+			if (newconf->disks[d].rdev == rdev) {
+				goto already_there;
+			}
+		}
+
+		raid5_add_disk(mddev, rdev);
+		newconf->failed_disks++;
+		
+already_there:		
+		1;
+	}
+
+	// argh! we can't hold this lock while allocating memory
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+	
+	// allocate new stripes
+	atomic_set(&newconf->active_stripes_expand, 0);
+	if (grow_stripes(newconf, newconf->max_nr_stripes, 1)) {
+		int memory = newconf->max_nr_stripes * (sizeof(struct stripe_head) +
+			newconf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
+		printk(KERN_ERR "raid5: couldn't allocate %dkB for expand stripes\n", memory);
+		shrink_stripes(newconf);
+		kfree(newconf);
+		return -ENOMEM;
+	}
+
+	// and space for our temporary expansion buffers
+	newconf->expand_buffer = kmalloc (sizeof(struct expand_buf) * (conf->chunk_size / STRIPE_SIZE) * (raid_disks-1), GFP_KERNEL);
+	if (newconf->expand_buffer == NULL) {
+		printk(KERN_ERR "raid5: couldn't allocate %dkB for expand buffer\n",
+			(conf->chunk_size * (raid_disks-1)) >> 10);
+		shrink_stripes(newconf);
+		kfree(newconf);
+		return -ENOMEM;
+	}
+	
+	for (i = 0; i < (conf->chunk_size / STRIPE_SIZE) * (raid_disks-1); ++i) {
+		newconf->expand_buffer[i].page = alloc_page(GFP_KERNEL);
+		if (newconf->expand_buffer[i].page == NULL) {
+			printk(KERN_ERR "raid5: couldn't allocate %dkB for expand buffer\n",
+				(conf->chunk_size * (raid_disks-1)) >> 10);
+			shrink_stripes(newconf);
+			kfree(newconf);
+			return -ENOMEM;
+		}
+		newconf->expand_buffer[i].up_to_date = 0;
+	}
+	
+	spin_lock_irqsave(&conf->device_lock, flags);
+	
+	print_raid5_conf(newconf);
+
+	clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
+        set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+        set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+	mddev->recovery_cp = 0;
+        md_wakeup_thread(mddev->thread);
+//        md_check_recovery(mddev);
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+
+	kfree(conf);
+
+	printk("Starting expand.\n");
+	
+        return 0;
+}
+
+
 static mdk_personality_t raid5_personality=
 {
 	.name		= "raid5",
@@ -1948,6 +2737,7 @@
 	.spare_active	= raid5_spare_active,
 	.sync_request	= sync_request,
 	.resize		= raid5_resize,
+	.reshape	= raid5_reshape
 };
 
 static int __init raid5_init (void)
diff -ur linux-2.6-2.6.12/include/linux/raid/raid5.h ../linux-2.6-2.6.12/include/linux/raid/raid5.h
--- linux-2.6-2.6.12/include/linux/raid/raid5.h	2005-06-17 21:48:29.000000000 +0200
+++ linux-2.6-2.6.12.patch/include/linux/raid/raid5.h	2005-09-17 00:47:25.000000000 +0200
@@ -92,7 +92,11 @@
  * stripe is also (potentially) linked to a hash bucket in the hash
  * table so that it can be found by sector number.  Stripes that are
  * not hashed must be on the inactive_list, and will normally be at
- * the front.  All stripes start life this way.
+ * the front.  All stripes start life this way. There is also a
+ * "inactive_list_expand"; this is only used during an expand, and
+ * it contains stripes with "disks" set to the correct number of disks
+ * after the expand (and with the correct amount of memory allocated,
+ * of course).
  *
  * The inactive_list, handle_list and hash bucket lists are all protected by the
  * device_lock.
@@ -134,6 +138,7 @@
 	unsigned long		state;			/* state flags */
 	atomic_t		count;			/* nr of active thread/requests */
 	spinlock_t		lock;
+	int			disks;			/* disks in stripe */
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
@@ -171,6 +176,7 @@
 #define	STRIPE_INSYNC		4
 #define	STRIPE_PREREAD_ACTIVE	5
 #define	STRIPE_DELAYED		6
+#define	STRIPE_DELAY_EXPAND	7
 
 /*
  * Plugging:
@@ -199,6 +205,10 @@
 struct disk_info {
 	mdk_rdev_t	*rdev;
 };
+struct expand_buf {
+	struct page     *page;
+	int		up_to_date;
+};
 
 struct raid5_private_data {
 	struct stripe_head	**stripe_hashtbl;
@@ -208,22 +218,38 @@
 	int			raid_disks, working_disks, failed_disks;
 	int			max_nr_stripes;
 
+	/* used during an expand */
+	int			expand_in_progress;
+	sector_t		expand_progress;
+	int			previous_raid_disks;
+	struct list_head	wait_for_expand_list;
+	
+	struct expand_buf	*expand_buffer;
+
 	struct list_head	handle_list; /* stripes needing handling */
 	struct list_head	delayed_list; /* stripes that have plugged requests */
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 
 	char			cache_name[20];
+	char			cache_name_expand[20];
 	kmem_cache_t		*slab_cache; /* for allocating stripes */
+	kmem_cache_t		*slab_cache_expand;
+	
 	/*
 	 * Free stripes pool
 	 */
 	atomic_t		active_stripes;
+	atomic_t		active_stripes_expand;
 	struct list_head	inactive_list;
+	struct list_head	inactive_list_expand;
 	wait_queue_head_t	wait_for_stripe;
+	wait_queue_head_t	wait_for_stripe_expand;
+	wait_queue_head_t	wait_for_expand_progress;
 	wait_queue_head_t	wait_for_overlap;
 	int			inactive_blocked;	/* release of inactive stripes blocked,
 							 * waiting for 25% to be free
-							 */        
+							 */       
+	int			inactive_blocked_expand;
 	spinlock_t		device_lock;
 	struct disk_info	disks[0];
 };

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-09-20 14:33 [PATCH] Online RAID-5 resizing Steinar H. Gunderson
@ 2005-09-20 15:01 ` Neil Brown
  2005-09-20 15:36   ` Steinar H. Gunderson
                     ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Neil Brown @ 2005-09-20 15:01 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: linux-raid

On Tuesday September 20, sgunderson@bigfoot.com wrote:
> (Please Cc me on any replies, I'm not subscribed)
ofcourse...
> 
> Hi,
> 
> Attached is a patch (against 2.6.12) for adding online RAID-5 resize
> capabilities to Linux' RAID code.

Wow!  Thanks for this.  It's been something that I wanted to be done
for some time, but it hasn't got even close to the top of my todo
list.

> 
> - It's RAID-5 only; I don't really use RAID-0, and RAID-6 would probably be
>   more complex.

I doubt raid6 would be much more difficult.  However it is definitely
best to get it working well in raid5 first.
I'd never thought of doing raid0, but I cannot now think of a good
reason not to.....

> - It supports only growing, not shrinking. (Not sure if I really care about
>   fixing this one.)

Shrinking certainly adds a lot of complications, and you would have to
start at the 'top' and work backwards.  Probably not worth the effort,
except that people might want to be able to back-out a change...

> - It leaks memory; it doesn't properly free up the old stripes etc. at the
>   end of the resize. (This also makes it impossible to do a grow and then
>   another grow without stopping and starting the volumes.)

I'm sure that can be fixed.

> - There is absolutely no crash recovery -- this shouldn't be so hard to do
>   (just update the superblock every time, with some progress meter, and
>   restart from that spot in case of a crash), but I have no knowledge of the
>   on-disk superblock format at all, so some help would be appreciated here.
>   Also, I'm not really sure what happens if it encounters a bad block during
>   the restripe.

Crash recovery is essential I think.  There are some awkward cases,
particularly while growing the first few stripes.  I'm sure we can
work it out together.

> - It's quite slow; on my test system with old IDE disks, it achieves about
>   1MB/sec. One could probably make a speed/memory tradeoff here, and move
>   more chunks at a time instead of just one by one; I'm a bit concerned
>   about the implications of the kernel allocating something like 64MB in one
>   go, though :-)

I doubt speed is a top priority.

I'll try to have a read through your code over the next week or so and
give you more detailed feedback.

Thanks again,
NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-09-20 15:01 ` Neil Brown
@ 2005-09-20 15:36   ` Steinar H. Gunderson
  2005-09-22 16:16     ` Neil Brown
  2005-09-20 18:54   ` Al Boldi
  2005-09-21 19:23   ` Steinar H. Gunderson
  2 siblings, 1 reply; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-09-20 15:36 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Wed, Sep 21, 2005 at 01:01:42AM +1000, Neil Brown wrote:
> Shrinking certainly adds a lot of complications, and you would have to
> start at the 'top' and work backwards.  Probably not worth the effort,
> except that people might want to be able to back-out a change...

I worked on EVMS' resizing code prior to doing this, and it seems like a
resize was simply doing it the other way without any further complications...
I don't know how the underlying block layer in Linux would like it, though.

>> - It leaks memory; it doesn't properly free up the old stripes etc. at the
>>   end of the resize. (This also makes it impossible to do a grow and then
>>   another grow without stopping and starting the volumes.)
> I'm sure that can be fixed.

Yes, of course; it's mostly about not having gotten around to doing it yet. A
good start would be doing shrink_stripes(), but the “finish up the expanding”
code is currently called from __release_stripe() when the last stripe from
the old array is freed, and thus is done under the device_lock, and I had
problems doing memory management under the spinlock. The correct solution
would probably be moving it into raid5d, outside the spinlock.

> Crash recovery is essential I think.  There are some awkward cases,
> particularly while growing the first few stripes.  I'm sure we can
> work it out together.

Mm, or at least the very first stripe. I'm not really sure if it's worth it,
though; perfect crash recovery is pretty hard (for one, you'd have to disable
all write caching on the destination disks), and I'm not sure how probable
a power loss 20ms into the resizing is.

>> - It's quite slow; on my test system with old IDE disks, it achieves about
>>   1MB/sec. One could probably make a speed/memory tradeoff here, and move
>>   more chunks at a time instead of just one by one; I'm a bit concerned
>>   about the implications of the kernel allocating something like 64MB in one
>>   go, though :-)
> I doubt speed is a top priority.

Well, with multi-terabyte arrays, restriping at those speeds will take
_weeks_, so more speed is always good. I agree that we don't need to be
pushing it very hard, though.

> I'll try to have a read through your code over the next week or so and
> give you more detailed feedback.

OK, thanks. :-) There's a lot of unneeded junk in the patch, BTW (some
reindenting here and there that I don't know where is coming from, plus lots
of temporary added printks), but I guess we can sort out the cleanness after
a while. :-)

/* Steinar */
-- 
Homepage: http://www.sesse.net/
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-09-20 15:01 ` Neil Brown
  2005-09-20 15:36   ` Steinar H. Gunderson
@ 2005-09-20 18:54   ` Al Boldi
  2005-09-21 19:23   ` Steinar H. Gunderson
  2 siblings, 0 replies; 23+ messages in thread
From: Al Boldi @ 2005-09-20 18:54 UTC (permalink / raw)
  To: Neil Brown, Steinar H. Gunderson; +Cc: linux-raid

Neil Brown wrote:
> On Tuesday September 20, sgunderson@bigfoot.com wrote:
> > - It's RAID-5 only; I don't really use RAID-0, and RAID-6 would probably
> > be more complex.
>
> I doubt raid6 would be much more difficult.  However it is definitely
> best to get it working well in raid5 first.
> I'd never thought of doing raid0, but I cannot now think of a good
> reason not to.....

Always generalize the code, which forces modularization, which implies 
simplicity, structure, and scalability, thus gaining reliability and 
performance.

Thanks, and keep up the good work!

--
Al


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-09-20 15:01 ` Neil Brown
  2005-09-20 15:36   ` Steinar H. Gunderson
  2005-09-20 18:54   ` Al Boldi
@ 2005-09-21 19:23   ` Steinar H. Gunderson
  2005-09-22  0:14     ` Steinar H. Gunderson
  2 siblings, 1 reply; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-09-21 19:23 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Wed, Sep 21, 2005 at 01:01:42AM +1000, Neil Brown wrote:
> I doubt speed is a top priority.

I just found something interesting; if I raise speed_limit_min, I get much
better speed than I initially got. My 2->4 slow IDE disk setup:

md1 : active raid5 hdg1[4] hde1[5] hdc1[1] hda1[0]
      39078016 blocks level 5, 64k chunk, algorithm 2 [4/2] [UU__]
      [==========>..........]  resync = 52.4% (20499840/39078016) finish=50.0min speed=6187K/sec

This is more than good enough for me, at least. :-)

I noticed a problem during a big test resize now, BTW; for some reason,
pdflush goes into uninterruptable sleep, and kjournald and thus the
filesystem seems to follow. I'm not sure if this is a starvation issue or
not, but it sure is a bug -- I'll look into it and see if I find something
obvious.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-09-21 19:23   ` Steinar H. Gunderson
@ 2005-09-22  0:14     ` Steinar H. Gunderson
  2005-09-22  1:00       ` Steinar H. Gunderson
  0 siblings, 1 reply; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-09-22  0:14 UTC (permalink / raw)
  To: linux-raid

On Wed, Sep 21, 2005 at 09:23:26PM +0200, Steinar H. Gunderson wrote:
> I noticed a problem during a big test resize now, BTW; for some reason,
> pdflush goes into uninterruptable sleep, and kjournald and thus the
> filesystem seems to follow. I'm not sure if this is a starvation issue or
> not, but it sure is a bug -- I'll look into it and see if I find something
> obvious.

This might be it (apoligies for the inline patch, but it should be simple
enough to apply by hand):

--- drivers/md/raid5.c.orig     2005-09-22 02:15:30.000000000 +0200
+++ drivers/md/raid5.c  2005-09-22 01:49:30.000000000 +0200
@@ -374,7 +374,7 @@
					BUG();
			} else {
				if (!test_bit(STRIPE_HANDLE, &sh->state)) {
-					if (conf->expand_in_progress && sector < conf->expand_progress) {
+					if (conf->expand_in_progress && sector * (conf->raid_disks - 1) < conf->expand_progress) {
						atomic_inc(&conf->active_stripes_expand);
					} else {
						atomic_inc(&conf->active_stripes);

It seems to survive even under quite heavy load now (but of course, restripe
performance suffers badly when doing heavy stuff on the volume :-) ). Too
soon to say if there are problems still, of course.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-09-22  0:14     ` Steinar H. Gunderson
@ 2005-09-22  1:00       ` Steinar H. Gunderson
  0 siblings, 0 replies; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-09-22  1:00 UTC (permalink / raw)
  To: linux-raid

On Thu, Sep 22, 2005 at 02:14:21AM +0200, Steinar H. Gunderson wrote:
> It seems to survive even under quite heavy load now (but of course, restripe
> performance suffers badly when doing heavy stuff on the volume :-) ). Too
> soon to say if there are problems still, of course.

Nopes, it's still corrupting if I'm doing heavy I/O against it while
expanding, even after two more patches of the same type. I haven't seen the
uninterruptible sleep problem in this round of tests, but I won't say for 
sure that it went away either. :-)

(Yes, I'm replying to myself _again_ :-) )

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-09-20 15:36   ` Steinar H. Gunderson
@ 2005-09-22 16:16     ` Neil Brown
  2005-09-22 16:32       ` Steinar H. Gunderson
                         ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Neil Brown @ 2005-09-22 16:16 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: linux-raid

On Tuesday September 20, sgunderson@bigfoot.com wrote:
> 
> > I'll try to have a read through your code over the next week or so and
> > give you more detailed feedback.
> 
> OK, thanks. :-) There's a lot of unneeded junk in the patch, BTW (some
> reindenting here and there that I don't know where is coming from, plus lots
> of temporary added printks), but I guess we can sort out the cleanness after
> a while. :-)

Yes, that reindenting is a problem as it makes the patch hard to read
-- it's hard to see which bits need to be checked and which don't.  If
you could remove them for the next version, it would help....

Can I make two suggestions for a start?

1/ in raid5_reshape, rather than allocate a separate set of
  stripe_heads, I think it would be good to re-size all of the stripes
  and the continue running with the new set of stripe_heads.
  This would involved repeatedly

     get_inactive_stripe
     allocate new stripe slightly bigger
     copy the pages across
     allocate the extra pages
     put it on a private list

  Repeat this until we have all the stripes.  This will temporarily
  stall the raid5 as the stripes will be exhausted.  
  As soon as you have them all, you release them again, and the raid
  will continue to work.  
  Avoiding the two lists of stripe_heads will remove a fair bit of
  code.

2/ Reserve the stripe_heads needed for a chunk-resize in make_request
   (where it is safe to block) rather than in handle_stripe.

   so make_request reserves all the stripes needed to read, and all
   needed to write (which may overlap for the first chunk) or 2,
   (store them in an array or list in ->conf) and arrange for
   handle_stripe to trigger the reads, and arrange that new write
   requests to any of these flags block.
   Once the reads are done, shuffle the pages (rather then memcpy,
   just fiddle with pointers), and cause write-out to commence.

Please let me know if that makes sense, or if you don't think it will
work, or if you just don't have the time....

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-09-22 16:16     ` Neil Brown
@ 2005-09-22 16:32       ` Steinar H. Gunderson
  2005-09-23  8:59         ` Neil Brown
  2005-09-22 20:53       ` Steinar H. Gunderson
  2005-09-24  1:44       ` Steinar H. Gunderson
  2 siblings, 1 reply; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-09-22 16:32 UTC (permalink / raw)
  To: linux-raid

On Thu, Sep 22, 2005 at 06:16:41PM +0200, Neil Brown wrote:
> Yes, that reindenting is a problem as it makes the patch hard to read
> -- it's hard to see which bits need to be checked and which don't.  If
> you could remove them for the next version, it would help....

I'll look into it.

> 1/ in raid5_reshape, rather than allocate a separate set of
>   stripe_heads, I think it would be good to re-size all of the stripes
>   and the continue running with the new set of stripe_heads.

Hm, that's an idea. OTOH, I'm not sure how much code it really saves; you
still need to have a notion about how many disks there are in a stripe, so
you can read and write the correct amount, calculate parity correctly etc...
and thus there is a distinction after all. It's an interesting thought,
though; I'd probably try to get it all working properly first, and then go
onto that later.

> 2/ Reserve the stripe_heads needed for a chunk-resize in make_request
>    (where it is safe to block) rather than in handle_stripe.

Hm. make_request is never called from the sync code, is it? My understanding
was that make_request was called whenever userspace wanted to read/write
something, and only then.

>    Once the reads are done, shuffle the pages (rather then memcpy,
>    just fiddle with pointers), and cause write-out to commence.

Ah, flipping the pointers is a good idea. I was searching for a way of doing
optimized page-to-page memcpy operations, but of course just moving the
pointers will work fine.

I've fixed what I believe is a race condition that might lead to the subtle
corruption I've been seeing (conf->expand_progress can change between the
geometry calculation and actual writing in make_request); I'm testing it now,
but random bugs are naturally hard to verify fixed :-) OTOH, I haven't seen
the problem with uninterruptable sleep for a while, so I hope my other fix
got that.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-09-22 16:16     ` Neil Brown
  2005-09-22 16:32       ` Steinar H. Gunderson
@ 2005-09-22 20:53       ` Steinar H. Gunderson
  2005-09-24  1:44       ` Steinar H. Gunderson
  2 siblings, 0 replies; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-09-22 20:53 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 963 bytes --]

On Thu, Sep 22, 2005 at 06:16:41PM +0200, Neil Brown wrote:
> Yes, that reindenting is a problem as it makes the patch hard to read
> -- it's hard to see which bits need to be checked and which don't.  If
> you could remove them for the next version, it would help....

Here's an updated version of the patch (against raid5.c only, the changes
against raid5.h should be the same) fixed for readability -- almost all
indent-only changes should be fixed now (I have no idea how they got in in
the first place), and I've removed some of the extra debug printk statements.
In addition, it has a bugfix or two over the previous one.

It still corrupts some stripes (usually something like half of a cluster)
when doing I/O against the RAID while it's restriping, and it might still
have the problem with uninterruptable sleep (unsure about the last one,
though).

I haven't done the other changes you proposed (yet).

/* Steinar */
-- 
Homepage: http://www.sesse.net/

[-- Attachment #2: raid5-online-expand-02.diff --]
[-- Type: text/plain, Size: 39615 bytes --]

--- /usr/src/orig/linux-2.6-2.6.12/drivers/md/raid5.c	2005-06-17 21:48:29.000000000 +0200
+++ drivers/md/raid5.c	2005-09-22 23:04:58.000000000 +0200
@@ -68,19 +68,38 @@
 #endif
 
 static void print_raid5_conf (raid5_conf_t *conf);
+#if RAID5_DEBUG
+static void print_sh (struct stripe_head *sh);
+#endif
+static int sync_request (mddev_t *mddev, sector_t sector_nr, int go_faster);
+static void raid5_finish_expand (raid5_conf_t *conf);
+static sector_t raid5_compute_sector(sector_t r_sector, unsigned int raid_disks,
+			unsigned int data_disks, unsigned int * dd_idx,
+			unsigned int * pd_idx, raid5_conf_t *conf);
 
 static inline void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 {
+	BUG_ON(atomic_read(&sh->count) == 0);
 	if (atomic_dec_and_test(&sh->count)) {
 		if (!list_empty(&sh->lru))
 			BUG();
-		if (atomic_read(&conf->active_stripes)==0)
-			BUG();
+		if (conf->expand_in_progress && sh->disks == conf->raid_disks) {
+			if (atomic_read(&conf->active_stripes_expand)==0)
+				BUG();
+		} else {
+			if (atomic_read(&conf->active_stripes)==0)
+				BUG();
+		}
 		if (test_bit(STRIPE_HANDLE, &sh->state)) {
-			if (test_bit(STRIPE_DELAYED, &sh->state))
+			if (test_bit(STRIPE_DELAY_EXPAND, &sh->state)) {
+				list_add_tail(&sh->lru, &conf->wait_for_expand_list);
+				printk("delaying stripe with sector %llu (expprog=%llu, active=%d)\n", sh->sector,
+					conf->expand_progress, atomic_read(&conf->active_stripes_expand));
+			} else if (test_bit(STRIPE_DELAYED, &sh->state)) {
 				list_add_tail(&sh->lru, &conf->delayed_list);
-			else
+			} else {
 				list_add_tail(&sh->lru, &conf->handle_list);
+			}
 			md_wakeup_thread(conf->mddev->thread);
 		} else {
 			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
@@ -88,11 +107,34 @@
 				if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
 					md_wakeup_thread(conf->mddev->thread);
 			}
-			list_add_tail(&sh->lru, &conf->inactive_list);
-			atomic_dec(&conf->active_stripes);
-			if (!conf->inactive_blocked ||
-			    atomic_read(&conf->active_stripes) < (NR_STRIPES*3/4))
-				wake_up(&conf->wait_for_stripe);
+			if (conf->expand_in_progress && sh->disks == conf->raid_disks) {
+				list_add_tail(&sh->lru, &conf->inactive_list_expand);
+				atomic_dec(&conf->active_stripes_expand);
+			} else {
+				list_add_tail(&sh->lru, &conf->inactive_list);
+				if (conf->expand_in_progress == 2) {
+					// we are in the process of finishing up an expand, see
+					// if we have no active stripes left
+					if (atomic_dec_and_test(&conf->active_stripes)) {
+						printk("Finishing up expand\n");
+						raid5_finish_expand(conf);
+						printk("Expand done.\n");
+					}
+				} else {
+					atomic_dec(&conf->active_stripes);
+				}
+			}
+			if (conf->expand_in_progress && sh->disks == conf->raid_disks) {
+				if (!conf->inactive_blocked_expand ||
+				    atomic_read(&conf->active_stripes_expand) < (NR_STRIPES*3/4)) {
+					wake_up(&conf->wait_for_stripe_expand);
+				}
+			} else {
+				if (!conf->inactive_blocked ||
+				    atomic_read(&conf->active_stripes) < (NR_STRIPES*3/4)) {
+					wake_up(&conf->wait_for_stripe);
+				}
+			}
 		}
 	}
 }
@@ -133,20 +175,44 @@
 
 
 /* find an idle stripe, make sure it is unhashed, and return it. */
-static struct stripe_head *get_free_stripe(raid5_conf_t *conf)
+static struct stripe_head *get_free_stripe(raid5_conf_t *conf, int expand)
 {
 	struct stripe_head *sh = NULL;
 	struct list_head *first;
 
 	CHECK_DEVLOCK();
-	if (list_empty(&conf->inactive_list))
-		goto out;
-	first = conf->inactive_list.next;
-	sh = list_entry(first, struct stripe_head, lru);
-	list_del_init(first);
-	remove_hash(sh);
-	atomic_inc(&conf->active_stripes);
+
+	if (expand) {
+		if (list_empty(&conf->inactive_list_expand))
+			goto out;
+		first = conf->inactive_list_expand.next;
+		sh = list_entry(first, struct stripe_head, lru);
+		list_del_init(first);
+		remove_hash(sh);
+		atomic_inc(&conf->active_stripes_expand);
+	} else {
+		if (list_empty(&conf->inactive_list))
+			goto out;
+		first = conf->inactive_list.next;
+		sh = list_entry(first, struct stripe_head, lru);
+		list_del_init(first);
+		remove_hash(sh);
+		atomic_inc(&conf->active_stripes);
+	}
 out:
+
+	if (sh) {
+		if (conf->expand_in_progress) {
+			if (expand)
+				BUG_ON(sh->disks != conf->raid_disks);
+			else
+				BUG_ON(sh->disks != conf->previous_raid_disks);
+		} else {
+			BUG_ON(expand);
+			BUG_ON(sh->disks != conf->raid_disks);
+		}
+	}
+
 	return sh;
 }
 
@@ -184,7 +250,7 @@
 static inline void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 
 	if (atomic_read(&sh->count) != 0)
 		BUG();
@@ -245,21 +311,59 @@
 
 	do {
 		sh = __find_stripe(conf, sector);
+
+		// make sure this is of the right size; if not, remove it from the hash
+		if (sh) {
+			int correct_disks = conf->raid_disks;
+			if (conf->expand_in_progress && sector * (conf->raid_disks - 1) >= conf->expand_progress) {
+				correct_disks = conf->previous_raid_disks;
+			}
+
+			if (sh->disks != correct_disks) {
+				BUG_ON(atomic_read(&sh->count) != 0);
+
+				remove_hash(sh);
+				sh = NULL;
+			}
+		}
+		
 		if (!sh) {
-			if (!conf->inactive_blocked)
-				sh = get_free_stripe(conf);
+			if (conf->expand_in_progress && sector * (conf->raid_disks - 1) < conf->expand_progress) {
+				if (!conf->inactive_blocked_expand) {
+					sh = get_free_stripe(conf, 1);
+				}
+			} else {
+				if (!conf->inactive_blocked) {
+					sh = get_free_stripe(conf, 0);
+				}
+			}
 			if (noblock && sh == NULL)
 				break;
 			if (!sh) {
-				conf->inactive_blocked = 1;
-				wait_event_lock_irq(conf->wait_for_stripe,
-						    !list_empty(&conf->inactive_list) &&
-						    (atomic_read(&conf->active_stripes) < (NR_STRIPES *3/4)
-						     || !conf->inactive_blocked),
-						    conf->device_lock,
-						    unplug_slaves(conf->mddev);
-					);
-				conf->inactive_blocked = 0;
+				if (conf->expand_in_progress && sector * (conf->raid_disks - 1) < conf->expand_progress) {
+//					printk("WAITING FOR AN EXPAND STRIPE\n");
+					conf->inactive_blocked_expand = 1;
+					wait_event_lock_irq(conf->wait_for_stripe_expand,
+							    !list_empty(&conf->inactive_list_expand) &&
+							    (atomic_read(&conf->active_stripes_expand) < (NR_STRIPES *3/4)
+							     || !conf->inactive_blocked_expand),
+							    conf->device_lock,
+							    unplug_slaves(conf->mddev);
+						);
+					conf->inactive_blocked_expand = 0;
+				} else {
+//					printk("WAITING FOR A NON-EXPAND STRIPE, sector=%llu\n", sector);
+					conf->inactive_blocked = 1;
+					wait_event_lock_irq(conf->wait_for_stripe,
+							    !list_empty(&conf->inactive_list) &&
+							    (atomic_read(&conf->active_stripes) < (NR_STRIPES *3/4)
+							     || !conf->inactive_blocked),
+							    conf->device_lock,
+							    unplug_slaves(conf->mddev);
+						);
+					conf->inactive_blocked = 0;
+				}
+//				printk("INACTIVITY DONE\n");
 			} else
 				init_stripe(sh, sector, pd_idx);
 		} else {
@@ -267,8 +371,13 @@
 				if (!list_empty(&sh->lru))
 					BUG();
 			} else {
-				if (!test_bit(STRIPE_HANDLE, &sh->state))
-					atomic_inc(&conf->active_stripes);
+				if (!test_bit(STRIPE_HANDLE, &sh->state)) {
+					if (conf->expand_in_progress && sector * (conf->raid_disks - 1) < conf->expand_progress) {
+						atomic_inc(&conf->active_stripes_expand);
+					} else {
+						atomic_inc(&conf->active_stripes);
+					}
+				}
 				if (list_empty(&sh->lru))
 					BUG();
 				list_del_init(&sh->lru);
@@ -283,26 +392,33 @@
 	return sh;
 }
 
-static int grow_stripes(raid5_conf_t *conf, int num)
+static int grow_stripes(raid5_conf_t *conf, int num, int expand)
 {
 	struct stripe_head *sh;
 	kmem_cache_t *sc;
 	int devs = conf->raid_disks;
 
-	sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
+	if (expand)
+		sprintf(conf->cache_name, "raid5e/%s", mdname(conf->mddev));
+	else
+		sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
 
 	sc = kmem_cache_create(conf->cache_name, 
 			       sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
 			       0, 0, NULL, NULL);
 	if (!sc)
 		return 1;
-	conf->slab_cache = sc;
+	if (expand)
+		conf->slab_cache_expand = sc;
+	else
+		conf->slab_cache = sc;
 	while (num--) {
 		sh = kmem_cache_alloc(sc, GFP_KERNEL);
 		if (!sh)
 			return 1;
 		memset(sh, 0, sizeof(*sh) + (devs-1)*sizeof(struct r5dev));
 		sh->raid_conf = conf;
+		sh->disks = conf->raid_disks;
 		spin_lock_init(&sh->lock);
 
 		if (grow_buffers(sh, conf->raid_disks)) {
@@ -312,7 +428,11 @@
 		}
 		/* we just created an active stripe so... */
 		atomic_set(&sh->count, 1);
-		atomic_inc(&conf->active_stripes);
+		if (expand) {
+			atomic_inc(&conf->active_stripes_expand);
+		} else {
+			atomic_inc(&conf->active_stripes);
+		}
 		INIT_LIST_HEAD(&sh->lru);
 		release_stripe(sh);
 	}
@@ -325,7 +445,7 @@
 
 	while (1) {
 		spin_lock_irq(&conf->device_lock);
-		sh = get_free_stripe(conf);
+		sh = get_free_stripe(conf, 0);
 		spin_unlock_irq(&conf->device_lock);
 		if (!sh)
 			break;
@@ -344,7 +464,7 @@
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
 	if (bi->bi_size)
@@ -411,12 +531,93 @@
 	return 0;
 }
 
+							
+static void raid5_finish_expand (raid5_conf_t *conf)
+{
+	int i;
+	struct disk_info *tmp;
+//	shrink_stripes(conf);
+	
+	conf->expand_in_progress = 0;
+	conf->active_stripes = conf->active_stripes_expand;
+	conf->inactive_list = conf->inactive_list_expand;
+	conf->wait_for_stripe = conf->wait_for_stripe_expand;
+	conf->slab_cache = conf->slab_cache_expand;
+	conf->inactive_blocked = conf->inactive_blocked_expand;
+
+	// fix up linked list
+	conf->inactive_list.next->prev = &conf->inactive_list;
+	{
+		struct list_head *first = &conf->inactive_list;
+		while (1) {
+			if (first->next == &conf->inactive_list_expand) {
+				first->next = &conf->inactive_list;
+				break;
+			}
+
+			first = first->next;
+		}
+	}
+
+	conf->wait_for_stripe.task_list.next->prev = &conf->wait_for_stripe.task_list;
+	{
+		struct list_head *first = &conf->wait_for_stripe.task_list;
+		while (1) {
+			if (first->next == &conf->wait_for_stripe_expand.task_list) {
+				first->next = &conf->wait_for_stripe.task_list;
+				break;
+			}
+
+			first = first->next;
+		}
+	}
+
+	for (i = conf->previous_raid_disks; i < conf->raid_disks; i++) {
+		tmp = conf->disks + i;
+		if (tmp->rdev
+		    && !tmp->rdev->faulty
+		    && !tmp->rdev->in_sync) {
+			conf->mddev->degraded--;
+			conf->failed_disks--;
+			conf->working_disks++;
+			tmp->rdev->in_sync = 1;
+		}
+	}
+
+	// inform the md code that we have more space now
+ 	{	
+		struct block_device *bdev;
+		sector_t sync_sector;
+		unsigned dummy1, dummy2;
+
+		conf->mddev->array_size = conf->mddev->size * (conf->mddev->raid_disks-1);
+		set_capacity(conf->mddev->gendisk, conf->mddev->array_size << 1);
+		conf->mddev->changed = 1;
+
+		sync_sector = raid5_compute_sector(conf->expand_progress, conf->raid_disks,
+			conf->raid_disks - 1, &dummy1, &dummy2, conf);
+		
+		conf->mddev->recovery_cp = sync_sector << 1;    // FIXME: hum, hum
+		set_bit(MD_RECOVERY_NEEDED, &conf->mddev->recovery);
+
+		bdev = bdget_disk(conf->mddev->gendisk, 0);
+		if (bdev) {
+			down(&bdev->bd_inode->i_sem);
+			i_size_write(bdev->bd_inode, conf->mddev->array_size << 10);
+			up(&bdev->bd_inode->i_sem);
+			bdput(bdev);
+		}
+	}
+	
+	/* FIXME: free old stuff here! (what are we missing?) */
+}
+
 static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
 				    int error)
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	unsigned long flags;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
@@ -436,7 +637,7 @@
 	}
 
 	spin_lock_irqsave(&conf->device_lock, flags);
-	if (!uptodate)
+	if (!uptodate) 
 		md_error(conf->mddev, conf->disks[i].rdev);
 
 	rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
@@ -570,7 +771,7 @@
 static sector_t compute_blocknr(struct stripe_head *sh, int i)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int raid_disks = conf->raid_disks, data_disks = raid_disks - 1;
+	int raid_disks = sh->disks, data_disks = raid_disks - 1;
 	sector_t new_sector = sh->sector, check;
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	sector_t stripe;
@@ -605,7 +806,8 @@
 
 	check = raid5_compute_sector (r_sector, raid_disks, data_disks, &dummy1, &dummy2, conf);
 	if (check != sh->sector || dummy1 != dd_idx || dummy2 != sh->pd_idx) {
-		printk("compute_blocknr: map not correct\n");
+		printk("compute_blocknr: map not correct (%llu,%u,%u vs. %llu,%u,%u) disks=%u offset=%u virtual_dd=%u\n",
+				check, dummy1, dummy2, sh->sector, dd_idx, sh->pd_idx, sh->disks, chunk_offset, i);
 		return 0;
 	}
 	return r_sector;
@@ -671,8 +873,7 @@
 
 static void compute_block(struct stripe_head *sh, int dd_idx)
 {
-	raid5_conf_t *conf = sh->raid_conf;
-	int i, count, disks = conf->raid_disks;
+	int i, count, disks = sh->disks;
 	void *ptr[MAX_XOR_BLOCKS], *p;
 
 	PRINTK("compute_block, stripe %llu, idx %d\n", 
@@ -691,7 +892,6 @@
 			printk("compute_block() %d, stripe %llu, %d"
 				" not present\n", dd_idx,
 				(unsigned long long)sh->sector, i);
-
 		check_xor();
 	}
 	if (count != 1)
@@ -702,7 +902,7 @@
 static void compute_parity(struct stripe_head *sh, int method)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int i, pd_idx = sh->pd_idx, disks = conf->raid_disks, count;
+	int i, pd_idx = sh->pd_idx, disks = sh->disks, count;
 	void *ptr[MAX_XOR_BLOCKS];
 	struct bio *chosen;
 
@@ -876,11 +1076,11 @@
  * get BH_Lock set before the stripe lock is released.
  *
  */
- 
+
 static void handle_stripe(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks;
+	int disks = sh->disks;
 	struct bio *return_bi= NULL;
 	struct bio *bi;
 	int i;
@@ -897,6 +1097,7 @@
 	spin_lock(&sh->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
+	clear_bit(STRIPE_DELAY_EXPAND, &sh->state);
 
 	syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	/* Now to look around and see what can be done */
@@ -945,19 +1146,20 @@
 		}
 		if (dev->written) written++;
 		rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
-		if (!rdev || !rdev->in_sync) {
+		if (!conf->expand_in_progress && (!rdev || !rdev->in_sync)) {
 			failed++;
 			failed_num = i;
 		} else
 			set_bit(R5_Insync, &dev->flags);
 	}
-	PRINTK("locked=%d uptodate=%d to_read=%d"
-		" to_write=%d failed=%d failed_num=%d\n",
-		locked, uptodate, to_read, to_write, failed, failed_num);
 	/* check if the array has lost two devices and, if so, some requests might
 	 * need to be failed
 	 */
 	if (failed > 1 && to_read+to_write+written) {
+		printk("Need to fail requests!\n");
+		printk("locked=%d uptodate=%d to_read=%d"
+			" to_write=%d failed=%d failed_num=%d disks=%d\n",
+			locked, uptodate, to_read, to_write, failed, failed_num, disks);
 		spin_lock_irq(&conf->device_lock);
 		for (i=disks; i--; ) {
 			/* fail all writes first */
@@ -1012,7 +1214,7 @@
 		}
 		spin_unlock_irq(&conf->device_lock);
 	}
-	if (failed > 1 && syncing) {
+	if (failed > 1 && syncing && !conf->expand_in_progress) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
 		clear_bit(STRIPE_SYNCING, &sh->state);
 		syncing = 0;
@@ -1085,7 +1287,7 @@
 					/* if I am just reading this block and we don't have
 					   a failed drive, or any pending writes then sidestep the cache */
 					if (sh->bh_read[i] && !sh->bh_read[i]->b_reqnext &&
-					    ! syncing && !failed && !to_write) {
+						! syncing && !failed && !to_write) {
 						sh->bh_cache[i]->b_page =  sh->bh_read[i]->b_page;
 						sh->bh_cache[i]->b_data =  sh->bh_read[i]->b_data;
 					}
@@ -1093,7 +1295,7 @@
 					locked++;
 					PRINTK("Reading block %d (sync=%d)\n", 
 						i, syncing);
-					if (syncing)
+					if (syncing && !conf->expand_in_progress)
 						md_sync_acct(conf->disks[i].rdev->bdev,
 							     STRIPE_SECTORS);
 				}
@@ -1102,6 +1304,288 @@
 		set_bit(STRIPE_HANDLE, &sh->state);
 	}
 
+	// see if we have the data we need to expand by another block
+	if (conf->expand_in_progress && sh->disks == conf->previous_raid_disks) {
+		int uptodate = 0, delay_to_future=0, d = 0, count = 0, needed_uptodate = 0;
+		for (i=0; i<disks; ++i) {
+			sector_t start_sector, dest_sector;
+			unsigned int dd_idx, pd_idx;
+
+			if (i == sh->pd_idx)
+				continue;
+
+			start_sector = sh->sector * (conf->previous_raid_disks - 1) + d * (conf->chunk_size >> 9);
+			++d;
+
+			// see what sector this block would land in the new layout
+			dest_sector = raid5_compute_sector(start_sector, conf->raid_disks,
+				conf->raid_disks - 1, &dd_idx, &pd_idx, conf);
+			if (dd_idx > pd_idx)
+				--dd_idx;
+
+/*			printk("start_sector = %llu (base=%llu, i=%u, d=%u) || dest_stripe = %llu\n", start_sector, sh->sector,
+				i, d, dest_stripe); */
+			
+			if (dest_sector * (conf->raid_disks - 1) >= conf->expand_progress &&
+ 			    dest_sector * (conf->raid_disks - 1) <  conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+/*				printk("UPDATING CHUNK %u FROM DISK %u (sec=%llu, dest_sector=%llu, uptodate=%u)\n",
+					dd_idx, i, start_sector, dest_sector, test_bit(R5_UPTODATE, &sh->dev[i].flags)); */
+				unsigned int buf_sector;
+				sector_t base = conf->expand_progress;
+				sector_div(base, conf->raid_disks - 1);
+
+				buf_sector = dd_idx * (conf->chunk_size / STRIPE_SIZE) + (dest_sector - base) / STRIPE_SECTORS;
+				
+				if (test_bit(R5_UPTODATE, &sh->dev[i].flags)) {
+					conf->expand_buffer[buf_sector].up_to_date = 1;
+//					printk("memcpy device %u/%u: %p <- %p\n", i, sh->disks,
+//						page_address(conf->expand_buffer[buf_sector].page), page_address(sh->dev[i].page));
+					memcpy(page_address(conf->expand_buffer[buf_sector].page), page_address(sh->dev[i].page), STRIPE_SIZE);
+//					printk("memcpy done\n");
+					count = 1;
+					PRINTK("Updating %u\n", buf_sector);
+				} else {
+					conf->expand_buffer[buf_sector].up_to_date = 0;
+				}
+			} else if (dest_sector * (conf->raid_disks - 1) >= conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1) &&
+				   dest_sector * (conf->raid_disks - 1) < conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1) * 2 &&
+				   syncing) {
+				delay_to_future = 1;
+			}
+		}
+
+		for (i=0; i < (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE); ++i) {
+			uptodate += conf->expand_buffer[i].up_to_date;
+		}
+		if (count) 
+			PRINTK("%u/%lu is up to date\n", uptodate, (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE));
+	
+		/*
+		 * Figure out how many stripes we need for this chunk to be complete.
+		 * In almost all cases, this will be a full destination stripe, but our
+		 * original volume might not be big enough for that at the very end --
+		 * so use the rest of the volume then.
+	         */
+		needed_uptodate = (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE);
+		if (((conf->mddev->array_size << 1) - conf->expand_progress) / STRIPE_SECTORS < needed_uptodate) {
+			needed_uptodate = ((conf->mddev->array_size << 1) - conf->expand_progress) / STRIPE_SECTORS;
+//			printk("reading partial block at the end: %u\n", needed_uptodate);
+		}
+		if (needed_uptodate > 0 && uptodate == needed_uptodate) {
+			// we can do an expand!
+			struct stripe_head *newsh[256];   // FIXME: dynamic allocation somewhere instead?
+			sector_t dest_sector, advance;
+			unsigned i;
+			unsigned int dummy1, dummy2, pd_idx;
+
+			if ((conf->mddev->size << 1) - conf->expand_progress > (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+				advance = (conf->chunk_size * (conf->raid_disks - 1)) >> 9;
+			} else {
+				advance = (conf->mddev->size << 1) - conf->expand_progress;
+			}
+
+//			sector_div(new_sector, (conf->raid_disks - 1));
+//			printk("EXPANDING ONTO SECTOR %llu\n", conf->expand_progress);
+//			printk("EXPAND => %llu/%llu\n", conf->expand_progress, conf->mddev->size << 1);
+			
+			// find the parity disk and starting sector
+			dest_sector = raid5_compute_sector(conf->expand_progress, conf->raid_disks,
+				conf->raid_disks - 1, &dummy1, &pd_idx, conf);
+//			printk("Expanding onto %llu\n", dest_sector);
+		
+			spin_lock_irq(&conf->device_lock);
+			
+			/*
+			 * Check that we won't try to expand over an area where there's
+			 * still active stripes; if we do, we'll risk inconsistency since we
+			 * suddenly have two different sets of stripes referring to the
+			 * same logical sector.
+			 */
+			{
+				struct stripe_head *ash;
+				int activity = 0, i;
+				sector_t first_touched_sector, last_touched_sector;
+				
+				first_touched_sector = raid5_compute_sector(conf->expand_progress,
+					conf->previous_raid_disks, conf->previous_raid_disks - 1, &dummy1, &dummy2, conf);
+				last_touched_sector = raid5_compute_sector(conf->expand_progress + ((conf->chunk_size * (conf->previous_raid_disks - 1)) >> 9) - 1,
+					conf->previous_raid_disks, conf->previous_raid_disks - 1, &dummy1, &dummy2, conf);
+
+				for (i = 0; i < NR_HASH; i++) {
+					ash = conf->stripe_hashtbl[i];
+					for (; ash; ash = ash->hash_next) {						
+						if (sh == ash && atomic_read(&ash->count) == 1 && !to_write)
+							continue;   // we'll release it shortly, so it's OK (?)
+
+						// is this stripe active, and within the region we're expanding?
+						if (atomic_read(&ash->count) > 0 &&
+						    ash->disks == conf->previous_raid_disks &&
+						    ash->sector >= first_touched_sector &&
+						    ash->sector <= last_touched_sector) {
+							activity = 1;
+							break;
+						}
+					}
+				}
+				
+				if (activity) {
+					printk("Aborting, active stripes in the area\n");
+					spin_unlock_irq(&conf->device_lock);
+					goto please_wait;
+				}
+			}
+
+			/*
+			 * Check that we have enough free stripes to write out our
+			 * entire chunk in the new layout. If not, we'll have to wait
+			 * until some writes have been retired. We can't just do
+			 * as in get_active_stripe() and sleep here until enough are
+			 * free, since all busy stripes might have STRIPE_HANDLE set
+			 * and thus won't be retired until somebody (our thread!) takes
+			 * care of them.
+			 */	
+			
+			{
+				int not_enough_free = 0;
+				
+				for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+					newsh[i] = get_free_stripe(conf, 1);
+					if (newsh[i] == NULL) {
+						not_enough_free = 1;
+						break;
+					}
+					init_stripe(newsh[i], dest_sector + i * STRIPE_SECTORS, pd_idx);		
+				}
+
+				if (not_enough_free) {
+					// release all the stripes we allocated
+					for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+						if (newsh[i] == NULL)
+							break;
+						atomic_inc(&newsh[i]->count);
+						__release_stripe(conf, newsh[i]);
+					}
+					printk("Aborting, not enough destination stripes free\n");
+					spin_unlock_irq(&conf->device_lock);
+					goto please_wait;
+				}
+			}
+
+			for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+				for (d = 0; d < conf->raid_disks; ++d) {
+					unsigned dd_idx = d;
+					
+					if (d != pd_idx) {
+						if (dd_idx > pd_idx)
+							--dd_idx;
+
+						memcpy(page_address(newsh[i]->dev[d].page), page_address(conf->expand_buffer[dd_idx * conf->chunk_size / STRIPE_SIZE + i].page), STRIPE_SIZE);
+					}
+					set_bit(R5_Wantwrite, &newsh[i]->dev[d].flags);
+					set_bit(R5_Syncio, &newsh[i]->dev[d].flags);
+				}
+			}
+			
+			for (i=0; i < (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE); ++i) {
+				conf->expand_buffer[i].up_to_date = 0;
+			}
+
+			conf->expand_progress += advance;
+			
+			spin_unlock_irq(&conf->device_lock);
+			
+			for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+				compute_parity(newsh[i], RECONSTRUCT_WRITE);
+					
+				atomic_inc(&newsh[i]->count);
+				set_bit(STRIPE_INSYNC, &newsh[i]->state);
+				set_bit(STRIPE_HANDLE, &newsh[i]->state);
+				release_stripe(newsh[i]);
+			}
+
+			spin_lock_irq(&conf->device_lock);
+			md_done_sync(conf->mddev, advance, 1);
+			wake_up(&conf->wait_for_expand_progress);
+			spin_unlock_irq(&conf->device_lock);
+
+//			md_sync_acct(conf->disks[0].rdev->bdev, STRIPE_SECTORS * (conf->raid_disks - 1));
+
+			// see if we have delayed data that we can process now
+			{			
+				struct list_head *l, *next;
+				
+				spin_lock_irq(&conf->device_lock);
+				l = conf->wait_for_expand_list.next;
+
+//				printk("printing delay list:\n");
+				while (l != &conf->wait_for_expand_list) {
+					int i, d = 0;
+					int do_process = 0;
+					
+					struct stripe_head *dsh;
+					dsh = list_entry(l, struct stripe_head, lru);
+//					printk("sector: %llu\n", dsh->sector);
+					
+					for (i=0; i<disks; ++i) {
+						sector_t start_sector, dest_sector;
+						unsigned int dd_idx, pd_idx;
+
+						if (i == dsh->pd_idx)
+							continue;
+
+						start_sector = dsh->sector * (conf->previous_raid_disks - 1) + d * (conf->chunk_size >> 9);
+
+						// see what sector this block would land in in the new layout
+						dest_sector = raid5_compute_sector(start_sector, conf->raid_disks,
+								conf->raid_disks - 1, &dd_idx, &pd_idx, conf);
+						if (/*dest_sector * (conf->raid_disks - 1) >= conf->expand_progress &&*/
+						    dest_sector * (conf->raid_disks - 1) <  conf->expand_progress + (conf->raid_disks - 1) * (conf->chunk_size >> 9)) {
+							do_process = 1;
+						}
+
+						++d;
+					}
+					
+					next = l->next;
+					
+					if (do_process) {
+						list_del_init(l);
+
+						set_bit(STRIPE_HANDLE, &dsh->state);
+						clear_bit(STRIPE_DELAYED, &dsh->state);
+						clear_bit(STRIPE_DELAY_EXPAND, &dsh->state);
+						atomic_inc(&dsh->count);
+						atomic_inc(&dsh->count);
+						printk("pulling in stuff from delayed, sector=%llu\n",
+							dsh->sector);
+						__release_stripe(conf, dsh);
+					} else {
+						printk("still there\n");
+					}
+
+					l = next;
+				}
+
+				spin_unlock_irq(&conf->device_lock);
+			}
+
+			// see if we are done
+			if (conf->expand_progress >= conf->mddev->array_size << 1) {
+				printk("expand done, waiting for last activity to settle...\n");
+//				conf->mddev->raid_disks = conf->raid_disks;
+//				raid5_resize(conf->mddev, conf->mddev->size << 1);
+				conf->expand_in_progress = 2;
+			}
+
+please_wait:			
+			1;
+		}
+
+		if (delay_to_future) { // && atomic_dec_and_test(&sh->count)) {
+			set_bit(STRIPE_DELAY_EXPAND, &sh->state);
+		}
+	}
+
 	/* now to consider writing and what else, if anything should be read */
 	if (to_write) {
 		int rmw=0, rcw=0;
@@ -1237,7 +1721,9 @@
 		}
 	}
 	if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
-		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
+		if (!conf->expand_in_progress) {
+			md_done_sync(conf->mddev, STRIPE_SECTORS,1);
+		}
 		clear_bit(STRIPE_SYNCING, &sh->state);
 	}
 	
@@ -1279,7 +1765,7 @@
 		rcu_read_unlock();
  
 		if (rdev) {
-			if (test_bit(R5_Syncio, &sh->dev[i].flags))
+			if (test_bit(R5_Syncio, &sh->dev[i].flags) && !conf->expand_in_progress)
 				md_sync_acct(rdev->bdev, STRIPE_SECTORS);
 
 			bi->bi_bdev = rdev->bdev;
@@ -1427,9 +1913,18 @@
 		md_write_start(mddev);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
+		int disks;
 		
-		new_sector = raid5_compute_sector(logical_sector,
-						  raid_disks, data_disks, &dd_idx, &pd_idx, conf);
+	recalculate:		
+		if (conf->expand_in_progress && logical_sector >= conf->expand_progress) {
+			new_sector = raid5_compute_sector(logical_sector,
+				conf->previous_raid_disks, conf->previous_raid_disks - 1, &dd_idx, &pd_idx, conf);
+			disks = conf->previous_raid_disks;
+		} else {
+			new_sector = raid5_compute_sector(logical_sector,
+				 raid_disks, data_disks, &dd_idx, &pd_idx, conf);
+			disks = conf->raid_disks;
+		}
 
 		PRINTK("raid5: make_request, sector %llu logical %llu\n",
 			(unsigned long long)new_sector, 
@@ -1438,15 +1933,21 @@
 	retry:
 		prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 		sh = get_active_stripe(conf, new_sector, pd_idx, (bi->bi_rw&RWA_MASK));
-		if (sh) {
-			if (!add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
+		if (sh) {			
+			if (sh->disks != disks || !add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
 				/* Add failed due to overlap.  Flush everything
 				 * and wait a while
 				 */
 				raid5_unplug_device(mddev->queue);
 				release_stripe(sh);
 				schedule();
-				goto retry;
+				if (sh->disks != disks) {
+					// just expanded past this point! re-process using the new structure
+					printk("recalculate!\n");
+					finish_wait(&conf->wait_for_overlap, &w);
+					goto recalculate;
+				} else
+					goto retry;
 			}
 			finish_wait(&conf->wait_for_overlap, &w);
 			raid5_plug_device(conf);
@@ -1488,6 +1989,13 @@
 	int raid_disks = conf->raid_disks;
 	int data_disks = raid_disks-1;
 
+	if (conf->expand_in_progress) {
+		raid_disks = conf->previous_raid_disks;
+		data_disks = raid_disks-1;
+	}
+
+	BUG_ON(data_disks == 0 || raid_disks == 0);
+	
 	if (sector_nr >= mddev->size <<1) {
 		/* just being told to finish up .. nothing much to do */
 		unplug_slaves(mddev);
@@ -1502,12 +2010,31 @@
 		md_done_sync(mddev, rv, 1);
 		return rv;
 	}
+	
+	/* if we're in an expand, we can't allow the process
+	 * to keep reading in stripes; we might not have enough buffer
+	 * space to keep it all in RAM.
+	 */
+	if (conf->expand_in_progress && sector_nr >= conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+		spin_lock_irq(&conf->device_lock);
+		wait_event_lock_irq(conf->wait_for_expand_progress,
+			    sector_nr < conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1),
+			    conf->device_lock,
+			    unplug_slaves(conf->mddev);
+		);
+		spin_unlock_irq(&conf->device_lock);
+	}
 
 	x = sector_nr;
 	chunk_offset = sector_div(x, sectors_per_chunk);
 	stripe = x;
 	BUG_ON(x != stripe);
-
+	
+	PRINTK("sync_request:%llu/%llu, %u+%u active, pr=%llu v. %llu\n", sector_nr, mddev->size<<1,
+		atomic_read(&conf->active_stripes), atomic_read(&conf->active_stripes_expand),
+		sector_nr,
+		conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)); 
+ 
 	first_sector = raid5_compute_sector((sector_t)stripe*data_disks*sectors_per_chunk
 		+ chunk_offset, raid_disks, data_disks, &dd_idx, &pd_idx, conf);
 	sh = get_active_stripe(conf, sector_nr, pd_idx, 1);
@@ -1553,6 +2080,8 @@
 	while (1) {
 		struct list_head *first;
 
+		conf = mddev_to_conf(mddev);
+
 		if (list_empty(&conf->handle_list) &&
 		    atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
 		    !blk_queue_plugged(mddev->queue) &&
@@ -1650,6 +2179,7 @@
 	conf->level = mddev->level;
 	conf->algorithm = mddev->layout;
 	conf->max_nr_stripes = NR_STRIPES;
+	conf->expand_in_progress = 0;
 
 	/* device size must be a multiple of chunk size */
 	mddev->size &= ~(mddev->chunk_size/1024 -1);
@@ -1691,7 +2221,7 @@
 	}
 memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
 		 conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
-	if (grow_stripes(conf, conf->max_nr_stripes)) {
+	if (grow_stripes(conf, conf->max_nr_stripes, 0)) {
 		printk(KERN_ERR 
 			"raid5: couldn't allocate %dkB for buffers\n", memory);
 		shrink_stripes(conf);
@@ -1767,8 +2297,8 @@
 
 	printk("sh %llu, pd_idx %d, state %ld.\n",
 		(unsigned long long)sh->sector, sh->pd_idx, sh->state);
-	printk("sh %llu,  count %d.\n",
-		(unsigned long long)sh->sector, atomic_read(&sh->count));
+	printk("sh %llu,  count %d, disks %d.\n",
+		(unsigned long long)sh->sector, atomic_read(&sh->count), sh->disks);
 	printk("sh %llu, ", (unsigned long long)sh->sector);
 	for (i = 0; i < sh->raid_conf->raid_disks; i++) {
 		printk("(cache%d: %p %ld) ", 
@@ -1865,6 +2395,9 @@
 	mdk_rdev_t *rdev;
 	struct disk_info *p = conf->disks + number;
 
+	printk("we were asked to remove a disk\n");
+	return -EBUSY;  // FIXME: hack
+	
 	print_raid5_conf(conf);
 	rdev = p->rdev;
 	if (rdev) {
@@ -1903,6 +2436,7 @@
 	 */
 	for (disk=0; disk < mddev->raid_disks; disk++)
 		if ((p=conf->disks + disk)->rdev == NULL) {
+			rdev->faulty = 0;
 			rdev->in_sync = 0;
 			rdev->raid_disk = disk;
 			found = 1;
@@ -1915,6 +2449,7 @@
 
 static int raid5_resize(mddev_t *mddev, sector_t sectors)
 {
+        raid5_conf_t *conf = mddev_to_conf(mddev);
 	/* no resync is happening, and there is enough space
 	 * on all devices, so we can resize.
 	 * We need to make sure resync covers any new space.
@@ -1922,6 +2457,9 @@
 	 * any io in the removed space completes, but it hardly seems
 	 * worth it.
 	 */
+	if (conf->expand_in_progress)
+		return -EBUSY;
+		
 	sectors &= ~((sector_t)mddev->chunk_size/512 - 1);
 	mddev->array_size = (sectors * (mddev->raid_disks-1))>>1;
 	set_capacity(mddev->gendisk, mddev->array_size << 1);
@@ -1934,6 +2472,219 @@
 	return 0;
 }
 
+static int raid5_reshape(mddev_t *mddev, int raid_disks)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	raid5_conf_t *newconf;
+	struct list_head *tmp;
+	mdk_rdev_t *rdev;
+	unsigned long flags;
+
+	int d, i;
+
+	if (mddev->degraded >= 1 || conf->expand_in_progress)
+		return -EBUSY;
+	
+	print_raid5_conf(conf);
+	
+	newconf = kmalloc (sizeof (raid5_conf_t)
+			+ raid_disks * sizeof(struct disk_info),
+			GFP_KERNEL);
+	if (newconf == NULL)
+		return -ENOMEM;	
+	
+	memset(newconf, 0, sizeof (raid5_conf_t) + raid_disks * sizeof(struct disk_info));
+	memcpy(newconf, conf, sizeof (raid5_conf_t) + conf->raid_disks * sizeof(struct disk_info));
+
+	newconf->expand_in_progress = 1;
+	newconf->expand_progress = 0;
+	newconf->raid_disks = mddev->raid_disks = raid_disks;	
+	newconf->previous_raid_disks = conf->raid_disks;	
+	
+	INIT_LIST_HEAD(&newconf->inactive_list_expand);
+	
+	
+	spin_lock_irqsave(&conf->device_lock, flags);
+	mddev->private = newconf;
+
+	printk("conf=%p newconf=%p\n", conf, newconf);
+	
+	if (newconf->handle_list.next)
+		newconf->handle_list.next->prev = &newconf->handle_list;
+	if (newconf->delayed_list.next)
+		newconf->delayed_list.next->prev = &newconf->delayed_list;
+	if (newconf->inactive_list.next)
+		newconf->inactive_list.next->prev = &newconf->inactive_list;
+
+	if (newconf->handle_list.prev == &conf->handle_list)
+		newconf->handle_list.prev = &newconf->handle_list;
+	if (newconf->delayed_list.prev == &conf->delayed_list)
+		newconf->delayed_list.prev = &newconf->delayed_list;
+	if (newconf->inactive_list.prev == &conf->inactive_list)
+		newconf->inactive_list.prev = &newconf->inactive_list;
+	
+	if (newconf->wait_for_stripe.task_list.prev == &conf->wait_for_stripe.task_list)
+		newconf->wait_for_stripe.task_list.prev = &newconf->wait_for_stripe.task_list;
+	if (newconf->wait_for_overlap.task_list.prev == &conf->wait_for_overlap.task_list)
+		newconf->wait_for_overlap.task_list.prev = &newconf->wait_for_overlap.task_list;
+	
+	init_waitqueue_head(&newconf->wait_for_stripe_expand);
+	init_waitqueue_head(&newconf->wait_for_expand_progress);
+	INIT_LIST_HEAD(&newconf->wait_for_expand_list);
+	
+	// update all the stripes
+	for (i = 0; i < NR_STRIPES; ++i) {
+		struct stripe_head *sh = newconf->stripe_hashtbl[i];
+		while (sh) {
+			sh->raid_conf = newconf;
+			
+			if (sh->lru.next == &conf->inactive_list)
+				sh->lru.next = &newconf->inactive_list;
+			if (sh->lru.next == &conf->handle_list)
+				sh->lru.next = &newconf->handle_list;
+
+			sh = sh->hash_next;
+		}
+	}
+
+	// ...and all on the inactive queue
+	{
+		struct list_head *first = newconf->inactive_list.next;
+		
+		while (1) {
+			struct stripe_head *sh = list_entry(first, struct stripe_head, lru);
+			sh->raid_conf = newconf;
+		
+			if (sh->lru.next == &conf->inactive_list)
+				sh->lru.next = &newconf->inactive_list;
+			if (sh->lru.next == &conf->handle_list)
+				sh->lru.next = &newconf->handle_list;
+
+			if (first->next == &conf->inactive_list || first->next == &newconf->inactive_list) {
+				first->next = &newconf->inactive_list;
+				break;
+			}
+					
+			first = first->next;
+		};
+	}
+
+	// update the pointer for the other lists as well
+	{
+		struct list_head *first = &newconf->handle_list;
+		while (1) {
+			if (first->next == &conf->handle_list) {
+				first->next = &newconf->handle_list;
+				break;
+			}
+					
+			first = first->next;
+		};
+	}
+	{
+		struct list_head *first = &newconf->delayed_list;
+		while (1) {
+			if (first->next == &conf->delayed_list) {
+				first->next = &newconf->delayed_list;
+				break;
+			}
+					
+			first = first->next;
+		};
+	}
+	{
+		struct list_head *first = &newconf->wait_for_stripe.task_list;
+		while (1) {
+			if (first->next == &conf->wait_for_stripe.task_list) {
+				first->next = &newconf->wait_for_stripe.task_list;
+				break;
+			}
+					
+			first = first->next;
+		};
+	}
+	{
+		struct list_head *first = &newconf->wait_for_overlap.task_list;
+		while (1) {
+			if (first->next == &conf->wait_for_overlap.task_list) {
+				first->next = &newconf->wait_for_overlap.task_list;
+				break;
+			}
+					
+			first = first->next;
+		};
+	}
+	
+	ITERATE_RDEV(mddev,rdev,tmp) {
+		printk("disk: %p\n", rdev);
+		for (d= 0; d < newconf->raid_disks; d++) {
+			if (newconf->disks[d].rdev == rdev) {
+				goto already_there;
+			}
+		}
+
+		raid5_add_disk(mddev, rdev);
+		newconf->failed_disks++;
+		
+already_there:		
+		1;
+	}
+
+	// argh! we can't hold this lock while allocating memory
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+	
+	// allocate new stripes
+	atomic_set(&newconf->active_stripes_expand, 0);
+	if (grow_stripes(newconf, newconf->max_nr_stripes, 1)) {
+		int memory = newconf->max_nr_stripes * (sizeof(struct stripe_head) +
+			newconf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
+		printk(KERN_ERR "raid5: couldn't allocate %dkB for expand stripes\n", memory);
+		shrink_stripes(newconf);
+		kfree(newconf);
+		return -ENOMEM;
+	}
+
+	// and space for our temporary expansion buffers
+	newconf->expand_buffer = kmalloc (sizeof(struct expand_buf) * (conf->chunk_size / STRIPE_SIZE) * (raid_disks-1), GFP_KERNEL);
+	if (newconf->expand_buffer == NULL) {
+		printk(KERN_ERR "raid5: couldn't allocate %dkB for expand buffer\n",
+			(conf->chunk_size * (raid_disks-1)) >> 10);
+		shrink_stripes(newconf);
+		kfree(newconf);
+		return -ENOMEM;
+	}
+	
+	for (i = 0; i < (conf->chunk_size / STRIPE_SIZE) * (raid_disks-1); ++i) {
+		newconf->expand_buffer[i].page = alloc_page(GFP_KERNEL);
+		if (newconf->expand_buffer[i].page == NULL) {
+			printk(KERN_ERR "raid5: couldn't allocate %dkB for expand buffer\n",
+				(conf->chunk_size * (raid_disks-1)) >> 10);
+			shrink_stripes(newconf);
+			kfree(newconf);
+			return -ENOMEM;
+		}
+		newconf->expand_buffer[i].up_to_date = 0;
+	}
+	
+	spin_lock_irqsave(&conf->device_lock, flags);
+	
+	print_raid5_conf(newconf);
+
+	clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
+	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+	set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+	mddev->recovery_cp = 0;
+	md_wakeup_thread(mddev->thread);
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+
+	kfree(conf);
+
+	printk("Starting expand.\n");
+	
+        return 0;
+}
+
+
 static mdk_personality_t raid5_personality=
 {
 	.name		= "raid5",
@@ -1948,6 +2699,7 @@
 	.spare_active	= raid5_spare_active,
 	.sync_request	= sync_request,
 	.resize		= raid5_resize,
+	.reshape	= raid5_reshape
 };
 
 static int __init raid5_init (void)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-09-22 16:32       ` Steinar H. Gunderson
@ 2005-09-23  8:59         ` Neil Brown
  2005-09-23 12:50           ` Steinar H. Gunderson
  0 siblings, 1 reply; 23+ messages in thread
From: Neil Brown @ 2005-09-23  8:59 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: linux-raid

On Thursday September 22, sgunderson@bigfoot.com wrote:
> 
> > 2/ Reserve the stripe_heads needed for a chunk-resize in make_request
> >    (where it is safe to block) rather than in handle_stripe.
> 
> Hm. make_request is never called from the sync code, is it? My understanding
> was that make_request was called whenever userspace wanted to read/write
> something, and only then.

Sorry, I meant sync_request.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-09-23  8:59         ` Neil Brown
@ 2005-09-23 12:50           ` Steinar H. Gunderson
  0 siblings, 0 replies; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-09-23 12:50 UTC (permalink / raw)
  To: linux-raid

On Fri, Sep 23, 2005 at 10:59:02AM +0200, Neil Brown wrote:
>> Hm. make_request is never called from the sync code, is it? My understanding
>> was that make_request was called whenever userspace wanted to read/write
>> something, and only then.
> Sorry, I meant sync_request.

Hm, that makes a lot more sense :-) I'll have a look at it and see if I can
get it to work in a sane way.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-09-22 16:16     ` Neil Brown
  2005-09-22 16:32       ` Steinar H. Gunderson
  2005-09-22 20:53       ` Steinar H. Gunderson
@ 2005-09-24  1:44       ` Steinar H. Gunderson
  2005-10-07  3:09         ` Neil Brown
  2 siblings, 1 reply; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-09-24  1:44 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1785 bytes --]

On Thu, Sep 22, 2005 at 06:16:41PM +0200, Neil Brown wrote:
> 2/ Reserve the stripe_heads needed for a chunk-resize in make_request
>    (where it is safe to block) rather than in handle_stripe.

See the attached patch for at least a partial implementation of this (on top
of the last patch); it pre-allocates the write stripes, but not the read
stripes. In other words, it still uses a temporary buffer, and it doesn't
block requests against the to-be-expanded area.

There's also one extra fix I snuck in; before recomputing the parity for the
written out blocks, I actually lock it. This shouldn't break anything, and
AFAICS should be a good idea both in this and the previous implementation.

I'm unsure how much this actually buys us (there's a slight reduction in
complexity, but I fear that will go up again once I implement it for read
stripes as well), and I seem to have created a somewhat nasty regression --
if I do I/O against the volume while it's expanding, it causes a panic
_somewhere_ in handle_stripe, which I'm having a hard time tracking down. (If
you have a good way of actually mapping the offsets to C lines, let me know
-- every approach I've tried seems to fail, and the disassembly even shows
lots of nonsensical asm :-) )

In short, I think getting the original patch more or less bug-free is a
higher priority than this, although I do agree that it's a conceptually
cleaner way. What I really need is external testing of this, and of course
preferrably someone tracking down the bugs that are still left there... There
might be something that's really obvious to you (like “why on earth isn't he
locking that there?”) that I've overlooked, simply because you know this code
a _lot_ better than me :-)

/* Steinar */
-- 
Homepage: http://www.sesse.net/

[-- Attachment #2: raid5-online-expand-prealloc.diff --]
[-- Type: text/plain, Size: 6130 bytes --]

--- include/linux/raid/raid5.h.orig	2005-09-24 01:08:00.000000000 +0200
+++ include/linux/raid/raid5.h	2005-09-24 05:17:50.000000000 +0200
@@ -225,6 +225,9 @@
 	struct list_head	wait_for_expand_list;
 	
 	struct expand_buf	*expand_buffer;
+	
+	int			expand_stripes_ready;	
+	struct stripe_head	**expand_stripes;
 
 	struct list_head	handle_list; /* stripes needing handling */
 	struct list_head	delayed_list; /* stripes that have plugged requests */
--- drivers/md/raid5.c.orig	2005-09-24 01:05:32.000000000 +0200
+++ drivers/md/raid5.c	2005-09-24 05:22:30.000000000 +0200
@@ -1371,9 +1371,8 @@
 			needed_uptodate = ((conf->mddev->array_size << 1) - conf->expand_progress) / STRIPE_SECTORS;
 //			printk("reading partial block at the end: %u\n", needed_uptodate);
 		}
-		if (needed_uptodate > 0 && uptodate == needed_uptodate) {
+		if (needed_uptodate > 0 && uptodate == needed_uptodate && conf->expand_stripes_ready) {
 			// we can do an expand!
-			struct stripe_head *newsh[256];   // FIXME: dynamic allocation somewhere instead?
 			sector_t dest_sector, advance;
 			unsigned i;
 			unsigned int dummy1, dummy2, pd_idx;
@@ -1435,72 +1434,50 @@
 				}
 			}
 
-			/*
-			 * Check that we have enough free stripes to write out our
-			 * entire chunk in the new layout. If not, we'll have to wait
-			 * until some writes have been retired. We can't just do
-			 * as in get_active_stripe() and sleep here until enough are
-			 * free, since all busy stripes might have STRIPE_HANDLE set
-			 * and thus won't be retired until somebody (our thread!) takes
-			 * care of them.
-			 */	
-			
-			{
-				int not_enough_free = 0;
-				
-				for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
-					newsh[i] = get_free_stripe(conf, 1);
-					if (newsh[i] == NULL) {
-						not_enough_free = 1;
-						break;
-					}
-					init_stripe(newsh[i], dest_sector + i * STRIPE_SECTORS, pd_idx);		
-				}
-
-				if (not_enough_free) {
-					// release all the stripes we allocated
-					for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
-						if (newsh[i] == NULL)
-							break;
-						atomic_inc(&newsh[i]->count);
-						__release_stripe(conf, newsh[i]);
-					}
-					printk("Aborting, not enough destination stripes free\n");
-					spin_unlock_irq(&conf->device_lock);
-					goto please_wait;
-				}
-			}
-
 			for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+				struct stripe_head *newsh = conf->expand_stripes[i];
+				init_stripe(newsh, dest_sector + i * STRIPE_SECTORS, pd_idx);
+
 				for (d = 0; d < conf->raid_disks; ++d) {
 					unsigned dd_idx = d;
-					
+					                                        
 					if (d != pd_idx) {
+						struct page *tmp;
+						unsigned di;
+						
 						if (dd_idx > pd_idx)
 							--dd_idx;
 
-						memcpy(page_address(newsh[i]->dev[d].page), page_address(conf->expand_buffer[dd_idx * conf->chunk_size / STRIPE_SIZE + i].page), STRIPE_SIZE);
+						di = dd_idx * conf->chunk_size / STRIPE_SIZE + i;
+						
+						// swap the two pages, moving the data in place into the stripe
+						tmp = newsh->dev[d].page;
+						newsh->dev[d].page = conf->expand_buffer[di].page;
+						conf->expand_buffer[di].page = tmp;
+						
+						conf->expand_buffer[di].up_to_date = 0;
 					}
-					set_bit(R5_Wantwrite, &newsh[i]->dev[d].flags);
-					set_bit(R5_Syncio, &newsh[i]->dev[d].flags);
+					set_bit(R5_Wantwrite, &newsh->dev[d].flags);
+					set_bit(R5_Syncio, &newsh->dev[d].flags);
 				}
 			}
 			
-			for (i=0; i < (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE); ++i) {
-				conf->expand_buffer[i].up_to_date = 0;
-			}
-
+			conf->expand_stripes_ready = 0;
 			conf->expand_progress += advance;
 			
 			spin_unlock_irq(&conf->device_lock);
 			
 			for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
-				compute_parity(newsh[i], RECONSTRUCT_WRITE);
+				struct stripe_head *newsh = conf->expand_stripes[i];
+
+				spin_lock(&newsh->lock);
+				compute_parity(newsh, RECONSTRUCT_WRITE);
 					
-				atomic_inc(&newsh[i]->count);
-				set_bit(STRIPE_INSYNC, &newsh[i]->state);
-				set_bit(STRIPE_HANDLE, &newsh[i]->state);
-				release_stripe(newsh[i]);
+				atomic_inc(&newsh->count);
+				set_bit(STRIPE_INSYNC, &newsh->state);
+				set_bit(STRIPE_HANDLE, &newsh->state);
+				spin_unlock(&newsh->lock);
+				release_stripe(newsh);
 			}
 
 			spin_lock_irq(&conf->device_lock);
@@ -2025,6 +2002,37 @@
 		spin_unlock_irq(&conf->device_lock);
 	}
 
+	/*
+	 * In an expand, we also need to make sure that we have enough destination stripes
+	 * available for writing out the block after we've read in the data, so make sure
+	 * we get them before we start reading any data.
+	 */
+	if (conf->expand_in_progress && !conf->expand_stripes_ready) {
+		unsigned i;
+		
+		spin_lock_irq(&conf->device_lock);
+		for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+			do {
+				conf->expand_stripes[i] = get_free_stripe(conf, 1);
+
+				if (conf->expand_stripes[i] == NULL) {
+					conf->inactive_blocked_expand = 1;
+					wait_event_lock_irq(conf->wait_for_stripe_expand,
+							    !list_empty(&conf->inactive_list_expand) &&
+							    (atomic_read(&conf->active_stripes_expand) < (NR_STRIPES *3/4)
+							     || !conf->inactive_blocked_expand),
+							    conf->device_lock,
+							    unplug_slaves(conf->mddev);
+						);
+					conf->inactive_blocked_expand = 0;
+				}
+			} while (conf->expand_stripes[i] == NULL);
+		}
+		spin_unlock_irq(&conf->device_lock);
+
+		conf->expand_stripes_ready = 1;
+	}
+
 	x = sector_nr;
 	chunk_offset = sector_div(x, sectors_per_chunk);
 	stripe = x;
@@ -2653,6 +2661,15 @@
 		kfree(newconf);
 		return -ENOMEM;
 	}
+
+	newconf->expand_stripes = kmalloc (sizeof(struct stripe_head *) * (conf->chunk_size / STRIPE_SIZE), GFP_KERNEL);
+	if (newconf->expand_stripes == NULL) {
+		printk(KERN_ERR "raid5: couldn't allocate memory for expand stripe pointers\n");
+		shrink_stripes(newconf);
+		kfree(newconf);
+		return -ENOMEM;
+	}
+	newconf->expand_stripes_ready = 0;
 	
 	for (i = 0; i < (conf->chunk_size / STRIPE_SIZE) * (raid_disks-1); ++i) {
 		newconf->expand_buffer[i].page = alloc_page(GFP_KERNEL);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-09-24  1:44       ` Steinar H. Gunderson
@ 2005-10-07  3:09         ` Neil Brown
  2005-10-07 14:13           ` Steinar H. Gunderson
  2005-10-14 19:46           ` Steinar H. Gunderson
  0 siblings, 2 replies; 23+ messages in thread
From: Neil Brown @ 2005-10-07  3:09 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: linux-raid

On Saturday September 24, sgunderson@bigfoot.com wrote:
> On Thu, Sep 22, 2005 at 06:16:41PM +0200, Neil Brown wrote:
> > 2/ Reserve the stripe_heads needed for a chunk-resize in make_request
> >    (where it is safe to block) rather than in handle_stripe.
> 
> See the attached patch for at least a partial implementation of this (on top
> of the last patch); it pre-allocates the write stripes, but not the read
> stripes. In other words, it still uses a temporary buffer, and it doesn't
> block requests against the to-be-expanded area.

That certainly looks like it is heading in the right direction.
Thanks.
However it is usually easier to read a whole patch - reading a patch
that removes bits of a previous patch, and depends on other bits of
it, requires holding too much in one's brain at once.  If you could
possibly send a complete patch against a recent release kernel, it
would make review a lot easier.
(In general, patches should be broken into the smallest usable pieces,
and no smaller.  I think the functionality you are currently writing
can not be usefully broken up, so just one patch is best).

> 
> There's also one extra fix I snuck in; before recomputing the parity for the
> written out blocks, I actually lock it. This shouldn't break anything, and
> AFAICS should be a good idea both in this and the previous
> implementation.

That probably shouldn't be needed, but certainly shouldn't hurt.  I'd
like to leave reviewing of the locking until the rest of the patch is
in good shape.

> 
> I'm unsure how much this actually buys us (there's a slight reduction in
> complexity, but I fear that will go up again once I implement it for read
> stripes as well),

I think it buys us a lot.  It means we can wait for stripes to become
free instead of spinning around hoping they will come free soon.

It's possible that you don't need to pre-allocate the read stripes
as well. 
sync_request just pre-allocated the write stripes, then allocates and
schedules the read-stripes.  Once all the read-stripes are full,
handle_stripe shuffles all the pages from 'read' to 'write' and
schedules the writes.

>                   and I seem to have created a somewhat nasty regression --
> if I do I/O against the volume while it's expanding, it causes a panic
> _somewhere_ in handle_stripe, which I'm having a hard time tracking down. (If
> you have a good way of actually mapping the offsets to C lines, let me know
> -- every approach I've tried seems to fail, and the disassembly even shows
> lots of nonsensical asm :-) )

I find disassembly works quite well.
You can even
   make drives/md/raid5.lst
which gives you a listing to read.

> 
> In short, I think getting the original patch more or less bug-free is a
> higher priority than this, although I do agree that it's a conceptually
> cleaner way. What I really need is external testing of this, and of course
> preferrably someone tracking down the bugs that are still left there... There
> might be something that's really obvious to you (like ^[$,1r|^[(Bwhy on earth isn't he
> locking that there?^[$,1r}^[(B) that I've overlooked, simply because you know this code
> a _lot_ better than me :-)

What a patch really needs to improve confidence is review.  Testing is
good too of course, but code review will find bugs that testing may
not.  I'm personally not interested in testing it until it looks
right, as any structural changes needed will invalidate any testing.
Currently the patch looks mostly good, but there are a couple of
structural changes that I think it needs as I mentioned previously.
Once these are in place, I can review the code more closely and look
for races and other subtle semantic issues.
Then I'm happy to start testing.

I really want this functionality to get into mainline - and I hope we
can work together to make that happen.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-10-07  3:09         ` Neil Brown
@ 2005-10-07 14:13           ` Steinar H. Gunderson
  2005-10-14 19:46           ` Steinar H. Gunderson
  1 sibling, 0 replies; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-10-07 14:13 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Fri, Oct 07, 2005 at 01:09:21PM +1000, Neil Brown wrote:
> However it is usually easier to read a whole patch - reading a patch
> that removes bits of a previous patch, and depends on other bits of
> it, requires holding too much in one's brain at once.  If you could
> possibly send a complete patch against a recent release kernel, it
> would make review a lot easier.

Mm, OK.

>> I'm unsure how much this actually buys us (there's a slight reduction in
>> complexity, but I fear that will go up again once I implement it for read
>> stripes as well),
> I think it buys us a lot.  It means we can wait for stripes to become
> free instead of spinning around hoping they will come free soon.

Well, I've been doing printk-debugging on this, and it's actually a quite
rare case (even with heavy I/O) that it's starved for stripes.

> Currently the patch looks mostly good, but there are a couple of
> structural changes that I think it needs as I mentioned previously.
> Once these are in place, I can review the code more closely and look
> for races and other subtle semantic issues.

Mm. I'm still a bit ambivalent about a rewrite; I need something working in
about exactly a month (when we're going to restripe our backup server) and a
rewrite would no doubt destabilize it all for quite a while. My test server
is currently broken after I tried to get in a kdb-enabled kernel; grub
conveniently broke at about the same time :-) I'm not sure how much time I
can dedicate to this ATM either.

I definitely agree moving code into sync_request would result in a better
overall model, though, with less overhead in the usual (non-restripe) paths.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-10-07  3:09         ` Neil Brown
  2005-10-07 14:13           ` Steinar H. Gunderson
@ 2005-10-14 19:46           ` Steinar H. Gunderson
  2005-10-16 22:55             ` Neil Brown
  1 sibling, 1 reply; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-10-14 19:46 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2337 bytes --]

On Fri, Oct 07, 2005 at 01:09:21PM +1000, Neil Brown wrote:
> However it is usually easier to read a whole patch - reading a patch
> that removes bits of a previous patch, and depends on other bits of
> it, requires holding too much in one's brain at once.  If you could
> possibly send a complete patch against a recent release kernel, it
> would make review a lot easier.

Here's the latest version of the patch. What's been done since last time:

- There's no longer a set of “larger” stripes; instead, they're all shrunk
  and then expanded, like you requested the last time.
- The expand stripes are preallocated in sync_request(), again like you 
  requested.
- Likewise, the raid5_conf struct is never reallocated; instead, I just make
  sure it supports MAX_MD_DEVS devices in the first place. This wasted a
  kilobyte or so per active device, but it removed a _lot_ of fiddly code,
  so I believe it's a good thing.
- The patch in general is a lot slimmer (about half the size of the original
  patch). Lots of special-case code has been thrown out and replaced by using
  the generic functions instead (for, say, all the parity disk layout stuff).

I'm unsure how many regressions there are; there are still problems with
stuff hanging here and there, for one, and I'm unsure if I broke something
at the very end of the expand (raid5_finish_expand). I haven't seen the
problems with data corruption during heavy I/O yet, but OTOH it hasn't gotten
that much testing yet either. Much of the code is new or rewritten, so expect
regressions. :-)

There's no new functionality (in particular, still no crash recovery), but I
think it's a step in the right direction.

This patch is against 2.6.13, which was the latest kernel version I could get
to work with kdb. kdb helps a _lot_ in debugging the more obscure bugs, so
it's taken a significant amount of pain away :-)

> I find disassembly works quite well.
> You can even
>    make drives/md/raid5.lst
> which gives you a listing to read.

The listings are basically useless. 90% of the code lines map to address 0x0,
and in the main functions (say, handle_stripe) there's hardly a code line
except some __set_bit etc. here and there. And yes, I compile with -O0 :-)
objdump --source gives me almost exactly the same thing.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

[-- Attachment #2: raid5-online-exp-04.diff --]
[-- Type: text/plain, Size: 33625 bytes --]

--- /usr/src/old/linux-2.6.13/drivers/md/raid5.c	2005-08-29 01:41:01.000000000 +0200
+++ drivers/md/raid5.c	2005-10-14 21:50:06.000000000 +0200
@@ -68,16 +68,29 @@
 #endif
 
 static void print_raid5_conf (raid5_conf_t *conf);
+#if RAID5_DEBUG
+static void print_sh (struct stripe_head *sh);
+#endif
+static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, int go_faster);
+static void raid5_finish_expand (raid5_conf_t *conf);
+static sector_t raid5_compute_sector(sector_t r_sector, unsigned int raid_disks,
+			unsigned int data_disks, unsigned int * dd_idx,
+			unsigned int * pd_idx, raid5_conf_t *conf);
 
 static inline void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 {
+	BUG_ON(atomic_read(&sh->count) == 0);
 	if (atomic_dec_and_test(&sh->count)) {
 		if (!list_empty(&sh->lru))
 			BUG();
 		if (atomic_read(&conf->active_stripes)==0)
 			BUG();
 		if (test_bit(STRIPE_HANDLE, &sh->state)) {
-			if (test_bit(STRIPE_DELAYED, &sh->state))
+			if (test_bit(STRIPE_DELAY_EXPAND, &sh->state)) {
+				list_add_tail(&sh->lru, &conf->wait_for_expand_list);
+				printk("delaying stripe with sector %llu (expprog=%llu, active=%d)\n", sh->sector,
+					conf->expand_progress, atomic_read(&conf->active_stripes));
+			} else if (test_bit(STRIPE_DELAYED, &sh->state))
 				list_add_tail(&sh->lru, &conf->delayed_list);
 			else
 				list_add_tail(&sh->lru, &conf->handle_list);
@@ -133,7 +146,7 @@ static __inline__ void insert_hash(raid5
 
 
 /* find an idle stripe, make sure it is unhashed, and return it. */
-static struct stripe_head *get_free_stripe(raid5_conf_t *conf)
+static struct stripe_head *get_free_stripe(raid5_conf_t *conf, int expand)
 {
 	struct stripe_head *sh = NULL;
 	struct list_head *first;
@@ -146,6 +159,12 @@ static struct stripe_head *get_free_stri
 	list_del_init(first);
 	remove_hash(sh);
 	atomic_inc(&conf->active_stripes);
+
+	if (expand || !conf->expand_in_progress)
+		sh->disks = conf->raid_disks;
+	else
+		sh->disks = conf->previous_raid_disks;
+
 out:
 	return sh;
 }
@@ -184,7 +203,7 @@ static void raid5_build_block (struct st
 static inline void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int i;
 
 	if (atomic_read(&sh->count) != 0)
 		BUG();
@@ -200,8 +219,14 @@ static inline void init_stripe(struct st
 	sh->sector = sector;
 	sh->pd_idx = pd_idx;
 	sh->state = 0;
+	
+	if (conf->expand_in_progress && sector * (conf->raid_disks - 1) >= conf->expand_progress) {
+		sh->disks = conf->previous_raid_disks;
+	} else {
+		sh->disks = conf->raid_disks;
+	}
 
-	for (i=disks; i--; ) {
+	for (i=sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 
 		if (dev->toread || dev->towrite || dev->written ||
@@ -245,9 +270,26 @@ static struct stripe_head *get_active_st
 
 	do {
 		sh = __find_stripe(conf, sector);
+
+		// make sure this is of the right size; if not, remove it from the hash
+		// FIXME: is this needed now?
+		if (sh) {
+			int correct_disks = conf->raid_disks;
+			if (conf->expand_in_progress && sector * (conf->raid_disks - 1) >= conf->expand_progress) {
+				correct_disks = conf->previous_raid_disks;
+			}
+
+			if (sh->disks != correct_disks) {
+				BUG_ON(atomic_read(&sh->count) != 0);
+
+				remove_hash(sh);
+				sh = NULL;
+			}
+		}
+		
 		if (!sh) {
 			if (!conf->inactive_blocked)
-				sh = get_free_stripe(conf);
+				sh = get_free_stripe(conf, 1);
 			if (noblock && sh == NULL)
 				break;
 			if (!sh) {
@@ -267,8 +309,9 @@ static struct stripe_head *get_active_st
 				if (!list_empty(&sh->lru))
 					BUG();
 			} else {
-				if (!test_bit(STRIPE_HANDLE, &sh->state))
+				if (!test_bit(STRIPE_HANDLE, &sh->state)) {
 					atomic_inc(&conf->active_stripes);
+				}
 				if (list_empty(&sh->lru))
 					BUG();
 				list_del_init(&sh->lru);
@@ -303,6 +346,7 @@ static int grow_stripes(raid5_conf_t *co
 			return 1;
 		memset(sh, 0, sizeof(*sh) + (devs-1)*sizeof(struct r5dev));
 		sh->raid_conf = conf;
+		sh->disks = conf->raid_disks;
 		spin_lock_init(&sh->lock);
 
 		if (grow_buffers(sh, conf->raid_disks)) {
@@ -325,7 +369,7 @@ static void shrink_stripes(raid5_conf_t 
 
 	while (1) {
 		spin_lock_irq(&conf->device_lock);
-		sh = get_free_stripe(conf);
+		sh = get_free_stripe(conf, 0);
 		spin_unlock_irq(&conf->device_lock);
 		if (!sh)
 			break;
@@ -344,7 +388,7 @@ static int raid5_end_read_request (struc
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
 	if (bi->bi_size)
@@ -411,12 +455,61 @@ static int raid5_end_read_request (struc
 	return 0;
 }
 
+							
+static void raid5_finish_expand (raid5_conf_t *conf)
+{
+	int i;
+	struct disk_info *tmp;
+//	shrink_stripes(conf);
+	
+	conf->expand_in_progress = 0;
+
+	for (i = conf->previous_raid_disks; i < conf->raid_disks; i++) {
+		tmp = conf->disks + i;
+		if (tmp->rdev
+		    && !tmp->rdev->faulty
+		    && !tmp->rdev->in_sync) {
+			conf->mddev->degraded--;
+			conf->failed_disks--;
+			conf->working_disks++;
+			tmp->rdev->in_sync = 1;
+		}
+	}
+
+	// inform the md code that we have more space now
+ 	{	
+		struct block_device *bdev;
+		sector_t sync_sector;
+		unsigned dummy1, dummy2;
+
+		conf->mddev->array_size = conf->mddev->size * (conf->mddev->raid_disks-1);
+		set_capacity(conf->mddev->gendisk, conf->mddev->array_size << 1);
+		conf->mddev->changed = 1;
+
+		sync_sector = raid5_compute_sector(conf->expand_progress, conf->raid_disks,
+			conf->raid_disks - 1, &dummy1, &dummy2, conf);
+		
+		conf->mddev->recovery_cp = sync_sector << 1;    // FIXME: hum, hum
+		set_bit(MD_RECOVERY_NEEDED, &conf->mddev->recovery);
+
+		bdev = bdget_disk(conf->mddev->gendisk, 0);
+		if (bdev) {
+			down(&bdev->bd_inode->i_sem);
+			i_size_write(bdev->bd_inode, conf->mddev->array_size << 10);
+			up(&bdev->bd_inode->i_sem);
+			bdput(bdev);
+		}
+	}
+	
+	/* FIXME: free old stuff here! (what are we missing?) */
+}
+
 static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
 				    int error)
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	unsigned long flags;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
@@ -570,7 +663,7 @@ static sector_t raid5_compute_sector(sec
 static sector_t compute_blocknr(struct stripe_head *sh, int i)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int raid_disks = conf->raid_disks, data_disks = raid_disks - 1;
+	int raid_disks = sh->disks, data_disks = raid_disks - 1;
 	sector_t new_sector = sh->sector, check;
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	sector_t stripe;
@@ -605,7 +698,8 @@ static sector_t compute_blocknr(struct s
 
 	check = raid5_compute_sector (r_sector, raid_disks, data_disks, &dummy1, &dummy2, conf);
 	if (check != sh->sector || dummy1 != dd_idx || dummy2 != sh->pd_idx) {
-		printk("compute_blocknr: map not correct\n");
+		printk("compute_blocknr: map not correct (%llu,%u,%u vs. %llu,%u,%u) disks=%u offset=%u virtual_dd=%u\n",
+				check, dummy1, dummy2, sh->sector, dd_idx, sh->pd_idx, sh->disks, chunk_offset, i);
 		return 0;
 	}
 	return r_sector;
@@ -671,8 +765,7 @@ static void copy_data(int frombio, struc
 
 static void compute_block(struct stripe_head *sh, int dd_idx)
 {
-	raid5_conf_t *conf = sh->raid_conf;
-	int i, count, disks = conf->raid_disks;
+	int i, count, disks = sh->disks;
 	void *ptr[MAX_XOR_BLOCKS], *p;
 
 	PRINTK("compute_block, stripe %llu, idx %d\n", 
@@ -702,7 +795,7 @@ static void compute_block(struct stripe_
 static void compute_parity(struct stripe_head *sh, int method)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int i, pd_idx = sh->pd_idx, disks = conf->raid_disks, count;
+	int i, pd_idx = sh->pd_idx, disks = sh->disks, count;
 	void *ptr[MAX_XOR_BLOCKS];
 	struct bio *chosen;
 
@@ -876,11 +969,11 @@ static int add_stripe_bio(struct stripe_
  * get BH_Lock set before the stripe lock is released.
  *
  */
- 
+
 static void handle_stripe(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks;
+	int disks = sh->disks;
 	struct bio *return_bi= NULL;
 	struct bio *bi;
 	int i;
@@ -897,6 +990,7 @@ static void handle_stripe(struct stripe_
 	spin_lock(&sh->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
+	clear_bit(STRIPE_DELAY_EXPAND, &sh->state);
 
 	syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	/* Now to look around and see what can be done */
@@ -945,19 +1039,20 @@ static void handle_stripe(struct stripe_
 		}
 		if (dev->written) written++;
 		rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
-		if (!rdev || !rdev->in_sync) {
+		if (!conf->expand_in_progress && (!rdev || !rdev->in_sync)) {
 			failed++;
 			failed_num = i;
 		} else
 			set_bit(R5_Insync, &dev->flags);
 	}
-	PRINTK("locked=%d uptodate=%d to_read=%d"
-		" to_write=%d failed=%d failed_num=%d\n",
-		locked, uptodate, to_read, to_write, failed, failed_num);
 	/* check if the array has lost two devices and, if so, some requests might
 	 * need to be failed
 	 */
 	if (failed > 1 && to_read+to_write+written) {
+		printk("Need to fail requests!\n");
+		printk("locked=%d uptodate=%d to_read=%d"
+			" to_write=%d failed=%d failed_num=%d disks=%d\n",
+			locked, uptodate, to_read, to_write, failed, failed_num, disks);
 		spin_lock_irq(&conf->device_lock);
 		for (i=disks; i--; ) {
 			/* fail all writes first */
@@ -1012,7 +1107,7 @@ static void handle_stripe(struct stripe_
 		}
 		spin_unlock_irq(&conf->device_lock);
 	}
-	if (failed > 1 && syncing) {
+	if (failed > 1 && syncing && !conf->expand_in_progress) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
 		clear_bit(STRIPE_SYNCING, &sh->state);
 		syncing = 0;
@@ -1085,7 +1180,7 @@ static void handle_stripe(struct stripe_
 					/* if I am just reading this block and we don't have
 					   a failed drive, or any pending writes then sidestep the cache */
 					if (sh->bh_read[i] && !sh->bh_read[i]->b_reqnext &&
-					    ! syncing && !failed && !to_write) {
+						! syncing && !failed && !to_write) {
 						sh->bh_cache[i]->b_page =  sh->bh_read[i]->b_page;
 						sh->bh_cache[i]->b_data =  sh->bh_read[i]->b_data;
 					}
@@ -1093,7 +1188,7 @@ static void handle_stripe(struct stripe_
 					locked++;
 					PRINTK("Reading block %d (sync=%d)\n", 
 						i, syncing);
-					if (syncing)
+					if (syncing && !conf->expand_in_progress)
 						md_sync_acct(conf->disks[i].rdev->bdev,
 							     STRIPE_SECTORS);
 				}
@@ -1102,6 +1197,273 @@ static void handle_stripe(struct stripe_
 		set_bit(STRIPE_HANDLE, &sh->state);
 	}
 
+	// see if we have the data we need to expand by another block
+	if (conf->expand_in_progress && sh->disks == conf->previous_raid_disks) {
+		int uptodate = 0, delay_to_future=0, d = 0, needed_uptodate = 0;
+		spin_lock_irq(&conf->expand_progress_lock);
+		for (i=0; i<disks; ++i) {
+			sector_t start_sector, dest_sector;
+			unsigned int dd_idx, pd_idx;
+
+			if (i == sh->pd_idx)
+				continue;
+
+			// see what sector this block would land in the new layout
+			start_sector = compute_blocknr(sh, i);
+			dest_sector = raid5_compute_sector(start_sector, conf->raid_disks,
+				conf->raid_disks - 1, &dd_idx, &pd_idx, conf);
+			if (dd_idx > pd_idx)
+				--dd_idx;
+
+/*			printk("start_sector = %llu (base=%llu, i=%u, d=%u) || dest_stripe = %llu\n", start_sector, sh->sector,
+				i, d, dest_stripe); */
+		
+			if (dest_sector * (conf->raid_disks - 1) >= conf->expand_progress &&
+ 			    dest_sector * (conf->raid_disks - 1) <  conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+/*				printk("UPDATING CHUNK %u FROM DISK %u (sec=%llu, dest_sector=%llu, uptodate=%u)\n",
+					dd_idx, i, start_sector, dest_sector, test_bit(R5_UPTODATE, &sh->dev[i].flags)); */
+				unsigned int ind = (start_sector - conf->expand_progress) / STRIPE_SECTORS;
+				if (test_bit(R5_UPTODATE, &sh->dev[i].flags)) {
+					unsigned int *ptr = page_address(conf->expand_buffer[ind].page);
+					
+					conf->expand_buffer[ind].up_to_date = 1;
+					memcpy(page_address(conf->expand_buffer[ind].page), page_address(sh->dev[i].page), STRIPE_SIZE);
+//					printk("memcpy done [%u -> %u]: %08x %08x %08x %08x\n", i, ind, ptr[0], ptr[1], ptr[2], ptr[3]);
+				} else {
+					conf->expand_buffer[ind].up_to_date = 0;
+				}
+			} else if (dest_sector * (conf->raid_disks - 1) >= conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1) &&
+				   dest_sector * (conf->raid_disks - 1) < conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1) * 2 &&
+				   syncing) {
+				delay_to_future = 1;
+			}
+		}
+		spin_unlock_irq(&conf->expand_progress_lock);
+
+		for (i=0; i < (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE); ++i) {
+			uptodate += conf->expand_buffer[i].up_to_date;
+		}
+	
+		/*
+		 * Figure out how many stripes we need for this chunk to be complete.
+		 * In almost all cases, this will be a full destination stripe, but our
+		 * original volume might not be big enough for that at the very end --
+		 * so use the rest of the volume then.
+	         */
+		needed_uptodate = (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE);
+		if (((conf->mddev->array_size << 1) - conf->expand_progress) / STRIPE_SECTORS < needed_uptodate) {
+			needed_uptodate = ((conf->mddev->array_size << 1) - conf->expand_progress) / STRIPE_SECTORS;
+//			printk("reading partial block at the end: %u\n", needed_uptodate);
+		}
+		if (needed_uptodate > 0 && uptodate == needed_uptodate && conf->expand_stripes_ready == 1) {
+			// we can do an expand!
+			sector_t dest_sector, advance;
+			unsigned i;
+			unsigned int dummy1, dummy2, pd_idx, flags;
+
+			if ((conf->mddev->size << 1) - conf->expand_progress > (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+				advance = (conf->chunk_size * (conf->raid_disks - 1)) >> 9;
+			} else {
+				advance = (conf->mddev->size << 1) - conf->expand_progress;
+			}
+
+			// find the parity disk and starting sector
+			dest_sector = raid5_compute_sector(conf->expand_progress, conf->raid_disks,
+				conf->raid_disks - 1, &dummy1, &pd_idx, conf);
+//			printk("Expanding onto %llu\n", dest_sector);
+		
+			spin_lock_irq(&conf->device_lock);
+			
+			if (conf->expand_stripes_ready != 1) {
+				// something else just did the expand, we're done here
+				spin_unlock_irq(&conf->device_lock);
+				goto please_wait;
+			}
+			
+			/*
+			 * Check that we won't try to expand over an area where there's
+			 * still active stripes; if we do, we'll risk inconsistency since we
+			 * suddenly have two different sets of stripes referring to the
+			 * same logical sector.
+			 */
+			{
+				struct stripe_head *ash;
+				unsigned activity = 0, i;
+				sector_t first_touched_sector, last_touched_sector;
+				
+				first_touched_sector = raid5_compute_sector(conf->expand_progress,
+					conf->previous_raid_disks, conf->previous_raid_disks - 1, &dummy1, &dummy2, conf);
+				last_touched_sector = raid5_compute_sector(conf->expand_progress + ((conf->chunk_size * (conf->previous_raid_disks - 1)) >> 9) - 1,
+					conf->previous_raid_disks, conf->previous_raid_disks - 1, &dummy1, &dummy2, conf);
+
+				for (i = 0; i < NR_HASH; i++) {
+					ash = conf->stripe_hashtbl[i];
+					for (; ash; ash = ash->hash_next) {
+						if (sh == ash && atomic_read(&ash->count) == 1 && !to_write)
+							continue;   // we'll release it shortly, so it's OK (?)
+
+						// is this stripe active, and within the region we're expanding?
+						if (atomic_read(&ash->count) > 0 &&
+						    ash->disks == conf->previous_raid_disks &&
+						    ash->sector >= first_touched_sector &&
+						    ash->sector <= last_touched_sector) {
+							++activity;
+						}
+					}
+				}
+				
+				if (activity > 0) {
+					printk("Aborting, %u active stripes in the area\n", activity);
+					spin_unlock_irq(&conf->device_lock);
+					goto please_wait;
+				}
+			}
+			
+			spin_lock_irqsave(&conf->expand_progress_lock, flags);
+			conf->expand_progress += advance;
+
+			for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+				struct stripe_head *newsh = conf->expand_stripes[i];
+				if (atomic_read(&newsh->count) != 0)
+					BUG();
+				init_stripe(newsh, dest_sector + i * STRIPE_SECTORS, pd_idx);
+			//	printk("Generating sector %llu\n", dest_sector + i * STRIPE_SECTORS);
+
+				for (d = 0; d < conf->raid_disks; ++d) {
+					if (d == pd_idx) {
+						clear_bit(R5_UPTODATE, &newsh->dev[d].flags);
+						clear_bit(R5_LOCKED, &newsh->dev[d].flags);
+					} else {
+						unsigned int *ptr;
+						//struct page *tmp;
+						unsigned di;
+						
+						di = (compute_blocknr(newsh, d) - (conf->expand_progress - advance)) / STRIPE_SECTORS;
+						
+						// swap the two pages, moving the data in place into the stripe
+#if 0
+						// FIXME: this doesn't work. we'll need to fiddle with the bio_vec
+						// as well or we'll simply write out the wrong data.
+						tmp = newsh->dev[d].page;
+						newsh->dev[d].page = conf->expand_buffer[di].page;
+						conf->expand_buffer[di].page = tmp; 
+#else
+						memcpy(page_address(newsh->dev[d].page), page_address(conf->expand_buffer[di].page), STRIPE_SIZE);
+#endif
+					
+						ptr = page_address(newsh->dev[d].page);
+//						printk("shuffle done [%u.%u -> %u]: %08x %08x %08x %08x\n", i, d, di, ptr[0], ptr[1], ptr[2], ptr[3]);
+					
+						set_bit(R5_UPTODATE, &newsh->dev[d].flags);
+						set_bit(R5_LOCKED, &newsh->dev[d].flags);
+						conf->expand_buffer[di].up_to_date = 0;
+					}
+					set_bit(R5_Wantwrite, &newsh->dev[d].flags);
+				}
+			}
+			conf->expand_stripes_ready = 2;	
+			spin_unlock_irqrestore(&conf->expand_progress_lock, flags);
+			spin_unlock_irq(&conf->device_lock);
+			
+			for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+				struct stripe_head *newsh = conf->expand_stripes[i];
+				
+				compute_block(newsh, newsh->pd_idx);
+
+				spin_lock(&newsh->lock);
+				atomic_inc(&newsh->count);
+				clear_bit(STRIPE_SYNCING, &newsh->state);
+				set_bit(STRIPE_INSYNC, &newsh->state);
+				set_bit(STRIPE_HANDLE, &newsh->state);
+				spin_unlock(&newsh->lock);
+#if 0
+				printk("Releasing stripe %u (%u disks)\n", i, newsh->disks);
+				for (d = 0; d < conf->raid_disks; ++d) {
+					unsigned int *ptr = page_address(newsh->dev[d].page);
+					printk("%u: %08x %08x %08x %08x\n", d, ptr[0], ptr[1], ptr[2], ptr[3]);
+				}
+#endif
+				release_stripe(newsh);
+			}
+			
+			conf->expand_stripes_ready = 0;	
+
+			spin_lock_irq(&conf->device_lock);
+			md_done_sync(conf->mddev, advance, 1);
+			wake_up(&conf->wait_for_expand_progress);
+			spin_unlock_irq(&conf->device_lock);
+
+			// see if we have delayed data that we can process now
+			{			
+				struct list_head *l, *next;
+				
+				spin_lock_irq(&conf->device_lock);
+				l = conf->wait_for_expand_list.next;
+
+				while (l != &conf->wait_for_expand_list) {
+//					int i, d = 0;
+					int do_process = 0;
+					
+					struct stripe_head *dsh;
+					dsh = list_entry(l, struct stripe_head, lru);
+				
+#if 0
+					for (i=0; i<disks; ++i) {
+						sector_t start_sector, dest_sector;
+						unsigned int dd_idx, pd_idx;
+
+						if (i == dsh->pd_idx)
+							continue;
+
+						start_sector = dsh->sector * (conf->previous_raid_disks - 1) + d * (conf->chunk_size >> 9);
+
+						// see what sector this block would land in in the new layout
+						dest_sector = raid5_compute_sector(start_sector, conf->raid_disks,
+								conf->raid_disks - 1, &dd_idx, &pd_idx, conf);
+						if (dest_sector * (conf->raid_disks - 1) <  conf->expand_progress + (conf->raid_disks - 1) * (conf->chunk_size >> 9)) {
+							do_process = 1;
+						}
+
+						++d;
+					}
+#endif					
+					
+					do_process = 1;
+					next = l->next;
+					
+					if (do_process) {
+						list_del_init(l);
+
+						set_bit(STRIPE_HANDLE, &dsh->state);
+						clear_bit(STRIPE_DELAYED, &dsh->state);
+						clear_bit(STRIPE_DELAY_EXPAND, &dsh->state);
+						atomic_inc(&dsh->count);
+						__release_stripe(conf, dsh);
+					}
+
+					l = next;
+				}
+
+				spin_unlock_irq(&conf->device_lock);
+			}
+
+			// see if we are done
+			if (conf->expand_progress >= conf->mddev->array_size << 1) {
+				printk("Expand done, finishing...\n");
+				raid5_finish_expand(conf);
+				printk("...done.\n");
+			}
+
+please_wait:			
+			1;
+		}
+
+		if (delay_to_future) {
+			atomic_inc(&sh->count);
+			set_bit(STRIPE_DELAY_EXPAND, &sh->state);
+		}
+	}
+
 	/* now to consider writing and what else, if anything should be read */
 	if (to_write) {
 		int rmw=0, rcw=0;
@@ -1237,7 +1599,9 @@ static void handle_stripe(struct stripe_
 		}
 	}
 	if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
-		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
+		if (!conf->expand_in_progress) {
+			md_done_sync(conf->mddev, STRIPE_SECTORS,1);
+		}
 		clear_bit(STRIPE_SYNCING, &sh->state);
 	}
 	
@@ -1279,7 +1643,7 @@ static void handle_stripe(struct stripe_
 		rcu_read_unlock();
  
 		if (rdev) {
-			if (test_bit(R5_Syncio, &sh->dev[i].flags))
+			if (test_bit(R5_Syncio, &sh->dev[i].flags) && !conf->expand_in_progress)
 				md_sync_acct(rdev->bdev, STRIPE_SECTORS);
 
 			bi->bi_bdev = rdev->bdev;
@@ -1404,8 +1768,6 @@ static int make_request (request_queue_t
 {
 	mddev_t *mddev = q->queuedata;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
-	const unsigned int raid_disks = conf->raid_disks;
-	const unsigned int data_disks = raid_disks - 1;
 	unsigned int dd_idx, pd_idx;
 	sector_t new_sector;
 	sector_t logical_sector, last_sector;
@@ -1428,26 +1790,39 @@ static int make_request (request_queue_t
 
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
+		int disks;
 		
+	recalculate:		
+		if (conf->expand_in_progress && logical_sector >= conf->expand_progress) {
+			disks = conf->previous_raid_disks;
+		} else {
+			disks = conf->raid_disks;
+		}
 		new_sector = raid5_compute_sector(logical_sector,
-						  raid_disks, data_disks, &dd_idx, &pd_idx, conf);
-
-		PRINTK("raid5: make_request, sector %llu logical %llu\n",
+			disks, disks - 1, &dd_idx, &pd_idx, conf);	
+/*		printk("raid5: make_request [%u/%u], sector %llu logical %llu\n",
+			dd_idx, disks,
 			(unsigned long long)new_sector, 
-			(unsigned long long)logical_sector);
+			(unsigned long long)logical_sector); */
 
 	retry:
 		prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 		sh = get_active_stripe(conf, new_sector, pd_idx, (bi->bi_rw&RWA_MASK));
-		if (sh) {
-			if (!add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
+		if (sh) {			
+			if (sh->disks != disks || !add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
 				/* Add failed due to overlap.  Flush everything
 				 * and wait a while
 				 */
 				raid5_unplug_device(mddev->queue);
 				release_stripe(sh);
 				schedule();
-				goto retry;
+				if (sh->disks != disks) {
+					// just expanded past this point! re-process using the new structure
+					printk("recalculate!\n");
+					finish_wait(&conf->wait_for_overlap, &w);
+					goto recalculate;
+				} else
+					goto retry;
 			}
 			finish_wait(&conf->wait_for_overlap, &w);
 			raid5_plug_device(conf);
@@ -1488,7 +1863,14 @@ static sector_t sync_request(mddev_t *md
 	sector_t first_sector;
 	int raid_disks = conf->raid_disks;
 	int data_disks = raid_disks-1;
+	
+	if (conf->expand_in_progress) {
+		raid_disks = conf->previous_raid_disks;
+		data_disks = raid_disks-1;
+	}
 
+	BUG_ON(data_disks == 0 || raid_disks == 0);
+	
 	if (sector_nr >= mddev->size <<1) {
 		/* just being told to finish up .. nothing much to do */
 		unplug_slaves(mddev);
@@ -1503,12 +1885,57 @@ static sector_t sync_request(mddev_t *md
 		*skipped = 1;
 		return rv;
 	}
+	
+	/* if we're in an expand, we can't allow the process
+	 * to keep reading in stripes; we might not have enough buffer
+	 * space to keep it all in RAM.
+	 */
+	if (conf->expand_in_progress && sector_nr >= conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+		spin_lock_irq(&conf->device_lock);
+		wait_event_lock_irq(conf->wait_for_expand_progress,
+			    sector_nr < conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1),
+			    conf->device_lock,
+			    unplug_slaves(conf->mddev);
+		);
+		spin_unlock_irq(&conf->device_lock);
+	}
+
+	/*
+	 * In an expand, we also need to make sure that we have enough destination stripes
+	 * available for writing out the block after we've read in the data, so make sure
+	 * we get them before we start reading any data.
+	 */
+	if (conf->expand_in_progress && conf->expand_stripes_ready == 0) {
+		unsigned i;
+
+		spin_lock_irq(&conf->device_lock);
+		for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+			do {
+				conf->expand_stripes[i] = get_free_stripe(conf, 1);
+
+				if (conf->expand_stripes[i] == NULL) {
+					conf->inactive_blocked = 1;
+					wait_event_lock_irq(conf->wait_for_stripe,
+							    !list_empty(&conf->inactive_list) &&
+							    (atomic_read(&conf->active_stripes) < (NR_STRIPES *3/4)
+							     || !conf->inactive_blocked),
+							    conf->device_lock,
+							    unplug_slaves(conf->mddev);
+						);
+					conf->inactive_blocked = 0;
+				}
+			} while (conf->expand_stripes[i] == NULL);
+		}
+		spin_unlock_irq(&conf->device_lock);
+
+		conf->expand_stripes_ready = 1;
+	}
 
 	x = sector_nr;
 	chunk_offset = sector_div(x, sectors_per_chunk);
 	stripe = x;
 	BUG_ON(x != stripe);
-
+	
 	first_sector = raid5_compute_sector((sector_t)stripe*data_disks*sectors_per_chunk
 		+ chunk_offset, raid_disks, data_disks, &dd_idx, &pd_idx, conf);
 	sh = get_active_stripe(conf, sector_nr, pd_idx, 1);
@@ -1553,6 +1980,8 @@ static void raid5d (mddev_t *mddev)
 	while (1) {
 		struct list_head *first;
 
+		conf = mddev_to_conf(mddev);
+
 		if (list_empty(&conf->handle_list) &&
 		    atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
 		    !blk_queue_plugged(mddev->queue) &&
@@ -1600,7 +2029,7 @@ static int run (mddev_t *mddev)
 	}
 
 	mddev->private = kmalloc (sizeof (raid5_conf_t)
-				  + mddev->raid_disks * sizeof(struct disk_info),
+				  + MAX_MD_DEVS * sizeof(struct disk_info),
 				  GFP_KERNEL);
 	if ((conf = mddev->private) == NULL)
 		goto abort;
@@ -1650,6 +2079,7 @@ static int run (mddev_t *mddev)
 	conf->level = mddev->level;
 	conf->algorithm = mddev->layout;
 	conf->max_nr_stripes = NR_STRIPES;
+	conf->expand_in_progress = 0;
 
 	/* device size must be a multiple of chunk size */
 	mddev->size &= ~(mddev->chunk_size/1024 -1);
@@ -1866,6 +2296,9 @@ static int raid5_remove_disk(mddev_t *md
 	mdk_rdev_t *rdev;
 	struct disk_info *p = conf->disks + number;
 
+	printk("we were asked to remove a disk\n");
+	return -EBUSY;  // FIXME: hack
+	
 	print_raid5_conf(conf);
 	rdev = p->rdev;
 	if (rdev) {
@@ -1904,6 +2337,7 @@ static int raid5_add_disk(mddev_t *mddev
 	 */
 	for (disk=0; disk < mddev->raid_disks; disk++)
 		if ((p=conf->disks + disk)->rdev == NULL) {
+			rdev->faulty = 0;
 			rdev->in_sync = 0;
 			rdev->raid_disk = disk;
 			found = 1;
@@ -1916,6 +2350,7 @@ static int raid5_add_disk(mddev_t *mddev
 
 static int raid5_resize(mddev_t *mddev, sector_t sectors)
 {
+        raid5_conf_t *conf = mddev_to_conf(mddev);
 	/* no resync is happening, and there is enough space
 	 * on all devices, so we can resize.
 	 * We need to make sure resync covers any new space.
@@ -1923,6 +2358,9 @@ static int raid5_resize(mddev_t *mddev, 
 	 * any io in the removed space completes, but it hardly seems
 	 * worth it.
 	 */
+	if (conf->expand_in_progress)
+		return -EBUSY;
+		
 	sectors &= ~((sector_t)mddev->chunk_size/512 - 1);
 	mddev->array_size = (sectors * (mddev->raid_disks-1))>>1;
 	set_capacity(mddev->gendisk, mddev->array_size << 1);
@@ -1936,6 +2374,125 @@ static int raid5_resize(mddev_t *mddev, 
 	return 0;
 }
 
+static int raid5_reshape(mddev_t *mddev, int raid_disks)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	struct list_head *tmp;
+	mdk_rdev_t *rdev;
+	unsigned long flags;
+
+	int d, i;
+	
+	if (mddev->degraded >= 1 || conf->expand_in_progress)
+		return -EBUSY;
+	if (conf->raid_disks == raid_disks)
+		return 0;
+	
+	print_raid5_conf(conf);
+	
+	// the old stripes are too small now; remove them (temporarily
+	// stalling the RAID)
+	for (i = 0; i < conf->max_nr_stripes; ++i) {
+		struct stripe_head *sh;
+		
+		spin_lock_irqsave(&conf->device_lock, flags);
+		sh = get_free_stripe(conf, 0);
+		while (sh == NULL) {
+			wait_event_lock_irq(conf->wait_for_stripe,
+					!list_empty(&conf->inactive_list),
+					conf->device_lock,
+					unplug_slaves(conf->mddev);
+					);
+			sh = get_free_stripe(conf, 0);
+		}
+		spin_unlock_irqrestore(&conf->device_lock, flags);
+
+		shrink_buffers(sh, conf->raid_disks);
+		kmem_cache_free(conf->slab_cache, sh);
+		atomic_dec(&conf->active_stripes);
+	}	
+	kmem_cache_destroy(conf->slab_cache);
+	
+	spin_lock_irqsave(&conf->device_lock, flags);
+	
+	for (d= conf->raid_disks; d < MAX_MD_DEVS; d++) {
+		conf->disks[d].rdev = NULL;
+	}
+
+	conf->expand_in_progress = 1;
+	conf->expand_progress = 0;
+	conf->previous_raid_disks = conf->raid_disks;	
+	conf->raid_disks = mddev->raid_disks = raid_disks;	
+
+	spin_lock_init(&conf->expand_progress_lock);
+	
+	init_waitqueue_head(&conf->wait_for_expand_progress);
+	INIT_LIST_HEAD(&conf->wait_for_expand_list);
+
+	ITERATE_RDEV(mddev,rdev,tmp) {
+		for (d= 0; d < conf->raid_disks; d++) {
+			if (conf->disks[d].rdev == rdev) {
+				goto already_there;
+			}
+		}
+
+		raid5_add_disk(mddev, rdev);
+		conf->failed_disks++;
+		
+already_there:		
+		1;
+	}
+
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+	
+	// allocate stripes of the new size
+	if (grow_stripes(conf, conf->max_nr_stripes)) {
+		BUG();  // FIXME
+		return -ENOMEM;
+	}	
+	
+	// allocate space for our temporary expansion buffers
+	conf->expand_buffer = kmalloc (sizeof(struct expand_buf) * (conf->chunk_size / STRIPE_SIZE) * (raid_disks-1), GFP_KERNEL);
+	if (conf->expand_buffer == NULL) {
+		printk(KERN_ERR "raid5: couldn't allocate %dkB for expand buffer\n",
+			(conf->chunk_size * (raid_disks-1)) >> 10);
+		// FIXME
+		return -ENOMEM;
+	}
+
+	conf->expand_stripes = kmalloc (sizeof(struct stripe_head *) * (conf->chunk_size / STRIPE_SIZE), GFP_KERNEL);
+	if (conf->expand_stripes == NULL) {
+		printk(KERN_ERR "raid5: couldn't allocate memory for expand stripe pointers\n");
+		// FIXME
+		return -ENOMEM;
+	}
+	conf->expand_stripes_ready = 0;
+
+	for (i = 0; i < (conf->chunk_size / STRIPE_SIZE) * (raid_disks-1); ++i) {
+		conf->expand_buffer[i].page = alloc_page(GFP_KERNEL);
+		if (conf->expand_buffer[i].page == NULL) {
+			printk(KERN_ERR "raid5: couldn't allocate %dkB for expand buffer\n",
+					(conf->chunk_size * (raid_disks-1)) >> 10);
+			// FIXME
+			return -ENOMEM;
+		}
+		conf->expand_buffer[i].up_to_date = 0;
+	}
+	
+	print_raid5_conf(conf);
+
+	clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
+	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+	set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+	mddev->recovery_cp = 0;
+	md_wakeup_thread(mddev->thread);
+
+	printk("Starting expand.\n");
+	
+        return 0;
+}
+
+
 static mdk_personality_t raid5_personality=
 {
 	.name		= "raid5",
@@ -1950,6 +2507,7 @@ static mdk_personality_t raid5_personali
 	.spare_active	= raid5_spare_active,
 	.sync_request	= sync_request,
 	.resize		= raid5_resize,
+	.reshape	= raid5_reshape
 };
 
 static int __init raid5_init (void)
--- /usr/src/old/linux-2.6.13/include/linux/raid/raid5.h	2005-08-29 01:41:01.000000000 +0200
+++ include/linux/raid/raid5.h	2005-10-14 21:28:42.000000000 +0200
@@ -134,6 +134,7 @@ struct stripe_head {
 	unsigned long		state;			/* state flags */
 	atomic_t		count;			/* nr of active thread/requests */
 	spinlock_t		lock;
+	int			disks;			/* disks in stripe */
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
@@ -171,6 +172,7 @@ struct stripe_head {
 #define	STRIPE_INSYNC		4
 #define	STRIPE_PREREAD_ACTIVE	5
 #define	STRIPE_DELAYED		6
+#define	STRIPE_DELAY_EXPAND	7
 
 /*
  * Plugging:
@@ -199,6 +201,10 @@ struct stripe_head {
 struct disk_info {
 	mdk_rdev_t	*rdev;
 };
+struct expand_buf {
+	struct page    	*page;
+	int		up_to_date;
+};
 
 struct raid5_private_data {
 	struct stripe_head	**stripe_hashtbl;
@@ -208,22 +214,38 @@ struct raid5_private_data {
 	int			raid_disks, working_disks, failed_disks;
 	int			max_nr_stripes;
 
+	/* used during an expand */
+	int			expand_in_progress;
+	sector_t		expand_progress;
+	spinlock_t		expand_progress_lock;
+	int			previous_raid_disks;
+	struct list_head	wait_for_expand_list;
+	
+	struct expand_buf	*expand_buffer;
+	
+	int			expand_stripes_ready;	
+	struct stripe_head	**expand_stripes;
+
 	struct list_head	handle_list; /* stripes needing handling */
 	struct list_head	delayed_list; /* stripes that have plugged requests */
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 
 	char			cache_name[20];
+	char			cache_name_expand[20];
 	kmem_cache_t		*slab_cache; /* for allocating stripes */
+	
 	/*
 	 * Free stripes pool
 	 */
 	atomic_t		active_stripes;
 	struct list_head	inactive_list;
 	wait_queue_head_t	wait_for_stripe;
+	wait_queue_head_t	wait_for_expand_progress;
 	wait_queue_head_t	wait_for_overlap;
 	int			inactive_blocked;	/* release of inactive stripes blocked,
 							 * waiting for 25% to be free
-							 */        
+							 */
+	int			inactive_blocked_expand;
 	spinlock_t		device_lock;
 	struct disk_info	disks[0];
 };

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-10-14 19:46           ` Steinar H. Gunderson
@ 2005-10-16 22:55             ` Neil Brown
  2005-10-17  0:16               ` Steinar H. Gunderson
  2005-10-19 23:18               ` Steinar H. Gunderson
  0 siblings, 2 replies; 23+ messages in thread
From: Neil Brown @ 2005-10-16 22:55 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: linux-raid

On Friday October 14, sgunderson@bigfoot.com wrote:
> On Fri, Oct 07, 2005 at 01:09:21PM +1000, Neil Brown wrote:
> > However it is usually easier to read a whole patch - reading a patch
> > that removes bits of a previous patch, and depends on other bits of
> > it, requires holding too much in one's brain at once.  If you could
> > possibly send a complete patch against a recent release kernel, it
> > would make review a lot easier.
> 
> Here's the latest version of the patch. What's been done since last
> time:

Thanks a lot for this, I really appreciate it!

> 
> - There's no longer a set of ^[$,1r|^[(Blarger^[$,1r}^[(B stripes; instead, they're all shrunk
>   and then expanded, like you requested the last time.
> - The expand stripes are preallocated in sync_request(), again like you 
>   requested.
> - Likewise, the raid5_conf struct is never reallocated; instead, I just make
>   sure it supports MAX_MD_DEVS devices in the first place. This wasted a
>   kilobyte or so per active device, but it removed a _lot_ of fiddly code,
>   so I believe it's a good thing.

I really don't like having hard-coded maximums like this.  However it
makes perfect sense to keep that piece of code really simple while we
make sure the rest of the code work.  So I'm happy for it to stay with
a hard-coded maximum for now, but I will probably want to change it
back to re-allocating raid5_conf once the rest of the code is stable.


> - The patch in general is a lot slimmer (about half the size of the original
>   patch). Lots of special-case code has been thrown out and replaced by using
>   the generic functions instead (for, say, all the parity disk layout stuff).

Half the size sounds like a great step forward!! :-)
I'll have a close look at all the code sometime today and get back to
you with any comments.

Thanks again,
NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-10-16 22:55             ` Neil Brown
@ 2005-10-17  0:16               ` Steinar H. Gunderson
  2005-10-19 23:18               ` Steinar H. Gunderson
  1 sibling, 0 replies; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-10-17  0:16 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1353 bytes --]

[My mail setup is somewhat broken due to a fried CPU (odd accident involving
 multiple buggy BIOSes), I hope this gets correctly through :-)]

On Mon, Oct 17, 2005 at 08:55:45AM +1000, Neil Brown wrote:
> Half the size sounds like a great step forward!! :-)
> I'll have a close look at all the code sometime today and get back to
> you with any comments.

Here's another version with a few minor (but important) bug fixes. Also, I
removed the “delay stripes” code, as it doesn't look like it's ever used or
needed anymore.

I still see data corruption from time to time, though, and sometimes the
other odd crash (and deadlocks on _something_ holding the mddev semaphore
forever; haven't seen that one in a while, though). I'm a bit unsure as of
what could cause it, but it only seems to happen on I/O, and I think I
reduced a bit with one of the fixes. (The current stripe could have R5_LOCKED
buffers but not have dev->towrite, and I didn't take that into account, so I
could be expanding over an area with one still-dirty stripe referring to it
and thus “leak” a stripe, causing problems.)

I think I want to move the entire restripe logic to the very bottom of
handle_stripe(), that might solve a few problems. Will have to wait for
another day when I get a replacement CPU in, though :-)

/* Steinar */
-- 
Homepage: http://www.sesse.net/


[-- Attachment #2: raid5-online-exp-06.diff --]
[-- Type: text/plain, Size: 28625 bytes --]

--- /usr/src/old/linux-2.6.13/drivers/md/raid5.c	2005-08-29 01:41:01.000000000 +0200
+++ drivers/md/raid5.c	2005-10-16 18:20:39.000000000 +0200
@@ -68,9 +68,18 @@
 #endif
 
 static void print_raid5_conf (raid5_conf_t *conf);
+#if RADI5_DEBUG
+static void print_sh (struct stripe_head *sh);
+#endif
+static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, int go_faster);
+static void raid5_finish_expand (raid5_conf_t *conf);
+static sector_t raid5_compute_sector(sector_t r_sector, unsigned int raid_disks,
+			unsigned int data_disks, unsigned int * dd_idx,
+			unsigned int * pd_idx, raid5_conf_t *conf);
 
 static inline void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 {
+	BUG_ON(atomic_read(&sh->count) == 0);
 	if (atomic_dec_and_test(&sh->count)) {
 		if (!list_empty(&sh->lru))
 			BUG();
@@ -133,7 +142,7 @@ static __inline__ void insert_hash(raid5
 
 
 /* find an idle stripe, make sure it is unhashed, and return it. */
-static struct stripe_head *get_free_stripe(raid5_conf_t *conf)
+static struct stripe_head *get_free_stripe(raid5_conf_t *conf, int expand)
 {
 	struct stripe_head *sh = NULL;
 	struct list_head *first;
@@ -146,6 +155,12 @@ static struct stripe_head *get_free_stri
 	list_del_init(first);
 	remove_hash(sh);
 	atomic_inc(&conf->active_stripes);
+
+	if (expand || !conf->expand_in_progress)
+		sh->disks = conf->raid_disks;
+	else
+		sh->disks = conf->previous_raid_disks;
+
 out:
 	return sh;
 }
@@ -184,7 +199,7 @@ static void raid5_build_block (struct st
 static inline void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int i;
 
 	if (atomic_read(&sh->count) != 0)
 		BUG();
@@ -200,8 +215,14 @@ static inline void init_stripe(struct st
 	sh->sector = sector;
 	sh->pd_idx = pd_idx;
 	sh->state = 0;
+	
+	if (conf->expand_in_progress && sector * (conf->raid_disks - 1) >= conf->expand_progress) {
+		sh->disks = conf->previous_raid_disks;
+	} else {
+		sh->disks = conf->raid_disks;
+	}
 
-	for (i=disks; i--; ) {
+	for (i=sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 
 		if (dev->toread || dev->towrite || dev->written ||
@@ -245,9 +266,29 @@ static struct stripe_head *get_active_st
 
 	do {
 		sh = __find_stripe(conf, sector);
+
+		// make sure this is of the right size; if not, remove it from the hash
+		// FIXME: is this needed now?
+		if (sh) {
+			int correct_disks = conf->raid_disks;
+			if (conf->expand_in_progress && sector * (conf->raid_disks - 1) >= conf->expand_progress) {
+				correct_disks = conf->previous_raid_disks;
+			}
+
+			if (sh->disks != correct_disks) {
+				BUG_ON(atomic_read(&sh->count) != 0);
+
+				printk("get_stripe %llu with different number of disks (%u, should be %u)\n",
+					sector, sh->disks, correct_disks);
+
+				remove_hash(sh);
+				sh = NULL;
+			}
+		}
+		
 		if (!sh) {
 			if (!conf->inactive_blocked)
-				sh = get_free_stripe(conf);
+				sh = get_free_stripe(conf, 1);
 			if (noblock && sh == NULL)
 				break;
 			if (!sh) {
@@ -303,6 +344,7 @@ static int grow_stripes(raid5_conf_t *co
 			return 1;
 		memset(sh, 0, sizeof(*sh) + (devs-1)*sizeof(struct r5dev));
 		sh->raid_conf = conf;
+		sh->disks = conf->raid_disks;
 		spin_lock_init(&sh->lock);
 
 		if (grow_buffers(sh, conf->raid_disks)) {
@@ -325,7 +367,7 @@ static void shrink_stripes(raid5_conf_t 
 
 	while (1) {
 		spin_lock_irq(&conf->device_lock);
-		sh = get_free_stripe(conf);
+		sh = get_free_stripe(conf, 0);
 		spin_unlock_irq(&conf->device_lock);
 		if (!sh)
 			break;
@@ -344,7 +386,7 @@ static int raid5_end_read_request (struc
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
 	if (bi->bi_size)
@@ -411,12 +453,60 @@ static int raid5_end_read_request (struc
 	return 0;
 }
 
+							
+static void raid5_finish_expand (raid5_conf_t *conf)
+{
+	int i;
+	struct disk_info *tmp;
+	
+	for (i = conf->previous_raid_disks; i < conf->raid_disks; i++) {
+		tmp = conf->disks + i;
+		if (tmp->rdev
+		    && !tmp->rdev->faulty
+		    && !tmp->rdev->in_sync) {
+			conf->mddev->degraded--;
+			conf->failed_disks--;
+			conf->working_disks++;
+			tmp->rdev->in_sync = 1;
+		}
+	}
+	
+	conf->expand_in_progress = 0;
+	
+	// inform the md code that we have more space now
+ 	{	
+		struct block_device *bdev;
+		sector_t sync_sector;
+		unsigned dummy1, dummy2;
+
+		conf->mddev->array_size = conf->mddev->size * (conf->mddev->raid_disks-1);
+		set_capacity(conf->mddev->gendisk, conf->mddev->array_size << 1);
+		conf->mddev->changed = 1;
+
+		sync_sector = raid5_compute_sector(conf->expand_progress, conf->raid_disks,
+			conf->raid_disks - 1, &dummy1, &dummy2, conf);
+		
+		conf->mddev->recovery_cp = sync_sector << 1;    // FIXME: hum, hum
+		set_bit(MD_RECOVERY_NEEDED, &conf->mddev->recovery);
+
+		bdev = bdget_disk(conf->mddev->gendisk, 0);
+		if (bdev) {
+			down(&bdev->bd_inode->i_sem);
+			i_size_write(bdev->bd_inode, conf->mddev->array_size << 10);
+			up(&bdev->bd_inode->i_sem);
+			bdput(bdev);
+		}
+	}
+	
+	/* FIXME: free old stuff here! (what are we missing?) */
+}
+
 static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
 				    int error)
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	unsigned long flags;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
@@ -570,7 +660,7 @@ static sector_t raid5_compute_sector(sec
 static sector_t compute_blocknr(struct stripe_head *sh, int i)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int raid_disks = conf->raid_disks, data_disks = raid_disks - 1;
+	int raid_disks = sh->disks, data_disks = raid_disks - 1;
 	sector_t new_sector = sh->sector, check;
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	sector_t stripe;
@@ -605,7 +695,8 @@ static sector_t compute_blocknr(struct s
 
 	check = raid5_compute_sector (r_sector, raid_disks, data_disks, &dummy1, &dummy2, conf);
 	if (check != sh->sector || dummy1 != dd_idx || dummy2 != sh->pd_idx) {
-		printk("compute_blocknr: map not correct\n");
+		printk("compute_blocknr: map not correct (%llu,%u,%u vs. %llu,%u,%u) disks=%u offset=%u virtual_dd=%u\n",
+				check, dummy1, dummy2, sh->sector, dd_idx, sh->pd_idx, sh->disks, chunk_offset, i);
 		return 0;
 	}
 	return r_sector;
@@ -671,8 +762,7 @@ static void copy_data(int frombio, struc
 
 static void compute_block(struct stripe_head *sh, int dd_idx)
 {
-	raid5_conf_t *conf = sh->raid_conf;
-	int i, count, disks = conf->raid_disks;
+	int i, count, disks = sh->disks;
 	void *ptr[MAX_XOR_BLOCKS], *p;
 
 	PRINTK("compute_block, stripe %llu, idx %d\n", 
@@ -702,7 +792,7 @@ static void compute_block(struct stripe_
 static void compute_parity(struct stripe_head *sh, int method)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int i, pd_idx = sh->pd_idx, disks = conf->raid_disks, count;
+	int i, pd_idx = sh->pd_idx, disks = sh->disks, count;
 	void *ptr[MAX_XOR_BLOCKS];
 	struct bio *chosen;
 
@@ -880,7 +970,7 @@ static int add_stripe_bio(struct stripe_
 static void handle_stripe(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks;
+	int disks = sh->disks;
 	struct bio *return_bi= NULL;
 	struct bio *bi;
 	int i;
@@ -945,19 +1035,20 @@ static void handle_stripe(struct stripe_
 		}
 		if (dev->written) written++;
 		rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
-		if (!rdev || !rdev->in_sync) {
+		if (!conf->expand_in_progress && (!rdev || !rdev->in_sync)) {
 			failed++;
 			failed_num = i;
 		} else
 			set_bit(R5_Insync, &dev->flags);
 	}
-	PRINTK("locked=%d uptodate=%d to_read=%d"
-		" to_write=%d failed=%d failed_num=%d\n",
-		locked, uptodate, to_read, to_write, failed, failed_num);
 	/* check if the array has lost two devices and, if so, some requests might
 	 * need to be failed
 	 */
 	if (failed > 1 && to_read+to_write+written) {
+		printk("Need to fail requests!\n");
+		printk("locked=%d uptodate=%d to_read=%d"
+			" to_write=%d failed=%d failed_num=%d disks=%d\n",
+			locked, uptodate, to_read, to_write, failed, failed_num, disks);
 		spin_lock_irq(&conf->device_lock);
 		for (i=disks; i--; ) {
 			/* fail all writes first */
@@ -1012,7 +1103,7 @@ static void handle_stripe(struct stripe_
 		}
 		spin_unlock_irq(&conf->device_lock);
 	}
-	if (failed > 1 && syncing) {
+	if (failed > 1 && syncing && !conf->expand_in_progress) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
 		clear_bit(STRIPE_SYNCING, &sh->state);
 		syncing = 0;
@@ -1093,7 +1184,7 @@ static void handle_stripe(struct stripe_
 					locked++;
 					PRINTK("Reading block %d (sync=%d)\n", 
 						i, syncing);
-					if (syncing)
+					if (syncing && !conf->expand_in_progress)
 						md_sync_acct(conf->disks[i].rdev->bdev,
 							     STRIPE_SECTORS);
 				}
@@ -1102,6 +1193,193 @@ static void handle_stripe(struct stripe_
 		set_bit(STRIPE_HANDLE, &sh->state);
 	}
 
+	// see if we have the data we need to expand by another block
+	if (conf->expand_in_progress && sh->disks == conf->previous_raid_disks) {
+		int uptodate = 0, d = 0, needed_uptodate = 0;
+		spin_lock_irq(&conf->expand_progress_lock);
+		for (i=0; i<disks; ++i) {
+			sector_t start_sector, dest_sector;
+			unsigned int dd_idx, pd_idx;
+
+			if (i == sh->pd_idx)
+				continue;
+
+			// see what sector this block would land in the new layout
+			start_sector = compute_blocknr(sh, i);
+			dest_sector = raid5_compute_sector(start_sector, conf->raid_disks,
+				conf->raid_disks - 1, &dd_idx, &pd_idx, conf);
+			if (dd_idx > pd_idx)
+				--dd_idx;
+
+			if (dest_sector * (conf->raid_disks - 1) >= conf->expand_progress &&
+ 			    dest_sector * (conf->raid_disks - 1) <  conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+				unsigned int ind = (start_sector - conf->expand_progress) / STRIPE_SECTORS;
+				if (test_bit(R5_UPTODATE, &sh->dev[i].flags)) {
+					memcpy(page_address(conf->expand_buffer[ind].page), page_address(sh->dev[i].page), STRIPE_SIZE);
+					conf->expand_buffer[ind].up_to_date = 1;
+				} else {
+					conf->expand_buffer[ind].up_to_date = 0;
+				}
+			}
+		}
+
+		for (i=0; i < (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE); ++i) {
+			uptodate += conf->expand_buffer[i].up_to_date;
+		}
+		spin_unlock_irq(&conf->expand_progress_lock);
+	
+		/*
+		 * Figure out how many stripes we need for this chunk to be complete.
+		 * In almost all cases, this will be a full destination stripe, but our
+		 * original volume might not be big enough for that at the very end --
+		 * so use the rest of the volume then.
+	         */
+		needed_uptodate = (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE);
+		if (((conf->mddev->array_size << 1) - conf->expand_progress) / STRIPE_SECTORS < needed_uptodate) {
+			needed_uptodate = ((conf->mddev->array_size << 1) - conf->expand_progress) / STRIPE_SECTORS;
+		}
+		if (needed_uptodate > 0 && uptodate == needed_uptodate && conf->expand_stripes_ready == 1) {
+			// we can do an expand!
+			sector_t dest_sector, advance;
+			unsigned i;
+			unsigned int dummy1, dummy2, pd_idx;
+
+			if ((conf->mddev->size << 1) - conf->expand_progress > (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+				advance = (conf->chunk_size * (conf->raid_disks - 1)) >> 9;
+			} else {
+				advance = (conf->mddev->size << 1) - conf->expand_progress;
+			}
+
+			// find the parity disk and starting sector
+			dest_sector = raid5_compute_sector(conf->expand_progress, conf->raid_disks,
+				conf->raid_disks - 1, &dummy1, &pd_idx, conf);
+		
+			spin_lock_irq(&conf->device_lock);
+			
+			if (conf->expand_stripes_ready != 1) {
+				// something else just did the expand, we're done here
+				spin_unlock_irq(&conf->device_lock);
+				goto please_wait;
+			}
+			
+			/*
+			 * Check that we won't try to move an area where there's
+			 * still active stripes; if we do, we'll risk inconsistency since we
+			 * suddenly have two different sets of stripes referring to the
+			 * same logical sector.
+			 */
+			{
+				struct stripe_head *ash;
+				unsigned activity = 0, i;
+				sector_t first_touched_sector, last_touched_sector;
+				
+				first_touched_sector = raid5_compute_sector(conf->expand_progress,
+					conf->previous_raid_disks, conf->previous_raid_disks - 1, &dummy1, &dummy2, conf);
+				last_touched_sector = raid5_compute_sector(conf->expand_progress + ((conf->chunk_size * (conf->raid_disks - 1)) >> 9) - 1,
+					conf->previous_raid_disks, conf->previous_raid_disks - 1, &dummy1, &dummy2, conf);
+
+				for (i = 0; i < NR_HASH; i++) {
+					ash = conf->stripe_hashtbl[i];
+					for (; ash; ash = ash->hash_next) {
+						if (sh == ash && atomic_read(&ash->count) == 1 && !to_write && !locked)
+							continue;   // we'll release it shortly, so it's OK (?)
+
+						// is this stripe active, and within the region we're expanding?
+						if (atomic_read(&ash->count) > 0 &&
+						    ash->disks == conf->previous_raid_disks &&
+						    ash->sector >= first_touched_sector &&
+						    ash->sector <= last_touched_sector) {
+							++activity;
+						}
+					}
+				}
+				
+				if (activity > 0) {
+					printk("Aborting, %u active stripes in the area\n", activity);
+					spin_unlock_irq(&conf->device_lock);
+					goto please_wait;
+				}
+			}
+			
+			spin_lock(&conf->expand_progress_lock);
+			conf->expand_progress += advance;
+
+			for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+				struct stripe_head *newsh = conf->expand_stripes[i];
+				if (atomic_read(&newsh->count) != 0)
+					BUG();
+				init_stripe(newsh, dest_sector + i * STRIPE_SECTORS, pd_idx);
+
+				for (d = 0; d < conf->raid_disks; ++d) {
+					if (d == pd_idx) {
+						clear_bit(R5_UPTODATE, &newsh->dev[d].flags);
+						clear_bit(R5_LOCKED, &newsh->dev[d].flags);
+					} else {
+						//struct page *tmp;
+						unsigned di;
+						
+						di = (compute_blocknr(newsh, d) - (conf->expand_progress - advance)) / STRIPE_SECTORS;
+						
+						// swap the two pages, moving the data in place into the stripe
+#if 0
+						// FIXME: this doesn't work. we'll need to fiddle with the bio_vec
+						// as well or we'll simply write out the wrong data.
+						tmp = newsh->dev[d].page;
+						newsh->dev[d].page = conf->expand_buffer[di].page;
+						conf->expand_buffer[di].page = tmp; 
+#else
+						memcpy(page_address(newsh->dev[d].page), page_address(conf->expand_buffer[di].page), STRIPE_SIZE);
+#endif
+					
+						set_bit(R5_UPTODATE, &newsh->dev[d].flags);
+						set_bit(R5_LOCKED, &newsh->dev[d].flags);
+						conf->expand_buffer[di].up_to_date = 0;
+					}
+					set_bit(R5_Wantwrite, &newsh->dev[d].flags);
+				}
+			}
+			conf->expand_stripes_ready = 2;	
+			spin_unlock(&conf->expand_progress_lock);
+			spin_unlock_irq(&conf->device_lock);
+			
+			for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+				struct stripe_head *newsh = conf->expand_stripes[i];
+				
+				compute_block(newsh, newsh->pd_idx);
+
+				spin_lock(&newsh->lock);
+				atomic_inc(&newsh->count);
+				clear_bit(STRIPE_SYNCING, &newsh->state);
+				set_bit(STRIPE_INSYNC, &newsh->state);
+				set_bit(STRIPE_HANDLE, &newsh->state);
+				spin_unlock(&newsh->lock);
+#if 0
+				printk("Releasing stripe %u (%u disks)\n", i, newsh->disks);
+				for (d = 0; d < conf->raid_disks; ++d) {
+					unsigned int *ptr = page_address(newsh->dev[d].page);
+					printk("%u: %08x %08x %08x %08x\n", d, ptr[0], ptr[1], ptr[2], ptr[3]);
+				}
+#endif
+				release_stripe(newsh);
+			}
+			
+			conf->expand_stripes_ready = 0;	
+
+			md_done_sync(conf->mddev, advance, 1);
+			wake_up(&conf->wait_for_expand_progress);
+
+			// see if we are done
+			if (conf->expand_progress >= conf->mddev->array_size << 1) {
+				printk("Expand done, finishing...\n");
+				raid5_finish_expand(conf);
+				printk("...done.\n");
+			}
+
+please_wait:			
+			1;
+		}
+	}
+
 	/* now to consider writing and what else, if anything should be read */
 	if (to_write) {
 		int rmw=0, rcw=0;
@@ -1237,7 +1515,9 @@ static void handle_stripe(struct stripe_
 		}
 	}
 	if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
-		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
+		if (!conf->expand_in_progress) {
+			md_done_sync(conf->mddev, STRIPE_SECTORS,1);
+		}
 		clear_bit(STRIPE_SYNCING, &sh->state);
 	}
 	
@@ -1279,7 +1559,7 @@ static void handle_stripe(struct stripe_
 		rcu_read_unlock();
  
 		if (rdev) {
-			if (test_bit(R5_Syncio, &sh->dev[i].flags))
+			if (test_bit(R5_Syncio, &sh->dev[i].flags) && !conf->expand_in_progress)
 				md_sync_acct(rdev->bdev, STRIPE_SECTORS);
 
 			bi->bi_bdev = rdev->bdev;
@@ -1404,8 +1684,6 @@ static int make_request (request_queue_t
 {
 	mddev_t *mddev = q->queuedata;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
-	const unsigned int raid_disks = conf->raid_disks;
-	const unsigned int data_disks = raid_disks - 1;
 	unsigned int dd_idx, pd_idx;
 	sector_t new_sector;
 	sector_t logical_sector, last_sector;
@@ -1428,18 +1706,55 @@ static int make_request (request_queue_t
 
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
+		int disks;
 		
+	retry:
+		disks = conf->raid_disks;
+		if (conf->expand_in_progress) {
+			spin_lock_irq(&conf->expand_progress_lock);
+			if (logical_sector >= conf->expand_progress) {
+				disks = conf->previous_raid_disks;
+			}
+			spin_unlock_irq(&conf->expand_progress_lock);
+		}
 		new_sector = raid5_compute_sector(logical_sector,
-						  raid_disks, data_disks, &dd_idx, &pd_idx, conf);
-
+			disks, disks - 1, &dd_idx, &pd_idx, conf);	
 		PRINTK("raid5: make_request, sector %llu logical %llu\n",
 			(unsigned long long)new_sector, 
 			(unsigned long long)logical_sector);
 
-	retry:
 		prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 		sh = get_active_stripe(conf, new_sector, pd_idx, (bi->bi_rw&RWA_MASK));
 		if (sh) {
+			/*
+			 * At this point, our stripe is active and _will_ get
+			 * counted by handle_stripe() if it decides to do an
+			 * expand (which will delay it if that overlaps over
+			 * us). However, we also need to check that there
+			 * wasn't an expand happening while we waited for our
+			 * stripe in get_active_stripe() (or one is in progress
+			 * right now).
+			 */
+			if (conf->expand_in_progress) {
+				int new_disks;
+
+				spin_lock(&conf->expand_progress_lock);
+
+				// recalculate what side we are on
+				if (logical_sector >= conf->expand_progress) {
+					new_disks = conf->previous_raid_disks;
+				} else {
+					new_disks = conf->raid_disks;
+				}
+
+				spin_unlock(&conf->expand_progress_lock);
+				
+				if (disks != new_disks || sh->disks != disks) {
+					printk("progressed\n");
+					release_stripe(sh);
+					goto retry;
+				}
+			}
 			if (!add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
 				/* Add failed due to overlap.  Flush everything
 				 * and wait a while
@@ -1488,7 +1803,14 @@ static sector_t sync_request(mddev_t *md
 	sector_t first_sector;
 	int raid_disks = conf->raid_disks;
 	int data_disks = raid_disks-1;
+	
+	if (conf->expand_in_progress) {
+		raid_disks = conf->previous_raid_disks;
+		data_disks = raid_disks-1;
+	}
 
+	BUG_ON(data_disks == 0 || raid_disks == 0);
+	
 	if (sector_nr >= mddev->size <<1) {
 		/* just being told to finish up .. nothing much to do */
 		unplug_slaves(mddev);
@@ -1503,6 +1825,51 @@ static sector_t sync_request(mddev_t *md
 		*skipped = 1;
 		return rv;
 	}
+	
+	/* if we're in an expand, we can't allow the process
+	 * to keep reading in stripes; we might not have enough buffer
+	 * space to keep it all in RAM.
+	 */
+	if (conf->expand_in_progress && sector_nr >= conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+		spin_lock_irq(&conf->device_lock);
+		wait_event_lock_irq(conf->wait_for_expand_progress,
+			    sector_nr < conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1),
+			    conf->device_lock,
+			    unplug_slaves(conf->mddev);
+		);
+		spin_unlock_irq(&conf->device_lock);
+	}
+
+	/*
+	 * In an expand, we also need to make sure that we have enough destination stripes
+	 * available for writing out the block after we've read in the data, so make sure
+	 * we get them before we start reading any data.
+	 */
+	if (conf->expand_in_progress && conf->expand_stripes_ready == 0) {
+		unsigned i;
+
+		spin_lock_irq(&conf->device_lock);
+		for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+			do {
+				conf->expand_stripes[i] = get_free_stripe(conf, 1);
+
+				if (conf->expand_stripes[i] == NULL) {
+					conf->inactive_blocked = 1;
+					wait_event_lock_irq(conf->wait_for_stripe,
+							    !list_empty(&conf->inactive_list) &&
+							    (atomic_read(&conf->active_stripes) < (NR_STRIPES *3/4)
+							     || !conf->inactive_blocked),
+							    conf->device_lock,
+							    unplug_slaves(conf->mddev);
+						);
+					conf->inactive_blocked = 0;
+				}
+			} while (conf->expand_stripes[i] == NULL);
+		}
+		spin_unlock_irq(&conf->device_lock);
+
+		conf->expand_stripes_ready = 1;
+	}
 
 	x = sector_nr;
 	chunk_offset = sector_div(x, sectors_per_chunk);
@@ -1553,6 +1920,8 @@ static void raid5d (mddev_t *mddev)
 	while (1) {
 		struct list_head *first;
 
+		conf = mddev_to_conf(mddev);
+
 		if (list_empty(&conf->handle_list) &&
 		    atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
 		    !blk_queue_plugged(mddev->queue) &&
@@ -1600,7 +1969,7 @@ static int run (mddev_t *mddev)
 	}
 
 	mddev->private = kmalloc (sizeof (raid5_conf_t)
-				  + mddev->raid_disks * sizeof(struct disk_info),
+				  + MAX_MD_DEVS * sizeof(struct disk_info),
 				  GFP_KERNEL);
 	if ((conf = mddev->private) == NULL)
 		goto abort;
@@ -1650,6 +2019,7 @@ static int run (mddev_t *mddev)
 	conf->level = mddev->level;
 	conf->algorithm = mddev->layout;
 	conf->max_nr_stripes = NR_STRIPES;
+	conf->expand_in_progress = 0;
 
 	/* device size must be a multiple of chunk size */
 	mddev->size &= ~(mddev->chunk_size/1024 -1);
@@ -1866,6 +2236,9 @@ static int raid5_remove_disk(mddev_t *md
 	mdk_rdev_t *rdev;
 	struct disk_info *p = conf->disks + number;
 
+	printk("we were asked to remove a disk\n");
+	return -EBUSY;  // FIXME: hack
+	
 	print_raid5_conf(conf);
 	rdev = p->rdev;
 	if (rdev) {
@@ -1904,6 +2277,7 @@ static int raid5_add_disk(mddev_t *mddev
 	 */
 	for (disk=0; disk < mddev->raid_disks; disk++)
 		if ((p=conf->disks + disk)->rdev == NULL) {
+			rdev->faulty = 0;
 			rdev->in_sync = 0;
 			rdev->raid_disk = disk;
 			found = 1;
@@ -1916,6 +2290,7 @@ static int raid5_add_disk(mddev_t *mddev
 
 static int raid5_resize(mddev_t *mddev, sector_t sectors)
 {
+        raid5_conf_t *conf = mddev_to_conf(mddev);
 	/* no resync is happening, and there is enough space
 	 * on all devices, so we can resize.
 	 * We need to make sure resync covers any new space.
@@ -1923,6 +2298,9 @@ static int raid5_resize(mddev_t *mddev, 
 	 * any io in the removed space completes, but it hardly seems
 	 * worth it.
 	 */
+	if (conf->expand_in_progress)
+		return -EBUSY;
+		
 	sectors &= ~((sector_t)mddev->chunk_size/512 - 1);
 	mddev->array_size = (sectors * (mddev->raid_disks-1))>>1;
 	set_capacity(mddev->gendisk, mddev->array_size << 1);
@@ -1936,6 +2314,125 @@ static int raid5_resize(mddev_t *mddev, 
 	return 0;
 }
 
+static int raid5_reshape(mddev_t *mddev, int raid_disks)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	struct list_head *tmp;
+	mdk_rdev_t *rdev;
+	unsigned long flags;
+
+	int d, i;
+	
+	if (mddev->degraded >= 1 || conf->expand_in_progress)
+		return -EBUSY;
+	if (conf->raid_disks == raid_disks)
+		return 0;
+	
+	print_raid5_conf(conf);
+	
+	// the old stripes are too small now; remove them (temporarily
+	// stalling the RAID)
+	for (i = 0; i < conf->max_nr_stripes; ++i) {
+		struct stripe_head *sh;
+		
+		spin_lock_irqsave(&conf->device_lock, flags);
+		sh = get_free_stripe(conf, 0);
+		while (sh == NULL) {
+			wait_event_lock_irq(conf->wait_for_stripe,
+					!list_empty(&conf->inactive_list),
+					conf->device_lock,
+					unplug_slaves(conf->mddev);
+					);
+			sh = get_free_stripe(conf, 0);
+		}
+		spin_unlock_irqrestore(&conf->device_lock, flags);
+
+		shrink_buffers(sh, conf->raid_disks);
+		kmem_cache_free(conf->slab_cache, sh);
+		atomic_dec(&conf->active_stripes);
+	}	
+	kmem_cache_destroy(conf->slab_cache);
+	
+	spin_lock_irqsave(&conf->device_lock, flags);
+	
+	for (d= conf->raid_disks; d < MAX_MD_DEVS; d++) {
+		conf->disks[d].rdev = NULL;
+	}
+
+	conf->expand_progress = 0;
+	conf->previous_raid_disks = conf->raid_disks;	
+	conf->raid_disks = mddev->raid_disks = raid_disks;	
+
+	spin_lock_init(&conf->expand_progress_lock);
+	
+	init_waitqueue_head(&conf->wait_for_expand_progress);
+
+	ITERATE_RDEV(mddev,rdev,tmp) {
+		for (d= 0; d < conf->raid_disks; d++) {
+			if (conf->disks[d].rdev == rdev) {
+				goto already_there;
+			}
+		}
+
+		raid5_add_disk(mddev, rdev);
+		conf->failed_disks++;
+		
+already_there:		
+		1;
+	}
+
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+	
+	// allocate space for our temporary expansion buffers
+	conf->expand_buffer = kmalloc (sizeof(struct expand_buf) * (conf->chunk_size / STRIPE_SIZE) * (raid_disks-1), GFP_KERNEL);
+	if (conf->expand_buffer == NULL) {
+		printk(KERN_ERR "raid5: couldn't allocate %dkB for expand buffer\n",
+			(conf->chunk_size * (raid_disks-1)) >> 10);
+		// FIXME
+		return -ENOMEM;
+	}
+
+	conf->expand_stripes = kmalloc (sizeof(struct stripe_head *) * (conf->chunk_size / STRIPE_SIZE), GFP_KERNEL);
+	if (conf->expand_stripes == NULL) {
+		printk(KERN_ERR "raid5: couldn't allocate memory for expand stripe pointers\n");
+		// FIXME
+		return -ENOMEM;
+	}
+	conf->expand_stripes_ready = 0;
+
+	for (i = 0; i < (conf->chunk_size / STRIPE_SIZE) * (raid_disks-1); ++i) {
+		conf->expand_buffer[i].page = alloc_page(GFP_KERNEL);
+		if (conf->expand_buffer[i].page == NULL) {
+			printk(KERN_ERR "raid5: couldn't allocate %dkB for expand buffer\n",
+					(conf->chunk_size * (raid_disks-1)) >> 10);
+			// FIXME
+			return -ENOMEM;
+		}
+		conf->expand_buffer[i].up_to_date = 0;
+	}
+	
+	conf->expand_in_progress = 1;
+	
+	// allocate stripes of the new size, and get the RAID going again
+	if (grow_stripes(conf, conf->max_nr_stripes)) {
+		BUG();  // FIXME
+		return -ENOMEM;
+	}	
+	
+	print_raid5_conf(conf);
+
+	clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
+	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+	set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+	mddev->recovery_cp = 0;
+	md_wakeup_thread(mddev->thread);
+
+	printk("Starting expand.\n");
+	
+	return 0;
+}
+
+
 static mdk_personality_t raid5_personality=
 {
 	.name		= "raid5",
@@ -1950,6 +2447,7 @@ static mdk_personality_t raid5_personali
 	.spare_active	= raid5_spare_active,
 	.sync_request	= sync_request,
 	.resize		= raid5_resize,
+	.reshape	= raid5_reshape
 };
 
 static int __init raid5_init (void)
--- /usr/src/old/linux-2.6.13/include/linux/raid/raid5.h	2005-08-29 01:41:01.000000000 +0200
+++ include/linux/raid/raid5.h	2005-10-16 18:22:51.000000000 +0200
@@ -134,6 +134,7 @@ struct stripe_head {
 	unsigned long		state;			/* state flags */
 	atomic_t		count;			/* nr of active thread/requests */
 	spinlock_t		lock;
+	int			disks;			/* disks in stripe */
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
@@ -199,6 +200,10 @@ struct stripe_head {
 struct disk_info {
 	mdk_rdev_t	*rdev;
 };
+struct expand_buf {
+	struct page    	*page;
+	int		up_to_date;
+};
 
 struct raid5_private_data {
 	struct stripe_head	**stripe_hashtbl;
@@ -208,6 +213,17 @@ struct raid5_private_data {
 	int			raid_disks, working_disks, failed_disks;
 	int			max_nr_stripes;
 
+	/* used during an expand */
+	int			expand_in_progress;
+	sector_t		expand_progress;
+	spinlock_t		expand_progress_lock;
+	int			previous_raid_disks;
+	
+	struct expand_buf	*expand_buffer;
+	
+	int			expand_stripes_ready;	
+	struct stripe_head	**expand_stripes;
+
 	struct list_head	handle_list; /* stripes needing handling */
 	struct list_head	delayed_list; /* stripes that have plugged requests */
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
@@ -220,6 +236,7 @@ struct raid5_private_data {
 	atomic_t		active_stripes;
 	struct list_head	inactive_list;
 	wait_queue_head_t	wait_for_stripe;
+	wait_queue_head_t	wait_for_expand_progress;
 	wait_queue_head_t	wait_for_overlap;
 	int			inactive_blocked;	/* release of inactive stripes blocked,
 							 * waiting for 25% to be free

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-10-16 22:55             ` Neil Brown
  2005-10-17  0:16               ` Steinar H. Gunderson
@ 2005-10-19 23:18               ` Steinar H. Gunderson
  2005-10-20 13:07                 ` Steinar H. Gunderson
                                   ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-10-19 23:18 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1047 bytes --]

On Mon, Oct 17, 2005 at 08:55:45AM +1000, Neil Brown wrote:
> I'll have a close look at all the code sometime today and get back to
> you with any comments.

Any progress?

I've made a small extra patch now; most of the logic has been moved down to
the bottom of handle_stripe. (I tried moving it out to sync_request, but that
caused infinite stalls for a number of reasons -- it will need a more
thorough redesign than just moving if we want to move it down there, and some
sort of wait queue. I'm not sure if it's worth it.)

The good news is that this actually seems to have fixed the data corruption
issue.  Either that, or I'm more lucky than usual; I've done five or six of
my usual stress tests (one intensive writer and one intensive reader while
restriping), and while every other or so used to get corruption earlier, none
did now. With a bit of luck I fixed some odd race, so we might be on track
for our November 1st resize :-) (Yes, I realize these are the famous last
words. :-) )

/* Steinar */
-- 
Homepage: http://www.sesse.net/


[-- Attachment #2: raid5-online-exp-07.diff --]
[-- Type: text/plain, Size: 28891 bytes --]

--- /usr/src/old/linux-2.6.13/drivers/md/raid5.c	2005-08-29 01:41:01.000000000 +0200
+++ drivers/md/raid5.c	2005-10-20 01:05:52.000000000 +0200
@@ -68,9 +68,18 @@
 #endif
 
 static void print_raid5_conf (raid5_conf_t *conf);
+#if RADI5_DEBUG
+static void print_sh (struct stripe_head *sh);
+#endif
+static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, int go_faster);
+static void raid5_finish_expand (raid5_conf_t *conf);
+static sector_t raid5_compute_sector(sector_t r_sector, unsigned int raid_disks,
+			unsigned int data_disks, unsigned int * dd_idx,
+			unsigned int * pd_idx, raid5_conf_t *conf);
 
 static inline void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 {
+	BUG_ON(atomic_read(&sh->count) == 0);
 	if (atomic_dec_and_test(&sh->count)) {
 		if (!list_empty(&sh->lru))
 			BUG();
@@ -133,7 +142,7 @@ static __inline__ void insert_hash(raid5
 
 
 /* find an idle stripe, make sure it is unhashed, and return it. */
-static struct stripe_head *get_free_stripe(raid5_conf_t *conf)
+static struct stripe_head *get_free_stripe(raid5_conf_t *conf, int expand)
 {
 	struct stripe_head *sh = NULL;
 	struct list_head *first;
@@ -146,6 +155,12 @@ static struct stripe_head *get_free_stri
 	list_del_init(first);
 	remove_hash(sh);
 	atomic_inc(&conf->active_stripes);
+
+	if (expand || !conf->expand_in_progress)
+		sh->disks = conf->raid_disks;
+	else
+		sh->disks = conf->previous_raid_disks;
+
 out:
 	return sh;
 }
@@ -184,7 +199,7 @@ static void raid5_build_block (struct st
 static inline void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int i;
 
 	if (atomic_read(&sh->count) != 0)
 		BUG();
@@ -200,8 +215,14 @@ static inline void init_stripe(struct st
 	sh->sector = sector;
 	sh->pd_idx = pd_idx;
 	sh->state = 0;
+	
+	if (conf->expand_in_progress && sector * (conf->raid_disks - 1) >= conf->expand_progress) {
+		sh->disks = conf->previous_raid_disks;
+	} else {
+		sh->disks = conf->raid_disks;
+	}
 
-	for (i=disks; i--; ) {
+	for (i=sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 
 		if (dev->toread || dev->towrite || dev->written ||
@@ -245,9 +266,29 @@ static struct stripe_head *get_active_st
 
 	do {
 		sh = __find_stripe(conf, sector);
+
+		// make sure this is of the right size; if not, remove it from the hash
+		// FIXME: is this needed now?
+		if (sh) {
+			int correct_disks = conf->raid_disks;
+			if (conf->expand_in_progress && sector * (conf->raid_disks - 1) >= conf->expand_progress) {
+				correct_disks = conf->previous_raid_disks;
+			}
+
+			if (sh->disks != correct_disks) {
+				BUG_ON(atomic_read(&sh->count) != 0);
+
+				printk("get_stripe %llu with different number of disks (%u, should be %u)\n",
+					sector, sh->disks, correct_disks);
+
+				remove_hash(sh);
+				sh = NULL;
+			}
+		}
+		
 		if (!sh) {
 			if (!conf->inactive_blocked)
-				sh = get_free_stripe(conf);
+				sh = get_free_stripe(conf, 1);
 			if (noblock && sh == NULL)
 				break;
 			if (!sh) {
@@ -303,6 +344,7 @@ static int grow_stripes(raid5_conf_t *co
 			return 1;
 		memset(sh, 0, sizeof(*sh) + (devs-1)*sizeof(struct r5dev));
 		sh->raid_conf = conf;
+		sh->disks = conf->raid_disks;
 		spin_lock_init(&sh->lock);
 
 		if (grow_buffers(sh, conf->raid_disks)) {
@@ -325,7 +367,7 @@ static void shrink_stripes(raid5_conf_t 
 
 	while (1) {
 		spin_lock_irq(&conf->device_lock);
-		sh = get_free_stripe(conf);
+		sh = get_free_stripe(conf, 0);
 		spin_unlock_irq(&conf->device_lock);
 		if (!sh)
 			break;
@@ -344,7 +386,7 @@ static int raid5_end_read_request (struc
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
 	if (bi->bi_size)
@@ -411,12 +453,60 @@ static int raid5_end_read_request (struc
 	return 0;
 }
 
+							
+static void raid5_finish_expand (raid5_conf_t *conf)
+{
+	int i;
+	struct disk_info *tmp;
+	
+	for (i = conf->previous_raid_disks; i < conf->raid_disks; i++) {
+		tmp = conf->disks + i;
+		if (tmp->rdev
+		    && !tmp->rdev->faulty
+		    && !tmp->rdev->in_sync) {
+			conf->mddev->degraded--;
+			conf->failed_disks--;
+			conf->working_disks++;
+			tmp->rdev->in_sync = 1;
+		}
+	}
+	
+	conf->expand_in_progress = 0;
+	
+	// inform the md code that we have more space now
+ 	{	
+		struct block_device *bdev;
+		sector_t sync_sector;
+		unsigned dummy1, dummy2;
+
+		conf->mddev->array_size = conf->mddev->size * (conf->mddev->raid_disks-1);
+		set_capacity(conf->mddev->gendisk, conf->mddev->array_size << 1);
+		conf->mddev->changed = 1;
+
+		sync_sector = raid5_compute_sector(conf->expand_progress, conf->raid_disks,
+			conf->raid_disks - 1, &dummy1, &dummy2, conf);
+		
+		conf->mddev->recovery_cp = sync_sector << 1;    // FIXME: hum, hum
+		set_bit(MD_RECOVERY_NEEDED, &conf->mddev->recovery);
+
+		bdev = bdget_disk(conf->mddev->gendisk, 0);
+		if (bdev) {
+			down(&bdev->bd_inode->i_sem);
+			i_size_write(bdev->bd_inode, conf->mddev->array_size << 10);
+			up(&bdev->bd_inode->i_sem);
+			bdput(bdev);
+		}
+	}
+	
+	/* FIXME: free old stuff here! (what are we missing?) */
+}
+
 static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
 				    int error)
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	unsigned long flags;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
@@ -570,7 +660,7 @@ static sector_t raid5_compute_sector(sec
 static sector_t compute_blocknr(struct stripe_head *sh, int i)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int raid_disks = conf->raid_disks, data_disks = raid_disks - 1;
+	int raid_disks = sh->disks, data_disks = raid_disks - 1;
 	sector_t new_sector = sh->sector, check;
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	sector_t stripe;
@@ -605,7 +695,8 @@ static sector_t compute_blocknr(struct s
 
 	check = raid5_compute_sector (r_sector, raid_disks, data_disks, &dummy1, &dummy2, conf);
 	if (check != sh->sector || dummy1 != dd_idx || dummy2 != sh->pd_idx) {
-		printk("compute_blocknr: map not correct\n");
+		printk("compute_blocknr: map not correct (%llu,%u,%u vs. %llu,%u,%u) disks=%u offset=%u virtual_dd=%u\n",
+				check, dummy1, dummy2, sh->sector, dd_idx, sh->pd_idx, sh->disks, chunk_offset, i);
 		return 0;
 	}
 	return r_sector;
@@ -671,8 +762,7 @@ static void copy_data(int frombio, struc
 
 static void compute_block(struct stripe_head *sh, int dd_idx)
 {
-	raid5_conf_t *conf = sh->raid_conf;
-	int i, count, disks = conf->raid_disks;
+	int i, count, disks = sh->disks;
 	void *ptr[MAX_XOR_BLOCKS], *p;
 
 	PRINTK("compute_block, stripe %llu, idx %d\n", 
@@ -702,7 +792,7 @@ static void compute_block(struct stripe_
 static void compute_parity(struct stripe_head *sh, int method)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int i, pd_idx = sh->pd_idx, disks = conf->raid_disks, count;
+	int i, pd_idx = sh->pd_idx, disks = sh->disks, count;
 	void *ptr[MAX_XOR_BLOCKS];
 	struct bio *chosen;
 
@@ -880,7 +970,7 @@ static int add_stripe_bio(struct stripe_
 static void handle_stripe(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks;
+	int disks = sh->disks;
 	struct bio *return_bi= NULL;
 	struct bio *bi;
 	int i;
@@ -945,19 +1035,20 @@ static void handle_stripe(struct stripe_
 		}
 		if (dev->written) written++;
 		rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
-		if (!rdev || !rdev->in_sync) {
+		if (!conf->expand_in_progress && (!rdev || !rdev->in_sync)) {
 			failed++;
 			failed_num = i;
 		} else
 			set_bit(R5_Insync, &dev->flags);
 	}
-	PRINTK("locked=%d uptodate=%d to_read=%d"
-		" to_write=%d failed=%d failed_num=%d\n",
-		locked, uptodate, to_read, to_write, failed, failed_num);
 	/* check if the array has lost two devices and, if so, some requests might
 	 * need to be failed
 	 */
 	if (failed > 1 && to_read+to_write+written) {
+		printk("Need to fail requests!\n");
+		printk("locked=%d uptodate=%d to_read=%d"
+			" to_write=%d failed=%d failed_num=%d disks=%d\n",
+			locked, uptodate, to_read, to_write, failed, failed_num, disks);
 		spin_lock_irq(&conf->device_lock);
 		for (i=disks; i--; ) {
 			/* fail all writes first */
@@ -1012,7 +1103,7 @@ static void handle_stripe(struct stripe_
 		}
 		spin_unlock_irq(&conf->device_lock);
 	}
-	if (failed > 1 && syncing) {
+	if (failed > 1 && syncing && !conf->expand_in_progress) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
 		clear_bit(STRIPE_SYNCING, &sh->state);
 		syncing = 0;
@@ -1093,7 +1184,7 @@ static void handle_stripe(struct stripe_
 					locked++;
 					PRINTK("Reading block %d (sync=%d)\n", 
 						i, syncing);
-					if (syncing)
+					if (syncing && !conf->expand_in_progress)
 						md_sync_acct(conf->disks[i].rdev->bdev,
 							     STRIPE_SECTORS);
 				}
@@ -1102,6 +1193,37 @@ static void handle_stripe(struct stripe_
 		set_bit(STRIPE_HANDLE, &sh->state);
 	}
 
+	/* see if we can use this stripe's data in an ongoing expand */
+	if (conf->expand_in_progress && sh->disks == conf->previous_raid_disks) {
+		spin_lock_irq(&conf->expand_progress_lock);
+		for (i=0; i<disks; ++i) {
+			sector_t start_sector, dest_sector;
+			unsigned int dd_idx, pd_idx;
+
+			if (i == sh->pd_idx)
+				continue;
+
+			// see what sector this block would land in the new layout
+			start_sector = compute_blocknr(sh, i);
+			dest_sector = raid5_compute_sector(start_sector, conf->raid_disks,
+				conf->raid_disks - 1, &dd_idx, &pd_idx, conf);
+			if (dd_idx > pd_idx)
+				--dd_idx;
+
+			if (dest_sector * (conf->raid_disks - 1) >= conf->expand_progress &&
+ 			    dest_sector * (conf->raid_disks - 1) <  conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+				unsigned int ind = (start_sector - conf->expand_progress) / STRIPE_SECTORS;
+				if (test_bit(R5_UPTODATE, &sh->dev[i].flags)) {
+					memcpy(page_address(conf->expand_buffer[ind].page), page_address(sh->dev[i].page), STRIPE_SIZE);
+					conf->expand_buffer[ind].up_to_date = 1;
+				} else {
+					conf->expand_buffer[ind].up_to_date = 0;
+				}
+			}
+		}
+		spin_unlock_irq(&conf->expand_progress_lock);
+	}
+	
 	/* now to consider writing and what else, if anything should be read */
 	if (to_write) {
 		int rmw=0, rcw=0;
@@ -1237,7 +1359,9 @@ static void handle_stripe(struct stripe_
 		}
 	}
 	if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
-		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
+		if (!conf->expand_in_progress) {
+			md_done_sync(conf->mddev, STRIPE_SECTORS,1);
+		}
 		clear_bit(STRIPE_SYNCING, &sh->state);
 	}
 	
@@ -1279,7 +1403,7 @@ static void handle_stripe(struct stripe_
 		rcu_read_unlock();
  
 		if (rdev) {
-			if (test_bit(R5_Syncio, &sh->dev[i].flags))
+			if (test_bit(R5_Syncio, &sh->dev[i].flags) && !conf->expand_in_progress)
 				md_sync_acct(rdev->bdev, STRIPE_SECTORS);
 
 			bi->bi_bdev = rdev->bdev;
@@ -1304,6 +1428,167 @@ static void handle_stripe(struct stripe_
 			set_bit(STRIPE_HANDLE, &sh->state);
 		}
 	}
+
+	// see if we have the data we need to expand by another block
+	if (conf->expand_in_progress) {
+		int uptodate = 0, needed_uptodate;
+		
+		for (i=0; i < (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE); ++i) {
+			uptodate += conf->expand_buffer[i].up_to_date;
+		}
+		/*
+		 * Figure out how many stripes we need for this chunk to be complete.
+		 * In almost all cases, this will be a full destination stripe, but our
+		 * original volume might not be big enough for that at the very end --
+		 * so use the rest of the volume then.
+	         */
+		needed_uptodate = (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE);
+		if (((conf->mddev->array_size << 1) - conf->expand_progress) / STRIPE_SECTORS < needed_uptodate) {
+			needed_uptodate = ((conf->mddev->array_size << 1) - conf->expand_progress) / STRIPE_SECTORS;
+		}
+
+		if (needed_uptodate > 0 && uptodate == needed_uptodate && conf->expand_stripes_ready == 1) {
+			// we can do an expand!
+			sector_t dest_sector, advance;
+			unsigned i;
+			unsigned int dummy1, dummy2, pd_idx;
+
+			if ((conf->mddev->size << 1) - conf->expand_progress > (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+				advance = (conf->chunk_size * (conf->raid_disks - 1)) >> 9;
+			} else {
+				advance = (conf->mddev->size << 1) - conf->expand_progress;
+			}
+
+			// find the parity disk and starting sector
+			dest_sector = raid5_compute_sector(conf->expand_progress, conf->raid_disks,
+				conf->raid_disks - 1, &dummy1, &pd_idx, conf);
+		
+			spin_lock_irq(&conf->device_lock);
+			
+			if (conf->expand_stripes_ready != 1) {
+				// something else just did the expand, we're done here
+				spin_unlock_irq(&conf->device_lock);
+				goto please_wait;
+			}
+			
+			/*
+			 * Check that we won't try to move an area where there's
+			 * still active stripes; if we do, we'll risk inconsistency since we
+			 * suddenly have two different sets of stripes referring to the
+			 * same logical sector.
+			 */
+			{
+				struct stripe_head *ash;
+				unsigned activity = 0, i;
+				sector_t first_touched_sector, last_touched_sector;
+				
+				first_touched_sector = raid5_compute_sector(conf->expand_progress,
+					conf->previous_raid_disks, conf->previous_raid_disks - 1, &dummy1, &dummy2, conf);
+				last_touched_sector = raid5_compute_sector(conf->expand_progress + ((conf->chunk_size * (conf->raid_disks - 1)) >> 9) - 1,
+					conf->previous_raid_disks, conf->previous_raid_disks - 1, &dummy1, &dummy2, conf);
+
+				for (i = 0; i < NR_HASH; i++) {
+					ash = conf->stripe_hashtbl[i];
+					for (; ash; ash = ash->hash_next) {
+						if (sh == ash && atomic_read(&ash->count) == 1)
+							continue;   // we'll release it shortly, so it's OK (?)
+
+						// is this stripe active, and within the region we're expanding?
+						if (atomic_read(&ash->count) > 0 &&
+						    ash->disks == conf->previous_raid_disks &&
+						    ash->sector >= first_touched_sector &&
+						    ash->sector <= last_touched_sector) {
+							++activity;
+						}
+					}
+				}
+				
+				if (activity > 0) {
+					printk("Aborting, %u active stripes in the area\n", activity);
+					spin_unlock_irq(&conf->device_lock);
+					goto please_wait;
+				}
+			}
+			
+			spin_lock(&conf->expand_progress_lock);
+			conf->expand_progress += advance;
+
+			for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+				int d;
+				struct stripe_head *newsh = conf->expand_stripes[i];
+				if (atomic_read(&newsh->count) != 0)
+					BUG();
+				init_stripe(newsh, dest_sector + i * STRIPE_SECTORS, pd_idx);
+
+				for (d = 0; d < conf->raid_disks; ++d) {
+					if (d == pd_idx) {
+						clear_bit(R5_UPTODATE, &newsh->dev[d].flags);
+						clear_bit(R5_LOCKED, &newsh->dev[d].flags);
+					} else {
+						//struct page *tmp;
+						unsigned di;
+						
+						di = (compute_blocknr(newsh, d) - (conf->expand_progress - advance)) / STRIPE_SECTORS;
+						
+						// swap the two pages, moving the data in place into the stripe
+#if 0
+						// FIXME: this doesn't work. we'll need to fiddle with the bio_vec
+						// as well or we'll simply write out the wrong data.
+						tmp = newsh->dev[d].page;
+						newsh->dev[d].page = conf->expand_buffer[di].page;
+						conf->expand_buffer[di].page = tmp; 
+#else
+						memcpy(page_address(newsh->dev[d].page), page_address(conf->expand_buffer[di].page), STRIPE_SIZE);
+#endif
+					
+						set_bit(R5_UPTODATE, &newsh->dev[d].flags);
+						set_bit(R5_LOCKED, &newsh->dev[d].flags);
+						conf->expand_buffer[di].up_to_date = 0;
+					}
+					set_bit(R5_Wantwrite, &newsh->dev[d].flags);
+				}
+			}
+			conf->expand_stripes_ready = 2;	
+			spin_unlock(&conf->expand_progress_lock);
+			spin_unlock_irq(&conf->device_lock);
+			
+			for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+				struct stripe_head *newsh = conf->expand_stripes[i];
+				
+				compute_block(newsh, newsh->pd_idx);
+
+				spin_lock(&newsh->lock);
+				atomic_inc(&newsh->count);
+				clear_bit(STRIPE_SYNCING, &newsh->state);
+				set_bit(STRIPE_INSYNC, &newsh->state);
+				set_bit(STRIPE_HANDLE, &newsh->state);
+				spin_unlock(&newsh->lock);
+#if 0
+				printk("Releasing stripe %u (%u disks)\n", i, newsh->disks);
+				for (d = 0; d < conf->raid_disks; ++d) {
+					unsigned int *ptr = page_address(newsh->dev[d].page);
+					printk("%u: %08x %08x %08x %08x\n", d, ptr[0], ptr[1], ptr[2], ptr[3]);
+				}
+#endif
+				release_stripe(newsh);
+			}
+			
+			conf->expand_stripes_ready = 0;	
+
+			md_done_sync(conf->mddev, advance, 1);
+			wake_up(&conf->wait_for_expand_progress);
+
+			// see if we are done
+			if (conf->expand_progress >= conf->mddev->array_size << 1) {
+				printk("Expand done, finishing...\n");
+				raid5_finish_expand(conf);
+				printk("...done.\n");
+			}
+
+please_wait:			
+			1;
+		}
+	}
 }
 
 static inline void raid5_activate_delayed(raid5_conf_t *conf)
@@ -1404,8 +1689,6 @@ static int make_request (request_queue_t
 {
 	mddev_t *mddev = q->queuedata;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
-	const unsigned int raid_disks = conf->raid_disks;
-	const unsigned int data_disks = raid_disks - 1;
 	unsigned int dd_idx, pd_idx;
 	sector_t new_sector;
 	sector_t logical_sector, last_sector;
@@ -1428,18 +1711,55 @@ static int make_request (request_queue_t
 
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
+		int disks;
 		
+	retry:
+		disks = conf->raid_disks;
+		if (conf->expand_in_progress) {
+			spin_lock_irq(&conf->expand_progress_lock);
+			if (logical_sector >= conf->expand_progress) {
+				disks = conf->previous_raid_disks;
+			}
+			spin_unlock_irq(&conf->expand_progress_lock);
+		}
 		new_sector = raid5_compute_sector(logical_sector,
-						  raid_disks, data_disks, &dd_idx, &pd_idx, conf);
-
+			disks, disks - 1, &dd_idx, &pd_idx, conf);	
 		PRINTK("raid5: make_request, sector %llu logical %llu\n",
 			(unsigned long long)new_sector, 
 			(unsigned long long)logical_sector);
 
-	retry:
 		prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 		sh = get_active_stripe(conf, new_sector, pd_idx, (bi->bi_rw&RWA_MASK));
 		if (sh) {
+			/*
+			 * At this point, our stripe is active and _will_ get
+			 * counted by handle_stripe() if it decides to do an
+			 * expand (which will delay it if that overlaps over
+			 * us). However, we also need to check that there
+			 * wasn't an expand happening while we waited for our
+			 * stripe in get_active_stripe() (or one is in progress
+			 * right now).
+			 */
+			if (conf->expand_in_progress) {
+				int new_disks;
+
+				spin_lock(&conf->expand_progress_lock);
+
+				// recalculate what side we are on
+				if (logical_sector >= conf->expand_progress) {
+					new_disks = conf->previous_raid_disks;
+				} else {
+					new_disks = conf->raid_disks;
+				}
+
+				spin_unlock(&conf->expand_progress_lock);
+				
+				if (disks != new_disks || sh->disks != disks) {
+					printk("progressed\n");
+					release_stripe(sh);
+					goto retry;
+				}
+			}
 			if (!add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
 				/* Add failed due to overlap.  Flush everything
 				 * and wait a while
@@ -1488,7 +1808,14 @@ static sector_t sync_request(mddev_t *md
 	sector_t first_sector;
 	int raid_disks = conf->raid_disks;
 	int data_disks = raid_disks-1;
+	
+	if (conf->expand_in_progress) {
+		raid_disks = conf->previous_raid_disks;
+		data_disks = raid_disks-1;
+	}
 
+	BUG_ON(data_disks == 0 || raid_disks == 0);
+	
 	if (sector_nr >= mddev->size <<1) {
 		/* just being told to finish up .. nothing much to do */
 		unplug_slaves(mddev);
@@ -1503,6 +1830,51 @@ static sector_t sync_request(mddev_t *md
 		*skipped = 1;
 		return rv;
 	}
+	
+	/* if we're in an expand, we can't allow the process
+	 * to keep reading in stripes; we might not have enough buffer
+	 * space to keep it all in RAM.
+	 */
+	if (conf->expand_in_progress && sector_nr >= conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+		spin_lock_irq(&conf->device_lock);
+		wait_event_lock_irq(conf->wait_for_expand_progress,
+			    sector_nr < conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1),
+			    conf->device_lock,
+			    unplug_slaves(conf->mddev);
+		);
+		spin_unlock_irq(&conf->device_lock);
+	}
+
+	/*
+	 * In an expand, we also need to make sure that we have enough destination stripes
+	 * available for writing out the block after we've read in the data, so make sure
+	 * we get them before we start reading any data.
+	 */
+	if (conf->expand_in_progress && conf->expand_stripes_ready == 0) {
+		unsigned i;
+
+		spin_lock_irq(&conf->device_lock);
+		for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+			do {
+				conf->expand_stripes[i] = get_free_stripe(conf, 1);
+
+				if (conf->expand_stripes[i] == NULL) {
+					conf->inactive_blocked = 1;
+					wait_event_lock_irq(conf->wait_for_stripe,
+							    !list_empty(&conf->inactive_list) &&
+							    (atomic_read(&conf->active_stripes) < (NR_STRIPES *3/4)
+							     || !conf->inactive_blocked),
+							    conf->device_lock,
+							    unplug_slaves(conf->mddev);
+						);
+					conf->inactive_blocked = 0;
+				}
+			} while (conf->expand_stripes[i] == NULL);
+		}
+		spin_unlock_irq(&conf->device_lock);
+
+		conf->expand_stripes_ready = 1;
+	}
 
 	x = sector_nr;
 	chunk_offset = sector_div(x, sectors_per_chunk);
@@ -1553,6 +1925,8 @@ static void raid5d (mddev_t *mddev)
 	while (1) {
 		struct list_head *first;
 
+		conf = mddev_to_conf(mddev);
+
 		if (list_empty(&conf->handle_list) &&
 		    atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
 		    !blk_queue_plugged(mddev->queue) &&
@@ -1600,7 +1974,7 @@ static int run (mddev_t *mddev)
 	}
 
 	mddev->private = kmalloc (sizeof (raid5_conf_t)
-				  + mddev->raid_disks * sizeof(struct disk_info),
+				  + MAX_MD_DEVS * sizeof(struct disk_info),
 				  GFP_KERNEL);
 	if ((conf = mddev->private) == NULL)
 		goto abort;
@@ -1650,6 +2024,7 @@ static int run (mddev_t *mddev)
 	conf->level = mddev->level;
 	conf->algorithm = mddev->layout;
 	conf->max_nr_stripes = NR_STRIPES;
+	conf->expand_in_progress = 0;
 
 	/* device size must be a multiple of chunk size */
 	mddev->size &= ~(mddev->chunk_size/1024 -1);
@@ -1866,6 +2241,9 @@ static int raid5_remove_disk(mddev_t *md
 	mdk_rdev_t *rdev;
 	struct disk_info *p = conf->disks + number;
 
+	printk("we were asked to remove a disk\n");
+	return -EBUSY;  // FIXME: hack
+	
 	print_raid5_conf(conf);
 	rdev = p->rdev;
 	if (rdev) {
@@ -1904,6 +2282,7 @@ static int raid5_add_disk(mddev_t *mddev
 	 */
 	for (disk=0; disk < mddev->raid_disks; disk++)
 		if ((p=conf->disks + disk)->rdev == NULL) {
+			rdev->faulty = 0;
 			rdev->in_sync = 0;
 			rdev->raid_disk = disk;
 			found = 1;
@@ -1916,6 +2295,7 @@ static int raid5_add_disk(mddev_t *mddev
 
 static int raid5_resize(mddev_t *mddev, sector_t sectors)
 {
+        raid5_conf_t *conf = mddev_to_conf(mddev);
 	/* no resync is happening, and there is enough space
 	 * on all devices, so we can resize.
 	 * We need to make sure resync covers any new space.
@@ -1923,6 +2303,9 @@ static int raid5_resize(mddev_t *mddev, 
 	 * any io in the removed space completes, but it hardly seems
 	 * worth it.
 	 */
+	if (conf->expand_in_progress)
+		return -EBUSY;
+		
 	sectors &= ~((sector_t)mddev->chunk_size/512 - 1);
 	mddev->array_size = (sectors * (mddev->raid_disks-1))>>1;
 	set_capacity(mddev->gendisk, mddev->array_size << 1);
@@ -1936,6 +2319,125 @@ static int raid5_resize(mddev_t *mddev, 
 	return 0;
 }
 
+static int raid5_reshape(mddev_t *mddev, int raid_disks)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	struct list_head *tmp;
+	mdk_rdev_t *rdev;
+	unsigned long flags;
+
+	int d, i;
+	
+	if (mddev->degraded >= 1 || conf->expand_in_progress)
+		return -EBUSY;
+	if (conf->raid_disks == raid_disks)
+		return 0;
+	
+	print_raid5_conf(conf);
+	
+	// the old stripes are too small now; remove them (temporarily
+	// stalling the RAID)
+	for (i = 0; i < conf->max_nr_stripes; ++i) {
+		struct stripe_head *sh;
+		
+		spin_lock_irqsave(&conf->device_lock, flags);
+		sh = get_free_stripe(conf, 0);
+		while (sh == NULL) {
+			wait_event_lock_irq(conf->wait_for_stripe,
+					!list_empty(&conf->inactive_list),
+					conf->device_lock,
+					unplug_slaves(conf->mddev);
+					);
+			sh = get_free_stripe(conf, 0);
+		}
+		spin_unlock_irqrestore(&conf->device_lock, flags);
+
+		shrink_buffers(sh, conf->raid_disks);
+		kmem_cache_free(conf->slab_cache, sh);
+		atomic_dec(&conf->active_stripes);
+	}	
+	kmem_cache_destroy(conf->slab_cache);
+	
+	spin_lock_irqsave(&conf->device_lock, flags);
+	
+	for (d= conf->raid_disks; d < MAX_MD_DEVS; d++) {
+		conf->disks[d].rdev = NULL;
+	}
+
+	conf->expand_progress = 0;
+	conf->previous_raid_disks = conf->raid_disks;	
+	conf->raid_disks = mddev->raid_disks = raid_disks;	
+
+	spin_lock_init(&conf->expand_progress_lock);
+	
+	init_waitqueue_head(&conf->wait_for_expand_progress);
+
+	ITERATE_RDEV(mddev,rdev,tmp) {
+		for (d= 0; d < conf->raid_disks; d++) {
+			if (conf->disks[d].rdev == rdev) {
+				goto already_there;
+			}
+		}
+
+		raid5_add_disk(mddev, rdev);
+		conf->failed_disks++;
+		
+already_there:		
+		1;
+	}
+
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+	
+	// allocate space for our temporary expansion buffers
+	conf->expand_buffer = kmalloc (sizeof(struct expand_buf) * (conf->chunk_size / STRIPE_SIZE) * (raid_disks-1), GFP_KERNEL);
+	if (conf->expand_buffer == NULL) {
+		printk(KERN_ERR "raid5: couldn't allocate %dkB for expand buffer\n",
+			(conf->chunk_size * (raid_disks-1)) >> 10);
+		// FIXME
+		return -ENOMEM;
+	}
+
+	conf->expand_stripes = kmalloc (sizeof(struct stripe_head *) * (conf->chunk_size / STRIPE_SIZE), GFP_KERNEL);
+	if (conf->expand_stripes == NULL) {
+		printk(KERN_ERR "raid5: couldn't allocate memory for expand stripe pointers\n");
+		// FIXME
+		return -ENOMEM;
+	}
+	conf->expand_stripes_ready = 0;
+
+	for (i = 0; i < (conf->chunk_size / STRIPE_SIZE) * (raid_disks-1); ++i) {
+		conf->expand_buffer[i].page = alloc_page(GFP_KERNEL);
+		if (conf->expand_buffer[i].page == NULL) {
+			printk(KERN_ERR "raid5: couldn't allocate %dkB for expand buffer\n",
+					(conf->chunk_size * (raid_disks-1)) >> 10);
+			// FIXME
+			return -ENOMEM;
+		}
+		conf->expand_buffer[i].up_to_date = 0;
+	}
+	
+	conf->expand_in_progress = 1;
+	
+	// allocate stripes of the new size, and get the RAID going again
+	if (grow_stripes(conf, conf->max_nr_stripes)) {
+		BUG();  // FIXME
+		return -ENOMEM;
+	}	
+	
+	print_raid5_conf(conf);
+
+	clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
+	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+	set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+	mddev->recovery_cp = 0;
+	md_wakeup_thread(mddev->thread);
+
+	printk("Starting expand.\n");
+	
+	return 0;
+}
+
+
 static mdk_personality_t raid5_personality=
 {
 	.name		= "raid5",
@@ -1950,6 +2452,7 @@ static mdk_personality_t raid5_personali
 	.spare_active	= raid5_spare_active,
 	.sync_request	= sync_request,
 	.resize		= raid5_resize,
+	.reshape	= raid5_reshape
 };
 
 static int __init raid5_init (void)
--- /usr/src/old/linux-2.6.13/include/linux/raid/raid5.h	2005-08-29 01:41:01.000000000 +0200
+++ include/linux/raid/raid5.h	2005-10-20 00:40:01.000000000 +0200
@@ -134,6 +134,7 @@ struct stripe_head {
 	unsigned long		state;			/* state flags */
 	atomic_t		count;			/* nr of active thread/requests */
 	spinlock_t		lock;
+	int			disks;			/* disks in stripe */
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
@@ -199,6 +200,10 @@ struct stripe_head {
 struct disk_info {
 	mdk_rdev_t	*rdev;
 };
+struct expand_buf {
+	struct page    	*page;
+	int		up_to_date;
+};
 
 struct raid5_private_data {
 	struct stripe_head	**stripe_hashtbl;
@@ -208,6 +213,17 @@ struct raid5_private_data {
 	int			raid_disks, working_disks, failed_disks;
 	int			max_nr_stripes;
 
+	/* used during an expand */
+	int			expand_in_progress;
+	sector_t		expand_progress;
+	spinlock_t		expand_progress_lock;
+	int			previous_raid_disks;
+	
+	struct expand_buf	*expand_buffer;
+	
+	int			expand_stripes_ready;	
+	struct stripe_head	**expand_stripes;
+
 	struct list_head	handle_list; /* stripes needing handling */
 	struct list_head	delayed_list; /* stripes that have plugged requests */
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
@@ -220,6 +236,7 @@ struct raid5_private_data {
 	atomic_t		active_stripes;
 	struct list_head	inactive_list;
 	wait_queue_head_t	wait_for_stripe;
+	wait_queue_head_t	wait_for_expand_progress;
 	wait_queue_head_t	wait_for_overlap;
 	int			inactive_blocked;	/* release of inactive stripes blocked,
 							 * waiting for 25% to be free

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-10-19 23:18               ` Steinar H. Gunderson
@ 2005-10-20 13:07                 ` Steinar H. Gunderson
  2005-10-22 13:45                 ` Steinar H. Gunderson
  2005-10-24  0:37                 ` Neil Brown
  2 siblings, 0 replies; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-10-20 13:07 UTC (permalink / raw)
  To: linux-raid

On Thu, Oct 20, 2005 at 01:18:30AM +0200, Steinar H. Gunderson wrote:
> The good news is that this actually seems to have fixed the data corruption
> issue.  Either that, or I'm more lucky than usual;

It resurfaced in a second set of heavier tests, so it's still there. :-/

/* Steinar */
-- 
Homepage: http://www.sesse.net/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-10-19 23:18               ` Steinar H. Gunderson
  2005-10-20 13:07                 ` Steinar H. Gunderson
@ 2005-10-22 13:45                 ` Steinar H. Gunderson
  2005-10-22 13:52                   ` Neil Brown
  2005-10-24  0:37                 ` Neil Brown
  2 siblings, 1 reply; 23+ messages in thread
From: Steinar H. Gunderson @ 2005-10-22 13:45 UTC (permalink / raw)
  To: linux-raid

On Thu, Oct 20, 2005 at 01:18:30AM +0200, Steinar H. Gunderson wrote:
> Any progress?

Still nothing. I'm at the point now where I'll just let it lie -- there's
been zero real feedback since the very first patch, and it's really not
getting anywhere soon. Versions 05 and 06 of the patch are broken, BTW; the
delay code needs to go back in for cases like expanding 3 -> 4 disks, but it
seems to interfere with the code for not expanding over active stripes, so
something must be done. Oh well, I have other things to do...

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-10-22 13:45                 ` Steinar H. Gunderson
@ 2005-10-22 13:52                   ` Neil Brown
  0 siblings, 0 replies; 23+ messages in thread
From: Neil Brown @ 2005-10-22 13:52 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: linux-raid

On Saturday October 22, sgunderson@bigfoot.com wrote:
> On Thu, Oct 20, 2005 at 01:18:30AM +0200, Steinar H. Gunderson wrote:
> > Any progress?
> 
> Still nothing. I'm at the point now where I'll just let it lie -- there's
> been zero real feedback since the very first patch, and it's really not
> getting anywhere soon. Versions 05 and 06 of the patch are broken, BTW; the
> delay code needs to go back in for cases like expanding 3 -> 4 disks, but it
> seems to interfere with the code for not expanding over active stripes, so
> something must be done. Oh well, I have other things to do...

I am definitely interesting in seeing this progress.  I've been busy
lately and haven't had a chance to give it the thought that it
deserves.  I'll try to schedule some time in this week.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Online RAID-5 resizing
  2005-10-19 23:18               ` Steinar H. Gunderson
  2005-10-20 13:07                 ` Steinar H. Gunderson
  2005-10-22 13:45                 ` Steinar H. Gunderson
@ 2005-10-24  0:37                 ` Neil Brown
  2 siblings, 0 replies; 23+ messages in thread
From: Neil Brown @ 2005-10-24  0:37 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: linux-raid

On Thursday October 20, sgunderson@bigfoot.com wrote:
> On Mon, Oct 17, 2005 at 08:55:45AM +1000, Neil Brown wrote:
> > I'll have a close look at all the code sometime today and get back to
> > you with any comments.
> 
> Any progress?
> 

Ok, I've had another fairly detailed look..
I'd like to suggest another simplification (at least, I think it's a
simplification, let me know what you think).

While working on expanding one section, you currently gave a bunch of
stripe_heads (expand_stripes) sitting idle, waiting for all the reads
to complete, and you also have a collection of pages (expand_buffer)
to copy the data into when it is read.
I think you can combine these two, so that the expand_stripes are less
idle (i.e. their buffer space gets used) and so that expand_buffer
isn't needed.

I would first include the "sh->disks" value in the 'key' used to look
stripe_heads up in the hash table.  That way you could easily have 
'old' stripes and 'new' stripes possibly for overlapping regions
living in the hash table at the same time.

I would change sync_request to work in units of chunk_size rather than
STRIPE_SECTORS, as I think that makes it a bit easier too
(sync_request can schedule as much sync activity as it like - it just
has to return the correct number of sectors).

So what sync_request would do would be:
 1/ get_active_stripe a collection of stripes with sh->disks being the
   new size, and flag all of them as being in an expand, and set the
   sh->count to be the number of blocks that need to be loaded into
   the stripe.  Also set STRIPE_HANDLE so as soon as the count reaches
   0, the stripe will be handled. (Actually, you probably wouldn't use
   get_active_stripe, you would use get_free_stripe, and then init_stripe).
 2/ advance 'expand_progress' to the end of this set of stripes
 3/ get_active_stripe all of the stripes (with old size in sh->disks)
    that need to be read to fill in the new set of stripes, and flag
    them so that handle_stripe will read all the blocks.

make_request will now find the new-sized stripes when looking for
blocks in this area of the array.  It will not find the old stripes, so
we know that no new IO requests will be added to old stripes in this
region of the array.
If make_request finds a stripe that is flagged as being in an expand,
then it should block until the expand moves onward
(wait_for_expand_progress).

In handle_stripe, if we get a stripe that
  - was flagged for reading prior to expand and
  - has all the required blocks read in and
  - has no pending IO requests in the region of the current expand
then
  - transfer (either memcpy or pointer fiddling) the data into the
    stripes that are waiting for the data, and decrement the ->count
    for those stripes by the number of blocks that were transferred.

When handle stripe gets a stripe that is flagged for expanding, then it
knows that all the data has been transferred in, and it updates the
parity block and schedules a write.

I think that with that change in place, I'll be happy with the
structure and will then start a closer review of the fine detail and
help you find the corruption bug (if it still exists).

Thanks for your continuing efforts.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2005-10-24  0:37 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-20 14:33 [PATCH] Online RAID-5 resizing Steinar H. Gunderson
2005-09-20 15:01 ` Neil Brown
2005-09-20 15:36   ` Steinar H. Gunderson
2005-09-22 16:16     ` Neil Brown
2005-09-22 16:32       ` Steinar H. Gunderson
2005-09-23  8:59         ` Neil Brown
2005-09-23 12:50           ` Steinar H. Gunderson
2005-09-22 20:53       ` Steinar H. Gunderson
2005-09-24  1:44       ` Steinar H. Gunderson
2005-10-07  3:09         ` Neil Brown
2005-10-07 14:13           ` Steinar H. Gunderson
2005-10-14 19:46           ` Steinar H. Gunderson
2005-10-16 22:55             ` Neil Brown
2005-10-17  0:16               ` Steinar H. Gunderson
2005-10-19 23:18               ` Steinar H. Gunderson
2005-10-20 13:07                 ` Steinar H. Gunderson
2005-10-22 13:45                 ` Steinar H. Gunderson
2005-10-22 13:52                   ` Neil Brown
2005-10-24  0:37                 ` Neil Brown
2005-09-20 18:54   ` Al Boldi
2005-09-21 19:23   ` Steinar H. Gunderson
2005-09-22  0:14     ` Steinar H. Gunderson
2005-09-22  1:00       ` Steinar H. Gunderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).