[md PATCH 0/6] Reduce the number of devices in RAID4/5/6

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [md PATCH 0/6] Reduce the number of devices in RAID4/5/6
@ 2009-03-24  8:53 NeilBrown
  2009-03-24  8:53 ` [md PATCH 2/6] md/raid5: enhance raid5_size to work correctly with negative delta_disks NeilBrown
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-24  8:53 UTC (permalink / raw)
  To: linux-raid

Here is a set of patches I just finished that will be submitted for
2.6.30.
They make it possible to reduce the number of devices in a raid 4/5/6
array.

There is no support in mdadm for this yet so:
  a/ it has to be done by hand, writing to sysfs
  b/ there is no protection for the "critical section".
     So if you try this and your system crashes just before
     the reshape finishes, you lose data.

When reducing the number of device, you need to reduce the size of the
array first.  This can be done by writing to the "array_size"
attribute.
Alternately you can write the current size to the array_size
attribute, to pin it, then use "mdadm --grow --size=whatever" to use
more of the devices.  That is what the script below does.

When reducing the number of devices in an array, restriping starts at
the end of the devices and works towards the beginning.  When it gets
to the first few chunks it needs to over-write data that is currently
live.  This is perfectly safe unless the system crashes.  In that case
see the comment about "critical section" above.

The following script give some idea how to use this if you want to
experiment.  I kept and ext3 filesystem on md0 busy while this was
happening, and it fscked perfectly afterwards.

Note that the array starts degraded (I only have 5 test disks) and
ends degraded.  But then there are suddenly two spare devices, so it
start recovery straight away.

Comments or improvements, as always, very welcome.

NeilBrown

# test script for shrinking a raid5
# from 6 devices to 4
# don't worry about critical section
# make the devices larger first

mdadm -Ss
mdadm -CR /dev/md0 -l5 -n6 -z 2000000 missing /dev/sd[bcdef]
mdadm --wait /dev/md0
cat /proc/mdstat
echo 4 > /sys/block/md0/md/raid_disks
# freeze the array size at 5*2000000
echo 10000000 > /sys/block/md0/md/array_size
# make the individual devices bigger to 3333376
mdadm -G /dev/md0 --size 3333376
sleep 2
echo reshape > /sys/block/md0/md/sync_action
sleep 2
cat /proc/mdstat
mdadm --wait /dev/md0
echo default > /sys/block/md0/md/array_size
cat /proc/mdstat

---

NeilBrown (6):
      Documentation/md.txt update
      md: allow number of drives in raid5 to be reduced
      md/raid5: change reshape-progress measurement to cope with reshaping backwards.
      md: add explicit method to signal the end of a reshape.
      md/raid5: enhance raid5_size to work correctly with negative delta_disks
      md/raid5: drop qd_idx from r6_state

 Documentation/md.txt |   37 ++++++-
 drivers/md/md.c      |   11 +-
 drivers/md/md.h      |    2 
 drivers/md/raid5.c   |  259 ++++++++++++++++++++++++++++++++++----------------
 drivers/md/raid5.h   |   17 ++-
 5 files changed, 219 insertions(+), 107 deletions(-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [md PATCH 2/6] md/raid5: enhance raid5_size to work correctly with negative delta_disks
  2009-03-24  8:53 [md PATCH 0/6] Reduce the number of devices in RAID4/5/6 NeilBrown
@ 2009-03-24  8:53 ` NeilBrown
  2009-03-24  8:53 ` [md PATCH 3/6] md: add explicit method to signal the end of a reshape NeilBrown
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-24  8:53 UTC (permalink / raw)
  To: linux-raid

This is the first of four patches which combine to allow md/raid5 to
reduce the number of devices in the array by restriping the data over
a subset of the devices.

If the number of disks in a raid4/5/6 is being reduced, then the
default size must be based on the new number, not the old number
of devices.
In general, it should be based on the smaller of new and old.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3b71f01..74e89c1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4169,8 +4169,13 @@ raid5_size(mddev_t *mddev, sector_t sectors, int raid_disks)
 
 	if (!sectors)
 		sectors = mddev->dev_sectors;
-	if (!raid_disks)
-		raid_disks = conf->previous_raid_disks;
+	if (!raid_disks) {
+		/* size is defined by the smallest of previous and new size */
+		if (conf->raid_disks < conf->previous_raid_disks)
+			raid_disks = conf->raid_disks;
+		else
+			raid_disks = conf->previous_raid_disks;
+	}
 
 	sectors &= ~((sector_t)mddev->chunk_size/512 - 1);
 	return sectors * (raid_disks - conf->max_degraded);



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [md PATCH 3/6] md: add explicit method to signal the end of a reshape.
  2009-03-24  8:53 [md PATCH 0/6] Reduce the number of devices in RAID4/5/6 NeilBrown
  2009-03-24  8:53 ` [md PATCH 2/6] md/raid5: enhance raid5_size to work correctly with negative delta_disks NeilBrown
@ 2009-03-24  8:53 ` NeilBrown
  2009-03-24  8:53 ` [md PATCH 4/6] md/raid5: change reshape-progress measurement to cope with reshaping backwards NeilBrown
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-24  8:53 UTC (permalink / raw)
  To: linux-raid

Currently raid5 (the only module that supports restriping)
notices that the reshape has finished be sync_request being
given a large value, and handles any cleanup them.

This patch changes it so md_check_recovery calls into an
explicit finish_reshape method instead.

The key difference is that this method is called with the
device lock held so finish_reshape is free to make config changes.
This means we can remove the slightly-ugly
md_set_array_sectors_locked, and it means that when raid5 finishes
a reshape that reduces the number of devices, it is free to
remove devices from the array (which requires a lock).

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c    |   11 +++--------
 drivers/md/md.h    |    2 +-
 drivers/md/raid5.c |   32 +++++++++++++++-----------------
 3 files changed, 19 insertions(+), 26 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 923d125..c509313 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5073,14 +5073,6 @@ void md_set_array_sectors(mddev_t *mddev, sector_t array_sectors)
 }
 EXPORT_SYMBOL(md_set_array_sectors);
 
-void md_set_array_sectors_lock(mddev_t *mddev, sector_t array_sectors)
-{
-	mddev_lock(mddev);
-	md_set_array_sectors(mddev, array_sectors);
-	mddev_unlock(mddev);
-}
-EXPORT_SYMBOL(md_set_array_sectors_lock);
-
 static int update_size(mddev_t *mddev, sector_t num_sectors)
 {
 	mdk_rdev_t *rdev;
@@ -6641,6 +6633,9 @@ void md_check_recovery(mddev_t *mddev)
 					sysfs_notify(&mddev->kobj, NULL,
 						     "degraded");
 			}
+			if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+			    mddev->pers->finish_reshape)
+				mddev->pers->finish_reshape(mddev);
 			md_update_sb(mddev, 1);
 
 			/* if array is no-longer degraded, then any saved_raid_disk
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 614329d..50b7fc3 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -316,6 +316,7 @@ struct mdk_personality
 	sector_t (*size) (mddev_t *mddev, sector_t sectors, int raid_disks);
 	int (*check_reshape) (mddev_t *mddev);
 	int (*start_reshape) (mddev_t *mddev);
+	void (*finish_reshape) (mddev_t *mddev);
 	int (*reconfig) (mddev_t *mddev, int layout, int chunk_size);
 	/* quiesce moves between quiescence states
 	 * 0 - fully active
@@ -432,4 +433,3 @@ extern void md_new_event(mddev_t *mddev);
 extern int md_allow_write(mddev_t *mddev);
 extern void md_wait_for_blocked_rdev(mdk_rdev_t *rdev, mddev_t *mddev);
 extern void md_set_array_sectors(mddev_t *mddev, sector_t array_sectors);
-extern void md_set_array_sectors_lock(mddev_t *mddev, sector_t array_sectors);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 74e89c1..163e5fc 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1979,8 +1979,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 	return 0;
 }
 
-static void end_reshape(raid5_conf_t *conf);
-
 static int page_is_zero(struct page *p)
 {
 	char *a = page_address(p);
@@ -3850,10 +3848,6 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 	if (sector_nr >= max_sector) {
 		/* just being told to finish up .. nothing much to do */
 		unplug_slaves(mddev);
-		if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
-			end_reshape(conf);
-			return 0;
-		}
 
 		if (mddev->curr_resync < max_sector) /* aborted */
 			bitmap_end_sync(mddev->bitmap, mddev->curr_resync,
@@ -4833,30 +4827,30 @@ static int raid5_start_reshape(mddev_t *mddev)
 }
 #endif
 
-static void end_reshape(raid5_conf_t *conf)
+static void raid5_finish_reshape(mddev_t *mddev)
 {
 	struct block_device *bdev;
+	raid5_conf_t *conf = mddev_to_conf(mddev);
 
-	if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {
-		mddev_t *mddev = conf->mddev;
+	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
 
-		md_set_array_sectors_lock(mddev, raid5_size(mddev, 0, conf->raid_disks));
+		conf->previous_raid_disks = conf->raid_disks;
+		md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
 		set_capacity(mddev->gendisk, mddev->array_sectors);
 		mddev->changed = 1;
-		conf->previous_raid_disks = conf->raid_disks;
 
-		bdev = bdget_disk(conf->mddev->gendisk, 0);
+		bdev = bdget_disk(mddev->gendisk, 0);
 		if (bdev) {
 			mutex_lock(&bdev->bd_inode->i_mutex);
 			i_size_write(bdev->bd_inode,
-				     (loff_t)conf->mddev->array_sectors << 9);
+				     (loff_t)mddev->array_sectors << 9);
 			mutex_unlock(&bdev->bd_inode->i_mutex);
 			bdput(bdev);
 		}
 		spin_lock_irq(&conf->device_lock);
 		conf->expand_progress = MaxSector;
 		spin_unlock_irq(&conf->device_lock);
-		conf->mddev->reshape_position = MaxSector;
+		mddev->reshape_position = MaxSector;
 
 		/* read-ahead size must cover two whole stripes, which is
 		 * 2 * (datadisks) * chunksize where 'n' is the number of raid devices
@@ -4864,9 +4858,10 @@ static void end_reshape(raid5_conf_t *conf)
 		{
 			int data_disks = conf->previous_raid_disks - conf->max_degraded;
 			int stripe = data_disks *
-				(conf->mddev->chunk_size / PAGE_SIZE);
-			if (conf->mddev->queue->backing_dev_info.ra_pages < 2 * stripe)
-				conf->mddev->queue->backing_dev_info.ra_pages = 2 * stripe;
+				(mddev->chunk_size / PAGE_SIZE);
+			if (mddev->queue->backing_dev_info.ra_pages < 2*stripe)
+				mddev->queue->backing_dev_info.ra_pages
+					= 2 * stripe;
 		}
 	}
 }
@@ -5096,6 +5091,7 @@ static struct mdk_personality raid6_personality =
 #ifdef CONFIG_MD_RAID5_RESHAPE
 	.check_reshape	= raid5_check_reshape,
 	.start_reshape  = raid5_start_reshape,
+	.finish_reshape = raid5_finish_reshape,
 #endif
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid6_takeover,
@@ -5119,6 +5115,7 @@ static struct mdk_personality raid5_personality =
 #ifdef CONFIG_MD_RAID5_RESHAPE
 	.check_reshape	= raid5_check_reshape,
 	.start_reshape  = raid5_start_reshape,
+	.finish_reshape = raid5_finish_reshape,
 #endif
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid5_takeover,
@@ -5144,6 +5141,7 @@ static struct mdk_personality raid4_personality =
 #ifdef CONFIG_MD_RAID5_RESHAPE
 	.check_reshape	= raid5_check_reshape,
 	.start_reshape  = raid5_start_reshape,
+	.finish_reshape = raid5_finish_reshape,
 #endif
 	.quiesce	= raid5_quiesce,
 };



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [md PATCH 4/6] md/raid5: change reshape-progress measurement to cope with reshaping backwards.
  2009-03-24  8:53 [md PATCH 0/6] Reduce the number of devices in RAID4/5/6 NeilBrown
  2009-03-24  8:53 ` [md PATCH 2/6] md/raid5: enhance raid5_size to work correctly with negative delta_disks NeilBrown
  2009-03-24  8:53 ` [md PATCH 3/6] md: add explicit method to signal the end of a reshape NeilBrown
@ 2009-03-24  8:53 ` NeilBrown
  2009-03-27 16:19   ` Andre Noll
  2009-03-24  8:53 ` [md PATCH 1/6] md/raid5: drop qd_idx from r6_state NeilBrown
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2009-03-24  8:53 UTC (permalink / raw)
  To: linux-raid

When reducing the number of devices in a raid4/5/6, the reshape
process has to start at the end of the array and work down to the
beginning.  So we need to handle expand_progress and expand_lo
differently.

The patch renames "expand_progress" and "expand_lo" to avoid the
implication that anything is getting bigger (expand->reshape) and
every place they are used, we make sure that they are used the right
way depending on whether delta_disks is positive or negative.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   93 ++++++++++++++++++++++++++++++++++------------------
 drivers/md/raid5.h |   15 ++++++--
 2 files changed, 70 insertions(+), 38 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 163e5fc..8447c9f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3591,24 +3591,28 @@ static int make_request(struct request_queue *q, struct bio * bi)
 	retry:
 		previous = 0;
 		prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
-		if (likely(conf->expand_progress == MaxSector))
+		if (likely(conf->reshape_progress == MaxSector))
 			disks = conf->raid_disks;
 		else {
-			/* spinlock is needed as expand_progress may be
+			/* spinlock is needed as reshape_progress may be
 			 * 64bit on a 32bit platform, and so it might be
 			 * possible to see a half-updated value
-			 * Ofcourse expand_progress could change after
+			 * Ofcourse reshape_progress could change after
 			 * the lock is dropped, so once we get a reference
 			 * to the stripe that we think it is, we will have
 			 * to check again.
 			 */
 			spin_lock_irq(&conf->device_lock);
 			disks = conf->raid_disks;
-			if (logical_sector >= conf->expand_progress) {
+			if (mddev->delta_disks < 0
+			    ? logical_sector < conf->reshape_progress
+			    : logical_sector >= conf->reshape_progress) {
 				disks = conf->previous_raid_disks;
 				previous = 1;
 			} else {
-				if (logical_sector >= conf->expand_lo) {
+				if (mddev->delta_disks < 0
+				    ? logical_sector < conf->reshape_safe
+				    : logical_sector >= conf->reshape_safe) {
 					spin_unlock_irq(&conf->device_lock);
 					schedule();
 					goto retry;
@@ -3628,7 +3632,7 @@ static int make_request(struct request_queue *q, struct bio * bi)
 		sh = get_active_stripe(conf, new_sector, previous,
 				       (bi->bi_rw&RWA_MASK));
 		if (sh) {
-			if (unlikely(conf->expand_progress != MaxSector)) {
+			if (unlikely(conf->reshape_progress != MaxSector)) {
 				/* expansion might have moved on while waiting for a
 				 * stripe, so we must do the range check again.
 				 * Expansion could still move past after this
@@ -3639,8 +3643,10 @@ static int make_request(struct request_queue *q, struct bio * bi)
 				 */
 				int must_retry = 0;
 				spin_lock_irq(&conf->device_lock);
-				if (logical_sector <  conf->expand_progress &&
-				    disks == conf->previous_raid_disks)
+				if ((mddev->delta_disks < 0
+				     ? logical_sector >= conf->reshape_progress
+				     : logical_sector < conf->reshape_progress)
+				    && disks == conf->previous_raid_disks)
 					/* mismatch, need to try again */
 					must_retry = 1;
 				spin_unlock_irq(&conf->device_lock);
@@ -3718,13 +3724,20 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	int dd_idx;
 	sector_t writepos, safepos, gap;
 
-	if (sector_nr == 0 &&
-	    conf->expand_progress != 0) {
-		/* restarting in the middle, skip the initial sectors */
-		sector_nr = conf->expand_progress;
+	if (sector_nr == 0) {
+		/* If restarting in the middle, skip the initial sectors */
+		if (mddev->delta_disks < 0 &&
+		    conf->reshape_progress < raid5_size(mddev, 0, 0)) {
+			sector_nr = raid5_size(mddev, 0, 0)
+				- conf->reshape_progress;
+		} else if (mddev->delta_disks > 0 &&
+			   conf->reshape_progress > 0)
+			sector_nr = conf->reshape_progress;
 		sector_div(sector_nr, new_data_disks);
-		*skipped = 1;
-		return sector_nr;
+		if (sector_nr) {
+			*skipped = 1;
+			return sector_nr;
+		}
 	}
 
 	/* we update the metadata when there is more than 3Meg
@@ -3732,28 +3745,36 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	 * probably be time based) or when the data about to be
 	 * copied would over-write the source of the data at
 	 * the front of the range.
-	 * i.e. one new_stripe forward from expand_progress new_maps
-	 * to after where expand_lo old_maps to
+	 * i.e. one new_stripe along from reshape_progress new_maps
+	 * to after where reshape_safe old_maps to
 	 */
-	writepos = conf->expand_progress +
-		conf->chunk_size/512*(new_data_disks);
+	if (mddev->delta_disks < 0) {
+		writepos = conf->reshape_progress -
+			conf->chunk_size/512*(new_data_disks);
+		gap = conf->reshape_safe - conf->reshape_progress;
+	} else {
+		writepos = conf->reshape_progress +
+			conf->chunk_size/512*(new_data_disks);
+		gap = conf->reshape_progress - conf->reshape_safe;
+	}
 	sector_div(writepos, new_data_disks);
-	safepos = conf->expand_lo;
+	safepos = conf->reshape_safe;
 	sector_div(safepos, data_disks);
-	gap = conf->expand_progress - conf->expand_lo;
 
-	if (writepos >= safepos ||
+	if ((mddev->delta_disks < 0
+	     ? writepos < safepos
+	     : writepos > safepos) ||
 	    gap > (new_data_disks)*3000*2 /*3Meg*/) {
 		/* Cannot proceed until we've updated the superblock... */
 		wait_event(conf->wait_for_overlap,
 			   atomic_read(&conf->reshape_stripes)==0);
-		mddev->reshape_position = conf->expand_progress;
+		mddev->reshape_position = conf->reshape_progress;
 		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 		md_wakeup_thread(mddev->thread);
 		wait_event(mddev->sb_wait, mddev->flags == 0 ||
 			   kthread_should_stop());
 		spin_lock_irq(&conf->device_lock);
-		conf->expand_lo = mddev->reshape_position;
+		conf->reshape_safe = mddev->reshape_position;
 		spin_unlock_irq(&conf->device_lock);
 		wake_up(&conf->wait_for_overlap);
 	}
@@ -3790,7 +3811,10 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 		release_stripe(sh);
 	}
 	spin_lock_irq(&conf->device_lock);
-	conf->expand_progress = (sector_nr + i) * new_data_disks;
+	if (mddev->delta_disks < 0)
+		conf->reshape_progress -= i * new_data_disks;
+	else
+		conf->reshape_progress += i * new_data_disks;
 	spin_unlock_irq(&conf->device_lock);
 	/* Ok, those stripe are ready. We can start scheduling
 	 * reads on the source stripes.
@@ -3821,14 +3845,14 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 		/* Cannot proceed until we've updated the superblock... */
 		wait_event(conf->wait_for_overlap,
 			   atomic_read(&conf->reshape_stripes) == 0);
-		mddev->reshape_position = conf->expand_progress;
+		mddev->reshape_position = conf->reshape_progress;
 		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 		md_wakeup_thread(mddev->thread);
 		wait_event(mddev->sb_wait,
 			   !test_bit(MD_CHANGE_DEVS, &mddev->flags)
 			   || kthread_should_stop());
 		spin_lock_irq(&conf->device_lock);
-		conf->expand_lo = mddev->reshape_position;
+		conf->reshape_safe = mddev->reshape_position;
 		spin_unlock_irq(&conf->device_lock);
 		wake_up(&conf->wait_for_overlap);
 	}
@@ -4276,7 +4300,7 @@ static raid5_conf_t *setup_conf(mddev_t *mddev)
 		conf->max_degraded = 1;
 	conf->algorithm = mddev->new_layout;
 	conf->max_nr_stripes = NR_STRIPES;
-	conf->expand_progress = mddev->reshape_position;
+	conf->reshape_progress = mddev->reshape_position;
 
 	memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
 		 conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
@@ -4434,9 +4458,9 @@ static int run(mddev_t *mddev)
 
 	print_raid5_conf(conf);
 
-	if (conf->expand_progress != MaxSector) {
+	if (conf->reshape_progress != MaxSector) {
 		printk("...ok start reshape thread\n");
-		conf->expand_lo = conf->expand_progress;
+		conf->reshape_safe = conf->reshape_progress;
 		atomic_set(&conf->reshape_stripes, 0);
 		clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 		clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
@@ -4774,8 +4798,11 @@ static int raid5_start_reshape(mddev_t *mddev)
 	spin_lock_irq(&conf->device_lock);
 	conf->previous_raid_disks = conf->raid_disks;
 	conf->raid_disks += mddev->delta_disks;
-	conf->expand_progress = 0;
-	conf->expand_lo = 0;
+	if (mddev->delta_disks < 0)
+		conf->reshape_progress = raid5_size(mddev, 0, 0);
+	else
+		conf->reshape_progress = 0;
+	conf->reshape_safe = conf->reshape_progress;
 	spin_unlock_irq(&conf->device_lock);
 
 	/* Add some new drives, as many as will fit.
@@ -4817,7 +4844,7 @@ static int raid5_start_reshape(mddev_t *mddev)
 		mddev->recovery = 0;
 		spin_lock_irq(&conf->device_lock);
 		mddev->raid_disks = conf->raid_disks = conf->previous_raid_disks;
-		conf->expand_progress = MaxSector;
+		conf->reshape_progress = MaxSector;
 		spin_unlock_irq(&conf->device_lock);
 		return -EAGAIN;
 	}
@@ -4848,7 +4875,7 @@ static void raid5_finish_reshape(mddev_t *mddev)
 			bdput(bdev);
 		}
 		spin_lock_irq(&conf->device_lock);
-		conf->expand_progress = MaxSector;
+		conf->reshape_progress = MaxSector;
 		spin_unlock_irq(&conf->device_lock);
 		mddev->reshape_position = MaxSector;
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index c2f37f2..48083a2 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -337,11 +337,16 @@ struct raid5_private_data {
 	int			raid_disks;
 	int			max_nr_stripes;
 
-	/* used during an expand */
-	sector_t		expand_progress;	/* MaxSector when no expand happening */
-	sector_t		expand_lo; /* from here up to expand_progress it out-of-bounds
-					    * as we haven't flushed the metadata yet
-					    */
+	/* reshape_progress is the leading edge of a 'reshape'
+	 * It has value MaxSector when no expand is happening
+	 * If delta_disks < 0, it is the last sector we started work on,
+	 * else is it the next sector to work on.
+	 */
+	sector_t		reshape_progress;
+	/* reshape_safe is the trailing edge of a reshape.  We know that
+	 * before (or after) this address, all reshape has completed.
+	 */
+	sector_t		reshape_safe;
 	int			previous_raid_disks;
 
 	struct list_head	handle_list; /* stripes needing handling */



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [md PATCH 4/6] md/raid5: change reshape-progress measurement to cope with reshaping backwards.
  2009-03-24  8:53 ` [md PATCH 4/6] md/raid5: change reshape-progress measurement to cope with reshaping backwards NeilBrown
@ 2009-03-27 16:19   ` Andre Noll
  2009-03-27 19:54     ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Andre Noll @ 2009-03-27 16:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1083 bytes --]

On 19:53, NeilBrown wrote:
> -			if (logical_sector >= conf->expand_progress) {
> +			if (mddev->delta_disks < 0
> +			    ? logical_sector < conf->reshape_progress
> +			    : logical_sector >= conf->reshape_progress) {
>  				disks = conf->previous_raid_disks;
>  				previous = 1;
>  			} else {
> -				if (logical_sector >= conf->expand_lo) {
> +				if (mddev->delta_disks < 0
> +				    ? logical_sector < conf->reshape_safe
> +				    : logical_sector >= conf->reshape_safe) {
>  					spin_unlock_irq(&conf->device_lock);
>  					schedule();
>  					goto retry;

Is it only me who finds such code hard to comprehend?  Given that
the patch adds checks of the form

	(delta < 0 && s < r) || (delta >= 0 && s >= r)

at several locations, it might make sense to introduce a marco or an
inline function for this check.

> +	/* reshape_progress is the leading edge of a 'reshape'
> +	 * It has value MaxSector when no expand is happening

s/expand/reshape

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [md PATCH 4/6] md/raid5: change reshape-progress measurement to cope with reshaping backwards.
  2009-03-27 16:19   ` Andre Noll
@ 2009-03-27 19:54     ` NeilBrown
  2009-03-30  9:09       ` Andre Noll
       [not found]       ` <49CE1713.9070707@tmr.com>
  0 siblings, 2 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-27 19:54 UTC (permalink / raw)
  To: Andre Noll; +Cc: linux-raid

On Sat, March 28, 2009 3:19 am, Andre Noll wrote:
> On 19:53, NeilBrown wrote:
>> -			if (logical_sector >= conf->expand_progress) {
>> +			if (mddev->delta_disks < 0
>> +			    ? logical_sector < conf->reshape_progress
>> +			    : logical_sector >= conf->reshape_progress) {
>>  				disks = conf->previous_raid_disks;
>>  				previous = 1;
>>  			} else {
>> -				if (logical_sector >= conf->expand_lo) {
>> +				if (mddev->delta_disks < 0
>> +				    ? logical_sector < conf->reshape_safe
>> +				    : logical_sector >= conf->reshape_safe) {
>>  					spin_unlock_irq(&conf->device_lock);
>>  					schedule();
>>  					goto retry;
>
> Is it only me who finds such code hard to comprehend?  Given that
> the patch adds checks of the form
>
> 	(delta < 0 && s < r) || (delta >= 0 && s >= r)

They are really of the form
       delta < 0 ? s < r : s >= r

I guess I could use something like

 static inline inorder(mddev_t *mddev, sector_t a, sector_t b)
 {
      if (mddev->delta_disks < 0)
                return b > a;
      else
                return a <= b;
 }

However sometimes it is '<' vs '>=' and sometimes  '<' vs '>',
so I'm not sure it would apply universally.....


>
> at several locations, it might make sense to introduce a marco or an
> inline function for this check.
>
>> +	/* reshape_progress is the leading edge of a 'reshape'
>> +	 * It has value MaxSector when no expand is happening
>
> s/expand/reshape

Thanks.
Fixed.


NeilBrown

>
> Regards
> Andre
> --
> The only person who always got his work done by Friday was Robinson Crusoe
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [md PATCH 4/6] md/raid5: change reshape-progress measurement to cope with reshaping backwards.
  2009-03-27 19:54     ` NeilBrown
@ 2009-03-30  9:09       ` Andre Noll
       [not found]       ` <49CE1713.9070707@tmr.com>
  1 sibling, 0 replies; 15+ messages in thread
From: Andre Noll @ 2009-03-30  9:09 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1045 bytes --]

On 06:54, NeilBrown wrote:
> > Is it only me who finds such code hard to comprehend?  Given that
> > the patch adds checks of the form
> >
> > 	(delta < 0 && s < r) || (delta >= 0 && s >= r)
> 
> They are really of the form
>        delta < 0 ? s < r : s >= r

Yeah right. This underlines that it is easy to get wrong :)

>  static inline inorder(mddev_t *mddev, sector_t a, sector_t b)
>  {
>       if (mddev->delta_disks < 0)
>                 return b > a;
>       else
>                 return a <= b;
>  }
> 
> However sometimes it is '<' vs '>=' and sometimes  '<' vs '>',
> so I'm not sure it would apply universally.....

Return -1, 0, or 1, i.e. something like this:

static inline int inorder(int delta, sector_t a, sector_t b)
{
	int x;

	if (a < b)
		x = 1;
	else if (a > b)
		x = -1;
	else x = 0;
	if (delta < 0)
		x = -x;
	return x;
}

Of course, the callers would need to be adapted slightly.

Thanks
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <49CE1713.9070707@tmr.com>]

* Re: [md PATCH 4/6] md/raid5: change reshape-progress measurement to cope with reshaping backwards.
       [not found]       ` <49CE1713.9070707@tmr.com>
@ 2009-03-30  9:20         ` Andre Noll
  0 siblings, 0 replies; 15+ messages in thread
From: Andre Noll @ 2009-03-30  9:20 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: NeilBrown, linux-raid

[-- Attachment #1: Type: text/plain, Size: 724 bytes --]

On 08:24, Bill Davidsen wrote:

> Let me say that (a) I find the existing code very readable, but (b) I
> find the suggested rewrite even more so, using (?:) notation. If you
> really want to use a macro, see the attached and preprocess with "cc
> -E" to expand the macro. I personally find the macro version really
> hard to read.

Yes, the macro is not much easier to read either. It's not only about
readability btw. Doing the check at only one place makes it easier to
change the code later, should that ever be necessary. It also reduces
the probability of getting it wrong at one location without noticing.

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [md PATCH 1/6] md/raid5: drop qd_idx from r6_state
  2009-03-24  8:53 [md PATCH 0/6] Reduce the number of devices in RAID4/5/6 NeilBrown
                   ` (2 preceding siblings ...)
  2009-03-24  8:53 ` [md PATCH 4/6] md/raid5: change reshape-progress measurement to cope with reshaping backwards NeilBrown
@ 2009-03-24  8:53 ` NeilBrown
  2009-03-24  8:53 ` [md PATCH 6/6] Documentation/md.txt update NeilBrown
  2009-03-24  8:53 ` [md PATCH 5/6] md: allow number of drives in raid5 to be reduced NeilBrown
  5 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-24  8:53 UTC (permalink / raw)
  To: linux-raid

We now have this value in stripe_head so we don't need to duplicate
it.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   17 ++++++++---------
 drivers/md/raid5.h |    2 +-
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e643fee..3b71f01 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2371,7 +2371,7 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf,
 		struct r6_state *r6s, int disks)
 {
 	int rcw = 0, must_compute = 0, pd_idx = sh->pd_idx, i;
-	int qd_idx = r6s->qd_idx;
+	int qd_idx = sh->qd_idx;
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 		/* Would I have to read this buffer for reconstruct_write */
@@ -2561,7 +2561,7 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
 	int update_p = 0, update_q = 0;
 	struct r5dev *dev;
 	int pd_idx = sh->pd_idx;
-	int qd_idx = r6s->qd_idx;
+	int qd_idx = sh->qd_idx;
 
 	set_bit(STRIPE_HANDLE, &sh->state);
 
@@ -2657,7 +2657,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 	struct dma_async_tx_descriptor *tx = NULL;
 	clear_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 	for (i = 0; i < sh->disks; i++)
-		if (i != sh->pd_idx && (!r6s || i != r6s->qd_idx)) {
+		if (i != sh->pd_idx && i != sh->qd_idx) {
 			int dd_idx, j;
 			struct stripe_head *sh2;
 
@@ -2984,17 +2984,16 @@ static bool handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks;
 	struct bio *return_bi = NULL;
-	int i, pd_idx = sh->pd_idx;
+	int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
 	struct stripe_head_state s;
 	struct r6_state r6s;
 	struct r5dev *dev, *pdev, *qdev;
 	mdk_rdev_t *blocked_rdev = NULL;
 
-	r6s.qd_idx = sh->qd_idx;
 	pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
 		"pd_idx=%d, qd_idx=%d\n",
 	       (unsigned long long)sh->sector, sh->state,
-	       atomic_read(&sh->count), pd_idx, r6s.qd_idx);
+	       atomic_read(&sh->count), pd_idx, qd_idx);
 	memset(&s, 0, sizeof(s));
 
 	spin_lock(&sh->lock);
@@ -3105,9 +3104,9 @@ static bool handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	pdev = &sh->dev[pd_idx];
 	r6s.p_failed = (s.failed >= 1 && r6s.failed_num[0] == pd_idx)
 		|| (s.failed >= 2 && r6s.failed_num[1] == pd_idx);
-	qdev = &sh->dev[r6s.qd_idx];
-	r6s.q_failed = (s.failed >= 1 && r6s.failed_num[0] == r6s.qd_idx)
-		|| (s.failed >= 2 && r6s.failed_num[1] == r6s.qd_idx);
+	qdev = &sh->dev[qd_idx];
+	r6s.q_failed = (s.failed >= 1 && r6s.failed_num[0] == qd_idx)
+		|| (s.failed >= 2 && r6s.failed_num[1] == qd_idx);
 
 	if ( s.written &&
 	     ( r6s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 2934ee0..c2f37f2 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -239,7 +239,7 @@ struct stripe_head_state {
 
 /* r6_state - extra state data only relevant to r6 */
 struct r6_state {
-	int p_failed, q_failed, qd_idx, failed_num[2];
+	int p_failed, q_failed, failed_num[2];
 };
 
 /* Flags */



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [md PATCH 6/6] Documentation/md.txt update
  2009-03-24  8:53 [md PATCH 0/6] Reduce the number of devices in RAID4/5/6 NeilBrown
                   ` (3 preceding siblings ...)
  2009-03-24  8:53 ` [md PATCH 1/6] md/raid5: drop qd_idx from r6_state NeilBrown
@ 2009-03-24  8:53 ` NeilBrown
  2009-03-27 16:19   ` Andre Noll
  2009-03-24  8:53 ` [md PATCH 5/6] md: allow number of drives in raid5 to be reduced NeilBrown
  5 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2009-03-24  8:53 UTC (permalink / raw)
  To: linux-raid

Update md.txt to reflect recent changes in a number of sysfs
attributes.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 Documentation/md.txt |   37 ++++++++++++++++++++++++++++++-------
 1 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/Documentation/md.txt b/Documentation/md.txt
index 1da9d1b..12a8053 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -164,15 +164,19 @@ All md devices contain:
   raid_disks
      a text file with a simple number indicating the number of devices
      in a fully functional array.  If this is not yet known, the file
-     will be empty.  If an array is being resized (not currently
-     possible) this will contain the larger of the old and new sizes.
-     Some raid level (RAID1) allow this value to be set while the
-     array is active.  This will reconfigure the array.   Otherwise
-     it can only be set while assembling an array.
+     will be empty.  If an array is being resized this will contain
+     the new number of devices.
+     Some raid levels allow this value to be set while the array is
+     active.  This will reconfigure the array.   Otherwise it can only
+     be set while assembling an array.
+     A change to this attribute will not be permitted if it would
+     reduce the size of the array.  To reduce the number of drives
+     in an e.g. raid5, the array size musts first be reduced by
+     setting the 'array_size' attribute.
 
   chunk_size
-     This is the size if bytes for 'chunks' and is only relevant to
-     raid levels that involve striping (1,4,5,6,10). The address space
+     This is the size in bytes for 'chunks' and is only relevant to
+     raid levels that involve striping (0,4,5,6,10). The address space
      of the array is conceptually divided into chunks and consecutive
      chunks are striped onto neighbouring devices.
      The size should be at least PAGE_SIZE (4k) and should be a power
@@ -183,6 +187,20 @@ All md devices contain:
      simply a number that is interpretted differently by different
      levels.  It can be written while assembling an array.
 
+  array_size
+     This can be used to artificially constrain the available space in
+     the array to be less than is actually available on the combined
+     devices.  Writing a number (in Kilobytes) which is less than
+     the available size will set the size.  Any reconfiguration of the
+     array (e.g. adding devices) will not cause the size to change.
+     Writing the word 'default' will cause the effective size of the
+     array to be whatever size is actually available based on
+     'level', 'chunk_size' and 'component_size'.
+
+     This can be used to reduce the size of the array before reducing
+     the number of devices in a raid4/5/6, or to support external
+     metadata formats which mandate such clipping.
+
   reshape_position
      This is either "none" or a sector number within the devices of
      the array where "reshape" is up to.  If this is set, the three
@@ -207,6 +225,11 @@ All md devices contain:
      about the array.  It can be 0.90 (traditional format), 1.0, 1.1,
      1.2 (newer format in varying locations) or "none" indicating that
      the kernel isn't managing metadata at all.
+     Alternately it can be "external:" followed by a string which
+     is set by user-space.  This indicates that metadata is managed
+     by a user-space program.  Any device failure or other event that
+     requires a metadata update will cause array activity to be
+     suspended until the event is acknowledged.
 
   resync_start
      The point at which resync should start.  If no resync is needed,



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [md PATCH 6/6] Documentation/md.txt update
  2009-03-24  8:53 ` [md PATCH 6/6] Documentation/md.txt update NeilBrown
@ 2009-03-27 16:19   ` Andre Noll
  2009-03-27 19:43     ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Andre Noll @ 2009-03-27 16:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 282 bytes --]

On 19:53, NeilBrown wrote:
> +     reduce the size of the array.  To reduce the number of drives
> +     in an e.g. raid5, the array size musts first be reduced by

s/musts/must

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [md PATCH 6/6] Documentation/md.txt update
  2009-03-27 16:19   ` Andre Noll
@ 2009-03-27 19:43     ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-27 19:43 UTC (permalink / raw)
  To: Andre Noll; +Cc: linux-raid

On Sat, March 28, 2009 3:19 am, Andre Noll wrote:
> On 19:53, NeilBrown wrote:
>> +     reduce the size of the array.  To reduce the number of drives
>> +     in an e.g. raid5, the array size musts first be reduced by
>
> s/musts/must

Thanks. Fixed.

NeilBrown


>
> Regards
> Andre
> --
> The only person who always got his work done by Friday was Robinson Crusoe
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [md PATCH 5/6] md: allow number of drives in raid5 to be reduced
  2009-03-24  8:53 [md PATCH 0/6] Reduce the number of devices in RAID4/5/6 NeilBrown
                   ` (4 preceding siblings ...)
  2009-03-24  8:53 ` [md PATCH 6/6] Documentation/md.txt update NeilBrown
@ 2009-03-24  8:53 ` NeilBrown
  2009-03-27 16:19   ` Andre Noll
  5 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2009-03-24  8:53 UTC (permalink / raw)
  To: linux-raid

When reshaping a raid5 to have fewer devices, we work from the end of
the array to the beginning.
md_do_sync gives addresses to sync_request that go from the beginning
to the end.  So largely ignore them use the internal state variable
"reshape_progress" to keep track of what to do next.

Never allow the size to be reduced below the minimum (4 doe raid6,
3 otherwise).

We require that the size of the array has already been reduced before
the array is reshaped to a smaller size.  This is because simply
reducing the size is an easily reversible operation, while the reshape
is immediately destructive and so is not reversible for the blocks at
the ends of the devices.
Thus to reshape an array to have fewer devices, you must first write
an appropriately small size to md/array_size.

When reshape finished, we remove any drives that are no longer
needed and fix up ->degraded.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |  114 ++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 87 insertions(+), 27 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8447c9f..aaeb176 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3723,6 +3723,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	int i;
 	int dd_idx;
 	sector_t writepos, safepos, gap;
+	sector_t stripe_addr;
 
 	if (sector_nr == 0) {
 		/* If restarting in the middle, skip the initial sectors */
@@ -3779,10 +3780,21 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 		wake_up(&conf->wait_for_overlap);
 	}
 
+	if (mddev->delta_disks < 0) {
+		BUG_ON(conf->reshape_progress == 0);
+		stripe_addr = writepos;
+		BUG_ON((mddev->dev_sectors &
+			~((sector_t)mddev->chunk_size / 512 - 1))
+		       - (conf->chunk_size / 512) - stripe_addr
+		       != sector_nr);
+	} else {
+		BUG_ON(writepos != sector_nr + conf->chunk_size / 512);
+		stripe_addr = writepos;
+	}
 	for (i=0; i < conf->chunk_size/512; i+= STRIPE_SECTORS) {
 		int j;
 		int skipped = 0;
-		sh = get_active_stripe(conf, sector_nr+i, 0, 0);
+		sh = get_active_stripe(conf, stripe_addr+i, 0, 0);
 		set_bit(STRIPE_EXPANDING, &sh->state);
 		atomic_inc(&conf->reshape_stripes);
 		/* If any of this stripe is beyond the end of the old
@@ -3822,10 +3834,10 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	 * block on the destination stripes.
 	 */
 	first_sector =
-		raid5_compute_sector(conf, sector_nr*(new_data_disks),
+		raid5_compute_sector(conf, stripe_addr*(new_data_disks),
 				     1, &dd_idx, NULL);
 	last_sector =
-		raid5_compute_sector(conf, ((sector_nr+conf->chunk_size/512)
+		raid5_compute_sector(conf, ((stripe_addr+conf->chunk_size/512)
 					    *(new_data_disks) - 1),
 				     1, &dd_idx, NULL);
 	if (last_sector >= mddev->dev_sectors)
@@ -4640,6 +4652,10 @@ static int raid5_remove_disk(mddev_t *mddev, int number)
 	print_raid5_conf(conf);
 	rdev = p->rdev;
 	if (rdev) {
+		if (number >= conf->raid_disks &&
+		    conf->reshape_progress == MaxSector)
+			clear_bit(In_sync, &rdev->flags);
+
 		if (test_bit(In_sync, &rdev->flags) ||
 		    atomic_read(&rdev->nr_pending)) {
 			err = -EBUSY;
@@ -4649,7 +4665,8 @@ static int raid5_remove_disk(mddev_t *mddev, int number)
 		 * isn't possible.
 		 */
 		if (!test_bit(Faulty, &rdev->flags) &&
-		    mddev->degraded <= conf->max_degraded) {
+		    mddev->degraded <= conf->max_degraded &&
+		    number < conf->raid_disks) {
 			err = -EBUSY;
 			goto abort;
 		}
@@ -4738,14 +4755,25 @@ static int raid5_check_reshape(mddev_t *mddev)
 	raid5_conf_t *conf = mddev_to_conf(mddev);
 	int err;
 
-	if (mddev->delta_disks < 0 ||
-	    mddev->new_level != mddev->level)
-		return -EINVAL; /* Cannot shrink array or change level yet */
 	if (mddev->delta_disks == 0)
 		return 0; /* nothing to do */
 	if (mddev->bitmap)
 		/* Cannot grow a bitmap yet */
 		return -EBUSY;
+	if (mddev->degraded > conf->max_degraded)
+		return -EINVAL;
+	if (mddev->delta_disks < 0) {
+		/* We might be able to shrink, but the devices must
+		 * be made bigger first.
+		 * For raid6, 4 is the minimum size.
+		 * Otherwise 2 is the minimum
+		 */
+		int min = 2;
+		if (mddev->level == 6)
+			min = 4;
+		if (mddev->raid_disks + mddev->delta_disks < min)
+			return -EINVAL;
+	}
 
 	/* Can only proceed if there are plenty of stripe_heads.
 	 * We need a minimum of one full stripe,, and for sensible progress
@@ -4762,12 +4790,13 @@ static int raid5_check_reshape(mddev_t *mddev)
 		return -ENOSPC;
 	}
 
-	err = resize_stripes(conf, conf->raid_disks + mddev->delta_disks);
-	if (err)
-		return err;
+	if (mddev->delta_disks > 0) {
+		err = resize_stripes(conf, (conf->raid_disks +
+					    mddev->delta_disks));
+		if (err)
+			return err;
+	}
 
-	if (mddev->degraded > conf->max_degraded)
-		return -EINVAL;
 	/* looks like we might be able to manage this */
 	return 0;
 }
@@ -4794,6 +4823,17 @@ static int raid5_start_reshape(mddev_t *mddev)
 		 */
 		return -EINVAL;
 
+	/* Refuse to reduce size of the array.  Any reductions in
+	 * array size must be through explicit setting of array_size
+	 * attribute.
+	 */
+	if (raid5_size(mddev, 0, conf->raid_disks + mddev->delta_disks)
+	    < mddev->array_sectors) {
+		printk(KERN_ERR "md: %s: array size must be reduced "
+		       "before number of disks\n", mdname(mddev));
+		return -EINVAL;
+	}
+
 	atomic_set(&conf->reshape_stripes, 0);
 	spin_lock_irq(&conf->device_lock);
 	conf->previous_raid_disks = conf->raid_disks;
@@ -4827,9 +4867,12 @@ static int raid5_start_reshape(mddev_t *mddev)
 				break;
 		}
 
-	spin_lock_irqsave(&conf->device_lock, flags);
-	mddev->degraded = (conf->raid_disks - conf->previous_raid_disks) - added_devices;
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	if (mddev->delta_disks > 0) {
+		spin_lock_irqsave(&conf->device_lock, flags);
+		mddev->degraded = (conf->raid_disks - conf->previous_raid_disks)
+			- added_devices;
+		spin_unlock_irqrestore(&conf->device_lock, flags);
+	}
 	mddev->raid_disks = conf->raid_disks;
 	mddev->reshape_position = 0;
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
@@ -4862,21 +4905,38 @@ static void raid5_finish_reshape(mddev_t *mddev)
 	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
 
 		conf->previous_raid_disks = conf->raid_disks;
-		md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
-		set_capacity(mddev->gendisk, mddev->array_sectors);
-		mddev->changed = 1;
-
-		bdev = bdget_disk(mddev->gendisk, 0);
-		if (bdev) {
-			mutex_lock(&bdev->bd_inode->i_mutex);
-			i_size_write(bdev->bd_inode,
-				     (loff_t)mddev->array_sectors << 9);
-			mutex_unlock(&bdev->bd_inode->i_mutex);
-			bdput(bdev);
-		}
 		spin_lock_irq(&conf->device_lock);
 		conf->reshape_progress = MaxSector;
 		spin_unlock_irq(&conf->device_lock);
+		if (mddev->delta_disks > 0) {
+			conf->previous_raid_disks = conf->raid_disks;
+			md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
+			set_capacity(mddev->gendisk, mddev->array_sectors);
+			mddev->changed = 1;
+			bdev = bdget_disk(mddev->gendisk, 0);
+			if (bdev) {
+				mutex_lock(&bdev->bd_inode->i_mutex);
+				i_size_write(bdev->bd_inode,
+					     (loff_t)mddev->array_sectors << 9);
+				mutex_unlock(&bdev->bd_inode->i_mutex);
+				bdput(bdev);
+			}
+		} else {
+			int d;
+			mddev->degraded = conf->raid_disks;
+			for (d = 0; d < conf->raid_disks ; d++)
+				if (conf->disks[d].rdev &&
+				    test_bit(In_sync,
+					     &conf->disks[d].rdev->flags))
+					mddev->degraded--;
+			for (d = conf->raid_disks ;
+			     d < conf->previous_raid_disks ;
+			     d++)
+				raid5_remove_disk(mddev, d);
+
+			conf->previous_raid_disks = conf->raid_disks;
+		}
+		mddev->delta_disks = 0;
 		mddev->reshape_position = MaxSector;
 
 		/* read-ahead size must cover two whole stripes, which is



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [md PATCH 5/6] md: allow number of drives in raid5 to be reduced
  2009-03-24  8:53 ` [md PATCH 5/6] md: allow number of drives in raid5 to be reduced NeilBrown
@ 2009-03-27 16:19   ` Andre Noll
  2009-03-27 19:39     ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Andre Noll @ 2009-03-27 16:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3273 bytes --]

On 19:53, NeilBrown wrote:

> Never allow the size to be reduced below the minimum (4 doe raid6,
> 3 otherwise).

doe?

> @@ -3723,6 +3723,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
>  	int i;
>  	int dd_idx;
>  	sector_t writepos, safepos, gap;
> +	sector_t stripe_addr;
>  
>  	if (sector_nr == 0) {
>  		/* If restarting in the middle, skip the initial sectors */
> @@ -3779,10 +3780,21 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
>  		wake_up(&conf->wait_for_overlap);
>  	}
>  
> +	if (mddev->delta_disks < 0) {
> +		BUG_ON(conf->reshape_progress == 0);
> +		stripe_addr = writepos;
> +		BUG_ON((mddev->dev_sectors &
> +			~((sector_t)mddev->chunk_size / 512 - 1))
> +		       - (conf->chunk_size / 512) - stripe_addr
> +		       != sector_nr);
> +	} else {
> +		BUG_ON(writepos != sector_nr + conf->chunk_size / 512);
> +		stripe_addr = writepos;
> +	}

What's the point of the new stripe_addr variable? Isn't it equal to
writepos in any case?

> @@ -4738,14 +4755,25 @@ static int raid5_check_reshape(mddev_t *mddev)
>  	raid5_conf_t *conf = mddev_to_conf(mddev);
>  	int err;
>  
> -	if (mddev->delta_disks < 0 ||
> -	    mddev->new_level != mddev->level)
> -		return -EINVAL; /* Cannot shrink array or change level yet */
>  	if (mddev->delta_disks == 0)
>  		return 0; /* nothing to do */
>  	if (mddev->bitmap)
>  		/* Cannot grow a bitmap yet */
>  		return -EBUSY;
> +	if (mddev->degraded > conf->max_degraded)
> +		return -EINVAL;
> +	if (mddev->delta_disks < 0) {
> +		/* We might be able to shrink, but the devices must
> +		 * be made bigger first.
> +		 * For raid6, 4 is the minimum size.
> +		 * Otherwise 2 is the minimum
> +		 */
> +		int min = 2;
> +		if (mddev->level == 6)
> +			min = 4;
> +		if (mddev->raid_disks + mddev->delta_disks < min)
> +			return -EINVAL;
> +	}

Hm, this code doesn't seem to do what the comment suggests. IMHO,
we must check that

	(a) (raid_disks + delta_disks) * sizeof(smallest) is big enough
and
	(b) raid_disks + delta_disks >= minimal admissible number of disks

The comment says the devices must be made bigger (to satisfy (a))
but the code only checks (b).

> @@ -4862,21 +4905,38 @@ static void raid5_finish_reshape(mddev_t *mddev)
>  	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
>  
>  		conf->previous_raid_disks = conf->raid_disks;
> -		md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
> -		set_capacity(mddev->gendisk, mddev->array_sectors);
> -		mddev->changed = 1;
> -
> -		bdev = bdget_disk(mddev->gendisk, 0);
> -		if (bdev) {
> -			mutex_lock(&bdev->bd_inode->i_mutex);
> -			i_size_write(bdev->bd_inode,
> -				     (loff_t)mddev->array_sectors << 9);
> -			mutex_unlock(&bdev->bd_inode->i_mutex);
> -			bdput(bdev);
> -		}
>  		spin_lock_irq(&conf->device_lock);
>  		conf->reshape_progress = MaxSector;
>  		spin_unlock_irq(&conf->device_lock);
> +		if (mddev->delta_disks > 0) {
> +			conf->previous_raid_disks = conf->raid_disks;

previous_raid_disks was already assigned earlier, so this can go away,
imo.

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [md PATCH 5/6] md: allow number of drives in raid5 to be reduced
  2009-03-27 16:19   ` Andre Noll
@ 2009-03-27 19:39     ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-27 19:39 UTC (permalink / raw)
  To: Andre Noll; +Cc: linux-raid

On Sat, March 28, 2009 3:19 am, Andre Noll wrote:
> On 19:53, NeilBrown wrote:
>
>> Never allow the size to be reduced below the minimum (4 doe raid6,
>> 3 otherwise).
>
> doe?

Left hand was one column too far left. d->f e->r.
Fixed.

>
>> @@ -3723,6 +3723,7 @@ static sector_t reshape_request(mddev_t *mddev,
>> sector_t sector_nr, int *skipped
>>  	int i;
>>  	int dd_idx;
>>  	sector_t writepos, safepos, gap;
>> +	sector_t stripe_addr;
>>
>>  	if (sector_nr == 0) {
>>  		/* If restarting in the middle, skip the initial sectors */
>> @@ -3779,10 +3780,21 @@ static sector_t reshape_request(mddev_t *mddev,
>> sector_t sector_nr, int *skipped
>>  		wake_up(&conf->wait_for_overlap);
>>  	}
>>
>> +	if (mddev->delta_disks < 0) {
>> +		BUG_ON(conf->reshape_progress == 0);
>> +		stripe_addr = writepos;
>> +		BUG_ON((mddev->dev_sectors &
>> +			~((sector_t)mddev->chunk_size / 512 - 1))
>> +		       - (conf->chunk_size / 512) - stripe_addr
>> +		       != sector_nr);
>> +	} else {
>> +		BUG_ON(writepos != sector_nr + conf->chunk_size / 512);
>> +		stripe_addr = writepos;
>> +	}
>
> What's the point of the new stripe_addr variable? Isn't it equal to
> writepos in any case?

Yes, but it shouldn't be.  In the second branch of the if, it should be
    stripe_addr = sector_nr;
as I discovered yesterday while testing.


>
>> @@ -4738,14 +4755,25 @@ static int raid5_check_reshape(mddev_t *mddev)
>>  	raid5_conf_t *conf = mddev_to_conf(mddev);
>>  	int err;
>>
>> -	if (mddev->delta_disks < 0 ||
>> -	    mddev->new_level != mddev->level)
>> -		return -EINVAL; /* Cannot shrink array or change level yet */
>>  	if (mddev->delta_disks == 0)
>>  		return 0; /* nothing to do */
>>  	if (mddev->bitmap)
>>  		/* Cannot grow a bitmap yet */
>>  		return -EBUSY;
>> +	if (mddev->degraded > conf->max_degraded)
>> +		return -EINVAL;
>> +	if (mddev->delta_disks < 0) {
>> +		/* We might be able to shrink, but the devices must
>> +		 * be made bigger first.
>> +		 * For raid6, 4 is the minimum size.
>> +		 * Otherwise 2 is the minimum
>> +		 */
>> +		int min = 2;
>> +		if (mddev->level == 6)
>> +			min = 4;
>> +		if (mddev->raid_disks + mddev->delta_disks < min)
>> +			return -EINVAL;
>> +	}
>
> Hm, this code doesn't seem to do what the comment suggests. IMHO,
> we must check that
>
> 	(a) (raid_disks + delta_disks) * sizeof(smallest) is big enough
> and
> 	(b) raid_disks + delta_disks >= minimal admissible number of disks
>
> The comment says the devices must be made bigger (to satisfy (a))
> but the code only checks (b).

Yes, the comment could probably be improved.
The check for 'a' happens later when start_reshape is called.

>
>> @@ -4862,21 +4905,38 @@ static void raid5_finish_reshape(mddev_t *mddev)
>>  	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
>>
>>  		conf->previous_raid_disks = conf->raid_disks;
>> -		md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
>> -		set_capacity(mddev->gendisk, mddev->array_sectors);
>> -		mddev->changed = 1;
>> -
>> -		bdev = bdget_disk(mddev->gendisk, 0);
>> -		if (bdev) {
>> -			mutex_lock(&bdev->bd_inode->i_mutex);
>> -			i_size_write(bdev->bd_inode,
>> -				     (loff_t)mddev->array_sectors << 9);
>> -			mutex_unlock(&bdev->bd_inode->i_mutex);
>> -			bdput(bdev);
>> -		}
>>  		spin_lock_irq(&conf->device_lock);
>>  		conf->reshape_progress = MaxSector;
>>  		spin_unlock_irq(&conf->device_lock);
>> +		if (mddev->delta_disks > 0) {
>> +			conf->previous_raid_disks = conf->raid_disks;
>
> previous_raid_disks was already assigned earlier, so this can go away,
> imo.

Thanks.  It is actually the first one that needs to go.
The 'else' branch of that 'if' needs to have access to the
original value of previous_raid_disks.

Thanks a lot for the review.

NeilBrown


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2009-03-30  9:20 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-24  8:53 [md PATCH 0/6] Reduce the number of devices in RAID4/5/6 NeilBrown
2009-03-24  8:53 ` [md PATCH 2/6] md/raid5: enhance raid5_size to work correctly with negative delta_disks NeilBrown
2009-03-24  8:53 ` [md PATCH 3/6] md: add explicit method to signal the end of a reshape NeilBrown
2009-03-24  8:53 ` [md PATCH 4/6] md/raid5: change reshape-progress measurement to cope with reshaping backwards NeilBrown
2009-03-27 16:19   ` Andre Noll
2009-03-27 19:54     ` NeilBrown
2009-03-30  9:09       ` Andre Noll
     [not found]       ` <49CE1713.9070707@tmr.com>
2009-03-30  9:20         ` Andre Noll
2009-03-24  8:53 ` [md PATCH 1/6] md/raid5: drop qd_idx from r6_state NeilBrown
2009-03-24  8:53 ` [md PATCH 6/6] Documentation/md.txt update NeilBrown
2009-03-27 16:19   ` Andre Noll
2009-03-27 19:43     ` NeilBrown
2009-03-24  8:53 ` [md PATCH 5/6] md: allow number of drives in raid5 to be reduced NeilBrown
2009-03-27 16:19   ` Andre Noll
2009-03-27 19:39     ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).