* [md PATCH 00/14] Final set of patches head for 2.6.30
@ 2009-03-31 4:54 NeilBrown
2009-03-31 4:54 ` [md PATCH 02/14] md/raid5: change reshape-progress measurement to cope with reshaping backwards NeilBrown
` (13 more replies)
0 siblings, 14 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid
Hi again,
following are some more patches that are heading for 2.6.30.
I'll probably send a pull request in a day or two.
There is some overlap between this set and the previous as I
had to significantly change
md: add explicit method to signal the end of a reshape.
as I found out that some things need to be done in the reshape
thread, and some needs to be done under the mddev_lock in the
raid5d thread.
The main new functionality here is the ability to change chunk_size
and layout during reshape. This means that we can really change
a RAID5 into a RAID6, though it needs some mdadm support to do it
safely.
e.g. if you write a new number of .../md/chunk_size and then write
'reshape' to 'sync_action' it will restripe the whole array with
the new chunk size. If you get a system crash during this, you lose
your data (sorry). Mdadm will not do this for you until it is able
to backup each few stripes while they are being reshaped so that in
the event of a crash you can restore from backup and not lose all
your data.
Any review comments most welcome.
NeilBrown
---
NeilBrown (14):
md/raid5 revise rules for when to update metadata during reshape
md/raid5: minor code cleanups in make_request.
md: remove CONFIG_MD_RAID_RESHAPE config option.
md/raid5: be more careful about write ordering when reshaping.
md: don't display meaningless values in sysfs files resync_start and sync_speed
md/raid5: allow layout and chunksize to be changed on active array.
md/raid5: reshape using largest of old and new chunk size
md/raid5: prepare for allowing reshape to change layout
md/raid5: prepare for allowing reshape to change chunksize.
md/raid5: clearly differentiate 'before' and 'after' stripes during reshape.
Documentation/md.txt update
md: allow number of drives in raid5 to be reduced
md/raid5: change reshape-progress measurement to cope with reshaping backwards.
md: add explicit method to signal the end of a reshape.
Documentation/md.txt | 37 +++-
drivers/md/Kconfig | 29 ---
drivers/md/md.c | 15 +
drivers/md/md.h | 2
drivers/md/raid5.c | 506 +++++++++++++++++++++++++++++++++++---------------
drivers/md/raid5.h | 21 ++
6 files changed, 411 insertions(+), 199 deletions(-)
--
Signature
^ permalink raw reply [flat|nested] 15+ messages in thread
* [md PATCH 01/14] md: add explicit method to signal the end of a reshape.
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
2009-03-31 4:54 ` [md PATCH 02/14] md/raid5: change reshape-progress measurement to cope with reshaping backwards NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
2009-03-31 4:54 ` [md PATCH 05/14] md/raid5: clearly differentiate 'before' and 'after' stripes during reshape NeilBrown
` (11 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
Currently raid5 (the only module that supports restriping)
notices that the reshape has finished be sync_request being
given a large value, and handles any cleanup them.
This patch changes it so md_check_recovery calls into an
explicit finish_reshape method as well.
The clean-up from sync_request can do things that need to be
done promptly, typically things local to the raid5_conf_t
structure.
The "finish_reshape" method is called under the mddev_lock
so it can do things involving reconfiguring the device.
This allows us to get rid of md_set_array_sectors_locked, which
would have caused a deadlock if you tried to stop and array
while a reshape was happening.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/md.c | 11 +++--------
drivers/md/md.h | 2 +-
drivers/md/raid5.c | 50 ++++++++++++++++++++++++++++++--------------------
3 files changed, 34 insertions(+), 29 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 923d125..c509313 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5073,14 +5073,6 @@ void md_set_array_sectors(mddev_t *mddev, sector_t array_sectors)
}
EXPORT_SYMBOL(md_set_array_sectors);
-void md_set_array_sectors_lock(mddev_t *mddev, sector_t array_sectors)
-{
- mddev_lock(mddev);
- md_set_array_sectors(mddev, array_sectors);
- mddev_unlock(mddev);
-}
-EXPORT_SYMBOL(md_set_array_sectors_lock);
-
static int update_size(mddev_t *mddev, sector_t num_sectors)
{
mdk_rdev_t *rdev;
@@ -6641,6 +6633,9 @@ void md_check_recovery(mddev_t *mddev)
sysfs_notify(&mddev->kobj, NULL,
"degraded");
}
+ if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+ mddev->pers->finish_reshape)
+ mddev->pers->finish_reshape(mddev);
md_update_sb(mddev, 1);
/* if array is no-longer degraded, then any saved_raid_disk
diff --git a/drivers/md/md.h b/drivers/md/md.h
index d13e34f..e9b7f54 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -317,6 +317,7 @@ struct mdk_personality
sector_t (*size) (mddev_t *mddev, sector_t sectors, int raid_disks);
int (*check_reshape) (mddev_t *mddev);
int (*start_reshape) (mddev_t *mddev);
+ void (*finish_reshape) (mddev_t *mddev);
int (*reconfig) (mddev_t *mddev, int layout, int chunk_size);
/* quiesce moves between quiescence states
* 0 - fully active
@@ -433,4 +434,3 @@ extern void md_new_event(mddev_t *mddev);
extern int md_allow_write(mddev_t *mddev);
extern void md_wait_for_blocked_rdev(mdk_rdev_t *rdev, mddev_t *mddev);
extern void md_set_array_sectors(mddev_t *mddev, sector_t array_sectors);
-extern void md_set_array_sectors_lock(mddev_t *mddev, sector_t array_sectors);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5694eb8..a0f22dd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3850,6 +3850,7 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
if (sector_nr >= max_sector) {
/* just being told to finish up .. nothing much to do */
unplug_slaves(mddev);
+
if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
end_reshape(conf);
return 0;
@@ -4836,43 +4837,49 @@ static int raid5_start_reshape(mddev_t *mddev)
static void end_reshape(raid5_conf_t *conf)
{
- struct block_device *bdev;
if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {
- mddev_t *mddev = conf->mddev;
-
- md_set_array_sectors_lock(mddev, raid5_size(mddev, 0,
- conf->raid_disks));
- set_capacity(mddev->gendisk, mddev->array_sectors);
- mddev->changed = 1;
- conf->previous_raid_disks = conf->raid_disks;
- bdev = bdget_disk(conf->mddev->gendisk, 0);
- if (bdev) {
- mutex_lock(&bdev->bd_inode->i_mutex);
- i_size_write(bdev->bd_inode,
- (loff_t)conf->mddev->array_sectors << 9);
- mutex_unlock(&bdev->bd_inode->i_mutex);
- bdput(bdev);
- }
spin_lock_irq(&conf->device_lock);
+ conf->previous_raid_disks = conf->raid_disks;
conf->expand_progress = MaxSector;
spin_unlock_irq(&conf->device_lock);
- conf->mddev->reshape_position = MaxSector;
/* read-ahead size must cover two whole stripes, which is
* 2 * (datadisks) * chunksize where 'n' is the number of raid devices
*/
{
- int data_disks = conf->previous_raid_disks - conf->max_degraded;
- int stripe = data_disks *
- (conf->mddev->chunk_size / PAGE_SIZE);
+ int data_disks = conf->raid_disks - conf->max_degraded;
+ int stripe = data_disks * (conf->chunk_size
+ / PAGE_SIZE);
if (conf->mddev->queue->backing_dev_info.ra_pages < 2 * stripe)
conf->mddev->queue->backing_dev_info.ra_pages = 2 * stripe;
}
}
}
+static void raid5_finish_reshape(mddev_t *mddev)
+{
+ struct block_device *bdev;
+
+ if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
+
+ md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
+ set_capacity(mddev->gendisk, mddev->array_sectors);
+ mddev->changed = 1;
+ mddev->reshape_position = MaxSector;
+
+ bdev = bdget_disk(mddev->gendisk, 0);
+ if (bdev) {
+ mutex_lock(&bdev->bd_inode->i_mutex);
+ i_size_write(bdev->bd_inode,
+ (loff_t)mddev->array_sectors << 9);
+ mutex_unlock(&bdev->bd_inode->i_mutex);
+ bdput(bdev);
+ }
+ }
+}
+
static void raid5_quiesce(mddev_t *mddev, int state)
{
raid5_conf_t *conf = mddev_to_conf(mddev);
@@ -5098,6 +5105,7 @@ static struct mdk_personality raid6_personality =
#ifdef CONFIG_MD_RAID5_RESHAPE
.check_reshape = raid5_check_reshape,
.start_reshape = raid5_start_reshape,
+ .finish_reshape = raid5_finish_reshape,
#endif
.quiesce = raid5_quiesce,
.takeover = raid6_takeover,
@@ -5121,6 +5129,7 @@ static struct mdk_personality raid5_personality =
#ifdef CONFIG_MD_RAID5_RESHAPE
.check_reshape = raid5_check_reshape,
.start_reshape = raid5_start_reshape,
+ .finish_reshape = raid5_finish_reshape,
#endif
.quiesce = raid5_quiesce,
.takeover = raid5_takeover,
@@ -5146,6 +5155,7 @@ static struct mdk_personality raid4_personality =
#ifdef CONFIG_MD_RAID5_RESHAPE
.check_reshape = raid5_check_reshape,
.start_reshape = raid5_start_reshape,
+ .finish_reshape = raid5_finish_reshape,
#endif
.quiesce = raid5_quiesce,
};
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [md PATCH 02/14] md/raid5: change reshape-progress measurement to cope with reshaping backwards.
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
2009-03-31 4:54 ` [md PATCH 01/14] md: add explicit method to signal the end of a reshape NeilBrown
` (12 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
When reducing the number of devices in a raid4/5/6, the reshape
process has to start at the end of the array and work down to the
beginning. So we need to handle expand_progress and expand_lo
differently.
This patch renames "expand_progress" and "expand_lo" to avoid the
implication that anything is getting bigger (expand->reshape) and
every place they are used, we make sure that they are used the right
way depending on whether delta_disks is positive or negative.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 94 ++++++++++++++++++++++++++++++++++------------------
drivers/md/raid5.h | 15 ++++++--
2 files changed, 71 insertions(+), 38 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a0f22dd..1023c4e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3593,24 +3593,28 @@ static int make_request(struct request_queue *q, struct bio * bi)
retry:
previous = 0;
prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
- if (likely(conf->expand_progress == MaxSector))
+ if (likely(conf->reshape_progress == MaxSector))
disks = conf->raid_disks;
else {
- /* spinlock is needed as expand_progress may be
+ /* spinlock is needed as reshape_progress may be
* 64bit on a 32bit platform, and so it might be
* possible to see a half-updated value
- * Ofcourse expand_progress could change after
+ * Ofcourse reshape_progress could change after
* the lock is dropped, so once we get a reference
* to the stripe that we think it is, we will have
* to check again.
*/
spin_lock_irq(&conf->device_lock);
disks = conf->raid_disks;
- if (logical_sector >= conf->expand_progress) {
+ if (mddev->delta_disks < 0
+ ? logical_sector < conf->reshape_progress
+ : logical_sector >= conf->reshape_progress) {
disks = conf->previous_raid_disks;
previous = 1;
} else {
- if (logical_sector >= conf->expand_lo) {
+ if (mddev->delta_disks < 0
+ ? logical_sector < conf->reshape_safe
+ : logical_sector >= conf->reshape_safe) {
spin_unlock_irq(&conf->device_lock);
schedule();
goto retry;
@@ -3630,7 +3634,7 @@ static int make_request(struct request_queue *q, struct bio * bi)
sh = get_active_stripe(conf, new_sector, previous,
(bi->bi_rw&RWA_MASK));
if (sh) {
- if (unlikely(conf->expand_progress != MaxSector)) {
+ if (unlikely(conf->reshape_progress != MaxSector)) {
/* expansion might have moved on while waiting for a
* stripe, so we must do the range check again.
* Expansion could still move past after this
@@ -3641,8 +3645,10 @@ static int make_request(struct request_queue *q, struct bio * bi)
*/
int must_retry = 0;
spin_lock_irq(&conf->device_lock);
- if (logical_sector < conf->expand_progress &&
- disks == conf->previous_raid_disks)
+ if ((mddev->delta_disks < 0
+ ? logical_sector >= conf->reshape_progress
+ : logical_sector < conf->reshape_progress)
+ && disks == conf->previous_raid_disks)
/* mismatch, need to try again */
must_retry = 1;
spin_unlock_irq(&conf->device_lock);
@@ -3720,13 +3726,20 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
int dd_idx;
sector_t writepos, safepos, gap;
- if (sector_nr == 0 &&
- conf->expand_progress != 0) {
- /* restarting in the middle, skip the initial sectors */
- sector_nr = conf->expand_progress;
+ if (sector_nr == 0) {
+ /* If restarting in the middle, skip the initial sectors */
+ if (mddev->delta_disks < 0 &&
+ conf->reshape_progress < raid5_size(mddev, 0, 0)) {
+ sector_nr = raid5_size(mddev, 0, 0)
+ - conf->reshape_progress;
+ } else if (mddev->delta_disks > 0 &&
+ conf->reshape_progress > 0)
+ sector_nr = conf->reshape_progress;
sector_div(sector_nr, new_data_disks);
- *skipped = 1;
- return sector_nr;
+ if (sector_nr) {
+ *skipped = 1;
+ return sector_nr;
+ }
}
/* we update the metadata when there is more than 3Meg
@@ -3734,28 +3747,37 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
* probably be time based) or when the data about to be
* copied would over-write the source of the data at
* the front of the range.
- * i.e. one new_stripe forward from expand_progress new_maps
- * to after where expand_lo old_maps to
+ * i.e. one new_stripe along from reshape_progress new_maps
+ * to after where reshape_safe old_maps to
*/
- writepos = conf->expand_progress +
- conf->chunk_size/512*(new_data_disks);
+ writepos = conf->reshape_progress;
sector_div(writepos, new_data_disks);
- safepos = conf->expand_lo;
+ safepos = conf->reshape_safe;
sector_div(safepos, data_disks);
- gap = conf->expand_progress - conf->expand_lo;
+ if (mddev->delta_disks < 0) {
+ writepos -= conf->chunk_size/512;
+ safepos += conf->chunk_size/512;
+ gap = conf->reshape_safe - conf->reshape_progress;
+ } else {
+ writepos += conf->chunk_size/512;
+ safepos -= conf->chunk_size/512;
+ gap = conf->reshape_progress - conf->reshape_safe;
+ }
- if (writepos >= safepos ||
+ if ((mddev->delta_disks < 0
+ ? writepos < safepos
+ : writepos > safepos) ||
gap > (new_data_disks)*3000*2 /*3Meg*/) {
/* Cannot proceed until we've updated the superblock... */
wait_event(conf->wait_for_overlap,
atomic_read(&conf->reshape_stripes)==0);
- mddev->reshape_position = conf->expand_progress;
+ mddev->reshape_position = conf->reshape_progress;
set_bit(MD_CHANGE_DEVS, &mddev->flags);
md_wakeup_thread(mddev->thread);
wait_event(mddev->sb_wait, mddev->flags == 0 ||
kthread_should_stop());
spin_lock_irq(&conf->device_lock);
- conf->expand_lo = mddev->reshape_position;
+ conf->reshape_safe = mddev->reshape_position;
spin_unlock_irq(&conf->device_lock);
wake_up(&conf->wait_for_overlap);
}
@@ -3792,7 +3814,10 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
release_stripe(sh);
}
spin_lock_irq(&conf->device_lock);
- conf->expand_progress = (sector_nr + i) * new_data_disks;
+ if (mddev->delta_disks < 0)
+ conf->reshape_progress -= i * new_data_disks;
+ else
+ conf->reshape_progress += i * new_data_disks;
spin_unlock_irq(&conf->device_lock);
/* Ok, those stripe are ready. We can start scheduling
* reads on the source stripes.
@@ -3823,14 +3848,14 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
/* Cannot proceed until we've updated the superblock... */
wait_event(conf->wait_for_overlap,
atomic_read(&conf->reshape_stripes) == 0);
- mddev->reshape_position = conf->expand_progress;
+ mddev->reshape_position = conf->reshape_progress;
set_bit(MD_CHANGE_DEVS, &mddev->flags);
md_wakeup_thread(mddev->thread);
wait_event(mddev->sb_wait,
!test_bit(MD_CHANGE_DEVS, &mddev->flags)
|| kthread_should_stop());
spin_lock_irq(&conf->device_lock);
- conf->expand_lo = mddev->reshape_position;
+ conf->reshape_safe = mddev->reshape_position;
spin_unlock_irq(&conf->device_lock);
wake_up(&conf->wait_for_overlap);
}
@@ -4283,7 +4308,7 @@ static raid5_conf_t *setup_conf(mddev_t *mddev)
conf->max_degraded = 1;
conf->algorithm = mddev->new_layout;
conf->max_nr_stripes = NR_STRIPES;
- conf->expand_progress = mddev->reshape_position;
+ conf->reshape_progress = mddev->reshape_position;
memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
@@ -4441,9 +4466,9 @@ static int run(mddev_t *mddev)
print_raid5_conf(conf);
- if (conf->expand_progress != MaxSector) {
+ if (conf->reshape_progress != MaxSector) {
printk("...ok start reshape thread\n");
- conf->expand_lo = conf->expand_progress;
+ conf->reshape_safe = conf->reshape_progress;
atomic_set(&conf->reshape_stripes, 0);
clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
@@ -4782,8 +4807,11 @@ static int raid5_start_reshape(mddev_t *mddev)
spin_lock_irq(&conf->device_lock);
conf->previous_raid_disks = conf->raid_disks;
conf->raid_disks += mddev->delta_disks;
- conf->expand_progress = 0;
- conf->expand_lo = 0;
+ if (mddev->delta_disks < 0)
+ conf->reshape_progress = raid5_size(mddev, 0, 0);
+ else
+ conf->reshape_progress = 0;
+ conf->reshape_safe = conf->reshape_progress;
spin_unlock_irq(&conf->device_lock);
/* Add some new drives, as many as will fit.
@@ -4825,7 +4853,7 @@ static int raid5_start_reshape(mddev_t *mddev)
mddev->recovery = 0;
spin_lock_irq(&conf->device_lock);
mddev->raid_disks = conf->raid_disks = conf->previous_raid_disks;
- conf->expand_progress = MaxSector;
+ conf->reshape_progress = MaxSector;
spin_unlock_irq(&conf->device_lock);
return -EAGAIN;
}
@@ -4842,7 +4870,7 @@ static void end_reshape(raid5_conf_t *conf)
spin_lock_irq(&conf->device_lock);
conf->previous_raid_disks = conf->raid_disks;
- conf->expand_progress = MaxSector;
+ conf->reshape_progress = MaxSector;
spin_unlock_irq(&conf->device_lock);
/* read-ahead size must cover two whole stripes, which is
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index c2f37f2..b2edcc4 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -337,11 +337,16 @@ struct raid5_private_data {
int raid_disks;
int max_nr_stripes;
- /* used during an expand */
- sector_t expand_progress; /* MaxSector when no expand happening */
- sector_t expand_lo; /* from here up to expand_progress it out-of-bounds
- * as we haven't flushed the metadata yet
- */
+ /* reshape_progress is the leading edge of a 'reshape'
+ * It has value MaxSector when no reshape is happening
+ * If delta_disks < 0, it is the last sector we started work on,
+ * else is it the next sector to work on.
+ */
+ sector_t reshape_progress;
+ /* reshape_safe is the trailing edge of a reshape. We know that
+ * before (or after) this address, all reshape has completed.
+ */
+ sector_t reshape_safe;
int previous_raid_disks;
struct list_head handle_list; /* stripes needing handling */
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [md PATCH 03/14] md: allow number of drives in raid5 to be reduced
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
` (7 preceding siblings ...)
2009-03-31 4:54 ` [md PATCH 08/14] md/raid5: reshape using largest of old and new chunk size NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
2009-03-31 4:54 ` [md PATCH 13/14] md/raid5: minor code cleanups in make_request NeilBrown
` (4 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
When reshaping a raid5 to have fewer devices, we work from the end of
the array to the beginning.
md_do_sync gives addresses to sync_request that go from the beginning
to the end. So largely ignore them use the internal state variable
"reshape_progress" to keep track of what to do next.
Never allow the size to be reduced below the minimum (4 for raid6,
3 otherwise).
We require that the size of the array has already been reduced before
the array is reshaped to a smaller size. This is because simply
reducing the size is an easily reversible operation, while the reshape
is immediately destructive and so is not reversible for the blocks at
the ends of the devices.
Thus to reshape an array to have fewer devices, you must first write
an appropriately small size to md/array_size.
When reshape finished, we remove any drives that are no longer
needed and fix up ->degraded.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 124 ++++++++++++++++++++++++++++++++++++----------------
1 files changed, 87 insertions(+), 37 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1023c4e..76eed59 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3725,6 +3725,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
int i;
int dd_idx;
sector_t writepos, safepos, gap;
+ sector_t stripe_addr;
if (sector_nr == 0) {
/* If restarting in the middle, skip the initial sectors */
@@ -3782,10 +3783,21 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
wake_up(&conf->wait_for_overlap);
}
+ if (mddev->delta_disks < 0) {
+ BUG_ON(conf->reshape_progress == 0);
+ stripe_addr = writepos;
+ BUG_ON((mddev->dev_sectors &
+ ~((sector_t)mddev->chunk_size / 512 - 1))
+ - (conf->chunk_size / 512) - stripe_addr
+ != sector_nr);
+ } else {
+ BUG_ON(writepos != sector_nr + conf->chunk_size / 512);
+ stripe_addr = sector_nr;
+ }
for (i=0; i < conf->chunk_size/512; i+= STRIPE_SECTORS) {
int j;
int skipped = 0;
- sh = get_active_stripe(conf, sector_nr+i, 0, 0);
+ sh = get_active_stripe(conf, stripe_addr+i, 0, 0);
set_bit(STRIPE_EXPANDING, &sh->state);
atomic_inc(&conf->reshape_stripes);
/* If any of this stripe is beyond the end of the old
@@ -3825,10 +3837,10 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
* block on the destination stripes.
*/
first_sector =
- raid5_compute_sector(conf, sector_nr*(new_data_disks),
+ raid5_compute_sector(conf, stripe_addr*(new_data_disks),
1, &dd_idx, NULL);
last_sector =
- raid5_compute_sector(conf, ((sector_nr+conf->chunk_size/512)
+ raid5_compute_sector(conf, ((stripe_addr+conf->chunk_size/512)
*(new_data_disks) - 1),
1, &dd_idx, NULL);
if (last_sector >= mddev->dev_sectors)
@@ -4366,12 +4378,6 @@ static int run(mddev_t *mddev)
mdname(mddev));
return -EINVAL;
}
- if (mddev->delta_disks <= 0) {
- printk(KERN_ERR "raid5: %s: unsupported reshape "
- "(reduce disks) required - aborting.\n",
- mdname(mddev));
- return -EINVAL;
- }
old_disks = mddev->raid_disks - mddev->delta_disks;
/* reshape_position must be on a new-stripe boundary, and one
* further up in new geometry must map after here in old
@@ -4648,6 +4654,10 @@ static int raid5_remove_disk(mddev_t *mddev, int number)
print_raid5_conf(conf);
rdev = p->rdev;
if (rdev) {
+ if (number >= conf->raid_disks &&
+ conf->reshape_progress == MaxSector)
+ clear_bit(In_sync, &rdev->flags);
+
if (test_bit(In_sync, &rdev->flags) ||
atomic_read(&rdev->nr_pending)) {
err = -EBUSY;
@@ -4657,7 +4667,8 @@ static int raid5_remove_disk(mddev_t *mddev, int number)
* isn't possible.
*/
if (!test_bit(Faulty, &rdev->flags) &&
- mddev->degraded <= conf->max_degraded) {
+ mddev->degraded <= conf->max_degraded &&
+ number < conf->raid_disks) {
err = -EBUSY;
goto abort;
}
@@ -4745,16 +4756,26 @@ static int raid5_resize(mddev_t *mddev, sector_t sectors)
static int raid5_check_reshape(mddev_t *mddev)
{
raid5_conf_t *conf = mddev_to_conf(mddev);
- int err;
- if (mddev->delta_disks < 0 ||
- mddev->new_level != mddev->level)
- return -EINVAL; /* Cannot shrink array or change level yet */
if (mddev->delta_disks == 0)
return 0; /* nothing to do */
if (mddev->bitmap)
/* Cannot grow a bitmap yet */
return -EBUSY;
+ if (mddev->degraded > conf->max_degraded)
+ return -EINVAL;
+ if (mddev->delta_disks < 0) {
+ /* We might be able to shrink, but the devices must
+ * be made bigger first.
+ * For raid6, 4 is the minimum size.
+ * Otherwise 2 is the minimum
+ */
+ int min = 2;
+ if (mddev->level == 6)
+ min = 4;
+ if (mddev->raid_disks + mddev->delta_disks < min)
+ return -EINVAL;
+ }
/* Can only proceed if there are plenty of stripe_heads.
* We need a minimum of one full stripe,, and for sensible progress
@@ -4771,14 +4792,7 @@ static int raid5_check_reshape(mddev_t *mddev)
return -ENOSPC;
}
- err = resize_stripes(conf, conf->raid_disks + mddev->delta_disks);
- if (err)
- return err;
-
- if (mddev->degraded > conf->max_degraded)
- return -EINVAL;
- /* looks like we might be able to manage this */
- return 0;
+ return resize_stripes(conf, conf->raid_disks + mddev->delta_disks);
}
static int raid5_start_reshape(mddev_t *mddev)
@@ -4803,6 +4817,17 @@ static int raid5_start_reshape(mddev_t *mddev)
*/
return -EINVAL;
+ /* Refuse to reduce size of the array. Any reductions in
+ * array size must be through explicit setting of array_size
+ * attribute.
+ */
+ if (raid5_size(mddev, 0, conf->raid_disks + mddev->delta_disks)
+ < mddev->array_sectors) {
+ printk(KERN_ERR "md: %s: array size must be reduced "
+ "before number of disks\n", mdname(mddev));
+ return -EINVAL;
+ }
+
atomic_set(&conf->reshape_stripes, 0);
spin_lock_irq(&conf->device_lock);
conf->previous_raid_disks = conf->raid_disks;
@@ -4836,9 +4861,12 @@ static int raid5_start_reshape(mddev_t *mddev)
break;
}
- spin_lock_irqsave(&conf->device_lock, flags);
- mddev->degraded = (conf->raid_disks - conf->previous_raid_disks) - added_devices;
- spin_unlock_irqrestore(&conf->device_lock, flags);
+ if (mddev->delta_disks > 0) {
+ spin_lock_irqsave(&conf->device_lock, flags);
+ mddev->degraded = (conf->raid_disks - conf->previous_raid_disks)
+ - added_devices;
+ spin_unlock_irqrestore(&conf->device_lock, flags);
+ }
mddev->raid_disks = conf->raid_disks;
mddev->reshape_position = 0;
set_bit(MD_CHANGE_DEVS, &mddev->flags);
@@ -4863,6 +4891,9 @@ static int raid5_start_reshape(mddev_t *mddev)
}
#endif
+/* This is called from the reshape thread and should make any
+ * changes needed in 'conf'
+ */
static void end_reshape(raid5_conf_t *conf)
{
@@ -4886,25 +4917,44 @@ static void end_reshape(raid5_conf_t *conf)
}
}
+/* This is called from the raid5d thread with mddev_lock held.
+ * It makes config changes to the device.
+ */
static void raid5_finish_reshape(mddev_t *mddev)
{
struct block_device *bdev;
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
- md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
- set_capacity(mddev->gendisk, mddev->array_sectors);
- mddev->changed = 1;
- mddev->reshape_position = MaxSector;
-
- bdev = bdget_disk(mddev->gendisk, 0);
- if (bdev) {
- mutex_lock(&bdev->bd_inode->i_mutex);
- i_size_write(bdev->bd_inode,
- (loff_t)mddev->array_sectors << 9);
- mutex_unlock(&bdev->bd_inode->i_mutex);
- bdput(bdev);
+ if (mddev->delta_disks > 0) {
+ md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
+ set_capacity(mddev->gendisk, mddev->array_sectors);
+ mddev->changed = 1;
+
+ bdev = bdget_disk(mddev->gendisk, 0);
+ if (bdev) {
+ mutex_lock(&bdev->bd_inode->i_mutex);
+ i_size_write(bdev->bd_inode,
+ (loff_t)mddev->array_sectors << 9);
+ mutex_unlock(&bdev->bd_inode->i_mutex);
+ bdput(bdev);
+ }
+ } else {
+ int d;
+ raid5_conf_t *conf = mddev_to_conf(mddev);
+ mddev->degraded = conf->raid_disks;
+ for (d = 0; d < conf->raid_disks ; d++)
+ if (conf->disks[d].rdev &&
+ test_bit(In_sync,
+ &conf->disks[d].rdev->flags))
+ mddev->degraded--;
+ for (d = conf->raid_disks ;
+ d < conf->raid_disks - mddev->delta_disks;
+ d++)
+ raid5_remove_disk(mddev, d);
}
+ mddev->reshape_position = MaxSector;
+ mddev->delta_disks = 0;
}
}
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [md PATCH 04/14] Documentation/md.txt update
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
` (5 preceding siblings ...)
2009-03-31 4:54 ` [md PATCH 09/14] md/raid5: allow layout and chunksize to be changed on active array NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
2009-03-31 4:54 ` [md PATCH 08/14] md/raid5: reshape using largest of old and new chunk size NeilBrown
` (6 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
Update md.txt to reflect recent changes in a number of sysfs
attributes.
Signed-off-by: NeilBrown <neilb@suse.de>
---
Documentation/md.txt | 37 ++++++++++++++++++++++++++++++-------
1 files changed, 30 insertions(+), 7 deletions(-)
diff --git a/Documentation/md.txt b/Documentation/md.txt
index 1da9d1b..4edd39e 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -164,15 +164,19 @@ All md devices contain:
raid_disks
a text file with a simple number indicating the number of devices
in a fully functional array. If this is not yet known, the file
- will be empty. If an array is being resized (not currently
- possible) this will contain the larger of the old and new sizes.
- Some raid level (RAID1) allow this value to be set while the
- array is active. This will reconfigure the array. Otherwise
- it can only be set while assembling an array.
+ will be empty. If an array is being resized this will contain
+ the new number of devices.
+ Some raid levels allow this value to be set while the array is
+ active. This will reconfigure the array. Otherwise it can only
+ be set while assembling an array.
+ A change to this attribute will not be permitted if it would
+ reduce the size of the array. To reduce the number of drives
+ in an e.g. raid5, the array size must first be reduced by
+ setting the 'array_size' attribute.
chunk_size
- This is the size if bytes for 'chunks' and is only relevant to
- raid levels that involve striping (1,4,5,6,10). The address space
+ This is the size in bytes for 'chunks' and is only relevant to
+ raid levels that involve striping (0,4,5,6,10). The address space
of the array is conceptually divided into chunks and consecutive
chunks are striped onto neighbouring devices.
The size should be at least PAGE_SIZE (4k) and should be a power
@@ -183,6 +187,20 @@ All md devices contain:
simply a number that is interpretted differently by different
levels. It can be written while assembling an array.
+ array_size
+ This can be used to artificially constrain the available space in
+ the array to be less than is actually available on the combined
+ devices. Writing a number (in Kilobytes) which is less than
+ the available size will set the size. Any reconfiguration of the
+ array (e.g. adding devices) will not cause the size to change.
+ Writing the word 'default' will cause the effective size of the
+ array to be whatever size is actually available based on
+ 'level', 'chunk_size' and 'component_size'.
+
+ This can be used to reduce the size of the array before reducing
+ the number of devices in a raid4/5/6, or to support external
+ metadata formats which mandate such clipping.
+
reshape_position
This is either "none" or a sector number within the devices of
the array where "reshape" is up to. If this is set, the three
@@ -207,6 +225,11 @@ All md devices contain:
about the array. It can be 0.90 (traditional format), 1.0, 1.1,
1.2 (newer format in varying locations) or "none" indicating that
the kernel isn't managing metadata at all.
+ Alternately it can be "external:" followed by a string which
+ is set by user-space. This indicates that metadata is managed
+ by a user-space program. Any device failure or other event that
+ requires a metadata update will cause array activity to be
+ suspended until the event is acknowledged.
resync_start
The point at which resync should start. If no resync is needed,
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [md PATCH 05/14] md/raid5: clearly differentiate 'before' and 'after' stripes during reshape.
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
2009-03-31 4:54 ` [md PATCH 02/14] md/raid5: change reshape-progress measurement to cope with reshaping backwards NeilBrown
2009-03-31 4:54 ` [md PATCH 01/14] md: add explicit method to signal the end of a reshape NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
2009-03-31 4:54 ` [md PATCH 06/14] md/raid5: prepare for allowing reshape to change chunksize NeilBrown
` (10 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
During a raid5 reshape, we have some stripes in the cache that are
'before' the reshape (and are still to be processed) and some that are
'after'. They are currently differentiated by having different
->disks values as the only reshape current supported involves changing
the number of disks.
However we will soon support reshapes that do not change the number
of disks (chunk parity or chunk size). So make the difference more
explicit with a 'generation' number.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 12 +++++++-----
drivers/md/raid5.h | 3 +++
2 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 76eed59..73cdf43 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -318,6 +318,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
remove_hash(sh);
+ sh->generation = conf->generation - previous;
sh->disks = previous ? conf->previous_raid_disks : conf->raid_disks;
sh->sector = sector;
stripe_set_idx(sector, conf, previous, sh);
@@ -341,7 +342,8 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
insert_hash(conf, sh);
}
-static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector, int disks)
+static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector,
+ short generation)
{
struct stripe_head *sh;
struct hlist_node *hn;
@@ -349,7 +351,7 @@ static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector, in
CHECK_DEVLOCK();
pr_debug("__find_stripe, sector %llu\n", (unsigned long long)sector);
hlist_for_each_entry(sh, hn, stripe_hash(conf, sector), hash)
- if (sh->sector == sector && sh->disks == disks)
+ if (sh->sector == sector && sh->generation == generation)
return sh;
pr_debug("__stripe %llu not in cache\n", (unsigned long long)sector);
return NULL;
@@ -363,7 +365,6 @@ get_active_stripe(raid5_conf_t *conf, sector_t sector,
int previous, int noblock)
{
struct stripe_head *sh;
- int disks = previous ? conf->previous_raid_disks : conf->raid_disks;
pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector);
@@ -373,7 +374,7 @@ get_active_stripe(raid5_conf_t *conf, sector_t sector,
wait_event_lock_irq(conf->wait_for_stripe,
conf->quiesce == 0,
conf->device_lock, /* nothing */);
- sh = __find_stripe(conf, sector, disks);
+ sh = __find_stripe(conf, sector, conf->generation - previous);
if (!sh) {
if (!conf->inactive_blocked)
sh = get_free_stripe(conf);
@@ -3648,7 +3649,7 @@ static int make_request(struct request_queue *q, struct bio * bi)
if ((mddev->delta_disks < 0
? logical_sector >= conf->reshape_progress
: logical_sector < conf->reshape_progress)
- && disks == conf->previous_raid_disks)
+ && previous)
/* mismatch, need to try again */
must_retry = 1;
spin_unlock_irq(&conf->device_lock);
@@ -4837,6 +4838,7 @@ static int raid5_start_reshape(mddev_t *mddev)
else
conf->reshape_progress = 0;
conf->reshape_safe = conf->reshape_progress;
+ conf->generation++;
spin_unlock_irq(&conf->device_lock);
/* Add some new drives, as many as will fit.
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index b2edcc4..a081fb4 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -198,6 +198,8 @@ struct stripe_head {
struct hlist_node hash;
struct list_head lru; /* inactive_list or handle_list */
struct raid5_private_data *raid_conf;
+ short generation; /* increments with every
+ * reshape */
sector_t sector; /* sector of this row */
short pd_idx; /* parity disk index */
short qd_idx; /* 'Q' disk index for raid6 */
@@ -348,6 +350,7 @@ struct raid5_private_data {
*/
sector_t reshape_safe;
int previous_raid_disks;
+ short generation; /* increments with every reshape */
struct list_head handle_list; /* stripes needing handling */
struct list_head hold_list; /* preread ready stripes */
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [md PATCH 06/14] md/raid5: prepare for allowing reshape to change chunksize.
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
` (2 preceding siblings ...)
2009-03-31 4:54 ` [md PATCH 05/14] md/raid5: clearly differentiate 'before' and 'after' stripes during reshape NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
2009-03-31 4:54 ` [md PATCH 07/14] md/raid5: prepare for allowing reshape to change layout NeilBrown
` (9 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
Add "prev_chunk" to raid5_conf_t, similar to "previous_raid_disks", to
remember what the chunk size was before the reshape that is currently
underway.
This seems like duplication with "chunk_size" and "new_chunk" in
mddev_t, and to some extent it is, but there are differences.
The values in mddev_t are always defined and often the same.
The prev* values are only defined if a reshape is underway.
Also (and more significantly) the raid5_conf_t values will be changed
at the same time (inside an appropriate lock) that the reshape is
started by setting reshape_position. In contrast, the new_chunk value
is set when the sysfs file is written which could be well before the
reshape starts.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 43 +++++++++++++++++++++++++++----------------
drivers/md/raid5.h | 1 +
2 files changed, 28 insertions(+), 16 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 73cdf43..7638cc3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -299,7 +299,7 @@ static int grow_buffers(struct stripe_head *sh, int num)
return 0;
}
-static void raid5_build_block(struct stripe_head *sh, int i);
+static void raid5_build_block(struct stripe_head *sh, int i, int previous);
static void stripe_set_idx(sector_t stripe, raid5_conf_t *conf, int previous,
struct stripe_head *sh);
@@ -337,7 +337,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
BUG();
}
dev->flags = 0;
- raid5_build_block(sh, i);
+ raid5_build_block(sh, i, previous);
}
insert_hash(conf, sh);
}
@@ -1212,9 +1212,9 @@ static void raid5_end_write_request(struct bio *bi, int error)
}
-static sector_t compute_blocknr(struct stripe_head *sh, int i);
+static sector_t compute_blocknr(struct stripe_head *sh, int i, int previous);
-static void raid5_build_block(struct stripe_head *sh, int i)
+static void raid5_build_block(struct stripe_head *sh, int i, int previous)
{
struct r5dev *dev = &sh->dev[i];
@@ -1230,7 +1230,7 @@ static void raid5_build_block(struct stripe_head *sh, int i)
dev->req.bi_private = sh;
dev->flags = 0;
- dev->sector = compute_blocknr(sh, i);
+ dev->sector = compute_blocknr(sh, i, previous);
}
static void error(mddev_t *mddev, mdk_rdev_t *rdev)
@@ -1273,7 +1273,8 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
int pd_idx, qd_idx;
int ddf_layout = 0;
sector_t new_sector;
- int sectors_per_chunk = conf->chunk_size >> 9;
+ int sectors_per_chunk = previous ? (conf->prev_chunk >> 9)
+ : (conf->chunk_size >> 9);
int raid_disks = previous ? conf->previous_raid_disks
: conf->raid_disks;
int data_disks = raid_disks - conf->max_degraded;
@@ -1472,13 +1473,14 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
}
-static sector_t compute_blocknr(struct stripe_head *sh, int i)
+static sector_t compute_blocknr(struct stripe_head *sh, int i, int previous)
{
raid5_conf_t *conf = sh->raid_conf;
int raid_disks = sh->disks;
int data_disks = raid_disks - conf->max_degraded;
sector_t new_sector = sh->sector, check;
- int sectors_per_chunk = conf->chunk_size >> 9;
+ int sectors_per_chunk = previous ? (conf->prev_chunk >> 9)
+ : (conf->chunk_size >> 9);
sector_t stripe;
int chunk_offset;
int chunk_number, dummy1, dd_idx = i;
@@ -1579,8 +1581,7 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
r_sector = (sector_t)chunk_number * sectors_per_chunk + chunk_offset;
check = raid5_compute_sector(conf, r_sector,
- (raid_disks != conf->raid_disks),
- &dummy1, &sh2);
+ previous, &dummy1, &sh2);
if (check != sh->sector || dummy1 != dd_idx || sh2.pd_idx != sh->pd_idx
|| sh2.qd_idx != sh->qd_idx) {
printk(KERN_ERR "compute_blocknr: map not correct\n");
@@ -1992,7 +1993,9 @@ static int page_is_zero(struct page *p)
static void stripe_set_idx(sector_t stripe, raid5_conf_t *conf, int previous,
struct stripe_head *sh)
{
- int sectors_per_chunk = conf->chunk_size >> 9;
+ int sectors_per_chunk =
+ previous ? (conf->prev_chunk >> 9)
+ : (conf->chunk_size >> 9);
int dd_idx;
int chunk_offset = sector_div(stripe, sectors_per_chunk);
int disks = previous ? conf->previous_raid_disks : conf->raid_disks;
@@ -2662,7 +2665,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
int dd_idx, j;
struct stripe_head *sh2;
- sector_t bn = compute_blocknr(sh, i);
+ sector_t bn = compute_blocknr(sh, i, 1);
sector_t s = raid5_compute_sector(conf, bn, 0,
&dd_idx, NULL);
sh2 = get_active_stripe(conf, s, 0, 1);
@@ -3318,6 +3321,8 @@ static int raid5_mergeable_bvec(struct request_queue *q,
if ((bvm->bi_rw & 1) == WRITE)
return biovec->bv_len; /* always allow writes to be mergeable */
+ if (mddev->new_chunk < mddev->chunk_size)
+ chunk_sectors = mddev->new_chunk >> 9;
max = (chunk_sectors - ((sector & (chunk_sectors - 1)) + bio_sectors)) << 9;
if (max < 0) max = 0;
if (max <= biovec->bv_len && bio_sectors == 0)
@@ -3333,6 +3338,8 @@ static int in_chunk_boundary(mddev_t *mddev, struct bio *bio)
unsigned int chunk_sectors = mddev->chunk_size >> 9;
unsigned int bio_sectors = bio->bi_size >> 9;
+ if (mddev->new_chunk < mddev->chunk_size)
+ chunk_sectors = mddev->new_chunk >> 9;
return chunk_sectors >=
((sector & (chunk_sectors - 1)) + bio_sectors);
}
@@ -3788,7 +3795,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
BUG_ON(conf->reshape_progress == 0);
stripe_addr = writepos;
BUG_ON((mddev->dev_sectors &
- ~((sector_t)mddev->chunk_size / 512 - 1))
+ ~((sector_t)conf->chunk_size / 512 - 1))
- (conf->chunk_size / 512) - stripe_addr
!= sector_nr);
} else {
@@ -3811,7 +3818,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
if (conf->level == 6 &&
j == sh->qd_idx)
continue;
- s = compute_blocknr(sh, j);
+ s = compute_blocknr(sh, j, 0);
if (s < raid5_size(mddev, 0, 0)) {
skipped = 1;
continue;
@@ -4217,6 +4224,7 @@ raid5_size(mddev_t *mddev, sector_t sectors, int raid_disks)
}
sectors &= ~((sector_t)mddev->chunk_size/512 - 1);
+ sectors &= ~((sector_t)mddev->new_chunk/512 - 1);
return sectors * (raid_disks - conf->max_degraded);
}
@@ -4322,6 +4330,8 @@ static raid5_conf_t *setup_conf(mddev_t *mddev)
conf->algorithm = mddev->new_layout;
conf->max_nr_stripes = NR_STRIPES;
conf->reshape_progress = mddev->reshape_position;
+ if (conf->reshape_progress != MaxSector)
+ conf->prev_chunk = mddev->chunk_size;
memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
@@ -4385,7 +4395,7 @@ static int run(mddev_t *mddev)
* geometry.
*/
here_new = mddev->reshape_position;
- if (sector_div(here_new, (mddev->chunk_size>>9)*
+ if (sector_div(here_new, (mddev->new_chunk>>9)*
(mddev->raid_disks - max_degraded))) {
printk(KERN_ERR "raid5: reshape_position not "
"on a stripe boundary\n");
@@ -4789,7 +4799,8 @@ static int raid5_check_reshape(mddev_t *mddev)
if ((mddev->chunk_size / STRIPE_SIZE) * 4 > conf->max_nr_stripes ||
(mddev->new_chunk / STRIPE_SIZE) * 4 > conf->max_nr_stripes) {
printk(KERN_WARNING "raid5: reshape: not enough stripes. Needed %lu\n",
- (mddev->chunk_size / STRIPE_SIZE)*4);
+ (max(mddev->chunk_size, mddev->new_chunk)
+ / STRIPE_SIZE)*4);
return -ENOSPC;
}
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index a081fb4..b9c9328 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -350,6 +350,7 @@ struct raid5_private_data {
*/
sector_t reshape_safe;
int previous_raid_disks;
+ int prev_chunk;
short generation; /* increments with every reshape */
struct list_head handle_list; /* stripes needing handling */
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [md PATCH 07/14] md/raid5: prepare for allowing reshape to change layout
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
` (3 preceding siblings ...)
2009-03-31 4:54 ` [md PATCH 06/14] md/raid5: prepare for allowing reshape to change chunksize NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
2009-03-31 4:54 ` [md PATCH 09/14] md/raid5: allow layout and chunksize to be changed on active array NeilBrown
` (8 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
Add prev_algo to raid5_conf_t along the same lines as prev_chunk
and previous_raid_disks.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 32 +++++++++++++++++++-------------
drivers/md/raid5.h | 2 +-
2 files changed, 20 insertions(+), 14 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7638cc3..80ec9a6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1273,6 +1273,8 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
int pd_idx, qd_idx;
int ddf_layout = 0;
sector_t new_sector;
+ int algorithm = previous ? conf->prev_algo
+ : conf->algorithm;
int sectors_per_chunk = previous ? (conf->prev_chunk >> 9)
: (conf->chunk_size >> 9);
int raid_disks = previous ? conf->previous_raid_disks
@@ -1307,7 +1309,7 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
pd_idx = data_disks;
break;
case 5:
- switch (conf->algorithm) {
+ switch (algorithm) {
case ALGORITHM_LEFT_ASYMMETRIC:
pd_idx = data_disks - stripe % raid_disks;
if (*dd_idx >= pd_idx)
@@ -1335,13 +1337,13 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
break;
default:
printk(KERN_ERR "raid5: unsupported algorithm %d\n",
- conf->algorithm);
+ algorithm);
BUG();
}
break;
case 6:
- switch (conf->algorithm) {
+ switch (algorithm) {
case ALGORITHM_LEFT_ASYMMETRIC:
pd_idx = raid_disks - 1 - (stripe % raid_disks);
qd_idx = pd_idx + 1;
@@ -1454,7 +1456,7 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
default:
printk(KERN_CRIT "raid6: unsupported algorithm %d\n",
- conf->algorithm);
+ algorithm);
BUG();
}
break;
@@ -1481,6 +1483,8 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i, int previous)
sector_t new_sector = sh->sector, check;
int sectors_per_chunk = previous ? (conf->prev_chunk >> 9)
: (conf->chunk_size >> 9);
+ int algorithm = previous ? conf->prev_algo
+ : conf->algorithm;
sector_t stripe;
int chunk_offset;
int chunk_number, dummy1, dd_idx = i;
@@ -1497,7 +1501,7 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i, int previous)
switch(conf->level) {
case 4: break;
case 5:
- switch (conf->algorithm) {
+ switch (algorithm) {
case ALGORITHM_LEFT_ASYMMETRIC:
case ALGORITHM_RIGHT_ASYMMETRIC:
if (i > sh->pd_idx)
@@ -1516,14 +1520,14 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i, int previous)
break;
default:
printk(KERN_ERR "raid5: unsupported algorithm %d\n",
- conf->algorithm);
+ algorithm);
BUG();
}
break;
case 6:
if (i == sh->qd_idx)
return 0; /* It is the Q disk */
- switch (conf->algorithm) {
+ switch (algorithm) {
case ALGORITHM_LEFT_ASYMMETRIC:
case ALGORITHM_RIGHT_ASYMMETRIC:
case ALGORITHM_ROTATING_ZERO_RESTART:
@@ -1571,7 +1575,7 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i, int previous)
break;
default:
printk(KERN_CRIT "raid6: unsupported algorithm %d\n",
- conf->algorithm);
+ algorithm);
BUG();
}
break;
@@ -4330,8 +4334,10 @@ static raid5_conf_t *setup_conf(mddev_t *mddev)
conf->algorithm = mddev->new_layout;
conf->max_nr_stripes = NR_STRIPES;
conf->reshape_progress = mddev->reshape_position;
- if (conf->reshape_progress != MaxSector)
+ if (conf->reshape_progress != MaxSector) {
conf->prev_chunk = mddev->chunk_size;
+ conf->prev_algo = mddev->layout;
+ }
memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
@@ -4472,14 +4478,14 @@ static int run(mddev_t *mddev)
if (mddev->degraded == 0)
printk("raid5: raid level %d set %s active with %d out of %d"
- " devices, algorithm %d\n", conf->level, mdname(mddev),
- mddev->raid_disks-mddev->degraded, mddev->raid_disks,
- conf->algorithm);
+ " devices, algorithm %d\n", conf->level, mdname(mddev),
+ mddev->raid_disks-mddev->degraded, mddev->raid_disks,
+ mddev->new_layout);
else
printk(KERN_ALERT "raid5: raid level %d set %s active with %d"
" out of %d devices, algorithm %d\n", conf->level,
mdname(mddev), mddev->raid_disks - mddev->degraded,
- mddev->raid_disks, conf->algorithm);
+ mddev->raid_disks, mddev->new_layout);
print_raid5_conf(conf);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index b9c9328..cdd0456 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -350,7 +350,7 @@ struct raid5_private_data {
*/
sector_t reshape_safe;
int previous_raid_disks;
- int prev_chunk;
+ int prev_chunk, prev_algo;
short generation; /* increments with every reshape */
struct list_head handle_list; /* stripes needing handling */
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [md PATCH 08/14] md/raid5: reshape using largest of old and new chunk size
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
` (6 preceding siblings ...)
2009-03-31 4:54 ` [md PATCH 04/14] Documentation/md.txt update NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
2009-03-31 4:54 ` [md PATCH 03/14] md: allow number of drives in raid5 to be reduced NeilBrown
` (5 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
This ensures that even when old and new stripes are overlapping,
we will try to read all of the old before having to write any
of the new.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 34 ++++++++++++++++++++++------------
1 files changed, 22 insertions(+), 12 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 80ec9a6..f7fb2b8 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3738,6 +3738,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
int dd_idx;
sector_t writepos, safepos, gap;
sector_t stripe_addr;
+ int reshape_sectors;
if (sector_nr == 0) {
/* If restarting in the middle, skip the initial sectors */
@@ -3755,6 +3756,15 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
}
}
+ /* We need to process a full chunk at a time.
+ * If old and new chunk sizes differ, we need to process the
+ * largest of these
+ */
+ if (mddev->new_chunk > mddev->chunk_size)
+ reshape_sectors = mddev->new_chunk / 512;
+ else
+ reshape_sectors = mddev->chunk_size / 512;
+
/* we update the metadata when there is more than 3Meg
* in the block range (that is rather arbitrary, should
* probably be time based) or when the data about to be
@@ -3768,12 +3778,12 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
safepos = conf->reshape_safe;
sector_div(safepos, data_disks);
if (mddev->delta_disks < 0) {
- writepos -= conf->chunk_size/512;
- safepos += conf->chunk_size/512;
+ writepos -= reshape_sectors;
+ safepos += reshape_sectors;
gap = conf->reshape_safe - conf->reshape_progress;
} else {
- writepos += conf->chunk_size/512;
- safepos -= conf->chunk_size/512;
+ writepos += reshape_sectors;
+ safepos -= reshape_sectors;
gap = conf->reshape_progress - conf->reshape_safe;
}
@@ -3799,14 +3809,14 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
BUG_ON(conf->reshape_progress == 0);
stripe_addr = writepos;
BUG_ON((mddev->dev_sectors &
- ~((sector_t)conf->chunk_size / 512 - 1))
- - (conf->chunk_size / 512) - stripe_addr
+ ~((sector_t)reshape_sectors - 1))
+ - reshape_sectors - stripe_addr
!= sector_nr);
} else {
- BUG_ON(writepos != sector_nr + conf->chunk_size / 512);
+ BUG_ON(writepos != sector_nr + reshape_sectors);
stripe_addr = sector_nr;
}
- for (i=0; i < conf->chunk_size/512; i+= STRIPE_SECTORS) {
+ for (i = 0; i < reshape_sectors; i += STRIPE_SECTORS) {
int j;
int skipped = 0;
sh = get_active_stripe(conf, stripe_addr+i, 0, 0);
@@ -3839,9 +3849,9 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
}
spin_lock_irq(&conf->device_lock);
if (mddev->delta_disks < 0)
- conf->reshape_progress -= i * new_data_disks;
+ conf->reshape_progress -= reshape_sectors * new_data_disks;
else
- conf->reshape_progress += i * new_data_disks;
+ conf->reshape_progress += reshape_sectors * new_data_disks;
spin_unlock_irq(&conf->device_lock);
/* Ok, those stripe are ready. We can start scheduling
* reads on the source stripes.
@@ -3867,7 +3877,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
/* If this takes us to the resync_max point where we have to pause,
* then we need to write out the superblock.
*/
- sector_nr += conf->chunk_size>>9;
+ sector_nr += reshape_sectors;
if (sector_nr >= mddev->resync_max) {
/* Cannot proceed until we've updated the superblock... */
wait_event(conf->wait_for_overlap,
@@ -3883,7 +3893,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
spin_unlock_irq(&conf->device_lock);
wake_up(&conf->wait_for_overlap);
}
- return conf->chunk_size>>9;
+ return reshape_sectors;
}
/* FIXME go_faster isn't used */
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [md PATCH 09/14] md/raid5: allow layout and chunksize to be changed on active array.
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
` (4 preceding siblings ...)
2009-03-31 4:54 ` [md PATCH 07/14] md/raid5: prepare for allowing reshape to change layout NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
2009-03-31 4:54 ` [md PATCH 04/14] Documentation/md.txt update NeilBrown
` (7 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
If an array has 3 or more devices, we allow the chunksize or layout
to be changed and when a reshape starts, we use these as the 'new'
values.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 76 ++++++++++++++++++++++++++++++++++++++--------------
1 files changed, 56 insertions(+), 20 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f7fb2b8..4fdc6d0 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4397,9 +4397,7 @@ static int run(mddev_t *mddev)
int old_disks;
int max_degraded = (mddev->level == 6 ? 2 : 1);
- if (mddev->new_level != mddev->level ||
- mddev->new_layout != mddev->layout ||
- mddev->new_chunk != mddev->chunk_size) {
+ if (mddev->new_level != mddev->level) {
printk(KERN_ERR "raid5: %s: unsupported reshape "
"required - aborting.\n",
mdname(mddev));
@@ -4784,8 +4782,10 @@ static int raid5_check_reshape(mddev_t *mddev)
{
raid5_conf_t *conf = mddev_to_conf(mddev);
- if (mddev->delta_disks == 0)
- return 0; /* nothing to do */
+ if (mddev->delta_disks == 0 &&
+ mddev->new_layout == mddev->layout &&
+ mddev->new_chunk == mddev->chunk_size)
+ return -EINVAL; /* nothing to do */
if (mddev->bitmap)
/* Cannot grow a bitmap yet */
return -EBUSY;
@@ -4860,6 +4860,10 @@ static int raid5_start_reshape(mddev_t *mddev)
spin_lock_irq(&conf->device_lock);
conf->previous_raid_disks = conf->raid_disks;
conf->raid_disks += mddev->delta_disks;
+ conf->prev_chunk = conf->chunk_size;
+ conf->chunk_size = mddev->new_chunk;
+ conf->prev_algo = conf->algorithm;
+ conf->algorithm = mddev->new_layout;
if (mddev->delta_disks < 0)
conf->reshape_progress = raid5_size(mddev, 0, 0);
else
@@ -4952,6 +4956,7 @@ static void end_reshape(raid5_conf_t *conf)
static void raid5_finish_reshape(mddev_t *mddev)
{
struct block_device *bdev;
+ raid5_conf_t *conf = mddev_to_conf(mddev);
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
@@ -4970,7 +4975,6 @@ static void raid5_finish_reshape(mddev_t *mddev)
}
} else {
int d;
- raid5_conf_t *conf = mddev_to_conf(mddev);
mddev->degraded = conf->raid_disks;
for (d = 0; d < conf->raid_disks ; d++)
if (conf->disks[d].rdev &&
@@ -4982,6 +4986,8 @@ static void raid5_finish_reshape(mddev_t *mddev)
d++)
raid5_remove_disk(mddev, d);
}
+ mddev->layout = conf->algorithm;
+ mddev->chunk_size = conf->chunk_size;
mddev->reshape_position = MaxSector;
mddev->delta_disks = 0;
}
@@ -5080,11 +5086,10 @@ static void *raid5_takeover_raid6(mddev_t *mddev)
static int raid5_reconfig(mddev_t *mddev, int new_layout, int new_chunk)
{
- /* Currently the layout and chunk size can only be changed
- * for a 2-drive raid array, as in that case no data shuffling
- * is required.
- * Later we might validate these and set new_* so a reshape
- * can complete the change.
+ /* For a 2-drive array, the layout and chunk size can be changed
+ * immediately as not restriping is needed.
+ * For larger arrays we record the new value - after validation
+ * to be used by a reshape pass.
*/
raid5_conf_t *conf = mddev_to_conf(mddev);
@@ -5103,19 +5108,49 @@ static int raid5_reconfig(mddev_t *mddev, int new_layout, int new_chunk)
/* They look valid */
- if (mddev->raid_disks != 2)
- return -EINVAL;
+ if (mddev->raid_disks == 2) {
- if (new_layout >= 0) {
- conf->algorithm = new_layout;
- mddev->layout = mddev->new_layout = new_layout;
+ if (new_layout >= 0) {
+ conf->algorithm = new_layout;
+ mddev->layout = mddev->new_layout = new_layout;
+ }
+ if (new_chunk > 0) {
+ conf->chunk_size = new_chunk;
+ mddev->chunk_size = mddev->new_chunk = new_chunk;
+ }
+ set_bit(MD_CHANGE_DEVS, &mddev->flags);
+ md_wakeup_thread(mddev->thread);
+ } else {
+ if (new_layout >= 0)
+ mddev->new_layout = new_layout;
+ if (new_chunk > 0)
+ mddev->new_chunk = new_chunk;
}
+ return 0;
+}
+
+static int raid6_reconfig(mddev_t *mddev, int new_layout, int new_chunk)
+{
+ if (new_layout >= 0 && !algorithm_valid_raid6(new_layout))
+ return -EINVAL;
if (new_chunk > 0) {
- conf->chunk_size = new_chunk;
- mddev->chunk_size = mddev->new_chunk = new_chunk;
+ if (new_chunk & (new_chunk-1))
+ /* not a power of 2 */
+ return -EINVAL;
+ if (new_chunk < PAGE_SIZE)
+ return -EINVAL;
+ if (mddev->array_sectors & ((new_chunk>>9)-1))
+ /* not factor of array size */
+ return -EINVAL;
}
- set_bit(MD_CHANGE_DEVS, &mddev->flags);
- md_wakeup_thread(mddev->thread);
+
+ /* They look valid */
+
+ if (new_layout >= 0)
+ mddev->new_layout = new_layout;
+ if (new_chunk > 0)
+ mddev->new_chunk = new_chunk;
+
return 0;
}
@@ -5216,6 +5251,7 @@ static struct mdk_personality raid6_personality =
#endif
.quiesce = raid5_quiesce,
.takeover = raid6_takeover,
+ .reconfig = raid6_reconfig,
};
static struct mdk_personality raid5_personality =
{
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [md PATCH 10/14] md: don't display meaningless values in sysfs files resync_start and sync_speed
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
` (12 preceding siblings ...)
2009-03-31 4:54 ` [md PATCH 12/14] md: remove CONFIG_MD_RAID_RESHAPE config option NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
When no resync if happening, both of these files currently have
meaningless values (is slightly different ways).
Change them to "none" in that case.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/md.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index c509313..2be574c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2888,6 +2888,8 @@ __ATTR(chunk_size, S_IRUGO|S_IWUSR, chunk_size_show, chunk_size_store);
static ssize_t
resync_start_show(mddev_t *mddev, char *page)
{
+ if (mddev->recovery_cp == MaxSector)
+ return sprintf(page, "none\n");
return sprintf(page, "%llu\n", (unsigned long long)mddev->recovery_cp);
}
@@ -3469,6 +3471,8 @@ static ssize_t
sync_speed_show(mddev_t *mddev, char *page)
{
unsigned long resync, dt, db;
+ if (mddev->curr_resync == 0)
+ return sprintf(page, "none\n");
resync = mddev->curr_mark_cnt - atomic_read(&mddev->recovery_active);
dt = (jiffies - mddev->resync_mark) / HZ;
if (!dt) dt++;
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [md PATCH 12/14] md: remove CONFIG_MD_RAID_RESHAPE config option.
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
` (11 preceding siblings ...)
2009-03-31 4:54 ` [md PATCH 11/14] md/raid5: be more careful about write ordering when reshaping NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
2009-03-31 4:54 ` [md PATCH 10/14] md: don't display meaningless values in sysfs files resync_start and sync_speed NeilBrown
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
This was only needed when the code was experimental. Most of it
is well tested now, so the option is no longer useful.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/Kconfig | 29 -----------------------------
drivers/md/raid5.c | 10 ----------
2 files changed, 0 insertions(+), 39 deletions(-)
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 449d0b9..36e0675 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -152,35 +152,6 @@ config MD_RAID456
If unsure, say Y.
-config MD_RAID5_RESHAPE
- bool "Support adding drives to a raid-5 array"
- depends on MD_RAID456
- default y
- ---help---
- A RAID-5 set can be expanded by adding extra drives. This
- requires "restriping" the array which means (almost) every
- block must be written to a different place.
-
- This option allows such restriping to be done while the array
- is online.
-
- You will need mdadm version 2.4.1 or later to use this
- feature safely. During the early stage of reshape there is
- a critical section where live data is being over-written. A
- crash during this time needs extra care for recovery. The
- newer mdadm takes a copy of the data in the critical section
- and will restore it, if necessary, after a crash.
-
- The mdadm usage is e.g.
- mdadm --grow /dev/md1 --raid-disks=6
- to grow '/dev/md1' to having 6 disks.
-
- Note: The array can only be expanded, not contracted.
- There should be enough spares already present to make the new
- array workable.
-
- If unsure, say Y.
-
config MD_RAID6_PQ
tristate
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 062df84..fb11c13 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -948,7 +948,6 @@ static int grow_stripes(raid5_conf_t *conf, int num)
return 0;
}
-#ifdef CONFIG_MD_RAID5_RESHAPE
static int resize_stripes(raid5_conf_t *conf, int newsize)
{
/* Make all the stripes able to hold 'newsize' devices.
@@ -1073,7 +1072,6 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
conf->pool_size = newsize;
return err;
}
-#endif
static int drop_one_stripe(raid5_conf_t *conf)
{
@@ -4822,7 +4820,6 @@ static int raid5_resize(mddev_t *mddev, sector_t sectors)
return 0;
}
-#ifdef CONFIG_MD_RAID5_RESHAPE
static int raid5_check_reshape(mddev_t *mddev)
{
raid5_conf_t *conf = mddev_to_conf(mddev);
@@ -4967,7 +4964,6 @@ static int raid5_start_reshape(mddev_t *mddev)
md_new_event(mddev);
return 0;
}
-#endif
/* This is called from the reshape thread and should make any
* changes needed in 'conf'
@@ -5289,11 +5285,9 @@ static struct mdk_personality raid6_personality =
.sync_request = sync_request,
.resize = raid5_resize,
.size = raid5_size,
-#ifdef CONFIG_MD_RAID5_RESHAPE
.check_reshape = raid5_check_reshape,
.start_reshape = raid5_start_reshape,
.finish_reshape = raid5_finish_reshape,
-#endif
.quiesce = raid5_quiesce,
.takeover = raid6_takeover,
.reconfig = raid6_reconfig,
@@ -5314,11 +5308,9 @@ static struct mdk_personality raid5_personality =
.sync_request = sync_request,
.resize = raid5_resize,
.size = raid5_size,
-#ifdef CONFIG_MD_RAID5_RESHAPE
.check_reshape = raid5_check_reshape,
.start_reshape = raid5_start_reshape,
.finish_reshape = raid5_finish_reshape,
-#endif
.quiesce = raid5_quiesce,
.takeover = raid5_takeover,
.reconfig = raid5_reconfig,
@@ -5340,11 +5332,9 @@ static struct mdk_personality raid4_personality =
.sync_request = sync_request,
.resize = raid5_resize,
.size = raid5_size,
-#ifdef CONFIG_MD_RAID5_RESHAPE
.check_reshape = raid5_check_reshape,
.start_reshape = raid5_start_reshape,
.finish_reshape = raid5_finish_reshape,
-#endif
.quiesce = raid5_quiesce,
};
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [md PATCH 11/14] md/raid5: be more careful about write ordering when reshaping.
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
` (10 preceding siblings ...)
2009-03-31 4:54 ` [md PATCH 14/14] md/raid5 revise rules for when to update metadata during reshape NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
2009-03-31 4:54 ` [md PATCH 12/14] md: remove CONFIG_MD_RAID_RESHAPE config option NeilBrown
2009-03-31 4:54 ` [md PATCH 10/14] md: don't display meaningless values in sysfs files resync_start and sync_speed NeilBrown
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
When we are reshaping an array, it is very important that we read
the data from a particular sector offset before writing new data
at that offset.
In most cases when growing or shrinking an array we read long before
we even consider writing. But when restriping an array without
changing it size, there is a small possibility that we might have
some data to available write before the read has happened at the same
location. This would require some stripes to be in cache already.
To guard against this small possibility, we check, before writing,
that the 'old' stripe at the same location is not in the process of
being read. And we ensure that we mark all 'source' stripes as such
before allowing new 'destination' stripes to proceed.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 47 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4fdc6d0..062df84 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -395,7 +395,8 @@ get_active_stripe(raid5_conf_t *conf, sector_t sector,
init_stripe(sh, sector, previous);
} else {
if (atomic_read(&sh->count)) {
- BUG_ON(!list_empty(&sh->lru));
+ BUG_ON(!list_empty(&sh->lru)
+ && !test_bit(STRIPE_EXPANDING, &sh->state));
} else {
if (!test_bit(STRIPE_HANDLE, &sh->state))
atomic_inc(&conf->active_stripes);
@@ -2944,6 +2945,23 @@ static bool handle_stripe5(struct stripe_head *sh)
/* Finish reconstruct operations initiated by the expansion process */
if (sh->reconstruct_state == reconstruct_state_result) {
+ struct stripe_head *sh2
+ = get_active_stripe(conf, sh->sector, 1, 1);
+ if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
+ /* sh cannot be written until sh2 has been read.
+ * so arrange for sh to be delayed a little
+ */
+ set_bit(STRIPE_DELAYED, &sh->state);
+ set_bit(STRIPE_HANDLE, &sh->state);
+ if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
+ &sh2->state))
+ atomic_inc(&conf->preread_active_stripes);
+ release_stripe(sh2);
+ goto unlock;
+ }
+ if (sh2)
+ release_stripe(sh2);
+
sh->reconstruct_state = reconstruct_state_idle;
clear_bit(STRIPE_EXPANDING, &sh->state);
for (i = conf->raid_disks; i--; ) {
@@ -3172,6 +3190,23 @@ static bool handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
}
if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
+ struct stripe_head *sh2
+ = get_active_stripe(conf, sh->sector, 1, 1);
+ if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
+ /* sh cannot be written until sh2 has been read.
+ * so arrange for sh to be delayed a little
+ */
+ set_bit(STRIPE_DELAYED, &sh->state);
+ set_bit(STRIPE_HANDLE, &sh->state);
+ if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
+ &sh2->state))
+ atomic_inc(&conf->preread_active_stripes);
+ release_stripe(sh2);
+ goto unlock;
+ }
+ if (sh2)
+ release_stripe(sh2);
+
/* Need to write out all blocks after computing P&Q */
sh->disks = conf->raid_disks;
stripe_set_idx(sh->sector, conf, 0, sh);
@@ -3739,6 +3774,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
sector_t writepos, safepos, gap;
sector_t stripe_addr;
int reshape_sectors;
+ struct list_head stripes;
if (sector_nr == 0) {
/* If restarting in the middle, skip the initial sectors */
@@ -3816,6 +3852,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
BUG_ON(writepos != sector_nr + reshape_sectors);
stripe_addr = sector_nr;
}
+ INIT_LIST_HEAD(&stripes);
for (i = 0; i < reshape_sectors; i += STRIPE_SECTORS) {
int j;
int skipped = 0;
@@ -3845,7 +3882,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
set_bit(STRIPE_EXPAND_READY, &sh->state);
set_bit(STRIPE_HANDLE, &sh->state);
}
- release_stripe(sh);
+ list_add(&sh->lru, &stripes);
}
spin_lock_irq(&conf->device_lock);
if (mddev->delta_disks < 0)
@@ -3874,6 +3911,14 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
release_stripe(sh);
first_sector += STRIPE_SECTORS;
}
+ /* Now that the sources are clearly marked, we can release
+ * the destination stripes
+ */
+ while (!list_empty(&stripes)) {
+ sh = list_entry(stripes.next, struct stripe_head, lru);
+ list_del_init(&sh->lru);
+ release_stripe(sh);
+ }
/* If this takes us to the resync_max point where we have to pause,
* then we need to write out the superblock.
*/
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [md PATCH 13/14] md/raid5: minor code cleanups in make_request.
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
` (8 preceding siblings ...)
2009-03-31 4:54 ` [md PATCH 03/14] md: allow number of drives in raid5 to be reduced NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
2009-03-31 4:54 ` [md PATCH 14/14] md/raid5 revise rules for when to update metadata during reshape NeilBrown
` (3 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
... and to be certain the that make_request doesn't wait forever,
add a 'wake_up' when ->reshape_progress has been set to MaxSector
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 16 +++++++---------
1 files changed, 7 insertions(+), 9 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index fb11c13..bb4b12e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3637,10 +3637,9 @@ static int make_request(struct request_queue *q, struct bio * bi)
retry:
previous = 0;
+ disks = conf->raid_disks;
prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
- if (likely(conf->reshape_progress == MaxSector))
- disks = conf->raid_disks;
- else {
+ if (unlikely(conf->reshape_progress != MaxSector)) {
/* spinlock is needed as reshape_progress may be
* 64bit on a 32bit platform, and so it might be
* possible to see a half-updated value
@@ -3650,7 +3649,6 @@ static int make_request(struct request_queue *q, struct bio * bi)
* to check again.
*/
spin_lock_irq(&conf->device_lock);
- disks = conf->raid_disks;
if (mddev->delta_disks < 0
? logical_sector < conf->reshape_progress
: logical_sector >= conf->reshape_progress) {
@@ -3679,7 +3677,7 @@ static int make_request(struct request_queue *q, struct bio * bi)
sh = get_active_stripe(conf, new_sector, previous,
(bi->bi_rw&RWA_MASK));
if (sh) {
- if (unlikely(conf->reshape_progress != MaxSector)) {
+ if (unlikely(previous)) {
/* expansion might have moved on while waiting for a
* stripe, so we must do the range check again.
* Expansion could still move past after this
@@ -3690,10 +3688,9 @@ static int make_request(struct request_queue *q, struct bio * bi)
*/
int must_retry = 0;
spin_lock_irq(&conf->device_lock);
- if ((mddev->delta_disks < 0
- ? logical_sector >= conf->reshape_progress
- : logical_sector < conf->reshape_progress)
- && previous)
+ if (mddev->delta_disks < 0
+ ? logical_sector >= conf->reshape_progress
+ : logical_sector < conf->reshape_progress)
/* mismatch, need to try again */
must_retry = 1;
spin_unlock_irq(&conf->device_lock);
@@ -4977,6 +4974,7 @@ static void end_reshape(raid5_conf_t *conf)
conf->previous_raid_disks = conf->raid_disks;
conf->reshape_progress = MaxSector;
spin_unlock_irq(&conf->device_lock);
+ wake_up(&conf->wait_for_overlap);
/* read-ahead size must cover two whole stripes, which is
* 2 * (datadisks) * chunksize where 'n' is the number of raid devices
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [md PATCH 14/14] md/raid5 revise rules for when to update metadata during reshape
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
` (9 preceding siblings ...)
2009-03-31 4:54 ` [md PATCH 13/14] md/raid5: minor code cleanups in make_request NeilBrown
@ 2009-03-31 4:54 ` NeilBrown
2009-03-31 4:54 ` [md PATCH 11/14] md/raid5: be more careful about write ordering when reshaping NeilBrown
` (2 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2009-03-31 4:54 UTC (permalink / raw)
To: linux-raid; +Cc: NeilBrown
We currently update the metadata :
1/ every 3Megabytes
2/ When the place we will write new-layout data to is recorded in
the metadata as still containing old-layout data.
Rule one exists to avoid having to re-do too much reshaping in the
face of a crash/restart. So it should really be time based rather
than size based. So change it to "every 10 seconds".
Rule two turns out to be too harsh when restriping an array
'in-place', as in that case the metadata much be updates for every
stripe.
For the in-place update, it can only possibly be safe from a crash if
some user-space program data a backup of every e.g. few hundred
stripes before allowing them to be reshaped. In that case, the
constant metadata update is pointless.
So only update the metadata if the new metadata will report that the
end of the 'old-layout' data is beyond where we are currently
writing 'new-layout' data.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 34 ++++++++++++++++++++++++++++------
drivers/md/raid5.h | 2 ++
2 files changed, 30 insertions(+), 6 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index bb4b12e..3bbc6d6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3766,7 +3766,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
int new_data_disks = conf->raid_disks - conf->max_degraded;
int i;
int dd_idx;
- sector_t writepos, safepos, gap;
+ sector_t writepos, readpos, safepos;
sector_t stripe_addr;
int reshape_sectors;
struct list_head stripes;
@@ -3806,26 +3806,46 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
*/
writepos = conf->reshape_progress;
sector_div(writepos, new_data_disks);
+ readpos = conf->reshape_progress;
+ sector_div(readpos, data_disks);
safepos = conf->reshape_safe;
sector_div(safepos, data_disks);
if (mddev->delta_disks < 0) {
writepos -= reshape_sectors;
+ readpos += reshape_sectors;
safepos += reshape_sectors;
- gap = conf->reshape_safe - conf->reshape_progress;
} else {
writepos += reshape_sectors;
+ readpos -= reshape_sectors;
safepos -= reshape_sectors;
- gap = conf->reshape_progress - conf->reshape_safe;
}
+ /* 'writepos' is the most advanced device address we might write.
+ * 'readpos' is the least advanced device address we might read.
+ * 'safepos' is the least address recorded in the metadata as having
+ * been reshaped.
+ * If 'readpos' is behind 'writepos', then there is no way that we can
+ * ensure safety in the face of a crash - that must be done by userspace
+ * making a backup of the data. So in that case there is no particular
+ * rush to update metadata.
+ * Otherwise if 'safepos' is behind 'writepos', then we really need to
+ * update the metadata to advance 'safepos' to match 'readpos' so that
+ * we can be safe in the event of a crash.
+ * So we insist on updating metadata if safepos is behind writepos and
+ * readpos is beyond writepos.
+ * In any case, update the metadata every 10 seconds.
+ * Maybe that number should be configurable, but I'm not sure it is
+ * worth it.... maybe it could be a multiple of safemode_delay???
+ */
if ((mddev->delta_disks < 0
- ? writepos < safepos
- : writepos > safepos) ||
- gap > (new_data_disks)*3000*2 /*3Meg*/) {
+ ? (safepos > writepos && readpos < writepos)
+ : (safepos < writepos && readpos > writepos)) ||
+ time_after(jiffies, conf->reshape_checkpoint + 10*HZ)) {
/* Cannot proceed until we've updated the superblock... */
wait_event(conf->wait_for_overlap,
atomic_read(&conf->reshape_stripes)==0);
mddev->reshape_position = conf->reshape_progress;
+ conf->reshape_checkpoint = jiffies;
set_bit(MD_CHANGE_DEVS, &mddev->flags);
md_wakeup_thread(mddev->thread);
wait_event(mddev->sb_wait, mddev->flags == 0 ||
@@ -3923,6 +3943,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
wait_event(conf->wait_for_overlap,
atomic_read(&conf->reshape_stripes) == 0);
mddev->reshape_position = conf->reshape_progress;
+ conf->reshape_checkpoint = jiffies;
set_bit(MD_CHANGE_DEVS, &mddev->flags);
md_wakeup_thread(mddev->thread);
wait_event(mddev->sb_wait,
@@ -4957,6 +4978,7 @@ static int raid5_start_reshape(mddev_t *mddev)
spin_unlock_irq(&conf->device_lock);
return -EAGAIN;
}
+ conf->reshape_checkpoint = jiffies;
md_wakeup_thread(mddev->sync_thread);
md_new_event(mddev);
return 0;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index cdd0456..52ba999 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -352,6 +352,8 @@ struct raid5_private_data {
int previous_raid_disks;
int prev_chunk, prev_algo;
short generation; /* increments with every reshape */
+ unsigned long reshape_checkpoint; /* Time we last updated
+ * metadata */
struct list_head handle_list; /* stripes needing handling */
struct list_head hold_list; /* preread ready stripes */
^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2009-03-31 4:54 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-31 4:54 [md PATCH 00/14] Final set of patches head for 2.6.30 NeilBrown
2009-03-31 4:54 ` [md PATCH 02/14] md/raid5: change reshape-progress measurement to cope with reshaping backwards NeilBrown
2009-03-31 4:54 ` [md PATCH 01/14] md: add explicit method to signal the end of a reshape NeilBrown
2009-03-31 4:54 ` [md PATCH 05/14] md/raid5: clearly differentiate 'before' and 'after' stripes during reshape NeilBrown
2009-03-31 4:54 ` [md PATCH 06/14] md/raid5: prepare for allowing reshape to change chunksize NeilBrown
2009-03-31 4:54 ` [md PATCH 07/14] md/raid5: prepare for allowing reshape to change layout NeilBrown
2009-03-31 4:54 ` [md PATCH 09/14] md/raid5: allow layout and chunksize to be changed on active array NeilBrown
2009-03-31 4:54 ` [md PATCH 04/14] Documentation/md.txt update NeilBrown
2009-03-31 4:54 ` [md PATCH 08/14] md/raid5: reshape using largest of old and new chunk size NeilBrown
2009-03-31 4:54 ` [md PATCH 03/14] md: allow number of drives in raid5 to be reduced NeilBrown
2009-03-31 4:54 ` [md PATCH 13/14] md/raid5: minor code cleanups in make_request NeilBrown
2009-03-31 4:54 ` [md PATCH 14/14] md/raid5 revise rules for when to update metadata during reshape NeilBrown
2009-03-31 4:54 ` [md PATCH 11/14] md/raid5: be more careful about write ordering when reshaping NeilBrown
2009-03-31 4:54 ` [md PATCH 12/14] md: remove CONFIG_MD_RAID_RESHAPE config option NeilBrown
2009-03-31 4:54 ` [md PATCH 10/14] md: don't display meaningless values in sysfs files resync_start and sync_speed NeilBrown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).