* [md PATCH 01/16] md: refine interpretation of "hold_active == UNTIL_IOCTL".
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 03/16] md: remove test for duplicate device when setting slot number NeilBrown
` (17 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
We like md devices to disappear when they really are not needed.
However it is not possible to tell from the current state whether it
is needed or not. We can only tell from recent history of changes.
In particular immediately after we create an md device it looks very
similar to immediately after we have finished with it.
So we always preserve a newly created md device until something
significant happens. This state is stored in 'hold_active'.
The normal case is to keep it until an ioctl happens, as that will
normally either activate it, or explicitly de-activate it. If it
doesn't then it was probably created by mistake and it is now time to
get rid of it.
We can also modify an array via sysfs (instead of via ioctl) and we
currently treat any change via sysfs like an ioctl as a sign that if
it now isn't more active, it should be destroyed.
However this is not appropriate as changes made via sysfs are more
gradual so we should look for a more definitive change.
So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when
the array_state is changed via sysfs. Other changes via sysfs
are ignored.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/md.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 266e82e..fd68243 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3791,6 +3791,8 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
if (err)
return err;
else {
+ if (mddev->hold_active == UNTIL_IOCTL)
+ mddev->hold_active = 0;
sysfs_notify_dirent_safe(mddev->sysfs_state);
return len;
}
@@ -4511,8 +4513,6 @@ md_attr_store(struct kobject *kobj, struct attribute *attr,
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
rv = mddev_lock(mddev);
- if (mddev->hold_active == UNTIL_IOCTL)
- mddev->hold_active = 0;
if (!rv) {
rv = entry->store(mddev, page, length);
mddev_unlock(mddev);
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 03/16] md: remove test for duplicate device when setting slot number.
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
2011-10-26 1:43 ` [md PATCH 01/16] md: refine interpretation of "hold_active == UNTIL_IOCTL" NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 04/16] md: change hot_remove_disk to take an rdev rather than a number NeilBrown
` (16 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
When setting the slot number on a device in an active array we
currently check that the number is not already in use.
We then call into the personality's hot_add_disk function
which performs the same test and returns the same error.
Thus the common test is not needed.
As we will shortly be changing some personalities to allow duplicates
in some cases (to support hot-replace), the common test will become
inconvenient.
So remove the common test.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/md.c | 5 -----
1 files changed, 0 insertions(+), 5 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5905526..5164856 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2699,7 +2699,6 @@ slot_store(struct md_rdev *rdev, const char *buf, size_t len)
set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
md_wakeup_thread(rdev->mddev->thread);
} else if (rdev->mddev->pers) {
- struct md_rdev *rdev2;
/* Activating a spare .. or possibly reactivating
* if we ever get bitmaps working here.
*/
@@ -2713,10 +2712,6 @@ slot_store(struct md_rdev *rdev, const char *buf, size_t len)
if (rdev->mddev->pers->hot_add_disk == NULL)
return -EINVAL;
- list_for_each_entry(rdev2, &rdev->mddev->disks, same_set)
- if (rdev2->raid_disk == slot)
- return -EEXIST;
-
if (slot >= rdev->mddev->raid_disks &&
slot >= rdev->mddev->raid_disks + rdev->mddev->delta_disks)
return -ENOSPC;
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 04/16] md: change hot_remove_disk to take an rdev rather than a number.
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
2011-10-26 1:43 ` [md PATCH 01/16] md: refine interpretation of "hold_active == UNTIL_IOCTL" NeilBrown
2011-10-26 1:43 ` [md PATCH 03/16] md: remove test for duplicate device when setting slot number NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 02/16] md: take after reference to mddev during sysfs access NeilBrown
` (15 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement). So removing
a device based on the number won't work. So pass the actual device
handle instead.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/md.c | 6 +++---
drivers/md/md.h | 2 +-
drivers/md/multipath.c | 7 +++----
drivers/md/raid1.c | 7 +++----
drivers/md/raid10.c | 7 +++----
drivers/md/raid5.c | 9 ++++-----
6 files changed, 17 insertions(+), 21 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5164856..182056a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2691,7 +2691,7 @@ slot_store(struct md_rdev *rdev, const char *buf, size_t len)
if (rdev->mddev->pers->hot_remove_disk == NULL)
return -EINVAL;
err = rdev->mddev->pers->
- hot_remove_disk(rdev->mddev, rdev->raid_disk);
+ hot_remove_disk(rdev->mddev, rdev);
if (err)
return err;
sysfs_unlink_rdev(rdev->mddev, rdev);
@@ -7335,7 +7335,7 @@ static int remove_and_add_spares(struct mddev *mddev)
! test_bit(In_sync, &rdev->flags)) &&
atomic_read(&rdev->nr_pending)==0) {
if (mddev->pers->hot_remove_disk(
- mddev, rdev->raid_disk)==0) {
+ mddev, rdev)==0) {
sysfs_unlink_rdev(mddev, rdev);
rdev->raid_disk = -1;
}
@@ -7473,7 +7473,7 @@ void md_check_recovery(struct mddev *mddev)
test_bit(Faulty, &rdev->flags) &&
atomic_read(&rdev->nr_pending)==0) {
if (mddev->pers->hot_remove_disk(
- mddev, rdev->raid_disk)==0) {
+ mddev, rdev)==0) {
sysfs_unlink_rdev(mddev, rdev);
rdev->raid_disk = -1;
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 51c1d91..15a0346 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -428,7 +428,7 @@ struct md_personality
*/
void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);
int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
- int (*hot_remove_disk) (struct mddev *mddev, int number);
+ int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev);
int (*spare_active) (struct mddev *mddev);
sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster);
int (*resize) (struct mddev *mddev, sector_t sectors);
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index d32c785..164f074 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -291,17 +291,16 @@ static int multipath_add_disk(struct mddev *mddev, struct md_rdev *rdev)
return err;
}
-static int multipath_remove_disk(struct mddev *mddev, int number)
+static int multipath_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
{
struct mpconf *conf = mddev->private;
int err = 0;
- struct md_rdev *rdev;
+ int number = rdev->raid_disk;
struct multipath_info *p = conf->multipaths + number;
print_multipath_conf(conf);
- rdev = p->rdev;
- if (rdev) {
+ if (rdev == p->rdev) {
if (test_bit(In_sync, &rdev->flags) ||
atomic_read(&rdev->nr_pending)) {
printk(KERN_ERR "hot-remove-disk, slot %d is identified"
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 4602fc5..a52979b 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1328,16 +1328,15 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
return err;
}
-static int raid1_remove_disk(struct mddev *mddev, int number)
+static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
{
struct r1conf *conf = mddev->private;
int err = 0;
- struct md_rdev *rdev;
+ int number = rdev->raid_disk;
struct mirror_info *p = conf->mirrors+ number;
print_conf(conf);
- rdev = p->rdev;
- if (rdev) {
+ if (rdev == p->rdev) {
if (test_bit(In_sync, &rdev->flags) ||
atomic_read(&rdev->nr_pending)) {
err = -EBUSY;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 132c18e..aa3796d 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1387,16 +1387,15 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
return err;
}
-static int raid10_remove_disk(struct mddev *mddev, int number)
+static int raid10_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
{
struct r10conf *conf = mddev->private;
int err = 0;
- struct md_rdev *rdev;
+ int number = rdev->raid_disk;
struct mirror_info *p = conf->mirrors+ number;
print_conf(conf);
- rdev = p->rdev;
- if (rdev) {
+ if (rdev == p->rdev) {
if (test_bit(In_sync, &rdev->flags) ||
atomic_read(&rdev->nr_pending)) {
err = -EBUSY;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f6fe053..27056d3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5020,16 +5020,15 @@ static int raid5_spare_active(struct mddev *mddev)
return count;
}
-static int raid5_remove_disk(struct mddev *mddev, int number)
+static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
{
struct r5conf *conf = mddev->private;
int err = 0;
- struct md_rdev *rdev;
+ int number = rdev->raid_disk;
struct disk_info *p = conf->disks + number;
print_raid5_conf(conf);
- rdev = p->rdev;
- if (rdev) {
+ if (rdev == p->rdev) {
if (number >= conf->raid_disks &&
conf->reshape_progress == MaxSector)
clear_bit(In_sync, &rdev->flags);
@@ -5355,7 +5354,7 @@ static void raid5_finish_reshape(struct mddev *mddev)
d < conf->raid_disks - mddev->delta_disks;
d++) {
struct md_rdev *rdev = conf->disks[d].rdev;
- if (rdev && raid5_remove_disk(mddev, d) == 0) {
+ if (rdev && raid5_remove_disk(mddev, rdev) == 0) {
sysfs_unlink_rdev(mddev, rdev);
rdev->raid_disk = -1;
}
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 02/16] md: take after reference to mddev during sysfs access.
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (2 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 04/16] md: change hot_remove_disk to take an rdev rather than a number NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 13/16] md/raid5: handle activation of replacement device when recovery completes NeilBrown
` (14 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
When we are accessing an mddev via sysfs we know that the
mddev cannot disappear because it has an embedded kobj which
is refcounted by sysfs.
And we also take the mddev_lock.
However this is not enough.
The final mddev_put could have been called and the
mddev_delayed_delete is waiting for sysfs to let go so it can destroy
the kobj and mddev.
In this state there are a lot of changes that should not be attempted.
To to guard against this we:
- initialise mddev->all_mddevs in on last put so the state can be
easily detected.
- in md_attr_show and md_attr_store, check ->all_mddevs under
all_mddevs_lock and mddev_get the mddev if it still appears to
be active.
This means that if we get to sysfs as the mddev is being deleted we
will get -EBUSY.
rdev_attr_store and rdev_attr_show are similar but already have
sufficient protection. They check that rdev->mddev still points to
mddev after taking mddev_lock. As this is cleared before delayed
removal which can only be requested under the mddev_lock, this
ensure the rdev and mddev are still alive.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/md.c | 19 ++++++++++++++++++-
1 files changed, 18 insertions(+), 1 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index fd68243..5905526 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -573,7 +573,7 @@ static void mddev_put(struct mddev *mddev)
mddev->ctime == 0 && !mddev->hold_active) {
/* Array is not configured at all, and not held active,
* so destroy it */
- list_del(&mddev->all_mddevs);
+ list_del_init(&mddev->all_mddevs);
bs = mddev->bio_set;
mddev->bio_set = NULL;
if (mddev->gendisk) {
@@ -4492,11 +4492,20 @@ md_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
if (!entry->show)
return -EIO;
+ spin_lock(&all_mddevs_lock);
+ if (list_empty(&mddev->all_mddevs)) {
+ spin_unlock(&all_mddevs_lock);
+ return -EBUSY;
+ }
+ mddev_get(mddev);
+ spin_unlock(&all_mddevs_lock);
+
rv = mddev_lock(mddev);
if (!rv) {
rv = entry->show(mddev, page);
mddev_unlock(mddev);
}
+ mddev_put(mddev);
return rv;
}
@@ -4512,11 +4521,19 @@ md_attr_store(struct kobject *kobj, struct attribute *attr,
return -EIO;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
+ spin_lock(&all_mddevs_lock);
+ if (list_empty(&mddev->all_mddevs)) {
+ spin_unlock(&all_mddevs_lock);
+ return -EBUSY;
+ }
+ mddev_get(mddev);
+ spin_unlock(&all_mddevs_lock);
rv = mddev_lock(mddev);
if (!rv) {
rv = entry->store(mddev, page, length);
mddev_unlock(mddev);
}
+ mddev_put(mddev);
return rv;
}
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 13/16] md/raid5: handle activation of replacement device when recovery completes.
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (3 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 02/16] md: take after reference to mddev during sysfs access NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 08/16] md/raid5: remove redundant bio initialisations NeilBrown
` (13 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
When recovery completes - as reported by a call to ->spare_active,
we clear In_sync on the original and set it on the replacement.
Then when the original gets removed we move the replacement from
'replacement' to 'rdev'.
This could race with other code that is looking at these pointers,
so we use memory barriers and careful ordering to ensure that
a reader might see one device twice, but never no devices.
Then the readers guard against using both devices, which could
only happen when writing.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 61 insertions(+), 7 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 08f388c..6c22f09 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -519,13 +519,21 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
bi->bi_end_io = raid5_end_read_request;
rcu_read_lock();
- rdev = rcu_dereference(conf->disks[i].rdev);
rrdev = rcu_dereference(conf->disks[i].replacement);
+ smp_mb(); /* Ensure that if rrdev is NULL, rdev won't be */
+ rdev = rcu_dereference(conf->disks[i].rdev);
+ if (!rdev) {
+ rdev = rrdev;
+ rrdev = NULL;
+ }
if (rw & WRITE) {
if (replace_only)
rdev = NULL;
+ if (rdev == rrdev)
+ /* We raced and saw duplicates */
+ rrdev = NULL;
} else {
- if (test_bit(R5_ReadRepl, &sh->dev[i].flags))
+ if (test_bit(R5_ReadRepl, &sh->dev[i].flags) && rrdev)
rdev = rrdev;
rrdev = NULL;
}
@@ -1627,7 +1635,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
int disks = sh->disks, i;
int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
char b[BDEVNAME_SIZE];
- struct md_rdev *rdev;
+ struct md_rdev *rdev = NULL;
for (i=0 ; i<disks; i++)
@@ -1642,8 +1650,13 @@ static void raid5_end_read_request(struct bio * bi, int error)
return;
}
if (test_bit(R5_ReadRepl, &sh->dev[i].flags))
+ /* If replacement finished while this request was outstanding,
+ * 'replacement' might be NULL already.
+ * In that case it moved down to 'rdev'.
+ * rdev is not removed until all requests are finished.
+ */
rdev = conf->disks[i].replacement;
- else
+ if (!rdev)
rdev = conf->disks[i].rdev;
if (uptodate) {
@@ -1740,7 +1753,13 @@ static void raid5_end_write_request(struct bio *bi, int error)
}
if (bi == &sh->dev[i].rreq) {
rdev = conf->disks[i].replacement;
- replacement = 1;
+ if (rdev)
+ replacement = 1;
+ else
+ /* rdev was removed and 'replacement'
+ * replaced it.
+ */
+ rdev = conf->disks[i].rdev;
break;
}
}
@@ -3515,6 +3534,9 @@ finish:
}
if (test_and_clear_bit(R5_MadeGoodRepl, &dev->flags)) {
rdev = conf->disks[i].replacement;
+ if (!rdev)
+ /* rdev have been moved down */
+ rdev = conf->disks[i].rdev;
rdev_clear_badblocks(rdev, sh->sector,
STRIPE_SECTORS);
rdev_dec_pending(rdev, conf->mddev);
@@ -5183,7 +5205,25 @@ static int raid5_spare_active(struct mddev *mddev)
for (i = 0; i < conf->raid_disks; i++) {
tmp = conf->disks + i;
- if (tmp->rdev
+ if (tmp->replacement
+ && tmp->replacement->recovery_offset == MaxSector
+ && !test_bit(Faulty, &tmp->replacement->flags)
+ && !test_and_set_bit(In_sync, &tmp->replacement->flags)) {
+ /* Replacement has just become active. */
+ if (!tmp->rdev
+ || !test_and_clear_bit(In_sync, &tmp->rdev->flags))
+ count++;
+ if (tmp->rdev) {
+ /* Replaced device not technically faulty,
+ * but we need to be sure it gets removed
+ * and never re-added.
+ */
+ set_bit(Faulty, &tmp->rdev->flags);
+ sysfs_notify_dirent_safe(
+ tmp->rdev->sysfs_state);
+ }
+ sysfs_notify_dirent_safe(tmp->replacement->sysfs_state);
+ } else if (tmp->rdev
&& tmp->rdev->recovery_offset == MaxSector
&& !test_bit(Faulty, &tmp->rdev->flags)
&& !test_and_set_bit(In_sync, &tmp->rdev->flags)) {
@@ -5229,6 +5269,7 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
if (!test_bit(Faulty, &rdev->flags) &&
mddev->recovery_disabled != conf->recovery_disabled &&
!has_failed(conf) &&
+ (!p->replacement || p->replacement == rdev) &&
number < conf->raid_disks) {
err = -EBUSY;
goto abort;
@@ -5239,7 +5280,20 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
/* lost the race, try later */
err = -EBUSY;
*rdevp = rdev;
- }
+ } else if (p->replacement) {
+ /* We must have just cleared 'rdev' */
+ p->rdev = p->replacement;
+ clear_bit(Replacement, &p->replacement->flags);
+ smp_mb(); /* Make sure other CPUs may see both as identical
+ * but will never see neither - if they are careful
+ */
+ p->replacement = NULL;
+ clear_bit(Replaceable, &rdev->flags);
+ } else
+ /* We might have just removed the Replacement as faulty-
+ * clear the bit just in case
+ */
+ clear_bit(Replaceable, &rdev->flags);
abort:
print_raid5_conf(conf);
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 08/16] md/raid5: remove redundant bio initialisations.
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (4 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 13/16] md/raid5: handle activation of replacement device when recovery completes NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 11/16] md/raid5: writes should get directed to replacement as well as original NeilBrown
` (12 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
We current initialise some fields of a bio when preparing a
stripe_head, and again just before submitting the request.
Remove the duplication by only setting the fields that lower level
devices don't touch in raid5_build_block, and only set the changeable
fields in ops_run_io.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 9 +--------
1 files changed, 1 insertions(+), 8 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e6d10df..12e54c0 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -562,10 +562,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
atomic_inc(&sh->count);
bi->bi_sector = sh->sector + rdev->data_offset;
bi->bi_flags = 1 << BIO_UPTODATE;
- bi->bi_vcnt = 1;
- bi->bi_max_vecs = 1;
bi->bi_idx = 0;
- bi->bi_io_vec = &sh->dev[i].vec;
bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
bi->bi_io_vec[0].bv_offset = 0;
bi->bi_size = STRIPE_SIZE;
@@ -1708,12 +1705,8 @@ static void raid5_build_block(struct stripe_head *sh, int i, int previous)
dev->req.bi_io_vec = &dev->vec;
dev->req.bi_vcnt++;
dev->req.bi_max_vecs++;
- dev->vec.bv_page = dev->page;
- dev->vec.bv_len = STRIPE_SIZE;
- dev->vec.bv_offset = 0;
-
- dev->req.bi_sector = sh->sector;
dev->req.bi_private = sh;
+ dev->vec.bv_page = dev->page;
dev->flags = 0;
dev->sector = compute_blocknr(sh, i, previous);
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 11/16] md/raid5: writes should get directed to replacement as well as original.
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (5 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 08/16] md/raid5: remove redundant bio initialisations NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 07/16] md/raid5: raid5.h cleanup NeilBrown
` (11 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
When writing, we need to submit two writes, one to the original, and
one to the replacement - if there is a replacement.
If the write to the replacement results in a write error, we just fail
the device. We only try to record write errors to the original.
When writing for recovery, we shouldn't write to the original. This
will be addressed in a subsequent patch that generally addresses
recovery.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 119 ++++++++++++++++++++++++++++++++++++++++++----------
drivers/md/raid5.h | 1
2 files changed, 97 insertions(+), 23 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 20f42f6..4ba4b8c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -491,8 +491,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
for (i = disks; i--; ) {
int rw;
- struct bio *bi;
- struct md_rdev *rdev;
+ struct bio *bi, *rbi;
+ struct md_rdev *rdev, *rrdev = NULL;
if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags)) {
if (test_and_clear_bit(R5_WantFUA, &sh->dev[i].flags))
rw = WRITE_FUA;
@@ -504,27 +504,36 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
continue;
bi = &sh->dev[i].req;
+ rbi = &sh->dev[i].rreq; /* For writing to replacement */
bi->bi_rw = rw;
- if (rw & WRITE)
+ rbi->bi_rw = rw;
+ if (rw & WRITE) {
bi->bi_end_io = raid5_end_write_request;
- else
+ rbi->bi_end_io = raid5_end_write_request;
+ } else
bi->bi_end_io = raid5_end_read_request;
rcu_read_lock();
- if (rw == READ &&
- test_bit(R5_ReadRepl, &sh->dev[i].flags))
+ rdev = rcu_dereference(conf->disks[i].rdev);
+ if (rw & WRITE)
+ rrdev = rcu_dereference(conf->disks[i].replacement);
+ else if (test_bit(R5_ReadRepl, &sh->dev[i].flags))
rdev = rcu_dereference(conf->disks[i].replacement);
- else
- rdev = rcu_dereference(conf->disks[i].rdev);
+
if (rdev && test_bit(Faulty, &rdev->flags))
rdev = NULL;
if (rdev)
atomic_inc(&rdev->nr_pending);
+ if (rrdev && test_bit(Faulty, &rrdev->flags))
+ rrdev = NULL;
+ if (rrdev)
+ atomic_inc(&rrdev->nr_pending);
rcu_read_unlock();
/* We have already checked bad blocks for reads. Now
- * need to check for writes.
+ * need to check for writes. We never accept write errors
+ * on the replacement, so we don't to check rrdev.
*/
while ((rw & WRITE) && rdev &&
test_bit(WriteErrorSeen, &rdev->flags)) {
@@ -571,8 +580,32 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
bi->bi_io_vec[0].bv_offset = 0;
bi->bi_size = STRIPE_SIZE;
bi->bi_next = NULL;
+ if (rrdev)
+ set_bit(R5_DOUBLE_LOCKED, &sh->dev[i].flags);
generic_make_request(bi);
- } else {
+ }
+ if (rrdev) {
+ if (s->syncing || s->expanding || s->expanded)
+ md_sync_acct(rrdev->bdev, STRIPE_SECTORS);
+
+ set_bit(STRIPE_IO_STARTED, &sh->state);
+
+ rbi->bi_bdev = rrdev->bdev;
+ pr_debug("%s: for %llu schedule op %ld on "
+ "replacement disc %d\n",
+ __func__, (unsigned long long)sh->sector,
+ rbi->bi_rw, i);
+ atomic_inc(&sh->count);
+ rbi->bi_sector = sh->sector + rrdev->data_offset;
+ rbi->bi_flags = 1 << BIO_UPTODATE;
+ rbi->bi_idx = 0;
+ rbi->bi_io_vec[0].bv_len = STRIPE_SIZE;
+ rbi->bi_io_vec[0].bv_offset = 0;
+ rbi->bi_size = STRIPE_SIZE;
+ rbi->bi_next = NULL;
+ generic_make_request(rbi);
+ }
+ if (!rdev && !rrdev) {
if (rw & WRITE)
set_bit(STRIPE_DEGRADED, &sh->state);
pr_debug("skip op %ld on disc %d for sector %llu\n",
@@ -1683,14 +1716,23 @@ static void raid5_end_write_request(struct bio *bi, int error)
struct stripe_head *sh = bi->bi_private;
struct r5conf *conf = sh->raid_conf;
int disks = sh->disks, i;
+ struct md_rdev *uninitialized_var(rdev);
int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
sector_t first_bad;
int bad_sectors;
+ int replacement = 0;
- for (i=0 ; i<disks; i++)
- if (bi == &sh->dev[i].req)
+ for (i = 0 ; i < disks; i++) {
+ if (bi == &sh->dev[i].req) {
+ rdev = conf->disks[i].rdev;
break;
-
+ }
+ if (bi == &sh->dev[i].rreq) {
+ rdev = conf->disks[i].replacement;
+ replacement = 1;
+ break;
+ }
+ }
pr_debug("end_write_request %llu/%d, count %d, uptodate: %d.\n",
(unsigned long long)sh->sector, i, atomic_read(&sh->count),
uptodate);
@@ -1699,21 +1741,30 @@ static void raid5_end_write_request(struct bio *bi, int error)
return;
}
- if (!uptodate) {
- set_bit(WriteErrorSeen, &conf->disks[i].rdev->flags);
- set_bit(R5_WriteError, &sh->dev[i].flags);
- } else if (is_badblock(conf->disks[i].rdev, sh->sector, STRIPE_SECTORS,
- &first_bad, &bad_sectors))
- set_bit(R5_MadeGood, &sh->dev[i].flags);
+ if (replacement) {
+ if (!uptodate)
+ md_error(conf->mddev, rdev);
+ else if (is_badblock(rdev, sh->sector,
+ STRIPE_SECTORS,
+ &first_bad, &bad_sectors))
+ set_bit(R5_MadeGoodRepl, &sh->dev[i].flags);
+ } else {
+ if (!uptodate) {
+ set_bit(WriteErrorSeen, &rdev->flags);
+ set_bit(R5_WriteError, &sh->dev[i].flags);
+ } else if (is_badblock(rdev, sh->sector,
+ STRIPE_SECTORS,
+ &first_bad, &bad_sectors))
+ set_bit(R5_MadeGood, &sh->dev[i].flags);
+ }
+ rdev_dec_pending(rdev, conf->mddev);
- rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
-
- clear_bit(R5_LOCKED, &sh->dev[i].flags);
+ if (!test_and_clear_bit(R5_DOUBLE_LOCKED, &sh->dev[i].flags))
+ clear_bit(R5_LOCKED, &sh->dev[i].flags);
set_bit(STRIPE_HANDLE, &sh->state);
release_stripe(sh);
}
-
static sector_t compute_blocknr(struct stripe_head *sh, int i, int previous);
static void raid5_build_block(struct stripe_head *sh, int i, int previous)
@@ -1727,6 +1778,13 @@ static void raid5_build_block(struct stripe_head *sh, int i, int previous)
dev->req.bi_private = sh;
dev->vec.bv_page = dev->page;
+ bio_init(&dev->rreq);
+ dev->rreq.bi_io_vec = &dev->rvec;
+ dev->rreq.bi_vcnt++;
+ dev->rreq.bi_max_vecs++;
+ dev->rreq.bi_private = sh;
+ dev->rvec.bv_page = dev->page;
+
dev->flags = 0;
dev->sector = compute_blocknr(sh, i, previous);
}
@@ -3115,6 +3173,15 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
} else
clear_bit(R5_MadeGood, &dev->flags);
}
+ if (test_bit(R5_MadeGoodRepl, &dev->flags)) {
+ struct md_rdev *rdev2 = rcu_dereference(
+ conf->disks[i].replacement);
+ if (rdev2 && !test_bit(Faulty, &rdev2->flags)) {
+ s->handle_bad_blocks = 1;
+ atomic_inc(&rdev2->nr_pending);
+ } else
+ clear_bit(R5_MadeGoodRepl, &dev->flags);
+ }
if (!test_bit(R5_Insync, &dev->flags)) {
/* The ReadError flag will just be confusing now */
clear_bit(R5_ReadError, &dev->flags);
@@ -3383,6 +3450,12 @@ finish:
STRIPE_SECTORS);
rdev_dec_pending(rdev, conf->mddev);
}
+ if (test_and_clear_bit(R5_MadeGoodRepl, &dev->flags)) {
+ rdev = conf->disks[i].replacement;
+ rdev_clear_badblocks(rdev, sh->sector,
+ STRIPE_SECTORS);
+ rdev_dec_pending(rdev, conf->mddev);
+ }
}
if (s.ops_request)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 4cfd801..f6faaa1 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -259,6 +259,7 @@ struct stripe_head_state {
enum r5dev_flags {
R5_UPTODATE, /* page contains current data */
R5_LOCKED, /* IO has been submitted on "req" */
+ R5_DOUBLE_LOCKED,/* Cannot clear R5_LOCKED until 2 writes complete */
R5_OVERWRITE, /* towrite covers whole page */
/* and some that are internal to handle_stripe */
R5_Insync, /* rdev && rdev->in_sync at start */
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 07/16] md/raid5: raid5.h cleanup
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (6 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 11/16] md/raid5: writes should get directed to replacement as well as original NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 14/16] md/raid5: recognise replacements when assembling array NeilBrown
` (10 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
Remove some #defines that are no longer used, and replace some
others with an enum.
And remove an unused field.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.h | 27 +++++++++------------------
1 files changed, 9 insertions(+), 18 deletions(-)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 43106f0..4cfd801 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -27,7 +27,7 @@
* The possible state transitions are:
*
* Empty -> Want - on read or write to get old data for parity calc
- * Empty -> Dirty - on compute_parity to satisfy write/sync request.(RECONSTRUCT_WRITE)
+ * Empty -> Dirty - on compute_parity to satisfy write/sync request.
* Empty -> Clean - on compute_block when computing a block for failed drive
* Want -> Empty - on failed read
* Want -> Clean - on successful completion of read request
@@ -284,15 +284,6 @@ enum r5dev_flags {
R5_MadeGoodRepl,/* A bad block on the replacement device has been
* fixed by writing to it */
};
-/*
- * Write method
- */
-#define RECONSTRUCT_WRITE 1
-#define READ_MODIFY_WRITE 2
-/* not a write method, but a compute_parity mode */
-#define CHECK_PARITY 3
-/* Additional compute_parity mode -- updates the parity w/o LOCKING */
-#define UPDATE_PARITY 4
/*
* Stripe state
@@ -320,13 +311,14 @@ enum {
/*
* Operation request flags
*/
-#define STRIPE_OP_BIOFILL 0
-#define STRIPE_OP_COMPUTE_BLK 1
-#define STRIPE_OP_PREXOR 2
-#define STRIPE_OP_BIODRAIN 3
-#define STRIPE_OP_RECONSTRUCT 4
-#define STRIPE_OP_CHECK 5
-
+enum {
+ STRIPE_OP_BIOFILL,
+ STRIPE_OP_COMPUTE_BLK,
+ STRIPE_OP_PREXOR,
+ STRIPE_OP_BIODRAIN,
+ STRIPE_OP_RECONSTRUCT,
+ STRIPE_OP_CHECK,
+};
/*
* Plugging:
*
@@ -359,7 +351,6 @@ struct disk_info {
struct r5conf {
struct hlist_head *stripe_hashtbl;
struct mddev *mddev;
- struct disk_info *spare;
int chunk_sectors;
int level, algorithm;
int max_degraded;
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 14/16] md/raid5: recognise replacements when assembling array.
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (7 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 07/16] md/raid5: raid5.h cleanup NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 05/16] md: create externally visible flags for supporting hot-replace NeilBrown
` (9 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
If a Replacement is seen, file it as such.
If we see two replacements (or two normal devices) for the one slot,
abort.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 10 +++++++++-
1 files changed, 9 insertions(+), 1 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6c22f09..efe3679 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4824,7 +4824,15 @@ static struct r5conf *setup_conf(struct mddev *mddev)
continue;
disk = conf->disks + raid_disk;
- disk->rdev = rdev;
+ if (test_bit(Replacement, &rdev->flags)) {
+ if (disk->replacement)
+ goto abort;
+ disk->replacement = rdev;
+ } else {
+ if (disk->rdev)
+ goto abort;
+ disk->rdev = rdev;
+ }
if (test_bit(In_sync, &rdev->flags)) {
char b[BDEVNAME_SIZE];
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 05/16] md: create externally visible flags for supporting hot-replace.
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (8 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 14/16] md/raid5: recognise replacements when assembling array NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 06/16] md/raid5: allow each slot to have an extra replacement device NeilBrown
` (8 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
hot-replace is a feature being added to md which will allow a
device to be replaced without removing it from the array first.
With hot-replace a spare can be activated and recovery can start while
the original device is still in place, thus allowing a transition from
an unreliable device to a reliable device without leaving the array
degraded during the transition. It can also be use when the original
device is still reliable but it not wanted for some reason.
This will eventually be supported in RAID4/5/6 and RAID10.
This patch adds a super-block flag to distinguish the replacement
device. If an old kernel sees this flag it will reject the device.
It also adds two per-device flags which are viewable and settable via
sysfs.
"replaceable" can be set to request that a device be replaced.
"replacement" is set to show that this device is replacing another
device.
The "rd%d" links in /sys/block/mdXx/md only apply to the original
device, not the replacement. We currently don't make links for the
replacement - there doesn't seem to be a need.
Signed-off-by: NeilBrown <neilb@suse.de>
---
Documentation/md.txt | 22 ++++++++++--
drivers/md/md.c | 64 +++++++++++++++++++++++++++++++++++-
drivers/md/md.h | 80 +++++++++++++++++++++++++++------------------
include/linux/raid/md_p.h | 7 +++-
4 files changed, 134 insertions(+), 39 deletions(-)
diff --git a/Documentation/md.txt b/Documentation/md.txt
index fc94770..9c47a1f 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -357,14 +357,14 @@ Each directory contains:
written to, that device.
state
- A file recording the current state of the device in the array
+ A file recording the current state of the device in the array
which can be a comma separated list of
faulty - device has been kicked from active use due to
- a detected fault or it has unacknowledged bad
- blocks
+ a detected fault, or it has unacknowledged bad
+ blocks
in_sync - device is a fully in-sync member of the array
writemostly - device will only be subject to read
- requests if there are no other options.
+ requests if there are no other options.
This applies only to raid1 arrays.
blocked - device has failed, and the failure hasn't been
acknowledged yet by the metadata handler.
@@ -374,6 +374,13 @@ Each directory contains:
This includes spares that are in the process
of being recovered to
write_error - device has ever seen a write error.
+ replaceable - device is (mostly) working but probably
+ should be replaced, either due to errors or
+ due to user request.
+ replacement - device is a replacement for another active
+ device with same raid_disk.
+
+
This list may grow in future.
This can be written to.
Writing "faulty" simulates a failure on the device.
@@ -386,6 +393,13 @@ Each directory contains:
Writing "in_sync" sets the in_sync flag.
Writing "write_error" sets writeerrorseen flag.
Writing "-write_error" clears writeerrorseen flag.
+ Writing "replaceable" is allowed at any time except to a
+ replacement device. It sets the flag.
+ Writing "-replaceable" is allowed at any time. It clears
+ the flag.
+ Writing "replacement" or "-replacement" is only allowed before
+ starting the array. It sets or clears the flag.
+
This file responds to select/poll. Any change to 'faulty'
or 'blocked' causes an event.
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 182056a..72a07d0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1717,6 +1717,8 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
}
if (sb->devflags & WriteMostly1)
set_bit(WriteMostly, &rdev->flags);
+ if (le32_to_cpu(sb->feature_map) & MD_FEATURE_REPLACEMENT)
+ set_bit(Replacement, &rdev->flags);
} else /* MULTIPATH are always insync */
set_bit(In_sync, &rdev->flags);
@@ -1770,6 +1772,9 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
sb->recovery_offset =
cpu_to_le64(rdev->recovery_offset);
}
+ if (test_bit(Replacement, &rdev->flags))
+ sb->feature_map |=
+ cpu_to_le32(MD_FEATURE_REPLACEMENT);
if (mddev->reshape_position != MaxSector) {
sb->feature_map |= cpu_to_le32(MD_FEATURE_RESHAPE_ACTIVE);
@@ -2562,6 +2567,15 @@ state_show(struct md_rdev *rdev, char *page)
len += sprintf(page+len, "%swrite_error", sep);
sep = ",";
}
+ if (test_bit(Replaceable, &rdev->flags)) {
+ len += sprintf(page+len, "%sreplaceable", sep);
+ sep = ",";
+ }
+ if (test_bit(Replacement, &rdev->flags)) {
+ len += sprintf(page+len, "%sreplacement", sep);
+ sep = ",";
+ }
+
return len+sprintf(page+len, "\n");
}
@@ -2580,6 +2594,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
* -write_error - clears WriteErrorSeen
*/
int err = -EINVAL;
+ struct md_rdev *rdev2;
if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
md_error(rdev->mddev, rdev);
if (test_bit(Faulty, &rdev->flags))
@@ -2630,6 +2645,50 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
} else if (cmd_match(buf, "-write_error")) {
clear_bit(WriteErrorSeen, &rdev->flags);
err = 0;
+ } else if (cmd_match(buf, "replaceable")) {
+ /* Any device that is not a replacement can become
+ * replaceable at any time, but we then need to
+ * check if recovery is needed.
+ */
+ if (!test_bit(Replacement, &rdev->flags))
+ set_bit(Replaceable, &rdev->flags);
+ set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
+ md_wakeup_thread(rdev->mddev->thread);
+ err = 0;
+ } else if (cmd_match(buf, "-replaceable")) {
+ /* Clearing 'replaceable' is only allowed when there
+ * is no recovery happening, and when there is no
+ * replacement in place
+ */
+ err = 0;
+ if (test_bit(MD_RECOVERY_RUNNING, &rdev->mddev->recovery))
+ err = -EBUSY;
+ else list_for_each_entry(rdev2, &rdev->mddev->disks, same_set)
+ if (rdev2 != rdev &&
+ rdev2->raid_disk == rdev->raid_disk)
+ /* Must be a replacement - remove it first */
+ err = -EBUSY;
+ if (!err)
+ clear_bit(Replaceable, &rdev->flags);
+ } else if (cmd_match(buf, "replacement")) {
+ /* Can only set a device as a replacement when array has not
+ * yet been started. Once running, replacement is automatic
+ * from spares, or by assigning 'slot'.
+ */
+ if (rdev->mddev->pers)
+ err = -EBUSY;
+ else {
+ set_bit(Replacement, &rdev->flags);
+ err = 0;
+ }
+ } else if (cmd_match(buf, "-replacement")) {
+ /* Similarly, can only clear Replacement before start */
+ if (rdev->mddev->pers)
+ err = -EBUSY;
+ else {
+ clear_bit(Replacement, &rdev->flags);
+ err = 0;
+ }
}
if (!err)
sysfs_notify_dirent_safe(rdev->sysfs_state);
@@ -6712,8 +6771,11 @@ static int md_seq_show(struct seq_file *seq, void *v)
if (test_bit(Faulty, &rdev->flags)) {
seq_printf(seq, "(F)");
continue;
- } else if (rdev->raid_disk < 0)
+ }
+ if (rdev->raid_disk < 0)
seq_printf(seq, "(S)"); /* spare */
+ if (test_bit(Replacement, &rdev->flags))
+ seq_printf(seq, "(R)");
sectors += rdev->sectors;
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 15a0346..2c59e73 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -72,34 +72,7 @@ struct md_rdev {
* This reduces the burden of testing multiple flags in many cases
*/
- unsigned long flags;
-#define Faulty 1 /* device is known to have a fault */
-#define In_sync 2 /* device is in_sync with rest of array */
-#define WriteMostly 4 /* Avoid reading if at all possible */
-#define AutoDetected 7 /* added by auto-detect */
-#define Blocked 8 /* An error occurred but has not yet
- * been acknowledged by the metadata
- * handler, so don't allow writes
- * until it is cleared */
-#define WriteErrorSeen 9 /* A write error has been seen on this
- * device
- */
-#define FaultRecorded 10 /* Intermediate state for clearing
- * Blocked. The Fault is/will-be
- * recorded in the metadata, but that
- * metadata hasn't been stored safely
- * on disk yet.
- */
-#define BlockedBadBlocks 11 /* A writer is blocked because they
- * found an unacknowledged bad-block.
- * This can safely be cleared at any
- * time, and the writer will re-check.
- * It may be set at any time, and at
- * worst the writer will timeout and
- * re-check. So setting it as
- * accurately as possible is good, but
- * not absolutely critical.
- */
+ unsigned long flags; /* bit set of 'enum flag_bits' bits. */
wait_queue_head_t blocked_wait;
int desc_nr; /* descriptor index in the superblock */
@@ -152,6 +125,44 @@ struct md_rdev {
sector_t size; /* in sectors */
} badblocks;
};
+enum flag_bits {
+ Faulty, /* device is known to have a fault */
+ In_sync, /* device is in_sync with rest of array */
+ WriteMostly, /* Avoid reading if at all possible */
+ AutoDetected, /* added by auto-detect */
+ Blocked, /* An error occurred but has not yet
+ * been acknowledged by the metadata
+ * handler, so don't allow writes
+ * until it is cleared */
+ WriteErrorSeen, /* A write error has been seen on this
+ * device
+ */
+ FaultRecorded, /* Intermediate state for clearing
+ * Blocked. The Fault is/will-be
+ * recorded in the metadata, but that
+ * metadata hasn't been stored safely
+ * on disk yet.
+ */
+ BlockedBadBlocks, /* A writer is blocked because they
+ * found an unacknowledged bad-block.
+ * This can safely be cleared at any
+ * time, and the writer will re-check.
+ * It may be set at any time, and at
+ * worst the writer will timeout and
+ * re-check. So setting it as
+ * accurately as possible is good, but
+ * not absolutely critical.
+ */
+ Replaceable, /* This device is a candidate to be
+ * hot-replaced, either because it has
+ * reported some faults, or because
+ * of explicit request.
+ */
+ Replacement, /* This device is a replacement for
+ * a replaceable device with same
+ * raid_disk number.
+ */
+};
#define BB_LEN_MASK (0x00000000000001FFULL)
#define BB_OFFSET_MASK (0x7FFFFFFFFFFFFE00ULL)
@@ -482,15 +493,20 @@ static inline char * mdname (struct mddev * mddev)
static inline int sysfs_link_rdev(struct mddev *mddev, struct md_rdev *rdev)
{
char nm[20];
- sprintf(nm, "rd%d", rdev->raid_disk);
- return sysfs_create_link(&mddev->kobj, &rdev->kobj, nm);
+ if (!test_bit(Replacement, &rdev->flags)) {
+ sprintf(nm, "rd%d", rdev->raid_disk);
+ return sysfs_create_link(&mddev->kobj, &rdev->kobj, nm);
+ } else
+ return 0;
}
static inline void sysfs_unlink_rdev(struct mddev *mddev, struct md_rdev *rdev)
{
char nm[20];
- sprintf(nm, "rd%d", rdev->raid_disk);
- sysfs_remove_link(&mddev->kobj, nm);
+ if (!test_bit(Replacement, &rdev->flags)) {
+ sprintf(nm, "rd%d", rdev->raid_disk);
+ sysfs_remove_link(&mddev->kobj, nm);
+ }
}
/*
diff --git a/include/linux/raid/md_p.h b/include/linux/raid/md_p.h
index 9e65d9e..6f6df86 100644
--- a/include/linux/raid/md_p.h
+++ b/include/linux/raid/md_p.h
@@ -277,7 +277,10 @@ struct mdp_superblock_1 {
*/
#define MD_FEATURE_RESHAPE_ACTIVE 4
#define MD_FEATURE_BAD_BLOCKS 8 /* badblock list is not empty */
-
-#define MD_FEATURE_ALL (1|2|4|8)
+#define MD_FEATURE_REPLACEMENT 16 /* This device is replacing an
+ * active device with same 'role'.
+ * 'recovery_offset' is also set.
+ */
+#define MD_FEATURE_ALL (1|2|4|8|16)
#endif
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 06/16] md/raid5: allow each slot to have an extra replacement device
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (9 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 05/16] md: create externally visible flags for supporting hot-replace NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 12/16] md/raid5: detect and handle replacements during recovery NeilBrown
` (7 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
Just enhance data structures to record a second device per slot to be
used as a 'replacement' device, replacing the original.
We also have a second bio in each slot in each stripe_head. This will
only be used when writing to the array - we need to write to both the
original and the replacement at the same time, so will need two bios.
For now, only try using the replacement drive for aligned-reads.
In this case, we prefer the replacement if it has been recovered far
enough, otherwise use the original.
This includes a small enhancement. Previously we would only do
aligned reads if the target device was fully recovered. Now we also
do them if it has recovered far enough.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 15 ++++++++++++--
drivers/md/raid5.h | 57 ++++++++++++++++++++++++++++++----------------------
2 files changed, 46 insertions(+), 26 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 27056d3..e6d10df 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3573,6 +3573,7 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
int dd_idx;
struct bio* align_bi;
struct md_rdev *rdev;
+ sector_t end_sector;
if (!in_chunk_boundary(mddev, raid_bio)) {
pr_debug("chunk_aligned_read : non aligned\n");
@@ -3597,9 +3598,19 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
0,
&dd_idx, NULL);
+ end_sector = align_bi->bi_sector + (align_bi->bi_size >>9);
rcu_read_lock();
- rdev = rcu_dereference(conf->disks[dd_idx].rdev);
- if (rdev && test_bit(In_sync, &rdev->flags)) {
+ rdev = rcu_dereference(conf->disks[dd_idx].replacement);
+ if (!rdev || test_bit(Faulty, &rdev->flags) ||
+ rdev->recovery_offset < end_sector) {
+ rdev = rcu_dereference(conf->disks[dd_idx].rdev);
+ if (rdev &&
+ (test_bit(Faulty, &rdev->flags) ||
+ !(test_bit(In_sync, &rdev->flags) ||
+ rdev->recovery_offset >= end_sector)))
+ rdev = NULL;
+ }
+ if (rdev) {
sector_t first_bad;
int bad_sectors;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index e10c553..43106f0 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -226,8 +226,11 @@ struct stripe_head {
#endif
} ops;
struct r5dev {
- struct bio req;
- struct bio_vec vec;
+ /* rreq and rvec are used for the replacement device when
+ * writing data to both devices.
+ */
+ struct bio req, rreq;
+ struct bio_vec vec, rvec;
struct page *page;
struct bio *toread, *read, *towrite, *written;
sector_t sector; /* sector of this page */
@@ -252,29 +255,35 @@ struct stripe_head_state {
int handle_bad_blocks;
};
-/* Flags */
-#define R5_UPTODATE 0 /* page contains current data */
-#define R5_LOCKED 1 /* IO has been submitted on "req" */
-#define R5_OVERWRITE 2 /* towrite covers whole page */
+/* Flags for struct r5dev.flags */
+enum r5dev_flags {
+ R5_UPTODATE, /* page contains current data */
+ R5_LOCKED, /* IO has been submitted on "req" */
+ R5_OVERWRITE, /* towrite covers whole page */
/* and some that are internal to handle_stripe */
-#define R5_Insync 3 /* rdev && rdev->in_sync at start */
-#define R5_Wantread 4 /* want to schedule a read */
-#define R5_Wantwrite 5
-#define R5_Overlap 7 /* There is a pending overlapping request on this block */
-#define R5_ReadError 8 /* seen a read error here recently */
-#define R5_ReWrite 9 /* have tried to over-write the readerror */
+ R5_Insync, /* rdev && rdev->in_sync at start */
+ R5_Wantread, /* want to schedule a read */
+ R5_Wantwrite,
+ R5_Overlap, /* There is a pending overlapping request
+ * on this block */
+ R5_ReadError, /* seen a read error here recently */
+ R5_ReWrite, /* have tried to over-write the readerror */
-#define R5_Expanded 10 /* This block now has post-expand data */
-#define R5_Wantcompute 11 /* compute_block in progress treat as
- * uptodate
- */
-#define R5_Wantfill 12 /* dev->toread contains a bio that needs
- * filling
- */
-#define R5_Wantdrain 13 /* dev->towrite needs to be drained */
-#define R5_WantFUA 14 /* Write should be FUA */
-#define R5_WriteError 15 /* got a write error - need to record it */
-#define R5_MadeGood 16 /* A bad block has been fixed by writing to it*/
+ R5_Expanded, /* This block now has post-expand data */
+ R5_Wantcompute, /* compute_block in progress treat as
+ * uptodate
+ */
+ R5_Wantfill, /* dev->toread contains a bio that needs
+ * filling
+ */
+ R5_Wantdrain, /* dev->towrite needs to be drained */
+ R5_WantFUA, /* Write should be FUA */
+ R5_WriteError, /* got a write error - need to record it */
+ R5_MadeGood, /* A bad block has been fixed by writing to it */
+ R5_ReadRepl, /* Will/did read from replacement rather than orig */
+ R5_MadeGoodRepl,/* A bad block on the replacement device has been
+ * fixed by writing to it */
+};
/*
* Write method
*/
@@ -344,7 +353,7 @@ enum {
struct disk_info {
- struct md_rdev *rdev;
+ struct md_rdev *rdev, *replacement;
};
struct r5conf {
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 12/16] md/raid5: detect and handle replacements during recovery.
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (10 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 06/16] md/raid5: allow each slot to have an extra replacement device NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 09/16] md/raid5: preferentially read from replacement device if possible NeilBrown
` (6 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
During recovery we want to write to the replacement but not
the original. So we have two new flags
- R5_NeedReplace if this stripe has a replacement that needs to
be written at some state
- R5_WantReplace if NeedReplace, and the data is available, and
a 'sync' has been requested on this stripe.
We also distinguish between 'sync and replace' which need to read
all other devices, and 'replace' which only needs to read the
devices being replaced.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 120 +++++++++++++++++++++++++++++++++++++++-------------
drivers/md/raid5.h | 13 +++++-
2 files changed, 103 insertions(+), 30 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4ba4b8c..08f388c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -491,6 +491,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
for (i = disks; i--; ) {
int rw;
+ int replace_only = 0;
struct bio *bi, *rbi;
struct md_rdev *rdev, *rrdev = NULL;
if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags)) {
@@ -500,7 +501,10 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
rw = WRITE;
} else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
rw = READ;
- else
+ else if (test_and_clear_bit(R5_WantReplace, &sh->dev[i].flags)) {
+ rw = WRITE;
+ replace_only = 1;
+ } else
continue;
bi = &sh->dev[i].req;
@@ -516,10 +520,15 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
rcu_read_lock();
rdev = rcu_dereference(conf->disks[i].rdev);
- if (rw & WRITE)
- rrdev = rcu_dereference(conf->disks[i].replacement);
- else if (test_bit(R5_ReadRepl, &sh->dev[i].flags))
- rdev = rcu_dereference(conf->disks[i].replacement);
+ rrdev = rcu_dereference(conf->disks[i].replacement);
+ if (rw & WRITE) {
+ if (replace_only)
+ rdev = NULL;
+ } else {
+ if (test_bit(R5_ReadRepl, &sh->dev[i].flags))
+ rdev = rrdev;
+ rrdev = NULL;
+ }
if (rdev && test_bit(Faulty, &rdev->flags))
rdev = NULL;
@@ -563,7 +572,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
}
if (rdev) {
- if (s->syncing || s->expanding || s->expanded)
+ if (s->syncing || s->expanding || s->expanded
+ || s->replacing)
md_sync_acct(rdev->bdev, STRIPE_SECTORS);
set_bit(STRIPE_IO_STARTED, &sh->state);
@@ -585,7 +595,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
generic_make_request(bi);
}
if (rrdev) {
- if (s->syncing || s->expanding || s->expanded)
+ if (s->syncing || s->expanding || s->expanded
+ || s->replacing)
md_sync_acct(rrdev->bdev, STRIPE_SECTORS);
set_bit(STRIPE_IO_STARTED, &sh->state);
@@ -2431,8 +2442,9 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
md_done_sync(conf->mddev, STRIPE_SECTORS, 0);
clear_bit(STRIPE_SYNCING, &sh->state);
s->syncing = 0;
+ s->replacing = 0;
/* There is nothing more to do for sync/check/repair.
- * For recover we need to record a bad block on all
+ * For recover/replace we need to record a bad block on all
* non-sync devices, or abort the recovery
*/
if (!test_bit(MD_RECOVERY_RECOVER, &conf->mddev->recovery))
@@ -2442,12 +2454,18 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
*/
for (i = 0; i < conf->raid_disks; i++) {
struct md_rdev *rdev = conf->disks[i].rdev;
- if (!rdev
- || test_bit(Faulty, &rdev->flags)
- || test_bit(In_sync, &rdev->flags))
- continue;
- if (!rdev_set_badblocks(rdev, sh->sector,
- STRIPE_SECTORS, 0))
+ if (rdev
+ && !test_bit(Faulty, &rdev->flags)
+ && !test_bit(In_sync, &rdev->flags)
+ && !rdev_set_badblocks(rdev, sh->sector,
+ STRIPE_SECTORS, 0))
+ abort = 1;
+ rdev = conf->disks[i].replacement;
+ if (rdev
+ && !test_bit(Faulty, &rdev->flags)
+ && !test_bit(In_sync, &rdev->flags)
+ && !rdev_set_badblocks(rdev, sh->sector,
+ STRIPE_SECTORS, 0))
abort = 1;
}
if (abort) {
@@ -2456,6 +2474,21 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
}
}
+static int want_replace(struct stripe_head *sh, int disk_idx)
+{
+ struct md_rdev *rdev;
+ int rv = 0;
+ /* Doing recovery so rcu locking not required */
+ rdev = sh->raid_conf->disks[disk_idx].replacement;
+ if (rdev &&
+ !test_bit(Faulty, &rdev->flags) &&
+ !test_bit(In_sync, &rdev->flags) &&
+ rdev->recovery_offset <= sh->sector)
+ rv = 1;
+
+ return rv;
+}
+
/* fetch_block - checks the given member device to see if its data needs
* to be read or computed to satisfy a request.
*
@@ -2475,6 +2508,7 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
(dev->toread ||
(dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
s->syncing || s->expanding ||
+ (s->replacing && want_replace(sh, disk_idx)) ||
(s->failed >= 1 && fdev[0]->toread) ||
(s->failed >= 2 && fdev[1]->toread) ||
(sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&
@@ -3028,22 +3062,18 @@ static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh)
}
}
-
/*
* handle_stripe - do things to a stripe.
*
- * We lock the stripe and then examine the state of various bits
- * to see what needs to be done.
+ * We lock the stripe by setting STRIPE_ACTIVE and then examine the
+ * state of various bits to see what needs to be done.
* Possible results:
- * return some read request which now have data
- * return some write requests which are safely on disc
+ * return some read requests which now have data
+ * return some write requests which are safely on storage
* schedule a read on some buffers
* schedule a write of some buffers
* return confirmation of parity correctness
*
- * buffers are taken off read_list or write_list, and bh_cache buffers
- * get BH_Lock set before the stripe lock is released.
- *
*/
static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
@@ -3052,10 +3082,10 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
int disks = sh->disks;
struct r5dev *dev;
int i;
+ int do_recovery = 0;
memset(s, 0, sizeof(*s));
- s->syncing = test_bit(STRIPE_SYNCING, &sh->state);
s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
s->failed_num[0] = -1;
@@ -3073,7 +3103,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
dev = &sh->dev[i];
pr_debug("check %d: state 0x%lx read %p write %p written %p\n",
- i, dev->flags, dev->toread, dev->towrite, dev->written);
+ i, dev->flags, dev->toread, dev->towrite, dev->written);
/* maybe we can reply to a read
*
* new wantfill requests are only permitted while
@@ -3114,6 +3144,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
&first_bad, &bad_sectors))
set_bit(R5_ReadRepl, &dev->flags);
else {
+ if (rdev)
+ set_bit(R5_NeedReplace, &dev->flags);
rdev = rcu_dereference(conf->disks[i].rdev);
clear_bit(R5_ReadRepl, &dev->flags);
}
@@ -3193,9 +3225,25 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
if (s->failed < 2)
s->failed_num[s->failed] = i;
s->failed++;
+ if (rdev && !test_bit(Faulty, &rdev->flags))
+ do_recovery = 1;
}
}
spin_unlock_irq(&conf->device_lock);
+ if (test_bit(STRIPE_SYNCING, &sh->state)) {
+ /* If there is a failed device being replaced,
+ * we must be recovering.
+ * else if we are after recovery_cp, we must be syncing
+ * else we can only be replacing
+ * sync and recovery both need to read all devices, and so
+ * use the same flag.
+ */
+ if (do_recovery ||
+ sh->sector >= conf->mddev->recovery_cp)
+ s->syncing = 1;
+ else
+ s->replacing = 1;
+ }
rcu_read_unlock();
}
@@ -3237,7 +3285,7 @@ static void handle_stripe(struct stripe_head *sh)
if (unlikely(s.blocked_rdev)) {
if (s.syncing || s.expanding || s.expanded ||
- s.to_write || s.written) {
+ s.replacing || s.to_write || s.written) {
set_bit(STRIPE_HANDLE, &sh->state);
goto finish;
}
@@ -3260,7 +3308,7 @@ static void handle_stripe(struct stripe_head *sh)
*/
if (s.failed > conf->max_degraded && s.to_read+s.to_write+s.written)
handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
- if (s.failed > conf->max_degraded && s.syncing)
+ if (s.failed > conf->max_degraded && s.syncing + s.replacing)
handle_failed_sync(conf, sh, &s);
/*
@@ -3290,7 +3338,9 @@ static void handle_stripe(struct stripe_head *sh)
*/
if (s.to_read || s.non_overwrite
|| (conf->level == 6 && s.to_write && s.failed)
- || (s.syncing && (s.uptodate + s.compute < disks)) || s.expanding)
+ || (s.syncing && (s.uptodate + s.compute < disks))
+ || s.replacing
+ || s.expanding)
handle_stripe_fill(sh, &s, disks);
/* Now we check to see if any write operations have recently
@@ -3352,7 +3402,20 @@ static void handle_stripe(struct stripe_head *sh)
handle_parity_checks5(conf, sh, &s, disks);
}
- if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
+ if (s.replacing && s.locked == 0
+ && !test_bit(STRIPE_INSYNC, &sh->state)) {
+ /* Write out to replacement devices where possible */
+ for (i = 0; i < conf->raid_disks; i++)
+ if (test_bit(R5_UPTODATE, &sh->dev[i].flags) &&
+ test_bit(R5_NeedReplace, &sh->dev[i].flags)) {
+ set_bit(R5_WantReplace, &sh->dev[i].flags);
+ set_bit(R5_LOCKED, &sh->dev[i].flags);
+ s.locked++;
+ }
+ set_bit(STRIPE_INSYNC, &sh->state);
+ }
+ if ((s.syncing || s.replacing) && s.locked == 0 &&
+ test_bit(STRIPE_INSYNC, &sh->state)) {
md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
clear_bit(STRIPE_SYNCING, &sh->state);
}
@@ -4243,7 +4306,6 @@ static inline sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int
return sync_blocks * STRIPE_SECTORS; /* keep things rounded to whole stripes */
}
-
bitmap_cond_end_sync(mddev->bitmap, sector_nr);
sh = get_active_stripe(conf, sector_nr, 0, 1, 0);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index f6faaa1..8d8e139 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -242,7 +242,13 @@ struct stripe_head {
* for handle_stripe.
*/
struct stripe_head_state {
- int syncing, expanding, expanded;
+ /* 'syncing' means that we need to read all devices, either
+ * to check/correct parity, or to reconstruct a missing device.
+ * 'replacing' means we are replacing one or more drives and
+ * the source is valid at this point so we don't need to
+ * read all devices, just the replacement targets.
+ */
+ int syncing, expanding, expanded, replacing;
int locked, uptodate, to_read, to_write, failed, written;
int to_fill, compute, req_compute, non_overwrite;
int failed_num[2];
@@ -284,6 +290,11 @@ enum r5dev_flags {
R5_ReadRepl, /* Will/did read from replacement rather than orig */
R5_MadeGoodRepl,/* A bad block on the replacement device has been
* fixed by writing to it */
+ R5_NeedReplace, /* This device has a replacement which is not
+ * up-to-date at this stripe. */
+ R5_WantReplace, /* We need to update the replacement, we have read
+ * data in, and now is a good time to write it out.
+ */
};
/*
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 09/16] md/raid5: preferentially read from replacement device if possible.
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (11 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 12/16] md/raid5: detect and handle replacements during recovery NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 10/16] md/raid5: allow removal for failed replacement devices NeilBrown
` (5 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
If a replacement device is present and has been recovered far enough,
then use it for reading into the stripe cache.
If we get an error we don't try to repair it, we just fail the device.
A replacement device that gives errors does not sound sensible.
This requires removing the setting of R5_ReadError when we get
a read error during a read that bypasses the cache. It was probably
a bad idea anyway as we don't know that every block in the read
caused an error, and it could cause ReadError to be set for the
replacement device, which is bad.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 69 +++++++++++++++++++++++++++++++++++++++++-----------
1 files changed, 54 insertions(+), 15 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 12e54c0..7d77a38 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -512,7 +512,11 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
bi->bi_end_io = raid5_end_read_request;
rcu_read_lock();
- rdev = rcu_dereference(conf->disks[i].rdev);
+ if (rw == READ &&
+ test_bit(R5_ReadRepl, &sh->dev[i].flags))
+ rdev = rcu_dereference(conf->disks[i].replacement);
+ else
+ rdev = rcu_dereference(conf->disks[i].rdev);
if (rdev && test_bit(Faulty, &rdev->flags))
rdev = NULL;
if (rdev)
@@ -1593,11 +1597,18 @@ static void raid5_end_read_request(struct bio * bi, int error)
BUG();
return;
}
+ if (test_bit(R5_ReadRepl, &sh->dev[i].flags))
+ rdev = conf->disks[i].replacement;
+ else
+ rdev = conf->disks[i].rdev;
if (uptodate) {
set_bit(R5_UPTODATE, &sh->dev[i].flags);
if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
- rdev = conf->disks[i].rdev;
+ /* Note that this cannot happen on a
+ * replacement device. We just fail those on
+ * any error
+ */
printk_ratelimited(
KERN_INFO
"md/raid:%s: read error corrected"
@@ -1610,16 +1621,24 @@ static void raid5_end_read_request(struct bio * bi, int error)
clear_bit(R5_ReadError, &sh->dev[i].flags);
clear_bit(R5_ReWrite, &sh->dev[i].flags);
}
- if (atomic_read(&conf->disks[i].rdev->read_errors))
- atomic_set(&conf->disks[i].rdev->read_errors, 0);
+ if (atomic_read(&rdev->read_errors))
+ atomic_set(&rdev->read_errors, 0);
} else {
- const char *bdn = bdevname(conf->disks[i].rdev->bdev, b);
+ const char *bdn = bdevname(rdev->bdev, b);
int retry = 0;
- rdev = conf->disks[i].rdev;
clear_bit(R5_UPTODATE, &sh->dev[i].flags);
atomic_inc(&rdev->read_errors);
- if (conf->mddev->degraded >= conf->max_degraded)
+ if (test_bit(R5_ReadRepl, &sh->dev[i].flags))
+ printk_ratelimited(
+ KERN_WARNING
+ "md/raid:%s: read error on replacement device "
+ "(sector %llu on %s).\n",
+ mdname(conf->mddev),
+ (unsigned long long)(sh->sector
+ + rdev->data_offset),
+ bdn);
+ else if (conf->mddev->degraded >= conf->max_degraded)
printk_ratelimited(
KERN_WARNING
"md/raid:%s: read error not correctable "
@@ -1653,7 +1672,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
md_error(conf->mddev, rdev);
}
}
- rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
+ rdev_dec_pending(rdev, conf->mddev);
clear_bit(R5_LOCKED, &sh->dev[i].flags);
set_bit(STRIPE_HANDLE, &sh->state);
release_stripe(sh);
@@ -3027,7 +3046,19 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
}
if (dev->written)
s->written++;
- rdev = rcu_dereference(conf->disks[i].rdev);
+ /* Prefer to use the replacement for reads, but only
+ * if it is recovered enough and has no bad blocks.
+ */
+ rdev = rcu_dereference(conf->disks[i].replacement);
+ if (rdev && !test_bit(Faulty, &rdev->flags) &&
+ rdev->recovery_offset >= sh->sector + STRIPE_SECTORS &&
+ !is_badblock(rdev, sh->sector, STRIPE_SECTORS,
+ &first_bad, &bad_sectors))
+ set_bit(R5_ReadRepl, &dev->flags);
+ else {
+ rdev = rcu_dereference(conf->disks[i].rdev);
+ clear_bit(R5_ReadRepl, &dev->flags);
+ }
if (rdev) {
is_bad = is_badblock(rdev, sh->sector, STRIPE_SECTORS,
&first_bad, &bad_sectors);
@@ -3061,17 +3092,26 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
set_bit(R5_Insync, &dev->flags);
}
if (test_bit(R5_WriteError, &dev->flags)) {
- clear_bit(R5_Insync, &dev->flags);
- if (!test_bit(Faulty, &rdev->flags)) {
+ /* This flag does not apply to '.replacement'
+ * only to .rdev, so make sure to check that*/
+ struct md_rdev *rdev2 = rcu_dereference(
+ conf->disks[i].rdev);
+ if (rdev2 == rdev)
+ clear_bit(R5_Insync, &dev->flags);
+ if (!test_bit(Faulty, &rdev2->flags)) {
s->handle_bad_blocks = 1;
- atomic_inc(&rdev->nr_pending);
+ atomic_inc(&rdev2->nr_pending);
} else
clear_bit(R5_WriteError, &dev->flags);
}
if (test_bit(R5_MadeGood, &dev->flags)) {
- if (!test_bit(Faulty, &rdev->flags)) {
+ /* This flag does not apply to '.replacement'
+ * only to .rdev, so make sure to check that*/
+ struct md_rdev *rdev2 = rcu_dereference(
+ conf->disks[i].rdev);
+ if (rdev2 && !test_bit(Faulty, &rdev2->flags)) {
s->handle_bad_blocks = 1;
- atomic_inc(&rdev->nr_pending);
+ atomic_inc(&rdev2->nr_pending);
} else
clear_bit(R5_MadeGood, &dev->flags);
}
@@ -4201,7 +4241,6 @@ static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
return handled;
}
- set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
release_stripe(sh);
raid5_set_bi_hw_segments(raid_bio, scnt);
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 10/16] md/raid5: allow removal for failed replacement devices.
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (12 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 09/16] md/raid5: preferentially read from replacement device if possible NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 15/16] md/raid5: If there is a spare and a replaceable device, start replacement NeilBrown
` (4 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
Enhance raid5_remove_disk to be able to remove ->replacement
as well as ->rdev.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 58 +++++++++++++++++++++++++++++-----------------------
1 files changed, 32 insertions(+), 26 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7d77a38..20f42f6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5068,36 +5068,42 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
struct r5conf *conf = mddev->private;
int err = 0;
int number = rdev->raid_disk;
+ struct md_rdev **rdevp;
struct disk_info *p = conf->disks + number;
print_raid5_conf(conf);
- if (rdev == p->rdev) {
- if (number >= conf->raid_disks &&
- conf->reshape_progress == MaxSector)
- clear_bit(In_sync, &rdev->flags);
+ if (rdev == p->rdev)
+ rdevp = &p->rdev;
+ else if (rdev == p->replacement)
+ rdevp = &p->replacement;
+ else
+ return 0;
- if (test_bit(In_sync, &rdev->flags) ||
- atomic_read(&rdev->nr_pending)) {
- err = -EBUSY;
- goto abort;
- }
- /* Only remove non-faulty devices if recovery
- * isn't possible.
- */
- if (!test_bit(Faulty, &rdev->flags) &&
- mddev->recovery_disabled != conf->recovery_disabled &&
- !has_failed(conf) &&
- number < conf->raid_disks) {
- err = -EBUSY;
- goto abort;
- }
- p->rdev = NULL;
- synchronize_rcu();
- if (atomic_read(&rdev->nr_pending)) {
- /* lost the race, try later */
- err = -EBUSY;
- p->rdev = rdev;
- }
+ if (number >= conf->raid_disks &&
+ conf->reshape_progress == MaxSector)
+ clear_bit(In_sync, &rdev->flags);
+
+ if (test_bit(In_sync, &rdev->flags) ||
+ atomic_read(&rdev->nr_pending)) {
+ err = -EBUSY;
+ goto abort;
+ }
+ /* Only remove non-faulty devices if recovery
+ * isn't possible.
+ */
+ if (!test_bit(Faulty, &rdev->flags) &&
+ mddev->recovery_disabled != conf->recovery_disabled &&
+ !has_failed(conf) &&
+ number < conf->raid_disks) {
+ err = -EBUSY;
+ goto abort;
+ }
+ *rdevp = NULL;
+ synchronize_rcu();
+ if (atomic_read(&rdev->nr_pending)) {
+ /* lost the race, try later */
+ err = -EBUSY;
+ *rdevp = rdev;
}
abort:
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 15/16] md/raid5: If there is a spare and a replaceable device, start replacement.
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (13 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 10/16] md/raid5: allow removal for failed replacement devices NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 1:43 ` [md PATCH 16/16] md/raid5: Mark device replaceable when we see a write error NeilBrown
` (3 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
When attempting to add a spare to a RAID[456] array, also consider
adding it as a replacement for a replaceable device.
This requires that common md code attempt hot_add even when the array
is not formally degraded.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/md.c | 34 ++++++++++++++++------------------
drivers/md/raid5.c | 16 ++++++++++++++--
2 files changed, 30 insertions(+), 20 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 72a07d0..1b43f86 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7403,25 +7403,23 @@ static int remove_and_add_spares(struct mddev *mddev)
}
}
- if (mddev->degraded) {
- list_for_each_entry(rdev, &mddev->disks, same_set) {
- if (rdev->raid_disk >= 0 &&
- !test_bit(In_sync, &rdev->flags) &&
- !test_bit(Faulty, &rdev->flags))
+ list_for_each_entry(rdev, &mddev->disks, same_set) {
+ if (rdev->raid_disk >= 0 &&
+ !test_bit(In_sync, &rdev->flags) &&
+ !test_bit(Faulty, &rdev->flags))
+ spares++;
+ if (rdev->raid_disk < 0
+ && !test_bit(Faulty, &rdev->flags)) {
+ rdev->recovery_offset = 0;
+ if (mddev->pers->
+ hot_add_disk(mddev, rdev) == 0) {
+ if (sysfs_link_rdev(mddev, rdev))
+ /* failure here is OK */;
spares++;
- if (rdev->raid_disk < 0
- && !test_bit(Faulty, &rdev->flags)) {
- rdev->recovery_offset = 0;
- if (mddev->pers->
- hot_add_disk(mddev, rdev) == 0) {
- if (sysfs_link_rdev(mddev, rdev))
- /* failure here is OK */;
- spares++;
- md_new_event(mddev);
- set_bit(MD_CHANGE_DEVS, &mddev->flags);
- } else
- break;
- }
+ md_new_event(mddev);
+ set_bit(MD_CHANGE_DEVS, &mddev->flags);
+ } else
+ break;
}
}
return spares;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index efe3679..3f55bef 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5337,8 +5337,9 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
disk = rdev->saved_raid_disk;
else
disk = first;
- for ( ; disk <= last ; disk++)
- if ((p=conf->disks + disk)->rdev == NULL) {
+ for ( ; disk <= last ; disk++) {
+ p = conf->disks + disk;
+ if (p->rdev == NULL) {
clear_bit(In_sync, &rdev->flags);
rdev->raid_disk = disk;
err = 0;
@@ -5347,6 +5348,17 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
rcu_assign_pointer(p->rdev, rdev);
break;
}
+ if (test_bit(Replaceable, &p->rdev->flags) &&
+ p->replacement == NULL) {
+ clear_bit(In_sync, &rdev->flags);
+ set_bit(Replacement, &rdev->flags);
+ rdev->raid_disk = disk;
+ err = 0;
+ conf->fullsync = 1;
+ rcu_assign_pointer(p->replacement, rdev);
+ break;
+ }
+ }
print_raid5_conf(conf);
return err;
}
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [md PATCH 16/16] md/raid5: Mark device replaceable when we see a write error.
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (14 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 15/16] md/raid5: If there is a spare and a replaceable device, start replacement NeilBrown
@ 2011-10-26 1:43 ` NeilBrown
2011-10-26 6:38 ` [md PATCH 00/16] hot-replace support for RAID4/5/6 David Brown
` (2 subsequent siblings)
18 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 1:43 UTC (permalink / raw)
To: linux-raid
Now that Replaceable drives are replaced cleanly, mark a drive
as replaceable when we see a write error. It might get failed soon so
the Replaceable flag is irrelevant, but if the write error is recorded
in the bad block log, we still want to activate any spare that might
be available.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid5.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3f55bef..0ec48d0 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1782,6 +1782,9 @@ static void raid5_end_write_request(struct bio *bi, int error)
if (!uptodate) {
set_bit(WriteErrorSeen, &rdev->flags);
set_bit(R5_WriteError, &sh->dev[i].flags);
+ if (!test_and_set_bit(Replaceable, &rdev->flags))
+ set_bit(MD_RECOVERY_NEEDED,
+ &rdev->mddev->recovery);
} else if (is_badblock(rdev, sh->sector,
STRIPE_SECTORS,
&first_bad, &bad_sectors))
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (15 preceding siblings ...)
2011-10-26 1:43 ` [md PATCH 16/16] md/raid5: Mark device replaceable when we see a write error NeilBrown
@ 2011-10-26 6:38 ` David Brown
2011-10-26 7:42 ` NeilBrown
2011-10-26 9:01 ` John Robinson
2011-10-27 17:10 ` Peter W. Morreale
2011-12-14 22:18 ` Dan Williams
18 siblings, 2 replies; 31+ messages in thread
From: David Brown @ 2011-10-26 6:38 UTC (permalink / raw)
To: linux-raid
On 26/10/2011 03:43, NeilBrown wrote:
> The following series - on top of my for-linus branch which should appear in
> 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost
> certainly the most requested feature over the last few years.
Fantastic news - well done, and thanks for all your work (and to the
other md developers too, of course).
> I hope to submit this together with support for RAID10 (and maybe some
> minimal support for RAID1) for Linux-3.3. By the time it comes out
> mdadm-3.3 should exist will full support for hot-replace.
>
Hot-replace for RAID1 is surely just a normal "add", a resync, a "fail",
and a "remove"? So support here should be mainly about the consistency,
such as supporting the same "replaceable" tag. RAID10, I imagine, would
be more complicated.
mvh.,
David
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-10-26 6:38 ` [md PATCH 00/16] hot-replace support for RAID4/5/6 David Brown
@ 2011-10-26 7:42 ` NeilBrown
2011-10-26 9:01 ` John Robinson
1 sibling, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-10-26 7:42 UTC (permalink / raw)
To: David Brown; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 1365 bytes --]
On Wed, 26 Oct 2011 08:38:00 +0200 David Brown <david@westcontrol.com> wrote:
> On 26/10/2011 03:43, NeilBrown wrote:
> > The following series - on top of my for-linus branch which should appear in
> > 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost
> > certainly the most requested feature over the last few years.
>
> Fantastic news - well done, and thanks for all your work (and to the
> other md developers too, of course).
>
> > I hope to submit this together with support for RAID10 (and maybe some
> > minimal support for RAID1) for Linux-3.3. By the time it comes out
> > mdadm-3.3 should exist will full support for hot-replace.
> >
>
> Hot-replace for RAID1 is surely just a normal "add", a resync, a "fail",
> and a "remove"? So support here should be mainly about the consistency,
> such as supporting the same "replaceable" tag. RAID10, I imagine, would
> be more complicated.
Zigactly.
The thing that particularly needs to work for RAID1 is if you have an array
with a spare and you have bad-block-logs enabled and you get a write error,
then the write error needs to be logged, but also recovery needs to start
onto the spare and when that completes the drive with the write error should
be failed. I just need to add enough bit so that works, which shouldn't be
hard.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-10-26 6:38 ` [md PATCH 00/16] hot-replace support for RAID4/5/6 David Brown
2011-10-26 7:42 ` NeilBrown
@ 2011-10-26 9:01 ` John Robinson
2011-10-26 13:57 ` Peter W. Morreale
1 sibling, 1 reply; 31+ messages in thread
From: John Robinson @ 2011-10-26 9:01 UTC (permalink / raw)
To: Linux RAID
On 26/10/2011 07:38, David Brown wrote:
> On 26/10/2011 03:43, NeilBrown wrote:
>> The following series - on top of my for-linus branch which should
>> appear in
>> 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost
>> certainly the most requested feature over the last few years.
>
> Fantastic news - well done, and thanks for all your work (and to the
> other md developers too, of course).
Seconded! This is doubleplusgood!
Cheers,
John.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-10-26 9:01 ` John Robinson
@ 2011-10-26 13:57 ` Peter W. Morreale
2011-10-26 17:27 ` Piergiorgio Sartor
0 siblings, 1 reply; 31+ messages in thread
From: Peter W. Morreale @ 2011-10-26 13:57 UTC (permalink / raw)
To: John Robinson; +Cc: Linux RAID, NeilBrown
On Wed, 2011-10-26 at 10:01 +0100, John Robinson wrote:
> On 26/10/2011 07:38, David Brown wrote:
> > On 26/10/2011 03:43, NeilBrown wrote:
> >> The following series - on top of my for-linus branch which should
> >> appear in
> >> 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost
> >> certainly the most requested feature over the last few years.
> >
> > Fantastic news - well done, and thanks for all your work (and to the
> > other md developers too, of course).
>
> Seconded! This is doubleplusgood!
+1. Thank you Neil and everyone else.
Will be testing shortly....
Best
-PWM
>
> Cheers,
>
> John.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-10-26 13:57 ` Peter W. Morreale
@ 2011-10-26 17:27 ` Piergiorgio Sartor
0 siblings, 0 replies; 31+ messages in thread
From: Piergiorgio Sartor @ 2011-10-26 17:27 UTC (permalink / raw)
To: Peter W. Morreale; +Cc: John Robinson, Linux RAID, NeilBrown
On Wed, Oct 26, 2011 at 07:57:53AM -0600, Peter W. Morreale wrote:
> On Wed, 2011-10-26 at 10:01 +0100, John Robinson wrote:
> > On 26/10/2011 07:38, David Brown wrote:
> > > On 26/10/2011 03:43, NeilBrown wrote:
> > >> The following series - on top of my for-linus branch which should
> > >> appear in
> > >> 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost
> > >> certainly the most requested feature over the last few years.
> > >
> > > Fantastic news - well done, and thanks for all your work (and to the
> > > other md developers too, of course).
> >
> > Seconded! This is doubleplusgood!
>
> +1. Thank you Neil and everyone else.
>
> Will be testing shortly....
+1, me too!!!
This is really great news!
Thanks Neil,
bye,
--
piergiorgio
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (16 preceding siblings ...)
2011-10-26 6:38 ` [md PATCH 00/16] hot-replace support for RAID4/5/6 David Brown
@ 2011-10-27 17:10 ` Peter W. Morreale
2011-10-27 20:44 ` NeilBrown
2011-12-14 22:18 ` Dan Williams
18 siblings, 1 reply; 31+ messages in thread
From: Peter W. Morreale @ 2011-10-27 17:10 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
On Wed, 2011-10-26 at 12:43 +1100, NeilBrown wrote:
> The following series - on top of my for-linus branch which should appear in
> 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost
> certainly the most requested feature over the last few years.
> The whole series can be pulled from my md-devel branch:
> git://neil.brown.name/md md-devel
> (please don't do a full clone, it is not a very fast link).
>
> There is currently no mdadm support, but you can test it out and
> experiment without mdadm.
>
> In order to activate hot-replace you need to mark the device as
> 'replaceable'.
> This happens automatically when a write error is recorded in a
> bad-block log (if you happen to have one).
> It can be achieved manually by
> echo replaceable > /sys/block/mdXX/md/dev-YYY/state
>
> This makes YYY, in XX, replaceable.
>
> If md notices that there is a replaceable drive and a spare it will
> attach the spare to the replaceable drive and mark it as a
> 'replacement'.
> This word appears in the 'state' file and as (R) in /proc/mdstat.
>
> md will then copy data from the replaceable drive to the replacement.
> If there is a bad block on the replaceable drive, it will get the data
> from elsewhere. This looks like a "recovery" operation.
>
> When the replacement completes the replaceable device will be marked
> as Failed and will be disconnected from the array (i.e. the 'slot'
> will be set to 'none') and the replacement drive will take up full
> possession of that slot.
Neil,
Seems to work quite well. Note I have not yet performed a data
consistency check, just the mechanics of 'replacing' an existing
drive.
I see in the code that a recovery is kicked immediately after changing
the state of a drive. One question is whether it will be possible to
mark multiple drives for replacement, then invoke the recovery one time,
replacing all disks marked in a single pass?
Right now, it changing state on multiple drives kicks off sequential
recoveries. For larger disks (3TB/etc), recovery takes a long time and
there is a non-zero performance hit on the live array.
There are two common use cases to think about. First being an array
disk replacement to (say) larger disks. Second being a new array in use
for a period of time where the disks are approaching end-of-life, and
multiple disks are showing signs of possible failure. So we want to
replace a number of them at one time and incur the performance hit one
time.
I see where the code limits a recovery to one sync at a time, would it
be possible to extend this to multiple concurrent replacements?
What would it take to enable this?
Thanks again for this effort, this is terrific.
Best,
-PWM
>
> It is not possible to assemble an array with replacement with mdadm.
> To do this by hand:
>
> mknod /dev/md27 b 9 27
> < /dev/md27
> cd /sys/block/md27/md
> echo 1.2 > metadata_version
> echo 8:1 > new_dev
> echo 8:17 > new_dev
> ...
> echo active > array_state
>
> Replace '27' by the md number you want. Replace 1.2 by the metadata
> version number (must be 1.x for some x). Replace 8:1, 8:17 etc
> by the major:minor numbers of each device in the array.
>
> Yes: this is clumsy. But they you aren't doing this on live data -
> only on test devices to experiment.
>
> You can still assemble the array without the replacement using mdadm.
> Just list all the drives except the replacement in the --assemble
> command.
> Also once the replacement operation completes you can of course stop
> and assemble the new array with old mdadm.
>
> I hope to submit this together with support for RAID10 (and maybe some
> minimal support for RAID1) for Linux-3.3. By the time it comes out
> mdadm-3.3 should exist will full support for hot-replace.
>
> Review and testing is very welcome, be please do not try it on live
> data.
>
> NeilBrown
>
>
> ---
>
> NeilBrown (16):
> md/raid5: Mark device replaceable when we see a write error.
> md/raid5: If there is a spare and a replaceable device, start replacement.
> md/raid5: recognise replacements when assembling array.
> md/raid5: handle activation of replacement device when recovery completes.
> md/raid5: detect and handle replacements during recovery.
> md/raid5: writes should get directed to replacement as well as original.
> md/raid5: allow removal for failed replacement devices.
> md/raid5: preferentially read from replacement device if possible.
> md/raid5: remove redundant bio initialisations.
> md/raid5: raid5.h cleanup
> md/raid5: allow each slot to have an extra replacement device
> md: create externally visible flags for supporting hot-replace.
> md: change hot_remove_disk to take an rdev rather than a number.
> md: remove test for duplicate device when setting slot number.
> md: take after reference to mddev during sysfs access.
> md: refine interpretation of "hold_active == UNTIL_IOCTL".
>
>
> Documentation/md.txt | 22 ++
> drivers/md/md.c | 132 ++++++++++---
> drivers/md/md.h | 82 +++++---
> drivers/md/multipath.c | 7 -
> drivers/md/raid1.c | 7 -
> drivers/md/raid10.c | 7 -
> drivers/md/raid5.c | 462 +++++++++++++++++++++++++++++++++++----------
> drivers/md/raid5.h | 98 +++++-----
> include/linux/raid/md_p.h | 7 -
> 9 files changed, 599 insertions(+), 225 deletions(-)
>
> --
> Signature
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-10-27 17:10 ` Peter W. Morreale
@ 2011-10-27 20:44 ` NeilBrown
2011-10-27 20:53 ` Peter W. Morreale
0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2011-10-27 20:44 UTC (permalink / raw)
To: Peter W. Morreale; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 6667 bytes --]
On Thu, 27 Oct 2011 11:10:34 -0600 "Peter W. Morreale" <morreale@sgi.com>
wrote:
> On Wed, 2011-10-26 at 12:43 +1100, NeilBrown wrote:
> > The following series - on top of my for-linus branch which should appear in
> > 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost
> > certainly the most requested feature over the last few years.
> > The whole series can be pulled from my md-devel branch:
> > git://neil.brown.name/md md-devel
> > (please don't do a full clone, it is not a very fast link).
> >
> > There is currently no mdadm support, but you can test it out and
> > experiment without mdadm.
> >
> > In order to activate hot-replace you need to mark the device as
> > 'replaceable'.
> > This happens automatically when a write error is recorded in a
> > bad-block log (if you happen to have one).
> > It can be achieved manually by
> > echo replaceable > /sys/block/mdXX/md/dev-YYY/state
> >
> > This makes YYY, in XX, replaceable.
> >
> > If md notices that there is a replaceable drive and a spare it will
> > attach the spare to the replaceable drive and mark it as a
> > 'replacement'.
> > This word appears in the 'state' file and as (R) in /proc/mdstat.
> >
> > md will then copy data from the replaceable drive to the replacement.
> > If there is a bad block on the replaceable drive, it will get the data
> > from elsewhere. This looks like a "recovery" operation.
> >
> > When the replacement completes the replaceable device will be marked
> > as Failed and will be disconnected from the array (i.e. the 'slot'
> > will be set to 'none') and the replacement drive will take up full
> > possession of that slot.
>
> Neil,
>
> Seems to work quite well. Note I have not yet performed a data
> consistency check, just the mechanics of 'replacing' an existing
> drive.
>
> I see in the code that a recovery is kicked immediately after changing
> the state of a drive. One question is whether it will be possible to
> mark multiple drives for replacement, then invoke the recovery one time,
> replacing all disks marked in a single pass?
>
> Right now, it changing state on multiple drives kicks off sequential
> recoveries. For larger disks (3TB/etc), recovery takes a long time and
> there is a non-zero performance hit on the live array.
>
> There are two common use cases to think about. First being an array
> disk replacement to (say) larger disks. Second being a new array in use
> for a period of time where the disks are approaching end-of-life, and
> multiple disks are showing signs of possible failure. So we want to
> replace a number of them at one time and incur the performance hit one
> time.
>
> I see where the code limits a recovery to one sync at a time, would it
> be possible to extend this to multiple concurrent replacements?
>
> What would it take to enable this?
echo frozen > /sys/block/mdX/md/sync_action
for i in /sys/block/mdX/md/dev-*/state
do echo replaceable > $i
done
echo repair > /sys/block/mdX/md/sync_action
should do it. You certainly should be able to replace several devices at the
same time using this approach, though I haven't tried it.
(hmmm... it probably shouldn't accept a 'replaceable' flag on spares - I'll
make a note of that).
>
> Thanks again for this effort, this is terrific.
Thanks.
NeilBrown
>
> Best,
> -PWM
>
>
> >
> > It is not possible to assemble an array with replacement with mdadm.
> > To do this by hand:
> >
> > mknod /dev/md27 b 9 27
> > < /dev/md27
> > cd /sys/block/md27/md
> > echo 1.2 > metadata_version
> > echo 8:1 > new_dev
> > echo 8:17 > new_dev
> > ...
> > echo active > array_state
> >
> > Replace '27' by the md number you want. Replace 1.2 by the metadata
> > version number (must be 1.x for some x). Replace 8:1, 8:17 etc
> > by the major:minor numbers of each device in the array.
> >
> > Yes: this is clumsy. But they you aren't doing this on live data -
> > only on test devices to experiment.
> >
> > You can still assemble the array without the replacement using mdadm.
> > Just list all the drives except the replacement in the --assemble
> > command.
> > Also once the replacement operation completes you can of course stop
> > and assemble the new array with old mdadm.
> >
> > I hope to submit this together with support for RAID10 (and maybe some
> > minimal support for RAID1) for Linux-3.3. By the time it comes out
> > mdadm-3.3 should exist will full support for hot-replace.
> >
> > Review and testing is very welcome, be please do not try it on live
> > data.
> >
> > NeilBrown
> >
> >
> > ---
> >
> > NeilBrown (16):
> > md/raid5: Mark device replaceable when we see a write error.
> > md/raid5: If there is a spare and a replaceable device, start replacement.
> > md/raid5: recognise replacements when assembling array.
> > md/raid5: handle activation of replacement device when recovery completes.
> > md/raid5: detect and handle replacements during recovery.
> > md/raid5: writes should get directed to replacement as well as original.
> > md/raid5: allow removal for failed replacement devices.
> > md/raid5: preferentially read from replacement device if possible.
> > md/raid5: remove redundant bio initialisations.
> > md/raid5: raid5.h cleanup
> > md/raid5: allow each slot to have an extra replacement device
> > md: create externally visible flags for supporting hot-replace.
> > md: change hot_remove_disk to take an rdev rather than a number.
> > md: remove test for duplicate device when setting slot number.
> > md: take after reference to mddev during sysfs access.
> > md: refine interpretation of "hold_active == UNTIL_IOCTL".
> >
> >
> > Documentation/md.txt | 22 ++
> > drivers/md/md.c | 132 ++++++++++---
> > drivers/md/md.h | 82 +++++---
> > drivers/md/multipath.c | 7 -
> > drivers/md/raid1.c | 7 -
> > drivers/md/raid10.c | 7 -
> > drivers/md/raid5.c | 462 +++++++++++++++++++++++++++++++++++----------
> > drivers/md/raid5.h | 98 +++++-----
> > include/linux/raid/md_p.h | 7 -
> > 9 files changed, 599 insertions(+), 225 deletions(-)
> >
> > --
> > Signature
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-10-27 20:44 ` NeilBrown
@ 2011-10-27 20:53 ` Peter W. Morreale
0 siblings, 0 replies; 31+ messages in thread
From: Peter W. Morreale @ 2011-10-27 20:53 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
On Fri, 2011-10-28 at 07:44 +1100, NeilBrown wrote:
> On Thu, 27 Oct 2011 11:10:34 -0600 "Peter W. Morreale" <morreale@sgi.com>
> wrote:
>
> > On Wed, 2011-10-26 at 12:43 +1100, NeilBrown wrote:
> > > The following series - on top of my for-linus branch which should appear in
> > > 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost
> > > certainly the most requested feature over the last few years.
> > > The whole series can be pulled from my md-devel branch:
> > > git://neil.brown.name/md md-devel
> > > (please don't do a full clone, it is not a very fast link).
> > >
> > > There is currently no mdadm support, but you can test it out and
> > > experiment without mdadm.
> > >
> > > In order to activate hot-replace you need to mark the device as
> > > 'replaceable'.
> > > This happens automatically when a write error is recorded in a
> > > bad-block log (if you happen to have one).
> > > It can be achieved manually by
> > > echo replaceable > /sys/block/mdXX/md/dev-YYY/state
> > >
> > > This makes YYY, in XX, replaceable.
> > >
> > > If md notices that there is a replaceable drive and a spare it will
> > > attach the spare to the replaceable drive and mark it as a
> > > 'replacement'.
> > > This word appears in the 'state' file and as (R) in /proc/mdstat.
> > >
> > > md will then copy data from the replaceable drive to the replacement.
> > > If there is a bad block on the replaceable drive, it will get the data
> > > from elsewhere. This looks like a "recovery" operation.
> > >
> > > When the replacement completes the replaceable device will be marked
> > > as Failed and will be disconnected from the array (i.e. the 'slot'
> > > will be set to 'none') and the replacement drive will take up full
> > > possession of that slot.
> >
> > Neil,
> >
> > Seems to work quite well. Note I have not yet performed a data
> > consistency check, just the mechanics of 'replacing' an existing
> > drive.
> >
> > I see in the code that a recovery is kicked immediately after changing
> > the state of a drive. One question is whether it will be possible to
> > mark multiple drives for replacement, then invoke the recovery one time,
> > replacing all disks marked in a single pass?
> >
> > Right now, it changing state on multiple drives kicks off sequential
> > recoveries. For larger disks (3TB/etc), recovery takes a long time and
> > there is a non-zero performance hit on the live array.
> >
> > There are two common use cases to think about. First being an array
> > disk replacement to (say) larger disks. Second being a new array in use
> > for a period of time where the disks are approaching end-of-life, and
> > multiple disks are showing signs of possible failure. So we want to
> > replace a number of them at one time and incur the performance hit one
> > time.
> >
> > I see where the code limits a recovery to one sync at a time, would it
> > be possible to extend this to multiple concurrent replacements?
> >
> > What would it take to enable this?
>
> echo frozen > /sys/block/mdX/md/sync_action
> for i in /sys/block/mdX/md/dev-*/state
> do echo replaceable > $i
> done
> echo repair > /sys/block/mdX/md/sync_action
>
> should do it. You certainly should be able to replace several devices at the
> same time using this approach, though I haven't tried it.
No worries, I will and will let you know...
Awesome. I'm only at about 10% of understanding the code at this point.
Investigating 'frozen' was on the list...
Thx
-PWM
>
> (hmmm... it probably shouldn't accept a 'replaceable' flag on spares - I'll
> make a note of that).
>
> >
> > Thanks again for this effort, this is terrific.
>
> Thanks.
>
> NeilBrown
>
>
> >
> > Best,
> > -PWM
> >
> >
> > >
> > > It is not possible to assemble an array with replacement with mdadm.
> > > To do this by hand:
> > >
> > > mknod /dev/md27 b 9 27
> > > < /dev/md27
> > > cd /sys/block/md27/md
> > > echo 1.2 > metadata_version
> > > echo 8:1 > new_dev
> > > echo 8:17 > new_dev
> > > ...
> > > echo active > array_state
> > >
> > > Replace '27' by the md number you want. Replace 1.2 by the metadata
> > > version number (must be 1.x for some x). Replace 8:1, 8:17 etc
> > > by the major:minor numbers of each device in the array.
> > >
> > > Yes: this is clumsy. But they you aren't doing this on live data -
> > > only on test devices to experiment.
> > >
> > > You can still assemble the array without the replacement using mdadm.
> > > Just list all the drives except the replacement in the --assemble
> > > command.
> > > Also once the replacement operation completes you can of course stop
> > > and assemble the new array with old mdadm.
> > >
> > > I hope to submit this together with support for RAID10 (and maybe some
> > > minimal support for RAID1) for Linux-3.3. By the time it comes out
> > > mdadm-3.3 should exist will full support for hot-replace.
> > >
> > > Review and testing is very welcome, be please do not try it on live
> > > data.
> > >
> > > NeilBrown
> > >
> > >
> > > ---
> > >
> > > NeilBrown (16):
> > > md/raid5: Mark device replaceable when we see a write error.
> > > md/raid5: If there is a spare and a replaceable device, start replacement.
> > > md/raid5: recognise replacements when assembling array.
> > > md/raid5: handle activation of replacement device when recovery completes.
> > > md/raid5: detect and handle replacements during recovery.
> > > md/raid5: writes should get directed to replacement as well as original.
> > > md/raid5: allow removal for failed replacement devices.
> > > md/raid5: preferentially read from replacement device if possible.
> > > md/raid5: remove redundant bio initialisations.
> > > md/raid5: raid5.h cleanup
> > > md/raid5: allow each slot to have an extra replacement device
> > > md: create externally visible flags for supporting hot-replace.
> > > md: change hot_remove_disk to take an rdev rather than a number.
> > > md: remove test for duplicate device when setting slot number.
> > > md: take after reference to mddev during sysfs access.
> > > md: refine interpretation of "hold_active == UNTIL_IOCTL".
> > >
> > >
> > > Documentation/md.txt | 22 ++
> > > drivers/md/md.c | 132 ++++++++++---
> > > drivers/md/md.h | 82 +++++---
> > > drivers/md/multipath.c | 7 -
> > > drivers/md/raid1.c | 7 -
> > > drivers/md/raid10.c | 7 -
> > > drivers/md/raid5.c | 462 +++++++++++++++++++++++++++++++++++----------
> > > drivers/md/raid5.h | 98 +++++-----
> > > include/linux/raid/md_p.h | 7 -
> > > 9 files changed, 599 insertions(+), 225 deletions(-)
> > >
> > > --
> > > Signature
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
` (17 preceding siblings ...)
2011-10-27 17:10 ` Peter W. Morreale
@ 2011-12-14 22:18 ` Dan Williams
2011-12-15 6:18 ` NeilBrown
18 siblings, 1 reply; 31+ messages in thread
From: Dan Williams @ 2011-12-14 22:18 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
On Tue, Oct 25, 2011 at 6:43 PM, NeilBrown <neilb@suse.de> wrote:
> The following series - on top of my for-linus branch which should appear in
> 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost
> certainly the most requested feature over the last few years.
> The whole series can be pulled from my md-devel branch:
> git://neil.brown.name/md md-devel
> (please don't do a full clone, it is not a very fast link).
Some belated comments based on the commit ids at the time:
88eeb3d md: refine interpretation of "hold_active == UNTIL_IOCTL".
9c22832 md: take a reference to mddev during sysfs access.
a7d6ae4 md: remove test for duplicate device when setting slot number.
6deecf2 md: change hot_remove_disk to take an rdev rather than a number.
last 4 reviewed-by.
f248f8c md: create externally visible flags for supporting hot-replace.
'replaceable' just strikes me as a confusing name as all devices are
nominally "replaceable", but whether you want it to be actively
replaced is a different consideration. What about "incumbent" to mark
the disk as currently holding a position we want it to vacate and
remove any potential confusion with 'replacement'.
ce8fd05 md/raid5: allow each slot to have an extra replacement device
fd7557d md/raid5: raid5.h cleanup
15e9a58 md/raid5: remove redundant bio initialisations.
last 3 reviewed-by.
37aebb5 md/raid5: preferentially read from replacement device if possible.
+ /* This flag does not apply to '.replacement'
+ * only to .rdev, so make sure to check that*/
+ struct md_rdev *rdev2 = rcu_dereference(
+ conf->disks[i].rdev);
+ if (rdev2 == rdev)
+ clear_bit(R5_Insync, &dev->flags);
+ if (!test_bit(Faulty, &rdev2->flags)) {
can't rdev2 be NULL here?
@@ -4201,7 +4241,6 @@ static int retry_aligned_read(struct r5conf
*conf, struct bio *raid_bio)
return handled;
}
- set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
release_stripe(sh);
raid5_set_bi_hw_segments(raid_bio, scnt);
Should this one liner be broken out for -stable?
8e2c0f9 md/raid5: allow removal for failed replacement devices.
17df00a md/raid5: writes should get directed to replacement as well as original.
last 2 reviewed-by
dba5a681 md/raid5: detect and handle replacements during recovery.
This one got me looking back to recall the rules about when
rcu_deference must be used for an rdev (the ones outlined in commit
9910f16a "md: fix up some rdev rcu locking in raid5/6"). But the
casual future reader may have a hard time finding that commit. Maybe
we could introduce our own rdev_deref() macro so that sparse and
lockdep can automatically validate rdev derefences like below.
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 8d8e139..6023583 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -357,9 +357,14 @@ enum {
struct disk_info {
- struct md_rdev *rdev, *replacement;
+ struct md_rdev __rcu *rdev,
+ struct md_rdev __rcu *replacement;
};
+#define rdev_deref(p, md, sh) \
+ rcu_dereference_check((p), (md) ? mddev_is_locked(md) : 1 || \
+ (sh) ? test_bit(STRIPE_SYNCING,
&(sh)->state) : 1)
+
struct r5conf {
struct hlist_head *stripe_hashtbl;
struct mddev *mddev;
...but not sure if it's worth the code uglification.
Nit, not sure if it's worth fixing but this one introduces some
inconsistent line wrapping around logical operators... "at the end" vs
"beginning of next line"
+ if (rdev
+ && !test_bit(Faulty, &rdev->flags)
+ && !test_bit(In_sync, &rdev->flags)
+ && !rdev_set_badblocks(rdev, sh->sector,
+ STRIPE_SECTORS, 0))
+ abort = 1;
+ rdev = conf->disks[i].replacement;
+ if (rdev
+ && !test_bit(Faulty, &rdev->flags)
+ && !test_bit(In_sync, &rdev->flags)
+ && !rdev_set_badblocks(rdev, sh->sector,
+ STRIPE_SECTORS, 0))
abort = 1;
}
if (abort) {
@@ -2456,6 +2475,22 @@ handle_failed_sync(struct r5conf *conf, struct
stripe_head *sh,
}
}
+static int want_replace(struct stripe_head *sh, int disk_idx)
+{
+ struct md_rdev *rdev;
+ int rv = 0;
+ /* Doing recovery so rcu locking not required */
+ rdev = sh->raid_conf->disks[disk_idx].replacement;
+ if (rdev &&
+ !test_bit(Faulty, &rdev->flags) &&
+ !test_bit(In_sync, &rdev->flags) &&
+ (rdev->recovery_offset <= sh->sector ||
+ rdev->mddev->recovery_cp <= sh->sector))
+ rv = 1;
+
+ return rv;
2693b9e md/raid5: handle activation of replacement device when
recovery completes.
I questioned not needing a barrier in raid5_end_write_request after
finding conf->disks[i].replacement == NULL until I found the note in
raid5_end_read_request about the rdev being pinned until all i/o
returns. Maybe a similar note to raid5_end_write_request?
d6db3d0 md/raid5: recognise replacements when assembling array.
6cdb4fb md/raid5: If there is a spare and a replaceable device, start
replacement.
0124565 md/raid5: Mark device replaceable when we see a write error.
last 3 reviewed-by.
058c478..678a66d
raid10 and raid1 patches not reviewed.
--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-12-14 22:18 ` Dan Williams
@ 2011-12-15 6:18 ` NeilBrown
2011-12-15 7:14 ` Williams, Dan J
0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2011-12-15 6:18 UTC (permalink / raw)
To: Dan Williams; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 7456 bytes --]
On Wed, 14 Dec 2011 14:18:51 -0800 Dan Williams <dan.j.williams@intel.com>
wrote:
> On Tue, Oct 25, 2011 at 6:43 PM, NeilBrown <neilb@suse.de> wrote:
> > The following series - on top of my for-linus branch which should appear in
> > 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost
> > certainly the most requested feature over the last few years.
> > The whole series can be pulled from my md-devel branch:
> > git://neil.brown.name/md md-devel
> > (please don't do a full clone, it is not a very fast link).
>
> Some belated comments based on the commit ids at the time:
>
> 88eeb3d md: refine interpretation of "hold_active == UNTIL_IOCTL".
> 9c22832 md: take a reference to mddev during sysfs access.
> a7d6ae4 md: remove test for duplicate device when setting slot number.
> 6deecf2 md: change hot_remove_disk to take an rdev rather than a number.
>
> last 4 reviewed-by.
Thanks. I've annotated the two that haven't gone upstream yet.
>
> f248f8c md: create externally visible flags for supporting hot-replace.
>
> 'replaceable' just strikes me as a confusing name as all devices are
> nominally "replaceable", but whether you want it to be actively
> replaced is a different consideration. What about "incumbent" to mark
> the disk as currently holding a position we want it to vacate and
> remove any potential confusion with 'replacement'.
Fair point. I had wondered if I should not have the flag and just use the
"write_error" flag. However the meaning is slightly different.
I don't really like "incumbent" as it gives no indication that there is a
desire to replace the device. Maybe "want_replacement" ??
>
> ce8fd05 md/raid5: allow each slot to have an extra replacement device
> fd7557d md/raid5: raid5.h cleanup
> 15e9a58 md/raid5: remove redundant bio initialisations.
>
> last 3 reviewed-by.
Thanks.
>
> 37aebb5 md/raid5: preferentially read from replacement device if possible.
>
> + /* This flag does not apply to '.replacement'
> + * only to .rdev, so make sure to check that*/
> + struct md_rdev *rdev2 = rcu_dereference(
> + conf->disks[i].rdev);
> + if (rdev2 == rdev)
> + clear_bit(R5_Insync, &dev->flags);
> + if (!test_bit(Faulty, &rdev2->flags)) {
>
> can't rdev2 be NULL here?
Uhm... probably. I've added a test for rdev2 like I have in the "MadeGood"
case below.
Thanks.
>
> @@ -4201,7 +4241,6 @@ static int retry_aligned_read(struct r5conf
> *conf, struct bio *raid_bio)
> return handled;
> }
>
> - set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
> if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
> release_stripe(sh);
> raid5_set_bi_hw_segments(raid_bio, scnt);
>
>
> Should this one liner be broken out for -stable?
Uhmm... maybe. If the array is degraded we'll hit problems soon anyway, and
if it isn't, the read-errors will all soon be fixed up.
Do you see a particular problem that this fixes that is already possible
without hot-replace?
>
> 8e2c0f9 md/raid5: allow removal for failed replacement devices.
> 17df00a md/raid5: writes should get directed to replacement as well as original.
>
> last 2 reviewed-by
Thanks.
>
> dba5a681 md/raid5: detect and handle replacements during recovery.
>
> This one got me looking back to recall the rules about when
> rcu_deference must be used for an rdev (the ones outlined in commit
> 9910f16a "md: fix up some rdev rcu locking in raid5/6"). But the
> casual future reader may have a hard time finding that commit. Maybe
> we could introduce our own rdev_deref() macro so that sparse and
> lockdep can automatically validate rdev derefences like below.
>
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 8d8e139..6023583 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -357,9 +357,14 @@ enum {
>
>
> struct disk_info {
> - struct md_rdev *rdev, *replacement;
> + struct md_rdev __rcu *rdev,
> + struct md_rdev __rcu *replacement;
> };
>
> +#define rdev_deref(p, md, sh) \
> + rcu_dereference_check((p), (md) ? mddev_is_locked(md) : 1 || \
> + (sh) ? test_bit(STRIPE_SYNCING,
> &(sh)->state) : 1)
> +
> struct r5conf {
> struct hlist_head *stripe_hashtbl;
> struct mddev *mddev;
>
> ...but not sure if it's worth the code uglification.
No, I'm not sure either... If it comes up again I might...
>
>
> Nit, not sure if it's worth fixing but this one introduces some
> inconsistent line wrapping around logical operators... "at the end" vs
> "beginning of next line"
>
> + if (rdev
> + && !test_bit(Faulty, &rdev->flags)
> + && !test_bit(In_sync, &rdev->flags)
> + && !rdev_set_badblocks(rdev, sh->sector,
> + STRIPE_SECTORS, 0))
> + abort = 1;
> + rdev = conf->disks[i].replacement;
> + if (rdev
> + && !test_bit(Faulty, &rdev->flags)
> + && !test_bit(In_sync, &rdev->flags)
> + && !rdev_set_badblocks(rdev, sh->sector,
> + STRIPE_SECTORS, 0))
> abort = 1;
> }
> if (abort) {
> @@ -2456,6 +2475,22 @@ handle_failed_sync(struct r5conf *conf, struct
> stripe_head *sh,
> }
> }
>
> +static int want_replace(struct stripe_head *sh, int disk_idx)
> +{
> + struct md_rdev *rdev;
> + int rv = 0;
> + /* Doing recovery so rcu locking not required */
> + rdev = sh->raid_conf->disks[disk_idx].replacement;
> + if (rdev &&
> + !test_bit(Faulty, &rdev->flags) &&
> + !test_bit(In_sync, &rdev->flags) &&
> + (rdev->recovery_offset <= sh->sector ||
> + rdev->mddev->recovery_cp <= sh->sector))
> + rv = 1;
> +
> + return rv;
Thanks.
I almost always prefer 'at the start' as import things should be obvious.
So I have updated 'want_replace'.
>
> 2693b9e md/raid5: handle activation of replacement device when
> recovery completes.
>
> I questioned not needing a barrier in raid5_end_write_request after
> finding conf->disks[i].replacement == NULL until I found the note in
> raid5_end_read_request about the rdev being pinned until all i/o
> returns. Maybe a similar note to raid5_end_write_request?
I like adding explanatory notes ... but I'm not quite sure what you are
suggesting here. Could you be a little more explicit? Thanks.
>
> d6db3d0 md/raid5: recognise replacements when assembling array.
> 6cdb4fb md/raid5: If there is a spare and a replaceable device, start
> replacement.
> 0124565 md/raid5: Mark device replaceable when we see a write error.
>
> last 3 reviewed-by.
Thanks.
>
> 058c478..678a66d
> raid10 and raid1 patches not reviewed.
That's what a Christmas break is for, isn't it??
Thanks for all the review - I really appreciate it.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-12-15 6:18 ` NeilBrown
@ 2011-12-15 7:14 ` Williams, Dan J
2011-12-20 5:18 ` NeilBrown
0 siblings, 1 reply; 31+ messages in thread
From: Williams, Dan J @ 2011-12-15 7:14 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
On Wed, Dec 14, 2011 at 10:18 PM, NeilBrown <neilb@suse.de> wrote:
>> f248f8c md: create externally visible flags for supporting hot-replace.
>>
>> 'replaceable' just strikes me as a confusing name as all devices are
>> nominally "replaceable", but whether you want it to be actively
>> replaced is a different consideration. What about "incumbent" to mark
>> the disk as currently holding a position we want it to vacate and
>> remove any potential confusion with 'replacement'.
>
> Fair point. I had wondered if I should not have the flag and just use the
> "write_error" flag. However the meaning is slightly different.
>
> I don't really like "incumbent" as it gives no indication that there is a
> desire to replace the device. Maybe "want_replacement" ??
Yeah that works. I was hung up on the previous scheme only differing
the in "able" vs "ment" suffix, so a "want_" prefix fits the bill.
>> 37aebb5 md/raid5: preferentially read from replacement device if possible.
[..]
>> Should this one liner be broken out for -stable?
>> Do you see a particular problem that this fixes that is already possible
> without hot-replace?
I would need to think it through a bit further, but the changelog was
sufficiently convincing so I thought I would ask.
>> Nit, not sure if it's worth fixing but this one introduces some
>> inconsistent line wrapping around logical operators... "at the end" vs
>> "beginning of next line"
>>
[..]
>
> Thanks.
> I almost always prefer 'at the start' as import things should be obvious.
> So I have updated 'want_replace'.
...and here I've been an 'at the end' Sneetch and convinced myself
that it's easier to read...
>>
>> 2693b9e md/raid5: handle activation of replacement device when
>> recovery completes.
>>
>> I questioned not needing a barrier in raid5_end_write_request after
>> finding conf->disks[i].replacement == NULL until I found the note in
>> raid5_end_read_request about the rdev being pinned until all i/o
>> returns. Maybe a similar note to raid5_end_write_request?
>
> I like adding explanatory notes ... but I'm not quite sure what you are
> suggesting here. Could you be a little more explicit? Thanks.
>
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 59f5b05..8074515 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1758,7 +1758,8 @@ static void raid5_end_write_request(struct bio
*bi, int error)
replacement = 1;
else
/* rdev was removed and 'replacement'
- * replaced it.
+ * replaced it. rdev is not removed
+ * until all requests are finished.
*/
rdev = conf->disks[i].rdev;
break;
>> 058c478..678a66d
>> raid10 and raid1 patches not reviewed.
>
> That's what a Christmas break is for, isn't it??
True, maybe that's my "wish I could find the time to do more code
review" wish coming true :-).
> Thanks for all the review - I really appreciate it.
>
Anytime.
--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-12-15 7:14 ` Williams, Dan J
@ 2011-12-20 5:18 ` NeilBrown
2011-12-22 20:54 ` Alexander Kühn
0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2011-12-20 5:18 UTC (permalink / raw)
To: Williams, Dan J; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 3385 bytes --]
On Wed, 14 Dec 2011 23:14:04 -0800 "Williams, Dan J"
<dan.j.williams@intel.com> wrote:
> On Wed, Dec 14, 2011 at 10:18 PM, NeilBrown <neilb@suse.de> wrote:
> >> f248f8c md: create externally visible flags for supporting hot-replace.
> >>
> >> 'replaceable' just strikes me as a confusing name as all devices are
> >> nominally "replaceable", but whether you want it to be actively
> >> replaced is a different consideration. What about "incumbent" to mark
> >> the disk as currently holding a position we want it to vacate and
> >> remove any potential confusion with 'replacement'.
> >
> > Fair point. I had wondered if I should not have the flag and just use the
> > "write_error" flag. However the meaning is slightly different.
> >
> > I don't really like "incumbent" as it gives no indication that there is a
> > desire to replace the device. Maybe "want_replacement" ??
>
> Yeah that works. I was hung up on the previous scheme only differing
> the in "able" vs "ment" suffix, so a "want_" prefix fits the bill.
I've changed it all to "want_replacements" and agree that is it clearer.
>
> >> 37aebb5 md/raid5: preferentially read from replacement device if possible.
> [..]
> >> Should this one liner be broken out for -stable?
> >> Do you see a particular problem that this fixes that is already possible
> > without hot-replace?
>
> I would need to think it through a bit further, but the changelog was
> sufficiently convincing so I thought I would ask.
I'm pretty sure it isn't really needed before replacements is added but
thanks for checking.
>
> >> Nit, not sure if it's worth fixing but this one introduces some
> >> inconsistent line wrapping around logical operators... "at the end" vs
> >> "beginning of next line"
> >>
> [..]
> >
> > Thanks.
> > I almost always prefer 'at the start' as import things should be obvious.
> > So I have updated 'want_replace'.
>
> ...and here I've been an 'at the end' Sneetch and convinced myself
> that it's easier to read...
>
> >>
> >> 2693b9e md/raid5: handle activation of replacement device when
> >> recovery completes.
> >>
> >> I questioned not needing a barrier in raid5_end_write_request after
> >> finding conf->disks[i].replacement == NULL until I found the note in
> >> raid5_end_read_request about the rdev being pinned until all i/o
> >> returns. Maybe a similar note to raid5_end_write_request?
> >
> > I like adding explanatory notes ... but I'm not quite sure what you are
> > suggesting here. Could you be a little more explicit? Thanks.
> >
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 59f5b05..8074515 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -1758,7 +1758,8 @@ static void raid5_end_write_request(struct bio
> *bi, int error)
> replacement = 1;
> else
> /* rdev was removed and 'replacement'
> - * replaced it.
> + * replaced it. rdev is not removed
> + * until all requests are finished.
> */
> rdev = conf->disks[i].rdev;
> break;
>
I've added that - thanks.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-12-20 5:18 ` NeilBrown
@ 2011-12-22 20:54 ` Alexander Kühn
2011-12-22 21:14 ` NeilBrown
0 siblings, 1 reply; 31+ messages in thread
From: Alexander Kühn @ 2011-12-22 20:54 UTC (permalink / raw)
To: NeilBrown; +Cc: Williams, Dan J, linux-raid
Zitat von NeilBrown <neilb@suse.de>:
> On Wed, 14 Dec 2011 23:14:04 -0800 "Williams, Dan J"
> <dan.j.williams@intel.com> wrote:
>
>> On Wed, Dec 14, 2011 at 10:18 PM, NeilBrown <neilb@suse.de> wrote:
>> >> f248f8c md: create externally visible flags for supporting hot-replace.
>> >>
>> >> 'replaceable' just strikes me as a confusing name as all devices are
>> >> nominally "replaceable", but whether you want it to be actively
>> >> replaced is a different consideration. What about "incumbent" to mark
>> >> the disk as currently holding a position we want it to vacate and
>> >> remove any potential confusion with 'replacement'.
'vacating' strikes me as the obvious choice, no?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
2011-12-22 20:54 ` Alexander Kühn
@ 2011-12-22 21:14 ` NeilBrown
0 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2011-12-22 21:14 UTC (permalink / raw)
To: Alexander Kühn; +Cc: Williams, Dan J, linux-raid
[-- Attachment #1: Type: text/plain, Size: 1468 bytes --]
On Thu, 22 Dec 2011 21:54:42 +0100 Alexander Kühn
<alexander.kuehn@nagilum.de> wrote:
>
> Zitat von NeilBrown <neilb@suse.de>:
>
> > On Wed, 14 Dec 2011 23:14:04 -0800 "Williams, Dan J"
> > <dan.j.williams@intel.com> wrote:
> >
> >> On Wed, Dec 14, 2011 at 10:18 PM, NeilBrown <neilb@suse.de> wrote:
> >> >> f248f8c md: create externally visible flags for supporting hot-replace.
> >> >>
> >> >> 'replaceable' just strikes me as a confusing name as all devices are
> >> >> nominally "replaceable", but whether you want it to be actively
> >> >> replaced is a different consideration. What about "incumbent" to mark
> >> >> the disk as currently holding a position we want it to vacate and
> >> >> remove any potential confusion with 'replacement'.
>
> 'vacating' strikes me as the obvious choice, no?
Interesting suggestion, but not quite the right meaning.
The bit can be set before a replacement is available. So it isn't a
statement about what is happening to the device, but about what should
happening to the device.
The meaning we want is "replace this device as soon as possible".
I don't think there is one word that really has a meaning like that.
"replaceable" was the closest I could get but that means "can be replaced"
rather than "should be replaced" and as Dan pointed out that is a significant
difference.
Maybe "deprecated"?? However I don't think using that would improve clarity.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread