* [PATCH 000 of 9] md: Assorted patches for the 2.5.26 merge window
@ 2008-04-29 3:34 NeilBrown
2008-04-29 3:34 ` [PATCH 001 of 9] md: Fix use after free when removing rdev via sysfs NeilBrown
` (8 more replies)
0 siblings, 9 replies; 14+ messages in thread
From: NeilBrown @ 2008-04-29 3:34 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-raid, linux-kernel, Bernd Schubert, Dan Williams, stable
Following 9 patches for md are suitable for 2.5.26.
Yes, I really should have submitted them to -mm earlier, but I've been
on leave, sorry.
First patch is a bug fix that should go in 2.6.25.stable.
It has been copied to stable@kernel.org.
Others are mostly related to "external" metadata support. i.e. a userspace
program does all the management of metadata, and needs to communicate
with the kernel to do its job properly.
NeilBrown
[PATCH 001 of 9] md: Fix use after free when removing rdev via sysfs
[PATCH 002 of 9] md: Skip all metadata update processing when using external metadata.
[PATCH 003 of 9] md: Reinitialise more mddev fields in do_md_stop.
[PATCH 004 of 9] md: Fix 'safemode' handling for external metadata.
[PATCH 005 of 9] md: Fix up switching md arrays between read-only and read-write
[PATCH 006 of 9] md: Remove a stray command from a copy and paste error in resync_start_store
[PATCH 007 of 9] md: prevent duplicates in bind_rdev_to_array
[PATCH 008 of 9] md: md: raid5 rate limit error printk
[PATCH 009 of 9] md: md: support blocking writes to an array on device failure
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 001 of 9] md: Fix use after free when removing rdev via sysfs
2008-04-29 3:34 [PATCH 000 of 9] md: Assorted patches for the 2.5.26 merge window NeilBrown
@ 2008-04-29 3:34 ` NeilBrown
2008-04-29 3:34 ` [PATCH 002 of 9] md: Skip all metadata update processing when using external metadata NeilBrown
` (7 subsequent siblings)
8 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2008-04-29 3:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel, Dan Williams, stable
From: Dan Williams <dan.j.williams@intel.com>
rdev->mddev is no longer valid upon return from entry->store() when the
'remove' command is given.
This should go in 2.6.25.stable.
Cc: stable@kernel.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/md.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2008-04-29 12:27:50.000000000 +1000
+++ ./drivers/md/md.c 2008-04-29 12:27:55.000000000 +1000
@@ -2096,7 +2096,7 @@ rdev_attr_store(struct kobject *kobj, st
rv = -EBUSY;
else
rv = entry->store(rdev, page, length);
- mddev_unlock(rdev->mddev);
+ mddev_unlock(mddev);
}
return rv;
}
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 002 of 9] md: Skip all metadata update processing when using external metadata.
2008-04-29 3:34 [PATCH 000 of 9] md: Assorted patches for the 2.5.26 merge window NeilBrown
2008-04-29 3:34 ` [PATCH 001 of 9] md: Fix use after free when removing rdev via sysfs NeilBrown
@ 2008-04-29 3:34 ` NeilBrown
2008-04-29 3:35 ` [PATCH 003 of 9] md: Reinitialise more mddev fields in do_md_stop NeilBrown
` (6 subsequent siblings)
8 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2008-04-29 3:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel
All the metadata update processing for external metadata is on
in user-space or through the sysfs interfaces, so make "md_update_sb"
a no-op in that case.
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/md.c | 2 ++
1 file changed, 2 insertions(+)
diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2008-04-29 12:27:55.000000000 +1000
+++ ./drivers/md/md.c 2008-04-29 12:27:56.000000000 +1000
@@ -1651,6 +1651,8 @@ static void md_update_sb(mddev_t * mddev
int sync_req;
int nospares = 0;
+ if (mddev->external)
+ return;
repeat:
spin_lock_irq(&mddev->write_lock);
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 003 of 9] md: Reinitialise more mddev fields in do_md_stop.
2008-04-29 3:34 [PATCH 000 of 9] md: Assorted patches for the 2.5.26 merge window NeilBrown
2008-04-29 3:34 ` [PATCH 001 of 9] md: Fix use after free when removing rdev via sysfs NeilBrown
2008-04-29 3:34 ` [PATCH 002 of 9] md: Skip all metadata update processing when using external metadata NeilBrown
@ 2008-04-29 3:35 ` NeilBrown
2008-04-29 3:35 ` [PATCH 004 of 9] md: Fix 'safemode' handling for external metadata NeilBrown
` (5 subsequent siblings)
8 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2008-04-29 3:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel
I keep finding problems where an mddev gets reused and some fields
has a value from a previous usage that confuses the new usage. So
clear all fields that could possible need clearing when calling do_md_stop.
Also initialise the 'level' of a new array to LEVEL_NONE (which isn't 0).
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/md.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2008-04-29 12:27:56.000000000 +1000
+++ ./drivers/md/md.c 2008-04-29 12:27:56.000000000 +1000
@@ -276,6 +276,7 @@ static mddev_t * mddev_find(dev_t unit)
init_waitqueue_head(&new->sb_wait);
new->reshape_position = MaxSector;
new->resync_max = MaxSector;
+ new->level = LEVEL_NONE;
new->queue = blk_alloc_queue(GFP_KERNEL);
if (!new->queue) {
@@ -3713,6 +3714,30 @@ static int do_md_stop(mddev_t * mddev, i
mddev->reshape_position = MaxSector;
mddev->external = 0;
mddev->persistent = 0;
+ mddev->level = LEVEL_NONE;
+ mddev->clevel[0] = 0;
+ mddev->flags = 0;
+ mddev->ro = 0;
+ mddev->metadata_type[0] = 0;
+ mddev->chunk_size = 0;
+ mddev->ctime = mddev->utime = 0;
+ mddev->layout = 0;
+ mddev->max_disks = 0;
+ mddev->events = 0;
+ mddev->delta_disks = 0;
+ mddev->new_level = LEVEL_NONE;
+ mddev->new_layout = 0;
+ mddev->new_chunk = 0;
+ mddev->curr_resync = 0;
+ mddev->resync_mismatches = 0;
+ mddev->suspend_lo = mddev->suspend_hi = 0;
+ mddev->sync_speed_min = mddev->sync_speed_max = 0;
+ mddev->recovery = 0;
+ mddev->in_sync = 0;
+ mddev->changed = 0;
+ mddev->degraded = 0;
+ mddev->barriers_work = 0;
+ mddev->safemode = 0;
} else if (mddev->pers)
printk(KERN_INFO "md: %s switched to read-only mode.\n",
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 004 of 9] md: Fix 'safemode' handling for external metadata.
2008-04-29 3:34 [PATCH 000 of 9] md: Assorted patches for the 2.5.26 merge window NeilBrown
` (2 preceding siblings ...)
2008-04-29 3:35 ` [PATCH 003 of 9] md: Reinitialise more mddev fields in do_md_stop NeilBrown
@ 2008-04-29 3:35 ` NeilBrown
2008-04-29 3:35 ` [PATCH 005 of 9] md: Fix up switching md arrays between read-only and read-write NeilBrown
` (4 subsequent siblings)
8 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2008-04-29 3:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel
'safemode' relates to marking an array as 'clean' if there has been
no write traffic for a while (a couple of seconds), to reduce the chance
of the array being found dirty on reboot.
->safemode is set to '1' when there have been no write for a while, and
it gets set to '0' when the superblock is updates with the 'clean' flag set.
This requires a few fixes for 'external' metadata:
- When an array is set to 'clean' via sysfs, 'safemode' must be cleared.
- when we write to an array that has 'safemode' set (there must have been
some delay in updating the metadata), we need to clear safemode.
- Don't try to update external metadata in md_check_recovery for safemode
transitions - it won't work.
Also, don't try to support "immediate safe mode" (safemode==2) for external
metadata, it cannot really work (the safemode timeout can be set very low
if this is really needed).
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/md.c | 30 +++++++++++++++++++-----------
1 file changed, 19 insertions(+), 11 deletions(-)
diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2008-04-29 12:27:56.000000000 +1000
+++ ./drivers/md/md.c 2008-04-29 12:27:56.000000000 +1000
@@ -2614,6 +2614,8 @@ array_state_store(mddev_t *mddev, const
if (atomic_read(&mddev->writes_pending) == 0) {
if (mddev->in_sync == 0) {
mddev->in_sync = 1;
+ if (mddev->safemode == 1)
+ mddev->safemode = 0;
if (mddev->persistent)
set_bit(MD_CHANGE_CLEAN,
&mddev->flags);
@@ -5391,6 +5393,8 @@ void md_write_start(mddev_t *mddev, stru
md_wakeup_thread(mddev->sync_thread);
}
atomic_inc(&mddev->writes_pending);
+ if (mddev->safemode == 1)
+ mddev->safemode = 0;
if (mddev->in_sync) {
spin_lock_irq(&mddev->write_lock);
if (mddev->in_sync) {
@@ -5815,7 +5819,7 @@ void md_check_recovery(mddev_t *mddev)
return;
if (signal_pending(current)) {
- if (mddev->pers->sync_request) {
+ if (mddev->pers->sync_request && !mddev->external) {
printk(KERN_INFO "md: %s in immediate safe mode\n",
mdname(mddev));
mddev->safemode = 2;
@@ -5827,7 +5831,7 @@ void md_check_recovery(mddev_t *mddev)
(mddev->flags && !mddev->external) ||
test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
- (mddev->safemode == 1) ||
+ (mddev->external == 0 && mddev->safemode == 1) ||
(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
&& !mddev->in_sync && mddev->recovery_cp == MaxSector)
))
@@ -5836,16 +5840,20 @@ void md_check_recovery(mddev_t *mddev)
if (mddev_trylock(mddev)) {
int spares = 0;
- spin_lock_irq(&mddev->write_lock);
- if (mddev->safemode && !atomic_read(&mddev->writes_pending) &&
- !mddev->in_sync && mddev->recovery_cp == MaxSector) {
- mddev->in_sync = 1;
- if (mddev->persistent)
- set_bit(MD_CHANGE_CLEAN, &mddev->flags);
+ if (!mddev->external) {
+ spin_lock_irq(&mddev->write_lock);
+ if (mddev->safemode &&
+ !atomic_read(&mddev->writes_pending) &&
+ !mddev->in_sync &&
+ mddev->recovery_cp == MaxSector) {
+ mddev->in_sync = 1;
+ if (mddev->persistent)
+ set_bit(MD_CHANGE_CLEAN, &mddev->flags);
+ }
+ if (mddev->safemode == 1)
+ mddev->safemode = 0;
+ spin_unlock_irq(&mddev->write_lock);
}
- if (mddev->safemode == 1)
- mddev->safemode = 0;
- spin_unlock_irq(&mddev->write_lock);
if (mddev->flags)
md_update_sb(mddev, 0);
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 005 of 9] md: Fix up switching md arrays between read-only and read-write
2008-04-29 3:34 [PATCH 000 of 9] md: Assorted patches for the 2.5.26 merge window NeilBrown
` (3 preceding siblings ...)
2008-04-29 3:35 ` [PATCH 004 of 9] md: Fix 'safemode' handling for external metadata NeilBrown
@ 2008-04-29 3:35 ` NeilBrown
2008-04-29 3:35 ` [PATCH 006 of 9] md: Remove a stray command from a copy and paste error in resync_start_store NeilBrown
` (3 subsequent siblings)
8 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2008-04-29 3:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel
When setting an array to 'readonly' or to 'active' via sysfs, we must
make the appropriate set_disk_ro call too.
Also when switching to "read_auto" (which is like readonly, but blocks on the
first write so that metadata can be marked 'dirty') we need to be more careful
about what state we are changing from.
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/md.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2008-04-29 12:27:56.000000000 +1000
+++ ./drivers/md/md.c 2008-04-29 12:27:57.000000000 +1000
@@ -2593,15 +2593,20 @@ array_state_store(mddev_t *mddev, const
err = do_md_stop(mddev, 1);
else {
mddev->ro = 1;
+ set_disk_ro(mddev->gendisk, 1);
err = do_md_run(mddev);
}
break;
case read_auto:
- /* stopping an active array */
if (mddev->pers) {
- err = do_md_stop(mddev, 1);
- if (err == 0)
- mddev->ro = 2; /* FIXME mark devices writable */
+ if (mddev->ro != 1)
+ err = do_md_stop(mddev, 1);
+ else
+ err = restart_array(mddev);
+ if (err == 0) {
+ mddev->ro = 2;
+ set_disk_ro(mddev->gendisk, 0);
+ }
} else {
mddev->ro = 2;
err = do_md_run(mddev);
@@ -2639,6 +2644,7 @@ array_state_store(mddev_t *mddev, const
err = 0;
} else {
mddev->ro = 0;
+ set_disk_ro(mddev->gendisk, 0);
err = do_md_run(mddev);
}
break;
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 006 of 9] md: Remove a stray command from a copy and paste error in resync_start_store
2008-04-29 3:34 [PATCH 000 of 9] md: Assorted patches for the 2.5.26 merge window NeilBrown
` (4 preceding siblings ...)
2008-04-29 3:35 ` [PATCH 005 of 9] md: Fix up switching md arrays between read-only and read-write NeilBrown
@ 2008-04-29 3:35 ` NeilBrown
2008-04-29 3:35 ` [PATCH 007 of 9] md: prevent duplicates in bind_rdev_to_array NeilBrown
` (2 subsequent siblings)
8 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2008-04-29 3:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel, Dan Williams
From: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/md.c | 1 -
1 file changed, 1 deletion(-)
diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2008-04-29 12:27:57.000000000 +1000
+++ ./drivers/md/md.c 2008-04-29 12:27:57.000000000 +1000
@@ -2459,7 +2459,6 @@ resync_start_show(mddev_t *mddev, char *
static ssize_t
resync_start_store(mddev_t *mddev, const char *buf, size_t len)
{
- /* can only set chunk_size if array is not yet active */
char *e;
unsigned long long n = simple_strtoull(buf, &e, 10);
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 007 of 9] md: prevent duplicates in bind_rdev_to_array
2008-04-29 3:34 [PATCH 000 of 9] md: Assorted patches for the 2.5.26 merge window NeilBrown
` (5 preceding siblings ...)
2008-04-29 3:35 ` [PATCH 006 of 9] md: Remove a stray command from a copy and paste error in resync_start_store NeilBrown
@ 2008-04-29 3:35 ` NeilBrown
2008-04-29 3:51 ` Andrew Morton
2008-04-29 3:35 ` [PATCH 008 of 9] md: md: raid5 rate limit error printk NeilBrown
2008-04-29 3:35 ` [PATCH 009 of 9] md: md: support blocking writes to an array on device failure NeilBrown
8 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2008-04-29 3:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel, Dan Williams
From: Dan Williams <dan.j.williams@intel.com>
Found when trying to reassemble an active externally managed array.
Without this check we hit the more noisy "sysfs duplicate" warning in
the later call to kobject_add.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/md.c | 5 +++++
1 file changed, 5 insertions(+)
diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2008-04-29 12:27:57.000000000 +1000
+++ ./drivers/md/md.c 2008-04-29 12:27:57.000000000 +1000
@@ -1369,6 +1369,11 @@ static int bind_rdev_to_array(mdk_rdev_t
MD_BUG();
return -EINVAL;
}
+
+ /* prevent duplicates */
+ if (find_rdev(mddev, rdev->bdev->bd_dev))
+ return -EEXIST;
+
/* make sure rdev->size exceeds mddev->size */
if (rdev->size && (mddev->size == 0 || rdev->size < mddev->size)) {
if (mddev->pers) {
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 008 of 9] md: md: raid5 rate limit error printk
2008-04-29 3:34 [PATCH 000 of 9] md: Assorted patches for the 2.5.26 merge window NeilBrown
` (6 preceding siblings ...)
2008-04-29 3:35 ` [PATCH 007 of 9] md: prevent duplicates in bind_rdev_to_array NeilBrown
@ 2008-04-29 3:35 ` NeilBrown
2008-04-29 3:55 ` Andrew Morton
2008-04-29 3:35 ` [PATCH 009 of 9] md: md: support blocking writes to an array on device failure NeilBrown
8 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2008-04-29 3:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel, Bernd Schubert, Dan Williams
From: Bernd Schubert <bernd-schubert@gmx.de>
last night we had scsi problems and a hardware raid
unit was offlined during heavy i/o. While this happened we got for
about 3 minutes a huge number messages like these
Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error not correctable (sector 2993096568 on sdj2).
I guess the high error rate is responsible for not scheduling other
events - during this time the system was not pingable and in the end
also other devices run into scsi command timeouts causing problems on
these unrelated devices as well.
Signed-off-by: Bernd Schubert <bernd-schubert@gmx.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/raid5.c | 27 +++++++++++++++------------
./include/linux/raid/md_k.h | 3 +++
2 files changed, 18 insertions(+), 12 deletions(-)
diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c 2008-04-29 12:27:50.000000000 +1000
+++ ./drivers/md/raid5.c 2008-04-29 12:27:58.000000000 +1000
@@ -1143,10 +1143,11 @@ static void raid5_end_read_request(struc
set_bit(R5_UPTODATE, &sh->dev[i].flags);
if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
rdev = conf->disks[i].rdev;
- printk(KERN_INFO "raid5:%s: read error corrected (%lu sectors at %llu on %s)\n",
- mdname(conf->mddev), STRIPE_SECTORS,
- (unsigned long long)(sh->sector + rdev->data_offset),
- bdevname(rdev->bdev, b));
+ printk_rl(KERN_INFO "raid5:%s: read error corrected"
+ " (%lu sectors at %llu on %s)\n",
+ mdname(conf->mddev), STRIPE_SECTORS,
+ (unsigned long long)(sh->sector + rdev->data_offset),
+ bdevname(rdev->bdev, b));
clear_bit(R5_ReadError, &sh->dev[i].flags);
clear_bit(R5_ReWrite, &sh->dev[i].flags);
}
@@ -1160,16 +1161,18 @@ static void raid5_end_read_request(struc
clear_bit(R5_UPTODATE, &sh->dev[i].flags);
atomic_inc(&rdev->read_errors);
if (conf->mddev->degraded)
- printk(KERN_WARNING "raid5:%s: read error not correctable (sector %llu on %s).\n",
- mdname(conf->mddev),
- (unsigned long long)(sh->sector + rdev->data_offset),
- bdn);
+ printk_rl(KERN_WARNING "raid5:%s: read error not correctable "
+ "(sector %llu on %s).\n",
+ mdname(conf->mddev),
+ (unsigned long long)(sh->sector + rdev->data_offset),
+ bdn);
else if (test_bit(R5_ReWrite, &sh->dev[i].flags))
/* Oh, no!!! */
- printk(KERN_WARNING "raid5:%s: read error NOT corrected!! (sector %llu on %s).\n",
- mdname(conf->mddev),
- (unsigned long long)(sh->sector + rdev->data_offset),
- bdn);
+ printk_rl(KERN_WARNING "raid5:%s: read error NOT corrected!! "
+ "(sector %llu on %s).\n",
+ mdname(conf->mddev),
+ (unsigned long long)(sh->sector + rdev->data_offset),
+ bdn);
else if (atomic_read(&rdev->read_errors)
> conf->max_nr_stripes)
printk(KERN_WARNING
diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h 2008-04-29 12:25:24.000000000 +1000
+++ ./include/linux/raid/md_k.h 2008-04-29 12:27:58.000000000 +1000
@@ -368,6 +368,9 @@ static inline void safe_put_page(struct
if (p) put_page(p);
}
+#define printk_rl printk_ratelimit() ?: printk
+
+
#endif /* CONFIG_BLOCK */
#endif
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 009 of 9] md: md: support blocking writes to an array on device failure
2008-04-29 3:34 [PATCH 000 of 9] md: Assorted patches for the 2.5.26 merge window NeilBrown
` (7 preceding siblings ...)
2008-04-29 3:35 ` [PATCH 008 of 9] md: md: raid5 rate limit error printk NeilBrown
@ 2008-04-29 3:35 ` NeilBrown
8 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2008-04-29 3:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel, Dan Williams
From: Dan Williams <dan.j.williams@intel.com>
Allows a userspace metadata handler to take action upon detecting a device
failure.
Based on an original patch by Neil Brown.
Changes:
-added blocked_wait waitqueue to rdev
-don't qualify Blocked with Faulty always let userspace block writes
-added md_wait_for_blocked_rdev to wait for the block device to be clear, if
userspace misses the notification another one is sent every 5 seconds
-set MD_RECOVERY_NEEDED after clearing "blocked"
-kill DoBlock flag, just test mddev->external
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/md.c | 33 ++++++++++++++++++++++++++++++++-
./drivers/md/raid1.c | 27 ++++++++++++++++++++++++---
./drivers/md/raid10.c | 29 ++++++++++++++++++++++++++---
./drivers/md/raid5.c | 33 +++++++++++++++++++++++++++++++++
./include/linux/raid/md.h | 1 +
./include/linux/raid/md_k.h | 4 ++++
6 files changed, 120 insertions(+), 7 deletions(-)
diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2008-04-29 12:27:57.000000000 +1000
+++ ./drivers/md/md.c 2008-04-29 12:27:58.000000000 +1000
@@ -1827,6 +1827,10 @@ state_show(mdk_rdev_t *rdev, char *page)
len += sprintf(page+len, "%swrite_mostly",sep);
sep = ",";
}
+ if (test_bit(Blocked, &rdev->flags)) {
+ len += sprintf(page+len, "%sblocked", sep);
+ sep = ",";
+ }
if (!test_bit(Faulty, &rdev->flags) &&
!test_bit(In_sync, &rdev->flags)) {
len += sprintf(page+len, "%sspare", sep);
@@ -1843,6 +1847,8 @@ state_store(mdk_rdev_t *rdev, const char
* remove - disconnects the device
* writemostly - sets write_mostly
* -writemostly - clears write_mostly
+ * blocked - sets the Blocked flag
+ * -blocked - clears the Blocked flag
*/
int err = -EINVAL;
if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
@@ -1865,6 +1871,16 @@ state_store(mdk_rdev_t *rdev, const char
} else if (cmd_match(buf, "-writemostly")) {
clear_bit(WriteMostly, &rdev->flags);
err = 0;
+ } else if (cmd_match(buf, "blocked")) {
+ set_bit(Blocked, &rdev->flags);
+ err = 0;
+ } else if (cmd_match(buf, "-blocked")) {
+ clear_bit(Blocked, &rdev->flags);
+ wake_up(&rdev->blocked_wait);
+ set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
+ md_wakeup_thread(rdev->mddev->thread);
+
+ err = 0;
}
return err ? err : len;
}
@@ -2193,7 +2209,9 @@ static mdk_rdev_t *md_import_device(dev_
goto abort_free;
}
}
+
INIT_LIST_HEAD(&rdev->same_set);
+ init_waitqueue_head(&rdev->blocked_wait);
return rdev;
@@ -4957,6 +4975,9 @@ void md_error(mddev_t *mddev, mdk_rdev_t
if (!rdev || test_bit(Faulty, &rdev->flags))
return;
+
+ if (mddev->external)
+ set_bit(Blocked, &rdev->flags);
/*
dprintk("md_error dev:%s, rdev:(%d:%d), (caller: %p,%p,%p,%p).\n",
mdname(mddev),
@@ -5759,7 +5780,7 @@ static int remove_and_add_spares(mddev_t
rdev_for_each(rdev, rtmp, mddev)
if (rdev->raid_disk >= 0 &&
- !mddev->external &&
+ !test_bit(Blocked, &rdev->flags) &&
(test_bit(Faulty, &rdev->flags) ||
! test_bit(In_sync, &rdev->flags)) &&
atomic_read(&rdev->nr_pending)==0) {
@@ -5958,6 +5979,16 @@ void md_check_recovery(mddev_t *mddev)
}
}
+void md_wait_for_blocked_rdev(mdk_rdev_t *rdev, mddev_t *mddev)
+{
+ sysfs_notify(&rdev->kobj, NULL, "state");
+ wait_event_timeout(rdev->blocked_wait,
+ !test_bit(Blocked, &rdev->flags),
+ msecs_to_jiffies(5000));
+ rdev_dec_pending(rdev, mddev);
+}
+EXPORT_SYMBOL(md_wait_for_blocked_rdev);
+
static int md_notify_reboot(struct notifier_block *this,
unsigned long code, void *x)
{
diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c 2008-04-29 12:25:24.000000000 +1000
+++ ./drivers/md/raid10.c 2008-04-29 12:27:58.000000000 +1000
@@ -790,6 +790,7 @@ static int make_request(struct request_q
const int do_sync = bio_sync(bio);
struct bio_list bl;
unsigned long flags;
+ mdk_rdev_t *blocked_rdev;
if (unlikely(bio_barrier(bio))) {
bio_endio(bio, -EOPNOTSUPP);
@@ -879,17 +880,23 @@ static int make_request(struct request_q
/*
* WRITE:
*/
- /* first select target devices under spinlock and
+ /* first select target devices under rcu_lock and
* inc refcount on their rdev. Record them by setting
* bios[x] to bio
*/
raid10_find_phys(conf, r10_bio);
+ retry_write:
+ blocked_rdev = 0;
rcu_read_lock();
for (i = 0; i < conf->copies; i++) {
int d = r10_bio->devs[i].devnum;
mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[d].rdev);
- if (rdev &&
- !test_bit(Faulty, &rdev->flags)) {
+ if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
+ atomic_inc(&rdev->nr_pending);
+ blocked_rdev = rdev;
+ break;
+ }
+ if (rdev && !test_bit(Faulty, &rdev->flags)) {
atomic_inc(&rdev->nr_pending);
r10_bio->devs[i].bio = bio;
} else {
@@ -899,6 +906,22 @@ static int make_request(struct request_q
}
rcu_read_unlock();
+ if (unlikely(blocked_rdev)) {
+ /* Have to wait for this device to get unblocked, then retry */
+ int j;
+ int d;
+
+ for (j = 0; j < i; j++)
+ if (r10_bio->devs[j].bio) {
+ d = r10_bio->devs[j].devnum;
+ rdev_dec_pending(conf->mirrors[d].rdev, mddev);
+ }
+ allow_barrier(conf);
+ md_wait_for_blocked_rdev(blocked_rdev, mddev);
+ wait_barrier(conf);
+ goto retry_write;
+ }
+
atomic_set(&r10_bio->remaining, 0);
bio_list_init(&bl);
diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c 2008-04-29 12:25:24.000000000 +1000
+++ ./drivers/md/raid1.c 2008-04-29 12:27:58.000000000 +1000
@@ -773,7 +773,6 @@ static int make_request(struct request_q
r1bio_t *r1_bio;
struct bio *read_bio;
int i, targets = 0, disks;
- mdk_rdev_t *rdev;
struct bitmap *bitmap = mddev->bitmap;
unsigned long flags;
struct bio_list bl;
@@ -781,6 +780,7 @@ static int make_request(struct request_q
const int rw = bio_data_dir(bio);
const int do_sync = bio_sync(bio);
int do_barriers;
+ mdk_rdev_t *blocked_rdev;
/*
* Register the new request and wait if the reconstruction
@@ -862,10 +862,17 @@ static int make_request(struct request_q
first = 0;
}
#endif
+ retry_write:
+ blocked_rdev = NULL;
rcu_read_lock();
for (i = 0; i < disks; i++) {
- if ((rdev=rcu_dereference(conf->mirrors[i].rdev)) != NULL &&
- !test_bit(Faulty, &rdev->flags)) {
+ mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
+ if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
+ atomic_inc(&rdev->nr_pending);
+ blocked_rdev = rdev;
+ break;
+ }
+ if (rdev && !test_bit(Faulty, &rdev->flags)) {
atomic_inc(&rdev->nr_pending);
if (test_bit(Faulty, &rdev->flags)) {
rdev_dec_pending(rdev, mddev);
@@ -878,6 +885,20 @@ static int make_request(struct request_q
}
rcu_read_unlock();
+ if (unlikely(blocked_rdev)) {
+ /* Wait for this device to become unblocked */
+ int j;
+
+ for (j = 0; j < i; j++)
+ if (r1_bio->bios[j])
+ rdev_dec_pending(conf->mirrors[j].rdev, mddev);
+
+ allow_barrier(conf);
+ md_wait_for_blocked_rdev(blocked_rdev, mddev);
+ wait_barrier(conf);
+ goto retry_write;
+ }
+
BUG_ON(targets == 0); /* we never fail the last device */
if (targets < conf->raid_disks) {
diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c 2008-04-29 12:27:58.000000000 +1000
+++ ./drivers/md/raid5.c 2008-04-29 12:27:58.000000000 +1000
@@ -2610,6 +2610,7 @@ static void handle_stripe_expansion(raid
}
}
+
/*
* handle_stripe - do things to a stripe.
*
@@ -2635,6 +2636,7 @@ static void handle_stripe5(struct stripe
struct stripe_head_state s;
struct r5dev *dev;
unsigned long pending = 0;
+ mdk_rdev_t *blocked_rdev = NULL;
memset(&s, 0, sizeof(s));
pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d "
@@ -2694,6 +2696,11 @@ static void handle_stripe5(struct stripe
if (dev->written)
s.written++;
rdev = rcu_dereference(conf->disks[i].rdev);
+ if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
+ blocked_rdev = rdev;
+ atomic_inc(&rdev->nr_pending);
+ break;
+ }
if (!rdev || !test_bit(In_sync, &rdev->flags)) {
/* The ReadError flag will just be confusing now */
clear_bit(R5_ReadError, &dev->flags);
@@ -2708,6 +2715,11 @@ static void handle_stripe5(struct stripe
}
rcu_read_unlock();
+ if (unlikely(blocked_rdev)) {
+ set_bit(STRIPE_HANDLE, &sh->state);
+ goto unlock;
+ }
+
if (s.to_fill && !test_and_set_bit(STRIPE_OP_BIOFILL, &sh->ops.pending))
sh->ops.count++;
@@ -2897,8 +2909,13 @@ static void handle_stripe5(struct stripe
if (sh->ops.count)
pending = get_stripe_work(sh);
+ unlock:
spin_unlock(&sh->lock);
+ /* wait for this device to become unblocked */
+ if (unlikely(blocked_rdev))
+ md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
+
if (pending)
raid5_run_ops(sh, pending);
@@ -2915,6 +2932,7 @@ static void handle_stripe6(struct stripe
struct stripe_head_state s;
struct r6_state r6s;
struct r5dev *dev, *pdev, *qdev;
+ mdk_rdev_t *blocked_rdev = NULL;
r6s.qd_idx = raid6_next_disk(pd_idx, disks);
pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
@@ -2978,6 +2996,11 @@ static void handle_stripe6(struct stripe
if (dev->written)
s.written++;
rdev = rcu_dereference(conf->disks[i].rdev);
+ if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
+ blocked_rdev = rdev;
+ atomic_inc(&rdev->nr_pending);
+ break;
+ }
if (!rdev || !test_bit(In_sync, &rdev->flags)) {
/* The ReadError flag will just be confusing now */
clear_bit(R5_ReadError, &dev->flags);
@@ -2992,6 +3015,11 @@ static void handle_stripe6(struct stripe
set_bit(R5_Insync, &dev->flags);
}
rcu_read_unlock();
+
+ if (unlikely(blocked_rdev)) {
+ set_bit(STRIPE_HANDLE, &sh->state);
+ goto unlock;
+ }
pr_debug("locked=%d uptodate=%d to_read=%d"
" to_write=%d failed=%d failed_num=%d,%d\n",
s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
@@ -3097,8 +3125,13 @@ static void handle_stripe6(struct stripe
!test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending))
handle_stripe_expansion(conf, sh, &r6s);
+ unlock:
spin_unlock(&sh->lock);
+ /* wait for this device to become unblocked */
+ if (unlikely(blocked_rdev))
+ md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
+
return_io(return_bi);
for (i=disks; i-- ;) {
diff .prev/include/linux/raid/md.h ./include/linux/raid/md.h
--- .prev/include/linux/raid/md.h 2008-04-29 12:25:24.000000000 +1000
+++ ./include/linux/raid/md.h 2008-04-29 12:27:58.000000000 +1000
@@ -95,6 +95,7 @@ extern int sync_page_io(struct block_dev
extern void md_do_sync(mddev_t *mddev);
extern void md_new_event(mddev_t *mddev);
extern void md_allow_write(mddev_t *mddev);
+extern void md_wait_for_blocked_rdev(mdk_rdev_t *rdev, mddev_t *mddev);
#endif /* CONFIG_MD */
#endif
diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h 2008-04-29 12:27:58.000000000 +1000
+++ ./include/linux/raid/md_k.h 2008-04-29 12:27:58.000000000 +1000
@@ -84,6 +84,10 @@ struct mdk_rdev_s
#define AllReserved 6 /* If whole device is reserved for
* one array */
#define AutoDetected 7 /* added by auto-detect */
+#define Blocked 8 /* An error occured on an externally
+ * managed array, don't allow writes
+ * until it is cleared */
+ wait_queue_head_t blocked_wait;
int desc_nr; /* descriptor index in the superblock */
int raid_disk; /* role of device in array */
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 007 of 9] md: prevent duplicates in bind_rdev_to_array
2008-04-29 3:35 ` [PATCH 007 of 9] md: prevent duplicates in bind_rdev_to_array NeilBrown
@ 2008-04-29 3:51 ` Andrew Morton
2008-04-29 4:09 ` Neil Brown
0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2008-04-29 3:51 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, linux-kernel, Dan Williams
On Tue, 29 Apr 2008 13:35:27 +1000 NeilBrown <neilb@suse.de> wrote:
>
> From: Dan Williams <dan.j.williams@intel.com>
>
> Found when trying to reassemble an active externally managed array.
> Without this check we hit the more noisy "sysfs duplicate" warning in
> the later call to kobject_add.
>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Neil Brown <neilb@suse.de>
>
> ### Diffstat output
> ./drivers/md/md.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff .prev/drivers/md/md.c ./drivers/md/md.c
> --- .prev/drivers/md/md.c 2008-04-29 12:27:57.000000000 +1000
> +++ ./drivers/md/md.c 2008-04-29 12:27:57.000000000 +1000
> @@ -1369,6 +1369,11 @@ static int bind_rdev_to_array(mdk_rdev_t
> MD_BUG();
> return -EINVAL;
> }
> +
> + /* prevent duplicates */
> + if (find_rdev(mddev, rdev->bdev->bd_dev))
> + return -EEXIST;
> +
> /* make sure rdev->size exceeds mddev->size */
> if (rdev->size && (mddev->size == 0 || rdev->size < mddev->size)) {
> if (mddev->pers) {
Smells racy. Do we have enough locking in place here to make this more
than a best-effort thing?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 008 of 9] md: md: raid5 rate limit error printk
2008-04-29 3:35 ` [PATCH 008 of 9] md: md: raid5 rate limit error printk NeilBrown
@ 2008-04-29 3:55 ` Andrew Morton
2008-04-29 4:14 ` Neil Brown
0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2008-04-29 3:55 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, linux-kernel, Bernd Schubert, Dan Williams
On Tue, 29 Apr 2008 13:35:34 +1000 NeilBrown <neilb@suse.de> wrote:
> + printk_rl(KERN_WARNING "raid5:%s: read error NOT corrected!! "
> + "(sector %llu on %s).\n",
> + mdname(conf->mddev),
> + (unsigned long long)(sh->sector + rdev->data_offset),
> + bdn);
> else if (atomic_read(&rdev->read_errors)
> > conf->max_nr_stripes)
> printk(KERN_WARNING
>
> diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
> --- .prev/include/linux/raid/md_k.h 2008-04-29 12:25:24.000000000 +1000
> +++ ./include/linux/raid/md_k.h 2008-04-29 12:27:58.000000000 +1000
> @@ -368,6 +368,9 @@ static inline void safe_put_page(struct
> if (p) put_page(p);
> }
>
> +#define printk_rl printk_ratelimit() ?: printk
(boggle)
Isn't this backwards? Should be !printk_ratelimit()?
open-coding the printk_ratelimit() at each callsite would be more
conventional.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 007 of 9] md: prevent duplicates in bind_rdev_to_array
2008-04-29 3:51 ` Andrew Morton
@ 2008-04-29 4:09 ` Neil Brown
0 siblings, 0 replies; 14+ messages in thread
From: Neil Brown @ 2008-04-29 4:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel, Dan Williams
On Monday April 28, akpm@linux-foundation.org wrote:
> On Tue, 29 Apr 2008 13:35:27 +1000 NeilBrown <neilb@suse.de> wrote:
>
> >
> > From: Dan Williams <dan.j.williams@intel.com>
> >
> > Found when trying to reassemble an active externally managed array.
> > Without this check we hit the more noisy "sysfs duplicate" warning in
> > the later call to kobject_add.
> >
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Neil Brown <neilb@suse.de>
> >
> > ### Diffstat output
> > ./drivers/md/md.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff .prev/drivers/md/md.c ./drivers/md/md.c
> > --- .prev/drivers/md/md.c 2008-04-29 12:27:57.000000000 +1000
> > +++ ./drivers/md/md.c 2008-04-29 12:27:57.000000000 +1000
> > @@ -1369,6 +1369,11 @@ static int bind_rdev_to_array(mdk_rdev_t
> > MD_BUG();
> > return -EINVAL;
> > }
> > +
> > + /* prevent duplicates */
> > + if (find_rdev(mddev, rdev->bdev->bd_dev))
> > + return -EEXIST;
> > +
> > /* make sure rdev->size exceeds mddev->size */
> > if (rdev->size && (mddev->size == 0 || rdev->size < mddev->size)) {
> > if (mddev->pers) {
>
> Smells racy. Do we have enough locking in place here to make this more
> than a best-effort thing?
Yes. We have exclusive access to the mddev at this point, so no race.
NeilBrown
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 008 of 9] md: md: raid5 rate limit error printk
2008-04-29 3:55 ` Andrew Morton
@ 2008-04-29 4:14 ` Neil Brown
0 siblings, 0 replies; 14+ messages in thread
From: Neil Brown @ 2008-04-29 4:14 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel, Bernd Schubert, Dan Williams
On Monday April 28, akpm@linux-foundation.org wrote:
> On Tue, 29 Apr 2008 13:35:34 +1000 NeilBrown <neilb@suse.de> wrote:
>
> > + printk_rl(KERN_WARNING "raid5:%s: read error NOT corrected!! "
> > + "(sector %llu on %s).\n",
> > + mdname(conf->mddev),
> > + (unsigned long long)(sh->sector + rdev->data_offset),
> > + bdn);
> > else if (atomic_read(&rdev->read_errors)
> > > conf->max_nr_stripes)
> > printk(KERN_WARNING
> >
> > diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
> > --- .prev/include/linux/raid/md_k.h 2008-04-29 12:25:24.000000000 +1000
> > +++ ./include/linux/raid/md_k.h 2008-04-29 12:27:58.000000000 +1000
> > @@ -368,6 +368,9 @@ static inline void safe_put_page(struct
> > if (p) put_page(p);
> > }
> >
> > +#define printk_rl printk_ratelimit() ?: printk
>
> (boggle)
You don't like the "?:" operator?
Maybe
printk_ratelimit() && printk
?
>
> Isn't this backwards? Should be !printk_ratelimit()?
Arggg.. Did I do that? Bother.
>
> open-coding the printk_ratelimit() at each callsite would be more
> conventional.
True, but it can get noisy, adding an extra level of indent where it
isn't really needed (and the printk lines tend to be fairly long
already).
NeilBrown
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2008-04-29 4:14 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-29 3:34 [PATCH 000 of 9] md: Assorted patches for the 2.5.26 merge window NeilBrown
2008-04-29 3:34 ` [PATCH 001 of 9] md: Fix use after free when removing rdev via sysfs NeilBrown
2008-04-29 3:34 ` [PATCH 002 of 9] md: Skip all metadata update processing when using external metadata NeilBrown
2008-04-29 3:35 ` [PATCH 003 of 9] md: Reinitialise more mddev fields in do_md_stop NeilBrown
2008-04-29 3:35 ` [PATCH 004 of 9] md: Fix 'safemode' handling for external metadata NeilBrown
2008-04-29 3:35 ` [PATCH 005 of 9] md: Fix up switching md arrays between read-only and read-write NeilBrown
2008-04-29 3:35 ` [PATCH 006 of 9] md: Remove a stray command from a copy and paste error in resync_start_store NeilBrown
2008-04-29 3:35 ` [PATCH 007 of 9] md: prevent duplicates in bind_rdev_to_array NeilBrown
2008-04-29 3:51 ` Andrew Morton
2008-04-29 4:09 ` Neil Brown
2008-04-29 3:35 ` [PATCH 008 of 9] md: md: raid5 rate limit error printk NeilBrown
2008-04-29 3:55 ` Andrew Morton
2008-04-29 4:14 ` Neil Brown
2008-04-29 3:35 ` [PATCH 009 of 9] md: md: support blocking writes to an array on device failure NeilBrown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).