* [PATCH] proactive raid5 disk replacement for 2.6.11
@ 2005-08-14 20:10 Pallai Roland
2005-08-14 21:29 ` [PATCH] proactive raid5 disk replacement for 2.6.11 [fixed patch] Pallai Roland
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Pallai Roland @ 2005-08-14 20:10 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: Type: text/plain, Size: 3527 bytes --]
Hi,
this is a feature patch that implements 'proactive raid5 disk
replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
that could help a lot on large raid5 arrays built from cheap sata
drivers when the IO traffic such large that daily media scan on the
disks isn't possible.
linux software raid is very fragile by default, the typical (nervous)
breakdown situation is: I noticed a bad block on a drive, replace it,
and the resync fails cause another 2-3 disks have hidden badblocks too.
I've to save the disks and rebuild bad blocks with a userspace tool (by
hand..), meanwhile the site is down for hours. bad; especially when a
pair of simple steps enough to avoid from this atypical problem:
1. dont kick a drive on read error cause it is possible that 99.99% is
useable and will help (to serve and to save data) if another drive show
bad sectors in same array
2. let mirror a partially failed drive to a spare _online_ and replace
the source of the mirror with the spare when it's done. bad blocks isn't
a problem unless same sector damaged on two disks what's a rare case. in
this way is possible to fix an array with partially failed drives
without data loss and without downtime
I'm not a programmer just a sysadm who admins a large software sata
array, but my angry got bigger than my laziness, so I made this patch on
this weekend.. I'm not understand every piece of the md code (eg. the
if-forest of the handle_stripe :) yet, so this patch may be a bug-colony
and wrong by design, but I've tested it under heavy stress with both of
'faulty' module and real disks, and it works fine!
ideas, piece of advice, bugfix/enchancement is welcomed!
(I know, raid6 could be another solution for this problem, but that's a
large overhead.)
use:
1. patch the kernel, this one is against 2.6.11
2. type:
# make drives
mdadm -B -n1 -l faulty /dev/md/1 /dev/rd/0
mdadm -B -n1 -l faulty /dev/md/2 /dev/rd/1
mdadm -B -n1 -l faulty /dev/md/3 /dev/rd/2
# make the array
mdadm -C -n3 -l5 /dev/md/0 /dev/md/1 /dev/md/2 /dev/md/3
# .. wait for sync ..
# grow bad blocks as ma*tor does
mdadm --grow -l faulty -p rp454 /dev/md/1
mdadm --grow -l faulty -p rp738 /dev/md/2
# add a spare
mdadm -a /dev/md/0 /dev/rd/4
# -> fail a drive, sync begins <-
# the md/1 will not marked as failed, this is the point, but if you want to,
# you can issue this command again!
mdadm -f /dev/md/0 /dev/md/1
# kernel:
# resync from md1 to spare ram4
# added spare for active resync
# .. wonder the read errors from md[12] and the sync goes on!
# feel free to stress the md at this time, mkfs, dd, badblocks, etc
# kernel:
# raid5_spare_active: 3 in_sync 3->0
# /proc/mdstat:
# md0 : active raid5 ram4[0] md3[2] md2[1] md1[0]
# -> ram4 and md1 have same id, this means the spare is a complete mirror,
# if you stop the array you can assembly it with ram4 instead of md1,
# the superblock same both of them
# check the mirror (stop write stress if any)
mdadm --grow -l faulty -p none /dev/md/1
cmp /dev/md/1 /dev/rd/4
# hot-replace the mirrored -partially failed- device with the active spare
# (yes, mark it as failed again, but if there's a syncing- or synced 'active spare'
# the -f really fails the device or replace it with the synced spare)
mdadm -f /dev/md/0 /dev/md/1
# kernel:
# replace md1 with in_sync active spare ram4
# and voila!
# /proc/mdstat:
# md0 : active raid5 ram4[0] md3[2] md2[1]
.. I hope someone can use it for something, for me it's a musthave feature ..
--
dap
[-- Attachment #2: 00_raid5-asp-dap1.diff --]
[-- Type: text/x-patch, Size: 22926 bytes --]
this is a feature patch that implements 'proactive raid5 disk replacement'
(http://www.arctic.org/~dean/raid-wishlist.html),
that could help a lot on large raid5 arrays built from cheap sata drivers
when the IO traffic such large that daily media scan on the disks isn't
possible.
linux software raid is very fragile by default, the typical (nervous) breakdown
situation is: I noticed a bad block on a drive, replace it, and the resync
fails cause another 2-3 disks have hidden badblocks too. I've to save
the disks and rebuild bad blocks with a userspace tool (by hand..), meanwhile
the site is down for hours. bad; especially when a pair of simple steps
enough to avoid from this atypical problem:
1. dont kick a drive on read error cause it is possible that 99.99% is useable
and will help (to serve and to save data) if another drive show bad sectors in
same array
2. let build a mirror _online_ to a spare and replace the source of the mirror
with the spare when it's done. bad blocks not a problem unless same sector
bad on two disks what's a rare case. in this way is possible to fix an array
with partially failed drives without data loss and without downtime
I'm not a programmer just a sysadm who admins a large software sata array, but my
angry got bigger than my laziness, so I made this patch on this weekend.. I'm not
understand every piece of the md code (eg. the if-forest of the handle_stripe :) yet,
so this patch may be a bug-colony and wrong by design, but I've tested it
under heavy stress with both of 'faulty' module and real disks, and it works fine!
ideas, piece of advice, bugfix/enchancement is welcomed!
(I know, raid6 could be another solution for this problem, but that's a large
overhead too.)
use:
1. patch the kernel, this one is against 2.6.11
2. type:
# make drives
mdadm -B -n1 -l faulty /dev/md/1 /dev/rd/0
mdadm -B -n1 -l faulty /dev/md/2 /dev/rd/1
mdadm -B -n1 -l faulty /dev/md/3 /dev/rd/2
# make the array
mdadm -C -n3 -l5 /dev/md/0 /dev/md/1 /dev/md/2 /dev/md/3
# .. wait the sync ..
# grow bad blocks as ma*tor does
mdadm --grow -l faulty -p rp454 /dev/md/1
mdadm --grow -l faulty -p rp738 /dev/md/2
# add a spare
mdadm -a /dev/md/0 /dev/rd/4
# -> fail a drive, sync begins <-
# the md/1 will not marked as failed, this is the point, but if you want to,
# you can issue this command again!
mdadm -f /dev/md/0 /dev/md/1
# kernel:
# resync from md1 to spare ram4
# added spare for active resync
# .. wonder the read errors from md[12] and the sync goes on!
# feel free to stress the md at this time, mkfs, dd, badblocks, etc
# kernel:
# raid5_spare_active: 3 in_sync 3->0
# /proc/mdstat:
# md0 : active raid5 ram4[0] md3[2] md2[1] md1[0]
# -> ram4 and md1 have same id, this means the spare is a complete mirror,
# if you stop the array you can assembly it with ram4 instead of md1,
# the superblock same both of them
# check the mirror (stop write stress)
mdadm --grow -l faulty -p none /dev/md/1
cmp /dev/md/0 /dev/rd/4
# hot-replace the mirrored -partially failed- device with the active spare
# (yes, mark it as failed again, but if there's a syncing ot synced 'active spare'
# the -f really fails the device or replace it with the synced spare)
mdadm -f /dev/md/0 /dev/md/1
# kernel:
# replace md1 with in_sync active spare ram4
# and voila!
# /proc/mdstat:
# md0 : active raid5 ram4[0] md3[2] md2[1]
--- linux/include/linux/raid/raid5.h.orig 2005-03-03 23:51:29.000000000 +0100
+++ linux/include/linux/raid/raid5.h 2005-08-14 03:02:11.000000000 +0200
@@ -147,6 +147,7 @@
#define R5_UPTODATE 0 /* page contains current data */
#define R5_LOCKED 1 /* IO has been submitted on "req" */
#define R5_OVERWRITE 2 /* towrite covers whole page */
+#define R5_FAILED 8 /* failed to read this stripe */
/* and some that are internal to handle_stripe */
#define R5_Insync 3 /* rdev && rdev->in_sync at start */
#define R5_Wantread 4 /* want to schedule a read */
@@ -224,6 +225,8 @@
int inactive_blocked; /* release of inactive stripes blocked,
* waiting for 25% to be free
*/
+ int mirrorit; /* source for active spare resync */
+
spinlock_t device_lock;
struct disk_info disks[0];
};
--- linux/drivers/md/md.c.orig 2005-08-14 21:22:08.000000000 +0200
+++ linux/drivers/md/md.c 2005-08-14 17:20:15.000000000 +0200
@@ -3545,18 +3547,19 @@
/* no recovery is running.
* remove any failed drives, then
- * add spares if possible
+ * add spares if possible.
+ * Spare are also removed and re-added, to allow
+ * the personality to fail the re-add.
*/
- ITERATE_RDEV(mddev,rdev,rtmp) {
+ ITERATE_RDEV(mddev,rdev,rtmp)
if (rdev->raid_disk >= 0 &&
- rdev->faulty &&
+ (rdev->faulty || ! rdev->in_sync) &&
atomic_read(&rdev->nr_pending)==0) {
+printk("md_check_recovery: hot_remove_disk\n");
if (mddev->pers->hot_remove_disk(mddev, rdev->raid_disk)==0)
rdev->raid_disk = -1;
}
- if (!rdev->faulty && rdev->raid_disk >= 0 && !rdev->in_sync)
- spares++;
- }
+
if (mddev->degraded) {
ITERATE_RDEV(mddev,rdev,rtmp)
if (rdev->raid_disk < 0
@@ -3764,4 +3771,5 @@
EXPORT_SYMBOL(md_wakeup_thread);
EXPORT_SYMBOL(md_print_devices);
EXPORT_SYMBOL(md_check_recovery);
+EXPORT_SYMBOL(kick_rdev_from_array); // fixme
MODULE_LICENSE("GPL");
--- linux/drivers/md/raid5.c.orig 2005-08-14 21:22:08.000000000 +0200
+++ linux/drivers/md/raid5.c 2005-08-14 20:49:49.000000000 +0200
@@ -53,7 +53,7 @@
/*
* The following can be used to debug the driver
*/
-#define RAID5_DEBUG 0
+#define RAID5_DEBUG 1
#define RAID5_PARANOIA 1
#if RAID5_PARANOIA && defined(CONFIG_SMP)
# define CHECK_DEVLOCK() assert_spin_locked(&conf->device_lock)
@@ -61,7 +61,7 @@
# define CHECK_DEVLOCK()
#endif
-#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(x)))
+#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(KERN_DEBUG x)))
#if RAID5_DEBUG
#define inline
#define __inline__
@@ -201,7 +201,7 @@
sh->pd_idx = pd_idx;
sh->state = 0;
- for (i=disks; i--; ) {
+ for (i=disks+1; i--; ) {
struct r5dev *dev = &sh->dev[i];
if (dev->toread || dev->towrite || dev->written ||
@@ -291,8 +291,10 @@
sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
+ /* +1: we need extra space in the *sh->devs for the 'active spare' to keep
+ handle_stripe() simple */
sc = kmem_cache_create(conf->cache_name,
- sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
+ sizeof(struct stripe_head)+(devs-1+1)*sizeof(struct r5dev),
0, 0, NULL, NULL);
if (!sc)
return 1;
@@ -301,12 +303,12 @@
sh = kmem_cache_alloc(sc, GFP_KERNEL);
if (!sh)
return 1;
- memset(sh, 0, sizeof(*sh) + (devs-1)*sizeof(struct r5dev));
+ memset(sh, 0, sizeof(*sh) + (devs-1+1)*sizeof(struct r5dev));
sh->raid_conf = conf;
spin_lock_init(&sh->lock);
- if (grow_buffers(sh, conf->raid_disks)) {
- shrink_buffers(sh, conf->raid_disks);
+ if (grow_buffers(sh, conf->raid_disks+1)) {
+ shrink_buffers(sh, conf->raid_disks+1);
kmem_cache_free(sc, sh);
return 1;
}
@@ -391,18 +393,39 @@
}
#else
set_bit(R5_UPTODATE, &sh->dev[i].flags);
+ clear_bit(R5_FAILED, &sh->dev[i].flags);
#endif
} else {
- if (conf->working_disks < conf->raid_disks && !conf->mddev->curr_resync &&
- conf->disks[i].rdev->in_sync) {
- char b[BDEVNAME_SIZE];
- printk (KERN_ALERT
- "raid5_end_read_request: Disk failure on %s, but just failing this req.\n"
- ,bdevname(conf->disks[i].rdev->bdev, b));
- conf->disks[i].rdev->rerr = 1;
+ char b[BDEVNAME_SIZE];
+
+ /*
+ rule 1.,: try to keep all disk in_sync even if we've got read errors,
+ cause the 'active spare' maybe can rebuild a complete column from
+ partially failed drives
+ */
+ if (conf->disks[i].rdev->in_sync && conf->working_disks < conf->raid_disks) {
+ /* bad news, but keep it, cause md_error() would do a complete
+ array shutdown, even if 99.99% is useable */
+ printk(KERN_ALERT
+ "raid5_end_read_request: Read failure %s on sector %llu (%d) in degraded mode\n"
+ ,bdevname(conf->disks[i].rdev->bdev, b),
+ (unsigned long long)sh->sector, atomic_read(&sh->count));
+ if (conf->mddev->curr_resync)
+ /* raid5_add_disk() will no accept the spare again,
+ and will not loop forever */
+ conf->mddev->degraded = 2;
+ } else if (conf->disks[i].rdev->in_sync && conf->working_disks >= conf->raid_disks) {
+ /* will be computed */
+ printk(KERN_ALERT
+ "raid5_end_read_request: Read failure %s on sector %llu (%d) in optimal mode\n"
+ ,bdevname(conf->disks[i].rdev->bdev, b),
+ (unsigned long long)sh->sector, atomic_read(&sh->count));
+ /* conf->disks[i].rerr++ */
} else
+ /* practically it never happens */
md_error(conf->mddev, conf->disks[i].rdev);
clear_bit(R5_UPTODATE, &sh->dev[i].flags);
+ set_bit(R5_FAILED, &sh->dev[i].flags);
}
rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
#if 0
@@ -438,10 +461,11 @@
PRINTK("end_write_request %llu/%d, count %d, uptodate: %d.\n",
(unsigned long long)sh->sector, i, atomic_read(&sh->count),
uptodate);
+ /* sorry
if (i == disks) {
BUG();
return 0;
- }
+ }*/
spin_lock_irqsave(&conf->device_lock, flags);
if (!uptodate)
@@ -475,33 +499,142 @@
dev->req.bi_private = sh;
dev->flags = 0;
- if (i != sh->pd_idx)
+ if (i != sh->pd_idx && i < sh->raid_conf->raid_disks) /* active spare? */
dev->sector = compute_blocknr(sh, i);
}
+static int raid5_remove_disk(mddev_t *mddev, int number);
+static int raid5_add_disk(mddev_t *mddev, mdk_rdev_t *rdev);
+/*static*/ void kick_rdev_from_array(mdk_rdev_t * rdev);
+//static void md_update_sb(mddev_t * mddev);
static void error(mddev_t *mddev, mdk_rdev_t *rdev)
{
char b[BDEVNAME_SIZE];
+ char b2[BDEVNAME_SIZE];
raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
PRINTK("raid5: error called\n");
if (!rdev->faulty) {
- mddev->sb_dirty = 1;
- if (rdev->in_sync) {
- conf->working_disks--;
- mddev->degraded++;
- conf->failed_disks++;
- rdev->in_sync = 0;
- /*
- * if recovery was running, make sure it aborts.
- */
- set_bit(MD_RECOVERY_ERR, &mddev->recovery);
- }
- rdev->faulty = 1;
- printk (KERN_ALERT
- "raid5: Disk failure on %s, disabling device."
- " Operation continuing on %d devices\n",
- bdevname(rdev->bdev,b), conf->working_disks);
+ int mddisks = 0;
+ mdk_rdev_t *rd;
+ mdk_rdev_t *rdevs = NULL;
+ struct list_head *rtmp;
+ int i;
+
+ ITERATE_RDEV(mddev,rd,rtmp)
+ {
+ printk(KERN_INFO "mddev%d: %s\n", mddisks, bdevname(rd->bdev,b));
+ mddisks++;
+ }
+ for (i = 0; (rd = conf->disks[i].rdev); i++) {
+ printk(KERN_INFO "r5dev%d: %s\n", i, bdevname(rd->bdev,b));
+ }
+ ITERATE_RDEV(mddev,rd,rtmp)
+ {
+ rdevs = rd;
+ break;
+ }
+printk("%d %d > %d %d ins:%d %p\n",
+ mddev->raid_disks, mddisks, conf->raid_disks, mddev->degraded, rdev->in_sync, rdevs);
+ if (conf->disks[conf->raid_disks].rdev == rdev && rdev->in_sync) {
+ /* in_sync, but must be handled specially, don't let 'degraded++' */
+ printk ("active spare failed %s (in_sync)\n",
+ bdevname(rdev->bdev,b));
+ mddev->sb_dirty = 1;
+ rdev->in_sync = 0;
+ rdev->faulty = 1;
+ rdev->raid_disk = conf->raid_disks; /* me as myself, again ;) */
+ conf->mirrorit = -1;
+ } else if (mddisks > conf->raid_disks && !mddev->degraded && rdev->in_sync) {
+ /* have active spare, array is optimal, removed disk member
+ of it (but not the active spare) */
+ if (rdev->raid_disk == conf->mirrorit && conf->disks[conf->raid_disks].rdev) {
+ if (!conf->disks[conf->raid_disks].rdev->in_sync) {
+ printk(KERN_ALERT "disk %s failed and active spare isn't in_sync yet, readd as normal spare\n",
+ bdevname(rdev->bdev,b));
+ /* maybe shouldn't stop here, but we can't call this disk as
+ 'active spare' anymore, cause it's a simple rebuild from
+ a degraded array, fear of bad blocks! */
+ conf->mirrorit = -1;
+ goto letitgo;
+ } else {
+ int ret;
+
+ /* hot replace the synced drive with the 'active spare'
+ this is really "hot", I can't see clearly the things
+ what I have to do here. :}
+ pray. */
+
+ printk(KERN_ALERT "replace %s with in_sync active spare %s\n",
+ bdevname(rdev->bdev,b),
+ bdevname(rdevs->bdev,b2));
+ rdev->in_sync = 0;
+ rdev->faulty = 1;
+
+ conf->mirrorit = -1;
+
+ /* my God, am I sane? */
+ while ((i = atomic_read(&rdev->nr_pending))) {
+ printk("waiting for disk %d .. %d\n",
+ rdev->raid_disk, i);
+ }
+ ret = raid5_remove_disk(mddev, rdev->raid_disk);
+ if (ret) {
+ printk(KERN_WARNING "raid5_remove_disk1: busy?!\n");
+ return; // should it nothing to do
+ }
+ ret = raid5_add_disk(mddev, conf->disks[conf->raid_disks].rdev);
+ if (!ret) {
+ printk(KERN_WARNING "raid5_add_disk: no free slot?!\n");
+ return; // ..
+ }
+
+ while ((i = atomic_read(&conf->disks[conf->raid_disks].rdev->nr_pending))) {
+ printk("waiting for disk %d .. %d\n",
+ conf->raid_disks, i);
+ }
+ ret = raid5_remove_disk(mddev, conf->raid_disks);
+ if (ret) {
+ printk(KERN_WARNING "raid5_remove_disk2: busy?!\n");
+ return; // ..
+ }
+
+ conf->disks[rdev->raid_disk].rdev->in_sync = 1;
+
+ /* borrowed from hot_remove_disk() */
+ kick_rdev_from_array(rdev);
+ //md_update_sb(mddev);
+ }
+ } else {
+ /* in_sync disk failed (!degraded), trying to make a copy
+ to a spare {and we could call it 'active spare' from now:} */
+ printk(KERN_ALERT "resync from %s to spare %s (%d)\n",
+ bdevname(rdev->bdev,b),
+ bdevname(rdevs->bdev,b2),
+ conf->raid_disks);
+ conf->mirrorit = rdev->raid_disk;
+
+ mddev->degraded++; /* for call raid5_hot_add_disk(), reset there */
+ }
+ } else {
+letitgo:
+ mddev->sb_dirty = 1;
+ if (rdev->in_sync) {
+ conf->working_disks--;
+ mddev->degraded++;
+ conf->failed_disks++;
+ rdev->in_sync = 0;
+ /*
+ * if recovery was running, make sure it aborts.
+ */
+ set_bit(MD_RECOVERY_ERR, &mddev->recovery);
+ }
+ rdev->faulty = 1;
+ printk (KERN_ALERT
+ "raid5: Disk failure on %s, disabling device."
+ " Operation continuing on %d devices\n",
+ bdevname(rdev->bdev,b), conf->working_disks);
+ }
}
}
@@ -896,6 +1029,8 @@
int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
int non_overwrite = 0;
int failed_num=0;
+ int aspare=0, asparenum=-1;
+ struct disk_info *asparedev;
struct r5dev *dev;
PRINTK("handling stripe %llu, cnt=%d, pd_idx=%d\n",
@@ -907,9 +1042,16 @@
clear_bit(STRIPE_DELAYED, &sh->state);
syncing = test_bit(STRIPE_SYNCING, &sh->state);
+ asparedev = &conf->disks[conf->raid_disks];
+ if (!conf->mddev->degraded && asparedev->rdev && !asparedev->rdev->faulty &&
+ conf->mirrorit != -1) {
+ aspare++;
+ asparenum = sh->raid_conf->mirrorit;
+ PRINTK("has aspare (%d)\n", asparenum);
+ }
/* Now to look around and see what can be done */
- for (i=disks; i--; ) {
+ for (i=disks+aspare; i--; ) {
mdk_rdev_t *rdev;
dev = &sh->dev[i];
clear_bit(R5_Insync, &dev->flags);
@@ -953,12 +1095,17 @@
}
if (dev->written) written++;
rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
- if (!rdev || !rdev->in_sync || rdev->rerr /*forced error*/) {
+ if (!rdev || !rdev->in_sync ||
+ (test_bit(R5_FAILED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags))) {
failed++;
failed_num = i;
+ PRINTK("device %d failed for this stripe r%p w%p\n", i, dev->toread, dev->towrite);
} else
set_bit(R5_Insync, &dev->flags);
}
+ if (aspare && failed > 1)
+ failed--; /* failed = 1 means "all ok" if we've aspare, this is simplest
+ method to do our work */
PRINTK("locked=%d uptodate=%d to_read=%d"
" to_write=%d failed=%d failed_num=%d\n",
locked, uptodate, to_read, to_write, failed, failed_num);
@@ -1000,14 +1147,8 @@
bi = bi2;
}
- /* fail any reads if this device is non-operational or
- there's forced read error */
+ /* fail any reads if this device is non-operational */
if (!test_bit(R5_Insync, &sh->dev[i].flags)) {
- if (conf->disks[i].rdev && conf->disks[i].rdev->rerr) {
- conf->disks[i].rdev->rerr = 0;
- printk (KERN_NOTICE "raid5: forced error on %d (r%d w%d)\n",
- i, to_read, to_write);
- }
bi = sh->dev[i].toread;
sh->dev[i].toread = NULL;
if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
@@ -1027,6 +1168,7 @@
spin_unlock_irq(&conf->device_lock);
}
if (failed > 1 && syncing) {
+ printk(KERN_ALERT "sync stopped by IO error\n");
md_done_sync(conf->mddev, STRIPE_SECTORS,0);
clear_bit(STRIPE_SYNCING, &sh->state);
syncing = 0;
@@ -1198,6 +1340,22 @@
PRINTK("Writing block %d\n", i);
locked++;
set_bit(R5_Wantwrite, &sh->dev[i].flags);
+ if (aspare && i == asparenum) {
+ char *ps, *pd;
+
+ /* mirroring this new block */
+ PRINTK("Writing to aspare too %d->%d\n",
+ i, conf->raid_disks);
+ /*if (test_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags)) {
+ printk("bazmeg, ez lokkolt1!!!\n");
+ }*/
+ ps = page_address(sh->dev[i].page);
+ pd = page_address(sh->dev[conf->raid_disks].page);
+ /* better idea? */
+ memcpy(pd, ps, STRIPE_SIZE);
+ set_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags);
+ set_bit(R5_Wantwrite, &sh->dev[conf->raid_disks].flags);
+ }
if (!test_bit(R5_Insync, &sh->dev[i].flags)
|| (i==sh->pd_idx && failed == 0))
set_bit(STRIPE_INSYNC, &sh->state);
@@ -1234,14 +1392,30 @@
if (failed==0)
failed_num = sh->pd_idx;
/* should be able to compute the missing block and write it to spare */
+ if (aspare)
+ failed_num = asparenum;
if (!test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)) {
if (uptodate+1 != disks)
BUG();
compute_block(sh, failed_num);
uptodate++;
}
+ if (aspare) {
+ char *ps, *pd;
+
+ ps = page_address(sh->dev[failed_num].page);
+ pd = page_address(sh->dev[conf->raid_disks].page);
+ memcpy(pd, ps, STRIPE_SIZE);
+ PRINTK("R5_Wantwrite to aspare, uptodate: %d %p->%p\n",
+ uptodate, ps, pd);
+ /*if (test_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags)) {
+ printk("bazmeg, ez lokkolt2!!!\n");
+ }*/
+ }
if (uptodate != disks)
BUG();
+ if (aspare)
+ failed_num = conf->raid_disks;
dev = &sh->dev[failed_num];
set_bit(R5_LOCKED, &dev->flags);
set_bit(R5_Wantwrite, &dev->flags);
@@ -1265,7 +1439,7 @@
bi->bi_size = 0;
bi->bi_end_io(bi, bytes, 0);
}
- for (i=disks; i-- ;) {
+ for (i=disks+aspare; i-- ;) {
int rw;
struct bio *bi;
mdk_rdev_t *rdev;
@@ -1507,6 +1681,15 @@
unplug_slaves(mddev);
return 0;
}
+ /* if there is 1 or more failed drives and we are trying
+ * to resync, then assert that we are finished, because there is
+ * nothing we can do.
+ */
+ if (mddev->degraded >= 1 && test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
+ int rv = (mddev->size << 1) - sector_nr;
+ md_done_sync(mddev, rv, 1);
+ return rv;
+ }
x = sector_nr;
chunk_offset = sector_div(x, sectors_per_chunk);
@@ -1605,11 +1788,11 @@
}
mddev->private = kmalloc (sizeof (raid5_conf_t)
- + mddev->raid_disks * sizeof(struct disk_info),
+ + (mddev->raid_disks + 1) * sizeof(struct disk_info),
GFP_KERNEL);
if ((conf = mddev->private) == NULL)
goto abort;
- memset (conf, 0, sizeof (*conf) + mddev->raid_disks * sizeof(struct disk_info) );
+ memset (conf, 0, sizeof (*conf) + (mddev->raid_disks + 1) * sizeof(struct disk_info) );
conf->mddev = mddev;
if ((conf->stripe_hashtbl = (struct stripe_head **) __get_free_pages(GFP_ATOMIC, HASH_PAGES_ORDER)) == NULL)
@@ -1639,7 +1822,6 @@
disk->rdev = rdev;
- rdev->rerr = 0; /* !!! */
if (rdev->in_sync) {
char b[BDEVNAME_SIZE];
printk(KERN_INFO "raid5: device %s operational as raid"
@@ -1650,6 +1832,7 @@
}
conf->raid_disks = mddev->raid_disks;
+ conf->mirrorit = -1;
/*
* 0 for a fully functional array, 1 for a degraded array.
*/
@@ -1699,7 +1882,7 @@
}
}
memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
- conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
+ (conf->raid_disks+1) * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
if (grow_stripes(conf, conf->max_nr_stripes)) {
printk(KERN_ERR
"raid5: couldn't allocate %dkB for buffers\n", memory);
@@ -1859,6 +2042,17 @@
tmp->rdev->in_sync = 1;
}
}
+ tmp = conf->disks + i;
+ if (tmp->rdev && !tmp->rdev->faulty && !tmp->rdev->in_sync) {
+ /* sync done to the 'active spare' */
+ tmp->rdev->in_sync = 1;
+
+ printk(KERN_NOTICE "raid5_spare_active: %d in_sync %d->%d\n",
+ i, tmp->rdev->raid_disk, conf->mirrorit);
+
+ /* scary..? :} */
+ tmp->rdev->raid_disk = conf->mirrorit;
+ }
print_raid5_conf(conf);
return 0;
}
@@ -1872,6 +2066,7 @@
print_raid5_conf(conf);
rdev = p->rdev;
+printk("raid5_remove_disk %d\n", number);
if (rdev) {
if (rdev->in_sync ||
atomic_read(&rdev->nr_pending)) {
@@ -1899,18 +2094,35 @@
int disk;
struct disk_info *p;
+ if (mddev->degraded > 1)
+ /* no point adding a device */
+ return 0;
+
/*
* find the disk ...
*/
for (disk=0; disk < mddev->raid_disks; disk++)
if ((p=conf->disks + disk)->rdev == NULL) {
rdev->in_sync = 0;
- rdev->rerr = 0; /* !!! */
rdev->raid_disk = disk;
found = 1;
p->rdev = rdev;
break;
}
+
+ if (!found) {
+ /* array optimal, this should be the 'active spare' */
+ conf->disks[disk].rdev = rdev;
+ rdev->in_sync = 0;
+ rdev->raid_disk = conf->raid_disks;
+
+ mddev->degraded--;
+ found++; /* call resync */
+
+ printk(KERN_INFO "added spare for active resync\n");
+ }
+ printk(KERN_INFO "raid5_add_disk: %d (%d)\n", disk, found);
+
print_raid5_conf(conf);
return found;
}
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH] proactive raid5 disk replacement for 2.6.11 [fixed patch]
2005-08-14 20:10 [PATCH] proactive raid5 disk replacement for 2.6.11 Pallai Roland
@ 2005-08-14 21:29 ` Pallai Roland
2005-08-15 6:45 ` [PATCH] proactive raid5 disk replacement for 2.6.11 Claas Hilbrecht
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Pallai Roland @ 2005-08-14 21:29 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: Type: text/plain, Size: 284 bytes --]
On Sun, 2005-08-14 at 22:10 +0200, Pallai Roland wrote:
> this is a feature patch that implements 'proactive raid5 disk
> replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
> [...]
sorry, the previous patch doesn't work against vanilla kernel, this one
do..
--
dap
[-- Attachment #2: 04_raid5-asp-dap2.diff --]
[-- Type: text/x-patch, Size: 21744 bytes --]
this is a feature patch that implements 'proactive raid5 disk
replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
that could help a lot on large raid5 arrays built from cheap sata
drivers when the IO traffic such large that daily media scan on the
disks isn't possible.
linux software raid is very fragile by default, the typical (nervous)
breakdown situation: I noticed a bad block on a drive, replace it,
and the resync fails cause another 2-3 disks has hidden badblocks too.
I've to save the disks and rebuild bad blocks with a userspace tool (by
hand..), meanwhile the site is down for hours. bad; especially when a
pair of simple steps enough to avoid from this atypical problem:
1. dont kick a drive on read error cause it is possible that 99.99% is
useable and will help (to serve and to save data) if another drive show
bad sectors in same array
2. let to mirror a partially failed drive to a spare _online_ and replace
the source of the mirror with the spare when it's done. bad blocks isn't
a problem unless same sector damaged on two disks what's a rare case. in
this way is possible to fix an array with partially failed drives
without data loss and without downtime
I'm not a programmer just a sysadm who admins a large software sata
array, but my angry got bigger than my laziness, so I made this patch on
this weekend.. I don't understand every piece of the md code (eg. the
if-forest of the handle_stripe :) yet, so this patch may be a bug-colony
and wrong by design, but I've tested it under heavy stress with both of
'faulty' module and real disks, and it works fine!
ideas, piece of advice, bugfix/enchancement is welcomed!
(I know, raid6 could be another solution for this problem, but that's a
large overhead.)
use:
1. patch the kernel, this one is against 2.6.11
2. type:
# make drives
mdadm -B -n1 -l faulty /dev/md/1 /dev/rd/0
mdadm -B -n1 -l faulty /dev/md/2 /dev/rd/1
mdadm -B -n1 -l faulty /dev/md/3 /dev/rd/2
# make the array
mdadm -C -n3 -l5 /dev/md/0 /dev/md/1 /dev/md/2 /dev/md/3
# .. wait for sync ..
# grow bad blocks as ma*tor does
mdadm --grow -l faulty -p rp454 /dev/md/1
mdadm --grow -l faulty -p rp738 /dev/md/2
# add a spare
mdadm -a /dev/md/0 /dev/rd/4
# -> fail a drive, sync begins <-
# the md/1 will not marked as failed, this is the point, but if you want to,
# you can issue this command again!
mdadm -f /dev/md/0 /dev/md/1
# kernel:
# resync from md1 to spare ram4
# added spare for active resync
# .. wonder the read errors from md[12] and the sync goes on!
# feel free to stress the md at this time, mkfs, dd, badblocks, etc
# kernel:
# raid5_spare_active: 3 in_sync 3->0
# /proc/mdstat:
# md0 : active raid5 ram4[0] md3[2] md2[1] md1[0]
# -> ram4 and md1 has same id, this means the spare is a complete mirror,
# if you stop the array you can assembly it with ram4 instead of md1,
# the superblock same both of them
# check the mirror (stop write stress if any)
mdadm --grow -l faulty -p none /dev/md/1
cmp /dev/md/1 /dev/rd/4
# hot-replace the mirrored -partially failed- device with the active spare
# (yes, mark it as failed again, but if there's a syncing- or synced 'active spare'
# the -f really fails the device or replace it with the synced spare)
mdadm -f /dev/md/0 /dev/md/1
# kernel:
# replace md1 with in_sync active spare ram4
# and voila!
# /proc/mdstat:
# md0 : active raid5 ram4[0] md3[2] md2[1]
.. I hope someone can use it for something, for me it's a musthave feature ..
--- linux/include/linux/raid/raid5.h.orig 2005-03-03 23:51:29.000000000 +0100
+++ linux/include/linux/raid/raid5.h 2005-08-14 03:02:11.000000000 +0200
@@ -147,6 +147,7 @@
#define R5_UPTODATE 0 /* page contains current data */
#define R5_LOCKED 1 /* IO has been submitted on "req" */
#define R5_OVERWRITE 2 /* towrite covers whole page */
+#define R5_FAILED 8 /* failed to read this stripe */
/* and some that are internal to handle_stripe */
#define R5_Insync 3 /* rdev && rdev->in_sync at start */
#define R5_Wantread 4 /* want to schedule a read */
@@ -224,6 +225,8 @@
int inactive_blocked; /* release of inactive stripes blocked,
* waiting for 25% to be free
*/
+ int mirrorit; /* source for active spare resync */
+
spinlock_t device_lock;
struct disk_info disks[0];
};
--- linux/drivers/md/md.c.orig 2005-08-14 21:22:08.000000000 +0200
+++ linux/drivers/md/md.c 2005-08-14 17:20:15.000000000 +0200
@@ -3545,18 +3547,19 @@
/* no recovery is running.
* remove any failed drives, then
- * add spares if possible
+ * add spares if possible.
+ * Spare are also removed and re-added, to allow
+ * the personality to fail the re-add.
*/
- ITERATE_RDEV(mddev,rdev,rtmp) {
+ ITERATE_RDEV(mddev,rdev,rtmp)
if (rdev->raid_disk >= 0 &&
- rdev->faulty &&
+ (rdev->faulty || ! rdev->in_sync) &&
atomic_read(&rdev->nr_pending)==0) {
+printk("md_check_recovery: hot_remove_disk\n");
if (mddev->pers->hot_remove_disk(mddev, rdev->raid_disk)==0)
rdev->raid_disk = -1;
}
- if (!rdev->faulty && rdev->raid_disk >= 0 && !rdev->in_sync)
- spares++;
- }
+
if (mddev->degraded) {
ITERATE_RDEV(mddev,rdev,rtmp)
if (rdev->raid_disk < 0
@@ -3764,4 +3771,5 @@
EXPORT_SYMBOL(md_wakeup_thread);
EXPORT_SYMBOL(md_print_devices);
EXPORT_SYMBOL(md_check_recovery);
+EXPORT_SYMBOL(kick_rdev_from_array); // fixme
MODULE_LICENSE("GPL");
--- linux/drivers/md/raid5.c.orig 2005-08-14 21:22:08.000000000 +0200
+++ linux/drivers/md/raid5.c 2005-08-14 20:49:49.000000000 +0200
@@ -53,7 +53,7 @@
/*
* The following can be used to debug the driver
*/
-#define RAID5_DEBUG 0
+#define RAID5_DEBUG 1
#define RAID5_PARANOIA 1
#if RAID5_PARANOIA && defined(CONFIG_SMP)
# define CHECK_DEVLOCK() assert_spin_locked(&conf->device_lock)
@@ -61,7 +61,7 @@
# define CHECK_DEVLOCK()
#endif
-#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(x)))
+#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(KERN_DEBUG x)))
#if RAID5_DEBUG
#define inline
#define __inline__
@@ -201,7 +201,7 @@
sh->pd_idx = pd_idx;
sh->state = 0;
- for (i=disks; i--; ) {
+ for (i=disks+1; i--; ) {
struct r5dev *dev = &sh->dev[i];
if (dev->toread || dev->towrite || dev->written ||
@@ -291,8 +291,10 @@
sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
+ /* +1: we need extra space in the *sh->devs for the 'active spare' to keep
+ handle_stripe() simple */
sc = kmem_cache_create(conf->cache_name,
- sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
+ sizeof(struct stripe_head)+(devs-1+1)*sizeof(struct r5dev),
0, 0, NULL, NULL);
if (!sc)
return 1;
@@ -301,12 +303,12 @@
sh = kmem_cache_alloc(sc, GFP_KERNEL);
if (!sh)
return 1;
- memset(sh, 0, sizeof(*sh) + (devs-1)*sizeof(struct r5dev));
+ memset(sh, 0, sizeof(*sh) + (devs-1+1)*sizeof(struct r5dev));
sh->raid_conf = conf;
spin_lock_init(&sh->lock);
- if (grow_buffers(sh, conf->raid_disks)) {
- shrink_buffers(sh, conf->raid_disks);
+ if (grow_buffers(sh, conf->raid_disks+1)) {
+ shrink_buffers(sh, conf->raid_disks+1);
kmem_cache_free(sc, sh);
return 1;
}
@@ -391,10 +393,39 @@
}
#else
set_bit(R5_UPTODATE, &sh->dev[i].flags);
+ clear_bit(R5_FAILED, &sh->dev[i].flags);
#endif
} else {
+ char b[BDEVNAME_SIZE];
+
+ /*
+ rule 1.,: try to keep all disk in_sync even if we've got read errors,
+ cause the 'active spare' maybe can rebuild a complete column from
+ partially failed drives
+ */
+ if (conf->disks[i].rdev->in_sync && conf->working_disks < conf->raid_disks) {
+ /* bad news, but keep it, cause md_error() would do a complete
+ array shutdown, even if 99.99% is useable */
+ printk(KERN_ALERT
+ "raid5_end_read_request: Read failure %s on sector %llu (%d) in degraded mode\n"
+ ,bdevname(conf->disks[i].rdev->bdev, b),
+ (unsigned long long)sh->sector, atomic_read(&sh->count));
+ if (conf->mddev->curr_resync)
+ /* raid5_add_disk() will no accept the spare again,
+ and will not loop forever */
+ conf->mddev->degraded = 2;
+ } else if (conf->disks[i].rdev->in_sync && conf->working_disks >= conf->raid_disks) {
+ /* will be computed */
+ printk(KERN_ALERT
+ "raid5_end_read_request: Read failure %s on sector %llu (%d) in optimal mode\n"
+ ,bdevname(conf->disks[i].rdev->bdev, b),
+ (unsigned long long)sh->sector, atomic_read(&sh->count));
+ /* conf->disks[i].rerr++ */
+ } else
+ /* practically it never happens */
md_error(conf->mddev, conf->disks[i].rdev);
- clear_bit(R5_UPTODATE, &sh->dev[i].flags);
+ clear_bit(R5_UPTODATE, &sh->dev[i].flags);
+ set_bit(R5_FAILED, &sh->dev[i].flags);
}
rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
#if 0
@@ -430,10 +461,11 @@
PRINTK("end_write_request %llu/%d, count %d, uptodate: %d.\n",
(unsigned long long)sh->sector, i, atomic_read(&sh->count),
uptodate);
+ /* sorry
if (i == disks) {
BUG();
return 0;
- }
+ }*/
spin_lock_irqsave(&conf->device_lock, flags);
if (!uptodate)
@@ -467,33 +499,142 @@
dev->req.bi_private = sh;
dev->flags = 0;
- if (i != sh->pd_idx)
+ if (i != sh->pd_idx && i < sh->raid_conf->raid_disks) /* active spare? */
dev->sector = compute_blocknr(sh, i);
}
+static int raid5_remove_disk(mddev_t *mddev, int number);
+static int raid5_add_disk(mddev_t *mddev, mdk_rdev_t *rdev);
+/*static*/ void kick_rdev_from_array(mdk_rdev_t * rdev);
+//static void md_update_sb(mddev_t * mddev);
static void error(mddev_t *mddev, mdk_rdev_t *rdev)
{
char b[BDEVNAME_SIZE];
+ char b2[BDEVNAME_SIZE];
raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
PRINTK("raid5: error called\n");
if (!rdev->faulty) {
- mddev->sb_dirty = 1;
- if (rdev->in_sync) {
- conf->working_disks--;
- mddev->degraded++;
- conf->failed_disks++;
- rdev->in_sync = 0;
- /*
- * if recovery was running, make sure it aborts.
- */
- set_bit(MD_RECOVERY_ERR, &mddev->recovery);
- }
- rdev->faulty = 1;
- printk (KERN_ALERT
- "raid5: Disk failure on %s, disabling device."
- " Operation continuing on %d devices\n",
- bdevname(rdev->bdev,b), conf->working_disks);
+ int mddisks = 0;
+ mdk_rdev_t *rd;
+ mdk_rdev_t *rdevs = NULL;
+ struct list_head *rtmp;
+ int i;
+
+ ITERATE_RDEV(mddev,rd,rtmp)
+ {
+ printk(KERN_INFO "mddev%d: %s\n", mddisks, bdevname(rd->bdev,b));
+ mddisks++;
+ }
+ for (i = 0; (rd = conf->disks[i].rdev); i++) {
+ printk(KERN_INFO "r5dev%d: %s\n", i, bdevname(rd->bdev,b));
+ }
+ ITERATE_RDEV(mddev,rd,rtmp)
+ {
+ rdevs = rd;
+ break;
+ }
+printk("%d %d > %d %d ins:%d %p\n",
+ mddev->raid_disks, mddisks, conf->raid_disks, mddev->degraded, rdev->in_sync, rdevs);
+ if (conf->disks[conf->raid_disks].rdev == rdev && rdev->in_sync) {
+ /* in_sync, but must be handled specially, don't let 'degraded++' */
+ printk ("active spare failed %s (in_sync)\n",
+ bdevname(rdev->bdev,b));
+ mddev->sb_dirty = 1;
+ rdev->in_sync = 0;
+ rdev->faulty = 1;
+ rdev->raid_disk = conf->raid_disks; /* me as myself, again ;) */
+ conf->mirrorit = -1;
+ } else if (mddisks > conf->raid_disks && !mddev->degraded && rdev->in_sync) {
+ /* have active spare, array is optimal, removed disk member
+ of it (but not the active spare) */
+ if (rdev->raid_disk == conf->mirrorit && conf->disks[conf->raid_disks].rdev) {
+ if (!conf->disks[conf->raid_disks].rdev->in_sync) {
+ printk(KERN_ALERT "disk %s failed and active spare isn't in_sync yet, readd as normal spare\n",
+ bdevname(rdev->bdev,b));
+ /* maybe shouldn't stop here, but we can't call this disk as
+ 'active spare' anymore, cause it's a simple rebuild from
+ a degraded array, fear of bad blocks! */
+ conf->mirrorit = -1;
+ goto letitgo;
+ } else {
+ int ret;
+
+ /* hot replace the synced drive with the 'active spare'
+ this is really "hot", I can't see clearly the things
+ what I have to do here. :}
+ pray. */
+
+ printk(KERN_ALERT "replace %s with in_sync active spare %s\n",
+ bdevname(rdev->bdev,b),
+ bdevname(rdevs->bdev,b2));
+ rdev->in_sync = 0;
+ rdev->faulty = 1;
+
+ conf->mirrorit = -1;
+
+ /* my God, am I sane? */
+ while ((i = atomic_read(&rdev->nr_pending))) {
+ printk("waiting for disk %d .. %d\n",
+ rdev->raid_disk, i);
+ }
+ ret = raid5_remove_disk(mddev, rdev->raid_disk);
+ if (ret) {
+ printk(KERN_WARNING "raid5_remove_disk1: busy?!\n");
+ return; // should it nothing to do
+ }
+ ret = raid5_add_disk(mddev, conf->disks[conf->raid_disks].rdev);
+ if (!ret) {
+ printk(KERN_WARNING "raid5_add_disk: no free slot?!\n");
+ return; // ..
+ }
+
+ while ((i = atomic_read(&conf->disks[conf->raid_disks].rdev->nr_pending))) {
+ printk("waiting for disk %d .. %d\n",
+ conf->raid_disks, i);
+ }
+ ret = raid5_remove_disk(mddev, conf->raid_disks);
+ if (ret) {
+ printk(KERN_WARNING "raid5_remove_disk2: busy?!\n");
+ return; // ..
+ }
+
+ conf->disks[rdev->raid_disk].rdev->in_sync = 1;
+
+ /* borrowed from hot_remove_disk() */
+ kick_rdev_from_array(rdev);
+ //md_update_sb(mddev);
+ }
+ } else {
+ /* in_sync disk failed (!degraded), trying to make a copy
+ to a spare {and we could call it 'active spare' from now:} */
+ printk(KERN_ALERT "resync from %s to spare %s (%d)\n",
+ bdevname(rdev->bdev,b),
+ bdevname(rdevs->bdev,b2),
+ conf->raid_disks);
+ conf->mirrorit = rdev->raid_disk;
+
+ mddev->degraded++; /* for call raid5_hot_add_disk(), reset there */
+ }
+ } else {
+letitgo:
+ mddev->sb_dirty = 1;
+ if (rdev->in_sync) {
+ conf->working_disks--;
+ mddev->degraded++;
+ conf->failed_disks++;
+ rdev->in_sync = 0;
+ /*
+ * if recovery was running, make sure it aborts.
+ */
+ set_bit(MD_RECOVERY_ERR, &mddev->recovery);
+ }
+ rdev->faulty = 1;
+ printk (KERN_ALERT
+ "raid5: Disk failure on %s, disabling device."
+ " Operation continuing on %d devices\n",
+ bdevname(rdev->bdev,b), conf->working_disks);
+ }
}
}
@@ -888,6 +1029,8 @@
int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
int non_overwrite = 0;
int failed_num=0;
+ int aspare=0, asparenum=-1;
+ struct disk_info *asparedev;
struct r5dev *dev;
PRINTK("handling stripe %llu, cnt=%d, pd_idx=%d\n",
@@ -899,9 +1042,16 @@
clear_bit(STRIPE_DELAYED, &sh->state);
syncing = test_bit(STRIPE_SYNCING, &sh->state);
+ asparedev = &conf->disks[conf->raid_disks];
+ if (!conf->mddev->degraded && asparedev->rdev && !asparedev->rdev->faulty &&
+ conf->mirrorit != -1) {
+ aspare++;
+ asparenum = sh->raid_conf->mirrorit;
+ PRINTK("has aspare (%d)\n", asparenum);
+ }
/* Now to look around and see what can be done */
- for (i=disks; i--; ) {
+ for (i=disks+aspare; i--; ) {
mdk_rdev_t *rdev;
dev = &sh->dev[i];
clear_bit(R5_Insync, &dev->flags);
@@ -945,12 +1095,17 @@
}
if (dev->written) written++;
rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
- if (!rdev || !rdev->in_sync) {
+ if (!rdev || !rdev->in_sync ||
+ (test_bit(R5_FAILED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags))) {
failed++;
failed_num = i;
+ PRINTK("device %d failed for this stripe r%p w%p\n", i, dev->toread, dev->towrite);
} else
set_bit(R5_Insync, &dev->flags);
}
+ if (aspare && failed > 1)
+ failed--; /* failed = 1 means "all ok" if we've aspare, this is simplest
+ method to do our work */
PRINTK("locked=%d uptodate=%d to_read=%d"
" to_write=%d failed=%d failed_num=%d\n",
locked, uptodate, to_read, to_write, failed, failed_num);
@@ -1013,6 +1168,7 @@
spin_unlock_irq(&conf->device_lock);
}
if (failed > 1 && syncing) {
+ printk(KERN_ALERT "sync stopped by IO error\n");
md_done_sync(conf->mddev, STRIPE_SECTORS,0);
clear_bit(STRIPE_SYNCING, &sh->state);
syncing = 0;
@@ -1184,6 +1340,22 @@
PRINTK("Writing block %d\n", i);
locked++;
set_bit(R5_Wantwrite, &sh->dev[i].flags);
+ if (aspare && i == asparenum) {
+ char *ps, *pd;
+
+ /* mirroring this new block */
+ PRINTK("Writing to aspare too %d->%d\n",
+ i, conf->raid_disks);
+ /*if (test_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags)) {
+ printk("bazmeg, ez lokkolt1!!!\n");
+ }*/
+ ps = page_address(sh->dev[i].page);
+ pd = page_address(sh->dev[conf->raid_disks].page);
+ /* better idea? */
+ memcpy(pd, ps, STRIPE_SIZE);
+ set_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags);
+ set_bit(R5_Wantwrite, &sh->dev[conf->raid_disks].flags);
+ }
if (!test_bit(R5_Insync, &sh->dev[i].flags)
|| (i==sh->pd_idx && failed == 0))
set_bit(STRIPE_INSYNC, &sh->state);
@@ -1220,14 +1392,30 @@
if (failed==0)
failed_num = sh->pd_idx;
/* should be able to compute the missing block and write it to spare */
+ if (aspare)
+ failed_num = asparenum;
if (!test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)) {
if (uptodate+1 != disks)
BUG();
compute_block(sh, failed_num);
uptodate++;
}
+ if (aspare) {
+ char *ps, *pd;
+
+ ps = page_address(sh->dev[failed_num].page);
+ pd = page_address(sh->dev[conf->raid_disks].page);
+ memcpy(pd, ps, STRIPE_SIZE);
+ PRINTK("R5_Wantwrite to aspare, uptodate: %d %p->%p\n",
+ uptodate, ps, pd);
+ /*if (test_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags)) {
+ printk("bazmeg, ez lokkolt2!!!\n");
+ }*/
+ }
if (uptodate != disks)
BUG();
+ if (aspare)
+ failed_num = conf->raid_disks;
dev = &sh->dev[failed_num];
set_bit(R5_LOCKED, &dev->flags);
set_bit(R5_Wantwrite, &dev->flags);
@@ -1251,7 +1439,7 @@
bi->bi_size = 0;
bi->bi_end_io(bi, bytes, 0);
}
- for (i=disks; i-- ;) {
+ for (i=disks+aspare; i-- ;) {
int rw;
struct bio *bi;
mdk_rdev_t *rdev;
@@ -1493,6 +1681,15 @@
unplug_slaves(mddev);
return 0;
}
+ /* if there is 1 or more failed drives and we are trying
+ * to resync, then assert that we are finished, because there is
+ * nothing we can do.
+ */
+ if (mddev->degraded >= 1 && test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
+ int rv = (mddev->size << 1) - sector_nr;
+ md_done_sync(mddev, rv, 1);
+ return rv;
+ }
x = sector_nr;
chunk_offset = sector_div(x, sectors_per_chunk);
@@ -1591,11 +1788,11 @@
}
mddev->private = kmalloc (sizeof (raid5_conf_t)
- + mddev->raid_disks * sizeof(struct disk_info),
+ + (mddev->raid_disks + 1) * sizeof(struct disk_info),
GFP_KERNEL);
if ((conf = mddev->private) == NULL)
goto abort;
- memset (conf, 0, sizeof (*conf) + mddev->raid_disks * sizeof(struct disk_info) );
+ memset (conf, 0, sizeof (*conf) + (mddev->raid_disks + 1) * sizeof(struct disk_info) );
conf->mddev = mddev;
if ((conf->stripe_hashtbl = (struct stripe_head **) __get_free_pages(GFP_ATOMIC, HASH_PAGES_ORDER)) == NULL)
@@ -1635,6 +1832,7 @@
}
conf->raid_disks = mddev->raid_disks;
+ conf->mirrorit = -1;
/*
* 0 for a fully functional array, 1 for a degraded array.
*/
@@ -1684,7 +1882,7 @@
}
}
memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
- conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
+ (conf->raid_disks+1) * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
if (grow_stripes(conf, conf->max_nr_stripes)) {
printk(KERN_ERR
"raid5: couldn't allocate %dkB for buffers\n", memory);
@@ -1844,6 +2042,17 @@
tmp->rdev->in_sync = 1;
}
}
+ tmp = conf->disks + i;
+ if (tmp->rdev && !tmp->rdev->faulty && !tmp->rdev->in_sync) {
+ /* sync done to the 'active spare' */
+ tmp->rdev->in_sync = 1;
+
+ printk(KERN_NOTICE "raid5_spare_active: %d in_sync %d->%d\n",
+ i, tmp->rdev->raid_disk, conf->mirrorit);
+
+ /* scary..? :} */
+ tmp->rdev->raid_disk = conf->mirrorit;
+ }
print_raid5_conf(conf);
return 0;
}
@@ -1857,6 +2066,7 @@
print_raid5_conf(conf);
rdev = p->rdev;
+printk("raid5_remove_disk %d\n", number);
if (rdev) {
if (rdev->in_sync ||
atomic_read(&rdev->nr_pending)) {
@@ -1884,6 +2094,10 @@
int disk;
struct disk_info *p;
+ if (mddev->degraded > 1)
+ /* no point adding a device */
+ return 0;
+
/*
* find the disk ...
*/
@@ -1895,6 +2109,20 @@
p->rdev = rdev;
break;
}
+
+ if (!found) {
+ /* array optimal, this should be the 'active spare' */
+ conf->disks[disk].rdev = rdev;
+ rdev->in_sync = 0;
+ rdev->raid_disk = conf->raid_disks;
+
+ mddev->degraded--;
+ found++; /* call resync */
+
+ printk(KERN_INFO "added spare for active resync\n");
+ }
+ printk(KERN_INFO "raid5_add_disk: %d (%d)\n", disk, found);
+
print_raid5_conf(conf);
return found;
}
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH] proactive raid5 disk replacement for 2.6.11
2005-08-14 20:10 [PATCH] proactive raid5 disk replacement for 2.6.11 Pallai Roland
2005-08-14 21:29 ` [PATCH] proactive raid5 disk replacement for 2.6.11 [fixed patch] Pallai Roland
@ 2005-08-15 6:45 ` Claas Hilbrecht
2005-08-15 11:29 ` Mario 'BitKoenig' Holbe
[not found] ` <C0A1E607B5206F88D89CAF42@192.168.1.22>
3 siblings, 0 replies; 9+ messages in thread
From: Claas Hilbrecht @ 2005-08-15 6:45 UTC (permalink / raw)
To: Pallai Roland, linux-raid
--Am Sonntag, 14. August 2005 22:10 +0200 Pallai Roland <dap@mail.index.hu>
schrieb:
> this is a feature patch that implements 'proactive raid5 disk
> replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
After my experience with a broken raid5 (read the list) I think the
"partially failed disks" feature you describe is really useful. I agree
with you that this kind of error is rather common.
--
Claas Hilbrecht
http://www.jucs-kramkiste.de
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH] proactive raid5 disk replacement for 2.6.11
2005-08-14 20:10 [PATCH] proactive raid5 disk replacement for 2.6.11 Pallai Roland
2005-08-14 21:29 ` [PATCH] proactive raid5 disk replacement for 2.6.11 [fixed patch] Pallai Roland
2005-08-15 6:45 ` [PATCH] proactive raid5 disk replacement for 2.6.11 Claas Hilbrecht
@ 2005-08-15 11:29 ` Mario 'BitKoenig' Holbe
2005-08-15 13:50 ` Pallai Roland
[not found] ` <C0A1E607B5206F88D89CAF42@192.168.1.22>
3 siblings, 1 reply; 9+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2005-08-15 11:29 UTC (permalink / raw)
To: linux-raid
Hi,
Pallai Roland <dap@mail.index.hu> wrote:
> this is a feature patch that implements 'proactive raid5 disk
> replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
> that could help a lot on large raid5 arrays built from cheap sata
...
> linux software raid is very fragile by default, the typical (nervous)
I just had a fast look over your patch, so please forgive me if I could
have found the answer in the code.
What I'm wondering about is how does your patch make the whole system
behave in case of more harmful errors?
The read errors you are talking about are quite harmless regarding
subsequent access to the device. Unfortunately there *are* errors (even
read errors, too), especially when you are talking about cheap IDE (ATA,
SATA) equipment, where subsequent access to the device results in
infinite (bus-)lockups. I think, this is the reason why Software-RAID
does never ever touch a failing drive again. If you are changing this
behaviour in general, you risk lock-ups of the raid-device just because
one of the drives got locked up.
What I did not find in your patch is some differentiation between the
harmless and harmful error conditions. I'm not even sure, if this is
possible at all.
regards
Mario
--
Um mit einem Mann gluecklich zu werden, muss man ihn sehr gut
verstehen und ihn ein bisschen lieben.
Um mit einer Frau gluecklich zu werden, muss man sie sehr lieben
und darf erst gar nicht versuchen, sie zu verstehen.
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH] proactive raid5 disk replacement for 2.6.11
2005-08-15 11:29 ` Mario 'BitKoenig' Holbe
@ 2005-08-15 13:50 ` Pallai Roland
0 siblings, 0 replies; 9+ messages in thread
From: Pallai Roland @ 2005-08-15 13:50 UTC (permalink / raw)
To: linux-raid
On Mon, 2005-08-15 at 13:29 +0200, Mario 'BitKoenig' Holbe wrote:
> Pallai Roland <dap@mail.index.hu> wrote:
> > this is a feature patch that implements 'proactive raid5 disk
> > replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
> > that could help a lot on large raid5 arrays built from cheap sata
> ...
> > linux software raid is very fragile by default, the typical (nervous)
>
> What I'm wondering about is how does your patch make the whole system
> behave in case of more harmful errors?
> The read errors you are talking about are quite harmless regarding
> subsequent access to the device. Unfortunately there *are* errors (even
> read errors, too), especially when you are talking about cheap IDE (ATA,
> SATA) equipment, where subsequent access to the device results in
> infinite (bus-)lockups. I think, this is the reason why Software-RAID
> does never ever touch a failing drive again. If you are changing this
> behaviour in general, you risk lock-ups of the raid-device just because
> one of the drives got locked up.
yes, I understand your point, but I think the low level ATA driver must
be fixed if that lets a drive to lock up. as I know, the SCSI layer send
an abort/reset to the device driver if a request not served within a
timeout value ("hey, give me some kind of result, now!"), it's a good
operation, only a really braindead driver ignores that alarm..
as I saw it in practice, modern sata drivers doesn't let a drive to
lock up, others should be teached about "timeout"
unfortunately, bad blocks are often served slowly from damaged disks
and the array tries to access those periodically in this way, it could
slow down the array. I think about it, and would be a good starting
practice to build a table called 'this disk is bad for this stripe', an
insert occurs after a read error, a delete after if stripe is rewritten
to the disk. it reduces error lines in dmesg about bad sectors too
> What I did not find in your patch is some differentiation between the
> harmless and harmful error conditions. I'm not even sure, if this is
> possible at all.
currently it doesn't tolerate write errors, if a write fails, the drive
gets kicked immediately, so a fully failed disk will not be accessed
forever.. anyway, it's really hard to determine what's a harmful error
(at this layer we've got a bit for that:), maybe we should to compute a
success-fail ratio (%) for a time, or scan the 'this disk is bad for
this stripe' table for errors and disable the disk if count of bad
blocks is over a user-defined threshold
summary (todo..!):
- I think, we shouldn't care about drive lockups
- would be good a 'this disk is bad for this stripe' table to speed up
array with partially failed drives, easy to implement it
- make a switch to choose 'partially failed' feature on per-array basis
after the patch is being applied (eg. to remain compatible with buggy -
forever locking- device drivers)
well?
--
dap
^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <C0A1E607B5206F88D89CAF42@192.168.1.22>]
* Re: [PATCH] proactive raid5 disk replacement for 2.6.11
[not found] ` <C0A1E607B5206F88D89CAF42@192.168.1.22>
@ 2005-08-22 10:47 ` Molle Bestefich
2005-08-22 11:56 ` Pallai Roland
0 siblings, 1 reply; 9+ messages in thread
From: Molle Bestefich @ 2005-08-22 10:47 UTC (permalink / raw)
To: Claas Hilbrecht, Pallai Roland; +Cc: linux-raid
Claas Hilbrecht wrote:
> Pallai Roland schrieb:
> > this is a feature patch that implements 'proactive raid5 disk
> > replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
>
> After my experience with a broken raid5 (read the list) I think the
> "partially failed disks" feature you describe is really useful. I agree
> with you that this kind of error is rather common.
Horrible idea.
Once you have a bad block on one disk, you have definitively lost your
data redundancy.
That's bad.
What should be done about bad blocks instead of your suggestion is to
try and write the data back to the bad block before kicking the disk.
If this succeeds, and the data can then be read from the failed block,
the disk has automatically reassigned the sector to the spare sector
area. You have redundancy again and the bad sector is "fixed".
If you're having a lot of problems with disks getting kicked because
of bad blocks, then you need to diagnose some more to find out what
the actual problem is.
My best guess would be that either you're using an old version of MD
that won't try to write to bad blocks, or the spare area on your disk
is full, in which case it should be replaced. You can check the
status of spare areas on disks with 'smartctl' or similar.
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH] proactive raid5 disk replacement for 2.6.11
2005-08-22 10:47 ` Molle Bestefich
@ 2005-08-22 11:56 ` Pallai Roland
2005-08-22 13:55 ` Molle Bestefich
0 siblings, 1 reply; 9+ messages in thread
From: Pallai Roland @ 2005-08-22 11:56 UTC (permalink / raw)
To: Molle Bestefich; +Cc: Claas Hilbrecht, linux-raid
On Mon, 2005-08-22 at 12:47 +0200, Molle Bestefich wrote:
> Claas Hilbrecht wrote:
> > Pallai Roland schrieb:
> > > this is a feature patch that implements 'proactive raid5 disk
> > > replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
> >
> > After my experience with a broken raid5 (read the list) I think the
> > "partially failed disks" feature you describe is really useful. I agree
> > with you that this kind of error is rather common.
>
> Horrible idea.
> Once you have a bad block on one disk, you have definitively lost your
> data redundancy.
> That's bad.
Hm, I think you don't understand the point, yes, that should be
replaced as soon as you can, but the good sectors of that drive can be
useful if some bad sectors are discovered on an another drive during the
rebuilding. we must keep that drive in sync to keep that sectors useful,
this is why the badblock tolerance is.
It is the common error if you've lot of disks and can't do daily media
checks because of the IO load.
> What should be done about bad blocks instead of your suggestion is to
> try and write the data back to the bad block before kicking the disk.
> If this succeeds, and the data can then be read from the failed block,
> the disk has automatically reassigned the sector to the spare sector
> area. You have redundancy again and the bad sector is "fixed".
>
> If you're having a lot of problems with disks getting kicked because
> of bad blocks, then you need to diagnose some more to find out what
> the actual problem is.
>
> My best guess would be that either you're using an old version of MD
> that won't try to write to bad blocks, or the spare area on your disk
> is full, in which case it should be replaced. You can check the
> status of spare areas on disks with 'smartctl' or similar.
Which version of md tries to rewrite bad blocks in raid5?
I've problem with "hidden" bad blocks (never mind if that's repairable
or not), the rewrite can't help, cause you don't know if that's there
until you don't try to rebuild the array from degraded state to a
replaced disk. I want to avoid from the rebuiling from degraded state,
this is why the 'proactive replacement' feature is. handling of *known*
bad blocks is an another subject, yes, that should be rewritten asap
(but I think, not certainly when detected, see my previous mails, a
problem is you never can be sure is that succeed or not).
--
dap
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] proactive raid5 disk replacement for 2.6.11
2005-08-22 11:56 ` Pallai Roland
@ 2005-08-22 13:55 ` Molle Bestefich
2005-08-28 23:35 ` Neil Brown
0 siblings, 1 reply; 9+ messages in thread
From: Molle Bestefich @ 2005-08-22 13:55 UTC (permalink / raw)
To: Pallai Roland; +Cc: Claas Hilbrecht, linux-raid
Pallai Roland wrote:
> Molle Bestefich wrote:
> > Claas Hilbrecht wrote:
> > > Pallai Roland schrieb:
> > > > this is a feature patch that implements 'proactive raid5 disk
> > > > replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
> > >
> > > After my experience with a broken raid5 (read the list) I think the
> > > "partially failed disks" feature you describe is really useful. I agree
> > > with you that this kind of error is rather common.
> >
> > Horrible idea.
> > Once you have a bad block on one disk, you have definitively lost your
> > data redundancy.
> > That's bad.
>
> Hm, I think you don't understand the point, yes, that should be
> replaced as soon as you can, but the good sectors of that drive can be
> useful if some bad sectors are discovered on an another drive during the
> rebuilding. we must keep that drive in sync to keep that sectors useful,
> this is why the badblock tolerance is.
Ok, I misunderstood you. Sorry, and thanks for the explanation.
> It is the common error if you've lot of disks and can't do daily media
> checks because of the IO load.
Agreed.
> > What should be done about bad blocks instead of your suggestion is to
> > try and write the data back to the bad block before kicking the disk.
> > If this succeeds, and the data can then be read from the failed block,
> > the disk has automatically reassigned the sector to the spare sector
> > area. You have redundancy again and the bad sector is "fixed".
> >
> > If you're having a lot of problems with disks getting kicked because
> > of bad blocks, then you need to diagnose some more to find out what
> > the actual problem is.
> >
> > My best guess would be that either you're using an old version of MD
> > that won't try to write to bad blocks, or the spare area on your disk
> > is full, in which case it should be replaced. You can check the
> > status of spare areas on disks with 'smartctl' or similar.
>
> Which version of md tries to rewrite bad blocks in raid5?
Haven't followed the discussions closely, but I sure hope that the
newest version does. (After all, spare areas are a somewhat old
feature in harddrives..)
> I've problem with "hidden" bad blocks (never mind if that's repairable
> or not), the rewrite can't help, cause you don't know if that's there
> until you don't try to rebuild the array from degraded state to a
> replaced disk. I want to avoid from the rebuiling from degraded state,
> this is why the 'proactive replacement' feature is.
Got it now. Super. Sounds good ;-).
(I hope that you're simply rebuilding to a spare before kicking the
drive, not doing something funky like remapping sectors or some
such..)
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] proactive raid5 disk replacement for 2.6.11
2005-08-22 13:55 ` Molle Bestefich
@ 2005-08-28 23:35 ` Neil Brown
0 siblings, 0 replies; 9+ messages in thread
From: Neil Brown @ 2005-08-28 23:35 UTC (permalink / raw)
To: Molle Bestefich; +Cc: Pallai Roland, Claas Hilbrecht, linux-raid
On Monday August 22, molle.bestefich@gmail.com wrote:
> > >
> > > My best guess would be that either you're using an old version of MD
> > > that won't try to write to bad blocks, or the spare area on your disk
> > > is full, in which case it should be replaced. You can check the
> > > status of spare areas on disks with 'smartctl' or similar.
> >
> > Which version of md tries to rewrite bad blocks in raid5?
>
> Haven't followed the discussions closely, but I sure hope that the
> newest version does. (After all, spare areas are a somewhat old
> feature in harddrives..)
>
Sorry, but no - not yet.
Over-writing bad blocks is on my todo list, and will probably happen
this year, but it certainly is available yet.
NeilBrown
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2005-08-28 23:35 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-14 20:10 [PATCH] proactive raid5 disk replacement for 2.6.11 Pallai Roland
2005-08-14 21:29 ` [PATCH] proactive raid5 disk replacement for 2.6.11 [fixed patch] Pallai Roland
2005-08-15 6:45 ` [PATCH] proactive raid5 disk replacement for 2.6.11 Claas Hilbrecht
2005-08-15 11:29 ` Mario 'BitKoenig' Holbe
2005-08-15 13:50 ` Pallai Roland
[not found] ` <C0A1E607B5206F88D89CAF42@192.168.1.22>
2005-08-22 10:47 ` Molle Bestefich
2005-08-22 11:56 ` Pallai Roland
2005-08-22 13:55 ` Molle Bestefich
2005-08-28 23:35 ` Neil Brown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).