From: Tyler <pml@dtbb.net>
To: Pallai Roland <dap@mail.index.hu>
Cc: linux-raid@vger.kernel.org
Subject: Re: [PATCH] proactive raid5 disk replacement for 2.6.11, updated
Date: Wed, 17 Aug 2005 18:55:53 -0700 [thread overview]
Message-ID: <4303EAA9.3080201@dtbb.net> (raw)
In-Reply-To: <1124322731.3810.77.camel@localhost.localdomain>
I think some of these features are great :) When you get into 15+
device raids, this becomes a very active issue.
Tyler.
Pallai Roland wrote:
> per-device bad block cache has been implemented to speed up arrays with
>partially failed drives (replies are often slow from those). also
>helps to determine badly damaged drives based on number of bad blocks,
>and can take an action if steps over an user defined threshold
>(see /proc/sys/dev/raid/badblock_tolerance).
>rewrite of a bad stripe will delete the entry from the cache, so it
>honors the auto sector reallocation feature of ATA drives
>
> performance is affected just a little bit if there's no or some
>registered bad blocks, but over a million that could be a problem
>currently, I'll examine it later..
>
> if we've a spare and a drive had kicked, that spare becomes to 'active
>spare', sync begins, but the original (failed) drive won't be kicked
>until the sync will not have finished. if the original drive still drops
>errors after had been synced, the in_sync spare replaces that online
>otherwise you can do it manually (mdadm -f)
>
> you can check the list of registered bad sectors in /proc/mdstat (in
>debug mode), and the size of the cache with: grep _bbc /proc/slabinfo
>
>
> please let me know if you're interested in, otherwise I'll not flood
>the list with this topic..
>
>
>my /proc/mdstat now:
>
>md0 : active raid5 ram4[2] md2[1] md1[0]
> 8064 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
> known bad sectors on active devices:
> ram4
> md2
> md1 56 136 232 472 600 872 1176 1248 1336 1568 1688 1952 2104
>
>md2 : active faulty ram1[0]
> 4096 blocks nfaults=0
>
>md1 : active faulty ram0[0]
> 4096 blocks ReadPersistent=92(100) nfaults=13
>
>
>--
> dap
>
>
>
>
>------------------------------------------------------------------------
>
>
> this is a feature patch that implements 'proactive raid5 disk
>replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
>that could help a lot on large raid5 arrays built from cheap sata
>drivers when the IO traffic such large that daily media scan on the
>disks isn't possible.
> linux software raid is very fragile by default, the typical (nervous)
>breakdown situation: I noticed a bad block on a drive, replace it,
>and the resync fails cause another 2-3 disks has hidden badblocks too.
>I've to save the disks and rebuild bad blocks with a userspace tool (by
>hand..), meanwhile the site is down for hours. bad; especially when a
>pair of simple steps enough to avoid from this atypical problem:
> 1. dont kick a drive on read error cause it is possible that 99.99% is
>useable and will help (to serve and to save data) if another drive show
>bad sectors in same array
> 2. let to mirror a partially failed drive to a spare _online_ and replace
>the source of the mirror with the spare when it's done. bad blocks isn't
>a problem unless same sector damaged on two disks what's a rare case. in
>this way is possible to fix an array with partially failed drives
>without data loss and without downtime
>
> I'm not a programmer just a sysadm who admins a large software sata
>array, but my angry got bigger than my laziness, so I made this patch on
>this weekend.. I don't understand every piece of the md code (eg. the
>if-forest of the handle_stripe :) yet, so this patch may be a bug-colony
>and wrong by design, but I've tested it under heavy stress with both of
>'faulty' module and real disks, and it works fine!
>
> ideas, piece of advice, bugfix/enchancement is welcomed!
>
>
> (I know, raid6 could be another solution for this problem, but that's a
>large overhead.)
>
>
>use:
>
>1. patch the kernel, this one is against 2.6.11
>2. type:
>
># make drives
>mdadm -B -n1 -l faulty /dev/md/1 /dev/rd/0
>mdadm -B -n1 -l faulty /dev/md/2 /dev/rd/1
>mdadm -B -n1 -l faulty /dev/md/3 /dev/rd/2
>
># make the array
>mdadm -C -n3 -l5 /dev/md/0 /dev/md/1 /dev/md/2 /dev/md/3
>
># .. wait for sync ..
>
># grow bad blocks as ma*tor does
>mdadm --grow -l faulty -p rp454 /dev/md/1
>mdadm --grow -l faulty -p rp738 /dev/md/2
>
># add a spare
>mdadm -a /dev/md/0 /dev/rd/4
>
># -> fail a drive, sync begins <-
># the md/1 will not marked as failed, this is the point, but if you want to,
># you can issue this command again!
>mdadm -f /dev/md/0 /dev/md/1
>
># kernel:
># resync from md1 to spare ram4
># added spare for active resync
>
># .. wonder the read errors from md[12] and the sync goes on!
># feel free to stress the md at this time, mkfs, dd, badblocks, etc
>
># kernel:
># raid5_spare_active: 3 in_sync 3->0
># /proc/mdstat:
># md0 : active raid5 ram4[0] md3[2] md2[1] md1[0]
># -> ram4 and md1 has same id, this means the spare is a complete mirror,
># if you stop the array you can assembly it with ram4 instead of md1,
># the superblock same both of them
>
># check the mirror (stop write stress if any)
>mdadm --grow -l faulty -p none /dev/md/1
>cmp /dev/md/1 /dev/rd/4
>
># hot-replace the mirrored -partially failed- device with the active spare
># (yes, mark it as failed again, but if there's a syncing- or synced 'active spare'
># the -f really fails the device or replace it with the synced spare)
>mdadm -f /dev/md/0 /dev/md/1
>
># kernel:
># replace md1 with in_sync active spare ram4
>
># and voila!
># /proc/mdstat:
># md0 : active raid5 ram4[0] md3[2] md2[1]
>
>
>update:
>
> per-device bad block cache has been implemented to speed up arrays with
>partially failed drives (replies are often slow from those). also
>helps to determine badly damaged drives based on number of bad blocks,
>and can take an action if steps over an user defined threshold
>(see /proc/sys/dev/raid/badblock_tolerance).
>rewrite of a bad stripe will delete the entry from the cache, so it honors
>the auto sector reallocation feature of ATA drives
>
> performance is affected just a little bit if there's no or some registered
>bad blocks, but over a million that could be a problem currently, I'll examine
>it later..
>
> if we've a spare and a drive had kicked, that spare becomes to
>'active spare', sync begins, but the original (failed) drive won't be
>kicked until the sync will not have finished. if the original drive still
>drops errors after had been synced, the in_sync spare replaces that online
>
> you can check the list of registered bad sectors in /proc/mdstat (in debug mode),
>and the size of the cache with: grep _bbc /proc/slabinfo
>
>
>my /proc/mdstat:
>
>md0 : active raid5 ram4[2] md2[1] md1[0]
> 8064 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
> known bad sectors on active devices:
> ram4
> md2
> md1 56 136 232 472 600 872 1176 1248 1336 1568 1688 1952 2104
>
>md2 : active faulty ram1[0]
> 4096 blocks nfaults=0
>
>md1 : active faulty ram0[0]
> 4096 blocks ReadPersistent=92(100) nfaults=13
>
>
>
>--- linux/include/linux/raid/raid5.h.orig 2005-03-03 23:51:29.000000000 +0100
>+++ linux/include/linux/raid/raid5.h 2005-08-14 03:02:11.000000000 +0200
>@@ -147,6 +147,7 @@
> #define R5_UPTODATE 0 /* page contains current data */
> #define R5_LOCKED 1 /* IO has been submitted on "req" */
> #define R5_OVERWRITE 2 /* towrite covers whole page */
>+#define R5_FAILED 8 /* failed to read this stripe */
> /* and some that are internal to handle_stripe */
> #define R5_Insync 3 /* rdev && rdev->in_sync at start */
> #define R5_Wantread 4 /* want to schedule a read */
>@@ -196,8 +197,16 @@
> */
>
>
>+struct badblock {
>+ struct badblock *hash_next, **hash_pprev; /* hash pointers */
>+ sector_t sector; /* stripe # */
>+};
>+
> struct disk_info {
> mdk_rdev_t *rdev;
>+ struct badblock **badblock_hashtbl; /* list of known badblocks */
>+ char cache_name[20];
>+ kmem_cache_t *slab_cache; /* badblock db */
> };
>
> struct raid5_private_data {
>@@ -224,6 +233,8 @@
> int inactive_blocked; /* release of inactive stripes blocked,
> * waiting for 25% to be free
> */
>+ int mirrorit; /* source for active spare resync */
>+
> spinlock_t device_lock;
> struct disk_info disks[0];
> };
>--- linux/include/linux/sysctl.h.orig 2005-07-06 20:19:10.000000000 +0200
>+++ linux/include/linux/sysctl.h 2005-08-17 22:01:28.000000000 +0200
>@@ -778,7 +778,8 @@
> /* /proc/sys/dev/raid */
> enum {
> DEV_RAID_SPEED_LIMIT_MIN=1,
>- DEV_RAID_SPEED_LIMIT_MAX=2
>+ DEV_RAID_SPEED_LIMIT_MAX=2,
>+ DEV_RAID_BADBLOCK_TOLERANCE=3
> };
>
> /* /proc/sys/dev/parport/default */
>--- linux/drivers/md/md.c.orig 2005-08-14 21:22:08.000000000 +0200
>+++ linux/drivers/md/md.c 2005-08-14 17:20:15.000000000 +0200
>@@ -78,6 +78,10 @@
> static int sysctl_speed_limit_min = 1000;
> static int sysctl_speed_limit_max = 200000;
>
>+/* over this limit the drive'll be marked as failed. measure is block. */
>+int sysctl_badblock_tolerance = 10000;
>+
>+
> static struct ctl_table_header *raid_table_header;
>
> static ctl_table raid_table[] = {
>@@ -97,6 +101,14 @@
> .mode = 0644,
> .proc_handler = &proc_dointvec,
> },
>+ {
>+ .ctl_name = DEV_RAID_BADBLOCK_TOLERANCE,
>+ .procname = "badblock_tolerance",
>+ .data = &sysctl_badblock_tolerance,
>+ .maxlen = sizeof(int),
>+ .mode = 0644,
>+ .proc_handler = &proc_dointvec,
>+ },
> { .ctl_name = 0 }
> };
>
>@@ -3525,10 +3537,12 @@
> }
> if (mddev->sync_thread) {
> /* resync has finished, collect result */
>+printk("md_check_recovery: resync has finished\n");
> md_unregister_thread(mddev->sync_thread);
> mddev->sync_thread = NULL;
> if (!test_bit(MD_RECOVERY_ERR, &mddev->recovery) &&
> !test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
>+printk("md_check_recovery: activate any spares\n");
> /* success...*/
> /* activate any spares */
> mddev->pers->spare_active(mddev);
>@@ -3545,18 +3559,19 @@
>
> /* no recovery is running.
> * remove any failed drives, then
>- * add spares if possible
>+ * add spares if possible.
>+ * Spare are also removed and re-added, to allow
>+ * the personality to fail the re-add.
> */
>- ITERATE_RDEV(mddev,rdev,rtmp) {
>+ ITERATE_RDEV(mddev,rdev,rtmp)
> if (rdev->raid_disk >= 0 &&
>- rdev->faulty &&
>+ (rdev->faulty || ! rdev->in_sync) &&
> atomic_read(&rdev->nr_pending)==0) {
>+printk("md_check_recovery: hot_remove_disk\n");
> if (mddev->pers->hot_remove_disk(mddev, rdev->raid_disk)==0)
> rdev->raid_disk = -1;
> }
>- if (!rdev->faulty && rdev->raid_disk >= 0 && !rdev->in_sync)
>- spares++;
>- }
>+
> if (mddev->degraded) {
> ITERATE_RDEV(mddev,rdev,rtmp)
> if (rdev->raid_disk < 0
>@@ -3764,4 +3783,6 @@
> EXPORT_SYMBOL(md_wakeup_thread);
> EXPORT_SYMBOL(md_print_devices);
> EXPORT_SYMBOL(md_check_recovery);
>+EXPORT_SYMBOL(kick_rdev_from_array); // fixme
>+EXPORT_SYMBOL(sysctl_badblock_tolerance);
> MODULE_LICENSE("GPL");
>--- linux/drivers/md/raid5.c.orig 2005-08-14 21:22:08.000000000 +0200
>+++ linux/drivers/md/raid5.c 2005-08-14 20:49:49.000000000 +0200
>@@ -40,6 +40,18 @@
>
> #define stripe_hash(conf, sect) ((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK])
>
>+ /*
>+ * per-device badblock cache
>+ */
>+
>+#define BB_SHIFT (PAGE_SHIFT/*12*/ - 9)
>+#define BB_HASH_PAGES 1
>+#define BB_NR_HASH (HASH_PAGES * PAGE_SIZE / sizeof(struct badblock *))
>+#define BB_HASH_MASK (BB_NR_HASH - 1)
>+
>+#define bb_hash(disk, sect) ((disk)->badblock_hashtbl[((sect) >> BB_SHIFT) & BB_HASH_MASK])
>+#define bb_hashnr(sect) (((sect) >> BB_SHIFT) & BB_HASH_MASK)
>+
> /* bio's attached to a stripe+device for I/O are linked together in bi_sector
> * order without overlap. There may be several bio's per stripe+device, and
> * a bio could span several devices.
>@@ -53,7 +65,7 @@
> /*
> * The following can be used to debug the driver
> */
>-#define RAID5_DEBUG 0
>+#define RAID5_DEBUG 1
> #define RAID5_PARANOIA 1
> #if RAID5_PARANOIA && defined(CONFIG_SMP)
> # define CHECK_DEVLOCK() assert_spin_locked(&conf->device_lock)
>@@ -61,13 +73,159 @@
> # define CHECK_DEVLOCK()
> #endif
>
>-#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(x)))
>+#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(KERN_DEBUG x)))
> #if RAID5_DEBUG
> #define inline
> #define __inline__
> #endif
>
> static void print_raid5_conf (raid5_conf_t *conf);
>+extern int sysctl_badblock_tolerance;
>+
>+
>+static void bb_insert_hash(struct disk_info *disk, struct badblock *bb)
>+{
>+ struct badblock **bbp = &bb_hash(disk, bb->sector);
>+
>+ /*printk("bb_insert_hash(), sector %llu hashnr %lu\n", (unsigned long long)bb->sector,
>+ bb_hashnr(bb->sector));*/
>+
>+ if ((bb->hash_next = *bbp) != NULL)
>+ (*bbp)->hash_pprev = &bb->hash_next;
>+ *bbp = bb;
>+ bb->hash_pprev = bbp;
>+}
>+
>+static void bb_remove_hash(struct badblock *bb)
>+{
>+ /*printk("remove_hash(), sector %llu hashnr %lu\n", (unsigned long long)bb->sector,
>+ bb_hashnr(bb->sector));*/
>+
>+ if (bb->hash_pprev) {
>+ if (bb->hash_next)
>+ bb->hash_next->hash_pprev = bb->hash_pprev;
>+ *bb->hash_pprev = bb->hash_next;
>+ bb->hash_pprev = NULL;
>+ }
>+}
>+
>+static struct badblock *__find_badblock(struct disk_info *disk, sector_t sector)
>+{
>+ struct badblock *bb;
>+
>+ for (bb = bb_hash(disk, sector); bb; bb = bb->hash_next)
>+ if (bb->sector == sector)
>+ return bb;
>+ return NULL;
>+}
>+
>+static struct badblock *find_badblock(struct disk_info *disk, sector_t sector)
>+{
>+ raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private;
>+ struct badblock *bb;
>+
>+ spin_lock_irq(&conf->device_lock);
>+ bb = __find_badblock(disk, sector);
>+ spin_unlock_irq(&conf->device_lock);
>+ return bb;
>+}
>+
>+static unsigned long count_badblocks (struct disk_info *disk)
>+{
>+ raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private;
>+ struct badblock *bb;
>+ int j;
>+ int n = 0;
>+
>+ spin_lock_irq(&conf->device_lock);
>+ for (j = 0; j < BB_NR_HASH; j++) {
>+ bb = disk->badblock_hashtbl[j];
>+ for (; bb; bb = bb->hash_next)
>+ n++;
>+ }
>+ spin_unlock_irq(&conf->device_lock);
>+
>+ return n;
>+}
>+
>+static int grow_badblocks(struct disk_info *disk)
>+{
>+ char b[BDEVNAME_SIZE];
>+ kmem_cache_t *sc;
>+
>+ /* hash table */
>+ if ((disk->badblock_hashtbl = (struct badblock **) __get_free_pages(GFP_ATOMIC, HASH_PAGES_ORDER)) == NULL) {
>+ printk("grow_badblocks: __get_free_pages failed\n");
>+ return 0;
>+ }
>+ memset(disk->badblock_hashtbl, 0, BB_HASH_PAGES * PAGE_SIZE);
>+
>+ /* badblocks db */
>+ sprintf(disk->cache_name, "raid5/%s_%s_bbc", mdname(disk->rdev->mddev),
>+ bdevname(disk->rdev->bdev, b));
>+ sc = kmem_cache_create(disk->cache_name,
>+ sizeof(struct badblock),
>+ 0, 0, NULL, NULL);
>+ if (!sc) {
>+ printk("grow_badblocks: kmem_cache_create failed\n");
>+ return 1;
>+ }
>+ disk->slab_cache = sc;
>+
>+ return 0;
>+}
>+
>+static void shrink_badblocks(struct disk_info *disk)
>+{
>+ struct badblock *bb;
>+ int j;
>+
>+ /* badblocks db */
>+ for (j = 0; j < BB_NR_HASH; j++) {
>+ bb = disk->badblock_hashtbl[j];
>+ for (; bb; bb = bb->hash_next)
>+ kmem_cache_free(disk->slab_cache, bb);
>+ }
>+ kmem_cache_destroy(disk->slab_cache);
>+ disk->slab_cache = NULL;
>+
>+ /* hash table */
>+ free_pages((unsigned long) disk->badblock_hashtbl, HASH_PAGES_ORDER);
>+}
>+
>+static void store_badblock(struct disk_info *disk, sector_t sector)
>+{
>+ struct badblock *bb;
>+ raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private;
>+
>+ bb = kmem_cache_alloc(disk->slab_cache, GFP_KERNEL);
>+ if (!bb) {
>+ printk("store_badblock: kmem_cache_alloc failed\n");
>+ return;
>+ }
>+ memset(bb, 0, sizeof(*bb));
>+ bb->sector = sector;
>+
>+ spin_lock_irq(&conf->device_lock);
>+ bb_insert_hash(disk, bb);
>+ spin_unlock_irq(&conf->device_lock);
>+}
>+
>+static void delete_badblock(struct disk_info *disk, sector_t sector)
>+{
>+ struct badblock *bb;
>+ raid5_conf_t *conf = (raid5_conf_t *) disk->rdev->mddev->private;
>+
>+ bb = find_badblock(disk, sector);
>+ if (!bb)
>+ /* reset on write'll call us like an idiot :} */
>+ return;
>+ spin_lock_irq(&conf->device_lock);
>+ bb_remove_hash(bb);
>+ kmem_cache_free(disk->slab_cache, bb);
>+ spin_unlock_irq(&conf->device_lock);
>+}
>+
>
> static inline void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
> {
>@@ -201,7 +359,7 @@
> sh->pd_idx = pd_idx;
> sh->state = 0;
>
>- for (i=disks; i--; ) {
>+ for (i=disks+1; i--; ) {
> struct r5dev *dev = &sh->dev[i];
>
> if (dev->toread || dev->towrite || dev->written ||
>@@ -291,8 +449,10 @@
>
> sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
>
>+ /* +1: we need extra space in the *sh->devs for the 'active spare' to keep
>+ handle_stripe() simple */
> sc = kmem_cache_create(conf->cache_name,
>- sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
>+ sizeof(struct stripe_head)+(devs-1+1)*sizeof(struct r5dev),
> 0, 0, NULL, NULL);
> if (!sc)
> return 1;
>@@ -301,12 +461,12 @@
> sh = kmem_cache_alloc(sc, GFP_KERNEL);
> if (!sh)
> return 1;
>- memset(sh, 0, sizeof(*sh) + (devs-1)*sizeof(struct r5dev));
>+ memset(sh, 0, sizeof(*sh) + (devs-1+1)*sizeof(struct r5dev));
> sh->raid_conf = conf;
> spin_lock_init(&sh->lock);
>
>- if (grow_buffers(sh, conf->raid_disks)) {
>- shrink_buffers(sh, conf->raid_disks);
>+ if (grow_buffers(sh, conf->raid_disks+1)) {
>+ shrink_buffers(sh, conf->raid_disks+1);
> kmem_cache_free(sc, sh);
> return 1;
> }
>@@ -391,10 +551,39 @@
> }
> #else
> set_bit(R5_UPTODATE, &sh->dev[i].flags);
>+ clear_bit(R5_FAILED, &sh->dev[i].flags);
> #endif
> } else {
>+ char b[BDEVNAME_SIZE];
>+
>+ /*
>+ rule 1.,: try to keep all disk in_sync even if we've got read errors,
>+ cause the 'active spare' may can rebuild a complete column from
>+ partially failed drives
>+ */
>+ if (conf->disks[i].rdev->in_sync && conf->working_disks < conf->raid_disks) {
>+ /* bad news, but keep it, cause md_error() would do a complete
>+ array shutdown, even if 99.99% is useable */
>+ printk(KERN_ALERT
>+ "raid5_end_read_request: Read failure %s on sector %llu (%d) in degraded mode\n"
>+ ,bdevname(conf->disks[i].rdev->bdev, b),
>+ (unsigned long long)sh->sector, atomic_read(&sh->count));
>+ if (conf->mddev->curr_resync)
>+ /* raid5_add_disk() will no accept the spare again,
>+ and will not loop forever */
>+ conf->mddev->degraded = 2;
>+ } else if (conf->disks[i].rdev->in_sync && conf->working_disks >= conf->raid_disks) {
>+ /* will be computed */
>+ printk(KERN_ALERT
>+ "raid5_end_read_request: Read failure %s on sector %llu (%d) in optimal mode\n"
>+ ,bdevname(conf->disks[i].rdev->bdev, b),
>+ (unsigned long long)sh->sector, atomic_read(&sh->count));
>+ /* conf->disks[i].rerr++ */
>+ } else
>+ /* practically it never happens */
> md_error(conf->mddev, conf->disks[i].rdev);
>- clear_bit(R5_UPTODATE, &sh->dev[i].flags);
>+ clear_bit(R5_UPTODATE, &sh->dev[i].flags);
>+ set_bit(R5_FAILED, &sh->dev[i].flags);
> }
> rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
> #if 0
>@@ -430,10 +619,11 @@
> PRINTK("end_write_request %llu/%d, count %d, uptodate: %d.\n",
> (unsigned long long)sh->sector, i, atomic_read(&sh->count),
> uptodate);
>+ /* sorry
> if (i == disks) {
> BUG();
> return 0;
>- }
>+ }*/
>
> spin_lock_irqsave(&conf->device_lock, flags);
> if (!uptodate)
>@@ -467,33 +657,144 @@
> dev->req.bi_private = sh;
>
> dev->flags = 0;
>- if (i != sh->pd_idx)
>+ if (i != sh->pd_idx && i < sh->raid_conf->raid_disks) /* active spare? */
> dev->sector = compute_blocknr(sh, i);
> }
>
>+static int raid5_remove_disk(mddev_t *mddev, int number);
>+static int raid5_add_disk(mddev_t *mddev, mdk_rdev_t *rdev);
>+/*static*/ void kick_rdev_from_array(mdk_rdev_t * rdev);
>+//static void md_update_sb(mddev_t * mddev);
> static void error(mddev_t *mddev, mdk_rdev_t *rdev)
> {
> char b[BDEVNAME_SIZE];
>+ char b2[BDEVNAME_SIZE];
> raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
> PRINTK("raid5: error called\n");
>
> if (!rdev->faulty) {
>- mddev->sb_dirty = 1;
>- if (rdev->in_sync) {
>- conf->working_disks--;
>- mddev->degraded++;
>- conf->failed_disks++;
>- rdev->in_sync = 0;
>- /*
>- * if recovery was running, make sure it aborts.
>- */
>- set_bit(MD_RECOVERY_ERR, &mddev->recovery);
>- }
>- rdev->faulty = 1;
>- printk (KERN_ALERT
>- "raid5: Disk failure on %s, disabling device."
>- " Operation continuing on %d devices\n",
>- bdevname(rdev->bdev,b), conf->working_disks);
>+ int mddisks = 0;
>+ mdk_rdev_t *rd;
>+ mdk_rdev_t *rdevs = NULL;
>+ struct list_head *rtmp;
>+ int i;
>+
>+ ITERATE_RDEV(mddev,rd,rtmp)
>+ {
>+ printk(KERN_INFO "mddev%d: %s\n", mddisks, bdevname(rd->bdev,b));
>+ mddisks++;
>+ }
>+ for (i = 0; (rd = conf->disks[i].rdev); i++) {
>+ printk(KERN_INFO "r5dev%d: %s\n", i, bdevname(rd->bdev,b));
>+ }
>+ ITERATE_RDEV(mddev,rd,rtmp)
>+ {
>+ rdevs = rd;
>+ break;
>+ }
>+printk("%d %d > %d %d ins:%d %p\n",
>+ mddev->raid_disks, mddisks, conf->raid_disks, mddev->degraded, rdev->in_sync, rdevs);
>+ if (conf->disks[conf->raid_disks].rdev == rdev && rdev->in_sync) {
>+ /* in_sync, but must be handled specially, don't let 'degraded++' */
>+ printk ("active spare failed %s (in_sync)\n",
>+ bdevname(rdev->bdev,b));
>+ mddev->sb_dirty = 1;
>+ rdev->in_sync = 0;
>+ rdev->faulty = 1;
>+ rdev->raid_disk = conf->raid_disks; /* me as myself, again ;) */
>+ conf->mirrorit = -1;
>+ } else if (mddisks > conf->raid_disks && !mddev->degraded && rdev->in_sync) {
>+ /* have active spare, array is optimal, removed disk member
>+ of it (but not the active spare) */
>+ if (rdev->raid_disk == conf->mirrorit && conf->disks[conf->raid_disks].rdev) {
>+ if (!conf->disks[conf->raid_disks].rdev->in_sync) {
>+ printk(KERN_ALERT "disk %s failed and active spare isn't in_sync yet, readd as normal spare\n",
>+ bdevname(rdev->bdev,b));
>+ /* maybe shouldn't stop here, but we can't call this disk as
>+ 'active spare' anymore, cause it's a simple rebuild from
>+ a degraded array, fear of bad blocks! */
>+ conf->mirrorit = -1;
>+ goto letitgo;
>+ } else {
>+ int ret;
>+
>+ /* hot replace the mirrored drive with the 'active spare'
>+ this is really "hot", I can't see clearly the things
>+ what I have to do here. :}
>+ pray. */
>+
>+ printk(KERN_ALERT "replace %s with in_sync active spare %s\n",
>+ bdevname(rdev->bdev,b),
>+ bdevname(rdevs->bdev,b2));
>+ rdev->in_sync = 0;
>+ rdev->faulty = 1;
>+
>+ conf->mirrorit = -1;
>+
>+ /* my God, am I sane? */
>+ while ((i = atomic_read(&rdev->nr_pending))) {
>+ printk("waiting for disk %d .. %d\n",
>+ rdev->raid_disk, i);
>+ }
>+ ret = raid5_remove_disk(mddev, rdev->raid_disk);
>+ if (ret) {
>+ printk(KERN_WARNING "raid5_remove_disk1: busy?!\n");
>+ return; // should nothing to do
>+ }
>+
>+ rd = conf->disks[conf->raid_disks].rdev;
>+ while ((i = atomic_read(&rd->nr_pending))) {
>+ printk("waiting for disk %d .. %d\n",
>+ conf->raid_disks, i);
>+ }
>+ rd->in_sync = 0;
>+ ret = raid5_remove_disk(mddev, conf->raid_disks);
>+ if (ret) {
>+ printk(KERN_WARNING "raid5_remove_disk2: busy?!\n");
>+ return; // ..
>+ }
>+
>+ ret = raid5_add_disk(mddev, rd);
>+ if (!ret) {
>+ printk(KERN_WARNING "raid5_add_disk: no free slot?!\n");
>+ return; // ..
>+ }
>+ rd->in_sync = 1;
>+
>+ /* borrowed from hot_remove_disk() */
>+ kick_rdev_from_array(rdev);
>+ //md_update_sb(mddev);
>+ }
>+ } else {
>+ /* in_sync disk failed (!degraded), trying to make a copy
>+ to a spare {and we can call it 'active spare' from now:} */
>+ printk(KERN_ALERT "resync from %s to spare %s (%d)\n",
>+ bdevname(rdev->bdev,b),
>+ bdevname(rdevs->bdev,b2),
>+ conf->raid_disks);
>+ conf->mirrorit = rdev->raid_disk;
>+
>+ mddev->degraded++; /* for call raid5_hot_add_disk(), reset there */
>+ }
>+ } else {
>+letitgo:
>+ mddev->sb_dirty = 1;
>+ if (rdev->in_sync) {
>+ conf->working_disks--;
>+ mddev->degraded++;
>+ conf->failed_disks++;
>+ rdev->in_sync = 0;
>+ /*
>+ * if recovery was running, make sure it aborts.
>+ */
>+ set_bit(MD_RECOVERY_ERR, &mddev->recovery);
>+ }
>+ rdev->faulty = 1;
>+ printk (KERN_ALERT
>+ "raid5: Disk failure on %s, disabling device."
>+ " Operation continuing on %d devices\n",
>+ bdevname(rdev->bdev,b), conf->working_disks);
>+ }
> }
> }
>
>@@ -888,6 +1189,8 @@
> int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
> int non_overwrite = 0;
> int failed_num=0;
>+ int aspare=0, asparenum=-1;
>+ struct disk_info *asparedev;
> struct r5dev *dev;
>
> PRINTK("handling stripe %llu, cnt=%d, pd_idx=%d\n",
>@@ -899,10 +1202,18 @@
> clear_bit(STRIPE_DELAYED, &sh->state);
>
> syncing = test_bit(STRIPE_SYNCING, &sh->state);
>+ asparedev = &conf->disks[conf->raid_disks];
>+ if (!conf->mddev->degraded && asparedev->rdev && !asparedev->rdev->faulty &&
>+ conf->mirrorit != -1) {
>+ aspare++;
>+ asparenum = sh->raid_conf->mirrorit;
>+ PRINTK("has aspare (%d)\n", asparenum);
>+ }
> /* Now to look around and see what can be done */
>
>- for (i=disks; i--; ) {
>+ for (i=disks+aspare; i--; ) {
> mdk_rdev_t *rdev;
>+ struct badblock *bb = NULL;
> dev = &sh->dev[i];
> clear_bit(R5_Insync, &dev->flags);
> clear_bit(R5_Syncio, &dev->flags);
>@@ -945,12 +1256,43 @@
> }
> if (dev->written) written++;
> rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
>- if (!rdev || !rdev->in_sync) {
>+ if (rdev && rdev->in_sync &&
>+ !test_bit(R5_UPTODATE, &dev->flags) &&
>+ !test_bit(R5_LOCKED, &dev->flags)) {
>+ /* ..potentially deserved to read, we must check it
>+ checkme, it could be a big performance penalty if called
>+ without a good reason! it's seems ok for now
>+ */
>+ PRINTK("find_badblock %d: %llu\n", i, sh->sector);
>+ bb = find_badblock(&conf->disks[i], sh->sector);
>+ }
>+ if (!rdev || !rdev->in_sync
>+ || (test_bit(R5_FAILED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags))
>+ || bb) {
>+ if (rdev && rdev->in_sync && test_bit(R5_FAILED, &dev->flags) && !bb) {
>+ if (/*(!aspare || (aspare && asparedev->rdev->in_sync)) &&
>+ it would be clear, but too early, the thread hasn't woken, yet */
>+ conf->mirrorit == -1 &&
>+ count_badblocks(&conf->disks[i]) >= sysctl_badblock_tolerance) {
>+ char b[BDEVNAME_SIZE];
>+
>+ printk(KERN_ALERT "too many badblocks (%lu) on device %s, marking as failed\n",
>+ count_badblocks(&conf->disks[i]) + 1, bdevname(conf->disks[i].rdev->bdev, b));
>+ md_error(conf->mddev, conf->disks[i].rdev);
>+ }
>+ PRINTK("store_badblock %d: %llu\n", i, sh->sector);
>+ store_badblock(&conf->disks[i], sh->sector);
>+ }
> failed++;
> failed_num = i;
>- } else
>+ PRINTK("device %d failed for this stripe r%p w%p\n", i, dev->toread, dev->towrite);
>+ } else {
> set_bit(R5_Insync, &dev->flags);
>+ }
> }
>+ if (aspare && failed > 1)
>+ failed--; /* failed = 1 means "all ok" if we've aspare, this is simplest
>+ method to do our work */
> PRINTK("locked=%d uptodate=%d to_read=%d"
> " to_write=%d failed=%d failed_num=%d\n",
> locked, uptodate, to_read, to_write, failed, failed_num);
>@@ -1013,6 +1355,7 @@
> spin_unlock_irq(&conf->device_lock);
> }
> if (failed > 1 && syncing) {
>+ printk(KERN_ALERT "sync stopped by IO error\n");
> md_done_sync(conf->mddev, STRIPE_SECTORS,0);
> clear_bit(STRIPE_SYNCING, &sh->state);
> syncing = 0;
>@@ -1184,6 +1527,26 @@
> PRINTK("Writing block %d\n", i);
> locked++;
> set_bit(R5_Wantwrite, &sh->dev[i].flags);
>+ if (aspare && i == asparenum) {
>+ char *ps, *pd;
>+
>+ /* mirroring this new block */
>+ PRINTK("Writing to aspare too %d->%d\n",
>+ i, conf->raid_disks);
>+ /*if (test_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags)) {
>+ printk("bazmeg, ez lokkolt1!!!\n");
>+ }*/
>+ ps = page_address(sh->dev[i].page);
>+ pd = page_address(sh->dev[conf->raid_disks].page);
>+ /* better idea? */
>+ memcpy(pd, ps, STRIPE_SIZE);
>+ set_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags);
>+ set_bit(R5_Wantwrite, &sh->dev[conf->raid_disks].flags);
>+ }
>+ if (conf->disks[i].rdev && conf->disks[i].rdev->in_sync) {
>+ PRINTK("reset badblock on %d: %llu\n", i, sh->sector);
>+ delete_badblock(&conf->disks[i], sh->sector);
>+ }
> if (!test_bit(R5_Insync, &sh->dev[i].flags)
> || (i==sh->pd_idx && failed == 0))
> set_bit(STRIPE_INSYNC, &sh->state);
>@@ -1220,20 +1583,39 @@
> if (failed==0)
> failed_num = sh->pd_idx;
> /* should be able to compute the missing block and write it to spare */
>+ if (aspare)
>+ failed_num = asparenum;
> if (!test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)) {
> if (uptodate+1 != disks)
> BUG();
> compute_block(sh, failed_num);
> uptodate++;
> }
>+ if (aspare) {
>+ char *ps, *pd;
>+
>+ ps = page_address(sh->dev[failed_num].page);
>+ pd = page_address(sh->dev[conf->raid_disks].page);
>+ memcpy(pd, ps, STRIPE_SIZE);
>+ PRINTK("R5_Wantwrite to aspare, uptodate: %d %p->%p\n",
>+ uptodate, ps, pd);
>+ /*if (test_bit(R5_LOCKED, &sh->dev[conf->raid_disks].flags)) {
>+ printk("bazmeg, ez lokkolt2!!!\n");
>+ }*/
>+ }
> if (uptodate != disks)
> BUG();
>+ if (aspare)
>+ failed_num = conf->raid_disks;
> dev = &sh->dev[failed_num];
> set_bit(R5_LOCKED, &dev->flags);
> set_bit(R5_Wantwrite, &dev->flags);
> locked++;
> set_bit(STRIPE_INSYNC, &sh->state);
> set_bit(R5_Syncio, &dev->flags);
>+ /* !in_sync..
>+ printk("reset badblock on %d: %llu\n", failed_num, sh->sector);
>+ delete_badblock(&conf->disks[failed_num], sh->sector);*/
> }
> }
> if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
>@@ -1251,7 +1633,7 @@
> bi->bi_size = 0;
> bi->bi_end_io(bi, bytes, 0);
> }
>- for (i=disks; i-- ;) {
>+ for (i=disks+aspare; i-- ;) {
> int rw;
> struct bio *bi;
> mdk_rdev_t *rdev;
>@@ -1493,6 +1875,15 @@
> unplug_slaves(mddev);
> return 0;
> }
>+ /* if there is 1 or more failed drives and we are trying
>+ * to resync, then assert that we are finished, because there is
>+ * nothing we can do.
>+ */
>+ if (mddev->degraded >= 1 && test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
>+ int rv = (mddev->size << 1) - sector_nr;
>+ md_done_sync(mddev, rv, 1);
>+ return rv;
>+ }
>
> x = sector_nr;
> chunk_offset = sector_div(x, sectors_per_chunk);
>@@ -1591,11 +1982,11 @@
> }
>
> mddev->private = kmalloc (sizeof (raid5_conf_t)
>- + mddev->raid_disks * sizeof(struct disk_info),
>+ + (mddev->raid_disks + 1) * sizeof(struct disk_info),
> GFP_KERNEL);
> if ((conf = mddev->private) == NULL)
> goto abort;
>- memset (conf, 0, sizeof (*conf) + mddev->raid_disks * sizeof(struct disk_info) );
>+ memset (conf, 0, sizeof (*conf) + (mddev->raid_disks + 1) * sizeof(struct disk_info) );
> conf->mddev = mddev;
>
> if ((conf->stripe_hashtbl = (struct stripe_head **) __get_free_pages(GFP_ATOMIC, HASH_PAGES_ORDER)) == NULL)
>@@ -1625,6 +2016,8 @@
>
> disk->rdev = rdev;
>
>+ grow_badblocks(disk);
>+
> if (rdev->in_sync) {
> char b[BDEVNAME_SIZE];
> printk(KERN_INFO "raid5: device %s operational as raid"
>@@ -1635,6 +2028,7 @@
> }
>
> conf->raid_disks = mddev->raid_disks;
>+ conf->mirrorit = -1;
> /*
> * 0 for a fully functional array, 1 for a degraded array.
> */
>@@ -1684,7 +2078,7 @@
> }
> }
> memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
>- conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
>+ (conf->raid_disks+1) * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
> if (grow_stripes(conf, conf->max_nr_stripes)) {
> printk(KERN_ERR
> "raid5: couldn't allocate %dkB for buffers\n", memory);
>@@ -1739,10 +2133,14 @@
> static int stop (mddev_t *mddev)
> {
> raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
>+ int i;
>
> md_unregister_thread(mddev->thread);
> mddev->thread = NULL;
> shrink_stripes(conf);
>+ for (i = conf->raid_disks; i--; )
>+ if (conf->disks[i].rdev && conf->disks[i].rdev->in_sync)
>+ shrink_badblocks(&conf->disks[i]);
> free_pages((unsigned long) conf->stripe_hashtbl, HASH_PAGES_ORDER);
> blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
> kfree(conf);
>@@ -1788,7 +2186,9 @@
> static void status (struct seq_file *seq, mddev_t *mddev)
> {
> raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
>- int i;
>+ int i, j;
>+ char b[BDEVNAME_SIZE];
>+ struct badblock *bb;
>
> seq_printf (seq, " level %d, %dk chunk, algorithm %d", mddev->level, mddev->chunk_size >> 10, mddev->layout);
> seq_printf (seq, " [%d/%d] [", conf->raid_disks, conf->working_disks);
>@@ -1801,6 +2201,20 @@
> #define D(x) \
> seq_printf (seq, "<"#x":%d>", atomic_read(&conf->x))
> printall(conf);
>+
>+ spin_lock_irq(&conf->device_lock); /* it's ok now for debug */
>+ seq_printf (seq, "\n known bad sectors on active devices:");
>+ for (i = conf->raid_disks; i--; ) {
>+ if (conf->disks[i].rdev) {
>+ seq_printf (seq, "\n %s", bdevname(conf->disks[i].rdev->bdev, b));
>+ for (j = 0; j < BB_NR_HASH; j++) {
>+ bb = conf->disks[i].badblock_hashtbl[j];
>+ for (; bb; bb = bb->hash_next)
>+ seq_printf (seq, " %llu-%llu", bb->sector, bb->sector + (unsigned long long)(STRIPE_SIZE / 512) - 1);
>+ }
>+ }
>+ }
>+ spin_unlock_irq(&conf->device_lock);
> #endif
> }
>
>@@ -1844,6 +2258,17 @@
> tmp->rdev->in_sync = 1;
> }
> }
>+ tmp = conf->disks + i;
>+ if (tmp->rdev && !tmp->rdev->faulty && !tmp->rdev->in_sync) {
>+ /* sync done to the 'active spare' */
>+ tmp->rdev->in_sync = 1;
>+
>+ printk(KERN_NOTICE "raid5_spare_active: %d in_sync %d->%d\n",
>+ i, tmp->rdev->raid_disk, conf->mirrorit);
>+
>+ /* scary..? :} */
>+ tmp->rdev->raid_disk = conf->mirrorit;
>+ }
> print_raid5_conf(conf);
> return 0;
> }
>@@ -1857,6 +2282,7 @@
>
> print_raid5_conf(conf);
> rdev = p->rdev;
>+printk("raid5_remove_disk %d\n", number);
> if (rdev) {
> if (rdev->in_sync ||
> atomic_read(&rdev->nr_pending)) {
>@@ -1870,6 +2296,8 @@
> err = -EBUSY;
> p->rdev = rdev;
> }
>+ if (!err)
>+ shrink_badblocks(p);
> }
> abort:
>
>@@ -1884,6 +2312,10 @@
> int disk;
> struct disk_info *p;
>
>+ if (mddev->degraded > 1)
>+ /* no point adding a device */
>+ return 0;
>+
> /*
> * find the disk ...
> */
>@@ -1895,6 +2327,22 @@
> p->rdev = rdev;
> break;
> }
>+
>+ if (!found) {
>+ /* array optimal, this should be the 'active spare' */
>+ conf->disks[disk].rdev = rdev;
>+ rdev->in_sync = 0;
>+ rdev->raid_disk = conf->raid_disks;
>+
>+ mddev->degraded--;
>+ found++; /* call resync */
>+
>+ printk(KERN_INFO "added spare for active resync\n");
>+ }
>+ if (found)
>+ grow_badblocks(&conf->disks[disk]);
>+ printk(KERN_INFO "raid5_add_disk: %d (%d)\n", disk, found);
>+
> print_raid5_conf(conf);
> return found;
> }
>
>
>------------------------------------------------------------------------
>
>No virus found in this incoming message.
>Checked by AVG Anti-Virus.
>Version: 7.0.338 / Virus Database: 267.10.12/75 - Release Date: 8/17/2005
>
>
--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.338 / Virus Database: 267.10.12/75 - Release Date: 8/17/2005
next prev parent reply other threads:[~2005-08-18 1:55 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-08-17 23:52 [PATCH] proactive raid5 disk replacement for 2.6.11, updated Pallai Roland
2005-08-18 1:55 ` Tyler [this message]
2005-08-18 5:28 ` Neil Brown
2005-08-18 10:24 ` Lars Marowsky-Bree
2005-08-18 14:13 ` Pallai Roland
2005-08-18 10:56 ` Michael Tokarev
2005-08-18 13:46 ` Pallai Roland
2005-08-19 14:58 ` Pallai Roland
2005-08-20 15:35 ` Pallai Roland
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4303EAA9.3080201@dtbb.net \
--to=pml@dtbb.net \
--cc=dap@mail.index.hu \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).