[PATCH md 1 of 9] Remove possible oops in md/raid1

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH md 1 of 9] Remove possible oops in md/raid1
  2005-02-18  0:40 [PATCH md 0 of 9] Introduction NeilBrown
@ 2005-02-18  0:40 ` NeilBrown
  2005-02-18  0:40 ` [PATCH md 7 of 9] Optimised resync using Bitmap based intent logging NeilBrown
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2005-02-18  0:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


When we get a pointer and check it is non-null,
we should not get the pointer again but should instead
use the pointer we got in the first place...

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/raid1.c |   40 ++++++++++++++++++++++++++++------------
 1 files changed, 28 insertions(+), 12 deletions(-)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-02-18 11:08:39.000000000 +1100
+++ ./drivers/md/raid1.c	2005-02-18 11:11:11.000000000 +1100
@@ -338,6 +338,7 @@ static int read_balance(conf_t *conf, r1
 	int new_disk = conf->last_used, disk = new_disk;
 	const int sectors = r1_bio->sectors;
 	sector_t new_distance, current_distance;
+	mdk_rdev_t *new_rdev, *rdev;
 
 	rcu_read_lock();
 	/*
@@ -345,13 +346,14 @@ static int read_balance(conf_t *conf, r1
 	 * device if no resync is going on, or below the resync window.
 	 * We take the first readable disk when above the resync window.
 	 */
+ retry:
 	if (conf->mddev->recovery_cp < MaxSector &&
 	    (this_sector + sectors >= conf->next_resync)) {
 		/* Choose the first operation device, for consistancy */
 		new_disk = 0;
 
-		while (!conf->mirrors[new_disk].rdev ||
-		       !conf->mirrors[new_disk].rdev->in_sync) {
+		while ((new_rdev=conf->mirrors[new_disk].rdev) == NULL ||
+		       !new_rdev->in_sync) {
 			new_disk++;
 			if (new_disk == conf->raid_disks) {
 				new_disk = -1;
@@ -363,8 +365,8 @@ static int read_balance(conf_t *conf, r1
 
 
 	/* make sure the disk is operational */
-	while (!conf->mirrors[new_disk].rdev ||
-	       !conf->mirrors[new_disk].rdev->in_sync) {
+	while ((new_rdev=conf->mirrors[new_disk].rdev) == NULL ||
+	       !new_rdev->in_sync) {
 		if (new_disk <= 0)
 			new_disk = conf->raid_disks;
 		new_disk--;
@@ -393,18 +395,20 @@ static int read_balance(conf_t *conf, r1
 			disk = conf->raid_disks;
 		disk--;
 
-		if (!conf->mirrors[disk].rdev ||
-		    !conf->mirrors[disk].rdev->in_sync)
+		if ((rdev=conf->mirrors[disk].rdev) == NULL ||
+		    !rdev->in_sync)
 			continue;
 
-		if (!atomic_read(&conf->mirrors[disk].rdev->nr_pending)) {
+		if (!atomic_read(&rdev->nr_pending)) {
 			new_disk = disk;
+			new_rdev = rdev;
 			break;
 		}
 		new_distance = abs(this_sector - conf->mirrors[disk].head_position);
 		if (new_distance < current_distance) {
 			current_distance = new_distance;
 			new_disk = disk;
+			new_rdev = rdev;
 		}
 	} while (disk != conf->last_used);
 
@@ -414,7 +418,14 @@ rb_out:
 	if (new_disk >= 0) {
 		conf->next_seq_sect = this_sector + sectors;
 		conf->last_used = new_disk;
-		atomic_inc(&conf->mirrors[new_disk].rdev->nr_pending);
+		atomic_inc(&new_rdev->nr_pending);
+		if (!new_rdev->in_sync) {
+			/* cannot risk returning a device that failed
+			 * before we inc'ed nr_pending
+			 */
+			atomic_dec(&new_rdev->nr_pending);
+			goto retry;
+		}
 	}
 	rcu_read_unlock();
 
@@ -512,6 +523,7 @@ static int make_request(request_queue_t 
 	r1bio_t *r1_bio;
 	struct bio *read_bio;
 	int i, disks;
+	mdk_rdev_t *rdev;
 
 	/*
 	 * Register the new request and wait if the reconstruction
@@ -585,10 +597,14 @@ static int make_request(request_queue_t 
 	disks = conf->raid_disks;
 	rcu_read_lock();
 	for (i = 0;  i < disks; i++) {
-		if (conf->mirrors[i].rdev &&
-		    !conf->mirrors[i].rdev->faulty) {
-			atomic_inc(&conf->mirrors[i].rdev->nr_pending);
-			r1_bio->bios[i] = bio;
+		if ((rdev=conf->mirrors[i].rdev) != NULL &&
+		    !rdev->faulty) {
+			atomic_inc(&rdev->nr_pending);
+			if (rdev->faulty) {
+				atomic_dec(&rdev->nr_pending);
+				r1_bio->bios[i] = NULL;
+			} else
+				r1_bio->bios[i] = bio;
 		} else
 			r1_bio->bios[i] = NULL;
 	}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH md 0 of 9] Introduction
@ 2005-02-18  0:40 NeilBrown
  2005-02-18  0:40 ` [PATCH md 1 of 9] Remove possible oops in md/raid1 NeilBrown
                   ` (9 more replies)
  0 siblings, 10 replies; 14+ messages in thread
From: NeilBrown @ 2005-02-18  0:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


9 patches for md in 2.6.11-rc3-mm2 follow.

1 - tightens up some locking to close a race that is extremely unlikely
    to happen (famous last words).  It can only happen when failed devices are
    being removed.
2 - Fixes a resync related problem for raid5 and raid6 arrays that have
    lost more devices than they can cope with.
3 - Fixes a clumsy test that has very little real effect (just a printk).

These three can be considered stable.

4,5 Improve handling of 'safemode' which is where we mark the superblock
    clean whenever there have been no writes for a little while, and mark 
    it dirty before the next write.
6-9 Introduce "bitmap based intent logging" which allows a file to 
    store - as a bitmap - all block that may have been written recently.
    This allow resync to be optimised, which is more important as drives
    get larger.

These should not be considerred stable - certainly actually using
the bitmap mode (which requires a 2.0 series version of mdadm) should
be seen as 'experimental'.

mdadm-2.0-devel-1  was released today.

NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH md 9 of 9] Optimise reconstruction when re-adding a recently failed drive.
  2005-02-18  0:40 [PATCH md 0 of 9] Introduction NeilBrown
                   ` (7 preceding siblings ...)
  2005-02-18  0:40 ` [PATCH md 4 of 9] Merge md_enter_safemode into md_check_recovery NeilBrown
@ 2005-02-18  0:40 ` NeilBrown
  2005-02-18  0:51   ` Mike Hardy
  2005-02-18 14:25 ` [PATCH md 0 of 9] Introduction Phantazm
  9 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2005-02-18  0:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


When an array is degraded, bit in the intent-bitmap are
never cleared. So if a recently failed drive is re-added, we only need
to reconstruct the block that are still reflected in the
bitmap.
This patch adds support for this re-adding.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/md.c           |   71 ++++++++++++++++++++++++++++++++++----------
 ./drivers/md/raid1.c        |    7 +++-
 ./include/linux/raid/md_k.h |    4 ++
 3 files changed, 65 insertions(+), 17 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2005-02-18 11:11:26.000000000 +1100
+++ ./drivers/md/md.c	2005-02-18 11:11:26.000000000 +1100
@@ -620,6 +620,8 @@ static int super_90_validate(mddev_t *md
 	mdp_disk_t *desc;
 	mdp_super_t *sb = (mdp_super_t *)page_address(rdev->sb_page);
 
+	rdev->raid_disk = -1;
+	rdev->in_sync = 0;
 	if (mddev->raid_disks == 0) {
 		mddev->major_version = 0;
 		mddev->minor_version = sb->minor_version;
@@ -650,16 +652,24 @@ static int super_90_validate(mddev_t *md
 		memcpy(mddev->uuid+12,&sb->set_uuid3, 4);
 
 		mddev->max_disks = MD_SB_DISKS;
-	} else {
-		__u64 ev1;
-		ev1 = md_event(sb);
+	} else if (mddev->pers == NULL) {
+		/* Insist on good event counter while assembling */
+		__u64 ev1 = md_event(sb);
 		++ev1;
 		if (ev1 < mddev->events) 
 			return -EINVAL;
-	}
+	} else if (mddev->bitmap) {
+		/* if adding to array with a bitmap, then we can accept an
+		 * older device ... but not too old.
+		 */
+		__u64 ev1 = md_event(sb);
+		if (ev1 < mddev->bitmap->events_cleared)
+			return 0;
+	} else /* just a hot-add of a new device, leave raid_disk at -1 */
+		return 0;
+
 	if (mddev->level != LEVEL_MULTIPATH) {
-		rdev->raid_disk = -1;
-		rdev->in_sync = rdev->faulty = 0;
+		rdev->faulty = 0;
 		desc = sb->disks + rdev->desc_nr;
 
 		if (desc->state & (1<<MD_DISK_FAULTY))
@@ -669,7 +679,8 @@ static int super_90_validate(mddev_t *md
 			rdev->in_sync = 1;
 			rdev->raid_disk = desc->raid_disk;
 		}
-	}
+	} else /* MULTIPATH are always insync */
+		rdev->in_sync = 1;
 	return 0;
 }
 
@@ -911,6 +922,8 @@ static int super_1_validate(mddev_t *mdd
 {
 	struct mdp_superblock_1 *sb = (struct mdp_superblock_1*)page_address(rdev->sb_page);
 
+	rdev->raid_disk = -1;
+	rdev->in_sync = 0;
 	if (mddev->raid_disks == 0) {
 		mddev->major_version = 1;
 		mddev->patch_version = 0;
@@ -928,13 +941,21 @@ static int super_1_validate(mddev_t *mdd
 		memcpy(mddev->uuid, sb->set_uuid, 16);
 
 		mddev->max_disks =  (4096-256)/2;
-	} else {
-		__u64 ev1;
-		ev1 = le64_to_cpu(sb->events);
+	} else if (mddev->pers == NULL) {
+		/* Insist of good event counter while assembling */
+		__u64 ev1 = le64_to_cpu(sb->events);
 		++ev1;
 		if (ev1 < mddev->events)
 			return -EINVAL;
-	}
+	} else if (mddev->bitmap) {
+		/* If adding to array with a bitmap, then we can accept an
+		 * older device, but not too old.
+		 */
+		__u64 ev1 = le64_to_cpu(sb->events);
+		if (ev1 < mddev->bitmap->events_cleared)
+			return 0;
+	} else /* just a hot-add of a new device, leave raid_disk at -1 */
+		return 0;
 
 	if (mddev->level != LEVEL_MULTIPATH) {
 		int role;
@@ -942,14 +963,10 @@ static int super_1_validate(mddev_t *mdd
 		role = le16_to_cpu(sb->dev_roles[rdev->desc_nr]);
 		switch(role) {
 		case 0xffff: /* spare */
-			rdev->in_sync = 0;
 			rdev->faulty = 0;
-			rdev->raid_disk = -1;
 			break;
 		case 0xfffe: /* faulty */
-			rdev->in_sync = 0;
 			rdev->faulty = 1;
-			rdev->raid_disk = -1;
 			break;
 		default:
 			rdev->in_sync = 1;
@@ -957,7 +974,9 @@ static int super_1_validate(mddev_t *mdd
 			rdev->raid_disk = role;
 			break;
 		}
-	}
+	} else /* MULTIPATH are always insync */
+		rdev->in_sync = 1;
+
 	return 0;
 }
 
@@ -2211,6 +2230,18 @@ static int add_new_disk(mddev_t * mddev,
 				PTR_ERR(rdev));
 			return PTR_ERR(rdev);
 		}
+		/* set save_raid_disk if appropriate */
+		if (!mddev->persistent) {
+			if (info->state & (1<<MD_DISK_SYNC)  &&
+			    info->raid_disk < mddev->raid_disks)
+				rdev->raid_disk = info->raid_disk;
+			else
+				rdev->raid_disk = -1;
+		} else
+			super_types[mddev->major_version].
+				validate_super(mddev, rdev);
+		rdev->saved_raid_disk = rdev->raid_disk;
+
 		rdev->in_sync = 0; /* just to be sure */
 		rdev->raid_disk = -1;
 		err = bind_rdev_to_array(rdev, mddev);
@@ -3829,6 +3860,14 @@ void md_check_recovery(mddev_t *mddev)
 				mddev->pers->spare_active(mddev);
 			}
 			md_update_sb(mddev);
+
+			/* if array is no-longer degraded, then any saved_raid_disk
+			 * information must be scrapped
+			 */
+			if (!mddev->degraded)
+				ITERATE_RDEV(mddev,rdev,rtmp)
+					rdev->saved_raid_disk = -1;
+			
 			mddev->recovery = 0;
 			/* flag recovery needed just to double check */
 			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-02-18 11:11:26.000000000 +1100
+++ ./drivers/md/raid1.c	2005-02-18 11:11:26.000000000 +1100
@@ -811,9 +811,12 @@ static int raid1_add_disk(mddev_t *mddev
 {
 	conf_t *conf = mddev->private;
 	int found = 0;
-	int mirror;
+	int mirror = 0;
 	mirror_info_t *p;
 
+	if (rdev->saved_raid_disk >= 0 &&
+	    conf->mirrors[rdev->saved_raid_disk].rdev == NULL)
+		mirror = rdev->saved_raid_disk;
 	for (mirror=0; mirror < mddev->raid_disks; mirror++)
 		if ( !(p=conf->mirrors+mirror)->rdev) {
 
@@ -830,6 +833,8 @@ static int raid1_add_disk(mddev_t *mddev
 			p->head_position = 0;
 			rdev->raid_disk = mirror;
 			found = 1;
+			if (rdev->saved_raid_disk != mirror)
+				conf->fullsync = 1;
 			p->rdev = rdev;
 			break;
 		}

diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~	2005-02-18 11:11:26.000000000 +1100
+++ ./include/linux/raid/md_k.h	2005-02-18 11:11:26.000000000 +1100
@@ -183,6 +183,10 @@ struct mdk_rdev_s
 
 	int desc_nr;			/* descriptor index in the superblock */
 	int raid_disk;			/* role of device in array */
+	int saved_raid_disk;		/* role that device used to have in the 
+					 * array and could again if we did a partial
+					 * resync from the bitmap
+					 */
 
 	atomic_t	nr_pending;	/* number of pending requests.
 					 * only maintained for arrays that

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH md 6 of 9] Improve the interface to sync_request
  2005-02-18  0:40 [PATCH md 0 of 9] Introduction NeilBrown
                   ` (3 preceding siblings ...)
  2005-02-18  0:40 ` [PATCH md 5 of 9] Improve locking on 'safemode' and move superblock writes NeilBrown
@ 2005-02-18  0:40 ` NeilBrown
  2005-02-18  0:40 ` [PATCH md 3 of 9] Remove kludgy level check from md.c NeilBrown
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2005-02-18  0:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


1/ change the return value (which is number-of-sectors synced)
 from 'int' to 'sector_t'.
 The number of sectors is usually easily small enough to fit 
 in an int, but if resync needs to abort, it may want to return
 the total number of remaining sectors, which could be large.
 Also errors cannot be returned as negative numbers now, so use
 0 instead
2/ Add a 'skipped' return parameter to allow the array to report
 that it skipped the sectors.  This allows md to take this into account
 in the speed calculations.
 Currently there is no important skipping, but the bitmap-based-resync
 that is coming will use this.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/md.c           |   39 ++++++++++++++++++++++++---------------
 ./drivers/md/raid1.c        |    8 ++++----
 ./drivers/md/raid10.c       |   19 ++++++++++++-------
 ./drivers/md/raid5.c        |    6 +++---
 ./drivers/md/raid6main.c    |    6 +++---
 ./include/linux/raid/md_k.h |    2 +-
 6 files changed, 47 insertions(+), 33 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2005-02-18 11:11:25.000000000 +1100
+++ ./drivers/md/md.c	2005-02-18 11:11:26.000000000 +1100
@@ -3361,12 +3361,13 @@ static void md_do_sync(mddev_t *mddev)
 	mddev_t *mddev2;
 	unsigned int currspeed = 0,
 		 window;
-	sector_t max_sectors,j;
+	sector_t max_sectors,j, io_sectors;
 	unsigned long mark[SYNC_MARKS];
 	sector_t mark_cnt[SYNC_MARKS];
 	int last_mark,m;
 	struct list_head *tmp;
 	sector_t last_check;
+	int skipped = 0;
 
 	/* just incase thread restarts... */
 	if (test_bit(MD_RECOVERY_DONE, &mddev->recovery))
@@ -3432,7 +3433,7 @@ static void md_do_sync(mddev_t *mddev)
 
 	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
 		/* resync follows the size requested by the personality,
-		 * which default to physical size, but can be virtual size
+		 * which defaults to physical size, but can be virtual size
 		 */
 		max_sectors = mddev->resync_max_sectors;
 	else
@@ -3451,9 +3452,10 @@ static void md_do_sync(mddev_t *mddev)
 		j = mddev->recovery_cp;
 	else
 		j = 0;
+	io_sectors = 0;
 	for (m = 0; m < SYNC_MARKS; m++) {
 		mark[m] = jiffies;
-		mark_cnt[m] = j;
+		mark_cnt[m] = io_sectors;
 	}
 	last_mark = 0;
 	mddev->resync_mark = mark[last_mark];
@@ -3478,26 +3480,32 @@ static void md_do_sync(mddev_t *mddev)
 	}
 
 	while (j < max_sectors) {
-		int sectors;
-
-		sectors = mddev->pers->sync_request(mddev, j, currspeed < sysctl_speed_limit_min);
-		if (sectors < 0) {
+		sector_t sectors;
+		skipped = 0;
+		sectors = mddev->pers->sync_request(mddev, j, &skipped,
+						    currspeed < sysctl_speed_limit_min);
+		if (sectors == 0) {
 			set_bit(MD_RECOVERY_ERR, &mddev->recovery);
 			goto out;
 		}
-		atomic_add(sectors, &mddev->recovery_active);
+		if (!skipped) { /* actual IO requested */
+			io_sectors += sectors;
+			atomic_add(sectors, &mddev->recovery_active);
+		}
+
 		j += sectors;
 		if (j>1) mddev->curr_resync = j;
+
+		if (last_check + window > io_sectors || j == max_sectors)
+			continue;
+
 		if (last_check==0)
 			/* This is the earliest that rebuild will be visible
 			 * in /proc/mdstat
 			 */
 			md_new_event();
 
-		if (last_check + window > j || j == max_sectors)
-			continue;
-
-		last_check = j;
+		last_check = io_sectors;
 
 		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery) ||
 		    test_bit(MD_RECOVERY_ERR, &mddev->recovery))
@@ -3511,7 +3519,7 @@ static void md_do_sync(mddev_t *mddev)
 			mddev->resync_mark = mark[next];
 			mddev->resync_mark_cnt = mark_cnt[next];
 			mark[next] = jiffies;
-			mark_cnt[next] = j - atomic_read(&mddev->recovery_active);
+			mark_cnt[next] = io_sectors - atomic_read(&mddev->recovery_active);
 			last_mark = next;
 		}
 
@@ -3538,7 +3546,8 @@ static void md_do_sync(mddev_t *mddev)
 		mddev->queue->unplug_fn(mddev->queue);
 		cond_resched();
 
-		currspeed = ((unsigned long)(j-mddev->resync_mark_cnt))/2/((jiffies-mddev->resync_mark)/HZ +1) +1;
+		currspeed = ((unsigned long)(io_sectors-mddev->resync_mark_cnt))/2
+			/((jiffies-mddev->resync_mark)/HZ +1) +1;
 
 		if (currspeed > sysctl_speed_limit_min) {
 			if ((currspeed > sysctl_speed_limit_max) ||
@@ -3558,7 +3567,7 @@ static void md_do_sync(mddev_t *mddev)
 	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
 
 	/* tell personality that we are finished */
-	mddev->pers->sync_request(mddev, max_sectors, 1);
+	mddev->pers->sync_request(mddev, max_sectors, &skipped, 1);
 
 	if (!test_bit(MD_RECOVERY_ERR, &mddev->recovery) &&
 	    mddev->curr_resync > 2 &&

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-02-18 11:11:25.000000000 +1100
+++ ./drivers/md/raid1.c	2005-02-18 11:11:26.000000000 +1100
@@ -1010,7 +1010,7 @@ static int init_resync(conf_t *conf)
  * that can be installed to exclude normal IO requests.
  */
 
-static int sync_request(mddev_t *mddev, sector_t sector_nr, int go_faster)
+static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, int go_faster)
 {
 	conf_t *conf = mddev_to_conf(mddev);
 	mirror_info_t *mirror;
@@ -1023,7 +1023,7 @@ static int sync_request(mddev_t *mddev, 
 
 	if (!conf->r1buf_pool)
 		if (init_resync(conf))
-			return -ENOMEM;
+			return 0;
 
 	max_sector = mddev->size << 1;
 	if (sector_nr >= max_sector) {
@@ -1107,8 +1107,8 @@ static int sync_request(mddev_t *mddev, 
 		/* There is nowhere to write, so all non-sync
 		 * drives must be failed - so we are finished
 		 */
-		int rv = max_sector - sector_nr;
-		md_done_sync(mddev, rv, 1);
+		sector_t rv = max_sector - sector_nr;
+		*skipped = 1;
 		put_buf(r1_bio);
 		rdev_dec_pending(conf->mirrors[disk].rdev, mddev);
 		return rv;

diff ./drivers/md/raid10.c~current~ ./drivers/md/raid10.c
--- ./drivers/md/raid10.c~current~	2005-02-18 11:11:25.000000000 +1100
+++ ./drivers/md/raid10.c	2005-02-18 11:11:26.000000000 +1100
@@ -1321,7 +1321,7 @@ static int init_resync(conf_t *conf)
  *
  */
 
-static int sync_request(mddev_t *mddev, sector_t sector_nr, int go_faster)
+static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, int go_faster)
 {
 	conf_t *conf = mddev_to_conf(mddev);
 	r10bio_t *r10_bio;
@@ -1335,7 +1335,7 @@ static int sync_request(mddev_t *mddev, 
 
 	if (!conf->r10buf_pool)
 		if (init_resync(conf))
-			return -ENOMEM;
+			return 0;
 
  skipped:
 	max_sector = mddev->size << 1;
@@ -1343,15 +1343,15 @@ static int sync_request(mddev_t *mddev, 
 		max_sector = mddev->resync_max_sectors;
 	if (sector_nr >= max_sector) {
 		close_sync(conf);
+		*skipped = 1;
 		return sectors_skipped;
 	}
 	if (chunks_skipped >= conf->raid_disks) {
 		/* if there has been nothing to do on any drive,
 		 * then there is nothing to do at all..
 		 */
-		sector_t sec = max_sector - sector_nr;
-		md_done_sync(mddev, sec, 1);
-		return sec + sectors_skipped;
+		*skipped = 1;
+		return (max_sector - sector_nr) + sectors_skipped;
 	}
 
 	/* make sure whole request will fit in a chunk - if chunks
@@ -1565,17 +1565,22 @@ static int sync_request(mddev_t *mddev, 
 		}
 	}
 
+	if (sectors_skipped)
+		/* pretend they weren't skipped, it makes
+		 * no important difference in this case
+		 */
+		md_done_sync(mddev, sectors_skipped, 1);
+
 	return sectors_skipped + nr_sectors;
  giveup:
 	/* There is nowhere to write, so all non-sync
 	 * drives must be failed, so try the next chunk...
 	 */
 	{
-	int sec = max_sector - sector_nr;
+	sector_t sec = max_sector - sector_nr;
 	sectors_skipped += sec;
 	chunks_skipped ++;
 	sector_nr = max_sector;
-	md_done_sync(mddev, sec, 1);
 	goto skipped;
 	}
 }

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2005-02-18 11:11:26.000000000 +1100
+++ ./drivers/md/raid5.c	2005-02-18 11:11:26.000000000 +1100
@@ -1475,7 +1475,7 @@ static int make_request (request_queue_t
 }
 
 /* FIXME go_faster isn't used */
-static int sync_request (mddev_t *mddev, sector_t sector_nr, int go_faster)
+static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, int go_faster)
 {
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	struct stripe_head *sh;
@@ -1498,8 +1498,8 @@ static int sync_request (mddev_t *mddev,
 	 * nothing we can do.
 	 */
 	if (mddev->degraded >= 1 && test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
-		int rv = (mddev->size << 1) - sector_nr;
-		md_done_sync(mddev, rv, 1);
+		sector_t rv = (mddev->size << 1) - sector_nr;
+		*skipped = 1;
 		return rv;
 	}
 

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2005-02-18 11:11:26.000000000 +1100
+++ ./drivers/md/raid6main.c	2005-02-18 11:11:26.000000000 +1100
@@ -1634,7 +1634,7 @@ static int make_request (request_queue_t
 }
 
 /* FIXME go_faster isn't used */
-static int sync_request (mddev_t *mddev, sector_t sector_nr, int go_faster)
+static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, int go_faster)
 {
 	raid6_conf_t *conf = (raid6_conf_t *) mddev->private;
 	struct stripe_head *sh;
@@ -1657,8 +1657,8 @@ static int sync_request (mddev_t *mddev,
 	 * nothing we can do.
 	 */
 	if (mddev->degraded >= 2 && test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
-		int rv = (mddev->size << 1) - sector_nr;
-		md_done_sync(mddev, rv, 1);
+		sector_t rv = (mddev->size << 1) - sector_nr;
+		*skipped = 1;
 		return rv;
 	}
 

diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~	2005-02-18 11:11:26.000000000 +1100
+++ ./include/linux/raid/md_k.h	2005-02-18 11:11:26.000000000 +1100
@@ -298,7 +298,7 @@ struct mdk_personality_s
 	int (*hot_add_disk) (mddev_t *mddev, mdk_rdev_t *rdev);
 	int (*hot_remove_disk) (mddev_t *mddev, int number);
 	int (*spare_active) (mddev_t *mddev);
-	int (*sync_request)(mddev_t *mddev, sector_t sector_nr, int go_faster);
+	sector_t (*sync_request)(mddev_t *mddev, sector_t sector_nr, int *skipped, int go_faster);
 	int (*resize) (mddev_t *mddev, sector_t sectors);
 	int (*reshape) (mddev_t *mddev, int raid_disks);
 	int (*reconfig) (mddev_t *mddev, int layout, int chunk_size);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH md 5 of 9] Improve locking on 'safemode' and move superblock writes
  2005-02-18  0:40 [PATCH md 0 of 9] Introduction NeilBrown
                   ` (2 preceding siblings ...)
  2005-02-18  0:40 ` [PATCH md 8 of 9] raid1 support for bitmap " NeilBrown
@ 2005-02-18  0:40 ` NeilBrown
  2005-02-18  0:40 ` [PATCH md 6 of 9] Improve the interface to sync_request NeilBrown
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2005-02-18  0:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


When md marks the superblock dirty before a write, it calls
generic_make_request (to write the superblock) from within
generic_make_request (to write the first dirty block), which
could cause problems later.
With this patch, the superblock write is always done by the
helper thread, and write request are delayed until that
write completes.

Also, the locking around marking the array dirty and writing
the superblock is improved to avoid possible races.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/md.c           |   73 +++++++++++++++++++++++++++++++++++---------
 ./drivers/md/raid1.c        |    4 +-
 ./drivers/md/raid10.c       |    5 ++-
 ./drivers/md/raid5.c        |    6 ++-
 ./drivers/md/raid6main.c    |    6 ++-
 ./include/linux/raid/md.h   |    2 -
 ./include/linux/raid/md_k.h |    7 ++++
 7 files changed, 82 insertions(+), 21 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2005-02-18 11:11:25.000000000 +1100
+++ ./drivers/md/md.c	2005-02-18 11:11:25.000000000 +1100
@@ -261,6 +261,8 @@ static mddev_t * mddev_find(dev_t unit)
 	INIT_LIST_HEAD(&new->all_mddevs);
 	init_timer(&new->safemode_timer);
 	atomic_set(&new->active, 1);
+	bio_list_init(&new->write_list);
+	spin_lock_init(&new->write_lock);
 
 	new->queue = blk_alloc_queue(GFP_KERNEL);
 	if (!new->queue) {
@@ -1294,9 +1296,11 @@ static void md_update_sb(mddev_t * mddev
 	int err, count = 100;
 	struct list_head *tmp;
 	mdk_rdev_t *rdev;
+	int sync_req;
 
-	mddev->sb_dirty = 0;
 repeat:
+	spin_lock(&mddev->write_lock);
+	sync_req = mddev->in_sync;
 	mddev->utime = get_seconds();
 	mddev->events ++;
 
@@ -1315,8 +1319,12 @@ repeat:
 	 * do not write anything to disk if using
 	 * nonpersistent superblocks
 	 */
-	if (!mddev->persistent)
+	if (!mddev->persistent) {
+		mddev->sb_dirty = 0;
+		spin_unlock(&mddev->write_lock);
 		return;
+	}
+	spin_unlock(&mddev->write_lock);
 
 	dprintk(KERN_INFO 
 		"md: updating %s RAID superblock on device (in sync %d)\n",
@@ -1347,6 +1355,15 @@ repeat:
 		printk(KERN_ERR \
 			"md: excessive errors occurred during superblock update, exiting\n");
 	}
+	spin_lock(&mddev->write_lock);
+	if (mddev->in_sync != sync_req) {
+		/* have to write it out again */
+		spin_unlock(&mddev->write_lock);
+		goto repeat;
+	}
+	mddev->sb_dirty = 0;
+	spin_unlock(&mddev->write_lock);
+
 }
 
 /*
@@ -3298,19 +3315,31 @@ void md_done_sync(mddev_t *mddev, int bl
 }
 
 
-void md_write_start(mddev_t *mddev)
+/* md_write_start(mddev, bi)
+ * If we need to update some array metadata (e.g. 'active' flag
+ * in superblock) before writing, queue bi for later writing
+ * and return 0, else return 1 and it will be written now
+ */
+int md_write_start(mddev_t *mddev, struct bio *bi)
 {
-	if (!atomic_read(&mddev->writes_pending)) {
-		mddev_lock_uninterruptible(mddev);
-		if (mddev->in_sync) {
-			mddev->in_sync = 0;
- 			del_timer(&mddev->safemode_timer);
-			md_update_sb(mddev);
-		}
-		atomic_inc(&mddev->writes_pending);
-		mddev_unlock(mddev);
-	} else
-		atomic_inc(&mddev->writes_pending);
+	if (bio_data_dir(bi) != WRITE)
+		return 1;
+
+	atomic_inc(&mddev->writes_pending);
+	spin_lock(&mddev->write_lock);
+	if (mddev->in_sync == 0 && mddev->sb_dirty == 0) {
+		spin_unlock(&mddev->write_lock);
+		return 1;
+	}
+	bio_list_add(&mddev->write_list, bi);
+
+	if (mddev->in_sync) {
+		mddev->in_sync = 0;
+		mddev->sb_dirty = 1;
+	}
+	spin_unlock(&mddev->write_lock);
+	md_wakeup_thread(mddev->thread);
+	return 0;
 }
 
 void md_write_end(mddev_t *mddev)
@@ -3597,6 +3626,7 @@ void md_check_recovery(mddev_t *mddev)
 		mddev->sb_dirty ||
 		test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
 		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
+		mddev->write_list.head ||
 		(mddev->safemode == 1) ||
 		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
 		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
@@ -3605,7 +3635,9 @@ void md_check_recovery(mddev_t *mddev)
 
 	if (mddev_trylock(mddev)==0) {
 		int spares =0;
+		struct bio *blist;
 
+		spin_lock(&mddev->write_lock);
 		if (mddev->safemode && !atomic_read(&mddev->writes_pending) &&
 		    !mddev->in_sync && mddev->recovery_cp == MaxSector) {
 			mddev->in_sync = 1;
@@ -3613,9 +3645,22 @@ void md_check_recovery(mddev_t *mddev)
 		}
 		if (mddev->safemode == 1)
 			mddev->safemode = 0;
+		blist = bio_list_get(&mddev->write_list);
+		spin_unlock(&mddev->write_lock);
 
 		if (mddev->sb_dirty)
 			md_update_sb(mddev);
+
+		while (blist) {
+			struct bio *b = blist;
+			blist = blist->bi_next;
+			b->bi_next = NULL;
+			generic_make_request(b);
+			/* we already counted this, so need to un-count */
+			md_write_end(mddev);
+		}
+
+
 		if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
 		    !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
 			/* resync/recovery still happening */

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-02-18 11:11:25.000000000 +1100
+++ ./drivers/md/raid1.c	2005-02-18 11:11:25.000000000 +1100
@@ -530,6 +530,8 @@ static int make_request(request_queue_t 
 	 * thread has put up a bar for new requests.
 	 * Continue immediately if no resync is active currently.
 	 */
+	if (md_write_start(mddev, bio)==0)
+		return 0;
 	spin_lock_irq(&conf->resync_lock);
 	wait_event_lock_irq(conf->wait_resume, !conf->barrier, conf->resync_lock, );
 	conf->nr_pending++;
@@ -611,7 +613,7 @@ static int make_request(request_queue_t 
 	rcu_read_unlock();
 
 	atomic_set(&r1_bio->remaining, 1);
-	md_write_start(mddev);
+
 	for (i = 0; i < disks; i++) {
 		struct bio *mbio;
 		if (!r1_bio->bios[i])

diff ./drivers/md/raid10.c~current~ ./drivers/md/raid10.c
--- ./drivers/md/raid10.c~current~	2005-02-18 11:11:25.000000000 +1100
+++ ./drivers/md/raid10.c	2005-02-18 11:11:25.000000000 +1100
@@ -700,6 +700,9 @@ static int make_request(request_queue_t 
 		return 0;
 	}
 
+	if (md_write_start(mddev, bio) == 0)
+		return 0;
+
 	/*
 	 * Register the new request and wait if the reconstruction
 	 * thread has put up a bar for new requests.
@@ -774,7 +777,7 @@ static int make_request(request_queue_t 
 	rcu_read_unlock();
 
 	atomic_set(&r10_bio->remaining, 1);
-	md_write_start(mddev);
+
 	for (i = 0; i < conf->copies; i++) {
 		struct bio *mbio;
 		int d = r10_bio->devs[i].devnum;

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2005-02-18 11:11:25.000000000 +1100
+++ ./drivers/md/raid5.c	2005-02-18 11:11:26.000000000 +1100
@@ -1409,6 +1409,9 @@ static int make_request (request_queue_t
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
 
+	if (md_write_start(mddev, bi)==0)
+		return 0;
+
 	if (bio_data_dir(bi)==WRITE) {
 		disk_stat_inc(mddev->gendisk, writes);
 		disk_stat_add(mddev->gendisk, write_sectors, bio_sectors(bi));
@@ -1421,8 +1424,7 @@ static int make_request (request_queue_t
 	last_sector = bi->bi_sector + (bi->bi_size>>9);
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
-	if ( bio_data_dir(bi) == WRITE )
-		md_write_start(mddev);
+
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
 		

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2005-02-18 11:11:25.000000000 +1100
+++ ./drivers/md/raid6main.c	2005-02-18 11:11:26.000000000 +1100
@@ -1568,6 +1568,9 @@ static int make_request (request_queue_t
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
 
+	if (md_write_start(mddev, bi)==0)
+		return 0;
+
 	if (bio_data_dir(bi)==WRITE) {
 		disk_stat_inc(mddev->gendisk, writes);
 		disk_stat_add(mddev->gendisk, write_sectors, bio_sectors(bi));
@@ -1581,8 +1584,7 @@ static int make_request (request_queue_t
 
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
-	if ( bio_data_dir(bi) == WRITE )
-		md_write_start(mddev);
+
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
 

diff ./include/linux/raid/md.h~current~ ./include/linux/raid/md.h
--- ./include/linux/raid/md.h~current~	2005-02-18 11:08:39.000000000 +1100
+++ ./include/linux/raid/md.h	2005-02-18 11:11:26.000000000 +1100
@@ -69,7 +69,7 @@ extern mdk_thread_t * md_register_thread
 extern void md_unregister_thread (mdk_thread_t *thread);
 extern void md_wakeup_thread(mdk_thread_t *thread);
 extern void md_check_recovery(mddev_t *mddev);
-extern void md_write_start(mddev_t *mddev);
+extern int md_write_start(mddev_t *mddev, struct bio *bi);
 extern void md_write_end(mddev_t *mddev);
 extern void md_handle_safemode(mddev_t *mddev);
 extern void md_done_sync(mddev_t *mddev, int blocks, int ok);

diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~	2005-02-18 11:08:39.000000000 +1100
+++ ./include/linux/raid/md_k.h	2005-02-18 11:11:26.000000000 +1100
@@ -15,6 +15,9 @@
 #ifndef _MD_K_H
 #define _MD_K_H
 
+/* and dm-bio-list.h is not under include/linux because.... ??? */
+#include "../../../drivers/md/dm-bio-list.h"
+
 #define MD_RESERVED       0UL
 #define LINEAR            1UL
 #define RAID0             2UL
@@ -252,6 +255,10 @@ struct mddev_s
 	atomic_t			recovery_active; /* blocks scheduled, but not written */
 	wait_queue_head_t		recovery_wait;
 	sector_t			recovery_cp;
+
+	spinlock_t			write_lock;
+	struct bio_list			write_list;
+
 	unsigned int			safemode;	/* if set, update "clean" superblock
 							 * when no writes pending.
 							 */ 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH md 8 of 9] raid1 support for bitmap intent logging
  2005-02-18  0:40 [PATCH md 0 of 9] Introduction NeilBrown
  2005-02-18  0:40 ` [PATCH md 1 of 9] Remove possible oops in md/raid1 NeilBrown
  2005-02-18  0:40 ` [PATCH md 7 of 9] Optimised resync using Bitmap based intent logging NeilBrown
@ 2005-02-18  0:40 ` NeilBrown
  2005-02-18  0:40 ` [PATCH md 5 of 9] Improve locking on 'safemode' and move superblock writes NeilBrown
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2005-02-18  0:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid



Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/raid1.c         |  184 ++++++++++++++++++++++++++++++++++++-------
 ./include/linux/raid/raid1.h |   16 +++
 2 files changed, 169 insertions(+), 31 deletions(-)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-02-18 11:11:26.000000000 +1100
+++ ./drivers/md/raid1.c	2005-02-18 11:11:26.000000000 +1100
@@ -12,6 +12,15 @@
  * Fixes to reconstruction by Jakob Østergaard" <jakob@ostenfeld.dk>
  * Various fixes by Neil Brown <neilb@cse.unsw.edu.au>
  *
+ * Changes by Peter T. Breuer <ptb@it.uc3m.es> 31/1/2003 to support
+ * bitmapped intelligence in resync:
+ *
+ *      - bitmap marked during normal i/o
+ *      - bitmap used to skip nondirty blocks during sync
+ *
+ * Additions to bitmap code, (C) 2003-2004 Paul Clements, SteelEye Technology:
+ * - persistent bitmap code
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2, or (at your option)
@@ -22,7 +31,16 @@
  * Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
  */
 
+#include "dm-bio-list.h"
 #include <linux/raid/raid1.h>
+#include <linux/raid/bitmap.h>
+
+#define DEBUG 0
+#if DEBUG
+#define PRINTK(x...) printk(x)
+#else
+#define PRINTK(x...)
+#endif
 
 /*
  * Number of guaranteed r1bios in case of extreme VM load:
@@ -287,9 +305,11 @@ static int raid1_end_write_request(struc
 	/*
 	 * this branch is our 'one mirror IO has finished' event handler:
 	 */
-	if (!uptodate)
+	if (!uptodate) {
 		md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
-	else
+		/* an I/O failed, we can't clear the bitmap */
+		set_bit(R1BIO_Degraded, &r1_bio->state);
+	} else
 		/*
 		 * Set R1BIO_Uptodate in our master bio, so that
 		 * we will return a good error code for to the higher
@@ -309,6 +329,10 @@ static int raid1_end_write_request(struc
 	 * already.
 	 */
 	if (atomic_dec_and_test(&r1_bio->remaining)) {
+		/* clear the bitmap if all writes complete successfully */
+		bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
+				r1_bio->sectors,
+				!test_bit(R1BIO_Degraded, &r1_bio->state));
 		md_write_end(r1_bio->mddev);
 		raid_end_bio_io(r1_bio);
 	}
@@ -458,7 +482,10 @@ static void unplug_slaves(mddev_t *mddev
 
 static void raid1_unplug(request_queue_t *q)
 {
-	unplug_slaves(q->queuedata);
+	mddev_t *mddev = q->queuedata;
+
+	unplug_slaves(mddev);
+	md_wakeup_thread(mddev->thread);
 }
 
 static int raid1_issue_flush(request_queue_t *q, struct gendisk *disk,
@@ -501,16 +528,16 @@ static void device_barrier(conf_t *conf,
 {
 	spin_lock_irq(&conf->resync_lock);
 	wait_event_lock_irq(conf->wait_idle, !waitqueue_active(&conf->wait_resume),
-			    conf->resync_lock, unplug_slaves(conf->mddev));
+			    conf->resync_lock, raid1_unplug(conf->mddev->queue));
 	
 	if (!conf->barrier++) {
 		wait_event_lock_irq(conf->wait_idle, !conf->nr_pending,
-				    conf->resync_lock, unplug_slaves(conf->mddev));
+				    conf->resync_lock, raid1_unplug(conf->mddev->queue));
 		if (conf->nr_pending)
 			BUG();
 	}
 	wait_event_lock_irq(conf->wait_resume, conf->barrier < RESYNC_DEPTH,
-			    conf->resync_lock, unplug_slaves(conf->mddev));
+			    conf->resync_lock, raid1_unplug(conf->mddev->queue));
 	conf->next_resync = sect;
 	spin_unlock_irq(&conf->resync_lock);
 }
@@ -522,8 +549,12 @@ static int make_request(request_queue_t 
 	mirror_info_t *mirror;
 	r1bio_t *r1_bio;
 	struct bio *read_bio;
-	int i, disks;
+	int i, targets = 0, disks;
 	mdk_rdev_t *rdev;
+	struct bitmap *bitmap = mddev->bitmap;
+	unsigned long flags;
+	struct bio_list bl;
+
 
 	/*
 	 * Register the new request and wait if the reconstruction
@@ -554,7 +585,7 @@ static int make_request(request_queue_t 
 
 	r1_bio->master_bio = bio;
 	r1_bio->sectors = bio->bi_size >> 9;
-
+	r1_bio->state = 0;
 	r1_bio->mddev = mddev;
 	r1_bio->sector = bio->bi_sector;
 
@@ -597,6 +628,13 @@ static int make_request(request_queue_t 
 	 * bios[x] to bio
 	 */
 	disks = conf->raid_disks;
+#if 0
+	{ static int first=1;
+	if (first) printk("First Write sector %llu disks %d\n",
+			  (unsigned long long)r1_bio->sector, disks);
+	first = 0;
+	}
+#endif
 	rcu_read_lock();
 	for (i = 0;  i < disks; i++) {
 		if ((rdev=conf->mirrors[i].rdev) != NULL &&
@@ -607,13 +645,21 @@ static int make_request(request_queue_t 
 				r1_bio->bios[i] = NULL;
 			} else
 				r1_bio->bios[i] = bio;
+			targets++;
 		} else
 			r1_bio->bios[i] = NULL;
 	}
 	rcu_read_unlock();
 
-	atomic_set(&r1_bio->remaining, 1);
+	if (targets < conf->raid_disks) {
+		/* array is degraded, we will not clear the bitmap
+		 * on I/O completion (see raid1_end_write_request) */
+		set_bit(R1BIO_Degraded, &r1_bio->state);
+	}
+
+	atomic_set(&r1_bio->remaining, 0);
 
+	bio_list_init(&bl);
 	for (i = 0; i < disks; i++) {
 		struct bio *mbio;
 		if (!r1_bio->bios[i])
@@ -629,14 +675,23 @@ static int make_request(request_queue_t 
 		mbio->bi_private = r1_bio;
 
 		atomic_inc(&r1_bio->remaining);
-		generic_make_request(mbio);
-	}
 
-	if (atomic_dec_and_test(&r1_bio->remaining)) {
-		md_write_end(mddev);
-		raid_end_bio_io(r1_bio);
+		bio_list_add(&bl, mbio);
 	}
 
+	bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors);
+	spin_lock_irqsave(&conf->device_lock, flags);
+	bio_list_merge(&conf->pending_bio_list, &bl);
+	bio_list_init(&bl);
+
+	blk_plug_device(mddev->queue);
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+	
+#if 0
+	while ((bio = bio_list_pop(&bl)) != NULL) 
+		generic_make_request(bio);
+#endif
+
 	return 0;
 }
 
@@ -716,7 +771,7 @@ static void close_sync(conf_t *conf)
 {
 	spin_lock_irq(&conf->resync_lock);
 	wait_event_lock_irq(conf->wait_resume, !conf->barrier,
-			    conf->resync_lock, 	unplug_slaves(conf->mddev));
+			    conf->resync_lock, 	raid1_unplug(conf->mddev->queue));
 	spin_unlock_irq(&conf->resync_lock);
 
 	if (conf->barrier) BUG();
@@ -830,10 +885,11 @@ static int end_sync_read(struct bio *bio
 	 * or re-read if the read failed.
 	 * We don't do much here, just schedule handling by raid1d
 	 */
-	if (!uptodate)
+	if (!uptodate) {
 		md_error(r1_bio->mddev,
 			 conf->mirrors[r1_bio->read_disk].rdev);
-	else
+		set_bit(R1BIO_Degraded, &r1_bio->state);
+	} else
 		set_bit(R1BIO_Uptodate, &r1_bio->state);
 	rdev_dec_pending(conf->mirrors[r1_bio->read_disk].rdev, conf->mddev);
 	reschedule_retry(r1_bio);
@@ -857,8 +913,10 @@ static int end_sync_write(struct bio *bi
 			mirror = i;
 			break;
 		}
-	if (!uptodate)
+	if (!uptodate) {
 		md_error(mddev, conf->mirrors[mirror].rdev);
+		set_bit(R1BIO_Degraded, &r1_bio->state);
+	}
 	update_head_pos(mirror, r1_bio);
 
 	if (atomic_dec_and_test(&r1_bio->remaining)) {
@@ -878,6 +936,9 @@ static void sync_request_write(mddev_t *
 
 	bio = r1_bio->bios[r1_bio->read_disk];
 
+/*
+	if (r1_bio->sector == 0) printk("First sync write startss\n");
+*/
 	/*
 	 * schedule writes
 	 */
@@ -905,10 +966,12 @@ static void sync_request_write(mddev_t *
 		atomic_inc(&conf->mirrors[i].rdev->nr_pending);
 		atomic_inc(&r1_bio->remaining);
 		md_sync_acct(conf->mirrors[i].rdev->bdev, wbio->bi_size >> 9);
+
 		generic_make_request(wbio);
 	}
 
 	if (atomic_dec_and_test(&r1_bio->remaining)) {
+		/* if we're here, all write(s) have completed, so clean up */
 		md_done_sync(mddev, r1_bio->sectors, 1);
 		put_buf(r1_bio);
 	}
@@ -937,6 +1000,26 @@ static void raid1d(mddev_t *mddev)
 	for (;;) {
 		char b[BDEVNAME_SIZE];
 		spin_lock_irqsave(&conf->device_lock, flags);
+
+		if (conf->pending_bio_list.head) {
+			bio = bio_list_get(&conf->pending_bio_list);
+			blk_remove_plug(mddev->queue);
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			/* flush any pending bitmap writes to disk before proceeding w/ I/O */
+			if (bitmap_unplug(mddev->bitmap) != 0)
+				printk("%s: bitmap file write failed!\n", mdname(mddev));
+
+			while (bio) { /* submit pending writes */
+				struct bio *next = bio->bi_next;
+				bio->bi_next = NULL;
+				generic_make_request(bio);
+				bio = next;
+			}
+			unplug = 1;
+
+			continue;
+		}
+
 		if (list_empty(head))
 			break;
 		r1_bio = list_entry(head->prev, r1bio_t, retry_list);
@@ -1020,17 +1103,43 @@ static sector_t sync_request(mddev_t *md
 	int disk;
 	int i;
 	int write_targets = 0;
+	int sync_blocks;
 
-	if (!conf->r1buf_pool)
+	if (!conf->r1buf_pool) 
+	{
+/*
+		printk("sync start - bitmap %p\n", mddev->bitmap);
+*/
 		if (init_resync(conf))
 			return 0;
+	}
 
 	max_sector = mddev->size << 1;
 	if (sector_nr >= max_sector) {
+		/* If we aborted, we need to abort the
+		 * sync on the 'current' bitmap chunk (there will
+		 * only be one in raid1 resync.
+		 * We can find the current addess in mddev->curr_resync
+		 */
+		if (!conf->fullsync) {
+			if (mddev->curr_resync < max_sector)
+				bitmap_end_sync(mddev->bitmap,
+						mddev->curr_resync,
+						&sync_blocks, 1);
+			bitmap_close_sync(mddev->bitmap);
+		}
+		if (mddev->curr_resync >= max_sector)
+			conf->fullsync = 0;
 		close_sync(conf);
 		return 0;
 	}
 
+	if (!conf->fullsync &&
+	    !bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks)) {
+		/* We can skip this block, and probably several more */
+		*skipped = 1;
+		return sync_blocks;
+	}
 	/*
 	 * If there is non-resync activity waiting for us then
 	 * put in a delay to throttle resync.
@@ -1069,6 +1178,7 @@ static sector_t sync_request(mddev_t *md
 
 	r1_bio->mddev = mddev;
 	r1_bio->sector = sector_nr;
+	r1_bio->state = 0;
 	set_bit(R1BIO_IsSync, &r1_bio->state);
 	r1_bio->read_disk = disk;
 
@@ -1103,6 +1213,11 @@ static sector_t sync_request(mddev_t *md
 		bio->bi_bdev = conf->mirrors[i].rdev->bdev;
 		bio->bi_private = r1_bio;
 	}
+
+	if (write_targets + 1 < conf->raid_disks)
+		/* array degraded, can't clear bitmap */
+		set_bit(R1BIO_Degraded, &r1_bio->state);
+
 	if (write_targets == 0) {
 		/* There is nowhere to write, so all non-sync
 		 * drives must be failed - so we are finished
@@ -1122,6 +1237,14 @@ static sector_t sync_request(mddev_t *md
 			len = (max_sector - sector_nr) << 9;
 		if (len == 0)
 			break;
+		if (!conf->fullsync && sync_blocks == 0)
+			if (!bitmap_start_sync(mddev->bitmap, 
+					       sector_nr, &sync_blocks)) 
+				break;
+		if (sync_blocks < (PAGE_SIZE>>9))
+			BUG();
+		if (len > (sync_blocks<<9)) len = sync_blocks<<9;
+
 		for (i=0 ; i < conf->raid_disks; i++) {
 			bio = r1_bio->bios[i];
 			if (bio->bi_end_io) {
@@ -1144,6 +1267,7 @@ static sector_t sync_request(mddev_t *md
 		}
 		nr_sectors += len>>9;
 		sector_nr += len>>9;
+		sync_blocks -= (len>>9);
 	} while (r1_bio->bios[disk]->bi_vcnt < RESYNC_PAGES);
  bio_full:
 	bio = r1_bio->bios[disk];
@@ -1236,6 +1360,9 @@ static int run(mddev_t *mddev)
 	init_waitqueue_head(&conf->wait_idle);
 	init_waitqueue_head(&conf->wait_resume);
 
+	bio_list_init(&conf->pending_bio_list);
+	bio_list_init(&conf->flushing_bio_list);
+
 	if (!conf->working_disks) {
 		printk(KERN_ERR "raid1: no operational mirrors for %s\n",
 			mdname(mddev));
@@ -1264,16 +1391,15 @@ static int run(mddev_t *mddev)
 	conf->last_used = j;
 
 
-
-	{
-		mddev->thread = md_register_thread(raid1d, mddev, "%s_raid1");
-		if (!mddev->thread) {
-			printk(KERN_ERR 
-				"raid1: couldn't allocate thread for %s\n", 
-				mdname(mddev));
-			goto out_free_conf;
-		}
+	mddev->thread = md_register_thread(raid1d, mddev, "%s_raid1");
+	if (!mddev->thread) {
+		printk(KERN_ERR 
+		       "raid1: couldn't allocate thread for %s\n", 
+		       mdname(mddev));
+		goto out_free_conf;
 	}
+	if (mddev->bitmap) mddev->thread->timeout = mddev->bitmap->daemon_sleep * HZ;
+
 	printk(KERN_INFO 
 		"raid1: raid set %s active with %d out of %d mirrors\n",
 		mdname(mddev), mddev->raid_disks - mddev->degraded, 
@@ -1386,7 +1512,7 @@ static int raid1_reshape(mddev_t *mddev,
 	spin_lock_irq(&conf->resync_lock);
 	conf->barrier++;
 	wait_event_lock_irq(conf->wait_idle, !conf->nr_pending,
-			    conf->resync_lock, unplug_slaves(mddev));
+			    conf->resync_lock, raid1_unplug(mddev->queue));
 	spin_unlock_irq(&conf->resync_lock);
 
 	/* ok, everything is stopped */

diff ./include/linux/raid/raid1.h~current~ ./include/linux/raid/raid1.h
--- ./include/linux/raid/raid1.h~current~	2005-02-18 11:08:39.000000000 +1100
+++ ./include/linux/raid/raid1.h	2005-02-18 11:11:26.000000000 +1100
@@ -36,12 +36,21 @@ struct r1_private_data_s {
 	spinlock_t		device_lock;
 
 	struct list_head	retry_list;
+	/* queue pending writes and submit them on unplug */
+	struct bio_list		pending_bio_list;
+	/* queue of writes that have been unplugged */
+	struct bio_list		flushing_bio_list;
+
 	/* for use when syncing mirrors: */
 
 	spinlock_t		resync_lock;
-	int nr_pending;
-	int barrier;
+	int			nr_pending;
+	int			barrier;
 	sector_t		next_resync;
+	int			fullsync;  /* set to 1 if a full sync is needed,
+					    * (fresh device added).
+					    * Cleared when a sync completes.
+					    */
 
 	wait_queue_head_t	wait_idle;
 	wait_queue_head_t	wait_resume;
@@ -85,14 +94,17 @@ struct r1bio_s {
 	int			read_disk;
 
 	struct list_head	retry_list;
+	struct bitmap_update	*bitmap_update;
 	/*
 	 * if the IO is in WRITE direction, then multiple bios are used.
 	 * We choose the number when they are allocated.
 	 */
 	struct bio		*bios[0];
+	/* DO NOT PUT ANY NEW FIELDS HERE - bios array is contiguously alloced*/
 };
 
 /* bits for r1bio.state */
 #define	R1BIO_Uptodate	0
 #define	R1BIO_IsSync	1
+#define	R1BIO_Degraded	2
 #endif
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH md 3 of 9] Remove kludgy level check from md.c
  2005-02-18  0:40 [PATCH md 0 of 9] Introduction NeilBrown
                   ` (4 preceding siblings ...)
  2005-02-18  0:40 ` [PATCH md 6 of 9] Improve the interface to sync_request NeilBrown
@ 2005-02-18  0:40 ` NeilBrown
  2005-02-18  0:40 ` [PATCH md 2 of 9] Make raid5 and raid6 robust against failure during recovery NeilBrown
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2005-02-18  0:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


This test is overly specific, and misses raid10.
Assume all levels >= 1 might need reconstruction instead.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/md.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2005-02-18 11:11:11.000000000 +1100
+++ ./drivers/md/md.c	2005-02-18 11:11:17.000000000 +1100
@@ -1478,9 +1478,8 @@ static int analyze_sbs(mddev_t * mddev)
 
 
 
-	if ((mddev->recovery_cp != MaxSector) &&
-	    ((mddev->level == 1) ||
-	     ((mddev->level >= 4) && (mddev->level <= 6))))
+	if (mddev->recovery_cp != MaxSector &&
+	    mddev->level >= 1)
 		printk(KERN_ERR "md: %s: raid array is not clean"
 		       " -- starting background reconstruction\n",
 		       mdname(mddev));

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH md 2 of 9] Make raid5 and raid6 robust against failure during recovery.
  2005-02-18  0:40 [PATCH md 0 of 9] Introduction NeilBrown
                   ` (5 preceding siblings ...)
  2005-02-18  0:40 ` [PATCH md 3 of 9] Remove kludgy level check from md.c NeilBrown
@ 2005-02-18  0:40 ` NeilBrown
  2005-02-18  0:40 ` [PATCH md 4 of 9] Merge md_enter_safemode into md_check_recovery NeilBrown
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2005-02-18  0:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


Two problems are fixed here.
1/ if the array is known to require a resync (parity update),
  but there are too many failed devices,  the resync cannot complete
  but will be retried indefinitely.
2/ if the array has too many failed drives to be usable and a spare is
  available, reconstruction will be attempted, but cannot work.  This
  also is retried indefinitely.


Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/md.c        |   12 ++++++------
 ./drivers/md/raid5.c     |   13 +++++++++++++
 ./drivers/md/raid6main.c |   12 ++++++++++++
 3 files changed, 31 insertions(+), 6 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2005-02-18 11:08:39.000000000 +1100
+++ ./drivers/md/md.c	2005-02-18 11:11:11.000000000 +1100
@@ -3655,18 +3655,18 @@ void md_check_recovery(mddev_t *mddev)
 
 		/* no recovery is running.
 		 * remove any failed drives, then
-		 * add spares if possible
+		 * add spares if possible.
+		 * Spare are also removed and re-added, to allow
+		 * the personality to fail the re-add.
 		 */
-		ITERATE_RDEV(mddev,rdev,rtmp) {
+		ITERATE_RDEV(mddev,rdev,rtmp)
 			if (rdev->raid_disk >= 0 &&
-			    rdev->faulty &&
+			    (rdev->faulty || ! rdev->in_sync) &&
 			    atomic_read(&rdev->nr_pending)==0) {
 				if (mddev->pers->hot_remove_disk(mddev, rdev->raid_disk)==0)
 					rdev->raid_disk = -1;
 			}
-			if (!rdev->faulty && rdev->raid_disk >= 0 && !rdev->in_sync)
-				spares++;
-		}
+
 		if (mddev->degraded) {
 			ITERATE_RDEV(mddev,rdev,rtmp)
 				if (rdev->raid_disk < 0

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2005-02-18 11:08:39.000000000 +1100
+++ ./drivers/md/raid5.c	2005-02-18 11:11:11.000000000 +1100
@@ -1491,6 +1491,15 @@ static int sync_request (mddev_t *mddev,
 		unplug_slaves(mddev);
 		return 0;
 	}
+	/* if there is 1 or more failed drives and we are trying
+	 * to resync, then assert that we are finished, because there is
+	 * nothing we can do.
+	 */
+	if (mddev->degraded >= 1 && test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
+		int rv = (mddev->size << 1) - sector_nr;
+		md_done_sync(mddev, rv, 1);
+		return rv;
+	}
 
 	x = sector_nr;
 	chunk_offset = sector_div(x, sectors_per_chunk);
@@ -1882,6 +1891,10 @@ static int raid5_add_disk(mddev_t *mddev
 	int disk;
 	struct disk_info *p;
 
+	if (mddev->degraded > 1)
+		/* no point adding a device */
+		return 0;
+
 	/*
 	 * find the disk ...
 	 */

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2005-02-18 11:08:39.000000000 +1100
+++ ./drivers/md/raid6main.c	2005-02-18 11:11:11.000000000 +1100
@@ -1650,6 +1650,15 @@ static int sync_request (mddev_t *mddev,
 		unplug_slaves(mddev);
 		return 0;
 	}
+	/* if there are 2 or more failed drives and we are trying
+	 * to resync, then assert that we are finished, because there is
+	 * nothing we can do.
+	 */
+	if (mddev->degraded >= 2 && test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
+		int rv = (mddev->size << 1) - sector_nr;
+		md_done_sync(mddev, rv, 1);
+		return rv;
+	}
 
 	x = sector_nr;
 	chunk_offset = sector_div(x, sectors_per_chunk);
@@ -2048,6 +2057,9 @@ static int raid6_add_disk(mddev_t *mddev
 	int disk;
 	struct disk_info *p;
 
+	if (mddev->degraded > 2)
+		/* no point adding a device */
+		return 0;
 	/*
 	 * find the disk ...
 	 */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH md 4 of 9] Merge md_enter_safemode into md_check_recovery
  2005-02-18  0:40 [PATCH md 0 of 9] Introduction NeilBrown
                   ` (6 preceding siblings ...)
  2005-02-18  0:40 ` [PATCH md 2 of 9] Make raid5 and raid6 robust against failure during recovery NeilBrown
@ 2005-02-18  0:40 ` NeilBrown
  2005-02-18  0:40 ` [PATCH md 9 of 9] Optimise reconstruction when re-adding a recently failed drive NeilBrown
  2005-02-18 14:25 ` [PATCH md 0 of 9] Introduction Phantazm
  9 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2005-02-18  0:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


md_enter_safemode checks if it is time to mark the
md superblock as 'clean'. i.e. if all writes have completed
and a suitable delay has passed.

This is currently called from md_handle_safemode which in-turn
is called (almost) every time md_check_recovery is called,
and from the end of md_do_sync which causes the mddev->thread 
to run, which will always call md_check_recovery as well.

So it doesn't need to be a separate function and fits quite
well into md_check_recovery.

The "almost" is because multipathd calls md_check_recovery but
not md_handle_safemode.  This is OK because the code from 
md_enter_safemode is a no-op if mddev->safemode == 0, which it 
always is for a multipathd (providing we don't allow it to be set to 
2 on a signal...)


Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/md.c        |   58 +++++++++++++++++++----------------------------
 ./drivers/md/raid1.c     |    1 
 ./drivers/md/raid10.c    |    1 
 ./drivers/md/raid5.c     |    1 
 ./drivers/md/raid6main.c |    1 
 5 files changed, 24 insertions(+), 38 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2005-02-18 11:11:17.000000000 +1100
+++ ./drivers/md/md.c	2005-02-18 11:11:25.000000000 +1100
@@ -3323,37 +3323,6 @@ void md_write_end(mddev_t *mddev)
 	}
 }
 
-static inline void md_enter_safemode(mddev_t *mddev)
-{
-	if (!mddev->safemode) return;
-	if (mddev->safemode == 2 &&
-	    (atomic_read(&mddev->writes_pending) || mddev->in_sync ||
-		    mddev->recovery_cp != MaxSector))
-		return; /* avoid the lock */
-	mddev_lock_uninterruptible(mddev);
-	if (mddev->safemode && !atomic_read(&mddev->writes_pending) &&
-	    !mddev->in_sync && mddev->recovery_cp == MaxSector) {
-		mddev->in_sync = 1;
-		md_update_sb(mddev);
-	}
-	mddev_unlock(mddev);
-
-	if (mddev->safemode == 1)
-		mddev->safemode = 0;
-}
-
-void md_handle_safemode(mddev_t *mddev)
-{
-	if (signal_pending(current)) {
-		printk(KERN_INFO "md: %s in immediate safe mode\n",
-			mdname(mddev));
-		mddev->safemode = 2;
-		flush_signals(current);
-	}
-	md_enter_safemode(mddev);
-}
-
-
 DECLARE_WAIT_QUEUE_HEAD(resync_wait);
 
 #define SYNC_MARKS	10
@@ -3574,7 +3543,6 @@ static void md_do_sync(mddev_t *mddev)
 			mddev->recovery_cp = MaxSector;
 	}
 
-	md_enter_safemode(mddev);
  skip:
 	mddev->curr_resync = 0;
 	wake_up(&resync_wait);
@@ -3615,14 +3583,37 @@ void md_check_recovery(mddev_t *mddev)
 
 	if (mddev->ro)
 		return;
+
+	if (signal_pending(current)) {
+		if (mddev->pers->sync_request) {
+			printk(KERN_INFO "md: %s in immediate safe mode\n",
+			       mdname(mddev));
+			mddev->safemode = 2;
+		}
+		flush_signals(current);
+	}
+
 	if ( ! (
 		mddev->sb_dirty ||
 		test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
-		test_bit(MD_RECOVERY_DONE, &mddev->recovery)
+		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
+		(mddev->safemode == 1) ||
+		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
+		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
 		))
 		return;
+
 	if (mddev_trylock(mddev)==0) {
 		int spares =0;
+
+		if (mddev->safemode && !atomic_read(&mddev->writes_pending) &&
+		    !mddev->in_sync && mddev->recovery_cp == MaxSector) {
+			mddev->in_sync = 1;
+			mddev->sb_dirty = 1;
+		}
+		if (mddev->safemode == 1)
+			mddev->safemode = 0;
+
 		if (mddev->sb_dirty)
 			md_update_sb(mddev);
 		if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
@@ -3869,7 +3860,6 @@ EXPORT_SYMBOL(md_error);
 EXPORT_SYMBOL(md_done_sync);
 EXPORT_SYMBOL(md_write_start);
 EXPORT_SYMBOL(md_write_end);
-EXPORT_SYMBOL(md_handle_safemode);
 EXPORT_SYMBOL(md_register_thread);
 EXPORT_SYMBOL(md_unregister_thread);
 EXPORT_SYMBOL(md_wakeup_thread);

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-02-18 11:11:11.000000000 +1100
+++ ./drivers/md/raid1.c	2005-02-18 11:11:25.000000000 +1100
@@ -931,7 +931,6 @@ static void raid1d(mddev_t *mddev)
 	mdk_rdev_t *rdev;
 
 	md_check_recovery(mddev);
-	md_handle_safemode(mddev);
 	
 	for (;;) {
 		char b[BDEVNAME_SIZE];

diff ./drivers/md/raid10.c~current~ ./drivers/md/raid10.c
--- ./drivers/md/raid10.c~current~	2005-02-18 11:08:39.000000000 +1100
+++ ./drivers/md/raid10.c	2005-02-18 11:11:25.000000000 +1100
@@ -1216,7 +1216,6 @@ static void raid10d(mddev_t *mddev)
 	mdk_rdev_t *rdev;
 
 	md_check_recovery(mddev);
-	md_handle_safemode(mddev);
 
 	for (;;) {
 		char b[BDEVNAME_SIZE];

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2005-02-18 11:11:11.000000000 +1100
+++ ./drivers/md/raid5.c	2005-02-18 11:11:25.000000000 +1100
@@ -1544,7 +1544,6 @@ static void raid5d (mddev_t *mddev)
 	PRINTK("+++ raid5d active\n");
 
 	md_check_recovery(mddev);
-	md_handle_safemode(mddev);
 
 	handled = 0;
 	spin_lock_irq(&conf->device_lock);

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2005-02-18 11:11:11.000000000 +1100
+++ ./drivers/md/raid6main.c	2005-02-18 11:11:25.000000000 +1100
@@ -1703,7 +1703,6 @@ static void raid6d (mddev_t *mddev)
 	PRINTK("+++ raid6d active\n");
 
 	md_check_recovery(mddev);
-	md_handle_safemode(mddev);
 
 	handled = 0;
 	spin_lock_irq(&conf->device_lock);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH md 7 of 9] Optimised resync using Bitmap based intent logging
  2005-02-18  0:40 [PATCH md 0 of 9] Introduction NeilBrown
  2005-02-18  0:40 ` [PATCH md 1 of 9] Remove possible oops in md/raid1 NeilBrown
@ 2005-02-18  0:40 ` NeilBrown
  2005-02-18  0:40 ` [PATCH md 8 of 9] raid1 support for bitmap " NeilBrown
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2005-02-18  0:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


With this patch, the intent to write to some block in the array can
be logged to a bitmap file.  Each bit represents some number of sectors
and is set before any update happens, and only cleared when all writes
relating to all sectors are complete.

After an unclean shutdown, information in this bitmap can be used
to optimise resync - only sectors which could be out-of-sync need to
be updated.

Also if a drive is removed and then added back into an array, the
recovery can make use of the bitmap to optimise reconstruction.  This
is not implemented in this patch.

Currently the bitmap is stored in a file which must (obviously)
be stored on a separate device.

The patch only provided infrastructure. It does not update any
personalities to bitmap intent logging.

Md arrays can still be used with no bitmap file.  This patch has minimal
impact on such arrays.

This is still a work-in-progress.  Don't trust data to it yet.


Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/Makefile         |    3 
 ./drivers/md/bitmap.c         | 1517 ++++++++++++++++++++++++++++++++++++++++++
 ./drivers/md/md.c             |  172 ++++
 ./include/linux/raid/bitmap.h |  280 +++++++
 ./include/linux/raid/md_k.h   |    4 
 ./include/linux/raid/md_u.h   |    7 
 6 files changed, 1968 insertions(+), 15 deletions(-)

diff ./drivers/md/Makefile~current~ ./drivers/md/Makefile
--- ./drivers/md/Makefile~current~	2005-02-18 11:08:39.000000000 +1100
+++ ./drivers/md/Makefile	2005-02-18 11:11:26.000000000 +1100
@@ -6,6 +6,7 @@ dm-mod-objs	:= dm.o dm-table.o dm-target
 		   dm-ioctl.o dm-io.o kcopyd.o
 dm-snapshot-objs := dm-snap.o dm-exception-store.o
 dm-mirror-objs	:= dm-log.o dm-raid1.o
+md-mod-objs     := md.o bitmap.o
 raid6-objs	:= raid6main.o raid6algos.o raid6recov.o raid6tables.o \
 		   raid6int1.o raid6int2.o raid6int4.o \
 		   raid6int8.o raid6int16.o raid6int32.o \
@@ -27,7 +28,7 @@ obj-$(CONFIG_MD_RAID5)		+= raid5.o xor.o
 obj-$(CONFIG_MD_RAID6)		+= raid6.o xor.o
 obj-$(CONFIG_MD_MULTIPATH)	+= multipath.o
 obj-$(CONFIG_MD_FAULTY)		+= faulty.o
-obj-$(CONFIG_BLK_DEV_MD)	+= md.o
+obj-$(CONFIG_BLK_DEV_MD)	+= md-mod.o
 obj-$(CONFIG_BLK_DEV_DM)	+= dm-mod.o
 obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
 obj-$(CONFIG_DM_SNAPSHOT)	+= dm-snapshot.o

diff ./drivers/md/bitmap.c~current~ ./drivers/md/bitmap.c
--- ./drivers/md/bitmap.c~current~	2005-02-18 11:08:39.000000000 +1100
+++ ./drivers/md/bitmap.c	2005-02-18 11:11:26.000000000 +1100
@@ -0,0 +1,1517 @@
+/*
+ * bitmap.c two-level bitmap (C) Peter T. Breuer (ptb@ot.uc3m.es) 2003
+ *
+ * bitmap_create  - sets up the bitmap structure
+ * bitmap_destroy - destroys the bitmap structure
+ *
+ * additions, Copyright (C) 2003-2004, Paul Clements, SteelEye Technology, Inc.:
+ * - added disk storage for bitmap
+ * - changes to allow various bitmap chunk sizes
+ * - added bitmap daemon (to asynchronously clear bitmap bits from disk)
+ */
+
+/*
+ * Still to do:
+ *
+ * flush after percent set rather than just time based. (maybe both).
+ * wait if count gets too high, wake when it drops to half.
+ * allow bitmap to be mirrored with superblock (before or after...)
+ * allow hot-add to re-instate a current device.
+ * allow hot-add of bitmap after quiessing device 
+ */
+
+#include <linux/module.h>
+#include <linux/version.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/config.h>
+#include <linux/timer.h>
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/buffer_head.h>
+#include <linux/raid/md.h>
+#include <linux/raid/bitmap.h>
+
+/* debug macros */
+
+#define DEBUG 0
+
+#if DEBUG
+/* these are for debugging purposes only! */
+
+/* define one and only one of these */
+#define INJECT_FAULTS_1 0 /* cause bitmap_alloc_page to fail always */
+#define INJECT_FAULTS_2 0 /* cause bitmap file to be kicked when first bit set*/
+#define INJECT_FAULTS_3 0 /* treat bitmap file as kicked at init time */
+#define INJECT_FAULTS_4 0 /* undef */
+#define INJECT_FAULTS_5 0 /* undef */
+#define INJECT_FAULTS_6 0
+
+/* if these are defined, the driver will fail! debug only */
+#define INJECT_FATAL_FAULT_1 0 /* fail kmalloc, causing bitmap_create to fail */
+#define INJECT_FATAL_FAULT_2 0 /* undef */
+#define INJECT_FATAL_FAULT_3 0 /* undef */
+#endif
+
+//#define DPRINTK PRINTK /* set this NULL to avoid verbose debug output */
+#define DPRINTK(x...) do { } while(0)
+
+#ifndef PRINTK
+#  if DEBUG > 0 
+#    define PRINTK(x...) printk(KERN_DEBUG x)
+#  else
+#    define PRINTK(x...)
+#  endif
+#endif
+
+static inline char * bmname(struct bitmap *bitmap)
+{
+	return bitmap->mddev ? mdname(bitmap->mddev) : "mdX";
+}
+
+
+/* 
+ * test if the bitmap is active
+ */
+int bitmap_active(struct bitmap *bitmap)
+{
+	unsigned long flags;
+	int res = 0;
+
+	if (!bitmap)
+		return res;
+	spin_lock_irqsave(&bitmap->lock, flags);
+	res = bitmap->flags & BITMAP_ACTIVE;
+	spin_unlock_irqrestore(&bitmap->lock, flags);
+	return res;
+}
+
+#define WRITE_POOL_SIZE 256
+/* mempool for queueing pending writes on the bitmap file */
+static void *write_pool_alloc(int gfp_flags, void *data)
+{
+	return kmalloc(sizeof(struct page_list), gfp_flags);
+}
+
+static void write_pool_free(void *ptr, void *data)
+{
+	kfree(ptr);
+}
+
+/*
+ * just a placeholder - calls kmalloc for bitmap pages
+ */
+static unsigned char *bitmap_alloc_page(struct bitmap *bitmap)
+{
+	unsigned char *page;
+
+#if INJECT_FAULTS_1
+	page = NULL;
+#else
+	page = kmalloc(PAGE_SIZE, GFP_NOIO);
+#endif
+	if (!page)
+		printk("%s: bitmap_alloc_page FAILED\n", bmname(bitmap));
+	else
+		printk("%s: bitmap_alloc_page: allocated page at %p\n", 
+			bmname(bitmap), page);
+	return page;
+}
+
+/*
+ * for now just a placeholder -- just calls kfree for bitmap pages
+ */
+static void bitmap_free_page(struct bitmap *bitmap, unsigned char *page)
+{
+	PRINTK("%s: bitmap_free_page: free page %p\n", bmname(bitmap), page);
+	kfree(page);
+}
+
+/* 
+ * check a page and, if necessary, allocate it (or hijack it if the alloc fails)
+ *
+ * 1) check to see if this page is allocated, if it's not then try to alloc
+ * 2) if the alloc fails, set the page's hijacked flag so we'll use the 
+ *    page pointer directly as a counter
+ *
+ * if we find our page, we increment the page's refcount so that it stays
+ * allocated while we're using it
+ */
+static int bitmap_checkpage(struct bitmap *bitmap, unsigned long page, int create)
+{
+	unsigned char *mappage;
+
+	if (page >= bitmap->pages) {
+		printk(KERN_ALERT 
+			"%s: invalid bitmap page request: %lu (> %lu)\n",
+			bmname(bitmap), page, bitmap->pages-1);
+		return -EINVAL;
+	}
+
+
+	if (bitmap->bp[page].hijacked) /* it's hijacked, don't try to alloc */
+		return 0;
+
+	if (bitmap->bp[page].map) /* page is already allocated, just return */
+		return 0;
+
+	if (!create)
+		return -ENOENT;
+
+	spin_unlock_irq(&bitmap->lock);
+
+	/* this page has not been allocated yet */
+
+	if ((mappage = bitmap_alloc_page(bitmap)) == NULL) {
+		PRINTK("%s: bitmap map page allocation failed, hijacking\n", 
+			bmname(bitmap));
+		/* failed - set the hijacked flag so that we can use the
+		 * pointer as a counter */
+		spin_lock_irq(&bitmap->lock);
+		if (!bitmap->bp[page].map)
+			bitmap->bp[page].hijacked = 1;
+		goto out;
+	}
+
+	/* got a page */
+
+	spin_lock_irq(&bitmap->lock);
+
+	/* recheck the page */
+
+	if (bitmap->bp[page].map || bitmap->bp[page].hijacked) {
+		/* somebody beat us to getting the page */
+		bitmap_free_page(bitmap, mappage);
+		return 0;
+	}
+
+	/* no page was in place and we have one, so install it */
+
+	memset(mappage, 0, PAGE_SIZE);
+	bitmap->bp[page].map = mappage;
+	bitmap->missing_pages--;
+out:
+	return 0;
+}
+
+
+/* if page is completely empty, put it back on the free list, or dealloc it */
+/* if page was hijacked, unmark the flag so it might get alloced next time */
+/* Note: lock should be held when calling this */
+static inline void bitmap_checkfree(struct bitmap *bitmap, unsigned long page)
+{
+	char *ptr;
+
+	if (bitmap->bp[page].count) /* page is still busy */
+		return;
+
+	/* page is no longer in use, it can be released */
+
+	if (bitmap->bp[page].hijacked) { /* page was hijacked, undo this now */
+		bitmap->bp[page].hijacked = 0;
+		bitmap->bp[page].map = NULL;
+		return;
+	}
+
+	/* normal case, free the page */
+
+#if 0
+/* actually ... let's not.  We will probably need the page again exactly when
+ * memory is tight and we are flusing to disk
+ */
+	return;
+#else
+	ptr = bitmap->bp[page].map;
+	bitmap->bp[page].map = NULL;
+	bitmap->missing_pages++;
+	bitmap_free_page(bitmap, ptr);
+	return;
+#endif
+}
+
+
+/*
+ * bitmap file handling - read and write the bitmap file and its superblock
+ */
+
+/* copy the pathname of a file to a buffer */
+char *file_path(struct file *file, char *buf, int count)
+{
+	struct dentry *d;
+	struct vfsmount *v;
+
+	if (!buf)
+		return NULL;
+
+	d = file->f_dentry;
+	v = file->f_vfsmnt;
+
+	buf = d_path(d, v, buf, count);
+
+	return IS_ERR(buf) ? NULL : buf;
+}
+
+/*
+ * basic page I/O operations
+ */
+
+/*
+ * write out a page
+ */
+static int write_page(struct page *page, int wait)
+{
+	int ret = -ENOMEM;
+
+	lock_page(page);
+
+	if (page->mapping == NULL)
+		goto unlock_out;
+	else if (i_size_read(page->mapping->host) < page->index << PAGE_SHIFT) {
+		ret = -ENOENT;
+		goto unlock_out;
+	}
+		
+	ret = page->mapping->a_ops->prepare_write(NULL, page, 0, PAGE_SIZE);
+	if (!ret)
+		ret = page->mapping->a_ops->commit_write(NULL, page, 0,
+			PAGE_SIZE);
+	if (ret) {
+unlock_out:
+		unlock_page(page);
+		return ret;
+	}
+
+	set_page_dirty(page); /* force it to be written out */
+	return write_one_page(page, wait);
+}
+
+/* read a page from a file, pinning it into cache, and return bytes_read */
+static struct page *read_page(struct file *file, unsigned long index,
+					unsigned long *bytes_read)
+{
+	struct inode *inode = file->f_mapping->host;
+	struct page *page = NULL;
+	loff_t isize = i_size_read(inode);
+	unsigned long end_index = isize >> PAGE_CACHE_SHIFT;
+
+	PRINTK("read bitmap file (%dB @ %Lu)\n", (int)PAGE_CACHE_SIZE,
+			(unsigned long long)index << PAGE_CACHE_SHIFT);
+
+	page = read_cache_page(inode->i_mapping, index,
+			(filler_t *)inode->i_mapping->a_ops->readpage, file);
+	if (IS_ERR(page))
+		goto out;
+	wait_on_page_locked(page);
+	if (!PageUptodate(page) || PageError(page)) {
+		page_cache_release(page);
+		page = ERR_PTR(-EIO);
+		goto out;
+	}
+
+	if (index > end_index) /* we have read beyond EOF */
+		*bytes_read = 0;
+	else if (index == end_index) /* possible short read */
+		*bytes_read = isize & ~PAGE_CACHE_MASK;
+	else
+		*bytes_read = PAGE_CACHE_SIZE; /* got a full page */
+out:
+	if (IS_ERR(page))
+		printk(KERN_ALERT "md: bitmap read error: (%dB @ %Lu): %ld\n",
+			(int)PAGE_CACHE_SIZE,
+			(unsigned long long)index << PAGE_CACHE_SHIFT,
+			PTR_ERR(page));
+	return page;
+}
+
+/*
+ * bitmap file superblock operations
+ */
+
+/* update the event counter and sync the superblock to disk */
+int bitmap_update_sb(struct bitmap *bitmap)
+{
+	bitmap_super_t *sb;
+	unsigned long flags;
+
+	if (!bitmap || !bitmap->mddev) /* no bitmap for this array */
+		return 0;
+	spin_lock_irqsave(&bitmap->lock, flags);
+	if (!bitmap->sb_page) { /* no superblock */
+		spin_unlock_irqrestore(&bitmap->lock, flags);
+		return 0;
+	}
+	page_cache_get(bitmap->sb_page);
+	spin_unlock_irqrestore(&bitmap->lock, flags);
+	sb = (bitmap_super_t *)kmap(bitmap->sb_page);
+	sb->events = cpu_to_le64(bitmap->mddev->events);
+	if (!bitmap->mddev->degraded)
+		sb->events_cleared = cpu_to_le64(bitmap->mddev->events);
+	kunmap(bitmap->sb_page);
+	write_page(bitmap->sb_page, 0);
+	return 0;
+}
+
+/* print out the bitmap file superblock */
+void bitmap_print_sb(struct bitmap *bitmap)
+{
+	bitmap_super_t *sb;
+
+	if (!bitmap || !bitmap->sb_page)
+		return;
+	sb = (bitmap_super_t *)kmap(bitmap->sb_page);
+	printk(KERN_DEBUG "%s: bitmap file superblock:\n", bmname(bitmap));
+	printk(KERN_DEBUG "       magic: %08x\n", le32_to_cpu(sb->magic));
+	printk(KERN_DEBUG "     version: %d\n", le32_to_cpu(sb->version));
+	printk(KERN_DEBUG "        uuid: %08x.%08x.%08x.%08x\n",
+					*(__u32 *)(sb->uuid+0),
+					*(__u32 *)(sb->uuid+4),
+					*(__u32 *)(sb->uuid+8),
+					*(__u32 *)(sb->uuid+12));
+	printk(KERN_DEBUG "      events: %llu\n",
+			(unsigned long long) le64_to_cpu(sb->events));
+	printk(KERN_DEBUG "events_clred: %llu\n",
+			(unsigned long long) le64_to_cpu(sb->events_cleared));
+	printk(KERN_DEBUG "       state: %08x\n", le32_to_cpu(sb->state));
+	printk(KERN_DEBUG "   chunksize: %d B\n", le32_to_cpu(sb->chunksize));
+	printk(KERN_DEBUG "daemon sleep: %ds\n", le32_to_cpu(sb->daemon_sleep));
+	printk(KERN_DEBUG "   sync size: %llu KB\n", le64_to_cpu(sb->sync_size));
+	kunmap(bitmap->sb_page);
+}
+
+/* read the superblock from the bitmap file and initialize some bitmap fields */
+static int bitmap_read_sb(struct bitmap *bitmap)
+{
+	char *reason = NULL;
+	bitmap_super_t *sb;
+	unsigned long chunksize, daemon_sleep;
+	unsigned long bytes_read;
+	unsigned long long events;
+	int err = -EINVAL;
+
+	/* page 0 is the superblock, read it... */
+	bitmap->sb_page = read_page(bitmap->file, 0, &bytes_read);
+	if (IS_ERR(bitmap->sb_page)) {
+		err = PTR_ERR(bitmap->sb_page);
+		bitmap->sb_page = NULL;
+		return err;
+	}
+
+	sb = (bitmap_super_t *)kmap(bitmap->sb_page);
+
+	if (bytes_read < sizeof(*sb)) { /* short read */
+		printk(KERN_INFO "%s: bitmap file superblock truncated\n",
+			bmname(bitmap));
+		err = -ENOSPC;
+		goto out;
+	}
+
+	chunksize = le32_to_cpu(sb->chunksize);
+	daemon_sleep = le32_to_cpu(sb->daemon_sleep);
+
+	/* verify that the bitmap-specific fields are valid */
+	if (sb->magic != cpu_to_le32(BITMAP_MAGIC))
+		reason = "bad magic";
+	else if (sb->version != cpu_to_le32(BITMAP_MAJOR))
+		reason = "unrecognized superblock version";
+	else if (chunksize < 512 || chunksize > (1024 * 1024 * 4))
+		reason = "bitmap chunksize out of range (512B - 4MB)";
+	else if ((1 << ffz(~chunksize)) != chunksize)
+		reason = "bitmap chunksize not a power of 2";
+	else if (daemon_sleep < 1 || daemon_sleep > 15)
+		reason = "daemon sleep period out of range";
+	if (reason) {
+		printk(KERN_INFO "%s: invalid bitmap file superblock: %s\n",
+			bmname(bitmap), reason);
+		goto out;
+	}
+
+	/* keep the array size field of the bitmap superblock up to date */
+	sb->sync_size = cpu_to_le64(bitmap->mddev->resync_max_sectors);
+
+	if (!bitmap->mddev->persistent)
+		goto success;
+
+	/*
+	 * if we have a persistent array superblock, compare the
+	 * bitmap's UUID and event counter to the mddev's
+	 */
+	if (memcmp(sb->uuid, bitmap->mddev->uuid, 16)) {
+		printk(KERN_INFO "%s: bitmap superblock UUID mismatch\n",
+			bmname(bitmap));
+		goto out;
+	}
+	events = le64_to_cpu(sb->events);
+	if (events < bitmap->mddev->events) {
+		printk(KERN_INFO "%s: bitmap file is out of date (%llu < %llu) "
+			"-- forcing full recovery\n", bmname(bitmap), events,
+			(unsigned long long) bitmap->mddev->events);
+		sb->state |= BITMAP_STALE;
+	}
+success:
+	/* assign fields using values from superblock */
+	bitmap->chunksize = chunksize;
+	bitmap->daemon_sleep = daemon_sleep;
+	bitmap->flags |= sb->state;
+	bitmap->events_cleared = le64_to_cpu(sb->events_cleared);
+	err = 0;
+out:
+	kunmap(bitmap->sb_page);
+	if (err)
+		bitmap_print_sb(bitmap);
+	return err;
+}
+
+enum bitmap_mask_op {
+	MASK_SET,
+	MASK_UNSET
+};
+
+/* record the state of the bitmap in the superblock */
+static void bitmap_mask_state(struct bitmap *bitmap, enum bitmap_state bits,
+				enum bitmap_mask_op op)
+{
+	bitmap_super_t *sb;
+	unsigned long flags;
+
+	spin_lock_irqsave(&bitmap->lock, flags);
+	if (!bitmap || !bitmap->sb_page) { /* can't set the state */
+		spin_unlock_irqrestore(&bitmap->lock, flags);
+		return;
+	}
+	page_cache_get(bitmap->sb_page);
+	spin_unlock_irqrestore(&bitmap->lock, flags);
+	sb = (bitmap_super_t *)kmap(bitmap->sb_page);
+	switch (op) {
+		case MASK_SET: sb->state |= bits;
+				break;
+		case MASK_UNSET: sb->state &= ~bits;
+				break;
+		default: BUG();
+	}
+	kunmap(bitmap->sb_page);
+	page_cache_release(bitmap->sb_page);
+}
+
+/*
+ * general bitmap file operations
+ */
+
+/* calculate the index of the page that contains this bit */
+static inline unsigned long file_page_index(unsigned long chunk)
+{
+	return CHUNK_BIT_OFFSET(chunk) >> PAGE_BIT_SHIFT;
+}
+
+/* calculate the (bit) offset of this bit within a page */
+static inline unsigned long file_page_offset(unsigned long chunk)
+{
+	return CHUNK_BIT_OFFSET(chunk) & (PAGE_BITS - 1);
+}
+
+/*
+ * return a pointer to the page in the filemap that contains the given bit
+ *
+ * this lookup is complicated by the fact that the bitmap sb might be exactly
+ * 1 page (e.g., x86) or less than 1 page -- so the bitmap might start on page
+ * 0 or page 1
+ */
+static inline struct page *filemap_get_page(struct bitmap *bitmap,
+					unsigned long chunk)
+{
+	return bitmap->filemap[file_page_index(chunk) - file_page_index(0)];
+}
+
+
+static void bitmap_file_unmap(struct bitmap *bitmap)
+{
+	struct page **map, *sb_page;
+	unsigned long *attr;
+	int pages;
+	unsigned long flags;
+
+	spin_lock_irqsave(&bitmap->lock, flags);
+	map = bitmap->filemap;
+	bitmap->filemap = NULL;
+	attr = bitmap->filemap_attr;
+	bitmap->filemap_attr = NULL;
+	pages = bitmap->file_pages;
+	bitmap->file_pages = 0;
+	sb_page = bitmap->sb_page;
+	bitmap->sb_page = NULL;
+	spin_unlock_irqrestore(&bitmap->lock, flags);
+
+	while (pages--)
+		if (map[pages]->index != 0) /* 0 is sb_page, release it below */
+			page_cache_release(map[pages]);
+	kfree(map);
+	kfree(attr);
+
+	if (sb_page)
+		page_cache_release(sb_page);
+}
+
+static void bitmap_stop_daemons(struct bitmap *bitmap);
+
+/* dequeue the next item in a page list -- don't call from irq context */
+static struct page_list *dequeue_page(struct bitmap *bitmap,
+					struct list_head *head)
+{
+	struct page_list *item = NULL;
+
+	spin_lock(&bitmap->write_lock);
+	if (list_empty(head))
+		goto out;
+	item = list_entry(head->prev, struct page_list, list);
+	list_del(head->prev);
+out:
+	spin_unlock(&bitmap->write_lock);
+	return item;
+}
+
+static void drain_write_queues(struct bitmap *bitmap)
+{
+	struct list_head *queues[] = { 	&bitmap->complete_pages, NULL };
+	struct list_head *head;
+	struct page_list *item;
+	int i;
+
+	for (i = 0; queues[i]; i++) {
+		head = queues[i];
+		while ((item = dequeue_page(bitmap, head))) {
+			page_cache_release(item->page);
+			mempool_free(item, bitmap->write_pool);
+		}
+	}
+
+	spin_lock(&bitmap->write_lock);
+	bitmap->writes_pending = 0; /* make sure waiters continue */
+	wake_up(&bitmap->write_wait);
+	spin_unlock(&bitmap->write_lock);
+}
+
+static void bitmap_file_put(struct bitmap *bitmap)
+{
+	struct file *file;
+	struct inode *inode;
+	unsigned long flags;
+
+	spin_lock_irqsave(&bitmap->lock, flags);
+	file = bitmap->file;
+	bitmap->file = NULL;
+	spin_unlock_irqrestore(&bitmap->lock, flags);
+
+	bitmap_stop_daemons(bitmap);
+
+	drain_write_queues(bitmap);
+
+	bitmap_file_unmap(bitmap);
+
+	if (file) {
+		inode = file->f_mapping->host;
+		spin_lock(&inode->i_lock);
+		atomic_set(&inode->i_writecount, 1); /* allow writes again */
+		spin_unlock(&inode->i_lock);
+		fput(file);
+	}
+}
+
+
+/*
+ * bitmap_file_kick - if an error occurs while manipulating the bitmap file
+ * then it is no longer reliable, so we stop using it and we mark the file 
+ * as failed in the superblock
+ */
+static void bitmap_file_kick(struct bitmap *bitmap)
+{
+	char *path, *ptr = NULL;
+
+	bitmap_mask_state(bitmap, BITMAP_STALE, MASK_SET);
+	bitmap_update_sb(bitmap);
+
+	path = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (path)
+		ptr = file_path(bitmap->file, path, PAGE_SIZE);
+
+	printk(KERN_ALERT "%s: kicking failed bitmap file %s from array!\n",
+		bmname(bitmap), ptr ? ptr : "");
+
+	kfree(path);
+
+	bitmap_file_put(bitmap);
+
+	return;
+}
+
+enum bitmap_page_attr {
+	BITMAP_PAGE_DIRTY = 1, // there are set bits that need to be synced
+	BITMAP_PAGE_CLEAN = 2, // there are bits that might need to be cleared
+	BITMAP_PAGE_NEEDWRITE=4, // there are cleared bits that need to be synced
+};
+
+static inline void set_page_attr(struct bitmap *bitmap, struct page *page,
+				enum bitmap_page_attr attr)
+{
+	bitmap->filemap_attr[page->index] |= attr;
+}
+
+static inline void clear_page_attr(struct bitmap *bitmap, struct page *page,
+				enum bitmap_page_attr attr)
+{
+	bitmap->filemap_attr[page->index] &= ~attr;
+}
+
+static inline unsigned long get_page_attr(struct bitmap *bitmap, struct page *page)
+{
+	return bitmap->filemap_attr[page->index];
+}
+				
+/* 
+ * bitmap_file_set_bit -- called before performing a write to the md device
+ * to set (and eventually sync) a particular bit in the bitmap file
+ *
+ * we set the bit immediately, then we record the page number so that
+ * when an unplug occurs, we can flush the dirty pages out to disk
+ */
+static void bitmap_file_set_bit(struct bitmap *bitmap, sector_t block)
+{
+	unsigned long bit;
+	struct page *page;
+	void *kaddr;
+	unsigned long chunk = block >> CHUNK_BLOCK_SHIFT(bitmap);
+
+	if (!bitmap->file || !bitmap->filemap) {
+		return;
+	}
+
+	page = filemap_get_page(bitmap, chunk);
+	bit = file_page_offset(chunk);
+
+
+	/* make sure the page stays cached until it gets written out */
+	if (! (get_page_attr(bitmap, page) & BITMAP_PAGE_DIRTY))
+		page_cache_get(page);
+
+ 	/* set the bit */
+	kaddr = kmap_atomic(page, KM_USER0);
+	set_bit(bit, kaddr);
+	kunmap_atomic(kaddr, KM_USER0);
+	PRINTK("set file bit %lu page %lu\n", bit, page->index);
+
+	/* record page number so it gets flushed to disk when unplug occurs */
+	set_page_attr(bitmap, page, BITMAP_PAGE_DIRTY);
+
+}
+
+/* this gets called when the md device is ready to unplug its underlying
+ * (slave) device queues -- before we let any writes go down, we need to
+ * sync the dirty pages of the bitmap file to disk */
+int bitmap_unplug(struct bitmap *bitmap)
+{
+	unsigned long i, attr, flags;
+	struct page *page;
+	int wait = 0;
+
+	if (!bitmap)
+		return 0;
+
+	/* look at each page to see if there are any set bits that need to be
+	 * flushed out to disk */
+	for (i = 0; i < bitmap->file_pages; i++) {
+		spin_lock_irqsave(&bitmap->lock, flags);
+		if (!bitmap->file || !bitmap->filemap) {
+			spin_unlock_irqrestore(&bitmap->lock, flags);
+			return 0;
+		}
+		page = bitmap->filemap[i];
+		attr = get_page_attr(bitmap, page);
+		clear_page_attr(bitmap, page, BITMAP_PAGE_DIRTY);
+		clear_page_attr(bitmap, page, BITMAP_PAGE_NEEDWRITE);
+		if ((attr & BITMAP_PAGE_DIRTY))
+			wait = 1;
+		spin_unlock_irqrestore(&bitmap->lock, flags);
+
+		if (attr & (BITMAP_PAGE_DIRTY | BITMAP_PAGE_NEEDWRITE))
+			write_page(page, 0);
+	}
+	if (wait) { /* if any writes were performed, we need to wait on them */
+		spin_lock_irq(&bitmap->write_lock);
+		wait_event_lock_irq(bitmap->write_wait,
+			bitmap->writes_pending == 0, bitmap->write_lock,
+			wake_up_process(bitmap->writeback_daemon->tsk));
+		spin_unlock_irq(&bitmap->write_lock);
+	}
+	return 0;
+}
+
+static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset,
+	unsigned long sectors, int set);
+/* * bitmap_init_from_disk -- called at bitmap_create time to initialize
+ * the in-memory bitmap from the on-disk bitmap -- also, sets up the
+ * memory mapping of the bitmap file
+ * Special cases: 
+ *   if there's no bitmap file, or if the bitmap file had been 
+ *   previously kicked from the array, we mark all the bits as 
+ *   1's in order to cause a full resync.
+ */
+static int bitmap_init_from_disk(struct bitmap *bitmap)
+{
+	unsigned long i, chunks, index, oldindex, bit;
+	struct page *page = NULL, *oldpage = NULL;
+	unsigned long num_pages, bit_cnt = 0;
+	struct file *file;
+	unsigned long bytes, offset, dummy;
+	int outofdate;
+	int ret = -ENOSPC;
+
+	chunks = bitmap->chunks;
+	file = bitmap->file;
+
+	if (!file) { /* no file, dirty all the in-memory bits */
+		printk(KERN_INFO "%s: no bitmap file, doing full recovery\n",
+			bmname(bitmap));
+		bitmap_set_memory_bits(bitmap, 0,
+				       chunks << CHUNK_BLOCK_SHIFT(bitmap), 1);
+		return 0;
+	}
+
+#if INJECT_FAULTS_3
+	outofdate = 1;
+#else
+	outofdate = bitmap->flags & BITMAP_STALE;
+#endif
+	if (outofdate)
+		printk(KERN_INFO "%s: bitmap file is out of date, doing full "
+			"recovery\n", bmname(bitmap));
+
+	bytes = (chunks + 7) / 8;
+	num_pages = (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
+	if (i_size_read(file->f_mapping->host) < bytes + sizeof(bitmap_super_t)) {
+		printk(KERN_INFO "%s: bitmap file too short %lu < %lu\n",
+			bmname(bitmap),
+			(unsigned long) i_size_read(file->f_mapping->host),
+			bytes + sizeof(bitmap_super_t));
+		goto out;
+	}
+	num_pages++;
+	bitmap->filemap = kmalloc(sizeof(struct page *) * num_pages, GFP_KERNEL);
+	if (!bitmap->filemap) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	bitmap->filemap_attr = kmalloc(sizeof(long) * num_pages, GFP_KERNEL);
+	if (!bitmap->filemap_attr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	memset(bitmap->filemap_attr, 0, sizeof(long) * num_pages);
+
+	oldindex = ~0L;
+
+	for (i = 0; i < chunks; i++) {
+		index = file_page_index(i);
+		bit = file_page_offset(i);
+		if (index != oldindex) { /* this is a new page, read it in */
+			/* unmap the old page, we're done with it */
+			if (oldpage != NULL)
+				kunmap(oldpage);
+			if (index == 0) {
+				/*
+				 * if we're here then the superblock page
+				 * contains some bits (PAGE_SIZE != sizeof sb)
+				 * we've already read it in, so just use it
+				 */
+				page = bitmap->sb_page;
+				offset = sizeof(bitmap_super_t);
+			} else {
+				page = read_page(file, index, &dummy);
+				if (IS_ERR(page)) { /* read error */
+					ret = PTR_ERR(page);
+					goto out;
+				}
+				offset = 0;
+			}
+			oldindex = index;
+			oldpage = page;
+			kmap(page);
+
+			if (outofdate) {
+				/*
+				 * if bitmap is out of date, dirty the
+			 	 * whole page and write it out
+				 */
+				memset(page_address(page) + offset, 0xff,
+					PAGE_SIZE - offset);
+				ret = write_page(page, 1);
+				if (ret) {
+					kunmap(page);
+					/* release, page not in filemap yet */
+					page_cache_release(page);
+					goto out;
+				}
+			}
+
+			bitmap->filemap[bitmap->file_pages++] = page;
+		}
+		if (test_bit(bit, page_address(page))) {
+			/* if the disk bit is set, set the memory bit */
+			bitmap_set_memory_bits(bitmap,
+					i << CHUNK_BLOCK_SHIFT(bitmap), 1, 1);
+			bit_cnt++;
+		} 
+#if 0		
+		else 
+			bitmap_set_memory_bits(bitmap,
+				       i << CHUNK_BLOCK_SHIFT(bitmap), 1, 0);
+#endif
+	}
+
+ 	/* everything went OK */
+	ret = 0;
+	bitmap_mask_state(bitmap, BITMAP_STALE, MASK_UNSET);
+
+	if (page) /* unmap the last page */
+		kunmap(page);
+
+	if (bit_cnt) { /* Kick recovery if any bits were set */
+		set_bit(MD_RECOVERY_NEEDED, &bitmap->mddev->recovery);
+		md_wakeup_thread(bitmap->mddev->thread);
+	}
+
+out:
+	printk(KERN_INFO "%s: bitmap initialized from disk: "
+		"read %lu/%lu pages, set %lu bits, status: %d\n",
+		bmname(bitmap), bitmap->file_pages, num_pages, bit_cnt, ret);
+
+	return ret;
+}
+
+
+static void bitmap_count_page(struct bitmap *bitmap, sector_t offset, int inc)
+{
+	sector_t chunk = offset >> CHUNK_BLOCK_SHIFT(bitmap);
+	unsigned long page = chunk >> PAGE_COUNTER_SHIFT;
+	bitmap->bp[page].count += inc;
+/*
+	if (page == 0) printk("count page 0, offset %llu: %d gives %d\n",
+			      (unsigned long long)offset, inc, bitmap->bp[page].count);
+*/
+	bitmap_checkfree(bitmap, page);
+}
+static bitmap_counter_t *bitmap_get_counter(struct bitmap *bitmap, 
+					    sector_t offset, int *blocks,
+					    int create);
+
+/*
+ * bitmap daemon -- periodically wakes up to clean bits and flush pages
+ *			out to disk
+ */
+
+int bitmap_daemon_work(struct bitmap *bitmap)
+{
+	unsigned long bit, j;
+	unsigned long flags;
+	struct page *page = NULL, *lastpage = NULL;
+	int err = 0;
+	int blocks;
+	int attr;
+
+	if (bitmap == NULL)
+		return 0;
+	if (time_before(jiffies, bitmap->daemon_lastrun + bitmap->daemon_sleep*HZ))
+		return 0;
+	bitmap->daemon_lastrun = jiffies;
+
+	for (j = 0; j < bitmap->chunks; j++) {
+		bitmap_counter_t *bmc;
+		spin_lock_irqsave(&bitmap->lock, flags);
+		if (!bitmap->file || !bitmap->filemap) {
+			/* error or shutdown */
+			spin_unlock_irqrestore(&bitmap->lock, flags);
+			break;
+		}
+
+		page = filemap_get_page(bitmap, j);
+		/* skip this page unless it's marked as needing cleaning */
+		if (!((attr=get_page_attr(bitmap, page)) & BITMAP_PAGE_CLEAN)) {
+			if (attr & BITMAP_PAGE_NEEDWRITE) {
+				page_cache_get(page);
+				clear_page_attr(bitmap, page, BITMAP_PAGE_NEEDWRITE);
+			}
+			spin_unlock_irqrestore(&bitmap->lock, flags);
+			if (attr & BITMAP_PAGE_NEEDWRITE) {
+				if (write_page(page, 0))
+					bitmap_file_kick(bitmap);
+				page_cache_release(page);
+			}
+			continue;
+		}
+
+		bit = file_page_offset(j);
+
+		if (page != lastpage) {
+			/* grab the new page, sync and release the old */
+			page_cache_get(page);
+			if (lastpage != NULL) {
+				if (get_page_attr(bitmap, lastpage) & BITMAP_PAGE_NEEDWRITE) {
+					clear_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
+					spin_unlock_irqrestore(&bitmap->lock, flags);
+					write_page(lastpage, 0);
+				} else {
+					set_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
+					spin_unlock_irqrestore(&bitmap->lock, flags);
+				}
+				kunmap(lastpage);
+				page_cache_release(lastpage);
+				if (err)
+					bitmap_file_kick(bitmap);
+			} else
+				spin_unlock_irqrestore(&bitmap->lock, flags);
+			lastpage = page;
+			kmap(page);
+/*
+			printk("bitmap clean at page %lu\n", j);
+*/
+			spin_lock_irqsave(&bitmap->lock, flags);
+			clear_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
+		}
+		bmc = bitmap_get_counter(bitmap, j << CHUNK_BLOCK_SHIFT(bitmap),
+					&blocks, 0);
+		if (bmc) {
+/*
+  if (j < 100) printk("bitmap: j=%lu, *bmc = 0x%x\n", j, *bmc);
+*/
+			if (*bmc == 2) {
+				*bmc=1; /* maybe clear the bit next time */
+				set_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
+			} else if (*bmc == 1) {
+				/* we can clear the bit */
+				*bmc = 0;
+				bitmap_count_page(bitmap, j << CHUNK_BLOCK_SHIFT(bitmap),
+						  -1);
+
+				/* clear the bit */
+				clear_bit(bit, page_address(page));
+			}
+		}
+		spin_unlock_irqrestore(&bitmap->lock, flags);
+	}
+
+	/* now sync the final page */
+	if (lastpage != NULL) {
+		kunmap(lastpage);
+		spin_lock_irqsave(&bitmap->lock, flags);
+		if (get_page_attr(bitmap, lastpage) &BITMAP_PAGE_NEEDWRITE) {
+			clear_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
+			spin_unlock_irqrestore(&bitmap->lock, flags);
+			write_page(lastpage, 0);
+		} else {
+			set_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
+			spin_unlock_irqrestore(&bitmap->lock, flags);
+		}
+
+		page_cache_release(lastpage);
+	}
+
+	return err;
+}
+
+static void daemon_exit(struct bitmap *bitmap, mdk_thread_t **daemon)
+{
+	mdk_thread_t *dmn;
+	unsigned long flags;
+
+	/* if no one is waiting on us, we'll free the md thread struct
+	 * and exit, otherwise we let the waiter clean things up */
+	spin_lock_irqsave(&bitmap->lock, flags);
+	if ((dmn = *daemon)) { /* no one is waiting, cleanup and exit */
+		*daemon = NULL;
+		spin_unlock_irqrestore(&bitmap->lock, flags);
+		kfree(dmn);
+		complete_and_exit(NULL, 0); /* do_exit not exported */
+	}
+	spin_unlock_irqrestore(&bitmap->lock, flags);
+}
+
+static void bitmap_writeback_daemon(mddev_t *mddev)
+{
+	struct bitmap *bitmap = mddev->bitmap;
+	struct page *page;
+	struct page_list *item;
+	int err = 0;
+
+	while (1) {
+		PRINTK("%s: bitmap writeback daemon waiting...\n", bmname(bitmap));
+		down_interruptible(&bitmap->write_done);
+		if (signal_pending(current)) {
+			printk(KERN_INFO 
+			    "%s: bitmap writeback daemon got signal, exiting...\n",
+			    bmname(bitmap));
+			break;
+		}
+
+		PRINTK("%s: bitmap writeback daemon woke up...\n", bmname(bitmap));
+		/* wait on bitmap page writebacks */
+		while ((item = dequeue_page(bitmap, &bitmap->complete_pages))) {
+			page = item->page;
+			mempool_free(item, bitmap->write_pool);
+			PRINTK("wait on page writeback: %p %lu\n", page, bitmap->writes_pending);
+			wait_on_page_writeback(page);
+			PRINTK("finished page writeback: %p %lu\n", page, bitmap->writes_pending);
+			spin_lock(&bitmap->write_lock);
+			if (!--bitmap->writes_pending)
+				wake_up(&bitmap->write_wait);
+			spin_unlock(&bitmap->write_lock);
+			err = PageError(page);
+			page_cache_release(page);
+			if (err) {
+				printk(KERN_WARNING "%s: bitmap file writeback "
+					"failed (page %lu): %d\n",
+					bmname(bitmap), page->index, err);
+				bitmap_file_kick(bitmap);
+				goto out;
+			}
+		}
+	}
+out:
+	if (err) {
+		printk(KERN_INFO "%s: bitmap writeback daemon exiting (%d)\n",
+			bmname(bitmap), err);
+		daemon_exit(bitmap, &bitmap->writeback_daemon);
+	}
+	return;
+}
+
+static int bitmap_start_daemon(struct bitmap *bitmap, mdk_thread_t **ptr,
+				void (*func)(mddev_t *), char *name)
+{
+	mdk_thread_t *daemon;
+	unsigned long flags;
+	char namebuf[32];
+
+	spin_lock_irqsave(&bitmap->lock, flags);
+	*ptr = NULL;
+	if (!bitmap->file) /* no need for daemon if there's no backing file */
+		goto out_unlock;
+
+	spin_unlock_irqrestore(&bitmap->lock, flags);
+
+#if INJECT_FATAL_FAULT_2
+	daemon = NULL;
+#else
+	sprintf(namebuf, "%%s_%s", name);
+	daemon = md_register_thread(func, bitmap->mddev, namebuf);
+#endif
+	if (!daemon) {
+		printk(KERN_ERR "%s: failed to start bitmap daemon\n",
+			bmname(bitmap));
+		return -ECHILD;
+	}
+
+	spin_lock_irqsave(&bitmap->lock, flags);
+	*ptr = daemon;
+
+	md_wakeup_thread(daemon); /* start it running */
+
+	PRINTK("%s: %s daemon (pid %d) started...\n", 
+		bmname(bitmap), name, bitmap->daemon->tsk->pid);
+out_unlock:
+	spin_unlock_irqrestore(&bitmap->lock, flags);
+	return 0;
+}
+
+static int bitmap_start_daemons(struct bitmap *bitmap)
+{
+	int err = bitmap_start_daemon(bitmap, &bitmap->writeback_daemon,
+					bitmap_writeback_daemon, "bitmap_wb");
+	return err;
+}
+
+static void bitmap_stop_daemon(struct bitmap *bitmap, mdk_thread_t **ptr)
+{
+	mdk_thread_t *daemon;
+	unsigned long flags;
+
+	spin_lock_irqsave(&bitmap->lock, flags);
+	daemon = *ptr;
+	*ptr = NULL;
+	spin_unlock_irqrestore(&bitmap->lock, flags);
+	if (daemon)
+		md_unregister_thread(daemon); /* destroy the thread */
+}
+
+static void bitmap_stop_daemons(struct bitmap *bitmap)
+{
+	/* the daemons can't stop themselves... they'll just exit instead... */
+	if (bitmap->writeback_daemon &&
+	    current->pid != bitmap->writeback_daemon->tsk->pid)
+		bitmap_stop_daemon(bitmap, &bitmap->writeback_daemon);
+}
+
+static bitmap_counter_t *bitmap_get_counter(struct bitmap *bitmap, 
+					    sector_t offset, int *blocks,
+					    int create)
+{
+	/* If 'create', we might release the lock and reclaim it.
+	 * The lock must have been taken with interrupts enabled.
+	 * If !create, we don't release the lock.
+	 */
+	sector_t chunk = offset >> CHUNK_BLOCK_SHIFT(bitmap);
+	unsigned long page = chunk >> PAGE_COUNTER_SHIFT;
+	unsigned long pageoff = (chunk & PAGE_COUNTER_MASK) << COUNTER_BYTE_SHIFT;
+	sector_t csize;
+
+	if (bitmap_checkpage(bitmap, page, create) < 0) {
+		csize = ((sector_t)1) << (CHUNK_BLOCK_SHIFT(bitmap));
+		*blocks = csize - (offset & (csize- 1));
+		return NULL;
+	}
+	/* now locked ... */
+
+	if (bitmap->bp[page].hijacked) { /* hijacked pointer */
+		/* should we use the first or second counter field
+		 * of the hijacked pointer? */
+		int hi = (pageoff > PAGE_COUNTER_MASK);
+		csize = ((sector_t)1) << (CHUNK_BLOCK_SHIFT(bitmap) + 
+					  PAGE_COUNTER_SHIFT - 1);
+		*blocks = csize - (offset & (csize- 1));
+		return  &((bitmap_counter_t *)
+			  &bitmap->bp[page].map)[hi];
+	} else { /* page is allocated */
+		csize = ((sector_t)1) << (CHUNK_BLOCK_SHIFT(bitmap));
+		*blocks = csize - (offset & (csize- 1));
+		return (bitmap_counter_t *)
+			&(bitmap->bp[page].map[pageoff]);
+	}
+}
+
+int bitmap_startwrite(struct bitmap *bitmap, sector_t offset, unsigned long sectors)
+{
+	if (!bitmap) return 0;
+	while (sectors) {
+		int blocks;
+		bitmap_counter_t *bmc;
+
+		spin_lock_irq(&bitmap->lock);
+		bmc = bitmap_get_counter(bitmap, offset, &blocks, 1);
+		if (!bmc) {
+			spin_unlock_irq(&bitmap->lock);
+			return 0;
+		}
+
+		switch(*bmc) {
+		case 0:
+			bitmap_file_set_bit(bitmap, offset);
+			bitmap_count_page(bitmap,offset, 1);
+			blk_plug_device(bitmap->mddev->queue);
+			/* fall through */
+		case 1:
+			*bmc = 2;
+		}
+		if ((*bmc & COUNTER_MAX) == COUNTER_MAX) BUG();
+		(*bmc)++;
+
+		spin_unlock_irq(&bitmap->lock);
+
+		offset += blocks;
+		if (sectors > blocks)
+			sectors -= blocks;
+		else sectors = 0;
+	}
+	return 0;
+}
+
+void bitmap_endwrite(struct bitmap *bitmap, sector_t offset, unsigned long sectors,
+		     int success)
+{
+	if (!bitmap) return;
+	while (sectors) {
+		int blocks;
+		unsigned long flags;
+		spin_lock_irqsave(&bitmap->lock, flags);
+		bitmap_counter_t *bmc = bitmap_get_counter(bitmap, offset, &blocks, 0);
+		if (!bmc) {
+			spin_unlock_irqrestore(&bitmap->lock, flags);
+			return;
+		}
+
+		if (!success && ! (*bmc & NEEDED_MASK))
+			*bmc |= NEEDED_MASK;
+
+		(*bmc)--;
+		if (*bmc <= 2) {
+			set_page_attr(bitmap,
+				      filemap_get_page(bitmap, offset >> CHUNK_BLOCK_SHIFT(bitmap)),
+				      BITMAP_PAGE_CLEAN);
+		}
+		spin_unlock_irqrestore(&bitmap->lock, flags);
+		offset += blocks;
+		if (sectors > blocks)
+			sectors -= blocks;
+		else sectors = 0;
+	}
+}
+
+int bitmap_start_sync(struct bitmap *bitmap, sector_t offset, int *blocks)
+{
+	bitmap_counter_t *bmc;
+	int rv;
+	if (bitmap == NULL) {/* FIXME or bitmap set as 'failed' */
+		*blocks = 1024;
+		return 1; /* always resync if no bitmap */
+	}
+	spin_lock_irq(&bitmap->lock);
+	bmc = bitmap_get_counter(bitmap, offset, blocks, 0);
+	rv = 0;
+	if (bmc) {
+		/* locked */
+		if (RESYNC(*bmc))
+			rv = 1;
+		else if (NEEDED(*bmc)) {
+			rv = 1;
+			*bmc |= RESYNC_MASK;
+			*bmc &= ~NEEDED_MASK;
+		}
+	}
+	spin_unlock_irq(&bitmap->lock);
+	return rv;
+}
+
+void bitmap_end_sync(struct bitmap *bitmap, sector_t offset, int *blocks, int aborted)
+{
+	bitmap_counter_t *bmc;
+	unsigned long flags;
+/*
+	if (offset == 0) printk("bitmap_end_sync 0 (%d)\n", aborted);
+*/	if (bitmap == NULL) {
+		*blocks = 1024;
+		return;
+	}
+	spin_lock_irqsave(&bitmap->lock, flags);
+	bmc = bitmap_get_counter(bitmap, offset, blocks, 0);
+	if (bmc == NULL)
+		goto unlock;
+	/* locked */
+/*
+	if (offset == 0) printk("bitmap_end sync found 0x%x, blocks %d\n", *bmc, *blocks);
+*/
+	if (RESYNC(*bmc)) {
+		*bmc &= ~RESYNC_MASK;
+
+		if (!NEEDED(*bmc) && aborted)
+			*bmc |= NEEDED_MASK;
+		else {
+			if (*bmc <= 2) {
+				set_page_attr(bitmap,
+					      filemap_get_page(bitmap, offset >> CHUNK_BLOCK_SHIFT(bitmap)),
+					      BITMAP_PAGE_CLEAN);
+			}
+		}
+	}
+ unlock:
+	spin_unlock_irqrestore(&bitmap->lock, flags);
+}
+
+void bitmap_close_sync(struct bitmap *bitmap)
+{
+	/* Sync has finished, and any bitmap chunks that weren't synced
+	 * properly have been aborted.  It remains to us to clear the
+	 * RESYNC bit wherever it is still on
+	 */
+	sector_t sector = 0;
+	int blocks;
+	if (!bitmap) return;
+	while (sector < bitmap->mddev->resync_max_sectors) {
+		bitmap_end_sync(bitmap, sector, &blocks, 0);
+/*
+		if (sector < 500) printk("bitmap_close_sync: sec %llu blks %d\n",	
+					 (unsigned long long)sector, blocks);
+*/		sector += blocks;
+	}
+}
+
+static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset,
+				   unsigned long sectors, int set)
+{
+	/* For each chunk covered by any of these sectors, set the
+	 * resync needed bit, and the counter to 1.  They should all
+	 * be 0 at this point
+	 */
+	while (sectors) {
+		int secs;
+		bitmap_counter_t *bmc;
+		spin_lock_irq(&bitmap->lock);
+		bmc = bitmap_get_counter(bitmap, offset, &secs, 1);
+		if (!bmc) {
+			spin_unlock_irq(&bitmap->lock);
+			return;
+		}
+		if (set && !NEEDED(*bmc)) {
+			BUG_ON(*bmc);
+			*bmc = NEEDED_MASK | 1;
+			bitmap_count_page(bitmap, offset, 1);
+		}
+		spin_unlock_irq(&bitmap->lock);
+		if (sectors > secs)
+			sectors -= secs;
+		else
+			sectors = 0;
+	}
+}
+
+/* dirty the entire bitmap */
+int bitmap_setallbits(struct bitmap *bitmap)
+{
+	unsigned long flags;
+	unsigned long j;
+
+	/* dirty the in-memory bitmap */
+	bitmap_set_memory_bits(bitmap, 0, bitmap->chunks << CHUNK_BLOCK_SHIFT(bitmap), 1);
+
+	/* dirty the bitmap file */
+	for (j = 0; j < bitmap->file_pages; j++) {
+		struct page *page = bitmap->filemap[j];
+
+		spin_lock_irqsave(&bitmap->lock, flags);
+		page_cache_get(page);
+		spin_unlock_irqrestore(&bitmap->lock, flags);
+		memset(kmap(page), 0xff, PAGE_SIZE);
+		kunmap(page);
+		write_page(page, 0);
+	}
+
+	return 0;
+}
+
+/*
+ * free memory that was allocated
+ */
+void bitmap_destroy(mddev_t *mddev)
+{
+	unsigned long k, pages;
+	struct bitmap_page *bp;
+	struct bitmap *bitmap = mddev->bitmap;
+
+	if (!bitmap) /* there was no bitmap */
+		return;
+
+	mddev->bitmap = NULL; /* disconnect from the md device */
+
+	/* release the bitmap file and kill the daemon */
+	bitmap_file_put(bitmap);
+
+	bp = bitmap->bp;
+	pages = bitmap->pages;
+
+	/* free all allocated memory */
+
+	mempool_destroy(bitmap->write_pool);
+
+	if (bp) /* deallocate the page memory */
+		for (k = 0; k < pages; k++)
+			if (bp[k].map && !bp[k].hijacked)
+				kfree(bp[k].map);
+	kfree(bp);
+	kfree(bitmap);
+}
+
+/*
+ * initialize the bitmap structure
+ * if this returns an error, bitmap_destroy must be called to do clean up
+ */
+int bitmap_create(mddev_t *mddev)
+{
+	struct bitmap *bitmap;
+	unsigned long blocks = mddev->resync_max_sectors;
+	unsigned long chunks;
+	unsigned long pages;
+	struct file *file = mddev->bitmap_file;
+	int err;
+
+	BUG_ON(sizeof(bitmap_super_t) != 256);
+
+	if (!file) /* bitmap disabled, nothing to do */
+		return 0;
+
+	bitmap = kmalloc(sizeof(*bitmap), GFP_KERNEL);
+	if (!bitmap)
+		return -ENOMEM;
+
+	memset(bitmap, 0, sizeof(*bitmap));
+
+	spin_lock_init(&bitmap->lock);
+	bitmap->mddev = mddev;
+	mddev->bitmap = bitmap;
+
+	spin_lock_init(&bitmap->write_lock);
+	init_MUTEX_LOCKED(&bitmap->write_done);
+	INIT_LIST_HEAD(&bitmap->complete_pages);
+	init_waitqueue_head(&bitmap->write_wait);
+	bitmap->write_pool = mempool_create(WRITE_POOL_SIZE, write_pool_alloc,
+				write_pool_free, NULL);
+	if (!bitmap->write_pool)
+		return -ENOMEM;
+
+	bitmap->file = file;
+	get_file(file);
+	/* read superblock from bitmap file (this sets bitmap->chunksize) */
+	err = bitmap_read_sb(bitmap);
+	if (err)
+		return err;
+
+	bitmap->chunkshift = find_first_bit(&bitmap->chunksize,
+					sizeof(bitmap->chunksize));
+
+	/* now that chunksize and chunkshift are set, we can use these macros */
+ 	chunks = (blocks + CHUNK_BLOCK_RATIO(bitmap) - 1) /
+			CHUNK_BLOCK_RATIO(bitmap);
+ 	pages = (chunks + PAGE_COUNTER_RATIO - 1) / PAGE_COUNTER_RATIO;
+
+	BUG_ON(!pages);
+
+	bitmap->chunks = chunks;
+	bitmap->pages = pages;
+	bitmap->missing_pages = pages;
+	bitmap->counter_bits = COUNTER_BITS;
+
+	bitmap->syncchunk = ~0UL;
+
+#if INJECT_FATAL_FAULT_1
+	bitmap->bp = NULL;
+#else
+	bitmap->bp = kmalloc(pages * sizeof(*bitmap->bp), GFP_KERNEL);
+#endif
+	if (!bitmap->bp)
+		return -ENOMEM;
+	memset(bitmap->bp, 0, pages * sizeof(*bitmap->bp));
+
+	bitmap->flags |= BITMAP_ACTIVE;
+
+	/* now that we have some pages available, initialize the in-memory
+	 * bitmap from the on-disk bitmap */
+	err = bitmap_init_from_disk(bitmap);
+	if (err)
+		return err;
+
+	printk(KERN_INFO "created bitmap (%lu pages) for device %s\n", 
+		pages, bmname(bitmap));
+
+	/* kick off the bitmap daemons */
+	err = bitmap_start_daemons(bitmap);
+	if (err)
+		return err;
+	return bitmap_update_sb(bitmap);
+}
+
+/* the bitmap API -- for raid personalities */
+EXPORT_SYMBOL(bitmap_startwrite);
+EXPORT_SYMBOL(bitmap_endwrite);
+EXPORT_SYMBOL(bitmap_start_sync);
+EXPORT_SYMBOL(bitmap_end_sync);
+EXPORT_SYMBOL(bitmap_unplug);
+EXPORT_SYMBOL(bitmap_close_sync);
+EXPORT_SYMBOL(bitmap_daemon_work);

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2005-02-18 11:11:26.000000000 +1100
+++ ./drivers/md/md.c	2005-02-18 11:11:26.000000000 +1100
@@ -19,6 +19,9 @@
 
      Neil Brown <neilb@cse.unsw.edu.au>.
 
+   - persistent bitmap code
+     Copyright (C) 2003-2004, Paul Clements, SteelEye Technology, Inc.
+
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2, or (at your option)
@@ -33,6 +36,7 @@
 #include <linux/config.h>
 #include <linux/linkage.h>
 #include <linux/raid/md.h>
+#include <linux/raid/bitmap.h>
 #include <linux/sysctl.h>
 #include <linux/devfs_fs_kernel.h>
 #include <linux/buffer_head.h> /* for invalidate_bdev */
@@ -41,6 +45,8 @@
 
 #include <linux/init.h>
 
+#include <linux/file.h>
+
 #ifdef CONFIG_KMOD
 #include <linux/kmod.h>
 #endif
@@ -1241,8 +1247,11 @@ void md_print_devices(void)
 	printk("md:	* <COMPLETE RAID STATE PRINTOUT> *\n");
 	printk("md:	**********************************\n");
 	ITERATE_MDDEV(mddev,tmp) {
-		printk("%s: ", mdname(mddev));
 
+		if (mddev->bitmap)
+			bitmap_print_sb(mddev->bitmap);
+		else
+			printk("%s: ", mdname(mddev));
 		ITERATE_RDEV(mddev,rdev,tmp2)
 			printk("<%s>", bdevname(rdev->bdev,b));
 		printk("\n");
@@ -1330,7 +1339,7 @@ repeat:
 		"md: updating %s RAID superblock on device (in sync %d)\n",
 		mdname(mddev),mddev->in_sync);
 
-	err = 0;
+	err = bitmap_update_sb(mddev->bitmap);
 	ITERATE_RDEV(mddev,rdev,tmp) {
 		char b[BDEVNAME_SIZE];
 		dprintk(KERN_INFO "md: ");
@@ -1676,12 +1685,19 @@ static int do_md_run(mddev_t * mddev)
 
 	mddev->resync_max_sectors = mddev->size << 1; /* may be over-ridden by personality */
 
-	err = mddev->pers->run(mddev);
+	/* before we start the array running, initialise the bitmap */
+	err = bitmap_create(mddev);
+	if (err)
+		printk(KERN_ERR "%s: failed to create bitmap (%d)\n",
+			mdname(mddev), err);
+	else
+		err = mddev->pers->run(mddev);
 	if (err) {
 		printk(KERN_ERR "md: pers->run() failed ...\n");
 		module_put(mddev->pers->owner);
 		mddev->pers = NULL;
-		return -EINVAL;
+		bitmap_destroy(mddev);
+		return err;
 	}
  	atomic_set(&mddev->writes_pending,0);
 	mddev->safemode = 0;
@@ -1795,6 +1811,14 @@ static int do_md_stop(mddev_t * mddev, i
 		if (ro)
 			set_disk_ro(disk, 1);
 	}
+	
+	bitmap_destroy(mddev);
+	if (mddev->bitmap_file) {
+		atomic_set(&mddev->bitmap_file->f_dentry->d_inode->i_writecount, 1);
+		fput(mddev->bitmap_file);
+		mddev->bitmap_file = NULL;
+	}
+
 	/*
 	 * Free resources if final stop
 	 */
@@ -2056,6 +2080,42 @@ static int get_array_info(mddev_t * mdde
 	return 0;
 }
 
+static int get_bitmap_file(mddev_t * mddev, void * arg)
+{
+	mdu_bitmap_file_t *file = NULL; /* too big for stack allocation */
+	char *ptr, *buf = NULL;
+	int err = -ENOMEM;
+
+	file = kmalloc(sizeof(*file), GFP_KERNEL);
+	if (!file)
+		goto out;
+
+	/* bitmap disabled, zero the first byte and copy out */
+	if (!mddev->bitmap || !mddev->bitmap->file) {
+		file->pathname[0] = '\0';
+		goto copy_out;
+	}
+
+	buf = kmalloc(sizeof(file->pathname), GFP_KERNEL);
+	if (!buf)
+		goto out;
+
+	ptr = file_path(mddev->bitmap->file, buf, sizeof(file->pathname));
+	if (!ptr)
+		goto out;
+
+	strcpy(file->pathname, ptr);
+
+copy_out:
+	err = 0;
+	if (copy_to_user(arg, file, sizeof(*file)))
+		err = -EFAULT;
+out:
+	kfree(buf);
+	kfree(file);
+	return err;
+}
+
 static int get_disk_info(mddev_t * mddev, void __user * arg)
 {
 	mdu_disk_info_t info;
@@ -2330,6 +2390,48 @@ abort_export:
 	return err;
 }
 
+/* similar to deny_write_access, but accounts for our holding a reference
+ * to the file ourselves */
+static int deny_bitmap_write_access(struct file * file)
+{
+	struct inode *inode = file->f_mapping->host;
+
+	spin_lock(&inode->i_lock);
+	if (atomic_read(&inode->i_writecount) > 1) {
+		spin_unlock(&inode->i_lock);
+		return -ETXTBSY;
+	}
+	atomic_set(&inode->i_writecount, -1);
+	spin_unlock(&inode->i_lock);
+
+	return 0;
+}
+
+static int set_bitmap_file(mddev_t *mddev, int fd)
+{
+	int err;
+
+	if (mddev->pers)
+		return -EBUSY;
+
+	mddev->bitmap_file = fget(fd);
+
+	if (mddev->bitmap_file == NULL) {
+		printk(KERN_ERR "%s: error: failed to get bitmap file\n",
+			mdname(mddev));
+		return -EBADF;
+	}
+
+	err = deny_bitmap_write_access(mddev->bitmap_file);
+	if (err) {
+		printk(KERN_ERR "%s: error: bitmap file is already in use\n",
+			mdname(mddev));
+		fput(mddev->bitmap_file);
+		mddev->bitmap_file = NULL;
+	}
+	return err;
+}
+
 /*
  * set_array_info is used two different ways
  * The original usage is when creating a new array.
@@ -2641,8 +2743,10 @@ static int md_ioctl(struct inode *inode,
 	/*
 	 * Commands querying/configuring an existing array:
 	 */
-	/* if we are initialised yet, only ADD_NEW_DISK or STOP_ARRAY is allowed */
-	if (!mddev->raid_disks && cmd != ADD_NEW_DISK && cmd != STOP_ARRAY && cmd != RUN_ARRAY) {
+	/* if we are not initialised yet, only ADD_NEW_DISK, STOP_ARRAY,
+	 * RUN_ARRAY, and SET_BITMAP_FILE are allowed */
+	if (!mddev->raid_disks && cmd != ADD_NEW_DISK && cmd != STOP_ARRAY
+			&& cmd != RUN_ARRAY && cmd != SET_BITMAP_FILE) {
 		err = -ENODEV;
 		goto abort_unlock;
 	}
@@ -2656,6 +2760,10 @@ static int md_ioctl(struct inode *inode,
 			err = get_array_info(mddev, argp);
 			goto done_unlock;
 
+		case GET_BITMAP_FILE:
+			err = get_bitmap_file(mddev, (void *)arg);
+			goto done_unlock;
+
 		case GET_DISK_INFO:
 			err = get_disk_info(mddev, argp);
 			goto done_unlock;
@@ -2735,6 +2843,10 @@ static int md_ioctl(struct inode *inode,
 		case RUN_ARRAY:
 			err = do_md_run (mddev);
 			goto done_unlock;
+ 
+		case SET_BITMAP_FILE:
+			err = set_bitmap_file(mddev, (int)arg);
+			goto done_unlock;
 
 		default:
 			if (_IOC_TYPE(cmd) == MD_MAJOR)
@@ -2847,8 +2959,9 @@ int md_thread(void * arg)
 	while (thread->run) {
 		void (*run)(mddev_t *);
 
-		wait_event_interruptible(thread->wqueue,
-					 test_bit(THREAD_WAKEUP, &thread->flags));
+		wait_event_interruptible_timeout(thread->wqueue,
+						 test_bit(THREAD_WAKEUP, &thread->flags),
+						 thread->timeout);
 		if (current->flags & PF_FREEZE)
 			refrigerator(PF_FREEZE);
 
@@ -2894,6 +3007,7 @@ mdk_thread_t *md_register_thread(void (*
 	thread->run = run;
 	thread->mddev = mddev;
 	thread->name = name;
+	thread->timeout = MAX_SCHEDULE_TIMEOUT;
 	ret = kernel_thread(md_thread, thread, 0);
 	if (ret < 0) {
 		kfree(thread);
@@ -2936,13 +3050,13 @@ void md_error(mddev_t *mddev, mdk_rdev_t
 
 	if (!rdev || rdev->faulty)
 		return;
-
+/*
 	dprintk("md_error dev:%s, rdev:(%d:%d), (caller: %p,%p,%p,%p).\n",
 		mdname(mddev),
 		MAJOR(rdev->bdev->bd_dev), MINOR(rdev->bdev->bd_dev),
 		__builtin_return_address(0),__builtin_return_address(1),
 		__builtin_return_address(2),__builtin_return_address(3));
-
+*/
 	if (!mddev->pers->error_handler)
 		return;
 	mddev->pers->error_handler(mddev,rdev);
@@ -3098,6 +3212,7 @@ static int md_seq_show(struct seq_file *
 	mdk_rdev_t *rdev;
 	struct mdstat_info *mi = seq->private;
 	int i;
+	struct bitmap *bitmap;
 
 	if (v == (void*)1) {
 		seq_printf(seq, "Personalities : ");
@@ -3152,10 +3267,36 @@ static int md_seq_show(struct seq_file *
 		if (mddev->pers) {
 			mddev->pers->status (seq, mddev);
 	 		seq_printf(seq, "\n      ");
-			if (mddev->curr_resync > 2)
+			if (mddev->curr_resync > 2) {
 				status_resync (seq, mddev);
-			else if (mddev->curr_resync == 1 || mddev->curr_resync == 2)
-				seq_printf(seq, "	resync=DELAYED");
+				seq_printf(seq, "\n      ");
+			} else if (mddev->curr_resync == 1 || mddev->curr_resync == 2)
+				seq_printf(seq, "	resync=DELAYED\n      ");
+		} else
+			seq_printf(seq, "\n       ");
+
+		if ((bitmap = mddev->bitmap)) {
+			char *buf, *path;
+			unsigned long chunk_kb;
+			unsigned long flags;
+			buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+			spin_lock_irqsave(&bitmap->lock, flags);
+			chunk_kb = bitmap->chunksize >> 10;
+			seq_printf(seq, "bitmap: %lu/%lu pages [%luKB], "
+				"%lu%s chunk",
+				bitmap->pages - bitmap->missing_pages,
+				bitmap->pages,
+				(bitmap->pages - bitmap->missing_pages)
+					<< (PAGE_SHIFT - 10),
+				chunk_kb ? chunk_kb : bitmap->chunksize,
+				chunk_kb ? "KB" : "B");
+			if (bitmap->file && buf) {
+				path = file_path(bitmap->file, buf, PAGE_SIZE);
+				seq_printf(seq, ", file: %s", path ? path : "");
+			}
+			seq_printf(seq, "\n");
+			spin_unlock_irqrestore(&bitmap->lock, flags);
+			kfree(buf);
 		}
 
 		seq_printf(seq, "\n");
@@ -3448,7 +3589,8 @@ static void md_do_sync(mddev_t *mddev)
 	       sysctl_speed_limit_max);
 
 	is_mddev_idle(mddev); /* this also initializes IO event counters */
-	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
+	/* we don't use the checkpoint if there's a bitmap */
+	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) && !mddev->bitmap)
 		j = mddev->recovery_cp;
 	else
 		j = 0;
@@ -3799,6 +3941,8 @@ int __init md_init(void)
 			" MD_SB_DISKS=%d\n",
 			MD_MAJOR_VERSION, MD_MINOR_VERSION,
 			MD_PATCHLEVEL_VERSION, MAX_MD_DEVS, MD_SB_DISKS);
+	printk(KERN_INFO "md: bitmap version %d.%d\n", BITMAP_MAJOR,
+			BITMAP_MINOR);
 
 	if (register_blkdev(MAJOR_NR, "md"))
 		return -1;

diff ./include/linux/raid/bitmap.h~current~ ./include/linux/raid/bitmap.h
--- ./include/linux/raid/bitmap.h~current~	2005-02-18 11:08:39.000000000 +1100
+++ ./include/linux/raid/bitmap.h	2005-02-18 11:11:26.000000000 +1100
@@ -0,0 +1,280 @@
+/*
+ * bitmap.h: Copyright (C) Peter T. Breuer (ptb@ot.uc3m.es) 2003
+ *
+ * additions: Copyright (C) 2003-2004, Paul Clements, SteelEye Technology, Inc.
+ */
+#ifndef BITMAP_H
+#define BITMAP_H 1
+
+#define BITMAP_MAJOR 3
+#define BITMAP_MINOR 38
+
+/*
+ * in-memory bitmap:
+ *
+ * Use 16 bit block counters to track pending writes to each "chunk".
+ * The 2 high order bits are special-purpose, the first is a flag indicating
+ * whether a resync is needed.  The second is a flag indicating whether a
+ * resync is active.
+ * This means that the counter is actually 14 bits:
+ *
+ * +--------+--------+------------------------------------------------+
+ * | resync | resync |               counter                          |
+ * | needed | active |                                                |
+ * |  (0-1) |  (0-1) |              (0-16383)                         |
+ * +--------+--------+------------------------------------------------+
+ *
+ * The "resync needed" bit is set when:
+ *    a '1' bit is read from storage at startup.
+ *    a write request fails on some drives
+ *    a resync is aborted on a chunk with 'resync active' set
+ * It is cleared (and resync-active set) when a resync starts across all drives
+ * of the chunk.
+ *
+ *
+ * The "resync active" bit is set when:
+ *    a resync is started on all drives, and resync_needed is set.
+ *       resync_needed will be cleared (as long as resync_active wasn't already set).
+ * It is cleared when a resync completes.
+ *
+ * The counter counts pending write requests, plus the on-disk bit.
+ * When the counter is '1' and the resync bits are clear, the on-disk
+ * bit can be cleared aswell, thus setting the counter to 0.
+ * When we set a bit, or in the counter (to start a write), if the fields is
+ * 0, we first set the disk bit and set the counter to 1.
+ *
+ * If the counter is 0, the on-disk bit is clear and the stipe is clean
+ * Anything that dirties the stipe pushes the counter to 2 (at least)
+ * and sets the on-disk bit (lazily).
+ * If a periodic sweep find the counter at 2, it is decremented to 1.
+ * If the sweep find the counter at 1, the on-disk bit is cleared and the
+ * counter goes to zero.
+ *
+ * Also, we'll hijack the "map" pointer itself and use it as two 16 bit block
+ * counters as a fallback when "page" memory cannot be allocated:
+ *
+ * Normal case (page memory allocated):
+ *
+ *     page pointer (32-bit)
+ *
+ *     [ ] ------+
+ *               |
+ *               +-------> [   ][   ]..[   ] (4096 byte page == 2048 counters)
+ *                          c1   c2    c2048
+ *
+ * Hijacked case (page memory allocation failed):
+ *
+ *     hijacked page pointer (32-bit)
+ *
+ *     [		  ][		  ] (no page memory allocated)
+ *      counter #1 (16-bit) counter #2 (16-bit)
+ *
+ */
+
+#ifdef __KERNEL__
+
+#define PAGE_BITS (PAGE_SIZE << 3)
+#define PAGE_BIT_SHIFT (PAGE_SHIFT + 3)
+
+typedef __u16 bitmap_counter_t;
+#define COUNTER_BITS 16
+#define COUNTER_BIT_SHIFT 4
+#define COUNTER_BYTE_RATIO (COUNTER_BITS / 8)
+#define COUNTER_BYTE_SHIFT (COUNTER_BIT_SHIFT - 3)
+
+#define NEEDED_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 1)))
+#define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2)))
+#define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1)
+#define NEEDED(x) (((bitmap_counter_t) x) & NEEDED_MASK)
+#define RESYNC(x) (((bitmap_counter_t) x) & RESYNC_MASK)
+#define COUNTER(x) (((bitmap_counter_t) x) & COUNTER_MAX)
+
+/* how many counters per page? */
+#define PAGE_COUNTER_RATIO (PAGE_BITS / COUNTER_BITS)
+/* same, except a shift value for more efficient bitops */
+#define PAGE_COUNTER_SHIFT (PAGE_BIT_SHIFT - COUNTER_BIT_SHIFT)
+/* same, except a mask value for more efficient bitops */
+#define PAGE_COUNTER_MASK  (PAGE_COUNTER_RATIO - 1)
+
+#define BITMAP_BLOCK_SIZE 512
+#define BITMAP_BLOCK_SHIFT 9
+
+/* how many blocks per chunk? (this is variable) */
+#define CHUNK_BLOCK_RATIO(bitmap) ((bitmap)->chunksize >> BITMAP_BLOCK_SHIFT)
+#define CHUNK_BLOCK_SHIFT(bitmap) ((bitmap)->chunkshift - BITMAP_BLOCK_SHIFT)
+#define CHUNK_BLOCK_MASK(bitmap) (CHUNK_BLOCK_RATIO(bitmap) - 1)
+
+/* when hijacked, the counters and bits represent even larger "chunks" */
+/* there will be 1024 chunks represented by each counter in the page pointers */
+#define PAGEPTR_BLOCK_RATIO(bitmap) \
+			(CHUNK_BLOCK_RATIO(bitmap) << PAGE_COUNTER_SHIFT >> 1)
+#define PAGEPTR_BLOCK_SHIFT(bitmap) \
+			(CHUNK_BLOCK_SHIFT(bitmap) + PAGE_COUNTER_SHIFT - 1)
+#define PAGEPTR_BLOCK_MASK(bitmap) (PAGEPTR_BLOCK_RATIO(bitmap) - 1)
+
+/*
+ * on-disk bitmap:
+ *
+ * Use one bit per "chunk" (block set). We do the disk I/O on the bitmap
+ * file a page at a time. There's a superblock at the start of the file.
+ */
+
+/* map chunks (bits) to file pages - offset by the size of the superblock */
+#define CHUNK_BIT_OFFSET(chunk) ((chunk) + (sizeof(bitmap_super_t) << 3))
+
+#endif
+
+/*
+ * bitmap structures:
+ */
+
+#define BITMAP_MAGIC 0x6d746962
+
+/* use these for bitmap->flags and bitmap->sb->state bit-fields */
+enum bitmap_state {
+	BITMAP_ACTIVE = 0x001, /* the bitmap is in use */
+	BITMAP_STALE  = 0x002  /* the bitmap file is out of date or had -EIO */
+};
+
+/* the superblock at the front of the bitmap file -- little endian */
+typedef struct bitmap_super_s {
+	__u32 magic;        /*  0  BITMAP_MAGIC */
+	__u32 version;      /*  4  the bitmap major for now, could change... */
+	__u8  uuid[16];     /*  8  128 bit uuid - must match md device uuid */
+	__u64 events;       /* 24  event counter for the bitmap (1)*/
+	__u64 events_cleared;/*32  event counter when last bit cleared (2) */
+	__u64 sync_size;    /* 40  the size of the md device's sync range(3) */
+	__u32 state;        /* 48  bitmap state information */
+	__u32 chunksize;    /* 52  the bitmap chunk size in bytes */
+	__u32 daemon_sleep; /* 56  seconds between disk flushes */
+
+	__u8  pad[256 - 60]; /* set to zero */
+} bitmap_super_t;
+
+/* notes:
+ * (1) This event counter is updated before the eventcounter in the md superblock
+ *    When a bitmap is loaded, it is only accepted if this event counter is equal
+ *    to, or one greater than, the event counter in the superblock.
+ * (2) This event counter is updated when the other one is *if*and*only*if* the 
+ *    array is not degraded.  As bits are not cleared when the array is degraded,
+ *    this represents the last time that any bits were cleared.
+ *    If a device is being added that has an event count with this value or
+ *    higher, it is accepted as conforming to the bitmap.
+ * (3)This is the number of sectors represented by the bitmap, and is the range that
+ *    resync happens across.  For raid1 and raid5/6 it is the size of individual
+ *    devices.  For raid10 it is the size of the array.
+ */
+
+#ifdef __KERNEL__
+
+/* the in-memory bitmap is represented by bitmap_pages */
+struct bitmap_page {
+	/*
+	 * map points to the actual memory page
+	 */
+	char *map;
+	/*
+	 * in emergencies (when map cannot be alloced), hijack the map
+	 * pointer and use it as two counters itself
+	 */
+	unsigned int hijacked:1;
+	/*
+	 * count of dirty bits on the page
+	 */ 
+	unsigned int  count:31;
+};
+
+/* keep track of bitmap file pages that have pending writes on them */
+struct page_list {
+	struct list_head list;
+	struct page *page;
+};
+
+/* the main bitmap structure - one per mddev */
+struct bitmap {
+	struct bitmap_page *bp;
+	unsigned long pages; /* total number of pages in the bitmap */
+	unsigned long missing_pages; /* number of pages not yet allocated */
+
+	mddev_t *mddev; /* the md device that the bitmap is for */
+
+	int counter_bits; /* how many bits per block counter */
+
+	/* bitmap chunksize -- how much data does each bit represent? */
+	unsigned long chunksize;
+	unsigned long chunkshift; /* chunksize = 2^chunkshift (for bitops) */
+	unsigned long chunks; /* total number of data chunks for the array */
+
+	/* We hold a count on the chunk currently being synced, and drop
+	 * it when the last block is started.  If the resync is aborted
+	 * midway, we need to be able to drop that count, so we remember
+	 * the counted chunk..
+	 */
+	unsigned long syncchunk;
+
+	__u64	events_cleared;
+
+	/* bitmap spinlock */
+	spinlock_t lock;
+
+	struct file *file; /* backing disk file */
+	struct page *sb_page; /* cached copy of the bitmap file superblock */
+	struct page **filemap; /* list of cache pages for the file */
+	unsigned long *filemap_attr; /* attributes associated w/ filemap pages */
+	unsigned long file_pages; /* number of pages in the file */
+
+	unsigned long flags;
+
+	/*
+	 * the bitmap daemon - periodically wakes up and sweeps the bitmap
+	 * file, cleaning up bits and flushing out pages to disk as necessary
+	 */
+	unsigned long daemon_lastrun; /* jiffies of last run */
+	unsigned long daemon_sleep; /* how many seconds between updates? */
+
+	/*
+	 * bitmap write daemon - this daemon performs writes to the bitmap file
+	 * this thread is only needed because of a limitation in ext3 (jbd)
+	 * that does not allow a task to have two journal transactions ongoing
+	 * simultaneously (even if the transactions are for two different
+	 * filesystems) -- in the case of bitmap, that would be the filesystem
+	 * that the bitmap file resides on and the filesystem that is mounted
+	 * on the md device -- see current->journal_info in jbd/transaction.c
+	 */
+	mdk_thread_t *writeback_daemon;
+	spinlock_t write_lock;
+	struct semaphore write_ready;
+	struct semaphore write_done;
+	unsigned long writes_pending;
+	wait_queue_head_t write_wait;
+	struct list_head write_pages;
+	struct list_head complete_pages;
+	mempool_t *write_pool;
+};
+
+/* the bitmap API */
+
+/* these are used only by md/bitmap */
+int  bitmap_create(mddev_t *mddev);
+void bitmap_destroy(mddev_t *mddev);
+int  bitmap_active(struct bitmap *bitmap);
+
+char *file_path(struct file *file, char *buf, int count);
+void bitmap_print_sb(struct bitmap *bitmap);
+int bitmap_update_sb(struct bitmap *bitmap);
+
+int  bitmap_setallbits(struct bitmap *bitmap);
+
+/* these are exported */
+int bitmap_startwrite(struct bitmap *bitmap, sector_t offset, unsigned long sectors);
+void bitmap_endwrite(struct bitmap *bitmap, sector_t offset, unsigned long sectors,
+		     int success);
+int bitmap_start_sync(struct bitmap *bitmap, sector_t offset, int *blocks);
+void bitmap_end_sync(struct bitmap *bitmap, sector_t offset, int *blocks, int aborted);
+void bitmap_close_sync(struct bitmap *bitmap);
+
+int bitmap_unplug(struct bitmap *bitmap);
+int bitmap_daemon_work(struct bitmap *bitmap);
+#endif
+
+#endif

diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~	2005-02-18 11:11:26.000000000 +1100
+++ ./include/linux/raid/md_k.h	2005-02-18 11:11:26.000000000 +1100
@@ -267,6 +267,9 @@ struct mddev_s
 	atomic_t			writes_pending; 
 	request_queue_t			*queue;	/* for plugging ... */
 
+	struct bitmap                   *bitmap; /* the bitmap for the device */
+	struct file			*bitmap_file; /* the bitmap file */
+
 	struct list_head		all_mddevs;
 };
 
@@ -341,6 +344,7 @@ typedef struct mdk_thread_s {
 	unsigned long           flags;
 	struct completion	*event;
 	struct task_struct	*tsk;
+	unsigned long		timeout;
 	const char		*name;
 } mdk_thread_t;
 

diff ./include/linux/raid/md_u.h~current~ ./include/linux/raid/md_u.h
--- ./include/linux/raid/md_u.h~current~	2005-02-18 11:08:39.000000000 +1100
+++ ./include/linux/raid/md_u.h	2005-02-18 11:11:26.000000000 +1100
@@ -23,6 +23,7 @@
 #define GET_DISK_INFO		_IOR (MD_MAJOR, 0x12, mdu_disk_info_t)
 #define PRINT_RAID_DEBUG	_IO (MD_MAJOR, 0x13)
 #define RAID_AUTORUN		_IO (MD_MAJOR, 0x14)
+#define GET_BITMAP_FILE		_IOR (MD_MAJOR, 0x15, mdu_bitmap_file_t)
 
 /* configuration */
 #define CLEAR_ARRAY		_IO (MD_MAJOR, 0x20)
@@ -36,6 +37,7 @@
 #define HOT_ADD_DISK		_IO (MD_MAJOR, 0x28)
 #define SET_DISK_FAULTY		_IO (MD_MAJOR, 0x29)
 #define HOT_GENERATE_ERROR	_IO (MD_MAJOR, 0x2a)
+#define SET_BITMAP_FILE		_IOW (MD_MAJOR, 0x2b, int)
 
 /* usage */
 #define RUN_ARRAY		_IOW (MD_MAJOR, 0x30, mdu_param_t)
@@ -106,6 +108,11 @@ typedef struct mdu_start_info_s {
 
 } mdu_start_info_t;
 
+typedef struct mdu_bitmap_file_s
+{
+	char pathname[4096];
+} mdu_bitmap_file_t;
+
 typedef struct mdu_param_s
 {
 	int			personality;	/* 1,2,3,4 */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH md 9 of 9] Optimise reconstruction when re-adding a recently failed drive.
  2005-02-18  0:40 ` [PATCH md 9 of 9] Optimise reconstruction when re-adding a recently failed drive NeilBrown
@ 2005-02-18  0:51   ` Mike Hardy
  2005-02-18  0:58     ` Neil Brown
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Hardy @ 2005-02-18  0:51 UTC (permalink / raw)
  To: linux-raid

NeilBrown wrote:
> When an array is degraded, bit in the intent-bitmap are
> never cleared. So if a recently failed drive is re-added, we only need
> to reconstruct the block that are still reflected in the
> bitmap.
> This patch adds support for this re-adding.

Hi there -

If I understand this correctly, this means that:

1) if I had a raid1 mirror (for example) that has no writes to it since 
a resync
2) a drive fails out, and some writes occur
3) when I re-add the drive, only the areas where the writes occurred 
would be re-synced?

I can think of a bunch of peripheral questions around this scenario, and 
bad sectors / bad sector clearing, but I may not be understanding the 
basic idea, so I wanted to ask first.

-Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH md 9 of 9] Optimise reconstruction when re-adding a recently failed drive.
  2005-02-18  0:51   ` Mike Hardy
@ 2005-02-18  0:58     ` Neil Brown
  0 siblings, 0 replies; 14+ messages in thread
From: Neil Brown @ 2005-02-18  0:58 UTC (permalink / raw)
  To: Mike Hardy; +Cc: linux-raid

On Thursday February 17, mhardy@h3c.com wrote:
> 
> NeilBrown wrote:
> > When an array is degraded, bit in the intent-bitmap are
> > never cleared. So if a recently failed drive is re-added, we only need
> > to reconstruct the block that are still reflected in the
> > bitmap.
> > This patch adds support for this re-adding.
> 
> Hi there -
> 
> If I understand this correctly, this means that:
> 
> 1) if I had a raid1 mirror (for example) that has no writes to it since 
> a resync
> 2) a drive fails out, and some writes occur
> 3) when I re-add the drive, only the areas where the writes occurred 
> would be re-synced?
> 
> I can think of a bunch of peripheral questions around this scenario, and 
> bad sectors / bad sector clearing, but I may not be understanding the 
> basic idea, so I wanted to ask first.

You seem to understand the basic idea.
I believe one of the motivators for this code (I didn't originate it)
is when a raid1 has one device locally and one device over a network
connection.

If the network connection breaks, that device has to be thrown
out. But when it comes back, we don't want to resync the whole array
over the network.  This functionality helps there (though there are a
few other things needed before that scenario can work smoothly).

You would only re-add a device if you thought it was OK.  i.e. if it
was a connection problem rather than a media problem, or if you had
resolved any media issues.

NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH md 0 of 9] Introduction
  2005-02-18  0:40 [PATCH md 0 of 9] Introduction NeilBrown
                   ` (8 preceding siblings ...)
  2005-02-18  0:40 ` [PATCH md 9 of 9] Optimise reconstruction when re-adding a recently failed drive NeilBrown
@ 2005-02-18 14:25 ` Phantazm
  2005-02-20 22:55   ` Neil Brown
  9 siblings, 1 reply; 14+ messages in thread
From: Phantazm @ 2005-02-18 14:25 UTC (permalink / raw)
  To: linux-raid

Would you recommend to apply this package 
http://neilb.web.cse.unsw.edu.au/~neilb/patches/linux-devel/2.6/2005-02-18-00/patch-all-2005-02-18-00
To a 2.6.10 kernel?

Thank you for a bery good raid tool :)

"NeilBrown" <neilb@cse.unsw.edu.au> skrev i meddelandet 
news:20050218111131.790.patches@notabene...
>
> 9 patches for md in 2.6.11-rc3-mm2 follow.
>
> 1 - tightens up some locking to close a race that is extremely unlikely
>    to happen (famous last words).  It can only happen when failed devices 
> are
>    being removed.
> 2 - Fixes a resync related problem for raid5 and raid6 arrays that have
>    lost more devices than they can cope with.
> 3 - Fixes a clumsy test that has very little real effect (just a printk).
>
> These three can be considered stable.
>
> 4,5 Improve handling of 'safemode' which is where we mark the superblock
>    clean whenever there have been no writes for a little while, and mark
>    it dirty before the next write.
> 6-9 Introduce "bitmap based intent logging" which allows a file to
>    store - as a bitmap - all block that may have been written recently.
>    This allow resync to be optimised, which is more important as drives
>    get larger.
>
> These should not be considerred stable - certainly actually using
> the bitmap mode (which requires a 2.0 series version of mdadm) should
> be seen as 'experimental'.
>
> mdadm-2.0-devel-1  was released today.
>
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH md 0 of 9] Introduction
  2005-02-18 14:25 ` [PATCH md 0 of 9] Introduction Phantazm
@ 2005-02-20 22:55   ` Neil Brown
  0 siblings, 0 replies; 14+ messages in thread
From: Neil Brown @ 2005-02-20 22:55 UTC (permalink / raw)
  To: Phantazm; +Cc: linux-raid

On Friday February 18, phantazm@phantazm.nu wrote:
> Would you recommend to apply this package 
> http://neilb.web.cse.unsw.edu.au/~neilb/patches/linux-devel/2.6/2005-02-18-00/patch-all-2005-02-18-00
> To a 2.6.10 kernel?

No.  I don't think it would apply.
That patch it mostly experimental stuff.  Only apply it if you want to
experiment with the "bitmap resync" code.

NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2005-02-20 22:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-18  0:40 [PATCH md 0 of 9] Introduction NeilBrown
2005-02-18  0:40 ` [PATCH md 1 of 9] Remove possible oops in md/raid1 NeilBrown
2005-02-18  0:40 ` [PATCH md 7 of 9] Optimised resync using Bitmap based intent logging NeilBrown
2005-02-18  0:40 ` [PATCH md 8 of 9] raid1 support for bitmap " NeilBrown
2005-02-18  0:40 ` [PATCH md 5 of 9] Improve locking on 'safemode' and move superblock writes NeilBrown
2005-02-18  0:40 ` [PATCH md 6 of 9] Improve the interface to sync_request NeilBrown
2005-02-18  0:40 ` [PATCH md 3 of 9] Remove kludgy level check from md.c NeilBrown
2005-02-18  0:40 ` [PATCH md 2 of 9] Make raid5 and raid6 robust against failure during recovery NeilBrown
2005-02-18  0:40 ` [PATCH md 4 of 9] Merge md_enter_safemode into md_check_recovery NeilBrown
2005-02-18  0:40 ` [PATCH md 9 of 9] Optimise reconstruction when re-adding a recently failed drive NeilBrown
2005-02-18  0:51   ` Mike Hardy
2005-02-18  0:58     ` Neil Brown
2005-02-18 14:25 ` [PATCH md 0 of 9] Introduction Phantazm
2005-02-20 22:55   ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).