[PATCH md 000 of 18] Introduction

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH md 000 of 18] Introduction
@ 2005-11-27 23:39 NeilBrown
  2005-11-27 23:39 ` [PATCH md 001 of 18] Improve read speed to raid10 arrays using 'far copies' NeilBrown
                   ` (16 more replies)
  0 siblings, 17 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid

Following are 18 patches against md/raid in 2.6.15-rc2-mm1

The first 5 are bug fixes against recent additions to 2.6.15-rc.
They should preferrable go to Linus for inclusion in 2.6.15.

The remaining 13 are code cleanups and new functionality.  These
should sit in -mm until 2.6.16-rc opens up.

The main elements of functionality I have been working on are
 - bitmap based write intent logging
      This is now available in all raid levels (where useful).
 - better handling of read errors
      This is now available for raid5, raid6, raid1
      raid10 is still outstanding
 - Support 'check' action which checks redundancy but doesn't repair.
      This is now available in raid5 and raid6
      raid10 and raid1 are still outstanding.

Future work includes
  - enhancements to the sysfs interface
  - improve raid5/6 read speed by bypassing cache

Thanks,
NeilBrown

2.6.15 material
 [PATCH md 001 of 18] Improve read speed to raid10 arrays using 'far copies'.
 [PATCH md 002 of 18] Fix locking problem in r5/r6
 [PATCH md 003 of 18] Fix problem with raid6 intent bitmap
 [PATCH md 004 of 18] Set default_bitmap_offset properly in set_array_info
 [PATCH md 005 of 18] Fix --re-add for raid1 and raid6
2.6.16 material
 [PATCH md 006 of 18] Improve raid1 "IO Barrier" concept.
 [PATCH md 007 of 18] Improve raid10 "IO Barrier" concept.
 [PATCH md 008 of 18] Small cleanups for raid5
 [PATCH md 009 of 18] Allow dirty raid[456] arrays to be started at boot.
 [PATCH md 010 of 18] Move bitmap_create to after md array has been initialised.
 [PATCH md 011 of 18] Write intent bitmap support for raid10
 [PATCH md 012 of 18] Fix raid6 resync check/repair code.
 [PATCH md 013 of 18] Improve handing of read errors with raid6
 [PATCH md 014 of 18] Attempt to auto-correct read errors in raid1.
 [PATCH md 015 of 18] Tidyup some issues with raid1 resync and prepare for catching read errors.
 [PATCH md 016 of 18] Better handling for read error in raid1 during resync
 [PATCH md 017 of 18] Handle errors when read-only
 [PATCH md 018 of 18] Fix up some rdev rcu locking in raid5/6

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 001 of 18] Improve read speed to raid10 arrays using 'far copies'.
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
@ 2005-11-27 23:39 ` NeilBrown
  2005-11-27 23:39 ` [PATCH md 002 of 18] Fix locking problem in r5/r6 NeilBrown
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid

raid10 has two different layouts.  One uses near-copies (so multiple
copies of a block are at the same or similar offsets of different
devices) and the other uses far-copies (so multiple copies of a block
are stored a greatly different offsets on different devices).  The
point of far-copies is that it allows the first section (normally first
half) to be layed out in normal raid0 style, and thus provide raid0
sequential read performance.

Unfortunately, the read balancing in raid10 makes some poor decisions
for far-copies arrays and you don't get the desired performance.  So
turn off that bad bit of read_balance for far-copies arrays.

With this patch, read speed of an 'f2' array is comparable with a
raid0 with the same number of devices, though write speed is ofcourse
still very slow.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid10.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff ./drivers/md/raid10.c~current~ ./drivers/md/raid10.c
--- ./drivers/md/raid10.c~current~	2005-11-28 10:08:19.000000000 +1100
+++ ./drivers/md/raid10.c	2005-11-28 10:08:26.000000000 +1100
@@ -552,7 +552,11 @@ static int read_balance(conf_t *conf, r1
 		    !test_bit(In_sync, &rdev->flags))
 			continue;

-		if (!atomic_read(&rdev->nr_pending)) {
+		/* This optimisation is debatable, and completely destroys
+		 * sequential read speed for 'far copies' arrays.  So only
+		 * keep it for 'near' arrays, and review those later.
+		 */
+		if (conf->near_copies > 1 && !atomic_read(&rdev->nr_pending)) {
 			disk = ndisk;
 			slot = nslot;
 			break;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 002 of 18] Fix locking problem in r5/r6
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
  2005-11-27 23:39 ` [PATCH md 001 of 18] Improve read speed to raid10 arrays using 'far copies' NeilBrown
@ 2005-11-27 23:39 ` NeilBrown
  2005-11-27 23:39 ` [PATCH md 003 of 18] Fix problem with raid6 intent bitmap NeilBrown
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


bitmap_unplug actually writes data (bits) to storage, so we
shouldn't be holding a spinlock...

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c     |    2 ++
 ./drivers/md/raid6main.c |    2 ++
 2 files changed, 4 insertions(+)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2005-11-28 10:08:19.000000000 +1100
+++ ./drivers/md/raid5.c	2005-11-28 10:08:33.000000000 +1100
@@ -1704,7 +1704,9 @@ static void raid5d (mddev_t *mddev)
 
 		if (conf->seq_flush - conf->seq_write > 0) {
 			int seq = conf->seq_flush;
+			spin_unlock_irq(&conf->device_lock);
 			bitmap_unplug(mddev->bitmap);
+			spin_lock_irq(&conf->device_lock);
 			conf->seq_write = seq;
 			activate_bit_delay(conf);
 		}

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2005-11-28 10:08:19.000000000 +1100
+++ ./drivers/md/raid6main.c	2005-11-28 10:08:33.000000000 +1100
@@ -1784,7 +1784,9 @@ static void raid6d (mddev_t *mddev)
 
 		if (conf->seq_flush - conf->seq_write > 0) {
 			int seq = conf->seq_flush;
+			spin_unlock_irq(&conf->device_lock);
 			bitmap_unplug(mddev->bitmap);
+			spin_lock_irq(&conf->device_lock);
 			conf->seq_write = seq;
 			activate_bit_delay(conf);
 		}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 003 of 18] Fix problem with raid6 intent bitmap
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
  2005-11-27 23:39 ` [PATCH md 001 of 18] Improve read speed to raid10 arrays using 'far copies' NeilBrown
  2005-11-27 23:39 ` [PATCH md 002 of 18] Fix locking problem in r5/r6 NeilBrown
@ 2005-11-27 23:39 ` NeilBrown
  2005-11-27 23:39 ` [PATCH md 004 of 18] Set default_bitmap_offset properly in set_array_info NeilBrown
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


When doing a recovery, we need to know whether the array will
still be degraded after the recovery has finished, so we can 
know whether bits can be clearred yet or not.  This patch
performs the required check.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid6main.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2005-11-28 10:08:33.000000000 +1100
+++ ./drivers/md/raid6main.c	2005-11-28 10:09:27.000000000 +1100
@@ -1702,6 +1702,8 @@ static sector_t sync_request(mddev_t *md
 	int data_disks = raid_disks - 2;
 	sector_t max_sector = mddev->size << 1;
 	int sync_blocks;
+	int still_degraded = 0;
+	int i;
 
 	if (sector_nr >= max_sector) {
 		/* just being told to finish up .. nothing much to do */
@@ -1710,7 +1712,7 @@ static sector_t sync_request(mddev_t *md
 		if (mddev->curr_resync < max_sector) /* aborted */
 			bitmap_end_sync(mddev->bitmap, mddev->curr_resync,
 					&sync_blocks, 1);
-		else /* compelted sync */
+		else /* completed sync */
 			conf->fullsync = 0;
 		bitmap_close_sync(mddev->bitmap);
 
@@ -1748,7 +1750,16 @@ static sector_t sync_request(mddev_t *md
 		 */
 		schedule_timeout_uninterruptible(1);
 	}
-	bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, 0);
+	/* Need to check if array will still be degraded after recovery/resync
+	 * We don't need to check the 'failed' flag as when that gets set,
+	 * recovery aborts.
+	 */
+	for (i=0; i<mddev->raid_disks; i++)
+		if (conf->disks[i].rdev == NULL)
+			still_degraded = 1;
+
+	bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, still_degraded);
+
 	spin_lock(&sh->lock);
 	set_bit(STRIPE_SYNCING, &sh->state);
 	clear_bit(STRIPE_INSYNC, &sh->state);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 004 of 18] Set default_bitmap_offset properly in set_array_info
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (2 preceding siblings ...)
  2005-11-27 23:39 ` [PATCH md 003 of 18] Fix problem with raid6 intent bitmap NeilBrown
@ 2005-11-27 23:39 ` NeilBrown
  2005-11-27 23:40 ` [PATCH md 005 of 18] Fix --re-add for raid1 and raid6 NeilBrown
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


If an array is created using set_array_info, default_bitmap_offset
isn't set properly meaning that an internal bitmap cannot be hot-added until
the array is stopped and re-assembled.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2005-11-28 10:08:19.000000000 +1100
+++ ./drivers/md/md.c	2005-11-28 10:10:54.000000000 +1100
@@ -1028,7 +1028,6 @@ static int super_1_validate(mddev_t *mdd
 		mddev->size = le64_to_cpu(sb->size)/2;
 		mddev->events = le64_to_cpu(sb->events);
 		mddev->bitmap_offset = 0;
-		mddev->default_bitmap_offset = 0;
 		mddev->default_bitmap_offset = 1024;
 		
 		mddev->recovery_cp = le64_to_cpu(sb->resync_offset);
@@ -2932,6 +2931,9 @@ static int set_array_info(mddev_t * mdde
 
 	mddev->sb_dirty      = 1;
 
+	mddev->default_bitmap_offset = MD_SB_BYTES >> 9;
+	mddev->bitmap_offset = 0;
+
 	/*
 	 * Generate a 128 bit UUID
 	 */

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 005 of 18] Fix --re-add for raid1 and raid6
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (3 preceding siblings ...)
  2005-11-27 23:39 ` [PATCH md 004 of 18] Set default_bitmap_offset properly in set_array_info NeilBrown
@ 2005-11-27 23:40 ` NeilBrown
  2005-11-27 23:40 ` [PATCH md 006 of 18] Improve raid1 "IO Barrier" concept NeilBrown
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


If you have an array with a write-intent-bitmap, and you remove a device,
then re-add it, a full recovery isn't needed.  We detect a re-add by looking
at saved_raid_disk.  For raid1, it doesn't matter which disk it was, only
whether or not it was an active device.
The old code being removed set a value of 'mirror' which was then ignored, 
so it can go.  The changed code performs the correct check.

For raid6, if there are two missing devices, make sure we chose the right
slot on --re-add rather than always the first slot.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c     |    8 ++++----
 ./drivers/md/raid6main.c |   10 ++++++++--
 2 files changed, 12 insertions(+), 6 deletions(-)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-11-28 10:08:19.000000000 +1100
+++ ./drivers/md/raid1.c	2005-11-28 10:11:52.000000000 +1100
@@ -953,9 +953,6 @@ static int raid1_add_disk(mddev_t *mddev
 	int mirror = 0;
 	mirror_info_t *p;
 
-	if (rdev->saved_raid_disk >= 0 &&
-	    conf->mirrors[rdev->saved_raid_disk].rdev == NULL)
-		mirror = rdev->saved_raid_disk;
 	for (mirror=0; mirror < mddev->raid_disks; mirror++)
 		if ( !(p=conf->mirrors+mirror)->rdev) {
 
@@ -972,7 +969,10 @@ static int raid1_add_disk(mddev_t *mddev
 			p->head_position = 0;
 			rdev->raid_disk = mirror;
 			found = 1;
-			if (rdev->saved_raid_disk != mirror)
+			/* As all devices are equivalent, we don't need a full recovery
+			 * if this was recently any drive of the array
+			 */
+			if (rdev->saved_raid_disk < 0)
 				conf->fullsync = 1;
 			rcu_assign_pointer(p->rdev, rdev);
 			break;

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2005-11-28 10:09:27.000000000 +1100
+++ ./drivers/md/raid6main.c	2005-11-28 10:11:52.000000000 +1100
@@ -2158,9 +2158,15 @@ static int raid6_add_disk(mddev_t *mddev
 		/* no point adding a device */
 		return 0;
 	/*
-	 * find the disk ...
+	 * find the disk ... but prefer rdev->saved_raid_disk
+	 * if possible.
 	 */
-	for (disk=0; disk < mddev->raid_disks; disk++)
+	if (rdev->saved_raid_disk >= 0 &&
+	    conf->disks[rdev->saved_raid_disk].rdev == NULL)
+		disk = rdev->saved_raid_disk;
+	else
+		disk = 0;
+	for ( ; disk < mddev->raid_disks; disk++)
 		if ((p=conf->disks + disk)->rdev == NULL) {
 			clear_bit(In_sync, &rdev->flags);
 			rdev->raid_disk = disk;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 006 of 18] Improve raid1 "IO Barrier" concept.
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (4 preceding siblings ...)
  2005-11-27 23:40 ` [PATCH md 005 of 18] Fix --re-add for raid1 and raid6 NeilBrown
@ 2005-11-27 23:40 ` NeilBrown
  2005-11-27 23:40 ` [PATCH md 007 of 18] Improve raid10 " NeilBrown
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


raid1 needs to put up a barrier to new requests while it 
does resync or other background recovery.
The code for this is currently open-coded, slighty obscure
by its use of two waitqueues, and not documented.

This patch gathers all the related code into 4 functions,
and includes a comment which (hopefully) explains what is happening. 


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c         |  169 ++++++++++++++++++++++---------------------
 ./include/linux/raid/raid1.h |    4 -
 2 files changed, 92 insertions(+), 81 deletions(-)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-11-28 10:11:52.000000000 +1100
+++ ./drivers/md/raid1.c	2005-11-28 10:12:17.000000000 +1100
@@ -51,6 +51,8 @@ static mdk_personality_t raid1_personali
 
 static void unplug_slaves(mddev_t *mddev);
 
+static void allow_barrier(conf_t *conf);
+static void lower_barrier(conf_t *conf);
 
 static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -160,20 +162,13 @@ static void put_all_bios(conf_t *conf, r
 
 static inline void free_r1bio(r1bio_t *r1_bio)
 {
-	unsigned long flags;
-
 	conf_t *conf = mddev_to_conf(r1_bio->mddev);
 
 	/*
 	 * Wake up any possible resync thread that waits for the device
 	 * to go idle.
 	 */
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	if (!--conf->nr_pending) {
-		wake_up(&conf->wait_idle);
-		wake_up(&conf->wait_resume);
-	}
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	allow_barrier(conf);
 
 	put_all_bios(conf, r1_bio);
 	mempool_free(r1_bio, conf->r1bio_pool);
@@ -182,22 +177,10 @@ static inline void free_r1bio(r1bio_t *r
 static inline void put_buf(r1bio_t *r1_bio)
 {
 	conf_t *conf = mddev_to_conf(r1_bio->mddev);
-	unsigned long flags;
 
 	mempool_free(r1_bio, conf->r1buf_pool);
 
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	if (!conf->barrier)
-		BUG();
-	--conf->barrier;
-	wake_up(&conf->wait_resume);
-	wake_up(&conf->wait_idle);
-
-	if (!--conf->nr_pending) {
-		wake_up(&conf->wait_idle);
-		wake_up(&conf->wait_resume);
-	}
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	lower_barrier(conf);
 }
 
 static void reschedule_retry(r1bio_t *r1_bio)
@@ -210,6 +193,7 @@ static void reschedule_retry(r1bio_t *r1
 	list_add(&r1_bio->retry_list, &conf->retry_list);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
+	wake_up(&conf->wait_barrier);
 	md_wakeup_thread(mddev->thread);
 }
 
@@ -592,30 +576,83 @@ static int raid1_issue_flush(request_que
 	return ret;
 }
 
-/*
- * Throttle resync depth, so that we can both get proper overlapping of
- * requests, but are still able to handle normal requests quickly.
+/* Barriers....
+ * Sometimes we need to suspend IO while we do something else,
+ * either some resync/recovery, or reconfigure the array.
+ * To do this we raise a 'barrier'.
+ * The 'barrier' is a counter that can be raised multiple times
+ * to count how many activities are happening which preclude
+ * normal IO.
+ * We can only raise the barrier if there is no pending IO.
+ * i.e. if nr_pending == 0.
+ * We choose only to raise the barrier if no-one is waiting for the
+ * barrier to go down.  This means that as soon as an IO request
+ * is ready, no other operations which require a barrier will start
+ * until the IO request has had a chance.
+ *
+ * So: regular IO calls 'wait_barrier'.  When that returns there
+ *    is no backgroup IO happening,  It must arrange to call
+ *    allow_barrier when it has finished its IO.
+ * backgroup IO calls must call raise_barrier.  Once that returns
+ *    there is no normal IO happeing.  It must arrange to call
+ *    lower_barrier when the particular background IO completes.
  */
 #define RESYNC_DEPTH 32
 
-static void device_barrier(conf_t *conf, sector_t sect)
+static void raise_barrier(conf_t *conf)
 {
 	spin_lock_irq(&conf->resync_lock);
-	wait_event_lock_irq(conf->wait_idle, !waitqueue_active(&conf->wait_resume),
-			    conf->resync_lock, raid1_unplug(conf->mddev->queue));
-	
-	if (!conf->barrier++) {
-		wait_event_lock_irq(conf->wait_idle, !conf->nr_pending,
-				    conf->resync_lock, raid1_unplug(conf->mddev->queue));
-		if (conf->nr_pending)
-			BUG();
-	}
-	wait_event_lock_irq(conf->wait_resume, conf->barrier < RESYNC_DEPTH,
-			    conf->resync_lock, raid1_unplug(conf->mddev->queue));
-	conf->next_resync = sect;
+
+	/* Wait until no block IO is waiting */
+	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
+			    conf->resync_lock,
+			    raid1_unplug(conf->mddev->queue));
+
+	/* block any new IO from starting */
+	conf->barrier++;
+
+	/* No wait for all pending IO to complete */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
+			    conf->resync_lock,
+			    raid1_unplug(conf->mddev->queue));
+
+	spin_unlock_irq(&conf->resync_lock);
+}
+
+static void lower_barrier(conf_t *conf)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&conf->resync_lock, flags);
+	conf->barrier--;
+	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	wake_up(&conf->wait_barrier);
+}
+
+static void wait_barrier(conf_t *conf)
+{
+	spin_lock_irq(&conf->resync_lock);
+	if (conf->barrier) {
+		conf->nr_waiting++;
+		wait_event_lock_irq(conf->wait_barrier, !conf->barrier,
+				    conf->resync_lock,
+				    raid1_unplug(conf->mddev->queue));
+		conf->nr_waiting--;
+	}
+	conf->nr_pending++;
 	spin_unlock_irq(&conf->resync_lock);
 }
 
+static void allow_barrier(conf_t *conf)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&conf->resync_lock, flags);
+	conf->nr_pending--;
+	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	wake_up(&conf->wait_barrier);
+}
+
+
 /* duplicate the data pages for behind I/O */
 static struct page **alloc_behind_pages(struct bio *bio)
 {
@@ -677,10 +714,7 @@ static int make_request(request_queue_t 
 	 */
 	md_write_start(mddev, bio); /* wait on superblock update early */
 
-	spin_lock_irq(&conf->resync_lock);
-	wait_event_lock_irq(conf->wait_resume, !conf->barrier, conf->resync_lock, );
-	conf->nr_pending++;
-	spin_unlock_irq(&conf->resync_lock);
+	wait_barrier(conf);
 
 	disk_stat_inc(mddev->gendisk, ios[rw]);
 	disk_stat_add(mddev->gendisk, sectors[rw], bio_sectors(bio));
@@ -908,13 +942,8 @@ static void print_conf(conf_t *conf)
 
 static void close_sync(conf_t *conf)
 {
-	spin_lock_irq(&conf->resync_lock);
-	wait_event_lock_irq(conf->wait_resume, !conf->barrier,
-			    conf->resync_lock, 	raid1_unplug(conf->mddev->queue));
-	spin_unlock_irq(&conf->resync_lock);
-
-	if (conf->barrier) BUG();
-	if (waitqueue_active(&conf->wait_idle)) BUG();
+	wait_barrier(conf);
+	allow_barrier(conf);
 
 	mempool_destroy(conf->r1buf_pool);
 	conf->r1buf_pool = NULL;
@@ -1316,12 +1345,16 @@ static sector_t sync_request(mddev_t *md
 		return sync_blocks;
 	}
 	/*
-	 * If there is non-resync activity waiting for us then
-	 * put in a delay to throttle resync.
+	 * If there is non-resync activity waiting for a turn,
+	 * and resync is going fast enough,
+	 * then let it though before starting on this new sync request.
 	 */
-	if (!go_faster && waitqueue_active(&conf->wait_resume))
+	if (!go_faster && conf->nr_waiting)
 		msleep_interruptible(1000);
-	device_barrier(conf, sector_nr + RESYNC_SECTORS);
+
+	raise_barrier(conf);
+
+	conf->next_resync = sector_nr;
 
 	/*
 	 * If reconstructing, and >1 working disc,
@@ -1354,10 +1387,6 @@ static sector_t sync_request(mddev_t *md
 
 	r1_bio = mempool_alloc(conf->r1buf_pool, GFP_NOIO);
 
-	spin_lock_irq(&conf->resync_lock);
-	conf->nr_pending++;
-	spin_unlock_irq(&conf->resync_lock);
-
 	r1_bio->mddev = mddev;
 	r1_bio->sector = sector_nr;
 	r1_bio->state = 0;
@@ -1541,8 +1570,7 @@ static int run(mddev_t *mddev)
 		mddev->recovery_cp = MaxSector;
 
 	spin_lock_init(&conf->resync_lock);
-	init_waitqueue_head(&conf->wait_idle);
-	init_waitqueue_head(&conf->wait_resume);
+	init_waitqueue_head(&conf->wait_barrier);
 
 	bio_list_init(&conf->pending_bio_list);
 	bio_list_init(&conf->flushing_bio_list);
@@ -1713,11 +1741,7 @@ static int raid1_reshape(mddev_t *mddev,
 	}
 	memset(newmirrors, 0, sizeof(struct mirror_info)*raid_disks);
 
-	spin_lock_irq(&conf->resync_lock);
-	conf->barrier++;
-	wait_event_lock_irq(conf->wait_idle, !conf->nr_pending,
-			    conf->resync_lock, raid1_unplug(mddev->queue));
-	spin_unlock_irq(&conf->resync_lock);
+	raise_barrier(conf);
 
 	/* ok, everything is stopped */
 	oldpool = conf->r1bio_pool;
@@ -1737,12 +1761,7 @@ static int raid1_reshape(mddev_t *mddev,
 	conf->raid_disks = mddev->raid_disks = raid_disks;
 
 	conf->last_used = 0; /* just make sure it is in-range */
-	spin_lock_irq(&conf->resync_lock);
-	conf->barrier--;
-	spin_unlock_irq(&conf->resync_lock);
-	wake_up(&conf->wait_resume);
-	wake_up(&conf->wait_idle);
-
+	lower_barrier(conf);
 
 	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 	md_wakeup_thread(mddev->thread);
@@ -1757,18 +1776,10 @@ static void raid1_quiesce(mddev_t *mddev
 
 	switch(state) {
 	case 1:
-		spin_lock_irq(&conf->resync_lock);
-		conf->barrier++;
-		wait_event_lock_irq(conf->wait_idle, !conf->nr_pending,
-				    conf->resync_lock, raid1_unplug(mddev->queue));
-		spin_unlock_irq(&conf->resync_lock);
+		raise_barrier(conf);
 		break;
 	case 0:
-		spin_lock_irq(&conf->resync_lock);
-		conf->barrier--;
-		spin_unlock_irq(&conf->resync_lock);
-		wake_up(&conf->wait_resume);
-		wake_up(&conf->wait_idle);
+		lower_barrier(conf);
 		break;
 	}
 	if (mddev->thread) {

diff ./include/linux/raid/raid1.h~current~ ./include/linux/raid/raid1.h
--- ./include/linux/raid/raid1.h~current~	2005-11-28 10:08:19.000000000 +1100
+++ ./include/linux/raid/raid1.h	2005-11-28 10:12:17.000000000 +1100
@@ -45,6 +45,7 @@ struct r1_private_data_s {
 
 	spinlock_t		resync_lock;
 	int			nr_pending;
+	int			nr_waiting;
 	int			barrier;
 	sector_t		next_resync;
 	int			fullsync;  /* set to 1 if a full sync is needed,
@@ -52,8 +53,7 @@ struct r1_private_data_s {
 					    * Cleared when a sync completes.
 					    */
 
-	wait_queue_head_t	wait_idle;
-	wait_queue_head_t	wait_resume;
+	wait_queue_head_t	wait_barrier;
 
 	struct pool_info	*poolinfo;
 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 007 of 18] Improve raid10 "IO Barrier" concept.
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (5 preceding siblings ...)
  2005-11-27 23:40 ` [PATCH md 006 of 18] Improve raid1 "IO Barrier" concept NeilBrown
@ 2005-11-27 23:40 ` NeilBrown
  2005-11-27 23:40 ` [PATCH md 008 of 18] Small cleanups for raid5 NeilBrown
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


raid10 needs to put up a barrier to new requests while it 
does resync or other background recovery.
The code for this is currently open-coded, slighty obscure
by its use of two waitqueues, and not documented.

This patch gathers all the related code into 4 functions,
and includes a comment which (hopefully) explains what is happening. 

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid10.c         |  135 ++++++++++++++++++++++++------------------
 ./include/linux/raid/raid10.h |    4 -
 2 files changed, 81 insertions(+), 58 deletions(-)

diff ./drivers/md/raid10.c~current~ ./drivers/md/raid10.c
--- ./drivers/md/raid10.c~current~	2005-11-28 10:08:26.000000000 +1100
+++ ./drivers/md/raid10.c	2005-11-28 10:12:24.000000000 +1100
@@ -47,6 +47,9 @@
 
 static void unplug_slaves(mddev_t *mddev);
 
+static void allow_barrier(conf_t *conf);
+static void lower_barrier(conf_t *conf);
+
 static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
 	conf_t *conf = data;
@@ -175,20 +178,13 @@ static void put_all_bios(conf_t *conf, r
 
 static inline void free_r10bio(r10bio_t *r10_bio)
 {
-	unsigned long flags;
-
 	conf_t *conf = mddev_to_conf(r10_bio->mddev);
 
 	/*
 	 * Wake up any possible resync thread that waits for the device
 	 * to go idle.
 	 */
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	if (!--conf->nr_pending) {
-		wake_up(&conf->wait_idle);
-		wake_up(&conf->wait_resume);
-	}
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	allow_barrier(conf);
 
 	put_all_bios(conf, r10_bio);
 	mempool_free(r10_bio, conf->r10bio_pool);
@@ -197,22 +193,10 @@ static inline void free_r10bio(r10bio_t 
 static inline void put_buf(r10bio_t *r10_bio)
 {
 	conf_t *conf = mddev_to_conf(r10_bio->mddev);
-	unsigned long flags;
 
 	mempool_free(r10_bio, conf->r10buf_pool);
 
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	if (!conf->barrier)
-		BUG();
-	--conf->barrier;
-	wake_up(&conf->wait_resume);
-	wake_up(&conf->wait_idle);
-
-	if (!--conf->nr_pending) {
-		wake_up(&conf->wait_idle);
-		wake_up(&conf->wait_resume);
-	}
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	lower_barrier(conf);
 }
 
 static void reschedule_retry(r10bio_t *r10_bio)
@@ -640,30 +624,82 @@ static int raid10_issue_flush(request_qu
 	return ret;
 }
 
-/*
- * Throttle resync depth, so that we can both get proper overlapping of
- * requests, but are still able to handle normal requests quickly.
+/* Barriers....
+ * Sometimes we need to suspend IO while we do something else,
+ * either some resync/recovery, or reconfigure the array.
+ * To do this we raise a 'barrier'.
+ * The 'barrier' is a counter that can be raised multiple times
+ * to count how many activities are happening which preclude
+ * normal IO.
+ * We can only raise the barrier if there is no pending IO.
+ * i.e. if nr_pending == 0.
+ * We choose only to raise the barrier if no-one is waiting for the
+ * barrier to go down.  This means that as soon as an IO request
+ * is ready, no other operations which require a barrier will start
+ * until the IO request has had a chance.
+ *
+ * So: regular IO calls 'wait_barrier'.  When that returns there
+ *    is no backgroup IO happening,  It must arrange to call
+ *    allow_barrier when it has finished its IO.
+ * backgroup IO calls must call raise_barrier.  Once that returns
+ *    there is no normal IO happeing.  It must arrange to call
+ *    lower_barrier when the particular background IO completes.
  */
 #define RESYNC_DEPTH 32
 
-static void device_barrier(conf_t *conf, sector_t sect)
+static void raise_barrier(conf_t *conf)
 {
 	spin_lock_irq(&conf->resync_lock);
-	wait_event_lock_irq(conf->wait_idle, !waitqueue_active(&conf->wait_resume),
-			    conf->resync_lock, unplug_slaves(conf->mddev));
 
-	if (!conf->barrier++) {
-		wait_event_lock_irq(conf->wait_idle, !conf->nr_pending,
-				    conf->resync_lock, unplug_slaves(conf->mddev));
-		if (conf->nr_pending)
-			BUG();
-	}
-	wait_event_lock_irq(conf->wait_resume, conf->barrier < RESYNC_DEPTH,
-			    conf->resync_lock, unplug_slaves(conf->mddev));
-	conf->next_resync = sect;
+	/* Wait until no block IO is waiting */
+	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
+			    conf->resync_lock,
+			    raid10_unplug(conf->mddev->queue));
+
+	/* block any new IO from starting */
+	conf->barrier++;
+
+	/* No wait for all pending IO to complete */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
+			    conf->resync_lock,
+			    raid10_unplug(conf->mddev->queue));
+
+	spin_unlock_irq(&conf->resync_lock);
+}
+
+static void lower_barrier(conf_t *conf)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&conf->resync_lock, flags);
+	conf->barrier--;
+	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	wake_up(&conf->wait_barrier);
+}
+
+static void wait_barrier(conf_t *conf)
+{
+	spin_lock_irq(&conf->resync_lock);
+	if (conf->barrier) {
+		conf->nr_waiting++;
+		wait_event_lock_irq(conf->wait_barrier, !conf->barrier,
+				    conf->resync_lock,
+				    raid10_unplug(conf->mddev->queue));
+		conf->nr_waiting--;
+	}
+	conf->nr_pending++;
 	spin_unlock_irq(&conf->resync_lock);
 }
 
+static void allow_barrier(conf_t *conf)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&conf->resync_lock, flags);
+	conf->nr_pending--;
+	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	wake_up(&conf->wait_barrier);
+}
+
 static int make_request(request_queue_t *q, struct bio * bio)
 {
 	mddev_t *mddev = q->queuedata;
@@ -719,10 +755,7 @@ static int make_request(request_queue_t 
 	 * thread has put up a bar for new requests.
 	 * Continue immediately if no resync is active currently.
 	 */
-	spin_lock_irq(&conf->resync_lock);
-	wait_event_lock_irq(conf->wait_resume, !conf->barrier, conf->resync_lock, );
-	conf->nr_pending++;
-	spin_unlock_irq(&conf->resync_lock);
+	wait_barrier(conf);
 
 	disk_stat_inc(mddev->gendisk, ios[rw]);
 	disk_stat_add(mddev->gendisk, sectors[rw], bio_sectors(bio));
@@ -897,13 +930,8 @@ static void print_conf(conf_t *conf)
 
 static void close_sync(conf_t *conf)
 {
-	spin_lock_irq(&conf->resync_lock);
-	wait_event_lock_irq(conf->wait_resume, !conf->barrier,
-			    conf->resync_lock, 	unplug_slaves(conf->mddev));
-	spin_unlock_irq(&conf->resync_lock);
-
-	if (conf->barrier) BUG();
-	if (waitqueue_active(&conf->wait_idle)) BUG();
+	wait_barrier(conf);
+	allow_barrier(conf);
 
 	mempool_destroy(conf->r10buf_pool);
 	conf->r10buf_pool = NULL;
@@ -1395,9 +1423,10 @@ static sector_t sync_request(mddev_t *md
 	 * If there is non-resync activity waiting for us then
 	 * put in a delay to throttle resync.
 	 */
-	if (!go_faster && waitqueue_active(&conf->wait_resume))
+	if (!go_faster && conf->nr_waiting)
 		msleep_interruptible(1000);
-	device_barrier(conf, sector_nr + RESYNC_SECTORS);
+	raise_barrier(conf);
+	conf->next_resync = sector_nr;
 
 	/* Again, very different code for resync and recovery.
 	 * Both must result in an r10bio with a list of bios that
@@ -1427,7 +1456,6 @@ static sector_t sync_request(mddev_t *md
 
 				r10_bio = mempool_alloc(conf->r10buf_pool, GFP_NOIO);
 				spin_lock_irq(&conf->resync_lock);
-				conf->nr_pending++;
 				if (rb2) conf->barrier++;
 				spin_unlock_irq(&conf->resync_lock);
 				atomic_set(&r10_bio->remaining, 0);
@@ -1500,10 +1528,6 @@ static sector_t sync_request(mddev_t *md
 		int count = 0;
 		r10_bio = mempool_alloc(conf->r10buf_pool, GFP_NOIO);
 
-		spin_lock_irq(&conf->resync_lock);
-		conf->nr_pending++;
-		spin_unlock_irq(&conf->resync_lock);
-
 		r10_bio->mddev = mddev;
 		atomic_set(&r10_bio->remaining, 0);
 
@@ -1713,8 +1737,7 @@ static int run(mddev_t *mddev)
 	INIT_LIST_HEAD(&conf->retry_list);
 
 	spin_lock_init(&conf->resync_lock);
-	init_waitqueue_head(&conf->wait_idle);
-	init_waitqueue_head(&conf->wait_resume);
+	init_waitqueue_head(&conf->wait_barrier);
 
 	/* need to check that every block has at least one working mirror */
 	if (!enough(conf)) {

diff ./include/linux/raid/raid10.h~current~ ./include/linux/raid/raid10.h
--- ./include/linux/raid/raid10.h~current~	2005-11-28 10:08:19.000000000 +1100
+++ ./include/linux/raid/raid10.h	2005-11-28 10:12:24.000000000 +1100
@@ -39,11 +39,11 @@ struct r10_private_data_s {
 
 	spinlock_t		resync_lock;
 	int nr_pending;
+	int nr_waiting;
 	int barrier;
 	sector_t		next_resync;
 
-	wait_queue_head_t	wait_idle;
-	wait_queue_head_t	wait_resume;
+	wait_queue_head_t	wait_barrier;
 
 	mempool_t *r10bio_pool;
 	mempool_t *r10buf_pool;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 008 of 18] Small cleanups for raid5
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (6 preceding siblings ...)
  2005-11-27 23:40 ` [PATCH md 007 of 18] Improve raid10 " NeilBrown
@ 2005-11-27 23:40 ` NeilBrown
  2005-11-27 23:40 ` [PATCH md 010 of 18] Move bitmap_create to after md array has been initialised NeilBrown
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


Resync code:
  A test that isn't needed,
  a 'compute_block' that makes more sense 
    elsewhere (And then doesn't need a test),
  a couple of BUG_ONs to confirm the change makes sense.

Printks:
  A few were missing KERN_*

Also fix a typo in a comment..

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |   41 +++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 20 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2005-11-28 10:08:33.000000000 +1100
+++ ./drivers/md/raid5.c	2005-11-28 10:12:30.000000000 +1100
@@ -416,7 +416,7 @@ static int raid5_end_read_request(struct
 		set_bit(R5_UPTODATE, &sh->dev[i].flags);
 #endif
 		if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
-			printk("R5: read error corrected!!\n");
+			printk(KERN_INFO "raid5: read error corrected!!\n");
 			clear_bit(R5_ReadError, &sh->dev[i].flags);
 			clear_bit(R5_ReWrite, &sh->dev[i].flags);
 		}
@@ -427,13 +427,14 @@ static int raid5_end_read_request(struct
 		clear_bit(R5_UPTODATE, &sh->dev[i].flags);
 		atomic_inc(&conf->disks[i].rdev->read_errors);
 		if (conf->mddev->degraded)
-			printk("R5: read error not correctable.\n");
+			printk(KERN_WARNING "raid5: read error not correctable.\n");
 		else if (test_bit(R5_ReWrite, &sh->dev[i].flags))
 			/* Oh, no!!! */
-			printk("R5: read error NOT corrected!!\n");
+			printk(KERN_WARNING "raid5: read error NOT corrected!!\n");
 		else if (atomic_read(&conf->disks[i].rdev->read_errors)
 			 > conf->max_nr_stripes)
-			printk("raid5: Too many read errors, failing device.\n");
+			printk(KERN_WARNING
+			       "raid5: Too many read errors, failing device.\n");
 		else
 			retry = 1;
 		if (retry)
@@ -603,7 +604,7 @@ static sector_t raid5_compute_sector(sec
 			*dd_idx = (*pd_idx + 1 + *dd_idx) % raid_disks;
 			break;
 		default:
-			printk("raid5: unsupported algorithm %d\n",
+			printk(KERN_ERR "raid5: unsupported algorithm %d\n",
 				conf->algorithm);
 	}
 
@@ -644,7 +645,7 @@ static sector_t compute_blocknr(struct s
 			i -= (sh->pd_idx + 1);
 			break;
 		default:
-			printk("raid5: unsupported algorithm %d\n",
+			printk(KERN_ERR "raid5: unsupported algorithm %d\n",
 				conf->algorithm);
 	}
 
@@ -653,7 +654,7 @@ static sector_t compute_blocknr(struct s
 
 	check = raid5_compute_sector (r_sector, raid_disks, data_disks, &dummy1, &dummy2, conf);
 	if (check != sh->sector || dummy1 != dd_idx || dummy2 != sh->pd_idx) {
-		printk("compute_blocknr: map not correct\n");
+		printk(KERN_ERR "compute_blocknr: map not correct\n");
 		return 0;
 	}
 	return r_sector;
@@ -736,7 +737,7 @@ static void compute_block(struct stripe_
 		if (test_bit(R5_UPTODATE, &sh->dev[i].flags))
 			ptr[count++] = p;
 		else
-			printk("compute_block() %d, stripe %llu, %d"
+			printk(KERN_ERR "compute_block() %d, stripe %llu, %d"
 				" not present\n", dd_idx,
 				(unsigned long long)sh->sector, i);
 
@@ -1004,7 +1005,7 @@ static void handle_stripe(struct stripe_
 		if (dev->written) written++;
 		rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
 		if (!rdev || !test_bit(In_sync, &rdev->flags)) {
-			/* The ReadError flag wil just be confusing now */
+			/* The ReadError flag will just be confusing now */
 			clear_bit(R5_ReadError, &dev->flags);
 			clear_bit(R5_ReWrite, &dev->flags);
 		}
@@ -1287,7 +1288,7 @@ static void handle_stripe(struct stripe_
 	 * is available
 	 */
 	if (syncing && locked == 0 &&
-	    !test_bit(STRIPE_INSYNC, &sh->state) && failed <= 1) {
+	    !test_bit(STRIPE_INSYNC, &sh->state)) {
 		set_bit(STRIPE_HANDLE, &sh->state);
 		if (failed == 0) {
 			char *pagea;
@@ -1305,21 +1306,20 @@ static void handle_stripe(struct stripe_
 				if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
 					/* don't try to repair!! */
 					set_bit(STRIPE_INSYNC, &sh->state);
+				else {
+					compute_block(sh, sh->pd_idx);
+					uptodate++;
+				}
 			}
 		}
 		if (!test_bit(STRIPE_INSYNC, &sh->state)) {
+			/* either failed parity check, or recovery is happening */
 			if (failed==0)
 				failed_num = sh->pd_idx;
-			/* should be able to compute the missing block and write it to spare */
-			if (!test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)) {
-				if (uptodate+1 != disks)
-					BUG();
-				compute_block(sh, failed_num);
-				uptodate++;
-			}
-			if (uptodate != disks)
-				BUG();
 			dev = &sh->dev[failed_num];
+			BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
+			BUG_ON(uptodate != disks);
+
 			set_bit(R5_LOCKED, &dev->flags);
 			set_bit(R5_Wantwrite, &dev->flags);
 			clear_bit(STRIPE_DEGRADED, &sh->state);
@@ -1821,7 +1821,8 @@ static int run(mddev_t *mddev)
 	struct list_head *tmp;
 
 	if (mddev->level != 5 && mddev->level != 4) {
-		printk("raid5: %s: raid level not set to 4/5 (%d)\n", mdname(mddev), mddev->level);
+		printk(KERN_ERR "raid5: %s: raid level not set to 4/5 (%d)\n",
+		       mdname(mddev), mddev->level);
 		return -EIO;
 	}
 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 010 of 18] Move bitmap_create to after md array has been initialised.
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (7 preceding siblings ...)
  2005-11-27 23:40 ` [PATCH md 008 of 18] Small cleanups for raid5 NeilBrown
@ 2005-11-27 23:40 ` NeilBrown
  2005-11-27 23:40 ` [PATCH md 011 of 18] Write intent bitmap support for raid10 NeilBrown
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


This is important because bitmap_create uses
  mddev->resync_max_sectors
and that doesn't have a valid value until after the array
has been initialised (with pers->run()).
[It doesn't make a difference for current personalities that 
 support bitmaps, but will make a difference for raid10]

This has the added advantage of meaning with can move the
thread->timeout manipulation inside the bitmap.c code instead
of sprinkling identical code throughout all personalities.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/bitmap.c    |    4 ++++
 ./drivers/md/md.c        |   16 +++++++++-------
 ./drivers/md/raid1.c     |    8 --------
 ./drivers/md/raid5.c     |   11 +----------
 ./drivers/md/raid6main.c |   11 +----------
 5 files changed, 15 insertions(+), 35 deletions(-)

diff ./drivers/md/bitmap.c~current~ ./drivers/md/bitmap.c
--- ./drivers/md/bitmap.c~current~	2005-11-28 10:08:19.000000000 +1100
+++ ./drivers/md/bitmap.c	2005-11-28 10:12:40.000000000 +1100
@@ -1530,6 +1530,8 @@ void bitmap_destroy(mddev_t *mddev)
 		return;
 
 	mddev->bitmap = NULL; /* disconnect from the md device */
+	if (mddev->thread)
+		mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
 
 	bitmap_free(bitmap);
 }
@@ -1636,6 +1638,8 @@ int bitmap_create(mddev_t *mddev)
 
 	if (IS_ERR(bitmap->writeback_daemon))
 		return PTR_ERR(bitmap->writeback_daemon);
+	mddev->thread->timeout = bitmap->daemon_sleep * HZ;
+
 	return bitmap_update_sb(bitmap);
 
  error:

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2005-11-28 10:12:33.000000000 +1100
+++ ./drivers/md/md.c	2005-11-28 10:12:40.000000000 +1100
@@ -2054,13 +2054,15 @@ static int do_md_run(mddev_t * mddev)
 	if (start_readonly)
 		mddev->ro = 2; /* read-only, but switch on first write */
 
-	/* before we start the array running, initialise the bitmap */
-	err = bitmap_create(mddev);
-	if (err)
-		printk(KERN_ERR "%s: failed to create bitmap (%d)\n",
-			mdname(mddev), err);
-	else
-		err = mddev->pers->run(mddev);
+	err = mddev->pers->run(mddev);
+	if (!err && mddev->pers->sync_request) {
+		err = bitmap_create(mddev);
+		if (err) {
+			printk(KERN_ERR "%s: failed to create bitmap (%d)\n",
+			       mdname(mddev), err);
+			mddev->pers->stop(mddev);
+		}
+	}
 	if (err) {
 		printk(KERN_ERR "md: pers->run() failed ...\n");
 		module_put(mddev->pers->owner);

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-11-28 10:12:17.000000000 +1100
+++ ./drivers/md/raid1.c	2005-11-28 10:12:40.000000000 +1100
@@ -1610,7 +1610,6 @@ static int run(mddev_t *mddev)
 		       mdname(mddev));
 		goto out_free_conf;
 	}
-	if (mddev->bitmap) mddev->thread->timeout = mddev->bitmap->daemon_sleep * HZ;
 
 	printk(KERN_INFO 
 		"raid1: raid set %s active with %d out of %d mirrors\n",
@@ -1782,13 +1781,6 @@ static void raid1_quiesce(mddev_t *mddev
 		lower_barrier(conf);
 		break;
 	}
-	if (mddev->thread) {
-		if (mddev->bitmap)
-			mddev->thread->timeout = mddev->bitmap->daemon_sleep * HZ;
-		else
-			mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
-		md_wakeup_thread(mddev->thread);
-	}
 }
 
 

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2005-11-28 10:12:33.000000000 +1100
+++ ./drivers/md/raid5.c	2005-11-28 10:12:40.000000000 +1100
@@ -1963,9 +1963,6 @@ memory = conf->max_nr_stripes * (sizeof(
 	/* Ok, everything is just fine now */
 	sysfs_create_group(&mddev->kobj, &raid5_attrs_group);
 
-	if (mddev->bitmap)
-		mddev->thread->timeout = mddev->bitmap->daemon_sleep * HZ;
-
 	mddev->queue->unplug_fn = raid5_unplug_device;
 	mddev->queue->issue_flush_fn = raid5_issue_flush;
 
@@ -2199,14 +2196,8 @@ static void raid5_quiesce(mddev_t *mddev
 		spin_unlock_irq(&conf->device_lock);
 		break;
 	}
-	if (mddev->thread) {
-		if (mddev->bitmap)
-			mddev->thread->timeout = mddev->bitmap->daemon_sleep * HZ;
-		else
-			mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
-		md_wakeup_thread(mddev->thread);
-	}
 }
+
 static mdk_personality_t raid5_personality=
 {
 	.name		= "raid5",

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2005-11-28 10:12:33.000000000 +1100
+++ ./drivers/md/raid6main.c	2005-11-28 10:12:40.000000000 +1100
@@ -1990,9 +1990,6 @@ static int run(mddev_t *mddev)
 	/* Ok, everything is just fine now */
 	mddev->array_size =  mddev->size * (mddev->raid_disks - 2);
 
-	if (mddev->bitmap)
-		mddev->thread->timeout = mddev->bitmap->daemon_sleep * HZ;
-
 	mddev->queue->unplug_fn = raid6_unplug_device;
 	mddev->queue->issue_flush_fn = raid6_issue_flush;
 	return 0;
@@ -2228,14 +2225,8 @@ static void raid6_quiesce(mddev_t *mddev
 		spin_unlock_irq(&conf->device_lock);
 		break;
 	}
-	if (mddev->thread) {
-		if (mddev->bitmap)
-			mddev->thread->timeout = mddev->bitmap->daemon_sleep * HZ;
-		else
-			mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
-		md_wakeup_thread(mddev->thread);
-	}
 }
+
 static mdk_personality_t raid6_personality=
 {
 	.name		= "raid6",

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 011 of 18] Write intent bitmap support for raid10
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (8 preceding siblings ...)
  2005-11-27 23:40 ` [PATCH md 010 of 18] Move bitmap_create to after md array has been initialised NeilBrown
@ 2005-11-27 23:40 ` NeilBrown
  2005-11-27 23:40 ` [PATCH md 012 of 18] Fix raid6 resync check/repair code NeilBrown
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid



Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c             |   10 +-
 ./drivers/md/raid10.c         |  178 +++++++++++++++++++++++++++++++++++++-----
 ./include/linux/raid/raid10.h |    9 +-
 3 files changed, 171 insertions(+), 26 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2005-11-28 10:12:40.000000000 +1100
+++ ./drivers/md/md.c	2005-11-28 10:12:52.000000000 +1100
@@ -714,9 +714,10 @@ static int super_90_validate(mddev_t *md
 
 		if (sb->state & (1<<MD_SB_BITMAP_PRESENT) &&
 		    mddev->bitmap_file == NULL) {
-			if (mddev->level != 1 && mddev->level != 5 && mddev->level != 6) {
+			if (mddev->level != 1 && mddev->level != 5 && mddev->level != 6
+			    && mddev->level != 10) {
 				/* FIXME use a better test */
-				printk(KERN_WARNING "md: bitmaps only support for raid1\n");
+				printk(KERN_WARNING "md: bitmaps not supported for this level.\n");
 				return -EINVAL;
 			}
 			mddev->bitmap_offset = mddev->default_bitmap_offset;
@@ -1037,8 +1038,9 @@ static int super_1_validate(mddev_t *mdd
 
 		if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET) &&
 		    mddev->bitmap_file == NULL ) {
-			if (mddev->level != 1) {
-				printk(KERN_WARNING "md: bitmaps only supported for raid1\n");
+			if (mddev->level != 1 && mddev->level != 5 && mddev->level != 6
+			    && mddev->level != 10) {
+				printk(KERN_WARNING "md: bitmaps not supported for this level.\n");
 				return -EINVAL;
 			}
 			mddev->bitmap_offset = (__s32)le32_to_cpu(sb->bitmap_offset);

diff ./drivers/md/raid10.c~current~ ./drivers/md/raid10.c
--- ./drivers/md/raid10.c~current~	2005-11-28 10:12:24.000000000 +1100
+++ ./drivers/md/raid10.c	2005-11-28 10:12:52.000000000 +1100
@@ -18,7 +18,9 @@
  * Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
  */
 
+#include "dm-bio-list.h"
 #include <linux/raid/raid10.h>
+#include <linux/raid/bitmap.h>
 
 /*
  * RAID10 provides a combination of RAID0 and RAID1 functionality.
@@ -306,9 +308,11 @@ static int raid10_end_write_request(stru
 	/*
 	 * this branch is our 'one mirror IO has finished' event handler:
 	 */
-	if (!uptodate)
+	if (!uptodate) {
 		md_error(r10_bio->mddev, conf->mirrors[dev].rdev);
-	else
+		/* an I/O failed, we can't clear the bitmap */
+		set_bit(R10BIO_Degraded, &r10_bio->state);
+	} else
 		/*
 		 * Set R10BIO_Uptodate in our master bio, so that
 		 * we will return a good error code for to the higher
@@ -328,6 +332,11 @@ static int raid10_end_write_request(stru
 	 * already.
 	 */
 	if (atomic_dec_and_test(&r10_bio->remaining)) {
+		/* clear the bitmap if all writes complete successfully */
+		bitmap_endwrite(r10_bio->mddev->bitmap, r10_bio->sector,
+				r10_bio->sectors,
+				!test_bit(R10BIO_Degraded, &r10_bio->state),
+				0);
 		md_write_end(r10_bio->mddev);
 		raid_end_bio_io(r10_bio);
 	}
@@ -486,8 +495,9 @@ static int read_balance(conf_t *conf, r1
 	rcu_read_lock();
 	/*
 	 * Check if we can balance. We can balance on the whole
-	 * device if no resync is going on, or below the resync window.
-	 * We take the first readable disk when above the resync window.
+	 * device if no resync is going on (recovery is ok), or below
+	 * the resync window. We take the first readable disk when
+	 * above the resync window.
 	 */
 	if (conf->mddev->recovery_cp < MaxSector
 	    && (this_sector + sectors >= conf->next_resync)) {
@@ -591,7 +601,10 @@ static void unplug_slaves(mddev_t *mddev
 
 static void raid10_unplug(request_queue_t *q)
 {
+	mddev_t *mddev = q->queuedata;
+
 	unplug_slaves(q->queuedata);
+	md_wakeup_thread(mddev->thread);
 }
 
 static int raid10_issue_flush(request_queue_t *q, struct gendisk *disk,
@@ -647,12 +660,13 @@ static int raid10_issue_flush(request_qu
  */
 #define RESYNC_DEPTH 32
 
-static void raise_barrier(conf_t *conf)
+static void raise_barrier(conf_t *conf, int force)
 {
+	BUG_ON(force && !conf->barrier);
 	spin_lock_irq(&conf->resync_lock);
 
-	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
+	/* Wait until no block IO is waiting (unless 'force') */
+	wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting,
 			    conf->resync_lock,
 			    raid10_unplug(conf->mddev->queue));
 
@@ -710,6 +724,8 @@ static int make_request(request_queue_t 
 	int i;
 	int chunk_sects = conf->chunk_mask + 1;
 	const int rw = bio_data_dir(bio);
+	struct bio_list bl;
+	unsigned long flags;
 
 	if (unlikely(bio_barrier(bio))) {
 		bio_endio(bio, bio->bi_size, -EOPNOTSUPP);
@@ -767,6 +783,7 @@ static int make_request(request_queue_t 
 
 	r10_bio->mddev = mddev;
 	r10_bio->sector = bio->bi_sector;
+	r10_bio->state = 0;
 
 	if (rw == READ) {
 		/*
@@ -811,13 +828,16 @@ static int make_request(request_queue_t 
 		    !test_bit(Faulty, &rdev->flags)) {
 			atomic_inc(&rdev->nr_pending);
 			r10_bio->devs[i].bio = bio;
-		} else
+		} else {
 			r10_bio->devs[i].bio = NULL;
+			set_bit(R10BIO_Degraded, &r10_bio->state);
+		}
 	}
 	rcu_read_unlock();
 
-	atomic_set(&r10_bio->remaining, 1);
+	atomic_set(&r10_bio->remaining, 0);
 
+	bio_list_init(&bl);
 	for (i = 0; i < conf->copies; i++) {
 		struct bio *mbio;
 		int d = r10_bio->devs[i].devnum;
@@ -835,13 +855,14 @@ static int make_request(request_queue_t 
 		mbio->bi_private = r10_bio;
 
 		atomic_inc(&r10_bio->remaining);
-		generic_make_request(mbio);
+		bio_list_add(&bl, mbio);
 	}
 
-	if (atomic_dec_and_test(&r10_bio->remaining)) {
-		md_write_end(mddev);
-		raid_end_bio_io(r10_bio);
-	}
+	bitmap_startwrite(mddev->bitmap, bio->bi_sector, r10_bio->sectors, 0);
+	spin_lock_irqsave(&conf->device_lock, flags);
+	bio_list_merge(&conf->pending_bio_list, &bl);
+	blk_plug_device(mddev->queue);
+	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	return 0;
 }
@@ -999,7 +1020,12 @@ static int raid10_add_disk(mddev_t *mdde
 	if (!enough(conf))
 		return 0;
 
-	for (mirror=0; mirror < mddev->raid_disks; mirror++)
+	if (rdev->saved_raid_disk >= 0 &&
+	    conf->mirrors[rdev->saved_raid_disk].rdev == NULL)
+		mirror = rdev->saved_raid_disk;
+	else
+		mirror = 0;
+	for ( ; mirror < mddev->raid_disks; mirror++)
 		if ( !(p=conf->mirrors+mirror)->rdev) {
 
 			blk_queue_stack_limits(mddev->queue,
@@ -1015,6 +1041,8 @@ static int raid10_add_disk(mddev_t *mdde
 			p->head_position = 0;
 			rdev->raid_disk = mirror;
 			found = 1;
+			if (rdev->saved_raid_disk != mirror)
+				conf->fullsync = 1;
 			rcu_assign_pointer(p->rdev, rdev);
 			break;
 		}
@@ -1282,6 +1310,26 @@ static void raid10d(mddev_t *mddev)
 	for (;;) {
 		char b[BDEVNAME_SIZE];
 		spin_lock_irqsave(&conf->device_lock, flags);
+
+		if (conf->pending_bio_list.head) {
+			bio = bio_list_get(&conf->pending_bio_list);
+			blk_remove_plug(mddev->queue);
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			/* flush any pending bitmap writes to disk before proceeding w/ I/O */
+			if (bitmap_unplug(mddev->bitmap) != 0)
+				printk("%s: bitmap file write failed!\n", mdname(mddev));
+
+			while (bio) { /* submit pending writes */
+				struct bio *next = bio->bi_next;
+				bio->bi_next = NULL;
+				generic_make_request(bio);
+				bio = next;
+			}
+			unplug = 1;
+
+			continue;
+		}
+
 		if (list_empty(head))
 			break;
 		r10_bio = list_entry(head->prev, r10bio_t, retry_list);
@@ -1388,6 +1436,8 @@ static sector_t sync_request(mddev_t *md
 	sector_t max_sector, nr_sectors;
 	int disk;
 	int i;
+	int max_sync;
+	int sync_blocks;
 
 	sector_t sectors_skipped = 0;
 	int chunks_skipped = 0;
@@ -1401,6 +1451,29 @@ static sector_t sync_request(mddev_t *md
 	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
 		max_sector = mddev->resync_max_sectors;
 	if (sector_nr >= max_sector) {
+		/* If we aborted, we need to abort the
+		 * sync on the 'current' bitmap chucks (there can
+		 * be several when recovering multiple devices).
+		 * as we may have started syncing it but not finished.
+		 * We can find the current address in
+		 * mddev->curr_resync, but for recovery,
+		 * we need to convert that to several
+		 * virtual addresses.
+		 */
+		if (mddev->curr_resync < max_sector) { /* aborted */
+			if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
+				bitmap_end_sync(mddev->bitmap, mddev->curr_resync,
+						&sync_blocks, 1);
+			else for (i=0; i<conf->raid_disks; i++) {
+				sector_t sect =
+					raid10_find_virt(conf, mddev->curr_resync, i);
+				bitmap_end_sync(mddev->bitmap, sect,
+						&sync_blocks, 1);
+			}
+		} else /* completed sync */
+			conf->fullsync = 0;
+
+		bitmap_close_sync(mddev->bitmap);
 		close_sync(conf);
 		*skipped = 1;
 		return sectors_skipped;
@@ -1425,8 +1498,6 @@ static sector_t sync_request(mddev_t *md
 	 */
 	if (!go_faster && conf->nr_waiting)
 		msleep_interruptible(1000);
-	raise_barrier(conf);
-	conf->next_resync = sector_nr;
 
 	/* Again, very different code for resync and recovery.
 	 * Both must result in an r10bio with a list of bios that
@@ -1443,6 +1514,7 @@ static sector_t sync_request(mddev_t *md
 	 * end_sync_write if we will want to write.
 	 */
 
+	max_sync = RESYNC_PAGES << (PAGE_SHIFT-9);
 	if (!test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
 		/* recovery... the complicated one */
 		int i, j, k;
@@ -1451,13 +1523,29 @@ static sector_t sync_request(mddev_t *md
 		for (i=0 ; i<conf->raid_disks; i++)
 			if (conf->mirrors[i].rdev &&
 			    !test_bit(In_sync, &conf->mirrors[i].rdev->flags)) {
+				int still_degraded = 0;
 				/* want to reconstruct this device */
 				r10bio_t *rb2 = r10_bio;
+				sector_t sect = raid10_find_virt(conf, sector_nr, i);
+				int must_sync;
+				/* Unless we are doing a full sync, we only need
+				 * to recover the block if it is set in the bitmap
+				 */
+				must_sync = bitmap_start_sync(mddev->bitmap, sect,
+							      &sync_blocks, 1);
+				if (sync_blocks < max_sync)
+					max_sync = sync_blocks;
+				if (!must_sync &&
+				    !conf->fullsync) {
+					/* yep, skip the sync_blocks here, but don't assume
+					 * that there will never be anything to do here
+					 */
+					chunks_skipped = -1;
+					continue;
+				}
 
 				r10_bio = mempool_alloc(conf->r10buf_pool, GFP_NOIO);
-				spin_lock_irq(&conf->resync_lock);
-				if (rb2) conf->barrier++;
-				spin_unlock_irq(&conf->resync_lock);
+				raise_barrier(conf, rb2 != NULL);
 				atomic_set(&r10_bio->remaining, 0);
 
 				r10_bio->master_bio = (struct bio*)rb2;
@@ -1465,8 +1553,21 @@ static sector_t sync_request(mddev_t *md
 					atomic_inc(&rb2->remaining);
 				r10_bio->mddev = mddev;
 				set_bit(R10BIO_IsRecover, &r10_bio->state);
-				r10_bio->sector = raid10_find_virt(conf, sector_nr, i);
+				r10_bio->sector = sect;
+
 				raid10_find_phys(conf, r10_bio);
+				/* Need to check if this section will still be
+				 * degraded
+				 */
+				for (j=0; j<conf->copies;j++) {
+					int d = r10_bio->devs[j].devnum;
+					if (conf->mirrors[d].rdev == NULL ||
+					    test_bit(Faulty, &conf->mirrors[d].rdev->flags))
+						still_degraded = 1;
+				}
+				must_sync = bitmap_start_sync(mddev->bitmap, sect,
+							      &sync_blocks, still_degraded);
+
 				for (j=0; j<conf->copies;j++) {
 					int d = r10_bio->devs[j].devnum;
 					if (conf->mirrors[d].rdev &&
@@ -1526,10 +1627,22 @@ static sector_t sync_request(mddev_t *md
 	} else {
 		/* resync. Schedule a read for every block at this virt offset */
 		int count = 0;
+
+		if (!bitmap_start_sync(mddev->bitmap, sector_nr,
+				       &sync_blocks, mddev->degraded) &&
+		    !conf->fullsync && !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) {
+			/* We can skip this block */
+			*skipped = 1;
+			return sync_blocks + sectors_skipped;
+		}
+		if (sync_blocks < max_sync)
+			max_sync = sync_blocks;
 		r10_bio = mempool_alloc(conf->r10buf_pool, GFP_NOIO);
 
 		r10_bio->mddev = mddev;
 		atomic_set(&r10_bio->remaining, 0);
+		raise_barrier(conf, 0);
+		conf->next_resync = sector_nr;
 
 		r10_bio->master_bio = NULL;
 		r10_bio->sector = sector_nr;
@@ -1582,6 +1695,8 @@ static sector_t sync_request(mddev_t *md
 	}
 
 	nr_sectors = 0;
+	if (sector_nr + max_sync < max_sector)
+		max_sector = sector_nr + max_sync;
 	do {
 		struct page *page;
 		int len = PAGE_SIZE;
@@ -1821,6 +1936,26 @@ static int stop(mddev_t *mddev)
 	return 0;
 }
 
+static void raid10_quiesce(mddev_t *mddev, int state)
+{
+	conf_t *conf = mddev_to_conf(mddev);
+
+	switch(state) {
+	case 1:
+		raise_barrier(conf, 0);
+		break;
+	case 0:
+		lower_barrier(conf);
+		break;
+	}
+	if (mddev->thread) {
+		if (mddev->bitmap)
+			mddev->thread->timeout = mddev->bitmap->daemon_sleep * HZ;
+		else
+			mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
+		md_wakeup_thread(mddev->thread);
+	}
+}
 
 static mdk_personality_t raid10_personality =
 {
@@ -1835,6 +1970,7 @@ static mdk_personality_t raid10_personal
 	.hot_remove_disk= raid10_remove_disk,
 	.spare_active	= raid10_spare_active,
 	.sync_request	= sync_request,
+	.quiesce	= raid10_quiesce,
 };
 
 static int __init raid_init(void)

diff ./include/linux/raid/raid10.h~current~ ./include/linux/raid/raid10.h
--- ./include/linux/raid/raid10.h~current~	2005-11-28 10:12:24.000000000 +1100
+++ ./include/linux/raid/raid10.h	2005-11-28 10:12:52.000000000 +1100
@@ -35,13 +35,19 @@ struct r10_private_data_s {
 	sector_t chunk_mask;
 
 	struct list_head	retry_list;
-	/* for use when syncing mirrors: */
+	/* queue pending writes and submit them on unplug */
+	struct bio_list		pending_bio_list;
+
 
 	spinlock_t		resync_lock;
 	int nr_pending;
 	int nr_waiting;
 	int barrier;
 	sector_t		next_resync;
+	int			fullsync;  /* set to 1 if a full sync is needed,
+					    * (fresh device added).
+					    * Cleared when a sync completes.
+					    */
 
 	wait_queue_head_t	wait_barrier;
 
@@ -100,4 +106,5 @@ struct r10bio_s {
 #define	R10BIO_Uptodate	0
 #define	R10BIO_IsSync	1
 #define	R10BIO_IsRecover 2
+#define	R10BIO_Degraded 3
 #endif

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 012 of 18] Fix raid6 resync check/repair code.
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (9 preceding siblings ...)
  2005-11-27 23:40 ` [PATCH md 011 of 18] Write intent bitmap support for raid10 NeilBrown
@ 2005-11-27 23:40 ` NeilBrown
  2005-11-27 23:40 ` [PATCH md 013 of 18] Improve handing of read errors with raid6 NeilBrown
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


raid6 currently does not check the P/Q syndromes when doing a resync,
it just calculates the correct value and writes it.
Doing the check can reduce writes (often to 0) for a resync,
and it is needed to properly implement the 
  echo check > sync_action
operation.

This patch implements the appropriate checks and tidies up some related
code.

It also allows raid6 user-requested resync to bypass the intent bitmap.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid6main.c     |  184 +++++++++++++++++++++++++------------------
 ./include/linux/raid/raid5.h |    2 
 2 files changed, 109 insertions(+), 77 deletions(-)

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2005-11-28 10:12:40.000000000 +1100
+++ ./drivers/md/raid6main.c	2005-11-28 10:12:56.000000000 +1100
@@ -805,7 +805,7 @@ static void compute_parity(struct stripe
 }
 
 /* Compute one missing block */
-static void compute_block_1(struct stripe_head *sh, int dd_idx)
+static void compute_block_1(struct stripe_head *sh, int dd_idx, int nozero)
 {
 	raid6_conf_t *conf = sh->raid_conf;
 	int i, count, disks = conf->raid_disks;
@@ -821,7 +821,7 @@ static void compute_block_1(struct strip
 		compute_parity(sh, UPDATE_PARITY);
 	} else {
 		ptr[0] = page_address(sh->dev[dd_idx].page);
-		memset(ptr[0], 0, STRIPE_SIZE);
+		if (!nozero) memset(ptr[0], 0, STRIPE_SIZE);
 		count = 1;
 		for (i = disks ; i--; ) {
 			if (i == dd_idx || i == qd_idx)
@@ -838,7 +838,8 @@ static void compute_block_1(struct strip
 		}
 		if (count != 1)
 			xor_block(count, STRIPE_SIZE, ptr);
-		set_bit(R5_UPTODATE, &sh->dev[dd_idx].flags);
+		if (!nozero) set_bit(R5_UPTODATE, &sh->dev[dd_idx].flags);
+		else clear_bit(R5_UPTODATE, &sh->dev[dd_idx].flags);
 	}
 }
 
@@ -871,7 +872,7 @@ static void compute_block_2(struct strip
 			return;
 		} else {
 			/* We're missing D+Q; recompute D from P */
-			compute_block_1(sh, (dd_idx1 == qd_idx) ? dd_idx2 : dd_idx1);
+			compute_block_1(sh, (dd_idx1 == qd_idx) ? dd_idx2 : dd_idx1, 0);
 			compute_parity(sh, UPDATE_PARITY); /* Is this necessary? */
 			return;
 		}
@@ -982,6 +983,12 @@ static int add_stripe_bio(struct stripe_
 }
 
 
+static int page_is_zero(struct page *p)
+{
+	char *a = page_address(p);
+	return ((*(u32*)a) == 0 &&
+		memcmp(a, a+4, STRIPE_SIZE-4)==0);
+}
 /*
  * handle_stripe - do things to a stripe.
  *
@@ -1000,7 +1007,7 @@ static int add_stripe_bio(struct stripe_
  *
  */
 
-static void handle_stripe(struct stripe_head *sh)
+static void handle_stripe(struct stripe_head *sh, struct page *tmp_page)
 {
 	raid6_conf_t *conf = sh->raid_conf;
 	int disks = conf->raid_disks;
@@ -1228,7 +1235,7 @@ static void handle_stripe(struct stripe_
 				if (uptodate == disks-1) {
 					PRINTK("Computing stripe %llu block %d\n",
 					       (unsigned long long)sh->sector, i);
-					compute_block_1(sh, i);
+					compute_block_1(sh, i, 0);
 					uptodate++;
 				} else if ( uptodate == disks-2 && failed >= 2 ) {
 					/* Computing 2-failure is *very* expensive; only do it if failed >= 2 */
@@ -1323,7 +1330,7 @@ static void handle_stripe(struct stripe_
 				/* We have failed blocks and need to compute them */
 				switch ( failed ) {
 				case 0:	BUG();
-				case 1: compute_block_1(sh, failed_num[0]); break;
+				case 1: compute_block_1(sh, failed_num[0], 0); break;
 				case 2: compute_block_2(sh, failed_num[0], failed_num[1]); break;
 				default: BUG();	/* This request should have been failed? */
 				}
@@ -1338,12 +1345,10 @@ static void handle_stripe(struct stripe_
 					       (unsigned long long)sh->sector, i);
 					locked++;
 					set_bit(R5_Wantwrite, &sh->dev[i].flags);
-#if 0 /**** FIX: I don't understand the logic here... ****/
-					if (!test_bit(R5_Insync, &sh->dev[i].flags)
-					    || ((i==pd_idx || i==qd_idx) && failed == 0)) /* FIX? */
-						set_bit(STRIPE_INSYNC, &sh->state);
-#endif
 				}
+			/* after a RECONSTRUCT_WRITE, the stripe MUST be in-sync */
+			set_bit(STRIPE_INSYNC, &sh->state);
+
 			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
 				atomic_dec(&conf->preread_active_stripes);
 				if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
@@ -1356,79 +1361,97 @@ static void handle_stripe(struct stripe_
 	 * Any reads will already have been scheduled, so we just see if enough data
 	 * is available
 	 */
-	if (syncing && locked == 0 &&
-	    !test_bit(STRIPE_INSYNC, &sh->state) && failed <= 2) {
-		set_bit(STRIPE_HANDLE, &sh->state);
-#if 0 /* RAID-6: Don't support CHECK PARITY yet */
-		if (failed == 0) {
-			char *pagea;
-			if (uptodate != disks)
-				BUG();
-			compute_parity(sh, CHECK_PARITY);
-			uptodate--;
-			pagea = page_address(sh->dev[pd_idx].page);
-			if ((*(u32*)pagea) == 0 &&
-			    !memcmp(pagea, pagea+4, STRIPE_SIZE-4)) {
-				/* parity is correct (on disc, not in buffer any more) */
-				set_bit(STRIPE_INSYNC, &sh->state);
-			}
-		}
-#endif
-		if (!test_bit(STRIPE_INSYNC, &sh->state)) {
-			int failed_needupdate[2];
-			struct r5dev *adev, *bdev;
-
-			if ( failed < 1 )
-				failed_num[0] = pd_idx;
-			if ( failed < 2 )
-				failed_num[1] = (failed_num[0] == qd_idx) ? pd_idx : qd_idx;
+	if (syncing && locked == 0 && !test_bit(STRIPE_INSYNC, &sh->state)) {
+		int update_p = 0, update_q = 0;
+		struct r5dev *dev;
 
-			failed_needupdate[0] = !test_bit(R5_UPTODATE, &sh->dev[failed_num[0]].flags);
-			failed_needupdate[1] = !test_bit(R5_UPTODATE, &sh->dev[failed_num[1]].flags);
+		set_bit(STRIPE_HANDLE, &sh->state);
 
-			PRINTK("sync: failed=%d num=%d,%d fnu=%u%u\n",
-			       failed, failed_num[0], failed_num[1], failed_needupdate[0], failed_needupdate[1]);
+		BUG_ON(failed>2);
+		BUG_ON(uptodate < disks);
+		/* Want to check and possibly repair P and Q.
+		 * However there could be one 'failed' device, in which
+		 * case we can only check one of them, possibly using the
+		 * other to generate missing data
+		 */
 
-#if 0  /* RAID-6: This code seems to require that CHECK_PARITY destroys the uptodateness of the parity */
-			/* should be able to compute the missing block(s) and write to spare */
-			if ( failed_needupdate[0] ^ failed_needupdate[1] ) {
-				if (uptodate+1 != disks)
-					BUG();
-				compute_block_1(sh, failed_needupdate[0] ? failed_num[0] : failed_num[1]);
-				uptodate++;
-			} else if ( failed_needupdate[0] & failed_needupdate[1] ) {
-				if (uptodate+2 != disks)
-					BUG();
-				compute_block_2(sh, failed_num[0], failed_num[1]);
-				uptodate += 2;
+		/* If !tmp_page, we cannot do the calculations,
+		 * but as we have set STRIPE_HANDLE, we will soon be called
+		 * by stripe_handle with a tmp_page - just wait until then.
+		 */
+		if (tmp_page) {
+			if (failed == q_failed) {
+				/* The only possible failed device holds 'Q', so it makes
+				 * sense to check P (If anything else were failed, we would
+				 * have used P to recreate it).
+				 */
+				compute_block_1(sh, pd_idx, 1);
+				if (!page_is_zero(sh->dev[pd_idx].page)) {
+					compute_block_1(sh,pd_idx,0);
+					update_p = 1;
+				}
+			}
+			if (!q_failed && failed < 2) {
+				/* q is not failed, and we didn't use it to generate
+				 * anything, so it makes sense to check it
+				 */
+				memcpy(page_address(tmp_page),
+				       page_address(sh->dev[qd_idx].page),
+				       STRIPE_SIZE);
+				compute_parity(sh, UPDATE_PARITY);
+				if (memcmp(page_address(tmp_page),
+					   page_address(sh->dev[qd_idx].page),
+					   STRIPE_SIZE)!= 0) {
+					clear_bit(STRIPE_INSYNC, &sh->state);
+					update_q = 1;
+				}
+			}
+			if (update_p || update_q) {
+				conf->mddev->resync_mismatches += STRIPE_SECTORS;
+				if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
+					/* don't try to repair!! */
+					update_p = update_q = 0;
 			}
-#else
-			compute_block_2(sh, failed_num[0], failed_num[1]);
-			uptodate += failed_needupdate[0] + failed_needupdate[1];
-#endif
-
-			if (uptodate != disks)
-				BUG();
 
-			PRINTK("Marking for sync stripe %llu blocks %d,%d\n",
-			       (unsigned long long)sh->sector, failed_num[0], failed_num[1]);
+			/* now write out any block on a failed drive,
+			 * or P or Q if they need it
+			 */
+
+			if (failed == 2) {
+				dev = &sh->dev[failed_num[1]];
+				locked++;
+				set_bit(R5_LOCKED, &dev->flags);
+				set_bit(R5_Wantwrite, &dev->flags);
+				set_bit(R5_Syncio, &dev->flags);
+			}
+			if (failed >= 1) {
+				dev = &sh->dev[failed_num[0]];
+				locked++;
+				set_bit(R5_LOCKED, &dev->flags);
+				set_bit(R5_Wantwrite, &dev->flags);
+				set_bit(R5_Syncio, &dev->flags);
+			}
 
-			/**** FIX: Should we really do both of these unconditionally? ****/
-			adev = &sh->dev[failed_num[0]];
-			locked += !test_bit(R5_LOCKED, &adev->flags);
-			set_bit(R5_LOCKED, &adev->flags);
-			set_bit(R5_Wantwrite, &adev->flags);
-			bdev = &sh->dev[failed_num[1]];
-			locked += !test_bit(R5_LOCKED, &bdev->flags);
-			set_bit(R5_LOCKED, &bdev->flags);
+			if (update_p) {
+				dev = &sh->dev[pd_idx];
+				locked ++;
+				set_bit(R5_LOCKED, &dev->flags);
+				set_bit(R5_Wantwrite, &dev->flags);
+				set_bit(R5_Syncio, &dev->flags);
+			}
+			if (update_q) {
+				dev = &sh->dev[qd_idx];
+				locked++;
+				set_bit(R5_LOCKED, &dev->flags);
+				set_bit(R5_Wantwrite, &dev->flags);
+				set_bit(R5_Syncio, &dev->flags);
+			}
 			clear_bit(STRIPE_DEGRADED, &sh->state);
-			set_bit(R5_Wantwrite, &bdev->flags);
 
 			set_bit(STRIPE_INSYNC, &sh->state);
-			set_bit(R5_Syncio, &adev->flags);
-			set_bit(R5_Syncio, &bdev->flags);
 		}
 	}
+
 	if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
 		clear_bit(STRIPE_SYNCING, &sh->state);
@@ -1664,7 +1687,7 @@ static int make_request (request_queue_t
 			}
 			finish_wait(&conf->wait_for_overlap, &w);
 			raid6_plug_device(conf);
-			handle_stripe(sh);
+			handle_stripe(sh, NULL);
 			release_stripe(sh);
 		} else {
 			/* cannot get stripe for read-ahead, just give-up */
@@ -1728,6 +1751,7 @@ static sector_t sync_request(mddev_t *md
 		return rv;
 	}
 	if (!bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, 1) &&
+	    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
 	    !conf->fullsync && sync_blocks >= STRIPE_SECTORS) {
 		/* we can skip this block, and probably more */
 		sync_blocks /= STRIPE_SECTORS;
@@ -1765,7 +1789,7 @@ static sector_t sync_request(mddev_t *md
 	clear_bit(STRIPE_INSYNC, &sh->state);
 	spin_unlock(&sh->lock);
 
-	handle_stripe(sh);
+	handle_stripe(sh, NULL);
 	release_stripe(sh);
 
 	return STRIPE_SECTORS;
@@ -1821,7 +1845,7 @@ static void raid6d (mddev_t *mddev)
 		spin_unlock_irq(&conf->device_lock);
 
 		handled++;
-		handle_stripe(sh);
+		handle_stripe(sh, conf->spare_page);
 		release_stripe(sh);
 
 		spin_lock_irq(&conf->device_lock);
@@ -1860,6 +1884,10 @@ static int run(mddev_t *mddev)
 		goto abort;
 	memset(conf->stripe_hashtbl, 0, HASH_PAGES * PAGE_SIZE);
 
+	conf->spare_page = alloc_page(GFP_KERNEL);
+	if (!conf->spare_page)
+		goto abort;
+
 	spin_lock_init(&conf->device_lock);
 	init_waitqueue_head(&conf->wait_for_stripe);
 	init_waitqueue_head(&conf->wait_for_overlap);
@@ -1996,6 +2024,8 @@ static int run(mddev_t *mddev)
 abort:
 	if (conf) {
 		print_raid6_conf(conf);
+		if (conf->spare_page)
+			page_cache_release(conf->spare_page);
 		if (conf->stripe_hashtbl)
 			free_pages((unsigned long) conf->stripe_hashtbl,
 							HASH_PAGES_ORDER);

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~	2005-11-28 10:08:19.000000000 +1100
+++ ./include/linux/raid/raid5.h	2005-11-28 10:12:56.000000000 +1100
@@ -228,6 +228,8 @@ struct raid5_private_data {
 					    * Cleared when a sync completes.
 					    */
 
+	struct page 		*spare_page; /* Used when checking P/Q in raid6 */
+
 	/*
 	 * Free stripes pool
 	 */

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 013 of 18] Improve handing of read errors with raid6
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (10 preceding siblings ...)
  2005-11-27 23:40 ` [PATCH md 012 of 18] Fix raid6 resync check/repair code NeilBrown
@ 2005-11-27 23:40 ` NeilBrown
  2005-11-30 22:33   ` Carlos Carvalho
  2005-11-27 23:40 ` [PATCH md 014 of 18] Attempt to auto-correct read errors in raid1 NeilBrown
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


This is a simple port of match functionality across from raid5.
If we get a read error, we don't kick the drive straight away, but
try to over-write with good data first.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid6main.c |   70 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 66 insertions(+), 4 deletions(-)

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2005-11-28 10:12:56.000000000 +1100
+++ ./drivers/md/raid6main.c	2005-11-28 10:12:59.000000000 +1100
@@ -367,8 +367,8 @@ static void shrink_stripes(raid6_conf_t 
 	conf->slab_cache = NULL;
 }
 
-static int raid6_end_read_request (struct bio * bi, unsigned int bytes_done,
-				   int error)
+static int raid6_end_read_request(struct bio * bi, unsigned int bytes_done,
+				  int error)
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid6_conf_t *conf = sh->raid_conf;
@@ -420,9 +420,35 @@ static int raid6_end_read_request (struc
 #else
 		set_bit(R5_UPTODATE, &sh->dev[i].flags);
 #endif
+		if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
+			printk(KERN_INFO "raid6: read error corrected!!\n");
+			clear_bit(R5_ReadError, &sh->dev[i].flags);
+			clear_bit(R5_ReWrite, &sh->dev[i].flags);
+		}
+		if (atomic_read(&conf->disks[i].rdev->read_errors))
+			atomic_set(&conf->disks[i].rdev->read_errors, 0);
 	} else {
-		md_error(conf->mddev, conf->disks[i].rdev);
+		int retry = 0;
 		clear_bit(R5_UPTODATE, &sh->dev[i].flags);
+		atomic_inc(&conf->disks[i].rdev->read_errors);
+		if (conf->mddev->degraded)
+			printk(KERN_WARNING "raid6: read error not correctable.\n");
+		else if (test_bit(R5_ReWrite, &sh->dev[i].flags))
+			/* Oh, no!!! */
+			printk(KERN_WARNING "raid6: read error NOT corrected!!\n");
+		else if (atomic_read(&conf->disks[i].rdev->read_errors)
+			 > conf->max_nr_stripes)
+			printk(KERN_WARNING
+			       "raid6: Too many read errors, failing device.\n");
+		else
+			retry = 1;
+		if (retry)
+			set_bit(R5_ReadError, &sh->dev[i].flags);
+		else {
+			clear_bit(R5_ReadError, &sh->dev[i].flags);
+			clear_bit(R5_ReWrite, &sh->dev[i].flags);
+			md_error(conf->mddev, conf->disks[i].rdev);
+		}
 	}
 	rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
 #if 0
@@ -1079,6 +1105,12 @@ static void handle_stripe(struct stripe_
 		if (dev->written) written++;
 		rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
 		if (!rdev || !test_bit(In_sync, &rdev->flags)) {
+			/* The ReadError flag will just be confusing now */
+			clear_bit(R5_ReadError, &dev->flags);
+			clear_bit(R5_ReWrite, &dev->flags);
+		}
+		if (!rdev || !test_bit(In_sync, &rdev->flags)
+		    || test_bit(R5_ReadError, &dev->flags)) {
 			if ( failed < 2 )
 				failed_num[failed] = i;
 			failed++;
@@ -1095,6 +1127,14 @@ static void handle_stripe(struct stripe_
 	if (failed > 2 && to_read+to_write+written) {
 		for (i=disks; i--; ) {
 			int bitmap_end = 0;
+
+			if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
+				mdk_rdev_t *rdev = conf->disks[i].rdev;
+				if (rdev && test_bit(In_sync, &rdev->flags))
+					/* multiple read failures in one stripe */
+					md_error(conf->mddev, rdev);
+			}
+
 			spin_lock_irq(&conf->device_lock);
 			/* fail all writes first */
 			bi = sh->dev[i].towrite;
@@ -1130,7 +1170,8 @@ static void handle_stripe(struct stripe_
 			}
 
 			/* fail any reads if this device is non-operational */
-			if (!test_bit(R5_Insync, &sh->dev[i].flags)) {
+			if (!test_bit(R5_Insync, &sh->dev[i].flags) ||
+			    test_bit(R5_ReadError, &sh->dev[i].flags)) {
 				bi = sh->dev[i].toread;
 				sh->dev[i].toread = NULL;
 				if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
@@ -1457,6 +1498,27 @@ static void handle_stripe(struct stripe_
 		clear_bit(STRIPE_SYNCING, &sh->state);
 	}
 
+	/* If the failed drives are just a ReadError, then we might need
+	 * to progress the repair/check process
+	 */
+	if (failed <= 2 && ! conf->mddev->ro)
+		for (i=0; i<failed;i++) {
+			dev = &sh->dev[failed_num[i]];
+			if (test_bit(R5_ReadError, &dev->flags)
+			    && !test_bit(R5_LOCKED, &dev->flags)
+			    && test_bit(R5_UPTODATE, &dev->flags)
+				) {
+				if (!test_bit(R5_ReWrite, &dev->flags)) {
+					set_bit(R5_Wantwrite, &dev->flags);
+					set_bit(R5_ReWrite, &dev->flags);
+					set_bit(R5_LOCKED, &dev->flags);
+				} else {
+					/* let's read it back */
+					set_bit(R5_Wantread, &dev->flags);
+					set_bit(R5_LOCKED, &dev->flags);
+				}
+			}
+		}
 	spin_unlock(&sh->lock);
 
 	while ((bi=return_bi)) {

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 014 of 18] Attempt to auto-correct read errors in raid1.
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (11 preceding siblings ...)
  2005-11-27 23:40 ` [PATCH md 013 of 18] Improve handing of read errors with raid6 NeilBrown
@ 2005-11-27 23:40 ` NeilBrown
  2005-11-29 16:38   ` Paul Clements
  2005-11-27 23:40 ` [PATCH md 015 of 18] Tidyup some issues with raid1 resync and prepare for catching read errors NeilBrown
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


On a read-error we suspend the array, then synchronously read the
block from other arrays until we find one where we can read it.  Then
we try writing the good data back everywhere and make sure it works.
If any write or subsequent read fails, only then do we fail the device
out of the array.

To be able to suspend the array, we need to also keep track of how
many requests are queued for handling by raid1d.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c            |    1 
 ./drivers/md/raid1.c         |  115 +++++++++++++++++++++++++++++++++++++++----
 ./include/linux/raid/raid1.h |    3 +
 3 files changed, 109 insertions(+), 10 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2005-11-28 10:12:52.000000000 +1100
+++ ./drivers/md/md.c	2005-11-28 10:13:11.000000000 +1100
@@ -461,6 +461,7 @@ int sync_page_io(struct block_device *bd
 	bio_put(bio);
 	return ret;
 }
+EXPORT_SYMBOL(sync_page_io);
 
 static int read_disk_sb(mdk_rdev_t * rdev, int size)
 {

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-11-28 10:12:40.000000000 +1100
+++ ./drivers/md/raid1.c	2005-11-28 10:13:11.000000000 +1100
@@ -191,6 +191,7 @@ static void reschedule_retry(r1bio_t *r1
 
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
+	conf->nr_queued ++;
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	wake_up(&conf->wait_barrier);
@@ -245,9 +246,9 @@ static int raid1_end_read_request(struct
 	/*
 	 * this branch is our 'one mirror IO has finished' event handler:
 	 */
-	if (!uptodate)
-		md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
-	else
+	update_head_pos(mirror, r1_bio);
+
+	if (uptodate || conf->working_disks <= 1) {
 		/*
 		 * Set R1BIO_Uptodate in our master bio, so that
 		 * we will return a good error code for to the higher
@@ -259,14 +260,8 @@ static int raid1_end_read_request(struct
 		 */
 		set_bit(R1BIO_Uptodate, &r1_bio->state);
 
-	update_head_pos(mirror, r1_bio);
-
-	/*
-	 * we have only one bio on the read side
-	 */
-	if (uptodate)
 		raid_end_bio_io(r1_bio);
-	else {
+	} else {
 		/*
 		 * oops, read error:
 		 */
@@ -652,6 +647,32 @@ static void allow_barrier(conf_t *conf)
 	wake_up(&conf->wait_barrier);
 }
 
+static void freeze_array(conf_t *conf)
+{
+	/* stop syncio and normal IO and wait for everything to
+	 * go quite.
+	 * We increment barrier and nr_waiting, and then
+	 * wait until barrier+nr_pending match nr_queued+2
+	 */
+	spin_lock_irq(&conf->resync_lock);
+	conf->barrier++;
+	conf->nr_waiting++;
+	wait_event_lock_irq(conf->wait_barrier,
+			    conf->barrier+conf->nr_pending == conf->nr_queued+2,
+			    conf->resync_lock,
+			    raid1_unplug(conf->mddev->queue));
+	spin_unlock_irq(&conf->resync_lock);
+}
+static void unfreeze_array(conf_t *conf)
+{
+	/* reverse the effect of the freeze */
+	spin_lock_irq(&conf->resync_lock);
+	conf->barrier--;
+	conf->nr_waiting--;
+	wake_up(&conf->wait_barrier);
+	spin_unlock_irq(&conf->resync_lock);
+}
+
 
 /* duplicate the data pages for behind I/O */
 static struct page **alloc_behind_pages(struct bio *bio)
@@ -1195,6 +1216,7 @@ static void raid1d(mddev_t *mddev)
 			break;
 		r1_bio = list_entry(head->prev, r1bio_t, retry_list);
 		list_del(head->prev);
+		conf->nr_queued--;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 
 		mddev = r1_bio->mddev;
@@ -1234,6 +1256,74 @@ static void raid1d(mddev_t *mddev)
 				}
 		} else {
 			int disk;
+
+			/* we got a read error. Maybe the drive is bad.  Maybe just
+			 * the block and we can fix it.
+			 * We freeze all other IO, and try reading the block from
+			 * other devices.  When we find one, we re-write
+			 * and check it that fixes the read error.
+			 * This is all done synchronously while the array is
+			 * frozen
+			 */
+			sector_t sect = r1_bio->sector;
+			int sectors = r1_bio->sectors;
+			freeze_array(conf);
+			while(sectors) {
+				int s = sectors;
+				int d = r1_bio->read_disk;
+				int success = 0;
+
+				if (s > (PAGE_SIZE>>9))
+					s = PAGE_SIZE >> 9;
+
+				do {
+					rdev = conf->mirrors[d].rdev;
+					if (rdev &&
+					    test_bit(In_sync, &rdev->flags) &&
+					    sync_page_io(rdev->bdev,
+							 sect + rdev->data_offset,
+							 s<<9,
+							 conf->tmppage, READ))
+						success = 1;
+					else {
+						d++;
+						if (d == conf->raid_disks)
+							d = 0;
+					}
+				} while (!success && d != r1_bio->read_disk);
+
+				if (success) {
+					/* write it back and re-read */
+					while (d != r1_bio->read_disk) {
+						if (d==0)
+							d = conf->raid_disks;
+						d--;
+						rdev = conf->mirrors[d].rdev;
+						if (rdev &&
+						    test_bit(In_sync, &rdev->flags)) {
+							if (sync_page_io(rdev->bdev,
+									 sect + rdev->data_offset,
+									 s<<9, conf->tmppage, WRITE) == 0 ||
+							    sync_page_io(rdev->bdev,
+									 sect + rdev->data_offset,
+									 s<<9, conf->tmppage, READ) == 0) {
+								/* Well, this device is dead */
+								md_error(mddev, rdev);
+							}
+						}
+					}
+				} else {
+					/* Cannot read from anywhere -- bye bye array */
+					md_error(mddev, conf->mirrors[r1_bio->read_disk].rdev);
+					break;
+				}
+				sectors -= s;
+				sect += s;
+			}
+
+
+			unfreeze_array(conf);
+
 			bio = r1_bio->bios[r1_bio->read_disk];
 			if ((disk=read_balance(conf, r1_bio)) == -1) {
 				printk(KERN_ALERT "raid1: %s: unrecoverable I/O"
@@ -1528,6 +1618,10 @@ static int run(mddev_t *mddev)
 
 	memset(conf->mirrors, 0, sizeof(struct mirror_info)*mddev->raid_disks);
 
+	conf->tmppage = alloc_page(GFP_KERNEL);
+	if (!conf->tmppage)
+		goto out_no_mem;
+
 	conf->poolinfo = kmalloc(sizeof(*conf->poolinfo), GFP_KERNEL);
 	if (!conf->poolinfo)
 		goto out_no_mem;
@@ -1634,6 +1728,7 @@ out_free_conf:
 		if (conf->r1bio_pool)
 			mempool_destroy(conf->r1bio_pool);
 		kfree(conf->mirrors);
+		__free_page(conf->tmppage);
 		kfree(conf->poolinfo);
 		kfree(conf);
 		mddev->private = NULL;

diff ./include/linux/raid/raid1.h~current~ ./include/linux/raid/raid1.h
--- ./include/linux/raid/raid1.h~current~	2005-11-28 10:12:17.000000000 +1100
+++ ./include/linux/raid/raid1.h	2005-11-28 10:13:11.000000000 +1100
@@ -46,6 +46,7 @@ struct r1_private_data_s {
 	spinlock_t		resync_lock;
 	int			nr_pending;
 	int			nr_waiting;
+	int			nr_queued;
 	int			barrier;
 	sector_t		next_resync;
 	int			fullsync;  /* set to 1 if a full sync is needed,
@@ -57,6 +58,8 @@ struct r1_private_data_s {
 
 	struct pool_info	*poolinfo;
 
+	struct page		*tmppage;
+
 	mempool_t *r1bio_pool;
 	mempool_t *r1buf_pool;
 };

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 015 of 18] Tidyup some issues with raid1 resync and prepare for catching read errors.
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (12 preceding siblings ...)
  2005-11-27 23:40 ` [PATCH md 014 of 18] Attempt to auto-correct read errors in raid1 NeilBrown
@ 2005-11-27 23:40 ` NeilBrown
  2005-11-27 23:40 ` [PATCH md 016 of 18] Better handling for read error in raid1 during resync NeilBrown
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


We are dereferencing ->rdev without an rcu lock!

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c |  112 +++++++++++++++++++++++++--------------------------
 1 file changed, 57 insertions(+), 55 deletions(-)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-11-28 10:13:11.000000000 +1100
+++ ./drivers/md/raid1.c	2005-11-28 10:13:14.000000000 +1100
@@ -177,6 +177,13 @@ static inline void free_r1bio(r1bio_t *r
 static inline void put_buf(r1bio_t *r1_bio)
 {
 	conf_t *conf = mddev_to_conf(r1_bio->mddev);
+	int i;
+
+	for (i=0; i<conf->raid_disks; i++) {
+		struct bio *bio = r1_bio->bios[i];
+		if (bio->bi_end_io)
+			rdev_dec_pending(conf->mirrors[i].rdev, r1_bio->mddev);
+	}
 
 	mempool_free(r1_bio, conf->r1buf_pool);
 
@@ -1084,7 +1091,6 @@ static int end_sync_read(struct bio *bio
 			 conf->mirrors[r1_bio->read_disk].rdev);
 	} else
 		set_bit(R1BIO_Uptodate, &r1_bio->state);
-	rdev_dec_pending(conf->mirrors[r1_bio->read_disk].rdev, conf->mddev);
 	reschedule_retry(r1_bio);
 	return 0;
 }
@@ -1115,7 +1121,6 @@ static int end_sync_write(struct bio *bi
 		md_done_sync(mddev, r1_bio->sectors, uptodate);
 		put_buf(r1_bio);
 	}
-	rdev_dec_pending(conf->mirrors[mirror].rdev, mddev);
 	return 0;
 }
 
@@ -1152,10 +1157,14 @@ static void sync_request_write(mddev_t *
 	atomic_set(&r1_bio->remaining, 1);
 	for (i = 0; i < disks ; i++) {
 		wbio = r1_bio->bios[i];
-		if (wbio->bi_end_io != end_sync_write)
+		if (wbio->bi_end_io == NULL ||
+		    (wbio->bi_end_io == end_sync_read &&
+		     (i == r1_bio->read_disk ||
+		      !test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
 			continue;
 
-		atomic_inc(&conf->mirrors[i].rdev->nr_pending);
+		wbio->bi_rw = WRITE;
+		wbio->bi_end_io = end_sync_write;
 		atomic_inc(&r1_bio->remaining);
 		md_sync_acct(conf->mirrors[i].rdev->bdev, wbio->bi_size >> 9);
 
@@ -1387,14 +1396,13 @@ static int init_resync(conf_t *conf)
 static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, int go_faster)
 {
 	conf_t *conf = mddev_to_conf(mddev);
-	mirror_info_t *mirror;
 	r1bio_t *r1_bio;
 	struct bio *bio;
 	sector_t max_sector, nr_sectors;
-	int disk;
+	int disk = -1;
 	int i;
-	int wonly;
-	int write_targets = 0;
+	int wonly = -1;
+	int write_targets = 0, read_targets = 0;
 	int sync_blocks;
 	int still_degraded = 0;
 
@@ -1446,44 +1454,24 @@ static sector_t sync_request(mddev_t *md
 
 	conf->next_resync = sector_nr;
 
-	/*
-	 * If reconstructing, and >1 working disc,
-	 * could dedicate one to rebuild and others to
-	 * service read requests ..
-	 */
-	disk = conf->last_used;
-	/* make sure disk is operational */
-	wonly = disk;
-	while (conf->mirrors[disk].rdev == NULL ||
-	       !test_bit(In_sync, &conf->mirrors[disk].rdev->flags) ||
-	       test_bit(WriteMostly, &conf->mirrors[disk].rdev->flags)
-		) {
-		if (conf->mirrors[disk].rdev  &&
-		    test_bit(In_sync, &conf->mirrors[disk].rdev->flags))
-			wonly = disk;
-		if (disk <= 0)
-			disk = conf->raid_disks;
-		disk--;
-		if (disk == conf->last_used) {
-			disk = wonly;
-			break;
-		}
-	}
-	conf->last_used = disk;
-	atomic_inc(&conf->mirrors[disk].rdev->nr_pending);
-
-
-	mirror = conf->mirrors + disk;
-
 	r1_bio = mempool_alloc(conf->r1buf_pool, GFP_NOIO);
+	rcu_read_lock();
+	/*
+	 * If we get a correctably read error during resync or recovery,
+	 * we might want to read from a different device.  So we
+	 * flag all drives that could conceivably be read from for READ,
+	 * and any others (which will be non-In_sync devices) for WRITE.
+	 * If a read fails, we try reading from something else for which READ
+	 * is OK.
+	 */
 
 	r1_bio->mddev = mddev;
 	r1_bio->sector = sector_nr;
 	r1_bio->state = 0;
 	set_bit(R1BIO_IsSync, &r1_bio->state);
-	r1_bio->read_disk = disk;
 
 	for (i=0; i < conf->raid_disks; i++) {
+		mdk_rdev_t *rdev;
 		bio = r1_bio->bios[i];
 
 		/* take from bio_init */
@@ -1498,35 +1486,49 @@ static sector_t sync_request(mddev_t *md
 		bio->bi_end_io = NULL;
 		bio->bi_private = NULL;
 
-		if (i == disk) {
-			bio->bi_rw = READ;
-			bio->bi_end_io = end_sync_read;
-		} else if (conf->mirrors[i].rdev == NULL ||
-			   test_bit(Faulty, &conf->mirrors[i].rdev->flags)) {
+		rdev = rcu_dereference(conf->mirrors[i].rdev);
+		if (rdev == NULL ||
+			   test_bit(Faulty, &rdev->flags)) {
 			still_degraded = 1;
 			continue;
-		} else if (!test_bit(In_sync, &conf->mirrors[i].rdev->flags) ||
-			   sector_nr + RESYNC_SECTORS > mddev->recovery_cp   ||
-			   test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) {
+		} else if (!test_bit(In_sync, &rdev->flags)) {
 			bio->bi_rw = WRITE;
 			bio->bi_end_io = end_sync_write;
 			write_targets ++;
-		} else
-			/* no need to read or write here */
-			continue;
-		bio->bi_sector = sector_nr + conf->mirrors[i].rdev->data_offset;
-		bio->bi_bdev = conf->mirrors[i].rdev->bdev;
+		} else {
+			/* may need to read from here */
+			bio->bi_rw = READ;
+			bio->bi_end_io = end_sync_read;
+			if (test_bit(WriteMostly, &rdev->flags)) {
+				if (wonly < 0)
+					wonly = i;
+			} else {
+				if (disk < 0)
+					disk = i;
+			}
+			read_targets++;
+		}
+		atomic_inc(&rdev->nr_pending);
+		bio->bi_sector = sector_nr + rdev->data_offset;
+		bio->bi_bdev = rdev->bdev;
 		bio->bi_private = r1_bio;
 	}
+	rcu_read_unlock();
+	if (disk < 0)
+		disk = wonly;
+	r1_bio->read_disk = disk;
 
-	if (write_targets == 0) {
+	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) && read_targets > 0)
+		/* extra read targets are also write targets */
+		write_targets += read_targets-1;
+
+	if (write_targets == 0 || read_targets == 0) {
 		/* There is nowhere to write, so all non-sync
 		 * drives must be failed - so we are finished
 		 */
 		sector_t rv = max_sector - sector_nr;
 		*skipped = 1;
 		put_buf(r1_bio);
-		rdev_dec_pending(conf->mirrors[disk].rdev, mddev);
 		return rv;
 	}
 
@@ -1577,10 +1579,10 @@ static sector_t sync_request(mddev_t *md
 		sync_blocks -= (len>>9);
 	} while (r1_bio->bios[disk]->bi_vcnt < RESYNC_PAGES);
  bio_full:
-	bio = r1_bio->bios[disk];
+	bio = r1_bio->bios[r1_bio->read_disk];
 	r1_bio->sectors = nr_sectors;
 
-	md_sync_acct(mirror->rdev->bdev, nr_sectors);
+	md_sync_acct(conf->mirrors[r1_bio->read_disk].rdev->bdev, nr_sectors);
 
 	generic_make_request(bio);
 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 016 of 18] Better handling for read error in raid1 during resync
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (13 preceding siblings ...)
  2005-11-27 23:40 ` [PATCH md 015 of 18] Tidyup some issues with raid1 resync and prepare for catching read errors NeilBrown
@ 2005-11-27 23:40 ` NeilBrown
  2005-11-27 23:41 ` [PATCH md 017 of 18] Handle errors when read-only NeilBrown
  2005-11-27 23:41 ` [PATCH md 018 of 18] Fix up some rdev rcu locking in raid5/6 NeilBrown
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


... incomplete

Signed-off-by: Neil Brown <neilb@suse.de>
Fix md raid1 fix-read-error-during-resync


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c |   99 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 78 insertions(+), 21 deletions(-)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-11-28 10:13:14.000000000 +1100
+++ ./drivers/md/raid1.c	2005-11-28 10:13:27.000000000 +1100
@@ -1071,9 +1071,7 @@ abort:
 
 static int end_sync_read(struct bio *bio, unsigned int bytes_done, int error)
 {
-	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
 	r1bio_t * r1_bio = (r1bio_t *)(bio->bi_private);
-	conf_t *conf = mddev_to_conf(r1_bio->mddev);
 
 	if (bio->bi_size)
 		return 1;
@@ -1086,10 +1084,7 @@ static int end_sync_read(struct bio *bio
 	 * or re-read if the read failed.
 	 * We don't do much here, just schedule handling by raid1d
 	 */
-	if (!uptodate) {
-		md_error(r1_bio->mddev,
-			 conf->mirrors[r1_bio->read_disk].rdev);
-	} else
+	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
 		set_bit(R1BIO_Uptodate, &r1_bio->state);
 	reschedule_retry(r1_bio);
 	return 0;
@@ -1133,27 +1128,89 @@ static void sync_request_write(mddev_t *
 
 	bio = r1_bio->bios[r1_bio->read_disk];
 
-/*
-	if (r1_bio->sector == 0) printk("First sync write startss\n");
-*/
+
 	/*
 	 * schedule writes
 	 */
 	if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) {
-		/*
-		 * There is no point trying a read-for-reconstruct as
-		 * reconstruct is about to be aborted
+		/* ouch - failed to read all of that.
+		 * Try some synchronous reads of other devices to get
+		 * good data, much like with normal read errors.  Only
+		 * read into the pages we already have so they we don't
+		 * need to re-issue the read request.
+		 * We don't need to freeze the array, because being in an
+		 * active sync request, there is no normal IO, and
+		 * no overlapping syncs.
 		 */
-		char b[BDEVNAME_SIZE];
-		printk(KERN_ALERT "raid1: %s: unrecoverable I/O read error"
-			" for block %llu\n",
-			bdevname(bio->bi_bdev,b), 
-			(unsigned long long)r1_bio->sector);
-		md_done_sync(mddev, r1_bio->sectors, 0);
-		put_buf(r1_bio);
-		return;
+		sector_t sect = r1_bio->sector;
+		int sectors = r1_bio->sectors;
+		int idx = 0;
+
+		while(sectors) {
+			int s = sectors;
+			int d = r1_bio->read_disk;
+			int success = 0;
+			mdk_rdev_t *rdev;
+
+			if (s > (PAGE_SIZE>>9))
+				s = PAGE_SIZE >> 9;
+			do {
+				if (r1_bio->bios[d]->bi_end_io == end_sync_read) {
+					rdev = conf->mirrors[d].rdev;
+					if (sync_page_io(rdev->bdev,
+							 sect + rdev->data_offset,
+							 s<<9,
+							 bio->bi_io_vec[idx].bv_page,
+							 READ)) {
+						success = 1;
+						break;
+					}
+				}
+				d++;
+				if (d == conf->raid_disks)
+					d = 0;
+			} while (!success && d != r1_bio->read_disk);
+
+			if (success) {
+				/* write it back and re-read */
+				set_bit(R1BIO_Uptodate, &r1_bio->state);
+				while (d != r1_bio->read_disk) {
+					if (d == 0)
+						d = conf->raid_disks;
+					d--;
+					if (r1_bio->bios[d]->bi_end_io != end_sync_read)
+						continue;
+					rdev = conf->mirrors[d].rdev;
+					if (sync_page_io(rdev->bdev,
+							 sect + rdev->data_offset,
+							 s<<9,
+							 bio->bi_io_vec[idx].bv_page,
+							 WRITE) == 0 ||
+					    sync_page_io(rdev->bdev,
+							 sect + rdev->data_offset,
+							 s<<9,
+							 bio->bi_io_vec[idx].bv_page,
+							 READ) == 0) {
+						md_error(mddev, rdev);
+					}
+				}
+			} else {
+				char b[BDEVNAME_SIZE];
+				/* Cannot read from anywhere, array is toast */
+				md_error(mddev, conf->mirrors[r1_bio->read_disk].rdev);
+				printk(KERN_ALERT "raid1: %s: unrecoverable I/O read error"
+				       " for block %llu\n",
+				       bdevname(bio->bi_bdev,b),
+				       (unsigned long long)r1_bio->sector);
+				md_done_sync(mddev, r1_bio->sectors, 0);
+				put_buf(r1_bio);
+				return;
+			}
+			sectors -= s;
+			sect += s;
+			idx ++;
+		}
 	}
-
 	atomic_set(&r1_bio->remaining, 1);
 	for (i = 0; i < disks ; i++) {
 		wbio = r1_bio->bios[i];

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 017 of 18] Handle errors when read-only
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (14 preceding siblings ...)
  2005-11-27 23:40 ` [PATCH md 016 of 18] Better handling for read error in raid1 during resync NeilBrown
@ 2005-11-27 23:41 ` NeilBrown
  2005-12-10  6:41   ` Yanggun
  2005-11-27 23:41 ` [PATCH md 018 of 18] Fix up some rdev rcu locking in raid5/6 NeilBrown
  16 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid



Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c         |   18 +++++++++++-------
 ./include/linux/raid/raid1.h |    7 +++++++
 2 files changed, 18 insertions(+), 7 deletions(-)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-11-28 10:13:27.000000000 +1100
+++ ./drivers/md/raid1.c	2005-11-28 10:13:32.000000000 +1100
@@ -154,7 +154,7 @@ static void put_all_bios(conf_t *conf, r
 
 	for (i = 0; i < conf->raid_disks; i++) {
 		struct bio **bio = r1_bio->bios + i;
-		if (*bio)
+		if (*bio && *bio != IO_BLOCKED)
 			bio_put(*bio);
 		*bio = NULL;
 	}
@@ -418,11 +418,13 @@ static int read_balance(conf_t *conf, r1
 		new_disk = 0;
 
 		for (rdev = rcu_dereference(conf->mirrors[new_disk].rdev);
+		     r1_bio->bios[new_disk] == IO_BLOCKED ||
 		     !rdev || !test_bit(In_sync, &rdev->flags)
 			     || test_bit(WriteMostly, &rdev->flags);
 		     rdev = rcu_dereference(conf->mirrors[++new_disk].rdev)) {
 
-			if (rdev && test_bit(In_sync, &rdev->flags))
+			if (rdev && test_bit(In_sync, &rdev->flags) &&
+				r1_bio->bios[new_disk] != IO_BLOCKED)
 				wonly_disk = new_disk;
 
 			if (new_disk == conf->raid_disks - 1) {
@@ -436,11 +438,13 @@ static int read_balance(conf_t *conf, r1
 
 	/* make sure the disk is operational */
 	for (rdev = rcu_dereference(conf->mirrors[new_disk].rdev);
+	     r1_bio->bios[new_disk] == IO_BLOCKED ||
 	     !rdev || !test_bit(In_sync, &rdev->flags) ||
 		     test_bit(WriteMostly, &rdev->flags);
 	     rdev = rcu_dereference(conf->mirrors[new_disk].rdev)) {
 
-		if (rdev && test_bit(In_sync, &rdev->flags))
+		if (rdev && test_bit(In_sync, &rdev->flags) &&
+		    r1_bio->bios[new_disk] != IO_BLOCKED)
 			wonly_disk = new_disk;
 
 		if (new_disk <= 0)
@@ -477,7 +481,7 @@ static int read_balance(conf_t *conf, r1
 
 		rdev = rcu_dereference(conf->mirrors[disk].rdev);
 
-		if (!rdev ||
+		if (!rdev || r1_bio->bios[disk] == IO_BLOCKED ||
 		    !test_bit(In_sync, &rdev->flags) ||
 		    test_bit(WriteMostly, &rdev->flags))
 			continue;
@@ -1334,7 +1338,7 @@ static void raid1d(mddev_t *mddev)
 			sector_t sect = r1_bio->sector;
 			int sectors = r1_bio->sectors;
 			freeze_array(conf);
-			while(sectors) {
+			if (mddev->ro == 0) while(sectors) {
 				int s = sectors;
 				int d = r1_bio->read_disk;
 				int success = 0;
@@ -1387,7 +1391,6 @@ static void raid1d(mddev_t *mddev)
 				sect += s;
 			}
 
-
 			unfreeze_array(conf);
 
 			bio = r1_bio->bios[r1_bio->read_disk];
@@ -1398,7 +1401,8 @@ static void raid1d(mddev_t *mddev)
 				       (unsigned long long)r1_bio->sector);
 				raid_end_bio_io(r1_bio);
 			} else {
-				r1_bio->bios[r1_bio->read_disk] = NULL;
+				r1_bio->bios[r1_bio->read_disk] =
+					mddev->ro ? IO_BLOCKED : NULL;
 				r1_bio->read_disk = disk;
 				bio_put(bio);
 				bio = bio_clone(r1_bio->master_bio, GFP_NOIO);

diff ./include/linux/raid/raid1.h~current~ ./include/linux/raid/raid1.h
--- ./include/linux/raid/raid1.h~current~	2005-11-28 10:13:11.000000000 +1100
+++ ./include/linux/raid/raid1.h	2005-11-28 10:13:32.000000000 +1100
@@ -109,6 +109,13 @@ struct r1bio_s {
 	/* DO NOT PUT ANY NEW FIELDS HERE - bios array is contiguously alloced*/
 };
 
+/* when we get a read error on a read-only array, we redirect to another
+ * device without failing the first device, or trying to over-write to
+ * correct the read error.  To keep track of bad blocks on a per-bio
+ * level, we store IO_BLOCKED in the appropriate 'bios' pointer
+ */
+#define IO_BLOCKED ((struct bio*)1)
+
 /* bits for r1bio.state */
 #define	R1BIO_Uptodate	0
 #define	R1BIO_IsSync	1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH md 018 of 18] Fix up some rdev rcu locking in raid5/6
  2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
                   ` (15 preceding siblings ...)
  2005-11-27 23:41 ` [PATCH md 017 of 18] Handle errors when read-only NeilBrown
@ 2005-11-27 23:41 ` NeilBrown
  16 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2005-11-27 23:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


There is this "FIXME" comment with a typo in it!! that been annoying
me for days, so I just had to remove it.

conf->disks[i].rdev should only be accessed if
  - we know we hold a reference or
  - the mddev->reconfig_sem is down or
  - we have a rcu_readlock

handle_stripe was referencing rdev in three places without any of these.
For the first two, get an rcu_readlock.
For the last, the same access (md_sync_acct call) is made a little later
after the rdev has been claimed under and rcu_readlock, if R5_Syncio is set.
So just use that access...  However R5_Syncio isn't really needed as the
'syncing' variable contains the same information. So use that instead.

Issues, comment, and fix are identical in raid5 and raid6.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c         |   16 ++++++++--------
 ./drivers/md/raid6main.c     |   19 ++++++++-----------
 ./include/linux/raid/raid5.h |    1 -
 3 files changed, 16 insertions(+), 20 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2005-11-28 10:12:40.000000000 +1100
+++ ./drivers/md/raid5.c	2005-11-28 10:13:37.000000000 +1100
@@ -960,11 +960,11 @@ static void handle_stripe(struct stripe_
 	syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	/* Now to look around and see what can be done */
 
+	rcu_read_lock();
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;
 		dev = &sh->dev[i];
 		clear_bit(R5_Insync, &dev->flags);
-		clear_bit(R5_Syncio, &dev->flags);
 
 		PRINTK("check %d: state 0x%lx read %p write %p written %p\n",
 			i, dev->flags, dev->toread, dev->towrite, dev->written);
@@ -1003,7 +1003,7 @@ static void handle_stripe(struct stripe_
 				non_overwrite++;
 		}
 		if (dev->written) written++;
-		rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
+		rdev = rcu_dereference(conf->disks[i].rdev);
 		if (!rdev || !test_bit(In_sync, &rdev->flags)) {
 			/* The ReadError flag will just be confusing now */
 			clear_bit(R5_ReadError, &dev->flags);
@@ -1016,6 +1016,7 @@ static void handle_stripe(struct stripe_
 		} else
 			set_bit(R5_Insync, &dev->flags);
 	}
+	rcu_read_unlock();
 	PRINTK("locked=%d uptodate=%d to_read=%d"
 		" to_write=%d failed=%d failed_num=%d\n",
 		locked, uptodate, to_read, to_write, failed, failed_num);
@@ -1027,10 +1028,13 @@ static void handle_stripe(struct stripe_
 			int bitmap_end = 0;
 
 			if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
-				mdk_rdev_t *rdev = conf->disks[i].rdev;
+				mdk_rdev_t *rdev;
+				rcu_read_lock();
+				rdev = rcu_dereference(conf->disks[i].rdev);
 				if (rdev && test_bit(In_sync, &rdev->flags))
 					/* multiple read failures in one stripe */
 					md_error(conf->mddev, rdev);
+				rcu_read_unlock();
 			}
 
 			spin_lock_irq(&conf->device_lock);
@@ -1179,9 +1183,6 @@ static void handle_stripe(struct stripe_
 					locked++;
 					PRINTK("Reading block %d (sync=%d)\n", 
 						i, syncing);
-					if (syncing)
-						md_sync_acct(conf->disks[i].rdev->bdev,
-							     STRIPE_SECTORS);
 				}
 			}
 		}
@@ -1325,7 +1326,6 @@ static void handle_stripe(struct stripe_
 			clear_bit(STRIPE_DEGRADED, &sh->state);
 			locked++;
 			set_bit(STRIPE_INSYNC, &sh->state);
-			set_bit(R5_Syncio, &dev->flags);
 		}
 	}
 	if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
@@ -1391,7 +1391,7 @@ static void handle_stripe(struct stripe_
 		rcu_read_unlock();
  
 		if (rdev) {
-			if (test_bit(R5_Syncio, &sh->dev[i].flags))
+			if (syncing)
 				md_sync_acct(rdev->bdev, STRIPE_SECTORS);
 
 			bi->bi_bdev = rdev->bdev;

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2005-11-28 10:12:59.000000000 +1100
+++ ./drivers/md/raid6main.c	2005-11-28 10:13:37.000000000 +1100
@@ -1060,11 +1060,11 @@ static void handle_stripe(struct stripe_
 	syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	/* Now to look around and see what can be done */
 
+	rcu_read_lock();
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;
 		dev = &sh->dev[i];
 		clear_bit(R5_Insync, &dev->flags);
-		clear_bit(R5_Syncio, &dev->flags);
 
 		PRINTK("check %d: state 0x%lx read %p write %p written %p\n",
 			i, dev->flags, dev->toread, dev->towrite, dev->written);
@@ -1103,7 +1103,7 @@ static void handle_stripe(struct stripe_
 				non_overwrite++;
 		}
 		if (dev->written) written++;
-		rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
+		rdev = rcu_dereference(conf->disks[i].rdev);
 		if (!rdev || !test_bit(In_sync, &rdev->flags)) {
 			/* The ReadError flag will just be confusing now */
 			clear_bit(R5_ReadError, &dev->flags);
@@ -1117,6 +1117,7 @@ static void handle_stripe(struct stripe_
 		} else
 			set_bit(R5_Insync, &dev->flags);
 	}
+	rcu_read_unlock();
 	PRINTK("locked=%d uptodate=%d to_read=%d"
 	       " to_write=%d failed=%d failed_num=%d,%d\n",
 	       locked, uptodate, to_read, to_write, failed,
@@ -1129,10 +1130,13 @@ static void handle_stripe(struct stripe_
 			int bitmap_end = 0;
 
 			if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
-				mdk_rdev_t *rdev = conf->disks[i].rdev;
+				mdk_rdev_t *rdev;
+				rcu_read_lock();
+				rdev = rcu_dereference(conf->disks[i].rdev);
 				if (rdev && test_bit(In_sync, &rdev->flags))
 					/* multiple read failures in one stripe */
 					md_error(conf->mddev, rdev);
+				rcu_read_unlock();
 			}
 
 			spin_lock_irq(&conf->device_lock);
@@ -1307,9 +1311,6 @@ static void handle_stripe(struct stripe_
 					locked++;
 					PRINTK("Reading block %d (sync=%d)\n",
 						i, syncing);
-					if (syncing)
-						md_sync_acct(conf->disks[i].rdev->bdev,
-							     STRIPE_SECTORS);
 				}
 			}
 		}
@@ -1463,14 +1464,12 @@ static void handle_stripe(struct stripe_
 				locked++;
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantwrite, &dev->flags);
-				set_bit(R5_Syncio, &dev->flags);
 			}
 			if (failed >= 1) {
 				dev = &sh->dev[failed_num[0]];
 				locked++;
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantwrite, &dev->flags);
-				set_bit(R5_Syncio, &dev->flags);
 			}
 
 			if (update_p) {
@@ -1478,14 +1477,12 @@ static void handle_stripe(struct stripe_
 				locked ++;
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantwrite, &dev->flags);
-				set_bit(R5_Syncio, &dev->flags);
 			}
 			if (update_q) {
 				dev = &sh->dev[qd_idx];
 				locked++;
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantwrite, &dev->flags);
-				set_bit(R5_Syncio, &dev->flags);
 			}
 			clear_bit(STRIPE_DEGRADED, &sh->state);
 
@@ -1557,7 +1554,7 @@ static void handle_stripe(struct stripe_
 		rcu_read_unlock();
 
 		if (rdev) {
-			if (test_bit(R5_Syncio, &sh->dev[i].flags))
+			if (syncing)
 				md_sync_acct(rdev->bdev, STRIPE_SECTORS);
 
 			bi->bi_bdev = rdev->bdev;

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~	2005-11-28 10:12:56.000000000 +1100
+++ ./include/linux/raid/raid5.h	2005-11-28 10:13:37.000000000 +1100
@@ -152,7 +152,6 @@ struct stripe_head {
 #define	R5_Insync	3	/* rdev && rdev->in_sync at start */
 #define	R5_Wantread	4	/* want to schedule a read */
 #define	R5_Wantwrite	5
-#define	R5_Syncio	6	/* this io need to be accounted as resync io */
 #define	R5_Overlap	7	/* There is a pending overlapping request on this block */
 #define	R5_ReadError	8	/* seen a read error here recently */
 #define	R5_ReWrite	9	/* have tried to over-write the readerror */

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH md 014 of 18] Attempt to auto-correct read errors in raid1.
  2005-11-27 23:40 ` [PATCH md 014 of 18] Attempt to auto-correct read errors in raid1 NeilBrown
@ 2005-11-29 16:38   ` Paul Clements
  2005-11-29 23:21     ` Neil Brown
  0 siblings, 1 reply; 33+ messages in thread
From: Paul Clements @ 2005-11-29 16:38 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrew Morton, linux-raid

Hi Neil,

Glad to see this patch is making its way to mainline. I have a couple of 
questions on the patch, though...

NeilBrown wrote:

> +	if (uptodate || conf->working_disks <= 1) {

Is it valid to mask a read error just because we have only 1 working disk?

> +				do {
> +					rdev = conf->mirrors[d].rdev;
> +					if (rdev &&
> +					    test_bit(In_sync, &rdev->flags) &&
> +					    sync_page_io(rdev->bdev,
> +							 sect + rdev->data_offset,
> +							 s<<9,
> +							 conf->tmppage, READ))
> +						success = 1;
> +					else {
> +						d++;
> +						if (d == conf->raid_disks)
> +							d = 0;
> +					}
> +				} while (!success && d != r1_bio->read_disk);
> +
> +				if (success) {
> +					/* write it back and re-read */
> +					while (d != r1_bio->read_disk) {

Here, it looks like if we retry the read on the same disk that just gave 
the read error, then we will not do any re-writes? I assume that is 
intentional? I guess it's a judgment call whether the sector is really 
bad at that point.

> +						if (d==0)
> +							d = conf->raid_disks;
> +						d--;
> +						rdev = conf->mirrors[d].rdev;
> +						if (rdev &&
> +						    test_bit(In_sync, &rdev->flags)) {
> +							if (sync_page_io(rdev->bdev,
> +									 sect + rdev->data_offset,
> +									 s<<9, conf->tmppage, WRITE) == 0 ||
> +							    sync_page_io(rdev->bdev,
> +									 sect + rdev->data_offset,
> +									 s<<9, conf->tmppage, READ) == 0) {
> +								/* Well, this device is dead */
> +								md_error(mddev, rdev);

Here, we might have gotten garbage back from the sync_page_io(..., 
READ), if it failed. So don't we have to quit the re-write loop at this 
point? Otherwise, aren't we potentially writing bad data over other 
disks? Granted, this particular case might never actually happen in the 
real world.

Thanks,
Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH md 014 of 18] Attempt to auto-correct read errors in raid1.
  2005-11-29 16:38   ` Paul Clements
@ 2005-11-29 23:21     ` Neil Brown
  0 siblings, 0 replies; 33+ messages in thread
From: Neil Brown @ 2005-11-29 23:21 UTC (permalink / raw)
  To: Paul Clements; +Cc: Andrew Morton, linux-raid

On Tuesday November 29, paul.clements@steeleye.com wrote:
> Hi Neil,
> 
> Glad to see this patch is making its way to mainline. I have a couple of 
> questions on the patch, though...

Thanks for reviewing the code - I really value that!

> 
> NeilBrown wrote:
> 
> > +	if (uptodate || conf->working_disks <= 1) {
> 
> Is it valid to mask a read error just because we have only 1 working disk?
> 

The purpose of this was that if there is only one working disk, there
is nothing we can do in the face of a read error, except return it
upstream.  However I did get the logic immediately after that wrong as
I discovered when I was porting it across to raid10.  Patch will be
out shortly.

> 
> 
> > +				do {
> > +					rdev = conf->mirrors[d].rdev;
> > +					if (rdev &&
> > +					    test_bit(In_sync, &rdev->flags) &&
> > +					    sync_page_io(rdev->bdev,
> > +							 sect + rdev->data_offset,
> > +							 s<<9,
> > +							 conf->tmppage, READ))
> > +						success = 1;
> > +					else {
> > +						d++;
> > +						if (d == conf->raid_disks)
> > +							d = 0;
> > +					}
> > +				} while (!success && d != r1_bio->read_disk);
> > +
> > +				if (success) {
> > +					/* write it back and re-read */
> > +					while (d != r1_bio->read_disk) {
> 
> Here, it looks like if we retry the read on the same disk that just gave 
> the read error, then we will not do any re-writes? I assume that is 
> intentional? I guess it's a judgment call whether the sector is really 
> bad at that point.

The read that failed was quite possibly a very large multipage read -
64K maybe (depends a lot on the filesystem).
What I do is walk through that one page at a time and retry the read.
If the re-read succeeds, then I assume the failure that stopped the
original (possibly larger) request was somewhere else, and I move on.

So yes, if a block sometimes fails, and then succeeds again on the
next read, we might not try to 'fix' it.  Is that likely to happen I
wonder?? 

> 
> > +						if (d==0)
> > +							d = conf->raid_disks;
> > +						d--;
> > +						rdev = conf->mirrors[d].rdev;
> > +						if (rdev &&
> > +						    test_bit(In_sync, &rdev->flags)) {
> > +							if (sync_page_io(rdev->bdev,
> > +									 sect + rdev->data_offset,
> > +									 s<<9, conf->tmppage, WRITE) == 0 ||
> > +							    sync_page_io(rdev->bdev,
> > +									 sect + rdev->data_offset,
> > +									 s<<9, conf->tmppage, READ) == 0) {
> > +								/* Well, this device is dead */
> > +								md_error(mddev, rdev);
> 
> Here, we might have gotten garbage back from the sync_page_io(..., 
> READ), if it failed. So don't we have to quit the re-write loop at this 
> point? Otherwise, aren't we potentially writing bad data over other 
> disks? Granted, this particular case might never actually happen in the 
> real world.

Yes, you are right.  I guess I really should be reading back into a
different buffer just in case something goes screwy...
Or maybe I could do all the writes, and then do all the reads in a
separate loop - that would be just as safe.

I see which one looks neatest.

Thank again,
NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH md 013 of 18] Improve handing of read errors with raid6
  2005-11-27 23:40 ` [PATCH md 013 of 18] Improve handing of read errors with raid6 NeilBrown
@ 2005-11-30 22:33   ` Carlos Carvalho
  2005-12-01  2:54     ` Neil Brown
  0 siblings, 1 reply; 33+ messages in thread
From: Carlos Carvalho @ 2005-11-30 22:33 UTC (permalink / raw)
  To: linux-raid

NeilBrown (neilb@suse.de) wrote on 28 November 2005 10:40:
 >This is a simple port of match functionality across from raid5.
 >If we get a read error, we don't kick the drive straight away, but
 >try to over-write with good data first.

Does it really mean that this funcionality is already available for
raid5??? That would be great news.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH md 013 of 18] Improve handing of read errors with raid6
  2005-11-30 22:33   ` Carlos Carvalho
@ 2005-12-01  2:54     ` Neil Brown
  0 siblings, 0 replies; 33+ messages in thread
From: Neil Brown @ 2005-12-01  2:54 UTC (permalink / raw)
  To: Carlos Carvalho; +Cc: linux-raid

On Wednesday November 30, carlos@fisica.ufpr.br wrote:
> NeilBrown (neilb@suse.de) wrote on 28 November 2005 10:40:
>  >This is a simple port of match functionality across from raid5.
>  >If we get a read error, we don't kick the drive straight away, but
>  >try to over-write with good data first.
> 
> Does it really mean that this funcionality is already available for
> raid5??? That would be great news.

Yes.  It has been in -mm for a while, and will be in 2.6.15.  The
raid1/raid6/raid10 versions will have to wait for 2.6.16.

I'll probably do a little 'md Release notes' announcement when 2.6.15
comes out if I remember.

NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH md 017 of 18] Handle errors when read-only
  2005-11-27 23:41 ` [PATCH md 017 of 18] Handle errors when read-only NeilBrown
@ 2005-12-10  6:41   ` Yanggun
  2005-12-10  6:59     ` raid1 mysteriously switching to read-only Neil Brown
  0 siblings, 1 reply; 33+ messages in thread
From: Yanggun @ 2005-12-10  6:41 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrew Morton, linux-raid

Hi Brown.

I have raid arrays, raid1 called md0. Basically they run fine, but
something is switching md0 readonly during write to disk(cp, mv);

Is changed by that RAID readonly in what case? let me know, describe 
number of case?

I am very anxious  about it very. Please, can you inform to me?  Does
this patch solve it? Can not you do to do not become by readonly?

Setup:
  HW:
   2 S-ATA-Disks (240GB each) -> /dev/md0 RAID-1
   Promise S150 TX2Plus - Controller
   Intel Pentium4  - 2.8GHz

  SW:
   Debian Linux Testing
   Kernel 2.6.13.2
   Software Raid-1



2005/11/28, NeilBrown <neilb@suse.de>:
>
>
> Signed-off-by: Neil Brown <neilb@suse.de>
>
> ### Diffstat output
>  ./drivers/md/raid1.c         |   18 +++++++++++-------
>  ./include/linux/raid/raid1.h |    7 +++++++
>  2 files changed, 18 insertions(+), 7 deletions(-)
>
> diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
> --- ./drivers/md/raid1.c~current~       2005-11-28 10:13:27.000000000 +1100
> +++ ./drivers/md/raid1.c        2005-11-28 10:13:32.000000000 +1100
> @@ -154,7 +154,7 @@ static void put_all_bios(conf_t *conf, r
>
>         for (i = 0; i < conf->raid_disks; i++) {
>                 struct bio **bio = r1_bio->bios + i;
> -               if (*bio)
> +               if (*bio && *bio != IO_BLOCKED)
>                         bio_put(*bio);
>                 *bio = NULL;
>         }
> @@ -418,11 +418,13 @@ static int read_balance(conf_t *conf, r1
>                 new_disk = 0;
>
>                 for (rdev = rcu_dereference(conf->mirrors[new_disk].rdev);
> +                    r1_bio->bios[new_disk] == IO_BLOCKED ||
>                      !rdev || !test_bit(In_sync, &rdev->flags)
>                              || test_bit(WriteMostly, &rdev->flags);
>                      rdev = rcu_dereference(conf->mirrors[++new_disk].rdev)) {
>
> -                       if (rdev && test_bit(In_sync, &rdev->flags))
> +                       if (rdev && test_bit(In_sync, &rdev->flags) &&
> +                               r1_bio->bios[new_disk] != IO_BLOCKED)
>                                 wonly_disk = new_disk;
>
>                         if (new_disk == conf->raid_disks - 1) {
> @@ -436,11 +438,13 @@ static int read_balance(conf_t *conf, r1
>
>         /* make sure the disk is operational */
>         for (rdev = rcu_dereference(conf->mirrors[new_disk].rdev);
> +            r1_bio->bios[new_disk] == IO_BLOCKED ||
>              !rdev || !test_bit(In_sync, &rdev->flags) ||
>                      test_bit(WriteMostly, &rdev->flags);
>              rdev = rcu_dereference(conf->mirrors[new_disk].rdev)) {
>
> -               if (rdev && test_bit(In_sync, &rdev->flags))
> +               if (rdev && test_bit(In_sync, &rdev->flags) &&
> +                   r1_bio->bios[new_disk] != IO_BLOCKED)
>                         wonly_disk = new_disk;
>
>                 if (new_disk <= 0)
> @@ -477,7 +481,7 @@ static int read_balance(conf_t *conf, r1
>
>                 rdev = rcu_dereference(conf->mirrors[disk].rdev);
>
> -               if (!rdev ||
> +               if (!rdev || r1_bio->bios[disk] == IO_BLOCKED ||
>                     !test_bit(In_sync, &rdev->flags) ||
>                     test_bit(WriteMostly, &rdev->flags))
>                         continue;
> @@ -1334,7 +1338,7 @@ static void raid1d(mddev_t *mddev)
>                         sector_t sect = r1_bio->sector;
>                         int sectors = r1_bio->sectors;
>                         freeze_array(conf);
> -                       while(sectors) {
> +                       if (mddev->ro == 0) while(sectors) {
>                                 int s = sectors;
>                                 int d = r1_bio->read_disk;
>                                 int success = 0;
> @@ -1387,7 +1391,6 @@ static void raid1d(mddev_t *mddev)
>                                 sect += s;
>                         }
>
> -
>                         unfreeze_array(conf);
>
>                         bio = r1_bio->bios[r1_bio->read_disk];
> @@ -1398,7 +1401,8 @@ static void raid1d(mddev_t *mddev)
>                                        (unsigned long long)r1_bio->sector);
>                                 raid_end_bio_io(r1_bio);
>                         } else {
> -                               r1_bio->bios[r1_bio->read_disk] = NULL;
> +                               r1_bio->bios[r1_bio->read_disk] =
> +                                       mddev->ro ? IO_BLOCKED : NULL;
>                                 r1_bio->read_disk = disk;
>                                 bio_put(bio);
>                                 bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
>
> diff ./include/linux/raid/raid1.h~current~ ./include/linux/raid/raid1.h
> --- ./include/linux/raid/raid1.h~current~       2005-11-28 10:13:11.000000000 +1100
> +++ ./include/linux/raid/raid1.h        2005-11-28 10:13:32.000000000 +1100
> @@ -109,6 +109,13 @@ struct r1bio_s {
>         /* DO NOT PUT ANY NEW FIELDS HERE - bios array is contiguously alloced*/
>  };
>
> +/* when we get a read error on a read-only array, we redirect to another
> + * device without failing the first device, or trying to over-write to
> + * correct the read error.  To keep track of bad blocks on a per-bio
> + * level, we store IO_BLOCKED in the appropriate 'bios' pointer
> + */
> +#define IO_BLOCKED ((struct bio*)1)
> +
>  /* bits for r1bio.state */
>  #define        R1BIO_Uptodate  0
>  #define        R1BIO_IsSync    1
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: raid1 mysteriously switching to read-only
  2005-12-10  6:41   ` Yanggun
@ 2005-12-10  6:59     ` Neil Brown
  2005-12-10  7:50       ` Yanggun
  0 siblings, 1 reply; 33+ messages in thread
From: Neil Brown @ 2005-12-10  6:59 UTC (permalink / raw)
  To: Yanggun; +Cc: linux-raid

On Saturday December 10, yang.geum.seok@gmail.com wrote:
> Hi Brown.
> 
> I have raid arrays, raid1 called md0. Basically they run fine, but
> something is switching md0 readonly during write to disk(cp, mv);
> 
> Is changed by that RAID readonly in what case? let me know, describe 
> number of case?
> 
> I am very anxious  about it very. Please, can you inform to me?  Does
> this patch solve it? Can not you do to do not become by readonly?

You will need to give more details about what is happening.  Lots more.

What makes you say the array is 'read-only'  - what are the messages
     you get - exactly?
What filesystem are you using?
What messages are there in the kernel log ('dmesg' might show these)?
What does 'cat /proc/mdstat' show?

NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: raid1 mysteriously switching to read-only
  2005-12-10  6:59     ` raid1 mysteriously switching to read-only Neil Brown
@ 2005-12-10  7:50       ` Yanggun
  2005-12-10  8:02         ` Neil Brown
  0 siblings, 1 reply; 33+ messages in thread
From: Yanggun @ 2005-12-10  7:50 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

I am sorry because there is no information very.

I currently ext3 file system.

I composed RAID-1(/dev/md0)  disk array by order such as lower part.

mkfs.ext3 -j /dev/sda1
mkfs.ext3 -j /dev/sda1
mdadm -Cv /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1

	[root@sentry24 root]# cat /proc/mdstat ; cat /proc/mounts
	Personalities : [raid1]
	md0 : active raid1 sda1[0] sdb1[1]
    		244195904 blocks [2/2] [UU]

	unused devices: <none>
	rootfs / rootfs rw 0 0
	/dev/root / ext3 rw,noatime 0 0
	proc /proc proc rw,nodiratime 0 0
	sysfs /sys sysfs rw 0 0
	devpts /dev/pts devpts rw 0 0
	tmpfs /dev/shm tmpfs rw 0 0
	tmpfs /tmp tmpfs rw 0 0
	tmpfs /var tmpfs rw 0 0
	none /proc/bus/usb usbfs rw 0 0
	/dev/md0 /data/disk1 ext3 rw,noatime 0 0

After compose RAID-1 disk array, subordinate did serious disk I/O by
order such as lower part.

	scp -r root@192.168.0.24:/data/disk3/*.avi /data/disk1/

It became mount by to readonly while copy file at 30 minutes.

	[root@root root]# cat /proc/mdstat ; cat /proc/mounts
	Personalities : [raid1]
	md0 : active raid1 sda1[0] sdb1[1]
     		 244195904 blocks [2/2] [UU]

	unused devices: <none>
	rootfs / rootfs rw 0 0
	/dev/root / ext3 rw,noatime 0 0
	proc /proc proc rw,nodiratime 0 0
	sysfs /sys sysfs rw 0 0
	devpts /dev/pts devpts rw 0 0
	tmpfs /dev/shm tmpfs rw 0 0
	tmpfs /tmp tmpfs rw 0 0
	tmpfs /var tmpfs rw 0 0
	none /proc/bus/usb usbfs rw 0 0
	/dev/md0 /data/disk3 ext3 ro,noatime 0 0       <------------- changed readonly

	[root@rootroot]# ls -l /data/disk3/dvr
	total 0

	[root@rootroot]# vmstat 1
	procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 	r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 	5  0    128   6592 299528  32676    0    0   386     2 1328  1268 35 64  1  1
 	2  0    128   6468 299528  32676    0    0     0     0 1346  6578 52 48  0  0
 	2  0    128   6468 299528  32676    0    0     0     0 1344  7606 28 72  0  0

       [root@sentry24 root]# tail -f /var/log/kern.log
       Dec 10 01:36:32 kernel: [   48.647619] EXT3 FS on hda1, internal journal
       Dec 10 01:36:32 kernel: [   52.516666] SCSI subsystem initialized
       Dec 10 01:36:32 kernel: [   52.551594] PROMISE SATA-II 150/300
Series Linux Driver v1.01.0.20
       Dec 10 01:36:32 kernel: [   52.553736] ACPI: PCI Interrupt
0000:01:05.0[A] -> GSI 16 (level, low) -> IRQ 177
       Dec 10 01:36:32 kernel: [   52.693131] ulsata2:[info] Drive
1/0: WDC WD2500JS-22MHB0    488397167s 250059MB  UDMA6
       Dec 10 01:36:32 kernel: [   52.807338] ulsata2:[info] Drive
3/0: WDC WD2500JS-22MHB0    488397167s 250059MB  UDMA6
       Dec 10 01:36:32 kernel: [   52.809447] scsi0 : ulsata2
       Dec 10 01:36:32 kernel: [   52.823914]   Vendor:          
Model: WDC WD2500JS-22M  Rev:
       Dec 10 01:36:32 kernel: [   52.826003]   Type:   Direct-Access 
                    ANSI SCSI revision: 02
       Dec 10 01:36:32 kernel: [   52.852253]   Vendor:          
Model: WDC WD2500JS-22M  Rev:
       Dec 10 01:36:32 kernel: [   52.854580]   Type:   Direct-Access 
                    ANSI SCSI revision: 02
       Dec 10 01:36:32 kernel: [   52.942756] SCSI device sda:
488397168 512-byte hdwr sectors (250059 MB)
       Dec 10 01:36:32 kernel: [   52.944896] sda: got wrong page
       Dec 10 01:36:32 kernel: [   52.946855] sda: assuming drive
cache: write through
       Dec 10 01:36:32 kernel: [   52.961355] SCSI device sda:
488397168 512-byte hdwr sectors (250059 MB)
       Dec 10 01:36:32 kernel: [   52.963487] sda: got wrong page
       Dec 10 01:36:32 kernel: [   52.965451] sda: assuming drive
cache: write through
       Dec 10 01:36:32 kernel: [   52.967431]  sda: sda1
       Dec 10 01:36:32 kernel: [   52.989793] Attached scsi disk sda
at scsi0, channel 0, id 0, lun 0
       Dec 10 01:36:32 kernel: [   52.991983] SCSI device sdb:
488397168 512-byte hdwr sectors (250059 MB)
       Dec 10 01:36:32 kernel: [   52.994016] sdb: got wrong page
       Dec 10 01:36:32 kernel: [   52.996018] sdb: assuming drive
cache: write through
       Dec 10 01:36:32 kernel: [   53.004583] SCSI device sdb:
488397168 512-byte hdwr sectors (250059 MB)
       Dec 10 01:36:32 kernel: [   53.006722] sdb: got wrong page
       Dec 10 01:36:32 kernel: [   53.008714] sdb: assuming drive
cache: write through
       Dec 10 01:36:32 kernel: [   53.010803]  sdb: sdb1
       Dec 10 01:36:32 kernel: [   53.038174] Attached scsi disk sdb
at scsi0, channel 0, id 2, lun 0
       Dec 10 01:36:32 kernel: [   53.195957] Intel(R) PRO/1000
Network Driver - version 6.0.60-k2
       Dec 10 01:36:32 kernel: [   53.198108] Copyright (c) 1999-2005
Intel Corporation.
       Dec 10 01:36:32 kernel: [   53.212695] ACPI: PCI Interrupt
0000:01:03.0[A] -> GSI 21 (level, low) -> IRQ 185
       Dec 10 01:36:32 kernel: [   53.682839] e1000: eth0:
e1000_probe: Intel(R) PRO/1000 Network Connection
       Dec 10 01:36:32 kernel: [   53.999922] usbcore: registered new
driver usbfs
       Dec 10 01:36:32 kernel: [   54.014605] usbcore: registered new driver hub
       Dec 10 01:36:32 kernel: [   54.059477] ACPI: PCI Interrupt
0000:00:1d.7[D] -> GSI 23 (level, low) -> IRQ 193
       Dec 10 01:36:32 kernel: [   54.061754] PCI: Setting latency
timer of device 0000:00:1d.7 to 64
       Dec 10 01:36:32 kernel: [   54.061761] ehci_hcd 0000:00:1d.7:
Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller
       Dec 10 01:36:32 kernel: [   54.078687] ehci_hcd 0000:00:1d.7:
new USB bus registered, assigned bus number 1
       Dec 10 01:36:32 kernel: [   54.081030] ehci_hcd 0000:00:1d.7:
irq 193, io mem 0xe8180000
       Dec 10 01:36:32 kernel: [   54.087181] PCI: cache line size of
128 is not supported by device 0000:00:1d.7
       Dec 10 01:36:32 kernel: [   54.087189] ehci_hcd 0000:00:1d.7:
USB 2.0 initialized, EHCI 1.00, driver 10 Dec 2004
       Dec 10 01:36:32 kernel: [   54.116718] hub 1-0:1.0: USB hub found
       Dec 10 01:36:32 kernel: [   54.119049] hub 1-0:1.0: 6 ports detected
       Dec 10 01:36:32 kernel: [   54.675098] md: md driver 0.90.2
MAX_MD_DEVS=256, MD_SB_DISKS=27
       Dec 10 01:36:32 kernel: [   54.677398] md: bitmap version 3.38
       Dec 10 01:36:32 kernel: [   54.694205] md: raid1 personality
registered as nr 3
       Dec 10 01:36:32 kernel: [   54.868492] USB Universal Host
Controller Interface driver v2.3
       Dec 10 01:36:32 kernel: [   54.883020] ACPI: PCI Interrupt
0000:00:1d.0[A] -> GSI 16 (level, low) -> IRQ 177
       Dec 10 01:36:32 kernel: [   54.885145] PCI: Setting latency
timer of device 0000:00:1d.0 to 64
       Dec 10 01:36:32 kernel: [   54.885151] uhci_hcd 0000:00:1d.0:
Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI
Controller #1
       Dec 10 01:36:32 kernel: [   54.901825] uhci_hcd 0000:00:1d.0:
new USB bus registered, assigned bus number 2
       Dec 10 01:36:32 kernel: [   54.904054] uhci_hcd 0000:00:1d.0:
irq 177, io base 0x0000e200
       Dec 10 01:36:32 kernel: [   55.034027] hub 2-0:1.0: USB hub found
       Dec 10 01:36:32 kernel: [   55.036202] hub 2-0:1.0: 2 ports detected
       Dec 10 01:36:32 kernel: [   55.362260] ACPI: PCI Interrupt
0000:00:1d.1[B] -> GSI 19 (level, low) -> IRQ 201
       Dec 10 01:36:32 kernel: [   55.364448] PCI: Setting latency
timer of device 0000:00:1d.1 to 64
       Dec 10 01:36:32 kernel: [   55.364454] uhci_hcd 0000:00:1d.1:
Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI
Controller #2
       Dec 10 01:36:32 kernel: [   55.381310] uhci_hcd 0000:00:1d.1:
new USB bus registered, assigned bus number 3
       Dec 10 01:36:32 kernel: [   55.383568] uhci_hcd 0000:00:1d.1:
irq 201, io base 0x0000e000
       Dec 10 01:36:32 kernel: [   55.434004] hub 3-0:1.0: USB hub found
       Dec 10 01:36:32 kernel: [   55.436192] hub 3-0:1.0: 2 ports detected
       Dec 10 01:36:32 kernel: [   55.457925] ACPI: PCI Interrupt
0000:00:1d.2[C] -> GSI 18 (level, low) -> IRQ 169
       Dec 10 01:36:32 kernel: [   55.460171] PCI: Setting latency
timer of device 0000:00:1d.2 to 64
       Dec 10 01:36:32 kernel: [   55.460177] uhci_hcd 0000:00:1d.2:
Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI
Controller #3
       Dec 10 01:36:32 kernel: [   55.477009] uhci_hcd 0000:00:1d.2:
new USB bus registered, assigned bus number 4
       Dec 10 01:36:32 kernel: [   55.479289] uhci_hcd 0000:00:1d.2:
irq 169, io base 0x0000e100
       Dec 10 01:36:32 kernel: [   55.503270] hub 4-0:1.0: USB hub found
       Dec 10 01:36:32 kernel: [   55.505478] hub 4-0:1.0: 2 ports detected
       Dec 10 01:36:32 kernel: [   56.429056] Initializing USB Mass
Storage driver...
       Dec 10 01:36:32 kernel: [   56.597725] usb 4-2: new full speed
USB device using uhci_hcd and address 2
       Dec 10 01:36:32 kernel: [   58.327192] usbcore: registered new
driver usb-storage
       Dec 10 01:36:32 kernel: [   58.329386] USB Mass Storage support
registered.
       Dec 10 01:36:32 kernel: [   58.382477] usbcore: registered new
driver usbserial
       Dec 10 01:36:32 kernel: [   58.397053]
drivers/usb/serial/usb-serial.c: USB Serial support registered for
Generic
       Dec 10 01:36:32 kernel: [   58.411747] usbcore: registered new
driver usbserial_generic
       Dec 10 01:36:32 kernel: [   58.633823]
drivers/usb/serial/usb-serial.c: USB Serial Driver core v2.0
       Dec 10 01:36:32 kernel: [   58.760438] usbcore: registered new
driver hiddev
       Dec 10 01:36:32 kernel: [   58.778876] input: USB HID v1.00
Device [Burr-Brown from TI               USB Audio CODEC ] on
usb-0000:00:1d.2-2
       Dec 10 01:36:32 kernel: [   58.783803] usbcore: registered new
driver usbhid
       Dec 10 01:36:32 kernel: [   58.786300]
drivers/usb/input/hid-core.c: v2.01:USB HID core driver
       Dec 10 01:36:32 kernel: [   58.894333]
drivers/usb/serial/usb-serial.c: USB Serial support registered for
PL-2303
       Dec 10 01:36:32 kernel: [   58.909269] usbcore: registered new
driver pl2303
       Dec 10 01:36:32 kernel: [   58.911796]
drivers/usb/serial/pl2303.c: Prolific PL2303 USB to serial adaptor
driver v0.12
       Dec 10 01:36:32 kernel: [   59.014389] ieee1394: Initialized
config rom entry `ip1394'
       Dec 10 01:36:32 kernel: [   59.078076] usbcore: registered new
driver snd-usb-audio
       Dec 10 01:36:32 kernel: [   59.199443] sbp2: $Rev: 1306 $ Ben
Collins <bcollins@debian.org>
       Dec 10 01:36:32 kernel: [   59.232766] ACPI: Power Button (FF) [PWRF]
       Dec 10 01:36:32 kernel: [   59.235234] ACPI: Power Button (CM) [PWRB]
       Dec 10 01:36:32 kernel: [   59.526876] odcap_driver.c:910:
odcap: driver registered for major 97.
       Dec 10 01:36:32 kernel: [   59.541867] odcap_driver.c:278:
odcap0: device found.
       Dec 10 01:36:32 kernel: [   59.544359] ACPI: PCI Interrupt
0000:01:06.0[A] -> GSI 17 (level, low) -> IRQ 209
       Dec 10 01:36:32 kernel: [   59.546836] PCI: Setting latency
timer of device 0000:01:06.0 to 64
       Dec 10 01:36:32 kernel: [   59.546847] kernel_api.c:251:
odcap0: (240b rev 0) at 0000:01:06.0, irq:209, latency:0
       Dec 10 01:36:32 kernel: [   59.549297] kernel_api.c:267:
odcap0: port_addr:0x0000d300 port_len:0x00000004
       Dec 10 01:36:32 kernel: [   59.551750] kernel_api.c:292:
odcap0: io_mem:0xe002a000 (mmio:0xe8080000, len:0x1000)
       Dec 10 01:36:32 kernel: [   61.224923] ACPI: PCI Interrupt
0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 177
       Dec 10 01:36:32 kernel: [   61.226432] mtrr:
0xe0000000,0x8000000 overlaps existing 0xe0000000,0x400000
       Dec 10 01:36:32 kernel: [   61.240403] [drm] Initialized i830
1.3.2 20021108 on minor 0: Intel Corporation 82845G/GL[Brookdale-G]/GE
Chipset Integrated Graphics Device
       Dec 10 01:36:32 kernel: [   61.420124] Attached scsi generic
sg0 at scsi0, channel 0, id 0, lun 0,  type 0
       Dec 10 01:36:32 kernel: [   61.435944] Attached scsi generic
sg1 at scsi0, channel 0, id 2, lun 0,  type 0
       Dec 10 01:36:32 kernel: [   65.656940] e1000: eth0:
e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex
       Dec 10 01:36:32 kernel: [   66.997072] lp: driver loaded but no
devices found
       Dec 10 01:36:35 kernel: [   69.496288] apm: BIOS version 1.2
Flags 0x07 (Driver version 1.16ac)
       Dec 10 01:36:35 kernel: [   69.496295] apm: overridden by ACPI.
       Dec 10 01:36:35 kernel: [   69.937106] mtrr:
0xe0000000,0x8000000 overlaps existing 0xe0000000,0x400000
       Dec 10 01:36:41 kernel: [   75.979540] e1000: eth0:
e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex
       Dec 10 01:36:45 kernel: Kernel logging (proc) stopped.
       Dec 10 01:36:45 kernel: Kernel log daemon terminating.
       Dec 10 01:36:46 kernel: klogd 1.4.1#10, log source = /proc/kmsg started.
       Dec 10 01:36:46 kernel: Inspecting /boot/System.map-2.6.13.2
       Dec 10 01:36:46 kernel: Loaded 28981 symbols from
/boot/System.map-2.6.13.2.
       Dec 10 01:36:46 kernel: Symbols match kernel version 2.6.13.
       Dec 10 01:36:46 kernel: No module symbols loaded - kernel
modules not enabled.
       Dec 10 01:36:50 kernel: [   84.661743] md: bind<sda1>
       Dec 10 01:36:50 kernel: [   84.661911] md: bind<sdb1>
       Dec 10 01:36:50 kernel: [   84.662081] raid1: raid set md0
active with 2 out of 2 mirrors
       Dec 10 01:36:50 kernel: [   84.663584] md: syncing RAID array md0
       Dec 10 01:36:50 kernel: [   84.663789] md: minimum _guaranteed_
reconstruction speed: 1000 KB/sec/disc.
       Dec 10 01:36:50 kernel: [   84.663923] md: using maximum
available idle IO bandwith (but not more than 200000 KB/sec) for
reconstruction.
       Dec 10 01:36:50 kernel: [   84.664062] md: using 128k window,
over a total of 244195904 blocks.
       Dec 10 01:36:50 kernel: [   84.733296] kjournald starting. 
Commit interval 5 seconds
       Dec 10 01:36:50 kernel: [   84.740243] EXT3 FS on md0, internal journal
       Nov 17 01:36:50 kernel: [   84.740347] EXT3-fs: mounted
filesystem with ordered data mode.

Yanggun

2005/12/10, Neil Brown <neilb@suse.de>:
> On Saturday December 10, yang.geum.seok@gmail.com wrote:
> > Hi Brown.
> >
> > I have raid arrays, raid1 called md0. Basically they run fine, but
> > something is switching md0 readonly during write to disk(cp, mv);
> >
> > Is changed by that RAID readonly in what case? let me know, describe
> > number of case?
> >
> > I am very anxious  about it very. Please, can you inform to me?  Does
> > this patch solve it? Can not you do to do not become by readonly?
>
> You will need to give more details about what is happening.  Lots more.
>
> What makes you say the array is 'read-only'  - what are the messages
>      you get - exactly?
> What filesystem are you using?
> What messages are there in the kernel log ('dmesg' might show these)?
> What does 'cat /proc/mdstat' show?
>
> NeilBrown
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: raid1 mysteriously switching to read-only
  2005-12-10  7:50       ` Yanggun
@ 2005-12-10  8:02         ` Neil Brown
  2005-12-10  8:10           ` Yanggun
  0 siblings, 1 reply; 33+ messages in thread
From: Neil Brown @ 2005-12-10  8:02 UTC (permalink / raw)
  To: Yanggun; +Cc: linux-raid

On Saturday December 10, yang.geum.seok@gmail.com wrote:
> I am sorry because there is no information very.
> 
> I currently ext3 file system.
> 
> I composed RAID-1(/dev/md0)  disk array by order such as lower part.
> 
> mkfs.ext3 -j /dev/sda1
> mkfs.ext3 -j /dev/sda1
> mdadm -Cv /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1

This is not the way raid1 works.
The raid1 array /dev/md0 will be slightly smaller than either sda1 or
sdb1.  So the filesystem will 'think' it is the size of sda1, will
eventually discover it is smaller, and will fail.
You should:
  mdadm -Cv /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
  mkfs.ext3 -j /dev/md0
  mount /dev/md0 /data/disk1

and THEN use the filesystem.
NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: raid1 mysteriously switching to read-only
  2005-12-10  8:02         ` Neil Brown
@ 2005-12-10  8:10           ` Yanggun
  2005-12-10 12:10             ` Neil Brown
  0 siblings, 1 reply; 33+ messages in thread
From: Yanggun @ 2005-12-10  8:10 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

oh. sorry. it was mistyping.

i have done so.
  mkfs.ext3 -j /dev/sda1
  mkfs.ext3 -j /dev/sdb1
  mdadm -Cv /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
  mount /dev/md0 /data/disk1
  scp ... [snip]

Should I do format after make RAID-1(/dev/md0) device?

After I format disks(/dev/sda1, /dev/sdb1), do not you compose to RAID-1?

Can it make a problem?

> You should:
>   mdadm -Cv /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
>   mkfs.ext3 -j /dev/md0
>   mount /dev/md0 /data/disk1

 mkfs.ext3 -j /dev/md0
 mount /dev/md0 /data/disk1

2005/12/10, Neil Brown <neilb@suse.de>:
> On Saturday December 10, yang.geum.seok@gmail.com wrote:
> > I am sorry because there is no information very.
> >
> > I currently ext3 file system.
> >
> > I composed RAID-1(/dev/md0)  disk array by order such as lower part.
> >
> > mkfs.ext3 -j /dev/sda1
> > mkfs.ext3 -j /dev/sda1
> > mdadm -Cv /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
>
> This is not the way raid1 works.
> The raid1 array /dev/md0 will be slightly smaller than either sda1 or
> sdb1.  So the filesystem will 'think' it is the size of sda1, will
> eventually discover it is smaller, and will fail.
> You should:
>   mdadm -Cv /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
>   mkfs.ext3 -j /dev/md0
>   mount /dev/md0 /data/disk1
>
> and THEN use the filesystem.
> NeilBrown
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: raid1 mysteriously switching to read-only
  2005-12-10  8:10           ` Yanggun
@ 2005-12-10 12:10             ` Neil Brown
  2005-12-11 13:04               ` Yanggun
  0 siblings, 1 reply; 33+ messages in thread
From: Neil Brown @ 2005-12-10 12:10 UTC (permalink / raw)
  To: Yanggun; +Cc: linux-raid

On Saturday December 10, yang.geum.seok@gmail.com wrote:
> 
> Should I do format after make RAID-1(/dev/md0) device?

Yes.  format (mkfs) must come AFTER make RAID1 (mdadm -C).
> 
> After I format disks(/dev/sda1, /dev/sdb1), do not you compose to
> RAID-1?

Don't format sda1 or sdb1.  Compose the RAID-1 first, and then format
md0.

NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: raid1 mysteriously switching to read-only
  2005-12-10 12:10             ` Neil Brown
@ 2005-12-11 13:04               ` Yanggun
  2005-12-11 14:14                 ` Patrik Jonsson
  0 siblings, 1 reply; 33+ messages in thread
From: Yanggun @ 2005-12-11 13:04 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Thank you very much.

Can you inform it is what difference technologically?

Yanggun

2005/12/10, Neil Brown <neilb@suse.de>:
> On Saturday December 10, yang.geum.seok@gmail.com wrote:
> >
> > Should I do format after make RAID-1(/dev/md0) device?
>
> Yes.  format (mkfs) must come AFTER make RAID1 (mdadm -C).
> >
> > After I format disks(/dev/sda1, /dev/sdb1), do not you compose to
> > RAID-1?
>
> Don't format sda1 or sdb1.  Compose the RAID-1 first, and then format
> md0.
>
> NeilBrown
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: raid1 mysteriously switching to read-only
  2005-12-11 13:04               ` Yanggun
@ 2005-12-11 14:14                 ` Patrik Jonsson
  2005-12-11 14:29                   ` Yanggun
  0 siblings, 1 reply; 33+ messages in thread
From: Patrik Jonsson @ 2005-12-11 14:14 UTC (permalink / raw)
  To: Yanggun; +Cc: Neil Brown, linux-raid

[-- Attachment #1: Type: text/plain, Size: 888 bytes --]

hello,

Perhaps this can answer some questions:

http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html

cheers,

/Patrik

Yanggun wrote:
> Thank you very much.
> 
> Can you inform it is what difference technologically?
> 
> Yanggun
> 
> 2005/12/10, Neil Brown <neilb@suse.de>:
> 
>>On Saturday December 10, yang.geum.seok@gmail.com wrote:
>>
>>>Should I do format after make RAID-1(/dev/md0) device?
>>
>>Yes.  format (mkfs) must come AFTER make RAID1 (mdadm -C).
>>
>>>After I format disks(/dev/sda1, /dev/sdb1), do not you compose to
>>>RAID-1?
>>
>>Don't format sda1 or sdb1.  Compose the RAID-1 first, and then format
>>md0.
>>
>>NeilBrown
>>
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> !DSPAM:439c3231294002003030161!
> 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 254 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: raid1 mysteriously switching to read-only
  2005-12-11 14:14                 ` Patrik Jonsson
@ 2005-12-11 14:29                   ` Yanggun
  2005-12-11 17:13                     ` Ross Vandegrift
  0 siblings, 1 reply; 33+ messages in thread
From: Yanggun @ 2005-12-11 14:29 UTC (permalink / raw)
  To: Patrik Jonsson; +Cc: Neil Brown, linux-raid

I did not find involved part with this in the document.

I am sorry, inform it is what section.

2005/12/11, Patrik Jonsson <patrik@ucolick.org>:
> hello,
>
> Perhaps this can answer some questions:
>
> http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html
>
> cheers,
>
> /Patrik
>
> Yanggun wrote:
> > Thank you very much.
> >
> > Can you inform it is what difference technologically?
> >
> > Yanggun
> >
> > 2005/12/10, Neil Brown <neilb@suse.de>:
> >
> >>On Saturday December 10, yang.geum.seok@gmail.com wrote:
> >>
> >>>Should I do format after make RAID-1(/dev/md0) device?
> >>
> >>Yes.  format (mkfs) must come AFTER make RAID1 (mdadm -C).
> >>
> >>>After I format disks(/dev/sda1, /dev/sdb1), do not you compose to
> >>>RAID-1?
> >>
> >>Don't format sda1 or sdb1.  Compose the RAID-1 first, and then format
> >>md0.
> >>
> >>NeilBrown
> >>
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > !DSPAM:439c3231294002003030161!
> >
>
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: raid1 mysteriously switching to read-only
  2005-12-11 14:29                   ` Yanggun
@ 2005-12-11 17:13                     ` Ross Vandegrift
  2005-12-11 23:28                       ` Yanggun
  0 siblings, 1 reply; 33+ messages in thread
From: Ross Vandegrift @ 2005-12-11 17:13 UTC (permalink / raw)
  To: Yanggun; +Cc: Patrik Jonsson, Neil Brown, linux-raid

On Sun, Dec 11, 2005 at 11:29:23PM +0900, Yanggun wrote:
> I did not find involved part with this in the document.
> 
> I am sorry, inform it is what section.

Basically, the md driver acts like another hard disk.  So a regular
setup looks like this:

---------------------------------
|				|
|	Applications (ls, cat)	|
|				|
---------------------------------
		|
		v
---------------------------------
|				|
|	Filesystem (ext3)	|
|				|
---------------------------------
		|
		v
---------------------------------
|				|
|	Block layer (sda1/sdb1)	|	
|				|
---------------------------------

and a software RAID setup looks like this:

---------------------------------
|				|
|	Applications (ls, cat)	|
|				|
---------------------------------
		|
		v
---------------------------------
|				|
|	Filesystem (ext3)	|
|				|
---------------------------------
		|
		v
---------------------------------
|				|
|	Block layer (md0)	|	
|				|
---------------------------------
		|
		v
---------------------------------
|				|
|	Block Layer (sda1/sdb1)	|
|				|
---------------------------------

The format of sda1/sdb1 is the RAID data that the md driver writes to
them.  It form a complete layer of abstraction between your
applications and the actual devices you are storing data on.

-- 
Ross Vandegrift
ross@lug.udel.edu

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
	--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: raid1 mysteriously switching to read-only
  2005-12-11 17:13                     ` Ross Vandegrift
@ 2005-12-11 23:28                       ` Yanggun
  0 siblings, 0 replies; 33+ messages in thread
From: Yanggun @ 2005-12-11 23:28 UTC (permalink / raw)
  To: Ross Vandegrift; +Cc: Patrik Jonsson, Neil Brown, linux-raid

thanks.

I am really appreciative for explaining painting.

Yanggun.

2005/12/12, Ross Vandegrift <ross@jose.lug.udel.edu>:
> On Sun, Dec 11, 2005 at 11:29:23PM +0900, Yanggun wrote:
> > I did not find involved part with this in the document.
> >
> > I am sorry, inform it is what section.
>
> Basically, the md driver acts like another hard disk.  So a regular
> setup looks like this:
>
> ---------------------------------
> |                               |
> |       Applications (ls, cat)  |
> |                               |
> ---------------------------------
>                 |
>                 v
> ---------------------------------
> |                               |
> |       Filesystem (ext3)       |
> |                               |
> ---------------------------------
>                 |
>                 v
> ---------------------------------
> |                               |
> |       Block layer (sda1/sdb1) |
> |                               |
> ---------------------------------
>
> and a software RAID setup looks like this:
>
> ---------------------------------
> |                               |
> |       Applications (ls, cat)  |
> |                               |
> ---------------------------------
>                 |
>                 v
> ---------------------------------
> |                               |
> |       Filesystem (ext3)       |
> |                               |
> ---------------------------------
>                 |
>                 v
> ---------------------------------
> |                               |
> |       Block layer (md0)       |
> |                               |
> ---------------------------------
>                 |
>                 v
> ---------------------------------
> |                               |
> |       Block Layer (sda1/sdb1) |
> |                               |
> ---------------------------------
>
> The format of sda1/sdb1 is the RAID data that the md driver writes to
> them.  It form a complete layer of abstraction between your
> applications and the actual devices you are storing data on.
>
> --
> Ross Vandegrift
> ross@lug.udel.edu
>
> "The good Christian should beware of mathematicians, and all those who
> make empty prophecies. The danger already exists that the mathematicians
> have made a covenant with the devil to darken the spirit and to confine
> man in the bonds of Hell."
>         --St. Augustine, De Genesi ad Litteram, Book II, xviii, 37
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2005-12-11 23:28 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-27 23:39 [PATCH md 000 of 18] Introduction NeilBrown
2005-11-27 23:39 ` [PATCH md 001 of 18] Improve read speed to raid10 arrays using 'far copies' NeilBrown
2005-11-27 23:39 ` [PATCH md 002 of 18] Fix locking problem in r5/r6 NeilBrown
2005-11-27 23:39 ` [PATCH md 003 of 18] Fix problem with raid6 intent bitmap NeilBrown
2005-11-27 23:39 ` [PATCH md 004 of 18] Set default_bitmap_offset properly in set_array_info NeilBrown
2005-11-27 23:40 ` [PATCH md 005 of 18] Fix --re-add for raid1 and raid6 NeilBrown
2005-11-27 23:40 ` [PATCH md 006 of 18] Improve raid1 "IO Barrier" concept NeilBrown
2005-11-27 23:40 ` [PATCH md 007 of 18] Improve raid10 " NeilBrown
2005-11-27 23:40 ` [PATCH md 008 of 18] Small cleanups for raid5 NeilBrown
2005-11-27 23:40 ` [PATCH md 010 of 18] Move bitmap_create to after md array has been initialised NeilBrown
2005-11-27 23:40 ` [PATCH md 011 of 18] Write intent bitmap support for raid10 NeilBrown
2005-11-27 23:40 ` [PATCH md 012 of 18] Fix raid6 resync check/repair code NeilBrown
2005-11-27 23:40 ` [PATCH md 013 of 18] Improve handing of read errors with raid6 NeilBrown
2005-11-30 22:33   ` Carlos Carvalho
2005-12-01  2:54     ` Neil Brown
2005-11-27 23:40 ` [PATCH md 014 of 18] Attempt to auto-correct read errors in raid1 NeilBrown
2005-11-29 16:38   ` Paul Clements
2005-11-29 23:21     ` Neil Brown
2005-11-27 23:40 ` [PATCH md 015 of 18] Tidyup some issues with raid1 resync and prepare for catching read errors NeilBrown
2005-11-27 23:40 ` [PATCH md 016 of 18] Better handling for read error in raid1 during resync NeilBrown
2005-11-27 23:41 ` [PATCH md 017 of 18] Handle errors when read-only NeilBrown
2005-12-10  6:41   ` Yanggun
2005-12-10  6:59     ` raid1 mysteriously switching to read-only Neil Brown
2005-12-10  7:50       ` Yanggun
2005-12-10  8:02         ` Neil Brown
2005-12-10  8:10           ` Yanggun
2005-12-10 12:10             ` Neil Brown
2005-12-11 13:04               ` Yanggun
2005-12-11 14:14                 ` Patrik Jonsson
2005-12-11 14:29                   ` Yanggun
2005-12-11 17:13                     ` Ross Vandegrift
2005-12-11 23:28                       ` Yanggun
2005-11-27 23:41 ` [PATCH md 018 of 18] Fix up some rdev rcu locking in raid5/6 NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).