[md PATCH 00/22] MD patches queued for 2.6.33

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [md PATCH 00/22] MD patches queued for 2.6.33
@ 2009-12-04  6:48 NeilBrown
  2009-12-04  6:48 ` [md PATCH 15/22] md: Support write-intent bitmaps with externally managed metadata NeilBrown
                   ` (20 more replies)
  0 siblings, 21 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

Greetings,

 As 2.6.32 is now out I thought it was well past time to publish my
 current queue of patched for the 2.6.33 merge window.

 Many of these have been sitting in my queue for a while but some are
 newly written or re-written (I found more problems with the 'barrier'
 handling).
 So I will need to do some testing before I ask Linus to pull them.

 Any review that anyone cares to do will be greatly appreciated.

Thanks,
NeilBrown


---

Arnd Bergmann (1):
      md: move compat_ioctl handling into md.c

NeilBrown (19):
      md/bitmap: protect against bitmap removal while being updated.
      md: adjust resync_min usefully when resync aborts.
      md: don't reset curr_resync_completed after an interrupted resync
      md: support barrier requests on all personalities.
      md/raid5: don't complete make_request on barrier until writes are scheduled
      md: add honouring of suspend_{lo,hi} to raid1.
      md/raid1: add takeover support for raid5->raid1
      md: remove sparse warning:symbol XXX was not declared.
      md: collect bitmap-specific fields into one structure.
      md: move offset, daemon_sleep and chunksize out of bitmap structure
      md: change daemon_sleep to be in 'jiffies' rather than 'seconds'.
      md: remove needless setting of thread->timeout in raid10_quiesce
      md: support bitmap offset appropriate for external-metadata arrays.
      md: support updating bitmap parameters via sysfs.
      md: Support write-intent bitmaps with externally managed metadata.
      md/bitmap: update dirty flag when bitmap bits are explicitly set.
      md: add MODULE_DESCRIPTION for all md related modules.
      md: revise Kconfig help for MD_MULTIPATH
      md: integrate spares into array at earliest opportunity.

Robert Becker (2):
      md/raid10: print more useful messages on device failure.
      raid: improve MD/raid10 handling of correctable read errors.


 drivers/md/Kconfig      |    9 -
 drivers/md/bitmap.c     |  450 +++++++++++++++++++++++++++++++++++++++++------
 drivers/md/bitmap.h     |   19 --
 drivers/md/faulty.c     |    1 
 drivers/md/linear.c     |    3 
 drivers/md/md.c         |  289 +++++++++++++++++++++++++-----
 drivers/md/md.h         |   48 ++++-
 drivers/md/multipath.c  |    3 
 drivers/md/raid0.c      |    3 
 drivers/md/raid1.c      |  219 +++++++++++++++--------
 drivers/md/raid1.h      |    5 +
 drivers/md/raid10.c     |  116 +++++++++++-
 drivers/md/raid5.c      |   60 +++++-
 drivers/md/raid6algos.c |   20 --
 fs/compat_ioctl.c       |   22 --
 include/linux/raid/pq.h |   19 ++
 16 files changed, 997 insertions(+), 289 deletions(-)

-- 
Signature


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [md PATCH 01/22] md/bitmap: protect against bitmap removal while being updated.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (13 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 16/22] md/bitmap: update dirty flag when bitmap bits are explicitly set NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 11/22] md: change daemon_sleep to be in 'jiffies' rather than 'seconds' NeilBrown
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid; +Cc: stable

A write intent bitmap can be removed from an array while the
array is active.
When this happens, all IO is suspended and flushed before the
bitmap is removed.
However it is possible that bitmap_daemon_work is still running to
clear old bits from the bitmap.  If it is, it can dereference the
bitmap after it has been freed.

So borrow mddev->open_mutex to protect bitmap_daemon_work and
get it before destroying a bitmap.

This is suitable for any current -stable kernel.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
---
 drivers/md/bitmap.c |   24 ++++++++++++++++++------
 drivers/md/bitmap.h |    2 +-
 drivers/md/md.c     |    4 ++--
 drivers/md/md.h     |    3 ++-
 4 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 60e2b32..a39f632 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1078,23 +1078,31 @@ static bitmap_counter_t *bitmap_get_counter(struct bitmap *bitmap,
  *			out to disk
  */
 
-void bitmap_daemon_work(struct bitmap *bitmap)
+void bitmap_daemon_work(mddev_t *mddev)
 {
+	struct bitmap *bitmap;
 	unsigned long j;
 	unsigned long flags;
 	struct page *page = NULL, *lastpage = NULL;
 	int blocks;
 	void *paddr;
 
-	if (bitmap == NULL)
+	/* Borrow open_mutex to guard daemon_work against
+	 * bitmap_destroy.
+	 */
+	mutex_lock(&mddev->open_mutex);
+	bitmap = mddev->bitmap;
+	if (bitmap == NULL) {
+		mutex_unlock(&mddev->open_mutex);
 		return;
+	}
 	if (time_before(jiffies, bitmap->daemon_lastrun + bitmap->daemon_sleep*HZ))
 		goto done;
 
 	bitmap->daemon_lastrun = jiffies;
 	if (bitmap->allclean) {
 		bitmap->mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
-		return;
+		goto done;
 	}
 	bitmap->allclean = 1;
 
@@ -1203,6 +1211,7 @@ void bitmap_daemon_work(struct bitmap *bitmap)
  done:
 	if (bitmap->allclean == 0)
 		bitmap->mddev->thread->timeout = bitmap->daemon_sleep * HZ;
+	mutex_unlock(&mddev->open_mutex);
 }
 
 static bitmap_counter_t *bitmap_get_counter(struct bitmap *bitmap,
@@ -1541,9 +1550,9 @@ void bitmap_flush(mddev_t *mddev)
 	 */
 	sleep = bitmap->daemon_sleep;
 	bitmap->daemon_sleep = 0;
-	bitmap_daemon_work(bitmap);
-	bitmap_daemon_work(bitmap);
-	bitmap_daemon_work(bitmap);
+	bitmap_daemon_work(mddev);
+	bitmap_daemon_work(mddev);
+	bitmap_daemon_work(mddev);
 	bitmap->daemon_sleep = sleep;
 	bitmap_update_sb(bitmap);
 }
@@ -1574,6 +1583,7 @@ static void bitmap_free(struct bitmap *bitmap)
 	kfree(bp);
 	kfree(bitmap);
 }
+
 void bitmap_destroy(mddev_t *mddev)
 {
 	struct bitmap *bitmap = mddev->bitmap;
@@ -1581,7 +1591,9 @@ void bitmap_destroy(mddev_t *mddev)
 	if (!bitmap) /* there was no bitmap */
 		return;
 
+	mutex_lock(&mddev->open_mutex);
 	mddev->bitmap = NULL; /* disconnect from the md device */
+	mutex_unlock(&mddev->open_mutex);
 	if (mddev->thread)
 		mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
 
diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index e989006..7e38d13 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -282,7 +282,7 @@ void bitmap_close_sync(struct bitmap *bitmap);
 void bitmap_cond_end_sync(struct bitmap *bitmap, sector_t sector);
 
 void bitmap_unplug(struct bitmap *bitmap);
-void bitmap_daemon_work(struct bitmap *bitmap);
+void bitmap_daemon_work(mddev_t *mddev);
 #endif
 
 #endif
diff --git a/drivers/md/md.c b/drivers/md/md.c
index b182f86..69af3b8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4314,7 +4314,7 @@ static int deny_bitmap_write_access(struct file * file)
 	return 0;
 }
 
-static void restore_bitmap_write_access(struct file *file)
+void restore_bitmap_write_access(struct file *file)
 {
 	struct inode *inode = file->f_mapping->host;
 
@@ -6629,7 +6629,7 @@ void md_check_recovery(mddev_t *mddev)
 
 
 	if (mddev->bitmap)
-		bitmap_daemon_work(mddev->bitmap);
+		bitmap_daemon_work(mddev);
 
 	if (mddev->ro)
 		return;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index f184b69..57ae80c 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -443,6 +443,7 @@ extern void md_wait_for_blocked_rdev(mdk_rdev_t *rdev, mddev_t *mddev);
 extern void md_set_array_sectors(mddev_t *mddev, sector_t array_sectors);
 extern int md_check_no_bitmap(mddev_t *mddev);
 extern int md_integrity_register(mddev_t *mddev);
-void md_integrity_add_rdev(mdk_rdev_t *rdev, mddev_t *mddev);
+extern void md_integrity_add_rdev(mdk_rdev_t *rdev, mddev_t *mddev);
+extern void restore_bitmap_write_access(struct file *file);
 
 #endif /* _MD_MD_H */



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 02/22] md: adjust resync_min usefully when resync aborts.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (17 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 09/22] md: collect bitmap-specific fields into one structure NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 05/22] md/raid5: don't complete make_request on barrier until writes are scheduled NeilBrown
  2009-12-04  6:48 ` [md PATCH 03/22] md: don't reset curr_resync_completed after an interrupted resync NeilBrown
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

When a 'check' or 'repair' finished we should clear resync_min
so that a future check/repair will cover the whole array (by default).
However if it is interrupted, we should update resync_min to
where we got up to, so that when the check/repair continues it
just does the remainder of the array.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 69af3b8..a43539e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6527,11 +6527,15 @@ void md_do_sync(mddev_t *mddev)
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 
  skip:
+	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
+		/* We completed so min/max setting can be forgotten if used. */
+		if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
+			mddev->resync_min = 0;
+		mddev->resync_max = MaxSector;
+	} else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
+		mddev->resync_min = mddev->curr_resync_completed;
 	mddev->curr_resync = 0;
 	mddev->curr_resync_completed = 0;
-	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
-		/* We completed so max setting can be forgotten. */
-		mddev->resync_max = MaxSector;
 	sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 	wake_up(&resync_wait);
 	set_bit(MD_RECOVERY_DONE, &mddev->recovery);



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 03/22] md: don't reset curr_resync_completed after an interrupted resync
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (19 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 05/22] md/raid5: don't complete make_request on barrier until writes are scheduled NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

If a resync/recovery/check/repair is interrupted for some reason, it
can be useful to know exactly where it got up to.
So in that case, do clear curr_resync_completed.
Initialise it when starting a resync/recovery/... instead.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index a43539e..20a3660 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6383,7 +6383,9 @@ void md_do_sync(mddev_t *mddev)
 		       "md: resuming %s of %s from checkpoint.\n",
 		       desc, mdname(mddev));
 		mddev->curr_resync = j;
-	}
+		mddev->curr_resync_completed = j;
+	} else
+		mddev->curr_resync_completed = 0;
 
 	while (j < max_sectors) {
 		sector_t sectors;
@@ -6535,7 +6537,8 @@ void md_do_sync(mddev_t *mddev)
 	} else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
 		mddev->resync_min = mddev->curr_resync_completed;
 	mddev->curr_resync = 0;
-	mddev->curr_resync_completed = 0;
+	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+		mddev->curr_resync_completed = 0;
 	sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 	wake_up(&resync_wait);
 	set_bit(MD_RECOVERY_DONE, &mddev->recovery);



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 04/22] md: support barrier requests on all personalities.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (15 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 11/22] md: change daemon_sleep to be in 'jiffies' rather than 'seconds' NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-08 13:54   ` Andre Noll
  2009-12-04  6:48 ` [md PATCH 09/22] md: collect bitmap-specific fields into one structure NeilBrown
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

Previously barriers were only supported on RAID1.  This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.

When a barrier arrives, we send a zero-length barrier to every active
device.  When that completes - and if the original request was not
empty -  we submit the barrier request itself (with the barrier flag
cleared) and the submit a fresh load of zero length barriers.

The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.

The reason for clearing the barrier flag is that a barrier request is
allowed to fail.  If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail.  That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.

RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet.  So we flush the stripe cache before
proceeding with the barrier.

Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted.  Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.

That will be done in later patches.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/linear.c    |    2 -
 drivers/md/md.c        |  117 +++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/md.h        |   12 +++++
 drivers/md/multipath.c |    2 -
 drivers/md/raid0.c     |    2 -
 drivers/md/raid10.c    |    2 -
 drivers/md/raid5.c     |    8 +++
 7 files changed, 138 insertions(+), 7 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 1ceceb3..3b3f77c 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -292,7 +292,7 @@ static int linear_make_request (struct request_queue *q, struct bio *bio)
 	int cpu;
 
 	if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) {
-		bio_endio(bio, -EOPNOTSUPP);
+		md_barrier_request(mddev, bio);
 		return 0;
 	}
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 20a3660..ffe92db 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -217,12 +217,12 @@ static int md_make_request(struct request_queue *q, struct bio *bio)
 		return 0;
 	}
 	rcu_read_lock();
-	if (mddev->suspended) {
+	if (mddev->suspended || mddev->barrier) {
 		DEFINE_WAIT(__wait);
 		for (;;) {
 			prepare_to_wait(&mddev->sb_wait, &__wait,
 					TASK_UNINTERRUPTIBLE);
-			if (!mddev->suspended)
+			if (!mddev->suspended && !mddev->barrier)
 				break;
 			rcu_read_unlock();
 			schedule();
@@ -264,10 +264,122 @@ static void mddev_resume(mddev_t *mddev)
 
 int mddev_congested(mddev_t *mddev, int bits)
 {
+	if (mddev->barrier)
+		return 1;
 	return mddev->suspended;
 }
 EXPORT_SYMBOL(mddev_congested);
 
+/*
+ * Generic barrier handling for md
+ */
+
+static void md_end_barrier(struct bio *bio, int err)
+{
+	mdk_rdev_t *rdev = bio->bi_private;
+	mddev_t *mddev = rdev->mddev;
+	if (err == -EOPNOTSUPP && mddev->barrier != (void*)1)
+		set_bit(BIO_EOPNOTSUPP, &mddev->barrier->bi_flags);
+
+	rdev_dec_pending(rdev, mddev);
+
+	if (atomic_dec_and_test(&mddev->flush_pending)) {
+		if (mddev->barrier == (void*)1) {
+			/* This was a post-request barrier */
+			mddev->barrier = NULL;
+			wake_up(&mddev->sb_wait);
+		} else
+			/* The pre-request barrier has finished */
+			schedule_work(&mddev->barrier_work);
+	}
+	bio_put(bio);
+}
+
+static void md_submit_barrier(struct work_struct *ws)
+{
+	mddev_t *mddev = container_of(ws, mddev_t, barrier_work);
+	struct bio *bio = mddev->barrier;
+
+	atomic_set(&mddev->flush_pending, 1);
+
+	if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags))
+		bio_endio(bio, -EOPNOTSUPP);
+	else if (bio->bi_size == 0)
+		/* an empty barrier - all done */
+		bio_endio(bio, 0);
+	else {
+		mdk_rdev_t *rdev;
+
+		bio->bi_rw &= ~(1<<BIO_RW_BARRIER);
+		if (mddev->pers->make_request(mddev->queue, bio))
+			generic_make_request(bio);
+		mddev->barrier = (void*)1;
+		rcu_read_lock();
+		list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
+			if (rdev->raid_disk >= 0 &&
+			    !test_bit(Faulty, &rdev->flags)) {
+				/* Take two references, one is dropped
+				 * when request finishes, one after
+				 * we reclaim rcu_read_lock
+				 */
+				struct bio *bi;
+				atomic_inc(&rdev->nr_pending);
+				atomic_inc(&rdev->nr_pending);
+				rcu_read_unlock();
+				bi = bio_alloc(GFP_KERNEL, 0);
+				bi->bi_end_io = md_end_barrier;
+				bi->bi_private = rdev;
+				bi->bi_bdev = rdev->bdev;
+				atomic_inc(&mddev->flush_pending);
+				submit_bio(WRITE_BARRIER, bi);
+				rcu_read_lock();
+				rdev_dec_pending(rdev, mddev);
+			}
+		rcu_read_unlock();
+	}
+	if (atomic_dec_and_test(&mddev->flush_pending)) {
+		mddev->barrier = NULL;
+		wake_up(&mddev->sb_wait);
+	}
+}
+
+void md_barrier_request(mddev_t *mddev, struct bio *bio)
+{
+	mdk_rdev_t *rdev;
+
+	spin_lock_irq(&mddev->write_lock);
+	wait_event_lock_irq(mddev->sb_wait,
+			    !mddev->barrier,
+			    mddev->write_lock, /*nothing*/);
+	mddev->barrier = bio;
+	spin_unlock_irq(&mddev->write_lock);
+
+	atomic_set(&mddev->flush_pending, 1);
+	INIT_WORK(&mddev->barrier_work, md_submit_barrier);
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
+		if (rdev->raid_disk >= 0 &&
+		    !test_bit(Faulty, &rdev->flags)) {
+			struct bio *bi;
+
+			atomic_inc(&rdev->nr_pending);
+			atomic_inc(&rdev->nr_pending);
+			rcu_read_unlock();
+			bi = bio_alloc(GFP_KERNEL, 0);
+			bi->bi_end_io = md_end_barrier;
+			bi->bi_private = rdev;
+			bi->bi_bdev = rdev->bdev;
+			atomic_inc(&mddev->flush_pending);
+			submit_bio(WRITE_BARRIER, bi);
+			rcu_read_lock();
+			rdev_dec_pending(rdev, mddev);
+		}
+	rcu_read_unlock();
+	if (atomic_dec_and_test(&mddev->flush_pending))
+		schedule_work(&mddev->barrier_work);
+}
+EXPORT_SYMBOL(md_barrier_request);
 
 static inline mddev_t *mddev_get(mddev_t *mddev)
 {
@@ -374,6 +486,7 @@ static mddev_t * mddev_find(dev_t unit)
 	atomic_set(&new->openers, 0);
 	atomic_set(&new->active_io, 0);
 	spin_lock_init(&new->write_lock);
+	atomic_set(&new->flush_pending, 0);
 	init_waitqueue_head(&new->sb_wait);
 	init_waitqueue_head(&new->recovery_wait);
 	new->reshape_position = MaxSector;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 57ae80c..77f0aad 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -291,6 +291,17 @@ struct mddev_s
 								*/
 
 	struct list_head		all_mddevs;
+
+	/* Generic barrier handling.
+	 * If there is a pending barrier request, all other
+	 * writes are blocked while the devices are flushed.
+	 * The last to finish a flush schedules a worker to
+	 * submit the barrier request (without the barrier flag),
+	 * then submit more flush requests.
+	 */
+	struct bio *barrier;
+	atomic_t flush_pending;
+	struct work_struct barrier_work;
 };
 
 
@@ -431,6 +442,7 @@ extern void md_done_sync(mddev_t *mddev, int blocks, int ok);
 extern void md_error(mddev_t *mddev, mdk_rdev_t *rdev);
 
 extern int mddev_congested(mddev_t *mddev, int bits);
+extern void md_barrier_request(mddev_t *mddev, struct bio *bio);
 extern void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
 			   sector_t sector, int size, struct page *page);
 extern void md_super_wait(mddev_t *mddev);
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index ee7646f..cbc0a99 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -145,7 +145,7 @@ static int multipath_make_request (struct request_queue *q, struct bio * bio)
 	int cpu;
 
 	if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) {
-		bio_endio(bio, -EOPNOTSUPP);
+		md_barrier_request(mddev, bio);
 		return 0;
 	}
 
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index d3a4ce0..122d07a 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -453,7 +453,7 @@ static int raid0_make_request(struct request_queue *q, struct bio *bio)
 	int cpu;
 
 	if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) {
-		bio_endio(bio, -EOPNOTSUPP);
+		md_barrier_request(mddev, bio);
 		return 0;
 	}
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c2cb7b8..2fbf867 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -804,7 +804,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
 	mdk_rdev_t *blocked_rdev;
 
 	if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) {
-		bio_endio(bio, -EOPNOTSUPP);
+		md_barrier_request(mddev, bio);
 		return 0;
 	}
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d29215d..ecf89c8 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3866,7 +3866,13 @@ static int make_request(struct request_queue *q, struct bio * bi)
 	int cpu, remaining;
 
 	if (unlikely(bio_rw_flagged(bi, BIO_RW_BARRIER))) {
-		bio_endio(bi, -EOPNOTSUPP);
+		/* Drain all pending writes.  We only really need
+		 * to ensure they have been submitted, but this is
+		 * easier.
+		 */
+		mddev->pers->quiesce(mddev, 1);
+		mddev->pers->quiesce(mddev, 0);
+		md_barrier_request(mddev, bi);
 		return 0;
 	}
 



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 05/22] md/raid5: don't complete make_request on barrier until writes are scheduled
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (18 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 02/22] md: adjust resync_min usefully when resync aborts NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2010-01-21 21:07   ` [md PATCH 05/22] md/raid5: don't complete make_request on barrieruntil " Tirumala Reddy Marri
  2009-12-04  6:48 ` [md PATCH 03/22] md: don't reset curr_resync_completed after an interrupted resync NeilBrown
  20 siblings, 1 reply; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

The post-barrier-flush is sent be md as soon as make_request on the
barrier write completes.  For raid5, the data might not be in the
per-device queues yet.  So for barrier requests, wait for any
pre-reading to be done so that the request will be in the per-device
queues.

We use the 'preread_active' count to check that nothing is still in
the preread phase, and delay the decrement of this count until after
write requests have been submitted to the underlying devices.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid5.c |   51 +++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ecf89c8..8c772b2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2947,6 +2947,7 @@ static void handle_stripe5(struct stripe_head *sh)
 	struct r5dev *dev;
 	mdk_rdev_t *blocked_rdev = NULL;
 	int prexor;
+	int dec_preread_active = 0;
 
 	memset(&s, 0, sizeof(s));
 	pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d check:%d "
@@ -3096,12 +3097,8 @@ static void handle_stripe5(struct stripe_head *sh)
 					set_bit(STRIPE_INSYNC, &sh->state);
 			}
 		}
-		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-			atomic_dec(&conf->preread_active_stripes);
-			if (atomic_read(&conf->preread_active_stripes) <
-				IO_THRESHOLD)
-				md_wakeup_thread(conf->mddev->thread);
-		}
+		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+			dec_preread_active = 1;
 	}
 
 	/* Now to consider new write requests and what else, if anything
@@ -3208,6 +3205,16 @@ static void handle_stripe5(struct stripe_head *sh)
 
 	ops_run_io(sh, &s);
 
+	if (dec_preread_active) {
+		/* We delay this until after ops_run_io so that if make_request
+		 * is waiting on a barrier, it won't continue until the writes
+		 * have actually been submitted.
+		 */
+		atomic_dec(&conf->preread_active_stripes);
+		if (atomic_read(&conf->preread_active_stripes) <
+		    IO_THRESHOLD)
+			md_wakeup_thread(conf->mddev->thread);
+	}
 	return_io(return_bi);
 }
 
@@ -3221,6 +3228,7 @@ static void handle_stripe6(struct stripe_head *sh)
 	struct r6_state r6s;
 	struct r5dev *dev, *pdev, *qdev;
 	mdk_rdev_t *blocked_rdev = NULL;
+	int dec_preread_active = 0;
 
 	pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
 		"pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",
@@ -3380,12 +3388,8 @@ static void handle_stripe6(struct stripe_head *sh)
 					set_bit(STRIPE_INSYNC, &sh->state);
 			}
 		}
-		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-			atomic_dec(&conf->preread_active_stripes);
-			if (atomic_read(&conf->preread_active_stripes) <
-				IO_THRESHOLD)
-				md_wakeup_thread(conf->mddev->thread);
-		}
+		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+			dec_preread_active = 1;
 	}
 
 	/* Now to consider new write requests and what else, if anything
@@ -3494,6 +3498,18 @@ static void handle_stripe6(struct stripe_head *sh)
 
 	ops_run_io(sh, &s);
 
+
+	if (dec_preread_active) {
+		/* We delay this until after ops_run_io so that if make_request
+		 * is waiting on a barrier, it won't continue until the writes
+		 * have actually been submitted.
+		 */
+		atomic_dec(&conf->preread_active_stripes);
+		if (atomic_read(&conf->preread_active_stripes) <
+		    IO_THRESHOLD)
+			md_wakeup_thread(conf->mddev->thread);
+	}
+
 	return_io(return_bi);
 }
 
@@ -3996,6 +4012,9 @@ static int make_request(struct request_queue *q, struct bio * bi)
 			finish_wait(&conf->wait_for_overlap, &w);
 			set_bit(STRIPE_HANDLE, &sh->state);
 			clear_bit(STRIPE_DELAYED, &sh->state);
+			if (mddev->barrier && 
+			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+				atomic_inc(&conf->preread_active_stripes);
 			release_stripe(sh);
 		} else {
 			/* cannot get stripe for read-ahead, just give-up */
@@ -4015,6 +4034,14 @@ static int make_request(struct request_queue *q, struct bio * bi)
 
 		bio_endio(bi, 0);
 	}
+
+	if (mddev->barrier) {
+		/* We need to wait for the stripes to all be handled.
+		 * So: wait for preread_active_stripes to drop to 0.
+		 */
+		wait_event(mddev->thread->wqueue,
+			   atomic_read(&conf->preread_active_stripes) == 0);
+	}
 	return 0;
 }
 



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 06/22] md: add honouring of suspend_{lo,hi} to raid1.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (4 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 20/22] md: revise Kconfig help for MD_MULTIPATH NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 12/22] md: remove needless setting of thread->timeout in raid10_quiesce NeilBrown
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

This will allow us to stop writeout to portions of the array
while  they are resynced by someone else - e.g. another node in
a cluster.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid1.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index e07ce2e..35b2d86 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -801,6 +801,25 @@ static int make_request(struct request_queue *q, struct bio * bio)
 
 	md_write_start(mddev, bio); /* wait on superblock update early */
 
+	if (bio_data_dir(bio) == WRITE &&
+	    bio->bi_sector + bio->bi_size/512 > mddev->suspend_lo &&
+	    bio->bi_sector < mddev->suspend_hi) {
+		/* As the suspend_* range is controlled by
+		 * userspace, we want an interruptible
+		 * wait.
+		 */
+		DEFINE_WAIT(w);
+		for (;;) {
+			flush_signals(current);
+			prepare_to_wait(&conf->wait_barrier,
+					&w, TASK_INTERRUPTIBLE);
+			if (bio->bi_sector + bio->bi_size/512 <= mddev->suspend_lo ||
+			    bio->bi_sector >= mddev->suspend_hi)
+				break;
+			schedule();
+		}
+		finish_wait(&conf->wait_barrier, &w);
+	}
 	if (unlikely(!mddev->barriers_work &&
 		     bio_rw_flagged(bio, BIO_RW_BARRIER))) {
 		if (rw == WRITE)
@@ -2271,6 +2290,9 @@ static void raid1_quiesce(mddev_t *mddev, int state)
 	conf_t *conf = mddev->private;
 
 	switch(state) {
+	case 2: /* wake for suspend */
+		wake_up(&conf->wait_barrier);
+		break;
 	case 1:
 		raise_barrier(conf);
 		break;



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 07/22] md/raid1: add takeover support for raid5->raid1
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (7 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 10/22] md: move offset, daemon_sleep and chunksize out of bitmap structure NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 19/22] md: add MODULE_DESCRIPTION for all md related modules NeilBrown
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

A 2-device raid5 array can now be converted to raid1.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid1.c |  193 +++++++++++++++++++++++++++++++---------------------
 drivers/md/raid1.h |    5 +
 2 files changed, 121 insertions(+), 77 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 35b2d86..18da334 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -677,6 +677,7 @@ static void raise_barrier(conf_t *conf)
 static void lower_barrier(conf_t *conf)
 {
 	unsigned long flags;
+	BUG_ON(conf->barrier <= 0);
 	spin_lock_irqsave(&conf->resync_lock, flags);
 	conf->barrier--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
@@ -1960,74 +1961,48 @@ static sector_t raid1_size(mddev_t *mddev, sector_t sectors, int raid_disks)
 	return mddev->dev_sectors;
 }
 
-static int run(mddev_t *mddev)
+static conf_t *setup_conf(mddev_t *mddev)
 {
 	conf_t *conf;
-	int i, j, disk_idx;
-	mirror_info_t *disk;
 	mdk_rdev_t *rdev;
+	int err = -ENOMEM;
+	mirror_info_t *disk;
+	int i;
 
-	if (mddev->level != 1) {
-		printk("raid1: %s: raid level not set to mirroring (%d)\n",
-		       mdname(mddev), mddev->level);
-		goto out;
-	}
-	if (mddev->reshape_position != MaxSector) {
-		printk("raid1: %s: reshape_position set but not supported\n",
-		       mdname(mddev));
-		goto out;
-	}
-	/*
-	 * copy the already verified devices into our private RAID1
-	 * bookkeeping area. [whatever we allocate in run(),
-	 * should be freed in stop()]
-	 */
 	conf = kzalloc(sizeof(conf_t), GFP_KERNEL);
-	mddev->private = conf;
 	if (!conf)
-		goto out_no_mem;
+		goto abort;
 
 	conf->mirrors = kzalloc(sizeof(struct mirror_info)*mddev->raid_disks,
 				 GFP_KERNEL);
 	if (!conf->mirrors)
-		goto out_no_mem;
+		goto abort;
 
 	conf->tmppage = alloc_page(GFP_KERNEL);
 	if (!conf->tmppage)
-		goto out_no_mem;
+		goto abort;
 
-	conf->poolinfo = kmalloc(sizeof(*conf->poolinfo), GFP_KERNEL);
+	conf->poolinfo = kzalloc(sizeof(*conf->poolinfo), GFP_KERNEL);
 	if (!conf->poolinfo)
-		goto out_no_mem;
-	conf->poolinfo->mddev = NULL;
+		goto abort;
 	conf->poolinfo->raid_disks = mddev->raid_disks;
 	conf->r1bio_pool = mempool_create(NR_RAID1_BIOS, r1bio_pool_alloc,
 					  r1bio_pool_free,
 					  conf->poolinfo);
 	if (!conf->r1bio_pool)
-		goto out_no_mem;
+		goto abort;
+
 	conf->poolinfo->mddev = mddev;
 
 	spin_lock_init(&conf->device_lock);
-	mddev->queue->queue_lock = &conf->device_lock;
-
 	list_for_each_entry(rdev, &mddev->disks, same_set) {
-		disk_idx = rdev->raid_disk;
+		int disk_idx = rdev->raid_disk;
 		if (disk_idx >= mddev->raid_disks
 		    || disk_idx < 0)
 			continue;
 		disk = conf->mirrors + disk_idx;
 
 		disk->rdev = rdev;
-		disk_stack_limits(mddev->gendisk, rdev->bdev,
-				  rdev->data_offset << 9);
-		/* as we don't honour merge_bvec_fn, we must never risk
-		 * violating it, so limit ->max_sector to one PAGE, as
-		 * a one page request is never in violation.
-		 */
-		if (rdev->bdev->bd_disk->queue->merge_bvec_fn &&
-		    queue_max_sectors(mddev->queue) > (PAGE_SIZE>>9))
-			blk_queue_max_sectors(mddev->queue, PAGE_SIZE>>9);
 
 		disk->head_position = 0;
 	}
@@ -2041,8 +2016,7 @@ static int run(mddev_t *mddev)
 	bio_list_init(&conf->pending_bio_list);
 	bio_list_init(&conf->flushing_bio_list);
 
-
-	mddev->degraded = 0;
+	conf->last_used = -1;
 	for (i = 0; i < conf->raid_disks; i++) {
 
 		disk = conf->mirrors + i;
@@ -2050,38 +2024,97 @@ static int run(mddev_t *mddev)
 		if (!disk->rdev ||
 		    !test_bit(In_sync, &disk->rdev->flags)) {
 			disk->head_position = 0;
-			mddev->degraded++;
 			if (disk->rdev)
 				conf->fullsync = 1;
-		}
+		} else if (conf->last_used < 0)
+			/*
+			 * The first working device is used as a
+			 * starting point to read balancing.
+			 */
+			conf->last_used = i;
 	}
-	if (mddev->degraded == conf->raid_disks) {
+
+	err = -EIO;
+	if (conf->last_used < 0) {
 		printk(KERN_ERR "raid1: no operational mirrors for %s\n",
-			mdname(mddev));
-		goto out_free_conf;
+		       mdname(mddev));
+		goto abort;
 	}
-	if (conf->raid_disks - mddev->degraded == 1)
-		mddev->recovery_cp = MaxSector;
+	err = -ENOMEM;
+	conf->thread = md_register_thread(raid1d, mddev, NULL);
+	if (!conf->thread) {
+		printk(KERN_ERR
+		       "raid1: couldn't allocate thread for %s\n",
+		       mdname(mddev));
+		goto abort;
+	}
+
+	return conf;
+
+ abort:
+	if (conf) {
+		if (conf->r1bio_pool)
+			mempool_destroy(conf->r1bio_pool);
+		kfree(conf->mirrors);
+		safe_put_page(conf->tmppage);
+		kfree(conf->poolinfo);
+		kfree(conf);
+	}
+	return ERR_PTR(err);
+}
 
+static int run(mddev_t *mddev)
+{
+	conf_t *conf;
+	int i;
+	mdk_rdev_t *rdev;
+
+	if (mddev->level != 1) {
+		printk("raid1: %s: raid level not set to mirroring (%d)\n",
+		       mdname(mddev), mddev->level);
+		return -EIO;
+	}
+	if (mddev->reshape_position != MaxSector) {
+		printk("raid1: %s: reshape_position set but not supported\n",
+		       mdname(mddev));
+		return -EIO;
+	}
 	/*
-	 * find the first working one and use it as a starting point
-	 * to read balancing.
+	 * copy the already verified devices into our private RAID1
+	 * bookkeeping area. [whatever we allocate in run(),
+	 * should be freed in stop()]
 	 */
-	for (j = 0; j < conf->raid_disks &&
-		     (!conf->mirrors[j].rdev ||
-		      !test_bit(In_sync, &conf->mirrors[j].rdev->flags)) ; j++)
-		/* nothing */;
-	conf->last_used = j;
+	if (mddev->private == NULL)
+		conf = setup_conf(mddev);
+	else
+		conf = mddev->private;
 
+	if (IS_ERR(conf))
+		return PTR_ERR(conf);
 
-	mddev->thread = md_register_thread(raid1d, mddev, NULL);
-	if (!mddev->thread) {
-		printk(KERN_ERR
-		       "raid1: couldn't allocate thread for %s\n",
-		       mdname(mddev));
-		goto out_free_conf;
+	mddev->queue->queue_lock = &conf->device_lock;
+	list_for_each_entry(rdev, &mddev->disks, same_set) {
+		disk_stack_limits(mddev->gendisk, rdev->bdev,
+				  rdev->data_offset << 9);
+		/* as we don't honour merge_bvec_fn, we must never risk
+		 * violating it, so limit ->max_sector to one PAGE, as
+		 * a one page request is never in violation.
+		 */
+		if (rdev->bdev->bd_disk->queue->merge_bvec_fn &&
+		    queue_max_sectors(mddev->queue) > (PAGE_SIZE>>9))
+			blk_queue_max_sectors(mddev->queue, PAGE_SIZE>>9);
 	}
 
+	mddev->degraded = 0;
+	for (i=0; i < conf->raid_disks; i++)
+		if (conf->mirrors[i].rdev == NULL ||
+		    !test_bit(In_sync, &conf->mirrors[i].rdev->flags) ||
+		    test_bit(Faulty, &conf->mirrors[i].rdev->flags))
+			mddev->degraded++;
+
+	if (conf->raid_disks - mddev->degraded == 1)
+		mddev->recovery_cp = MaxSector;
+
 	if (mddev->recovery_cp != MaxSector)
 		printk(KERN_NOTICE "raid1: %s is not clean"
 		       " -- starting background reconstruction\n",
@@ -2090,9 +2123,14 @@ static int run(mddev_t *mddev)
 		"raid1: raid set %s active with %d out of %d mirrors\n",
 		mdname(mddev), mddev->raid_disks - mddev->degraded, 
 		mddev->raid_disks);
+
 	/*
 	 * Ok, everything is just fine now
 	 */
+	mddev->thread = conf->thread;
+	conf->thread = NULL;
+	mddev->private = conf;
+
 	md_set_array_sectors(mddev, raid1_size(mddev, 0, 0));
 
 	mddev->queue->unplug_fn = raid1_unplug;
@@ -2100,23 +2138,6 @@ static int run(mddev_t *mddev)
 	mddev->queue->backing_dev_info.congested_data = mddev;
 	md_integrity_register(mddev);
 	return 0;
-
-out_no_mem:
-	printk(KERN_ERR "raid1: couldn't allocate memory for %s\n",
-	       mdname(mddev));
-
-out_free_conf:
-	if (conf) {
-		if (conf->r1bio_pool)
-			mempool_destroy(conf->r1bio_pool);
-		kfree(conf->mirrors);
-		safe_put_page(conf->tmppage);
-		kfree(conf->poolinfo);
-		kfree(conf);
-		mddev->private = NULL;
-	}
-out:
-	return -EIO;
 }
 
 static int stop(mddev_t *mddev)
@@ -2302,6 +2323,23 @@ static void raid1_quiesce(mddev_t *mddev, int state)
 	}
 }
 
+static void *raid1_takeover(mddev_t *mddev)
+{
+	/* raid1 can take over:
+	 *  raid5 with 2 devices, any layout or chunk size
+	 */
+	if (mddev->level == 5 && mddev->raid_disks == 2) {
+		conf_t *conf;
+		mddev->new_level = 1;
+		mddev->new_layout = 0;
+		mddev->new_chunk_sectors = 0;
+		conf = setup_conf(mddev);
+		if (!IS_ERR(conf))
+			conf->barrier = 1;
+		return conf;
+	}
+	return ERR_PTR(-EINVAL);
+}
 
 static struct mdk_personality raid1_personality =
 {
@@ -2321,6 +2359,7 @@ static struct mdk_personality raid1_personality =
 	.size		= raid1_size,
 	.check_reshape	= raid1_reshape,
 	.quiesce	= raid1_quiesce,
+	.takeover	= raid1_takeover,
 };
 
 static int __init raid_init(void)
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index e87b84d..5f2d443 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -59,6 +59,11 @@ struct r1_private_data_s {
 
 	mempool_t *r1bio_pool;
 	mempool_t *r1buf_pool;
+
+	/* When taking over an array from a different personality, we store
+	 * the new thread here until we fully activate the array.
+	 */
+	struct mdk_thread_s	*thread;
 };
 
 typedef struct r1_private_data_s conf_t;



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 09/22] md: collect bitmap-specific fields into one structure.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (16 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 04/22] md: support barrier requests on all personalities NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 02/22] md: adjust resync_min usefully when resync aborts NeilBrown
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

In preparation for making bitmap fields configurable via sysfs,
start tidying up by making a single structure to contain the
configuration fields.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/bitmap.c |    8 +++--
 drivers/md/md.c     |   75 +++++++++++++++++++++++++++------------------------
 drivers/md/md.h     |   21 +++++++-------
 3 files changed, 54 insertions(+), 50 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index a39f632..4c70a01 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1610,16 +1610,16 @@ int bitmap_create(mddev_t *mddev)
 	sector_t blocks = mddev->resync_max_sectors;
 	unsigned long chunks;
 	unsigned long pages;
-	struct file *file = mddev->bitmap_file;
+	struct file *file = mddev->bitmap_info.file;
 	int err;
 	sector_t start;
 
 	BUILD_BUG_ON(sizeof(bitmap_super_t) != 256);
 
-	if (!file && !mddev->bitmap_offset) /* bitmap disabled, nothing to do */
+	if (!file && !mddev->bitmap_info.offset) /* bitmap disabled, nothing to do */
 		return 0;
 
-	BUG_ON(file && mddev->bitmap_offset);
+	BUG_ON(file && mddev->bitmap_info.offset);
 
 	bitmap = kzalloc(sizeof(*bitmap), GFP_KERNEL);
 	if (!bitmap)
@@ -1633,7 +1633,7 @@ int bitmap_create(mddev_t *mddev)
 	bitmap->mddev = mddev;
 
 	bitmap->file = file;
-	bitmap->offset = mddev->bitmap_offset;
+	bitmap->offset = mddev->bitmap_info.offset;
 	if (file) {
 		get_file(file);
 		/* As future accesses to this file will use bmap,
diff --git a/drivers/md/md.c b/drivers/md/md.c
index ffe92db..b989189 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -865,7 +865,7 @@ struct super_type  {
  */
 int md_check_no_bitmap(mddev_t *mddev)
 {
-	if (!mddev->bitmap_file && !mddev->bitmap_offset)
+	if (!mddev->bitmap_info.file && !mddev->bitmap_info.offset)
 		return 0;
 	printk(KERN_ERR "%s: bitmaps are not supported for %s\n",
 		mdname(mddev), mddev->pers->name);
@@ -993,8 +993,8 @@ static int super_90_validate(mddev_t *mddev, mdk_rdev_t *rdev)
 		mddev->raid_disks = sb->raid_disks;
 		mddev->dev_sectors = sb->size * 2;
 		mddev->events = ev1;
-		mddev->bitmap_offset = 0;
-		mddev->default_bitmap_offset = MD_SB_BYTES >> 9;
+		mddev->bitmap_info.offset = 0;
+		mddev->bitmap_info.default_offset = MD_SB_BYTES >> 9;
 
 		if (mddev->minor_version >= 91) {
 			mddev->reshape_position = sb->reshape_position;
@@ -1028,8 +1028,9 @@ static int super_90_validate(mddev_t *mddev, mdk_rdev_t *rdev)
 		mddev->max_disks = MD_SB_DISKS;
 
 		if (sb->state & (1<<MD_SB_BITMAP_PRESENT) &&
-		    mddev->bitmap_file == NULL)
-			mddev->bitmap_offset = mddev->default_bitmap_offset;
+		    mddev->bitmap_info.file == NULL)
+			mddev->bitmap_info.offset =
+				mddev->bitmap_info.default_offset;
 
 	} else if (mddev->pers == NULL) {
 		/* Insist on good event counter while assembling */
@@ -1146,7 +1147,7 @@ static void super_90_sync(mddev_t *mddev, mdk_rdev_t *rdev)
 	sb->layout = mddev->layout;
 	sb->chunk_size = mddev->chunk_sectors << 9;
 
-	if (mddev->bitmap && mddev->bitmap_file == NULL)
+	if (mddev->bitmap && mddev->bitmap_info.file == NULL)
 		sb->state |= (1<<MD_SB_BITMAP_PRESENT);
 
 	sb->disks[0].state = (1<<MD_DISK_REMOVED);
@@ -1224,7 +1225,7 @@ super_90_rdev_size_change(mdk_rdev_t *rdev, sector_t num_sectors)
 {
 	if (num_sectors && num_sectors < rdev->mddev->dev_sectors)
 		return 0; /* component must fit device */
-	if (rdev->mddev->bitmap_offset)
+	if (rdev->mddev->bitmap_info.offset)
 		return 0; /* can't move bitmap */
 	rdev->sb_start = calc_dev_sboffset(rdev->bdev);
 	if (!num_sectors || num_sectors > rdev->sb_start)
@@ -1403,8 +1404,8 @@ static int super_1_validate(mddev_t *mddev, mdk_rdev_t *rdev)
 		mddev->raid_disks = le32_to_cpu(sb->raid_disks);
 		mddev->dev_sectors = le64_to_cpu(sb->size);
 		mddev->events = ev1;
-		mddev->bitmap_offset = 0;
-		mddev->default_bitmap_offset = 1024 >> 9;
+		mddev->bitmap_info.offset = 0;
+		mddev->bitmap_info.default_offset = 1024 >> 9;
 		
 		mddev->recovery_cp = le64_to_cpu(sb->resync_offset);
 		memcpy(mddev->uuid, sb->set_uuid, 16);
@@ -1412,8 +1413,9 @@ static int super_1_validate(mddev_t *mddev, mdk_rdev_t *rdev)
 		mddev->max_disks =  (4096-256)/2;
 
 		if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET) &&
-		    mddev->bitmap_file == NULL )
-			mddev->bitmap_offset = (__s32)le32_to_cpu(sb->bitmap_offset);
+		    mddev->bitmap_info.file == NULL )
+			mddev->bitmap_info.offset =
+				(__s32)le32_to_cpu(sb->bitmap_offset);
 
 		if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RESHAPE_ACTIVE)) {
 			mddev->reshape_position = le64_to_cpu(sb->reshape_position);
@@ -1507,8 +1509,8 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
 	sb->level = cpu_to_le32(mddev->level);
 	sb->layout = cpu_to_le32(mddev->layout);
 
-	if (mddev->bitmap && mddev->bitmap_file == NULL) {
-		sb->bitmap_offset = cpu_to_le32((__u32)mddev->bitmap_offset);
+	if (mddev->bitmap && mddev->bitmap_info.file == NULL) {
+		sb->bitmap_offset = cpu_to_le32((__u32)mddev->bitmap_info.offset);
 		sb->feature_map = cpu_to_le32(MD_FEATURE_BITMAP_OFFSET);
 	}
 
@@ -1575,7 +1577,7 @@ super_1_rdev_size_change(mdk_rdev_t *rdev, sector_t num_sectors)
 		max_sectors -= rdev->data_offset;
 		if (!num_sectors || num_sectors > max_sectors)
 			num_sectors = max_sectors;
-	} else if (rdev->mddev->bitmap_offset) {
+	} else if (rdev->mddev->bitmap_info.offset) {
 		/* minor version 0 with bitmap we can't move */
 		return 0;
 	} else {
@@ -4522,12 +4524,12 @@ out:
 		printk(KERN_INFO "md: %s stopped.\n", mdname(mddev));
 
 		bitmap_destroy(mddev);
-		if (mddev->bitmap_file) {
-			restore_bitmap_write_access(mddev->bitmap_file);
-			fput(mddev->bitmap_file);
-			mddev->bitmap_file = NULL;
+		if (mddev->bitmap_info.file) {
+			restore_bitmap_write_access(mddev->bitmap_info.file);
+			fput(mddev->bitmap_info.file);
+			mddev->bitmap_info.file = NULL;
 		}
-		mddev->bitmap_offset = 0;
+		mddev->bitmap_info.offset = 0;
 
 		/* make sure all md_delayed_delete calls have finished */
 		flush_scheduled_work();
@@ -4753,7 +4755,7 @@ static int get_array_info(mddev_t * mddev, void __user * arg)
 	info.state         = 0;
 	if (mddev->in_sync)
 		info.state = (1<<MD_SB_CLEAN);
-	if (mddev->bitmap && mddev->bitmap_offset)
+	if (mddev->bitmap && mddev->bitmap_info.offset)
 		info.state = (1<<MD_SB_BITMAP_PRESENT);
 	info.active_disks  = insync;
 	info.working_disks = working;
@@ -5111,23 +5113,23 @@ static int set_bitmap_file(mddev_t *mddev, int fd)
 	if (fd >= 0) {
 		if (mddev->bitmap)
 			return -EEXIST; /* cannot add when bitmap is present */
-		mddev->bitmap_file = fget(fd);
+		mddev->bitmap_info.file = fget(fd);
 
-		if (mddev->bitmap_file == NULL) {
+		if (mddev->bitmap_info.file == NULL) {
 			printk(KERN_ERR "%s: error: failed to get bitmap file\n",
 			       mdname(mddev));
 			return -EBADF;
 		}
 
-		err = deny_bitmap_write_access(mddev->bitmap_file);
+		err = deny_bitmap_write_access(mddev->bitmap_info.file);
 		if (err) {
 			printk(KERN_ERR "%s: error: bitmap file is already in use\n",
 			       mdname(mddev));
-			fput(mddev->bitmap_file);
-			mddev->bitmap_file = NULL;
+			fput(mddev->bitmap_info.file);
+			mddev->bitmap_info.file = NULL;
 			return err;
 		}
-		mddev->bitmap_offset = 0; /* file overrides offset */
+		mddev->bitmap_info.offset = 0; /* file overrides offset */
 	} else if (mddev->bitmap == NULL)
 		return -ENOENT; /* cannot remove what isn't there */
 	err = 0;
@@ -5142,11 +5144,11 @@ static int set_bitmap_file(mddev_t *mddev, int fd)
 		mddev->pers->quiesce(mddev, 0);
 	}
 	if (fd < 0) {
-		if (mddev->bitmap_file) {
-			restore_bitmap_write_access(mddev->bitmap_file);
-			fput(mddev->bitmap_file);
+		if (mddev->bitmap_info.file) {
+			restore_bitmap_write_access(mddev->bitmap_info.file);
+			fput(mddev->bitmap_info.file);
 		}
-		mddev->bitmap_file = NULL;
+		mddev->bitmap_info.file = NULL;
 	}
 
 	return err;
@@ -5213,8 +5215,8 @@ static int set_array_info(mddev_t * mddev, mdu_array_info_t *info)
 		mddev->flags         = 0;
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 
-	mddev->default_bitmap_offset = MD_SB_BYTES >> 9;
-	mddev->bitmap_offset = 0;
+	mddev->bitmap_info.default_offset = MD_SB_BYTES >> 9;
+	mddev->bitmap_info.offset = 0;
 
 	mddev->reshape_position = MaxSector;
 
@@ -5314,7 +5316,7 @@ static int update_array_info(mddev_t *mddev, mdu_array_info_t *info)
 	int state = 0;
 
 	/* calculate expected state,ignoring low bits */
-	if (mddev->bitmap && mddev->bitmap_offset)
+	if (mddev->bitmap && mddev->bitmap_info.offset)
 		state |= (1 << MD_SB_BITMAP_PRESENT);
 
 	if (mddev->major_version != info->major_version ||
@@ -5373,9 +5375,10 @@ static int update_array_info(mddev_t *mddev, mdu_array_info_t *info)
 			/* add the bitmap */
 			if (mddev->bitmap)
 				return -EEXIST;
-			if (mddev->default_bitmap_offset == 0)
+			if (mddev->bitmap_info.default_offset == 0)
 				return -EINVAL;
-			mddev->bitmap_offset = mddev->default_bitmap_offset;
+			mddev->bitmap_info.offset =
+				mddev->bitmap_info.default_offset;
 			mddev->pers->quiesce(mddev, 1);
 			rv = bitmap_create(mddev);
 			if (rv)
@@ -5390,7 +5393,7 @@ static int update_array_info(mddev_t *mddev, mdu_array_info_t *info)
 			mddev->pers->quiesce(mddev, 1);
 			bitmap_destroy(mddev);
 			mddev->pers->quiesce(mddev, 0);
-			mddev->bitmap_offset = 0;
+			mddev->bitmap_info.offset = 0;
 		}
 	}
 	md_update_sb(mddev, 1);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 77f0aad..94a8e51 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -280,16 +280,17 @@ struct mddev_s
 	unsigned int                    max_write_behind; /* 0 = sync */
 
 	struct bitmap                   *bitmap; /* the bitmap for the device */
-	struct file			*bitmap_file; /* the bitmap file */
-	long				bitmap_offset; /* offset from superblock of
-							* start of bitmap. May be
-							* negative, but not '0'
-							*/
-	long				default_bitmap_offset; /* this is the offset to use when
-								* hot-adding a bitmap.  It should
-								* eventually be settable by sysfs.
-								*/
-
+	struct {
+		struct file		*file; /* the bitmap file */
+		long			offset; /* offset from superblock of
+						 * start of bitmap. May be
+						 * negative, but not '0'
+						 */
+		long			default_offset; /* this is the offset to use when
+							 * hot-adding a bitmap.  It should
+							 * eventually be settable by sysfs.
+							 */
+	} bitmap_info;
 	struct list_head		all_mddevs;
 
 	/* Generic barrier handling.



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 10/22] md: move offset, daemon_sleep and chunksize out of bitmap structure
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (6 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 12/22] md: remove needless setting of thread->timeout in raid10_quiesce NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 07/22] md/raid1: add takeover support for raid5->raid1 NeilBrown
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

... and into bitmap_info.  These are all configuration parameters
the need to be set before the bitmap is created.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/bitmap.c |   51 ++++++++++++++++++++++++++++-----------------------
 drivers/md/bitmap.h |    6 +-----
 drivers/md/md.c     |    4 ++--
 drivers/md/md.h     |    3 +++
 drivers/md/raid1.c  |    3 ++-
 drivers/md/raid10.c |    2 +-
 6 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 4c70a01..921b084 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -287,27 +287,28 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
 
 	while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
 			int size = PAGE_SIZE;
+			long offset = mddev->bitmap_info.offset;
 			if (page->index == bitmap->file_pages-1)
 				size = roundup(bitmap->last_page_size,
 					       bdev_logical_block_size(rdev->bdev));
 			/* Just make sure we aren't corrupting data or
 			 * metadata
 			 */
-			if (bitmap->offset < 0) {
+			if (offset < 0) {
 				/* DATA  BITMAP METADATA  */
-				if (bitmap->offset
+				if (offset
 				    + (long)(page->index * (PAGE_SIZE/512))
 				    + size/512 > 0)
 					/* bitmap runs in to metadata */
 					goto bad_alignment;
 				if (rdev->data_offset + mddev->dev_sectors
-				    > rdev->sb_start + bitmap->offset)
+				    > rdev->sb_start + offset)
 					/* data runs in to bitmap */
 					goto bad_alignment;
 			} else if (rdev->sb_start < rdev->data_offset) {
 				/* METADATA BITMAP DATA */
 				if (rdev->sb_start
-				    + bitmap->offset
+				    + offset
 				    + page->index*(PAGE_SIZE/512) + size/512
 				    > rdev->data_offset)
 					/* bitmap runs in to data */
@@ -316,7 +317,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
 				/* DATA METADATA BITMAP - no problems */
 			}
 			md_super_write(mddev, rdev,
-				       rdev->sb_start + bitmap->offset
+				       rdev->sb_start + offset
 				       + page->index * (PAGE_SIZE/512),
 				       size,
 				       page);
@@ -550,7 +551,8 @@ static int bitmap_read_sb(struct bitmap *bitmap)
 
 		bitmap->sb_page = read_page(bitmap->file, 0, bitmap, bytes);
 	} else {
-		bitmap->sb_page = read_sb_page(bitmap->mddev, bitmap->offset,
+		bitmap->sb_page = read_sb_page(bitmap->mddev,
+					       bitmap->mddev->bitmap_info.offset,
 					       NULL,
 					       0, sizeof(bitmap_super_t));
 	}
@@ -610,10 +612,10 @@ static int bitmap_read_sb(struct bitmap *bitmap)
 	}
 success:
 	/* assign fields using values from superblock */
-	bitmap->chunksize = chunksize;
-	bitmap->daemon_sleep = daemon_sleep;
+	bitmap->mddev->bitmap_info.chunksize = chunksize;
+	bitmap->mddev->bitmap_info.daemon_sleep = daemon_sleep;
 	bitmap->daemon_lastrun = jiffies;
-	bitmap->max_write_behind = write_behind;
+	bitmap->mddev->bitmap_info.max_write_behind = write_behind;
 	bitmap->flags |= le32_to_cpu(sb->state);
 	if (le32_to_cpu(sb->version) == BITMAP_MAJOR_HOSTENDIAN)
 		bitmap->flags |= BITMAP_HOSTENDIAN;
@@ -907,7 +909,7 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start)
 	chunks = bitmap->chunks;
 	file = bitmap->file;
 
-	BUG_ON(!file && !bitmap->offset);
+	BUG_ON(!file && !bitmap->mddev->bitmap_info.offset);
 
 #ifdef INJECT_FAULTS_3
 	outofdate = 1;
@@ -967,14 +969,15 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start)
 				offset = sizeof(bitmap_super_t);
 				if (!file)
 					read_sb_page(bitmap->mddev,
-						     bitmap->offset,
+						     bitmap->mddev->bitmap_info.offset,
 						     page,
 						     index, count);
 			} else if (file) {
 				page = read_page(file, index, bitmap, count);
 				offset = 0;
 			} else {
-				page = read_sb_page(bitmap->mddev, bitmap->offset,
+				page = read_sb_page(bitmap->mddev,
+						    bitmap->mddev->bitmap_info.offset,
 						    NULL,
 						    index, count);
 				offset = 0;
@@ -1096,7 +1099,8 @@ void bitmap_daemon_work(mddev_t *mddev)
 		mutex_unlock(&mddev->open_mutex);
 		return;
 	}
-	if (time_before(jiffies, bitmap->daemon_lastrun + bitmap->daemon_sleep*HZ))
+	if (time_before(jiffies, bitmap->daemon_lastrun
+			+ bitmap->mddev->bitmap_info.daemon_sleep*HZ))
 		goto done;
 
 	bitmap->daemon_lastrun = jiffies;
@@ -1210,7 +1214,8 @@ void bitmap_daemon_work(mddev_t *mddev)
 
  done:
 	if (bitmap->allclean == 0)
-		bitmap->mddev->thread->timeout = bitmap->daemon_sleep * HZ;
+		bitmap->mddev->thread->timeout = 
+			bitmap->mddev->bitmap_info.daemon_sleep * HZ;
 	mutex_unlock(&mddev->open_mutex);
 }
 
@@ -1479,7 +1484,7 @@ void bitmap_cond_end_sync(struct bitmap *bitmap, sector_t sector)
 		return;
 	}
 	if (time_before(jiffies, (bitmap->last_end_sync
-				  + bitmap->daemon_sleep * HZ)))
+				  + bitmap->mddev->bitmap_info.daemon_sleep * HZ)))
 		return;
 	wait_event(bitmap->mddev->recovery_wait,
 		   atomic_read(&bitmap->mddev->recovery_active) == 0);
@@ -1540,7 +1545,7 @@ void bitmap_dirty_bits(struct bitmap *bitmap, unsigned long s, unsigned long e)
 void bitmap_flush(mddev_t *mddev)
 {
 	struct bitmap *bitmap = mddev->bitmap;
-	int sleep;
+	long sleep;
 
 	if (!bitmap) /* there was no bitmap */
 		return;
@@ -1548,12 +1553,13 @@ void bitmap_flush(mddev_t *mddev)
 	/* run the daemon_work three time to ensure everything is flushed
 	 * that can be
 	 */
-	sleep = bitmap->daemon_sleep;
-	bitmap->daemon_sleep = 0;
+	sleep = mddev->bitmap_info.daemon_sleep * HZ * 2;
+	bitmap->daemon_lastrun -= sleep;
 	bitmap_daemon_work(mddev);
+	bitmap->daemon_lastrun -= sleep;
 	bitmap_daemon_work(mddev);
+	bitmap->daemon_lastrun -= sleep;
 	bitmap_daemon_work(mddev);
-	bitmap->daemon_sleep = sleep;
 	bitmap_update_sb(bitmap);
 }
 
@@ -1633,7 +1639,6 @@ int bitmap_create(mddev_t *mddev)
 	bitmap->mddev = mddev;
 
 	bitmap->file = file;
-	bitmap->offset = mddev->bitmap_info.offset;
 	if (file) {
 		get_file(file);
 		/* As future accesses to this file will use bmap,
@@ -1642,12 +1647,12 @@ int bitmap_create(mddev_t *mddev)
 		 */
 		vfs_fsync(file, file->f_dentry, 1);
 	}
-	/* read superblock from bitmap file (this sets bitmap->chunksize) */
+	/* read superblock from bitmap file (this sets mddev->bitmap_info.chunksize) */
 	err = bitmap_read_sb(bitmap);
 	if (err)
 		goto error;
 
-	bitmap->chunkshift = ffz(~bitmap->chunksize);
+	bitmap->chunkshift = ffz(~mddev->bitmap_info.chunksize);
 
 	/* now that chunksize and chunkshift are set, we can use these macros */
  	chunks = (blocks + CHUNK_BLOCK_RATIO(bitmap) - 1) >>
@@ -1689,7 +1694,7 @@ int bitmap_create(mddev_t *mddev)
 
 	mddev->bitmap = bitmap;
 
-	mddev->thread->timeout = bitmap->daemon_sleep * HZ;
+	mddev->thread->timeout = mddev->bitmap_info.daemon_sleep * HZ;
 
 	bitmap_update_sb(bitmap);
 
diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index 7e38d13..50ee424 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -106,7 +106,7 @@ typedef __u16 bitmap_counter_t;
 #define BITMAP_BLOCK_SHIFT 9
 
 /* how many blocks per chunk? (this is variable) */
-#define CHUNK_BLOCK_RATIO(bitmap) ((bitmap)->chunksize >> BITMAP_BLOCK_SHIFT)
+#define CHUNK_BLOCK_RATIO(bitmap) ((bitmap)->mddev->bitmap_info.chunksize >> BITMAP_BLOCK_SHIFT)
 #define CHUNK_BLOCK_SHIFT(bitmap) ((bitmap)->chunkshift - BITMAP_BLOCK_SHIFT)
 #define CHUNK_BLOCK_MASK(bitmap) (CHUNK_BLOCK_RATIO(bitmap) - 1)
 
@@ -209,7 +209,6 @@ struct bitmap {
 	int counter_bits; /* how many bits per block counter */
 
 	/* bitmap chunksize -- how much data does each bit represent? */
-	unsigned long chunksize;
 	unsigned long chunkshift; /* chunksize = 2^chunkshift (for bitops) */
 	unsigned long chunks; /* total number of data chunks for the array */
 
@@ -226,7 +225,6 @@ struct bitmap {
 	/* bitmap spinlock */
 	spinlock_t lock;
 
-	long offset; /* offset from superblock if file is NULL */
 	struct file *file; /* backing disk file */
 	struct page *sb_page; /* cached copy of the bitmap file superblock */
 	struct page **filemap; /* list of cache pages for the file */
@@ -238,7 +236,6 @@ struct bitmap {
 
 	int allclean;
 
-	unsigned long max_write_behind; /* write-behind mode */
 	atomic_t behind_writes;
 
 	/*
@@ -246,7 +243,6 @@ struct bitmap {
 	 * file, cleaning up bits and flushing out pages to disk as necessary
 	 */
 	unsigned long daemon_lastrun; /* jiffies of last run */
-	unsigned long daemon_sleep; /* how many seconds between updates? */
 	unsigned long last_end_sync; /* when we lasted called end_sync to
 				      * update bitmap with resync progress */
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index b989189..9f797ac 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6102,14 +6102,14 @@ static int md_seq_show(struct seq_file *seq, void *v)
 			unsigned long chunk_kb;
 			unsigned long flags;
 			spin_lock_irqsave(&bitmap->lock, flags);
-			chunk_kb = bitmap->chunksize >> 10;
+			chunk_kb = mddev->bitmap_info.chunksize >> 10;
 			seq_printf(seq, "bitmap: %lu/%lu pages [%luKB], "
 				"%lu%s chunk",
 				bitmap->pages - bitmap->missing_pages,
 				bitmap->pages,
 				(bitmap->pages - bitmap->missing_pages)
 					<< (PAGE_SHIFT - 10),
-				chunk_kb ? chunk_kb : bitmap->chunksize,
+				chunk_kb ? chunk_kb : mddev->bitmap_info.chunksize,
 				chunk_kb ? "KB" : "B");
 			if (bitmap->file) {
 				seq_printf(seq, ", file: ");
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 94a8e51..b35b29d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -290,6 +290,9 @@ struct mddev_s
 							 * hot-adding a bitmap.  It should
 							 * eventually be settable by sysfs.
 							 */
+		unsigned long		chunksize;
+		unsigned long		daemon_sleep; /* how many seconds between updates? */
+		unsigned long		max_write_behind; /* write-behind mode */
 	} bitmap_info;
 	struct list_head		all_mddevs;
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 18da334..da017f0 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -943,7 +943,8 @@ static int make_request(struct request_queue *q, struct bio * bio)
 
 	/* do behind I/O ? */
 	if (bitmap &&
-	    atomic_read(&bitmap->behind_writes) < bitmap->max_write_behind &&
+	    (atomic_read(&bitmap->behind_writes)
+	     < mddev->bitmap_info.max_write_behind) &&
 	    (behind_pages = alloc_behind_pages(bio)) != NULL)
 		set_bit(R1BIO_BehindIO, &r1_bio->state);
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 2fbf867..2255e33 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2277,7 +2277,7 @@ static void raid10_quiesce(mddev_t *mddev, int state)
 	}
 	if (mddev->thread) {
 		if (mddev->bitmap)
-			mddev->thread->timeout = mddev->bitmap->daemon_sleep * HZ;
+			mddev->thread->timeout = mddev->bitmap_info.daemon_sleep * HZ;
 		else
 			mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
 		md_wakeup_thread(mddev->thread);



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 11/22] md: change daemon_sleep to be in 'jiffies' rather than 'seconds'.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (14 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 01/22] md/bitmap: protect against bitmap removal while being updated NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 04/22] md: support barrier requests on all personalities NeilBrown
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

This removes a lot of multiplications by HZ.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/bitmap.c |   14 +++++++-------
 drivers/md/raid10.c |    2 +-
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 921b084..5e6fd37 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -565,7 +565,7 @@ static int bitmap_read_sb(struct bitmap *bitmap)
 	sb = (bitmap_super_t *)kmap_atomic(bitmap->sb_page, KM_USER0);
 
 	chunksize = le32_to_cpu(sb->chunksize);
-	daemon_sleep = le32_to_cpu(sb->daemon_sleep);
+	daemon_sleep = le32_to_cpu(sb->daemon_sleep) * HZ;
 	write_behind = le32_to_cpu(sb->write_behind);
 
 	/* verify that the bitmap-specific fields are valid */
@@ -578,7 +578,7 @@ static int bitmap_read_sb(struct bitmap *bitmap)
 		reason = "bitmap chunksize too small";
 	else if ((1 << ffz(~chunksize)) != chunksize)
 		reason = "bitmap chunksize not a power of 2";
-	else if (daemon_sleep < 1 || daemon_sleep > MAX_SCHEDULE_TIMEOUT / HZ)
+	else if (daemon_sleep < 1 || daemon_sleep > MAX_SCHEDULE_TIMEOUT)
 		reason = "daemon sleep period out of range";
 	else if (write_behind > COUNTER_MAX)
 		reason = "write-behind limit out of range (0 - 16383)";
@@ -1100,7 +1100,7 @@ void bitmap_daemon_work(mddev_t *mddev)
 		return;
 	}
 	if (time_before(jiffies, bitmap->daemon_lastrun
-			+ bitmap->mddev->bitmap_info.daemon_sleep*HZ))
+			+ bitmap->mddev->bitmap_info.daemon_sleep))
 		goto done;
 
 	bitmap->daemon_lastrun = jiffies;
@@ -1215,7 +1215,7 @@ void bitmap_daemon_work(mddev_t *mddev)
  done:
 	if (bitmap->allclean == 0)
 		bitmap->mddev->thread->timeout = 
-			bitmap->mddev->bitmap_info.daemon_sleep * HZ;
+			bitmap->mddev->bitmap_info.daemon_sleep;
 	mutex_unlock(&mddev->open_mutex);
 }
 
@@ -1484,7 +1484,7 @@ void bitmap_cond_end_sync(struct bitmap *bitmap, sector_t sector)
 		return;
 	}
 	if (time_before(jiffies, (bitmap->last_end_sync
-				  + bitmap->mddev->bitmap_info.daemon_sleep * HZ)))
+				  + bitmap->mddev->bitmap_info.daemon_sleep)))
 		return;
 	wait_event(bitmap->mddev->recovery_wait,
 		   atomic_read(&bitmap->mddev->recovery_active) == 0);
@@ -1553,7 +1553,7 @@ void bitmap_flush(mddev_t *mddev)
 	/* run the daemon_work three time to ensure everything is flushed
 	 * that can be
 	 */
-	sleep = mddev->bitmap_info.daemon_sleep * HZ * 2;
+	sleep = mddev->bitmap_info.daemon_sleep * 2;
 	bitmap->daemon_lastrun -= sleep;
 	bitmap_daemon_work(mddev);
 	bitmap->daemon_lastrun -= sleep;
@@ -1694,7 +1694,7 @@ int bitmap_create(mddev_t *mddev)
 
 	mddev->bitmap = bitmap;
 
-	mddev->thread->timeout = mddev->bitmap_info.daemon_sleep * HZ;
+	mddev->thread->timeout = mddev->bitmap_info.daemon_sleep;
 
 	bitmap_update_sb(bitmap);
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 2255e33..064c2bb 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2277,7 +2277,7 @@ static void raid10_quiesce(mddev_t *mddev, int state)
 	}
 	if (mddev->thread) {
 		if (mddev->bitmap)
-			mddev->thread->timeout = mddev->bitmap_info.daemon_sleep * HZ;
+			mddev->thread->timeout = mddev->bitmap_info.daemon_sleep;
 		else
 			mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
 		md_wakeup_thread(mddev->thread);



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 12/22] md: remove needless setting of thread->timeout in raid10_quiesce
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (5 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 06/22] md: add honouring of suspend_{lo,hi} to raid1 NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 10/22] md: move offset, daemon_sleep and chunksize out of bitmap structure NeilBrown
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

As bitmap_create and bitmap_destroy already set thread->timeout
as appropriate, there is no need to do it in raid10_quiesce.
There is a possible need to wake the thread after the timeout
has been set low, but it is better to do that where the timeout
is actually set low, in bitmap_create.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/bitmap.c |    1 +
 drivers/md/raid10.c |    7 -------
 2 files changed, 1 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 5e6fd37..4df173a 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1695,6 +1695,7 @@ int bitmap_create(mddev_t *mddev)
 	mddev->bitmap = bitmap;
 
 	mddev->thread->timeout = mddev->bitmap_info.daemon_sleep;
+	md_wakeup_thread(mddev->thread);
 
 	bitmap_update_sb(bitmap);
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 064c2bb..d9e28a6 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2275,13 +2275,6 @@ static void raid10_quiesce(mddev_t *mddev, int state)
 		lower_barrier(conf);
 		break;
 	}
-	if (mddev->thread) {
-		if (mddev->bitmap)
-			mddev->thread->timeout = mddev->bitmap_info.daemon_sleep;
-		else
-			mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
-		md_wakeup_thread(mddev->thread);
-	}
 }
 
 static struct mdk_personality raid10_personality =



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 13/22] md: support bitmap offset appropriate for external-metadata arrays.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (10 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 22/22] md: integrate spares into array at earliest opportunity NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 21/22] md: move compat_ioctl handling into md.c NeilBrown
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

For md arrays were metadata is managed externally, the kernel does not
know about a superblock so the superblock offset is 0.
If we want to have a write-intent-bitmap near the end of the
devices of such an array, we should support sector_t sized offset.
We need offset be possibly negative for when the bitmap is before
the metadata, so use loff_t instead.

Also add sanity check that bitmap does not overlap with data.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/bitmap.c |   14 +++++++++++---
 drivers/md/md.h     |    6 ++++--
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 4df173a..5b77e5c 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -212,7 +212,7 @@ static void bitmap_checkfree(struct bitmap *bitmap, unsigned long page)
  */
 
 /* IO operations when bitmap is stored near all superblocks */
-static struct page *read_sb_page(mddev_t *mddev, long offset,
+static struct page *read_sb_page(mddev_t *mddev, loff_t offset,
 				 struct page *page,
 				 unsigned long index, int size)
 {
@@ -287,14 +287,22 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
 
 	while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
 			int size = PAGE_SIZE;
-			long offset = mddev->bitmap_info.offset;
+			loff_t offset = mddev->bitmap_info.offset;
 			if (page->index == bitmap->file_pages-1)
 				size = roundup(bitmap->last_page_size,
 					       bdev_logical_block_size(rdev->bdev));
 			/* Just make sure we aren't corrupting data or
 			 * metadata
 			 */
-			if (offset < 0) {
+			if (mddev->external) {
+				/* Bitmap could be anywhere. */
+				if (rdev->sb_start + offset + (page->index *(PAGE_SIZE/512)) >
+				    rdev->data_offset &&
+				    rdev->sb_start + offset < 
+				    rdev->data_offset + mddev->dev_sectors +
+				    (PAGE_SIZE/512))
+					goto bad_alignment;
+			} else if (offset < 0) {
 				/* DATA  BITMAP METADATA  */
 				if (offset
 				    + (long)(page->index * (PAGE_SIZE/512))
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b35b29d..aa67532 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -282,11 +282,13 @@ struct mddev_s
 	struct bitmap                   *bitmap; /* the bitmap for the device */
 	struct {
 		struct file		*file; /* the bitmap file */
-		long			offset; /* offset from superblock of
+		loff_t			offset; /* offset from superblock of
 						 * start of bitmap. May be
 						 * negative, but not '0'
+						 * For external metadata, offset
+						 * from start of device. 
 						 */
-		long			default_offset; /* this is the offset to use when
+		loff_t			default_offset; /* this is the offset to use when
 							 * hot-adding a bitmap.  It should
 							 * eventually be settable by sysfs.
 							 */



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 14/22] md: support updating bitmap parameters via sysfs.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
  2009-12-04  6:48 ` [md PATCH 15/22] md: Support write-intent bitmaps with externally managed metadata NeilBrown
  2009-12-04  6:48 ` [md PATCH 17/22] md/raid10: print more useful messages on device failure NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-08 10:29   ` Andre Noll
  2009-12-04  6:48 ` [md PATCH 18/22] raid: improve MD/raid10 handling of correctable read errors NeilBrown
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

A new attribute directory 'bitmap' in 'md' is created which
contains files for configuring the bitmap.
'location' identifies where the bitmap is, either 'none',
or 'file' or 'sector offset from metadata'.
Writing 'location' can create or remove a bitmap.
Adding a 'file' bitmap this way is not yet supported.

'chunksize' can be set before creating a bitmap, but is
currently always over-ridden by the bitmap superblock.

'time_base' and 'backlog' can be updated at any time.


Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/bitmap.c |  214 +++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/md.c     |    3 +
 drivers/md/md.h     |    2 
 3 files changed, 218 insertions(+), 1 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 5b77e5c..0f3d9a8 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -27,6 +27,7 @@
 #include <linux/file.h>
 #include <linux/mount.h>
 #include <linux/buffer_head.h>
+#include <linux/ctype.h>
 #include "md.h"
 #include "bitmap.h"
 
@@ -510,6 +511,9 @@ void bitmap_update_sb(struct bitmap *bitmap)
 		bitmap->events_cleared = bitmap->mddev->events;
 		sb->events_cleared = cpu_to_le64(bitmap->events_cleared);
 	}
+	/* Just in case these have been changed via sysfs: */
+	sb->daemon_sleep = cpu_to_le32(bitmap->mddev->bitmap_info.daemon_sleep/HZ);
+	sb->write_behind = cpu_to_le32(bitmap->mddev->bitmap_info.max_write_behind);
 	kunmap_atomic(sb, KM_USER0);
 	write_page(bitmap, bitmap->sb_page, 1);
 }
@@ -1714,6 +1718,216 @@ int bitmap_create(mddev_t *mddev)
 	return err;
 }
 
+static ssize_t
+location_show(mddev_t *mddev, char *page)
+{
+	ssize_t len = 0;
+	if (mddev->bitmap_info.file) {
+		len = sprintf(page, "file");
+	} else if (mddev->bitmap_info.offset) {
+		len = sprintf(page, "%+lld", (long long)mddev->bitmap_info.offset);
+	} else
+		len = sprintf(page, "none");
+	len += sprintf(page+len, "\n");
+	return len;
+}
+
+static ssize_t
+location_store(mddev_t *mddev, const char *buf, size_t len)
+{
+
+	if (mddev->pers) {
+		if (!mddev->pers->quiesce)
+			return -EBUSY;
+		if (mddev->recovery || mddev->sync_thread)
+			return -EBUSY;
+	}
+
+	if (mddev->bitmap || mddev->bitmap_info.file ||
+	    mddev->bitmap_info.offset) {
+		/* bitmap already configured.  Only option is to clear it */
+		if (strncmp(buf, "none", 4) != 0)
+			return -EBUSY;
+		if (mddev->pers) {
+			mddev->pers->quiesce(mddev, 1);
+			bitmap_destroy(mddev);
+			mddev->pers->quiesce(mddev, 0);
+		}
+		mddev->bitmap_info.offset = 0;
+		if (mddev->bitmap_info.file) {
+			struct file *f = mddev->bitmap_info.file;
+			mddev->bitmap_info.file = NULL;
+			restore_bitmap_write_access(f);
+			fput(f);
+		}
+	} else {
+		/* No bitmap, OK to set a location */
+		long long offset;
+		if (strncmp(buf, "none", 4) == 0)
+			/* nothing to be done */;
+		else if (strncmp(buf, "file:", 5) == 0) {
+			/* Not supported yet */
+			return -EINVAL;
+		} else {
+			int rv;
+			if (buf[0] == '+')
+				rv = strict_strtoll(buf+1, 10, &offset);
+			else
+				rv = strict_strtoll(buf, 10, &offset);
+			if (rv)
+				return rv;
+			mddev->bitmap_info.offset = offset;
+			if (mddev->pers) {
+				mddev->pers->quiesce(mddev, 1);
+				rv = bitmap_create(mddev);
+				if (rv)
+					bitmap_destroy(mddev);
+				mddev->pers->quiesce(mddev, 0);
+				if (rv)
+					return rv;
+			}
+		}
+	}
+	return len;
+}
+
+static struct md_sysfs_entry bitmap_location =
+__ATTR(location, S_IRUGO|S_IWUSR, location_show, location_store);
+
+static ssize_t
+timeout_show(mddev_t *mddev, char *page)
+{
+	ssize_t len = 0;
+	unsigned long secs = mddev->bitmap_info.daemon_sleep / HZ;
+	unsigned long jifs = mddev->bitmap_info.daemon_sleep % HZ;
+	
+	len = sprintf(page, "%lu", secs);
+	if (jifs)
+		len += sprintf(page+len, ".%03u", jiffies_to_msecs(jifs));
+	len += sprintf(page+len, "\n");
+	return len;
+}
+
+int strict_strtoul_scaled(const char *cp, unsigned long *res, int scale)
+{
+	unsigned long result = 0;
+	long decimals = -1;
+	while (isdigit(*cp) || (*cp == '.' && decimals < 0)) {
+		if (*cp == '.')
+			decimals = 0;
+		else if (decimals < scale) {
+			unsigned int value;
+			value = *cp - '0';
+			result = result * 10 + value;
+			if (decimals >= 0)
+				decimals++;
+		}
+		cp++;
+	}
+	if (*cp == '\n')
+		cp++;
+	if (*cp)
+		return -EINVAL;
+	if (decimals < 0)
+		decimals = 0;
+	while (decimals < scale) {
+		result *= 10;
+		decimals ++;
+	}
+	*res = result;
+	return 0;
+}
+
+static ssize_t
+timeout_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	/* timeout can be set at any time */
+	unsigned long timeout;
+	int rv = strict_strtoul_scaled(buf, &timeout, 4);
+	if (rv)
+		return rv;
+
+	timeout = timeout * HZ / 10000;
+
+	if (timeout > MAX_SCHEDULE_TIMEOUT)
+		timeout = MAX_SCHEDULE_TIMEOUT;
+	if (timeout < 1)
+		timeout = 1;
+	mddev->bitmap_info.daemon_sleep = timeout;
+	if (mddev->thread) {
+		if (mddev->thread->timeout < MAX_SCHEDULE_TIMEOUT) {
+			mddev->thread->timeout = timeout;
+			md_wakeup_thread(mddev->thread);
+		}
+	}
+	return len;
+}
+
+static struct md_sysfs_entry bitmap_timeout =
+__ATTR(time_base, S_IRUGO|S_IWUSR, timeout_show, timeout_store);
+
+static ssize_t
+backlog_show(mddev_t *mddev, char *page)
+{
+	return sprintf(page, "%lu\n", mddev->bitmap_info.max_write_behind);
+}
+
+static ssize_t
+backlog_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	unsigned long backlog;
+	int rv = strict_strtoul(buf, 10, &backlog);
+	if (rv)
+		return rv;
+	if (backlog > COUNTER_MAX)
+		backlog = COUNTER_MAX;
+	mddev->bitmap_info.max_write_behind = backlog;
+	return len;
+}
+
+static struct md_sysfs_entry bitmap_backlog =
+__ATTR(backlog, S_IRUGO|S_IWUSR, backlog_show, backlog_store);
+
+static ssize_t
+chunksize_show(mddev_t *mddev, char *page)
+{
+	return sprintf(page, "%lu\n", mddev->bitmap_info.chunksize);
+}
+
+static ssize_t
+chunksize_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	/* Can only be changed when no bitmap is active */
+	int rv;
+	unsigned long csize;
+	if (mddev->bitmap)
+		return -EBUSY;
+	rv = strict_strtoul(buf, 10, &csize);
+	if (rv)
+		return rv;
+	if (csize < 512 ||
+	    !is_power_of_2(csize))
+		return -EINVAL;
+	mddev->bitmap_info.chunksize = csize;
+	return len;
+}
+
+static struct md_sysfs_entry bitmap_chunksize =
+__ATTR(chunksize, S_IRUGO|S_IWUSR, chunksize_show, chunksize_store);
+
+static struct attribute *md_bitmap_attrs[] = {
+	&bitmap_location.attr,
+	&bitmap_timeout.attr,
+	&bitmap_backlog.attr,
+	&bitmap_chunksize.attr,
+	NULL
+};
+struct attribute_group md_bitmap_group = {
+	.name = "bitmap",
+	.attrs = md_bitmap_attrs,
+};
+
+
 /* the bitmap API -- for raid personalities */
 EXPORT_SYMBOL(bitmap_startwrite);
 EXPORT_SYMBOL(bitmap_endwrite);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 9f797ac..fbff790 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4013,6 +4013,7 @@ static void mddev_delayed_delete(struct work_struct *ws)
 		mddev->sysfs_action = NULL;
 		mddev->private = NULL;
 	}
+	sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
 	kobject_del(&mddev->kobj);
 	kobject_put(&mddev->kobj);
 }
@@ -4104,6 +4105,8 @@ static int md_alloc(dev_t dev, char *name)
 		       disk->disk_name);
 		error = 0;
 	}
+	if (sysfs_create_group(&mddev->kobj, &md_bitmap_group))
+		printk(KERN_DEBUG "pointless warning\n");
  abort:
 	mutex_unlock(&disks_mutex);
 	if (!error) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index aa67532..05b91c3 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -370,7 +370,7 @@ struct md_sysfs_entry {
 	ssize_t (*show)(mddev_t *, char *);
 	ssize_t (*store)(mddev_t *, const char *, size_t);
 };
-
+extern struct attribute_group md_bitmap_group;
 
 static inline char * mdname (mddev_t * mddev)
 {



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 15/22] md: Support write-intent bitmaps with externally managed metadata.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 17/22] md/raid10: print more useful messages on device failure NeilBrown
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

In this case, the metadata needs to not be in the same
sector as the bitmap.
md will not read/write any bitmap metadata.  Config must be
done via sysfs and when a recovery makes the array non-degraded
again, writing 'true' to 'bitmap/can_clear' will allow bits in
the bitmap to be cleared again.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/bitmap.c |  134 +++++++++++++++++++++++++++++++++++++++++++--------
 drivers/md/bitmap.h |   11 ----
 drivers/md/md.h     |    1 
 3 files changed, 114 insertions(+), 32 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 0f3d9a8..aeaa30e 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -498,6 +498,8 @@ void bitmap_update_sb(struct bitmap *bitmap)
 
 	if (!bitmap || !bitmap->mddev) /* no bitmap for this array */
 		return;
+	if (bitmap->mddev->bitmap_info.external)
+		return;
 	spin_lock_irqsave(&bitmap->lock, flags);
 	if (!bitmap->sb_page) { /* no superblock */
 		spin_unlock_irqrestore(&bitmap->lock, flags);
@@ -678,16 +680,26 @@ static int bitmap_mask_state(struct bitmap *bitmap, enum bitmap_state bits,
  * general bitmap file operations
  */
 
+/*
+ * on-disk bitmap:
+ *
+ * Use one bit per "chunk" (block set). We do the disk I/O on the bitmap
+ * file a page at a time. There's a superblock at the start of the file.
+ */
 /* calculate the index of the page that contains this bit */
-static inline unsigned long file_page_index(unsigned long chunk)
+static inline unsigned long file_page_index(struct bitmap *bitmap, unsigned long chunk)
 {
-	return CHUNK_BIT_OFFSET(chunk) >> PAGE_BIT_SHIFT;
+	if (!bitmap->mddev->bitmap_info.external)
+		chunk += sizeof(bitmap_super_t) << 3;
+	return chunk >> PAGE_BIT_SHIFT;
 }
 
 /* calculate the (bit) offset of this bit within a page */
-static inline unsigned long file_page_offset(unsigned long chunk)
+static inline unsigned long file_page_offset(struct bitmap *bitmap, unsigned long chunk)
 {
-	return CHUNK_BIT_OFFSET(chunk) & (PAGE_BITS - 1);
+	if (!bitmap->mddev->bitmap_info.external)
+		chunk += sizeof(bitmap_super_t) << 3;
+	return chunk & (PAGE_BITS - 1);
 }
 
 /*
@@ -700,8 +712,9 @@ static inline unsigned long file_page_offset(unsigned long chunk)
 static inline struct page *filemap_get_page(struct bitmap *bitmap,
 					unsigned long chunk)
 {
-	if (file_page_index(chunk) >= bitmap->file_pages) return NULL;
-	return bitmap->filemap[file_page_index(chunk) - file_page_index(0)];
+	if (file_page_index(bitmap, chunk) >= bitmap->file_pages) return NULL;
+	return bitmap->filemap[file_page_index(bitmap, chunk)
+			       - file_page_index(bitmap, 0)];
 }
 
 
@@ -724,7 +737,7 @@ static void bitmap_file_unmap(struct bitmap *bitmap)
 	spin_unlock_irqrestore(&bitmap->lock, flags);
 
 	while (pages--)
-		if (map[pages]->index != 0) /* 0 is sb_page, release it below */
+		if (map[pages] != sb_page) /* 0 is sb_page, release it below */
 			free_buffers(map[pages]);
 	kfree(map);
 	kfree(attr);
@@ -835,7 +848,7 @@ static void bitmap_file_set_bit(struct bitmap *bitmap, sector_t block)
 
 	page = filemap_get_page(bitmap, chunk);
 	if (!page) return;
-	bit = file_page_offset(chunk);
+	bit = file_page_offset(bitmap, chunk);
 
  	/* set the bit */
 	kaddr = kmap_atomic(page, KM_USER0);
@@ -933,14 +946,17 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start)
 			"recovery\n", bmname(bitmap));
 
 	bytes = (chunks + 7) / 8;
+	if (!bitmap->mddev->bitmap_info.external)
+		bytes += sizeof(bitmap_super_t);
 
-	num_pages = (bytes + sizeof(bitmap_super_t) + PAGE_SIZE - 1) / PAGE_SIZE;
+	
+	num_pages = (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
 
-	if (file && i_size_read(file->f_mapping->host) < bytes + sizeof(bitmap_super_t)) {
+	if (file && i_size_read(file->f_mapping->host) < bytes) {
 		printk(KERN_INFO "%s: bitmap file too short %lu < %lu\n",
 			bmname(bitmap),
 			(unsigned long) i_size_read(file->f_mapping->host),
-			bytes + sizeof(bitmap_super_t));
+			bytes);
 		goto err;
 	}
 
@@ -961,17 +977,16 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start)
 
 	for (i = 0; i < chunks; i++) {
 		int b;
-		index = file_page_index(i);
-		bit = file_page_offset(i);
+		index = file_page_index(bitmap, i);
+		bit = file_page_offset(bitmap, i);
 		if (index != oldindex) { /* this is a new page, read it in */
 			int count;
 			/* unmap the old page, we're done with it */
 			if (index == num_pages-1)
-				count = bytes + sizeof(bitmap_super_t)
-					- index * PAGE_SIZE;
+				count = bytes - index * PAGE_SIZE;
 			else
 				count = PAGE_SIZE;
-			if (index == 0) {
+			if (index == 0 && bitmap->sb_page) {
 				/*
 				 * if we're here then the superblock page
 				 * contains some bits (PAGE_SIZE != sizeof sb)
@@ -1166,7 +1181,8 @@ void bitmap_daemon_work(mddev_t *mddev)
 			/* We are possibly going to clear some bits, so make
 			 * sure that events_cleared is up-to-date.
 			 */
-			if (bitmap->need_sync) {
+			if (bitmap->need_sync &&
+			    bitmap->mddev->bitmap_info.external == 0) {
 				bitmap_super_t *sb;
 				bitmap->need_sync = 0;
 				sb = kmap_atomic(bitmap->sb_page, KM_USER0);
@@ -1176,7 +1192,8 @@ void bitmap_daemon_work(mddev_t *mddev)
 				write_page(bitmap, bitmap->sb_page, 1);
 			}
 			spin_lock_irqsave(&bitmap->lock, flags);
-			clear_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
+			if (!bitmap->need_sync)
+				clear_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
 		}
 		bmc = bitmap_get_counter(bitmap,
 					 (sector_t)j << CHUNK_BLOCK_SHIFT(bitmap),
@@ -1191,7 +1208,7 @@ void bitmap_daemon_work(mddev_t *mddev)
 			if (*bmc == 2) {
 				*bmc=1; /* maybe clear the bit next time */
 				set_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
-			} else if (*bmc == 1) {
+			} else if (*bmc == 1 && !bitmap->need_sync) {
 				/* we can clear the bit */
 				*bmc = 0;
 				bitmap_count_page(bitmap,
@@ -1201,9 +1218,11 @@ void bitmap_daemon_work(mddev_t *mddev)
 				/* clear the bit */
 				paddr = kmap_atomic(page, KM_USER0);
 				if (bitmap->flags & BITMAP_HOSTENDIAN)
-					clear_bit(file_page_offset(j), paddr);
+					clear_bit(file_page_offset(bitmap, j),
+						  paddr);
 				else
-					ext2_clear_bit(file_page_offset(j), paddr);
+					ext2_clear_bit(file_page_offset(bitmap, j),
+						       paddr);
 				kunmap_atomic(paddr, KM_USER0);
 			}
 		} else
@@ -1358,6 +1377,7 @@ void bitmap_endwrite(struct bitmap *bitmap, sector_t offset, unsigned long secto
 		    bitmap->events_cleared < bitmap->mddev->events) {
 			bitmap->events_cleared = bitmap->mddev->events;
 			bitmap->need_sync = 1;
+			sysfs_notify_dirent(bitmap->sysfs_can_clear);
 		}
 
 		if (!success && ! (*bmc & NEEDED_MASK))
@@ -1615,6 +1635,9 @@ void bitmap_destroy(mddev_t *mddev)
 	if (mddev->thread)
 		mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
 
+	if (bitmap->sysfs_can_clear)
+		sysfs_put(bitmap->sysfs_can_clear);
+
 	bitmap_free(bitmap);
 }
 
@@ -1631,6 +1654,7 @@ int bitmap_create(mddev_t *mddev)
 	struct file *file = mddev->bitmap_info.file;
 	int err;
 	sector_t start;
+	struct sysfs_dirent *bm;
 
 	BUILD_BUG_ON(sizeof(bitmap_super_t) != 256);
 
@@ -1650,6 +1674,13 @@ int bitmap_create(mddev_t *mddev)
 
 	bitmap->mddev = mddev;
 
+	bm = sysfs_get_dirent(mddev->kobj.sd, "bitmap");
+	if (bm) {
+		bitmap->sysfs_can_clear = sysfs_get_dirent(bm, "can_clear");
+		sysfs_put(bm);
+	} else
+		bitmap->sysfs_can_clear = NULL;
+
 	bitmap->file = file;
 	if (file) {
 		get_file(file);
@@ -1660,7 +1691,11 @@ int bitmap_create(mddev_t *mddev)
 		vfs_fsync(file, file->f_dentry, 1);
 	}
 	/* read superblock from bitmap file (this sets mddev->bitmap_info.chunksize) */
-	err = bitmap_read_sb(bitmap);
+	if (!mddev->bitmap_info.external)
+		err = bitmap_read_sb(bitmap);
+	else
+		/* FIXME validate chunksize etc */
+		err = 0;
 	if (err)
 		goto error;
 
@@ -1915,11 +1950,66 @@ chunksize_store(mddev_t *mddev, const char *buf, size_t len)
 static struct md_sysfs_entry bitmap_chunksize =
 __ATTR(chunksize, S_IRUGO|S_IWUSR, chunksize_show, chunksize_store);
 
+static ssize_t metadata_show(mddev_t *mddev, char *page)
+{
+	return sprintf(page, "%s\n", (mddev->bitmap_info.external
+				      ? "external" : "internal"));
+}
+
+static ssize_t metadata_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	if (mddev->bitmap ||
+	    mddev->bitmap_info.file ||
+	    mddev->bitmap_info.offset)
+		return -EBUSY;
+	if (strncmp(buf, "external", 8) == 0)
+		mddev->bitmap_info.external = 1;
+	else if (strncmp(buf, "internal", 8) == 0)
+		mddev->bitmap_info.external = 0;
+	else
+		return -EINVAL;
+	return len;
+}
+
+static struct md_sysfs_entry bitmap_metadata =
+__ATTR(metadata, S_IRUGO|S_IWUSR, metadata_show, metadata_store);
+
+static ssize_t can_clear_show(mddev_t *mddev, char *page)
+{
+	int len;
+	if (mddev->bitmap)
+		len = sprintf(page, "%s\n", (mddev->bitmap->need_sync ?
+					     "false" : "true"));
+	else
+		len = sprintf(page, "\n");
+	return len;
+}
+
+static ssize_t can_clear_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	if (mddev->bitmap == 0)
+		return -ENOENT;
+	if (strncmp(buf, "false", 5) == 0)
+		mddev->bitmap->need_sync = 1;
+	else if (strncmp(buf, "true", 4) == 0) {
+		if (mddev->degraded)
+			return -EBUSY;
+		mddev->bitmap->need_sync = 0;
+	} else
+		return -EINVAL;
+	return len;
+}
+
+static struct md_sysfs_entry bitmap_can_clear =
+__ATTR(can_clear, S_IRUGO|S_IWUSR, can_clear_show, can_clear_store);
+
 static struct attribute *md_bitmap_attrs[] = {
 	&bitmap_location.attr,
 	&bitmap_timeout.attr,
 	&bitmap_backlog.attr,
 	&bitmap_chunksize.attr,
+	&bitmap_metadata.attr,
+	&bitmap_can_clear.attr,
 	NULL
 };
 struct attribute_group md_bitmap_group = {
diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index 50ee424..cb821d7 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -118,16 +118,6 @@ typedef __u16 bitmap_counter_t;
 			(CHUNK_BLOCK_SHIFT(bitmap) + PAGE_COUNTER_SHIFT - 1)
 #define PAGEPTR_BLOCK_MASK(bitmap) (PAGEPTR_BLOCK_RATIO(bitmap) - 1)
 
-/*
- * on-disk bitmap:
- *
- * Use one bit per "chunk" (block set). We do the disk I/O on the bitmap
- * file a page at a time. There's a superblock at the start of the file.
- */
-
-/* map chunks (bits) to file pages - offset by the size of the superblock */
-#define CHUNK_BIT_OFFSET(chunk) ((chunk) + (sizeof(bitmap_super_t) << 3))
-
 #endif
 
 /*
@@ -250,6 +240,7 @@ struct bitmap {
 	wait_queue_head_t write_wait;
 	wait_queue_head_t overflow_wait;
 
+	struct sysfs_dirent *sysfs_can_clear;
 };
 
 /* the bitmap API */
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 05b91c3..b74b05d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -295,6 +295,7 @@ struct mddev_s
 		unsigned long		chunksize;
 		unsigned long		daemon_sleep; /* how many seconds between updates? */
 		unsigned long		max_write_behind; /* write-behind mode */
+		int			external;
 	} bitmap_info;
 	struct list_head		all_mddevs;
 



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 16/22] md/bitmap: update dirty flag when bitmap bits are explicitly set.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (12 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 21/22] md: move compat_ioctl handling into md.c NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 01/22] md/bitmap: protect against bitmap removal while being updated NeilBrown
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

There is a sysfs file which allows bits in the write-intent
bitmap to be explicit set - indicating that the block is thought
to be 'dirty'.
When this happens we should really set recovery_cp backwards
to include the block to reflect this dirtiness.

In particular, a 'resync' process will refuse to start if
recovery_cp is beyond the end of the array, so this is needed
to allow a resync to be triggered.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/bitmap.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index aeaa30e..5960bd5 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1568,6 +1568,12 @@ void bitmap_dirty_bits(struct bitmap *bitmap, unsigned long s, unsigned long e)
 		sector_t sec = (sector_t)chunk << CHUNK_BLOCK_SHIFT(bitmap);
 		bitmap_set_memory_bits(bitmap, sec, 1);
 		bitmap_file_set_bit(bitmap, sec);
+		if (sec < bitmap->mddev->recovery_cp)
+			/* We are asserting that the array is dirty,
+			 * so move the recovery_cp address back so
+			 * that it is obvious that it is dirty
+			 */
+			bitmap->mddev->recovery_cp = sec;
 	}
 }
 



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 17/22] md/raid10: print more useful messages on device failure.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
  2009-12-04  6:48 ` [md PATCH 15/22] md: Support write-intent bitmaps with externally managed metadata NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 14/22] md: support updating bitmap parameters via sysfs NeilBrown
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid; +Cc: Robert Becker

From: Robert Becker <Rob.Becker@riverbed.com>

When we get a read error on a device in a RAID10, and attempting to
repair the error fails, print more useful messages about why it
failed.

Signed-off-by: Robert Becker <Rob.Becker@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid10.c |   32 +++++++++++++++++++++++++++++---
 1 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d9e28a6..670449f 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1488,6 +1488,7 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
 		/* write it back and re-read */
 		rcu_read_lock();
 		while (sl != r10_bio->read_slot) {
+			char b[BDEVNAME_SIZE];
 			int d;
 			if (sl==0)
 				sl = conf->copies;
@@ -1503,9 +1504,21 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
 						 r10_bio->devs[sl].addr +
 						 sect + rdev->data_offset,
 						 s<<9, conf->tmppage, WRITE)
-				    == 0)
+				    == 0) {
 					/* Well, this device is dead */
+					printk(KERN_NOTICE
+					       "raid10:%s: read correction "
+					       "write failed"
+					       " (%d sectors at %llu on %s)\n",
+					       mdname(mddev), s,
+					       (unsigned long long)(sect+
+					       rdev->data_offset),
+					       bdevname(rdev->bdev, b));
+					printk(KERN_NOTICE "raid10:%s: failing "
+					       "drive\n",
+					       bdevname(rdev->bdev, b));
 					md_error(mddev, rdev);
+				}
 				rdev_dec_pending(rdev, mddev);
 				rcu_read_lock();
 			}
@@ -1526,10 +1539,22 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
 				if (sync_page_io(rdev->bdev,
 						 r10_bio->devs[sl].addr +
 						 sect + rdev->data_offset,
-						 s<<9, conf->tmppage, READ) == 0)
+						 s<<9, conf->tmppage,
+						 READ) == 0) {
 					/* Well, this device is dead */
+					printk(KERN_NOTICE
+					       "raid10:%s: unable to read back "
+					       "corrected sectors"
+					       " (%d sectors at %llu on %s)\n",
+					       mdname(mddev), s,
+					       (unsigned long long)(sect+
+						    rdev->data_offset),
+					       bdevname(rdev->bdev, b));
+					printk(KERN_NOTICE "raid10:%s: failing drive\n",
+					       bdevname(rdev->bdev, b));
+
 					md_error(mddev, rdev);
-				else
+				} else {
 					printk(KERN_INFO
 					       "raid10:%s: read error corrected"
 					       " (%d sectors at %llu on %s)\n",
@@ -1537,6 +1562,7 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
 					       (unsigned long long)(sect+
 					            rdev->data_offset),
 					       bdevname(rdev->bdev, b));
+				}
 
 				rdev_dec_pending(rdev, mddev);
 				rcu_read_lock();



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 18/22] raid: improve MD/raid10 handling of correctable read errors.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (2 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 14/22] md: support updating bitmap parameters via sysfs NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 20/22] md: revise Kconfig help for MD_MULTIPATH NeilBrown
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid; +Cc: Robert Becker

From: Robert Becker <Rob.Becker@riverbed.com>

We've noticed severe lasting performance degradation of our raid
arrays when we have drives that yield large amounts of media errors.
The raid10 module will queue each failed read for retry, and also
will attempt call fix_read_error() to perform the read recovery.
Read recovery is performed while the array is frozen, so repeated
recovery attempts can degrade the performance of the array for
extended periods of time.

With this patch I propose adding a per md device max number of
corrected read attempts.  Each rdev will maintain a count of
read correction attempts in the rdev->read_errors field (not
used currently for raid10). When we enter fix_read_error()
we'll check to see when the last read error occurred, and
divide the read error count by 2 for every hour since the
last read error. If at that point our read error count
exceeds the read error threshold, we'll fail the raid device.

In addition in this patch I add sysfs nodes (get/set) for
the per md max_read_errors attribute, the rdev->read_errors
attribute, and added some printk's to indicate when
fix_read_error fails to repair an rdev.

For testing I used debugfs->fail_make_request to inject
IO errors to the rdev while doing IO to the raid array.

Signed-off-by: Robert Becker <Rob.Becker@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c     |   34 +++++++++++++++++++++++
 drivers/md/md.h     |    4 +++
 drivers/md/raid10.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 112 insertions(+), 0 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index fbff790..0ebd6b6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -68,6 +68,12 @@ static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
 #define MD_BUG(x...) { printk("md: bug in file %s, line %d\n", __FILE__, __LINE__); md_print_devices(); }
 
 /*
+ * Default number of read corrections we'll attempt on an rdev
+ * before ejecting it from the array. We divide the read error
+ * count by 2 for every hour elapsed between read errors.
+ */
+#define MD_DEFAULT_MAX_CORRECTED_READ_ERRORS 20
+/*
  * Current RAID-1,4,5 parallel reconstruction 'guaranteed speed limit'
  * is 1000 KB/sec, so the extra system load does not show up that much.
  * Increase it if you want to have more _guaranteed_ speed. Note that
@@ -2668,6 +2674,8 @@ static mdk_rdev_t *md_import_device(dev_t newdev, int super_format, int super_mi
 	rdev->flags = 0;
 	rdev->data_offset = 0;
 	rdev->sb_events = 0;
+	rdev->last_read_error.tv_sec  = 0;
+	rdev->last_read_error.tv_nsec = 0;
 	atomic_set(&rdev->nr_pending, 0);
 	atomic_set(&rdev->read_errors, 0);
 	atomic_set(&rdev->corrected_errors, 0);
@@ -3285,6 +3293,29 @@ static struct md_sysfs_entry md_array_state =
 __ATTR(array_state, S_IRUGO|S_IWUSR, array_state_show, array_state_store);
 
 static ssize_t
+max_corrected_read_errors_show(mddev_t *mddev, char *page) {
+	return sprintf(page, "%d\n",
+		       atomic_read(&mddev->max_corr_read_errors));
+}
+
+static ssize_t
+max_corrected_read_errors_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	char *e;
+	unsigned long n = simple_strtoul(buf, &e, 10);
+
+	if (*buf && (*e == 0 || *e == '\n')) {
+		atomic_set(&mddev->max_corr_read_errors, n);
+		return len;
+	}
+	return -EINVAL;
+}
+
+static struct md_sysfs_entry max_corr_read_errors =
+__ATTR(max_read_errors, S_IRUGO|S_IWUSR, max_corrected_read_errors_show,
+	max_corrected_read_errors_store);
+
+static ssize_t
 null_show(mddev_t *mddev, char *page)
 {
 	return -EINVAL;
@@ -3909,6 +3940,7 @@ static struct attribute *md_default_attrs[] = {
 	&md_array_state.attr,
 	&md_reshape_position.attr,
 	&md_array_size.attr,
+	&max_corr_read_errors.attr,
 	NULL,
 };
 
@@ -4328,6 +4360,8 @@ static int do_md_run(mddev_t * mddev)
 		mddev->ro = 0;
 
  	atomic_set(&mddev->writes_pending,0);
+	atomic_set(&mddev->max_corr_read_errors,
+		   MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
 	mddev->safemode = 0;
 	mddev->safemode_timer.function = md_safemode_timeout;
 	mddev->safemode_timer.data = (unsigned long) mddev;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b74b05d..3e94232 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -97,6 +97,9 @@ struct mdk_rdev_s
 	atomic_t	read_errors;	/* number of consecutive read errors that
 					 * we have tried to ignore.
 					 */
+	struct timespec last_read_error;	/* monotonic time since our
+						 * last read error
+						 */
 	atomic_t	corrected_errors; /* number of corrected read errors,
 					   * for reporting to userspace and storing
 					   * in superblock.
@@ -297,6 +300,7 @@ struct mddev_s
 		unsigned long		max_write_behind; /* write-behind mode */
 		int			external;
 	} bitmap_info;
+	atomic_t 			max_corr_read_errors; /* max read retries */
 	struct list_head		all_mddevs;
 
 	/* Generic barrier handling.
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 670449f..5c71a46 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1432,6 +1432,43 @@ static void recovery_request_write(mddev_t *mddev, r10bio_t *r10_bio)
 
 
 /*
+ * Used by fix_read_error() to decay the per rdev read_errors.
+ * We halve the read error count for every hour that has elapsed
+ * since the last recorded read error.
+ *
+ */
+static void check_decay_read_errors(mddev_t *mddev, mdk_rdev_t *rdev)
+{
+	struct timespec cur_time_mon;
+	unsigned long hours_since_last;
+	unsigned int read_errors = atomic_read(&rdev->read_errors);
+
+	ktime_get_ts(&cur_time_mon);
+
+	if (rdev->last_read_error.tv_sec == 0 &&
+	    rdev->last_read_error.tv_nsec == 0) {
+		/* first time we've seen a read error */
+		rdev->last_read_error = cur_time_mon;
+		return;
+	}
+
+	hours_since_last = (cur_time_mon.tv_sec -
+			    rdev->last_read_error.tv_sec) / 3600;
+
+	rdev->last_read_error = cur_time_mon;
+
+	/*
+	 * if hours_since_last is > the number of bits in read_errors
+	 * just set read errors to 0. We do this to avoid
+	 * overflowing the shift of read_errors by hours_since_last.
+	 */
+	if (hours_since_last >= 8 * sizeof(read_errors))
+		atomic_set(&rdev->read_errors, 0);
+	else
+		atomic_set(&rdev->read_errors, read_errors >> hours_since_last);
+}
+
+/*
  * This is a kernel thread which:
  *
  *	1.	Retries failed read operations on working mirrors.
@@ -1444,6 +1481,43 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
 	int sect = 0; /* Offset from r10_bio->sector */
 	int sectors = r10_bio->sectors;
 	mdk_rdev_t*rdev;
+	int max_read_errors = atomic_read(&mddev->max_corr_read_errors);
+
+	rcu_read_lock();
+	{
+		int d = r10_bio->devs[r10_bio->read_slot].devnum;
+		char b[BDEVNAME_SIZE];
+		int cur_read_error_count = 0;
+
+		rdev = rcu_dereference(conf->mirrors[d].rdev);
+		bdevname(rdev->bdev, b);
+
+		if (test_bit(Faulty, &rdev->flags)) {
+			rcu_read_unlock();
+			/* drive has already been failed, just ignore any
+			   more fix_read_error() attempts */
+			return;
+		}
+
+		check_decay_read_errors(mddev, rdev);
+		atomic_inc(&rdev->read_errors);
+		cur_read_error_count = atomic_read(&rdev->read_errors);
+		if (cur_read_error_count > max_read_errors) {
+			rcu_read_unlock();
+			printk(KERN_NOTICE
+			       "raid10: %s: Raid device exceeded "
+			       "read_error threshold "
+			       "[cur %d:max %d]\n",
+			       b, cur_read_error_count, max_read_errors);
+			printk(KERN_NOTICE
+			       "raid10: %s: Failing raid "
+			       "device\n", b);
+			md_error(mddev, conf->mirrors[d].rdev);
+			return;
+		}
+	}
+	rcu_read_unlock();
+
 	while(sectors) {
 		int s = sectors;
 		int sl = r10_bio->read_slot;



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 19/22] md: add MODULE_DESCRIPTION for all md related modules.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (8 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 07/22] md/raid1: add takeover support for raid5->raid1 NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 22/22] md: integrate spares into array at earliest opportunity NeilBrown
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

Suggested by  Oren Held <orenhe@il.ibm.com>

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/faulty.c     |    1 +
 drivers/md/linear.c     |    1 +
 drivers/md/md.c         |    1 +
 drivers/md/multipath.c  |    1 +
 drivers/md/raid0.c      |    1 +
 drivers/md/raid1.c      |    1 +
 drivers/md/raid10.c     |    1 +
 drivers/md/raid5.c      |    1 +
 drivers/md/raid6algos.c |    1 +
 9 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/drivers/md/faulty.c b/drivers/md/faulty.c
index 87d88db..713acd0 100644
--- a/drivers/md/faulty.c
+++ b/drivers/md/faulty.c
@@ -360,6 +360,7 @@ static void raid_exit(void)
 module_init(raid_init);
 module_exit(raid_exit);
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Fault injection personality for MD");
 MODULE_ALIAS("md-personality-10"); /* faulty */
 MODULE_ALIAS("md-faulty");
 MODULE_ALIAS("md-level--5");
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 3b3f77c..00435bd 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -383,6 +383,7 @@ static void linear_exit (void)
 module_init(linear_init);
 module_exit(linear_exit);
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Linear device concatenation personality for MD");
 MODULE_ALIAS("md-personality-1"); /* LINEAR - deprecated*/
 MODULE_ALIAS("md-linear");
 MODULE_ALIAS("md-level--1");
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0ebd6b6..5111587 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7159,5 +7159,6 @@ EXPORT_SYMBOL(md_unregister_thread);
 EXPORT_SYMBOL(md_wakeup_thread);
 EXPORT_SYMBOL(md_check_recovery);
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("MD RAID framework");
 MODULE_ALIAS("md");
 MODULE_ALIAS_BLOCKDEV_MAJOR(MD_MAJOR);
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index cbc0a99..32a662f 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -581,6 +581,7 @@ static void __exit multipath_exit (void)
 module_init(multipath_init);
 module_exit(multipath_exit);
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("simple multi-path personality for MD");
 MODULE_ALIAS("md-personality-7"); /* MULTIPATH */
 MODULE_ALIAS("md-multipath");
 MODULE_ALIAS("md-level--4");
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 122d07a..77605cd 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -567,6 +567,7 @@ static void raid0_exit (void)
 module_init(raid0_init);
 module_exit(raid0_exit);
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("RAID0 (striping) personality for MD");
 MODULE_ALIAS("md-personality-2"); /* RAID0 */
 MODULE_ALIAS("md-raid0");
 MODULE_ALIAS("md-level-0");
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index da017f0..2be48a4 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2376,6 +2376,7 @@ static void raid_exit(void)
 module_init(raid_init);
 module_exit(raid_exit);
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("RAID1 (mirroring) personality for MD");
 MODULE_ALIAS("md-personality-3"); /* RAID1 */
 MODULE_ALIAS("md-raid1");
 MODULE_ALIAS("md-level-1");
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 5c71a46..d119b7b 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2408,6 +2408,7 @@ static void raid_exit(void)
 module_init(raid_init);
 module_exit(raid_exit);
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("RAID10 (striped mirror) personality for MD");
 MODULE_ALIAS("md-personality-9"); /* RAID10 */
 MODULE_ALIAS("md-raid10");
 MODULE_ALIAS("md-level-10");
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8c772b2..7d5808d 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5893,6 +5893,7 @@ static void raid5_exit(void)
 module_init(raid5_init);
 module_exit(raid5_exit);
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("RAID4/5/6 (striping with parity) personality for MD");
 MODULE_ALIAS("md-personality-4"); /* RAID5 */
 MODULE_ALIAS("md-raid5");
 MODULE_ALIAS("md-raid4");
diff --git a/drivers/md/raid6algos.c b/drivers/md/raid6algos.c
index 8ce102c..bffc61b 100644
--- a/drivers/md/raid6algos.c
+++ b/drivers/md/raid6algos.c
@@ -150,3 +150,4 @@ static void raid6_exit(void)
 subsys_initcall(raid6_select_algo);
 module_exit(raid6_exit);
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("RAID6 Q-syndrome calculations");



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 20/22] md: revise Kconfig help for MD_MULTIPATH
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (3 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 18/22] raid: improve MD/raid10 handling of correctable read errors NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 06/22] md: add honouring of suspend_{lo,hi} to raid1 NeilBrown
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid; +Cc: Oren Held

Make it clear in the config message that MD_MULTIPATH is not under
active development.

Cc: Oren Held <orenhe@il.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/Kconfig |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 2158377..acb3a4e 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -185,11 +185,10 @@ config MD_MULTIPATH
 	tristate "Multipath I/O support"
 	depends on BLK_DEV_MD
 	help
-	  Multipath-IO is the ability of certain devices to address the same
-	  physical disk over multiple 'IO paths'. The code ensures that such
-	  paths can be defined and handled at runtime, and ensures that a
-	  transparent failover to the backup path(s) happens if a IO errors
-	  arrives on the primary path.
+	  MD_MULTIPATH provides a simple multi-path personality for use
+	  the MD framework.  It is not under active development.  New
+	  projects should consider using DM_MULTIPATH which has more
+	  features and more testing.
 
 	  If unsure, say N.
 



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 22/22] md: integrate spares into array at earliest opportunity.
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (9 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 19/22] md: add MODULE_DESCRIPTION for all md related modules NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 13/22] md: support bitmap offset appropriate for external-metadata arrays NeilBrown
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid

As v1.x metadata can record that a member of the array is
not completely recovered, it make sense to record that a
spare has become a regular member of the array at the earliest
opportunity.
So remove the tests on "recovery_offset > 0" in super_1_sync
as they really aren't needed, and schedule a metadata update
immediately after adding spares to a degraded array.

This means that if a crash happens immediately after a recovery
starts, the new device will be included in the array and recovery will
continue from wherever it was up to.  Previously this didn't happen
unless recovery was at least 1/16 of the way through.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c |   13 ++++++-------
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c4a0e59..8f70166 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1523,12 +1523,10 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
 
 	if (rdev->raid_disk >= 0 &&
 	    !test_bit(In_sync, &rdev->flags)) {
-		if (rdev->recovery_offset > 0) {
-			sb->feature_map |=
-				cpu_to_le32(MD_FEATURE_RECOVERY_OFFSET);
-			sb->recovery_offset =
-				cpu_to_le64(rdev->recovery_offset);
-		}
+		sb->feature_map |=
+			cpu_to_le32(MD_FEATURE_RECOVERY_OFFSET);
+		sb->recovery_offset =
+			cpu_to_le64(rdev->recovery_offset);
 	}
 
 	if (mddev->reshape_position != MaxSector) {
@@ -1562,7 +1560,7 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
 			sb->dev_roles[i] = cpu_to_le16(0xfffe);
 		else if (test_bit(In_sync, &rdev2->flags))
 			sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
-		else if (rdev2->raid_disk >= 0 && rdev2->recovery_offset > 0)
+		else if (rdev2->raid_disk >= 0)
 			sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
 		else
 			sb->dev_roles[i] = cpu_to_le16(0xffff);
@@ -6777,6 +6775,7 @@ static int remove_and_add_spares(mddev_t *mddev)
 						       nm, mdname(mddev));
 					spares++;
 					md_new_event(mddev);
+					set_bit(MD_CHANGE_DEVS, &mddev->flags);
 				} else
 					break;
 			}



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [md PATCH 21/22] md: move compat_ioctl handling into md.c
  2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
                   ` (11 preceding siblings ...)
  2009-12-04  6:48 ` [md PATCH 13/22] md: support bitmap offset appropriate for external-metadata arrays NeilBrown
@ 2009-12-04  6:48 ` NeilBrown
  2009-12-04  6:48 ` [md PATCH 16/22] md/bitmap: update dirty flag when bitmap bits are explicitly set NeilBrown
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2009-12-04  6:48 UTC (permalink / raw)
  To: linux-raid; +Cc: Andre Noll, Arnd Bergmann

From: Arnd Bergmann <arnd@arndb.de>

The RAID ioctls are only implemented in md.c, so the
handling for them should also be moved there from
fs/compat_ioctl.c.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Andre Noll <maan@systemlinux.org>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c   |   23 +++++++++++++++++++++++
 fs/compat_ioctl.c |   22 ----------------------
 2 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5111587..c4a0e59 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -44,6 +44,7 @@
 #include <linux/random.h>
 #include <linux/reboot.h>
 #include <linux/file.h>
+#include <linux/compat.h>
 #include <linux/delay.h>
 #include <linux/raid/md_p.h>
 #include <linux/raid/md_u.h>
@@ -5681,6 +5682,25 @@ done:
 abort:
 	return err;
 }
+#ifdef CONFIG_COMPAT
+static int md_compat_ioctl(struct block_device *bdev, fmode_t mode,
+		    unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case HOT_REMOVE_DISK:
+	case HOT_ADD_DISK:
+	case SET_DISK_FAULTY:
+	case SET_BITMAP_FILE:
+		/* These take in integer arg, do not convert */
+		break;
+	default:
+		arg = (unsigned long)compat_ptr(arg);
+		break;
+	}
+
+	return md_ioctl(bdev, mode, cmd, arg);
+}
+#endif /* CONFIG_COMPAT */
 
 static int md_open(struct block_device *bdev, fmode_t mode)
 {
@@ -5746,6 +5766,9 @@ static const struct block_device_operations md_fops =
 	.open		= md_open,
 	.release	= md_release,
 	.ioctl		= md_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= md_compat_ioctl,
+#endif
 	.getgeo		= md_getgeo,
 	.media_changed	= md_media_changed,
 	.revalidate_disk= md_revalidate,
diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index d84e705..ad485ba 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -1912,28 +1912,6 @@ COMPATIBLE_IOCTL(FIGETBSZ)
 /* 'X' - originally XFS but some now in the VFS */
 COMPATIBLE_IOCTL(FIFREEZE)
 COMPATIBLE_IOCTL(FITHAW)
-/* RAID */
-COMPATIBLE_IOCTL(RAID_VERSION)
-COMPATIBLE_IOCTL(GET_ARRAY_INFO)
-COMPATIBLE_IOCTL(GET_DISK_INFO)
-COMPATIBLE_IOCTL(PRINT_RAID_DEBUG)
-COMPATIBLE_IOCTL(RAID_AUTORUN)
-COMPATIBLE_IOCTL(CLEAR_ARRAY)
-COMPATIBLE_IOCTL(ADD_NEW_DISK)
-ULONG_IOCTL(HOT_REMOVE_DISK)
-COMPATIBLE_IOCTL(SET_ARRAY_INFO)
-COMPATIBLE_IOCTL(SET_DISK_INFO)
-COMPATIBLE_IOCTL(WRITE_RAID_INFO)
-COMPATIBLE_IOCTL(UNPROTECT_ARRAY)
-COMPATIBLE_IOCTL(PROTECT_ARRAY)
-ULONG_IOCTL(HOT_ADD_DISK)
-ULONG_IOCTL(SET_DISK_FAULTY)
-COMPATIBLE_IOCTL(RUN_ARRAY)
-COMPATIBLE_IOCTL(STOP_ARRAY)
-COMPATIBLE_IOCTL(STOP_ARRAY_RO)
-COMPATIBLE_IOCTL(RESTART_ARRAY_RW)
-COMPATIBLE_IOCTL(GET_BITMAP_FILE)
-ULONG_IOCTL(SET_BITMAP_FILE)
 /* Big K */
 COMPATIBLE_IOCTL(PIO_FONT)
 COMPATIBLE_IOCTL(GIO_FONT)



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [md PATCH 14/22] md: support updating bitmap parameters via sysfs.
  2009-12-04  6:48 ` [md PATCH 14/22] md: support updating bitmap parameters via sysfs NeilBrown
@ 2009-12-08 10:29   ` Andre Noll
  2009-12-10  6:14     ` Neil Brown
  0 siblings, 1 reply; 30+ messages in thread
From: Andre Noll @ 2009-12-08 10:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4969 bytes --]

On 17:48, NeilBrown wrote:
> A new attribute directory 'bitmap' in 'md' is created which
> contains files for configuring the bitmap.

Nice.

> 'location' identifies where the bitmap is, either 'none',
> or 'file' or 'sector offset from metadata'.
> Writing 'location' can create or remove a bitmap.
> Adding a 'file' bitmap this way is not yet supported.
> 
> 'chunksize' can be set before creating a bitmap, but is
> currently always over-ridden by the bitmap superblock.
> 
> 'time_base' and 'backlog' can be updated at any time.

This text should probably go into the mdadm documentation as well.

> +static ssize_t
> +location_show(mddev_t *mddev, char *page)
> +{
> +	ssize_t len = 0;
> +	if (mddev->bitmap_info.file) {
> +		len = sprintf(page, "file");
> +	} else if (mddev->bitmap_info.offset) {
> +		len = sprintf(page, "%+lld", (long long)mddev->bitmap_info.offset);
> +	} else
> +		len = sprintf(page, "none");
> +	len += sprintf(page+len, "\n");
> +	return len;
> +}

Pointless initialization to zero.

> +	} else {
> +		/* No bitmap, OK to set a location */
> +		long long offset;
> +		if (strncmp(buf, "none", 4) == 0)
> +			/* nothing to be done */;
> +		else if (strncmp(buf, "file:", 5) == 0) {
> +			/* Not supported yet */
> +			return -EINVAL;
> +		} else {
> +			int rv;
> +			if (buf[0] == '+')
> +				rv = strict_strtoll(buf+1, 10, &offset);
> +			else
> +				rv = strict_strtoll(buf, 10, &offset);
> +			if (rv)
> +				return rv;
> +			mddev->bitmap_info.offset = offset;

Introducing a sanity check here would probably save some sysadmin's
feet.

> +static struct md_sysfs_entry bitmap_location =
> +__ATTR(location, S_IRUGO|S_IWUSR, location_show, location_store);
> +
> +static ssize_t
> +timeout_show(mddev_t *mddev, char *page)
> +{
> +	ssize_t len = 0;
> +	unsigned long secs = mddev->bitmap_info.daemon_sleep / HZ;
> +	unsigned long jifs = mddev->bitmap_info.daemon_sleep % HZ;
> +	
> +	len = sprintf(page, "%lu", secs);
> +	if (jifs)
> +		len += sprintf(page+len, ".%03u", jiffies_to_msecs(jifs));
> +	len += sprintf(page+len, "\n");
> +	return len;
> +}

Again, no need to initialize len.

> +
> +int strict_strtoul_scaled(const char *cp, unsigned long *res, int scale)
> +{
> +	unsigned long result = 0;
> +	long decimals = -1;
> +	while (isdigit(*cp) || (*cp == '.' && decimals < 0)) {
> +		if (*cp == '.')
> +			decimals = 0;
> +		else if (decimals < scale) {
> +			unsigned int value;
> +			value = *cp - '0';
> +			result = result * 10 + value;
> +			if (decimals >= 0)
> +				decimals++;
> +		}
> +		cp++;
> +	}
> +	if (*cp == '\n')
> +		cp++;
> +	if (*cp)
> +		return -EINVAL;
> +	if (decimals < 0)
> +		decimals = 0;
> +	while (decimals < scale) {
> +		result *= 10;
> +		decimals ++;
> +	}
> +	*res = result;
> +	return 0;
> +}

A comment which describes what exactly this function is supposed to
to would be nice. It has only one caller and could be made static BTW.
OTOH, it probably belongs to lib/vsprintf.c.

> +static ssize_t
> +timeout_store(mddev_t *mddev, const char *buf, size_t len)
> +{
> +	/* timeout can be set at any time */
> +	unsigned long timeout;
> +	int rv = strict_strtoul_scaled(buf, &timeout, 4);
> +	if (rv)
> +		return rv;
> +
> +	timeout = timeout * HZ / 10000;

This might overflow. But maybe that's OK as only insane values (written
by root) will cause an overflow.

> +
> +	if (timeout > MAX_SCHEDULE_TIMEOUT)
> +		timeout = MAX_SCHEDULE_TIMEOUT;
> +	if (timeout < 1)
> +		timeout = 1;
> +	mddev->bitmap_info.daemon_sleep = timeout;
> +	if (mddev->thread) {
> +		if (mddev->thread->timeout < MAX_SCHEDULE_TIMEOUT) {
> +			mddev->thread->timeout = timeout;
> +			md_wakeup_thread(mddev->thread);
> +		}

I don't get this. Why do we want to set mddev->thread->timeout only
if mddev->thread->timeout < MAX_SCHEDULE_TIMEOUT)?

Also, if the timeout is being set while the array is stopped, the
command succeeds but has no effect (as mddev->thread is NULL). Maybe
it's better to return an error in this case.

> +static ssize_t
> +backlog_store(mddev_t *mddev, const char *buf, size_t len)
> +{
> +	unsigned long backlog;
> +	int rv = strict_strtoul(buf, 10, &backlog);
> +	if (rv)
> +		return rv;
> +	if (backlog > COUNTER_MAX)
> +		backlog = COUNTER_MAX;
> +	mddev->bitmap_info.max_write_behind = backlog;
> +	return len;
> +}

If the given value is too large this code silently ignores that value
and uses the maximal possible value instead. Even if we'd prefer "I
know better what you really want" over "you asked for it you got it",
we should at least be consistent. min_sync_store() for example returns
EINVAL if the given value is not suitable. So does chunksize_store()
below.

The same comment applies to timeout_store() of course.

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [md PATCH 04/22] md: support barrier requests on all personalities.
  2009-12-04  6:48 ` [md PATCH 04/22] md: support barrier requests on all personalities NeilBrown
@ 2009-12-08 13:54   ` Andre Noll
  2009-12-10  6:25     ` Neil Brown
  0 siblings, 1 reply; 30+ messages in thread
From: Andre Noll @ 2009-12-08 13:54 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2890 bytes --]

On 17:48, NeilBrown wrote:
> When a barrier arrives, we send a zero-length barrier to every active
> device.  When that completes - and if the original request was not
> empty -  we submit the barrier request itself (with the barrier flag
> cleared) and the submit a fresh load of zero length barriers.

s/the/then

> +/*
> + * Generic barrier handling for md
> + */
> +
> +static void md_end_barrier(struct bio *bio, int err)
> +{
> +	mdk_rdev_t *rdev = bio->bi_private;
> +	mddev_t *mddev = rdev->mddev;
> +	if (err == -EOPNOTSUPP && mddev->barrier != (void*)1)

How about

	#define POST_REQUEST_BARRIER ((void*)1)

?

> +		rcu_read_lock();
> +		list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
> +			if (rdev->raid_disk >= 0 &&
> +			    !test_bit(Faulty, &rdev->flags)) {
> +				/* Take two references, one is dropped
> +				 * when request finishes, one after
> +				 * we reclaim rcu_read_lock
> +				 */
> +				struct bio *bi;
> +				atomic_inc(&rdev->nr_pending);
> +				atomic_inc(&rdev->nr_pending);
> +				rcu_read_unlock();
> +				bi = bio_alloc(GFP_KERNEL, 0);
> +				bi->bi_end_io = md_end_barrier;
> +				bi->bi_private = rdev;
> +				bi->bi_bdev = rdev->bdev;
> +				atomic_inc(&mddev->flush_pending);
> +				submit_bio(WRITE_BARRIER, bi);
> +				rcu_read_lock();
> +				rdev_dec_pending(rdev, mddev);
> +			}
> +		rcu_read_unlock();

Calling atomic_inc() twice isn't an atomic operation any more. If
this doesn't matter (because all modifications of rdev->nr_pending
are supposed to happen within RCU read-side critical sections) then
why is rdev->nr_pending an atomic_t at all?

> +void md_barrier_request(mddev_t *mddev, struct bio *bio)
> +{
> +	mdk_rdev_t *rdev;
> +
> +	spin_lock_irq(&mddev->write_lock);
> +	wait_event_lock_irq(mddev->sb_wait,
> +			    !mddev->barrier,
> +			    mddev->write_lock, /*nothing*/);
> +	mddev->barrier = bio;
> +	spin_unlock_irq(&mddev->write_lock);
> +
> +	atomic_set(&mddev->flush_pending, 1);
> +	INIT_WORK(&mddev->barrier_work, md_submit_barrier);
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
> +		if (rdev->raid_disk >= 0 &&
> +		    !test_bit(Faulty, &rdev->flags)) {
> +			struct bio *bi;
> +
> +			atomic_inc(&rdev->nr_pending);
> +			atomic_inc(&rdev->nr_pending);
> +			rcu_read_unlock();
> +			bi = bio_alloc(GFP_KERNEL, 0);
> +			bi->bi_end_io = md_end_barrier;
> +			bi->bi_private = rdev;
> +			bi->bi_bdev = rdev->bdev;
> +			atomic_inc(&mddev->flush_pending);
> +			submit_bio(WRITE_BARRIER, bi);
> +			rcu_read_lock();
> +			rdev_dec_pending(rdev, mddev);
> +		}
> +	rcu_read_unlock();

This loop is identical to the one above, so it might make sense
to put it into a separate function.

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [md PATCH 14/22] md: support updating bitmap parameters via sysfs.
  2009-12-08 10:29   ` Andre Noll
@ 2009-12-10  6:14     ` Neil Brown
  2009-12-11 11:46       ` Andre Noll
  0 siblings, 1 reply; 30+ messages in thread
From: Neil Brown @ 2009-12-10  6:14 UTC (permalink / raw)
  To: Andre Noll; +Cc: linux-raid

On Tue, 8 Dec 2009 11:29:43 +0100
Andre Noll <maan@systemlinux.org> wrote:

Thanks for the very thorough review.

I think I have address all your issues - see below.

> > +		/* No bitmap, OK to set a location */
> > +		long long offset;
> > +		if (strncmp(buf, "none", 4) == 0)
> > +			/* nothing to be done */;
> > +		else if (strncmp(buf, "file:", 5) == 0) {
> > +			/* Not supported yet */
> > +			return -EINVAL;
> > +		} else {
> > +			int rv;
> > +			if (buf[0] == '+')
> > +				rv = strict_strtoll(buf+1, 10, &offset);
> > +			else
> > +				rv = strict_strtoll(buf, 10, &offset);
> > +			if (rv)
> > +				return rv;
> > +			mddev->bitmap_info.offset = offset;
> 
> Introducing a sanity check here would probably save some sysadmin's
> feet.

What sort of sanity check did you have in mind.  When we come to
write out bitmap data a sanity check is performed and if there is
any overlap the bitmap is disabled.

But sysadmins shouldn't really be touching this directly - that is what mdadm
is (or will be) for.



> > +
> > +int strict_strtoul_scaled(const char *cp, unsigned long *res, int scale)
> > +{
> > +	unsigned long result = 0;
> > +	long decimals = -1;
> > +	while (isdigit(*cp) || (*cp == '.' && decimals < 0)) {
> > +		if (*cp == '.')
> > +			decimals = 0;
> > +		else if (decimals < scale) {
> > +			unsigned int value;
> > +			value = *cp - '0';
> > +			result = result * 10 + value;
> > +			if (decimals >= 0)
> > +				decimals++;
> > +		}
> > +		cp++;
> > +	}
> > +	if (*cp == '\n')
> > +		cp++;
> > +	if (*cp)
> > +		return -EINVAL;
> > +	if (decimals < 0)
> > +		decimals = 0;
> > +	while (decimals < scale) {
> > +		result *= 10;
> > +		decimals ++;
> > +	}
> > +	*res = result;
> > +	return 0;
> > +}
> 
> A comment which describes what exactly this function is supposed to
> to would be nice. It has only one caller and could be made static BTW.
> OTOH, it probably belongs to lib/vsprintf.c.

I've created a separate patch which factors this code out of
safe_delay_store and makes it a global function in md.c
So the bitmap patch just uses that function.
I guess I could move it to lib/vsprintf.c....



> This might overflow. But maybe that's OK as only insane values (written
> by root) will cause an overflow.
> 
> > +
> > +	if (timeout > MAX_SCHEDULE_TIMEOUT)
> > +		timeout = MAX_SCHEDULE_TIMEOUT;
> > +	if (timeout < 1)
> > +		timeout = 1;
> > +	mddev->bitmap_info.daemon_sleep = timeout;
> > +	if (mddev->thread) {
> > +		if (mddev->thread->timeout < MAX_SCHEDULE_TIMEOUT) {
> > +			mddev->thread->timeout = timeout;
> > +			md_wakeup_thread(mddev->thread);
> > +		}
> 
> I don't get this. Why do we want to set mddev->thread->timeout only
> if mddev->thread->timeout < MAX_SCHEDULE_TIMEOUT)?

If mddev->thread->timeout is MAX_SCHEDULE_TIMEOUT then the bitmap is
currently idle (it is all clean) so there is no need to wake up any time soon.

> 
> Also, if the timeout is being set while the array is stopped, the
> command succeeds but has no effect (as mddev->thread is NULL). Maybe
> it's better to return an error in this case.

No - bitmap_info.daemon_sleep is always set.  This is the important part of
the function.  This "if (mddev->thread) {" branch is just to ensure it takes
effect instantly instead of soon.

Thanks for all your very helpful comments.
Here is the new version.

NeilBrown

From 9ebade9bec51bb65559e1fb56fa875dfbadaacf5 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Tue, 1 Dec 2009 17:49:54 +1100
Subject: [PATCH] md: support updating bitmap parameters via sysfs.

A new attribute directory 'bitmap' in 'md' is created which
contains files for configuring the bitmap.
'location' identifies where the bitmap is, either 'none',
or 'file' or 'sector offset from metadata'.
Writing 'location' can create or remove a bitmap.
Adding a 'file' bitmap this way is not yet supported.
'chunksize' and 'time_base' must be set before 'location'
can be set.

'chunksize' can be set before creating a bitmap, but is
currently always over-ridden by the bitmap superblock.

'time_base' and 'backlog' can be updated at any time.


Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>
---
 Documentation/md.txt |   29 +++++++
 drivers/md/bitmap.c  |  206 ++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/md.c      |    3 +
 drivers/md/md.h      |    2 +-
 4 files changed, 239 insertions(+), 1 deletions(-)

diff --git a/Documentation/md.txt b/Documentation/md.txt
index 4edd39e..18fad68 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -296,6 +296,35 @@ All md devices contain:
      active-idle
          like active, but no writes have been seen for a while (safe_mode_delay).
 
+  bitmap/location
+     This indicates where the write-intent bitmap for the array is
+     stored.
+     It can be one of "none", "file" or "[+-]N".
+     "file" may later be extended to "file:/file/name"
+     "[+-]N" means that many sectors from the start of the metadata.
+       This is replicated on all devices.  For arrays with externally
+       managed metadata, the offset is from the beginning of the
+       device.
+  bitmap/chunksize
+     The size, in bytes, of the chunk which will be represented by a
+     single bit.  For RAID456, it is a portion of an individual
+     device. For RAID10, it is a portion of the array.  For RAID1, it
+     is both (they come to the same thing).
+  bitmap/time_base
+     The time, in seconds, between looking for bits in the bitmap to
+     be cleared. In the current implementation, a bit will be cleared
+     between 2 and 3 times "time_base" after all the covered blocks
+     are known to be in-sync.
+  bitmap/backlog
+     When write-mostly devices are active in a RAID1, write requests
+     to those devices proceed in the background - the filesystem (or
+     other user of the device) does not have to wait for them.
+     'backlog' sets a limit on the number of concurrent background
+     writes.  If there are more than this, new writes will by
+     synchronous.
+     
+     
+     
 
 As component devices are added to an md array, they appear in the 'md'
 directory as new directories named
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 9588654..3ff7c5c 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -27,6 +27,7 @@
 #include <linux/file.h>
 #include <linux/mount.h>
 #include <linux/buffer_head.h>
+#include <linux/ctype.h>
 #include "md.h"
 #include "bitmap.h"
 
@@ -510,6 +511,9 @@ void bitmap_update_sb(struct bitmap *bitmap)
 		bitmap->events_cleared = bitmap->mddev->events;
 		sb->events_cleared = cpu_to_le64(bitmap->events_cleared);
 	}
+	/* Just in case these have been changed via sysfs: */
+	sb->daemon_sleep = cpu_to_le32(bitmap->mddev->bitmap_info.daemon_sleep/HZ);
+	sb->write_behind = cpu_to_le32(bitmap->mddev->bitmap_info.max_write_behind);
 	kunmap_atomic(sb, KM_USER0);
 	write_page(bitmap, bitmap->sb_page, 1);
 }
@@ -1714,6 +1718,208 @@ int bitmap_create(mddev_t *mddev)
 	return err;
 }
 
+static ssize_t
+location_show(mddev_t *mddev, char *page)
+{
+	ssize_t len;
+	if (mddev->bitmap_info.file) {
+		len = sprintf(page, "file");
+	} else if (mddev->bitmap_info.offset) {
+		len = sprintf(page, "%+lld", (long long)mddev->bitmap_info.offset);
+	} else
+		len = sprintf(page, "none");
+	len += sprintf(page+len, "\n");
+	return len;
+}
+
+static ssize_t
+location_store(mddev_t *mddev, const char *buf, size_t len)
+{
+
+	if (mddev->pers) {
+		if (!mddev->pers->quiesce)
+			return -EBUSY;
+		if (mddev->recovery || mddev->sync_thread)
+			return -EBUSY;
+	}
+
+	if (mddev->bitmap || mddev->bitmap_info.file ||
+	    mddev->bitmap_info.offset) {
+		/* bitmap already configured.  Only option is to clear it */
+		if (strncmp(buf, "none", 4) != 0)
+			return -EBUSY;
+		if (mddev->pers) {
+			mddev->pers->quiesce(mddev, 1);
+			bitmap_destroy(mddev);
+			mddev->pers->quiesce(mddev, 0);
+		}
+		mddev->bitmap_info.offset = 0;
+		if (mddev->bitmap_info.file) {
+			struct file *f = mddev->bitmap_info.file;
+			mddev->bitmap_info.file = NULL;
+			restore_bitmap_write_access(f);
+			fput(f);
+		}
+	} else {
+		/* No bitmap, OK to set a location */
+		long long offset;
+		if (strncmp(buf, "none", 4) == 0)
+			/* nothing to be done */;
+		else if (strncmp(buf, "file:", 5) == 0) {
+			/* Not supported yet */
+			return -EINVAL;
+		} else {
+			int rv;
+			if (buf[0] == '+')
+				rv = strict_strtoll(buf+1, 10, &offset);
+			else
+				rv = strict_strtoll(buf, 10, &offset);
+			if (rv)
+				return rv;
+			if (offset == 0)
+				return -EINVAL;
+			if (mddev->major_version == 0 &&
+			    offset != mddev->bitmap_info.default_offset)
+				return -EINVAL;
+			mddev->bitmap_info.offset = offset;
+			if (mddev->pers) {
+				mddev->pers->quiesce(mddev, 1);
+				rv = bitmap_create(mddev);
+				if (rv) {
+					bitmap_destroy(mddev);
+					mddev->bitmap_info.offset = 0;
+				}
+				mddev->pers->quiesce(mddev, 0);
+				if (rv)
+					return rv;
+			}
+		}
+	}
+	if (!mddev->external) {
+		/* Ensure new bitmap info is stored in
+		 * metadata promptly.
+		 */
+		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		md_wakeup_thread(mddev->thread);
+	}
+	return len;
+}
+
+static struct md_sysfs_entry bitmap_location =
+__ATTR(location, S_IRUGO|S_IWUSR, location_show, location_store);
+
+static ssize_t
+timeout_show(mddev_t *mddev, char *page)
+{
+	ssize_t len;
+	unsigned long secs = mddev->bitmap_info.daemon_sleep / HZ;
+	unsigned long jifs = mddev->bitmap_info.daemon_sleep % HZ;
+	
+	len = sprintf(page, "%lu", secs);
+	if (jifs)
+		len += sprintf(page+len, ".%03u", jiffies_to_msecs(jifs));
+	len += sprintf(page+len, "\n");
+	return len;
+}
+
+static ssize_t
+timeout_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	/* timeout can be set at any time */
+	unsigned long timeout;
+	int rv = strict_strtoul_scaled(buf, &timeout, 4);
+	if (rv)
+		return rv;
+
+	/* just to make sure we don't overflow... */
+	if (timeout >= LONG_MAX / HZ)
+		return -EINVAL;
+
+	timeout = timeout * HZ / 10000;
+
+	if (timeout >= MAX_SCHEDULE_TIMEOUT)
+		timeout = MAX_SCHEDULE_TIMEOUT-1;
+	if (timeout < 1)
+		timeout = 1;
+	mddev->bitmap_info.daemon_sleep = timeout;
+	if (mddev->thread) {
+		/* if thread->timeout is MAX_SCHEDULE_TIMEOUT, then
+		 * the bitmap is all clean and we don't need to
+		 * adjust the timeout right now
+		 */
+		if (mddev->thread->timeout < MAX_SCHEDULE_TIMEOUT) {
+			mddev->thread->timeout = timeout;
+			md_wakeup_thread(mddev->thread);
+		}
+	}
+	return len;
+}
+
+static struct md_sysfs_entry bitmap_timeout =
+__ATTR(time_base, S_IRUGO|S_IWUSR, timeout_show, timeout_store);
+
+static ssize_t
+backlog_show(mddev_t *mddev, char *page)
+{
+	return sprintf(page, "%lu\n", mddev->bitmap_info.max_write_behind);
+}
+
+static ssize_t
+backlog_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	unsigned long backlog;
+	int rv = strict_strtoul(buf, 10, &backlog);
+	if (rv)
+		return rv;
+	if (backlog > COUNTER_MAX)
+		return -EINVAL;
+	mddev->bitmap_info.max_write_behind = backlog;
+	return len;
+}
+
+static struct md_sysfs_entry bitmap_backlog =
+__ATTR(backlog, S_IRUGO|S_IWUSR, backlog_show, backlog_store);
+
+static ssize_t
+chunksize_show(mddev_t *mddev, char *page)
+{
+	return sprintf(page, "%lu\n", mddev->bitmap_info.chunksize);
+}
+
+static ssize_t
+chunksize_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	/* Can only be changed when no bitmap is active */
+	int rv;
+	unsigned long csize;
+	if (mddev->bitmap)
+		return -EBUSY;
+	rv = strict_strtoul(buf, 10, &csize);
+	if (rv)
+		return rv;
+	if (csize < 512 ||
+	    !is_power_of_2(csize))
+		return -EINVAL;
+	mddev->bitmap_info.chunksize = csize;
+	return len;
+}
+
+static struct md_sysfs_entry bitmap_chunksize =
+__ATTR(chunksize, S_IRUGO|S_IWUSR, chunksize_show, chunksize_store);
+
+static struct attribute *md_bitmap_attrs[] = {
+	&bitmap_location.attr,
+	&bitmap_timeout.attr,
+	&bitmap_backlog.attr,
+	&bitmap_chunksize.attr,
+	NULL
+};
+struct attribute_group md_bitmap_group = {
+	.name = "bitmap",
+	.attrs = md_bitmap_attrs,
+};
+
+
 /* the bitmap API -- for raid personalities */
 EXPORT_SYMBOL(bitmap_startwrite);
 EXPORT_SYMBOL(bitmap_endwrite);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 29a935b..4ac1984 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4034,6 +4034,7 @@ static void mddev_delayed_delete(struct work_struct *ws)
 		mddev->sysfs_action = NULL;
 		mddev->private = NULL;
 	}
+	sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
 	kobject_del(&mddev->kobj);
 	kobject_put(&mddev->kobj);
 }
@@ -4125,6 +4126,8 @@ static int md_alloc(dev_t dev, char *name)
 		       disk->disk_name);
 		error = 0;
 	}
+	if (sysfs_create_group(&mddev->kobj, &md_bitmap_group))
+		printk(KERN_DEBUG "pointless warning\n");
  abort:
 	mutex_unlock(&disks_mutex);
 	if (!error) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index ed91c39..7c3e86e 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -372,7 +372,7 @@ struct md_sysfs_entry {
 	ssize_t (*show)(mddev_t *, char *);
 	ssize_t (*store)(mddev_t *, const char *, size_t);
 };
-
+extern struct attribute_group md_bitmap_group;
 
 static inline char * mdname (mddev_t * mddev)
 {
-- 
1.6.5.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [md PATCH 04/22] md: support barrier requests on all personalities.
  2009-12-08 13:54   ` Andre Noll
@ 2009-12-10  6:25     ` Neil Brown
  2009-12-11 11:46       ` Andre Noll
  0 siblings, 1 reply; 30+ messages in thread
From: Neil Brown @ 2009-12-10  6:25 UTC (permalink / raw)
  To: Andre Noll; +Cc: linux-raid

On Tue, 8 Dec 2009 14:54:42 +0100
Andre Noll <maan@systemlinux.org> wrote:

Thanks again.
I have made most of the changes you suggest.

> > +		rcu_read_lock();
> > +		list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
> > +			if (rdev->raid_disk >= 0 &&
> > +			    !test_bit(Faulty, &rdev->flags)) {
> > +				/* Take two references, one is dropped
> > +				 * when request finishes, one after
> > +				 * we reclaim rcu_read_lock
> > +				 */
> > +				struct bio *bi;
> > +				atomic_inc(&rdev->nr_pending);
> > +				atomic_inc(&rdev->nr_pending);
> > +				rcu_read_unlock();
> > +				bi = bio_alloc(GFP_KERNEL, 0);
> > +				bi->bi_end_io = md_end_barrier;
> > +				bi->bi_private = rdev;
> > +				bi->bi_bdev = rdev->bdev;
> > +				atomic_inc(&mddev->flush_pending);
> > +				submit_bio(WRITE_BARRIER, bi);
> > +				rcu_read_lock();
> > +				rdev_dec_pending(rdev, mddev);
> > +			}
> > +		rcu_read_unlock();
> 
> Calling atomic_inc() twice isn't an atomic operation any more. If
> this doesn't matter (because all modifications of rdev->nr_pending
> are supposed to happen within RCU read-side critical sections) then
> why is rdev->nr_pending an atomic_t at all?

Calling atomic_inc() twice is still two atomic operations, and that is what I
need.
The important thing is that the read-modify-write cycle of atomic_inc isn't
interrupted, so if two thread both do atomic_inc at the same time the result
really is adding 2, not just adding one.

Multiple RCU read-side blocks can run at the same time on different
processors - so we definitely need to protection of 'atomic_'.

Here is the new version of the patch - thanks.

NeilBrown

From 87d04b8950f80e77b376531506cae08387578f64 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Tue, 1 Dec 2009 17:49:51 +1100
Subject: [PATCH] md: support barrier requests on all personalities.

Previously barriers were only supported on RAID1.  This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.

When a barrier arrives, we send a zero-length barrier to every active
device.  When that completes - and if the original request was not
empty -  we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.

The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.

The reason for clearing the barrier flag is that a barrier request is
allowed to fail.  If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail.  That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.

RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet.  So we flush the stripe cache before
proceeding with the barrier.

Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted.  Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.

That will be done in later patches.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/linear.c    |    2 +-
 drivers/md/md.c        |  105 +++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/md.h        |   12 +++++
 drivers/md/multipath.c |    2 +-
 drivers/md/raid0.c     |    2 +-
 drivers/md/raid10.c    |    2 +-
 drivers/md/raid5.c     |    8 +++-
 7 files changed, 126 insertions(+), 7 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 1ceceb3..3b3f77c 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -292,7 +292,7 @@ static int linear_make_request (struct request_queue *q, struct bio *bio)
 	int cpu;
 
 	if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) {
-		bio_endio(bio, -EOPNOTSUPP);
+		md_barrier_request(mddev, bio);
 		return 0;
 	}
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index eaf0a92..c94b6cd 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -217,12 +217,12 @@ static int md_make_request(struct request_queue *q, struct bio *bio)
 		return 0;
 	}
 	rcu_read_lock();
-	if (mddev->suspended) {
+	if (mddev->suspended || mddev->barrier) {
 		DEFINE_WAIT(__wait);
 		for (;;) {
 			prepare_to_wait(&mddev->sb_wait, &__wait,
 					TASK_UNINTERRUPTIBLE);
-			if (!mddev->suspended)
+			if (!mddev->suspended && !mddev->barrier)
 				break;
 			rcu_read_unlock();
 			schedule();
@@ -264,10 +264,110 @@ static void mddev_resume(mddev_t *mddev)
 
 int mddev_congested(mddev_t *mddev, int bits)
 {
+	if (mddev->barrier)
+		return 1;
 	return mddev->suspended;
 }
 EXPORT_SYMBOL(mddev_congested);
 
+/*
+ * Generic barrier handling for md
+ */
+
+#define POST_REQUEST_BARRIER ((void*)1)
+
+static void md_end_barrier(struct bio *bio, int err)
+{
+	mdk_rdev_t *rdev = bio->bi_private;
+	mddev_t *mddev = rdev->mddev;
+	if (err == -EOPNOTSUPP && mddev->barrier != POST_REQUEST_BARRIER)
+		set_bit(BIO_EOPNOTSUPP, &mddev->barrier->bi_flags);
+
+	rdev_dec_pending(rdev, mddev);
+
+	if (atomic_dec_and_test(&mddev->flush_pending)) {
+		if (mddev->barrier == POST_REQUEST_BARRIER) {
+			/* This was a post-request barrier */
+			mddev->barrier = NULL;
+			wake_up(&mddev->sb_wait);
+		} else
+			/* The pre-request barrier has finished */
+			schedule_work(&mddev->barrier_work);
+	}
+	bio_put(bio);
+}
+
+static void submit_barriers(mddev_t *mddev)
+{
+	mdk_rdev_t *rdev;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
+		if (rdev->raid_disk >= 0 &&
+		    !test_bit(Faulty, &rdev->flags)) {
+			/* Take two references, one is dropped
+			 * when request finishes, one after
+			 * we reclaim rcu_read_lock
+			 */
+			struct bio *bi;
+			atomic_inc(&rdev->nr_pending);
+			atomic_inc(&rdev->nr_pending);
+			rcu_read_unlock();
+			bi = bio_alloc(GFP_KERNEL, 0);
+			bi->bi_end_io = md_end_barrier;
+			bi->bi_private = rdev;
+			bi->bi_bdev = rdev->bdev;
+			atomic_inc(&mddev->flush_pending);
+			submit_bio(WRITE_BARRIER, bi);
+			rcu_read_lock();
+			rdev_dec_pending(rdev, mddev);
+		}
+	rcu_read_unlock();
+}
+
+static void md_submit_barrier(struct work_struct *ws)
+{
+	mddev_t *mddev = container_of(ws, mddev_t, barrier_work);
+	struct bio *bio = mddev->barrier;
+
+	atomic_set(&mddev->flush_pending, 1);
+
+	if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags))
+		bio_endio(bio, -EOPNOTSUPP);
+	else if (bio->bi_size == 0)
+		/* an empty barrier - all done */
+		bio_endio(bio, 0);
+	else {
+		bio->bi_rw &= ~(1<<BIO_RW_BARRIER);
+		if (mddev->pers->make_request(mddev->queue, bio))
+			generic_make_request(bio);
+		mddev->barrier = POST_REQUEST_BARRIER;
+		submit_barriers(mddev);
+	}
+	if (atomic_dec_and_test(&mddev->flush_pending)) {
+		mddev->barrier = NULL;
+		wake_up(&mddev->sb_wait);
+	}
+}
+
+void md_barrier_request(mddev_t *mddev, struct bio *bio)
+{
+	spin_lock_irq(&mddev->write_lock);
+	wait_event_lock_irq(mddev->sb_wait,
+			    !mddev->barrier,
+			    mddev->write_lock, /*nothing*/);
+	mddev->barrier = bio;
+	spin_unlock_irq(&mddev->write_lock);
+
+	atomic_set(&mddev->flush_pending, 1);
+	INIT_WORK(&mddev->barrier_work, md_submit_barrier);
+
+	submit_barriers(mddev);
+
+	if (atomic_dec_and_test(&mddev->flush_pending))
+		schedule_work(&mddev->barrier_work);
+}
+EXPORT_SYMBOL(md_barrier_request);
 
 static inline mddev_t *mddev_get(mddev_t *mddev)
 {
@@ -375,6 +475,7 @@ static mddev_t * mddev_find(dev_t unit)
 	atomic_set(&new->openers, 0);
 	atomic_set(&new->active_io, 0);
 	spin_lock_init(&new->write_lock);
+	atomic_set(&new->flush_pending, 0);
 	init_waitqueue_head(&new->sb_wait);
 	init_waitqueue_head(&new->recovery_wait);
 	new->reshape_position = MaxSector;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 9e2c412..0569e8f 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -292,6 +292,17 @@ struct mddev_s
 	struct mutex			bitmap_mutex;
 
 	struct list_head		all_mddevs;
+
+	/* Generic barrier handling.
+	 * If there is a pending barrier request, all other
+	 * writes are blocked while the devices are flushed.
+	 * The last to finish a flush schedules a worker to
+	 * submit the barrier request (without the barrier flag),
+	 * then submit more flush requests.
+	 */
+	struct bio *barrier;
+	atomic_t flush_pending;
+	struct work_struct barrier_work;
 };
 
 
@@ -432,6 +443,7 @@ extern void md_done_sync(mddev_t *mddev, int blocks, int ok);
 extern void md_error(mddev_t *mddev, mdk_rdev_t *rdev);
 
 extern int mddev_congested(mddev_t *mddev, int bits);
+extern void md_barrier_request(mddev_t *mddev, struct bio *bio);
 extern void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
 			   sector_t sector, int size, struct page *page);
 extern void md_super_wait(mddev_t *mddev);
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index ee7646f..cbc0a99 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -145,7 +145,7 @@ static int multipath_make_request (struct request_queue *q, struct bio * bio)
 	int cpu;
 
 	if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) {
-		bio_endio(bio, -EOPNOTSUPP);
+		md_barrier_request(mddev, bio);
 		return 0;
 	}
 
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index d3a4ce0..122d07a 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -453,7 +453,7 @@ static int raid0_make_request(struct request_queue *q, struct bio *bio)
 	int cpu;
 
 	if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) {
-		bio_endio(bio, -EOPNOTSUPP);
+		md_barrier_request(mddev, bio);
 		return 0;
 	}
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c2cb7b8..2fbf867 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -804,7 +804,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
 	mdk_rdev_t *blocked_rdev;
 
 	if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) {
-		bio_endio(bio, -EOPNOTSUPP);
+		md_barrier_request(mddev, bio);
 		return 0;
 	}
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d29215d..ecf89c8 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3866,7 +3866,13 @@ static int make_request(struct request_queue *q, struct bio * bi)
 	int cpu, remaining;
 
 	if (unlikely(bio_rw_flagged(bi, BIO_RW_BARRIER))) {
-		bio_endio(bi, -EOPNOTSUPP);
+		/* Drain all pending writes.  We only really need
+		 * to ensure they have been submitted, but this is
+		 * easier.
+		 */
+		mddev->pers->quiesce(mddev, 1);
+		mddev->pers->quiesce(mddev, 0);
+		md_barrier_request(mddev, bi);
 		return 0;
 	}
 
-- 
1.6.5.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [md PATCH 14/22] md: support updating bitmap parameters via sysfs.
  2009-12-10  6:14     ` Neil Brown
@ 2009-12-11 11:46       ` Andre Noll
  0 siblings, 0 replies; 30+ messages in thread
From: Andre Noll @ 2009-12-11 11:46 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1942 bytes --]

On 17:14, Neil Brown wrote:
> On Tue, 8 Dec 2009 11:29:43 +0100
> Andre Noll <maan@systemlinux.org> wrote:
> 
> Thanks for the very thorough review.
> 
> I think I have address all your issues - see below.

Jup, looks fine to me now.

> 
> > > +		/* No bitmap, OK to set a location */
> > > +		long long offset;
> > > +		if (strncmp(buf, "none", 4) == 0)
> > > +			/* nothing to be done */;
> > > +		else if (strncmp(buf, "file:", 5) == 0) {
> > > +			/* Not supported yet */
> > > +			return -EINVAL;
> > > +		} else {
> > > +			int rv;
> > > +			if (buf[0] == '+')
> > > +				rv = strict_strtoll(buf+1, 10, &offset);
> > > +			else
> > > +				rv = strict_strtoll(buf, 10, &offset);
> > > +			if (rv)
> > > +				return rv;
> > > +			mddev->bitmap_info.offset = offset;
> > 
> > Introducing a sanity check here would probably save some sysadmin's
> > feet.
> 
> What sort of sanity check did you have in mind.  When we come to
> write out bitmap data a sanity check is performed and if there is
> any overlap the bitmap is disabled.

I thought of a simple check for overlaps, not being aware of the fact
that such a check is already done in write_sb_page() of bitmap.c. Doing
the check there is obviously good as well. The only difference is
that for invalid offset values the bitmap is accepted (i.e. setting
the bitmap with mdadm succeeds) but then kicked immediately.


> > Also, if the timeout is being set while the array is stopped, the
> > command succeeds but has no effect (as mddev->thread is NULL). Maybe
> > it's better to return an error in this case.
> 
> No - bitmap_info.daemon_sleep is always set.  This is the important part of
> the function.  This "if (mddev->thread) {" branch is just to ensure it takes
> effect instantly instead of soon.

Ah, I see. Thanks for the explanation.

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [md PATCH 04/22] md: support barrier requests on all personalities.
  2009-12-10  6:25     ` Neil Brown
@ 2009-12-11 11:46       ` Andre Noll
  0 siblings, 0 replies; 30+ messages in thread
From: Andre Noll @ 2009-12-11 11:46 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1298 bytes --]

On 17:25, Neil Brown wrote:
tely Tue, 8 Dec 2009 14:54:42 +0100
> Andre Noll <maan@systemlinux.org> wrote:
> 
> Thanks again.
> I have made most of the changes you suggest.

Thanks. Feel free to add

Reviewed-by: Andre Noll <maan@systemlinux.org>

> > Calling atomic_inc() twice isn't an atomic operation any more. If
> > this doesn't matter (because all modifications of rdev->nr_pending
> > are supposed to happen within RCU read-side critical sections) then
> > why is rdev->nr_pending an atomic_t at all?
> 
> Calling atomic_inc() twice is still two atomic operations, and that is what I
> need.
> The important thing is that the read-modify-write cycle of atomic_inc isn't
> interrupted, so if two thread both do atomic_inc at the same time the result
> really is adding 2, not just adding one.

A conceivable problem is that someone else might remove the
whole thing between the two atomic_inc() calls because it calls
atomic_dec_and_test() and finds it is zero. But this can't possibly
happen here, can it?

> Multiple RCU read-side blocks can run at the same time on different
> processors - so we definitely need to protection of 'atomic_'.

Yes, definitely.

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [md PATCH 05/22] md/raid5: don't complete make_request on barrieruntil writes are scheduled
  2009-12-04  6:48 ` [md PATCH 05/22] md/raid5: don't complete make_request on barrier until writes are scheduled NeilBrown
@ 2010-01-21 21:07   ` Tirumala Reddy Marri
  2010-01-28  2:44     ` Neil Brown
  0 siblings, 1 reply; 30+ messages in thread
From: Tirumala Reddy Marri @ 2010-01-21 21:07 UTC (permalink / raw)
  To: NeilBrown, linux-raid

[-- Attachment #1: Type: text/plain, Size: 6116 bytes --]

I tried these patches and I still don’t see barriers working. Do I have to enable to device mapper to see the effect ?

 

 

-------- log -----

~ # 

~ # mount -o barrier=1 /dev/md0 /mnt/tmpmnt/

kjournald starting.  Commit interval 5 seconds

EXT3 FS on md0, internal journal

EXT3-fs: mounted filesystem with writeback data mode.

~ # touch /mnt/tmpmnt/test

~ # sync

JBD: barrier-based sync failed on md0 - disabling barriers

~ #

 

 

From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of NeilBrown
Sent: Thursday, December 03, 2009 10:48 PM
To: linux-raid@vger.kernel.org
Subject: [md PATCH 05/22] md/raid5: don't complete make_request on barrieruntil writes are scheduled

 

This message has been archived. View the original item <http://sdcmailvault.ad.amcc.com/EnterpriseVault/ViewMessage.asp?VaultId=1E9560FDB597EB744B7F046F24F9462D91110000sdcmailvault.ad.amcc.com&SavesetId=821000000000000~200912040648020000~0~73C17918B4764F9EBF4C8875EC1437B> 

The post-barrier-flush is sent be md as soon as make_request on the
barrier write completes.  For raid5, the data might not be in the
per-device queues yet.  So for barrier requests, wait for any
pre-reading to be done so that the request will be in the per-device
queues.

We use the 'preread_active' count to check that nothing is still in
the preread phase, and delay the decrement of this count until after
write requests have been submitted to the underlying devices.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid5.c |   51 +++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ecf89c8..8c772b2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2947,6 +2947,7 @@ static void handle_stripe5(struct stripe_head *sh)
     struct r5dev *dev;
     mdk_rdev_t *blocked_rdev = NULL;
     int prexor;
+    int dec_preread_active = 0;
 
     memset(&s, 0, sizeof(s));
     pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d check:%d "
@@ -3096,12 +3097,8 @@ static void handle_stripe5(struct stripe_head *sh)
                     set_bit(STRIPE_INSYNC, &sh->state);
             }
         }
-        if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-            atomic_dec(&conf->preread_active_stripes);
-            if (atomic_read(&conf->preread_active_stripes) <
-                IO_THRESHOLD)
-                md_wakeup_thread(conf->mddev->thread);
-        }
+        if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+            dec_preread_active = 1;
     }
 
     /* Now to consider new write requests and what else, if anything
@@ -3208,6 +3205,16 @@ static void handle_stripe5(struct stripe_head *sh)
 
     ops_run_io(sh, &s);
 
+    if (dec_preread_active) {
+        /* We delay this until after ops_run_io so that if make_request
+         * is waiting on a barrier, it won't continue until the writes
+         * have actually been submitted.
+         */
+        atomic_dec(&conf->preread_active_stripes);
+        if (atomic_read(&conf->preread_active_stripes) <
+            IO_THRESHOLD)
+            md_wakeup_thread(conf->mddev->thread);
+    }
     return_io(return_bi);
 }
 
@@ -3221,6 +3228,7 @@ static void handle_stripe6(struct stripe_head *sh)
     struct r6_state r6s;
     struct r5dev *dev, *pdev, *qdev;
     mdk_rdev_t *blocked_rdev = NULL;
+    int dec_preread_active = 0;
 
     pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
         "pd_idx=%d, qd_idx=%d, check:%d, reconstruct:%d",
@@ -3380,12 +3388,8 @@ static void handle_stripe6(struct stripe_head *sh)
                     set_bit(STRIPE_INSYNC, &sh->state);
             }
         }
-        if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-            atomic_dec(&conf->preread_active_stripes);
-            if (atomic_read(&conf->preread_active_stripes) <
-                IO_THRESHOLD)
-                md_wakeup_thread(conf->mddev->thread);
-        }
+        if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+            dec_preread_active = 1;
     }
 
     /* Now to consider new write requests and what else, if anything
@@ -3494,6 +3498,18 @@ static void handle_stripe6(struct stripe_head *sh)
 
     ops_run_io(sh, &s);
 
+
+    if (dec_preread_active) {
+        /* We delay this until after ops_run_io so that if make_request
+         * is waiting on a barrier, it won't continue until the writes
+         * have actually been submitted.
+         */
+        atomic_dec(&conf->preread_active_stripes);
+        if (atomic_read(&conf->preread_active_stripes) <
+            IO_THRESHOLD)
+            md_wakeup_thread(conf->mddev->thread);
+    }
+
     return_io(return_bi);
 }
 
@@ -3996,6 +4012,9 @@ static int make_request(struct request_queue *q, struct bio * bi)
             finish_wait(&conf->wait_for_overlap, &w);
             set_bit(STRIPE_HANDLE, &sh->state);
             clear_bit(STRIPE_DELAYED, &sh->state);
+            if (mddev->barrier &&
+                !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+                atomic_inc(&conf->preread_active_stripes);
             release_stripe(sh);
         } else {
             /* cannot get stripe for read-ahead, just give-up */
@@ -4015,6 +4034,14 @@ static int make_request(struct request_queue *q, struct bio * bi)
 
         bio_endio(bi, 0);
     }
+
+    if (mddev->barrier) {
+        /* We need to wait for the stripes to all be handled.
+         * So: wait for preread_active_stripes to drop to 0.
+         */
+        wait_event(mddev->thread->wqueue,
+               atomic_read(&conf->preread_active_stripes) == 0);
+    }
     return 0;
 }
 


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 17218 bytes --]

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [md PATCH 05/22] md/raid5: don't complete make_request on barrieruntil writes are scheduled
  2010-01-21 21:07   ` [md PATCH 05/22] md/raid5: don't complete make_request on barrieruntil " Tirumala Reddy Marri
@ 2010-01-28  2:44     ` Neil Brown
  0 siblings, 0 replies; 30+ messages in thread
From: Neil Brown @ 2010-01-28  2:44 UTC (permalink / raw)
  To: Tirumala Reddy Marri; +Cc: linux-raid

On Thu, 21 Jan 2010 13:07:44 -0800
"Tirumala Reddy Marri" <tmarri@amcc.com> wrote:

> I tried these patches and I still don’t see barriers working. Do I have to enable to device mapper to see the effect ?
> 
>  

You would need to be sure that the component devices of md0 honour barriers.
So try the same test on each component.  If they all work but md0 doesn't,
then ask again and I will look into it.

You would need to give more details, like what sort of array md0 is, what it's
component are, and exactly what kernel you are using.

NeilBrown


> 
>  
> 
> -------- log -----
> 
> ~ # 
> 
> ~ # mount -o barrier=1 /dev/md0 /mnt/tmpmnt/
> 
> kjournald starting.  Commit interval 5 seconds
> 
> EXT3 FS on md0, internal journal
> 
> EXT3-fs: mounted filesystem with writeback data mode.
> 
> ~ # touch /mnt/tmpmnt/test
> 
> ~ # sync
> 
> JBD: barrier-based sync failed on md0 - disabling barriers
> 
> ~ #
> 
>  
> 
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2010-01-28  2:44 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-04  6:48 [md PATCH 00/22] MD patches queued for 2.6.33 NeilBrown
2009-12-04  6:48 ` [md PATCH 15/22] md: Support write-intent bitmaps with externally managed metadata NeilBrown
2009-12-04  6:48 ` [md PATCH 17/22] md/raid10: print more useful messages on device failure NeilBrown
2009-12-04  6:48 ` [md PATCH 14/22] md: support updating bitmap parameters via sysfs NeilBrown
2009-12-08 10:29   ` Andre Noll
2009-12-10  6:14     ` Neil Brown
2009-12-11 11:46       ` Andre Noll
2009-12-04  6:48 ` [md PATCH 18/22] raid: improve MD/raid10 handling of correctable read errors NeilBrown
2009-12-04  6:48 ` [md PATCH 20/22] md: revise Kconfig help for MD_MULTIPATH NeilBrown
2009-12-04  6:48 ` [md PATCH 06/22] md: add honouring of suspend_{lo,hi} to raid1 NeilBrown
2009-12-04  6:48 ` [md PATCH 12/22] md: remove needless setting of thread->timeout in raid10_quiesce NeilBrown
2009-12-04  6:48 ` [md PATCH 10/22] md: move offset, daemon_sleep and chunksize out of bitmap structure NeilBrown
2009-12-04  6:48 ` [md PATCH 07/22] md/raid1: add takeover support for raid5->raid1 NeilBrown
2009-12-04  6:48 ` [md PATCH 19/22] md: add MODULE_DESCRIPTION for all md related modules NeilBrown
2009-12-04  6:48 ` [md PATCH 22/22] md: integrate spares into array at earliest opportunity NeilBrown
2009-12-04  6:48 ` [md PATCH 13/22] md: support bitmap offset appropriate for external-metadata arrays NeilBrown
2009-12-04  6:48 ` [md PATCH 21/22] md: move compat_ioctl handling into md.c NeilBrown
2009-12-04  6:48 ` [md PATCH 16/22] md/bitmap: update dirty flag when bitmap bits are explicitly set NeilBrown
2009-12-04  6:48 ` [md PATCH 01/22] md/bitmap: protect against bitmap removal while being updated NeilBrown
2009-12-04  6:48 ` [md PATCH 11/22] md: change daemon_sleep to be in 'jiffies' rather than 'seconds' NeilBrown
2009-12-04  6:48 ` [md PATCH 04/22] md: support barrier requests on all personalities NeilBrown
2009-12-08 13:54   ` Andre Noll
2009-12-10  6:25     ` Neil Brown
2009-12-11 11:46       ` Andre Noll
2009-12-04  6:48 ` [md PATCH 09/22] md: collect bitmap-specific fields into one structure NeilBrown
2009-12-04  6:48 ` [md PATCH 02/22] md: adjust resync_min usefully when resync aborts NeilBrown
2009-12-04  6:48 ` [md PATCH 05/22] md/raid5: don't complete make_request on barrier until writes are scheduled NeilBrown
2010-01-21 21:07   ` [md PATCH 05/22] md/raid5: don't complete make_request on barrieruntil " Tirumala Reddy Marri
2010-01-28  2:44     ` Neil Brown
2009-12-04  6:48 ` [md PATCH 03/22] md: don't reset curr_resync_completed after an interrupted resync NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).