[PATCH md 0 of 4] Introduction

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH md 0 of 4] Introduction
@ 2005-03-08  5:50 NeilBrown
  2005-03-08  5:50 ` [PATCH md 1 of 4] Fix typo in super_1_sync NeilBrown
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: NeilBrown @ 2005-03-08  5:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid

4 patches for md/raid in 2.6.11-mm1

The first two are trivial and should apply equally to 2.6.11

The second two fix bugs that were introduced by the recent 
bitmap-based-intent-logging patches and so are not relevant
to 2.6.11 yet. 

[PATCH md 1 of 4] Fix typo in super_1_sync
[PATCH md 2 of 4] Erroneous sizeof use in raid1
[PATCH md 3 of 4] Initialise sync_blocks in raid1 resync
[PATCH md 4 of 4] Fix md deadlock due to md thread processing delayed requests.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH md 1 of 4] Fix typo in super_1_sync
  2005-03-08  5:50 [PATCH md 0 of 4] Introduction NeilBrown
@ 2005-03-08  5:50 ` NeilBrown
  2005-03-08  5:50 ` [PATCH md 3 of 4] Initialise sync_blocks in raid1 resync NeilBrown
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: NeilBrown @ 2005-03-08  5:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


Instead of setting one value lots of times, let's
set lots of values once each, as we should..

From: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/md.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2005-03-07 11:32:41.000000000 +1100
+++ ./drivers/md/md.c	2005-03-07 11:32:41.000000000 +1100
@@ -1010,7 +1010,7 @@ static void super_1_sync(mddev_t *mddev,
 	
 	sb->max_dev = cpu_to_le32(max_dev);
 	for (i=0; i<max_dev;i++)
-		sb->dev_roles[max_dev] = cpu_to_le16(0xfffe);
+		sb->dev_roles[i] = cpu_to_le16(0xfffe);
 	
 	ITERATE_RDEV(mddev,rdev2,tmp) {
 		i = rdev2->desc_nr;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH md 3 of 4] Initialise sync_blocks in raid1 resync
  2005-03-08  5:50 [PATCH md 0 of 4] Introduction NeilBrown
  2005-03-08  5:50 ` [PATCH md 1 of 4] Fix typo in super_1_sync NeilBrown
@ 2005-03-08  5:50 ` NeilBrown
  2005-03-08  5:50 ` [PATCH md 4 of 4] Fix md deadlock due to md thread processing delayed requests NeilBrown
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: NeilBrown @ 2005-03-08  5:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


Otherwise it could have a random value and might BUG.
This fixes a BUG during resync problem in raid1 introduced
by the bitmap-based-intent-loggin patches.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/raid1.c |    1 +
 1 files changed, 1 insertion(+)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-03-07 15:49:55.000000000 +1100
+++ ./drivers/md/raid1.c	2005-03-07 15:53:55.000000000 +1100
@@ -1235,6 +1235,7 @@ static sector_t sync_request(mddev_t *md
 	}
 
 	nr_sectors = 0;
+	sync_blocks = 0;
 	do {
 		struct page *page;
 		int len = PAGE_SIZE;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH md 4 of 4] Fix md deadlock due to md thread processing delayed requests.
  2005-03-08  5:50 [PATCH md 0 of 4] Introduction NeilBrown
  2005-03-08  5:50 ` [PATCH md 1 of 4] Fix typo in super_1_sync NeilBrown
  2005-03-08  5:50 ` [PATCH md 3 of 4] Initialise sync_blocks in raid1 resync NeilBrown
@ 2005-03-08  5:50 ` NeilBrown
  2005-03-08  5:50 ` [PATCH md 2 of 4] Erroneous sizeof use in raid1 NeilBrown
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: NeilBrown @ 2005-03-08  5:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


Before completing a 'write' the md superblock might need to be updated.
This is best done by the md_thread. 
The current code schedules this up and queues the write request for later
handling by the md_thread.
However some personalities (Raid5/raid6) will deadlock if the
md_thread tries to submit requests to its own array.
So this patch changes things so the processes submitting the
request waits for the superblock to be written and then submits
the request itself.

This fixes a recently-created deadlock in raid5/raid6


Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/md.c           |   45 +++++++++++++++-----------------------------
 ./drivers/md/raid1.c        |    4 +--
 ./drivers/md/raid10.c       |    3 --
 ./drivers/md/raid5.c        |    3 --
 ./drivers/md/raid6main.c    |    3 --
 ./include/linux/raid/md.h   |    2 -
 ./include/linux/raid/md_k.h |    2 -
 7 files changed, 23 insertions(+), 39 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2005-03-08 16:08:10.000000000 +1100
+++ ./drivers/md/md.c	2005-03-08 16:11:44.000000000 +1100
@@ -267,8 +267,8 @@ static mddev_t * mddev_find(dev_t unit)
 	INIT_LIST_HEAD(&new->all_mddevs);
 	init_timer(&new->safemode_timer);
 	atomic_set(&new->active, 1);
-	bio_list_init(&new->write_list);
 	spin_lock_init(&new->write_lock);
+	init_waitqueue_head(&new->sb_wait);
 
 	new->queue = blk_alloc_queue(GFP_KERNEL);
 	if (!new->queue) {
@@ -1350,6 +1350,7 @@ repeat:
 	if (!mddev->persistent) {
 		mddev->sb_dirty = 0;
 		spin_unlock(&mddev->write_lock);
+		wake_up(&mddev->sb_wait);
 		return;
 	}
 	spin_unlock(&mddev->write_lock);
@@ -1391,6 +1392,7 @@ repeat:
 	}
 	mddev->sb_dirty = 0;
 	spin_unlock(&mddev->write_lock);
+	wake_up(&mddev->sb_wait);
 
 }
 
@@ -3489,29 +3491,26 @@ void md_done_sync(mddev_t *mddev, int bl
 
 /* md_write_start(mddev, bi)
  * If we need to update some array metadata (e.g. 'active' flag
- * in superblock) before writing, queue bi for later writing
- * and return 0, else return 1 and it will be written now
+ * in superblock) before writing, schedule a superblock update
+ * and wait for it to complete.
  */
-int md_write_start(mddev_t *mddev, struct bio *bi)
+void md_write_start(mddev_t *mddev, struct bio *bi)
 {
+	DEFINE_WAIT(w);
 	if (bio_data_dir(bi) != WRITE)
-		return 1;
+		return;
 
 	atomic_inc(&mddev->writes_pending);
-	spin_lock(&mddev->write_lock);
-	if (mddev->in_sync == 0 && mddev->sb_dirty == 0) {
-		spin_unlock(&mddev->write_lock);
-		return 1;
-	}
-	bio_list_add(&mddev->write_list, bi);
-
 	if (mddev->in_sync) {
-		mddev->in_sync = 0;
-		mddev->sb_dirty = 1;
+		spin_lock(&mddev->write_lock);
+		if (mddev->in_sync) {
+			mddev->in_sync = 0;
+			mddev->sb_dirty = 1;
+			md_wakeup_thread(mddev->thread);
+		}
+		spin_unlock(&mddev->write_lock);
 	}
-	spin_unlock(&mddev->write_lock);
-	md_wakeup_thread(mddev->thread);
-	return 0;
+	wait_event(mddev->sb_wait, mddev->sb_dirty==0);
 }
 
 void md_write_end(mddev_t *mddev)
@@ -3808,7 +3807,6 @@ void md_check_recovery(mddev_t *mddev)
 		mddev->sb_dirty ||
 		test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
 		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
-		mddev->write_list.head ||
 		(mddev->safemode == 1) ||
 		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
 		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
@@ -3817,7 +3815,6 @@ void md_check_recovery(mddev_t *mddev)
 
 	if (mddev_trylock(mddev)==0) {
 		int spares =0;
-		struct bio *blist;
 
 		spin_lock(&mddev->write_lock);
 		if (mddev->safemode && !atomic_read(&mddev->writes_pending) &&
@@ -3827,21 +3824,11 @@ void md_check_recovery(mddev_t *mddev)
 		}
 		if (mddev->safemode == 1)
 			mddev->safemode = 0;
-		blist = bio_list_get(&mddev->write_list);
 		spin_unlock(&mddev->write_lock);
 
 		if (mddev->sb_dirty)
 			md_update_sb(mddev);
 
-		while (blist) {
-			struct bio *b = blist;
-			blist = blist->bi_next;
-			b->bi_next = NULL;
-			generic_make_request(b);
-			/* we already counted this, so need to un-count */
-			md_write_end(mddev);
-		}
-
 
 		if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
 		    !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-03-08 16:08:10.000000000 +1100
+++ ./drivers/md/raid1.c	2005-03-07 16:33:42.000000000 +1100
@@ -561,8 +561,8 @@ static int make_request(request_queue_t 
 	 * thread has put up a bar for new requests.
 	 * Continue immediately if no resync is active currently.
 	 */
-	if (md_write_start(mddev, bio)==0)
-		return 0;
+	md_write_start(mddev, bio); /* wait on superblock update early */
+
 	spin_lock_irq(&conf->resync_lock);
 	wait_event_lock_irq(conf->wait_resume, !conf->barrier, conf->resync_lock, );
 	conf->nr_pending++;

diff ./drivers/md/raid10.c~current~ ./drivers/md/raid10.c
--- ./drivers/md/raid10.c~current~	2005-03-08 16:08:10.000000000 +1100
+++ ./drivers/md/raid10.c	2005-03-07 16:33:59.000000000 +1100
@@ -700,8 +700,7 @@ static int make_request(request_queue_t 
 		return 0;
 	}
 
-	if (md_write_start(mddev, bio) == 0)
-		return 0;
+	md_write_start(mddev, bio);
 
 	/*
 	 * Register the new request and wait if the reconstruction

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2005-03-08 16:08:10.000000000 +1100
+++ ./drivers/md/raid5.c	2005-03-07 16:34:09.000000000 +1100
@@ -1411,8 +1411,7 @@ static int make_request (request_queue_t
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
 
-	if (md_write_start(mddev, bi)==0)
-		return 0;
+	md_write_start(mddev, bi);
 
 	if (bio_data_dir(bi)==WRITE) {
 		disk_stat_inc(mddev->gendisk, writes);

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2005-03-08 16:08:10.000000000 +1100
+++ ./drivers/md/raid6main.c	2005-03-07 16:34:29.000000000 +1100
@@ -1570,8 +1570,7 @@ static int make_request (request_queue_t
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
 
-	if (md_write_start(mddev, bi)==0)
-		return 0;
+	md_write_start(mddev, bi);
 
 	if (bio_data_dir(bi)==WRITE) {
 		disk_stat_inc(mddev->gendisk, writes);

diff ./include/linux/raid/md.h~current~ ./include/linux/raid/md.h
--- ./include/linux/raid/md.h~current~	2005-03-08 16:08:10.000000000 +1100
+++ ./include/linux/raid/md.h	2005-03-07 16:32:55.000000000 +1100
@@ -69,7 +69,7 @@ extern mdk_thread_t * md_register_thread
 extern void md_unregister_thread (mdk_thread_t *thread);
 extern void md_wakeup_thread(mdk_thread_t *thread);
 extern void md_check_recovery(mddev_t *mddev);
-extern int md_write_start(mddev_t *mddev, struct bio *bi);
+extern void md_write_start(mddev_t *mddev, struct bio *bi);
 extern void md_write_end(mddev_t *mddev);
 extern void md_handle_safemode(mddev_t *mddev);
 extern void md_done_sync(mddev_t *mddev, int blocks, int ok);

diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~	2005-03-08 16:08:10.000000000 +1100
+++ ./include/linux/raid/md_k.h	2005-03-07 16:31:44.000000000 +1100
@@ -261,7 +261,7 @@ struct mddev_s
 	sector_t			recovery_cp;
 
 	spinlock_t			write_lock;
-	struct bio_list			write_list;
+	wait_queue_head_t		sb_wait;	/* for waiting on superblock updates */
 
 	unsigned int			safemode;	/* if set, update "clean" superblock
 							 * when no writes pending.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH md 2 of 4] Erroneous sizeof use in raid1
  2005-03-08  5:50 [PATCH md 0 of 4] Introduction NeilBrown
                   ` (2 preceding siblings ...)
  2005-03-08  5:50 ` [PATCH md 4 of 4] Fix md deadlock due to md thread processing delayed requests NeilBrown
@ 2005-03-08  5:50 ` NeilBrown
  2005-03-08  6:10 ` [PATCH md 0 of 4] Introduction Andrew Morton
  2005-03-08 12:49 ` Peter T. Breuer
  5 siblings, 0 replies; 16+ messages in thread
From: NeilBrown @ 2005-03-08  5:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


This isn't a real bug as the smallest slab-size is 32 bytes 
but please apply for consistency.

Found by the Coverity tool.

Signed-off-by: Alexander Nyberg <alexn@dsv.su.se>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/raid1.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-03-07 15:49:25.000000000 +1100
+++ ./drivers/md/raid1.c	2005-03-07 15:49:55.000000000 +1100
@@ -1494,7 +1494,7 @@ static int raid1_reshape(mddev_t *mddev,
 		if (conf->mirrors[d].rdev)
 			return -EBUSY;
 
-	newpoolinfo = kmalloc(sizeof(newpoolinfo), GFP_KERNEL);
+	newpoolinfo = kmalloc(sizeof(*newpoolinfo), GFP_KERNEL);
 	if (!newpoolinfo)
 		return -ENOMEM;
 	newpoolinfo->mddev = mddev;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH md 0 of 4] Introduction
  2005-03-08  5:50 [PATCH md 0 of 4] Introduction NeilBrown
                   ` (3 preceding siblings ...)
  2005-03-08  5:50 ` [PATCH md 2 of 4] Erroneous sizeof use in raid1 NeilBrown
@ 2005-03-08  6:10 ` Andrew Morton
  2005-03-09  3:17   ` Neil Brown
  2005-03-08 12:49 ` Peter T. Breuer
  5 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2005-03-08  6:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@cse.unsw.edu.au> wrote:
>
> The first two are trivial and should apply equally to 2.6.11
> 
>  The second two fix bugs that were introduced by the recent 
>  bitmap-based-intent-logging patches and so are not relevant
>  to 2.6.11 yet. 

The changelog for the "Fix typo in super_1_sync" patch doesn't actually say
what the patch does.  What are the user-visible consequences of not fixing
this?


Is the bitmap stuff now ready for Linus?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH md 0 of 4] Introduction
  2005-03-08  6:10 ` [PATCH md 0 of 4] Introduction Andrew Morton
@ 2005-03-09  3:17   ` Neil Brown
  2005-03-09  9:27     ` Mike Tran
  0 siblings, 1 reply; 16+ messages in thread
From: Neil Brown @ 2005-03-09  3:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid

On Monday March 7, akpm@osdl.org wrote:
> NeilBrown <neilb@cse.unsw.edu.au> wrote:
> >
> > The first two are trivial and should apply equally to 2.6.11
> > 
> >  The second two fix bugs that were introduced by the recent 
> >  bitmap-based-intent-logging patches and so are not relevant
> >  to 2.6.11 yet. 
> 
> The changelog for the "Fix typo in super_1_sync" patch doesn't actually say
> what the patch does.  What are the user-visible consequences of not fixing
> this?

-------
This fixes possible inconsistencies that might arise in a version-1 
superblock when devices fail and are removed.

Usage of version-1 superblocks is not yet widespread and no actual
problems have been reported.
--------
> 
> 
> Is the bitmap stuff now ready for Linus?

I agree with Paul - not yet.
I'd also like to get a bit more functionality in before it goes to
Linus, as the functionality may necessitate in interface change (I'm
not sure).
Specifically, I want the bitmap to be able to live near the superblock
rather than having to be in a file on a different filesystem.

NeilBrown

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH md 0 of 4] Introduction
  2005-03-09  3:17   ` Neil Brown
@ 2005-03-09  9:27     ` Mike Tran
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Tran @ 2005-03-09  9:27 UTC (permalink / raw)
  To: linux-raid

Hi Neil,

On Tue, 2005-03-08 at 21:17, Neil Brown wrote:
> On Monday March 7, akpm@osdl.org wrote:
> > NeilBrown <neilb@cse.unsw.edu.au> wrote:
> > >
> > > The first two are trivial and should apply equally to 2.6.11
> > > 
> > >  The second two fix bugs that were introduced by the recent 
> > >  bitmap-based-intent-logging patches and so are not relevant
> > >  to 2.6.11 yet. 
> > 
> > The changelog for the "Fix typo in super_1_sync" patch doesn't actually say
> > what the patch does.  What are the user-visible consequences of not fixing
> > this?
> 
> -------
> This fixes possible inconsistencies that might arise in a version-1 
> superblock when devices fail and are removed.
> 
> Usage of version-1 superblocks is not yet widespread and no actual
> problems have been reported.
> --------

EVMS 2.5.1 (http://evms.sf.net) has provided support for creation of MD
arrays using version-1 superblock.  Some of EVMS users actually tried to
use this new functionality.  You probably remember I posted a problem
and a patch to fix version-1 superblock update code.

We will continue to test and will report any problems.

--
Regards,
Mike T.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH md 0 of 4] Introduction
  2005-03-08  5:50 [PATCH md 0 of 4] Introduction NeilBrown
                   ` (4 preceding siblings ...)
  2005-03-08  6:10 ` [PATCH md 0 of 4] Introduction Andrew Morton
@ 2005-03-08 12:49 ` Peter T. Breuer
  2005-03-08 17:02   ` Paul Clements
  5 siblings, 1 reply; 16+ messages in thread
From: Peter T. Breuer @ 2005-03-08 12:49 UTC (permalink / raw)
  To: linux-raid

NeilBrown <neilb@cse.unsw.edu.au> wrote:
> The second two fix bugs that were introduced by the recent 
> bitmap-based-intent-logging patches and so are not relevant

Neil - can you describe for me (us all?) what is meant by
intentlogging here.

Well, I can guess - I suppose the driver marks the bitmap before a write
(or group of writes) and unmarks it when they have completed
successfully.  Is that it?

If so, how does it manage to mark what it is _going_ to do (without
psychic powers) on the disk bitmap?  Unmarking is easy - that needs a
queue of things due to be unmarked in the bitmap, and a point in time at
which they are all unmarked at once on disk.

Then resync would only deal with the marked blocks.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH md 0 of 4] Introduction
  2005-03-08 12:49 ` Peter T. Breuer
@ 2005-03-08 17:02   ` Paul Clements
  2005-03-08 19:05     ` Peter T. Breuer
  0 siblings, 1 reply; 16+ messages in thread
From: Paul Clements @ 2005-03-08 17:02 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

Peter T. Breuer wrote:

> Neil - can you describe for me (us all?) what is meant by
> intentlogging here.

Since I wrote a lot of the code, I guess I'll try...

> Well, I can guess - I suppose the driver marks the bitmap before a write
> (or group of writes) and unmarks it when they have completed
> successfully.  Is that it?

Yes. It marks the bitmap before writing (actually queues up the bitmap 
and normal writes in bunches for the sake of performance). The code is 
actually (loosely) based on your original bitmap (fr1) code.

> If so, how does it manage to mark what it is _going_ to do (without
> psychic powers) on the disk bitmap?  

That's actually fairly easy. The pages for the bitmap are locked in 
memory, so you just dirty the bits you want (which doesn't actually 
incur any I/O) and then when you're about to perform the normal writes, 
you flush the dirty bitmap pages to disk.

Once the writes are complete, a thread (we have the raid1d thread doing 
this) comes back along and flushes the (now clean) bitmap pages back to 
disk. If the pages get dirty again in the meantime (because of more 
I/O), we just leave them dirty and don't touch the disk.

> Then resync would only deal with the marked blocks.

Right. It clears the bitmap once things are back in sync.

--
Paul

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH md 0 of 4] Introduction
  2005-03-08 17:02   ` Paul Clements
@ 2005-03-08 19:05     ` Peter T. Breuer
  2005-03-09  5:07       ` Neil Brown
  0 siblings, 1 reply; 16+ messages in thread
From: Peter T. Breuer @ 2005-03-08 19:05 UTC (permalink / raw)
  To: linux-raid

Paul Clements <paul.clements@steeleye.com> wrote:
> Peter T. Breuer wrote:
> > Neil - can you describe for me (us all?) what is meant by
> > intent-logging here.
> 
> Since I wrote a lot of the code, I guess I'll try...

Hi, Paul. Thanks.

> > Well, I can guess - I suppose the driver marks the bitmap before a write
> > (or group of writes) and unmarks it when they have completed
> > successfully.  Is that it?
> 
> Yes. It marks the bitmap before writing (actually queues up the bitmap 
> and normal writes in bunches for the sake of performance). The code is 
> actually (loosely) based on your original bitmap (fr1) code.

Yeah, I can see the traces.  I'm a little tired right now, but some
aspects of this idea vaguely worry me.  I'll see if I manage to
articulate those worries here despite my state. And you can dispell
them :).

Let me first of all guess at the intervals involved. I assume you will
write the marked parts of the bitmap to disk every 1/100th of a second or
so?  (I'd probably opt for 1/10th of a second or even every second just
to make sure it's not noticable on bandwidth and to heck with the
safety until we learn better what the tradeoffs are).  Or perhaps once
every hundred trasactions in busy times.

Now, there are races here.  You must mark the bitmap in memory before
every write, and unmark it after every complete write.  That is an
ordering constraint.  There is a race, however, to record the bitmap
state to disk.  Without any rendezvous or handshake or other
synchronization, one would simply be snapshotting the in-memory bitmap
to disk every so often, and the  on-disk bitmap would not always
accurately reflect the current state of completed transactions to the
mirror. The question is whether it shows an overly-pessimistic picture,
an overly-optimistic picture, or neither one nor the other.

I would naively imagine straight off that it cannot in general be
(appropriately) pessimistic because it does not know what writes will
occur in the next 1/100th second in order to be able to mark those on
the disk bitmap before they happen.  In the next section of your answer,
however, you say this is what happens, and therefore I deduce that

   a) 1/100th second's worth of writes to the mirror are first queued
   b) the in-memory bitmap is marked for these (if it exists as separate)
   c) the dirty parts of that bitmap are written to disk(s)
   d) the queued writes are carried out on the mirror
   e) the in-memory bitmap is unmarked for these
   f) the newly cleaned parts of that bitmap are written to disk. 

You may even have some sort of direct mapping between the on-disk
bitmap and the memory image, which could be quite effective, but
may run into problems with the address range available (bitmap must be
less than 2GB, no?), unless it maps only the necessary parts of the
bitmap at a time.  Well, if the kernel can manage that mapping window on
its own, it would be useful and probably what you have done.

But I digress. My immediate problem is that writes must be queued
first. I thought md traditionally did not queue requests, but instead
used its own make_request substitute to dispatch incoming requests as
they arrived.

Have you remodelled the md/raid1 make_request() fn?

And if so, do you also aggregate them? And what steps are taken to
preserve write ordering constraints (do some overlying file systems
still require these)?

> > If so, how does it manage to mark what it is _going_ to do (without
> > psychic powers) on the disk bitmap?  
> 
> That's actually fairly easy. The pages for the bitmap are locked in 
> memory,

That limits the size to about 2GB - oh, but perhaps you are doing as I
did and release bitmap pages when they are not dirty. Yes, you must.

>  so you just dirty the bits you want (which doesn't actually 
> incur any I/O) and then when you're about to perform the normal writes, 
> you flush the dirty bitmap pages to disk.

Hmm. I don't know how one can select pages to flush, but clearly one
can!  You maintain a list of dirtied pages, clearly. This list cannot be
larger than the list of outstanding requests. If you use the generic
kernel mechanisms, that will be 1000 or so, max.

> Once the writes are complete, a thread (we have the raid1d thread doing 
> this) comes back along and flushes the (now clean) bitmap pages back to 
> disk.

OK ..  there is a potential race here too, however, ...

> If the pages get dirty again in the meantime (because of more 
> I/O), we just leave them dirty and don't touch the disk.

Hmm. This appears to me to be an optimization. OK.

> > Then resync would only deal with the marked blocks.
> 
> Right. It clears the bitmap once things are back in sync.

Well, OK. Thinking it through as I write I see fewer problems. Thank
you for the explanation, and well done.

I have been meaning to merge the patches and see what comes out. I
presume you left out the mechanisms I included to allow a mirror
component to aggressively notify the array when it feels sick, and when
it feels better again. That required the array to be able to notify the
mirror components that they have been included in an array, and lodge
a callback hotline with  them.

Thanks again.

Peter

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH md 0 of 4] Introduction
  2005-03-08 19:05     ` Peter T. Breuer
@ 2005-03-09  5:07       ` Neil Brown
  2005-03-09 15:37         ` Peter T. Breuer
  0 siblings, 1 reply; 16+ messages in thread
From: Neil Brown @ 2005-03-09  5:07 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

On Tuesday March 8, ptb@lab.it.uc3m.es wrote:
> 
> But I digress. My immediate problem is that writes must be queued
> first. I thought md traditionally did not queue requests, but instead
> used its own make_request substitute to dispatch incoming requests as
> they arrived.
> 
> Have you remodelled the md/raid1 make_request() fn?

Somewhat.  Write requests are queued, and raid1d submits them when
it is happy that all bitmap updates have been done.

There is no '1/100th' second or anything like that.
When a write request arrives, the queue is 'plugged', requests are
queued, and bits in the in-memory bitmap are set.
When the queue is unplugged (by the filesystem or timeout) the bitmap
changes (if any) are flushed to disk, then the queued requests are
submitted. 

Bits on disk are cleaned lazily.

Note that for many applications, the bitmap does not need to be huge.
4K is enough for 1 bit per 2-3 megabytes on many large drives.
Having to sync 3 meg when just one block might be out-of-sync may seem
like a waste, but it is heaps better than syncing 100Gig!!

If a resync without bitmap logging takes 1 hour, I suspect a resync
with a 4K bitmap would have a good chance of finishing in under 1
minute (Depending on locality of references).  That is good enough for
me.

Of course, if one mirror is on the other side of the country, and a
normal sync requires 5 days over ADSL, then you would have a strong
case for a finer grained bitmap.

> 
> And if so, do you also aggregate them? And what steps are taken to
> preserve write ordering constraints (do some overlying file systems
> still require these)?

filesystems have never had any write ordering constraints, except that
IO must not be processed before it is requested, nor after it has been
acknowledged.  md continue to obey these restraints.

NeilBrown

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH md 0 of 4] Introduction
  2005-03-09  5:07       ` Neil Brown
@ 2005-03-09 15:37         ` Peter T. Breuer
  0 siblings, 0 replies; 16+ messages in thread
From: Peter T. Breuer @ 2005-03-09 15:37 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@cse.unsw.edu.au> wrote:
> On Tuesday March 8, ptb@lab.it.uc3m.es wrote:
> > Have you remodelled the md/raid1 make_request() fn?
> 
> Somewhat.  Write requests are queued, and raid1d submits them when
> it is happy that all bitmap updates have been done.

OK - so a slight modification of the kernel generic_make_request (I
haven't looked).  Mind you, I think that Paul said that just before
clearing bitmap entries, incoming requests were checked to see if a
bitmap entry should be marked again..

Perhaps both things happen. Bitmap pages in memory are updated as
clean after pending writes have finished and then marked as dirty as
necessary, and then flushed and when the flush finishes new accumulated
requests are started.

One can

> There is no '1/100th' second or anything like that.

I was trying in a way to give a definite image to what happens, rather
than speak abstractly. I'm sure that the ordinary kernel mechanism for
plugging and unplugging is used, as much as it is possible. If yu
unplug when the request struct reservoir is exhausted, then it will be
at 1K requests. If they are each 4KB, that will be every 4MB. At say
64MB/s, that will be every 1/16 s. And unplugging may happen more
frequently because of other kernel magic mumble mumble ...

> When a write request arrives, the queue is 'plugged', requests are
> queued, and bits in the in-memory bitmap are set.

OK.

> When the queue is unplugged (by the filesystem or timeout) the bitmap
> changes (if any) are flushed to disk, then the queued requests are
> submitted. 

That accumulates bitmap markings into the minimum number of extra
transactions.  It does impose extra latency, however.

I'm intrigued by exactly how you exert the memory pressure required to
force just the dirty bitmap pages out. I'll have to look it up.

> Bits on disk are cleaned lazily.

OK - so the disk bitmap state is always pessimistic. That's fine. Very
good.

> Note that for many applications, the bitmap does not need to be huge.
> 4K is enough for 1 bit per 2-3 megabytes on many large drives.
> Having to sync 3 meg when just one block might be out-of-sync may seem
> like a waste, but it is heaps better than syncing 100Gig!!

Yes - I used 1 bit per 1K, falling back to 1 bit per 2MB under memory
pressure.

> > And if so, do you also aggregate them? And what steps are taken to
> > preserve write ordering constraints (do some overlying file systems
> > still require these)?
> 
> filesystems have never had any write ordering constraints, except that
> IO must not be processed before it is requested, nor after it has been
> acknowledged.  md continue to obey these restraints.

Out of curiousity, is aggregation done on the queued requests? Or are
they all kept at 4KB? (or whatever - 1KB).

Thanks!

Peter

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH md 0 of 4] Introduction
@ 2004-11-02  3:37 NeilBrown
  0 siblings, 0 replies; 16+ messages in thread
From: NeilBrown @ 2004-11-02  3:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid

Following are 4 patches for md/raid against 2.6.10-rc1-mm2.

1/ Fix problem with linear arrays if component devices are > 2terabytes
2/ Fix data corruption in (experimental) RAID6 personality
3/ Fix possible oops with unplug_timer firing at the wrong time.
4/ Add new md personality "faulty".
    "Faulty" can be used to inject faults and so test failure modes
    of other raid levels and of filesystes.

NeilBrown


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH md 0 of 4] Introduction
@ 2004-09-03  2:20 NeilBrown
  0 siblings, 0 replies; 16+ messages in thread
From: NeilBrown @ 2004-09-03  2:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid

Following are 4 patches for md in 2.6.8.1-mm4

The first three are minor improvements and modifications either
required by or inspired by the fourth.

The fourth adds a new raid pers

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH md 0 of 4] Introduction
@ 2004-08-23  3:10 NeilBrown
  0 siblings, 0 replies; 16+ messages in thread
From: NeilBrown @ 2004-08-23  3:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid

Following are 4 patches for md in 2.6.8.1-mm4

The first three are minor improvements and modifications either
required by or inspired by the fourth.

The fourth adds a new raid personality - raid10.  At 56K, I'm not 
sure it will get through the mailing list, but interested parties
can find it at:

  http://neilb.web.cse.unsw.edu.au/patches/linux-devel/2.6/2004-08-23-03

raid10 provides a combination of raid0 and raid1.
It requires mdadm 1.7.0 or later to use.  

The next release of mdadm should have better documention of raid10, but 
from the comment in the .c file:

/*
 * RAID10 provides a combination of RAID0 and RAID1 functionality.
 * The layout of data is defined by 
 *    chunk_size
 *    raid_disks
 *    near_copies (stored in low byte of layout)
 *    far_copies (stored in second byte of layout)
 *
 * The data to be stored is divided into chunks using chunksize.
 * Each device is divided into far_copies sections.
 * In each section, chunks are layed out in a style similar to raid0, but
 * near_copies copies of each chunk is stored (each on a different drive).
 * The starting device for each section is offset near_copies from the starting
 * device of the previous section.
 * Thus there are (near_copies*far_copies) of each chunk, and each is on a different
 * drive.
 * near_copies and far_copies must be at least one, and there product is at most
 * raid_disks.
 */

raid10 is currently marked EXPERIMENTAL, and this should be taken seriously.
A reasonable amount of basic testing hasn't shown any bugs, and it seems to resync
and rebuild correctly.  However wider testing would help.

NeilBrown

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2005-03-09 15:37 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-08  5:50 [PATCH md 0 of 4] Introduction NeilBrown
2005-03-08  5:50 ` [PATCH md 1 of 4] Fix typo in super_1_sync NeilBrown
2005-03-08  5:50 ` [PATCH md 3 of 4] Initialise sync_blocks in raid1 resync NeilBrown
2005-03-08  5:50 ` [PATCH md 4 of 4] Fix md deadlock due to md thread processing delayed requests NeilBrown
2005-03-08  5:50 ` [PATCH md 2 of 4] Erroneous sizeof use in raid1 NeilBrown
2005-03-08  6:10 ` [PATCH md 0 of 4] Introduction Andrew Morton
2005-03-09  3:17   ` Neil Brown
2005-03-09  9:27     ` Mike Tran
2005-03-08 12:49 ` Peter T. Breuer
2005-03-08 17:02   ` Paul Clements
2005-03-08 19:05     ` Peter T. Breuer
2005-03-09  5:07       ` Neil Brown
2005-03-09 15:37         ` Peter T. Breuer
  -- strict thread matches above, loose matches on Subject: below --
2004-11-02  3:37 NeilBrown
2004-09-03  2:20 NeilBrown
2004-08-23  3:10 NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).