linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] [PATCH 0/4]Multiple block allocation and delayed allocation for ext3
       [not found] <1110839154.24286.302.camel@dyn318077bld.beaverton.ibm.com>
@ 2005-07-17 17:40 ` Mingming Cao
  2005-07-17 17:45   ` Mingming Cao
  2005-07-17 17:40 ` [RFC] [PATCH 1/4]Multiple block " Mingming Cao
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Mingming Cao @ 2005-07-17 17:40 UTC (permalink / raw)
  To: ext2-devel, Andrew Morton, Stephen C. Tweedie, linux-kernel,
	linux-fsdevel
  Cc: Badari Pulavarty, suparna, tytso, alex, adilger

Hi All, 

Here are the updated patches to support multiple block allocation and
delayed allocation for ext3 done by me, Badari and Suparna.

[PATCH 1/4] -- multiple block allocation for current ext3.
(ext3_get_blocks()).

[PATCH 2/4] -- adding delayed allocation for writeback mode

[PATCH 3/4] -- generic support for cluster pages together in
mapge_writepages() to make use of getblocks() 

[PATCH 4/4] -- support multiple block allocation for ext3 writeback mode
through writepages(). 


Have done initial testing on dbench and tiobench on a 2.6.11 version of
this patch set. Dbench 8 thread throughput result is increased by 20%
with this patch set.

dbench comparison: (ext3-dm represents ext3+thispatchset)
http://www.sudhaa.com/~ram/ols2005presentation/dbench.jpg
tiobench comparison:
http://www.sudhaa.com/~ram/ols2005presentation/tio_seq_write.jpg


Todo:
- bmap() support for delayed allocation
- page reserve flag to indicate the delayed allocation
- ordered mode support for delayed allocation
- "bh" support to enable blocksize = 1k/2k filesystems



Cheers,

Mingming



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] [PATCH 1/4]Multiple block allocation for ext3
       [not found] <1110839154.24286.302.camel@dyn318077bld.beaverton.ibm.com>
  2005-07-17 17:40 ` [RFC] [PATCH 0/4]Multiple block allocation and delayed allocation for ext3 Mingming Cao
@ 2005-07-17 17:40 ` Mingming Cao
  2005-07-17 17:40 ` [RFC] [PATCH 2/4]delayed " Mingming Cao
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Mingming Cao @ 2005-07-17 17:40 UTC (permalink / raw)
  To: ext2-devel, Andrew Morton, Stephen C. Tweedie, linux-kernel,
	linux-fsdevel
  Cc: Badari Pulavarty, suparna, tytso

Here is the patch support multiple block allocation for ext3. Current
ext3 allocates one block at a time, not efficient for large sequential
write IO.

This patch implements a simply multiple block allocation with current
ext3.  The basic idea is allocating the 1st block in the existing way,
and attempting to allocate the next adjacent blocks on a  best effort
basis. If contiguous allocation is blocked by an already allocated
block, the current number of free blocks are allocated and no futhur
search is tried.

This implementation makes uses of block reservation. With the knowledge
of how many blocks to allocate, the reservation window size is being
enlargedaccordin before block allocation to increase the chance to get
contiguous blocks.

Previous post of this patch with more description could be found here:
http://marc.theaimsgroup.com/?l=ext2-devel&m=111471578328685&w=2 




---

 linux-2.6.12-ming/fs/ext3/balloc.c        |  121 +++++++--
 linux-2.6.12-ming/fs/ext3/inode.c         |  380 ++++++++++++++++++++++++++++--
 linux-2.6.12-ming/fs/ext3/xattr.c         |    3 
 linux-2.6.12-ming/include/linux/ext3_fs.h |    2 
 4 files changed, 458 insertions(+), 48 deletions(-)

diff -puN fs/ext3/balloc.c~ext3-get-blocks fs/ext3/balloc.c
--- linux-2.6.12/fs/ext3/balloc.c~ext3-get-blocks	2005-07-14 21:55:55.110385896 -0700
+++ linux-2.6.12-ming/fs/ext3/balloc.c	2005-07-14 22:40:32.265396472 -0700
@@ -20,6 +20,7 @@
 #include <linux/quotaops.h>
 #include <linux/buffer_head.h>
 
+#define		NBS_DEBUG	0
 /*
  * balloc.c contains the blocks allocation and deallocation routines
  */
@@ -652,9 +653,11 @@ claim_block(spinlock_t *lock, int block,
  */
 static int
 ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group,
-	struct buffer_head *bitmap_bh, int goal, struct ext3_reserve_window *my_rsv)
+		struct buffer_head *bitmap_bh, int goal, unsigned long *count,
+		struct ext3_reserve_window *my_rsv)
 {
 	int group_first_block, start, end;
+	unsigned long num = 0;
 
 	/* we do allocation within the reservation window if we have a window */
 	if (my_rsv) {
@@ -712,8 +715,22 @@ repeat:
 			goto fail_access;
 		goto repeat;
 	}
-	return goal;
+	num++;
+	goal++;
+	if (NBS_DEBUG)
+		printk("ext3_new_block: first block allocated:block %d,num %d\n", goal, num);
+	while (num < *count && goal < end
+		&& ext3_test_allocatable(goal, bitmap_bh)
+		&& claim_block(sb_bgl_lock(EXT3_SB(sb), group), goal, bitmap_bh)) {
+		num++;
+		goal++;
+	}
+	*count = num;
+	if (NBS_DEBUG)
+		printk("ext3_new_block: additional block allocated:block %d,num %d,goal-num %d\n", goal, num, goal-num);
+	return goal - num;
 fail_access:
+	*count = num;
 	return -1;
 }
 
@@ -998,6 +1015,28 @@ retry:
 	goto retry;
 }
 
+static void try_to_extend_reservation(struct ext3_reserve_window_node *my_rsv,
+			struct super_block *sb, int size)
+{
+	struct ext3_reserve_window_node *next_rsv;
+	struct rb_node *next;
+	spinlock_t *rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
+
+	spin_lock(rsv_lock);
+	next = rb_next(&my_rsv->rsv_node);
+
+	if (!next)
+		my_rsv->rsv_end += size;
+	else {
+		next_rsv = list_entry(next, struct ext3_reserve_window_node, rsv_node);
+
+		if ((next_rsv->rsv_start - my_rsv->rsv_end) > size)
+			my_rsv->rsv_end += size;
+		else
+			my_rsv->rsv_end = next_rsv->rsv_start -1 ;
+	}
+	spin_unlock(rsv_lock);
+}
 /*
  * This is the main function used to allocate a new block and its reservation
  * window.
@@ -1023,11 +1062,12 @@ static int
 ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
 			unsigned int group, struct buffer_head *bitmap_bh,
 			int goal, struct ext3_reserve_window_node * my_rsv,
-			int *errp)
+			unsigned long *count, int *errp)
 {
 	unsigned long group_first_block;
 	int ret = 0;
 	int fatal;
+	unsigned long num = *count;
 
 	*errp = 0;
 
@@ -1050,7 +1090,8 @@ ext3_try_to_allocate_with_rsv(struct sup
 	 * or last attempt to allocate a block with reservation turned on failed
 	 */
 	if (my_rsv == NULL ) {
-		ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL);
+		ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal,
+					count, NULL);
 		goto out;
 	}
 	/*
@@ -1080,6 +1121,10 @@ ext3_try_to_allocate_with_rsv(struct sup
 	while (1) {
 		if (rsv_is_empty(&my_rsv->rsv_window) || (ret < 0) ||
 			!goal_in_my_reservation(&my_rsv->rsv_window, goal, group, sb)) {
+			if (my_rsv->rsv_goal_size < *count)
+                               my_rsv->rsv_goal_size = *count;
+
+
 			ret = alloc_new_reservation(my_rsv, goal, sb,
 							group, bitmap_bh);
 			if (ret < 0)
@@ -1088,15 +1133,21 @@ ext3_try_to_allocate_with_rsv(struct sup
 			if (!goal_in_my_reservation(&my_rsv->rsv_window, goal, group, sb))
 				goal = -1;
 		}
+		else {
+			if (goal > 0 && (my_rsv->rsv_end - goal + 1) < *count)
+				try_to_extend_reservation(my_rsv, sb,
+					*count-my_rsv->rsv_end+goal-1);
+		}
 		if ((my_rsv->rsv_start >= group_first_block + EXT3_BLOCKS_PER_GROUP(sb))
 		    || (my_rsv->rsv_end < group_first_block))
 			BUG();
 		ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal,
-					   &my_rsv->rsv_window);
+					   &num, &my_rsv->rsv_window);
 		if (ret >= 0) {
-			my_rsv->rsv_alloc_hit++;
+			my_rsv->rsv_alloc_hit += num;
 			break;				/* succeed */
 		}
+		num = *count;
 	}
 out:
 	if (ret >= 0) {
@@ -1153,8 +1204,8 @@ int ext3_should_retry_alloc(struct super
  * bitmap, and then for any free bit if that fails.
  * This function also updates quota and i_blocks field.
  */
-int ext3_new_block(handle_t *handle, struct inode *inode,
-			unsigned long goal, int *errp)
+int ext3_new_blocks(handle_t *handle, struct inode *inode,
+			unsigned long goal, unsigned long* count, int *errp)
 {
 	struct buffer_head *bitmap_bh = NULL;
 	struct buffer_head *gdp_bh;
@@ -1177,7 +1228,8 @@ int ext3_new_block(handle_t *handle, str
 	static int goal_hits, goal_attempts;
 #endif
 	unsigned long ngroups;
-
+	unsigned long num = *count;
+	int i;
 	*errp = -ENOSPC;
 	sb = inode->i_sb;
 	if (!sb) {
@@ -1188,7 +1240,7 @@ int ext3_new_block(handle_t *handle, str
 	/*
 	 * Check quota for allocation of this block.
 	 */
-	if (DQUOT_ALLOC_BLOCK(inode, 1)) {
+	if (DQUOT_ALLOC_BLOCK(inode, num)) {
 		*errp = -EDQUOT;
 		return 0;
 	}
@@ -1243,7 +1295,7 @@ retry:
 		if (!bitmap_bh)
 			goto io_error;
 		ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
-					bitmap_bh, ret_block, my_rsv, &fatal);
+					bitmap_bh, ret_block,  my_rsv, &num, &fatal);
 		if (fatal)
 			goto out;
 		if (ret_block >= 0)
@@ -1280,7 +1332,7 @@ retry:
 		if (!bitmap_bh)
 			goto io_error;
 		ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
-					bitmap_bh, -1, my_rsv, &fatal);
+					bitmap_bh, -1, my_rsv, &num, &fatal);
 		if (fatal)
 			goto out;
 		if (ret_block >= 0) 
@@ -1315,14 +1367,17 @@ allocated:
 	target_block = ret_block + group_no * EXT3_BLOCKS_PER_GROUP(sb)
 				+ le32_to_cpu(es->s_first_data_block);
 
-	if (target_block == le32_to_cpu(gdp->bg_block_bitmap) ||
-	    target_block == le32_to_cpu(gdp->bg_inode_bitmap) ||
-	    in_range(target_block, le32_to_cpu(gdp->bg_inode_table),
-		      EXT3_SB(sb)->s_itb_per_group))
-		ext3_error(sb, "ext3_new_block",
-			    "Allocating block in system zone - "
-			    "block = %u", target_block);
-
+	for (i = 0; i < num; i++, target_block++) {
+		if (target_block == le32_to_cpu(gdp->bg_block_bitmap) ||
+		    target_block == le32_to_cpu(gdp->bg_inode_bitmap) ||
+		    in_range(target_block, le32_to_cpu(gdp->bg_inode_table),
+			      EXT3_SB(sb)->s_itb_per_group)) {
+			ext3_error(sb, "ext3_new_block",
+				    "Allocating block in system zone - "
+				    "block = %u", target_block);
+			goto out;
+		}
+	}
 	performed_allocation = 1;
 
 #ifdef CONFIG_JBD_DEBUG
@@ -1340,10 +1395,12 @@ allocated:
 	jbd_lock_bh_state(bitmap_bh);
 	spin_lock(sb_bgl_lock(sbi, group_no));
 	if (buffer_jbd(bitmap_bh) && bh2jh(bitmap_bh)->b_committed_data) {
-		if (ext3_test_bit(ret_block,
-				bh2jh(bitmap_bh)->b_committed_data)) {
-			printk("%s: block was unexpectedly set in "
-				"b_committed_data\n", __FUNCTION__);
+		for (i = 0; i < num ; i++) {
+			if (ext3_test_bit(ret_block++,
+					bh2jh(bitmap_bh)->b_committed_data)) {
+				printk("%s: block was unexpectedly set in "
+					"b_committed_data\n", __FUNCTION__);
+			}
 		}
 	}
 	ext3_debug("found bit %d\n", ret_block);
@@ -1352,12 +1409,12 @@ allocated:
 #endif
 
 	/* ret_block was blockgroup-relative.  Now it becomes fs-relative */
-	ret_block = target_block;
+	ret_block = target_block - num;
 
-	if (ret_block >= le32_to_cpu(es->s_blocks_count)) {
+	if (target_block - 1>= le32_to_cpu(es->s_blocks_count)) {
 		ext3_error(sb, "ext3_new_block",
-			    "block(%d) >= blocks count(%d) - "
-			    "block_group = %d, es == %p ", ret_block,
+			    "block(%d) >= fs blocks count(%d) - "
+			    "block_group = %d, es == %p ", target_block - 1,
 			le32_to_cpu(es->s_blocks_count), group_no, es);
 		goto out;
 	}
@@ -1372,9 +1429,9 @@ allocated:
 
 	spin_lock(sb_bgl_lock(sbi, group_no));
 	gdp->bg_free_blocks_count =
-			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) - 1);
+			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, -1);
+	percpu_counter_mod(&sbi->s_freeblocks_counter, -num);
 
 	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
 	err = ext3_journal_dirty_metadata(handle, gdp_bh);
@@ -1386,6 +1443,8 @@ allocated:
 		goto out;
 
 	*errp = 0;
+	DQUOT_FREE_BLOCK(inode, *count-num);
+	*count = num;
 	brelse(bitmap_bh);
 	return ret_block;
 
@@ -1400,7 +1459,7 @@ out:
 	 * Undo the block allocation
 	 */
 	if (!performed_allocation)
-		DQUOT_FREE_BLOCK(inode, 1);
+		DQUOT_FREE_BLOCK(inode, *count);
 	brelse(bitmap_bh);
 	return 0;
 }
diff -puN fs/ext3/inode.c~ext3-get-blocks fs/ext3/inode.c
--- linux-2.6.12/fs/ext3/inode.c~ext3-get-blocks	2005-07-14 21:55:55.114385288 -0700
+++ linux-2.6.12-ming/fs/ext3/inode.c	2005-07-14 22:42:26.225071968 -0700
@@ -237,12 +237,12 @@ static int ext3_alloc_block (handle_t *h
 			struct inode * inode, unsigned long goal, int *err)
 {
 	unsigned long result;
+	unsigned long count = 1;
 
-	result = ext3_new_block(handle, inode, goal, err);
+	result = ext3_new_blocks(handle, inode, goal, &count,  err);
 	return result;
 }
 
-
 typedef struct {
 	__le32	*p;
 	__le32	key;
@@ -328,7 +328,7 @@ static int ext3_block_to_path(struct ino
 		ext3_warning (inode->i_sb, "ext3_block_to_path", "block > big");
 	}
 	if (boundary)
-		*boundary = (i_block & (ptrs - 1)) == (final - 1);
+		*boundary = final - 1 - (i_block & (ptrs - 1));
 	return n;
 }
 
@@ -375,8 +375,10 @@ static Indirect *ext3_get_branch(struct 
 		goto no_block;
 	while (--depth) {
 		bh = sb_bread(sb, le32_to_cpu(p->key));
-		if (!bh)
+		if (!bh) {
+			printk("ext3_get-Branch failure: key is %d, depth is %d\n",p->key, depth);
 			goto failure;
+		}
 		/* Reader: pointers */
 		if (!verify_chain(chain, p))
 			goto changed;
@@ -429,11 +431,11 @@ static unsigned long ext3_find_near(stru
 	/* Try to find previous block */
 	for (p = ind->p - 1; p >= start; p--)
 		if (*p)
-			return le32_to_cpu(*p);
+			return le32_to_cpu(*p + 1);
 
 	/* No such thing, so let's try location of indirect block */
 	if (ind->bh)
-		return ind->bh->b_blocknr;
+		return ind->bh->b_blocknr + 1;
 
 	/*
 	 * It is going to be refered from inode itself? OK, just put it into
@@ -526,7 +528,7 @@ static int ext3_alloc_branch(handle_t *h
 			/*
 			 * Get buffer_head for parent block, zero it out
 			 * and set the pointer to new one, then send
-			 * parent to disk.  
+			 * parent to disk.
 			 */
 			bh = sb_getblk(inode->i_sb, parent);
 			branch[n].bh = bh;
@@ -566,6 +568,196 @@ static int ext3_alloc_branch(handle_t *h
 		ext3_free_blocks(handle, inode, le32_to_cpu(branch[i].key), 1);
 	return err;
 }
+#define GBS_DEBUG	0
+#define GBS_DEBUG1	0
+#define GBS_DEBUG2	0
+static int ext3_alloc_splice_branch(handle_t *handle, struct inode *inode,
+		     unsigned long goal, unsigned long* maxblocks,
+		     int *offsets, Indirect *branch, unsigned int minblocks)
+{
+	int blocksize = inode->i_sb->s_blocksize;
+	int err = 0;
+	int i, n = 0;
+	unsigned long required, target, count;
+	int meta_num, data_num;
+	unsigned long first_data_block = 0;
+	unsigned long current_block = 0;
+	struct buffer_head *bh;
+	unsigned long long new_meta_blocks[3];
+
+	/*
+	 * We must allocating required number metadata blocks
+	 * for the first data block if necessary. Thus the
+	 * minimum number of blocks needed(required) = the
+	 * number of needed metablocks(minblocks) + 1 (the first data
+	 * block).
+	 *
+	 * The multiple allocation of the rest data blocks are targeted
+	 * but not required.
+	 *
+	 */
+	target = *maxblocks;
+	required = minblocks + 1;
+	meta_num = 0; i = 0; data_num = 0; count = 0;
+
+	if (GBS_DEBUG)
+		printk("Come to mballoc: minblocks %d, maxblocks %d \n", minblocks, *maxblocks);
+
+	while (required > 0) {
+		i = 0;
+		count = target;
+		/* allocating blocks for metadat blocks and data blocks */
+		current_block = ext3_new_blocks(handle, inode, goal, &count, &err);
+		if (err)
+			goto failed;
+
+		/*
+		 * if need allocating blocks for metedata blocks
+		 * (indirect/double/tripl blocks)
+		 * save the allocated new metadata block numbers for later
+		 * branch update.
+		 */
+		if (required > 1)
+			for (i = 0; meta_num < minblocks && i < count; i++) {
+				new_meta_blocks[meta_num++] = current_block++;
+				if (GBS_DEBUG)
+					printk(" meta_num = %d, minblocks :%d\n", meta_num, minblocks);
+			}
+		/* if allocated blocks is less than the minimum # of blocks */
+		if (count < required) {
+			required -= count;
+			target -= count;
+		}
+		else {
+			if (GBS_DEBUG)
+				printk("count: %d, i:%d, required:%d\n", count, i, required);
+			/* done with allocation */
+			data_num = count - i;
+			first_data_block = current_block;
+			if (GBS_DEBUG) {
+				printk("ext3 mballoc allocation done. metablocks:%d,"
+				"datablocks %d, goal metablocks:%d, goal"
+				"datablocks:%d\n", meta_num, data_num, minblocks,
+				*maxblocks - minblocks);
+
+				printk("new metablocks are:");
+				for (i = 0; i<meta_num; i++)
+					printk("meta[%d]:%d",i, new_meta_blocks[i]);
+
+				printk(" over\n");
+			}
+			if (meta_num !=minblocks) {
+				BUG();
+				printk("ext3 mballoc error: allocate %d"
+					"number of metablocks, different than"
+					"required: %d", meta_num, minblocks);
+			}
+			break;
+		}
+	}
+
+	if (meta_num == 0)
+		branch[0].key = cpu_to_le32(first_data_block);
+	else
+		branch[0].key = cpu_to_le32(new_meta_blocks[0]);
+	/*
+	 * metadata blocks and data blocks are allocated.
+	 */
+	for (n = 1; n <= meta_num;  n++) {
+
+		/*
+		 * Get buffer_head for parent block, zero it out
+		 * and set the pointer to new one, then send
+		 * parent to disk.
+		 */
+		bh = sb_getblk(inode->i_sb, new_meta_blocks[n-1]);
+		branch[n].bh = bh;
+		lock_buffer(bh);
+		BUFFER_TRACE(bh, "call get_create_access");
+		err = ext3_journal_get_create_access(handle, bh);
+		if (err) {
+			unlock_buffer(bh);
+			brelse(bh);
+			goto failed;
+		}
+
+		memset(bh->b_data, 0, blocksize);
+		branch[n].p = (__le32*) bh->b_data + offsets[n];
+		if ( n != meta_num) {
+			branch[n].key = cpu_to_le32(new_meta_blocks[n]);
+			*branch[n].p = branch[n].key;
+		}
+		else {
+			branch[n].key = cpu_to_le32(first_data_block);
+			/* end of chain, update the last new metablock of
+			 * the chain to point to the new allocated
+			 * data blocks numbers
+			 */
+			for (i=0; i < data_num ; i++)
+				*(branch[n].p + i) = cpu_to_le32(current_block++);
+		}
+		BUFFER_TRACE(bh, "marking uptodate");
+		set_buffer_uptodate(bh);
+		unlock_buffer(bh);
+
+		BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");
+		err = ext3_journal_dirty_metadata(handle, bh);
+		if (err)
+			goto failed;
+	}
+
+	bh = branch[0].bh;
+
+	/* now splice the new branch into the tree */
+	if (bh){
+		BUFFER_TRACE(bh, "call get_write_access");
+		err = ext3_journal_get_write_access(handle, bh);
+		if (err)
+			goto failed;
+	}
+
+	*(branch[0].p) = branch[0].key;
+	current_block += 1;
+	/* update host bufferhead or inode to point to
+	 * new data blocks */
+	if ( meta_num == 0 )
+		for (i = 1; i < data_num; i++)
+			*(branch[0].p + i ) = cpu_to_le32(current_block++);
+
+	if (bh) {
+		BUFFER_TRACE(bh, "marking uptodate");
+		/*set_buffer_uptodate(bh);
+		unlock_buffer(bh);
+		*/
+		BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");
+		err = ext3_journal_dirty_metadata(handle, bh);
+		if (err)
+			goto failed;
+	}
+
+	*maxblocks = data_num;
+
+	if (GBS_DEBUG) {
+		for (i=0; i<=meta_num; i++)
+			printk("inode %x, branch[%d].p:%x, branch[%d].key:%d,\n", inode, i, branch[i].p, i, branch[i].key);
+		for (i=0; i< data_num - 1; i++)
+			printk("inode %x, branch[%d].p + %d + 1:%x, *(branch[%d].p+%d+1):%d,\n, branch[%d].bh:%x\n", inode, n-1, i, branch[n-1].p + i +1, n-1, i, *(branch[n-1].p+i+1),n-1, branch[n-1].bh);
+	}
+
+	return err;
+failed:
+	/* Allocation failed, free what we already allocated */
+	for (i = 0; i < n ; i++) {
+		BUFFER_TRACE(branch[i].bh, "call journal_forget");
+		ext3_journal_forget(handle, branch[n].bh);
+	}
+	for (i = 0; i < meta_num; i++)
+		ext3_free_blocks(handle, inode, new_meta_blocks[i], 1);
+
+	if (data_num)
+		ext3_free_blocks(handle, inode, first_data_block++, data_num);
+	return err;
+}
 
 /**
  *	ext3_splice_branch - splice the allocated branch onto inode.
@@ -783,8 +975,154 @@ out:
 	return err;
 }
 
-static int ext3_get_block(struct inode *inode, sector_t iblock,
-			struct buffer_head *bh_result, int create)
+static int
+ext3_count_blocks_to_allocate(Indirect * branch, int k,
+				unsigned long maxblocks, int blocks_to_boundary)
+{
+	unsigned long count = 0;
+
+	if (k == 0) return 0;
+	/*
+	 * Simple case, [t,d]Indirect block(s) has not allocated yet
+	 * then it's clear blocks on that path have not allocated
+	 */
+	if (GBS_DEBUG1 || GBS_DEBUG)
+		printk("maxblocks: %d, k: %d, boundary : %d \n",maxblocks, k,
+			blocks_to_boundary);
+	if (k > 1) {
+		/* right now don't hanel cross boundary allocation */
+		if ((maxblocks - count) < blocks_to_boundary)
+			count += maxblocks;
+		else
+			count += blocks_to_boundary;
+		count += k - 1; /* blocks for [t,d]indirect blocks */
+		return count;
+	}
+
+	count++;
+	while (count < maxblocks && count <= blocks_to_boundary
+		&& *(branch[0].p + count) == 0) {
+		count++;
+	}
+	return count;
+}
+static int
+ext3_get_blocks_handle(handle_t *handle, struct inode *inode, sector_t iblock,
+			unsigned long *maxblocks, struct buffer_head *bh_result,
+			int create, int extend_disksize)
+{
+	int err = -EIO;
+	int offsets[4];
+	Indirect chain[4];
+	Indirect *partial = NULL;
+	unsigned long goal;
+	int left;
+	int blocks_to_boundary = 0;
+	int depth;
+	struct ext3_inode_info *ei = EXT3_I(inode);
+	unsigned long count = 0;
+	unsigned long first_block = 0;
+	struct ext3_block_alloc_info *block_i =  EXT3_I(inode)->i_block_alloc_info;
+
+
+	J_ASSERT(handle != NULL || create == 0);
+
+	if (GBS_DEBUG1 || GBS_DEBUG)
+		printk("ext3_get_blocks_handle: inode %x, maxblocks= %d, iblock = %d, create = %d\n", inode, (int)*maxblocks, (int)iblock, create);
+	down(&ei->truncate_sem);
+	depth = ext3_block_to_path(inode, iblock, offsets, &blocks_to_boundary);
+	if (depth == 0) {
+		printk ("depth == 0\n");
+		goto out;
+	}
+	partial = ext3_get_branch(inode, depth,
+				offsets, chain, &err);
+	/* Simplest case - block found */
+	if (!partial) {
+		first_block = chain[depth-1].key;
+		clear_buffer_new(bh_result);
+		count ++;
+		/* map more blocks */
+		while (count < *maxblocks && count <= blocks_to_boundary
+			&& (*(chain[depth-1].p+count) == first_block + count)) {
+			count ++;
+		}
+		up(&ei->truncate_sem);
+		goto got_it;
+	}
+	/* got mapped blocks or plain lookup  or failed read of indirect block */
+	if (!create || err == -EIO){
+		up(&ei->truncate_sem);
+		goto out;
+	}
+	/*
+  	 * Okay, we need to do block allocation.  Lazily initialize the block
+	 * allocation info here if necessary
+	 */
+        if (S_ISREG(inode->i_mode) && (!ei->i_block_alloc_info))
+                ext3_init_block_alloc_info(inode);
+
+	goal = ext3_find_goal(inode, iblock, chain, partial);
+
+	/* number of missing meta data blocks need to allocate for this branch */
+	left = chain + depth - partial;
+	count = ext3_count_blocks_to_allocate(partial, left, *maxblocks, blocks_to_boundary);
+	if (GBS_DEBUG1 || GBS_DEBUG)
+		printk("blocks to allocate: %d\n", count);
+	if (!err)
+		err = ext3_alloc_splice_branch(handle, inode, goal, &count,
+			offsets+(partial-chain), partial, left-1);
+	if (err) {
+		up(&ei->truncate_sem);
+		goto cleanup;
+	}
+	/* i_disksize growing is protected by truncate_sem
+	 * don't forget to protect it if you're about to implement
+	 * concurrent ext3_get_block() -bzzz */
+	if (extend_disksize && inode->i_size > ei->i_disksize)
+		ei->i_disksize = inode->i_size;
+	/*
+	 * update the most recently allocated logical & physical block
+	 * in i_block_alloc_info, to assist find the proper goal block for next
+	 * allocation
+	 */
+	block_i = ei->i_block_alloc_info;
+	if (block_i) {
+		block_i->last_alloc_logical_block = iblock + count - left;
+		block_i->last_alloc_physical_block = first_block + count - 1;
+	}
+
+	inode->i_ctime = CURRENT_TIME_SEC;
+	ext3_mark_inode_dirty(handle, inode);
+
+	up(&ei->truncate_sem);
+	if (err)
+		goto cleanup;
+
+	set_buffer_new(bh_result);
+got_it:
+	map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
+	if (blocks_to_boundary == 0)
+		set_buffer_boundary(bh_result);
+	/* Clean up and exit */
+	partial = chain+depth-1; /* the whole chain */
+cleanup:
+	while (partial > chain) {
+		BUFFER_TRACE(partial->bh, "call brelse");
+		brelse(partial->bh);
+		partial--;
+	}
+	BUFFER_TRACE(bh_result, "returned");
+out:
+	if (GBS_DEBUG1 ||GBS_DEBUG)
+		printk("ext3_get_blocks_handle returned, logical:%d, physical:%d, count: %d, err is %d\n", (int)iblock, (int) first_block, count, err);
+	*maxblocks = count;
+	return err;
+}
+
+static int ext3_get_blocks(struct inode *inode, sector_t iblock,
+		unsigned long maxblocks, struct buffer_head *bh_result,
+		int create)
 {
 	handle_t *handle = NULL;
 	int ret;
@@ -793,15 +1131,23 @@ static int ext3_get_block(struct inode *
 		handle = ext3_journal_current_handle();
 		J_ASSERT(handle != 0);
 	}
-	ret = ext3_get_block_handle(handle, inode, iblock,
+	ret = ext3_get_blocks_handle(handle, inode, iblock, &maxblocks,
 				bh_result, create, 1);
-	return ret;
+	bh_result->b_size = (maxblocks << inode->i_blkbits);
+        return ret;
+}
+
+static int ext3_get_block(struct inode *inode, sector_t iblock,
+			struct buffer_head *bh_result, int create)
+{
+	if (GBS_DEBUG)
+		printk("ext3_get_block is called\n");
+	return ext3_get_blocks(inode, iblock, 1, bh_result, create);
 }
 
 #define DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32)
 
-static int
-ext3_direct_io_get_blocks(struct inode *inode, sector_t iblock,
+static int ext3_direct_io_get_blocks(struct inode *inode, sector_t iblock,
 		unsigned long max_blocks, struct buffer_head *bh_result,
 		int create)
 {
@@ -837,10 +1183,14 @@ ext3_direct_io_get_blocks(struct inode *
 	}
 
 get_block:
+	if (GBS_DEBUG)
+		printk("Calling ext3_get_blocks_handle from dio: maxblocks= %d, iblock = %d, creat = %d\n", (int)max_blocks, (int)iblock, create);
 	if (ret == 0)
-		ret = ext3_get_block_handle(handle, inode, iblock,
+		ret = ext3_get_blocks_handle(handle, inode, iblock, &max_blocks,
 					bh_result, create, 0);
-	bh_result->b_size = (1 << inode->i_blkbits);
+	bh_result->b_size = (max_blocks << inode->i_blkbits);
+	if (GBS_DEBUG)
+		printk("ext3_get_blocks_handle returns to dio: maxblocks= %d, iblock = %d\n", (int)max_blocks, (int)iblock);
 	return ret;
 }
 
diff -puN fs/ext3/xattr.c~ext3-get-blocks fs/ext3/xattr.c
--- linux-2.6.12/fs/ext3/xattr.c~ext3-get-blocks	2005-07-14 21:55:55.118384680 -0700
+++ linux-2.6.12-ming/fs/ext3/xattr.c	2005-07-14 21:55:55.173376320 -0700
@@ -796,7 +796,8 @@ inserted:
 					EXT3_SB(sb)->s_es->s_first_data_block) +
 				EXT3_I(inode)->i_block_group *
 				EXT3_BLOCKS_PER_GROUP(sb);
-			int block = ext3_new_block(handle, inode, goal, &error);
+			unsigned long count = 1;
+			int block = ext3_new_blocks(handle, inode, goal, &count, &error);
 			if (error)
 				goto cleanup;
 			ea_idebug(inode, "creating block %d", block);
diff -puN include/linux/ext3_fs.h~ext3-get-blocks include/linux/ext3_fs.h
--- linux-2.6.12/include/linux/ext3_fs.h~ext3-get-blocks	2005-07-14 21:55:55.122384072 -0700
+++ linux-2.6.12-ming/include/linux/ext3_fs.h	2005-07-14 21:55:55.177375712 -0700
@@ -729,7 +729,7 @@ struct dir_private_info {
 /* balloc.c */
 extern int ext3_bg_has_super(struct super_block *sb, int group);
 extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group);
-extern int ext3_new_block (handle_t *, struct inode *, unsigned long, int *);
+extern int ext3_new_blocks (handle_t *, struct inode *, unsigned long, unsigned long*, int *);
 extern void ext3_free_blocks (handle_t *, struct inode *, unsigned long,
 			      unsigned long);
 extern void ext3_free_blocks_sb (handle_t *, struct super_block *,

_




-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] [PATCH 2/4]delayed allocation for ext3
       [not found] <1110839154.24286.302.camel@dyn318077bld.beaverton.ibm.com>
  2005-07-17 17:40 ` [RFC] [PATCH 0/4]Multiple block allocation and delayed allocation for ext3 Mingming Cao
  2005-07-17 17:40 ` [RFC] [PATCH 1/4]Multiple block " Mingming Cao
@ 2005-07-17 17:40 ` Mingming Cao
  2005-07-18  1:47   ` [Ext2-devel] " Andreas Dilger
  2005-07-26 22:52   ` Andrew Morton
  2005-07-17 17:40 ` [RFC] [PATCH 3/4]generic getblocks() support in mpage_writepages Mingming Cao
  2005-07-17 17:41 ` [RFC] [PATCH 4/4]add ext3 writeback writpages Mingming Cao
  4 siblings, 2 replies; 11+ messages in thread
From: Mingming Cao @ 2005-07-17 17:40 UTC (permalink / raw)
  To: ext2-devel, Andrew Morton, Stephen C. Tweedie, linux-kernel,
	linux-fsdevel
  Cc: Badari Pulavarty, suparna, tytso

Here is the updated patch from Badari for delayed allocation for ext3.
Delayed allocation defers block allocation from prepare-write time to
page writeout time. 


---

 linux-2.6.12-ming/fs/buffer.c             |   13 +++++++++----
 linux-2.6.12-ming/fs/ext3/inode.c         |    6 ++++++
 linux-2.6.12-ming/fs/ext3/super.c         |   14 +++++++++++++-
 linux-2.6.12-ming/include/linux/ext3_fs.h |    1 +
 4 files changed, 29 insertions(+), 5 deletions(-)

diff -puN include/linux/ext3_fs.h~ext3-delalloc include/linux/ext3_fs.h
--- linux-2.6.12/include/linux/ext3_fs.h~ext3-delalloc	2005-07-14 23:15:34.861753240 -0700
+++ linux-2.6.12-ming/include/linux/ext3_fs.h	2005-07-14 23:15:34.881750200 -0700
@@ -373,6 +373,7 @@ struct ext3_inode {
 #define EXT3_MOUNT_BARRIER		0x20000 /* Use block barriers */
 #define EXT3_MOUNT_NOBH			0x40000 /* No bufferheads */
 #define EXT3_MOUNT_QUOTA		0x80000 /* Some quota option set */
+ #define EXT3_MOUNT_DELAYED_ALLOC	0xC0000 /* Delayed Allocation */
 
 /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
diff -puN fs/ext3/inode.c~ext3-delalloc fs/ext3/inode.c
--- linux-2.6.12/fs/ext3/inode.c~ext3-delalloc	2005-07-14 23:15:34.866752480 -0700
+++ linux-2.6.12-ming/fs/ext3/inode.c	2005-07-14 23:15:34.889748984 -0700
@@ -1340,6 +1340,9 @@ static int ext3_prepare_write(struct fil
 	handle_t *handle;
 	int retries = 0;
 
+
+	if (test_opt(inode->i_sb, DELAYED_ALLOC))
+		return __nobh_prepare_write(page, from, to, ext3_get_block, 0);
 retry:
 	handle = ext3_journal_start(inode, needed_blocks);
 	if (IS_ERR(handle)) {
@@ -1439,6 +1442,9 @@ static int ext3_writeback_commit_write(s
 	else
 		ret = generic_commit_write(file, page, from, to);
 
+	if (test_opt(inode->i_sb, DELAYED_ALLOC))
+		return ret;
+
 	ret2 = ext3_journal_stop(handle);
 	if (!ret)
 		ret = ret2;
diff -puN fs/ext3/super.c~ext3-delalloc fs/ext3/super.c
--- linux-2.6.12/fs/ext3/super.c~ext3-delalloc	2005-07-14 23:15:34.870751872 -0700
+++ linux-2.6.12-ming/fs/ext3/super.c	2005-07-14 23:15:34.896747920 -0700
@@ -585,7 +585,7 @@ enum {
 	Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro,
 	Opt_nouid32, Opt_check, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov,
 	Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
-	Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh,
+	Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_delayed_alloc,
 	Opt_commit, Opt_journal_update, Opt_journal_inum,
 	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
@@ -621,6 +621,7 @@ static match_table_t tokens = {
 	{Opt_noreservation, "noreservation"},
 	{Opt_noload, "noload"},
 	{Opt_nobh, "nobh"},
+	{Opt_delayed_alloc, "delalloc"},
 	{Opt_commit, "commit=%u"},
 	{Opt_journal_update, "journal=update"},
 	{Opt_journal_inum, "journal=%u"},
@@ -954,6 +955,10 @@ clear_qf_name:
 		case Opt_nobh:
 			set_opt(sbi->s_mount_opt, NOBH);
 			break;
+		case Opt_delayed_alloc:
+			set_opt(sbi->s_mount_opt, NOBH);
+			set_opt(sbi->s_mount_opt, DELAYED_ALLOC);
+			break;
 		default:
 			printk (KERN_ERR
 				"EXT3-fs: Unrecognized mount option \"%s\" "
@@ -1612,6 +1617,13 @@ static int ext3_fill_super (struct super
 			clear_opt(sbi->s_mount_opt, NOBH);
 		}
 	}
+	if (test_opt(sb, DELAYED_ALLOC)) {
+		if (!(test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)) {
+			printk(KERN_WARNING "EXT3-fs: Ignoring delall option - "
+				"its supported only with writeback mode\n");
+			clear_opt(sbi->s_mount_opt, DELAYED_ALLOC);
+		}
+	}
 	/*
 	 * The journal_load will have done any necessary log recovery,
 	 * so we can safely mount the rest of the filesystem now.
diff -puN fs/buffer.c~ext3-delalloc fs/buffer.c
--- linux-2.6.12/fs/buffer.c~ext3-delalloc	2005-07-14 23:15:34.875751112 -0700
+++ linux-2.6.12-ming/fs/buffer.c	2005-07-14 23:15:34.903746856 -0700
@@ -2337,8 +2337,8 @@ static void end_buffer_read_nobh(struct 
  * On entry, the page is fully not uptodate.
  * On exit the page is fully uptodate in the areas outside (from,to)
  */
-int nobh_prepare_write(struct page *page, unsigned from, unsigned to,
-			get_block_t *get_block)
+int __nobh_prepare_write(struct page *page, unsigned from, unsigned to,
+			get_block_t *get_block, int create)
 {
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
@@ -2370,10 +2370,8 @@ int nobh_prepare_write(struct page *page
 		  block_start < PAGE_CACHE_SIZE;
 		  block_in_page++, block_start += blocksize) {
 		unsigned block_end = block_start + blocksize;
-		int create;
 
 		map_bh.b_state = 0;
-		create = 1;
 		if (block_start >= to)
 			create = 0;
 		ret = get_block(inode, block_in_file + block_in_page,
@@ -2482,6 +2480,13 @@ failed:
 	set_page_dirty(page);
 	return ret;
 }
+
+int nobh_prepare_write(struct page *page, unsigned from, unsigned to,
+			get_block_t *get_block)
+{
+	return __nobh_prepare_write(page, from, to, get_block, 1);
+}
+
 EXPORT_SYMBOL(nobh_prepare_write);
 
 int nobh_commit_write(struct file *file, struct page *page,

_



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] [PATCH 3/4]generic getblocks() support in mpage_writepages
       [not found] <1110839154.24286.302.camel@dyn318077bld.beaverton.ibm.com>
                   ` (2 preceding siblings ...)
  2005-07-17 17:40 ` [RFC] [PATCH 2/4]delayed " Mingming Cao
@ 2005-07-17 17:40 ` Mingming Cao
  2005-07-17 17:41 ` [RFC] [PATCH 4/4]add ext3 writeback writpages Mingming Cao
  4 siblings, 0 replies; 11+ messages in thread
From: Mingming Cao @ 2005-07-17 17:40 UTC (permalink / raw)
  To: ext2-devel, Andrew Morton, Stephen C. Tweedie, linux-kernel,
	linux-fsdevel
  Cc: Badari Pulavarty, suparna, tytso

Updated patch from Suparna for generic support for cluster pages
together in mapge_writepages() to make use of getblocks() 

---

 linux-2.6.12-ming/fs/buffer.c                 |   49 -----
 linux-2.6.12-ming/fs/ext2/inode.c             |   15 -
 linux-2.6.12-ming/fs/ext3/inode.c             |   15 +
 linux-2.6.12-ming/fs/ext3/super.c             |    3 
 linux-2.6.12-ming/fs/hfs/inode.c              |    2 
 linux-2.6.12-ming/fs/hfsplus/inode.c          |    2 
 linux-2.6.12-ming/fs/jfs/inode.c              |   24 ++
 linux-2.6.12-ming/fs/mpage.c                  |  214 ++++++++++++++++++--------
 linux-2.6.12-ming/include/linux/buffer_head.h |    4 
 linux-2.6.12-ming/include/linux/fs.h          |    2 
 linux-2.6.12-ming/include/linux/mpage.h       |   11 -
 linux-2.6.12-ming/include/linux/pagemap.h     |    3 
 linux-2.6.12-ming/include/linux/pagevec.h     |    3 
 linux-2.6.12-ming/include/linux/radix-tree.h  |   14 +
 linux-2.6.12-ming/lib/radix-tree.c            |   25 ++-
 linux-2.6.12-ming/mm/filemap.c                |    9 -
 linux-2.6.12-ming/mm/swap.c                   |   11 +
 17 files changed, 270 insertions(+), 136 deletions(-)

diff -puN fs/buffer.c~mpage_writepages_getblocks fs/buffer.c
--- linux-2.6.12/fs/buffer.c~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/fs/buffer.c	2005-07-15 00:11:01.000000000 -0700
@@ -2509,53 +2509,10 @@ EXPORT_SYMBOL(nobh_commit_write);
  * that it tries to operate without attaching bufferheads to
  * the page.
  */
-int nobh_writepage(struct page *page, get_block_t *get_block,
-			struct writeback_control *wbc)
+int nobh_writepage(struct page *page, get_blocks_t *get_blocks,
+		struct writeback_control *wbc, writepage_t bh_writepage_fn)
 {
-	struct inode * const inode = page->mapping->host;
-	loff_t i_size = i_size_read(inode);
-	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
-	unsigned offset;
-	void *kaddr;
-	int ret;
-
-	/* Is the page fully inside i_size? */
-	if (page->index < end_index)
-		goto out;
-
-	/* Is the page fully outside i_size? (truncate in progress) */
-	offset = i_size & (PAGE_CACHE_SIZE-1);
-	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-#if 0
-		/* Not really sure about this  - do we need this ? */
-		if (page->mapping->a_ops->invalidatepage)
-			page->mapping->a_ops->invalidatepage(page, offset);
-#endif
-		unlock_page(page);
-		return 0; /* don't care */
-	}
-
-	/*
-	 * The page straddles i_size.  It must be zeroed out on each and every
-	 * writepage invocation because it may be mmapped.  "A file is mapped
-	 * in multiples of the page size.  For a file that is not a multiple of
-	 * the  page size, the remaining memory is zeroed when mapped, and
-	 * writes to that region are not written out to the file."
-	 */
-	kaddr = kmap_atomic(page, KM_USER0);
-	memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
-	flush_dcache_page(page);
-	kunmap_atomic(kaddr, KM_USER0);
-out:
-	ret = mpage_writepage(page, get_block, wbc);
-	if (ret == -EAGAIN)
-		ret = __block_write_full_page(inode, page, get_block, wbc);
-	return ret;
+	return mpage_writepage(page, get_blocks, wbc, bh_writepage_fn);
 }
 EXPORT_SYMBOL(nobh_writepage);
 
diff -puN fs/ext2/inode.c~mpage_writepages_getblocks fs/ext2/inode.c
--- linux-2.6.12/fs/ext2/inode.c~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/fs/ext2/inode.c	2005-07-15 00:11:01.000000000 -0700
@@ -650,12 +650,6 @@ ext2_nobh_prepare_write(struct file *fil
 	return nobh_prepare_write(page,from,to,ext2_get_block);
 }
 
-static int ext2_nobh_writepage(struct page *page,
-			struct writeback_control *wbc)
-{
-	return nobh_writepage(page, ext2_get_block, wbc);
-}
-
 static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
 {
 	return generic_block_bmap(mapping,block,ext2_get_block);
@@ -673,6 +667,12 @@ ext2_get_blocks(struct inode *inode, sec
 	return ret;
 }
 
+static int ext2_nobh_writepage(struct page *page,
+			struct writeback_control *wbc)
+{
+	return nobh_writepage(page, ext2_get_blocks, wbc, ext2_writepage);
+}
+
 static ssize_t
 ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs)
@@ -687,7 +687,8 @@ ext2_direct_IO(int rw, struct kiocb *ioc
 static int
 ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
-	return mpage_writepages(mapping, wbc, ext2_get_block);
+        return __mpage_writepages(mapping, wbc, ext2_get_blocks,
+					ext2_writepage);
 }
 
 struct address_space_operations ext2_aops = {
diff -puN fs/ext3/super.c~mpage_writepages_getblocks fs/ext3/super.c
--- linux-2.6.12/fs/ext3/super.c~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/fs/ext3/super.c	2005-07-15 00:11:01.000000000 -0700
@@ -1353,6 +1353,7 @@ static int ext3_fill_super (struct super
 	sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
 
 	set_opt(sbi->s_mount_opt, RESERVATION);
+	set_opt(sbi->s_mount_opt, NOBH); /* temp: set nobh default */
 
 	if (!parse_options ((char *) data, sb, &journal_inum, NULL, 0))
 		goto failed_mount;
@@ -1599,6 +1600,7 @@ static int ext3_fill_super (struct super
 			printk(KERN_ERR "EXT3-fs: Journal does not support "
 			       "requested data journaling mode\n");
 			goto failed_mount3;
+		set_opt(sbi->s_mount_opt, NOBH); /* temp: set nobh default */
 		}
 	default:
 		break;
@@ -1616,6 +1618,7 @@ static int ext3_fill_super (struct super
 				"its supported only with writeback mode\n");
 			clear_opt(sbi->s_mount_opt, NOBH);
 		}
+		printk("NOBH option set\n");
 	}
 	if (test_opt(sb, DELAYED_ALLOC)) {
 		if (!(test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)) {
diff -puN fs/ext3/inode.c~mpage_writepages_getblocks fs/ext3/inode.c
--- linux-2.6.12/fs/ext3/inode.c~mpage_writepages_getblocks	2005-07-15 17:32:05.865000480 -0700
+++ linux-2.6.12-ming/fs/ext3/inode.c	2005-07-15 18:06:49.384257408 -0700
@@ -1195,6 +1195,11 @@ get_block:
 }
 
 
+static int ext3_writepages_get_blocks(struct inode *inode, sector_t iblock,
+		unsigned long max_blocks, struct buffer_head *bh, int create)
+{
+	return ext3_direct_io_get_blocks(inode, iblock, max_blocks, bh, create);
+}
 /*
  * `handle' can be NULL if create is zero
  */
@@ -1674,6 +1679,13 @@ out_fail:
 	return ret;
 }
 
+static int
+ext3_writeback_writepage_helper(struct page *page,
+				struct writeback_control *wbc)
+{
+	return block_write_full_page(page, ext3_get_block, wbc);
+}
+
 static int ext3_writeback_writepage(struct page *page,
 				struct writeback_control *wbc)
 {
@@ -1692,7 +1704,8 @@ static int ext3_writeback_writepage(stru
 	}
 
 	if (test_opt(inode->i_sb, NOBH))
-		ret = nobh_writepage(page, ext3_get_block, wbc);
+		ret = nobh_writepage(page, ext3_writepages_get_blocks, wbc,
+			ext3_writeback_writepage_helper);
 	else
 		ret = block_write_full_page(page, ext3_get_block, wbc);
 
diff -puN fs/hfs/inode.c~mpage_writepages_getblocks fs/hfs/inode.c
--- linux-2.6.12/fs/hfs/inode.c~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/fs/hfs/inode.c	2005-07-15 00:11:01.000000000 -0700
@@ -124,7 +124,7 @@ static ssize_t hfs_direct_IO(int rw, str
 static int hfs_writepages(struct address_space *mapping,
 			  struct writeback_control *wbc)
 {
-	return mpage_writepages(mapping, wbc, hfs_get_block);
+	return mpage_writepages(mapping, wbc, hfs_get_blocks);
 }
 
 struct address_space_operations hfs_btree_aops = {
diff -puN fs/hfsplus/inode.c~mpage_writepages_getblocks fs/hfsplus/inode.c
--- linux-2.6.12/fs/hfsplus/inode.c~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/fs/hfsplus/inode.c	2005-07-15 00:11:01.000000000 -0700
@@ -121,7 +121,7 @@ static ssize_t hfsplus_direct_IO(int rw,
 static int hfsplus_writepages(struct address_space *mapping,
 			      struct writeback_control *wbc)
 {
-	return mpage_writepages(mapping, wbc, hfsplus_get_block);
+	return mpage_writepages(mapping, wbc, hfsplus_get_blocks);
 }
 
 struct address_space_operations hfsplus_btree_aops = {
diff -puN fs/jfs/inode.c~mpage_writepages_getblocks fs/jfs/inode.c
--- linux-2.6.12/fs/jfs/inode.c~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/fs/jfs/inode.c	2005-07-15 00:11:01.000000000 -0700
@@ -249,21 +249,41 @@ jfs_get_blocks(struct inode *ip, sector_
 	return rc;
 }
 
+static int
+jfs_mpage_get_blocks(struct inode *ip, sector_t lblock, unsigned long
+			max_blocks, struct buffer_head *bh_result, int create)
+{
+	/*
+	 * fixme: temporary workaround: return one block at a time until
+	 * we figure out why we see exposures with truncate on
+	 * allocating multiple blocks in one shot.
+	 */
+	return jfs_get_blocks(ip, lblock, 1, bh_result, create);
+}
+
 static int jfs_get_block(struct inode *ip, sector_t lblock,
 			 struct buffer_head *bh_result, int create)
 {
 	return jfs_get_blocks(ip, lblock, 1, bh_result, create);
 }
 
+static int jfs_bh_writepage(struct page *page,
+				struct writeback_control *wbc)
+{
+	return block_write_full_page(page, jfs_get_block, wbc);
+}
+
+
 static int jfs_writepage(struct page *page, struct writeback_control *wbc)
 {
-	return nobh_writepage(page, jfs_get_block, wbc);
+	return nobh_writepage(page, jfs_mpage_get_blocks, wbc, jfs_bh_writepage);
 }
 
 static int jfs_writepages(struct address_space *mapping,
 			struct writeback_control *wbc)
 {
-	return mpage_writepages(mapping, wbc, jfs_get_block);
+        return __mpage_writepages(mapping, wbc, jfs_mpage_get_blocks,
+					jfs_bh_writepage);
 }
 
 static int jfs_readpage(struct file *file, struct page *page)
diff -puN fs/mpage.c~mpage_writepages_getblocks fs/mpage.c
--- linux-2.6.12/fs/mpage.c~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/fs/mpage.c	2005-07-15 18:06:49.397255432 -0700
@@ -373,6 +373,67 @@ int mpage_readpage(struct page *page, ge
 }
 EXPORT_SYMBOL(mpage_readpage);
 
+struct mpageio {
+	struct bio *bio;
+	struct buffer_head map_bh;
+	unsigned long block_in_file;
+	unsigned long final_block_in_request;
+	sector_t block_in_bio;
+	int boundary;
+	sector_t boundary_block;
+	struct block_device *boundary_bdev;
+};
+
+/*
+ * Maps as many contiguous disk blocks as it can within the range of
+ * the request, and returns the total number of contiguous mapped
+ * blocks in the mpageio.
+ */
+static unsigned long mpage_get_more_blocks(struct mpageio *mio,
+	struct inode *inode, get_blocks_t get_blocks)
+{
+	struct buffer_head map_bh = {.b_state = 0};
+	unsigned long mio_nblocks = mio->map_bh.b_size >> inode->i_blkbits;
+	unsigned long first_unmapped = mio->block_in_file + mio_nblocks;
+	unsigned long next_contig_block = mio->map_bh.b_blocknr + mio_nblocks;
+
+	while ((first_unmapped < mio->final_block_in_request) &&
+		(mio->map_bh.b_size < PAGE_SIZE)) {
+
+		if (get_blocks(inode, first_unmapped,
+			mio->final_block_in_request - first_unmapped,
+			&map_bh, 1))
+			break;
+		if (mio_nblocks && ((map_bh.b_blocknr != next_contig_block) ||
+			map_bh.b_bdev != mio->map_bh.b_bdev))
+			break;
+
+		if (buffer_new(&map_bh)) {
+			int i = 0;
+			for (; i < map_bh.b_size >> inode->i_blkbits; i++)
+				unmap_underlying_metadata(map_bh.b_bdev,
+					map_bh.b_blocknr + i);
+		}
+
+		if (buffer_boundary(&map_bh)) {
+			mio->boundary = 1;
+			mio->boundary_block = map_bh.b_blocknr;
+			mio->boundary_bdev = map_bh.b_bdev;
+		}
+		if (mio_nblocks == 0) {
+			mio->map_bh.b_bdev = map_bh.b_bdev;
+			mio->map_bh.b_blocknr = map_bh.b_blocknr;
+		}
+
+		mio_nblocks += map_bh.b_size >> inode->i_blkbits;
+		first_unmapped = mio->block_in_file + mio_nblocks;
+		next_contig_block = mio->map_bh.b_blocknr + mio_nblocks;
+		mio->map_bh.b_size += map_bh.b_size;
+	}
+
+	return mio_nblocks;
+}
+
 /*
  * Writing is not so simple.
  *
@@ -389,9 +450,9 @@ EXPORT_SYMBOL(mpage_readpage);
  * written, so it can intelligently allocate a suitably-sized BIO.  For now,
  * just allocate full-size (16-page) BIOs.
  */
-static struct bio *
-__mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block,
-	sector_t *last_block_in_bio, int *ret, struct writeback_control *wbc,
+static int
+__mpage_writepage(struct mpageio *mio, struct page *page,
+	get_blocks_t get_blocks, struct writeback_control *wbc,
 	writepage_t writepage_fn)
 {
 	struct address_space *mapping = page->mapping;
@@ -399,9 +460,8 @@ __mpage_writepage(struct bio *bio, struc
 	const unsigned blkbits = inode->i_blkbits;
 	unsigned long end_index;
 	const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
-	sector_t last_block;
+	sector_t last_block, blocks_to_skip;
 	sector_t block_in_file;
-	sector_t blocks[MAX_BUF_PER_PAGE];
 	unsigned page_block;
 	unsigned first_unmapped = blocks_per_page;
 	struct block_device *bdev = NULL;
@@ -409,8 +469,10 @@ __mpage_writepage(struct bio *bio, struc
 	sector_t boundary_block = 0;
 	struct block_device *boundary_bdev = NULL;
 	int length;
-	struct buffer_head map_bh;
 	loff_t i_size = i_size_read(inode);
+	struct buffer_head *map_bh = &mio->map_bh;
+	struct bio *bio = mio->bio;
+	int ret = 0;
 
 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
@@ -438,10 +500,13 @@ __mpage_writepage(struct bio *bio, struc
 			if (!buffer_dirty(bh) || !buffer_uptodate(bh))
 				goto confused;
 			if (page_block) {
-				if (bh->b_blocknr != blocks[page_block-1] + 1)
+				if (bh->b_blocknr != map_bh->b_blocknr
+					+ page_block)
 					goto confused;
+			} else {
+				map_bh->b_blocknr = bh->b_blocknr;
+				map_bh->b_size = PAGE_SIZE;
 			}
-			blocks[page_block++] = bh->b_blocknr;
 			boundary = buffer_boundary(bh);
 			if (boundary) {
 				boundary_block = bh->b_blocknr;
@@ -468,33 +533,30 @@ __mpage_writepage(struct bio *bio, struc
 	BUG_ON(!PageUptodate(page));
 	block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits);
 	last_block = (i_size - 1) >> blkbits;
-	map_bh.b_page = page;
-	for (page_block = 0; page_block < blocks_per_page; ) {
-
-		map_bh.b_state = 0;
-		if (get_block(inode, block_in_file, &map_bh, 1))
-			goto confused;
-		if (buffer_new(&map_bh))
-			unmap_underlying_metadata(map_bh.b_bdev,
-						map_bh.b_blocknr);
-		if (buffer_boundary(&map_bh)) {
-			boundary_block = map_bh.b_blocknr;
-			boundary_bdev = map_bh.b_bdev;
-		}
-		if (page_block) {
-			if (map_bh.b_blocknr != blocks[page_block-1] + 1)
-				goto confused;
-		}
-		blocks[page_block++] = map_bh.b_blocknr;
-		boundary = buffer_boundary(&map_bh);
-		bdev = map_bh.b_bdev;
-		if (block_in_file == last_block)
-			break;
-		block_in_file++;
+	blocks_to_skip = block_in_file - mio->block_in_file;
+	mio->block_in_file = block_in_file;
+	if (blocks_to_skip < (map_bh->b_size >> blkbits)) {
+		map_bh->b_blocknr += blocks_to_skip;
+		map_bh->b_size -= blocks_to_skip << blkbits;
+	} else {
+		map_bh->b_state = 0;
+		map_bh->b_size = 0;
+		if (mio->final_block_in_request > last_block)
+			mio->final_block_in_request = last_block;
+		mpage_get_more_blocks(mio, inode, get_blocks);
 	}
-	BUG_ON(page_block == 0);
+	if (map_bh->b_size < PAGE_SIZE)
+		goto confused;
 
-	first_unmapped = page_block;
+	if (mio->boundary && (mio->boundary_block < map_bh->b_blocknr
+		+ blocks_per_page)) {
+		boundary = 1;
+		boundary_block = mio->boundary_block;
+		boundary_bdev = mio->boundary_bdev;
+	}
+
+	bdev = map_bh->b_bdev;
+	first_unmapped = blocks_per_page;
 
 page_is_mapped:
 	end_index = i_size >> PAGE_CACHE_SHIFT;
@@ -521,12 +583,16 @@ page_is_mapped:
 	/*
 	 * This page will go to BIO.  Do we need to send this BIO off first?
 	 */
-	if (bio && *last_block_in_bio != blocks[0] - 1)
+	if (bio && mio->block_in_bio != map_bh->b_blocknr - 1)
 		bio = mpage_bio_submit(WRITE, bio);
 
 alloc_new:
 	if (bio == NULL) {
-		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
+		/*
+		 * Fixme: bio size can be limited to final_block - block, or
+		 * even mio->map_bh.b_size
+		 */
+		bio = mpage_alloc(bdev, map_bh->b_blocknr << (blkbits - 9),
 				bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
 		if (bio == NULL)
 			goto confused;
@@ -542,6 +608,9 @@ alloc_new:
 		bio = mpage_bio_submit(WRITE, bio);
 		goto alloc_new;
 	}
+	map_bh->b_blocknr += blocks_per_page;
+	map_bh->b_size -= PAGE_SIZE;
+	mio->block_in_file += blocks_per_page;
 
 	/*
 	 * OK, we have our BIO, so we can now mark the buffers clean.  Make
@@ -578,7 +647,8 @@ alloc_new:
 					boundary_block, 1 << blkbits);
 		}
 	} else {
-		*last_block_in_bio = blocks[blocks_per_page - 1];
+		/* we can pack more pages into the bio, don't submit yet */
+		mio->block_in_bio = map_bh->b_blocknr - 1;
 	}
 	goto out;
 
@@ -587,22 +657,23 @@ confused:
 		bio = mpage_bio_submit(WRITE, bio);
 
 	if (writepage_fn) {
-		*ret = (*writepage_fn)(page, wbc);
+		ret = (*writepage_fn)(page, wbc);
 	} else {
-		*ret = -EAGAIN;
+		ret = -EAGAIN;
 		goto out;
 	}
 	/*
 	 * The caller has a ref on the inode, so *mapping is stable
 	 */
-	if (*ret) {
-		if (*ret == -ENOSPC)
+	if (ret) {
+		if (ret == -ENOSPC)
 			set_bit(AS_ENOSPC, &mapping->flags);
 		else
 			set_bit(AS_EIO, &mapping->flags);
 	}
 out:
-	return bio;
+	mio->bio = bio;
+	return ret;
 }
 
 /**
@@ -628,11 +699,21 @@ out:
  */
 int
 mpage_writepages(struct address_space *mapping,
-		struct writeback_control *wbc, get_block_t get_block)
+		struct writeback_control *wbc, get_blocks_t get_blocks)
+{
+	return __mpage_writepages(mapping, wbc, get_blocks,
+                mapping->a_ops->writepage);
+
+}
+int
+__mpage_writepages(struct address_space *mapping,
+                struct writeback_control *wbc, get_blocks_t get_blocks,
+                writepage_t writepage_fn)
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	struct bio *bio = NULL;
-	sector_t last_block_in_bio = 0;
+	struct inode *inode = mapping->host;
+	const unsigned blkbits = inode->i_blkbits;
 	int ret = 0;
 	int done = 0;
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
@@ -642,6 +723,9 @@ mpage_writepages(struct address_space *m
 	pgoff_t end = -1;		/* Inclusive */
 	int scanned = 0;
 	int is_range = 0;
+	struct mpageio mio = {
+		.bio = NULL
+	};
 
 	if (wbc->nonblocking && bdi_write_congested(bdi)) {
 		wbc->encountered_congestion = 1;
@@ -649,7 +733,7 @@ mpage_writepages(struct address_space *m
 	}
 
 	writepage = NULL;
-	if (get_block == NULL)
+	if (get_blocks == NULL)
 		writepage = mapping->a_ops->writepage;
 
 	pagevec_init(&pvec, 0);
@@ -666,12 +750,15 @@ mpage_writepages(struct address_space *m
 		scanned = 1;
 	}
 retry:
+	down_read(&inode->i_alloc_sem);
 	while (!done && (index <= end) &&
-			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-			PAGECACHE_TAG_DIRTY,
+			(nr_pages = pagevec_contig_lookup_tag(&pvec, mapping,
+			&index, PAGECACHE_TAG_DIRTY,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		unsigned i;
 
+		mio.final_block_in_request = min(index, end) <<
+			(PAGE_CACHE_SHIFT - blkbits);
 		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
@@ -696,7 +783,7 @@ retry:
 				unlock_page(page);
 				continue;
 			}
-
+
 			if (wbc->sync_mode != WB_SYNC_NONE)
 				wait_on_page_writeback(page);
 
@@ -717,9 +804,9 @@ retry:
 							&mapping->flags);
 				}
 			} else {
-				bio = __mpage_writepage(bio, page, get_block,
-						&last_block_in_bio, &ret, wbc,
-						page->mapping->a_ops->writepage);
+				ret = __mpage_writepage(&mio, page, get_blocks,
+						wbc, writepage_fn);
+				bio = mio.bio;
 			}
 			if (unlikely(ret == WRITEPAGE_ACTIVATE))
 				unlock_page(page);
@@ -733,6 +820,9 @@ retry:
 		pagevec_release(&pvec);
 		cond_resched();
 	}
+
+	up_read(&inode->i_alloc_sem);
+
 	if (!scanned && !done) {
 		/*
 		 * We hit the last page and there is more work to be done: wrap
@@ -749,18 +839,24 @@ retry:
 	return ret;
 }
 EXPORT_SYMBOL(mpage_writepages);
+EXPORT_SYMBOL(__mpage_writepages);
 
-int mpage_writepage(struct page *page, get_block_t get_block,
-	struct writeback_control *wbc)
+int mpage_writepage(struct page *page, get_blocks_t get_blocks,
+		struct writeback_control *wbc, writepage_t writepage_fn)
 {
 	int ret = 0;
-	struct bio *bio;
-	sector_t last_block_in_bio = 0;
-
-	bio = __mpage_writepage(NULL, page, get_block,
-			&last_block_in_bio, &ret, wbc, NULL);
-	if (bio)
-		mpage_bio_submit(WRITE, bio);
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	const unsigned blkbits = inode->i_blkbits;
+	struct mpageio mio = {
+		.final_block_in_request = (page->index + 1) << (PAGE_CACHE_SHIFT
+			- blkbits)
+	};
+
+	ret = __mpage_writepage(&mio, page, get_blocks,
+			wbc, writepage_fn);
+	if (mio.bio)
+		mpage_bio_submit(WRITE, mio.bio);
 
 	return ret;
 }
diff -puN include/linux/buffer_head.h~mpage_writepages_getblocks include/linux/buffer_head.h
--- linux-2.6.12/include/linux/buffer_head.h~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/include/linux/buffer_head.h	2005-07-15 00:11:01.000000000 -0700
@@ -206,8 +206,8 @@ int file_fsync(struct file *, struct den
 int nobh_prepare_write(struct page*, unsigned, unsigned, get_block_t*);
 int nobh_commit_write(struct file *, struct page *, unsigned, unsigned);
 int nobh_truncate_page(struct address_space *, loff_t);
-int nobh_writepage(struct page *page, get_block_t *get_block,
-                        struct writeback_control *wbc);
+int nobh_writepage(struct page *page, get_blocks_t *get_blocks,
+	struct writeback_control *wbc, writepage_t bh_writepage);
 
 
 /*
diff -puN include/linux/fs.h~mpage_writepages_getblocks include/linux/fs.h
--- linux-2.6.12/include/linux/fs.h~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/include/linux/fs.h	2005-07-15 00:11:01.000000000 -0700
@@ -305,6 +305,8 @@ struct page;
 struct address_space;
 struct writeback_control;
 
+typedef int (writepage_t)(struct page *page, struct writeback_control *wbc);
+
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
diff -puN include/linux/mpage.h~mpage_writepages_getblocks include/linux/mpage.h
--- linux-2.6.12/include/linux/mpage.h~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/include/linux/mpage.h	2005-07-15 18:06:49.398255280 -0700
@@ -11,15 +11,18 @@
  */
 
 struct writeback_control;
-typedef int (writepage_t)(struct page *page, struct writeback_control *wbc);
 
 int mpage_readpages(struct address_space *mapping, struct list_head *pages,
 				unsigned nr_pages, get_block_t get_block);
 int mpage_readpage(struct page *page, get_block_t get_block);
+
 int mpage_writepages(struct address_space *mapping,
-		struct writeback_control *wbc, get_block_t get_block);
-int mpage_writepage(struct page *page, get_block_t *get_block,
-		struct writeback_control *wbc);
+                struct writeback_control *wbc, get_blocks_t get_blocks);
+int mpage_writepage(struct page *page, get_blocks_t *get_blocks,
+                struct writeback_control *wbc, writepage_t writepage);
+int __mpage_writepages(struct address_space *mapping,
+                struct writeback_control *wbc, get_blocks_t get_blocks,
+                writepage_t writepage);
 
 static inline int
 generic_writepages(struct address_space *mapping, struct writeback_control *wbc)
diff -puN include/linux/pagemap.h~mpage_writepages_getblocks include/linux/pagemap.h
--- linux-2.6.12/include/linux/pagemap.h~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/include/linux/pagemap.h	2005-07-15 00:11:01.000000000 -0700
@@ -73,7 +73,8 @@ extern struct page * find_or_create_page
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
-			int tag, unsigned int nr_pages, struct page **pages);
+			int tag, unsigned int nr_pages, struct page **pages,
+			int contig);
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff -puN include/linux/pagevec.h~mpage_writepages_getblocks include/linux/pagevec.h
--- linux-2.6.12/include/linux/pagevec.h~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/include/linux/pagevec.h	2005-07-15 00:11:01.000000000 -0700
@@ -28,6 +28,9 @@ unsigned pagevec_lookup(struct pagevec *
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
 		struct address_space *mapping, pgoff_t *index, int tag,
 		unsigned nr_pages);
+unsigned pagevec_contig_lookup_tag(struct pagevec *pvec,
+		struct address_space *mapping, pgoff_t *index, int tag,
+		unsigned nr_pages);
 
 static inline void pagevec_init(struct pagevec *pvec, int cold)
 {
diff -puN include/linux/radix-tree.h~mpage_writepages_getblocks include/linux/radix-tree.h
--- linux-2.6.12/include/linux/radix-tree.h~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/include/linux/radix-tree.h	2005-07-15 00:11:01.000000000 -0700
@@ -59,8 +59,18 @@ void *radix_tree_tag_clear(struct radix_
 int radix_tree_tag_get(struct radix_tree_root *root,
 			unsigned long index, int tag);
 unsigned int
-radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
-		unsigned long first_index, unsigned int max_items, int tag);
+__radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
+		unsigned long first_index, unsigned int max_items, int tag,
+		int contig);
+
+static inline unsigned int radix_tree_gang_lookup_tag(struct radix_tree_root
+		*root, void **results, unsigned long first_index,
+		unsigned int max_items, int tag)
+{
+	return __radix_tree_gang_lookup_tag(root, results, first_index,
+		max_items, tag, 0);
+}
+
 int radix_tree_tagged(struct radix_tree_root *root, int tag);
 
 static inline void radix_tree_preload_end(void)
diff -puN lib/radix-tree.c~mpage_writepages_getblocks lib/radix-tree.c
--- linux-2.6.12/lib/radix-tree.c~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/lib/radix-tree.c	2005-07-15 00:11:01.000000000 -0700
@@ -557,12 +557,13 @@ EXPORT_SYMBOL(radix_tree_gang_lookup);
  */
 static unsigned int
 __lookup_tag(struct radix_tree_root *root, void **results, unsigned long index,
-	unsigned int max_items, unsigned long *next_index, int tag)
+	unsigned int max_items, unsigned long *next_index, int tag, int contig)
 {
 	unsigned int nr_found = 0;
 	unsigned int shift;
 	unsigned int height = root->height;
 	struct radix_tree_node *slot;
+	unsigned long cindex = (contig && (*next_index)) ? *next_index : -1;
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	slot = root->rnode;
@@ -575,6 +576,11 @@ __lookup_tag(struct radix_tree_root *roo
 				BUG_ON(slot->slots[i] == NULL);
 				break;
 			}
+			if (contig && index >= cindex) {
+				/* break in contiguity */
+				index = 0;
+				goto out;
+			}
 			index &= ~((1UL << shift) - 1);
 			index += 1UL << shift;
 			if (index == 0)
@@ -593,6 +599,10 @@ __lookup_tag(struct radix_tree_root *roo
 					results[nr_found++] = slot->slots[j];
 					if (nr_found == max_items)
 						goto out;
+				} else if (contig && nr_found) {
+					/* break in contiguity */
+					index = 0;
+					goto out;
 				}
 			}
 		}
@@ -618,29 +628,32 @@ out:
  *	returns the number of items which were placed at *@results.
  */
 unsigned int
-radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
-		unsigned long first_index, unsigned int max_items, int tag)
+__radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
+		unsigned long first_index, unsigned int max_items, int tag,
+		int contig)
 {
 	const unsigned long max_index = radix_tree_maxindex(root->height);
 	unsigned long cur_index = first_index;
+	unsigned long next_index = 0;	/* Index of next contiguous search */
 	unsigned int ret = 0;
 
 	while (ret < max_items) {
 		unsigned int nr_found;
-		unsigned long next_index;	/* Index of next search */
 
 		if (cur_index > max_index)
 			break;
 		nr_found = __lookup_tag(root, results + ret, cur_index,
-					max_items - ret, &next_index, tag);
+				max_items - ret, &next_index, tag, contig);
 		ret += nr_found;
 		if (next_index == 0)
 			break;
 		cur_index = next_index;
+		if (!nr_found)
+			next_index = 0;
 	}
 	return ret;
 }
-EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
+EXPORT_SYMBOL(__radix_tree_gang_lookup_tag);
 
 /**
  *	radix_tree_delete    -    delete an item from a radix tree
diff -puN mm/filemap.c~mpage_writepages_getblocks mm/filemap.c
--- linux-2.6.12/mm/filemap.c~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/mm/filemap.c	2005-07-15 00:11:01.000000000 -0700
@@ -649,16 +649,19 @@ unsigned find_get_pages(struct address_s
 /*
  * Like find_get_pages, except we only return pages which are tagged with
  * `tag'.   We update *index to index the next page for the traversal.
+ * If 'contig' is 1, then we return only pages which are contiguous in the
+ * file.
  */
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
-			int tag, unsigned int nr_pages, struct page **pages)
+			int tag, unsigned int nr_pages, struct page **pages,
+			int contig)
 {
 	unsigned int i;
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
-	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
-				(void **)pages, *index, nr_pages, tag);
+	ret = __radix_tree_gang_lookup_tag(&mapping->page_tree,
+			(void **)pages, *index, nr_pages, tag, contig);
 	for (i = 0; i < ret; i++)
 		page_cache_get(pages[i]);
 	if (ret)
diff -puN mm/swap.c~mpage_writepages_getblocks mm/swap.c
--- linux-2.6.12/mm/swap.c~mpage_writepages_getblocks	2005-07-15 00:11:01.000000000 -0700
+++ linux-2.6.12-ming/mm/swap.c	2005-07-15 00:11:01.000000000 -0700
@@ -384,7 +384,16 @@ unsigned pagevec_lookup_tag(struct pagev
 		pgoff_t *index, int tag, unsigned nr_pages)
 {
 	pvec->nr = find_get_pages_tag(mapping, index, tag,
-					nr_pages, pvec->pages);
+					nr_pages, pvec->pages, 0);
+	return pagevec_count(pvec);
+}
+
+unsigned int
+pagevec_contig_lookup_tag(struct pagevec *pvec, struct address_space *mapping,
+		pgoff_t *index, int tag, unsigned nr_pages)
+{
+	pvec->nr = find_get_pages_tag(mapping, index, tag,
+					nr_pages, pvec->pages, 1);
 	return pagevec_count(pvec);
 }
 

_




-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] [PATCH 4/4]add ext3 writeback writpages
       [not found] <1110839154.24286.302.camel@dyn318077bld.beaverton.ibm.com>
                   ` (3 preceding siblings ...)
  2005-07-17 17:40 ` [RFC] [PATCH 3/4]generic getblocks() support in mpage_writepages Mingming Cao
@ 2005-07-17 17:41 ` Mingming Cao
  4 siblings, 0 replies; 11+ messages in thread
From: Mingming Cao @ 2005-07-17 17:41 UTC (permalink / raw)
  To: ext2-devel, Andrew Morton, Stephen C. Tweedie, linux-kernel,
	linux-fsdevel
  Cc: Badari Pulavarty, suparna, tytso, alex, adilger

support multiple block allocation for ext3 writeback mode through writepages().


---

 linux-2.6.12-ming/fs/ext3/inode.c       |  131 ++++++++++++++++++++++++++++++++
 linux-2.6.12-ming/fs/mpage.c            |    8 +
 linux-2.6.12-ming/include/linux/mpage.h |   17 ++++
 3 files changed, 153 insertions(+), 3 deletions(-)

diff -puN fs/ext3/inode.c~writepages fs/ext3/inode.c
--- linux-2.6.12/fs/ext3/inode.c~writepages	2005-07-17 17:11:43.239274864 -0700
+++ linux-2.6.12-ming/fs/ext3/inode.c	2005-07-17 17:11:43.259271824 -0700
@@ -36,6 +36,7 @@
 #include <linux/writeback.h>
 #include <linux/mpage.h>
 #include <linux/uio.h>
+#include <linux/pagevec.h>
 #include "xattr.h"
 #include "acl.h"
 
@@ -1719,6 +1720,135 @@ out_fail:
 	return ret;
 }
 
+static int
+ext3_writeback_writepages(struct address_space *mapping,
+				struct writeback_control *wbc)
+{
+	struct inode *inode = mapping->host;
+	const unsigned blkbits = inode->i_blkbits;
+	int err = 0;
+	int ret = 0;
+	int done = 0;
+	unsigned int max_pages_to_cluster = 0;
+	struct pagevec pvec;
+	int nr_pages;
+	pgoff_t index;
+	pgoff_t end = -1;		/* Inclusive */
+	int scanned = 0;
+	int is_range = 0;
+	struct page *page;
+	struct mpageio mio = {
+		.bio = NULL
+	};
+
+	pagevec_init(&pvec, 0);
+	if (wbc->sync_mode == WB_SYNC_NONE) {
+		index = mapping->writeback_index; /* Start from prev offset */
+	} else {
+		index = 0;			  /* whole-file sweep */
+		scanned = 1;
+	}
+	if (wbc->start || wbc->end) {
+		index = wbc->start >> PAGE_CACHE_SHIFT;
+		end = wbc->end >> PAGE_CACHE_SHIFT;
+		is_range = 1;
+		scanned = 1;
+	}
+	max_pages_to_cluster = min(EXT3_MAX_TRANS_DATA, (pgoff_t)PAGEVEC_SIZE);
+
+retry:
+	down_read(&inode->i_alloc_sem);
+	while (!done && (index <= end) &&
+			(nr_pages = pagevec_contig_lookup_tag(&pvec, mapping,
+			&index, PAGECACHE_TAG_DIRTY,
+			min(end - index, max_pages_to_cluster-1) + 1))) {
+		unsigned i;
+
+		scanned = 1;
+		for (i = 0; i < nr_pages; i++) {
+			page = pvec.pages[i];
+
+			lock_page(page);
+
+			if (unlikely(page->mapping != mapping)) {
+				unlock_page(page);
+				break;
+			}
+
+			if (unlikely(is_range) && page->index > end) {
+				unlock_page(page);
+				break;
+			}
+
+			if (wbc->sync_mode != WB_SYNC_NONE)
+				wait_on_page_writeback(page);
+
+			if (PageWriteback(page) ||
+					!clear_page_dirty_for_io(page)) {
+				unlock_page(page);
+				break;
+			}
+		}
+
+		if (i) {
+			unsigned j;
+			handle_t *handle;
+
+			page = pvec.pages[i-1];
+			index = page->index + 1;
+			mio.final_block_in_request =
+				min(index, end) << (PAGE_CACHE_SHIFT - blkbits);
+
+			handle = ext3_journal_start(inode,
+					i + ext3_writepage_trans_blocks(inode));
+
+			if (IS_ERR(handle)) {
+				err = PTR_ERR(handle);
+				done = 1;
+			}
+			for (j = 0; j < i; j++) {
+				page = pvec.pages[j];
+				if (!done) {
+					ret = __mpage_writepage(&mio, page,
+						ext3_writepages_get_blocks, wbc,
+						ext3_writeback_writepage_helper);
+					if (ret || (--(wbc->nr_to_write) <= 0))
+						done = 1;
+				} else {
+					redirty_page_for_writepage(wbc, page);
+					unlock_page(page);
+				}
+			}
+			if (!err && mio.bio)
+				mio.bio = mpage_bio_submit(WRITE, mio.bio);
+			if (!err)
+				err = ext3_journal_stop(handle);
+			if (!ret) {
+				ret = err;
+				if (ret)
+					done = 1;
+			}
+		}
+		pagevec_release(&pvec);
+		cond_resched();
+	}
+
+	up_read(&inode->i_alloc_sem);
+
+	if (!scanned && !done) {
+		/*
+		 * We hit the last page and there is more work to be done: wrap
+		 * back to the start of the file
+		 */
+		scanned = 1;
+		index = 0;
+		goto retry;
+	}
+	if (!is_range)
+		mapping->writeback_index = index;
+	return ret;
+}
+
 static int ext3_journalled_writepage(struct page *page,
 				struct writeback_control *wbc)
 {
@@ -1923,6 +2053,7 @@ static struct address_space_operations e
 	.readpage	= ext3_readpage,
 	.readpages	= ext3_readpages,
 	.writepage	= ext3_writeback_writepage,
+	.writepages	= ext3_writeback_writepages,
 	.sync_page	= block_sync_page,
 	.prepare_write	= ext3_prepare_write,
 	.commit_write	= ext3_writeback_commit_write,
diff -puN fs/mpage.c~writepages fs/mpage.c
--- linux-2.6.12/fs/mpage.c~writepages	2005-07-17 17:11:43.243274256 -0700
+++ linux-2.6.12-ming/fs/mpage.c	2005-07-17 17:12:43.220156384 -0700
@@ -90,7 +90,7 @@ static int mpage_end_io_write(struct bio
 	return 0;
 }
 
-static struct bio *mpage_bio_submit(int rw, struct bio *bio)
+struct bio *mpage_bio_submit(int rw, struct bio *bio)
 {
 	bio->bi_end_io = mpage_end_io_read;
 	if (rw == WRITE)
@@ -373,6 +373,7 @@ int mpage_readpage(struct page *page, ge
 }
 EXPORT_SYMBOL(mpage_readpage);
 
+#if 0
 struct mpageio {
 	struct bio *bio;
 	struct buffer_head map_bh;
@@ -383,6 +384,7 @@ struct mpageio {
 	sector_t boundary_block;
 	struct block_device *boundary_bdev;
 };
+#endif
 
 /*
  * Maps as many contiguous disk blocks as it can within the range of
@@ -450,7 +452,7 @@ static unsigned long mpage_get_more_bloc
  * written, so it can intelligently allocate a suitably-sized BIO.  For now,
  * just allocate full-size (16-page) BIOs.
  */
-static int
+int
 __mpage_writepage(struct mpageio *mio, struct page *page,
 	get_blocks_t get_blocks, struct writeback_control *wbc,
 	writepage_t writepage_fn)
@@ -532,7 +534,7 @@ __mpage_writepage(struct mpageio *mio, s
 	 */
 	BUG_ON(!PageUptodate(page));
 	block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits);
-	last_block = (i_size - 1) >> blkbits;
+	last_block = (i_size) >> blkbits;
 	blocks_to_skip = block_in_file - mio->block_in_file;
 	mio->block_in_file = block_in_file;
 	if (blocks_to_skip < (map_bh->b_size >> blkbits)) {
diff -puN include/linux/mpage.h~writepages include/linux/mpage.h
--- linux-2.6.12/include/linux/mpage.h~writepages	2005-07-17 17:11:43.246273800 -0700
+++ linux-2.6.12-ming/include/linux/mpage.h	2005-07-17 17:11:43.263271216 -0700
@@ -9,9 +9,23 @@
  * (And no, it doesn't do the #ifdef __MPAGE_H thing, and it doesn't do
  * nested includes.  Get it right in the .c file).
  */
+#ifndef _LINUX_BUFFER_HEAD_H
+#include <linux/buffer_head.h>
+#endif
 
 struct writeback_control;
 
+struct mpageio {
+	struct bio *bio;
+	struct buffer_head map_bh;
+	unsigned long block_in_file;
+	unsigned long final_block_in_request;
+	sector_t block_in_bio;
+	int boundary;
+	sector_t boundary_block;
+	struct block_device *boundary_bdev;
+};
+
 int mpage_readpages(struct address_space *mapping, struct list_head *pages,
 				unsigned nr_pages, get_block_t get_block);
 int mpage_readpage(struct page *page, get_block_t get_block);
@@ -23,6 +37,9 @@ int mpage_writepage(struct page *page, g
 int __mpage_writepages(struct address_space *mapping,
                 struct writeback_control *wbc, get_blocks_t get_blocks,
                 writepage_t writepage);
+int __mpage_writepage(struct mpageio *mio, struct page *page,
+        get_blocks_t get_blocks, struct writeback_control *wbc,
+        writepage_t writepage_fn);
 
 static inline int
 generic_writepages(struct address_space *mapping, struct writeback_control *wbc)

_




-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] [PATCH 0/4]Multiple block allocation and delayed allocation for ext3
  2005-07-17 17:40 ` [RFC] [PATCH 0/4]Multiple block allocation and delayed allocation for ext3 Mingming Cao
@ 2005-07-17 17:45   ` Mingming Cao
  0 siblings, 0 replies; 11+ messages in thread
From: Mingming Cao @ 2005-07-17 17:45 UTC (permalink / raw)
  To: ext2-devel
  Cc: Andrew Morton, Stephen C. Tweedie, linux-kernel, linux-fsdevel,
	Badari Pulavarty, suparna, tytso, alex, adilger

On Sun, 2005-07-17 at 10:40 -0700, Mingming Cao wrote:
> Hi All, 
> 
> Here are the updated patches to support multiple block allocation and
> delayed allocation for ext3 done by me, Badari and Suparna.

Patches are against 2.6.13-rc3.


Mingming



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ext2-devel] [RFC] [PATCH 2/4]delayed allocation for ext3
  2005-07-17 17:40 ` [RFC] [PATCH 2/4]delayed " Mingming Cao
@ 2005-07-18  1:47   ` Andreas Dilger
  2005-07-18 17:32     ` Mingming Cao
  2005-07-19  0:25     ` Badari Pulavarty
  2005-07-26 22:52   ` Andrew Morton
  1 sibling, 2 replies; 11+ messages in thread
From: Andreas Dilger @ 2005-07-18  1:47 UTC (permalink / raw)
  To: Mingming Cao
  Cc: ext2-devel, Andrew Morton, Stephen C. Tweedie, linux-kernel,
	linux-fsdevel, Badari Pulavarty, suparna, tytso

On Jul 17, 2005  10:40 -0700, Mingming Cao wrote:
> @@ -373,6 +373,7 @@ struct ext3_inode {
>  #define EXT3_MOUNT_BARRIER		0x20000 /* Use block barriers */
>  #define EXT3_MOUNT_NOBH			0x40000 /* No bufferheads */
>  #define EXT3_MOUNT_QUOTA		0x80000 /* Some quota option set */
> + #define EXT3_MOUNT_DELAYED_ALLOC	0xC0000 /* Delayed Allocation */

This doesn't make sense.  DELAYED_ALLOC == QUOTA | NOBH?

> +     {Opt_delayed_alloc, "delalloc"},

Is this a replacement for Alex's delalloc code?  We also use delalloc for
that code and if they are not interchangeable it will cause confusion
about which one is in use.

> +     if (test_opt(sb, DELAYED_ALLOC)) {
> +             if (!(test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)) {
> +                     printk(KERN_WARNING "EXT3-fs: Ignoring delall option - "
> +                             "its supported only with writeback mode\n");

Should be "ignoring delalloc option".
 
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ext2-devel] [RFC] [PATCH 2/4]delayed allocation for ext3
  2005-07-18  1:47   ` [Ext2-devel] " Andreas Dilger
@ 2005-07-18 17:32     ` Mingming Cao
  2005-07-19  0:25     ` Badari Pulavarty
  1 sibling, 0 replies; 11+ messages in thread
From: Mingming Cao @ 2005-07-18 17:32 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: ext2-devel, Andrew Morton, Stephen C. Tweedie, linux-kernel,
	linux-fsdevel, Badari Pulavarty, suparna, tytso

On Sun, 2005-07-17 at 19:47 -0600, Andreas Dilger wrote:
> On Jul 17, 2005  10:40 -0700, Mingming Cao wrote:
> > @@ -373,6 +373,7 @@ struct ext3_inode {
> >  #define EXT3_MOUNT_BARRIER		0x20000 /* Use block barriers */
> >  #define EXT3_MOUNT_NOBH			0x40000 /* No bufferheads */
> >  #define EXT3_MOUNT_QUOTA		0x80000 /* Some quota option set */
> > + #define EXT3_MOUNT_DELAYED_ALLOC	0xC0000 /* Delayed Allocation */
> 
> This doesn't make sense.  DELAYED_ALLOC == QUOTA | NOBH?
> 


Ah...:-)  Badari used 0x80000 for DELAYED_ALLOC in the previous patch
(2.6.11 based).When moving those patches forward to 2.6.13-rc3, we found
the conflict with QUOTA, and obviously picked up a wrong value.

> > +     {Opt_delayed_alloc, "delalloc"},
> 
> Is this a replacement for Alex's delalloc code?  We also use delalloc for
> that code and if they are not interchangeable it will cause confusion
> about which one is in use.
> 

Okey, will think a new name for this feature to avoid confusion.  Alex's
delalloc is bond to extent tree structure so it's hard to be adopted
directly to the current ext3 layout, so, I'd say this work done by
Badari(inspired by Alex's work) is a different implementation.

> > +     if (test_opt(sb, DELAYED_ALLOC)) {
> > +             if (!(test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)) {
> > +                     printk(KERN_WARNING "EXT3-fs: Ignoring delall option - "
> > +                             "its supported only with writeback mode\n");
> 
> Should be "ignoring delalloc option".
>  
Fixed. 


Thanks for looking at this.

Mingming



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ext2-devel] [RFC] [PATCH 2/4]delayed allocation for ext3
  2005-07-18  1:47   ` [Ext2-devel] " Andreas Dilger
  2005-07-18 17:32     ` Mingming Cao
@ 2005-07-19  0:25     ` Badari Pulavarty
  1 sibling, 0 replies; 11+ messages in thread
From: Badari Pulavarty @ 2005-07-19  0:25 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Mingming Cao, ext2-devel, Andrew Morton, Stephen C. Tweedie,
	linux-kernel, linux-fsdevel, suparna, tytso

On Sun, 2005-07-17 at 19:47 -0600, Andreas Dilger wrote:
> On Jul 17, 2005  10:40 -0700, Mingming Cao wrote:
> > @@ -373,6 +373,7 @@ struct ext3_inode {
> >  #define EXT3_MOUNT_BARRIER		0x20000 /* Use block barriers */
> >  #define EXT3_MOUNT_NOBH			0x40000 /* No bufferheads */
> >  #define EXT3_MOUNT_QUOTA		0x80000 /* Some quota option set */
> > + #define EXT3_MOUNT_DELAYED_ALLOC	0xC0000 /* Delayed Allocation */
> 
> This doesn't make sense.  DELAYED_ALLOC == QUOTA | NOBH?

My fault. I will fix it.

> 
> > +     {Opt_delayed_alloc, "delalloc"},
> 
> Is this a replacement for Alex's delalloc code?  We also use delalloc for
> that code and if they are not interchangeable it will cause confusion
> about which one is in use.
> 

Well, basically "delalloc" concept is same - whether we use it on
current ext3 layout or with new extent layout doesn't matter.


> > +     if (test_opt(sb, DELAYED_ALLOC)) {
> > +             if (!(test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)) {
> > +                     printk(KERN_WARNING "EXT3-fs: Ignoring delall option - "
> > +                             "its supported only with writeback mode\n");
> 
> Should be "ignoring delalloc option".

Yep.

Thanks,
Badari




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] [PATCH 2/4]delayed allocation for ext3
  2005-07-17 17:40 ` [RFC] [PATCH 2/4]delayed " Mingming Cao
  2005-07-18  1:47   ` [Ext2-devel] " Andreas Dilger
@ 2005-07-26 22:52   ` Andrew Morton
  2005-07-26 22:55     ` Badari Pulavarty
  1 sibling, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2005-07-26 22:52 UTC (permalink / raw)
  To: cmm; +Cc: ext2-devel, sct, linux-kernel, linux-fsdevel, pbadari, suparna,
	tytso

Mingming Cao <cmm@us.ibm.com> wrote:
>
> Here is the updated patch from Badari for delayed allocation for ext3.
> Delayed allocation defers block allocation from prepare-write time to
> page writeout time. 

For data=writeback only, yes?

> ...
> --- linux-2.6.12/fs/ext3/inode.c~ext3-delalloc	2005-07-14 23:15:34.866752480 -0700
> +++ linux-2.6.12-ming/fs/ext3/inode.c	2005-07-14 23:15:34.889748984 -0700
> @@ -1340,6 +1340,9 @@ static int ext3_prepare_write(struct fil
>  	handle_t *handle;
>  	int retries = 0;
>  
> +
> +	if (test_opt(inode->i_sb, DELAYED_ALLOC))
> +		return __nobh_prepare_write(page, from, to, ext3_get_block, 0);

Rather than performing this test on each ->prepare_write(), would it not be
better to set up a new set of address_space_operations for this mode?

__nobh_prepare_write() seems like a poor choice of name?

>  retry:
>  	handle = ext3_journal_start(inode, needed_blocks);
>  	if (IS_ERR(handle)) {
> @@ -1439,6 +1442,9 @@ static int ext3_writeback_commit_write(s
>  	else
>  		ret = generic_commit_write(file, page, from, to);
>  
> +	if (test_opt(inode->i_sb, DELAYED_ALLOC))
> +		return ret;
> +

Here too, perhaps.

> +		}
> +	}
>  	/*
>  	 * The journal_load will have done any necessary log recovery,
>  	 * so we can safely mount the rest of the filesystem now.
> diff -puN fs/buffer.c~ext3-delalloc fs/buffer.c
> --- linux-2.6.12/fs/buffer.c~ext3-delalloc	2005-07-14 23:15:34.875751112 -0700
> +++ linux-2.6.12-ming/fs/buffer.c	2005-07-14 23:15:34.903746856 -0700
> @@ -2337,8 +2337,8 @@ static void end_buffer_read_nobh(struct 
>   * On entry, the page is fully not uptodate.
>   * On exit the page is fully uptodate in the areas outside (from,to)
>   */
> -int nobh_prepare_write(struct page *page, unsigned from, unsigned to,
> -			get_block_t *get_block)
> +int __nobh_prepare_write(struct page *page, unsigned from, unsigned to,
> +			get_block_t *get_block, int create)

Suggest you make this static and update the comment.

>  {
>  	struct inode *inode = page->mapping->host;
>  	const unsigned blkbits = inode->i_blkbits;
> @@ -2370,10 +2370,8 @@ int nobh_prepare_write(struct page *page
>  		  block_start < PAGE_CACHE_SIZE;
>  		  block_in_page++, block_start += blocksize) {
>  		unsigned block_end = block_start + blocksize;
> -		int create;
>  
>  		map_bh.b_state = 0;
> -		create = 1;
>  		if (block_start >= to)
>  			create = 0;
>  		ret = get_block(inode, block_in_file + block_in_page,

What's going on here?  Seems that we'll call get_block() with `create=0'. 
Is there any point in doing that?  For delayed allocation we shuld be able
to skip get_block() altogether here and, err, delay it.

> +int nobh_prepare_write(struct page *page, unsigned from, unsigned
> + 				get_block_t *get_block)
> +{
> +	return __nobh_prepare_write(page, from, to, get_block, 1);
> +}
> +
> EXPORT_SYMBOL(nobh_prepare_write);

Here you add nobh_dalloc_prepare_write() and remember to export it to
modules this time ;)


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] [PATCH 2/4]delayed allocation for ext3
  2005-07-26 22:52   ` Andrew Morton
@ 2005-07-26 22:55     ` Badari Pulavarty
  0 siblings, 0 replies; 11+ messages in thread
From: Badari Pulavarty @ 2005-07-26 22:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: cmm, ext2-devel, sct, lkml, linux-fsdevel, suparna, tytso

On Tue, 2005-07-26 at 15:52 -0700, Andrew Morton wrote:
> Mingming Cao <cmm@us.ibm.com> wrote:
> >
> > Here is the updated patch from Badari for delayed allocation for ext3.
> > Delayed allocation defers block allocation from prepare-write time to
> > page writeout time. 
> 
> For data=writeback only, yes?

Yes.

> 
> > ...
> > --- linux-2.6.12/fs/ext3/inode.c~ext3-delalloc	2005-07-14 23:15:34.866752480 -0700
> > +++ linux-2.6.12-ming/fs/ext3/inode.c	2005-07-14 23:15:34.889748984 -0700
> > @@ -1340,6 +1340,9 @@ static int ext3_prepare_write(struct fil
> >  	handle_t *handle;
> >  	int retries = 0;
> >  
> > +
> > +	if (test_opt(inode->i_sb, DELAYED_ALLOC))
> > +		return __nobh_prepare_write(page, from, to, ext3_get_block, 0);
> 
> Rather than performing this test on each ->prepare_write(), would it not be
> better to set up a new set of address_space_operations for this mode?
> 
> __nobh_prepare_write() seems like a poor choice of name?

You are correct. I was trying to minimize the changes to interfaces.
Once we get it working, I will do it as part of cleanups.

> 
> >  retry:
> >  	handle = ext3_journal_start(inode, needed_blocks);
> >  	if (IS_ERR(handle)) {
> > @@ -1439,6 +1442,9 @@ static int ext3_writeback_commit_write(s
> >  	else
> >  		ret = generic_commit_write(file, page, from, to);
> >  
> > +	if (test_opt(inode->i_sb, DELAYED_ALLOC))
> > +		return ret;
> > +
> 
> Here too, perhaps.
> 
> > +		}
> > +	}
> >  	/*
> >  	 * The journal_load will have done any necessary log recovery,
> >  	 * so we can safely mount the rest of the filesystem now.
> > diff -puN fs/buffer.c~ext3-delalloc fs/buffer.c
> > --- linux-2.6.12/fs/buffer.c~ext3-delalloc	2005-07-14 23:15:34.875751112 -0700
> > +++ linux-2.6.12-ming/fs/buffer.c	2005-07-14 23:15:34.903746856 -0700
> > @@ -2337,8 +2337,8 @@ static void end_buffer_read_nobh(struct 
> >   * On entry, the page is fully not uptodate.
> >   * On exit the page is fully uptodate in the areas outside (from,to)
> >   */
> > -int nobh_prepare_write(struct page *page, unsigned from, unsigned to,
> > -			get_block_t *get_block)
> > +int __nobh_prepare_write(struct page *page, unsigned from, unsigned to,
> > +			get_block_t *get_block, int create)
> 
> Suggest you make this static and update the comment.
> 

Sure.

> >  {
> >  	struct inode *inode = page->mapping->host;
> >  	const unsigned blkbits = inode->i_blkbits;
> > @@ -2370,10 +2370,8 @@ int nobh_prepare_write(struct page *page
> >  		  block_start < PAGE_CACHE_SIZE;
> >  		  block_in_page++, block_start += blocksize) {
> >  		unsigned block_end = block_start + blocksize;
> > -		int create;
> >  
> >  		map_bh.b_state = 0;
> > -		create = 1;
> >  		if (block_start >= to)
> >  			create = 0;
> >  		ret = get_block(inode, block_in_file + block_in_page,
> 
> What's going on here?  Seems that we'll call get_block() with `create=0'. 
> Is there any point in doing that?  For delayed allocation we shuld be able
> to skip get_block() altogether here and, err, delay it.

For delayed allocation, I need to delay the block allocation - but I
still need to do get_block(READ) and read the data from the block - if
the block already exists.

> 
> > +int nobh_prepare_write(struct page *page, unsigned from, unsigned
> > + 				get_block_t *get_block)
> > +{
> > +	return __nobh_prepare_write(page, from, to, get_block, 1);
> > +}
> > +
> > EXPORT_SYMBOL(nobh_prepare_write);
> 
> Here you add nobh_dalloc_prepare_write() and remember to export it to
> modules this time ;)
> 

Will do.

Thanks,
Badari



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-07-26 22:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1110839154.24286.302.camel@dyn318077bld.beaverton.ibm.com>
2005-07-17 17:40 ` [RFC] [PATCH 0/4]Multiple block allocation and delayed allocation for ext3 Mingming Cao
2005-07-17 17:45   ` Mingming Cao
2005-07-17 17:40 ` [RFC] [PATCH 1/4]Multiple block " Mingming Cao
2005-07-17 17:40 ` [RFC] [PATCH 2/4]delayed " Mingming Cao
2005-07-18  1:47   ` [Ext2-devel] " Andreas Dilger
2005-07-18 17:32     ` Mingming Cao
2005-07-19  0:25     ` Badari Pulavarty
2005-07-26 22:52   ` Andrew Morton
2005-07-26 22:55     ` Badari Pulavarty
2005-07-17 17:40 ` [RFC] [PATCH 3/4]generic getblocks() support in mpage_writepages Mingming Cao
2005-07-17 17:41 ` [RFC] [PATCH 4/4]add ext3 writeback writpages Mingming Cao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).