public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] VFS: Pagecache usage optimization on pagesize != blocksize environment
@ 2008-05-21  6:52 Hisashi Hifumi
  2008-05-21  7:19 ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Hisashi Hifumi @ 2008-05-21  6:52 UTC (permalink / raw)
  To: akpm, linux-kernel

Hi.

When we read some part of a file through pagecache, if there is a pagecache 
of corresponding index but this page is not uptodate, read IO is issued and
this page will be uptodate.
I think this is good for pagesize == blocksize environment but there is room
for improvement on pagesize != blocksize environment. Because in this case
a page can have multiple buffers and even if a page is not uptodate, some buffers 
can be uptodate. So I suggest that when all buffers which correspond to a part
of a file that we want to read are uptodate, use this pagecache and copy data
from this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.

I did a performance test using the sysbench.

#sysbench --num-threads=4 --max-requests=120000 --test=fileio --file-num=1 --file-block-size=1K --file-total-size=100M --file-test-mode=rndrw --file-fsync-freq=0 --file-rw-ratio=0.5 run

The result was:

	-- 2.6.26-rc3
	Operations performed:  40002 Read, 79998 Write, 1 Other = 120001 Total
	Read 39.064Mb  Written 78.123Mb  Total transferred 117.19Mb  (375Kb/sec)
	  375.00 Requests/sec executed

	Test execution summary:
	    total time:                          320.0027s
	    total number of events:              120000
	    total time taken by event execution: 1231.5564
	    per-request statistics:
	         min:                            0.0000s
	         avg:                            0.0103s
	         max:                            2.7605s
	         approx.  95 percentile:         0.0381s


	-- 2.6.26-rc3-patched
	Operations performed:  40002 Read, 79998 Write, 1 Other = 120001 Total
	Read 39.064Mb  Written 78.123Mb  Total transferred 117.19Mb  (409.78Kb/sec)
	  409.78 Requests/sec executed

	Test execution summary:
	    total time:                          292.8406s
	    total number of events:              120000
	    total time taken by event execution: 1106.3995
	    per-request statistics:
	         min:                            0.0000s
	         avg:                            0.0092s
	         max:                            3.7366s
	         approx.  95 percentile:         0.0327s

 
	arch:i386 
	filesystem:ext3
	blocksize:1024 bytes
	Memory: 1GB

Random read/write throughput was somewhat improved with following patch.
Thanks.

Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>

diff -Nrup linux-2.6.26-rc3.org/fs/buffer.c linux-2.6.26-rc3/fs/buffer.c
--- linux-2.6.26-rc3.org/fs/buffer.c	2008-05-19 11:35:10.000000000 +0900
+++ linux-2.6.26-rc3/fs/buffer.c	2008-05-19 14:29:25.000000000 +0900
@@ -2084,6 +2084,48 @@ int generic_write_end(struct file *file,
 EXPORT_SYMBOL(generic_write_end);
 
 /*
+ * check_buffers_uptodate checks whether buffers within a page are
+ * uptodate or not.
+ *
+ * Returns true if all buffers which correspond to a file portion
+ * we want to read are uptodate.
+ */
+int check_buffers_uptodate(unsigned long from,
+			read_descriptor_t *desc, struct page *page)
+{
+	struct inode *inode = page->mapping->host;
+	unsigned long block_start, block_end, blocksize;
+	unsigned long to;
+	struct buffer_head *bh, *head;
+	int ret = 1;
+
+	blocksize = 1 << inode->i_blkbits;
+	to = from + desc->count;
+	if (to > PAGE_CACHE_SIZE)
+		to = PAGE_CACHE_SIZE;
+	if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
+		return 0;
+
+	head = page_buffers(page);
+
+	for (bh = head, block_start = 0; bh != head || !block_start;
+	     block_start = block_end, bh = bh->b_this_page) {
+		block_end = block_start + blocksize;
+		if (block_end <= from || block_start >= to)
+			continue;
+		else {
+			if (!buffer_uptodate(bh)) {
+				ret = 0;
+				break;
+			}
+			if (block_end >= to)
+				break;
+		}
+	}
+	return ret;
+}
+
+/*
  * Generic "read page" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
  * Reads the page asynchronously --- the unlock_buffer() and
diff -Nrup linux-2.6.26-rc3.org/include/linux/buffer_head.h linux-2.6.26-rc3/include/linux/buffer_head.h
--- linux-2.6.26-rc3.org/include/linux/buffer_head.h	2008-05-19 11:35:11.000000000 +0900
+++ linux-2.6.26-rc3/include/linux/buffer_head.h	2008-05-19 12:13:46.000000000 +0900
@@ -205,6 +205,8 @@ void block_invalidatepage(struct page *p
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
 int block_read_full_page(struct page*, get_block_t*);
+int check_buffers_uptodate(unsigned long from,
+			read_descriptor_t *desc, struct page *page);
 int block_write_begin(struct file *, struct address_space *,
 				loff_t, unsigned, unsigned,
 				struct page **, void **, get_block_t*);
diff -Nrup linux-2.6.26-rc3.org/mm/filemap.c linux-2.6.26-rc3/mm/filemap.c
--- linux-2.6.26-rc3.org/mm/filemap.c	2008-05-19 11:35:11.000000000 +0900
+++ linux-2.6.26-rc3/mm/filemap.c	2008-05-19 14:29:23.000000000 +0900
@@ -932,8 +932,16 @@ find_page:
 					ra, filp, page,
 					index, last_index - index);
 		}
-		if (!PageUptodate(page))
-			goto page_not_up_to_date;
+		if (!PageUptodate(page)) {
+			if (inode->i_blkbits == PAGE_CACHE_SHIFT)
+				goto page_not_up_to_date;
+			if (TestSetPageLocked(page))
+				goto page_not_up_to_date;
+			if (!page_has_buffers(page) ||
+			      !check_buffers_uptodate(offset, desc, page))
+				goto page_not_up_to_date_locked;
+			unlock_page(page);
+		}
 page_ok:
 		/*
 		* i_size must be checked after we know the page is Uptodate.
@@ -1003,6 +1011,7 @@ page_not_up_to_date:
 		if (lock_page_killable(page))
 			goto readpage_eio;
 
+page_not_up_to_date_locked:
 		/* Did it get truncated before we got the lock? */
 		if (!page->mapping) {
 			unlock_page(page);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization on pagesize != blocksize environment
  2008-05-21  6:52 [PATCH] VFS: Pagecache usage optimization on pagesize != blocksize environment Hisashi Hifumi
@ 2008-05-21  7:19 ` Andrew Morton
  2008-05-21  7:38   ` Andrew Morton
  2008-05-22  7:31   ` [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment Hisashi Hifumi
  0 siblings, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2008-05-21  7:19 UTC (permalink / raw)
  To: Hisashi Hifumi; +Cc: linux-kernel

On Wed, 21 May 2008 15:52:04 +0900 Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> wrote:

> Hi.
> 
> When we read some part of a file through pagecache, if there is a pagecache 
> of corresponding index but this page is not uptodate, read IO is issued and
> this page will be uptodate.
> I think this is good for pagesize == blocksize environment but there is room
> for improvement on pagesize != blocksize environment. Because in this case
> a page can have multiple buffers and even if a page is not uptodate, some buffers 
> can be uptodate. So I suggest that when all buffers which correspond to a part
> of a file that we want to read are uptodate, use this pagecache and copy data
> from this pagecache to user buffer even if a page is not uptodate. This can
> reduce read IO and improve system throughput.

I suppose that makes sense.

> I did a performance test using the sysbench.

That's not a terribly good benchmark, IMO.  It's too complex.

To work out the best-case for a change like this I'd suggest a
microbenchmark which does something such as seeking all around a file
doing single-byte reads.

Then one should think up a benchmark which demonstrates the worst-case,
such as reading one-byte-quantities from a file at offsets 0, 0x2000,
0x4000, 0x6000, ...  and then read more one-byte-quantities at offsets
0x1000, 0x3000, 0x5000, etc.  That would be a pretty cruel comparison,
but as one tosses in more such artificial worklaods, one is in a better
position to work out whether the change is an aggregate benefit.

The results from a great big lumped-together benchmark such as sysbench 
aren't a lot of use to us in predicting how effective this change will
be across all the workloads which the kernel implements.

> @@ -932,8 +932,16 @@ find_page:
>  					ra, filp, page,
>  					index, last_index - index);
>  		}
> -		if (!PageUptodate(page))
> -			goto page_not_up_to_date;
> +		if (!PageUptodate(page)) {
> +			if (inode->i_blkbits == PAGE_CACHE_SHIFT)
> +				goto page_not_up_to_date;
> +			if (TestSetPageLocked(page))
> +				goto page_not_up_to_date;
> +			if (!page_has_buffers(page) ||
> +			      !check_buffers_uptodate(offset, desc, page))

We shouldn't do this.

> +				goto page_not_up_to_date_locked;
> +			unlock_page(page);
> +		}

See, the code which you have here is assuming that if PagePrivate is
set, then the thing which is at page.private is a ring of buffer_heads.

But this code (do_generic_file_read) doesn't know that!  Take a look at
afs, nfs, perhaps other filesystems, grep for set_page_private().

Only the address_space implementation (ie: the filesystem) knows
whether page.private holds buffer_heads and only the
address_space_operations functions are allowed to call into library
functions which treat page.private as a buffer_head ring.

Now, your code _may_ not crash, because perhaps there is no filesystem
which puts something else into page.private which also uses
do_generic_file_read().  But it's still wrong.

I guess a suitable fix might be to implement the above using a new
address_space_operations callback:

	if (PagePrivate(page) && aops->is_partially_uptodate) {
		if (aops->is_partially_uptodate(page, desc, offset))
			<OK, we can copy the data>

then implement a generic_file_is_partially_uptodate() in fs/buffer.c
and wire that up in the filesystems.

Note that things like network filesystems can then implement this also.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization on pagesize != blocksize environment
  2008-05-21  7:19 ` Andrew Morton
@ 2008-05-21  7:38   ` Andrew Morton
  2008-05-22  7:31   ` [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment Hisashi Hifumi
  1 sibling, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2008-05-21  7:38 UTC (permalink / raw)
  To: Hisashi Hifumi, linux-kernel

On Wed, 21 May 2008 00:19:30 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> 
> I guess a suitable fix might be to implement the above using a new
> address_space_operations callback:
> 
> 	if (PagePrivate(page) && aops->is_partially_uptodate) {

This is a bit presumptuous.  A filesytem could in theory be able to
work out if a section of a page is uptodate without necessarily
maintaining per-page metadata at page.private.

So one really should just do

	if (aops->is_partially_uptodate)

however it's hard to conceive how a filesystem could sanely do this
_without_ putting data at page.private, so the PagePrivate() test could
perhaps be retained as a microoptimisation, as long as it is suitably
commented.

otoh, non-zero aops->is_partially_uptodate might well be less common
than non-zero PagePrivate(), so perhaps these should be tested in the
other order.  Not sure.

> 		if (aops->is_partially_uptodate(page, desc, offset))
> 			<OK, we can copy the data>
> 
> then implement a generic_file_is_partially_uptodate() in fs/buffer.c

Make that block_is_partially_uptodate().

> and wire that up in the filesystems.
> 
> Note that things like network filesystems can then implement this also.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment
  2008-05-21  7:19 ` Andrew Morton
  2008-05-21  7:38   ` Andrew Morton
@ 2008-05-22  7:31   ` Hisashi Hifumi
  2008-05-22  8:03     ` Andrew Morton
  1 sibling, 1 reply; 15+ messages in thread
From: Hisashi Hifumi @ 2008-05-22  7:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Thank you for your comment.

At 16:19 08/05/21, Andrew Morton wrote:
>That's not a terribly good benchmark, IMO.  It's too complex.
>
>To work out the best-case for a change like this I'd suggest a
>microbenchmark which does something such as seeking all around a file
>doing single-byte reads.
>

I will try microbenchmark later.

>See, the code which you have here is assuming that if PagePrivate is
>set, then the thing which is at page.private is a ring of buffer_heads.
>
>But this code (do_generic_file_read) doesn't know that!  Take a look at
>afs, nfs, perhaps other filesystems, grep for set_page_private().
>
>Only the address_space implementation (ie: the filesystem) knows
>whether page.private holds buffer_heads and only the
>address_space_operations functions are allowed to call into library
>functions which treat page.private as a buffer_head ring.
>
>Now, your code _may_ not crash, because perhaps there is no filesystem
>which puts something else into page.private which also uses
>do_generic_file_read().  But it's still wrong.
>
>I guess a suitable fix might be to implement the above using a new
>address_space_operations callback:
>
>	if (PagePrivate(page) && aops->is_partially_uptodate) {
>		if (aops->is_partially_uptodate(page, desc, offset))
>			<OK, we can copy the data>
>
>then implement a generic_file_is_partially_uptodate() in fs/buffer.c
>and wire that up in the filesystems.

I modified my patch based on your comment.

I added is_partially_uptodate method to address_space_operations, and
I registered block_is_partially_uptodate in fs/buffer.c to ext2,ext3,ext4's aops.
Thanks.

Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>

diff -Nrup linux-2.6.26-rc3.org/fs/buffer.c linux-2.6.26-rc3/fs/buffer.c
--- linux-2.6.26-rc3.org/fs/buffer.c	2008-05-19 11:35:10.000000000 +0900
+++ linux-2.6.26-rc3/fs/buffer.c	2008-05-22 13:23:29.000000000 +0900
@@ -2084,6 +2084,51 @@ int generic_write_end(struct file *file,
 EXPORT_SYMBOL(generic_write_end);
 
 /*
+ * block_is_partially_uptodate checks whether buffers within a page are
+ * uptodate or not.
+ *
+ * Returns true if all buffers which correspond to a file portion
+ * we want to read are uptodate.
+ */
+int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
+					unsigned long from)
+{
+	struct inode *inode = page->mapping->host;
+	unsigned long block_start, block_end, blocksize;
+	unsigned long to;
+	struct buffer_head *bh, *head;
+	int ret = 1;
+
+	if (!page_has_buffers(page))
+		return 0;
+
+	blocksize = 1 << inode->i_blkbits;
+	to = from + desc->count;
+	if (to > PAGE_CACHE_SIZE)
+		to = PAGE_CACHE_SIZE;
+	if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
+		return 0;
+
+	head = page_buffers(page);
+
+	for (bh = head, block_start = 0; bh != head || !block_start;
+	     block_start = block_end, bh = bh->b_this_page) {
+		block_end = block_start + blocksize;
+		if (block_end <= from || block_start >= to)
+			continue;
+		else {
+			if (!buffer_uptodate(bh)) {
+				ret = 0;
+				break;
+			}
+			if (block_end >= to)
+				break;
+		}
+	}
+	return ret;
+}
+
+/*
  * Generic "read page" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
  * Reads the page asynchronously --- the unlock_buffer() and
diff -Nrup linux-2.6.26-rc3.org/fs/ext2/inode.c linux-2.6.26-rc3/fs/ext2/inode.c
--- linux-2.6.26-rc3.org/fs/ext2/inode.c	2008-05-19 11:35:10.000000000 +0900
+++ linux-2.6.26-rc3/fs/ext2/inode.c	2008-05-22 12:39:46.000000000 +0900
@@ -791,6 +791,7 @@ const struct address_space_operations ex
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
 	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate	= block_is_partially_uptodate,
 };
 
 const struct address_space_operations ext2_aops_xip = {
diff -Nrup linux-2.6.26-rc3.org/fs/ext3/inode.c linux-2.6.26-rc3/fs/ext3/inode.c
--- linux-2.6.26-rc3.org/fs/ext3/inode.c	2008-05-19 11:35:10.000000000 +0900
+++ linux-2.6.26-rc3/fs/ext3/inode.c	2008-05-22 13:10:40.000000000 +0900
@@ -1767,44 +1767,47 @@ static int ext3_journalled_set_page_dirt
 }
 
 static const struct address_space_operations ext3_ordered_aops = {
-	.readpage	= ext3_readpage,
-	.readpages	= ext3_readpages,
-	.writepage	= ext3_ordered_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext3_write_begin,
-	.write_end	= ext3_ordered_write_end,
-	.bmap		= ext3_bmap,
-	.invalidatepage	= ext3_invalidatepage,
-	.releasepage	= ext3_releasepage,
-	.direct_IO	= ext3_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_ordered_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_ordered_write_end,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.direct_IO		= ext3_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext3_writeback_aops = {
-	.readpage	= ext3_readpage,
-	.readpages	= ext3_readpages,
-	.writepage	= ext3_writeback_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext3_write_begin,
-	.write_end	= ext3_writeback_write_end,
-	.bmap		= ext3_bmap,
-	.invalidatepage	= ext3_invalidatepage,
-	.releasepage	= ext3_releasepage,
-	.direct_IO	= ext3_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_writeback_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_writeback_write_end,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.direct_IO		= ext3_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext3_journalled_aops = {
-	.readpage	= ext3_readpage,
-	.readpages	= ext3_readpages,
-	.writepage	= ext3_journalled_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext3_write_begin,
-	.write_end	= ext3_journalled_write_end,
-	.set_page_dirty	= ext3_journalled_set_page_dirty,
-	.bmap		= ext3_bmap,
-	.invalidatepage	= ext3_invalidatepage,
-	.releasepage	= ext3_releasepage,
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_journalled_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_journalled_write_end,
+	.set_page_dirty		= ext3_journalled_set_page_dirty,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 void ext3_set_aops(struct inode *inode)
diff -Nrup linux-2.6.26-rc3.org/fs/ext4/inode.c linux-2.6.26-rc3/fs/ext4/inode.c
--- linux-2.6.26-rc3.org/fs/ext4/inode.c	2008-05-19 11:35:10.000000000 +0900
+++ linux-2.6.26-rc3/fs/ext4/inode.c	2008-05-22 13:13:17.000000000 +0900
@@ -1817,44 +1817,47 @@ static int ext4_journalled_set_page_dirt
 }
 
 static const struct address_space_operations ext4_ordered_aops = {
-	.readpage	= ext4_readpage,
-	.readpages	= ext4_readpages,
-	.writepage	= ext4_ordered_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext4_write_begin,
-	.write_end	= ext4_ordered_write_end,
-	.bmap		= ext4_bmap,
-	.invalidatepage	= ext4_invalidatepage,
-	.releasepage	= ext4_releasepage,
-	.direct_IO	= ext4_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext4_readpage,
+	.readpages		= ext4_readpages,
+	.writepage		= ext4_ordered_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext4_write_begin,
+	.write_end		= ext4_ordered_write_end,
+	.bmap			= ext4_bmap,
+	.invalidatepage		= ext4_invalidatepage,
+	.releasepage		= ext4_releasepage,
+	.direct_IO		= ext4_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext4_writeback_aops = {
-	.readpage	= ext4_readpage,
-	.readpages	= ext4_readpages,
-	.writepage	= ext4_writeback_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext4_write_begin,
-	.write_end	= ext4_writeback_write_end,
-	.bmap		= ext4_bmap,
-	.invalidatepage	= ext4_invalidatepage,
-	.releasepage	= ext4_releasepage,
-	.direct_IO	= ext4_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext4_readpage,
+	.readpages		= ext4_readpages,
+	.writepage		= ext4_writeback_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext4_write_begin,
+	.write_end		= ext4_writeback_write_end,
+	.bmap			= ext4_bmap,
+	.invalidatepage		= ext4_invalidatepage,
+	.releasepage		= ext4_releasepage,
+	.direct_IO		= ext4_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext4_journalled_aops = {
-	.readpage	= ext4_readpage,
-	.readpages	= ext4_readpages,
-	.writepage	= ext4_journalled_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext4_write_begin,
-	.write_end	= ext4_journalled_write_end,
-	.set_page_dirty	= ext4_journalled_set_page_dirty,
-	.bmap		= ext4_bmap,
-	.invalidatepage	= ext4_invalidatepage,
-	.releasepage	= ext4_releasepage,
+	.readpage		= ext4_readpage,
+	.readpages		= ext4_readpages,
+	.writepage		= ext4_journalled_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext4_write_begin,
+	.write_end		= ext4_journalled_write_end,
+	.set_page_dirty		= ext4_journalled_set_page_dirty,
+	.bmap			= ext4_bmap,
+	.invalidatepage		= ext4_invalidatepage,
+	.releasepage		= ext4_releasepage,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 void ext4_set_aops(struct inode *inode)
diff -Nrup linux-2.6.26-rc3.org/include/linux/buffer_head.h linux-2.6.26-rc3/include/linux/buffer_head.h
--- linux-2.6.26-rc3.org/include/linux/buffer_head.h	2008-05-19 11:35:11.000000000 +0900
+++ linux-2.6.26-rc3/include/linux/buffer_head.h	2008-05-22 11:22:26.000000000 +0900
@@ -205,6 +205,8 @@ void block_invalidatepage(struct page *p
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
 int block_read_full_page(struct page*, get_block_t*);
+int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
+				unsigned long from);
 int block_write_begin(struct file *, struct address_space *,
 				loff_t, unsigned, unsigned,
 				struct page **, void **, get_block_t*);
diff -Nrup linux-2.6.26-rc3.org/include/linux/fs.h linux-2.6.26-rc3/include/linux/fs.h
--- linux-2.6.26-rc3.org/include/linux/fs.h	2008-05-19 11:35:11.000000000 +0900
+++ linux-2.6.26-rc3/include/linux/fs.h	2008-05-22 15:13:39.000000000 +0900
@@ -439,6 +439,27 @@ static inline size_t iov_iter_count(stru
 	return i->count;
 }
 
+/*
+ * "descriptor" for what we're up to with a read.
+ * This allows us to use the same read code yet
+ * have multiple different users of the data that
+ * we read from a file.
+ *
+ * The simplest case just copies the data to user
+ * mode.
+ */
+typedef struct {
+	size_t written;
+	size_t count;
+	union {
+		char __user *buf;
+		void *data;
+	} arg;
+	int error;
+} read_descriptor_t;
+
+typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
+		unsigned long, unsigned long);
 
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
@@ -480,6 +501,8 @@ struct address_space_operations {
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
 	int (*launder_page) (struct page *);
+	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
+					unsigned long);
 };
 
 /*
@@ -1186,27 +1209,6 @@ struct block_device_operations {
 	struct module *owner;
 };
 
-/*
- * "descriptor" for what we're up to with a read.
- * This allows us to use the same read code yet
- * have multiple different users of the data that
- * we read from a file.
- *
- * The simplest case just copies the data to user
- * mode.
- */
-typedef struct {
-	size_t written;
-	size_t count;
-	union {
-		char __user * buf;
-		void *data;
-	} arg;
-	int error;
-} read_descriptor_t;
-
-typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long);
-
 /* These macros are for out of kernel modules to test that
  * the kernel supports the unlocked_ioctl and compat_ioctl
  * fields in struct file_operations. */
diff -Nrup linux-2.6.26-rc3.org/mm/filemap.c linux-2.6.26-rc3/mm/filemap.c
--- linux-2.6.26-rc3.org/mm/filemap.c	2008-05-19 11:35:11.000000000 +0900
+++ linux-2.6.26-rc3/mm/filemap.c	2008-05-22 12:34:58.000000000 +0900
@@ -932,8 +932,17 @@ find_page:
 					ra, filp, page,
 					index, last_index - index);
 		}
-		if (!PageUptodate(page))
-			goto page_not_up_to_date;
+		if (!PageUptodate(page)) {
+			if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
+					!mapping->a_ops->is_partially_uptodate)
+				goto page_not_up_to_date;
+			if (TestSetPageLocked(page))
+				goto page_not_up_to_date;
+			if (!mapping->a_ops->is_partially_uptodate(page,
+								desc, offset))
+				goto page_not_up_to_date_locked;
+			unlock_page(page);
+		}
 page_ok:
 		/*
 		 * i_size must be checked after we know the page is Uptodate.
@@ -1003,6 +1012,7 @@ page_not_up_to_date:
 		if (lock_page_killable(page))
 			goto readpage_eio;
 
+page_not_up_to_date_locked:
 		/* Did it get truncated before we got the lock? */
 		if (!page->mapping) {
 			unlock_page(page);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment
  2008-05-22  7:31   ` [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment Hisashi Hifumi
@ 2008-05-22  8:03     ` Andrew Morton
  2008-05-22 12:08       ` [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment Hisashi Hifumi
                         ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Andrew Morton @ 2008-05-22  8:03 UTC (permalink / raw)
  To: Hisashi Hifumi; +Cc: linux-kernel

On Thu, 22 May 2008 16:31:15 +0900 Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> wrote:

> I modified my patch based on your comment.

Looks pretty close to me.

> I added is_partially_uptodate method to address_space_operations, and
> I registered block_is_partially_uptodate in fs/buffer.c to ext2,ext3,ext4's aops.
> Thanks.
> 
> Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
> 
> diff -Nrup linux-2.6.26-rc3.org/fs/buffer.c linux-2.6.26-rc3/fs/buffer.c
> --- linux-2.6.26-rc3.org/fs/buffer.c	2008-05-19 11:35:10.000000000 +0900
> +++ linux-2.6.26-rc3/fs/buffer.c	2008-05-22 13:23:29.000000000 +0900
> @@ -2084,6 +2084,51 @@ int generic_write_end(struct file *file,
>  EXPORT_SYMBOL(generic_write_end);
>  
>  /*
> + * block_is_partially_uptodate checks whether buffers within a page are
> + * uptodate or not.
> + *
> + * Returns true if all buffers which correspond to a file portion
> + * we want to read are uptodate.
> + */
> +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
> +					unsigned long from)
> +{
> +	struct inode *inode = page->mapping->host;
> +	unsigned long block_start, block_end, blocksize;
> +	unsigned long to;

These functions can get quite confusing, and the appropriate use of
types can help.

For offsets within a page I'd suggest `unsigned'.  I expect that the
32-bit quantities are more efficient on at least some 64-bit machines.

For page indices use pgoff_t.

For byte-offset-within-a-file use loff_t.

For byte-range-within-a-file use size_t.

Be careful about overflows and truncation when shifting, comparing,
assigning, etc.  Be sure that the code is still correct outside the
4Gbyte mark on 32-bit and outside the 4Gpage mark on 64-bit.

It doesn't hurt at all to put each variable definition on its own line
with a little comment off to the right.  It helps to mention what the
variable's units are too - bytes/pages/sectors/etc.

> +	struct buffer_head *bh, *head;
> +	int ret = 1;
> +
> +	if (!page_has_buffers(page))
> +		return 0;
> +
> +	blocksize = 1 << inode->i_blkbits;
> +	to = from + desc->count;
> +	if (to > PAGE_CACHE_SIZE)
> +		to = PAGE_CACHE_SIZE;
> +	if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
> +		return 0;
> +
> +	head = page_buffers(page);
> +
> +	for (bh = head, block_start = 0; bh != head || !block_start;
> +	     block_start = block_end, bh = bh->b_this_page) {
> +		block_end = block_start + blocksize;
> +		if (block_end <= from || block_start >= to)
> +			continue;

Oh dear, you've copied one of my least favourite parts of the VFS :(
Just look at it!

Do you think it can be simplified or clarified?

> +		else {
> +			if (!buffer_uptodate(bh)) {
> +				ret = 0;
> +				break;
> +			}
> +			if (block_end >= to)
> +				break;
> +		}
> +	}
> +	return ret;
> +}

We'll need the EXPORT_SYMBOL() here.

> --- linux-2.6.26-rc3.org/fs/ext2/inode.c	2008-05-19 11:35:10.000000000 +0900
> +++ linux-2.6.26-rc3/fs/ext2/inode.c	2008-05-22 12:39:46.000000000 +0900
> @@ -791,6 +791,7 @@ const struct address_space_operations ex
>  	.direct_IO		= ext2_direct_IO,
>  	.writepages		= ext2_writepages,
>  	.migratepage		= buffer_migrate_page,
> +	.is_partially_uptodate	= block_is_partially_uptodate,

Sometime we should update Documentation/Locking to reflect the new
address_space_operation.


One other thing we should think about here is the `nobh' mode which the
extX filesystems support (although I have a feeling that Nick might
have broken this ;)) We also have data=ordered, data=writeback and
data=journal to think about.  This optimisation might not be
appropriate at all to data=journal mode, but I haven't looked into
that.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment
  2008-05-22  8:03     ` Andrew Morton
@ 2008-05-22 12:08       ` Hisashi Hifumi
  2008-05-23 22:51       ` [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment Jan Kara
  2008-05-27  8:38       ` [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment Hisashi Hifumi
  2 siblings, 0 replies; 15+ messages in thread
From: Hisashi Hifumi @ 2008-05-22 12:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Hi.

>> +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
>> +					unsigned long from)
>> +{
>> +	struct inode *inode = page->mapping->host;
>> +	unsigned long block_start, block_end, blocksize;
>> +	unsigned long to;
>
>These functions can get quite confusing, and the appropriate use of
>types can help.
>
>For offsets within a page I'd suggest `unsigned'.  I expect that the
>32-bit quantities are more efficient on at least some 64-bit machines.
>
>For page indices use pgoff_t.
>
>For byte-offset-within-a-file use loff_t.
>
>For byte-range-within-a-file use size_t.
>
>Be careful about overflows and truncation when shifting, comparing,
>assigning, etc.  Be sure that the code is still correct outside the
>4Gbyte mark on 32-bit and outside the 4Gpage mark on 64-bit.
>
>It doesn't hurt at all to put each variable definition on its own line
>with a little comment off to the right.  It helps to mention what the
>variable's units are too - bytes/pages/sectors/etc.
>

>> +	head = page_buffers(page);
>> +
>> +	for (bh = head, block_start = 0; bh != head || !block_start;
>> +	     block_start = block_end, bh = bh->b_this_page) {
>> +		block_end = block_start + blocksize;
>> +		if (block_end <= from || block_start >= to)
>> +			continue;
>
>Oh dear, you've copied one of my least favourite parts of the VFS :(
>Just look at it!
>
>Do you think it can be simplified or clarified?

I cleaned up block_is_partially_uptodate.

Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>

diff -Nrup linux-2.6.26-rc3.org/fs/buffer.c linux-2.6.26-rc3/fs/buffer.c
--- linux-2.6.26-rc3.org/fs/buffer.c	2008-05-19 11:35:10.000000000 +0900
+++ linux-2.6.26-rc3/fs/buffer.c	2008-05-22 19:48:35.000000000 +0900
@@ -2084,6 +2084,52 @@ int generic_write_end(struct file *file,
 EXPORT_SYMBOL(generic_write_end);
 
 /*
+ * block_is_partially_uptodate checks whether buffers within a page are
+ * uptodate or not.
+ *
+ * Returns true if all buffers which correspond to a file portion
+ * we want to read are uptodate.
+ */
+int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
+					unsigned long from)
+{
+	struct inode *inode = page->mapping->host;
+	unsigned block_start, block_end, blocksize;
+	unsigned to;
+	struct buffer_head *bh, *head;
+	int ret = 1;
+
+	if (!page_has_buffers(page))
+		return 0;
+
+	blocksize = 1 << inode->i_blkbits;
+	to = min_t(unsigned, PAGE_CACHE_SIZE - from, desc->count);
+	to = from + to;
+	if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
+		return 0;
+
+	head = page_buffers(page);
+	bh = head;
+	block_start = 0;
+	do {
+		block_end = block_start + blocksize;
+		if (block_end > from && block_start < to) {
+			if (!buffer_uptodate(bh)) {
+				ret = 0;
+				break;
+			}
+			if (block_end >= to)
+				break;
+		}
+		block_start = block_end;
+		bh = bh->b_this_page;
+	} while (bh != head);
+
+	return ret;
+}
+EXPORT_SYMBOL(block_is_partially_uptodate);
+
+/*
  * Generic "read page" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
  * Reads the page asynchronously --- the unlock_buffer() and
diff -Nrup linux-2.6.26-rc3.org/fs/ext2/inode.c linux-2.6.26-rc3/fs/ext2/inode.c
--- linux-2.6.26-rc3.org/fs/ext2/inode.c	2008-05-19 11:35:10.000000000 +0900
+++ linux-2.6.26-rc3/fs/ext2/inode.c	2008-05-22 12:39:46.000000000 +0900
@@ -791,6 +791,7 @@ const struct address_space_operations ex
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
 	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate	= block_is_partially_uptodate,
 };
 
 const struct address_space_operations ext2_aops_xip = {
diff -Nrup linux-2.6.26-rc3.org/fs/ext3/inode.c linux-2.6.26-rc3/fs/ext3/inode.c
--- linux-2.6.26-rc3.org/fs/ext3/inode.c	2008-05-19 11:35:10.000000000 +0900
+++ linux-2.6.26-rc3/fs/ext3/inode.c	2008-05-22 13:10:40.000000000 +0900
@@ -1767,44 +1767,47 @@ static int ext3_journalled_set_page_dirt
 }
 
 static const struct address_space_operations ext3_ordered_aops = {
-	.readpage	= ext3_readpage,
-	.readpages	= ext3_readpages,
-	.writepage	= ext3_ordered_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext3_write_begin,
-	.write_end	= ext3_ordered_write_end,
-	.bmap		= ext3_bmap,
-	.invalidatepage	= ext3_invalidatepage,
-	.releasepage	= ext3_releasepage,
-	.direct_IO	= ext3_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_ordered_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_ordered_write_end,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.direct_IO		= ext3_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext3_writeback_aops = {
-	.readpage	= ext3_readpage,
-	.readpages	= ext3_readpages,
-	.writepage	= ext3_writeback_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext3_write_begin,
-	.write_end	= ext3_writeback_write_end,
-	.bmap		= ext3_bmap,
-	.invalidatepage	= ext3_invalidatepage,
-	.releasepage	= ext3_releasepage,
-	.direct_IO	= ext3_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_writeback_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_writeback_write_end,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.direct_IO		= ext3_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext3_journalled_aops = {
-	.readpage	= ext3_readpage,
-	.readpages	= ext3_readpages,
-	.writepage	= ext3_journalled_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext3_write_begin,
-	.write_end	= ext3_journalled_write_end,
-	.set_page_dirty	= ext3_journalled_set_page_dirty,
-	.bmap		= ext3_bmap,
-	.invalidatepage	= ext3_invalidatepage,
-	.releasepage	= ext3_releasepage,
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_journalled_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_journalled_write_end,
+	.set_page_dirty		= ext3_journalled_set_page_dirty,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 void ext3_set_aops(struct inode *inode)
diff -Nrup linux-2.6.26-rc3.org/fs/ext4/inode.c linux-2.6.26-rc3/fs/ext4/inode.c
--- linux-2.6.26-rc3.org/fs/ext4/inode.c	2008-05-19 11:35:10.000000000 +0900
+++ linux-2.6.26-rc3/fs/ext4/inode.c	2008-05-22 13:13:17.000000000 +0900
@@ -1817,44 +1817,47 @@ static int ext4_journalled_set_page_dirt
 }
 
 static const struct address_space_operations ext4_ordered_aops = {
-	.readpage	= ext4_readpage,
-	.readpages	= ext4_readpages,
-	.writepage	= ext4_ordered_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext4_write_begin,
-	.write_end	= ext4_ordered_write_end,
-	.bmap		= ext4_bmap,
-	.invalidatepage	= ext4_invalidatepage,
-	.releasepage	= ext4_releasepage,
-	.direct_IO	= ext4_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext4_readpage,
+	.readpages		= ext4_readpages,
+	.writepage		= ext4_ordered_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext4_write_begin,
+	.write_end		= ext4_ordered_write_end,
+	.bmap			= ext4_bmap,
+	.invalidatepage		= ext4_invalidatepage,
+	.releasepage		= ext4_releasepage,
+	.direct_IO		= ext4_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext4_writeback_aops = {
-	.readpage	= ext4_readpage,
-	.readpages	= ext4_readpages,
-	.writepage	= ext4_writeback_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext4_write_begin,
-	.write_end	= ext4_writeback_write_end,
-	.bmap		= ext4_bmap,
-	.invalidatepage	= ext4_invalidatepage,
-	.releasepage	= ext4_releasepage,
-	.direct_IO	= ext4_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext4_readpage,
+	.readpages		= ext4_readpages,
+	.writepage		= ext4_writeback_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext4_write_begin,
+	.write_end		= ext4_writeback_write_end,
+	.bmap			= ext4_bmap,
+	.invalidatepage		= ext4_invalidatepage,
+	.releasepage		= ext4_releasepage,
+	.direct_IO		= ext4_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext4_journalled_aops = {
-	.readpage	= ext4_readpage,
-	.readpages	= ext4_readpages,
-	.writepage	= ext4_journalled_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext4_write_begin,
-	.write_end	= ext4_journalled_write_end,
-	.set_page_dirty	= ext4_journalled_set_page_dirty,
-	.bmap		= ext4_bmap,
-	.invalidatepage	= ext4_invalidatepage,
-	.releasepage	= ext4_releasepage,
+	.readpage		= ext4_readpage,
+	.readpages		= ext4_readpages,
+	.writepage		= ext4_journalled_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext4_write_begin,
+	.write_end		= ext4_journalled_write_end,
+	.set_page_dirty		= ext4_journalled_set_page_dirty,
+	.bmap			= ext4_bmap,
+	.invalidatepage		= ext4_invalidatepage,
+	.releasepage		= ext4_releasepage,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 void ext4_set_aops(struct inode *inode)
diff -Nrup linux-2.6.26-rc3.org/include/linux/buffer_head.h linux-2.6.26-rc3/include/linux/buffer_head.h
--- linux-2.6.26-rc3.org/include/linux/buffer_head.h	2008-05-19 11:35:11.000000000 +0900
+++ linux-2.6.26-rc3/include/linux/buffer_head.h	2008-05-22 19:46:05.000000000 +0900
@@ -205,6 +205,8 @@ void block_invalidatepage(struct page *p
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
 int block_read_full_page(struct page*, get_block_t*);
+int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
+				unsigned long from);
 int block_write_begin(struct file *, struct address_space *,
 				loff_t, unsigned, unsigned,
 				struct page **, void **, get_block_t*);
diff -Nrup linux-2.6.26-rc3.org/include/linux/fs.h linux-2.6.26-rc3/include/linux/fs.h
--- linux-2.6.26-rc3.org/include/linux/fs.h	2008-05-19 11:35:11.000000000 +0900
+++ linux-2.6.26-rc3/include/linux/fs.h	2008-05-22 15:13:39.000000000 +0900
@@ -439,6 +439,27 @@ static inline size_t iov_iter_count(stru
 	return i->count;
 }
 
+/*
+ * "descriptor" for what we're up to with a read.
+ * This allows us to use the same read code yet
+ * have multiple different users of the data that
+ * we read from a file.
+ *
+ * The simplest case just copies the data to user
+ * mode.
+ */
+typedef struct {
+	size_t written;
+	size_t count;
+	union {
+		char __user *buf;
+		void *data;
+	} arg;
+	int error;
+} read_descriptor_t;
+
+typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
+		unsigned long, unsigned long);
 
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
@@ -480,6 +501,8 @@ struct address_space_operations {
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
 	int (*launder_page) (struct page *);
+	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
+					unsigned long);
 };
 
 /*
@@ -1186,27 +1209,6 @@ struct block_device_operations {
 	struct module *owner;
 };
 
-/*
- * "descriptor" for what we're up to with a read.
- * This allows us to use the same read code yet
- * have multiple different users of the data that
- * we read from a file.
- *
- * The simplest case just copies the data to user
- * mode.
- */
-typedef struct {
-	size_t written;
-	size_t count;
-	union {
-		char __user * buf;
-		void *data;
-	} arg;
-	int error;
-} read_descriptor_t;
-
-typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long);
-
 /* These macros are for out of kernel modules to test that
  * the kernel supports the unlocked_ioctl and compat_ioctl
  * fields in struct file_operations. */
diff -Nrup linux-2.6.26-rc3.org/mm/filemap.c linux-2.6.26-rc3/mm/filemap.c
--- linux-2.6.26-rc3.org/mm/filemap.c	2008-05-19 11:35:11.000000000 +0900
+++ linux-2.6.26-rc3/mm/filemap.c	2008-05-22 12:34:58.000000000 +0900
@@ -932,8 +932,17 @@ find_page:
 					ra, filp, page,
 					index, last_index - index);
 		}
-		if (!PageUptodate(page))
-			goto page_not_up_to_date;
+		if (!PageUptodate(page)) {
+			if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
+					!mapping->a_ops->is_partially_uptodate)
+				goto page_not_up_to_date;
+			if (TestSetPageLocked(page))
+				goto page_not_up_to_date;
+			if (!mapping->a_ops->is_partially_uptodate(page,
+								desc, offset))
+				goto page_not_up_to_date_locked;
+			unlock_page(page);
+		}
 page_ok:
 		/*
 		 * i_size must be checked after we know the page is Uptodate.
@@ -1003,6 +1012,7 @@ page_not_up_to_date:
 		if (lock_page_killable(page))
 			goto readpage_eio;
 
+page_not_up_to_date_locked:
 		/* Did it get truncated before we got the lock? */
 		if (!page->mapping) {
 			unlock_page(page);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment
  2008-05-22  8:03     ` Andrew Morton
  2008-05-22 12:08       ` [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment Hisashi Hifumi
@ 2008-05-23 22:51       ` Jan Kara
  2008-05-26  7:20         ` Hisashi Hifumi
  2008-05-27  8:38       ` [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment Hisashi Hifumi
  2 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2008-05-23 22:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hisashi Hifumi, linux-kernel

> On Thu, 22 May 2008 16:31:15 +0900 Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> wrote:
> 
> One other thing we should think about here is the `nobh' mode which the
> extX filesystems support (although I have a feeling that Nick might
> have broken this ;)) We also have data=ordered, data=writeback and
> data=journal to think about.  This optimisation might not be
> appropriate at all to data=journal mode, but I haven't looked into
> that.
  Why do you think so? We mess with buffer_dirty bits in data=journal
mode but as far as I understand the patch, it only does not read the
page if the part we need is marked as uptodate in buffers. And this
should be safe.
  But I'm slightly confused that the patch helps because I've always
thought that mpage_readpage() (which is what we end up calling from
do_generic_mapping_read()) always reads the whole page. Thus either all
buffers in the page or none of them are uptodate... So what do I miss
here?

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment
  2008-05-23 22:51       ` [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment Jan Kara
@ 2008-05-26  7:20         ` Hisashi Hifumi
  2008-05-26 11:40           ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Hisashi Hifumi @ 2008-05-26  7:20 UTC (permalink / raw)
  To: Jan Kara, Andrew Morton; +Cc: linux-kernel


>  But I'm slightly confused that the patch helps because I've always
>thought that mpage_readpage() (which is what we end up calling from
>do_generic_mapping_read()) always reads the whole page. Thus either all
>buffers in the page or none of them are uptodate... So what do I miss
>here?
>
>								Honza

On ext3/4, a file is written through buffer/block. So if a page has multiple 
buffers, buffers can be uptodate partially especially under random write workloads.
See __block_prepare_write and __block_commit_write.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment
  2008-05-26  7:20         ` Hisashi Hifumi
@ 2008-05-26 11:40           ` Jan Kara
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2008-05-26 11:40 UTC (permalink / raw)
  To: Hisashi Hifumi; +Cc: Andrew Morton, linux-kernel

On Mon 26-05-08 16:20:10, Hisashi Hifumi wrote:
> 
> >  But I'm slightly confused that the patch helps because I've always
> >thought that mpage_readpage() (which is what we end up calling from
> >do_generic_mapping_read()) always reads the whole page. Thus either all
> >buffers in the page or none of them are uptodate... So what do I miss
> >here?
> >
> >								Honza
> 
> On ext3/4, a file is written through buffer/block. So if a page has multiple 
> buffers, buffers can be uptodate partially especially under random write workloads.
> See __block_prepare_write and __block_commit_write.
  Ah, I see. Thanks for explanation.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment
  2008-05-22  8:03     ` Andrew Morton
  2008-05-22 12:08       ` [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment Hisashi Hifumi
  2008-05-23 22:51       ` [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment Jan Kara
@ 2008-05-27  8:38       ` Hisashi Hifumi
  2008-05-27  8:51         ` Andrew Morton
  2 siblings, 1 reply; 15+ messages in thread
From: Hisashi Hifumi @ 2008-05-27  8:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Hi Andrew.

>> +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
>> +					unsigned long from)
>> +{
>> +	struct inode *inode = page->mapping->host;
>> +	unsigned long block_start, block_end, blocksize;
>> +	unsigned long to;
>
>These functions can get quite confusing, and the appropriate use of
>types can help.
>
>For offsets within a page I'd suggest `unsigned'.  I expect that the
>32-bit quantities are more efficient on at least some 64-bit machines.
>
>For page indices use pgoff_t.
>
>For byte-offset-within-a-file use loff_t.
>
>For byte-range-within-a-file use size_t.
>
>Be careful about overflows and truncation when shifting, comparing,
>assigning, etc.  Be sure that the code is still correct outside the
>4Gbyte mark on 32-bit and outside the 4Gpage mark on 64-bit.
>
>It doesn't hurt at all to put each variable definition on its own line
>with a little comment off to the right.  It helps to mention what the
>variable's units are too - bytes/pages/sectors/etc.


>> +	head = page_buffers(page);
>> +
>> +	for (bh = head, block_start = 0; bh != head || !block_start;
>> +	     block_start = block_end, bh = bh->b_this_page) {
>> +		block_end = block_start + blocksize;
>> +		if (block_end <= from || block_start >= to)
>> +			continue;
>
>Oh dear, you've copied one of my least favourite parts of the VFS :(
>Just look at it!
>
>Do you think it can be simplified or clarified?


I cleaned up and simplified block_is_partially_uptodate.
Following patch is for linux-2.6.26-rc4.
Please review my patch.
Thanks.

Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>

diff -Nrup linux-2.6.26-rc4.org/fs/buffer.c linux-2.6.26-rc4/fs/buffer.c
--- linux-2.6.26-rc4.org/fs/buffer.c	2008-05-27 14:36:21.000000000 +0900
+++ linux-2.6.26-rc4/fs/buffer.c	2008-05-27 14:38:20.000000000 +0900
@@ -2084,6 +2084,52 @@ int generic_write_end(struct file *file,
 EXPORT_SYMBOL(generic_write_end);
 
 /*
+ * block_is_partially_uptodate checks whether buffers within a page are
+ * uptodate or not.
+ *
+ * Returns true if all buffers which correspond to a file portion
+ * we want to read are uptodate.
+ */
+int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
+					unsigned long from)
+{
+	struct inode *inode = page->mapping->host;
+	unsigned block_start, block_end, blocksize;
+	unsigned to;
+	struct buffer_head *bh, *head;
+	int ret = 1;
+
+	if (!page_has_buffers(page))
+		return 0;
+
+	blocksize = 1 << inode->i_blkbits;
+	to = min_t(unsigned, PAGE_CACHE_SIZE - from, desc->count);
+	to = from + to;
+	if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
+		return 0;
+
+	head = page_buffers(page);
+	bh = head;
+	block_start = 0;
+	do {
+		block_end = block_start + blocksize;
+		if (block_end > from && block_start < to) {
+			if (!buffer_uptodate(bh)) {
+				ret = 0;
+				break;
+			}
+			if (block_end >= to)
+				break;
+		}
+		block_start = block_end;
+		bh = bh->b_this_page;
+	} while (bh != head);
+
+	return ret;
+}
+EXPORT_SYMBOL(block_is_partially_uptodate);
+
+/*
  * Generic "read page" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
  * Reads the page asynchronously --- the unlock_buffer() and
diff -Nrup linux-2.6.26-rc4.org/fs/ext2/inode.c linux-2.6.26-rc4/fs/ext2/inode.c
--- linux-2.6.26-rc4.org/fs/ext2/inode.c	2008-05-27 14:36:21.000000000 +0900
+++ linux-2.6.26-rc4/fs/ext2/inode.c	2008-05-27 14:38:20.000000000 +0900
@@ -791,6 +791,7 @@ const struct address_space_operations ex
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
 	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate	= block_is_partially_uptodate,
 };
 
 const struct address_space_operations ext2_aops_xip = {
diff -Nrup linux-2.6.26-rc4.org/fs/ext3/inode.c linux-2.6.26-rc4/fs/ext3/inode.c
--- linux-2.6.26-rc4.org/fs/ext3/inode.c	2008-05-27 14:36:21.000000000 +0900
+++ linux-2.6.26-rc4/fs/ext3/inode.c	2008-05-27 14:38:20.000000000 +0900
@@ -1767,44 +1767,47 @@ static int ext3_journalled_set_page_dirt
 }
 
 static const struct address_space_operations ext3_ordered_aops = {
-	.readpage	= ext3_readpage,
-	.readpages	= ext3_readpages,
-	.writepage	= ext3_ordered_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext3_write_begin,
-	.write_end	= ext3_ordered_write_end,
-	.bmap		= ext3_bmap,
-	.invalidatepage	= ext3_invalidatepage,
-	.releasepage	= ext3_releasepage,
-	.direct_IO	= ext3_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_ordered_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_ordered_write_end,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.direct_IO		= ext3_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext3_writeback_aops = {
-	.readpage	= ext3_readpage,
-	.readpages	= ext3_readpages,
-	.writepage	= ext3_writeback_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext3_write_begin,
-	.write_end	= ext3_writeback_write_end,
-	.bmap		= ext3_bmap,
-	.invalidatepage	= ext3_invalidatepage,
-	.releasepage	= ext3_releasepage,
-	.direct_IO	= ext3_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_writeback_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_writeback_write_end,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.direct_IO		= ext3_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext3_journalled_aops = {
-	.readpage	= ext3_readpage,
-	.readpages	= ext3_readpages,
-	.writepage	= ext3_journalled_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext3_write_begin,
-	.write_end	= ext3_journalled_write_end,
-	.set_page_dirty	= ext3_journalled_set_page_dirty,
-	.bmap		= ext3_bmap,
-	.invalidatepage	= ext3_invalidatepage,
-	.releasepage	= ext3_releasepage,
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_journalled_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_journalled_write_end,
+	.set_page_dirty		= ext3_journalled_set_page_dirty,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 void ext3_set_aops(struct inode *inode)
diff -Nrup linux-2.6.26-rc4.org/fs/ext4/inode.c linux-2.6.26-rc4/fs/ext4/inode.c
--- linux-2.6.26-rc4.org/fs/ext4/inode.c	2008-05-27 14:36:21.000000000 +0900
+++ linux-2.6.26-rc4/fs/ext4/inode.c	2008-05-27 14:38:20.000000000 +0900
@@ -1817,44 +1817,47 @@ static int ext4_journalled_set_page_dirt
 }
 
 static const struct address_space_operations ext4_ordered_aops = {
-	.readpage	= ext4_readpage,
-	.readpages	= ext4_readpages,
-	.writepage	= ext4_ordered_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext4_write_begin,
-	.write_end	= ext4_ordered_write_end,
-	.bmap		= ext4_bmap,
-	.invalidatepage	= ext4_invalidatepage,
-	.releasepage	= ext4_releasepage,
-	.direct_IO	= ext4_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext4_readpage,
+	.readpages		= ext4_readpages,
+	.writepage		= ext4_ordered_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext4_write_begin,
+	.write_end		= ext4_ordered_write_end,
+	.bmap			= ext4_bmap,
+	.invalidatepage		= ext4_invalidatepage,
+	.releasepage		= ext4_releasepage,
+	.direct_IO		= ext4_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext4_writeback_aops = {
-	.readpage	= ext4_readpage,
-	.readpages	= ext4_readpages,
-	.writepage	= ext4_writeback_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext4_write_begin,
-	.write_end	= ext4_writeback_write_end,
-	.bmap		= ext4_bmap,
-	.invalidatepage	= ext4_invalidatepage,
-	.releasepage	= ext4_releasepage,
-	.direct_IO	= ext4_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext4_readpage,
+	.readpages		= ext4_readpages,
+	.writepage		= ext4_writeback_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext4_write_begin,
+	.write_end		= ext4_writeback_write_end,
+	.bmap			= ext4_bmap,
+	.invalidatepage		= ext4_invalidatepage,
+	.releasepage		= ext4_releasepage,
+	.direct_IO		= ext4_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext4_journalled_aops = {
-	.readpage	= ext4_readpage,
-	.readpages	= ext4_readpages,
-	.writepage	= ext4_journalled_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext4_write_begin,
-	.write_end	= ext4_journalled_write_end,
-	.set_page_dirty	= ext4_journalled_set_page_dirty,
-	.bmap		= ext4_bmap,
-	.invalidatepage	= ext4_invalidatepage,
-	.releasepage	= ext4_releasepage,
+	.readpage		= ext4_readpage,
+	.readpages		= ext4_readpages,
+	.writepage		= ext4_journalled_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext4_write_begin,
+	.write_end		= ext4_journalled_write_end,
+	.set_page_dirty		= ext4_journalled_set_page_dirty,
+	.bmap			= ext4_bmap,
+	.invalidatepage		= ext4_invalidatepage,
+	.releasepage		= ext4_releasepage,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 void ext4_set_aops(struct inode *inode)
diff -Nrup linux-2.6.26-rc4.org/include/linux/buffer_head.h linux-2.6.26-rc4/include/linux/buffer_head.h
--- linux-2.6.26-rc4.org/include/linux/buffer_head.h	2008-05-27 14:36:22.000000000 +0900
+++ linux-2.6.26-rc4/include/linux/buffer_head.h	2008-05-27 14:38:20.000000000 +0900
@@ -205,6 +205,8 @@ void block_invalidatepage(struct page *p
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
 int block_read_full_page(struct page*, get_block_t*);
+int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
+				unsigned long from);
 int block_write_begin(struct file *, struct address_space *,
 				loff_t, unsigned, unsigned,
 				struct page **, void **, get_block_t*);
diff -Nrup linux-2.6.26-rc4.org/include/linux/fs.h linux-2.6.26-rc4/include/linux/fs.h
--- linux-2.6.26-rc4.org/include/linux/fs.h	2008-05-27 14:36:22.000000000 +0900
+++ linux-2.6.26-rc4/include/linux/fs.h	2008-05-27 14:38:20.000000000 +0900
@@ -439,6 +439,27 @@ static inline size_t iov_iter_count(stru
 	return i->count;
 }
 
+/*
+ * "descriptor" for what we're up to with a read.
+ * This allows us to use the same read code yet
+ * have multiple different users of the data that
+ * we read from a file.
+ *
+ * The simplest case just copies the data to user
+ * mode.
+ */
+typedef struct {
+	size_t written;
+	size_t count;
+	union {
+		char __user *buf;
+		void *data;
+	} arg;
+	int error;
+} read_descriptor_t;
+
+typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
+		unsigned long, unsigned long);
 
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
@@ -480,6 +501,8 @@ struct address_space_operations {
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
 	int (*launder_page) (struct page *);
+	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
+					unsigned long);
 };
 
 /*
@@ -1186,27 +1209,6 @@ struct block_device_operations {
 	struct module *owner;
 };
 
-/*
- * "descriptor" for what we're up to with a read.
- * This allows us to use the same read code yet
- * have multiple different users of the data that
- * we read from a file.
- *
- * The simplest case just copies the data to user
- * mode.
- */
-typedef struct {
-	size_t written;
-	size_t count;
-	union {
-		char __user * buf;
-		void *data;
-	} arg;
-	int error;
-} read_descriptor_t;
-
-typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long);
-
 /* These macros are for out of kernel modules to test that
  * the kernel supports the unlocked_ioctl and compat_ioctl
  * fields in struct file_operations. */
diff -Nrup linux-2.6.26-rc4.org/mm/filemap.c linux-2.6.26-rc4/mm/filemap.c
--- linux-2.6.26-rc4.org/mm/filemap.c	2008-05-27 14:36:23.000000000 +0900
+++ linux-2.6.26-rc4/mm/filemap.c	2008-05-27 14:38:20.000000000 +0900
@@ -932,8 +932,17 @@ find_page:
 					ra, filp, page,
 					index, last_index - index);
 		}
-		if (!PageUptodate(page))
-			goto page_not_up_to_date;
+		if (!PageUptodate(page)) {
+			if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
+					!mapping->a_ops->is_partially_uptodate)
+				goto page_not_up_to_date;
+			if (TestSetPageLocked(page))
+				goto page_not_up_to_date;
+			if (!mapping->a_ops->is_partially_uptodate(page,
+								desc, offset))
+				goto page_not_up_to_date_locked;
+			unlock_page(page);
+		}
 page_ok:
 		/*
 		 * i_size must be checked after we know the page is Uptodate.
@@ -1003,6 +1012,7 @@ page_not_up_to_date:
 		if (lock_page_killable(page))
 			goto readpage_eio;
 
+page_not_up_to_date_locked:
 		/* Did it get truncated before we got the lock? */
 		if (!page->mapping) {
 			unlock_page(page);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment
  2008-05-27  8:38       ` [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment Hisashi Hifumi
@ 2008-05-27  8:51         ` Andrew Morton
  2008-05-27  9:34           ` [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksize environment Hisashi Hifumi
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2008-05-27  8:51 UTC (permalink / raw)
  To: Hisashi Hifumi; +Cc: linux-kernel

On Tue, 27 May 2008 17:38:34 +0900 Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> wrote:

> I cleaned up and simplified block_is_partially_uptodate.
> Following patch is for linux-2.6.26-rc4.
> Please review my patch.

We seem to have lost the changelog which means that the patch can't go
much further and it makes it considerably harder for people to review
the patch.

Please maintain the changelog alongside the patch and resend it (possibly after
suitable updates) along with each iteration of the patch.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksize environment
  2008-05-27  8:51         ` Andrew Morton
@ 2008-05-27  9:34           ` Hisashi Hifumi
  2008-05-28 23:23             ` Andrew Morton
  2008-07-11 23:39             ` [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksize environment Andrew Morton
  0 siblings, 2 replies; 15+ messages in thread
From: Hisashi Hifumi @ 2008-05-27  9:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Hi

At 17:51 08/05/27, Andrew Morton wrote:
>On Tue, 27 May 2008 17:38:34 +0900 Hisashi Hifumi 
><hifumi.hisashi@oss.ntt.co.jp> wrote:
>
>> I cleaned up and simplified block_is_partially_uptodate.
>> Following patch is for linux-2.6.26-rc4.
>> Please review my patch.
>
>We seem to have lost the changelog which means that the patch can't go
>much further and it makes it considerably harder for people to review
>the patch.
>
>Please maintain the changelog alongside the patch and resend it (possibly after
>suitable updates) along with each iteration of the patch.
>
>Thanks.

Sorry, I attach changelog.

When we read some part of a file through pagecache, if there is a pagecache 
of corresponding index but this page is not uptodate, read IO is issued and
this page will be uptodate.
I think this is good for pagesize == blocksize environment but there is room
for improvement on pagesize != blocksize environment. Because in this case
a page can have multiple buffers and even if a page is not uptodate, some buffers 
can be uptodate. So I suggest that when all buffers which correspond to a part
of a file that we want to read are uptodate, use this pagecache and copy data
from this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.

v2: add new address_space_operations member is_partially_uptodate, and 
     block_is_partially_uptodate was registered to ext2/3/4's aops.
     modify do_generic_file_read to use this aops callback.
v3: use unsigned instead of unsigned long in block_is_partially_uptodate.
     cleaned up and simplified page buffer iteration code in block_is_partially_uptodate.

Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>

diff -Nrup linux-2.6.26-rc4.org/fs/buffer.c linux-2.6.26-rc4/fs/buffer.c
--- linux-2.6.26-rc4.org/fs/buffer.c	2008-05-27 14:36:21.000000000 +0900
+++ linux-2.6.26-rc4/fs/buffer.c	2008-05-27 14:38:20.000000000 +0900
@@ -2084,6 +2084,52 @@ int generic_write_end(struct file *file,
 EXPORT_SYMBOL(generic_write_end);
 
 /*
+ * block_is_partially_uptodate checks whether buffers within a page are
+ * uptodate or not.
+ *
+ * Returns true if all buffers which correspond to a file portion
+ * we want to read are uptodate.
+ */
+int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
+					unsigned long from)
+{
+	struct inode *inode = page->mapping->host;
+	unsigned block_start, block_end, blocksize;
+	unsigned to;
+	struct buffer_head *bh, *head;
+	int ret = 1;
+
+	if (!page_has_buffers(page))
+		return 0;
+
+	blocksize = 1 << inode->i_blkbits;
+	to = min_t(unsigned, PAGE_CACHE_SIZE - from, desc->count);
+	to = from + to;
+	if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
+		return 0;
+
+	head = page_buffers(page);
+	bh = head;
+	block_start = 0;
+	do {
+		block_end = block_start + blocksize;
+		if (block_end > from && block_start < to) {
+			if (!buffer_uptodate(bh)) {
+				ret = 0;
+				break;
+			}
+			if (block_end >= to)
+				break;
+		}
+		block_start = block_end;
+		bh = bh->b_this_page;
+	} while (bh != head);
+
+	return ret;
+}
+EXPORT_SYMBOL(block_is_partially_uptodate);
+
+/*
  * Generic "read page" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
  * Reads the page asynchronously --- the unlock_buffer() and
diff -Nrup linux-2.6.26-rc4.org/fs/ext2/inode.c linux-2.6.26-rc4/fs/ext2/inode.c
--- linux-2.6.26-rc4.org/fs/ext2/inode.c	2008-05-27 14:36:21.000000000 +0900
+++ linux-2.6.26-rc4/fs/ext2/inode.c	2008-05-27 14:38:20.000000000 +0900
@@ -791,6 +791,7 @@ const struct address_space_operations ex
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
 	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate	= block_is_partially_uptodate,
 };
 
 const struct address_space_operations ext2_aops_xip = {
diff -Nrup linux-2.6.26-rc4.org/fs/ext3/inode.c linux-2.6.26-rc4/fs/ext3/inode.c
--- linux-2.6.26-rc4.org/fs/ext3/inode.c	2008-05-27 14:36:21.000000000 +0900
+++ linux-2.6.26-rc4/fs/ext3/inode.c	2008-05-27 14:38:20.000000000 +0900
@@ -1767,44 +1767,47 @@ static int ext3_journalled_set_page_dirt
 }
 
 static const struct address_space_operations ext3_ordered_aops = {
-	.readpage	= ext3_readpage,
-	.readpages	= ext3_readpages,
-	.writepage	= ext3_ordered_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext3_write_begin,
-	.write_end	= ext3_ordered_write_end,
-	.bmap		= ext3_bmap,
-	.invalidatepage	= ext3_invalidatepage,
-	.releasepage	= ext3_releasepage,
-	.direct_IO	= ext3_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_ordered_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_ordered_write_end,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.direct_IO		= ext3_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext3_writeback_aops = {
-	.readpage	= ext3_readpage,
-	.readpages	= ext3_readpages,
-	.writepage	= ext3_writeback_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext3_write_begin,
-	.write_end	= ext3_writeback_write_end,
-	.bmap		= ext3_bmap,
-	.invalidatepage	= ext3_invalidatepage,
-	.releasepage	= ext3_releasepage,
-	.direct_IO	= ext3_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_writeback_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_writeback_write_end,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.direct_IO		= ext3_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext3_journalled_aops = {
-	.readpage	= ext3_readpage,
-	.readpages	= ext3_readpages,
-	.writepage	= ext3_journalled_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext3_write_begin,
-	.write_end	= ext3_journalled_write_end,
-	.set_page_dirty	= ext3_journalled_set_page_dirty,
-	.bmap		= ext3_bmap,
-	.invalidatepage	= ext3_invalidatepage,
-	.releasepage	= ext3_releasepage,
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_journalled_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_journalled_write_end,
+	.set_page_dirty		= ext3_journalled_set_page_dirty,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 void ext3_set_aops(struct inode *inode)
diff -Nrup linux-2.6.26-rc4.org/fs/ext4/inode.c linux-2.6.26-rc4/fs/ext4/inode.c
--- linux-2.6.26-rc4.org/fs/ext4/inode.c	2008-05-27 14:36:21.000000000 +0900
+++ linux-2.6.26-rc4/fs/ext4/inode.c	2008-05-27 14:38:20.000000000 +0900
@@ -1817,44 +1817,47 @@ static int ext4_journalled_set_page_dirt
 }
 
 static const struct address_space_operations ext4_ordered_aops = {
-	.readpage	= ext4_readpage,
-	.readpages	= ext4_readpages,
-	.writepage	= ext4_ordered_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext4_write_begin,
-	.write_end	= ext4_ordered_write_end,
-	.bmap		= ext4_bmap,
-	.invalidatepage	= ext4_invalidatepage,
-	.releasepage	= ext4_releasepage,
-	.direct_IO	= ext4_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext4_readpage,
+	.readpages		= ext4_readpages,
+	.writepage		= ext4_ordered_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext4_write_begin,
+	.write_end		= ext4_ordered_write_end,
+	.bmap			= ext4_bmap,
+	.invalidatepage		= ext4_invalidatepage,
+	.releasepage		= ext4_releasepage,
+	.direct_IO		= ext4_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext4_writeback_aops = {
-	.readpage	= ext4_readpage,
-	.readpages	= ext4_readpages,
-	.writepage	= ext4_writeback_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext4_write_begin,
-	.write_end	= ext4_writeback_write_end,
-	.bmap		= ext4_bmap,
-	.invalidatepage	= ext4_invalidatepage,
-	.releasepage	= ext4_releasepage,
-	.direct_IO	= ext4_direct_IO,
-	.migratepage	= buffer_migrate_page,
+	.readpage		= ext4_readpage,
+	.readpages		= ext4_readpages,
+	.writepage		= ext4_writeback_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext4_write_begin,
+	.write_end		= ext4_writeback_write_end,
+	.bmap			= ext4_bmap,
+	.invalidatepage		= ext4_invalidatepage,
+	.releasepage		= ext4_releasepage,
+	.direct_IO		= ext4_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 static const struct address_space_operations ext4_journalled_aops = {
-	.readpage	= ext4_readpage,
-	.readpages	= ext4_readpages,
-	.writepage	= ext4_journalled_writepage,
-	.sync_page	= block_sync_page,
-	.write_begin	= ext4_write_begin,
-	.write_end	= ext4_journalled_write_end,
-	.set_page_dirty	= ext4_journalled_set_page_dirty,
-	.bmap		= ext4_bmap,
-	.invalidatepage	= ext4_invalidatepage,
-	.releasepage	= ext4_releasepage,
+	.readpage		= ext4_readpage,
+	.readpages		= ext4_readpages,
+	.writepage		= ext4_journalled_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext4_write_begin,
+	.write_end		= ext4_journalled_write_end,
+	.set_page_dirty		= ext4_journalled_set_page_dirty,
+	.bmap			= ext4_bmap,
+	.invalidatepage		= ext4_invalidatepage,
+	.releasepage		= ext4_releasepage,
+	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
 void ext4_set_aops(struct inode *inode)
diff -Nrup linux-2.6.26-rc4.org/include/linux/buffer_head.h linux-2.6.26-rc4/include/linux/buffer_head.h
--- linux-2.6.26-rc4.org/include/linux/buffer_head.h	2008-05-27 14:36:22.000000000 +0900
+++ linux-2.6.26-rc4/include/linux/buffer_head.h	2008-05-27 14:38:20.000000000 +0900
@@ -205,6 +205,8 @@ void block_invalidatepage(struct page *p
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
 int block_read_full_page(struct page*, get_block_t*);
+int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
+				unsigned long from);
 int block_write_begin(struct file *, struct address_space *,
 				loff_t, unsigned, unsigned,
 				struct page **, void **, get_block_t*);
diff -Nrup linux-2.6.26-rc4.org/include/linux/fs.h linux-2.6.26-rc4/include/linux/fs.h
--- linux-2.6.26-rc4.org/include/linux/fs.h	2008-05-27 14:36:22.000000000 +0900
+++ linux-2.6.26-rc4/include/linux/fs.h	2008-05-27 14:38:20.000000000 +0900
@@ -439,6 +439,27 @@ static inline size_t iov_iter_count(stru
 	return i->count;
 }
 
+/*
+ * "descriptor" for what we're up to with a read.
+ * This allows us to use the same read code yet
+ * have multiple different users of the data that
+ * we read from a file.
+ *
+ * The simplest case just copies the data to user
+ * mode.
+ */
+typedef struct {
+	size_t written;
+	size_t count;
+	union {
+		char __user *buf;
+		void *data;
+	} arg;
+	int error;
+} read_descriptor_t;
+
+typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
+		unsigned long, unsigned long);
 
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
@@ -480,6 +501,8 @@ struct address_space_operations {
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
 	int (*launder_page) (struct page *);
+	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
+					unsigned long);
 };
 
 /*
@@ -1186,27 +1209,6 @@ struct block_device_operations {
 	struct module *owner;
 };
 
-/*
- * "descriptor" for what we're up to with a read.
- * This allows us to use the same read code yet
- * have multiple different users of the data that
- * we read from a file.
- *
- * The simplest case just copies the data to user
- * mode.
- */
-typedef struct {
-	size_t written;
-	size_t count;
-	union {
-		char __user * buf;
-		void *data;
-	} arg;
-	int error;
-} read_descriptor_t;
-
-typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long);
-
 /* These macros are for out of kernel modules to test that
  * the kernel supports the unlocked_ioctl and compat_ioctl
  * fields in struct file_operations. */
diff -Nrup linux-2.6.26-rc4.org/mm/filemap.c linux-2.6.26-rc4/mm/filemap.c
--- linux-2.6.26-rc4.org/mm/filemap.c	2008-05-27 14:36:23.000000000 +0900
+++ linux-2.6.26-rc4/mm/filemap.c	2008-05-27 14:38:20.000000000 +0900
@@ -932,8 +932,17 @@ find_page:
 					ra, filp, page,
 					index, last_index - index);
 		}
-		if (!PageUptodate(page))
-			goto page_not_up_to_date;
+		if (!PageUptodate(page)) {
+			if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
+					!mapping->a_ops->is_partially_uptodate)
+				goto page_not_up_to_date;
+			if (TestSetPageLocked(page))
+				goto page_not_up_to_date;
+			if (!mapping->a_ops->is_partially_uptodate(page,
+								desc, offset))
+				goto page_not_up_to_date_locked;
+			unlock_page(page);
+		}
 page_ok:
 		/*
 		* i_size must be checked after we know the page is Uptodate.
@@ -1003,6 +1012,7 @@ page_not_up_to_date:
 		if (lock_page_killable(page))
 			goto readpage_eio;
 
+page_not_up_to_date_locked:
 		/* Did it get truncated before we got the lock? */
 		if (!page->mapping) {
 			unlock_page(page);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksize environment
  2008-05-27  9:34           ` [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksize environment Hisashi Hifumi
@ 2008-05-28 23:23             ` Andrew Morton
  2008-06-10  1:52               ` [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksizeenvironment Hisashi Hifumi
  2008-07-11 23:39             ` [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksize environment Andrew Morton
  1 sibling, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2008-05-28 23:23 UTC (permalink / raw)
  To: Hisashi Hifumi; +Cc: linux-kernel

	On Tue, 27 May 2008 18:34:02 +0900
Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> wrote:

> When we read some part of a file through pagecache, if there is a pagecache 
> of corresponding index but this page is not uptodate, read IO is issued and
> this page will be uptodate.
> I think this is good for pagesize == blocksize environment but there is room
> for improvement on pagesize != blocksize environment. Because in this case
> a page can have multiple buffers and even if a page is not uptodate, some buffers 
> can be uptodate. So I suggest that when all buffers which correspond to a part
> of a file that we want to read are uptodate, use this pagecache and copy data
> from this pagecache to user buffer even if a page is not uptodate. This can
> reduce read IO and improve system throughput.
> 
> v2: add new address_space_operations member is_partially_uptodate, and 
>      block_is_partially_uptodate was registered to ext2/3/4's aops.
>      modify do_generic_file_read to use this aops callback.
> v3: use unsigned instead of unsigned long in block_is_partially_uptodate.
>      cleaned up and simplified page buffer iteration code in block_is_partially_uptodate.
> 
> Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>

OK, thanks, let's give it a try.

It would be excellent if we had some benchmark numbers which justify
this change.

It also would be good if we can think up some workload which might be
slowed down by the change, and then demonstrate that this is not a
significant problem.

After all, it _is_ a performance patch...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksizeenvironment
  2008-05-28 23:23             ` Andrew Morton
@ 2008-06-10  1:52               ` Hisashi Hifumi
  0 siblings, 0 replies; 15+ messages in thread
From: Hisashi Hifumi @ 2008-06-10  1:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


At 08:23 08/05/29, Andrew Morton wrote:
>	On Tue, 27 May 2008 18:34:02 +0900
>Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> wrote:
>
>> When we read some part of a file through pagecache, if there is a pagecache 
>> of corresponding index but this page is not uptodate, read IO is issued and
>> this page will be uptodate.
>> I think this is good for pagesize == blocksize environment but there is room
>> for improvement on pagesize != blocksize environment. Because in this case
>> a page can have multiple buffers and even if a page is not uptodate, some 
>buffers 
>> can be uptodate. So I suggest that when all buffers which correspond to a part
>> of a file that we want to read are uptodate, use this pagecache and copy data
>> from this pagecache to user buffer even if a page is not uptodate. This can
>> reduce read IO and improve system throughput.
>> 
>> v2: add new address_space_operations member is_partially_uptodate, and 
>>      block_is_partially_uptodate was registered to ext2/3/4's aops.
>>      modify do_generic_file_read to use this aops callback.
>> v3: use unsigned instead of unsigned long in block_is_partially_uptodate.
>>      cleaned up and simplified page buffer iteration code in 
>block_is_partially_uptodate.
>> 
>> Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
>
>OK, thanks, let's give it a try.
>
>It would be excellent if we had some benchmark numbers which justify
>this change.
>
>It also would be good if we can think up some workload which might be
>slowed down by the change, and then demonstrate that this is not a
>significant problem.
>
>After all, it _is_ a performance patch...

I wrote a benchmark program and got result number with this program.
This benchmark do:
	1, mount and open a test file.
	2, create a 512MB file.
	3, close a file and umount.
	4, mount and again open a test file.
	5, pwrite randomly 300000 times on a test file. offset is aligned by IO size(1024bytes).
	6, measure time of preading randomly 100000 times on a test file.

The result was:
	2.6.26-rc5
        329 sec

	2.6.25-rc5-patched
        223 sec

arch:i386 
filesystem:ext3
blocksize:1024 bytes
Memory: 1GB

The read IO throughput was improved.

The benchmark program is as follows:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>

#define LEN 1024
#define LOOP 1024*512 /* 512MB */

main(void)
{
	unsigned long i, offset, filesize;
	int fd;
	char buf[LEN];
	time_t t1, t2;

	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
		perror("cannot mount\n");
		exit(1);
	}
	memset(buf, 0, LEN);
	fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
	if (fd < 0) {
		perror("cannot open file\n");
		exit(1);
	}
	for (i = 0; i < LOOP; i++)
		write(fd, buf, LEN);
	close(fd);
	if (umount("/root/test1/") < 0) {
		perror("cannot umount\n");
		exit(1);
	}
	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
		perror("cannot mount\n");
		exit(1);
	}
	fd = open("/root/test1/testfile", O_RDWR);
	if (fd < 0) {
		perror("cannot open file\n");
		exit(1);
	}

	filesize = LEN * LOOP;
	for (i = 0; i < 300000; i++){
		offset = (random() % filesize) & (~(LEN - 1));
		pwrite(fd, buf, LEN, offset);
	}
	printf("start test\n");
	time(&t1);
	for (i = 0; i < 100000; i++){
		offset = (random() % filesize) & (~(LEN - 1));
		pread(fd, buf, LEN, offset);
	}
	time(&t2);
	printf("%ld sec\n", t2-t1);
	close(fd);
	if (umount("/root/test1/") < 0) {
		perror("cannot umount\n");
		exit(1);
	}
}



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksize environment
  2008-05-27  9:34           ` [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksize environment Hisashi Hifumi
  2008-05-28 23:23             ` Andrew Morton
@ 2008-07-11 23:39             ` Andrew Morton
  1 sibling, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2008-07-11 23:39 UTC (permalink / raw)
  To: Hisashi Hifumi; +Cc: linux-kernel, linux-ext4

On Tue, 27 May 2008 18:34:02 +0900 Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> wrote:

> When we read some part of a file through pagecache, if there is a pagecache 
> of corresponding index but this page is not uptodate, read IO is issued and
> this page will be uptodate.
> I think this is good for pagesize == blocksize environment but there is room
> for improvement on pagesize != blocksize environment. Because in this case
> a page can have multiple buffers and even if a page is not uptodate, some buffers 
> can be uptodate. So I suggest that when all buffers which correspond to a part
> of a file that we want to read are uptodate, use this pagecache and copy data
> from this pagecache to user buffer even if a page is not uptodate. This can
> reduce read IO and improve system throughput.
> 
> v2: add new address_space_operations member is_partially_uptodate, and 
>      block_is_partially_uptodate was registered to ext2/3/4's aops.
>      modify do_generic_file_read to use this aops callback.
> v3: use unsigned instead of unsigned long in block_is_partially_uptodate.
>      cleaned up and simplified page buffer iteration code in block_is_partially_uptodate.
> 
> Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>

ext4 has now gained a new set of address_space_operations:
ext4_da_aops.  I put a

        .is_partially_uptodate  = block_is_partially_uptodate,

into there as well, but I have no clue whether it will work OK.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-07-11 23:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-21  6:52 [PATCH] VFS: Pagecache usage optimization on pagesize != blocksize environment Hisashi Hifumi
2008-05-21  7:19 ` Andrew Morton
2008-05-21  7:38   ` Andrew Morton
2008-05-22  7:31   ` [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment Hisashi Hifumi
2008-05-22  8:03     ` Andrew Morton
2008-05-22 12:08       ` [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment Hisashi Hifumi
2008-05-23 22:51       ` [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment Jan Kara
2008-05-26  7:20         ` Hisashi Hifumi
2008-05-26 11:40           ` Jan Kara
2008-05-27  8:38       ` [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment Hisashi Hifumi
2008-05-27  8:51         ` Andrew Morton
2008-05-27  9:34           ` [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksize environment Hisashi Hifumi
2008-05-28 23:23             ` Andrew Morton
2008-06-10  1:52               ` [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksizeenvironment Hisashi Hifumi
2008-07-11 23:39             ` [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksize environment Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox