public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/8] Variable Order Page Cache
@ 2007-04-19 16:35 Christoph Lameter
  2007-04-19 16:35 ` [RFC 1/8] Add order field to address_space struct Christoph Lameter
                   ` (13 more replies)
  0 siblings, 14 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-19 16:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Nick Piggin, Christoph Lameter, Paul Jackson,
	Dave Chinner, Andi Kleen

Variable Order Page Cache Patchset

This patchset modifies the core VM so that higher order page cache pages
become possible. The higher order page cache pages are compound pages
and can be handled in the same way as regular pages.

The order of the pages is determined by the order set up in the mapping
(struct address_space). By default the order is set to zero.
This means that higher order pages are optional. There is no attempt here
to generally change the page order of the page cache. 4K pages are effective
for small files.

However, it would be good if the VM would support I/O to higher order pages
to enable efficient support for large scale I/O. If one wants to write a
long file of a few gigabytes then the filesystem should have a choice of
selecting a larger page size for that file and handle larger chunks of
memory at once.

The support here is only for buffered I/O and only for one filesystem (ramfs).
Modification of other filesystems to support higher order pages may require
extensive work of other components of the kernel. But I hope this shows that
there is a relatively easy way to that goal that could be taken in steps..

Note that the higher order pages are subject to reclaim. This works in general
since we are always operating on a single page struct. Reclaim is fooled to
think that it is touching page sized objects (there are likely issues to be
fixed there if we want to go down this road).

What is currently not supported:
- Buffer heads for higher order pages (possible with the compound pages in mm
  that do not use page->private requires upgrade of the buffer cache layers).
- Higher order pages in the block layer etc.
- Mmapping higher order pages

Note that this is proof-of-concept. Lots of functionality is missing and
various issues have not been dealt with. Use of higher order pages may cause
memory fragmentation. Mel Gorman's anti-fragmentation work is probably
essential if we want to do this. We likely need actual defragmentation
support.

The main point of this patchset is to demonstrates that it is basically
possible to have higher order support with straightforward changes to the
VM.

The ramfs driver can be used to test higher order page cache functionality
(and may help troubleshoot the VM support until we get some real filesystem
and real devices supporting higher order pages).

If you apply this patch and then you can f.e. try this:

mount -tramfs -o10 none /media

	Mounts a ramfs filesystem with order 10 pages (4 MB)

cp linux-2.6.21-rc7.tar.gz /media

	Populate the ramfs. Note that we allocate 14 pages of 4M each
	instead of 13508..

umount /media

	Gets rid of the large pages again

Comments appreciated.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 1/8] Add order field to address_space struct
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
@ 2007-04-19 16:35 ` Christoph Lameter
  2007-04-19 16:35 ` [RFC 2/8] Basic allocation for higher order page cache pages Christoph Lameter
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-19 16:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Nick Piggin, Andi Kleen, Paul Jackson,
	Dave Chinner, Christoph Lameter

Variable Order Page Cache: Add order field in mapping

Add an "order" field in the address space structure that
specifies the page order of pages in an address space.

Set the field to zero by default so that filesystems not prepared to
deal with higher pages can be left as is.

Putting page order in the address space structure means that the order of the
pages in the page cache can be varied per file that a filesystem creates.
This means we can keep small 4k pages for small files. Larger files can
be configured by the file system to use a higher order.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/inode.c         |    1 +
 include/linux/fs.h |    1 +
 2 files changed, 2 insertions(+)

Index: linux-2.6.21-rc7/fs/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/inode.c	2007-04-18 21:21:56.000000000 -0700
+++ linux-2.6.21-rc7/fs/inode.c	2007-04-18 21:26:31.000000000 -0700
@@ -145,6 +145,7 @@ static struct inode *alloc_inode(struct 
 		mapping->a_ops = &empty_aops;
  		mapping->host = inode;
 		mapping->flags = 0;
+		mapping->order = 0;
 		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
Index: linux-2.6.21-rc7/include/linux/fs.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/fs.h	2007-04-18 21:21:56.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/fs.h	2007-04-18 21:26:31.000000000 -0700
@@ -435,6 +435,7 @@ struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
 	rwlock_t		tree_lock;	/* and rwlock protecting it */
+	unsigned int		order;		/* Page order in this space */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 2/8] Basic allocation for higher order page cache pages
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
  2007-04-19 16:35 ` [RFC 1/8] Add order field to address_space struct Christoph Lameter
@ 2007-04-19 16:35 ` Christoph Lameter
  2007-04-20 10:55   ` Mel Gorman
  2007-04-19 16:35 ` [RFC 3/8] Flushing and zeroing " Christoph Lameter
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 58+ messages in thread
From: Christoph Lameter @ 2007-04-19 16:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Nick Piggin, Andi Kleen, Paul Jackson,
	Dave Chinner, Christoph Lameter

Variable Order Page Cache: Add basic allocation functions

Extend __page_cache_alloc to take an order parameter and
modify caller sites. Modify mapping_set_gfp_mask to set
__GFP_COMP if the mapping requires higher order allocations.

put_page() is already capable of handling compound pages. So
there are no changes needed to release higher order page
cache pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/pagemap.h |   15 +++++++++------
 mm/filemap.c            |    9 +++++----
 2 files changed, 14 insertions(+), 10 deletions(-)

Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-18 21:21:56.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-18 21:29:02.000000000 -0700
@@ -32,6 +32,8 @@ static inline void mapping_set_gfp_mask(
 {
 	m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) |
 				(__force unsigned long)mask;
+	if (m->order)
+		m->flags |= __GFP_COMP;
 }
 
 /*
@@ -40,7 +42,7 @@ static inline void mapping_set_gfp_mask(
  * throughput (it can then be mapped into user
  * space in smaller chunks for same flexibility).
  *
- * Or rather, it _will_ be done in larger chunks.
+ * This is the base page size
  */
 #define PAGE_CACHE_SHIFT	PAGE_SHIFT
 #define PAGE_CACHE_SIZE		PAGE_SIZE
@@ -52,22 +54,23 @@ static inline void mapping_set_gfp_mask(
 void release_pages(struct page **pages, int nr, int cold);
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp, int order);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp, int order)
 {
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp, order);
 }
 #endif
 
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return __page_cache_alloc(mapping_gfp_mask(x), x->order);
 }
 
 static inline struct page *page_cache_alloc_cold(struct address_space *x)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD,
+		x->order);
 }
 
 typedef int filler_t(void *, struct page *);
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-18 21:21:56.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-18 21:28:30.000000000 -0700
@@ -467,13 +467,13 @@ int add_to_page_cache_lru(struct page *p
 }
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, int order)
 {
 	if (cpuset_do_page_mem_spread()) {
 		int n = cpuset_mem_spread_node();
-		return alloc_pages_node(n, gfp, 0);
+		return alloc_pages_node(n, gfp, order);
 	}
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp, order);
 }
 EXPORT_SYMBOL(__page_cache_alloc);
 #endif
@@ -803,7 +803,8 @@ grab_cache_page_nowait(struct address_sp
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
+		mapping->order);
 	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
 		page_cache_release(page);
 		page = NULL;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 3/8] Flushing and zeroing higher order page cache pages
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
  2007-04-19 16:35 ` [RFC 1/8] Add order field to address_space struct Christoph Lameter
  2007-04-19 16:35 ` [RFC 2/8] Basic allocation for higher order page cache pages Christoph Lameter
@ 2007-04-19 16:35 ` Christoph Lameter
  2007-04-20 11:02   ` Mel Gorman
  2007-04-19 16:35 ` [RFC 4/8] Enhance fallback functions in libs to support higher order pages Christoph Lameter
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 58+ messages in thread
From: Christoph Lameter @ 2007-04-19 16:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Nick Piggin, Andi Kleen, Paul Jackson,
	Dave Chinner, Christoph Lameter

---
 include/linux/pagemap.h |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-18 22:08:36.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-18 22:09:13.000000000 -0700
@@ -210,6 +210,33 @@ static inline void wait_on_page_writebac
 
 extern void end_page_writeback(struct page *page);
 
+/* Support for clearing higher order pages */
+static inline void clear_mapping_page(struct address_space *a,
+						struct page *page)
+{
+	int nr_pages = 1 << a->order;
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		clear_highpage(page + i);
+}
+
+/*
+ * Support for flushing higher order pages.
+ *
+ * A bit stupid: On many platforms flushing the first page
+ * will flush any TLB starting there
+ */
+static inline void flush_mapping_page(struct address_space *a,
+						struct page *page)
+{
+	int nr_pages = 1 << a->order;
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		flush_dcache_page(page + i);
+}
+
 /*
  * Fault a userspace page into pagetables.  Return non-zero on a fault.
  *

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 4/8] Enhance fallback functions in libs to support higher order pages
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-04-19 16:35 ` [RFC 3/8] Flushing and zeroing " Christoph Lameter
@ 2007-04-19 16:35 ` Christoph Lameter
  2007-04-19 18:48   ` Adam Litke
  2007-04-20 11:05   ` Mel Gorman
  2007-04-19 16:35 ` [RFC 5/8] Enhance generic_read/write " Christoph Lameter
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-19 16:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Nick Piggin, Andi Kleen, Paul Jackson,
	Dave Chinner, Christoph Lameter

Variable Order Page Cache: Fixup fallback functions

Fixup the fallback function in fs/libfs.c to be able to handle
higher order page cache pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/libfs.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

Index: linux-2.6.21-rc7/fs/libfs.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/libfs.c	2007-04-18 21:52:49.000000000 -0700
+++ linux-2.6.21-rc7/fs/libfs.c	2007-04-18 21:54:51.000000000 -0700
@@ -320,8 +320,8 @@ int simple_rename(struct inode *old_dir,
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
-	flush_dcache_page(page);
+	clear_mapping_page(file->f_mapping, page);
+	flush_mapping_page(file->f_mapping, page);
 	SetPageUptodate(page);
 	unlock_page(page);
 	return 0;
@@ -331,11 +331,15 @@ int simple_prepare_write(struct file *fi
 			unsigned from, unsigned to)
 {
 	if (!PageUptodate(page)) {
-		if (to - from != PAGE_CACHE_SIZE) {
+		if (to - from != page_cache_size(file->f_mapping)) {
+			/*
+			 * Mapping to higher order pages need to be supported
+			 * if higher order pages can be in highmem
+			 */
 			void *kaddr = kmap_atomic(page, KM_USER0);
 			memset(kaddr, 0, from);
-			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
-			flush_dcache_page(page);
+			memset(kaddr + to, 0, page_cache_size(file->f_mapping) - to);
+			flush_mapping_page(file->f_mapping, page);
 			kunmap_atomic(kaddr, KM_USER0);
 		}
 	}
@@ -346,7 +350,7 @@ int simple_commit_write(struct file *fil
 			unsigned from, unsigned to)
 {
 	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	loff_t pos = ((loff_t)page->index << page_cache_shift(file->f_mapping)) + to;
 
 	if (!PageUptodate(page))
 		SetPageUptodate(page);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 5/8] Enhance generic_read/write to support higher order pages
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
                   ` (3 preceding siblings ...)
  2007-04-19 16:35 ` [RFC 4/8] Enhance fallback functions in libs to support higher order pages Christoph Lameter
@ 2007-04-19 16:35 ` Christoph Lameter
  2007-04-19 16:35 ` [RFC 6/8] Account for pages in the page cache in terms of base pages Christoph Lameter
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-19 16:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Nick Piggin, Andi Kleen, Paul Jackson,
	Dave Chinner, Christoph Lameter

Variable Order Page Cache: Fix up mm/filemap.c

Fix up the function in mm/filemap.c to use the variable page cache.
This is pretty straightforward:

1. Convert the constants to function calls.
2. Use the mapping flush function

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/filemap.c |   63 +++++++++++++++++++++++++++++------------------------------
 1 file changed, 32 insertions(+), 31 deletions(-)

Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-18 21:28:30.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-18 22:07:10.000000000 -0700
@@ -302,8 +302,8 @@ int wait_on_page_writeback_range(struct 
 int sync_page_range(struct inode *inode, struct address_space *mapping,
 			loff_t pos, loff_t count)
 {
-	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+	pgoff_t start = pos >> page_cache_shift(mapping);
+	pgoff_t end = (pos + count - 1) >> page_cache_shift(mapping);
 	int ret;
 
 	if (!mapping_cap_writeback_dirty(mapping) || !count)
@@ -334,8 +334,8 @@ EXPORT_SYMBOL(sync_page_range);
 int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
 			   loff_t pos, loff_t count)
 {
-	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+	pgoff_t start = pos >> page_cache_shift(mapping);
+	pgoff_t end = (pos + count - 1) >> page_cache_shift(mapping);
 	int ret;
 
 	if (!mapping_cap_writeback_dirty(mapping) || !count)
@@ -364,7 +364,7 @@ int filemap_fdatawait(struct address_spa
 		return 0;
 
 	return wait_on_page_writeback_range(mapping, 0,
-				(i_size - 1) >> PAGE_CACHE_SHIFT);
+				(i_size - 1) >> page_cache_shift(mapping));
 }
 EXPORT_SYMBOL(filemap_fdatawait);
 
@@ -412,8 +412,8 @@ int filemap_write_and_wait_range(struct 
 		/* See comment of filemap_write_and_wait() */
 		if (err != -EIO) {
 			int err2 = wait_on_page_writeback_range(mapping,
-						lstart >> PAGE_CACHE_SHIFT,
-						lend >> PAGE_CACHE_SHIFT);
+						lstart >> page_cache_shift(mapping),
+						lend >> page_cache_shift(mapping));
 			if (!err)
 				err = err2;
 		}
@@ -875,27 +875,28 @@ void do_generic_mapping_read(struct addr
 	struct file_ra_state ra = *_ra;
 
 	cached_page = NULL;
-	index = *ppos >> PAGE_CACHE_SHIFT;
+	index = *ppos >> page_cache_shift(mapping);
 	next_index = index;
 	prev_index = ra.prev_page;
-	last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
-	offset = *ppos & ~PAGE_CACHE_MASK;
+	last_index = (*ppos + desc->count + page_cache_size(mapping)-1)
+				>> page_cache_shift(mapping);
+	offset = *ppos & ~page_cache_mask(mapping);
 
 	isize = i_size_read(inode);
 	if (!isize)
 		goto out;
 
-	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+	end_index = (isize - 1) >> page_cache_shift(mapping);
 	for (;;) {
 		struct page *page;
 		unsigned long nr, ret;
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
+		nr = page_cache_size(mapping);
 		if (index >= end_index) {
 			if (index > end_index)
 				goto out;
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+			nr = ((isize - 1) & ~page_cache_mask(mapping)) + 1;
 			if (nr <= offset) {
 				goto out;
 			}
@@ -922,7 +923,7 @@ page_ok:
 		 * before reading the page on the kernel side.
 		 */
 		if (mapping_writably_mapped(mapping))
-			flush_dcache_page(page);
+			flush_mapping_page(mapping, page);
 
 		/*
 		 * When (part of) the same page is read multiple times
@@ -944,8 +945,8 @@ page_ok:
 		 */
 		ret = actor(desc, page, offset, nr);
 		offset += ret;
-		index += offset >> PAGE_CACHE_SHIFT;
-		offset &= ~PAGE_CACHE_MASK;
+		index += offset >> page_cache_shift(mapping);
+		offset &= ~page_cache_mask(mapping);
 
 		page_cache_release(page);
 		if (ret == nr && desc->count)
@@ -1009,16 +1010,16 @@ readpage:
 		 * another truncate extends the file - this is desired though).
 		 */
 		isize = i_size_read(inode);
-		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+		end_index = (isize - 1) >> page_cache_shift(mapping);
 		if (unlikely(!isize || index > end_index)) {
 			page_cache_release(page);
 			goto out;
 		}
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
+		nr = page_cache_size(mapping);
 		if (index == end_index) {
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+			nr = ((isize - 1) & ~page_cache_mask(mapping)) + 1;
 			if (nr <= offset) {
 				page_cache_release(page);
 				goto out;
@@ -1061,7 +1062,7 @@ no_cached_page:
 out:
 	*_ra = ra;
 
-	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
+	*ppos = ((loff_t) index << page_cache_shift(mapping)) + offset;
 	if (cached_page)
 		page_cache_release(cached_page);
 	if (filp)
@@ -1257,8 +1258,8 @@ asmlinkage ssize_t sys_readahead(int fd,
 	if (file) {
 		if (file->f_mode & FMODE_READ) {
 			struct address_space *mapping = file->f_mapping;
-			unsigned long start = offset >> PAGE_CACHE_SHIFT;
-			unsigned long end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
+			unsigned long start = offset >> page_cache_shift(mapping);
+			unsigned long end = (offset + count - 1) >> page_cache_shift(mapping);
 			unsigned long len = end - start + 1;
 			ret = do_readahead(mapping, file, start, len);
 		}
@@ -2073,9 +2074,9 @@ generic_file_buffered_write(struct kiocb
 		unsigned long offset;
 		size_t copied;
 
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
+		offset = (pos & (page_cache_size(mapping) -1)); /* Within page */
+		index = pos >> page_cache_shift(mapping);
+		bytes = page_cache_size(mapping) - offset;
 
 		/* Limit the size of the copy to the caller's write size */
 		bytes = min(bytes, count);
@@ -2136,7 +2137,7 @@ generic_file_buffered_write(struct kiocb
 		else
 			copied = filemap_copy_from_user_iovec(page, offset,
 						cur_iov, iov_base, bytes);
-		flush_dcache_page(page);
+		flush_mapping_page(mapping, page);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (status == AOP_TRUNCATED_PAGE) {
 			page_cache_release(page);
@@ -2302,8 +2303,8 @@ __generic_file_aio_write_nolock(struct k
 		if (err == 0) {
 			written = written_buffered;
 			invalidate_mapping_pages(mapping,
-						 pos >> PAGE_CACHE_SHIFT,
-						 endbyte >> PAGE_CACHE_SHIFT);
+						 pos >> page_cache_shift(mapping),
+						 endbyte >> page_cache_shift(mapping));
 		} else {
 			/*
 			 * We don't know how much we wrote, so just return
@@ -2390,7 +2391,7 @@ generic_file_direct_IO(int rw, struct ki
 	 */
 	if (rw == WRITE) {
 		write_len = iov_length(iov, nr_segs);
-		end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT;
+		end = (offset + write_len - 1) >> page_cache_shift(mapping);
 	       	if (mapping_mapped(mapping))
 			unmap_mapping_range(mapping, offset, write_len, 0);
 	}
@@ -2407,7 +2408,7 @@ generic_file_direct_IO(int rw, struct ki
 	 */
 	if (rw == WRITE && mapping->nrpages) {
 		retval = invalidate_inode_pages2_range(mapping,
-					offset >> PAGE_CACHE_SHIFT, end);
+					offset >> page_cache_shift(mapping), end);
 		if (retval)
 			goto out;
 	}
@@ -2425,7 +2426,7 @@ generic_file_direct_IO(int rw, struct ki
 	 */
 	if (rw == WRITE && mapping->nrpages) {
 		int err = invalidate_inode_pages2_range(mapping,
-					      offset >> PAGE_CACHE_SHIFT, end);
+					      offset >> page_cache_shift(mapping), end);
 		if (err && retval >= 0)
 			retval = err;
 	}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 6/8] Account for pages in the page cache in terms of base pages
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
                   ` (4 preceding siblings ...)
  2007-04-19 16:35 ` [RFC 5/8] Enhance generic_read/write " Christoph Lameter
@ 2007-04-19 16:35 ` Christoph Lameter
  2007-04-19 17:45   ` Nish Aravamudan
  2007-04-19 16:35 ` [RFC 7/8] Enhance ramfs to support higher order pages Christoph Lameter
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 58+ messages in thread
From: Christoph Lameter @ 2007-04-19 16:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Nick Piggin, Andi Kleen, Paul Jackson,
	Dave Chinner, Christoph Lameter

Variable Order Page Cache: Account for higher order pages

NR_FILE_PAGES now counts pages of different order. Maybe we need to
account in base page sized pages? If so then we need to change
the way we update the counters. Note that the same would have to be
done for other counters.

Signed-off-by: Christoph Lameter <clameter@sgi.com>


---
 mm/filemap.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-19 09:11:13.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-19 09:14:16.000000000 -0700
@@ -119,7 +119,8 @@ void __remove_from_page_cache(struct pag
 	radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES,
+			-(1 << mapping->order));
 }
 
 void remove_from_page_cache(struct page *page)
@@ -448,7 +449,8 @@ int add_to_page_cache(struct page *page,
 			page->mapping = mapping;
 			page->index = offset;
 			mapping->nrpages++;
-			__inc_zone_page_state(page, NR_FILE_PAGES);
+			__mod_zone_page_state(page_zone(page), NR_FILE_PAGES,
+					1 << mappig->order);
 		}
 		write_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 7/8] Enhance ramfs to support higher order pages
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
                   ` (5 preceding siblings ...)
  2007-04-19 16:35 ` [RFC 6/8] Account for pages in the page cache in terms of base pages Christoph Lameter
@ 2007-04-19 16:35 ` Christoph Lameter
  2007-04-20 13:42   ` Mel Gorman
  2007-04-19 16:35 ` [RFC 8/8] Add some debug output Christoph Lameter
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 58+ messages in thread
From: Christoph Lameter @ 2007-04-19 16:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Nick Piggin, Andi Kleen, Paul Jackson,
	Dave Chinner, Christoph Lameter

Variable Order Page Cache: Add support to ramfs

The simplest file system to use is ramfs. Add a mount parameter that
specifies the page order of the pages that ramfs should use. If the
order is greater than zero then disable mmap functionality.

This could be removed if the VM would be changes to support faulting
higher order pages but for now we are content with buffered I/O on higher
order pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/ramfs/file-mmu.c   |   11 +++++++++++
 fs/ramfs/inode.c      |   15 ++++++++++++---
 include/linux/ramfs.h |    1 +
 3 files changed, 24 insertions(+), 3 deletions(-)

Index: linux-2.6.21-rc7/fs/ramfs/file-mmu.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ramfs/file-mmu.c	2007-04-18 21:46:38.000000000 -0700
+++ linux-2.6.21-rc7/fs/ramfs/file-mmu.c	2007-04-18 22:02:03.000000000 -0700
@@ -45,6 +45,17 @@ const struct file_operations ramfs_file_
 	.llseek		= generic_file_llseek,
 };
 
+/* Higher order mappings do not support mmmap */
+const struct file_operations ramfs_file_higher_order_operations = {
+	.read		= do_sync_read,
+	.aio_read	= generic_file_aio_read,
+	.write		= do_sync_write,
+	.aio_write	= generic_file_aio_write,
+	.fsync		= simple_sync_file,
+	.sendfile	= generic_file_sendfile,
+	.llseek		= generic_file_llseek,
+};
+
 const struct inode_operations ramfs_file_inode_operations = {
 	.getattr	= simple_getattr,
 };
Index: linux-2.6.21-rc7/fs/ramfs/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ramfs/inode.c	2007-04-18 21:46:38.000000000 -0700
+++ linux-2.6.21-rc7/fs/ramfs/inode.c	2007-04-18 22:02:03.000000000 -0700
@@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup
 		inode->i_blocks = 0;
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
+		inode->i_mapping->order = sb->s_blocksize_bits - PAGE_CACHE_SHIFT;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
 		default:
@@ -68,7 +69,10 @@ struct inode *ramfs_get_inode(struct sup
 			break;
 		case S_IFREG:
 			inode->i_op = &ramfs_file_inode_operations;
-			inode->i_fop = &ramfs_file_operations;
+			if (inode->i_mapping->order)
+				inode->i_fop = &ramfs_file_higher_order_operations;
+			else
+				inode->i_fop = &ramfs_file_operations;
 			break;
 		case S_IFDIR:
 			inode->i_op = &ramfs_dir_inode_operations;
@@ -164,10 +168,15 @@ static int ramfs_fill_super(struct super
 {
 	struct inode * inode;
 	struct dentry * root;
+	int order = 0;
+	char *options = data;
+
+	if (options && *options)
+		order = simple_strtoul(options, NULL, 10);
 
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
-	sb->s_blocksize = PAGE_CACHE_SIZE;
-	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
+	sb->s_blocksize = PAGE_CACHE_SIZE << order;
+	sb->s_blocksize_bits = order + PAGE_CACHE_SHIFT;
 	sb->s_magic = RAMFS_MAGIC;
 	sb->s_op = &ramfs_ops;
 	sb->s_time_gran = 1;
Index: linux-2.6.21-rc7/include/linux/ramfs.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/ramfs.h	2007-04-18 21:46:38.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/ramfs.h	2007-04-18 22:02:03.000000000 -0700
@@ -16,6 +16,7 @@ extern int ramfs_nommu_mmap(struct file 
 #endif
 
 extern const struct file_operations ramfs_file_operations;
+extern const struct file_operations ramfs_file_higher_order_operations;
 extern struct vm_operations_struct generic_file_vm_ops;
 extern int __init init_rootfs(void);
 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 8/8] Add some debug output
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
                   ` (6 preceding siblings ...)
  2007-04-19 16:35 ` [RFC 7/8] Enhance ramfs to support higher order pages Christoph Lameter
@ 2007-04-19 16:35 ` Christoph Lameter
  2007-04-19 19:09 ` [RFC 0/8] Variable Order Page Cache Badari Pulavarty
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-19 16:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Nick Piggin, Andi Kleen, Paul Jackson,
	Dave Chinner, Christoph Lameter

Debugging patch

Show some output as to what is going on.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/ramfs/inode.c |    1 +
 mm/filemap.c     |    8 ++++++++
 2 files changed, 9 insertions(+)

Index: linux-2.6.21-rc7/fs/ramfs/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ramfs/inode.c	2007-04-19 09:14:35.000000000 -0700
+++ linux-2.6.21-rc7/fs/ramfs/inode.c	2007-04-19 09:14:42.000000000 -0700
@@ -174,6 +174,7 @@ static int ramfs_fill_super(struct super
 	if (options && *options)
 		order = simple_strtoul(options, NULL, 10);
 
+	printk(KERN_ERR "ramfs_fill_super: '%s' order=%d\n", options, order);
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
 	sb->s_blocksize = PAGE_CACHE_SIZE << order;
 	sb->s_blocksize_bits = order + PAGE_CACHE_SHIFT;
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-19 09:14:16.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-19 09:16:45.000000000 -0700
@@ -121,6 +121,10 @@ void __remove_from_page_cache(struct pag
 	mapping->nrpages--;
 	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES,
 			-(1 << mapping->order));
+
+	if (mapping->order)
+		printk(KERN_ERR "Removing page %p order %d from pagecache\n",
+				page, mapping->order);
 }
 
 void remove_from_page_cache(struct page *page)
@@ -451,6 +455,10 @@ int add_to_page_cache(struct page *page,
 			mapping->nrpages++;
 			__mod_zone_page_state(page_zone(page), NR_FILE_PAGES,
 					1 << mappig->order);
+
+			if (mapping->order)
+				printk(KERN_ERR "add_to_page_cache page %p order=%d\n",
+						page, mapping->order);
 		}
 		write_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 6/8] Account for pages in the page cache in terms of base pages
  2007-04-19 16:35 ` [RFC 6/8] Account for pages in the page cache in terms of base pages Christoph Lameter
@ 2007-04-19 17:45   ` Nish Aravamudan
  2007-04-19 17:52     ` Christoph Lameter
  0 siblings, 1 reply; 58+ messages in thread
From: Nish Aravamudan @ 2007-04-19 17:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On 4/19/07, Christoph Lameter <clameter@sgi.com> wrote:
> Variable Order Page Cache: Account for higher order pages
>
> NR_FILE_PAGES now counts pages of different order. Maybe we need to
> account in base page sized pages? If so then we need to change
> the way we update the counters. Note that the same would have to be
> done for other counters.
>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
>
>
> ---
>  mm/filemap.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> Index: linux-2.6.21-rc7/mm/filemap.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/filemap.c  2007-04-19 09:11:13.000000000 -0700
> +++ linux-2.6.21-rc7/mm/filemap.c       2007-04-19 09:14:16.000000000 -0700
> @@ -119,7 +119,8 @@ void __remove_from_page_cache(struct pag
>         radix_tree_delete(&mapping->page_tree, page->index);
>         page->mapping = NULL;
>         mapping->nrpages--;
> -       __dec_zone_page_state(page, NR_FILE_PAGES);
> +       __mod_zone_page_state(page_zone(page), NR_FILE_PAGES,
> +                       -(1 << mapping->order));
>  }
>
>  void remove_from_page_cache(struct page *page)
> @@ -448,7 +449,8 @@ int add_to_page_cache(struct page *page,
>                         page->mapping = mapping;
>                         page->index = offset;
>                         mapping->nrpages++;
> -                       __inc_zone_page_state(page, NR_FILE_PAGES);
> +                       __mod_zone_page_state(page_zone(page), NR_FILE_PAGES,
> +                                       1 << mappig->order);

Typo? should be mapping->order?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 6/8] Account for pages in the page cache in terms of base pages
  2007-04-19 17:45   ` Nish Aravamudan
@ 2007-04-19 17:52     ` Christoph Lameter
  2007-04-19 17:54       ` Avi Kivity
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Lameter @ 2007-04-19 17:52 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Thu, 19 Apr 2007, Nish Aravamudan wrote:

> > NR_FILE_PAGES,
> > +                                       1 << mappig->order);
> 
> Typo? should be mapping->order?

Correct. Sigh. Why do these things creep in at the last minute before 
posting???


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 6/8] Account for pages in the page cache in terms of base pages
  2007-04-19 17:52     ` Christoph Lameter
@ 2007-04-19 17:54       ` Avi Kivity
  0 siblings, 0 replies; 58+ messages in thread
From: Avi Kivity @ 2007-04-19 17:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nish Aravamudan, linux-kernel, Peter Zijlstra, Nick Piggin,
	Andi Kleen, Paul Jackson, Dave Chinner

Christoph Lameter wrote:
> On Thu, 19 Apr 2007, Nish Aravamudan wrote:
>
>   
>>> NR_FILE_PAGES,
>>> +                                       1 << mappig->order);
>>>       
>> Typo? should be mapping->order?
>>     
>
> Correct. Sigh. Why do these things creep in at the last minute before 
> posting???
>   

To make sure you know that somebody's read them?


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages
  2007-04-19 16:35 ` [RFC 4/8] Enhance fallback functions in libs to support higher order pages Christoph Lameter
@ 2007-04-19 18:48   ` Adam Litke
  2007-04-19 19:10     ` Christoph Lameter
  2007-04-20 11:05   ` Mel Gorman
  1 sibling, 1 reply; 58+ messages in thread
From: Adam Litke @ 2007-04-19 18:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On 4/19/07, Christoph Lameter <clameter@sgi.com> wrote:
> @@ -331,11 +331,15 @@ int simple_prepare_write(struct file *fi
>                         unsigned from, unsigned to)
>  {
>         if (!PageUptodate(page)) {
> -               if (to - from != PAGE_CACHE_SIZE) {
> +               if (to - from != page_cache_size(file->f_mapping)) {

Where do you introduce page_cache_size()?  Is this added by a
different set of patches I should have applied first?

--
Adam Litke ( agl at us.ibm.com )
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
                   ` (7 preceding siblings ...)
  2007-04-19 16:35 ` [RFC 8/8] Add some debug output Christoph Lameter
@ 2007-04-19 19:09 ` Badari Pulavarty
  2007-04-19 19:12   ` Christoph Lameter
  2007-04-19 19:11 ` Andi Kleen
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 58+ messages in thread
From: Badari Pulavarty @ 2007-04-19 19:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: lkml, Peter Zijlstra, Nick Piggin, Paul Jackson, Dave Chinner,
	Andi Kleen

On Thu, 2007-04-19 at 09:35 -0700, Christoph Lameter wrote:
> Variable Order Page Cache Patchset
> 

..
mm/built-in.o: In function `__generic_file_aio_write_nolock':
filemap.c:(.text+0x295c): undefined reference to `page_cache_shift'
filemap.c:(.text+0x296c): undefined reference to `page_cache_shift'
fs/built-in.o: In function `simple_commit_write':
(.text+0x20866): undefined reference to `page_cache_shift'
fs/built-in.o: In function `simple_prepare_write':
(.text+0x2097b): undefined reference to `page_cache_size'
fs/built-in.o: In function `simple_prepare_write':
(.text+0x209cd): undefined reference to `page_cache_size'


Thanks,
Badari


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages
  2007-04-19 18:48   ` Adam Litke
@ 2007-04-19 19:10     ` Christoph Lameter
  2007-04-19 22:50       ` David Chinner
  2007-04-20  8:21       ` Jens Axboe
  0 siblings, 2 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-19 19:10 UTC (permalink / raw)
  To: Adam Litke
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Thu, 19 Apr 2007, Adam Litke wrote:

> On 4/19/07, Christoph Lameter <clameter@sgi.com> wrote:
> > @@ -331,11 +331,15 @@ int simple_prepare_write(struct file *fi
> >                         unsigned from, unsigned to)
> >  {
> >         if (!PageUptodate(page)) {
> > -               if (to - from != PAGE_CACHE_SIZE) {
> > +               if (to - from != page_cache_size(file->f_mapping)) {
> 
> Where do you introduce page_cache_size()?  Is this added by a
> different set of patches I should have applied first?

Yuck. Missed that one in the control file. Insert this patch before this 
one.



Variable Order Page Cache: Add functions to establish sizes

We use the macros PAGE_CACHE_SIZE PAGE_CACHE_SHIFT PAGE_CACHE_MASK
and PAGE_CACHE_ALIGN in various places in the kernel. These are now
the base page size but we do not have a means to calculating these
values for higher order pages.

Provide these functions. An address_space pointer must be passed
to them.

New function			Related base page constant
---------------------------------------------------
page_cache_shift(a)		PAGE_CACHE_SHIFT
page_cache_size(a)		PAGE_CACHE_SIZE
page_cache_mask(a)		PAGE_CACHE_MASK
page_cache_align(addr,a)	PAGE_CACHE_ALIGN(addr)

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/pagemap.h |   21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-18 23:01:09.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-18 23:03:32.000000000 -0700
@@ -49,6 +49,27 @@ static inline void mapping_set_gfp_mask(
 #define PAGE_CACHE_MASK		PAGE_MASK
 #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
 
+static inline int page_cache_shift(struct address_space *a)
+{
+	return a->order + PAGE_CACHE_SHIFT;
+}
+
+static inline unsigned long page_cache_size(struct address_space *a)
+{
+	return PAGE_CACHE_SIZE << a->order;
+}
+
+static inline unsigned long page_cache_mask(struct address_space *a)
+{
+	return PAGE_CACHE_MASK << a->order;
+}
+
+static inline unsigned long page_cache_align(unsigned long addr,
+		struct address_space *a)
+{
+	return (((addr) + page_cache_size(a) - 1) & page_cache_mask(a));
+}
+
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
                   ` (8 preceding siblings ...)
  2007-04-19 19:09 ` [RFC 0/8] Variable Order Page Cache Badari Pulavarty
@ 2007-04-19 19:11 ` Andi Kleen
  2007-04-19 19:15   ` Christoph Lameter
  2007-04-20 14:37   ` Mel Gorman
  2007-04-19 22:42 ` David Chinner
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 58+ messages in thread
From: Andi Kleen @ 2007-04-19 19:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Paul Jackson,
	Dave Chinner

> We likely need actual defragmentation support.

To be honest it looks quite pointless before this is solved. So far it is
not even clear if it is feasible to solve it.

-Andi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-19 19:09 ` [RFC 0/8] Variable Order Page Cache Badari Pulavarty
@ 2007-04-19 19:12   ` Christoph Lameter
  0 siblings, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-19 19:12 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: lkml, Peter Zijlstra, Nick Piggin, Paul Jackson, Dave Chinner,
	Andi Kleen

On Thu, 19 Apr 2007, Badari Pulavarty wrote:

> On Thu, 2007-04-19 at 09:35 -0700, Christoph Lameter wrote:
> > Variable Order Page Cache Patchset
> > 
> 
> ..
> mm/built-in.o: In function `__generic_file_aio_write_nolock':
> filemap.c:(.text+0x295c): undefined reference to `page_cache_shift'
> filemap.c:(.text+0x296c): undefined reference to `page_cache_shift'
> fs/built-in.o: In function `simple_commit_write':
> (.text+0x20866): undefined reference to `page_cache_shift'
> fs/built-in.o: In function `simple_prepare_write':
> (.text+0x2097b): undefined reference to `page_cache_size'
> fs/built-in.o: In function `simple_prepare_write':
> (.text+0x209cd): undefined reference to `page_cache_size'

Sigh. You need this patch that I did not include

Variable Order Page Cache: Add functions to establish sizes

We use the macros PAGE_CACHE_SIZE PAGE_CACHE_SHIFT PAGE_CACHE_MASK
and PAGE_CACHE_ALIGN in various places in the kernel. These are now
the base page size but we do not have a means to calculating these
values for higher order pages.

Provide these functions. An address_space pointer must be passed
to them.

New function			Related base page constant
---------------------------------------------------
page_cache_shift(a)		PAGE_CACHE_SHIFT
page_cache_size(a)		PAGE_CACHE_SIZE
page_cache_mask(a)		PAGE_CACHE_MASK
page_cache_align(addr,a)	PAGE_CACHE_ALIGN(addr)

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/pagemap.h |   21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-18 23:01:09.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-18 23:03:32.000000000 -0700
@@ -49,6 +49,27 @@ static inline void mapping_set_gfp_mask(
 #define PAGE_CACHE_MASK		PAGE_MASK
 #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
 
+static inline int page_cache_shift(struct address_space *a)
+{
+	return a->order + PAGE_CACHE_SHIFT;
+}
+
+static inline unsigned long page_cache_size(struct address_space *a)
+{
+	return PAGE_CACHE_SIZE << a->order;
+}
+
+static inline unsigned long page_cache_mask(struct address_space *a)
+{
+	return PAGE_CACHE_MASK << a->order;
+}
+
+static inline unsigned long page_cache_align(unsigned long addr,
+		struct address_space *a)
+{
+	return (((addr) + page_cache_size(a) - 1) & page_cache_mask(a));
+}
+
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-19 19:11 ` Andi Kleen
@ 2007-04-19 19:15   ` Christoph Lameter
  2007-04-20 14:37   ` Mel Gorman
  1 sibling, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-19 19:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Paul Jackson,
	Dave Chinner

On Thu, 19 Apr 2007, Andi Kleen wrote:

> > We likely need actual defragmentation support.
> 
> To be honest it looks quite pointless before this is solved. So far it is
> not even clear if it is feasible to solve it.

We have done order 1 / 2 allocations for some limited purposes for some 
time (task struct etc). But you are right: We need to figure out how high 
we can go with Mel's antifrag and defrag work.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
                   ` (9 preceding siblings ...)
  2007-04-19 19:11 ` Andi Kleen
@ 2007-04-19 22:42 ` David Chinner
  2007-04-20  1:14   ` Christoph Lameter
  2007-04-20  6:32   ` Jens Axboe
  2007-04-19 23:58 ` Maxim Levitsky
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 58+ messages in thread
From: David Chinner @ 2007-04-19 22:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Paul Jackson,
	Dave Chinner, Andi Kleen

On Thu, Apr 19, 2007 at 09:35:04AM -0700, Christoph Lameter wrote:
> Variable Order Page Cache Patchset
> 
> This patchset modifies the core VM so that higher order page cache pages
> become possible. The higher order page cache pages are compound pages
> and can be handled in the same way as regular pages.
> 
> The order of the pages is determined by the order set up in the mapping
> (struct address_space). By default the order is set to zero.
> This means that higher order pages are optional. There is no attempt here
> to generally change the page order of the page cache. 4K pages are effective
> for small files.
> 
> However, it would be good if the VM would support I/O to higher order pages
> to enable efficient support for large scale I/O. If one wants to write a
> long file of a few gigabytes then the filesystem should have a choice of
> selecting a larger page size for that file and handle larger chunks of
> memory at once.
> 
> The support here is only for buffered I/O and only for one filesystem (ramfs).
> Modification of other filesystems to support higher order pages may require
> extensive work of other components of the kernel. But I hope this shows that
> there is a relatively easy way to that goal that could be taken in steps..

So looking at this the main thing for converting a filesystem is some extra
bits in the mount process and replacing PAGE_CACHE_* macros with
page_cache_*() wrapper functions.

We can probably set all this up trivially with XFS by allowing block size > page
size filesystems to be mounted and modifying the way we feed pages to a bio
to be aware of compound pages.

> Note that the higher order pages are subject to reclaim. This works in general
> since we are always operating on a single page struct. Reclaim is fooled to
> think that it is touching page sized objects (there are likely issues to be
> fixed there if we want to go down this road).
> 
> What is currently not supported:
> - Buffer heads for higher order pages (possible with the compound pages in mm
>   that do not use page->private requires upgrade of the buffer cache layers).

Does this mean that the -mm code will currently support bufferheads on compound
pages? We need that before we can get XFS to work with compound pages.

> - Higher order pages in the block layer etc.

It's more drivers that we have to worry about, I think.  We don't need to
modify bios to explicitly support compound pages. From bio.h:

/*
 * was unsigned short, but we might as well be ready for > 64kB I/O pages
 */
struct bio_vec {
        struct page     *bv_page;
        unsigned int    bv_len;
        unsigned int    bv_offset;
};

So compound pages should be transparent to anything that doesn't
look at the contents of bio_vecs....

> - Mmapping higher order pages

*nod*

hmmm - what about the way we do copyin and copyout from the page cache? ie
we kmap_atomic() them before we access them. Does this need to change?

> The ramfs driver can be used to test higher order page cache functionality
> (and may help troubleshoot the VM support until we get some real filesystem
> and real devices supporting higher order pages).

I don't think it will take much to get XFS to work with a high order
page cache and we can probably insulate the block layer initially with some
kind of bio_add_compound_page() wrapper and some similar
wrapper on the io completion side.

> Comments appreciated.

So far it's much less intrusive than I expected ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages
  2007-04-19 19:10     ` Christoph Lameter
@ 2007-04-19 22:50       ` David Chinner
  2007-04-20  1:15         ` Christoph Lameter
  2007-04-20  8:21       ` Jens Axboe
  1 sibling, 1 reply; 58+ messages in thread
From: David Chinner @ 2007-04-19 22:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Adam Litke, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Thu, Apr 19, 2007 at 12:10:34PM -0700, Christoph Lameter wrote:
> Variable Order Page Cache: Add functions to establish sizes
> 
> We use the macros PAGE_CACHE_SIZE PAGE_CACHE_SHIFT PAGE_CACHE_MASK
> and PAGE_CACHE_ALIGN in various places in the kernel. These are now
> the base page size but we do not have a means to calculating these
> values for higher order pages.
> 
> Provide these functions. An address_space pointer must be passed
> to them.
> 
> New function			Related base page constant
> ---------------------------------------------------
> page_cache_shift(a)		PAGE_CACHE_SHIFT
> page_cache_size(a)		PAGE_CACHE_SIZE
> page_cache_mask(a)		PAGE_CACHE_MASK
> page_cache_align(addr,a)	PAGE_CACHE_ALIGN(addr)

I think PAGE_CACHE_SIZE is a redundant define with these
modifications.  The page cache size in now variable and it is based
on a multiple of PAGE_SIZE. Hence I suggest that PAGE_CACHE_SIZE and
it's derivitives should be made to go away completely with this
change.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
                   ` (10 preceding siblings ...)
  2007-04-19 22:42 ` David Chinner
@ 2007-04-19 23:58 ` Maxim Levitsky
  2007-04-20  1:15   ` Christoph Lameter
  2007-04-20  4:47 ` William Lee Irwin III
  2007-04-20 14:14 ` Mel Gorman
  13 siblings, 1 reply; 58+ messages in thread
From: Maxim Levitsky @ 2007-04-19 23:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Paul Jackson,
	Dave Chinner, Andi Kleen

On Thursday 19 April 2007 19:35:04 Christoph Lameter wrote:
> Variable Order Page Cache Patchset
> 
> This patchset modifies the core VM so that higher order page cache pages
> become possible. The higher order page cache pages are compound pages
> and can be handled in the same way as regular pages.
> 
> The order of the pages is determined by the order set up in the mapping
> (struct address_space). By default the order is set to zero.
> This means that higher order pages are optional. There is no attempt here
> to generally change the page order of the page cache. 4K pages are effective
> for small files.
> 
> However, it would be good if the VM would support I/O to higher order pages
> to enable efficient support for large scale I/O. If one wants to write a
> long file of a few gigabytes then the filesystem should have a choice of
> selecting a larger page size for that file and handle larger chunks of
> memory at once.
> 
> The support here is only for buffered I/O and only for one filesystem 
(ramfs).
> Modification of other filesystems to support higher order pages may require
> extensive work of other components of the kernel. But I hope this shows that
> there is a relatively easy way to that goal that could be taken in steps..
> 
> Note that the higher order pages are subject to reclaim. This works in 
general
> since we are always operating on a single page struct. Reclaim is fooled to
> think that it is touching page sized objects (there are likely issues to be
> fixed there if we want to go down this road).
> 
> What is currently not supported:
> - Buffer heads for higher order pages (possible with the compound pages in 
mm
>   that do not use page->private requires upgrade of the buffer cache 
layers).
> - Higher order pages in the block layer etc.
> - Mmapping higher order pages
> 
> Note that this is proof-of-concept. Lots of functionality is missing and
> various issues have not been dealt with. Use of higher order pages may cause
> memory fragmentation. Mel Gorman's anti-fragmentation work is probably
> essential if we want to do this. We likely need actual defragmentation
> support.
> 
> The main point of this patchset is to demonstrates that it is basically
> possible to have higher order support with straightforward changes to the
> VM.
> 
> The ramfs driver can be used to test higher order page cache functionality
> (and may help troubleshoot the VM support until we get some real filesystem
> and real devices supporting higher order pages).
> 
> If you apply this patch and then you can f.e. try this:
> 
> mount -tramfs -o10 none /media
> 
> 	Mounts a ramfs filesystem with order 10 pages (4 MB)
> 
> cp linux-2.6.21-rc7.tar.gz /media
> 
> 	Populate the ramfs. Note that we allocate 14 pages of 4M each
> 	instead of 13508..
> 
> umount /media
> 
> 	Gets rid of the large pages again
> 
> Comments appreciated.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

Hello,

This is exactly what I wanted some time ago,
Thank you very much, I was almost thinking of doing this myself 
(but decided that it is too difficult now for me and maybe doesn't worth the 
effort)

I want to point out on number of problems that this will solve (and reasons I 
wanted to do that)

First of all, today, packet writing on cd/dvd doesn't work well, it is very 
slow because 
now all file-systems are limited to 4k-barrier and cd/dvd can write only 
32k/64k packets.
This is why a pktcdvd was written and it emulates those 4k sectors by doing 
read/modify/write cycle
This cause a lot of seeks and read/writing switches and thus it is very slow.

By introducing a bigger that 4k page cache a dvd/cd can be divided is 64k/32k 
blocks that will be read an written freely
(Although dvd can read 2k  I don't think that reading a 64k block will hurt 
since most of time drive is busy seeking and locating a specific sector)

Now I thinking to implement this in an other way, I mean I want to teach udf 
filesystem to to packet writing on its own, bypassing disk cache (but not page 
cache)

Secondary 32/64k limitation is present of flash devices too, so they can 
benefit too, and I almost sure that future hard disks will use bigger block 
size too.

To summarize I want to tell that bigger pagesize will allow devices that have 
big hardware sectors to work fine in linux.

Best regards,
	Maxim Levitsky

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-19 22:42 ` David Chinner
@ 2007-04-20  1:14   ` Christoph Lameter
  2007-04-20  6:32   ` Jens Axboe
  1 sibling, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20  1:14 UTC (permalink / raw)
  To: David Chinner
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Paul Jackson,
	Andi Kleen

On Fri, 20 Apr 2007, David Chinner wrote:

> So looking at this the main thing for converting a filesystem is some extra
> bits in the mount process and replacing PAGE_CACHE_* macros with
> page_cache_*() wrapper functions.

Right.

> We can probably set all this up trivially with XFS by allowing block size > page
> size filesystems to be mounted and modifying the way we feed pages to a bio
> to be aware of compound pages.

That would be great! Anyone volunterering for the block layer?

> > What is currently not supported:
> > - Buffer heads for higher order pages (possible with the compound pages in mm
> >   that do not use page->private requires upgrade of the buffer cache layers).
> 
> Does this mean that the -mm code will currently support bufferheads on compound
> pages? We need that before we can get XFS to work with compound pages.

There needs to be some work done on that level. But page->private can be 
used for compound pages now which should make this simple to do.

> > - Higher order pages in the block layer etc.
> 
> It's more drivers that we have to worry about, I think.  We don't need to
> modify bios to explicitly support compound pages. From bio.h:
> 
> /*
>  * was unsigned short, but we might as well be ready for > 64kB I/O pages
>  */
> struct bio_vec {
>         struct page     *bv_page;
>         unsigned int    bv_len;
>         unsigned int    bv_offset;
> };
> 
> So compound pages should be transparent to anything that doesn't
> look at the contents of bio_vecs....

Great!

> > - Mmapping higher order pages
> 
> *nod*
> 
> hmmm - what about the way we do copyin and copyout from the page cache? ie
> we kmap_atomic() them before we access them. Does this need to change?

kmap_atomic does not do anything if we do not use highmem. If we want to 
support highmem with higher order pages then kmap_atomic needs to support 
arbitrary page orders.

>> > The ramfs driver can be used to test higher order page cache functionality
> > (and may help troubleshoot the VM support until we get some real filesystem
> > and real devices supporting higher order pages).
> 
> I don't think it will take much to get XFS to work with a high order
> page cache and we can probably insulate the block layer initially with some
> kind of bio_add_compound_page() wrapper and some similar
> wrapper on the io completion side.

I'd be happy if we could make this work soon.

> So far it's much less intrusive than I expected ;)

I was surprised too. Seems that multiple people have been preparing for 
the great day when we finally support higher order pages in the page 
cache.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages
  2007-04-19 22:50       ` David Chinner
@ 2007-04-20  1:15         ` Christoph Lameter
  0 siblings, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20  1:15 UTC (permalink / raw)
  To: David Chinner
  Cc: Adam Litke, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson

On Fri, 20 Apr 2007, David Chinner wrote:

> I think PAGE_CACHE_SIZE is a redundant define with these
> modifications.  The page cache size in now variable and it is based
> on a multiple of PAGE_SIZE. Hence I suggest that PAGE_CACHE_SIZE and
> it's derivitives should be made to go away completely with this
> change.

Ultimately we should do so but for right now lets stay on the least-
intrusive and as-clean-as-possible road.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-19 23:58 ` Maxim Levitsky
@ 2007-04-20  1:15   ` Christoph Lameter
  0 siblings, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20  1:15 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Paul Jackson,
	Dave Chinner, Andi Kleen

On Fri, 20 Apr 2007, Maxim Levitsky wrote:

> First of all, today, packet writing on cd/dvd doesn't work well, it is very 
> slow because 
> now all file-systems are limited to 4k-barrier and cd/dvd can write only 
> 32k/64k packets.
> This is why a pktcdvd was written and it emulates those 4k sectors by doing 
> read/modify/write cycle
> This cause a lot of seeks and read/writing switches and thus it is very slow.
> 
> By introducing a bigger that 4k page cache a dvd/cd can be divided is 64k/32k 
> blocks that will be read an written freely
> (Although dvd can read 2k  I don't think that reading a 64k block will hurt 
> since most of time drive is busy seeking and locating a specific sector)
> 
> Now I thinking to implement this in an other way, I mean I want to teach udf 
> filesystem to to packet writing on its own, bypassing disk cache (but not page 
> cache)
> 
> Secondary 32/64k limitation is present of flash devices too, so they can 
> benefit too, and I almost sure that future hard disks will use bigger block 
> size too.
> 
> To summarize I want to tell that bigger pagesize will allow devices that have 
> big hardware sectors to work fine in linux.

Great arguments in support of this feature. Thank you.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
                   ` (11 preceding siblings ...)
  2007-04-19 23:58 ` Maxim Levitsky
@ 2007-04-20  4:47 ` William Lee Irwin III
  2007-04-20  5:27   ` Christoph Lameter
  2007-04-20  8:42   ` David Chinner
  2007-04-20 14:14 ` Mel Gorman
  13 siblings, 2 replies; 58+ messages in thread
From: William Lee Irwin III @ 2007-04-20  4:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Paul Jackson,
	Dave Chinner, Andi Kleen

On Thu, Apr 19, 2007 at 09:35:04AM -0700, Christoph Lameter wrote:
> This patchset modifies the core VM so that higher order page cache pages
> become possible. The higher order page cache pages are compound pages
> and can be handled in the same way as regular pages.
> The order of the pages is determined by the order set up in the mapping
> (struct address_space). By default the order is set to zero.
> This means that higher order pages are optional. There is no attempt here
> to generally change the page order of the page cache. 4K pages are effective
> for small files.

Oh dear. Per-file pagesizes are foul. Better to fix up the pagecache's
radix tree than to restrict it like this. There are other attacks on the
multiple horizontal internal tree node allocation problem beyond
outright B+ trees that allow radix trees to continue to be used.


On Thu, Apr 19, 2007 at 09:35:04AM -0700, Christoph Lameter wrote:
> However, it would be good if the VM would support I/O to higher order pages
> to enable efficient support for large scale I/O. If one wants to write a
> long file of a few gigabytes then the filesystem should have a choice of
> selecting a larger page size for that file and handle larger chunks of
> memory at once.
> The support here is only for buffered I/O and only for one filesystem (ramfs).
> Modification of other filesystems to support higher order pages may require
> extensive work of other components of the kernel. But I hope this shows that
> there is a relatively easy way to that goal that could be taken in steps..

I've always wanted the awareness to be pervasive, so it's good to hear
there's someone with a common interest. If this effort takes off, I'd be
happy to contribute to it. I do wonder what ever happened with the gelato
codebase, though.


On Thu, Apr 19, 2007 at 09:35:04AM -0700, Christoph Lameter wrote:
> Note that the higher order pages are subject to reclaim. This works in general
> since we are always operating on a single page struct. Reclaim is fooled to
> think that it is touching page sized objects (there are likely issues to be
> fixed there if we want to go down this road).

I'm afraid this may be approaching an underappreciated research topic.
Most sponsors of such research seem to have an active disinterest in
getting page replacement to properly interoperate with all this.


On Thu, Apr 19, 2007 at 09:35:04AM -0700, Christoph Lameter wrote:
> What is currently not supported:
> - Buffer heads for higher order pages (possible with the compound pages in mm
>   that do not use page->private requires upgrade of the buffer cache layers).
> - Higher order pages in the block layer etc.
> - Mmapping higher order pages
> Note that this is proof-of-concept. Lots of functionality is missing and
> various issues have not been dealt with. Use of higher order pages may cause
> memory fragmentation. Mel Gorman's anti-fragmentation work is probably
> essential if we want to do this. We likely need actual defragmentation
> support.
> The main point of this patchset is to demonstrates that it is basically
> possible to have higher order support with straightforward changes to the
> VM.

You don't know how glad I am to see someone actually hammering out code
on this front.


-- wli

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-20  4:47 ` William Lee Irwin III
@ 2007-04-20  5:27   ` Christoph Lameter
  2007-04-20  6:22     ` William Lee Irwin III
  2007-04-20  8:42   ` David Chinner
  1 sibling, 1 reply; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20  5:27 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Paul Jackson,
	Dave Chinner, Andi Kleen

On Thu, 19 Apr 2007, William Lee Irwin III wrote:

> Oh dear. Per-file pagesizes are foul. Better to fix up the pagecache's
> radix tree than to restrict it like this. There are other attacks on the
> multiple horizontal internal tree node allocation problem beyond
> outright B+ trees that allow radix trees to continue to be used.

per-file pagesizes are just the granularity that is available. The order
value is then readily available for all page cache operations. In practice 
it is likely that filesystems will have one consistent page size. If you 
look at the ramfs implementation then you will see that is exactly the
approach taken here. I want to avoid any modifications to key data 
structures or locking. If possible straight code transformation.

> I've always wanted the awareness to be pervasive, so it's good to hear
> there's someone with a common interest. If this effort takes off, I'd be
> happy to contribute to it. I do wonder what ever happened with the gelato
> codebase, though.

The superpages? I do not think that we should be getting that complicated 
here. Maybe we can pick up some ideas at some point.

> > since we are always operating on a single page struct. Reclaim is fooled to
> > think that it is touching page sized objects (there are likely issues to be
> > fixed there if we want to go down this road).
> 
> I'm afraid this may be approaching an underappreciated research topic.
> Most sponsors of such research seem to have an active disinterest in
> getting page replacement to properly interoperate with all this.

Well that is the difference between academia where one gets his Ph.D. for
superpages, publishes a couple of papers and then its over and real 
kernel work where this actually will have to work consistently with the 
rest of the system. Let us thus do small steps towards the goal 
while keeping things as simple and straightforward as possible.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-20  5:27   ` Christoph Lameter
@ 2007-04-20  6:22     ` William Lee Irwin III
  0 siblings, 0 replies; 58+ messages in thread
From: William Lee Irwin III @ 2007-04-20  6:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Paul Jackson,
	Dave Chinner, Andi Kleen

On Thu, 19 Apr 2007, William Lee Irwin III wrote:
>> Oh dear. Per-file pagesizes are foul. Better to fix up the pagecache's
>> radix tree than to restrict it like this. There are other attacks on the
>> multiple horizontal internal tree node allocation problem beyond
>> outright B+ trees that allow radix trees to continue to be used.

On Thu, Apr 19, 2007 at 10:27:34PM -0700, Christoph Lameter wrote:
> per-file pagesizes are just the granularity that is available. The order
> value is then readily available for all page cache operations. In practice 
> it is likely that filesystems will have one consistent page size. If you 
> look at the ramfs implementation then you will see that is exactly the
> approach taken here. I want to avoid any modifications to key data 
> structures or locking. If possible straight code transformation.

I'm not sure how to say this politely, so please don't take it as a
slight. Hacks to avoid diffsize increases that would result from data
structure code are highly specious. There are typically functionality
or efficiency issues deliberately left unaddressed to accomplish such.
When committing the patch, you generally end up implicitly committed
to later updates to address those issues.

That said, even if some maintainers consciously agree (it's actually
rather clear that my opinion here is not representative of the majority
if anyone else at all), it does still present an issue for "marketing"
patches since diffsize is so easily latched onto as a metric of risk.

Don't worry about it for now. This will do fine.


On Thu, 19 Apr 2007, William Lee Irwin III wrote:
>> I've always wanted the awareness to be pervasive, so it's good to hear
>> there's someone with a common interest. If this effort takes off, I'd be
>> happy to contribute to it. I do wonder what ever happened with the gelato
>> codebase, though.

On Thu, Apr 19, 2007 at 10:27:34PM -0700, Christoph Lameter wrote:
> The superpages? I do not think that we should be getting that complicated 
> here. Maybe we can pick up some ideas at some point.

They're all TLB so picking things up on the TLB side can be done there.
I think they also keep the long format VHPT code updated to recent
mainline, which makes higher-order pages meaningful on ia64 (which with
the region register -based affair they are effectively not).


On Thu, 19 Apr 2007, William Lee Irwin III wrote:
>> I'm afraid this may be approaching an underappreciated research topic.
>> Most sponsors of such research seem to have an active disinterest in
>> getting page replacement to properly interoperate with all this.

On Thu, Apr 19, 2007 at 10:27:34PM -0700, Christoph Lameter wrote:
> Well that is the difference between academia where one gets his Ph.D. for
> superpages, publishes a couple of papers and then its over and real 
> kernel work where this actually will have to work consistently with the 
> rest of the system. Let us thus do small steps towards the goal 
> while keeping things as simple and straightforward as possible.

A more careful reading would reveal this as a criticism of industrial
sponsorship of research, not academia per se. For instance, what IHV
would bother sponsoring research on page replacement, since when
grinding out [trademarked name of major benchmark censored] numbers to
sell their systems, they arrange for page replacement never to happen?


-- wli

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-19 22:42 ` David Chinner
  2007-04-20  1:14   ` Christoph Lameter
@ 2007-04-20  6:32   ` Jens Axboe
  2007-04-20  7:48     ` David Chinner
  1 sibling, 1 reply; 58+ messages in thread
From: Jens Axboe @ 2007-04-20  6:32 UTC (permalink / raw)
  To: David Chinner
  Cc: Christoph Lameter, linux-kernel, Peter Zijlstra, Nick Piggin,
	Paul Jackson, Andi Kleen

On Fri, Apr 20 2007, David Chinner wrote:
> > - Higher order pages in the block layer etc.
> 
> It's more drivers that we have to worry about, I think.  We don't need to
> modify bios to explicitly support compound pages. From bio.h:
> 
> /*
>  * was unsigned short, but we might as well be ready for > 64kB I/O pages
>  */
> struct bio_vec {
>         struct page     *bv_page;
>         unsigned int    bv_len;
>         unsigned int    bv_offset;
> };
> 
> So compound pages should be transparent to anything that doesn't
> look at the contents of bio_vecs....

That just means you don't have to modify the bio_vec, there's still some
work to be done. But it should not be too hard, it's mainly updating the
merging checks. And grep for where PAGE_SIZE or PAGE_CACHE_SIZE is used
in fs/bio.c include/linux/bio.h block/ll_rw_blk.c

> > The ramfs driver can be used to test higher order page cache functionality
> > (and may help troubleshoot the VM support until we get some real filesystem
> > and real devices supporting higher order pages).
> 
> I don't think it will take much to get XFS to work with a high order
> page cache and we can probably insulate the block layer initially with some
> kind of bio_add_compound_page() wrapper and some similar
> wrapper on the io completion side.

Eh no way, at least not if you want it merged. Lets not repeat the XFS
kiobuf IO disaster :-). If this is to be done and merged, it needs to be
integrated nicely with the current framework, not attached to the side.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-20  6:32   ` Jens Axboe
@ 2007-04-20  7:48     ` David Chinner
  2007-04-21 22:18       ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: David Chinner @ 2007-04-20  7:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Chinner, Christoph Lameter, linux-kernel, Peter Zijlstra,
	Nick Piggin, Paul Jackson, Andi Kleen

On Fri, Apr 20, 2007 at 08:32:57AM +0200, Jens Axboe wrote:
> On Fri, Apr 20 2007, David Chinner wrote:
> > > The ramfs driver can be used to test higher order page cache functionality
> > > (and may help troubleshoot the VM support until we get some real filesystem
> > > and real devices supporting higher order pages).
> > 
> > I don't think it will take much to get XFS to work with a high order
> > page cache and we can probably insulate the block layer initially with some
> > kind of bio_add_compound_page() wrapper and some similar
> > wrapper on the io completion side.
> 
> Eh no way, at least not if you want it merged. Lets not repeat the XFS
> kiobuf IO disaster :-). If this is to be done and merged, it needs to be
> integrated nicely with the current framework, not attached to the side.

Agreed - I was talking about a quick way to hack a real filesystem
in to the VM to start exercising the new VM code without needing to
implement compound page support down the whole I/O stack. 

Any sort of hack we do like this will be throwaway code, but we're
going to need something to start with. I'll leave it up to ppl know
know much more about the block layer than I do to decide how to
do this properly.... ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Enhance fallback functions in libs to support higher  order pages
  2007-04-19 19:10     ` Christoph Lameter
  2007-04-19 22:50       ` David Chinner
@ 2007-04-20  8:21       ` Jens Axboe
  2007-04-20 16:01         ` Christoph Lameter
  1 sibling, 1 reply; 58+ messages in thread
From: Jens Axboe @ 2007-04-20  8:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Adam Litke, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Thu, Apr 19 2007, Christoph Lameter wrote:
> +static inline int page_cache_shift(struct address_space *a)
> +{
> +	return a->order + PAGE_CACHE_SHIFT;
> +}
> +
> +static inline unsigned long page_cache_size(struct address_space *a)
> +{
> +	return PAGE_CACHE_SIZE << a->order;
> +}

This works fine as long as you are in the submitter context, but once
you pass the into the block layer, we don't have any way to find the
address space (at least we don't want to). Would something like this be
workable, name withstanding:

static unsigned long page_size(struct page *page)
{
        struct address_space *mapping;
        int order = 0;

        mapping = page_mapping(page);
        if (mapping)
                order = mapping->order;

        return PAGE_CACHE_SIZE << order;
}

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-20  4:47 ` William Lee Irwin III
  2007-04-20  5:27   ` Christoph Lameter
@ 2007-04-20  8:42   ` David Chinner
  1 sibling, 0 replies; 58+ messages in thread
From: David Chinner @ 2007-04-20  8:42 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Christoph Lameter, linux-kernel, Peter Zijlstra, Nick Piggin,
	Paul Jackson, Dave Chinner, Andi Kleen

On Thu, Apr 19, 2007 at 09:47:18PM -0700, William Lee Irwin III wrote:
> On Thu, Apr 19, 2007 at 09:35:04AM -0700, Christoph Lameter wrote:
> > This patchset modifies the core VM so that higher order page cache pages
> > become possible. The higher order page cache pages are compound pages
> > and can be handled in the same way as regular pages.
> > The order of the pages is determined by the order set up in the mapping
> > (struct address_space). By default the order is set to zero.
> > This means that higher order pages are optional. There is no attempt here
> > to generally change the page order of the page cache. 4K pages are effective
> > for small files.
> 
> Oh dear. Per-file pagesizes are foul. Better to fix up the pagecache's
> radix tree than to restrict it like this.

How is this restrictive? This opens up much goodness to filesystems.
I *want* to be able to use different radix tree index orders for
different inodes in a single filesystem. Not just for data, either;
XFS has metadata constructs larger than a page that mean we have to
carry our own buffer cache around.  Being able to use the page cache
directly for these metadata constructs would enable us to remove a
good chunk of code from XFS....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 2/8] Basic allocation for higher order page cache pages
  2007-04-19 16:35 ` [RFC 2/8] Basic allocation for higher order page cache pages Christoph Lameter
@ 2007-04-20 10:55   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2007-04-20 10:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On (19/04/07 09:35), Christoph Lameter didst pronounce:
> Variable Order Page Cache: Add basic allocation functions
> 
> Extend __page_cache_alloc to take an order parameter and
> modify caller sites. Modify mapping_set_gfp_mask to set
> __GFP_COMP if the mapping requires higher order allocations.
> 
> put_page() is already capable of handling compound pages. So
> there are no changes needed to release higher order page
> cache pages.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/linux/pagemap.h |   15 +++++++++------
>  mm/filemap.c            |    9 +++++----
>  2 files changed, 14 insertions(+), 10 deletions(-)
> 
> Index: linux-2.6.21-rc7/include/linux/pagemap.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-18 21:21:56.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-18 21:29:02.000000000 -0700
> @@ -32,6 +32,8 @@ static inline void mapping_set_gfp_mask(
>  {
>  	m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) |
>  				(__force unsigned long)mask;
> +	if (m->order)
> +		m->flags |= __GFP_COMP;

This might need a comment. It might not be superclear that compound
pages have destructors that know how to lookup the order on free.

>  }
>  
>  /*
> @@ -40,7 +42,7 @@ static inline void mapping_set_gfp_mask(
>   * throughput (it can then be mapped into user
>   * space in smaller chunks for same flexibility).
>   *
> - * Or rather, it _will_ be done in larger chunks.
> + * This is the base page size
>   */
>  #define PAGE_CACHE_SHIFT	PAGE_SHIFT
>  #define PAGE_CACHE_SIZE		PAGE_SIZE
> @@ -52,22 +54,23 @@ static inline void mapping_set_gfp_mask(
>  void release_pages(struct page **pages, int nr, int cold);
>  
>  #ifdef CONFIG_NUMA
> -extern struct page *__page_cache_alloc(gfp_t gfp);
> +extern struct page *__page_cache_alloc(gfp_t gfp, int order);
>  #else
> -static inline struct page *__page_cache_alloc(gfp_t gfp)
> +static inline struct page *__page_cache_alloc(gfp_t gfp, int order)
>  {
> -	return alloc_pages(gfp, 0);
> +	return alloc_pages(gfp, order);
>  }
>  #endif
>  
>  static inline struct page *page_cache_alloc(struct address_space *x)
>  {
> -	return __page_cache_alloc(mapping_gfp_mask(x));
> +	return __page_cache_alloc(mapping_gfp_mask(x), x->order);
>  }
>  
>  static inline struct page *page_cache_alloc_cold(struct address_space *x)
>  {
> -	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
> +	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD,
> +		x->order);
>  }
>  
>  typedef int filler_t(void *, struct page *);
> Index: linux-2.6.21-rc7/mm/filemap.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-18 21:21:56.000000000 -0700
> +++ linux-2.6.21-rc7/mm/filemap.c	2007-04-18 21:28:30.000000000 -0700
> @@ -467,13 +467,13 @@ int add_to_page_cache_lru(struct page *p
>  }
>  
>  #ifdef CONFIG_NUMA
> -struct page *__page_cache_alloc(gfp_t gfp)
> +struct page *__page_cache_alloc(gfp_t gfp, int order)
>  {
>  	if (cpuset_do_page_mem_spread()) {
>  		int n = cpuset_mem_spread_node();
> -		return alloc_pages_node(n, gfp, 0);
> +		return alloc_pages_node(n, gfp, order);
>  	}
> -	return alloc_pages(gfp, 0);
> +	return alloc_pages(gfp, order);
>  }
>  EXPORT_SYMBOL(__page_cache_alloc);
>  #endif
> @@ -803,7 +803,8 @@ grab_cache_page_nowait(struct address_sp
>  		page_cache_release(page);
>  		return NULL;
>  	}
> -	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
> +	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
> +		mapping->order);
>  	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
>  		page_cache_release(page);
>  		page = NULL;
> -

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 3/8] Flushing and zeroing higher order page cache pages
  2007-04-19 16:35 ` [RFC 3/8] Flushing and zeroing " Christoph Lameter
@ 2007-04-20 11:02   ` Mel Gorman
  2007-04-20 16:15     ` Christoph Lameter
  0 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2007-04-20 11:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On (19/04/07 09:35), Christoph Lameter didst pronounce:
> ---
>  include/linux/pagemap.h |   27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> Index: linux-2.6.21-rc7/include/linux/pagemap.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-18 22:08:36.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-18 22:09:13.000000000 -0700
> @@ -210,6 +210,33 @@ static inline void wait_on_page_writebac
>  
>  extern void end_page_writeback(struct page *page);
>  
> +/* Support for clearing higher order pages */
> +static inline void clear_mapping_page(struct address_space *a,
> +						struct page *page)
> +{
> +	int nr_pages = 1 << a->order;
> +	int i;
> +
> +	for (i = 0; i < nr_pages; i++)
> +		clear_highpage(page + i);
> +}

While this looks fine, it seems that clear_huge_page() and
clear_mapping_page() could share a common helper. I also note that
clear_huge_page() calls cond_reched() and this doesn't which may be the
type of different behavior we want to avoid.

That said, if this goes ahead, it might be an excuse to look at using
ramfs as the basis for hugetlbfs instead of it's current approach. I
believe using ramfs for hugepages is something that wli wants anyway.

> +
> +/*
> + * Support for flushing higher order pages.
> + *
> + * A bit stupid: On many platforms flushing the first page
> + * will flush any TLB starting there
> + */
> +static inline void flush_mapping_page(struct address_space *a,
> +						struct page *page)
> +{
> +	int nr_pages = 1 << a->order;
> +	int i;
> +
> +	for (i = 0; i < nr_pages; i++)
> +		flush_dcache_page(page + i);
> +}
> +
>  /*
>   * Fault a userspace page into pagetables.  Return non-zero on a fault.
>   *

I suppose I'll see where this is used later in the patchset. It might be
easier to review if these helpers are introduced at the point they are
used.Not super-sure


> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages
  2007-04-19 16:35 ` [RFC 4/8] Enhance fallback functions in libs to support higher order pages Christoph Lameter
  2007-04-19 18:48   ` Adam Litke
@ 2007-04-20 11:05   ` Mel Gorman
  2007-04-20 18:50     ` Dave Kleikamp
  1 sibling, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2007-04-20 11:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner, shaggy

On (19/04/07 09:35), Christoph Lameter didst pronounce:
> Variable Order Page Cache: Fixup fallback functions
> 
> Fixup the fallback function in fs/libfs.c to be able to handle
> higher order page cache pages.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  fs/libfs.c |   16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> Index: linux-2.6.21-rc7/fs/libfs.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/fs/libfs.c	2007-04-18 21:52:49.000000000 -0700
> +++ linux-2.6.21-rc7/fs/libfs.c	2007-04-18 21:54:51.000000000 -0700
> @@ -320,8 +320,8 @@ int simple_rename(struct inode *old_dir,
>  
>  int simple_readpage(struct file *file, struct page *page)
>  {
> -	clear_highpage(page);
> -	flush_dcache_page(page);
> +	clear_mapping_page(file->f_mapping, page);
> +	flush_mapping_page(file->f_mapping, page);

right, I think it would be easier to read if patches 3 and 4 were
merged. Personal preference, feel free to ignore

>  	SetPageUptodate(page);
>  	unlock_page(page);
>  	return 0;
> @@ -331,11 +331,15 @@ int simple_prepare_write(struct file *fi
>  			unsigned from, unsigned to)
>  {
>  	if (!PageUptodate(page)) {
> -		if (to - from != PAGE_CACHE_SIZE) {
> +		if (to - from != page_cache_size(file->f_mapping)) {
> +			/*
> +			 * Mapping to higher order pages need to be supported
> +			 * if higher order pages can be in highmem
> +			 */

comments about missing page_cache_size() covered elsewhere. However, I
note that Dave Kleikamp might be interested in this changing of
page_cache_size() from the perspective of page cache tails. I've added
him to the cc so he can take a quick look.

>  			void *kaddr = kmap_atomic(page, KM_USER0);
>  			memset(kaddr, 0, from);
> -			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
> -			flush_dcache_page(page);
> +			memset(kaddr + to, 0, page_cache_size(file->f_mapping) - to);
> +			flush_mapping_page(file->f_mapping, page);
>  			kunmap_atomic(kaddr, KM_USER0);
>  		}
>  	}
> @@ -346,7 +350,7 @@ int simple_commit_write(struct file *fil
>  			unsigned from, unsigned to)
>  {
>  	struct inode *inode = page->mapping->host;
> -	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
> +	loff_t pos = ((loff_t)page->index << page_cache_shift(file->f_mapping)) + to;
>  
>  	if (!PageUptodate(page))
>  		SetPageUptodate(page);
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Enhance ramfs to support higher order pages
  2007-04-19 16:35 ` [RFC 7/8] Enhance ramfs to support higher order pages Christoph Lameter
@ 2007-04-20 13:42   ` Mel Gorman
  2007-04-20 14:47     ` William Lee Irwin III
  2007-04-20 16:20     ` Christoph Lameter
  0 siblings, 2 replies; 58+ messages in thread
From: Mel Gorman @ 2007-04-20 13:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On (19/04/07 09:35), Christoph Lameter didst pronounce:
> Variable Order Page Cache: Add support to ramfs
> 
> The simplest file system to use is ramfs. Add a mount parameter that
> specifies the page order of the pages that ramfs should use. If the
> order is greater than zero then disable mmap functionality.
> 
> This could be removed if the VM would be changes to support faulting
> higher order pages but for now we are content with buffered I/O on higher
> order pages.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  fs/ramfs/file-mmu.c   |   11 +++++++++++
>  fs/ramfs/inode.c      |   15 ++++++++++++---
>  include/linux/ramfs.h |    1 +
>  3 files changed, 24 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6.21-rc7/fs/ramfs/file-mmu.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/fs/ramfs/file-mmu.c	2007-04-18 21:46:38.000000000 -0700
> +++ linux-2.6.21-rc7/fs/ramfs/file-mmu.c	2007-04-18 22:02:03.000000000 -0700
> @@ -45,6 +45,17 @@ const struct file_operations ramfs_file_
>  	.llseek		= generic_file_llseek,
>  };
>  
> +/* Higher order mappings do not support mmmap */
> +const struct file_operations ramfs_file_higher_order_operations = {
> +	.read		= do_sync_read,
> +	.aio_read	= generic_file_aio_read,
> +	.write		= do_sync_write,
> +	.aio_write	= generic_file_aio_write,
> +	.fsync		= simple_sync_file,
> +	.sendfile	= generic_file_sendfile,
> +	.llseek		= generic_file_llseek,
> +};
> +
>  const struct inode_operations ramfs_file_inode_operations = {
>  	.getattr	= simple_getattr,
>  };
> Index: linux-2.6.21-rc7/fs/ramfs/inode.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/fs/ramfs/inode.c	2007-04-18 21:46:38.000000000 -0700
> +++ linux-2.6.21-rc7/fs/ramfs/inode.c	2007-04-18 22:02:03.000000000 -0700
> @@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup
>  		inode->i_blocks = 0;
>  		inode->i_mapping->a_ops = &ramfs_aops;
>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> +		inode->i_mapping->order = sb->s_blocksize_bits - PAGE_CACHE_SHIFT;
>  		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
>  		switch (mode & S_IFMT) {
>  		default:
> @@ -68,7 +69,10 @@ struct inode *ramfs_get_inode(struct sup
>  			break;
>  		case S_IFREG:
>  			inode->i_op = &ramfs_file_inode_operations;
> -			inode->i_fop = &ramfs_file_operations;
> +			if (inode->i_mapping->order)
> +				inode->i_fop = &ramfs_file_higher_order_operations;
> +			else
> +				inode->i_fop = &ramfs_file_operations;

So the difference here appears to be that specifying an order means you
can't mmap(). right?

That's fair enough for the moment but relaxing would make ramfs
potentially usable as a replacement for hugetlbfs so there would be just
one ram-based filesystem instead of two.

>  			break;
>  		case S_IFDIR:
>  			inode->i_op = &ramfs_dir_inode_operations;
> @@ -164,10 +168,15 @@ static int ramfs_fill_super(struct super
>  {
>  	struct inode * inode;
>  	struct dentry * root;
> +	int order = 0;
> +	char *options = data;
> +
> +	if (options && *options)
> +		order = simple_strtoul(options, NULL, 10);
>  

Not the nicest option there but no harm for the moment.

>  	sb->s_maxbytes = MAX_LFS_FILESIZE;
> -	sb->s_blocksize = PAGE_CACHE_SIZE;
> -	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
> +	sb->s_blocksize = PAGE_CACHE_SIZE << order;
> +	sb->s_blocksize_bits = order + PAGE_CACHE_SHIFT;
>  	sb->s_magic = RAMFS_MAGIC;
>  	sb->s_op = &ramfs_ops;
>  	sb->s_time_gran = 1;
> Index: linux-2.6.21-rc7/include/linux/ramfs.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/include/linux/ramfs.h	2007-04-18 21:46:38.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/ramfs.h	2007-04-18 22:02:03.000000000 -0700
> @@ -16,6 +16,7 @@ extern int ramfs_nommu_mmap(struct file 
>  #endif
>  
>  extern const struct file_operations ramfs_file_operations;
> +extern const struct file_operations ramfs_file_higher_order_operations;
>  extern struct vm_operations_struct generic_file_vm_ops;
>  extern int __init init_rootfs(void);
>  

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
                   ` (12 preceding siblings ...)
  2007-04-20  4:47 ` William Lee Irwin III
@ 2007-04-20 14:14 ` Mel Gorman
  2007-04-20 16:23   ` Christoph Lameter
  13 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2007-04-20 14:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Paul Jackson,
	Dave Chinner, Andi Kleen

On (19/04/07 09:35), Christoph Lameter didst pronounce:
> Variable Order Page Cache Patchset
> 
> This patchset modifies the core VM so that higher order page cache pages
> become possible. The higher order page cache pages are compound pages
> and can be handled in the same way as regular pages.
> 
> The order of the pages is determined by the order set up in the mapping
> (struct address_space). By default the order is set to zero.
> This means that higher order pages are optional. There is no attempt here
> to generally change the page order of the page cache. 4K pages are effective
> for small files.
> 
> However, it would be good if the VM would support I/O to higher order pages
> to enable efficient support for large scale I/O. If one wants to write a
> long file of a few gigabytes then the filesystem should have a choice of
> selecting a larger page size for that file and handle larger chunks of
> memory at once.
> 

Interesting stuff. It seems promising.

> The support here is only for buffered I/O and only for one filesystem (ramfs).
> Modification of other filesystems to support higher order pages may require
> extensive work of other components of the kernel. But I hope this shows that
> there is a relatively easy way to that goal that could be taken in steps..
> 

And having ramfs even remotely aware of compound pages is a step in the
direction of collaping hugetlbfs and ramfs into being two sides of the
same coin. I haven't thought about it much but it seems plausible.

> Note that the higher order pages are subject to reclaim. This works in general
> since we are always operating on a single page struct. Reclaim is fooled to
> think that it is touching page sized objects (there are likely issues to be
> fixed there if we want to go down this road).
> 

I believe there is an assumption in parts of reclaim that LRU pages are
order-0. An interesting bug or two is likely to rear its head there.

> What is currently not supported:
> - Buffer heads for higher order pages (possible with the compound pages in mm
>   that do not use page->private requires upgrade of the buffer cache layers).
> - Higher order pages in the block layer etc.
> - Mmapping higher order pages
> 
> Note that this is proof-of-concept. Lots of functionality is missing and
> various issues have not been dealt with. Use of higher order pages may cause
> memory fragmentation. Mel Gorman's anti-fragmentation work is probably
> essential if we want to do this. We likely need actual defragmentation
> support.
> 

Ok, anti-fragmentation will help up to a point but it's awkward with ramfs
because those pages are not reclaimable or migratable no matter what the
order. Normal filesystems would fare much better fragmentation-wise.

The problem is that the mapping gfp_mask is normally GFP_HIGH_MOVABLE but it's
GFP_HIGHUSER for ramfs. This patchset will increase the number of non-movable
high-order allocations quite considerably and it will tend to fragment memory
worse than we do currently. I can think of ways it can be dealt with 
(even marking them RECLAIMABLE would help) so I'm not massively worried
now but I'll keep it in mind as things develop.

> The main point of this patchset is to demonstrates that it is basically
> possible to have higher order support with straightforward changes to the
> VM.
> 
> The ramfs driver can be used to test higher order page cache functionality
> (and may help troubleshoot the VM support until we get some real filesystem
> and real devices supporting higher order pages).
> 
> If you apply this patch and then you can f.e. try this:
> 
> mount -tramfs -o10 none /media
> 
> 	Mounts a ramfs filesystem with order 10 pages (4 MB)
> 
> cp linux-2.6.21-rc7.tar.gz /media
> 
> 	Populate the ramfs. Note that we allocate 14 pages of 4M each
> 	instead of 13508..
> 
> umount /media
> 
> 	Gets rid of the large pages again
> 
> Comments appreciated.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-19 19:11 ` Andi Kleen
  2007-04-19 19:15   ` Christoph Lameter
@ 2007-04-20 14:37   ` Mel Gorman
  1 sibling, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2007-04-20 14:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, linux-kernel, Peter Zijlstra, Nick Piggin,
	Paul Jackson, Dave Chinner

On (19/04/07 21:11), Andi Kleen didst pronounce:
> > We likely need actual defragmentation support.
> 
> To be honest it looks quite pointless before this is solved. So far it is
> not even clear if it is feasible to solve it.
> 

I've written a proposal in an OLS paper on how such a mechanism would work
based on the existing page migration feature.  However, it depends heavily
on grouping pages by mobility to work because without it, a defragmentation
mechanism will be ineffective. I was holding off posting a RFC until I saw
how fragmentation avoidance got on in the next merge window and find the
time to prototype it.

Without going into unnecessary detail, the end result of a compaction run is
that all movable pages are at the end of the zone and all unmovable pages
are at the start with contiguous free space in the middle. Grouping pages
by mobility as it is biases the location of non-movable pages towards the
lower PFNs. Assuming it had a reasonable level of success, even high-order
ramfs pages that were unmovable should work out.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Enhance ramfs to support higher order pages
  2007-04-20 13:42   ` Mel Gorman
@ 2007-04-20 14:47     ` William Lee Irwin III
  2007-04-20 16:30       ` Christoph Lameter
  2007-04-20 16:20     ` Christoph Lameter
  1 sibling, 1 reply; 58+ messages in thread
From: William Lee Irwin III @ 2007-04-20 14:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, linux-kernel, Peter Zijlstra, Nick Piggin,
	Andi Kleen, Paul Jackson, Dave Chinner

On Fri, Apr 20, 2007 at 02:42:27PM +0100, Mel Gorman wrote:
> That's fair enough for the moment but relaxing would make ramfs
> potentially usable as a replacement for hugetlbfs so there would be just
> one ram-based filesystem instead of two.

Careful there. mmap() needs more than this.

(1) mapping->order is variable within an fs, so the architectural code
	would need some vague awareness of the underlying page size
	being variable unless the fs restricts it properly.
(2) by and large even large ptes are assumed to map a single object;
	where mapping->order exceeds the next smallest TLB entry size
	some potentially intricate machinations are required unless
	mapping->order restriction is used to avoid it.
(3) a backward compatibility wrapper for expand-on-mmap() semantics
	is needed, among other things.
(4) various odd hugetlb things like quotas are in there
(5) There are doubtless several oddities about SHM_HUGETLB to emulate
	that would not be automatic when substituting such an extended
	ramfs for hugetlbfs in ipc/shm.c
(6) ->get_unmapped_area() horrors.

The hugetlbfs fs stub has by and large been a huge embarrassment to me,
so I'd welcome the opportunity to foist off the vfs lifting onto ramfs.
I'd be happier with real superpages, but it's not my kernel.


-- wli

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages
  2007-04-20  8:21       ` Jens Axboe
@ 2007-04-20 16:01         ` Christoph Lameter
  2007-04-20 16:51           ` Jens Axboe
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20 16:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Adam Litke, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Fri, 20 Apr 2007, Jens Axboe wrote:

> This works fine as long as you are in the submitter context, but once
> you pass the into the block layer, we don't have any way to find the
> address space (at least we don't want to). Would something like this be
> workable, name withstanding:
> 
> static unsigned long page_size(struct page *page)
> {
>         struct address_space *mapping;
>         int order = 0;
> 
>         mapping = page_mapping(page);
>         if (mapping)
>                 order = mapping->order;
> 
>         return PAGE_CACHE_SIZE << order;
> }

There is much simpler solution (possible with mm)

PAGE_SIZE << compound_order(page)

compound_order will return 0 for a non compound page.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 3/8] Flushing and zeroing higher order page cache pages
  2007-04-20 11:02   ` Mel Gorman
@ 2007-04-20 16:15     ` Christoph Lameter
  2007-04-20 16:51       ` William Lee Irwin III
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20 16:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Fri, 20 Apr 2007, Mel Gorman wrote:

> While this looks fine, it seems that clear_huge_page() and
> clear_mapping_page() could share a common helper. I also note that
> clear_huge_page() calls cond_reched() and this doesn't which may be the
> type of different behavior we want to avoid.

I am really thinking that this variable order page cache approach 
is likely going to result in the final death of the huge page subsystem. I 
would like to keep huge pages separate from this so that the huge page 
subsystem can be removed at some point without too much trouble. Right now 
it is a very sore point at least from a performance standpoint since the 
hugetlb subsystem is serialized with a single lock. There is a weird maze 
of locking and accounting constraints in there that makes it difficult to 
fix this.

> That said, if this goes ahead, it might be an excuse to look at using
> ramfs as the basis for hugetlbfs instead of it's current approach. I
> believe using ramfs for hugepages is something that wli wants anyway.

Right. There is no reason for hugetlbfs to exist anymore. We will have 
very transparent and flexible support for huge pages.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Enhance ramfs to support higher order pages
  2007-04-20 13:42   ` Mel Gorman
  2007-04-20 14:47     ` William Lee Irwin III
@ 2007-04-20 16:20     ` Christoph Lameter
  1 sibling, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20 16:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Fri, 20 Apr 2007, Mel Gorman wrote:

> So the difference here appears to be that specifying an order means you
> can't mmap(). right?
> 
> That's fair enough for the moment but relaxing would make ramfs
> potentially usable as a replacement for hugetlbfs so there would be just
> one ram-based filesystem instead of two.

Yes I have some draft of patches that enable mmap. But I think we should 
fist make the non mmap case work cleanly.

The current approach is to map higher order pages into an address space on 
a per PTE basis. A page fault will establish one pte which may point to a 
tail page of a compound page. This means that the 4k semantics are 
preserved. We essentially manage pointers into 4k sections of larger 
pages.

Later we could add support for PMD faults. If the page size is larger than
pmd size then establish pmds mapping 2M instead of ptes. But that would be 
much much later when everything else works.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-20 14:14 ` Mel Gorman
@ 2007-04-20 16:23   ` Christoph Lameter
  0 siblings, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20 16:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Peter Zijlstra, Nick Piggin, Paul Jackson,
	Dave Chinner, Andi Kleen

On Fri, 20 Apr 2007, Mel Gorman wrote:

> I believe there is an assumption in parts of reclaim that LRU pages are
> order-0. An interesting bug or two is likely to rear its head there.

Correct. We need to deal with reclaim etc.

> > Note that this is proof-of-concept. Lots of functionality is missing and
> > various issues have not been dealt with. Use of higher order pages may cause
> > memory fragmentation. Mel Gorman's anti-fragmentation work is probably
> > essential if we want to do this. We likely need actual defragmentation
> > support.
> > 
> 
> Ok, anti-fragmentation will help up to a point but it's awkward with ramfs
> because those pages are not reclaimable or migratable no matter what the
> order. Normal filesystems would fare much better fragmentation-wise.
> 
> The problem is that the mapping gfp_mask is normally GFP_HIGH_MOVABLE but it's
> GFP_HIGHUSER for ramfs. This patchset will increase the number of non-movable
> high-order allocations quite considerably and it will tend to fragment memory
> worse than we do currently. I can think of ways it can be dealt with 
> (even marking them RECLAIMABLE would help) so I'm not massively worried
> now but I'll keep it in mind as things develop.

Well I think we will have xfs support soon. Then we can deal with more 
issues and be more complete. What I wanted from this post was a consensus 
on how to proceed. There are many subsystems involved and I do not want to 
go off the deep end.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Enhance ramfs to support higher order pages
  2007-04-20 14:47     ` William Lee Irwin III
@ 2007-04-20 16:30       ` Christoph Lameter
  2007-04-20 17:11         ` William Lee Irwin III
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20 16:30 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Mel Gorman, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Fri, 20 Apr 2007, William Lee Irwin III wrote:

> On Fri, Apr 20, 2007 at 02:42:27PM +0100, Mel Gorman wrote:
> > That's fair enough for the moment but relaxing would make ramfs
> > potentially usable as a replacement for hugetlbfs so there would be just
> > one ram-based filesystem instead of two.
> 
> Careful there. mmap() needs more than this.
> 
> (1) mapping->order is variable within an fs, so the architectural code
> 	would need some vague awareness of the underlying page size
> 	being variable unless the fs restricts it properly.

We can map arbitrary 4k chunks of larger pages.

> The hugetlbfs fs stub has by and large been a huge embarrassment to me,
> so I'd welcome the opportunity to foist off the vfs lifting onto ramfs.
> I'd be happier with real superpages, but it's not my kernel.

We are not doing superpages in any form. The filesystem determines it page 
cache size thereby managing pages of arbitrary order. The usual VM 
mmap will work using PAGE_SIZEd mappings as now (once that is working 
right). There is no need to tie the page cache order to the page sizes 
used by mmap or the sizes used by the fault handling of the VM.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Enhance fallback functions in libs to support higher   order pages
  2007-04-20 16:01         ` Christoph Lameter
@ 2007-04-20 16:51           ` Jens Axboe
  0 siblings, 0 replies; 58+ messages in thread
From: Jens Axboe @ 2007-04-20 16:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Adam Litke, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Fri, Apr 20 2007, Christoph Lameter wrote:
> On Fri, 20 Apr 2007, Jens Axboe wrote:
> 
> > This works fine as long as you are in the submitter context, but once
> > you pass the into the block layer, we don't have any way to find the
> > address space (at least we don't want to). Would something like this be
> > workable, name withstanding:
> > 
> > static unsigned long page_size(struct page *page)
> > {
> >         struct address_space *mapping;
> >         int order = 0;
> > 
> >         mapping = page_mapping(page);
> >         if (mapping)
> >                 order = mapping->order;
> > 
> >         return PAGE_CACHE_SIZE << order;
> > }
> 
> There is much simpler solution (possible with mm)
> 
> PAGE_SIZE << compound_order(page)
> 
> compound_order will return 0 for a non compound page.

Ah perfect, much easier. I'll spin a patchset for the block bits.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 3/8] Flushing and zeroing higher order page cache pages
  2007-04-20 16:15     ` Christoph Lameter
@ 2007-04-20 16:51       ` William Lee Irwin III
  0 siblings, 0 replies; 58+ messages in thread
From: William Lee Irwin III @ 2007-04-20 16:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Fri, 20 Apr 2007, Mel Gorman wrote:
>> While this looks fine, it seems that clear_huge_page() and
>> clear_mapping_page() could share a common helper. I also note that
>> clear_huge_page() calls cond_reched() and this doesn't which may be the
>> type of different behavior we want to avoid.

On Fri, Apr 20, 2007 at 09:15:25AM -0700, Christoph Lameter wrote:
> I am really thinking that this variable order page cache approach 
> is likely going to result in the final death of the huge page subsystem. I 
> would like to keep huge pages separate from this so that the huge page 
> subsystem can be removed at some point without too much trouble. Right now 
> it is a very sore point at least from a performance standpoint since the 
> hugetlb subsystem is serialized with a single lock. There is a weird maze 
> of locking and accounting constraints in there that makes it difficult to 
> fix this.

It'll drive a stake through its heart, but there are a number of stupid
catches to deal with before it'll finally flush down the drain.
Existing apps will require backward compatibility wrappers for various
forms of semantic damage done over the years, some over my objections,
effectively in perpetuity. fs/hugetlbfs/ can effectively be reduced to
wrappers around a normal fs for that, but sadly I doubt the rm -rf I
want will fly barring shoving that crud into fs/ramfs/ (which people
may be loath to do).


On Fri, 20 Apr 2007, Mel Gorman wrote:
>> That said, if this goes ahead, it might be an excuse to look at using
>> ramfs as the basis for hugetlbfs instead of it's current approach. I
>> believe using ramfs for hugepages is something that wli wants anyway.

On Fri, Apr 20, 2007 at 09:15:25AM -0700, Christoph Lameter wrote:
> Right. There is no reason for hugetlbfs to exist anymore. We will have 
> very transparent and flexible support for huge pages.

Ramming hugetlb, fs and all, down the garbage disposal is the direction
I really want to go, and in precisely this manner. (Ever wondered why I
never work on extending it?) There's an easy subdivision of labor here
(apart from where I contribute otherwise). Get generic going and keep
me posted and I'll actively go about ripping out those pieces made
redundant by it all.

To date I've been blocked by the absence of collaborators. I can put
company time down on this vs. other issues which are mere spare time.


-- wli

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Enhance ramfs to support higher order pages
  2007-04-20 16:30       ` Christoph Lameter
@ 2007-04-20 17:11         ` William Lee Irwin III
  2007-04-20 17:15           ` Christoph Lameter
  0 siblings, 1 reply; 58+ messages in thread
From: William Lee Irwin III @ 2007-04-20 17:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Fri, 20 Apr 2007, William Lee Irwin III wrote:
>> Careful there. mmap() needs more than this.
>> (1) mapping->order is variable within an fs, so the architectural code
>> 	would need some vague awareness of the underlying page size
>> 	being variable unless the fs restricts it properly.

On Fri, Apr 20, 2007 at 09:30:30AM -0700, Christoph Lameter wrote:
> We can map arbitrary 4k chunks of larger pages.

The core VM can do that but the hugetlb architectural code can't fall
back to smaller page sizes. It also should not be put into a situation
where it needs to do so given the semantics it must honor.


On Fri, 20 Apr 2007, William Lee Irwin III wrote:
>> The hugetlbfs fs stub has by and large been a huge embarrassment to me,
>> so I'd welcome the opportunity to foist off the vfs lifting onto ramfs.
>> I'd be happier with real superpages, but it's not my kernel.

On Fri, Apr 20, 2007 at 09:30:30AM -0700, Christoph Lameter wrote:
> We are not doing superpages in any form. The filesystem determines it page 
> cache size thereby managing pages of arbitrary order. The usual VM 
> mmap will work using PAGE_SIZEd mappings as now (once that is working 
> right). There is no need to tie the page cache order to the page sizes 
> used by mmap or the sizes used by the fault handling of the VM.

Fine, s/real superpages/superpages/ -- I'm not picky enough.

Also, the final assertion is inaccurate. Fault handlers must instantiate
pages of order mapping->order when faulting in a page of a file with
a given pagecache size. The semantics of faulting and mmap()'ing are
slightly underspecified but no requirements for the size of the newly
established translations are included, so they can be PAGE_SIZE easily
enough.


-- wli

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Enhance ramfs to support higher order pages
  2007-04-20 17:11         ` William Lee Irwin III
@ 2007-04-20 17:15           ` Christoph Lameter
  2007-04-20 17:19             ` William Lee Irwin III
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20 17:15 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Mel Gorman, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Fri, 20 Apr 2007, William Lee Irwin III wrote:
> 
> On Fri, Apr 20, 2007 at 09:30:30AM -0700, Christoph Lameter wrote:
> > We can map arbitrary 4k chunks of larger pages.
> 
> The core VM can do that but the hugetlb architectural code can't fall
> back to smaller page sizes. It also should not be put into a situation
> where it needs to do so given the semantics it must honor.

Wel we could potentially add a handle_pmd_fault to the vm...?
 
> Also, the final assertion is inaccurate. Fault handlers must instantiate
> pages of order mapping->order when faulting in a page of a file with
> a given pagecache size. The semantics of faulting and mmap()'ing are

Why? I agree that the page state of the higher order page must be updated 
consistently but one can use a pte to map a 4k chunk of a higher 
order page.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Enhance ramfs to support higher order pages
  2007-04-20 17:15           ` Christoph Lameter
@ 2007-04-20 17:19             ` William Lee Irwin III
  2007-04-20 17:57               ` Christoph Lameter
                                 ` (3 more replies)
  0 siblings, 4 replies; 58+ messages in thread
From: William Lee Irwin III @ 2007-04-20 17:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Fri, 20 Apr 2007, William Lee Irwin III wrote:
>> The core VM can do that but the hugetlb architectural code can't fall
>> back to smaller page sizes. It also should not be put into a situation
>> where it needs to do so given the semantics it must honor.

On Fri, Apr 20, 2007 at 10:15:00AM -0700, Christoph Lameter wrote:
> Wel we could potentially add a handle_pmd_fault to the vm...?

Unconscionably foul. I guess x86-uber-alles pagetables in the core vm
is the Linux way, though.


On Fri, 20 Apr 2007, William Lee Irwin III wrote:
>> Also, the final assertion is inaccurate. Fault handlers must instantiate
>> pages of order mapping->order when faulting in a page of a file with
>> a given pagecache size. The semantics of faulting and mmap()'ing are

On Fri, Apr 20, 2007 at 10:15:00AM -0700, Christoph Lameter wrote:
> Why? I agree that the page state of the higher order page must be updated 
> consistently but one can use a pte to map a 4k chunk of a higher 
> order page.

Probably just terminological disagreement here. I was referring to
allocating the higher-order page from the fault path here, not mapping
it or a piece of it with a user pte.


-- wli

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Enhance ramfs to support higher order pages
  2007-04-20 17:19             ` William Lee Irwin III
@ 2007-04-20 17:57               ` Christoph Lameter
  2007-04-20 19:21                 ` William Lee Irwin III
  2007-04-20 17:59               ` Christoph Lameter
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20 17:57 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Mel Gorman, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Fri, 20 Apr 2007, William Lee Irwin III wrote:

> On Fri, 20 Apr 2007, William Lee Irwin III wrote:
> >> The core VM can do that but the hugetlb architectural code can't fall
> >> back to smaller page sizes. It also should not be put into a situation
> >> where it needs to do so given the semantics it must honor.
> 
> On Fri, Apr 20, 2007 at 10:15:00AM -0700, Christoph Lameter wrote:
> > Wel we could potentially add a handle_pmd_fault to the vm...?
> 
> Unconscionably foul. I guess x86-uber-alles pagetables in the core vm
> is the Linux way, though.

Yes sadly true. The alternative is to add a page table abstraction layer 
first.

> On Fri, 20 Apr 2007, William Lee Irwin III wrote:
> >> Also, the final assertion is inaccurate. Fault handlers must instantiate
> >> pages of order mapping->order when faulting in a page of a file with
> >> a given pagecache size. The semantics of faulting and mmap()'ing are
> 
> On Fri, Apr 20, 2007 at 10:15:00AM -0700, Christoph Lameter wrote:
> > Why? I agree that the page state of the higher order page must be updated 
> > consistently but one can use a pte to map a 4k chunk of a higher 
> > order page.
> 
> Probably just terminological disagreement here. I was referring to
> allocating the higher-order page from the fault path here, not mapping
> it or a piece of it with a user pte.

Ah. Okay. I have some dysfunctional patches here that implement mmap 
support. Would you be willing to take care of that aspect of things? Then 
I can focus on the other VM pieces. I am going to post them following this 
message. These are an absolute mess. They do not compile etc etc.




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Enhance ramfs to support higher order pages
  2007-04-20 17:19             ` William Lee Irwin III
  2007-04-20 17:57               ` Christoph Lameter
@ 2007-04-20 17:59               ` Christoph Lameter
  2007-04-20 18:01               ` Christoph Lameter
  2007-04-20 18:02               ` Christoph Lameter
  3 siblings, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20 17:59 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Mel Gorman, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

Variable Order Page Cache: Readahead fixups

Readahead is now dependent on the page size. For larger page sizes
we want less readahead.

Add a parameter to max_sane_readahead specifying the page order
and update the code in mm/readahead.c to be aware of variant
page sizes.

[WARNING untested likely does not compile.....]

---
 include/linux/mm.h |    2 +-
 mm/fadvise.c       |    5 +++--
 mm/filemap.c       |    5 +++--
 mm/madvise.c       |    4 +++-
 mm/readahead.c     |   12 ++++++------
 5 files changed, 16 insertions(+), 12 deletions(-)

Index: linux-2.6.21-rc7/include/linux/mm.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/mm.h	2007-04-19 21:24:12.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/mm.h	2007-04-19 21:26:16.000000000 -0700
@@ -1084,7 +1084,7 @@ unsigned long page_cache_readahead(struc
 			  unsigned long size);
 void handle_ra_miss(struct address_space *mapping, 
 		    struct file_ra_state *ra, pgoff_t offset);
-unsigned long max_sane_readahead(unsigned long nr);
+unsigned long max_sane_readahead(unsigned long nr, int order);
 
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
Index: linux-2.6.21-rc7/mm/fadvise.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/fadvise.c	2007-04-19 21:24:12.000000000 -0700
+++ linux-2.6.21-rc7/mm/fadvise.c	2007-04-19 21:26:16.000000000 -0700
@@ -86,10 +86,11 @@ asmlinkage long sys_fadvise64_64(int fd,
 		nrpages = end_index - start_index + 1;
 		if (!nrpages)
 			nrpages = ~0UL;
-		
+
 		ret = force_page_cache_readahead(mapping, file,
 				start_index,
-				max_sane_readahead(nrpages));
+				max_sane_readahead(nrpages,
+				mapping->order));
 		if (ret > 0)
 			ret = 0;
 		break;
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-19 21:24:12.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-19 21:26:16.000000000 -0700
@@ -1246,7 +1246,7 @@ do_readahead(struct address_space *mappi
 		return -EINVAL;
 
 	force_page_cache_readahead(mapping, filp, index,
-					max_sane_readahead(nr));
+				max_sane_readahead(nr, mapping->order));
 	return 0;
 }
 
@@ -1381,7 +1381,8 @@ retry_find:
 			count_vm_event(PGMAJFAULT);
 		}
 		did_readaround = 1;
-		ra_pages = max_sane_readahead(file->f_ra.ra_pages);
+		ra_pages = max_sane_readahead(file->f_ra.ra_pages,
+							mapping->order);
 		if (ra_pages) {
 			pgoff_t start = 0;
 
Index: linux-2.6.21-rc7/mm/madvise.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/madvise.c	2007-04-19 21:24:12.000000000 -0700
+++ linux-2.6.21-rc7/mm/madvise.c	2007-04-19 21:26:16.000000000 -0700
@@ -105,7 +105,9 @@ static long madvise_willneed(struct vm_a
 	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
 	force_page_cache_readahead(file->f_mapping,
-			file, start, max_sane_readahead(end - start));
+			file, start,
+			max_sane_readahead(end - start,
+				file->f_mapping->order));
 	return 0;
 }
 
Index: linux-2.6.21-rc7/mm/readahead.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/readahead.c	2007-04-19 21:24:12.000000000 -0700
+++ linux-2.6.21-rc7/mm/readahead.c	2007-04-19 21:26:16.000000000 -0700
@@ -152,7 +152,7 @@ int read_cache_pages(struct address_spac
 			put_pages_list(pages);
 			break;
 		}
-		task_io_account_read(PAGE_CACHE_SIZE);
+		task_io_account_read(page_cache_size(mapping));
 	}
 	pagevec_lru_add(&lru_pvec);
 	return ret;
@@ -276,7 +276,7 @@ __do_page_cache_readahead(struct address
 	if (isize == 0)
 		goto out;
 
- 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+ 	end_index = ((isize - 1) >> page_cache_shift(mapping));
 
 	/*
 	 * Preallocate as many pages as we will need.
@@ -330,7 +330,7 @@ int force_page_cache_readahead(struct ad
 	while (nr_to_read) {
 		int err;
 
-		unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
+		unsigned long this_chunk = (2 * 1024 * 1024) / page_cache_size(mapping);
 
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
@@ -570,11 +570,11 @@ void handle_ra_miss(struct address_space
 }
 
 /*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
+ * Given a desired number of page order readahead pages, return a
  * sensible upper limit.
  */
-unsigned long max_sane_readahead(unsigned long nr)
+unsigned long max_sane_readahead(unsigned long nr, int order)
 {
 	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
-		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2 >> order);
 }

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Enhance ramfs to support higher order pages
  2007-04-20 17:19             ` William Lee Irwin III
  2007-04-20 17:57               ` Christoph Lameter
  2007-04-20 17:59               ` Christoph Lameter
@ 2007-04-20 18:01               ` Christoph Lameter
  2007-04-20 18:02               ` Christoph Lameter
  3 siblings, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20 18:01 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Mel Gorman, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

Variable Order Page Cache: mmap_nopage and mmap_populate

Fix up both functions to be able to operate on arbitrary order
pages. However, both functions establish page table entries
in PAGE_SIZE only and the offset and pgoffset when calling
both functions is always in PAGE_SIZE units. Thus the parameters
were renamed to pgoff_page which is in PAGE_SIZE unites in
constrast to pgoff which is in the order prescribed by the
address space.

As a result both functions may handle a page struct pointer to
a tail page. That is the page to be mapped or that was mapped.
However, that page struct cannot be used to get a refcount
or mark page characteristics. This can only be done on the
head page!

We need to fixup install_page also since filemap_populate
relies on it.

[WARNING: Early early draft may not compile untested]

---
 mm/filemap.c |   38 ++++++++++++++++++++++++++++----------
 mm/fremap.c  |   17 +++++++++++------
 2 files changed, 39 insertions(+), 16 deletions(-)

Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-19 21:26:16.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-19 21:27:55.000000000 -0700
@@ -1318,6 +1318,12 @@ static int fastcall page_cache_read(stru
  * The goto's are kind of ugly, but this streamlines the normal case of having
  * it in the page cache, and handles the special cases reasonably without
  * having a lot of duplicated code.
+ *
+ * filemap_nopage returns pointer to a page that may be a tail page
+ * of a compound page suitable for the VM to map a PAGE_SIZE portion.
+ * However, the VM must update state information in the head page
+ * alone. F.e. Taking a refcount on a tail page does not have the
+ * intended effect.
  */
 struct page *filemap_nopage(struct vm_area_struct *area,
 				unsigned long address, int *type)
@@ -1328,13 +1334,15 @@ struct page *filemap_nopage(struct vm_ar
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
 	struct page *page;
-	unsigned long size, pgoff;
+	unsigned long size, pgoff, pgoff_page, compound_index;
 	int did_readaround = 0, majmin = VM_FAULT_MINOR;
 
-	pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
+	pgoff_page = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
+	pgoff = pgoff_page >> mapping->order;
+	compound_index = pg_off_page % (1 << mapping->order);
 
 retry_all:
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	size = (i_size_read(inode) + page_cache_size(mapping) - 1) >> page_cache_shift(mapping);
 	if (pgoff >= size)
 		goto outside_data_content;
 
@@ -1412,7 +1420,7 @@ success:
 	mark_page_accessed(page);
 	if (type)
 		*type = majmin;
-	return page;
+	return page + compound_index;
 
 outside_data_content:
 	/*
@@ -1637,8 +1645,12 @@ err:
 	return NULL;
 }
 
+/*
+ * filemap_populate installs page sized ptes in the indicated area.
+ * However, the underlying pages may be of higher order.
+ */
 int filemap_populate(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long len, pgprot_t prot, unsigned long pgoff,
+		unsigned long len, pgprot_t prot, unsigned long pgoff_page,
 		int nonblock)
 {
 	struct file *file = vma->vm_file;
@@ -1648,14 +1660,20 @@ int filemap_populate(struct vm_area_stru
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
 	int err;
+	unsigned long pgoff;
+	int compound_index;
 
 	if (!nonblock)
 		force_page_cache_readahead(mapping, vma->vm_file,
-					pgoff, len >> PAGE_CACHE_SHIFT);
+			pgoff_page >> mapping->order,
+			len >> page_cache_shift(mapping));
 
 repeat:
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (pgoff + (len >> PAGE_CACHE_SHIFT) > size)
+	pgoff = pgoff_page >> mapping->order;
+	compound_index = pgoff_page % (1 << mapping->order);
+
+	size = (i_size_read(inode) + page_cache_size(mapping) - 1) >> page_cache_shift(mapping);
+	if (pgoff + (len >> page_cache_shift(mapping)) > size)
 		return -EINVAL;
 
 	page = filemap_getpage(file, pgoff, nonblock);
@@ -1666,7 +1684,7 @@ repeat:
 		return -ENOMEM;
 
 	if (page) {
-		err = install_page(mm, vma, addr, page, prot);
+		err = install_page(mm, vma, addr, page + compound_index, prot);
 		if (err) {
 			page_cache_release(page);
 			return err;
@@ -1682,7 +1700,7 @@ repeat:
 
 	len -= PAGE_SIZE;
 	addr += PAGE_SIZE;
-	pgoff++;
+	pgoff_page++;
 	if (len)
 		goto repeat;
 
Index: linux-2.6.21-rc7/mm/fremap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/fremap.c	2007-04-19 21:33:34.000000000 -0700
+++ linux-2.6.21-rc7/mm/fremap.c	2007-04-19 21:37:30.000000000 -0700
@@ -46,7 +46,9 @@ static int zap_pte(struct mm_struct *mm,
 
 /*
  * Install a file page to a given virtual memory address, release any
- * previously existing mapping.
+ * previously existing mapping. The page may point to a tail page
+ * in which case we update the state in the head page but establish
+ * a PAGE_SIZEd mapping to the tail page alone.
  */
 int install_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long addr, struct page *page, pgprot_t prot)
@@ -57,6 +59,8 @@ int install_page(struct mm_struct *mm, s
 	pte_t *pte;
 	pte_t pte_val;
 	spinlock_t *ptl;
+	struct address_space *mapping;
+	struct head_page *page = compound_head(page);
 
 	pte = get_locked_pte(mm, addr, &ptl);
 	if (!pte)
@@ -67,12 +71,13 @@ int install_page(struct mm_struct *mm, s
 	 * caller about it.
 	 */
 	err = -EINVAL;
-	inode = vma->vm_file->f_mapping->host;
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (!page->mapping || page->index >= size)
+	mapping = vma->vm_file->f_mapping;
+	inode = mapping->host;
+	size = (i_size_read(inode) + page_cache_size(mapping) - 1) >> page_cache_shift(mapping);
+	if (!head_page->mapping || head_page->index >= size)
 		goto unlock;
 	err = -ENOMEM;
-	if (page_mapcount(page) > INT_MAX/2)
+	if (page_mapcount(head_page) > INT_MAX/2)
 		goto unlock;
 
 	if (pte_none(*pte) || !zap_pte(mm, vma, addr, pte))
@@ -81,7 +86,7 @@ int install_page(struct mm_struct *mm, s
 	flush_icache_page(vma, page);
 	pte_val = mk_pte(page, prot);
 	set_pte_at(mm, addr, pte, pte_val);
-	page_add_file_rmap(page);
+	page_add_file_rmap(head_page);
 	update_mmu_cache(vma, addr, pte_val);
 	lazy_mmu_prot_update(pte_val);
 	err = 0;


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Enhance ramfs to support higher order pages
  2007-04-20 17:19             ` William Lee Irwin III
                                 ` (2 preceding siblings ...)
  2007-04-20 18:01               ` Christoph Lameter
@ 2007-04-20 18:02               ` Christoph Lameter
  3 siblings, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20 18:02 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Mel Gorman, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

Some ideas for memory.c pieces. Just junk like the earlier patches.

---
 mm/memory.c |  108 ++++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 66 insertions(+), 42 deletions(-)

Index: linux-2.6.21-rc7/mm/memory.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/memory.c	2007-04-20 10:55:49.000000000 -0700
+++ linux-2.6.21-rc7/mm/memory.c	2007-04-20 10:56:13.000000000 -0700
@@ -368,6 +368,12 @@ static inline int is_cow_mapping(unsigne
 /*
  * This function gets the "struct page" associated with a pte.
  *
+ * NOTE! For compound pages it may get to a tail page (maybe we
+ * only deal with a portion of a compound page after all). This
+ * means that the result of vm_normal_page may not directly be
+ * used to manipulate the page state. Use compound_head() if
+ * operations (like getting a ref count) are necessary.
+ *
  * NOTE! Some mappings do not have "struct pages". A raw PFN mapping
  * will have each page table entry just pointing to a raw page frame
  * number, and as far as the VM layer is concerned, those do not have
@@ -480,9 +486,11 @@ copy_one_pte(struct mm_struct *dst_mm, s
 
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
-		get_page(page);
-		page_dup_rmap(page);
-		rss[!!PageAnon(page)]++;
+		struct page *head_page = compound_head(page);
+
+		get_page(head_page);
+		page_dup_rmap(head_page);
+		rss[!!PageAnon(head_page)]++;
 	}
 
 out_set_pte:
@@ -642,8 +650,14 @@ static unsigned long zap_pte_range(struc
 
 		if (pte_present(ptent)) {
 			struct page *page;
+			struct page *head_page;
 
 			page = vm_normal_page(vma, addr, ptent);
+			if (page)
+				head_page = compound_head(page);
+			else
+				head_page = NULL;
+
 			if (unlikely(details) && page) {
 				/*
 				 * unmap_shared_mapping_pages() wants to
@@ -651,15 +665,15 @@ static unsigned long zap_pte_range(struc
 				 * unmap shared but keep private pages.
 				 */
 				if (details->check_mapping &&
-				    details->check_mapping != page->mapping)
+				    details->check_mapping != head_page->mapping)
 					continue;
 				/*
 				 * Each page->index must be checked when
 				 * invalidating or truncating nonlinear.
 				 */
 				if (details->nonlinear_vma &&
-				    (page->index < details->first_index ||
-				     page->index > details->last_index))
+				    (head_page->index < details->first_index ||
+				     head_page->index > details->last_index))
 					continue;
 			}
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
@@ -668,21 +682,24 @@ static unsigned long zap_pte_range(struc
 			if (unlikely(!page))
 				continue;
 			if (unlikely(details) && details->nonlinear_vma
-			    && linear_page_index(details->nonlinear_vma,
-						addr) != page->index)
+			    && linear_page_index_mapping(details->nonlinear_vma,
+						addr, compound_order(head_page))
+							!= head_page->index)
 				set_pte_at(mm, addr, pte,
-					   pgoff_to_pte(page->index));
-			if (PageAnon(page))
+					   pgoff_to_pte(
+					   	(head_page->index << compound_page(order))
+							+ page - head_page));
+			if (PageAnon(head_page))
 				anon_rss--;
 			else {
 				if (pte_dirty(ptent))
-					set_page_dirty(page);
+					set_page_dirty(head_page);
 				if (pte_young(ptent))
-					SetPageReferenced(page);
+					SetPageReferenced(head_page);
 				file_rss--;
 			}
-			page_remove_rmap(page, vma);
-			tlb_remove_page(tlb, page);
+			page_remove_rmap(head_page, vma);
+			tlb_remove_page(tlb, head_page);
 			continue;
 		}
 		/*
@@ -899,6 +916,10 @@ unsigned long zap_page_range(struct vm_a
 
 /*
  * Do a quick page-table lookup for a single page.
+ *
+ * Note: This function may return a pointer to a tail page. However,
+ * any operations like getting a page reference and touching it will
+ * have to be performed on the head page.
  */
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 			unsigned int flags)
@@ -949,13 +970,14 @@ struct page *follow_page(struct vm_area_
 	if (unlikely(!page))
 		goto unlock;
 
+	head_page = compound_head(page);
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page(head_page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
-		    !pte_dirty(pte) && !PageDirty(page))
-			set_page_dirty(page);
-		mark_page_accessed(page);
+		    !pte_dirty(pte) && !PageDirty(head_page))
+			set_page_dirty(head_page);
+		mark_page_accessed(head_page);
 	}
 unlock:
 	pte_unmap_unlock(ptep, ptl);
@@ -1537,6 +1559,7 @@ static int do_wp_page(struct mm_struct *
 		spinlock_t *ptl, pte_t orig_pte)
 {
 	struct page *old_page, *new_page;
+	struct page *old_page_head, *new_page_head;
 	pte_t entry;
 	int reuse = 0, ret = VM_FAULT_MINOR;
 	struct page *dirty_page = NULL;
@@ -1545,14 +1568,15 @@ static int do_wp_page(struct mm_struct *
 	if (!old_page)
 		goto gotten;
 
+	old_page_head = compound_head(old_page);
 	/*
 	 * Take out anonymous pages first, anonymous shared vmas are
 	 * not dirty accountable.
 	 */
-	if (PageAnon(old_page)) {
-		if (!TestSetPageLocked(old_page)) {
-			reuse = can_share_swap_page(old_page);
-			unlock_page(old_page);
+	if (PageAnon(old_page_head)) {
+		if (!TestSetPageLocked(old_page_head)) {
+			reuse = can_share_swap_page(old_page_head);
+			unlock_page(old_page_head);
 		}
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
@@ -1570,10 +1594,10 @@ static int do_wp_page(struct mm_struct *
 			 * We do this without the lock held, so that it can
 			 * sleep if it needs to.
 			 */
-			page_cache_get(old_page);
+			page_cache_get(old_page_head);
 			pte_unmap_unlock(page_table, ptl);
 
-			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
+			if (vma->vm_ops->page_mkwrite(vma, old_page_head) < 0)
 				goto unwritable_page;
 
 			/*
@@ -1584,11 +1608,11 @@ static int do_wp_page(struct mm_struct *
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
-			page_cache_release(old_page);
+			page_cache_release(old_page_head);
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
 		}
-		dirty_page = old_page;
+		dirty_page = old_page_head;
 		get_page(dirty_page);
 		reuse = 1;
 	}
@@ -1607,7 +1631,7 @@ static int do_wp_page(struct mm_struct *
 	/*
 	 * Ok, we need to copy. Oh, well..
 	 */
-	page_cache_get(old_page);
+	page_cache_get(old_page_head);
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
@@ -1630,8 +1654,8 @@ gotten:
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
-			page_remove_rmap(old_page, vma);
-			if (!PageAnon(old_page)) {
+			page_remove_rmap(old_page_head, vma);
+			if (!PageAnon(old_page_head)) {
 				dec_mm_counter(mm, file_rss);
 				inc_mm_counter(mm, anon_rss);
 			}
@@ -1654,13 +1678,13 @@ gotten:
 		page_add_new_anon_rmap(new_page, vma, address);
 
 		/* Free the old page.. */
-		new_page = old_page;
+		new_page = old_page_head;
 		ret |= VM_FAULT_WRITE;
 	}
 	if (new_page)
 		page_cache_release(new_page);
 	if (old_page)
-		page_cache_release(old_page);
+		page_cache_release(old_page_head);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
@@ -1669,8 +1693,8 @@ unlock:
 	}
 	return ret;
 oom:
-	if (old_page)
-		page_cache_release(old_page);
+	if (old_page_head)
+		page_cache_release(old_page_head);
 	return VM_FAULT_OOM;
 
 unwritable_page:
@@ -2243,7 +2267,7 @@ retry:
 			if (!page)
 				goto oom;
 			copy_user_highpage(page, new_page, address, vma);
-			page_cache_release(new_page);
+			page_cache_release(compound_head(new_page));
 			new_page = page;
 			anon = 1;
 
@@ -2254,7 +2278,7 @@ retry:
 			if (vma->vm_ops->page_mkwrite &&
 			    vma->vm_ops->page_mkwrite(vma, new_page) < 0
 			    ) {
-				page_cache_release(new_page);
+				page_cache_release(compound_head(new_page));
 				return VM_FAULT_SIGBUS;
 			}
 		}
@@ -2268,7 +2292,7 @@ retry:
 	 */
 	if (mapping && unlikely(sequence != mapping->truncate_count)) {
 		pte_unmap_unlock(page_table, ptl);
-		page_cache_release(new_page);
+		page_cache_release(compound_head(new_page));
 		cond_resched();
 		sequence = mapping->truncate_count;
 		smp_rmb();
@@ -2298,15 +2322,15 @@ retry:
 			page_add_new_anon_rmap(new_page, vma, address);
 		} else {
 			inc_mm_counter(mm, file_rss);
-			page_add_file_rmap(new_page);
+			page_add_file_rmap(compound_head(new_page));
 			if (write_access) {
-				dirty_page = new_page;
+				dirty_page = compound_head(new_page);
 				get_page(dirty_page);
 			}
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
-		page_cache_release(new_page);
+		page_cache_release(compound_head(new_page));
 		goto unlock;
 	}
 
@@ -2321,7 +2345,7 @@ unlock:
 	}
 	return ret;
 oom:
-	page_cache_release(new_page);
+	page_cache_release(compound_head(new_page));
 	return VM_FAULT_OOM;
 }
 
@@ -2720,13 +2744,13 @@ int access_process_vm(struct task_struct
 		if (write) {
 			copy_to_user_page(vma, page, addr,
 					  maddr + offset, buf, bytes);
-			set_page_dirty_lock(page);
+			set_page_dirty_lock(compound_head(page));
 		} else {
 			copy_from_user_page(vma, page, addr,
 					    buf, maddr + offset, bytes);
 		}
 		kunmap(page);
-		page_cache_release(page);
+		page_cache_release(compound_head(page));
 		len -= bytes;
 		buf += bytes;
 		addr += bytes;


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages
  2007-04-20 11:05   ` Mel Gorman
@ 2007-04-20 18:50     ` Dave Kleikamp
  2007-04-20 19:10       ` Christoph Lameter
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Kleikamp @ 2007-04-20 18:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, linux-kernel, Peter Zijlstra, Nick Piggin,
	Andi Kleen, Paul Jackson, Dave Chinner

On Fri, 2007-04-20 at 12:05 +0100, Mel Gorman wrote:

> comments about missing page_cache_size() covered elsewhere. However, I
> note that Dave Kleikamp might be interested in this changing of
> page_cache_size() from the perspective of page cache tails. I've added
> him to the cc so he can take a quick look.

Yeah.  I'm working on patches for storing file tails in buffers
allocated from the slab cache, and the tail will be represented by a
fake struct page.  (This is primarily for kernels with a larger page
size).  So my version of page_cache_size(page) may return a different
size for different pages belonging to the same mapping.  I'm in the
midst of cleaning up the patches and plan to post them to linux-mm by
Monday.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages
  2007-04-20 18:50     ` Dave Kleikamp
@ 2007-04-20 19:10       ` Christoph Lameter
  2007-04-20 19:27         ` Dave Kleikamp
  2007-04-24 23:00         ` Matt Mackall
  0 siblings, 2 replies; 58+ messages in thread
From: Christoph Lameter @ 2007-04-20 19:10 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Mel Gorman, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Fri, 20 Apr 2007, Dave Kleikamp wrote:

> On Fri, 2007-04-20 at 12:05 +0100, Mel Gorman wrote:
> 
> > comments about missing page_cache_size() covered elsewhere. However, I
> > note that Dave Kleikamp might be interested in this changing of
> > page_cache_size() from the perspective of page cache tails. I've added
> > him to the cc so he can take a quick look.
> 
> Yeah.  I'm working on patches for storing file tails in buffers
> allocated from the slab cache, and the tail will be represented by a
> fake struct page.  (This is primarily for kernels with a larger page
> size).  So my version of page_cache_size(page) may return a different
> size for different pages belonging to the same mapping.  I'm in the
> midst of cleaning up the patches and plan to post them to linux-mm by
> Monday.

I am not sure what the point of that patchset would be in this context 
given that this is about support for arbitrary page sizes. If the 
filesystem wants it then it can reduce the page size for small files.

Different page sizes for one mapping may introduce high complexity into a 
filesystem.

And we can already represent different page size. We have compound pages 
support in the kernel.

page_cache_size(page) in terms of current code in mm is

PAGE_SIZE << compound_order(page)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Enhance ramfs to support higher order pages
  2007-04-20 17:57               ` Christoph Lameter
@ 2007-04-20 19:21                 ` William Lee Irwin III
  0 siblings, 0 replies; 58+ messages in thread
From: William Lee Irwin III @ 2007-04-20 19:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Fri, 20 Apr 2007, William Lee Irwin III wrote:
>> Probably just terminological disagreement here. I was referring to
>> allocating the higher-order page from the fault path here, not mapping
>> it or a piece of it with a user pte.

On Fri, Apr 20, 2007 at 10:57:25AM -0700, Christoph Lameter wrote:
> Ah. Okay. I have some dysfunctional patches here that implement mmap 
> support. Would you be willing to take care of that aspect of things? Then 
> I can focus on the other VM pieces. I am going to post them following this 
> message. These are an absolute mess. They do not compile etc etc.

Good stuff. Going over this sounds like more fun than trying to 
preemptively clean up after whichever scheduler trainwreck ends up
hitting mainline.


-- wli

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages
  2007-04-20 19:10       ` Christoph Lameter
@ 2007-04-20 19:27         ` Dave Kleikamp
  2007-04-24 23:00         ` Matt Mackall
  1 sibling, 0 replies; 58+ messages in thread
From: Dave Kleikamp @ 2007-04-20 19:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-kernel, Peter Zijlstra, Nick Piggin, Andi Kleen,
	Paul Jackson, Dave Chinner

On Fri, 2007-04-20 at 12:10 -0700, Christoph Lameter wrote:
> On Fri, 20 Apr 2007, Dave Kleikamp wrote:
> > Yeah.  I'm working on patches for storing file tails in buffers
> > allocated from the slab cache, and the tail will be represented by a
> > fake struct page.  (This is primarily for kernels with a larger page
> > size).  So my version of page_cache_size(page) may return a different
> > size for different pages belonging to the same mapping.  I'm in the
> > midst of cleaning up the patches and plan to post them to linux-mm by
> > Monday.
> 
> I am not sure what the point of that patchset would be in this context 
> given that this is about support for arbitrary page sizes. If the 
> filesystem wants it then it can reduce the page size for small files.
> 
> Different page sizes for one mapping may introduce high complexity into a 
> filesystem.

I'm trying to keep it from getting too complicated.

> And we can already represent different page size. We have compound pages 
> support in the kernel.

There are advantages to having a larger base page size, such as
increased TLB reach.  I'm specifically targetting kernels built with
CONFIG_PPC_64K_PAGES, but other architectures could benefit.

> page_cache_size(page) in terms of current code in mm is
> 
> PAGE_SIZE << compound_order(page)

I don't see page_cache_size() in linux-2.6.21-rc6-mm1.  Is there
something newer?

Anyway, my patches are at a proof-of-concept stage.  I look forward to
more discussion when I post the patch set.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] Variable Order Page Cache
  2007-04-20  7:48     ` David Chinner
@ 2007-04-21 22:18       ` Andrew Morton
  0 siblings, 0 replies; 58+ messages in thread
From: Andrew Morton @ 2007-04-21 22:18 UTC (permalink / raw)
  To: David Chinner
  Cc: Jens Axboe, Christoph Lameter, linux-kernel, Peter Zijlstra,
	Nick Piggin, Paul Jackson, Andi Kleen

On Fri, 20 Apr 2007 17:48:18 +1000 David Chinner <dgc@sgi.com> wrote:

> Agreed - I was talking about a quick way to hack a real filesystem
> in to the VM to start exercising the new VM code without needing to
> implement compound page support down the whole I/O stack. 

Yes.  The whole point of this work is to speed stuff up, so I'd encourage
people to first work on getting some minimal scruffy proptotype in place -
whatever is needed to be able to start running performance tests.

Then we can take a look at the numbers (and the types of machines and
workloads upon which they are based) and decide whether it looks like
there's any point in proceeding with a full-on implementation.

And as part of that decision-making process we should take a detailed look
at the performance of the existing code and see if there are other ways in
which it might be acceptably sped up.

Because right now we're assuming that larger pages are the only way in
which acceptable performance may be obtained.  But that has not been proved.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Enhance fallback functions in libs to support higher order pages
  2007-04-20 19:10       ` Christoph Lameter
  2007-04-20 19:27         ` Dave Kleikamp
@ 2007-04-24 23:00         ` Matt Mackall
  1 sibling, 0 replies; 58+ messages in thread
From: Matt Mackall @ 2007-04-24 23:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Kleikamp, Mel Gorman, linux-kernel, Peter Zijlstra,
	Nick Piggin, Andi Kleen, Paul Jackson, Dave Chinner

On Fri, Apr 20, 2007 at 12:10:43PM -0700, Christoph Lameter wrote:
> On Fri, 20 Apr 2007, Dave Kleikamp wrote:
> 
> > On Fri, 2007-04-20 at 12:05 +0100, Mel Gorman wrote:
> > 
> > > comments about missing page_cache_size() covered elsewhere. However, I
> > > note that Dave Kleikamp might be interested in this changing of
> > > page_cache_size() from the perspective of page cache tails. I've added
> > > him to the cc so he can take a quick look.
> > 
> > Yeah.  I'm working on patches for storing file tails in buffers
> > allocated from the slab cache, and the tail will be represented by a
> > fake struct page.  (This is primarily for kernels with a larger page
> > size).  So my version of page_cache_size(page) may return a different
> > size for different pages belonging to the same mapping.  I'm in the
> > midst of cleaning up the patches and plan to post them to linux-mm by
> > Monday.
> 
> I am not sure what the point of that patchset would be in this context 
> given that this is about support for arbitrary page sizes. If the 
> filesystem wants it then it can reduce the page size for small files.

Shaggy's suggestion is to emulate pages with _negative_ orders.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2007-04-24 23:13 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-19 16:35 [RFC 0/8] Variable Order Page Cache Christoph Lameter
2007-04-19 16:35 ` [RFC 1/8] Add order field to address_space struct Christoph Lameter
2007-04-19 16:35 ` [RFC 2/8] Basic allocation for higher order page cache pages Christoph Lameter
2007-04-20 10:55   ` Mel Gorman
2007-04-19 16:35 ` [RFC 3/8] Flushing and zeroing " Christoph Lameter
2007-04-20 11:02   ` Mel Gorman
2007-04-20 16:15     ` Christoph Lameter
2007-04-20 16:51       ` William Lee Irwin III
2007-04-19 16:35 ` [RFC 4/8] Enhance fallback functions in libs to support higher order pages Christoph Lameter
2007-04-19 18:48   ` Adam Litke
2007-04-19 19:10     ` Christoph Lameter
2007-04-19 22:50       ` David Chinner
2007-04-20  1:15         ` Christoph Lameter
2007-04-20  8:21       ` Jens Axboe
2007-04-20 16:01         ` Christoph Lameter
2007-04-20 16:51           ` Jens Axboe
2007-04-20 11:05   ` Mel Gorman
2007-04-20 18:50     ` Dave Kleikamp
2007-04-20 19:10       ` Christoph Lameter
2007-04-20 19:27         ` Dave Kleikamp
2007-04-24 23:00         ` Matt Mackall
2007-04-19 16:35 ` [RFC 5/8] Enhance generic_read/write " Christoph Lameter
2007-04-19 16:35 ` [RFC 6/8] Account for pages in the page cache in terms of base pages Christoph Lameter
2007-04-19 17:45   ` Nish Aravamudan
2007-04-19 17:52     ` Christoph Lameter
2007-04-19 17:54       ` Avi Kivity
2007-04-19 16:35 ` [RFC 7/8] Enhance ramfs to support higher order pages Christoph Lameter
2007-04-20 13:42   ` Mel Gorman
2007-04-20 14:47     ` William Lee Irwin III
2007-04-20 16:30       ` Christoph Lameter
2007-04-20 17:11         ` William Lee Irwin III
2007-04-20 17:15           ` Christoph Lameter
2007-04-20 17:19             ` William Lee Irwin III
2007-04-20 17:57               ` Christoph Lameter
2007-04-20 19:21                 ` William Lee Irwin III
2007-04-20 17:59               ` Christoph Lameter
2007-04-20 18:01               ` Christoph Lameter
2007-04-20 18:02               ` Christoph Lameter
2007-04-20 16:20     ` Christoph Lameter
2007-04-19 16:35 ` [RFC 8/8] Add some debug output Christoph Lameter
2007-04-19 19:09 ` [RFC 0/8] Variable Order Page Cache Badari Pulavarty
2007-04-19 19:12   ` Christoph Lameter
2007-04-19 19:11 ` Andi Kleen
2007-04-19 19:15   ` Christoph Lameter
2007-04-20 14:37   ` Mel Gorman
2007-04-19 22:42 ` David Chinner
2007-04-20  1:14   ` Christoph Lameter
2007-04-20  6:32   ` Jens Axboe
2007-04-20  7:48     ` David Chinner
2007-04-21 22:18       ` Andrew Morton
2007-04-19 23:58 ` Maxim Levitsky
2007-04-20  1:15   ` Christoph Lameter
2007-04-20  4:47 ` William Lee Irwin III
2007-04-20  5:27   ` Christoph Lameter
2007-04-20  6:22     ` William Lee Irwin III
2007-04-20  8:42   ` David Chinner
2007-04-20 14:14 ` Mel Gorman
2007-04-20 16:23   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox