[patch][rfc] mm: new address space calls

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch][rfc] mm: new address space calls
@ 2009-02-25 10:48 Nick Piggin
  2009-02-25 20:59 ` Chris Mason
  2009-02-28 23:24 ` Christoph Hellwig
  0 siblings, 2 replies; 11+ messages in thread
From: Nick Piggin @ 2009-02-25 10:48 UTC (permalink / raw)
  To: Linux Memory Management List, linux-fsdevel

This is about the last change to generic code I need for fsblock.
Comments?

Introduce new address space operations sync and release, which can be used
by a filesystem to synchronize and release per-address_space private metadata.
They generalise sync_mapping_buffers, invalidate_inode_buffers, and
remove_inode_buffers calls, and get another step closer to divorcing
buffer heads from core mm/fs code.

---
 fs/block_dev.c              |    7 +++++++
 fs/buffer.c                 |   16 ++++++----------
 fs/fs-writeback.c           |   12 +++++++++---
 fs/inode.c                  |   37 +++++++++++++++++++++++++++++--------
 fs/super.c                  |    3 ++-
 include/linux/buffer_head.h |    2 --
 include/linux/fs.h          |   22 ++++++++++++++++++++++
 mm/filemap.c                |    1 +
 8 files changed, 76 insertions(+), 24 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -454,7 +454,7 @@ EXPORT_SYMBOL(mark_buffer_async_write);
  * written back and waited upon before fsync() returns.
  *
  * The functions mark_buffer_inode_dirty(), fsync_inode_buffers(),
- * inode_has_buffers() and invalidate_inode_buffers() are provided for the
+ * mapping_has_private() and invalidate_inode_buffers() are provided for the
  * management of a list of dependent buffers at ->i_mapping->private_list.
  *
  * Locking is a little subtle: try_to_free_buffers() will remove buffers
@@ -507,11 +507,6 @@ static void __remove_assoc_queue(struct
 	bh->b_assoc_map = NULL;
 }
 
-int inode_has_buffers(struct inode *inode)
-{
-	return !list_empty(&inode->i_data.private_list);
-}
-
 /*
  * osync is designed to support O_SYNC io.  It waits synchronously for
  * all already-submitted IO to complete, but does not queue any new
@@ -787,8 +782,9 @@ static int fsync_buffers_list(spinlock_t
  */
 void invalidate_inode_buffers(struct inode *inode)
 {
-	if (inode_has_buffers(inode)) {
-		struct address_space *mapping = &inode->i_data;
+	struct address_space *mapping = &inode->i_data;
+
+	if (mapping_has_private(mapping)) {
 		struct list_head *list = &mapping->private_list;
 		struct address_space *buffer_mapping = mapping->assoc_mapping;
 
@@ -808,10 +804,10 @@ EXPORT_SYMBOL(invalidate_inode_buffers);
  */
 int remove_inode_buffers(struct inode *inode)
 {
+	struct address_space *mapping = &inode->i_data;
 	int ret = 1;
 
-	if (inode_has_buffers(inode)) {
-		struct address_space *mapping = &inode->i_data;
+	if (mapping_has_private(mapping)) {
 		struct list_head *list = &mapping->private_list;
 		struct address_space *buffer_mapping = mapping->assoc_mapping;
 
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -158,7 +158,6 @@ void end_buffer_write_sync(struct buffer
 
 /* Things to do with buffers at mapping->private_list */
 void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode);
-int inode_has_buffers(struct inode *);
 void invalidate_inode_buffers(struct inode *);
 int remove_inode_buffers(struct inode *inode);
 int sync_mapping_buffers(struct address_space *mapping);
@@ -333,7 +332,6 @@ extern int __set_page_dirty_buffers(stru
 static inline void buffer_init(void) {}
 static inline int try_to_free_buffers(struct page *page) { return 1; }
 static inline int sync_blockdev(struct block_device *bdev) { return 0; }
-static inline int inode_has_buffers(struct inode *inode) { return 0; }
 static inline void invalidate_inode_buffers(struct inode *inode) {}
 static inline int remove_inode_buffers(struct inode *inode) { return 1; }
 static inline int sync_mapping_buffers(struct address_space *mapping) { return 0; }
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2476,6 +2476,7 @@ int try_to_release_page(struct page *pag
 	if (PageWriteback(page))
 		return 0;
 
+	BUG_ON(!PagePrivate(page));
 	if (mapping && mapping->a_ops->releasepage)
 		return mapping->a_ops->releasepage(page, gfp_mask);
 	return try_to_free_buffers(page);
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -782,9 +782,15 @@ int generic_osync_inode(struct inode *in
 	if (what & OSYNC_DATA)
 		err = filemap_fdatawrite(mapping);
 	if (what & (OSYNC_METADATA|OSYNC_DATA)) {
-		err2 = sync_mapping_buffers(mapping);
-		if (!err)
-			err = err2;
+		if (!mapping->a_ops->sync) {
+			err2 = sync_mapping_buffers(mapping);
+			if (!err)
+				err = err2;
+		} else {
+			err2 = mapping->a_ops->sync(mapping);
+			if (!err)
+				err = err2;
+		}
 	}
 	if (what & OSYNC_DATA) {
 		err2 = filemap_fdatawait(mapping);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -208,7 +208,8 @@ static struct inode *alloc_inode(struct
 
 void destroy_inode(struct inode *inode) 
 {
-	BUG_ON(inode_has_buffers(inode));
+	BUG_ON(mapping_has_private(&inode->i_data));
+	BUG_ON(inode->i_data.nrpages);
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
@@ -277,10 +278,14 @@ void __iget(struct inode * inode)
  */
 void clear_inode(struct inode *inode)
 {
+	struct address_space *mapping = &inode->i_data;
+
 	might_sleep();
-	invalidate_inode_buffers(inode);
+	if (!mapping->a_ops->release)
+		invalidate_inode_buffers(inode);
        
-	BUG_ON(inode->i_data.nrpages);
+	BUG_ON(mapping_has_private(mapping));
+	BUG_ON(mapping->nrpages);
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
 	inode_sync_wait(inode);
@@ -343,6 +348,7 @@ static int invalidate_list(struct list_h
 	for (;;) {
 		struct list_head * tmp = next;
 		struct inode * inode;
+		struct address_space * mapping;
 
 		/*
 		 * We can reschedule here without worrying about the list's
@@ -356,7 +362,12 @@ static int invalidate_list(struct list_h
 		if (tmp == head)
 			break;
 		inode = list_entry(tmp, struct inode, i_sb_list);
-		invalidate_inode_buffers(inode);
+		mapping = &inode->i_data;
+		if (!mapping->a_ops->release)
+			invalidate_inode_buffers(inode);
+		else
+			mapping->a_ops->release(mapping, 1); /* XXX: should be done in fs? */
+		BUG_ON(mapping_has_private(mapping));
 		if (!atomic_read(&inode->i_count)) {
 			list_move(&inode->i_list, dispose);
 			inode->i_state |= I_FREEING;
@@ -399,13 +410,15 @@ EXPORT_SYMBOL(invalidate_inodes);
 
 static int can_unuse(struct inode *inode)
 {
+	struct address_space *mapping = &inode->i_data;
+
 	if (inode->i_state)
 		return 0;
-	if (inode_has_buffers(inode))
+	if (mapping_has_private(mapping))
 		return 0;
 	if (atomic_read(&inode->i_count))
 		return 0;
-	if (inode->i_data.nrpages)
+	if (mapping->nrpages)
 		return 0;
 	return 1;
 }
@@ -434,6 +447,7 @@ static void prune_icache(int nr_to_scan)
 	spin_lock(&inode_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
+		struct address_space *mapping;
 
 		if (list_empty(&inode_unused))
 			break;
@@ -444,10 +458,17 @@ static void prune_icache(int nr_to_scan)
 			list_move(&inode->i_list, &inode_unused);
 			continue;
 		}
-		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+		mapping = &inode->i_data;
+		if (mapping_has_private(mapping) || mapping->nrpages) {
+			int ret;
+
 			__iget(inode);
 			spin_unlock(&inode_lock);
-			if (remove_inode_buffers(inode))
+			if (mapping->a_ops->release)
+				ret = mapping->a_ops->release(mapping, 0);
+			else
+				ret = remove_inode_buffers(inode);
+			if (ret)
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -531,6 +531,20 @@ struct address_space_operations {
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
 					unsigned long);
+
+	/*
+	 * release_mapping releases any private data on the mapping so that
+	 * it may be reclaimed. Returns 1 on success or 0 on failure. Second
+	 * parameter 'force' causes dirty data to be invalidated. (XXX: could
+	 * have other flags like sync/async, etc).
+	 */
+	int (*release)(struct address_space *, int);
+
+	/*
+	 * sync writes back and waits for any private data on the mapping,
+	 * as a data consistency operation.
+	 */
+	int (*sync)(struct address_space *);
 };
 
 /*
@@ -616,6 +630,14 @@ struct block_device {
 int mapping_tagged(struct address_space *mapping, int tag);
 
 /*
+ * Does this mapping have anything on its private list?
+ */
+static inline int mapping_has_private(struct address_space *mapping)
+{
+	return !list_empty(&mapping->private_list);
+}
+
+/*
  * Might pages of this file be mapped into userspace?
  */
 static inline int mapping_mapped(struct address_space *mapping)
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c
+++ linux-2.6/fs/block_dev.c
@@ -352,6 +352,11 @@ static int blkdev_write_end(struct file
 	return ret;
 }
 
+static void blkdev_invalidate_page(struct page *page, unsigned long offset)
+{
+	block_invalidatepage(page, offset);
+}
+
 /*
  * private llseek:
  * for a block special file file->f_path.dentry->d_inode->i_size is zero
@@ -1405,6 +1410,8 @@ static const struct address_space_operat
 	.writepages	= generic_writepages,
 	.releasepage	= blkdev_releasepage,
 	.direct_IO	= blkdev_direct_IO,
+	.set_page_dirty	= __set_page_dirty_buffers,
+	.invalidatepage = blkdev_invalidate_page,
 };
 
 const struct file_operations def_blk_fops = {
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -28,7 +28,7 @@
 #include <linux/blkdev.h>
 #include <linux/quotaops.h>
 #include <linux/namei.h>
-#include <linux/buffer_head.h>		/* for fsync_super() */
+#include <linux/fs.h>			/* for fsync_super() */
 #include <linux/mount.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
@@ -38,6 +38,7 @@
 #include <linux/kobject.h>
 #include <linux/mutex.h>
 #include <linux/file.h>
+#include <linux/buffer_head.h>		/* sync_blockdev */
 #include <linux/async.h>
 #include <asm/uaccess.h>
 #include "internal.h"

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch][rfc] mm: new address space calls
  2009-02-25 10:48 [patch][rfc] mm: new address space calls Nick Piggin
@ 2009-02-25 20:59 ` Chris Mason
  2009-02-26  5:17   ` Nick Piggin
  2009-02-28 23:19   ` Christoph Hellwig
  2009-02-28 23:24 ` Christoph Hellwig
  1 sibling, 2 replies; 11+ messages in thread
From: Chris Mason @ 2009-02-25 20:59 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Memory Management List, linux-fsdevel

On Wed, 2009-02-25 at 11:48 +0100, Nick Piggin wrote:
> This is about the last change to generic code I need for fsblock.
> Comments?
> 

Thanks for doing this.

We've got releasepage, invalidatepage, and now release, all with
slightly different semantics and some complex interactions with the rest
of the VM.

One problem I have with the btrfs extent state code is that I might
choose to release the extent state in releasepage, but the VM might not
choose to free the page.  So I've got an up to date page without any of
the rest of my state.

Which of these ops covers that? ;)  I'd love to help better document the
requirements for these callbacks, I find it confusing every time.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch][rfc] mm: new address space calls
  2009-02-25 20:59 ` Chris Mason
@ 2009-02-26  5:17   ` Nick Piggin
  2009-02-26 13:21     ` Chris Mason
  2009-02-28 23:19   ` Christoph Hellwig
  1 sibling, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2009-02-26  5:17 UTC (permalink / raw)
  To: Chris Mason; +Cc: Linux Memory Management List, linux-fsdevel

On Wed, Feb 25, 2009 at 03:59:57PM -0500, Chris Mason wrote:
> On Wed, 2009-02-25 at 11:48 +0100, Nick Piggin wrote:
> > This is about the last change to generic code I need for fsblock.
> > Comments?
> > 
> 
> Thanks for doing this.
> 
> We've got releasepage, invalidatepage, and now release, all with
> slightly different semantics and some complex interactions with the rest
> of the VM.

Release is I guess basically equivalent of releasepage/invalidatepage
but on a per-inode rather than per-page basis. I imagine they would
come in handy for other filesystems (in fsblock I use them for the
"associated" metadata like buffer-heads have, and for the block extent
map cache).

And I guess we haven't really grown complexity really, because
previously the core code has to do hardwired callbacks for buffer.c
anyway, and after this patch filesystems that don't care can continue
not to define callbacks :)

> One problem I have with the btrfs extent state code is that I might
> choose to release the extent state in releasepage, but the VM might not
> choose to free the page.  So I've got an up to date page without any of
> the rest of my state.

I'm not sure. What semantics do you want there? In most cases (including
fsblock default case where the filesystem does not have a pin), we're
happy to leave clean, uptodate pages in pagecache in that case.

> Which of these ops covers that? ;)  I'd love to help better document the
> requirements for these callbacks, I find it confusing every time.

I find myself having to re-lookup how they work too (which can be painful
following calls back and forth between mm/ and fs/ :P). I'd like to
improve documentation too..

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch][rfc] mm: new address space calls
  2009-02-26  5:17   ` Nick Piggin
@ 2009-02-26 13:21     ` Chris Mason
  2009-02-27 11:26       ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Mason @ 2009-02-26 13:21 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Memory Management List, linux-fsdevel

On Thu, 2009-02-26 at 06:17 +0100, Nick Piggin wrote:
> On Wed, Feb 25, 2009 at 03:59:57PM -0500, Chris Mason wrote:
> > On Wed, 2009-02-25 at 11:48 +0100, Nick Piggin wrote:
> > > This is about the last change to generic code I need for fsblock.
> > > Comments?
> > > 
> > 
> > Thanks for doing this.
> > 
> > We've got releasepage, invalidatepage, and now release, all with
> > slightly different semantics and some complex interactions with the rest
> > of the VM.
> 
> Release is I guess basically equivalent of releasepage/invalidatepage
> but on a per-inode rather than per-page basis. I imagine they would
> come in handy for other filesystems (in fsblock I use them for the
> "associated" metadata like buffer-heads have, and for the block extent
> map cache).
> 

Nod

> And I guess we haven't really grown complexity really, because
> previously the core code has to do hardwired callbacks for buffer.c
> anyway, and after this patch filesystems that don't care can continue
> not to define callbacks :)
> 
> 
> > One problem I have with the btrfs extent state code is that I might
> > choose to release the extent state in releasepage, but the VM might not
> > choose to free the page.  So I've got an up to date page without any of
> > the rest of my state.
> 
> I'm not sure. What semantics do you want there? In most cases (including
> fsblock default case where the filesystem does not have a pin), we're
> happy to leave clean, uptodate pages in pagecache in that case.

Right, but it really limits the state that we can keep outside the page
bits.  Take a subpage block, where we know the first 1k is up to date.
releasepage comes and we free our tracking that says the first 1k is up
to date, but the VM doesn't free the page.

Now we have a page where the uptodate bit isn't set, but the first 1k
has valid data.  We have to reread it.

I'd like a form of releasepage that knows if the vm is going to really
get rid of the page.  Or another callback that happens when the VM is
sure the page will be freed so we can drop extra metadata that doesn't
pin the page, but we always want to stay with the page.

-chris



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch][rfc] mm: new address space calls
  2009-02-26 13:21     ` Chris Mason
@ 2009-02-27 11:26       ` Nick Piggin
  2009-02-27 13:52         ` Chris Mason
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2009-02-27 11:26 UTC (permalink / raw)
  To: Chris Mason; +Cc: Linux Memory Management List, linux-fsdevel

On Thu, Feb 26, 2009 at 08:21:45AM -0500, Chris Mason wrote:
> > > One problem I have with the btrfs extent state code is that I might
> > > choose to release the extent state in releasepage, but the VM might not
> > > choose to free the page.  So I've got an up to date page without any of
> > > the rest of my state.
> > 
> > I'm not sure. What semantics do you want there? In most cases (including
> > fsblock default case where the filesystem does not have a pin), we're
> > happy to leave clean, uptodate pages in pagecache in that case.
> 
> Right, but it really limits the state that we can keep outside the page
> bits.  Take a subpage block, where we know the first 1k is up to date.
> releasepage comes and we free our tracking that says the first 1k is up
> to date, but the VM doesn't free the page.
> 
> Now we have a page where the uptodate bit isn't set, but the first 1k
> has valid data.  We have to reread it.

Well I don't see how that limits us? Either we prefer to keep the
metadata, or we throw it away and it is inevitable that we lose
information. 

Regardless of whether you store the data in a tree of extends in the
inode, or per-page buffers, you have the same problem (buffer heads
have that same problem too).

> I'd like a form of releasepage that knows if the vm is going to really
> get rid of the page.  Or another callback that happens when the VM is
> sure the page will be freed so we can drop extra metadata that doesn't
> pin the page, but we always want to stay with the page.

Well, for page reclaim/invalidate/truncate, we have releasepage that you
can use even if the metadata is stored outside the page, just set PagePrivate
and it will still get called when the page is about to be freed.

There are *some* races that can result in the page subsequently not being
freed, but I don't think that should be a big deal. I don't want to add
a callback in the pagecache remove path if possible, but we can try to
rework or improve things if btrfs needs something specific..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch][rfc] mm: new address space calls
  2009-02-27 11:26       ` Nick Piggin
@ 2009-02-27 13:52         ` Chris Mason
  2009-02-28  5:52           ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Mason @ 2009-02-27 13:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Memory Management List, linux-fsdevel

On Fri, 2009-02-27 at 12:26 +0100, Nick Piggin wrote:
> On Thu, Feb 26, 2009 at 08:21:45AM -0500, Chris Mason wrote:
> > > > One problem I have with the btrfs extent state code is that I might
> > > > choose to release the extent state in releasepage, but the VM might not
> > > > choose to free the page.  So I've got an up to date page without any of
> > > > the rest of my state.
> > > 
> > > I'm not sure. What semantics do you want there? In most cases (including
> > > fsblock default case where the filesystem does not have a pin), we're
> > > happy to leave clean, uptodate pages in pagecache in that case.
> > 
> > Right, but it really limits the state that we can keep outside the page
> > bits.  Take a subpage block, where we know the first 1k is up to date.
> > releasepage comes and we free our tracking that says the first 1k is up
> > to date, but the VM doesn't free the page.
> > 
> > Now we have a page where the uptodate bit isn't set, but the first 1k
> > has valid data.  We have to reread it.
> 
> Well I don't see how that limits us? Either we prefer to keep the
> metadata, or we throw it away and it is inevitable that we lose
> information. 
> 

We can't have metadata that isn't freed by releasepage unless we want to
pin the page completely.  There was a time when the btrfs metadata had a
bit for 'this block needs defrag', and I ended up not being able to use
it because releasepage was consistently freeing my extra data while the
page was still around.

> Regardless of whether you store the data in a tree of extends in the
> inode, or per-page buffers, you have the same problem (buffer heads
> have that same problem too).
> 

Right.

> 
> > I'd like a form of releasepage that knows if the vm is going to really
> > get rid of the page.  Or another callback that happens when the VM is
> > sure the page will be freed so we can drop extra metadata that doesn't
> > pin the page, but we always want to stay with the page.
> 
> Well, for page reclaim/invalidate/truncate, we have releasepage that you
> can use even if the metadata is stored outside the page, just set PagePrivate
> and it will still get called when the page is about to be freed.
> 

For clean pages, shrink_page_list seems to check the page count after
the releasepage call.  It was a big enough window for me to see it in
practice under normal workloads.

> There are *some* races that can result in the page subsequently not being
> freed, but I don't think that should be a big deal. I don't want to add
> a callback in the pagecache remove path if possible, but we can try to
> rework or improve things if btrfs needs something specific..

Btrfs doesn't need it today, but it should help once I finally get
subpage blocks going again (and metadata defrag as well).

-chris


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch][rfc] mm: new address space calls
  2009-02-27 13:52         ` Chris Mason
@ 2009-02-28  5:52           ` Nick Piggin
  0 siblings, 0 replies; 11+ messages in thread
From: Nick Piggin @ 2009-02-28  5:52 UTC (permalink / raw)
  To: Chris Mason; +Cc: Linux Memory Management List, linux-fsdevel

On Fri, Feb 27, 2009 at 08:52:47AM -0500, Chris Mason wrote:
> On Fri, 2009-02-27 at 12:26 +0100, Nick Piggin wrote:
> > Well I don't see how that limits us? Either we prefer to keep the
> > metadata, or we throw it away and it is inevitable that we lose
> > information. 
> > 
> 
> We can't have metadata that isn't freed by releasepage unless we want to
> pin the page completely.  There was a time when the btrfs metadata had a
> bit for 'this block needs defrag', and I ended up not being able to use
> it because releasepage was consistently freeing my extra data while the
> page was still around.

Hmm, it sounds like that data perhaps is more a property of the
filesystem / block management rather than the pagecache (OK, it's
a blurry line)...

But I mean 'this block neds defrag' sounds like important metadata
even if the page is *not* still around? (but the block is)

Having your own private metadata, perhaps with the ->shrinker callback
is an option. In fsblock actually for the block mapping cache tree,
I don't use a shrinker, because (I'm lazy and) reclaim will eventaully
reclaim the inode in which case the tree will be taken down with the
new aop->release callback.

But in theory even when the in-memory inode goes away, the block mapping
is still valid metadata, so you could keep it around somewhere (in which
case it would need a shrinker callback).

> > > I'd like a form of releasepage that knows if the vm is going to really
> > > get rid of the page.  Or another callback that happens when the VM is
> > > sure the page will be freed so we can drop extra metadata that doesn't
> > > pin the page, but we always want to stay with the page.
> > 
> > Well, for page reclaim/invalidate/truncate, we have releasepage that you
> > can use even if the metadata is stored outside the page, just set PagePrivate
> > and it will still get called when the page is about to be freed.
> > 
> 
> For clean pages, shrink_page_list seems to check the page count after
> the releasepage call.  It was a big enough window for me to see it in
> practice under normal workloads.

Oh yes, you would see it, but it just shouldn't be *too* common I think.
It's a hard race to close. You would ned to effectively take a spinlock
to prevent pagecache lookup over the releasepage call (OK, with lockless
pagecache it is no longer really tree_lock, but setting page->_count to
0, which causes lookup to basically do equivalent spinning anyway).

Of course it still may be closed with a new callback at pagecache
removal time... but I'm not convinced you need one yet ;) Maybe I don't
understand the requirements properly yet.

> > There are *some* races that can result in the page subsequently not being
> > freed, but I don't think that should be a big deal. I don't want to add
> > a callback in the pagecache remove path if possible, but we can try to
> > rework or improve things if btrfs needs something specific..
> 
> Btrfs doesn't need it today, but it should help once I finally get
> subpage blocks going again (and metadata defrag as well).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch][rfc] mm: new address space calls
  2009-02-25 20:59 ` Chris Mason
  2009-02-26  5:17   ` Nick Piggin
@ 2009-02-28 23:19   ` Christoph Hellwig
  2009-03-01  2:38     ` Nick Piggin
  1 sibling, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2009-02-28 23:19 UTC (permalink / raw)
  To: Chris Mason; +Cc: Nick Piggin, Linux Memory Management List, linux-fsdevel

On Wed, Feb 25, 2009 at 03:59:57PM -0500, Chris Mason wrote:
> One problem I have with the btrfs extent state code is that I might
> choose to release the extent state in releasepage, but the VM might not
> choose to free the page.  So I've got an up to date page without any of
> the rest of my state.
> 
> Which of these ops covers that? ;)  I'd love to help better document the
> requirements for these callbacks, I find it confusing every time.

releasepage has also another problem.  It only gets called after
discard_buffer discarded lots of valuable information from the buffers,
which gets XFS into really bad trouble as that drops information if
there is a delalloc extent.

I'd really like to see some major overhaul in that area, and that also
extende to documentation (or just naming, why is block_invalidatepage
calling into a method called ->releasepage, but there also is a
->invalidatepage that gets called from truncate*page routines..)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch][rfc] mm: new address space calls
  2009-02-25 10:48 [patch][rfc] mm: new address space calls Nick Piggin
  2009-02-25 20:59 ` Chris Mason
@ 2009-02-28 23:24 ` Christoph Hellwig
  2009-03-01  2:45   ` Nick Piggin
  1 sibling, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2009-02-28 23:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Memory Management List, linux-fsdevel

On Wed, Feb 25, 2009 at 11:48:39AM +0100, Nick Piggin wrote:
> This is about the last change to generic code I need for fsblock.
> Comments?
> 
> Introduce new address space operations sync and release, which can be used
> by a filesystem to synchronize and release per-address_space private metadata.
> They generalise sync_mapping_buffers, invalidate_inode_buffers, and
> remove_inode_buffers calls, and get another step closer to divorcing
> buffer heads from core mm/fs code.

>  void invalidate_inode_buffers(struct inode *inode)
>  {
> -	if (inode_has_buffers(inode)) {
> -		struct address_space *mapping = &inode->i_data;
> +	struct address_space *mapping = &inode->i_data;
> +
> +	if (mapping_has_private(mapping)) {
>  		struct list_head *list = &mapping->private_list;
>  		struct address_space *buffer_mapping = mapping->assoc_mapping;

I'ts not really helping much here as we still directly poke into the
buffer_head list.

> --- linux-2.6.orig/fs/fs-writeback.c
> +++ linux-2.6/fs/fs-writeback.c
> @@ -782,9 +782,15 @@ int generic_osync_inode(struct inode *in
>  	if (what & OSYNC_DATA)
>  		err = filemap_fdatawrite(mapping);
>  	if (what & (OSYNC_METADATA|OSYNC_DATA)) {
> -		err2 = sync_mapping_buffers(mapping);
> -		if (!err)
> -			err = err2;
> +		if (!mapping->a_ops->sync) {
> +			err2 = sync_mapping_buffers(mapping);
> +			if (!err)
> +				err = err2;
> +		} else {
> +			err2 = mapping->a_ops->sync(mapping);
> +			if (!err)
> +				err = err2;
> +		}
>  	}
>  	if (what & OSYNC_DATA) {
>  		err2 = filemap_fdatawait(mapping);

I'd really prefer not having the default fallbacks, these kinds
of implicit fallbacks make the code really hard to maintain over
long term.

I also wonder if moving the filemap_fdatawrite/filemap_fdatawait
into the method would help.  In fact it's surprisingly similar
to ->fsync in many ways, that I wonder if these should be one
operation.

>   */
>  void clear_inode(struct inode *inode)
>  {
> +	struct address_space *mapping = &inode->i_data;
> +
>  	might_sleep();
> -	invalidate_inode_buffers(inode);
> +	if (!mapping->a_ops->release)
> +		invalidate_inode_buffers(inode);

That's a weird one.  The implict default shouldn't be in a different
place from the method.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch][rfc] mm: new address space calls
  2009-02-28 23:19   ` Christoph Hellwig
@ 2009-03-01  2:38     ` Nick Piggin
  0 siblings, 0 replies; 11+ messages in thread
From: Nick Piggin @ 2009-03-01  2:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Linux Memory Management List, linux-fsdevel

On Sat, Feb 28, 2009 at 06:19:56PM -0500, Christoph Hellwig wrote:
> On Wed, Feb 25, 2009 at 03:59:57PM -0500, Chris Mason wrote:
> > One problem I have with the btrfs extent state code is that I might
> > choose to release the extent state in releasepage, but the VM might not
> > choose to free the page.  So I've got an up to date page without any of
> > the rest of my state.
> > 
> > Which of these ops covers that? ;)  I'd love to help better document the
> > requirements for these callbacks, I find it confusing every time.
> 
> releasepage has also another problem.  It only gets called after
> discard_buffer discarded lots of valuable information from the buffers,
> which gets XFS into really bad trouble as that drops information if
> there is a delalloc extent.

Then I think it just needs to provide its own invalidatepage?


> I'd really like to see some major overhaul in that area, and that also
> extende to documentation (or just naming, why is block_invalidatepage
> calling into a method called ->releasepage, but there also is a
> ->invalidatepage that gets called from truncate*page routines..)

Those convoluted call paths are really bloody annoying.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch][rfc] mm: new address space calls
  2009-02-28 23:24 ` Christoph Hellwig
@ 2009-03-01  2:45   ` Nick Piggin
  0 siblings, 0 replies; 11+ messages in thread
From: Nick Piggin @ 2009-03-01  2:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Linux Memory Management List, linux-fsdevel

On Sat, Feb 28, 2009 at 06:24:21PM -0500, Christoph Hellwig wrote:
> On Wed, Feb 25, 2009 at 11:48:39AM +0100, Nick Piggin wrote:
> > This is about the last change to generic code I need for fsblock.
> > Comments?
> > 
> > Introduce new address space operations sync and release, which can be used
> > by a filesystem to synchronize and release per-address_space private metadata.
> > They generalise sync_mapping_buffers, invalidate_inode_buffers, and
> > remove_inode_buffers calls, and get another step closer to divorcing
> > buffer heads from core mm/fs code.
> 
> >  void invalidate_inode_buffers(struct inode *inode)
> >  {
> > -	if (inode_has_buffers(inode)) {
> > -		struct address_space *mapping = &inode->i_data;
> > +	struct address_space *mapping = &inode->i_data;
> > +
> > +	if (mapping_has_private(mapping)) {
> >  		struct list_head *list = &mapping->private_list;
> >  		struct address_space *buffer_mapping = mapping->assoc_mapping;
> 
> I'ts not really helping much here as we still directly poke into the
> buffer_head list.

This is in fs/buffer.c.

Or do you object to the definition of mapping_has_private? Yes that
still checks the private_list, but it would be trivial to convert it
over to checking a bit in the mapping now. I just didn't do it because
fsblock also uses the private_list.

> 
> > --- linux-2.6.orig/fs/fs-writeback.c
> > +++ linux-2.6/fs/fs-writeback.c
> > @@ -782,9 +782,15 @@ int generic_osync_inode(struct inode *in
> >  	if (what & OSYNC_DATA)
> >  		err = filemap_fdatawrite(mapping);
> >  	if (what & (OSYNC_METADATA|OSYNC_DATA)) {
> > -		err2 = sync_mapping_buffers(mapping);
> > -		if (!err)
> > -			err = err2;
> > +		if (!mapping->a_ops->sync) {
> > +			err2 = sync_mapping_buffers(mapping);
> > +			if (!err)
> > +				err = err2;
> > +		} else {
> > +			err2 = mapping->a_ops->sync(mapping);
> > +			if (!err)
> > +				err = err2;
> > +		}
> >  	}
> >  	if (what & OSYNC_DATA) {
> >  		err2 = filemap_fdatawait(mapping);
> 
> I'd really prefer not having the default fallbacks, these kinds
> of implicit fallbacks make the code really hard to maintain over
> long term.

That seems to be the default way of adding callbacks, but
I agree I don't like it really and I don't like the existing
fallbacks in the tree.

 
> I also wonder if moving the filemap_fdatawrite/filemap_fdatawait
> into the method would help.  In fact it's surprisingly similar
> to ->fsync in many ways, that I wonder if these should be one
> operation.

It would be nice if possible. Do you have an fsync patchset
coming along?


> >   */
> >  void clear_inode(struct inode *inode)
> >  {
> > +	struct address_space *mapping = &inode->i_data;
> > +
> >  	might_sleep();
> > -	invalidate_inode_buffers(inode);
> > +	if (!mapping->a_ops->release)
> > +		invalidate_inode_buffers(inode);
> 
> That's a weird one.  The implict default shouldn't be in a different
> place from the method.

It really sucks in there. buffer.c even has a "FIXME: invalidate_inode
buffers should not be called in clear_inode"... Of course buffer.c is
never going to be fixed, but I didn't want to carry that over to
fsblock.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-03-01  2:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-25 10:48 [patch][rfc] mm: new address space calls Nick Piggin
2009-02-25 20:59 ` Chris Mason
2009-02-26  5:17   ` Nick Piggin
2009-02-26 13:21     ` Chris Mason
2009-02-27 11:26       ` Nick Piggin
2009-02-27 13:52         ` Chris Mason
2009-02-28  5:52           ` Nick Piggin
2009-02-28 23:19   ` Christoph Hellwig
2009-03-01  2:38     ` Nick Piggin
2009-02-28 23:24 ` Christoph Hellwig
2009-03-01  2:45   ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).