linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 000 of 4] Introduction: VFS documentation and tidy up
@ 2006-03-12 23:53 NeilBrown
  2006-03-12 23:53 ` [PATCH 001 of 4] Update some VFS documentation NeilBrown
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: NeilBrown @ 2006-03-12 23:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel

The following 4 patches improve the documentation for vfs address_space
operations, and clean up some code oddities I found while writing it.

All have been sent the these lists before for review and only one
comment was received: a positive note about the documentation.

Please target them for 2.6.17.

Thanks,
NeilBrown


 [PATCH 001 of 4] Update some VFS documentation.
 [PATCH 002 of 4] Honour AOP_TRUNCATE_PAGE returns in page_symlink
 [PATCH 003 of 4] Make address_space_operations->sync_page return void.
 [PATCH 004 of 4] Make address_space_operations->invalidatepage return void

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 001 of 4] Update some VFS documentation.
  2006-03-12 23:53 [PATCH 000 of 4] Introduction: VFS documentation and tidy up NeilBrown
@ 2006-03-12 23:53 ` NeilBrown
  2006-03-13  0:22   ` Avishay Traeger
  2006-03-13  4:58   ` [PATCH 001 of 4] Update some VFS documentation Randy.Dunlap
  2006-03-12 23:53 ` [PATCH 002 of 4] Honour AOP_TRUNCATE_PAGE returns in page_symlink NeilBrown
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: NeilBrown @ 2006-03-12 23:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel


Flesh out the description of the address_space operations.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./Documentation/filesystems/vfs.txt |  216 ++++++++++++++++++++++++++++++++----
 1 file changed, 194 insertions(+), 22 deletions(-)

diff ./Documentation/filesystems/vfs.txt~current~ ./Documentation/filesystems/vfs.txt
--- ./Documentation/filesystems/vfs.txt~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./Documentation/filesystems/vfs.txt	2006-03-13 10:42:31.000000000 +1100
@@ -230,10 +230,14 @@ only called from a process context (i.e.
 or bottom half).
 
   alloc_inode: this method is called by inode_alloc() to allocate memory
- 	for struct inode and initialize it.
+ 	for struct inode and initialize it.  If this function is not
+ 	defined, a simple 'struct inode' is allocated.  Normally
+ 	alloc_inode will be used to allocate a larger structure which
+ 	contains a 'struct inode' embedded within it.
 
   destroy_inode: this method is called by destroy_inode() to release
-  	resources allocated for struct inode.
+  	resources allocated for struct inode.  It is only required of
+  	->alloc_inode was defined and simply does a deallocate.
 
   read_inode: this method is called to read a specific inode from the
         mounted filesystem.  The i_ino member in the struct inode is
@@ -443,14 +447,81 @@ otherwise noted.
 The Address Space Object
 ========================
 
-The address space object is used to identify pages in the page cache.
+The address space object is used to group and manage pages in the page
+cache.  It can be used to keep track of the pages in a file (or
+anything else) and also track the mapping of sections of the file into
+process address spaces.
+
+There are a number of distinct yet related services that an
+address-space can provide.  These include communicating memory
+pressure, page lookup by address, and keeping track of pages tagged as
+Dirty or Writeback.
+
+The first can be used independantly to the others.  The vm can try to
+either write dirty pages in order to clean them, or release clean
+pages in order to reuse them.  To do this it can call the ->writepage
+method on dirty pages, and ->releasepage on clean pages with
+PagePrivate set. Clean pages without PagePrivate and with no external
+references will be released without notice being given to the
+address_space.
+
+To achieve this functionality, pages need to be placed on an lru with
+lru_cache_add and mark_page_active needs to be called whenever the
+page is used.
+
+Pages are normally kept in a radix tree index by ->index. This tree
+maintains information about the PG_Dirty and PG_Writeback status of
+each page, so that pages with either of these flags can be found
+quickly.
+
+The Dirty tag is primarily used by mpage_writepages - the default
+->writepages method.  It uses the tag to find dirty pages to call
+->writepage on.  If mpage_writepages is not used (i.e. the address
+provides it's own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
+almost unused.  write_inode_now and sync_inode do use it (through
+__sync_single_inode) to check if ->writepages has been successful in
+writing out the whole address_space.
+
+The Writeback tag is used by filemap*wait* and sync_page* functions,
+though wait_on_page_writeback_range, to wait for all writeback to
+complete.  While waiting ->sync_page (if defined) will be called on
+each page that is found to require writeback
+
+An address_space handler may attach extra information to a page,
+typically using the 'private' field in the 'struct page'.  If such
+information is attached, the PG_Private flag should be set.  This will
+cause various mm routines to make extra calls into the address_space
+handler to deal with that data.
+
+An address space acts as an intermediate between storage and
+application.  Data is read into the address space a whole page at a
+time, and provided to the application either by copying of the page,
+or by memory-mapping the page.
+Data is written into the address space by the application, and then
+written-back to storage typically in whole pages, however the
+address_space has finner control of write sizes.
+
+The read process essentially only requires 'readpage'.  The write
+process is more complicated and uses prepare_write/commit_write or
+set_page_dirty to write data into the address_space, and writepage,
+sync_page, and writepages to writeback data to storage.
+
+Adding and removing pages to/from an address_space is protected by the
+inode's i_mutex.
+
+When data is written to a page, the PG_Dirty flag should be set.  It
+typically remains set until writepage asks for it to be written.  This
+should clear PG_Dirty and set PG_Writeback.  It can be actually
+written at any point after PG_Dirty is clear.  Once it is known to be
+safe, PG_Writeback is cleared.
 
+Writeback makes use of a writeback_control structure...
 
 struct address_space_operations
 -------------------------------
 
 This describes how the VFS can manipulate mapping of a file to page cache in
-your filesystem. As of kernel 2.6.13, the following members are defined:
+your filesystem. As of kernel 2.6.16, the following members are defined:
 
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
@@ -469,47 +540,148 @@ struct address_space_operations {
 			loff_t offset, unsigned long nr_segs);
 	struct page* (*get_xip_page)(struct address_space *, sector_t,
 			int);
+	/* migrate the contents of a page to the specified target */
+	int (*migratepage) (struct page *, struct page *);
 };
 
-  writepage: called by the VM write a dirty page to backing store.
+  writepage: called by the VM to write a dirty page to backing store.
+      This may happen for data integrity reason (i.e. 'sync'), or
+      to free up memory (flush).  The difference can be seen in
+      wbc->sync_mode.
+      The PG_Dirty flag has been cleared and PageLocked is true.
+      writepage should start writeout, should set PG_Writeback,
+      and should make sure the page is Unlocked, either synchronously
+      or asynchronously when the write operation completes.
+
+      If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
+      try too hard if there are problems, and may choose to write out a
+      different page from the mapping if that would be more
+      appropriate.  If it chooses not to start writeout, it should
+      return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
+      calling ->writepage on that page.
+
+      See the file "Locking" for more details.
 
   readpage: called by the VM to read a page from backing store.
+       The page will be Locked when readpage is called, and should be
+       unlocked and marked uptodate once the read completes.
+       If ->readpage discovers that it needs to unlock the page for
+       some reason, it can do so, and then return AOP_TRUNCATED_PAGE.
+       In this case, the page will be re-located, re-locked and if
+       that all succeeds, ->readpage will be called again.
 
   sync_page: called by the VM to notify the backing store to perform all
   	queued I/O operations for a page. I/O operations for other pages
-	associated with this address_space object may also be performed.
+	associated with this address_space object may also be
+  	performed.
+	This function is optional and is called only for pages with
+  	PG_Writeback set while waiting for the writeback to complete.
 
   writepages: called by the VM to write out pages associated with the
-  	address_space object.
+  	address_space object.  If WBC_SYNC_ALL, then the
+  	writeback_control will specify a range of pages that must be
+  	written out.  If WBC_SYNC_NONE, then a nr_to_write is given
+	and that many pages should be written if possible.
+	If no ->writepages is given, then mpage_writepages is used
+  	instead.  This will choose pages from the addresspace that are
+  	tagged as DIRTY and will pass them to ->writepage.
 
   set_page_dirty: called by the VM to set a page dirty.
+        This is particularly needed if an address space attaches
+        private data to a page, and that data needs to be updated when
+        a page is dirtied.  This is called, for example, when a memory
+	mapped page gets modified.
+	If defined, it should set the PageDirty flag, and the
+        PAGECACHE_TAG_DIRTY tag in the radix tree.
 
   readpages: called by the VM to read pages associated with the address_space
-  	object.
+  	object. This is essentailly just a vector version of
+  	readpage.  Instead of just one page, several pages are
+  	requested.
+	readpages is only used for readahead, so read errors are
+  	ignored.  If anything goes wrong, feel free to give up.
 
   prepare_write: called by the generic write path in VM to set up a write
-  	request for a page.
-
-  commit_write: called by the generic write path in VM to write page to
-  	its backing store.
+  	request for a page.  This indicates to the address space that
+  	the given range of bytes are about to be written.  The
+  	address_space should check that the write will be able to
+  	complete, by allocating space if necessary and doing any other
+  	internal house keeping.  If the write will update parts some
+  	some basic-blocks on storage, then those blocks should be
+  	pre-read (if they haven't been read already) so that the
+  	update will not leave half-blocks that need to be written out.
+	The page will be locked.  If prepare_write wants to unlock the
+  	page it, like readpage, may do so and return
+  	AOP_TRUNCATED_PAGE.
+	In this case the prepare_write will be retried one the lock is
+  	regained.
+
+  commit_write: If prepare_write succeeds, new data will be copied
+        into the page and then commit_write will be called.  It will
+        typically update the size of the file (if appropriate) and
+        mark the inode as dirty, and do any other related housekeeping
+        operations.  It should avoid returning an error if possible -
+        errors should have been handled by prepare_write.
 
   bmap: called by the VFS to map a logical block offset within object to
-  	physical block number. This method is use by for the legacy FIBMAP
-	ioctl. Other uses are discouraged.
-
-  invalidatepage: called by the VM on truncate to disassociate a page from its
-  	address_space mapping.
-
-  releasepage: called by the VFS to release filesystem specific metadata from
-  	a page.
-
-  direct_IO: called by the VM for direct I/O writes and reads.
+  	physical block number. This method is used by for the FIBMAP
+  	ioctl and for working with swap-files.  To be able to swap to
+  	a file, the file must have as stable mapping to a block
+  	device.  The swap system does not go through the filesystem
+  	but instead uses bmap to find out where the blocks in the file
+  	are and uses those addresses directly.
+
+
+  invalidatepage: If a page has PagePrivate set, then invalidatepage
+        will be called when part or all of the page is to be removed
+	from the address space.  This generally corresponds either a
+	truncation or a complete invalidation of the address space
+	(in the latter case 'offset' will always be 0).
+	Any private data associated with the page should be updated
+	to reflect this truncation.  If offset is 0, then
+	the private data should be released, because the page
+	must be able to be completely discarded.  This may be done by
+        calling the ->releasepage function, but in this case the
+        release MUST succeed.
+
+  releasepage: releasepage is called on PagePrivate pages to indicate
+        that the page should be freed if possible.  ->releasepage
+        should remove any private data from the page and clear the
+        PagePrivate flag.  It may also remove the page from the
+        address_space.  If this fails for some reason, it may indicate
+        failure with a 0 return value.
+	This is used in two distinct though related cases.  The first
+        is when the VM finds a clean page with no active users and
+        wants to make it a free page.  If ->releasepage succeeds, the
+        page will be removed from the address_space and become free.
+
+	The second case if when a request has been made to invalidate
+        some or all pages in an address_space.  This can happen
+        through the fadvice(POSIX_FADV_DONTNEED) system call or by the
+        filesystem explicitly requesting it as nfs and 9fs do (when
+        they believe the cache may be out of date with storage) by
+        calling invalidate_inode_pages2().
+	If the filesystem makes such a call, and needs to be certain
+        that all pages are invalidated, then it's releasepage will
+        need to ensure this.  Possibly it can clear the PageUptodate
+        bit if it cannot free private data yet.
+
+  direct_IO: called by the generic read/write routines to perform
+        direct_IO - that is IO requests which bypass the page cache
+        and tranfer data directly between the storage and the
+        application's address space.
 
   get_xip_page: called by the VM to translate a block number to a page.
 	The page is valid until the corresponding filesystem is unmounted.
 	Filesystems that want to use execute-in-place (XIP) need to implement
 	it.  An example implementation can be found in fs/ext2/xip.c.
 
+  migrate_page:  This is used to compact the physical memory usage.
+        If the VM wants to relocate a page (maybe off a memory card
+        that is signalling imminent failure) it will pass a new page
+	and an old page to this function.  migrate_page should
+	transfer any private data across and update any references
+        that it has to the page.
 
 The File Object
 ===============

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 002 of 4] Honour AOP_TRUNCATE_PAGE returns in page_symlink
  2006-03-12 23:53 [PATCH 000 of 4] Introduction: VFS documentation and tidy up NeilBrown
  2006-03-12 23:53 ` [PATCH 001 of 4] Update some VFS documentation NeilBrown
@ 2006-03-12 23:53 ` NeilBrown
  2006-03-12 23:53 ` [PATCH 003 of 4] Make address_space_operations->sync_page return void NeilBrown
  2006-03-12 23:53 ` [PATCH 004 of 4] Make address_space_operations->invalidatepage " NeilBrown
  3 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2006-03-12 23:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel


As prepare_write, commit_write and readpage are allowed to return
AOP_TRUNCATE_PAGE, page_symlink should respond to them.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./fs/namei.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff ./fs/namei.c~current~ ./fs/namei.c
--- ./fs/namei.c~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./fs/namei.c	2006-03-13 10:43:40.000000000 +1100
@@ -2628,19 +2628,31 @@ void page_put_link(struct dentry *dentry
 int page_symlink(struct inode *inode, const char *symname, int len)
 {
 	struct address_space *mapping = inode->i_mapping;
-	struct page *page = grab_cache_page(mapping, 0);
+	struct page *page;
 	int err = -ENOMEM;
 	char *kaddr;
 
+ retry:
+	page = grab_cache_page(mapping, 0);
 	if (!page)
 		goto fail;
 	err = mapping->a_ops->prepare_write(NULL, page, 0, len-1);
+	if (err == AOP_TRUNCATED_PAGE) {
+		page_cache_release(page);
+		goto retry;
+	}
 	if (err)
 		goto fail_map;
 	kaddr = kmap_atomic(page, KM_USER0);
 	memcpy(kaddr, symname, len-1);
 	kunmap_atomic(kaddr, KM_USER0);
-	mapping->a_ops->commit_write(NULL, page, 0, len-1);
+	err = mapping->a_ops->commit_write(NULL, page, 0, len-1);
+	if (err == AOP_TRUNCATED_PAGE) {
+		page_cache_release(page);
+		goto retry;
+	}
+	if (err)
+		goto fail_map;
 	/*
 	 * Notice that we are _not_ going to block here - end of page is
 	 * unmapped, so this will only try to map the rest of page, see
@@ -2650,7 +2662,8 @@ int page_symlink(struct inode *inode, co
 	 */
 	if (!PageUptodate(page)) {
 		err = mapping->a_ops->readpage(NULL, page);
-		wait_on_page_locked(page);
+		if (err != AOP_TRUNCATED_PAGE)
+			wait_on_page_locked(page);
 	} else {
 		unlock_page(page);
 	}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 003 of 4] Make address_space_operations->sync_page return void.
  2006-03-12 23:53 [PATCH 000 of 4] Introduction: VFS documentation and tidy up NeilBrown
  2006-03-12 23:53 ` [PATCH 001 of 4] Update some VFS documentation NeilBrown
  2006-03-12 23:53 ` [PATCH 002 of 4] Honour AOP_TRUNCATE_PAGE returns in page_symlink NeilBrown
@ 2006-03-12 23:53 ` NeilBrown
  2006-03-12 23:53 ` [PATCH 004 of 4] Make address_space_operations->invalidatepage " NeilBrown
  3 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2006-03-12 23:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel


The only user ignores the return value, and the only 
instanace (block_sync_page) always returns 0...


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./fs/buffer.c                 |    3 +--
 ./fs/cifs/file.c              |    6 ++++--
 ./include/linux/buffer_head.h |    2 +-
 ./include/linux/fs.h          |    2 +-
 4 files changed, 7 insertions(+), 6 deletions(-)

diff ./fs/buffer.c~current~ ./fs/buffer.c
--- ./fs/buffer.c~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./fs/buffer.c	2006-03-13 10:46:35.000000000 +1100
@@ -3024,7 +3024,7 @@ out:
 }
 EXPORT_SYMBOL(try_to_free_buffers);
 
-int block_sync_page(struct page *page)
+void block_sync_page(struct page *page)
 {
 	struct address_space *mapping;
 
@@ -3032,7 +3032,6 @@ int block_sync_page(struct page *page)
 	mapping = page_mapping(page);
 	if (mapping)
 		blk_run_backing_dev(mapping->backing_dev_info, page);
-	return 0;
 }
 
 /*

diff ./fs/cifs/file.c~current~ ./fs/cifs/file.c
--- ./fs/cifs/file.c~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./fs/cifs/file.c	2006-03-13 10:46:35.000000000 +1100
@@ -1377,7 +1377,7 @@ int cifs_fsync(struct file *file, struct
 	return rc;
 }
 
-/* static int cifs_sync_page(struct page *page)
+/* static void cifs_sync_page(struct page *page)
 {
 	struct address_space *mapping;
 	struct inode *inode;
@@ -1391,16 +1391,18 @@ int cifs_fsync(struct file *file, struct
 		return 0;
 	inode = mapping->host;
 	if (!inode)
-		return 0; */
+		return; */
 
 /*	fill in rpages then 
 	result = cifs_pagein_inode(inode, index, rpages); */ /* BB finish */
 
 /*	cFYI(1, ("rpages is %d for sync page of Index %ld ", rpages, index));
 
+#if 0
 	if (rc < 0)
 		return rc;
 	return 0;
+#endif
 } */
 
 /*

diff ./include/linux/buffer_head.h~current~ ./include/linux/buffer_head.h
--- ./include/linux/buffer_head.h~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./include/linux/buffer_head.h	2006-03-13 10:46:35.000000000 +1100
@@ -203,7 +203,7 @@ int cont_prepare_write(struct page*, uns
 int generic_cont_expand(struct inode *inode, loff_t size);
 int generic_cont_expand_simple(struct inode *inode, loff_t size);
 int block_commit_write(struct page *page, unsigned from, unsigned to);
-int block_sync_page(struct page *);
+void block_sync_page(struct page *);
 sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *);
 int generic_commit_write(struct file *, struct page *, unsigned, unsigned);
 int block_truncate_page(struct address_space *, loff_t, get_block_t *);

diff ./include/linux/fs.h~current~ ./include/linux/fs.h
--- ./include/linux/fs.h~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./include/linux/fs.h	2006-03-13 10:46:35.000000000 +1100
@@ -345,7 +345,7 @@ struct writeback_control;
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
-	int (*sync_page)(struct page *);
+	void (*sync_page)(struct page *);
 
 	/* Write back some dirty pages from this mapping. */
 	int (*writepages)(struct address_space *, struct writeback_control *);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 004 of 4] Make address_space_operations->invalidatepage return void
  2006-03-12 23:53 [PATCH 000 of 4] Introduction: VFS documentation and tidy up NeilBrown
                   ` (2 preceding siblings ...)
  2006-03-12 23:53 ` [PATCH 003 of 4] Make address_space_operations->sync_page return void NeilBrown
@ 2006-03-12 23:53 ` NeilBrown
  2006-03-13 16:32   ` Dave Kleikamp
  3 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2006-03-12 23:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel


The return value of this function is never used, so let's
be honest and declare it as void.

Some places where invalidatepage returned 0, I have inserted
comments suggesting a BUG_ON.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./fs/afs/file.c               |    6 +++---
 ./fs/buffer.c                 |   22 +++++++++++-----------
 ./fs/ext3/inode.c             |    4 ++--
 ./fs/jbd/transaction.c        |   12 +++++-------
 ./fs/jfs/jfs_metapage.c       |    7 +++----
 ./fs/reiser4/as_ops.c         |   13 ++++++-------
 ./fs/reiser4/vfs_ops.h        |    2 +-
 ./fs/reiserfs/inode.c         |    8 +++++---
 ./fs/xfs/linux-2.6/xfs_aops.c |    4 ++--
 ./include/linux/buffer_head.h |    4 ++--
 ./include/linux/fs.h          |    2 +-
 ./include/linux/jbd.h         |    2 +-
 12 files changed, 42 insertions(+), 44 deletions(-)

diff ./fs/afs/file.c~current~ ./fs/afs/file.c
--- ./fs/afs/file.c~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./fs/afs/file.c	2006-03-13 10:46:55.000000000 +1100
@@ -28,7 +28,7 @@ static int afs_file_release(struct inode
 #endif
 
 static int afs_file_readpage(struct file *file, struct page *page);
-static int afs_file_invalidatepage(struct page *page, unsigned long offset);
+static void afs_file_invalidatepage(struct page *page, unsigned long offset);
 static int afs_file_releasepage(struct page *page, gfp_t gfp_flags);
 
 struct inode_operations afs_file_inode_operations = {
@@ -212,7 +212,7 @@ int afs_cache_get_page_cookie(struct pag
 /*
  * invalidate part or all of a page
  */
-static int afs_file_invalidatepage(struct page *page, unsigned long offset)
+static void afs_file_invalidatepage(struct page *page, unsigned long offset)
 {
 	int ret = 1;
 
@@ -238,11 +238,11 @@ static int afs_file_invalidatepage(struc
 			if (!PageWriteback(page))
 				ret = page->mapping->a_ops->releasepage(page,
 									0);
+			/* possibly should BUG_ON(!ret); - neilb */
 		}
 	}
 
 	_leave(" = %d", ret);
-	return ret;
 } /* end afs_file_invalidatepage() */
 
 /*****************************************************************************/

diff ./fs/buffer.c~current~ ./fs/buffer.c
--- ./fs/buffer.c~current~	2006-03-13 10:46:35.000000000 +1100
+++ ./fs/buffer.c	2006-03-13 10:46:55.000000000 +1100
@@ -1603,11 +1603,10 @@ EXPORT_SYMBOL(try_to_release_page);
  * point.  Because the caller is about to free (and possibly reuse) those
  * blocks on-disk.
  */
-int block_invalidatepage(struct page *page, unsigned long offset)
+void block_invalidatepage(struct page *page, unsigned long offset)
 {
 	struct buffer_head *head, *bh, *next;
 	unsigned int curr_off = 0;
-	int ret = 1;
 
 	BUG_ON(!PageLocked(page));
 	if (!page_has_buffers(page))
@@ -1633,20 +1632,21 @@ int block_invalidatepage(struct page *pa
 	 * The get_block cached value has been unconditionally invalidated,
 	 * so real IO is not possible anymore.
 	 */
-	if (offset == 0)
-		ret = try_to_release_page(page, 0);
+	if (offset == 0) {
+		int ret = try_to_release_page(page, 0);
+		BUG_ON(!ret);
+	}
 out:
-	return ret;
+	return;
 }
 EXPORT_SYMBOL(block_invalidatepage);
 
-int do_invalidatepage(struct page *page, unsigned long offset)
+void do_invalidatepage(struct page *page, unsigned long offset)
 {
-	int (*invalidatepage)(struct page *, unsigned long);
-	invalidatepage = page->mapping->a_ops->invalidatepage;
-	if (invalidatepage == NULL)
-		invalidatepage = block_invalidatepage;
-	return (*invalidatepage)(page, offset);
+	void (*invalidatepage)(struct page *, unsigned long);
+	invalidatepage = page->mapping->a_ops->invalidatepage ? :
+		block_invalidatepage;
+	(*invalidatepage)(page, offset);
 }
 
 /*

diff ./fs/ext3/inode.c~current~ ./fs/ext3/inode.c
--- ./fs/ext3/inode.c~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./fs/ext3/inode.c	2006-03-13 10:46:55.000000000 +1100
@@ -1583,7 +1583,7 @@ ext3_readpages(struct file *file, struct
 	return mpage_readpages(mapping, pages, nr_pages, ext3_get_block);
 }
 
-static int ext3_invalidatepage(struct page *page, unsigned long offset)
+static void ext3_invalidatepage(struct page *page, unsigned long offset)
 {
 	journal_t *journal = EXT3_JOURNAL(page->mapping->host);
 
@@ -1593,7 +1593,7 @@ static int ext3_invalidatepage(struct pa
 	if (offset == 0)
 		ClearPageChecked(page);
 
-	return journal_invalidatepage(journal, page, offset);
+	journal_invalidatepage(journal, page, offset);
 }
 
 static int ext3_releasepage(struct page *page, gfp_t wait)

diff ./fs/jbd/transaction.c~current~ ./fs/jbd/transaction.c
--- ./fs/jbd/transaction.c~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./fs/jbd/transaction.c	2006-03-13 10:46:55.000000000 +1100
@@ -1873,16 +1873,15 @@ zap_buffer_unlocked:
 }
 
 /** 
- * int journal_invalidatepage() 
+ * void journal_invalidatepage()
  * @journal: journal to use for flush... 
  * @page:    page to flush
  * @offset:  length of page to invalidate.
  *
  * Reap page buffers containing data after offset in page.
  *
- * Return non-zero if the page's buffers were successfully reaped.
  */
-int journal_invalidatepage(journal_t *journal, 
+void journal_invalidatepage(journal_t *journal,
 		      struct page *page, 
 		      unsigned long offset)
 {
@@ -1893,7 +1892,7 @@ int journal_invalidatepage(journal_t *jo
 	if (!PageLocked(page))
 		BUG();
 	if (!page_has_buffers(page))
-		return 1;
+		return;
 
 	/* We will potentially be playing with lists other than just the
 	 * data lists (especially for journaled data mode), so be
@@ -1916,11 +1915,10 @@ int journal_invalidatepage(journal_t *jo
 	} while (bh != head);
 
 	if (!offset) {
-		if (!may_free || !try_to_free_buffers(page))
-			return 0;
+		/* Maybe should BUG_ON !may_free - neilb */
+		try_to_free_buffers(page);
 		J_ASSERT(!page_has_buffers(page));
 	}
-	return 1;
 }
 
 /* 

diff ./fs/jfs/jfs_metapage.c~current~ ./fs/jfs/jfs_metapage.c
--- ./fs/jfs/jfs_metapage.c~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./fs/jfs/jfs_metapage.c	2006-03-13 10:46:55.000000000 +1100
@@ -578,14 +578,13 @@ static int metapage_releasepage(struct p
 	return 0;
 }
 
-static int metapage_invalidatepage(struct page *page, unsigned long offset)
+static void metapage_invalidatepage(struct page *page, unsigned long offset)
 {
 	BUG_ON(offset);
 
-	if (PageWriteback(page))
-		return 0;
+	BUG_ON(PageWriteback(page));
 
-	return metapage_releasepage(page, 0);
+	metapage_releasepage(page, 0);
 }
 
 struct address_space_operations jfs_metapage_aops = {

diff ./fs/reiser4/as_ops.c~current~ ./fs/reiser4/as_ops.c
--- ./fs/reiser4/as_ops.c~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./fs/reiser4/as_ops.c	2006-03-13 10:46:55.000000000 +1100
@@ -170,7 +170,7 @@ reiser4_readpages(struct file *file, str
  * @offset: starting offset for partial invalidation
  *
  */
-int reiser4_invalidatepage(struct page *page, unsigned long offset)
+void reiser4_invalidatepage(struct page *page, unsigned long offset)
 {
 	int ret = 0;
 	reiser4_context *ctx;
@@ -208,11 +208,11 @@ int reiser4_invalidatepage(struct page *
 	 * them. Check for this, and do nothing.
 	 */
 	if (get_super_fake(inode->i_sb) == inode)
-		return 0;
+		return;
 	if (get_cc_fake(inode->i_sb) == inode)
-		return 0;
+		return;
 	if (get_bitmap_fake(inode->i_sb) == inode)
-		return 0;
+		return;
 	assert("vs-1426", PagePrivate(page));
 	assert("vs-1427",
 	       page->mapping == jnode_get_mapping(jnode_by_page(page)));
@@ -222,7 +222,7 @@ int reiser4_invalidatepage(struct page *
 
 	ctx = init_context(inode->i_sb);
 	if (IS_ERR(ctx))
-		return PTR_ERR(ctx);
+		return;
 
 	node = jprivate(page);
 	spin_lock_jnode(node);
@@ -236,7 +236,7 @@ int reiser4_invalidatepage(struct page *
 		unhash_unformatted_jnode(node);
 		jput(node);
 		reiser4_exit_context(ctx);
-		return 0;
+		return;
 	}
 	spin_unlock_jnode(node);
 
@@ -265,7 +265,6 @@ int reiser4_invalidatepage(struct page *
 	}
 
 	reiser4_exit_context(ctx);
-	return ret;
 }
 
 /* help function called from reiser4_releasepage(). It returns true if jnode

diff ./fs/reiser4/vfs_ops.h~current~ ./fs/reiser4/vfs_ops.h
--- ./fs/reiser4/vfs_ops.h~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./fs/reiser4/vfs_ops.h	2006-03-13 10:46:55.000000000 +1100
@@ -24,7 +24,7 @@ int reiser4_writepage(struct page *, str
 int reiser4_set_page_dirty(struct page *);
 int reiser4_readpages(struct file *, struct address_space *,
 		      struct list_head *pages, unsigned nr_pages);
-int reiser4_invalidatepage(struct page *, unsigned long offset);
+void reiser4_invalidatepage(struct page *, unsigned long offset);
 int reiser4_releasepage(struct page *, gfp_t);
 
 extern int reiser4_update_sd(struct inode *);

diff ./fs/reiserfs/inode.c~current~ ./fs/reiserfs/inode.c
--- ./fs/reiserfs/inode.c~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./fs/reiserfs/inode.c	2006-03-13 10:46:55.000000000 +1100
@@ -2792,7 +2792,7 @@ static int invalidatepage_can_drop(struc
 }
 
 /* clm -- taken from fs/buffer.c:block_invalidate_page */
-static int reiserfs_invalidatepage(struct page *page, unsigned long offset)
+static void reiserfs_invalidatepage(struct page *page, unsigned long offset)
 {
 	struct buffer_head *head, *bh, *next;
 	struct inode *inode = page->mapping->host;
@@ -2831,10 +2831,12 @@ static int reiserfs_invalidatepage(struc
 	 * The get_block cached value has been unconditionally invalidated,
 	 * so real IO is not possible anymore.
 	 */
-	if (!offset && ret)
+	if (!offset && ret) {
 		ret = try_to_release_page(page, 0);
+		/* maybe should BUG_ON(!ret); - neilb */
+	}
       out:
-	return ret;
+	return;
 }
 
 static int reiserfs_set_page_dirty(struct page *page)

diff ./fs/xfs/linux-2.6/xfs_aops.c~current~ ./fs/xfs/linux-2.6/xfs_aops.c
--- ./fs/xfs/linux-2.6/xfs_aops.c~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./fs/xfs/linux-2.6/xfs_aops.c	2006-03-13 10:46:55.000000000 +1100
@@ -1370,14 +1370,14 @@ out_unlock:
 	return error;
 }
 
-STATIC int
+STATIC void
 linvfs_invalidate_page(
 	struct page		*page,
 	unsigned long		offset)
 {
 	xfs_page_trace(XFS_INVALIDPAGE_ENTER,
 			page->mapping->host, page, offset);
-	return block_invalidatepage(page, offset);
+	block_invalidatepage(page, offset);
 }
 
 /*

diff ./include/linux/buffer_head.h~current~ ./include/linux/buffer_head.h
--- ./include/linux/buffer_head.h~current~	2006-03-13 10:46:35.000000000 +1100
+++ ./include/linux/buffer_head.h	2006-03-13 10:46:55.000000000 +1100
@@ -192,8 +192,8 @@ extern int buffer_heads_over_limit;
  * address_spaces.
  */
 int try_to_release_page(struct page * page, gfp_t gfp_mask);
-int block_invalidatepage(struct page *page, unsigned long offset);
-int do_invalidatepage(struct page *page, unsigned long offset);
+void block_invalidatepage(struct page *page, unsigned long offset);
+void do_invalidatepage(struct page *page, unsigned long offset);
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
 int block_read_full_page(struct page*, get_block_t*);

diff ./include/linux/fs.h~current~ ./include/linux/fs.h
--- ./include/linux/fs.h~current~	2006-03-13 10:46:35.000000000 +1100
+++ ./include/linux/fs.h	2006-03-13 10:46:55.000000000 +1100
@@ -364,7 +364,7 @@ struct address_space_operations {
 	int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
 	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
 	sector_t (*bmap)(struct address_space *, sector_t);
-	int (*invalidatepage) (struct page *, unsigned long);
+	void (*invalidatepage) (struct page *, unsigned long);
 	int (*releasepage) (struct page *, gfp_t);
 	ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);

diff ./include/linux/jbd.h~current~ ./include/linux/jbd.h
--- ./include/linux/jbd.h~current~	2006-03-09 17:29:35.000000000 +1100
+++ ./include/linux/jbd.h	2006-03-13 10:46:55.000000000 +1100
@@ -895,7 +895,7 @@ extern int	 journal_dirty_metadata (hand
 extern void	 journal_release_buffer (handle_t *, struct buffer_head *);
 extern int	 journal_forget (handle_t *, struct buffer_head *);
 extern void	 journal_sync_buffer (struct buffer_head *);
-extern int	 journal_invalidatepage(journal_t *,
+extern void	 journal_invalidatepage(journal_t *,
 				struct page *, unsigned long);
 extern int	 journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
 extern int	 journal_stop(handle_t *);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 001 of 4] Update some VFS documentation.
  2006-03-12 23:53 ` [PATCH 001 of 4] Update some VFS documentation NeilBrown
@ 2006-03-13  0:22   ` Avishay Traeger
  2006-03-13  4:14     ` [PATCH 001 of 4] Update some VFS documentation fix Neil Brown
  2006-03-13  4:58   ` [PATCH 001 of 4] Update some VFS documentation Randy.Dunlap
  1 sibling, 1 reply; 14+ messages in thread
From: Avishay Traeger @ 2006-03-13  0:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrew Morton, linux-kernel, linux-fsdevel

Did a quick scan - some minor corrections in text.

On Mon, 2006-03-13 at 10:53 +1100, NeilBrown wrote:
<snip>

>    destroy_inode: this method is called by destroy_inode() to release
> -  	resources allocated for struct inode.
> +  	resources allocated for struct inode.  It is only required of

"of" -> "if"

> +  	->alloc_inode was defined and simply does a deallocate. 

"simply does a deallocate" -> "undoes anything done by ->alloc_inode"

<snip>

> +The first can be used independantly to the others.  The vm can try to

"independantly" -> "independently"

<snip>

> -  writepage: called by the VM write a dirty page to backing store.
> +  writepage: called by the VM to write a dirty page to backing store.
> +      This may happen for data integrity reason (i.e. 'sync'), or
> +      to free up memory (flush).  The difference can be seen in
> +      wbc->sync_mode.
> +      The PG_Dirty flag has been cleared and PageLocked is true.
> +      writepage should start writeout, should set PG_Writeback,
> +      and should make sure the page is Unlocked, either synchronously

"Unlocked" -> "unlocked"

<snip>

>  
>    sync_page: called by the VM to notify the backing store to perform all
>    	queued I/O operations for a page. I/O operations for other pages
> -	associated with this address_space object may also be performed.
> +	associated with this address_space object may also be
> +  	performed.

Why did this line get split?

<snip>

>    set_page_dirty: called by the VM to set a page dirty.
> +        This is particularly needed if an address space attaches
> +        private data to a page, and that data needs to be updated when
> +        a page is dirtied.  This is called, for example, when a memory
> +	mapped page gets modified.
> +	If defined, it should set the PageDirty flag, and the
> +        PAGECACHE_TAG_DIRTY tag in the radix tree.

Indentation is off by one.

>    readpages: called by the VM to read pages associated with the address_space
> -  	object.
> +  	object. This is essentailly just a vector version of

"essentailly" -> "essentially"

> +  	readpage.  Instead of just one page, several pages are
> +  	requested.
> +	readpages is only used for readahead, so read errors are
> +  	ignored.  If anything goes wrong, feel free to give up.
>  
>    prepare_write: called by the generic write path in VM to set up a write
> -  	request for a page.
> -
> -  commit_write: called by the generic write path in VM to write page to
> -  	its backing store.
> +  	request for a page.  This indicates to the address space that
> +  	the given range of bytes are about to be written.  The
> +  	address_space should check that the write will be able to
> +  	complete, by allocating space if necessary and doing any other
> +  	internal house keeping.  If the write will update parts some
> +  	some basic-blocks on storage, then those blocks should be
> +  	pre-read (if they haven't been read already) so that the
> +  	update will not leave half-blocks that need to be written out.

"some" appears twice in a row in this last sentence.  Anyway, it is
confusing.  I would re-word it to something like:
"If the write will update partial storage blocks, those blocks should be
read at this point (if they haven't been already) so that the updated
blocks can be properly written out."

<snip>


Avishay Traeger
http://www.fsl.cs.sunysb.edu/~avishay/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 001 of 4] Update some VFS documentation fix
  2006-03-13  0:22   ` Avishay Traeger
@ 2006-03-13  4:14     ` Neil Brown
  0 siblings, 0 replies; 14+ messages in thread
From: Neil Brown @ 2006-03-13  4:14 UTC (permalink / raw)
  To: Avishay Traeger; +Cc: Andrew Morton, linux-kernel, linux-fsdevel

On Sunday March 12, atraeger@cs.sunysb.edu wrote:
> Did a quick scan - some minor corrections in text.

Thanks. The following patch incorporated most of them.
I didn't fix the indenting oddity that you noticed.  The file is
wildly inconsistent in its uses of tabs or spaces for indenting.  That
should be fixed up in a separate patch.

Thanks.


-----------------
Corrections to recent updates for vfs.txt

Thanks Avishay Traeger <atraeger@cs.sunysb.edu>

Cc: Avishay Traeger <atraeger@cs.sunysb.edu>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./Documentation/filesystems/vfs.txt |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff ./Documentation/filesystems/vfs.txt~current~ ./Documentation/filesystems/vfs.txt
--- ./Documentation/filesystems/vfs.txt~current~	2006-03-13 10:58:13.000000000 +1100
+++ ./Documentation/filesystems/vfs.txt	2006-03-13 15:09:24.000000000 +1100
@@ -236,8 +236,9 @@ or bottom half).
  	contains a 'struct inode' embedded within it.
 
   destroy_inode: this method is called by destroy_inode() to release
-  	resources allocated for struct inode.  It is only required of
-  	->alloc_inode was defined and simply does a deallocate.
+  	resources allocated for struct inode.  It is only required if
+  	->alloc_inode was defined and simply undoes anything done by
+	->alloc_inode.
 
   read_inode: this method is called to read a specific inode from the
         mounted filesystem.  The i_ino member in the struct inode is
@@ -550,7 +551,7 @@ struct address_space_operations {
       wbc->sync_mode.
       The PG_Dirty flag has been cleared and PageLocked is true.
       writepage should start writeout, should set PG_Writeback,
-      and should make sure the page is Unlocked, either synchronously
+      and should make sure the page is unlocked, either synchronously
       or asynchronously when the write operation completes.
 
       If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
@@ -572,8 +573,8 @@ struct address_space_operations {
 
   sync_page: called by the VM to notify the backing store to perform all
   	queued I/O operations for a page. I/O operations for other pages
-	associated with this address_space object may also be
-  	performed.
+	associated with this address_space object may also be performed.
+
 	This function is optional and is called only for pages with
   	PG_Writeback set while waiting for the writeback to complete.
 
@@ -595,7 +596,7 @@ struct address_space_operations {
         PAGECACHE_TAG_DIRTY tag in the radix tree.
 
   readpages: called by the VM to read pages associated with the address_space
-  	object. This is essentailly just a vector version of
+  	object. This is essentially just a vector version of
   	readpage.  Instead of just one page, several pages are
   	requested.
 	readpages is only used for readahead, so read errors are
@@ -606,10 +607,10 @@ struct address_space_operations {
   	the given range of bytes are about to be written.  The
   	address_space should check that the write will be able to
   	complete, by allocating space if necessary and doing any other
-  	internal house keeping.  If the write will update parts some
-  	some basic-blocks on storage, then those blocks should be
+  	internal house keeping.  If the write will update parts of
+  	any basic-blocks on storage, then those blocks should be
   	pre-read (if they haven't been read already) so that the
-  	update will not leave half-blocks that need to be written out.
+  	updated blocks can be written out properly.
 	The page will be locked.  If prepare_write wants to unlock the
   	page it, like readpage, may do so and return
   	AOP_TRUNCATED_PAGE.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 001 of 4] Update some VFS documentation.
  2006-03-12 23:53 ` [PATCH 001 of 4] Update some VFS documentation NeilBrown
  2006-03-13  0:22   ` Avishay Traeger
@ 2006-03-13  4:58   ` Randy.Dunlap
  1 sibling, 0 replies; 14+ messages in thread
From: Randy.Dunlap @ 2006-03-13  4:58 UTC (permalink / raw)
  To: NeilBrown; +Cc: akpm, linux-kernel, linux-fsdevel


a few more; you are too fast.
and thanks for doing this.

On Mon, 13 Mar 2006 10:53:16 +1100 NeilBrown wrote:

> @@ -443,14 +447,81 @@ otherwise noted.
>  The Address Space Object
>  ========================
>  
> +There are a number of distinct yet related services that an
> +address-space can provide.  These include communicating memory
> +pressure, page lookup by address, and keeping track of pages tagged as
> +Dirty or Writeback.
> +
> +The first can be used independantly to the others.  The vm can try to

s/vm/VM/

> +either write dirty pages in order to clean them, or release clean
> +pages in order to reuse them.  To do this it can call the ->writepage
> +method on dirty pages, and ->releasepage on clean pages with
> +PagePrivate set. Clean pages without PagePrivate and with no external
> +references will be released without notice being given to the
> +address_space.
> +
> +To achieve this functionality, pages need to be placed on an lru with

prefer s/lru/LRU/

> +lru_cache_add and mark_page_active needs to be called whenever the
> +page is used.
> +
> +Pages are normally kept in a radix tree index by ->index. This tree
> +maintains information about the PG_Dirty and PG_Writeback status of
> +each page, so that pages with either of these flags can be found
> +quickly.
> +
> +The Dirty tag is primarily used by mpage_writepages - the default
> +->writepages method.  It uses the tag to find dirty pages to call
> +->writepage on.  If mpage_writepages is not used (i.e. the address
> +provides it's own ->writepages) , the PAGECACHE_TAG_DIRTY tag is

its

> +almost unused.  write_inode_now and sync_inode do use it (through
> +__sync_single_inode) to check if ->writepages has been successful in
> +writing out the whole address_space.
> +
> +The Writeback tag is used by filemap*wait* and sync_page* functions,
> +though wait_on_page_writeback_range, to wait for all writeback to

s/though/through/ ??

> +complete.  While waiting ->sync_page (if defined) will be called on
> +each page that is found to require writeback

full stop.

> +An address_space handler may attach extra information to a page,
> +typically using the 'private' field in the 'struct page'.  If such
> +information is attached, the PG_Private flag should be set.  This will
> +cause various mm routines to make extra calls into the address_space

prefer s/mm/MM/ (or use VM as above)

> +handler to deal with that data.
> +
> +An address space acts as an intermediate between storage and
> +application.  Data is read into the address space a whole page at a
> +time, and provided to the application either by copying of the page,
> +or by memory-mapping the page.
> +Data is written into the address space by the application, and then
> +written-back to storage typically in whole pages, however the
> +address_space has finner control of write sizes.

finer

> +
> +The read process essentially only requires 'readpage'.  The write
> +process is more complicated and uses prepare_write/commit_write or
> +set_page_dirty to write data into the address_space, and writepage,
> +sync_page, and writepages to writeback data to storage.

> -  writepage: called by the VM write a dirty page to backing store.
> +  writepage: called by the VM to write a dirty page to backing store.
> +      This may happen for data integrity reason (i.e. 'sync'), or

reasons

> +      to free up memory (flush).  The difference can be seen in
> +      wbc->sync_mode.
> +      The PG_Dirty flag has been cleared and PageLocked is true.
> +      writepage should start writeout, should set PG_Writeback,
> +      and should make sure the page is Unlocked, either synchronously
> +      or asynchronously when the write operation completes.
> +
> +      If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
> +      try too hard if there are problems, and may choose to write out a
> +      different page from the mapping if that would be more
> +      appropriate.  If it chooses not to start writeout, it should

any definition of appropriate?

> +      return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
> +      calling ->writepage on that page.
> +
> +      See the file "Locking" for more details.
>  
>    readpage: called by the VM to read a page from backing store.
> +       The page will be Locked when readpage is called, and should be
> +       unlocked and marked uptodate once the read completes.
> +       If ->readpage discovers that it needs to unlock the page for
> +       some reason, it can do so, and then return AOP_TRUNCATED_PAGE.
> +       In this case, the page will be re-located, re-locked and if

drop the hyphens

> +       that all succeeds, ->readpage will be called again.
>  
>    sync_page: called by the VM to notify the backing store to perform all
>    	queued I/O operations for a page. I/O operations for other pages
> -	associated with this address_space object may also be performed.
> +	associated with this address_space object may also be
> +  	performed.
> +	This function is optional and is called only for pages with
> +  	PG_Writeback set while waiting for the writeback to complete.
>  
>    writepages: called by the VM to write out pages associated with the
> -  	address_space object.
> +  	address_space object.  If WBC_SYNC_ALL, then the

If WBC_SYNC_ALL <what> ?

> +  	writeback_control will specify a range of pages that must be
> +  	written out.  If WBC_SYNC_NONE, then a nr_to_write is given

If WBC_SYNC_NONE <what> ?

> +	and that many pages should be written if possible.
> +	If no ->writepages is given, then mpage_writepages is used
> +  	instead.  This will choose pages from the addresspace that are
> +  	tagged as DIRTY and will pass them to ->writepage.
>  
> -  commit_write: called by the generic write path in VM to write page to
> -  	its backing store.
> +  	request for a page.  This indicates to the address space that
> +  	the given range of bytes are about to be written.  The

is about to be written.

> +  	address_space should check that the write will be able to
> +  	complete, by allocating space if necessary and doing any other
> +  	internal house keeping.  If the write will update parts some

housekeeping.

> +  	some basic-blocks on storage, then those blocks should be
> +  	pre-read (if they haven't been read already) so that the
> +  	update will not leave half-blocks that need to be written out.
> +	The page will be locked.  If prepare_write wants to unlock the
> +  	page it, like readpage, may do so and return
> +  	AOP_TRUNCATED_PAGE.
> +	In this case the prepare_write will be retried one the lock is
> +  	regained.
> +
> +  commit_write: If prepare_write succeeds, new data will be copied
> +        into the page and then commit_write will be called.  It will
> +        typically update the size of the file (if appropriate) and
> +        mark the inode as dirty, and do any other related housekeeping
> +        operations.  It should avoid returning an error if possible -
> +        errors should have been handled by prepare_write.
>  

> -  direct_IO: called by the VM for direct I/O writes and reads.
> +  	physical block number. This method is used by for the FIBMAP

by for ??

> +  	ioctl and for working with swap-files.  To be able to swap to
> +  	a file, the file must have as stable mapping to a block

must have a

> +  	device.  The swap system does not go through the filesystem
> +  	but instead uses bmap to find out where the blocks in the file
> +  	are and uses those addresses directly.
> +
> +
> +  invalidatepage: If a page has PagePrivate set, then invalidatepage
> +        will be called when part or all of the page is to be removed
> +	from the address space.  This generally corresponds either a

corresponds to

> +	truncation or a complete invalidation of the address space
> +	(in the latter case 'offset' will always be 0).
> +	Any private data associated with the page should be updated
> +	to reflect this truncation.  If offset is 0, then
> +	the private data should be released, because the page
> +	must be able to be completely discarded.  This may be done by
> +        calling the ->releasepage function, but in this case the
> +        release MUST succeed.
> +
> +  releasepage: releasepage is called on PagePrivate pages to indicate
> +        that the page should be freed if possible.  ->releasepage
> +        should remove any private data from the page and clear the
> +        PagePrivate flag.  It may also remove the page from the
> +        address_space.  If this fails for some reason, it may indicate
> +        failure with a 0 return value.
> +	This is used in two distinct though related cases.  The first
> +        is when the VM finds a clean page with no active users and
> +        wants to make it a free page.  If ->releasepage succeeds, the
> +        page will be removed from the address_space and become free.
> +
> +	The second case if when a request has been made to invalidate
> +        some or all pages in an address_space.  This can happen
> +        through the fadvice(POSIX_FADV_DONTNEED) system call or by the
> +        filesystem explicitly requesting it as nfs and 9fs do (when
> +        they believe the cache may be out of date with storage) by
> +        calling invalidate_inode_pages2().
> +	If the filesystem makes such a call, and needs to be certain
> +        that all pages are invalidated, then it's releasepage will

its

> +        need to ensure this.  Possibly it can clear the PageUptodate
> +        bit if it cannot free private data yet.
> +
> +  direct_IO: called by the generic read/write routines to perform
> +        direct_IO - that is IO requests which bypass the page cache
> +        and tranfer data directly between the storage and the

transfer

> +        application's address space.
>  

HTH.
---
~Randy
You can't do anything without having to do something else first.
-- Belefant's Law

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 004 of 4] Make address_space_operations->invalidatepage return void
  2006-03-12 23:53 ` [PATCH 004 of 4] Make address_space_operations->invalidatepage " NeilBrown
@ 2006-03-13 16:32   ` Dave Kleikamp
  2006-03-13 19:13     ` Dave Kleikamp
  2006-03-13 23:10     ` Neil Brown
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Kleikamp @ 2006-03-13 16:32 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrew Morton, linux-kernel, linux-fsdevel

On Mon, 2006-03-13 at 10:53 +1100, NeilBrown wrote:
> diff ./fs/jfs/jfs_metapage.c~current~ ./fs/jfs/jfs_metapage.c
> --- ./fs/jfs/jfs_metapage.c~current~    2006-03-09 17:29:35.000000000
> +1100
> +++ ./fs/jfs/jfs_metapage.c     2006-03-13 10:46:55.000000000 +1100
> @@ -578,14 +578,13 @@ static int metapage_releasepage(struct p
>         return 0;
>  }
>  
> -static int metapage_invalidatepage(struct page *page, unsigned long
> offset)
> +static void metapage_invalidatepage(struct page *page, unsigned long
> offset)
>  {
>         BUG_ON(offset);
>  
> -       if (PageWriteback(page))
> -               return 0;
> +       BUG_ON(PageWriteback(page));

I'm a little concerned about adding a BUG_ON for something this function
used to allow, but it looks like the BUG_ON is valid.  I'm asking myself
why did I add the test for PageWriteback in the first place.

>  
> -       return metapage_releasepage(page, 0);
> +       metapage_releasepage(page, 0);
>  }
>  
>  struct address_space_operations jfs_metapage_aops = { 

I'll try to stress test jfs with these patches to see if I can trigger
the an oops here.
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 004 of 4] Make address_space_operations->invalidatepage return void
  2006-03-13 16:32   ` Dave Kleikamp
@ 2006-03-13 19:13     ` Dave Kleikamp
  2006-03-13 21:36       ` Andrew Morton
  2006-03-13 23:10     ` Neil Brown
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Kleikamp @ 2006-03-13 19:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrew Morton, linux-kernel, linux-fsdevel

On Mon, 2006-03-13 at 10:32 -0600, Dave Kleikamp wrote:
> I'll try to stress test jfs with these patches to see if I can trigger
> the an oops here.

While stress testing on a jfs volume (dbench), I hit an assert in jbd:

Assertion failure in journal_invalidatepage() at fs/jbd/transaction.c:1920: "!page_has_buffers(page)"
------------[ cut here ]------------
kernel BUG at fs/jbd/transaction.c:1920!
invalid opcode: 0000 [#1]
PREEMPT 
last sysfs file: /block/hda/hda11/stat
Modules linked in: radeon irda crc_ccitt airo e1000 pcmcia yenta_socket rsrc_nonstatic pcmcia_core ntfs jfs rtc
CPU:    0
EIP:    0060:[<c01c2782>]    Not tainted VLI
EFLAGS: 00010282   (2.6.16-rc5-mm3-nb #1) 
EIP is at journal_invalidatepage+0xba/0xcb
eax: 00000069   ebx: db6e67f0   ecx: d5806dc8   edx: c0452aa3
esi: c1106620   edi: db6e67f0   ebp: db6e67f0   esp: d5806dc4
ds: 007b   es: 007b   ss: 0068
Process evolution (pid: 11320, threadinfo=d5806000 task=d5731570)
Stack: <0>c0452aa3 c041bc55 c0454d69 00000780 c04551ac 00001000 c1106620 00000000 
       c1106620 00000000 c01b4d10 df30b800 c1106620 00000000 c0141369 c1106620 
       00000000 00000001 c01414fb dbdca3e8 c1106620 00000000 ffffffff 00000000 
Call Trace:
 <c01b4d10> ext3_invalidatepage+0x2f/0x33   <c0141369> truncate_complete_page+0x1d/0x42
 <c01414fb> truncate_inode_pages_range+0xbe/0x27a   <c017977c> inotify_inode_is_dead+0x14/0x6e
 <c0413061> __mutex_lock_slowpath+0x25e/0x375   <c04130cd> __mutex_lock_slowpath+0x2ca/0x375
 <c017977c> inotify_inode_is_dead+0x14/0x6e   <c01b34d2> ext3_delete_inode+0x0/0xc9
 <c01416cc> truncate_inode_pages+0x15/0x19   <c01b34e8> ext3_delete_inode+0x16/0xc9
 <c01b34d2> ext3_delete_inode+0x0/0xc9   <c016cd2d> generic_delete_inode+0x89/0x10e
 <c016a193> dput+0x19e/0x1b7   <c0164a28> do_rename+0x135/0x16c
 <c0144d2a> unmap_vmas+0xc7/0x19e   <c014452b> free_pgtables+0x5b/0x65
 <c024c19b> strncpy_from_user+0x35/0x55   <c016147a> do_getname+0x3e/0x5c
 <c0164a9a> sys_renameat+0x3b/0x60   <c0164ad0> sys_rename+0x11/0x15
 <c0102c3b> syscall_call+0x7/0xb  
Code: e8 d7 6c f9 ff 58 8b 06 f6 c4 08 74 29 68 ac 51 45 c0 68 80 07 00 00 68 69 4d 45 c0 68 55 bc 41 c0 68 a3 2a 45 c0 e8 6a 83 f5 ff <0f> 0b 80 07 69 4d 45 c0 83 c4 14 58 5b 5e 5f 5d c3 55 31 ed 57 
 BUG: evolution/11320, lock held at task exit time!
 [dae25194] {inode_init_once}
.. held by:         evolution:11320 [d5731570, 115]
... acquired at:               lock_rename+0x8e/0x94

-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 004 of 4] Make address_space_operations->invalidatepage return void
  2006-03-13 19:13     ` Dave Kleikamp
@ 2006-03-13 21:36       ` Andrew Morton
  2006-03-13 23:05         ` Neil Brown
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2006-03-13 21:36 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: neilb, linux-kernel, linux-fsdevel

Dave Kleikamp <shaggy@austin.ibm.com> wrote:
>
> On Mon, 2006-03-13 at 10:32 -0600, Dave Kleikamp wrote:
>  > I'll try to stress test jfs with these patches to see if I can trigger
>  > the an oops here.
> 
>  While stress testing on a jfs volume (dbench), I hit an assert in jbd:
> 
>  Assertion failure in journal_invalidatepage() at fs/jbd/transaction.c:1920: "!page_has_buffers(page)"

Yes, thanks, that assertion has become wrong.

--- devel/fs/jbd/transaction.c~make-address_space_operations-invalidatepage-return-void-jbd-fix	2006-03-13 13:33:12.000000000 -0800
+++ devel-akpm/fs/jbd/transaction.c	2006-03-13 13:33:12.000000000 -0800
@@ -1915,9 +1915,8 @@ void journal_invalidatepage(journal_t *j
 	} while (bh != head);
 
 	if (!offset) {
-		/* Maybe should BUG_ON !may_free - neilb */
-		try_to_free_buffers(page);
-		J_ASSERT(!page_has_buffers(page));
+		if (may_free && try_to_free_buffers(page))
+			J_ASSERT(!page_has_buffers(page));
 	}
 }
 

However I'm more inclined to drop the whole patch, really - having
->invalidatepage() return a success indication makes sense.  The fact that
we're currently not using that return value doesn't mean that we shouldn't,
didn't and won't.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 004 of 4] Make address_space_operations->invalidatepage return void
  2006-03-13 21:36       ` Andrew Morton
@ 2006-03-13 23:05         ` Neil Brown
  0 siblings, 0 replies; 14+ messages in thread
From: Neil Brown @ 2006-03-13 23:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Kleikamp, linux-kernel, linux-fsdevel

On Monday March 13, akpm@osdl.org wrote:
> Dave Kleikamp <shaggy@austin.ibm.com> wrote:
> >
> > On Mon, 2006-03-13 at 10:32 -0600, Dave Kleikamp wrote:
> >  > I'll try to stress test jfs with these patches to see if I can trigger
> >  > the an oops here.
> > 
> >  While stress testing on a jfs volume (dbench), I hit an assert in jbd:
> > 
> >  Assertion failure in journal_invalidatepage() at fs/jbd/transaction.c:1920: "!page_has_buffers(page)"
> 
> Yes, thanks, that assertion has become wrong.
> 
> --- devel/fs/jbd/transaction.c~make-address_space_operations-invalidatepage-return-void-jbd-fix	2006-03-13 13:33:12.000000000 -0800
> +++ devel-akpm/fs/jbd/transaction.c	2006-03-13 13:33:12.000000000 -0800
> @@ -1915,9 +1915,8 @@ void journal_invalidatepage(journal_t *j
>  	} while (bh != head);
>  
>  	if (!offset) {
> -		/* Maybe should BUG_ON !may_free - neilb */
> -		try_to_free_buffers(page);
> -		J_ASSERT(!page_has_buffers(page));
> +		if (may_free && try_to_free_buffers(page))
> +			J_ASSERT(!page_has_buffers(page));
>  	}
>  }
>  
> 
> However I'm more inclined to drop the whole patch, really - having
> ->invalidatepage() return a success indication makes sense.  The fact that
> we're currently not using that return value doesn't mean that we shouldn't,
> didn't and won't.

->invalidatepage is called either when truncating a file or when
purging the mapping of a file from the page cache.

The VM has waited for read or write back to complete and has ensured
that no new io will happen on the page.  The page, at least from
'offset' onwards, is idle.
Immediately after ->invalidatepage with offset==0 completes, the page
is removed from the pagecache.  It's a goner!

So I really don't think there is any sense for invalidation to be able
to fail.  The VM is saying "this page (or part of it) is going way.
Now."  The filesystem really has to let go.

I'm a little bit worried by this behaviour of journal_invalidatepage
in that it doesn't always free the buffers...
I guess it hasn't finished committing a write when a truncate that
discards that data comes along.  So it needs to hold on to the page to
write out those data blocks before forgetting about them.

But these machinations are completely "under-the-hood".  The VM really
doesn't need to know that ext3 is holding on to the page a bit longer.
A return value to tell the VM still doesn't make sense.

Note that ->releasepage does a similar thing but does have a return
value and here it is very meaningful.  Here the VM is say "I'd like
this page if you've finished with it".  It isn't that the file is
being truncated.  It is just a memory-reclamation action.  So a return
value makes sense.

Finally, having a return value leads developers to think they need to
return a value, which gives a wrong impression of what
->invalidatepage is for.

However, if you still don't like it.....

NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 004 of 4] Make address_space_operations->invalidatepage return void
  2006-03-13 16:32   ` Dave Kleikamp
  2006-03-13 19:13     ` Dave Kleikamp
@ 2006-03-13 23:10     ` Neil Brown
  2006-03-13 23:22       ` Dave Kleikamp
  1 sibling, 1 reply; 14+ messages in thread
From: Neil Brown @ 2006-03-13 23:10 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: Andrew Morton, linux-kernel, linux-fsdevel

On Monday March 13, shaggy@austin.ibm.com wrote:
> On Mon, 2006-03-13 at 10:53 +1100, NeilBrown wrote:
> > diff ./fs/jfs/jfs_metapage.c~current~ ./fs/jfs/jfs_metapage.c
> > --- ./fs/jfs/jfs_metapage.c~current~    2006-03-09 17:29:35.000000000
> > +1100
> > +++ ./fs/jfs/jfs_metapage.c     2006-03-13 10:46:55.000000000 +1100
> > @@ -578,14 +578,13 @@ static int metapage_releasepage(struct p
> >         return 0;
> >  }
> >  
> > -static int metapage_invalidatepage(struct page *page, unsigned long
> > offset)
> > +static void metapage_invalidatepage(struct page *page, unsigned long
> > offset)
> >  {
> >         BUG_ON(offset);
> >  
> > -       if (PageWriteback(page))
> > -               return 0;
> > +       BUG_ON(PageWriteback(page));
> 
> I'm a little concerned about adding a BUG_ON for something this function
> used to allow, but it looks like the BUG_ON is valid.  I'm asking myself
> why did I add the test for PageWriteback in the first place.

Yes.... as far as I can tell, ->invalidatepage is only called with
the page locked, and with Writeback clear, so PageWriteback can never
be true.  So the BUG_ON should be a no-op.

> 
> >  
> > -       return metapage_releasepage(page, 0);
> > +       metapage_releasepage(page, 0);
> >  }
> >  
> >  struct address_space_operations jfs_metapage_aops = { 
> 
> I'll try to stress test jfs with these patches to see if I can trigger
> the an oops here.

Thanks.  I'd be very interested if you do.
I got on oops with a similar bug_on in the new nfs_invalidatepage and
it turned out to be a bug in radixtree (which I had already found and
fixed, but not in that source tree).

NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 004 of 4] Make address_space_operations->invalidatepage return void
  2006-03-13 23:10     ` Neil Brown
@ 2006-03-13 23:22       ` Dave Kleikamp
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Kleikamp @ 2006-03-13 23:22 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-kernel, linux-fsdevel

On Tue, 2006-03-14 at 10:10 +1100, Neil Brown wrote:
> On Monday March 13, shaggy@austin.ibm.com wrote:

> > I'm a little concerned about adding a BUG_ON for something this function
> > used to allow, but it looks like the BUG_ON is valid.  I'm asking myself
> > why did I add the test for PageWriteback in the first place.
> 
> Yes.... as far as I can tell, ->invalidatepage is only called with
> the page locked, and with Writeback clear, so PageWriteback can never
> be true.  So the BUG_ON should be a no-op.

Either I was being overly paranoid when I put in that test, or some code
previously called invalidatepage with Writeback set and it's since been
fixed.

> > I'll try to stress test jfs with these patches to see if I can trigger
> > the an oops here.
> 
> Thanks.  I'd be very interested if you do.
> I got on oops with a similar bug_on in the new nfs_invalidatepage and
> it turned out to be a bug in radixtree (which I had already found and
> fixed, but not in that source tree).

Aside from the jbd assert that Andrew fixed, my stress testing of jfs
was successful.  I ACK the jfs-specific part of the patch.
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-03-13 23:23 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-12 23:53 [PATCH 000 of 4] Introduction: VFS documentation and tidy up NeilBrown
2006-03-12 23:53 ` [PATCH 001 of 4] Update some VFS documentation NeilBrown
2006-03-13  0:22   ` Avishay Traeger
2006-03-13  4:14     ` [PATCH 001 of 4] Update some VFS documentation fix Neil Brown
2006-03-13  4:58   ` [PATCH 001 of 4] Update some VFS documentation Randy.Dunlap
2006-03-12 23:53 ` [PATCH 002 of 4] Honour AOP_TRUNCATE_PAGE returns in page_symlink NeilBrown
2006-03-12 23:53 ` [PATCH 003 of 4] Make address_space_operations->sync_page return void NeilBrown
2006-03-12 23:53 ` [PATCH 004 of 4] Make address_space_operations->invalidatepage " NeilBrown
2006-03-13 16:32   ` Dave Kleikamp
2006-03-13 19:13     ` Dave Kleikamp
2006-03-13 21:36       ` Andrew Morton
2006-03-13 23:05         ` Neil Brown
2006-03-13 23:10     ` Neil Brown
2006-03-13 23:22       ` Dave Kleikamp

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).