linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1
@ 2007-04-24  1:23 Nick Piggin
  2007-04-24  1:23 ` [patch 01/44] mm: revert KERNEL_DS buffered write optimisation Nick Piggin
                   ` (43 more replies)
  0 siblings, 44 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh

Hi, these patches are against 2.6.21-rc6-mm1. Aside from OCFS2, there
were no major clashes between -mm and mainline diffs, which is nice.

These patches aim to solve the long standing buffered write deadlocks,
and then go on to introduce a pair of new write a_op methods which
allow the deadlock to be solved without taking the performance hit of
the backwards compatible solutions using the old APIs.

Reiserfs (and Reiser4, in -mm) are the only filesystems left unconverted,
although there are a number of less common ones still untested.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 01/44] mm: revert KERNEL_DS buffered write optimisation
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
@ 2007-04-24  1:23 ` Nick Piggin
  2007-04-24  1:23 ` [patch 02/44] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6 Nick Piggin
                   ` (42 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Filesystems, Mark Fasheh, Linux Memory Management,
	Neil Brown

[-- Attachment #1: mm-revert-nfsd-writev-opt.patch --]
[-- Type: text/plain, Size: 2023 bytes --]


Revert the patch from Neil Brown to optimise NFSD writev handling.

Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 mm/filemap.c |   32 +++++++++++++-------------------
 1 file changed, 13 insertions(+), 19 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1980,27 +1980,21 @@ generic_file_buffered_write(struct kiocb
 		/* Limit the size of the copy to the caller's write size */
 		bytes = min(bytes, count);
 
-		/* We only need to worry about prefaulting when writes are from
-		 * user-space.  NFSd uses vfs_writev with several non-aligned
-		 * segments in the vector, and limiting to one segment a time is
-		 * a noticeable performance for re-write
+		/*
+		 * Limit the size of the copy to that of the current segment,
+		 * because fault_in_pages_readable() doesn't know how to walk
+		 * segments.
 		 */
-		if (!segment_eq(get_fs(), KERNEL_DS)) {
-			/*
-			 * Limit the size of the copy to that of the current
-			 * segment, because fault_in_pages_readable() doesn't
-			 * know how to walk segments.
-			 */
-			bytes = min(bytes, cur_iov->iov_len - iov_base);
+		bytes = min(bytes, cur_iov->iov_len - iov_base);
+
+		/*
+		 * Bring in the user page that we will copy from _first_.
+		 * Otherwise there's a nasty deadlock on copying from the
+		 * same page as we're writing to, without it being marked
+		 * up-to-date.
+		 */
+		fault_in_pages_readable(buf, bytes);
 
-			/*
-			 * Bring in the user page that we will copy from
-			 * _first_.  Otherwise there's a nasty deadlock on
-			 * copying from the same page as we're writing to,
-			 * without it being marked up-to-date.
-			 */
-			fault_in_pages_readable(buf, bytes);
-		}
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
 		if (!page) {
 			status = -ENOMEM;

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 02/44] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
  2007-04-24  1:23 ` [patch 01/44] mm: revert KERNEL_DS buffered write optimisation Nick Piggin
@ 2007-04-24  1:23 ` Nick Piggin
  2007-04-24  1:23 ` [patch 03/44] Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83 Nick Piggin
                   ` (41 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Filesystems, Mark Fasheh, Linux Memory Management,
	Andrew Morton

[-- Attachment #1: mm-revert-buffered-write-zero-length-iov.patch --]
[-- Type: text/plain, Size: 1660 bytes --]

From: Andrew Morton <akpm@osdl.org>

This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
also revert.

Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 mm/filemap.c |    9 +--------
 mm/filemap.h |    4 ++--
 2 files changed, 3 insertions(+), 10 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2001,12 +2001,6 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
-		if (unlikely(bytes == 0)) {
-			status = 0;
-			copied = 0;
-			goto zero_length_segment;
-		}
-
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status)) {
 			loff_t isize = i_size_read(inode);
@@ -2036,8 +2030,7 @@ generic_file_buffered_write(struct kiocb
 			page_cache_release(page);
 			continue;
 		}
-zero_length_segment:
-		if (likely(copied >= 0)) {
+		if (likely(copied > 0)) {
 			if (!status)
 				status = copied;
 
Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove
 	const struct iovec *iov = *iovp;
 	size_t base = *basep;
 
-	do {
+	while (bytes) {
 		int copy = min(bytes, iov->iov_len - base);
 
 		bytes -= copy;
@@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove
 			iov++;
 			base = 0;
 		}
-	} while (bytes);
+	}
 	*iovp = iov;
 	*basep = base;
 }

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 03/44] Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
  2007-04-24  1:23 ` [patch 01/44] mm: revert KERNEL_DS buffered write optimisation Nick Piggin
  2007-04-24  1:23 ` [patch 02/44] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6 Nick Piggin
@ 2007-04-24  1:23 ` Nick Piggin
  2007-04-24  1:23 ` [patch 04/44] mm: clean up buffered write code Nick Piggin
                   ` (40 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Filesystems, Mark Fasheh, Linux Memory Management,
	Andrew Morton

[-- Attachment #1: mm-revert-buffered-write-deadlock-fix.patch --]
[-- Type: text/plain, Size: 2922 bytes --]

From: Andrew Morton <akpm@osdl.org>

This patch fixed the following bug:

  When prefaulting in the pages in generic_file_buffered_write(), we only
  faulted in the pages for the firts segment of the iovec.  If the second of
  successive segment described a mmapping of the page into which we're
  write()ing, and that page is not up-to-date, the fault handler tries to lock
  the already-locked page (to bring it up to date) and deadlocks.

  An exploit for this bug is in writev-deadlock-demo.c, in
  http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

  (These demos assume blocksize < PAGE_CACHE_SIZE).

The problem with this fix is that it takes the kernel back to doing a single
prepare_write()/commit_write() per iovec segment.  So in the worst case we'll
run prepare_write+commit_write 1024 times where we previously would have run
it once. The other problem with the fix is that it fix all the locking problems.


<insert numbers obtained via ext3-tools's writev-speed.c here>

And apparently this change killed NFS overwrite performance, because, I
suppose, it talks to the server for each prepare_write+commit_write.

So just back that patch out - we'll be fixing the deadlock by other means.

Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>

Nick says: also it only ever actually papered over the bug, because after
faulting in the pages, they might be unmapped or reclaimed.

Signed-off-by: Nick Piggin <npiggin@suse.de>

 mm/filemap.c |   18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1971,21 +1971,14 @@ generic_file_buffered_write(struct kiocb
 	do {
 		unsigned long index;
 		unsigned long offset;
+		unsigned long maxlen;
 		size_t copied;
 
 		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
-
-		/* Limit the size of the copy to the caller's write size */
-		bytes = min(bytes, count);
-
-		/*
-		 * Limit the size of the copy to that of the current segment,
-		 * because fault_in_pages_readable() doesn't know how to walk
-		 * segments.
-		 */
-		bytes = min(bytes, cur_iov->iov_len - iov_base);
+		if (bytes > count)
+			bytes = count;
 
 		/*
 		 * Bring in the user page that we will copy from _first_.
@@ -1993,7 +1986,10 @@ generic_file_buffered_write(struct kiocb
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		fault_in_pages_readable(buf, bytes);
+		maxlen = cur_iov->iov_len - iov_base;
+		if (maxlen > bytes)
+			maxlen = bytes;
+		fault_in_pages_readable(buf, maxlen);
 
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
 		if (!page) {

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 04/44] mm: clean up buffered write code
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (2 preceding siblings ...)
  2007-04-24  1:23 ` [patch 03/44] Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83 Nick Piggin
@ 2007-04-24  1:23 ` Nick Piggin
  2007-04-24  1:23 ` [patch 05/44] mm: debug write deadlocks Nick Piggin
                   ` (39 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Filesystems, Mark Fasheh, Linux Memory Management,
	Andrew Morton

[-- Attachment #1: mm-generic_file_buffered_write-cleanup.patch --]
[-- Type: text/plain, Size: 3417 bytes --]

From: Andrew Morton <akpm@osdl.org>

Rename some variables and fix some types.

Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 mm/filemap.c |   35 ++++++++++++++++++-----------------
 1 file changed, 18 insertions(+), 17 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1944,16 +1944,15 @@ generic_file_buffered_write(struct kiocb
 		size_t count, ssize_t written)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space * mapping = file->f_mapping;
+	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	struct inode 	*inode = mapping->host;
 	long		status = 0;
 	struct page	*page;
 	struct page	*cached_page = NULL;
-	size_t		bytes;
 	struct pagevec	lru_pvec;
 	const struct iovec *cur_iov = iov; /* current iovec */
-	size_t		iov_base = 0;	   /* offset in the current iovec */
+	size_t		iov_offset = 0;	   /* offset in the current iovec */
 	char __user	*buf;
 
 	pagevec_init(&lru_pvec, 0);
@@ -1964,31 +1963,33 @@ generic_file_buffered_write(struct kiocb
 	if (likely(nr_segs == 1))
 		buf = iov->iov_base + written;
 	else {
-		filemap_set_next_iovec(&cur_iov, &iov_base, written);
-		buf = cur_iov->iov_base + iov_base;
+		filemap_set_next_iovec(&cur_iov, &iov_offset, written);
+		buf = cur_iov->iov_base + iov_offset;
 	}
 
 	do {
-		unsigned long index;
-		unsigned long offset;
-		unsigned long maxlen;
-		size_t copied;
+		pgoff_t index;		/* Pagecache index for current page */
+		unsigned long offset;	/* Offset into pagecache page */
+		unsigned long maxlen;	/* Bytes remaining in current iovec */
+		size_t bytes;		/* Bytes to write to page */
+		size_t copied;		/* Bytes copied from user */
 
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
 		if (bytes > count)
 			bytes = count;
 
+		maxlen = cur_iov->iov_len - iov_offset;
+		if (maxlen > bytes)
+			maxlen = bytes;
+
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		maxlen = cur_iov->iov_len - iov_base;
-		if (maxlen > bytes)
-			maxlen = bytes;
 		fault_in_pages_readable(buf, maxlen);
 
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
@@ -2019,7 +2020,7 @@ generic_file_buffered_write(struct kiocb
 							buf, bytes);
 		else
 			copied = filemap_copy_from_user_iovec(page, offset,
-						cur_iov, iov_base, bytes);
+						cur_iov, iov_offset, bytes);
 		flush_dcache_page(page);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (status == AOP_TRUNCATED_PAGE) {
@@ -2037,12 +2038,12 @@ generic_file_buffered_write(struct kiocb
 				buf += status;
 				if (unlikely(nr_segs > 1)) {
 					filemap_set_next_iovec(&cur_iov,
-							&iov_base, status);
+							&iov_offset, status);
 					if (count)
 						buf = cur_iov->iov_base +
-							iov_base;
+							iov_offset;
 				} else {
-					iov_base += status;
+					iov_offset += status;
 				}
 			}
 		}

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 05/44] mm: debug write deadlocks
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (3 preceding siblings ...)
  2007-04-24  1:23 ` [patch 04/44] mm: clean up buffered write code Nick Piggin
@ 2007-04-24  1:23 ` Nick Piggin
  2007-04-24  1:23 ` [patch 06/44] mm: trim more holes Nick Piggin
                   ` (38 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, Linux Memory Management

[-- Attachment #1: mm-debug-write-deadlocks.patch --]
[-- Type: text/plain, Size: 1136 bytes --]


Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
difficult race where the page may be unmapped before calling copy_from_user.
Makes the race much easier to hit.

This is useful for demonstration and testing purposes, but is removed in a
subsequent patch.

Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1984,6 +1984,7 @@ generic_file_buffered_write(struct kiocb
 		if (maxlen > bytes)
 			maxlen = bytes;
 
+#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
@@ -1991,6 +1992,7 @@ generic_file_buffered_write(struct kiocb
 		 * up-to-date.
 		 */
 		fault_in_pages_readable(buf, maxlen);
+#endif
 
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
 		if (!page) {

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 06/44] mm: trim more holes
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (4 preceding siblings ...)
  2007-04-24  1:23 ` [patch 05/44] mm: debug write deadlocks Nick Piggin
@ 2007-04-24  1:23 ` Nick Piggin
  2007-04-24  6:07   ` Neil Brown
  2007-04-24  1:23 ` [patch 07/44] mm: buffered write cleanup Nick Piggin
                   ` (37 subsequent siblings)
  43 siblings, 1 reply; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, Linux Memory Management

[-- Attachment #1: mm-trim-more-holes.patch --]
[-- Type: text/plain, Size: 3350 bytes --]


If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
we may have failed the write operation despite prepare_write having
instantiated blocks past i_size. Fix this, and consolidate the trimming into
one place.

Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 mm/filemap.c |   80 +++++++++++++++++++++++++++++------------------------------
 1 file changed, 40 insertions(+), 40 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2001,22 +2001,9 @@ generic_file_buffered_write(struct kiocb
 		}
 
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
-		if (unlikely(status)) {
-			loff_t isize = i_size_read(inode);
+		if (unlikely(status))
+			goto fs_write_aop_error;
 
-			if (status != AOP_TRUNCATED_PAGE)
-				unlock_page(page);
-			page_cache_release(page);
-			if (status == AOP_TRUNCATED_PAGE)
-				continue;
-			/*
-			 * prepare_write() may have instantiated a few blocks
-			 * outside i_size.  Trim these off again.
-			 */
-			if (pos + bytes > isize)
-				vmtruncate(inode, isize);
-			break;
-		}
 		if (likely(nr_segs == 1))
 			copied = filemap_copy_from_user(page, offset,
 							buf, bytes);
@@ -2025,40 +2012,53 @@ generic_file_buffered_write(struct kiocb
 						cur_iov, iov_offset, bytes);
 		flush_dcache_page(page);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
-		if (status == AOP_TRUNCATED_PAGE) {
-			page_cache_release(page);
-			continue;
+		if (unlikely(status < 0))
+			goto fs_write_aop_error;
+		if (unlikely(copied != bytes)) {
+			status = -EFAULT;
+			goto fs_write_aop_error;
 		}
-		if (likely(copied > 0)) {
-			if (!status)
-				status = copied;
+		if (unlikely(status > 0)) /* filesystem did partial write */
+			copied = status;
 
-			if (status >= 0) {
-				written += status;
-				count -= status;
-				pos += status;
-				buf += status;
-				if (unlikely(nr_segs > 1)) {
-					filemap_set_next_iovec(&cur_iov,
-							&iov_offset, status);
-					if (count)
-						buf = cur_iov->iov_base +
-							iov_offset;
-				} else {
-					iov_offset += status;
-				}
+		if (likely(copied > 0)) {
+			written += copied;
+			count -= copied;
+			pos += copied;
+			buf += copied;
+			if (unlikely(nr_segs > 1)) {
+				filemap_set_next_iovec(&cur_iov,
+						&iov_offset, copied);
+				if (count)
+					buf = cur_iov->iov_base + iov_offset;
+			} else {
+				iov_offset += copied;
 			}
 		}
-		if (unlikely(copied != bytes))
-			if (status >= 0)
-				status = -EFAULT;
 		unlock_page(page);
 		mark_page_accessed(page);
 		page_cache_release(page);
-		if (status < 0)
-			break;
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
+		continue;
+
+fs_write_aop_error:
+		if (status != AOP_TRUNCATED_PAGE)
+			unlock_page(page);
+		page_cache_release(page);
+
+		/*
+		 * prepare_write() may have instantiated a few blocks
+		 * outside i_size.  Trim these off again. Don't need
+		 * i_size_read because we hold i_mutex.
+		 */
+		if (pos + bytes > inode->i_size)
+			vmtruncate(inode, inode->i_size);
+		if (status == AOP_TRUNCATED_PAGE)
+			continue;
+		else
+			break;
+
 	} while (count);
 	*ppos = pos;
 

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 07/44] mm: buffered write cleanup
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (5 preceding siblings ...)
  2007-04-24  1:23 ` [patch 06/44] mm: trim more holes Nick Piggin
@ 2007-04-24  1:23 ` Nick Piggin
  2007-04-24  1:23 ` [patch 08/44] mm: write iovec cleanup Nick Piggin
                   ` (36 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, Linux Memory Management

[-- Attachment #1: mm-buffered-write-cleanup.patch --]
[-- Type: text/plain, Size: 11225 bytes --]


Quite a bit of code is used in maintaining these "cached pages" that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.

Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).

[this gets rid of some cond_resched() calls in readahead.c and mpage.c due
 to clashes in -mm. What put them there, and why? ]

Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/mpage.c     |   12 ----
 mm/filemap.c   |  144 ++++++++++++++++++++++-----------------------------------
 mm/readahead.c |   28 +++--------
 3 files changed, 66 insertions(+), 118 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -689,26 +689,22 @@ EXPORT_SYMBOL(probe_page);
 struct page *find_or_create_page(struct address_space *mapping,
 		unsigned long index, gfp_t gfp_mask)
 {
-	struct page *page, *cached_page = NULL;
+	struct page *page;
 	int err;
 repeat:
 	page = find_lock_page(mapping, index);
 	if (!page) {
-		if (!cached_page) {
-			cached_page = alloc_page(gfp_mask);
-			if (!cached_page)
-				return NULL;
-		}
-		err = add_to_page_cache_lru(cached_page, mapping,
-					index, gfp_mask);
-		if (!err) {
-			page = cached_page;
-			cached_page = NULL;
-		} else if (err == -EEXIST)
-			goto repeat;
+		page = alloc_page(gfp_mask);
+		if (!page)
+			return NULL;
+		err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+		if (unlikely(err)) {
+			page_cache_release(page);
+			page = NULL;
+			if (err == -EEXIST)
+				goto repeat;
+		}
 	}
-	if (cached_page)
-		page_cache_release(cached_page);
 	return page;
 }
 EXPORT_SYMBOL(find_or_create_page);
@@ -903,11 +899,9 @@ void do_generic_mapping_read(struct addr
 	unsigned long next_index;
 	unsigned long prev_index;
 	loff_t isize;
-	struct page *cached_page;
 	int error;
 	struct file_ra_state ra = *_ra;
 
-	cached_page = NULL;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	next_index = index;
 	prev_index = ra.prev_page;
@@ -1084,23 +1078,20 @@ no_cached_page:
 		 * Ok, it wasn't cached, so we need to create a new
 		 * page..
 		 */
-		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
-			if (!cached_page) {
-				desc->error = -ENOMEM;
-				goto out;
-			}
+		page = page_cache_alloc_cold(mapping);
+		if (!page) {
+			desc->error = -ENOMEM;
+			goto out;
 		}
-		error = add_to_page_cache_lru(cached_page, mapping,
+		error = add_to_page_cache_lru(page, mapping,
 						index, GFP_KERNEL);
 		if (error) {
+			page_cache_release(page);
 			if (error == -EEXIST)
 				goto find_page;
 			desc->error = error;
 			goto out;
 		}
-		page = cached_page;
-		cached_page = NULL;
 		goto readpage;
 	}
 
@@ -1110,8 +1101,6 @@ out:
 		_ra->prev_page = prev_index;
 
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
-	if (cached_page)
-		page_cache_release(cached_page);
 	if (filp)
 		file_accessed(filp);
 }
@@ -1605,35 +1594,28 @@ static struct page *__read_cache_page(st
 				int (*filler)(void *,struct page*),
 				void *data)
 {
-	struct page *page, *cached_page = NULL;
+	struct page *page;
 	int err;
 repeat:
 	page = find_get_page(mapping, index);
 	if (!page) {
-		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
-			if (!cached_page)
-				return ERR_PTR(-ENOMEM);
-		}
-		err = add_to_page_cache_lru(cached_page, mapping,
-					index, GFP_KERNEL);
-		if (err == -EEXIST)
-			goto repeat;
-		if (err < 0) {
+		page = page_cache_alloc_cold(mapping);
+		if (!page)
+			return ERR_PTR(-ENOMEM);
+		err = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+		if (unlikely(err)) {
+			page_cache_release(page);
+			if (err == -EEXIST)
+				goto repeat;
 			/* Presumably ENOMEM for radix tree node */
-			page_cache_release(cached_page);
 			return ERR_PTR(err);
 		}
-		page = cached_page;
-		cached_page = NULL;
 		err = filler(data, page);
 		if (err < 0) {
 			page_cache_release(page);
 			page = ERR_PTR(err);
 		}
 	}
-	if (cached_page)
-		page_cache_release(cached_page);
 	return page;
 }
 
@@ -1711,40 +1693,6 @@ struct page *read_cache_page(struct addr
 EXPORT_SYMBOL(read_cache_page);
 
 /*
- * If the page was newly created, increment its refcount and add it to the
- * caller's lru-buffering pagevec.  This function is specifically for
- * generic_file_write().
- */
-static inline struct page *
-__grab_cache_page(struct address_space *mapping, unsigned long index,
-			struct page **cached_page, struct pagevec *lru_pvec)
-{
-	int err;
-	struct page *page;
-repeat:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		if (!*cached_page) {
-			*cached_page = page_cache_alloc(mapping);
-			if (!*cached_page)
-				return NULL;
-		}
-		err = add_to_page_cache(*cached_page, mapping,
-					index, GFP_KERNEL);
-		if (err == -EEXIST)
-			goto repeat;
-		if (err == 0) {
-			page = *cached_page;
-			page_cache_get(page);
-			if (!pagevec_add(lru_pvec, page))
-				__pagevec_lru_add(lru_pvec);
-			*cached_page = NULL;
-		}
-	}
-	return page;
-}
-
-/*
  * The logic we want is
  *
  *	if suid or (sgid and xgrp)
@@ -1938,6 +1886,33 @@ generic_file_direct_write(struct kiocb *
 }
 EXPORT_SYMBOL(generic_file_direct_write);
 
+/*
+ * Find or create a page at the given pagecache position. Return the locked
+ * page. This function is specifically for buffered writes.
+ */
+static struct page *__grab_cache_page(struct address_space *mapping,
+							pgoff_t index)
+{
+	int status;
+	struct page *page;
+repeat:
+	page = find_lock_page(mapping, index);
+	if (likely(page))
+		return page;
+
+	page = page_cache_alloc(mapping);
+	if (!page)
+		return NULL;
+	status = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+	if (unlikely(status)) {
+		page_cache_release(page);
+		if (status == -EEXIST)
+			goto repeat;
+		return NULL;
+	}
+	return page;
+}
+
 ssize_t
 generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos, loff_t *ppos,
@@ -1948,15 +1923,10 @@ generic_file_buffered_write(struct kiocb
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	struct inode 	*inode = mapping->host;
 	long		status = 0;
-	struct page	*page;
-	struct page	*cached_page = NULL;
-	struct pagevec	lru_pvec;
 	const struct iovec *cur_iov = iov; /* current iovec */
 	size_t		iov_offset = 0;	   /* offset in the current iovec */
 	char __user	*buf;
 
-	pagevec_init(&lru_pvec, 0);
-
 	/*
 	 * handle partial DIO write.  Adjust cur_iov if needed.
 	 */
@@ -1968,6 +1938,7 @@ generic_file_buffered_write(struct kiocb
 	}
 
 	do {
+		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long maxlen;	/* Bytes remaining in current iovec */
@@ -1994,7 +1965,8 @@ generic_file_buffered_write(struct kiocb
 		fault_in_pages_readable(buf, maxlen);
 #endif
 
-		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
+
+		page = __grab_cache_page(mapping, index);
 		if (!page) {
 			status = -ENOMEM;
 			break;
@@ -2062,9 +2034,6 @@ fs_write_aop_error:
 	} while (count);
 	*ppos = pos;
 
-	if (cached_page)
-		page_cache_release(cached_page);
-
 	/*
 	 * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
 	 */
@@ -2084,7 +2053,6 @@ fs_write_aop_error:
 	if (unlikely(file->f_flags & O_DIRECT) && written)
 		status = filemap_write_and_wait(mapping);
 
-	pagevec_lru_add(&lru_pvec);
 	return written ? written : status;
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c
+++ linux-2.6/fs/mpage.c
@@ -389,33 +389,25 @@ mpage_readpages(struct address_space *ma
 	struct bio *bio = NULL;
 	unsigned page_idx;
 	sector_t last_block_in_bio = 0;
-	struct pagevec lru_pvec;
 	struct buffer_head map_bh;
 	unsigned long first_logical_block = 0;
 
 	clear_buffer_mapped(&map_bh);
-	pagevec_init(&lru_pvec, 0);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_entry(pages->prev, struct page, lru);
 
 		prefetchw(&page->flags);
 		list_del(&page->lru);
-		if (!add_to_page_cache(page, mapping,
+		if (!add_to_page_cache_lru(page, mapping,
 					page->index, GFP_KERNEL)) {
 			bio = do_mpage_readpage(bio, page,
 					nr_pages - page_idx,
 					&last_block_in_bio, &map_bh,
 					&first_logical_block,
 					get_block);
-			if (!pagevec_add(&lru_pvec, page)) {
-				cond_resched();
-				__pagevec_lru_add(&lru_pvec);
-			}
-		} else {
-			page_cache_release(page);
 		}
+		page_cache_release(page);
 	}
-	pagevec_lru_add(&lru_pvec);
 	BUG_ON(!list_empty(pages));
 	if (bio)
 		mpage_bio_submit(READ, bio);
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -230,30 +230,25 @@ int read_cache_pages(struct address_spac
 			int (*filler)(void *, struct page *), void *data)
 {
 	struct page *page;
-	struct pagevec lru_pvec;
 	int ret = 0;
 
-	pagevec_init(&lru_pvec, 0);
-
 	while (!list_empty(pages)) {
 		page = list_to_page(pages);
 		list_del(&page->lru);
-		if (add_to_page_cache(page, mapping, page->index, GFP_KERNEL)) {
+		if (add_to_page_cache_lru(page, mapping,
+					page->index, GFP_KERNEL)) {
 			page_cache_release(page);
 			continue;
 		}
+		page_cache_release(page);
+
 		ret = filler(data, page);
-		if (!pagevec_add(&lru_pvec, page)) {
-			cond_resched();
-			__pagevec_lru_add(&lru_pvec);
-		}
-		if (ret) {
+		if (unlikely(ret)) {
 			put_pages_list(pages);
 			break;
 		}
 		task_io_account_read(PAGE_CACHE_SIZE);
 	}
-	pagevec_lru_add(&lru_pvec);
 	return ret;
 }
 
@@ -263,7 +258,6 @@ static int read_pages(struct address_spa
 		struct list_head *pages, unsigned nr_pages)
 {
 	unsigned page_idx;
-	struct pagevec lru_pvec;
 	int ret;
 
 	if (mapping->a_ops->readpages) {
@@ -273,21 +267,15 @@ static int read_pages(struct address_spa
 		goto out;
 	}
 
-	pagevec_init(&lru_pvec, 0);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_to_page(pages);
 		list_del(&page->lru);
-		if (!add_to_page_cache(page, mapping,
+		if (!add_to_page_cache_lru(page, mapping,
 					page->index, GFP_KERNEL)) {
 			mapping->a_ops->readpage(filp, page);
-			if (!pagevec_add(&lru_pvec, page)) {
-				cond_resched();
-				__pagevec_lru_add(&lru_pvec);
-			}
-		} else
-			page_cache_release(page);
+		}
+		page_cache_release(page);
 	}
-	pagevec_lru_add(&lru_pvec);
 	ret = 0;
 out:
 	return ret;

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 08/44] mm: write iovec cleanup
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (6 preceding siblings ...)
  2007-04-24  1:23 ` [patch 07/44] mm: buffered write cleanup Nick Piggin
@ 2007-04-24  1:23 ` Nick Piggin
  2007-04-24  1:23 ` [patch 09/44] mm: fix pagecache write deadlocks Nick Piggin
                   ` (35 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, Linux Memory Management

[-- Attachment #1: mm-write-iov-cleanup.patch --]
[-- Type: text/plain, Size: 8574 bytes --]


Hide some of the open-coded nr_segs tests into the iovec helpers. This is
all to simplify generic_file_buffered_write, because that gets more complex
in the next patch.

Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 mm/filemap.c     |   36 +++++--------------
 mm/filemap.h     |  104 +++++++++++++++++++++++++++----------------------------
 mm/filemap_xip.c |   17 +++-----
 3 files changed, 69 insertions(+), 88 deletions(-)

Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -22,82 +22,82 @@ __filemap_copy_from_user_iovec_inatomic(
 
 /*
  * Copy as much as we can into the page and return the number of bytes which
- * were sucessfully copied.  If a fault is encountered then clear the page
- * out to (offset+bytes) and return the number of bytes which were copied.
- *
- * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache
- * to *NOT* zero any tail of the buffer that it failed to copy.  If it does,
- * and if the following non-atomic copy succeeds, then there is a small window
- * where the target page contains neither the data before the write, nor the
- * data after the write (it contains zero).  A read at this time will see
- * data that is inconsistent with any ordering of the read and the write.
- * (This has been detected in practice).
+ * were sucessfully copied.  If a fault is encountered then return the number of
+ * bytes which were copied.
  */
 static inline size_t
-filemap_copy_from_user(struct page *page, unsigned long offset,
-			const char __user *buf, unsigned bytes)
+filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
+			const struct iovec *iov, unsigned long nr_segs,
+			size_t base, size_t bytes)
 {
 	char *kaddr;
-	int left;
+	size_t copied;
 
 	kaddr = kmap_atomic(page, KM_USER0);
-	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
+	if (likely(nr_segs == 1)) {
+		int left;
+		char __user *buf = iov->iov_base + base;
+		left = __copy_from_user_inatomic_nocache(kaddr + offset,
+							buf, bytes);
+		copied = bytes - left;
+	} else {
+		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+							iov, base, bytes);
+	}
 	kunmap_atomic(kaddr, KM_USER0);
 
-	if (left != 0) {
-		/* Do it the slow way */
-		kaddr = kmap(page);
-		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
-		kunmap(page);
-	}
-	return bytes - left;
+	return copied;
 }
 
 /*
- * This has the same sideeffects and return value as filemap_copy_from_user().
- * The difference is that on a fault we need to memset the remainder of the
- * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
- * single-segment behaviour.
+ * This has the same sideeffects and return value as
+ * filemap_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
  */
 static inline size_t
-filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
-			const struct iovec *iov, size_t base, size_t bytes)
+filemap_copy_from_user(struct page *page, unsigned long offset,
+			const struct iovec *iov, unsigned long nr_segs,
+			 size_t base, size_t bytes)
 {
 	char *kaddr;
 	size_t copied;
 
-	kaddr = kmap_atomic(page, KM_USER0);
-	copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
-							 base, bytes);
-	kunmap_atomic(kaddr, KM_USER0);
-	if (copied != bytes) {
-		kaddr = kmap(page);
-		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
-								 base, bytes);
-		if (bytes - copied)
-			memset(kaddr + offset + copied, 0, bytes - copied);
-		kunmap(page);
+	kaddr = kmap(page);
+	if (likely(nr_segs == 1)) {
+		int left;
+		char __user *buf = iov->iov_base + base;
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+		copied = bytes - left;
+	} else {
+		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+							iov, base, bytes);
 	}
+	kunmap(page);
 	return copied;
 }
 
 static inline void
-filemap_set_next_iovec(const struct iovec **iovp, size_t *basep, size_t bytes)
+filemap_set_next_iovec(const struct iovec **iovp, unsigned long nr_segs,
+						 size_t *basep, size_t bytes)
 {
-	const struct iovec *iov = *iovp;
-	size_t base = *basep;
-
-	while (bytes) {
-		int copy = min(bytes, iov->iov_len - base);
-
-		bytes -= copy;
-		base += copy;
-		if (iov->iov_len == base) {
-			iov++;
-			base = 0;
+	if (likely(nr_segs == 1)) {
+		*basep += bytes;
+	} else {
+		const struct iovec *iov = *iovp;
+		size_t base = *basep;
+
+		while (bytes) {
+			int copy = min(bytes, iov->iov_len - base);
+
+			bytes -= copy;
+			base += copy;
+			if (iov->iov_len == base) {
+				iov++;
+				base = 0;
+			}
 		}
+		*iovp = iov;
+		*basep = base;
 	}
-	*iovp = iov;
-	*basep = base;
 }
 #endif
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1930,12 +1930,7 @@ generic_file_buffered_write(struct kiocb
 	/*
 	 * handle partial DIO write.  Adjust cur_iov if needed.
 	 */
-	if (likely(nr_segs == 1))
-		buf = iov->iov_base + written;
-	else {
-		filemap_set_next_iovec(&cur_iov, &iov_offset, written);
-		buf = cur_iov->iov_base + iov_offset;
-	}
+	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
 		struct page *page;
@@ -1945,6 +1940,7 @@ generic_file_buffered_write(struct kiocb
 		size_t bytes;		/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
+		buf = cur_iov->iov_base + iov_offset;
 		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
@@ -1976,13 +1972,10 @@ generic_file_buffered_write(struct kiocb
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		if (likely(nr_segs == 1))
-			copied = filemap_copy_from_user(page, offset,
-							buf, bytes);
-		else
-			copied = filemap_copy_from_user_iovec(page, offset,
-						cur_iov, iov_offset, bytes);
+		copied = filemap_copy_from_user(page, offset,
+					cur_iov, nr_segs, iov_offset, bytes);
 		flush_dcache_page(page);
+
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
@@ -1993,20 +1986,11 @@ generic_file_buffered_write(struct kiocb
 		if (unlikely(status > 0)) /* filesystem did partial write */
 			copied = status;
 
-		if (likely(copied > 0)) {
-			written += copied;
-			count -= copied;
-			pos += copied;
-			buf += copied;
-			if (unlikely(nr_segs > 1)) {
-				filemap_set_next_iovec(&cur_iov,
-						&iov_offset, copied);
-				if (count)
-					buf = cur_iov->iov_base + iov_offset;
-			} else {
-				iov_offset += copied;
-			}
-		}
+		written += copied;
+		count -= copied;
+		pos += copied;
+		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
+
 		unlock_page(page);
 		mark_page_accessed(page);
 		page_cache_release(page);
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -14,7 +14,6 @@
 #include <linux/uio.h>
 #include <linux/rmap.h>
 #include <asm/tlbflush.h>
-#include "filemap.h"
 
 /*
  * We do use our own empty page to avoid interference with other users
@@ -318,6 +317,7 @@ __xip_file_write(struct file *filp, cons
 		unsigned long index;
 		unsigned long offset;
 		size_t copied;
+		char *kaddr;
 
 		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
 		index = pos >> PAGE_CACHE_SHIFT;
@@ -325,14 +325,6 @@ __xip_file_write(struct file *filp, cons
 		if (bytes > count)
 			bytes = count;
 
-		/*
-		 * Bring in the user page that we will copy from _first_.
-		 * Otherwise there's a nasty deadlock on copying from the
-		 * same page as we're writing to, without it being marked
-		 * up-to-date.
-		 */
-		fault_in_pages_readable(buf, bytes);
-
 		page = a_ops->get_xip_page(mapping,
 					   index*(PAGE_SIZE/512), 0);
 		if (IS_ERR(page) && (PTR_ERR(page) == -ENODATA)) {
@@ -349,8 +341,13 @@ __xip_file_write(struct file *filp, cons
 			break;
 		}
 
-		copied = filemap_copy_from_user(page, offset, buf, bytes);
+		fault_in_pages_readable(buf, bytes);
+		kaddr = kmap_atomic(page, KM_USER0);
+		copied = bytes -
+			__copy_from_user_inatomic_nocache(kaddr, buf, bytes);
+		kunmap_atomic(kaddr, KM_USER0);
 		flush_dcache_page(page);
+
 		if (likely(copied > 0)) {
 			status = copied;
 

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 09/44] mm: fix pagecache write deadlocks
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (7 preceding siblings ...)
  2007-04-24  1:23 ` [patch 08/44] mm: write iovec cleanup Nick Piggin
@ 2007-04-24  1:23 ` Nick Piggin
  2007-04-24  1:23 ` [patch 10/44] mm: buffered write iterator Nick Piggin
                   ` (34 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, Linux Memory Management

[-- Attachment #1: mm-pagecache-write-deadlocks.patch --]
[-- Type: text/plain, Size: 8982 bytes --]


Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then allocate a temporary page to copy the
    source data into. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    from the pinned temporary page via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

This increases the CPU/memory copy cost by almost 50% on the affected
workloads. That will be solved by introducing a new set of pagecache write
aops in a subsequent patch.

Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 include/linux/pagemap.h |   11 +++-
 mm/filemap.c            |  114 ++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 104 insertions(+), 21 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1933,11 +1933,12 @@ generic_file_buffered_write(struct kiocb
 	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
+		struct page *src_page;
 		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long maxlen;	/* Bytes remaining in current iovec */
-		size_t bytes;		/* Bytes to write to page */
+		unsigned long seglen;	/* Bytes remaining in current iovec */
+		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
 		buf = cur_iov->iov_base + iov_offset;
@@ -1947,20 +1948,30 @@ generic_file_buffered_write(struct kiocb
 		if (bytes > count)
 			bytes = count;
 
-		maxlen = cur_iov->iov_len - iov_offset;
-		if (maxlen > bytes)
-			maxlen = bytes;
+		/*
+		 * a non-NULL src_page indicates that we're doing the
+		 * copy via get_user_pages and kmap.
+		 */
+		src_page = NULL;
+
+		seglen = cur_iov->iov_len - iov_offset;
+		if (seglen > bytes)
+			seglen = bytes;
 
-#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
 		 */
-		fault_in_pages_readable(buf, maxlen);
-#endif
-
+		if (unlikely(fault_in_pages_readable(buf, seglen))) {
+			status = -EFAULT;
+			break;
+		}
 
 		page = __grab_cache_page(mapping, index);
 		if (!page) {
@@ -1968,32 +1979,104 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
+		/*
+		 * non-uptodate pages cannot cope with short copies, and we
+		 * cannot take a pagefault with the destination page locked.
+		 * So pin the source page to copy it.
+		 */
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+
+			src_page = alloc_page(GFP_KERNEL);
+			if (!src_page) {
+				page_cache_release(page);
+				status = -ENOMEM;
+				break;
+			}
+
+			/*
+			 * Cannot get_user_pages with a page locked for the
+			 * same reason as we can't take a page fault with a
+			 * page locked (as explained below).
+			 */
+			copied = filemap_copy_from_user(src_page, offset,
+					cur_iov, nr_segs, iov_offset, bytes);
+			if (unlikely(copied == 0)) {
+				status = -EFAULT;
+				page_cache_release(page);
+				page_cache_release(src_page);
+				break;
+			}
+			bytes = copied;
+
+			lock_page(page);
+			/*
+			 * Can't handle the page going uptodate here, because
+			 * that means we would use non-atomic usercopies, which
+			 * zero out the tail of the page, which can cause
+			 * zeroes to become transiently visible. We could just
+			 * use a non-zeroing copy, but the APIs aren't too
+			 * consistent.
+			 */
+			if (unlikely(!page->mapping || PageUptodate(page))) {
+				unlock_page(page);
+				page_cache_release(page);
+				page_cache_release(src_page);
+				continue;
+			}
+
+		}
+
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		copied = filemap_copy_from_user(page, offset,
+		if (!src_page) {
+			/*
+			 * Must not enter the pagefault handler here, because
+			 * we hold the page lock, so we might recursively
+			 * deadlock on the same lock, or get an ABBA deadlock
+			 * against a different lock, or against the mmap_sem
+			 * (which nests outside the page lock).  So increment
+			 * preempt count, and use _atomic usercopies.
+			 *
+			 * The page is uptodate so we are OK to encounter a
+			 * short copy: if unmodified parts of the page are
+			 * marked dirty and written out to disk, it doesn't
+			 * really matter.
+			 */
+			pagefault_disable();
+			copied = filemap_copy_from_user_atomic(page, offset,
 					cur_iov, nr_segs, iov_offset, bytes);
+			pagefault_enable();
+		} else {
+			void *src, *dst;
+			src = kmap_atomic(src_page, KM_USER0);
+			dst = kmap_atomic(page, KM_USER1);
+			memcpy(dst + offset, src + offset, bytes);
+			kunmap_atomic(dst, KM_USER1);
+			kunmap_atomic(src, KM_USER0);
+			copied = bytes;
+		}
 		flush_dcache_page(page);
 
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
-		if (unlikely(copied != bytes)) {
-			status = -EFAULT;
-			goto fs_write_aop_error;
-		}
 		if (unlikely(status > 0)) /* filesystem did partial write */
-			copied = status;
+			copied = min_t(size_t, copied, status);
+
+		unlock_page(page);
+		mark_page_accessed(page);
+		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		written += copied;
 		count -= copied;
 		pos += copied;
 		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
 
-		unlock_page(page);
-		mark_page_accessed(page);
-		page_cache_release(page);
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 		continue;
@@ -2002,6 +2085,8 @@ fs_write_aop_error:
 		if (status != AOP_TRUNCATED_PAGE)
 			unlock_page(page);
 		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		/*
 		 * prepare_write() may have instantiated a few blocks
@@ -2014,7 +2099,6 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-
 	} while (count);
 	*ppos = pos;
 
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -220,6 +220,9 @@ static inline int fault_in_pages_writeab
 {
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	/*
 	 * Writing zeroes into userspace here is OK, because we know that if
 	 * the zero gets there, we'll be overwriting it.
@@ -239,19 +242,23 @@ static inline int fault_in_pages_writeab
 	return ret;
 }
 
-static inline void fault_in_pages_readable(const char __user *uaddr, int size)
+static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 {
 	volatile char c;
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	ret = __get_user(c, uaddr);
 	if (ret == 0) {
 		const char __user *end = uaddr + size - 1;
 
 		if (((unsigned long)uaddr & PAGE_MASK) !=
 				((unsigned long)end & PAGE_MASK))
-		 	__get_user(c, end);
+		 	ret = __get_user(c, end);
 	}
+	return ret;
 }
 
 #endif /* _LINUX_PAGEMAP_H */

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 10/44] mm: buffered write iterator
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (8 preceding siblings ...)
  2007-04-24  1:23 ` [patch 09/44] mm: fix pagecache write deadlocks Nick Piggin
@ 2007-04-24  1:23 ` Nick Piggin
  2007-04-24  1:23 ` [patch 11/44] fs: fix data-loss on error Nick Piggin
                   ` (33 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, Linux Memory Management

[-- Attachment #1: fs-buffered-write-iterator.patch --]
[-- Type: text/plain, Size: 11006 bytes --]


Add an iterator data structure to operate over an iovec. Add usercopy
operators needed by generic_file_buffered_write, and convert that function
over.

Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 include/linux/fs.h |   33 ++++++++++++
 mm/filemap.c       |  144 +++++++++++++++++++++++++++++++++++++++++++----------
 mm/filemap.h       |  103 -------------------------------------
 3 files changed, 150 insertions(+), 130 deletions(-)

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -398,6 +398,39 @@ struct page;
 struct address_space;
 struct writeback_control;
 
+struct iov_iter {
+	const struct iovec *iov;
+	unsigned long nr_segs;
+	size_t iov_offset;
+	size_t count;
+};
+
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+		struct iov_iter *i, unsigned long offset, size_t bytes);
+size_t iov_iter_copy_from_user(struct page *page,
+		struct iov_iter *i, unsigned long offset, size_t bytes);
+void iov_iter_advance(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_readable(struct iov_iter *i);
+size_t iov_iter_single_seg_count(struct iov_iter *i);
+
+static inline void iov_iter_init(struct iov_iter *i,
+			const struct iovec *iov, unsigned long nr_segs,
+			size_t count, size_t written)
+{
+	i->iov = iov;
+	i->nr_segs = nr_segs;
+	i->iov_offset = 0;
+	i->count = count + written;
+
+	iov_iter_advance(i, written);
+}
+
+static inline size_t iov_iter_count(struct iov_iter *i)
+{
+	return i->count;
+}
+
+
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -30,7 +30,7 @@
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/cpuset.h>
-#include "filemap.h"
+#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include "internal.h"
 
 /*
@@ -1740,8 +1740,7 @@ int remove_suid(struct dentry *dentry)
 }
 EXPORT_SYMBOL(remove_suid);
 
-size_t
-__filemap_copy_from_user_iovec_inatomic(char *vaddr,
+static size_t __iovec_copy_from_user_inatomic(char *vaddr,
 			const struct iovec *iov, size_t base, size_t bytes)
 {
 	size_t copied = 0, left = 0;
@@ -1764,6 +1763,110 @@ __filemap_copy_from_user_iovec_inatomic(
 }
 
 /*
+ * Copy as much as we can into the page and return the number of bytes which
+ * were sucessfully copied.  If a fault is encountered then return the number of
+ * bytes which were copied.
+ */
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+		struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+	char *kaddr;
+	size_t copied;
+
+	BUG_ON(!in_atomic());
+	kaddr = kmap_atomic(page, KM_USER0);
+	if (likely(i->nr_segs == 1)) {
+		int left;
+		char __user *buf = i->iov->iov_base + i->iov_offset;
+		left = __copy_from_user_inatomic_nocache(kaddr + offset,
+							buf, bytes);
+		copied = bytes - left;
+	} else {
+		copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+						i->iov, i->iov_offset, bytes);
+	}
+	kunmap_atomic(kaddr, KM_USER0);
+
+	return copied;
+}
+
+/*
+ * This has the same sideeffects and return value as
+ * iov_iter_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
+ * Page must not be locked.
+ */
+size_t iov_iter_copy_from_user(struct page *page,
+		struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+	char *kaddr;
+	size_t copied;
+
+	kaddr = kmap(page);
+	if (likely(i->nr_segs == 1)) {
+		int left;
+		char __user *buf = i->iov->iov_base + i->iov_offset;
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+		copied = bytes - left;
+	} else {
+		copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+						i->iov, i->iov_offset, bytes);
+	}
+	kunmap(page);
+	return copied;
+}
+
+static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes)
+{
+	if (likely(i->nr_segs == 1)) {
+		i->iov_offset += bytes;
+	} else {
+		const struct iovec *iov = i->iov;
+		size_t base = i->iov_offset;
+
+		while (bytes) {
+			int copy = min(bytes, iov->iov_len - base);
+
+			bytes -= copy;
+			base += copy;
+			if (iov->iov_len == base) {
+				iov++;
+				base = 0;
+			}
+		}
+		i->iov = iov;
+		i->iov_offset = base;
+	}
+}
+
+void iov_iter_advance(struct iov_iter *i, size_t bytes)
+{
+	BUG_ON(i->count < bytes);
+
+	__iov_iter_advance_iov(i, bytes);
+	i->count -= bytes;
+}
+
+int iov_iter_fault_in_readable(struct iov_iter *i)
+{
+	size_t seglen = min(i->iov->iov_len - i->iov_offset, i->count);
+	char __user *buf = i->iov->iov_base + i->iov_offset;
+	return fault_in_pages_readable(buf, seglen);
+}
+
+/*
+ * Return the count of just the current iov_iter segment.
+ */
+size_t iov_iter_single_seg_count(struct iov_iter *i)
+{
+	const struct iovec *iov = i->iov;
+	if (i->nr_segs == 1)
+		return i->count;
+	else
+		return min(i->count, iov->iov_len - i->iov_offset);
+}
+
+/*
  * Performs necessary checks before doing a write
  *
  * Can adjust writing position or amount of bytes to write.
@@ -1923,30 +2026,22 @@ generic_file_buffered_write(struct kiocb
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	struct inode 	*inode = mapping->host;
 	long		status = 0;
-	const struct iovec *cur_iov = iov; /* current iovec */
-	size_t		iov_offset = 0;	   /* offset in the current iovec */
-	char __user	*buf;
+	struct iov_iter i;
 
-	/*
-	 * handle partial DIO write.  Adjust cur_iov if needed.
-	 */
-	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
+	iov_iter_init(&i, iov, nr_segs, count, written);
 
 	do {
 		struct page *src_page;
 		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long seglen;	/* Bytes remaining in current iovec */
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
-		buf = cur_iov->iov_base + iov_offset;
 		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
-		if (bytes > count)
-			bytes = count;
+		bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
+						iov_iter_count(&i));
 
 		/*
 		 * a non-NULL src_page indicates that we're doing the
@@ -1954,10 +2049,6 @@ generic_file_buffered_write(struct kiocb
 		 */
 		src_page = NULL;
 
-		seglen = cur_iov->iov_len - iov_offset;
-		if (seglen > bytes)
-			seglen = bytes;
-
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
@@ -1968,7 +2059,7 @@ generic_file_buffered_write(struct kiocb
 		 * to check that the address is actually valid, when atomic
 		 * usercopies are used, below.
 		 */
-		if (unlikely(fault_in_pages_readable(buf, seglen))) {
+		if (unlikely(iov_iter_fault_in_readable(&i))) {
 			status = -EFAULT;
 			break;
 		}
@@ -1999,8 +2090,8 @@ generic_file_buffered_write(struct kiocb
 			 * same reason as we can't take a page fault with a
 			 * page locked (as explained below).
 			 */
-			copied = filemap_copy_from_user(src_page, offset,
-					cur_iov, nr_segs, iov_offset, bytes);
+			copied = iov_iter_copy_from_user(src_page, &i,
+								offset, bytes);
 			if (unlikely(copied == 0)) {
 				status = -EFAULT;
 				page_cache_release(page);
@@ -2046,8 +2137,8 @@ generic_file_buffered_write(struct kiocb
 			 * really matter.
 			 */
 			pagefault_disable();
-			copied = filemap_copy_from_user_atomic(page, offset,
-					cur_iov, nr_segs, iov_offset, bytes);
+			copied = iov_iter_copy_from_user_atomic(page, &i,
+								offset, bytes);
 			pagefault_enable();
 		} else {
 			void *src, *dst;
@@ -2072,10 +2163,9 @@ generic_file_buffered_write(struct kiocb
 		if (src_page)
 			page_cache_release(src_page);
 
+		iov_iter_advance(&i, copied);
 		written += copied;
-		count -= copied;
 		pos += copied;
-		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
 
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
@@ -2099,7 +2189,7 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-	} while (count);
+	} while (iov_iter_count(&i));
 	*ppos = pos;
 
 	/*
Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ /dev/null
@@ -1,103 +0,0 @@
-/*
- *	linux/mm/filemap.h
- *
- * Copyright (C) 1994-1999  Linus Torvalds
- */
-
-#ifndef __FILEMAP_H
-#define __FILEMAP_H
-
-#include <linux/types.h>
-#include <linux/fs.h>
-#include <linux/mm.h>
-#include <linux/highmem.h>
-#include <linux/uio.h>
-#include <linux/uaccess.h>
-
-size_t
-__filemap_copy_from_user_iovec_inatomic(char *vaddr,
-					const struct iovec *iov,
-					size_t base,
-					size_t bytes);
-
-/*
- * Copy as much as we can into the page and return the number of bytes which
- * were sucessfully copied.  If a fault is encountered then return the number of
- * bytes which were copied.
- */
-static inline size_t
-filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
-			const struct iovec *iov, unsigned long nr_segs,
-			size_t base, size_t bytes)
-{
-	char *kaddr;
-	size_t copied;
-
-	kaddr = kmap_atomic(page, KM_USER0);
-	if (likely(nr_segs == 1)) {
-		int left;
-		char __user *buf = iov->iov_base + base;
-		left = __copy_from_user_inatomic_nocache(kaddr + offset,
-							buf, bytes);
-		copied = bytes - left;
-	} else {
-		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
-							iov, base, bytes);
-	}
-	kunmap_atomic(kaddr, KM_USER0);
-
-	return copied;
-}
-
-/*
- * This has the same sideeffects and return value as
- * filemap_copy_from_user_atomic().
- * The difference is that it attempts to resolve faults.
- */
-static inline size_t
-filemap_copy_from_user(struct page *page, unsigned long offset,
-			const struct iovec *iov, unsigned long nr_segs,
-			 size_t base, size_t bytes)
-{
-	char *kaddr;
-	size_t copied;
-
-	kaddr = kmap(page);
-	if (likely(nr_segs == 1)) {
-		int left;
-		char __user *buf = iov->iov_base + base;
-		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
-		copied = bytes - left;
-	} else {
-		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
-							iov, base, bytes);
-	}
-	kunmap(page);
-	return copied;
-}
-
-static inline void
-filemap_set_next_iovec(const struct iovec **iovp, unsigned long nr_segs,
-						 size_t *basep, size_t bytes)
-{
-	if (likely(nr_segs == 1)) {
-		*basep += bytes;
-	} else {
-		const struct iovec *iov = *iovp;
-		size_t base = *basep;
-
-		while (bytes) {
-			int copy = min(bytes, iov->iov_len - base);
-
-			bytes -= copy;
-			base += copy;
-			if (iov->iov_len == base) {
-				iov++;
-				base = 0;
-			}
-		}
-		*iovp = iov;
-		*basep = base;
-	}
-}
-#endif

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 11/44] fs: fix data-loss on error
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (9 preceding siblings ...)
  2007-04-24  1:23 ` [patch 10/44] mm: buffered write iterator Nick Piggin
@ 2007-04-24  1:23 ` Nick Piggin
  2007-04-24  1:23 ` [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops Nick Piggin
                   ` (32 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, Linux Memory Management

[-- Attachment #1: fs-dataloss-stop.patch --]
[-- Type: text/plain, Size: 1116 bytes --]


New buffers against uptodate pages are simply be marked uptodate, while the
buffer_new bit remains set. This causes error-case code to zero out parts
of those buffers because it thinks they contain stale data: wrong, they
are actually uptodate so this is a data loss situation.

Fix this by actually clearning buffer_new and marking the buffer dirty. It
makes sense to always clear buffer_new before setting a buffer uptodate.

Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/buffer.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1800,7 +1800,9 @@ static int __block_prepare_write(struct 
 				unmap_underlying_metadata(bh->b_bdev,
 							bh->b_blocknr);
 				if (PageUptodate(page)) {
+					clear_buffer_new(bh);
 					set_buffer_uptodate(bh);
+					mark_buffer_dirty(bh);
 					continue;
 				}
 				if (block_end > to || block_start < from) {

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (10 preceding siblings ...)
  2007-04-24  1:23 ` [patch 11/44] fs: fix data-loss on error Nick Piggin
@ 2007-04-24  1:23 ` Nick Piggin
  2007-04-24  6:59   ` Neil Brown
  2007-04-24  1:23 ` [patch 13/44] mm: restore KERNEL_DS optimisations Nick Piggin
                   ` (31 subsequent siblings)
  43 siblings, 1 reply; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, Linux Memory Management

[-- Attachment #1: fs-new-write-aops.patch --]
[-- Type: text/plain, Size: 35617 bytes --]

These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).

Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

API design contributions, code review and fixes. 

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>

 Documentation/filesystems/Locking |    9 -
 Documentation/filesystems/vfs.txt |   48 +++++++
 drivers/block/loop.c              |   77 ++++--------
 fs/buffer.c                       |  203 +++++++++++++++++++++++++++------
 fs/libfs.c                        |   44 +++++++
 fs/namei.c                        |   47 +------
 fs/splice.c                       |   70 +----------
 include/linux/buffer_head.h       |   10 +
 include/linux/fs.h                |   28 ++++
 include/linux/pagemap.h           |    2 
 mm/filemap.c                      |  233 ++++++++++++++++++++++++++++++++++----
 11 files changed, 561 insertions(+), 210 deletions(-)

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -391,6 +391,8 @@ enum positive_aop_returns {
 	AOP_TRUNCATED_PAGE	= 0x80001,
 };
 
+#define AOP_FLAG_UNINTERRUPTIBLE	0x0001 /* will not do a short write */
+
 /*
  * oh the beauties of C type declarations.
  */
@@ -451,6 +453,14 @@ struct address_space_operations {
 	 */
 	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
 	int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+
+	int (*write_begin)(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata);
+	int (*write_end)(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata);
+
 	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
 	sector_t (*bmap)(struct address_space *, sector_t);
 	void (*invalidatepage) (struct page *, unsigned long);
@@ -465,6 +475,18 @@ struct address_space_operations {
 	int (*launder_page) (struct page *);
 };
 
+/*
+ * pagecache_write_begin/pagecache_write_end must be used by general code
+ * to write into the pagecache.
+ */
+int pagecache_write_begin(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata);
+
+int pagecache_write_end(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata);
+
 struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
@@ -1969,6 +1991,12 @@ extern int simple_prepare_write(struct f
 			unsigned offset, unsigned to);
 extern int simple_commit_write(struct file *file, struct page *page,
 				unsigned offset, unsigned to);
+extern int simple_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata);
+extern int simple_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata);
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, struct nameidata *);
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1950,6 +1950,93 @@ inline int generic_write_checks(struct f
 }
 EXPORT_SYMBOL(generic_write_checks);
 
+int pagecache_write_begin(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata)
+{
+	const struct address_space_operations *aops = mapping->a_ops;
+
+	if (aops->write_begin) {
+		return aops->write_begin(file, mapping, pos, len, flags,
+							pagep, fsdata);
+	} else {
+		int ret;
+		pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+		unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
+		struct inode *inode = mapping->host;
+		struct page *page;
+again:
+		page = __grab_cache_page(mapping, index);
+		*pagep = page;
+		if (!page)
+			return -ENOMEM;
+
+		if (flags & AOP_FLAG_UNINTERRUPTIBLE && !PageUptodate(page)) {
+			/*
+			 * There is no way to resolve a short write situation
+			 * for a !Uptodate page (except by double copying in
+			 * the caller done by generic_perform_write_2copy).
+			 *
+			 * Instead, we have to bring it uptodate here.
+			 */
+			ret = aops->readpage(file, page);
+			page_cache_release(page);
+			if (ret) {
+				if (ret == AOP_TRUNCATED_PAGE)
+					goto again;
+				return ret;
+			}
+			goto again;
+		}
+
+		ret = aops->prepare_write(file, page, offset, offset+len);
+		if (ret) {
+			if (ret != AOP_TRUNCATED_PAGE)
+				unlock_page(page);
+			page_cache_release(page);
+			if (pos + len > inode->i_size)
+				vmtruncate(inode, inode->i_size);
+			if (ret == AOP_TRUNCATED_PAGE)
+				goto again;
+		}
+		return ret;
+	}
+}
+EXPORT_SYMBOL(pagecache_write_begin);
+
+int pagecache_write_end(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
+{
+	const struct address_space_operations *aops = mapping->a_ops;
+	int ret;
+
+	if (aops->write_begin) {
+		ret = aops->write_end(file, mapping, pos, len, copied,
+							page, fsdata);
+	} else {
+		unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
+		struct inode *inode = mapping->host;
+
+		flush_dcache_page(page);
+		ret = aops->commit_write(file, page, offset, offset+len);
+		unlock_page(page);
+		page_cache_release(page);
+		BUG_ON(ret == AOP_TRUNCATED_PAGE); /* can't deal with */
+
+		if (ret < 0) {
+			if (pos + len > inode->i_size)
+				vmtruncate(inode, inode->i_size);
+		} else if (ret > 0)
+			ret = min_t(size_t, copied, ret);
+		else
+			ret = copied;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(pagecache_write_end);
+
 ssize_t
 generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long *nr_segs, loff_t pos, loff_t *ppos,
@@ -1993,8 +2080,7 @@ EXPORT_SYMBOL(generic_file_direct_write)
  * Find or create a page at the given pagecache position. Return the locked
  * page. This function is specifically for buffered writes.
  */
-static struct page *__grab_cache_page(struct address_space *mapping,
-							pgoff_t index)
+struct page *__grab_cache_page(struct address_space *mapping, pgoff_t index)
 {
 	int status;
 	struct page *page;
@@ -2015,20 +2101,16 @@ repeat:
 	}
 	return page;
 }
+EXPORT_SYMBOL(__grab_cache_page);
 
-ssize_t
-generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
-		unsigned long nr_segs, loff_t pos, loff_t *ppos,
-		size_t count, ssize_t written)
+static ssize_t generic_perform_write_2copy(struct file *file,
+				struct iov_iter *i, loff_t pos)
 {
-	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode 	*inode = mapping->host;
-	long		status = 0;
-	struct iov_iter i;
-
-	iov_iter_init(&i, iov, nr_segs, count, written);
+	struct inode *inode = mapping->host;
+	long status = 0;
+	ssize_t written = 0;
 
 	do {
 		struct page *src_page;
@@ -2041,7 +2123,7 @@ generic_file_buffered_write(struct kiocb
 		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
-						iov_iter_count(&i));
+						iov_iter_count(i));
 
 		/*
 		 * a non-NULL src_page indicates that we're doing the
@@ -2059,7 +2141,7 @@ generic_file_buffered_write(struct kiocb
 		 * to check that the address is actually valid, when atomic
 		 * usercopies are used, below.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(&i))) {
+		if (unlikely(iov_iter_fault_in_readable(i))) {
 			status = -EFAULT;
 			break;
 		}
@@ -2090,7 +2172,7 @@ generic_file_buffered_write(struct kiocb
 			 * same reason as we can't take a page fault with a
 			 * page locked (as explained below).
 			 */
-			copied = iov_iter_copy_from_user(src_page, &i,
+			copied = iov_iter_copy_from_user(src_page, i,
 								offset, bytes);
 			if (unlikely(copied == 0)) {
 				status = -EFAULT;
@@ -2115,7 +2197,6 @@ generic_file_buffered_write(struct kiocb
 				page_cache_release(src_page);
 				continue;
 			}
-
 		}
 
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
@@ -2137,7 +2218,7 @@ generic_file_buffered_write(struct kiocb
 			 * really matter.
 			 */
 			pagefault_disable();
-			copied = iov_iter_copy_from_user_atomic(page, &i,
+			copied = iov_iter_copy_from_user_atomic(page, i,
 								offset, bytes);
 			pagefault_enable();
 		} else {
@@ -2163,9 +2244,9 @@ generic_file_buffered_write(struct kiocb
 		if (src_page)
 			page_cache_release(src_page);
 
-		iov_iter_advance(&i, copied);
-		written += copied;
+		iov_iter_advance(i, copied);
 		pos += copied;
+		written += copied;
 
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
@@ -2189,13 +2270,117 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-	} while (iov_iter_count(&i));
-	*ppos = pos;
+	} while (iov_iter_count(i));
+
+	return written ? written : status;
+}
+
+static ssize_t generic_perform_write(struct file *file,
+				struct iov_iter *i, loff_t pos)
+{
+	struct address_space *mapping = file->f_mapping;
+	const struct address_space_operations *a_ops = mapping->a_ops;
+	long status = 0;
+	ssize_t written = 0;
+
+	do {
+		struct page *page;
+		pgoff_t index;		/* Pagecache index for current page */
+		unsigned long offset;	/* Offset into pagecache page */
+		unsigned long bytes;	/* Bytes to write to page */
+		size_t copied;		/* Bytes copied from user */
+		void *fsdata;
+
+		offset = (pos & (PAGE_CACHE_SIZE - 1));
+		index = pos >> PAGE_CACHE_SHIFT;
+		bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
+						iov_iter_count(i));
+
+again:
+
+		/*
+		 * Bring in the user page that we will copy from _first_.
+		 * Otherwise there's a nasty deadlock on copying from the
+		 * same page as we're writing to, without it being marked
+		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
+		 */
+		if (unlikely(iov_iter_fault_in_readable(i))) {
+			status = -EFAULT;
+			break;
+		}
+
+		status = a_ops->write_begin(file, mapping, pos, bytes, 0,
+						&page, &fsdata);
+		if (unlikely(status))
+			break;
+
+		pagefault_disable();
+		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		pagefault_enable();
+		flush_dcache_page(page);
+
+		status = a_ops->write_end(file, mapping, pos, bytes, copied,
+						page, fsdata);
+		if (unlikely(status < 0))
+			break;
+		copied = status;
+
+		cond_resched();
+
+		if (unlikely(copied == 0)) {
+			/*
+			 * If we were unable to copy any data at all, we must
+			 * fall back to a single segment length write.
+			 *
+			 * If we didn't fallback here, we could livelock
+			 * because not all segments in the iov can be copied at
+			 * once without a pagefault.
+			 */
+			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
+						iov_iter_single_seg_count(i));
+			goto again;
+		}
+		iov_iter_advance(i, copied);
+		pos += copied;
+		written += copied;
+
+		balance_dirty_pages_ratelimited(mapping);
+
+	} while (iov_iter_count(i));
+
+	return written ? written : status;
+}
+
+ssize_t
+generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
+		unsigned long nr_segs, loff_t pos, loff_t *ppos,
+		size_t count, ssize_t written)
+{
+	struct file *file = iocb->ki_filp;
+	struct address_space *mapping = file->f_mapping;
+	const struct address_space_operations *a_ops = mapping->a_ops;
+	struct inode *inode = mapping->host;
+	ssize_t status;
+	struct iov_iter i;
+
+	iov_iter_init(&i, iov, nr_segs, count, written);
+	if (a_ops->write_begin)
+		status = generic_perform_write(file, &i, pos);
+	else
+		status = generic_perform_write_2copy(file, &i, pos);
 
-	/*
-	 * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
-	 */
 	if (likely(status >= 0)) {
+		written += status;
+		*ppos = pos + status;
+
+		/*
+		 * For now, when the user asks for O_SYNC, we'll actually give
+		 * O_DSYNC
+		 */
 		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 			if (!a_ops->writepage || !is_sync_kiocb(iocb))
 				status = generic_osync_inode(inode, mapping,
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1757,6 +1757,52 @@ recover:
 	goto done;
 }
 
+/*
+ * If a page has any new buffers, zero them out here, and mark them uptodate
+ * and dirty so they'll be written out (in order to prevent uninitialised
+ * block data from leaking). And clear the new bit.
+ */
+void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
+{
+	unsigned int block_start, block_end;
+	struct buffer_head *head, *bh;
+
+	BUG_ON(!PageLocked(page));
+	if (!page_has_buffers(page))
+		return;
+
+	bh = head = page_buffers(page);
+	block_start = 0;
+	do {
+		block_end = block_start + bh->b_size;
+
+		if (buffer_new(bh)) {
+			if (block_end > from && block_start < to) {
+				if (!PageUptodate(page)) {
+					unsigned start, end;
+					void *kaddr;
+
+					start = max(from, block_start);
+					end = min(to, block_end);
+
+					kaddr = kmap_atomic(page, KM_USER0);
+					memset(kaddr+start, 0, end - start);
+					flush_dcache_page(page);
+					kunmap_atomic(kaddr, KM_USER0);
+					set_buffer_uptodate(bh);
+				}
+
+				clear_buffer_new(bh);
+				mark_buffer_dirty(bh);
+			}
+		}
+
+		block_start = block_end;
+		bh = bh->b_this_page;
+	} while (bh != head);
+}
+EXPORT_SYMBOL(page_zero_new_buffers);
+
 static int __block_prepare_write(struct inode *inode, struct page *page,
 		unsigned from, unsigned to, get_block_t *get_block)
 {
@@ -1841,43 +1887,8 @@ static int __block_prepare_write(struct 
 		if (!buffer_uptodate(*wait_bh))
 			err = -EIO;
 	}
-	if (!err) {
-		bh = head;
-		do {
-			if (buffer_new(bh))
-				clear_buffer_new(bh);
-		} while ((bh = bh->b_this_page) != head);
-		return 0;
-	}
-	/* Error case: */
-	/*
-	 * Zero out any newly allocated blocks to avoid exposing stale
-	 * data.  If BH_New is set, we know that the block was newly
-	 * allocated in the above loop.
-	 */
-	bh = head;
-	block_start = 0;
-	do {
-		block_end = block_start+blocksize;
-		if (block_end <= from)
-			goto next_bh;
-		if (block_start >= to)
-			break;
-		if (buffer_new(bh)) {
-			void *kaddr;
-
-			clear_buffer_new(bh);
-			kaddr = kmap_atomic(page, KM_USER0);
-			memset(kaddr+block_start, 0, bh->b_size);
-			flush_dcache_page(page);
-			kunmap_atomic(kaddr, KM_USER0);
-			set_buffer_uptodate(bh);
-			mark_buffer_dirty(bh);
-		}
-next_bh:
-		block_start = block_end;
-		bh = bh->b_this_page;
-	} while (bh != head);
+	if (unlikely(err))
+		page_zero_new_buffers(page, from, to);
 	return err;
 }
 
@@ -1902,6 +1913,7 @@ static int __block_commit_write(struct i
 			set_buffer_uptodate(bh);
 			mark_buffer_dirty(bh);
 		}
+		clear_buffer_new(bh);
 	}
 
 	/*
@@ -1916,6 +1928,123 @@ static int __block_commit_write(struct i
 }
 
 /*
+ * block_write_begin takes care of the basic task of block allocation and
+ * bringing partial write blocks uptodate first.
+ *
+ * If *pagep is not NULL, then block_write_begin uses the locked page
+ * at *pagep rather than allocating its own. In this case, the page will
+ * not be unlocked or deallocated on failure.
+ */
+int block_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata,
+			get_block_t *get_block)
+{
+	struct inode *inode = mapping->host;
+	int status = 0;
+	struct page *page;
+	pgoff_t index;
+	unsigned start, end;
+	int ownpage = 0;
+
+	index = pos >> PAGE_CACHE_SHIFT;
+	start = pos & (PAGE_CACHE_SIZE - 1);
+	end = start + len;
+
+	page = *pagep;
+	if (page == NULL) {
+		ownpage = 1;
+		page = __grab_cache_page(mapping, index);
+		if (!page) {
+			status = -ENOMEM;
+			goto out;
+		}
+		*pagep = page;
+	} else
+		BUG_ON(!PageLocked(page));
+
+	status = __block_prepare_write(inode, page, start, end, get_block);
+	if (unlikely(status)) {
+		ClearPageUptodate(page);
+
+		if (ownpage) {
+			unlock_page(page);
+			page_cache_release(page);
+
+			/*
+			 * prepare_write() may have instantiated a few blocks
+			 * outside i_size.  Trim these off again. Don't need
+			 * i_size_read because we hold i_mutex.
+			 */
+			if (pos + len > inode->i_size)
+				vmtruncate(inode, inode->i_size);
+		}
+		goto out;
+	}
+
+out:
+	return status;
+}
+EXPORT_SYMBOL(block_write_begin);
+
+int block_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
+{
+	struct inode *inode = mapping->host;
+	unsigned start;
+
+	start = pos & (PAGE_CACHE_SIZE - 1);
+
+	if (unlikely(copied < len)) {
+		/*
+		 * The buffers that were written will now be uptodate, so we
+		 * don't have to worry about a readpage reading them and
+		 * overwriting a partial write. However if we have encountered
+		 * a short write and only partially written into a buffer, it
+		 * will not be marked uptodate, so a readpage might come in and
+		 * destroy our partial write.
+		 *
+		 * Do the simplest thing, and just treat any short write to a
+		 * non uptodate page as a zero-length write, and force the
+		 * caller to redo the whole thing.
+		 */
+		if (!PageUptodate(page))
+			copied = 0;
+
+		page_zero_new_buffers(page, start+copied, start+len);
+	}
+	flush_dcache_page(page);
+
+	/* This could be a short (even 0-length) commit */
+	__block_commit_write(inode, page, start, start+copied);
+
+	return copied;
+}
+EXPORT_SYMBOL(block_write_end);
+
+int generic_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
+{
+	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+
+	unlock_page(page);
+	mark_page_accessed(page); /* XXX: put this in caller? */
+	page_cache_release(page);
+
+	/*
+	 * No need to use i_size_read() here, the i_size
+	 * cannot change under us because we hold i_mutex.
+	 */
+	if (pos+copied > inode->i_size) {
+		i_size_write(inode, pos+copied);
+		mark_inode_dirty(inode);
+	}
+}
+EXPORT_SYMBOL(generic_write_end);
+
+/*
  * Generic "read page" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
  * Reads the page asynchronously --- the unlock_buffer() and
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -202,6 +202,16 @@ void block_invalidatepage(struct page *p
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
 int block_read_full_page(struct page*, get_block_t*);
+int block_write_begin(struct file *, struct address_space *,
+				loff_t, unsigned, unsigned,
+				struct page **, void **, get_block_t*);
+int block_write_end(struct file *, struct address_space *,
+				loff_t, unsigned, unsigned,
+				struct page *, void *);
+int generic_write_end(struct file *, struct address_space *,
+				loff_t, unsigned, unsigned,
+				struct page *, void *);
+void page_zero_new_buffers(struct page *page, unsigned from, unsigned to);
 int block_prepare_write(struct page*, unsigned, unsigned, get_block_t*);
 int cont_prepare_write(struct page*, unsigned, unsigned, get_block_t*,
 				loff_t *);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -98,6 +98,8 @@ unsigned find_get_pages_contig(struct ad
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
 
+struct page *__grab_cache_page(struct address_space *mapping, pgoff_t index);
+
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -348,6 +348,26 @@ int simple_prepare_write(struct file *fi
 	return 0;
 }
 
+int simple_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	struct page *page;
+	pgoff_t index;
+	unsigned from;
+
+	index = pos >> PAGE_CACHE_SHIFT;
+	from = pos & (PAGE_CACHE_SIZE - 1);
+
+	page = __grab_cache_page(mapping, index);
+	if (!page)
+		return -ENOMEM;
+
+	*pagep = page;
+
+	return simple_prepare_write(file, page, from, from+len);
+}
+
 int simple_commit_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
@@ -366,6 +386,28 @@ int simple_commit_write(struct file *fil
 	return 0;
 }
 
+int simple_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
+{
+	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+
+	/* zero the stale part of the page if we did a short copy */
+	if (copied < len) {
+		void *kaddr = kmap_atomic(page, KM_USER0);
+		memset(kaddr + from + copied, 0, len - copied);
+		flush_dcache_page(page);
+		kunmap_atomic(kaddr, KM_USER0);
+	}
+
+	simple_commit_write(file, page, from, from+copied);
+
+	unlock_page(page);
+	page_cache_release(page);
+
+	return copied;
+}
+
 /*
  * the inodes created here are not hashed. If you use iunique to generate
  * unique inode values later for this filesystem, then you must take care
@@ -639,6 +681,8 @@ EXPORT_SYMBOL(dcache_dir_open);
 EXPORT_SYMBOL(dcache_readdir);
 EXPORT_SYMBOL(generic_read_dir);
 EXPORT_SYMBOL(get_sb_pseudo);
+EXPORT_SYMBOL(simple_write_begin);
+EXPORT_SYMBOL(simple_write_end);
 EXPORT_SYMBOL(simple_commit_write);
 EXPORT_SYMBOL(simple_dir_inode_operations);
 EXPORT_SYMBOL(simple_dir_operations);
Index: linux-2.6/drivers/block/loop.c
===================================================================
--- linux-2.6.orig/drivers/block/loop.c
+++ linux-2.6/drivers/block/loop.c
@@ -203,14 +203,13 @@ lo_do_transfer(struct loop_device *lo, i
  * do_lo_send_aops - helper for writing data to a loop device
  *
  * This is the fast version for backing filesystems which implement the address
- * space operations prepare_write and commit_write.
+ * space operations write_begin and write_end.
  */
 static int do_lo_send_aops(struct loop_device *lo, struct bio_vec *bvec,
-		int bsize, loff_t pos, struct page *page)
+		int bsize, loff_t pos, struct page *unused)
 {
 	struct file *file = lo->lo_backing_file; /* kudos to NFsckingS */
 	struct address_space *mapping = file->f_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 	pgoff_t index;
 	unsigned offset, bv_offs;
 	int len, ret;
@@ -222,67 +221,47 @@ static int do_lo_send_aops(struct loop_d
 	len = bvec->bv_len;
 	while (len > 0) {
 		sector_t IV;
-		unsigned size;
+		unsigned size, copied;
 		int transfer_result;
+		struct page *page;
+		void *fsdata;
 
 		IV = ((sector_t)index << (PAGE_CACHE_SHIFT - 9))+(offset >> 9);
 		size = PAGE_CACHE_SIZE - offset;
 		if (size > len)
 			size = len;
-		page = grab_cache_page(mapping, index);
-		if (unlikely(!page))
+
+		ret = pagecache_write_begin(file, mapping, pos, size, 0,
+							&page, &fsdata);
+		if (ret)
 			goto fail;
-		ret = aops->prepare_write(file, page, offset,
-					  offset + size);
-		if (unlikely(ret)) {
-			if (ret == AOP_TRUNCATED_PAGE) {
-				page_cache_release(page);
-				continue;
-			}
-			goto unlock;
-		}
+
 		transfer_result = lo_do_transfer(lo, WRITE, page, offset,
 				bvec->bv_page, bv_offs, size, IV);
-		if (unlikely(transfer_result)) {
-			char *kaddr;
+		copied = size;
+		if (unlikely(transfer_result))
+			copied = 0;
+
+		ret = pagecache_write_end(file, mapping, pos, size, copied,
+							page, fsdata);
+		if (ret < 0)
+			goto fail;
+		if (ret < copied)
+			copied = ret;
 
-			/*
-			 * The transfer failed, but we still write the data to
-			 * keep prepare/commit calls balanced.
-			 */
-			printk(KERN_ERR "loop: transfer error block %llu\n",
-			       (unsigned long long)index);
-			kaddr = kmap_atomic(page, KM_USER0);
-			memset(kaddr + offset, 0, size);
-			kunmap_atomic(kaddr, KM_USER0);
-		}
-		flush_dcache_page(page);
-		ret = aops->commit_write(file, page, offset,
-					 offset + size);
-		if (unlikely(ret)) {
-			if (ret == AOP_TRUNCATED_PAGE) {
-				page_cache_release(page);
-				continue;
-			}
-			goto unlock;
-		}
 		if (unlikely(transfer_result))
-			goto unlock;
-		bv_offs += size;
-		len -= size;
+			goto fail;
+
+		bv_offs += copied;
+		len -= copied;
 		offset = 0;
 		index++;
-		pos += size;
-		unlock_page(page);
-		page_cache_release(page);
+		pos += copied;
 	}
 	ret = 0;
 out:
 	mutex_unlock(&mapping->host->i_mutex);
 	return ret;
-unlock:
-	unlock_page(page);
-	page_cache_release(page);
 fail:
 	ret = -1;
 	goto out;
@@ -316,7 +295,7 @@ static int __do_lo_send_write(struct fil
  * do_lo_send_direct_write - helper for writing data to a loop device
  *
  * This is the fast, non-transforming version for backing filesystems which do
- * not implement the address space operations prepare_write and commit_write.
+ * not implement the address space operations write_begin and write_end.
  * It uses the write file operation which should be present on all writeable
  * filesystems.
  */
@@ -335,7 +314,7 @@ static int do_lo_send_direct_write(struc
  * do_lo_send_write - helper for writing data to a loop device
  *
  * This is the slow, transforming version for filesystems which do not
- * implement the address space operations prepare_write and commit_write.  It
+ * implement the address space operations write_begin and write_end.  It
  * uses the write file operation which should be present on all writeable
  * filesystems.
  *
@@ -774,7 +753,7 @@ static int loop_set_fd(struct loop_devic
 		 */
 		if (!file->f_op->sendfile)
 			goto out_putf;
-		if (aops->prepare_write && aops->commit_write)
+		if (aops->prepare_write || aops->write_begin)
 			lo_flags |= LO_FLAGS_USE_AOPS;
 		if (!(lo_flags & LO_FLAGS_USE_AOPS) && !file->f_op->write)
 			lo_flags |= LO_FLAGS_READ_ONLY;
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -2687,53 +2687,30 @@ int __page_symlink(struct inode *inode, 
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct page *page;
+	void *fsdata;
 	int err;
 	char *kaddr;
 
 retry:
-	err = -ENOMEM;
-	page = find_or_create_page(mapping, 0, gfp_mask);
-	if (!page)
-		goto fail;
-	err = mapping->a_ops->prepare_write(NULL, page, 0, len-1);
-	if (err == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
-		goto retry;
-	}
+	err = pagecache_write_begin(NULL, mapping, 0, PAGE_CACHE_SIZE,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
 	if (err)
-		goto fail_map;
+		goto fail;
+
 	kaddr = kmap_atomic(page, KM_USER0);
 	memcpy(kaddr, symname, len-1);
+	memset(kaddr+len-1, 0, PAGE_CACHE_SIZE-(len-1));
 	kunmap_atomic(kaddr, KM_USER0);
-	err = mapping->a_ops->commit_write(NULL, page, 0, len-1);
-	if (err == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
-		goto retry;
-	}
-	if (err)
-		goto fail_map;
-	/*
-	 * Notice that we are _not_ going to block here - end of page is
-	 * unmapped, so this will only try to map the rest of page, see
-	 * that it is unmapped (typically even will not look into inode -
-	 * ->i_size will be enough for everything) and zero it out.
-	 * OTOH it's obviously correct and should make the page up-to-date.
-	 */
-	if (!PageUptodate(page)) {
-		err = mapping->a_ops->readpage(NULL, page);
-		if (err != AOP_TRUNCATED_PAGE)
-			wait_on_page_locked(page);
-	} else {
-		unlock_page(page);
-	}
-	page_cache_release(page);
+
+	err = pagecache_write_end(NULL, mapping, 0, PAGE_CACHE_SIZE, PAGE_CACHE_SIZE,
+							page, fsdata);
 	if (err < 0)
 		goto fail;
+	if (err < PAGE_CACHE_SIZE)
+		goto retry;
+
 	mark_inode_dirty(inode);
 	return 0;
-fail_map:
-	unlock_page(page);
-	page_cache_release(page);
 fail:
 	return err;
 }
Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -559,7 +559,7 @@ static int pipe_to_file(struct pipe_inod
 	struct address_space *mapping = file->f_mapping;
 	unsigned int offset, this_len;
 	struct page *page;
-	pgoff_t index;
+	void *fsdata;
 	int ret;
 
 	/*
@@ -569,49 +569,16 @@ static int pipe_to_file(struct pipe_inod
 	if (unlikely(ret))
 		return ret;
 
-	index = sd->pos >> PAGE_CACHE_SHIFT;
 	offset = sd->pos & ~PAGE_CACHE_MASK;
 
 	this_len = sd->len;
 	if (this_len + offset > PAGE_CACHE_SIZE)
 		this_len = PAGE_CACHE_SIZE - offset;
 
-find_page:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		ret = -ENOMEM;
-		page = page_cache_alloc_cold(mapping);
-		if (unlikely(!page))
-			goto out_ret;
-
-		/*
-		 * This will also lock the page
-		 */
-		ret = add_to_page_cache_lru(page, mapping, index,
-					    GFP_KERNEL);
-		if (unlikely(ret))
-			goto out;
-	}
-
-	ret = mapping->a_ops->prepare_write(file, page, offset, offset+this_len);
-	if (unlikely(ret)) {
-		loff_t isize = i_size_read(mapping->host);
-
-		if (ret != AOP_TRUNCATED_PAGE)
-			unlock_page(page);
-		page_cache_release(page);
-		if (ret == AOP_TRUNCATED_PAGE)
-			goto find_page;
-
-		/*
-		 * prepare_write() may have instantiated a few blocks
-		 * outside i_size.  Trim these off again.
-		 */
-		if (sd->pos + this_len > isize)
-			vmtruncate(mapping->host, isize);
-
-		goto out_ret;
-	}
+	ret = pagecache_write_begin(file, mapping, sd->pos, sd->len,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+	if (unlikely(ret))
+		goto out;
 
 	if (buf->page != page) {
 		/*
@@ -621,35 +588,14 @@ find_page:
 		char *dst = kmap_atomic(page, KM_USER1);
 
 		memcpy(dst + offset, src + buf->offset, this_len);
-		flush_dcache_page(page);
 		kunmap_atomic(dst, KM_USER1);
 		buf->ops->unmap(pipe, buf, src);
 	}
 
-	ret = mapping->a_ops->commit_write(file, page, offset, offset+this_len);
-	if (ret) {
-		if (ret == AOP_TRUNCATED_PAGE) {
-			page_cache_release(page);
-			goto find_page;
-		}
-		if (ret < 0)
-			goto out;
-		/*
-		 * Partial write has happened, so 'ret' already initialized by
-		 * number of bytes written, Where is nothing we have to do here.
-		 */
-	} else
-		ret = this_len;
-	/*
-	 * Return the number of bytes written and mark page as
-	 * accessed, we are now done!
-	 */
-	mark_page_accessed(page);
-	balance_dirty_pages_ratelimited(mapping);
+	ret = pagecache_write_end(file, mapping, sd->pos, sd->len, sd->len, page, fsdata);
+
 out:
-	page_cache_release(page);
-	unlock_page(page);
-out_ret:
+
 	return ret;
 }
 
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -178,15 +178,18 @@ prototypes:
 locking rules:
 	All except set_page_dirty may block
 
-			BKL	PageLocked(page)
+			BKL	PageLocked(page)	i_sem
 writepage:		no	yes, unlocks (see below)
 readpage:		no	yes, unlocks
 sync_page:		no	maybe
 writepages:		no
 set_page_dirty		no	no
 readpages:		no
-prepare_write:		no	yes
-commit_write:		no	yes
+prepare_write:		no	yes			yes
+commit_write:		no	yes			yes
+write_begin:		no	locks the page		yes
+write_end:		no	yes, unlocks		yes
+perform_write:		no	n/a			yes
 bmap:			yes
 invalidatepage:		no	yes
 releasepage:		no	yes
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6/Documentation/filesystems/vfs.txt
@@ -534,6 +534,14 @@ struct address_space_operations {
 			struct list_head *pages, unsigned nr_pages);
 	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
 	int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+	int (*write_begin)(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata);
+	int (*write_end)(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata);
+	ssize_t (*perform_write)(struct file *, struct address_space *mapping,
+				struct iov_iter *i, loff_t pos);
 	sector_t (*bmap)(struct address_space *, sector_t);
 	int (*invalidatepage) (struct page *, unsigned long);
 	int (*releasepage) (struct page *, int);
@@ -629,6 +637,46 @@ struct address_space_operations {
         operations.  It should avoid returning an error if possible -
         errors should have been handled by prepare_write.
 
+  write_begin: This is intended as a replacement for prepare_write. Called
+        by the generic buffered write code to ask the filesystem to prepare
+        to write len bytes at the given offset in the file. flags is a field
+        for AOP_FLAG_xxx flags, described in include/linux/mm.h.
+
+        The filesystem must return the locked pagecache page for the caller
+        to write into.
+
+        A void * may be returned in fsdata, which then gets passed into
+        write_end.
+
+        Returns < 0 on failure, in which case all cleanup must be done and
+        write_end not called. 0 on success, in which case write_end must
+        be called.
+
+  write_end: After a successful write_begin, and data copy, write_end must
+        be called. len is the original len passed to write_begin, and copied
+        is the amount that was able to be copied (they must be equal if
+        write_begin was called with intr == 0).
+
+        The filesystem must take care of unlocking the page and dropping its
+        refcount, and updating i_size.
+
+        Returns < 0 on failure, otherwise the number of bytes (<= 'copied')
+        that were able to be copied into pagecache.
+
+  perform_write: This is a single-call, bulk version of write_begin/write_end
+        operations. It is only used in the buffered write path (write_begin
+        must still be implemented), and not for in-kernel writes to pagecache.
+        It takes an iov_iter structure, which provides a descriptor for the
+        source data (and has associated iov_iter_xxx helpers to operate on
+        that data). There are also file, mapping, and pos arguments, which
+        specify the destination of the data.
+
+        Returns < 0 on failure if nothing was written out, otherwise returns
+        the number of bytes copied into pagecache.
+
+        fs/libfs.c provides a reasonable template to start with, demonstrating
+        iov_iter routines, and iteration over the destination pagecache.
+
   bmap: called by the VFS to map a logical block offset within object to
   	physical block number. This method is used by the FIBMAP
   	ioctl and for working with swap-files.  To be able to swap to

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 13/44] mm: restore KERNEL_DS optimisations
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (11 preceding siblings ...)
  2007-04-24  1:23 ` [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops Nick Piggin
@ 2007-04-24  1:23 ` Nick Piggin
  2007-04-24 10:43   ` Christoph Hellwig
  2007-04-24  1:24 ` [patch 14/44] implement simple fs aops Nick Piggin
                   ` (30 subsequent siblings)
  43 siblings, 1 reply; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, Linux Memory Management

[-- Attachment #1: fs-kernel_ds-opt.patch --]
[-- Type: text/plain, Size: 1572 bytes --]

Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
path.

This may be a pretty questionable gain in most cases, especially after the
legacy 2copy write path is removed, but it doesn't cost much.

Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 mm/filemap.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2157,7 +2157,7 @@ static ssize_t generic_perform_write_2co
 		 * cannot take a pagefault with the destination page locked.
 		 * So pin the source page to copy it.
 		 */
-		if (!PageUptodate(page)) {
+		if (!PageUptodate(page) && !segment_eq(get_fs(), KERNEL_DS)) {
 			unlock_page(page);
 
 			src_page = alloc_page(GFP_KERNEL);
@@ -2282,6 +2282,13 @@ static ssize_t generic_perform_write(str
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	long status = 0;
 	ssize_t written = 0;
+	unsigned int flags = 0;
+
+	/*
+	 * Copies from kernel address space cannot fail (NFSD is a big user).
+	 */
+	if (segment_eq(get_fs(), KERNEL_DS))
+		flags |= AOP_FLAG_UNINTERRUPTIBLE;
 
 	do {
 		struct page *page;
@@ -2313,7 +2320,7 @@ again:
 			break;
 		}
 
-		status = a_ops->write_begin(file, mapping, pos, bytes, 0,
+		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 						&page, &fsdata);
 		if (unlikely(status))
 			break;

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 14/44] implement simple fs aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (12 preceding siblings ...)
  2007-04-24  1:23 ` [patch 13/44] mm: restore KERNEL_DS optimisations Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 15/44] block_dev convert to new aops Nick Piggin
                   ` (29 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh

[-- Attachment #1: fs-simple-aops.patch --]
[-- Type: text/plain, Size: 6104 bytes --]

Implement new aops for some of the simpler filesystems.

Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/configfs/inode.c   |    4 ++--
 fs/hugetlbfs/inode.c  |   16 ++++++++++------
 fs/ramfs/file-mmu.c   |    4 ++--
 fs/ramfs/file-nommu.c |    4 ++--
 fs/sysfs/inode.c      |    4 ++--
 mm/shmem.c            |   35 ++++++++++++++++++++++++++++-------
 6 files changed, 46 insertions(+), 21 deletions(-)

Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1109,7 +1109,7 @@ static int shmem_getpage(struct inode *i
 	 * Normally, filepage is NULL on entry, and either found
 	 * uptodate immediately, or allocated and zeroed, or read
 	 * in under swappage, which is then assigned to filepage.
-	 * But shmem_prepare_write passes in a locked filepage,
+	 * But shmem_write_begin passes in a locked filepage,
 	 * which may be found not uptodate by other callers too,
 	 * and may need to be copied from the swappage read in.
 	 */
@@ -1454,14 +1454,35 @@ static const struct inode_operations shm
 static const struct inode_operations shmem_symlink_inline_operations;
 
 /*
- * Normally tmpfs makes no use of shmem_prepare_write, but it
+ * Normally tmpfs makes no use of shmem_write_begin, but it
  * lets a tmpfs file be used read-write below the loop driver.
  */
 static int
-shmem_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
+shmem_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	struct inode *inode = mapping->host;
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+	*pagep = NULL;
+	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+}
+
+static int
+shmem_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
 {
-	struct inode *inode = page->mapping->host;
-	return shmem_getpage(inode, page->index, &page, SGP_WRITE, NULL);
+	struct inode *inode = mapping->host;
+
+	set_page_dirty(page);
+	mark_page_accessed(page);
+	page_cache_release(page);
+
+	if (pos+copied > inode->i_size)
+		i_size_write(inode, pos+copied);
+
+	return copied;
 }
 
 static ssize_t
@@ -2358,8 +2379,8 @@ static const struct address_space_operat
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
 #ifdef CONFIG_TMPFS
-	.prepare_write	= shmem_prepare_write,
-	.commit_write	= simple_commit_write,
+	.write_begin	= shmem_write_begin,
+	.write_end	= shmem_write_end,
 #endif
 	.migratepage	= migrate_page,
 };
Index: linux-2.6/fs/configfs/inode.c
===================================================================
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -40,8 +40,8 @@ extern struct super_block * configfs_sb;
 
 static const struct address_space_operations configfs_aops = {
 	.readpage	= simple_readpage,
-	.prepare_write	= simple_prepare_write,
-	.commit_write	= simple_commit_write
+	.write_begin	= simple_write_begin,
+	.write_end	= simple_write_end,
 };
 
 static struct backing_dev_info configfs_backing_dev_info = {
Index: linux-2.6/fs/sysfs/inode.c
===================================================================
--- linux-2.6.orig/fs/sysfs/inode.c
+++ linux-2.6/fs/sysfs/inode.c
@@ -20,8 +20,8 @@ extern struct super_block * sysfs_sb;
 
 static const struct address_space_operations sysfs_aops = {
 	.readpage	= simple_readpage,
-	.prepare_write	= simple_prepare_write,
-	.commit_write	= simple_commit_write
+	.write_begin	= simple_write_begin,
+	.write_end	= simple_write_end,
 };
 
 static struct backing_dev_info sysfs_backing_dev_info = {
Index: linux-2.6/fs/ramfs/file-mmu.c
===================================================================
--- linux-2.6.orig/fs/ramfs/file-mmu.c
+++ linux-2.6/fs/ramfs/file-mmu.c
@@ -29,8 +29,8 @@
 
 const struct address_space_operations ramfs_aops = {
 	.readpage	= simple_readpage,
-	.prepare_write	= simple_prepare_write,
-	.commit_write	= simple_commit_write,
+	.write_begin	= simple_write_begin,
+	.write_end	= simple_write_end,
 	.set_page_dirty = __set_page_dirty_no_writeback,
 };
 
Index: linux-2.6/fs/ramfs/file-nommu.c
===================================================================
--- linux-2.6.orig/fs/ramfs/file-nommu.c
+++ linux-2.6/fs/ramfs/file-nommu.c
@@ -29,8 +29,8 @@ static int ramfs_nommu_setattr(struct de
 
 const struct address_space_operations ramfs_aops = {
 	.readpage		= simple_readpage,
-	.prepare_write		= simple_prepare_write,
-	.commit_write		= simple_commit_write,
+	.write_begin		= simple_write_begin,
+	.write_end		= simple_write_end,
 	.set_page_dirty		= __set_page_dirty_no_writeback,
 };
 
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -159,15 +159,19 @@ static int hugetlbfs_readpage(struct fil
 	return -EINVAL;
 }
 
-static int hugetlbfs_prepare_write(struct file *file,
-			struct page *page, unsigned offset, unsigned to)
+static int hugetlbfs_write_begin(struct file *file,
+			struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
 {
 	return -EINVAL;
 }
 
-static int hugetlbfs_commit_write(struct file *file,
-			struct page *page, unsigned offset, unsigned to)
+static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
 {
+	BUG();
 	return -EINVAL;
 }
 
@@ -539,8 +543,8 @@ static void hugetlbfs_destroy_inode(stru
 
 static const struct address_space_operations hugetlbfs_aops = {
 	.readpage	= hugetlbfs_readpage,
-	.prepare_write	= hugetlbfs_prepare_write,
-	.commit_write	= hugetlbfs_commit_write,
+	.write_begin	= hugetlbfs_write_begin,
+	.write_end	= hugetlbfs_write_end,
 	.set_page_dirty	= hugetlbfs_set_page_dirty,
 };
 

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 15/44] block_dev convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (13 preceding siblings ...)
  2007-04-24  1:24 ` [patch 14/44] implement simple fs aops Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 16/44] rd " Nick Piggin
                   ` (28 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh

[-- Attachment #1: fs-blkdev-aops.patch --]
[-- Type: text/plain, Size: 1799 bytes --]

Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/block_dev.c |   26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c
+++ linux-2.6/fs/block_dev.c
@@ -378,14 +378,26 @@ static int blkdev_readpage(struct file *
 	return block_read_full_page(page, blkdev_get_block);
 }
 
-static int blkdev_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
-{
-	return block_prepare_write(page, from, to, blkdev_get_block);
+static int blkdev_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	*pagep = NULL;
+	return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				blkdev_get_block);
 }
 
-static int blkdev_commit_write(struct file *file, struct page *page, unsigned from, unsigned to)
+static int blkdev_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
 {
-	return block_commit_write(page, from, to);
+	int ret;
+	ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+
+	unlock_page(page);
+	page_cache_release(page);
+
+	return ret;
 }
 
 /*
@@ -1333,8 +1345,8 @@ const struct address_space_operations de
 	.readpage	= blkdev_readpage,
 	.writepage	= blkdev_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= blkdev_prepare_write,
-	.commit_write	= blkdev_commit_write,
+	.write_begin	= blkdev_write_begin,
+	.write_end	= blkdev_write_end,
 	.writepages	= generic_writepages,
 	.direct_IO	= blkdev_direct_IO,
 };

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 16/44] rd convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (14 preceding siblings ...)
  2007-04-24  1:24 ` [patch 15/44] block_dev convert to new aops Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24 10:46   ` Christoph Hellwig
  2007-04-24  1:24 ` [patch 17/44] ext2 " Nick Piggin
                   ` (27 subsequent siblings)
  43 siblings, 1 reply; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh

[-- Attachment #1: fs-rd-aops.patch --]
[-- Type: text/plain, Size: 5768 bytes --]

Also clean up various little things.

I've got rid of the comment from akpm, because now that make_page_uptodate
is only called from 2 places, it is pretty easy to see that the buffers
are in an uptodate state at the time of the call. Actually, it was OK before
my patch as well, because the memset is equivalent to reading from disk
of course... however it is more explicit where the updates come from now.

Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 drivers/block/rd.c |  125 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 73 insertions(+), 52 deletions(-)

Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -104,50 +104,60 @@ static void make_page_uptodate(struct pa
 		struct buffer_head *head = bh;
 
 		do {
-			if (!buffer_uptodate(bh)) {
-				memset(bh->b_data, 0, bh->b_size);
-				/*
-				 * akpm: I'm totally undecided about this.  The
-				 * buffer has just been magically brought "up to
-				 * date", but nobody should want to be reading
-				 * it anyway, because it hasn't been used for
-				 * anything yet.  It is still in a "not read
-				 * from disk yet" state.
-				 *
-				 * But non-uptodate buffers against an uptodate
-				 * page are against the rules.  So do it anyway.
-				 */
+			if (!buffer_uptodate(bh))
 				 set_buffer_uptodate(bh);
-			}
 		} while ((bh = bh->b_this_page) != head);
-	} else {
-		memset(page_address(page), 0, PAGE_CACHE_SIZE);
 	}
-	flush_dcache_page(page);
 	SetPageUptodate(page);
 }
 
 static int ramdisk_readpage(struct file *file, struct page *page)
 {
-	if (!PageUptodate(page))
+	if (!PageUptodate(page)) {
+		memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE);
 		make_page_uptodate(page);
+	}
 	unlock_page(page);
 	return 0;
 }
 
-static int ramdisk_prepare_write(struct file *file, struct page *page,
-				unsigned offset, unsigned to)
-{
-	if (!PageUptodate(page))
-		make_page_uptodate(page);
+static int ramdisk_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	struct page *page;
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+
+	page = __grab_cache_page(mapping, index);
+	if (!page)
+		return -ENOMEM;
+	*pagep = page;
 	return 0;
 }
 
-static int ramdisk_commit_write(struct file *file, struct page *page,
-				unsigned offset, unsigned to)
-{
+static int ramdisk_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
+{
+	if (!PageUptodate(page)) {
+		if (copied != PAGE_CACHE_SIZE) {
+			void *dst;
+			unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+			unsigned to = from + copied;
+
+			dst = kmap_atomic(page, KM_USER0);
+			memset(dst, 0, from);
+			memset(dst + to, 0, PAGE_CACHE_SIZE - to);
+			flush_dcache_page(page);
+			kunmap_atomic(dst, KM_USER0);
+		}
+		make_page_uptodate(page);
+	}
+
 	set_page_dirty(page);
-	return 0;
+	unlock_page(page);
+	page_cache_release(page);
+	return copied;
 }
 
 /*
@@ -191,8 +201,8 @@ static int ramdisk_set_page_dirty(struct
 
 static const struct address_space_operations ramdisk_aops = {
 	.readpage	= ramdisk_readpage,
-	.prepare_write	= ramdisk_prepare_write,
-	.commit_write	= ramdisk_commit_write,
+	.write_begin	= ramdisk_write_begin,
+	.write_end	= ramdisk_write_end,
 	.writepage	= ramdisk_writepage,
 	.set_page_dirty	= ramdisk_set_page_dirty,
 	.writepages	= ramdisk_writepages,
@@ -201,13 +211,14 @@ static const struct address_space_operat
 static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
 				struct address_space *mapping)
 {
-	pgoff_t index = sector >> (PAGE_CACHE_SHIFT - 9);
+	loff_t pos = sector << 9;
 	unsigned int vec_offset = vec->bv_offset;
-	int offset = (sector << 9) & ~PAGE_CACHE_MASK;
 	int size = vec->bv_len;
 	int err = 0;
 
 	do {
+		pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+		unsigned offset = pos & ~PAGE_CACHE_MASK;
 		int count;
 		struct page *page;
 		char *src;
@@ -216,40 +227,50 @@ static int rd_blkdev_pagecache_IO(int rw
 		count = PAGE_CACHE_SIZE - offset;
 		if (count > size)
 			count = size;
-		size -= count;
-
-		page = grab_cache_page(mapping, index);
-		if (!page) {
-			err = -ENOMEM;
-			goto out;
-		}
 
-		if (!PageUptodate(page))
-			make_page_uptodate(page);
+		if (rw == WRITE) {
+			err = pagecache_write_begin(NULL, mapping, pos, count,
+							0, &page, NULL);
+			if (err)
+				goto out;
 
-		index++;
+			src = kmap_atomic(vec->bv_page, KM_USER0) + vec_offset;
+			dst = kmap_atomic(page, KM_USER1) + offset;
+		} else {
+again:
+			page = __grab_cache_page(mapping, index);
+			if (!page) {
+				err = -ENOMEM;
+				goto out;
+			}
+			if (!PageUptodate(page)) {
+				mapping->a_ops->readpage(NULL, page);
+				goto again;
+			}
 
-		if (rw == READ) {
 			src = kmap_atomic(page, KM_USER0) + offset;
 			dst = kmap_atomic(vec->bv_page, KM_USER1) + vec_offset;
-		} else {
-			src = kmap_atomic(vec->bv_page, KM_USER0) + vec_offset;
-			dst = kmap_atomic(page, KM_USER1) + offset;
 		}
-		offset = 0;
-		vec_offset += count;
+
 
 		memcpy(dst, src, count);
 
 		kunmap_atomic(src, KM_USER0);
 		kunmap_atomic(dst, KM_USER1);
 
-		if (rw == READ)
+		if (rw == READ) {
 			flush_dcache_page(vec->bv_page);
-		else
-			set_page_dirty(page);
-		unlock_page(page);
-		put_page(page);
+			unlock_page(page);
+			page_cache_release(page);
+		} else {
+			flush_dcache_page(page);
+			pagecache_write_end(NULL, mapping, pos, count,
+							count, page, NULL);
+		}
+
+		pos += count;
+		vec_offset += count;
+		size -= count;
 	} while (size);
 
  out:

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 17/44] ext2 convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (15 preceding siblings ...)
  2007-04-24  1:24 ` [patch 16/44] rd " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 18/44] ext3 " Nick Piggin
                   ` (26 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, linux-ext4

[-- Attachment #1: fs-ext2-aops.patch --]
[-- Type: text/plain, Size: 7165 bytes --]

Cc: linux-ext4@vger.kernel.org
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/ext2/dir.c   |   47 +++++++++++++++++++++++++++++------------------
 fs/ext2/ext2.h  |    3 +++
 fs/ext2/inode.c |   24 +++++++++++++-----------
 3 files changed, 45 insertions(+), 29 deletions(-)

Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -726,18 +726,21 @@ ext2_readpages(struct file *file, struct
 	return mpage_readpages(mapping, pages, nr_pages, ext2_get_block);
 }
 
-static int
-ext2_prepare_write(struct file *file, struct page *page,
-			unsigned from, unsigned to)
+int __ext2_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
 {
-	return block_prepare_write(page,from,to,ext2_get_block);
+	return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+							ext2_get_block);
 }
 
 static int
-ext2_nobh_prepare_write(struct file *file, struct page *page,
-			unsigned from, unsigned to)
+ext2_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
 {
-	return nobh_prepare_write(page,from,to,ext2_get_block);
+	*pagep = NULL;
+	return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
 }
 
 static int ext2_nobh_writepage(struct page *page,
@@ -773,8 +776,8 @@ const struct address_space_operations ex
 	.readpages		= ext2_readpages,
 	.writepage		= ext2_writepage,
 	.sync_page		= block_sync_page,
-	.prepare_write		= ext2_prepare_write,
-	.commit_write		= generic_commit_write,
+	.write_begin		= ext2_write_begin,
+	.write_end		= generic_write_end,
 	.bmap			= ext2_bmap,
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
@@ -791,8 +794,7 @@ const struct address_space_operations ex
 	.readpages		= ext2_readpages,
 	.writepage		= ext2_nobh_writepage,
 	.sync_page		= block_sync_page,
-	.prepare_write		= ext2_nobh_prepare_write,
-	.commit_write		= nobh_commit_write,
+	/* XXX: todo */
 	.bmap			= ext2_bmap,
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
Index: linux-2.6/fs/ext2/dir.c
===================================================================
--- linux-2.6.orig/fs/ext2/dir.c
+++ linux-2.6/fs/ext2/dir.c
@@ -22,6 +22,7 @@
  */
 
 #include "ext2.h"
+#include <linux/buffer_head.h>
 #include <linux/pagemap.h>
 
 typedef struct ext2_dir_entry_2 ext2_dirent;
@@ -61,12 +62,14 @@ ext2_last_byte(struct inode *inode, unsi
 	return last_byte;
 }
 
-static int ext2_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-	struct inode *dir = page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *dir = mapping->host;
 	int err = 0;
+
 	dir->i_version++;
-	page->mapping->a_ops->commit_write(NULL, page, from, to);
+	block_write_end(NULL, mapping, pos, len, len, page, NULL);
 	if (IS_DIRSYNC(dir))
 		err = write_one_page(page, 1);
 	else
@@ -412,16 +415,18 @@ ino_t ext2_inode_by_name(struct inode * 
 void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
 			struct page *page, struct inode *inode)
 {
-	unsigned from = (char *) de - (char *) page_address(page);
-	unsigned to = from + le16_to_cpu(de->rec_len);
+	loff_t pos = (page->index << PAGE_CACHE_SHIFT) +
+			(char *) de - (char *) page_address(page);
+	unsigned len = le16_to_cpu(de->rec_len);
 	int err;
 
 	lock_page(page);
-	err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
+	err = __ext2_write_begin(NULL, page->mapping, pos, len,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
 	BUG_ON(err);
 	de->inode = cpu_to_le32(inode->i_ino);
-	ext2_set_de_type (de, inode);
-	err = ext2_commit_chunk(page, from, to);
+	ext2_set_de_type(de, inode);
+	err = ext2_commit_chunk(page, pos, len);
 	ext2_put_page(page);
 	dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
 	EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL;
@@ -444,7 +449,7 @@ int ext2_add_link (struct dentry *dentry
 	unsigned long npages = dir_pages(dir);
 	unsigned long n;
 	char *kaddr;
-	unsigned from, to;
+	loff_t pos;
 	int err;
 
 	/*
@@ -497,9 +502,10 @@ int ext2_add_link (struct dentry *dentry
 	return -EINVAL;
 
 got_it:
-	from = (char*)de - (char*)page_address(page);
-	to = from + rec_len;
-	err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
+	pos = (page->index << PAGE_CACHE_SHIFT) +
+		(char*)de - (char*)page_address(page);
+	err = __ext2_write_begin(NULL, page->mapping, pos, rec_len, 0,
+							&page, NULL);
 	if (err)
 		goto out_unlock;
 	if (de->inode) {
@@ -509,10 +515,10 @@ got_it:
 		de = de1;
 	}
 	de->name_len = namelen;
-	memcpy (de->name, name, namelen);
+	memcpy(de->name, name, namelen);
 	de->inode = cpu_to_le32(inode->i_ino);
 	ext2_set_de_type (de, inode);
-	err = ext2_commit_chunk(page, from, to);
+	err = ext2_commit_chunk(page, pos, rec_len);
 	dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
 	EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL;
 	mark_inode_dirty(dir);
@@ -537,6 +543,7 @@ int ext2_delete_entry (struct ext2_dir_e
 	char *kaddr = page_address(page);
 	unsigned from = ((char*)dir - kaddr) & ~(ext2_chunk_size(inode)-1);
 	unsigned to = ((char*)dir - kaddr) + le16_to_cpu(dir->rec_len);
+	loff_t pos;
 	ext2_dirent * pde = NULL;
 	ext2_dirent * de = (ext2_dirent *) (kaddr + from);
 	int err;
@@ -553,13 +560,15 @@ int ext2_delete_entry (struct ext2_dir_e
 	}
 	if (pde)
 		from = (char*)pde - (char*)page_address(page);
+	pos = (page->index << PAGE_CACHE_SHIFT) + from;
 	lock_page(page);
-	err = mapping->a_ops->prepare_write(NULL, page, from, to);
+	err = __ext2_write_begin(NULL, page->mapping, pos, to - from, 0,
+							&page, NULL);
 	BUG_ON(err);
 	if (pde)
-		pde->rec_len = cpu_to_le16(to-from);
+		pde->rec_len = cpu_to_le16(to - from);
 	dir->inode = 0;
-	err = ext2_commit_chunk(page, from, to);
+	err = ext2_commit_chunk(page, pos, to - from);
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME_SEC;
 	EXT2_I(inode)->i_flags &= ~EXT2_BTREE_FL;
 	mark_inode_dirty(inode);
@@ -582,7 +591,9 @@ int ext2_make_empty(struct inode *inode,
 
 	if (!page)
 		return -ENOMEM;
-	err = mapping->a_ops->prepare_write(NULL, page, 0, chunk_size);
+
+	err = __ext2_write_begin(NULL, page->mapping, 0, chunk_size, 0,
+							&page, NULL);
 	if (err) {
 		unlock_page(page);
 		goto fail;
Index: linux-2.6/fs/ext2/ext2.h
===================================================================
--- linux-2.6.orig/fs/ext2/ext2.h
+++ linux-2.6/fs/ext2/ext2.h
@@ -133,6 +133,9 @@ extern int ext2_get_block(struct inode *
 extern void ext2_truncate (struct inode *);
 extern int ext2_setattr (struct dentry *, struct iattr *);
 extern void ext2_set_inode_flags(struct inode *inode);
+int __ext2_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata);
 
 /* ioctl.c */
 extern int ext2_ioctl (struct inode *, struct file *, unsigned int,

-- 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 18/44] ext3 convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (16 preceding siblings ...)
  2007-04-24  1:24 ` [patch 17/44] ext2 " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 19/44] ext4 " Nick Piggin
                   ` (25 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Filesystems, Mark Fasheh, linux-ext4, Badari Pulavarty

[-- Attachment #1: fs-ext3-aops.patch --]
[-- Type: text/plain, Size: 8747 bytes --]

Cc: linux-ext4@vger.kernel.org
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>


Various fixes and improvements

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>

 fs/ext3/inode.c |  136 ++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 88 insertions(+), 48 deletions(-)

Index: linux-2.6/fs/ext3/inode.c
===================================================================
--- linux-2.6.orig/fs/ext3/inode.c
+++ linux-2.6/fs/ext3/inode.c
@@ -1147,51 +1147,68 @@ static int do_journal_get_write_access(h
 	return ext3_journal_get_write_access(handle, bh);
 }
 
-static int ext3_prepare_write(struct file *file, struct page *page,
-			      unsigned from, unsigned to)
+static int ext3_write_begin(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = mapping->host;
 	int ret, needed_blocks = ext3_writepage_trans_blocks(inode);
 	handle_t *handle;
 	int retries = 0;
+	struct page *page;
+	pgoff_t index;
+	unsigned from, to;
+
+	index = pos >> PAGE_CACHE_SHIFT;
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = from + len;
 
 retry:
+	page = __grab_cache_page(mapping, index);
+	if (!page)
+		return -ENOMEM;
+	*pagep = page;
+
 	handle = ext3_journal_start(inode, needed_blocks);
 	if (IS_ERR(handle)) {
+		unlock_page(page);
+		page_cache_release(page);
 		ret = PTR_ERR(handle);
 		goto out;
 	}
-	if (test_opt(inode->i_sb, NOBH) && ext3_should_writeback_data(inode))
-		ret = nobh_prepare_write(page, from, to, ext3_get_block);
-	else
-		ret = block_prepare_write(page, from, to, ext3_get_block);
+	ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+							ext3_get_block);
 	if (ret)
-		goto prepare_write_failed;
+		goto write_begin_failed;
 
 	if (ext3_should_journal_data(inode)) {
 		ret = walk_page_buffers(handle, page_buffers(page),
 				from, to, NULL, do_journal_get_write_access);
 	}
-prepare_write_failed:
-	if (ret)
+write_begin_failed:
+	if (ret) {
 		ext3_journal_stop(handle);
+		unlock_page(page);
+		page_cache_release(page);
+	}
 	if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;
 out:
 	return ret;
 }
 
+
 int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
 {
 	int err = journal_dirty_data(handle, bh);
 	if (err)
 		ext3_journal_abort_handle(__FUNCTION__, __FUNCTION__,
-						bh, handle,err);
+						bh, handle, err);
 	return err;
 }
 
-/* For commit_write() in data=journal mode */
-static int commit_write_fn(handle_t *handle, struct buffer_head *bh)
+/* For write_end() in data=journal mode */
+static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 {
 	if (!buffer_mapped(bh) || buffer_freed(bh))
 		return 0;
@@ -1206,78 +1223,100 @@ static int commit_write_fn(handle_t *han
  * ext3 never places buffers on inode->i_mapping->private_list.  metadata
  * buffers are managed internally.
  */
-static int ext3_ordered_commit_write(struct file *file, struct page *page,
-			     unsigned from, unsigned to)
+static int ext3_ordered_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext3_journal_current_handle();
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = file->f_mapping->host;
+	unsigned from, to;
 	int ret = 0, ret2;
 
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = from + len;
+
 	ret = walk_page_buffers(handle, page_buffers(page),
 		from, to, NULL, ext3_journal_dirty_data);
 
 	if (ret == 0) {
 		/*
-		 * generic_commit_write() will run mark_inode_dirty() if i_size
+		 * generic_write_end() will run mark_inode_dirty() if i_size
 		 * changes.  So let's piggyback the i_disksize mark_inode_dirty
 		 * into that.
 		 */
 		loff_t new_i_size;
 
-		new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+		new_i_size = pos + copied;
 		if (new_i_size > EXT3_I(inode)->i_disksize)
 			EXT3_I(inode)->i_disksize = new_i_size;
-		ret = generic_commit_write(file, page, from, to);
+		copied = generic_write_end(file, mapping, pos, len, copied,
+							page, fsdata);
+		if (copied < 0)
+			ret = copied;
+	} else {
+		unlock_page(page);
+		page_cache_release(page);
 	}
 	ret2 = ext3_journal_stop(handle);
 	if (!ret)
 		ret = ret2;
-	return ret;
+	return ret ? ret : copied;
 }
 
-static int ext3_writeback_commit_write(struct file *file, struct page *page,
-			     unsigned from, unsigned to)
+static int ext3_writeback_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext3_journal_current_handle();
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = file->f_mapping->host;
 	int ret = 0, ret2;
 	loff_t new_i_size;
 
-	new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	new_i_size = pos + copied;
 	if (new_i_size > EXT3_I(inode)->i_disksize)
 		EXT3_I(inode)->i_disksize = new_i_size;
 
-	if (test_opt(inode->i_sb, NOBH) && ext3_should_writeback_data(inode))
-		ret = nobh_commit_write(file, page, from, to);
-	else
-		ret = generic_commit_write(file, page, from, to);
+	copied = generic_write_end(file, mapping, pos, len, copied,
+							page, fsdata);
+	if (copied < 0)
+		ret = copied;
 
 	ret2 = ext3_journal_stop(handle);
 	if (!ret)
 		ret = ret2;
-	return ret;
+	return ret ? ret : copied;
 }
 
-static int ext3_journalled_commit_write(struct file *file,
-			struct page *page, unsigned from, unsigned to)
+static int ext3_journalled_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext3_journal_current_handle();
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = mapping->host;
 	int ret = 0, ret2;
 	int partial = 0;
-	loff_t pos;
+	unsigned from, to;
 
-	/*
-	 * Here we duplicate the generic_commit_write() functionality
-	 */
-	pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = from + len;
+
+	if (copied < len) {
+		if (!PageUptodate(page))
+			copied = 0;
+		page_zero_new_buffers(page, from+copied, to);
+	}
 
 	ret = walk_page_buffers(handle, page_buffers(page), from,
-				to, &partial, commit_write_fn);
+				to, &partial, write_end_fn);
 	if (!partial)
 		SetPageUptodate(page);
-	if (pos > inode->i_size)
-		i_size_write(inode, pos);
+	unlock_page(page);
+	page_cache_release(page);
+	if (pos+copied > inode->i_size)
+		i_size_write(inode, pos+copied);
 	EXT3_I(inode)->i_state |= EXT3_STATE_JDATA;
 	if (inode->i_size > EXT3_I(inode)->i_disksize) {
 		EXT3_I(inode)->i_disksize = inode->i_size;
@@ -1285,10 +1324,11 @@ static int ext3_journalled_commit_write(
 		if (!ret)
 			ret = ret2;
 	}
+
 	ret2 = ext3_journal_stop(handle);
 	if (!ret)
 		ret = ret2;
-	return ret;
+	return ret ? ret : copied;
 }
 
 /*
@@ -1546,7 +1586,7 @@ static int ext3_journalled_writepage(str
 			PAGE_CACHE_SIZE, NULL, do_journal_get_write_access);
 
 		err = walk_page_buffers(handle, page_buffers(page), 0,
-				PAGE_CACHE_SIZE, NULL, commit_write_fn);
+				PAGE_CACHE_SIZE, NULL, write_end_fn);
 		if (ret == 0)
 			ret = err;
 		EXT3_I(inode)->i_state |= EXT3_STATE_JDATA;
@@ -1706,8 +1746,8 @@ static const struct address_space_operat
 	.readpages	= ext3_readpages,
 	.writepage	= ext3_ordered_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= ext3_prepare_write,
-	.commit_write	= ext3_ordered_commit_write,
+	.write_begin	= ext3_write_begin,
+	.write_end	= ext3_ordered_write_end,
 	.bmap		= ext3_bmap,
 	.invalidatepage	= ext3_invalidatepage,
 	.releasepage	= ext3_releasepage,
@@ -1720,8 +1760,8 @@ static const struct address_space_operat
 	.readpages	= ext3_readpages,
 	.writepage	= ext3_writeback_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= ext3_prepare_write,
-	.commit_write	= ext3_writeback_commit_write,
+	.write_begin	= ext3_write_begin,
+	.write_end	= ext3_writeback_write_end,
 	.bmap		= ext3_bmap,
 	.invalidatepage	= ext3_invalidatepage,
 	.releasepage	= ext3_releasepage,
@@ -1734,8 +1774,8 @@ static const struct address_space_operat
 	.readpages	= ext3_readpages,
 	.writepage	= ext3_journalled_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= ext3_prepare_write,
-	.commit_write	= ext3_journalled_commit_write,
+	.write_begin	= ext3_write_begin,
+	.write_end	= ext3_journalled_write_end,
 	.set_page_dirty	= ext3_journalled_set_page_dirty,
 	.bmap		= ext3_bmap,
 	.invalidatepage	= ext3_invalidatepage,

-- 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 19/44] ext4 convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (17 preceding siblings ...)
  2007-04-24  1:24 ` [patch 18/44] ext3 " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 20/44] xfs " Nick Piggin
                   ` (24 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Filesystems, Mark Fasheh, linux-ext4, Badari Pulavarty

[-- Attachment #1: fs-ext4-aops.patch --]
[-- Type: text/plain, Size: 8845 bytes --]

Cc: linux-ext4@vger.kernel.org
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Convert ext4 to use write_begin()/write_end() methods.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>

 fs/ext4/inode.c |  147 +++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 93 insertions(+), 54 deletions(-)

Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c
+++ linux-2.6/fs/ext4/inode.c
@@ -1146,34 +1146,50 @@ static int do_journal_get_write_access(h
 	return ext4_journal_get_write_access(handle, bh);
 }
 
-static int ext4_prepare_write(struct file *file, struct page *page,
-			      unsigned from, unsigned to)
+static int ext4_write_begin(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata)
 {
-	struct inode *inode = page->mapping->host;
+ 	struct inode *inode = mapping->host;
 	int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
 	handle_t *handle;
 	int retries = 0;
+ 	struct page *page;
+ 	pgoff_t index;
+ 	unsigned from, to;
+
+ 	index = pos >> PAGE_CACHE_SHIFT;
+ 	from = pos & (PAGE_CACHE_SIZE - 1);
+ 	to = from + len;
 
 retry:
-	handle = ext4_journal_start(inode, needed_blocks);
-	if (IS_ERR(handle)) {
-		ret = PTR_ERR(handle);
-		goto out;
+ 	page = __grab_cache_page(mapping, index);
+ 	if (!page)
+ 		return -ENOMEM;
+ 	*pagep = page;
+
+  	handle = ext4_journal_start(inode, needed_blocks);
+  	if (IS_ERR(handle)) {
+ 		unlock_page(page);
+ 		page_cache_release(page);
+  		ret = PTR_ERR(handle);
+  		goto out;
 	}
-	if (test_opt(inode->i_sb, NOBH) && ext4_should_writeback_data(inode))
-		ret = nobh_prepare_write(page, from, to, ext4_get_block);
-	else
-		ret = block_prepare_write(page, from, to, ext4_get_block);
-	if (ret)
-		goto prepare_write_failed;
 
-	if (ext4_should_journal_data(inode)) {
+	ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+							ext4_get_block);
+
+	if (!ret && ext4_should_journal_data(inode)) {
 		ret = walk_page_buffers(handle, page_buffers(page),
 				from, to, NULL, do_journal_get_write_access);
 	}
-prepare_write_failed:
-	if (ret)
+
+	if (ret) {
 		ext4_journal_stop(handle);
+ 		unlock_page(page);
+ 		page_cache_release(page);
+	}
+
 	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;
 out:
@@ -1185,12 +1201,12 @@ int ext4_journal_dirty_data(handle_t *ha
 	int err = jbd2_journal_dirty_data(handle, bh);
 	if (err)
 		ext4_journal_abort_handle(__FUNCTION__, __FUNCTION__,
-						bh, handle,err);
+						bh, handle, err);
 	return err;
 }
 
-/* For commit_write() in data=journal mode */
-static int commit_write_fn(handle_t *handle, struct buffer_head *bh)
+/* For write_end() in data=journal mode */
+static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 {
 	if (!buffer_mapped(bh) || buffer_freed(bh))
 		return 0;
@@ -1205,78 +1221,100 @@ static int commit_write_fn(handle_t *han
  * ext4 never places buffers on inode->i_mapping->private_list.  metadata
  * buffers are managed internally.
  */
-static int ext4_ordered_commit_write(struct file *file, struct page *page,
-			     unsigned from, unsigned to)
+static int ext4_ordered_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext4_journal_current_handle();
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = file->f_mapping->host;
+	unsigned from, to;
 	int ret = 0, ret2;
 
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = from + len;
+
 	ret = walk_page_buffers(handle, page_buffers(page),
 		from, to, NULL, ext4_journal_dirty_data);
 
 	if (ret == 0) {
 		/*
-		 * generic_commit_write() will run mark_inode_dirty() if i_size
+		 * generic_write_end() will run mark_inode_dirty() if i_size
 		 * changes.  So let's piggyback the i_disksize mark_inode_dirty
 		 * into that.
 		 */
 		loff_t new_i_size;
 
-		new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+		new_i_size = pos + copied;
 		if (new_i_size > EXT4_I(inode)->i_disksize)
 			EXT4_I(inode)->i_disksize = new_i_size;
-		ret = generic_commit_write(file, page, from, to);
+		copied = generic_write_end(file, mapping, pos, len, copied,
+							page, fsdata);
+		if (copied < 0)
+			ret = copied;
+	} else {
+		unlock_page(page);
+		page_cache_release(page);
 	}
 	ret2 = ext4_journal_stop(handle);
 	if (!ret)
 		ret = ret2;
-	return ret;
+	return ret ? ret : copied;
 }
 
-static int ext4_writeback_commit_write(struct file *file, struct page *page,
-			     unsigned from, unsigned to)
+static int ext4_writeback_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext4_journal_current_handle();
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = file->f_mapping->host;
 	int ret = 0, ret2;
 	loff_t new_i_size;
 
-	new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	new_i_size = pos + copied;
 	if (new_i_size > EXT4_I(inode)->i_disksize)
 		EXT4_I(inode)->i_disksize = new_i_size;
 
-	if (test_opt(inode->i_sb, NOBH) && ext4_should_writeback_data(inode))
-		ret = nobh_commit_write(file, page, from, to);
-	else
-		ret = generic_commit_write(file, page, from, to);
+	copied = generic_write_end(file, mapping, pos, len, copied,
+							page, fsdata);
+	if (copied < 0)
+		ret = copied;
 
 	ret2 = ext4_journal_stop(handle);
 	if (!ret)
 		ret = ret2;
-	return ret;
+	return ret ? ret : copied;
 }
 
-static int ext4_journalled_commit_write(struct file *file,
-			struct page *page, unsigned from, unsigned to)
+static int ext4_journalled_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext4_journal_current_handle();
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = mapping->host;
 	int ret = 0, ret2;
 	int partial = 0;
-	loff_t pos;
+	unsigned from, to;
 
-	/*
-	 * Here we duplicate the generic_commit_write() functionality
-	 */
-	pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = from + len;
+
+	if (copied < len) {
+		if (!PageUptodate(page))
+			copied = 0;
+		page_zero_new_buffers(page, from+copied, to);
+	}
 
 	ret = walk_page_buffers(handle, page_buffers(page), from,
-				to, &partial, commit_write_fn);
+				to, &partial, write_end_fn);
 	if (!partial)
 		SetPageUptodate(page);
-	if (pos > inode->i_size)
-		i_size_write(inode, pos);
+	unlock_page(page);
+	page_cache_release(page);
+	if (pos+copied > inode->i_size)
+		i_size_write(inode, pos+copied);
 	EXT4_I(inode)->i_state |= EXT4_STATE_JDATA;
 	if (inode->i_size > EXT4_I(inode)->i_disksize) {
 		EXT4_I(inode)->i_disksize = inode->i_size;
@@ -1284,10 +1322,11 @@ static int ext4_journalled_commit_write(
 		if (!ret)
 			ret = ret2;
 	}
+
 	ret2 = ext4_journal_stop(handle);
 	if (!ret)
 		ret = ret2;
-	return ret;
+	return ret ? ret : copied;
 }
 
 /*
@@ -1545,7 +1584,7 @@ static int ext4_journalled_writepage(str
 			PAGE_CACHE_SIZE, NULL, do_journal_get_write_access);
 
 		err = walk_page_buffers(handle, page_buffers(page), 0,
-				PAGE_CACHE_SIZE, NULL, commit_write_fn);
+				PAGE_CACHE_SIZE, NULL, write_end_fn);
 		if (ret == 0)
 			ret = err;
 		EXT4_I(inode)->i_state |= EXT4_STATE_JDATA;
@@ -1705,8 +1744,8 @@ static const struct address_space_operat
 	.readpages	= ext4_readpages,
 	.writepage	= ext4_ordered_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= ext4_prepare_write,
-	.commit_write	= ext4_ordered_commit_write,
+	.write_begin	= ext4_write_begin,
+	.write_end	= ext4_ordered_write_end,
 	.bmap		= ext4_bmap,
 	.invalidatepage	= ext4_invalidatepage,
 	.releasepage	= ext4_releasepage,
@@ -1719,8 +1758,8 @@ static const struct address_space_operat
 	.readpages	= ext4_readpages,
 	.writepage	= ext4_writeback_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= ext4_prepare_write,
-	.commit_write	= ext4_writeback_commit_write,
+	.write_begin	= ext4_write_begin,
+	.write_end	= ext4_writeback_write_end,
 	.bmap		= ext4_bmap,
 	.invalidatepage	= ext4_invalidatepage,
 	.releasepage	= ext4_releasepage,
@@ -1733,8 +1772,8 @@ static const struct address_space_operat
 	.readpages	= ext4_readpages,
 	.writepage	= ext4_journalled_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= ext4_prepare_write,
-	.commit_write	= ext4_journalled_commit_write,
+	.write_begin	= ext4_write_begin,
+	.write_end	= ext4_journalled_write_end,
 	.set_page_dirty	= ext4_journalled_set_page_dirty,
 	.bmap		= ext4_bmap,
 	.invalidatepage	= ext4_invalidatepage,

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 20/44] xfs convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (18 preceding siblings ...)
  2007-04-24  1:24 ` [patch 19/44] ext4 " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 21/44] fs: new cont helpers Nick Piggin
                   ` (23 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, xfs-masters

[-- Attachment #1: fs-xfs-aops.patch --]
[-- Type: text/plain, Size: 3034 bytes --]

Cc: xfs-masters@oss.sgi.com
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/xfs/linux-2.6/xfs_aops.c |   19 ++++++++++++-------
 fs/xfs/linux-2.6/xfs_lrw.c  |   35 ++++++++++++-----------------------
 2 files changed, 24 insertions(+), 30 deletions(-)

Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
@@ -1414,13 +1414,18 @@ xfs_vm_direct_IO(
 }
 
 STATIC int
-xfs_vm_prepare_write(
+xfs_vm_write_begin(
 	struct file		*file,
-	struct page		*page,
-	unsigned int		from,
-	unsigned int		to)
+	struct address_space	*mapping,
+	loff_t			pos,
+	unsigned		len,
+	unsigned		flags,
+	struct page		**pagep,
+	void			**fsdata)
 {
-	return block_prepare_write(page, from, to, xfs_get_blocks);
+	*pagep = NULL;
+	return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+								xfs_get_blocks);
 }
 
 STATIC sector_t
@@ -1474,8 +1479,8 @@ const struct address_space_operations xf
 	.sync_page		= block_sync_page,
 	.releasepage		= xfs_vm_releasepage,
 	.invalidatepage		= xfs_vm_invalidatepage,
-	.prepare_write		= xfs_vm_prepare_write,
-	.commit_write		= generic_commit_write,
+	.write_begin		= xfs_vm_write_begin,
+	.write_end		= generic_write_end,
 	.bmap			= xfs_vm_bmap,
 	.direct_IO		= xfs_vm_direct_IO,
 	.migratepage		= buffer_migrate_page,
Index: linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_lrw.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c
@@ -134,45 +134,34 @@ xfs_iozero(
 	loff_t			pos,	/* offset in file		*/
 	size_t			count)	/* size of data to zero		*/
 {
-	unsigned		bytes;
 	struct page		*page;
 	struct address_space	*mapping;
 	int			status;
 
 	mapping = ip->i_mapping;
 	do {
-		unsigned long index, offset;
+		unsigned offset, bytes;
+		void *fsdata;
 
 		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
 		if (bytes > count)
 			bytes = count;
 
-		status = -ENOMEM;
-		page = grab_cache_page(mapping, index);
-		if (!page)
-			break;
-
-		status = mapping->a_ops->prepare_write(NULL, page, offset,
-							offset + bytes);
+		status = pagecache_write_begin(NULL, mapping, pos, bytes,
+					AOP_FLAG_UNINTERRUPTIBLE,
+					&page, &fsdata);
 		if (status)
-			goto unlock;
+			break;
 
 		memclear_highpage_flush(page, offset, bytes);
 
-		status = mapping->a_ops->commit_write(NULL, page, offset,
-							offset + bytes);
-		if (!status) {
-			pos += bytes;
-			count -= bytes;
-		}
-
-unlock:
-		unlock_page(page);
-		page_cache_release(page);
-		if (status)
-			break;
+		status = pagecache_write_end(NULL, mapping, pos, bytes, bytes,
+					page, fsdata);
+		WARN_ON(status <= 0); /* can't return less than zero! */
+		pos += bytes;
+		count -= bytes;
+		status = 0;
 	} while (count);
 
 	return (-status);

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 21/44] fs: new cont helpers
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (19 preceding siblings ...)
  2007-04-24  1:24 ` [patch 20/44] xfs " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 22/44] fat convert to new aops Nick Piggin
                   ` (22 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, hirofumi

[-- Attachment #1: fs-cont-aops.patch --]
[-- Type: text/plain, Size: 11579 bytes --]

Rework the generic block "cont" routines to handle the new aops.
Supporting cont_prepare_write would take quite a lot of code to support,
so remove it instead (and we later convert all filesystems to use it).

write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
generic_cont_expand, so filesystems can avoid the old hacks they used.

Cc: hirofumi@mail.parknet.co.jp
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/buffer.c                 |  204 +++++++++++++++++++++-----------------------
 include/linux/buffer_head.h |    5 -
 include/linux/fs.h          |    1 
 mm/filemap.c                |    5 +
 4 files changed, 110 insertions(+), 105 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2027,6 +2027,7 @@ int generic_write_end(struct file *file,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata)
 {
+	struct inode *inode = mapping->host;
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 
 	unlock_page(page);
@@ -2041,6 +2042,8 @@ int generic_write_end(struct file *file,
 		i_size_write(inode, pos+copied);
 		mark_inode_dirty(inode);
 	}
+
+	return copied;
 }
 EXPORT_SYMBOL(generic_write_end);
 
@@ -2142,14 +2145,14 @@ int block_read_full_page(struct page *pa
 }
 
 /* utility function for filesystems that need to do work on expanding
- * truncates.  Uses prepare/commit_write to allow the filesystem to
+ * truncates.  Uses filesystem pagecache writes to allow the filesystem to
  * deal with the hole.  
  */
-static int __generic_cont_expand(struct inode *inode, loff_t size,
-				 pgoff_t index, unsigned int offset)
+int generic_cont_expand_simple(struct inode *inode, loff_t size)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct page *page;
+	void *fsdata;
 	unsigned long limit;
 	int err;
 
@@ -2162,146 +2165,141 @@ static int __generic_cont_expand(struct 
 	if (size > inode->i_sb->s_maxbytes)
 		goto out;
 
-	err = -ENOMEM;
-	page = grab_cache_page(mapping, index);
-	if (!page)
-		goto out;
-	err = mapping->a_ops->prepare_write(NULL, page, offset, offset);
-	if (err) {
-		/*
-		 * ->prepare_write() may have instantiated a few blocks
-		 * outside i_size.  Trim these off again.
-		 */
-		unlock_page(page);
-		page_cache_release(page);
-		vmtruncate(inode, inode->i_size);
+	err = pagecache_write_begin(NULL, mapping, size, 0,
+				AOP_FLAG_UNINTERRUPTIBLE|AOP_FLAG_CONT_EXPAND,
+				&page, &fsdata);
+	if (err)
 		goto out;
-	}
 
-	err = mapping->a_ops->commit_write(NULL, page, offset, offset);
+	err = pagecache_write_end(NULL, mapping, size, 0, 0, page, fsdata);
+	BUG_ON(err > 0);
 
-	unlock_page(page);
-	page_cache_release(page);
-	if (err > 0)
-		err = 0;
 out:
 	return err;
 }
 
 int generic_cont_expand(struct inode *inode, loff_t size)
 {
-	pgoff_t index;
 	unsigned int offset;
 
 	offset = (size & (PAGE_CACHE_SIZE - 1)); /* Within page */
 
 	/* ugh.  in prepare/commit_write, if from==to==start of block, we
-	** skip the prepare.  make sure we never send an offset for the start
-	** of a block
-	*/
+	 * skip the prepare.  make sure we never send an offset for the start
+	 * of a block.
+	 * XXX: actually, this should be handled in those filesystems by
+	 * checking for the AOP_FLAG_CONT_EXPAND flag.
+	 */
 	if ((offset & (inode->i_sb->s_blocksize - 1)) == 0) {
 		/* caller must handle this extra byte. */
-		offset++;
+		size++;
 	}
-	index = size >> PAGE_CACHE_SHIFT;
-
-	return __generic_cont_expand(inode, size, index, offset);
+	return generic_cont_expand_simple(inode, size);
 }
 
-int generic_cont_expand_simple(struct inode *inode, loff_t size)
+int cont_expand_zero(struct file *file, struct address_space *mapping,
+			loff_t pos, loff_t *bytes)
 {
-	loff_t pos = size - 1;
-	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-	unsigned int offset = (pos & (PAGE_CACHE_SIZE - 1)) + 1;
-
-	/* prepare/commit_write can handle even if from==to==start of block. */
-	return __generic_cont_expand(inode, size, index, offset);
-}
-
-/*
- * For moronic filesystems that do not allow holes in file.
- * We may have to extend the file.
- */
-
-int cont_prepare_write(struct page *page, unsigned offset,
-		unsigned to, get_block_t *get_block, loff_t *bytes)
-{
-	struct address_space *mapping = page->mapping;
 	struct inode *inode = mapping->host;
-	struct page *new_page;
-	pgoff_t pgpos;
-	long status;
-	unsigned zerofrom;
 	unsigned blocksize = 1 << inode->i_blkbits;
+	struct page *page;
+	void *fsdata;
+	pgoff_t index, curidx;
+	loff_t curpos;
+	unsigned zerofrom, offset, len;
 	void *kaddr;
+	int err = 0;
 
-	while(page->index > (pgpos = *bytes>>PAGE_CACHE_SHIFT)) {
-		status = -ENOMEM;
-		new_page = grab_cache_page(mapping, pgpos);
-		if (!new_page)
-			goto out;
-		/* we might sleep */
-		if (*bytes>>PAGE_CACHE_SHIFT != pgpos) {
-			unlock_page(new_page);
-			page_cache_release(new_page);
-			continue;
-		}
-		zerofrom = *bytes & ~PAGE_CACHE_MASK;
+	index = pos >> PAGE_CACHE_SHIFT;
+	offset = pos & ~PAGE_CACHE_MASK;
+
+	while (index > (curidx = (curpos = *bytes)>>PAGE_CACHE_SHIFT)) {
+		zerofrom = curpos & ~PAGE_CACHE_MASK;
 		if (zerofrom & (blocksize-1)) {
 			*bytes |= (blocksize-1);
 			(*bytes)++;
 		}
-		status = __block_prepare_write(inode, new_page, zerofrom,
-						PAGE_CACHE_SIZE, get_block);
-		if (status)
-			goto out_unmap;
-		kaddr = kmap_atomic(new_page, KM_USER0);
-		memset(kaddr+zerofrom, 0, PAGE_CACHE_SIZE-zerofrom);
-		flush_dcache_page(new_page);
+		len = PAGE_CACHE_SIZE - zerofrom;
+
+		err = pagecache_write_begin(file, mapping, curpos, len,
+						AOP_FLAG_UNINTERRUPTIBLE,
+						&page, &fsdata);
+		if (err)
+			goto out;
+		kaddr = kmap_atomic(page, KM_USER0);
+		memset(kaddr+zerofrom, 0, len);
+		flush_dcache_page(page);
 		kunmap_atomic(kaddr, KM_USER0);
-		generic_commit_write(NULL, new_page, zerofrom, PAGE_CACHE_SIZE);
-		unlock_page(new_page);
-		page_cache_release(new_page);
+		err = pagecache_write_end(file, mapping, curpos, len, len,
+						page, fsdata);
+		if (err < 0)
+			goto out;
+		BUG_ON(err != len);
+		err = 0;
 	}
 
-	if (page->index < pgpos) {
-		/* completely inside the area */
-		zerofrom = offset;
-	} else {
-		/* page covers the boundary, find the boundary offset */
-		zerofrom = *bytes & ~PAGE_CACHE_MASK;
-
+	/* page covers the boundary, find the boundary offset */
+	if (index == curidx) {
+		zerofrom = curpos & ~PAGE_CACHE_MASK;
 		/* if we will expand the thing last block will be filled */
-		if (to > zerofrom && (zerofrom & (blocksize-1))) {
+		if (offset <= zerofrom) {
+			goto out;
+		}
+		if (zerofrom & (blocksize-1)) {
 			*bytes |= (blocksize-1);
 			(*bytes)++;
 		}
+		len = offset - zerofrom;
 
-		/* starting below the boundary? Nothing to zero out */
-		if (offset <= zerofrom)
-			zerofrom = offset;
-	}
-	status = __block_prepare_write(inode, page, zerofrom, to, get_block);
-	if (status)
-		goto out1;
-	if (zerofrom < offset) {
+		err = pagecache_write_begin(file, mapping, curpos, len,
+						AOP_FLAG_UNINTERRUPTIBLE,
+						&page, &fsdata);
+		if (err)
+			goto out;
 		kaddr = kmap_atomic(page, KM_USER0);
-		memset(kaddr+zerofrom, 0, offset-zerofrom);
+		memset(kaddr+zerofrom, 0, len);
 		flush_dcache_page(page);
 		kunmap_atomic(kaddr, KM_USER0);
-		__block_commit_write(inode, page, zerofrom, offset);
+		err = pagecache_write_end(file, mapping, curpos, len, len,
+						page, fsdata);
+		if (err < 0)
+			goto out;
+		BUG_ON(err != len);
+		err = 0;
 	}
-	return 0;
-out1:
-	ClearPageUptodate(page);
-	return status;
-
-out_unmap:
-	ClearPageUptodate(new_page);
-	unlock_page(new_page);
-	page_cache_release(new_page);
 out:
-	return status;
+	return err;
+}
+
+/*
+ * For moronic filesystems that do not allow holes in file.
+ * We may have to extend the file.
+ */
+int cont_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata,
+			get_block_t *get_block, loff_t *bytes)
+{
+	struct inode *inode = mapping->host;
+	unsigned blocksize = 1 << inode->i_blkbits;
+	unsigned zerofrom;
+	int err;
+
+	err = cont_expand_zero(file, mapping, pos, bytes);
+	if (err)
+		goto out;
+
+	zerofrom = *bytes & ~PAGE_CACHE_MASK;
+	if (pos+len > *bytes && zerofrom & (blocksize-1)) {
+		*bytes |= (blocksize-1);
+		(*bytes)++;
+	}
+
+	*pagep = NULL;
+	err = block_write_begin(file, mapping, pos, len,
+				flags, pagep, fsdata, get_block);
+out:
+	return err;
 }
 
 int block_prepare_write(struct page *page, unsigned from, unsigned to,
@@ -3160,7 +3158,7 @@ EXPORT_SYMBOL(block_read_full_page);
 EXPORT_SYMBOL(block_sync_page);
 EXPORT_SYMBOL(block_truncate_page);
 EXPORT_SYMBOL(block_write_full_page);
-EXPORT_SYMBOL(cont_prepare_write);
+EXPORT_SYMBOL(cont_write_begin);
 EXPORT_SYMBOL(end_buffer_read_sync);
 EXPORT_SYMBOL(end_buffer_write_sync);
 EXPORT_SYMBOL(file_fsync);
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -213,8 +213,9 @@ int generic_write_end(struct file *, str
 				struct page *, void *);
 void page_zero_new_buffers(struct page *page, unsigned from, unsigned to);
 int block_prepare_write(struct page*, unsigned, unsigned, get_block_t*);
-int cont_prepare_write(struct page*, unsigned, unsigned, get_block_t*,
-				loff_t *);
+int cont_write_begin(struct file *, struct address_space *, loff_t,
+			unsigned, unsigned, struct page **, void **,
+			get_block_t *, loff_t *);
 int generic_cont_expand(struct inode *inode, loff_t size);
 int generic_cont_expand_simple(struct inode *inode, loff_t size);
 int block_commit_write(struct page *page, unsigned from, unsigned to);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -392,6 +392,7 @@ enum positive_aop_returns {
 };
 
 #define AOP_FLAG_UNINTERRUPTIBLE	0x0001 /* will not do a short write */
+#define AOP_FLAG_CONT_EXPAND		0x0002 /* called from cont_expand */
 
 /*
  * oh the beauties of C type declarations.
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1789,6 +1789,7 @@ size_t iov_iter_copy_from_user_atomic(st
 
 	return copied;
 }
+EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
 
 /*
  * This has the same sideeffects and return value as
@@ -1815,6 +1816,7 @@ size_t iov_iter_copy_from_user(struct pa
 	kunmap(page);
 	return copied;
 }
+EXPORT_SYMBOL(iov_iter_copy_from_user);
 
 static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes)
 {
@@ -1846,6 +1848,7 @@ void iov_iter_advance(struct iov_iter *i
 	__iov_iter_advance_iov(i, bytes);
 	i->count -= bytes;
 }
+EXPORT_SYMBOL(iov_iter_advance);
 
 int iov_iter_fault_in_readable(struct iov_iter *i)
 {
@@ -1853,6 +1856,7 @@ int iov_iter_fault_in_readable(struct io
 	char __user *buf = i->iov->iov_base + i->iov_offset;
 	return fault_in_pages_readable(buf, seglen);
 }
+EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
 /*
  * Return the count of just the current iov_iter segment.
@@ -1865,6 +1869,7 @@ size_t iov_iter_single_seg_count(struct 
 	else
 		return min(i->count, iov->iov_len - i->iov_offset);
 }
+EXPORT_SYMBOL(iov_iter_single_seg_count);
 
 /*
  * Performs necessary checks before doing a write

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 22/44] fat convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (20 preceding siblings ...)
  2007-04-24  1:24 ` [patch 21/44] fs: new cont helpers Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 23/44] adfs " Nick Piggin
                   ` (21 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, hirofumi

[-- Attachment #1: fs-fat-aops.patch --]
[-- Type: text/plain, Size: 2164 bytes --]

Cc: hirofumi@mail.parknet.co.jp
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/fat/inode.c |   27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

Index: linux-2.6/fs/fat/inode.c
===================================================================
--- linux-2.6.orig/fs/fat/inode.c
+++ linux-2.6/fs/fat/inode.c
@@ -140,19 +140,24 @@ static int fat_readpages(struct file *fi
 	return mpage_readpages(mapping, pages, nr_pages, fat_get_block);
 }
 
-static int fat_prepare_write(struct file *file, struct page *page,
-			     unsigned from, unsigned to)
+static int fat_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
 {
-	return cont_prepare_write(page, from, to, fat_get_block,
-				  &MSDOS_I(page->mapping->host)->mmu_private);
+	*pagep = NULL;
+	return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				fat_get_block,
+				&MSDOS_I(mapping->host)->mmu_private);
 }
 
-static int fat_commit_write(struct file *file, struct page *page,
-			    unsigned from, unsigned to)
+static int fat_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *pagep, void *fsdata)
 {
-	struct inode *inode = page->mapping->host;
-	int err = generic_commit_write(file, page, from, to);
-	if (!err && !(MSDOS_I(inode)->i_attrs & ATTR_ARCH)) {
+	struct inode *inode = mapping->host;
+	int err;
+	err = generic_write_end(file, mapping, pos, len, copied, pagep, fsdata);
+	if (!(err < 0) && !(MSDOS_I(inode)->i_attrs & ATTR_ARCH)) {
 		inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
 		MSDOS_I(inode)->i_attrs |= ATTR_ARCH;
 		mark_inode_dirty(inode);
@@ -201,8 +206,8 @@ static const struct address_space_operat
 	.writepage	= fat_writepage,
 	.writepages	= fat_writepages,
 	.sync_page	= block_sync_page,
-	.prepare_write	= fat_prepare_write,
-	.commit_write	= fat_commit_write,
+	.write_begin	= fat_write_begin,
+	.write_end	= fat_write_end,
 	.direct_IO	= fat_direct_IO,
 	.bmap		= _fat_bmap
 };

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 23/44] adfs convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (21 preceding siblings ...)
  2007-04-24  1:24 ` [patch 22/44] fat convert to new aops Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 24/44] affs " Nick Piggin
                   ` (20 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, rmk

[-- Attachment #1: fs-adfs-aops.patch --]
[-- Type: text/plain, Size: 1446 bytes --]

Cc: rmk@arm.linux.org.uk
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/adfs/inode.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

Index: linux-2.6/fs/adfs/inode.c
===================================================================
--- linux-2.6.orig/fs/adfs/inode.c
+++ linux-2.6/fs/adfs/inode.c
@@ -61,10 +61,14 @@ static int adfs_readpage(struct file *fi
 	return block_read_full_page(page, adfs_get_block);
 }
 
-static int adfs_prepare_write(struct file *file, struct page *page, unsigned int from, unsigned int to)
+static int adfs_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
 {
-	return cont_prepare_write(page, from, to, adfs_get_block,
-		&ADFS_I(page->mapping->host)->mmu_private);
+	*pagep = NULL;
+	return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				adfs_get_block,
+				&ADFS_I(mapping->host)->mmu_private);
 }
 
 static sector_t _adfs_bmap(struct address_space *mapping, sector_t block)
@@ -76,8 +80,8 @@ static const struct address_space_operat
 	.readpage	= adfs_readpage,
 	.writepage	= adfs_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= adfs_prepare_write,
-	.commit_write	= generic_commit_write,
+	.write_begin	= adfs_write_begin,
+	.write_end	= generic_write_end,
 	.bmap		= _adfs_bmap
 };
 

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 24/44] affs convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (22 preceding siblings ...)
  2007-04-24  1:24 ` [patch 23/44] adfs " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 25/44] hfs " Nick Piggin
                   ` (19 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, zippel

[-- Attachment #1: fs-affs-aops.patch --]
[-- Type: text/plain, Size: 5998 bytes --]

Cc: zippel@linux-m68k.org
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/affs/file.c |  106 +++++++++++++++++++++++++++++++--------------------------
 1 file changed, 58 insertions(+), 48 deletions(-)

Index: linux-2.6/fs/affs/file.c
===================================================================
--- linux-2.6.orig/fs/affs/file.c
+++ linux-2.6/fs/affs/file.c
@@ -395,25 +395,33 @@ static int affs_writepage(struct page *p
 {
 	return block_write_full_page(page, affs_get_block, wbc);
 }
+
 static int affs_readpage(struct file *file, struct page *page)
 {
 	return block_read_full_page(page, affs_get_block);
 }
-static int affs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
-{
-	return cont_prepare_write(page, from, to, affs_get_block,
-		&AFFS_I(page->mapping->host)->mmu_private);
+
+static int affs_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	*pagep = NULL;
+	return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				affs_get_block,
+				&AFFS_I(mapping->host)->mmu_private);
 }
+
 static sector_t _affs_bmap(struct address_space *mapping, sector_t block)
 {
 	return generic_block_bmap(mapping,block,affs_get_block);
 }
+
 const struct address_space_operations affs_aops = {
 	.readpage = affs_readpage,
 	.writepage = affs_writepage,
 	.sync_page = block_sync_page,
-	.prepare_write = affs_prepare_write,
-	.commit_write = generic_commit_write,
+	.write_begin = affs_write_begin,
+	.write_end = generic_write_end,
 	.bmap = _affs_bmap
 };
 
@@ -603,58 +611,65 @@ affs_readpage_ofs(struct file *file, str
 	return err;
 }
 
-static int affs_prepare_write_ofs(struct file *file, struct page *page, unsigned from, unsigned to)
-{
-	struct inode *inode = page->mapping->host;
-	u32 size, offset;
-	u32 tmp;
+static int affs_write_begin_ofs(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata)
+{
+	struct inode *inode = mapping->host;
+	struct page *page;
+	pgoff_t index;
 	int err = 0;
 
-	pr_debug("AFFS: prepare_write(%u, %ld, %d, %d)\n", (u32)inode->i_ino, page->index, from, to);
-	offset = page->index << PAGE_CACHE_SHIFT;
-	if (offset + from > AFFS_I(inode)->mmu_private) {
-		err = affs_extent_file_ofs(inode, offset + from);
+	pr_debug("AFFS: write_begin(%u, %llu, %llu)\n", (u32)inode->i_ino, (unsigned long long)pos, (unsigned long long)pos + len);
+	if (pos > AFFS_I(inode)->mmu_private) {
+		/* XXX: this probably leaves a too-big i_size in case of
+		 * failure. Should really be updating i_size at write_end time
+		 */
+		err = affs_extent_file_ofs(inode, pos);
 		if (err)
 			return err;
 	}
-	size = inode->i_size;
+
+	index = pos >> PAGE_CACHE_SHIFT;
+	page = __grab_cache_page(mapping, index);
+	if (!page)
+		return -ENOMEM;
+	*pagep = page;
 
 	if (PageUptodate(page))
 		return 0;
 
-	if (from) {
-		err = affs_do_readpage_ofs(file, page, 0, from);
-		if (err)
-			return err;
-	}
-	if (to < PAGE_CACHE_SIZE) {
-		char *kaddr = kmap_atomic(page, KM_USER0);
-
-		memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
-		flush_dcache_page(page);
-		kunmap_atomic(kaddr, KM_USER0);
-		if (size > offset + to) {
-			if (size < offset + PAGE_CACHE_SIZE)
-				tmp = size & ~PAGE_CACHE_MASK;
-			else
-				tmp = PAGE_CACHE_SIZE;
-			err = affs_do_readpage_ofs(file, page, to, tmp);
-		}
+	/* XXX: inefficient but safe in the face of short writes */
+	err = affs_do_readpage_ofs(file, page, 0, PAGE_CACHE_SIZE);
+	if (err) {
+		unlock_page(page);
+		page_cache_release(page);
 	}
 	return err;
 }
 
-static int affs_commit_write_ofs(struct file *file, struct page *page, unsigned from, unsigned to)
+static int affs_write_end_ofs(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = mapping->host;
 	struct super_block *sb = inode->i_sb;
 	struct buffer_head *bh, *prev_bh;
 	char *data;
 	u32 bidx, boff, bsize;
+	unsigned from, to;
 	u32 tmp;
 	int written;
 
-	pr_debug("AFFS: commit_write(%u, %ld, %d, %d)\n", (u32)inode->i_ino, page->index, from, to);
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = pos + len;
+	/*
+	 * XXX: not sure if this can handle short copies (len < copied), but
+	 * we don't have to, because the page should always be uptodate here,
+	 * due to write_begin.
+	 */
+
+	pr_debug("AFFS: write_begin(%u, %llu, %llu)\n", (u32)inode->i_ino, (unsigned long long)pos, (unsigned long long)pos + len);
 	bsize = AFFS_SB(sb)->s_data_blksize;
 	data = page_address(page);
 
@@ -765,8 +780,8 @@ const struct address_space_operations af
 	.readpage = affs_readpage_ofs,
 	//.writepage = affs_writepage_ofs,
 	//.sync_page = affs_sync_page_ofs,
-	.prepare_write = affs_prepare_write_ofs,
-	.commit_write = affs_commit_write_ofs
+	.write_begin = affs_write_begin_ofs,
+	.write_end = affs_write_end_ofs
 };
 
 /* Free any preallocated blocks. */
@@ -809,18 +824,13 @@ affs_truncate(struct inode *inode)
 	if (inode->i_size > AFFS_I(inode)->mmu_private) {
 		struct address_space *mapping = inode->i_mapping;
 		struct page *page;
-		u32 size = inode->i_size - 1;
+		void *fsdata;
+		u32 size = inode->i_size;
 		int res;
 
-		page = grab_cache_page(mapping, size >> PAGE_CACHE_SHIFT);
-		if (!page)
-			return;
-		size = (size & (PAGE_CACHE_SIZE - 1)) + 1;
-		res = mapping->a_ops->prepare_write(NULL, page, size, size);
+		res = mapping->a_ops->write_begin(NULL, mapping, size, 0, 0, &page, &fsdata);
 		if (!res)
-			res = mapping->a_ops->commit_write(NULL, page, size, size);
-		unlock_page(page);
-		page_cache_release(page);
+			res = mapping->a_ops->write_end(NULL, mapping, size, 0, 0, page, fsdata);
 		mark_inode_dirty(inode);
 		return;
 	} else if (inode->i_size == AFFS_I(inode)->mmu_private)

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 25/44] hfs convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (23 preceding siblings ...)
  2007-04-24  1:24 ` [patch 24/44] affs " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 26/44] hfsplus " Nick Piggin
                   ` (18 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, zippel

[-- Attachment #1: fs-hfs-aops.patch --]
[-- Type: text/plain, Size: 3099 bytes --]

Cc: zippel@linux-m68k.org
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/hfs/extent.c |   19 ++++++++-----------
 fs/hfs/inode.c  |   20 ++++++++++++--------
 2 files changed, 20 insertions(+), 19 deletions(-)

Index: linux-2.6/fs/hfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hfs/inode.c
+++ linux-2.6/fs/hfs/inode.c
@@ -34,10 +34,14 @@ static int hfs_readpage(struct file *fil
 	return block_read_full_page(page, hfs_get_block);
 }
 
-static int hfs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
-{
-	return cont_prepare_write(page, from, to, hfs_get_block,
-				  &HFS_I(page->mapping->host)->phys_size);
+static int hfs_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	*pagep = NULL;
+	return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				hfs_get_block,
+				&HFS_I(mapping->host)->phys_size);
 }
 
 static sector_t hfs_bmap(struct address_space *mapping, sector_t block)
@@ -118,8 +122,8 @@ const struct address_space_operations hf
 	.readpage	= hfs_readpage,
 	.writepage	= hfs_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= hfs_prepare_write,
-	.commit_write	= generic_commit_write,
+	.write_begin	= hfs_write_begin,
+	.write_end	= generic_write_end,
 	.bmap		= hfs_bmap,
 	.releasepage	= hfs_releasepage,
 };
@@ -128,8 +132,8 @@ const struct address_space_operations hf
 	.readpage	= hfs_readpage,
 	.writepage	= hfs_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= hfs_prepare_write,
-	.commit_write	= generic_commit_write,
+	.write_begin	= hfs_write_begin,
+	.write_end	= generic_write_end,
 	.bmap		= hfs_bmap,
 	.direct_IO	= hfs_direct_IO,
 	.writepages	= hfs_writepages,
Index: linux-2.6/fs/hfs/extent.c
===================================================================
--- linux-2.6.orig/fs/hfs/extent.c
+++ linux-2.6/fs/hfs/extent.c
@@ -464,23 +464,20 @@ void hfs_file_truncate(struct inode *ino
 	       (long long)HFS_I(inode)->phys_size, inode->i_size);
 	if (inode->i_size > HFS_I(inode)->phys_size) {
 		struct address_space *mapping = inode->i_mapping;
+		void *fsdata;
 		struct page *page;
 		int res;
 
+		/* XXX: Can use generic_cont_expand? */
 		size = inode->i_size - 1;
-		page = grab_cache_page(mapping, size >> PAGE_CACHE_SHIFT);
-		if (!page)
-			return;
-		size &= PAGE_CACHE_SIZE - 1;
-		size++;
-		res = mapping->a_ops->prepare_write(NULL, page, size, size);
-		if (!res)
-			res = mapping->a_ops->commit_write(NULL, page, size, size);
+		res = pagecache_write_begin(NULL, mapping, size+1, 0,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+		if (!res) {
+			res = pagecache_write_end(NULL, mapping, size+1, 0, 0,
+					page, fsdata);
+		}
 		if (res)
 			inode->i_size = HFS_I(inode)->phys_size;
-		unlock_page(page);
-		page_cache_release(page);
-		mark_inode_dirty(inode);
 		return;
 	} else if (inode->i_size == HFS_I(inode)->phys_size)
 		return;

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 26/44] hfsplus convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (24 preceding siblings ...)
  2007-04-24  1:24 ` [patch 25/44] hfs " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 27/44] hpfs " Nick Piggin
                   ` (17 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, zippel

[-- Attachment #1: fs-hfsplus-aops.patch --]
[-- Type: text/plain, Size: 3167 bytes --]

Cc: zippel@linux-m68k.org
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/hfsplus/extents.c |   21 +++++++++------------
 fs/hfsplus/inode.c   |   20 ++++++++++++--------
 2 files changed, 21 insertions(+), 20 deletions(-)

Index: linux-2.6/fs/hfsplus/inode.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/inode.c
+++ linux-2.6/fs/hfsplus/inode.c
@@ -26,10 +26,14 @@ static int hfsplus_writepage(struct page
 	return block_write_full_page(page, hfsplus_get_block, wbc);
 }
 
-static int hfsplus_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
-{
-	return cont_prepare_write(page, from, to, hfsplus_get_block,
-		&HFSPLUS_I(page->mapping->host).phys_size);
+static int hfsplus_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	*pagep = NULL;
+	return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				hfsplus_get_block,
+				&HFSPLUS_I(mapping->host).phys_size);
 }
 
 static sector_t hfsplus_bmap(struct address_space *mapping, sector_t block)
@@ -113,8 +117,8 @@ const struct address_space_operations hf
 	.readpage	= hfsplus_readpage,
 	.writepage	= hfsplus_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= hfsplus_prepare_write,
-	.commit_write	= generic_commit_write,
+	.write_begin	= hfsplus_write_begin,
+	.write_end	= generic_write_end,
 	.bmap		= hfsplus_bmap,
 	.releasepage	= hfsplus_releasepage,
 };
@@ -123,8 +127,8 @@ const struct address_space_operations hf
 	.readpage	= hfsplus_readpage,
 	.writepage	= hfsplus_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= hfsplus_prepare_write,
-	.commit_write	= generic_commit_write,
+	.write_begin	= hfsplus_write_begin,
+	.write_end	= generic_write_end,
 	.bmap		= hfsplus_bmap,
 	.direct_IO	= hfsplus_direct_IO,
 	.writepages	= hfsplus_writepages,
Index: linux-2.6/fs/hfsplus/extents.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/extents.c
+++ linux-2.6/fs/hfsplus/extents.c
@@ -443,21 +443,18 @@ void hfsplus_file_truncate(struct inode 
 	if (inode->i_size > HFSPLUS_I(inode).phys_size) {
 		struct address_space *mapping = inode->i_mapping;
 		struct page *page;
-		u32 size = inode->i_size - 1;
+		void *fsdata;
+		u32 size = inode->i_size;
 		int res;
 
-		page = grab_cache_page(mapping, size >> PAGE_CACHE_SHIFT);
-		if (!page)
-			return;
-		size &= PAGE_CACHE_SIZE - 1;
-		size++;
-		res = mapping->a_ops->prepare_write(NULL, page, size, size);
-		if (!res)
-			res = mapping->a_ops->commit_write(NULL, page, size, size);
+		res = pagecache_write_begin(NULL, mapping, size, 0,
+						AOP_FLAG_UNINTERRUPTIBLE,
+						&page, &fsdata);
 		if (res)
-			inode->i_size = HFSPLUS_I(inode).phys_size;
-		unlock_page(page);
-		page_cache_release(page);
+			return;
+		res = pagecache_write_end(NULL, mapping, size, 0, 0, page, fsdata);
+		if (res < 0)
+			return;
 		mark_inode_dirty(inode);
 		return;
 	} else if (inode->i_size == HFSPLUS_I(inode).phys_size)

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 27/44] hpfs convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (25 preceding siblings ...)
  2007-04-24  1:24 ` [patch 26/44] hfsplus " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 28/44] bfs " Nick Piggin
                   ` (16 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, mikulas

[-- Attachment #1: fs-hpfs-aops.patch --]
[-- Type: text/plain, Size: 1645 bytes --]

Cc: mikulas@artax.karlin.mff.cuni.cz
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/hpfs/file.c |   20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

Index: linux-2.6/fs/hpfs/file.c
===================================================================
--- linux-2.6.orig/fs/hpfs/file.c
+++ linux-2.6/fs/hpfs/file.c
@@ -86,25 +86,33 @@ static int hpfs_writepage(struct page *p
 {
 	return block_write_full_page(page,hpfs_get_block, wbc);
 }
+
 static int hpfs_readpage(struct file *file, struct page *page)
 {
 	return block_read_full_page(page,hpfs_get_block);
 }
-static int hpfs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
-{
-	return cont_prepare_write(page,from,to,hpfs_get_block,
-		&hpfs_i(page->mapping->host)->mmu_private);
+
+static int hpfs_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	*pagep = NULL;
+	return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				hpfs_get_block,
+				&hpfs_i(mapping->host)->mmu_private);
 }
+
 static sector_t _hpfs_bmap(struct address_space *mapping, sector_t block)
 {
 	return generic_block_bmap(mapping,block,hpfs_get_block);
 }
+
 const struct address_space_operations hpfs_aops = {
 	.readpage = hpfs_readpage,
 	.writepage = hpfs_writepage,
 	.sync_page = block_sync_page,
-	.prepare_write = hpfs_prepare_write,
-	.commit_write = generic_commit_write,
+	.write_begin = hpfs_write_begin,
+	.write_end = generic_write_end,
 	.bmap = _hpfs_bmap
 };
 

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 28/44] bfs convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (26 preceding siblings ...)
  2007-04-24  1:24 ` [patch 27/44] hpfs " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 29/44] qnx4 " Nick Piggin
                   ` (15 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, tigran

[-- Attachment #1: fs-bfs-aops.patch --]
[-- Type: text/plain, Size: 1341 bytes --]

Cc: tigran@aivazian.fsnet.co.uk
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/bfs/file.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/bfs/file.c
===================================================================
--- linux-2.6.orig/fs/bfs/file.c
+++ linux-2.6/fs/bfs/file.c
@@ -145,9 +145,13 @@ static int bfs_readpage(struct file *fil
 	return block_read_full_page(page, bfs_get_block);
 }
 
-static int bfs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
+static int bfs_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
 {
-	return block_prepare_write(page, from, to, bfs_get_block);
+	*pagep = NULL;
+	return block_write_begin(file, mapping, pos, len, flags,
+					pagep, fsdata, bfs_get_block);
 }
 
 static sector_t bfs_bmap(struct address_space *mapping, sector_t block)
@@ -159,8 +163,8 @@ const struct address_space_operations bf
 	.readpage	= bfs_readpage,
 	.writepage	= bfs_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= bfs_prepare_write,
-	.commit_write	= generic_commit_write,
+	.write_begin	= bfs_write_begin,
+	.write_end	= generic_write_end,
 	.bmap		= bfs_bmap,
 };
 

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 29/44] qnx4 convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (27 preceding siblings ...)
  2007-04-24  1:24 ` [patch 28/44] bfs " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 30/44] nfs " Nick Piggin
                   ` (14 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, al

[-- Attachment #1: fs-qnx4-aops.patch --]
[-- Type: text/plain, Size: 1694 bytes --]

Cc: al@alarsen.net
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/qnx4/inode.c |   21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

Index: linux-2.6/fs/qnx4/inode.c
===================================================================
--- linux-2.6.orig/fs/qnx4/inode.c
+++ linux-2.6/fs/qnx4/inode.c
@@ -433,16 +433,21 @@ static int qnx4_writepage(struct page *p
 {
 	return block_write_full_page(page,qnx4_get_block, wbc);
 }
+
 static int qnx4_readpage(struct file *file, struct page *page)
 {
 	return block_read_full_page(page,qnx4_get_block);
 }
-static int qnx4_prepare_write(struct file *file, struct page *page,
-			      unsigned from, unsigned to)
-{
-	struct qnx4_inode_info *qnx4_inode = qnx4_i(page->mapping->host);
-	return cont_prepare_write(page, from, to, qnx4_get_block,
-				  &qnx4_inode->mmu_private);
+
+static int qnx4_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	struct qnx4_inode_info *qnx4_inode = qnx4_i(mapping->host);
+	*pagep = NULL;
+	return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				qnx4_get_block,
+				&qnx4_inode->mmu_private);
 }
 static sector_t qnx4_bmap(struct address_space *mapping, sector_t block)
 {
@@ -452,8 +457,8 @@ static const struct address_space_operat
 	.readpage	= qnx4_readpage,
 	.writepage	= qnx4_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= qnx4_prepare_write,
-	.commit_write	= generic_commit_write,
+	.write_begin	= qnx4_write_begin,
+	.write_end	= generic_write_end,
 	.bmap		= qnx4_bmap
 };
 

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 30/44] nfs convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (28 preceding siblings ...)
  2007-04-24  1:24 ` [patch 29/44] qnx4 " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 31/44] smb " Nick Piggin
                   ` (13 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, trond.myklebust

[-- Attachment #1: fs-nfs-aops.patch --]
[-- Type: text/plain, Size: 2763 bytes --]

Cc: trond.myklebust@fys.uio.no
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/nfs/file.c |   49 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 36 insertions(+), 13 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -282,27 +282,50 @@ nfs_fsync(struct file *file, struct dent
 }
 
 /*
- * This does the "real" work of the write. The generic routine has
- * allocated the page, locked it, done all the page alignment stuff
- * calculations etc. Now we should just copy the data from user
- * space and write it back to the real medium..
+ * This does the "real" work of the write. We must allocate and lock the
+ * page to be sent back to the generic routine, which then copies the
+ * data from user space.
  *
  * If the writer ends up delaying the write, the writer needs to
  * increment the page use counts until he is done with the page.
  */
-static int nfs_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
-{
-	return nfs_flush_incompatible(file, page);
+static int nfs_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	int ret;
+	pgoff_t index;
+	struct page *page;
+	index = pos >> PAGE_CACHE_SHIFT;
+
+	page = __grab_cache_page(mapping, index);
+	if (!page)
+		return -ENOMEM;
+	*pagep = page;
+
+	ret = nfs_flush_incompatible(file, page);
+	if (ret) {
+		unlock_page(page);
+		page_cache_release(page);
+	}
+	return ret;
 }
 
-static int nfs_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to)
+static int nfs_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
 {
-	long status;
+	unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
+	int status;
 
 	lock_kernel();
-	status = nfs_updatepage(file, page, offset, to-offset);
+	status = nfs_updatepage(file, page, offset, copied);
 	unlock_kernel();
-	return status;
+
+	unlock_page(page);
+	page_cache_release(page);
+
+	return status < 0 ? status : copied;
 }
 
 static void nfs_invalidate_page(struct page *page, unsigned long offset)
@@ -330,8 +353,8 @@ const struct address_space_operations nf
 	.set_page_dirty = nfs_set_page_dirty,
 	.writepage = nfs_writepage,
 	.writepages = nfs_writepages,
-	.prepare_write = nfs_prepare_write,
-	.commit_write = nfs_commit_write,
+	.write_begin = nfs_write_begin,
+	.write_end = nfs_write_end,
 	.invalidatepage = nfs_invalidate_page,
 	.releasepage = nfs_release_page,
 #ifdef CONFIG_NFS_DIRECTIO

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 31/44] smb convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (29 preceding siblings ...)
  2007-04-24  1:24 ` [patch 30/44] nfs " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 32/44] ocfs2: " Nick Piggin
                   ` (12 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh

[-- Attachment #1: fs-smbfs-aops.patch --]
[-- Type: text/plain, Size: 1931 bytes --]

Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/smbfs/file.c |   34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

Index: linux-2.6/fs/smbfs/file.c
===================================================================
--- linux-2.6.orig/fs/smbfs/file.c
+++ linux-2.6/fs/smbfs/file.c
@@ -290,29 +290,45 @@ out:
  * If the writer ends up delaying the write, the writer needs to
  * increment the page use counts until he is done with the page.
  */
-static int smb_prepare_write(struct file *file, struct page *page, 
-			     unsigned offset, unsigned to)
-{
+static int smb_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+	*pagep = __grab_cache_page(mapping, index);
+	if (!*pagep)
+		return -ENOMEM;
 	return 0;
 }
 
-static int smb_commit_write(struct file *file, struct page *page,
-			    unsigned offset, unsigned to)
+static int smb_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
 {
 	int status;
+	unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
 
-	status = -EFAULT;
 	lock_kernel();
-	status = smb_updatepage(file, page, offset, to-offset);
+	status = smb_updatepage(file, page, offset, copied);
 	unlock_kernel();
+
+	if (!status) {
+		if (!PageUptodate(page) && copied == PAGE_CACHE_SIZE)
+			SetPageUptodate(page);
+		status = copied;
+	}
+
+	unlock_page(page);
+	page_cache_release(page);
+
 	return status;
 }
 
 const struct address_space_operations smb_file_aops = {
 	.readpage = smb_readpage,
 	.writepage = smb_writepage,
-	.prepare_write = smb_prepare_write,
-	.commit_write = smb_commit_write
+	.write_begin = smb_write_begin,
+	.write_end = smb_write_end,
 };
 
 /* 

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 32/44] ocfs2: convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (30 preceding siblings ...)
  2007-04-24  1:24 ` [patch 31/44] smb " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 33/44] gfs2 " Nick Piggin
                   ` (11 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh

[-- Attachment #1: fs-ocfs2-aops.patch --]
[-- Type: text/plain, Size: 38577 bytes --]

From: Mark Fasheh <mark.fasheh@oracle.com>

Fix up ocfs2 to use ->write_begin and ->write_end. This lets us dump a large
amount of code which was implementing our own write path while preserving
the nice locking rules that were gained by moving away from ->prepare_write.

It makes use of the context back pointer to store information related to the
write which the vfs normally doesn't know about. Most importantly this is an
array of zero'd pages which might have to be written out for an allocating
write. Of note is that I also stick the journal handle on there. Ocfs2 could
use current->journal_info for that, but I think it's much cleaner to just pass
that around as a file system specific context.

I tested this on a couple nodes and things seem to be running smoothly.

A couple of notes:

* The ocfs2 write context is probably a bit big. I'm much more concerned
  with readability though as Ocfs2 has much more baggage to carry around
  than other file systems.

* A ton of code was deleted :) The patch adds a bunch too, but that's mostly
  getting the old stuff to flow with ->write_begin. Some assumptions about
  the size of the write that were made with my previous implemenation were
  no longer true (this is good).

* I could probably clean this up some more, but I'd be fine if the patch
  went upstream as-is. Diff seems to have mangled this patch file enough
  that it's probably much easier to read once applied.

* This doesn't use ->perform_write (yet), so stuff is still being copied one
  page at a time. I _think_ things are pretty reasonably set up to allow
  larger writes though...

Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>

 fs/ocfs2/aops.c |  779 +++++++++++++++++++++++++++++++-------------------------
 fs/ocfs2/aops.h |   52 ---
 fs/ocfs2/file.c |  246 -----------------
 3 files changed, 453 insertions(+), 624 deletions(-)

Index: linux-2.6/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.c
+++ linux-2.6/fs/ocfs2/aops.c
@@ -677,6 +677,8 @@ int ocfs2_map_page_blocks(struct page *p
 	     bh = bh->b_this_page, block_start += bsize) {
 		block_end = block_start + bsize;
 
+		clear_buffer_new(bh);
+
 		/*
 		 * Ignore blocks outside of our i/o range -
 		 * they may belong to unallocated clusters.
@@ -691,9 +693,8 @@ int ocfs2_map_page_blocks(struct page *p
 		 * For an allocating write with cluster size >= page
 		 * size, we always write the entire page.
 		 */
-
-		if (buffer_new(bh))
-			clear_buffer_new(bh);
+		if (new)
+			set_buffer_new(bh);
 
 		if (!buffer_mapped(bh)) {
 			map_bh(bh, inode->i_sb, *p_blkno);
@@ -754,217 +755,187 @@ next_bh:
 	return ret;
 }
 
+#if (PAGE_CACHE_SIZE >= OCFS2_MAX_CLUSTERSIZE)
+#define OCFS2_MAX_CTXT_PAGES	1
+#else
+#define OCFS2_MAX_CTXT_PAGES	(OCFS2_MAX_CLUSTERSIZE / PAGE_CACHE_SIZE)
+#endif
+
+#define OCFS2_MAX_CLUSTERS_PER_PAGE	(PAGE_CACHE_SIZE / OCFS2_MIN_CLUSTERSIZE)
+
 /*
- * This will copy user data from the buffer page in the splice
- * context.
- *
- * For now, we ignore SPLICE_F_MOVE as that would require some extra
- * communication out all the way to ocfs2_write().
+ * Describe the state of a single cluster to be written to.
  */
-int ocfs2_map_and_write_splice_data(struct inode *inode,
-				  struct ocfs2_write_ctxt *wc, u64 *p_blkno,
-				  unsigned int *ret_from, unsigned int *ret_to)
-{
-	int ret;
-	unsigned int to, from, cluster_start, cluster_end;
-	char *src, *dst;
-	struct ocfs2_splice_write_priv *sp = wc->w_private;
-	struct pipe_buffer *buf = sp->s_buf;
-	unsigned long bytes, src_from;
-	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+struct ocfs2_write_cluster_desc {
+	u32		c_cpos;
+	u32		c_phys;
+	/*
+	 * Give this a unique field because c_phys eventually gets
+	 * filled.
+	 */
+	unsigned	c_new;   
+};
 
-	ocfs2_figure_cluster_boundaries(osb, wc->w_cpos, &cluster_start,
-					&cluster_end);
+struct ocfs2_write_ctxt {
+	/* Logical cluster position / len of write */
+	u32				w_cpos;
+	u32				w_clen;
 
-	from = sp->s_offset;
-	src_from = sp->s_buf_offset;
-	bytes = wc->w_count;
+	struct ocfs2_write_cluster_desc	w_desc[OCFS2_MAX_CLUSTERS_PER_PAGE];
 
-	if (wc->w_large_pages) {
-		/*
-		 * For cluster size < page size, we have to
-		 * calculate pos within the cluster and obey
-		 * the rightmost boundary.
-		 */
-		bytes = min(bytes, (unsigned long)(osb->s_clustersize
-				   - (wc->w_pos & (osb->s_clustersize - 1))));
-	}
-	to = from + bytes;
+	/*
+	 * This is true if page_size > cluster_size.
+	 *
+	 * It triggers a set of special cases during write which might
+	 * have to deal with allocating writes to partial pages.
+	 */
+	unsigned int			w_large_pages;
 
-	if (wc->w_this_page_new)
-		ret = ocfs2_map_page_blocks(wc->w_this_page, p_blkno, inode,
-					    cluster_start, cluster_end, 1);
-	else
-		ret = ocfs2_map_page_blocks(wc->w_this_page, p_blkno, inode,
-					    from, to, 0);
-	if (ret) {
-		mlog_errno(ret);
-		goto out;
+	/*
+	 * Pages involved in this write.
+	 *
+	 * w_target_page is the page being written to by the user.
+	 *
+	 * w_pages is an array of pages which always contains
+	 * w_target_page, and in the case of an allocating write with
+	 * page_size < cluster size, it will contain zero'd and mapped
+	 * pages adjacent to w_target_page which need to be written
+	 * out in so that future reads from that region will get
+	 * zero's.
+	 */
+	struct page			*w_pages[OCFS2_MAX_CTXT_PAGES];
+	unsigned int			w_num_pages;
+	struct page			*w_target_page;
+
+	/*
+	 * ocfs2_write_end() uses this to know what the real range to
+	 * write in the target should be.
+	 */
+	unsigned int			w_target_from;
+	unsigned int			w_target_to;
+
+	/*
+	 * We could use journal_current_handle() but this is cleaner,
+	 * IMHO -Mark
+	 */
+	handle_t			*w_handle;
+
+	struct buffer_head		*w_di_bh;
+};
+
+static void ocfs2_free_write_ctxt(struct ocfs2_write_ctxt *wc)
+{
+	int i;
+
+	for(i = 0; i < wc->w_num_pages; i++) {
+		if (wc->w_pages[i] == NULL)
+			continue;
+
+		unlock_page(wc->w_pages[i]);
+		mark_page_accessed(wc->w_pages[i]);
+		page_cache_release(wc->w_pages[i]);
 	}
 
-	BUG_ON(from > PAGE_CACHE_SIZE);
-	BUG_ON(to > PAGE_CACHE_SIZE);
-	BUG_ON(from > osb->s_clustersize);
-	BUG_ON(to > osb->s_clustersize);
-
-	src = buf->ops->map(sp->s_pipe, buf, 1);
-	dst = kmap_atomic(wc->w_this_page, KM_USER1);
-	memcpy(dst + from, src + src_from, bytes);
-	kunmap_atomic(wc->w_this_page, KM_USER1);
-	buf->ops->unmap(sp->s_pipe, buf, src);
+	brelse(wc->w_di_bh);
+	kfree(wc);
+}
 
-	wc->w_finished_copy = 1;
+static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt **wcp,
+				  struct ocfs2_super *osb, loff_t pos,
+				  unsigned len)
+{
+	struct ocfs2_write_ctxt *wc;
 
-	*ret_from = from;
-	*ret_to = to;
-out:
+	wc = kzalloc(sizeof(struct ocfs2_write_ctxt), GFP_NOFS);
+	if (!wc)
+		return -ENOMEM;
 
-	return bytes ? (unsigned int)bytes : ret;
+	wc->w_cpos = pos >> osb->s_clustersize_bits;
+	wc->w_clen = ocfs2_clusters_for_bytes(osb->sb, len);
+
+	if (unlikely(PAGE_CACHE_SHIFT > osb->s_clustersize_bits))
+		wc->w_large_pages = 1;
+	else
+		wc->w_large_pages = 0;
+
+	*wcp = wc;
+
+	return 0;
 }
 
 /*
- * This will copy user data from the iovec in the buffered write
- * context.
+ * Only called when we have a failure during allocating write to write
+ * zero's to the newly allocated region.
  */
-int ocfs2_map_and_write_user_data(struct inode *inode,
-				  struct ocfs2_write_ctxt *wc, u64 *p_blkno,
-				  unsigned int *ret_from, unsigned int *ret_to)
+static void ocfs2_write_failure(struct inode *inode,
+				struct ocfs2_write_ctxt *wc,
+				loff_t user_pos, unsigned user_len)
 {
-	int ret;
-	unsigned int to, from, cluster_start, cluster_end;
-	unsigned long bytes, src_from;
-	char *dst;
-	struct ocfs2_buffered_write_priv *bp = wc->w_private;
-	const struct iovec *cur_iov = bp->b_cur_iov;
-	char __user *buf;
-	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
-
-	ocfs2_figure_cluster_boundaries(osb, wc->w_cpos, &cluster_start,
-					&cluster_end);
-
-	buf = cur_iov->iov_base + bp->b_cur_off;
-	src_from = (unsigned long)buf & ~PAGE_CACHE_MASK;
+	int i;
+	unsigned from, to;
+	struct page *tmppage;
 
-	from = wc->w_pos & (PAGE_CACHE_SIZE - 1);
+	page_zero_new_buffers(wc->w_target_page, user_pos, user_len);
 
-	/*
-	 * This is a lot of comparisons, but it reads quite
-	 * easily, which is important here.
-	 */
-	/* Stay within the src page */
-	bytes = PAGE_SIZE - src_from;
-	/* Stay within the vector */
-	bytes = min(bytes,
-		    (unsigned long)(cur_iov->iov_len - bp->b_cur_off));
-	/* Stay within count */
-	bytes = min(bytes, (unsigned long)wc->w_count);
-	/*
-	 * For clustersize > page size, just stay within
-	 * target page, otherwise we have to calculate pos
-	 * within the cluster and obey the rightmost
-	 * boundary.
-	 */
 	if (wc->w_large_pages) {
-		/*
-		 * For cluster size < page size, we have to
-		 * calculate pos within the cluster and obey
-		 * the rightmost boundary.
-		 */
-		bytes = min(bytes, (unsigned long)(osb->s_clustersize
-				   - (wc->w_pos & (osb->s_clustersize - 1))));
+		from = wc->w_target_from;
+		to = wc->w_target_to;
 	} else {
-		/*
-		 * cluster size > page size is the most common
-		 * case - we just stay within the target page
-		 * boundary.
-		 */
-		bytes = min(bytes, PAGE_CACHE_SIZE - from);
-	}
-
-	to = from + bytes;
-
-	if (wc->w_this_page_new)
-		ret = ocfs2_map_page_blocks(wc->w_this_page, p_blkno, inode,
-					    cluster_start, cluster_end, 1);
-	else
-		ret = ocfs2_map_page_blocks(wc->w_this_page, p_blkno, inode,
-					    from, to, 0);
-	if (ret) {
-		mlog_errno(ret);
-		goto out;
+		from = 0;
+		to = PAGE_CACHE_SIZE;
 	}
 
-	BUG_ON(from > PAGE_CACHE_SIZE);
-	BUG_ON(to > PAGE_CACHE_SIZE);
-	BUG_ON(from > osb->s_clustersize);
-	BUG_ON(to > osb->s_clustersize);
-
-	dst = kmap(wc->w_this_page);
-	memcpy(dst + from, bp->b_src_buf + src_from, bytes);
-	kunmap(wc->w_this_page);
-
-	/*
-	 * XXX: This is slow, but simple. The caller of
-	 * ocfs2_buffered_write_cluster() is responsible for
-	 * passing through the iovecs, so it's difficult to
-	 * predict what our next step is in here after our
-	 * initial write. A future version should be pushing
-	 * that iovec manipulation further down.
-	 *
-	 * By setting this, we indicate that a copy from user
-	 * data was done, and subsequent calls for this
-	 * cluster will skip copying more data.
-	 */
-	wc->w_finished_copy = 1;
+	for(i = 0; i < wc->w_num_pages; i++) {
+		tmppage = wc->w_pages[i];
 
-	*ret_from = from;
-	*ret_to = to;
-out:
+		if (ocfs2_should_order_data(inode))
+			walk_page_buffers(wc->w_handle, page_buffers(tmppage),
+					  from, to, NULL,
+					  ocfs2_journal_dirty_data);
 
-	return bytes ? (unsigned int)bytes : ret;
+		block_commit_write(tmppage, from, to);
+	}
 }
 
-/*
- * Map, fill and write a page to disk.
- *
- * The work of copying data is done via callback.  Newly allocated
- * pages which don't take user data will be zero'd (set 'new' to
- * indicate an allocating write)
- *
- * Returns a negative error code or the number of bytes copied into
- * the page.
- */
-int ocfs2_write_data_page(struct inode *inode, handle_t *handle,
-			  u64 *p_blkno, struct page *page,
-			  struct ocfs2_write_ctxt *wc, int new)
+static int ocfs2_prepare_page_for_write(struct inode *inode, u64 *p_blkno,
+					struct ocfs2_write_ctxt *wc,
+					struct page *page, u32 cpos,
+					loff_t user_pos, unsigned user_len,
+					int new)
 {
-	int ret, copied = 0;
-	unsigned int from = 0, to = 0;
+	int ret;
+	unsigned int map_from = 0, map_to = 0;
 	unsigned int cluster_start, cluster_end;
-	unsigned int zero_from = 0, zero_to = 0;
+	unsigned int user_data_from = 0, user_data_to = 0;
 
-	ocfs2_figure_cluster_boundaries(OCFS2_SB(inode->i_sb), wc->w_cpos,
+	ocfs2_figure_cluster_boundaries(OCFS2_SB(inode->i_sb), cpos,
 					&cluster_start, &cluster_end);
 
-	if ((wc->w_pos >> PAGE_CACHE_SHIFT) == page->index
-	    && !wc->w_finished_copy) {
-
-		wc->w_this_page = page;
-		wc->w_this_page_new = new;
-		ret = wc->w_write_data_page(inode, wc, p_blkno, &from, &to);
-		if (ret < 0) {
+	if (page == wc->w_target_page) {
+		map_from = user_pos & (PAGE_CACHE_SIZE - 1);
+		map_to = map_from + user_len;
+
+		if (new)
+			ret = ocfs2_map_page_blocks(page, p_blkno, inode,
+						    cluster_start, cluster_end,
+						    new);
+		else
+			ret = ocfs2_map_page_blocks(page, p_blkno, inode,
+						    map_from, map_to, new);
+		if (ret) {
 			mlog_errno(ret);
 			goto out;
 		}
 
-		copied = ret;
-
-		zero_from = from;
-		zero_to = to;
+		user_data_from = map_from;
+		user_data_to = map_to;
 		if (new) {
-			from = cluster_start;
-			to = cluster_end;
+			map_from = cluster_start;
+			map_to = cluster_end;
 		}
+
+		wc->w_target_from = map_from;
+		wc->w_target_to = map_to;
 	} else {
 		/*
 		 * If we haven't allocated the new page yet, we
@@ -973,11 +944,11 @@ int ocfs2_write_data_page(struct inode *
 		 */
 		BUG_ON(!new);
 
-		from = cluster_start;
-		to = cluster_end;
+		map_from = cluster_start;
+		map_to = cluster_end;
 
 		ret = ocfs2_map_page_blocks(page, p_blkno, inode,
-					    cluster_start, cluster_end, 1);
+					    cluster_start, cluster_end, new);
 		if (ret) {
 			mlog_errno(ret);
 			goto out;
@@ -996,108 +967,84 @@ int ocfs2_write_data_page(struct inode *
 	 */
 	if (new && !PageUptodate(page))
 		ocfs2_clear_page_regions(page, OCFS2_SB(inode->i_sb),
-					 wc->w_cpos, zero_from, zero_to);
+					 cpos, user_data_from, user_data_to);
 
 	flush_dcache_page(page);
 
-	if (ocfs2_should_order_data(inode)) {
-		ret = walk_page_buffers(handle,
-					page_buffers(page),
-					from, to, NULL,
-					ocfs2_journal_dirty_data);
-		if (ret < 0)
-			mlog_errno(ret);
-	}
-
-	/*
-	 * We don't use generic_commit_write() because we need to
-	 * handle our own i_size update.
-	 */
-	ret = block_commit_write(page, from, to);
-	if (ret)
-		mlog_errno(ret);
 out:
-
-	return copied ? copied : ret;
+	return ret;
 }
 
 /*
- * Do the actual write of some data into an inode. Optionally allocate
- * in order to fulfill the write.
- *
- * cpos is the logical cluster offset within the file to write at
- *
- * 'phys' is the physical mapping of that offset. a 'phys' value of
- * zero indicates that allocation is required. In this case, data_ac
- * and meta_ac should be valid (meta_ac can be null if metadata
- * allocation isn't required).
+ * This function will only grab one clusters worth of pages.
  */
-static ssize_t ocfs2_write(struct file *file, u32 phys, handle_t *handle,
-			   struct buffer_head *di_bh,
-			   struct ocfs2_alloc_context *data_ac,
-			   struct ocfs2_alloc_context *meta_ac,
-			   struct ocfs2_write_ctxt *wc)
+static int ocfs2_grab_pages_for_write(struct address_space *mapping,
+				      struct ocfs2_write_ctxt *wc,
+				      u32 cpos, loff_t user_pos, int new)
 {
-	int ret, i, numpages = 1, new;
-	unsigned int copied = 0;
-	u32 tmp_pos;
-	u64 v_blkno, p_blkno;
-	struct address_space *mapping = file->f_mapping;
+	int ret = 0, i;
+	unsigned long start, target_index, index;
 	struct inode *inode = mapping->host;
-	unsigned long index, start;
-	struct page **cpages;
 
-	new = phys == 0 ? 1 : 0;
+	target_index = user_pos >> PAGE_CACHE_SHIFT;
 
 	/*
 	 * Figure out how many pages we'll be manipulating here. For
 	 * non allocating write, we just change the one
 	 * page. Otherwise, we'll need a whole clusters worth.
 	 */
-	if (new)
-		numpages = ocfs2_pages_per_cluster(inode->i_sb);
-
-	cpages = kzalloc(sizeof(*cpages) * numpages, GFP_NOFS);
-	if (!cpages) {
-		ret = -ENOMEM;
-		mlog_errno(ret);
-		return ret;
-	}
-
-	/*
-	 * Fill our page array first. That way we've grabbed enough so
-	 * that we can zero and flush if we error after adding the
-	 * extent.
-	 */
 	if (new) {
-		start = ocfs2_align_clusters_to_page_index(inode->i_sb,
-							   wc->w_cpos);
-		v_blkno = ocfs2_clusters_to_blocks(inode->i_sb, wc->w_cpos);
+		wc->w_num_pages = ocfs2_pages_per_cluster(inode->i_sb);
+		start = ocfs2_align_clusters_to_page_index(inode->i_sb, cpos);
 	} else {
-		start = wc->w_pos >> PAGE_CACHE_SHIFT;
-		v_blkno = wc->w_pos >> inode->i_sb->s_blocksize_bits;
+		wc->w_num_pages = 1;
+		start = target_index;
 	}
 
-	for(i = 0; i < numpages; i++) {
+	for(i = 0; i < wc->w_num_pages; i++) {
 		index = start + i;
 
-		cpages[i] = grab_cache_page(mapping, index);
-		if (!cpages[i]) {
+		wc->w_pages[i] = grab_cache_page(mapping, index);
+		if (!wc->w_pages[i]) {
 			ret = -ENOMEM;
 			mlog_errno(ret);
 			goto out;
 		}
+
+		if (index == target_index)
+			wc->w_target_page = wc->w_pages[i];
 	}
+out:
+	return ret;
+}
+
+/*
+ * Prepare a single cluster for write one cluster into the file.
+ */
+static int ocfs2_write_cluster(struct address_space *mapping,
+			       u32 phys, struct ocfs2_alloc_context *data_ac,
+			       struct ocfs2_alloc_context *meta_ac,
+			       struct ocfs2_write_ctxt *wc, u32 cpos,
+			       loff_t user_pos, unsigned user_len)
+{
+	int ret, i, new;
+	u64 v_blkno, p_blkno;
+	struct inode *inode = mapping->host;
+
+	new = phys == 0 ? 1 : 0;
 
 	if (new) {
+		u32 tmp_pos;
+
 		/*
 		 * This is safe to call with the page locks - it won't take
 		 * any additional semaphores or cluster locks.
 		 */
-		tmp_pos = wc->w_cpos;
+		tmp_pos = cpos;
 		ret = ocfs2_do_extend_allocation(OCFS2_SB(inode->i_sb), inode,
-						 &tmp_pos, 1, di_bh, handle,
-						 data_ac, meta_ac, NULL);
+						 &tmp_pos, 1, wc->w_di_bh,
+						 wc->w_handle, data_ac,
+						 meta_ac, NULL);
 		/*
 		 * This shouldn't happen because we must have already
 		 * calculated the correct meta data allocation required. The
@@ -1114,103 +1061,132 @@ static ssize_t ocfs2_write(struct file *
 			mlog_errno(ret);
 			goto out;
 		}
+
+		v_blkno = ocfs2_clusters_to_blocks(inode->i_sb, cpos);
+	} else {
+		v_blkno = user_pos >> inode->i_sb->s_blocksize_bits;
 	}
 
+	/*
+	 * The only reason this should fail is due to an inability to
+	 * find the extent added.
+	 */
 	ret = ocfs2_extent_map_get_blocks(inode, v_blkno, &p_blkno, NULL,
 					  NULL);
 	if (ret < 0) {
-
-		/*
-		 * XXX: Should we go readonly here?
-		 */
-
-		mlog_errno(ret);
+		ocfs2_error(inode->i_sb, "Corrupting extend for inode %llu, "
+			    "at logical block %llu",
+			    (unsigned long long)OCFS2_I(inode)->ip_blkno,
+			    (unsigned long long)v_blkno);
 		goto out;
 	}
 
 	BUG_ON(p_blkno == 0);
 
-	for(i = 0; i < numpages; i++) {
-		ret = ocfs2_write_data_page(inode, handle, &p_blkno, cpages[i],
-					    wc, new);
-		if (ret < 0) {
-			mlog_errno(ret);
-			goto out;
-		}
+	for(i = 0; i < wc->w_num_pages; i++) {
+		int tmpret;
 
-		copied += ret;
+		tmpret = ocfs2_prepare_page_for_write(inode, &p_blkno, wc,
+						      wc->w_pages[i], cpos,
+						      user_pos, user_len, new);
+		if (tmpret) {
+			mlog_errno(tmpret);
+			if (ret == 0)
+				tmpret = ret;
+		}
 	}
 
+	/*
+	 * We only have cleanup to do in case of allocating write.
+	 */
+	if (ret && new)
+		ocfs2_write_failure(inode, wc, user_pos, user_len);
+
 out:
-	for(i = 0; i < numpages; i++) {
-		unlock_page(cpages[i]);
-		mark_page_accessed(cpages[i]);
-		page_cache_release(cpages[i]);
-	}
-	kfree(cpages);
 
-	return copied ? copied : ret;
+	return ret;
 }
 
-static void ocfs2_write_ctxt_init(struct ocfs2_write_ctxt *wc,
-				  struct ocfs2_super *osb, loff_t pos,
-				  size_t count, ocfs2_page_writer *cb,
-				  void *cb_priv)
+/*
+ * ocfs2_write_end() wants to know which parts of the target page it
+ * should complete the write on. It's easiest to compute them ahead of
+ * time when a more complete view of the write is available.
+ */
+static void ocfs2_set_target_boundaries(struct ocfs2_super *osb,
+					struct ocfs2_write_ctxt *wc,
+					loff_t pos, unsigned len, int alloc)
 {
-	wc->w_count = count;
-	wc->w_pos = pos;
-	wc->w_cpos = wc->w_pos >> osb->s_clustersize_bits;
-	wc->w_finished_copy = 0;
+	struct ocfs2_write_cluster_desc *desc;
 
-	if (unlikely(PAGE_CACHE_SHIFT > osb->s_clustersize_bits))
-		wc->w_large_pages = 1;
-	else
-		wc->w_large_pages = 0;
+	wc->w_target_from = pos & (PAGE_CACHE_SIZE - 1);
+	wc->w_target_to = wc->w_target_from + len;
+
+	if (alloc == 0)
+		return;
+
+	/*
+	 * Allocating write - we may have different boundaries based
+	 * on page size and cluster size.
+	 *
+	 * NOTE: We can no longer compute one value from the other as
+	 * the actual write length and user provided length may be
+	 * different.
+	 */
 
-	wc->w_write_data_page = cb;
-	wc->w_private = cb_priv;
+	if (wc->w_large_pages) {
+		/*
+		 * We only care about the 1st and last cluster within
+		 * our range and whether they are holes or not. Either
+		 * value may be extended out to the start/end of a
+		 * newly allocated cluster.
+		 */
+		desc = &wc->w_desc[0];
+		if (desc->c_new)
+			ocfs2_figure_cluster_boundaries(osb,
+							desc->c_cpos,
+							&wc->w_target_from,
+							NULL);
+
+		desc = &wc->w_desc[wc->w_clen - 1];
+		if (desc->c_new)
+			ocfs2_figure_cluster_boundaries(osb,
+							desc->c_cpos,
+							NULL,
+							&wc->w_target_to);
+	} else {
+		wc->w_target_from = 0;
+		wc->w_target_to = PAGE_CACHE_SIZE;
+	}
 }
 
-/*
- * Write a cluster to an inode. The cluster may not be allocated yet,
- * in which case it will be. This only exists for buffered writes -
- * O_DIRECT takes a more "traditional" path through the kernel.
- *
- * The caller is responsible for incrementing pos, written counts, etc
- *
- * For file systems that don't support sparse files, pre-allocation
- * and page zeroing up until cpos should be done prior to this
- * function call.
- *
- * Callers should be holding i_sem, and the rw cluster lock.
- *
- * Returns the number of user bytes written, or less than zero for
- * error.
- */
-ssize_t ocfs2_buffered_write_cluster(struct file *file, loff_t pos,
-				     size_t count, ocfs2_page_writer *actor,
-				     void *priv)
+static int ocfs2_write_begin(struct file *file, struct address_space *mapping,
+			     loff_t pos, unsigned len, unsigned flags,
+			     struct page **pagep, void **fsdata)
 {
-	int ret, credits = OCFS2_INODE_UPDATE_CREDITS;
-	ssize_t written = 0;
-	u32 phys;
-	struct inode *inode = file->f_mapping->host;
+	int ret, i, credits = OCFS2_INODE_UPDATE_CREDITS;
+	unsigned int num_clusters = 0, clusters_to_alloc = 0;
+	u32 phys = 0;
+	struct ocfs2_write_ctxt *wc;
+	struct inode *inode = mapping->host;
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
-	struct buffer_head *di_bh = NULL;
 	struct ocfs2_dinode *di;
 	struct ocfs2_alloc_context *data_ac = NULL;
 	struct ocfs2_alloc_context *meta_ac = NULL;
 	handle_t *handle;
-	struct ocfs2_write_ctxt wc;
+	struct ocfs2_write_cluster_desc *desc;
 
-	ocfs2_write_ctxt_init(&wc, osb, pos, count, actor, priv);
+	ret = ocfs2_alloc_write_ctxt(&wc, osb, pos, len);
+	if (ret) {
+		mlog_errno(ret);
+		return ret;
+	}
 
-	ret = ocfs2_meta_lock(inode, &di_bh, 1);
+	ret = ocfs2_meta_lock(inode, &wc->w_di_bh, 1);
 	if (ret) {
 		mlog_errno(ret);
 		goto out;
 	}
-	di = (struct ocfs2_dinode *)di_bh->b_data;
+	di = (struct ocfs2_dinode *)wc->w_di_bh->b_data;
 
 	/*
 	 * Take alloc sem here to prevent concurrent lookups. That way
@@ -1221,23 +1197,60 @@ ssize_t ocfs2_buffered_write_cluster(str
 	 */
 	down_write(&OCFS2_I(inode)->ip_alloc_sem);
 
-	ret = ocfs2_get_clusters(inode, wc.w_cpos, &phys, NULL, NULL);
-	if (ret) {
-		mlog_errno(ret);
-		goto out_meta;
+	for (i = 0; i < wc->w_clen; i++) {
+		desc = &wc->w_desc[i];
+		desc->c_cpos = wc->w_cpos + i;
+
+		if (num_clusters == 0) {
+			ret = ocfs2_get_clusters(inode, desc->c_cpos, &phys,
+						 &num_clusters, NULL);
+			if (ret) {
+				mlog_errno(ret);
+				goto out_meta;
+			}
+		} else if (phys) {
+			/*
+			 * Only increment phys if it doesn't describe
+			 * a hole.
+			 */
+			phys++;
+		}
+
+		desc->c_phys = phys;
+		if (phys == 0) {
+			desc->c_new = 1;
+			clusters_to_alloc++;
+		}
+
+		num_clusters--;
 	}
 
-	/* phys == 0 means that allocation is required. */
-	if (phys == 0) {
-		ret = ocfs2_lock_allocators(inode, di, 1, &data_ac, &meta_ac);
+	/*
+	 * We set w_target_from, w_target_to here so that
+	 * ocfs2_write_end() knows which range in the target page to
+	 * write out. An allocation requires that we write the entire
+	 * cluster range.
+	 */
+	if (clusters_to_alloc > 0) {
+		/*
+		 * XXX: We are stretching the limits of
+		 * ocfs2_lock_allocators(). It greately over-estimates
+		 * the work to be done.
+		 */
+		ret = ocfs2_lock_allocators(inode, di, clusters_to_alloc,
+					    &data_ac, &meta_ac);
 		if (ret) {
 			mlog_errno(ret);
 			goto out_meta;
 		}
 
-		credits = ocfs2_calc_extend_credits(inode->i_sb, di, 1);
+		credits = ocfs2_calc_extend_credits(inode->i_sb, di,
+						    clusters_to_alloc);
+
 	}
 
+	ocfs2_set_target_boundaries(osb, wc, pos, len, clusters_to_alloc);
+
 	ret = ocfs2_data_lock(inode, 1);
 	if (ret) {
 		mlog_errno(ret);
@@ -1251,36 +1264,50 @@ ssize_t ocfs2_buffered_write_cluster(str
 		goto out_data;
 	}
 
-	written = ocfs2_write(file, phys, handle, di_bh, data_ac,
-			      meta_ac, &wc);
-	if (written < 0) {
-		ret = written;
+	wc->w_handle = handle;
+
+	/*
+	 * We don't want this to fail in ocfs2_write_end(), so do it
+	 * here.
+	 */
+	ret = ocfs2_journal_access(handle, inode, wc->w_di_bh,
+				   OCFS2_JOURNAL_ACCESS_WRITE);
+	if (ret) {
 		mlog_errno(ret);
 		goto out_commit;
 	}
 
-	ret = ocfs2_journal_access(handle, inode, di_bh,
-				   OCFS2_JOURNAL_ACCESS_WRITE);
+	/*
+	 * Fill our page array first. That way we've grabbed enough so
+	 * that we can zero and flush if we error after adding the
+	 * extent.
+	 */
+	ret = ocfs2_grab_pages_for_write(mapping, wc, wc->w_cpos, pos,
+					 clusters_to_alloc);
 	if (ret) {
 		mlog_errno(ret);
-		goto out_commit;
+		goto out;
 	}
 
-	pos += written;
-	if (pos > inode->i_size) {
-		i_size_write(inode, pos);
-		mark_inode_dirty(inode);
+	for (i = 0; i < wc->w_clen; i++) {
+		desc = &wc->w_desc[i];
+
+		ret = ocfs2_write_cluster(mapping, desc->c_phys, data_ac,
+					  meta_ac, wc, desc->c_cpos, pos, len);
+		if (ret) {
+			mlog_errno(ret);
+			goto out_commit;
+		}
 	}
-	inode->i_blocks = ocfs2_inode_sector_count(inode);
-	di->i_size = cpu_to_le64((u64)i_size_read(inode));
-	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-	di->i_mtime = di->i_ctime = cpu_to_le64(inode->i_mtime.tv_sec);
-	di->i_mtime_nsec = di->i_ctime_nsec = cpu_to_le32(inode->i_mtime.tv_nsec);
 
-	ret = ocfs2_journal_dirty(handle, di_bh);
-	if (ret)
-		mlog_errno(ret);
+	if (data_ac)
+		ocfs2_free_alloc_context(data_ac);
+	if (meta_ac)
+		ocfs2_free_alloc_context(meta_ac);
 
+	*pagep = wc->w_target_page;
+	*fsdata = wc;
+	return 0;
 out_commit:
 	ocfs2_commit_trans(osb, handle);
 
@@ -1292,18 +1319,92 @@ out_meta:
 	ocfs2_meta_unlock(inode, 1);
 
 out:
-	brelse(di_bh);
+	ocfs2_free_write_ctxt(wc);
+
 	if (data_ac)
 		ocfs2_free_alloc_context(data_ac);
 	if (meta_ac)
 		ocfs2_free_alloc_context(meta_ac);
+	return ret;
+}
+
+static int ocfs2_write_end(struct file *file, struct address_space *mapping,
+			   loff_t pos, unsigned len, unsigned copied,
+			   struct page *page, void *fsdata)
+{
+	int i;
+	unsigned from, to, start = pos & (PAGE_CACHE_SIZE - 1);
+	struct inode *inode = mapping->host;
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+	struct ocfs2_write_ctxt *wc = fsdata;
+	struct ocfs2_dinode *di = (struct ocfs2_dinode *)wc->w_di_bh->b_data;
+	handle_t *handle = wc->w_handle;
+	struct page *tmppage;
+
+	if (unlikely(copied < len)) {
+		if (!PageUptodate(wc->w_target_page))
+			copied = 0;
+
+		page_zero_new_buffers(wc->w_target_page, start+copied,
+				      start+len);
+	}
+	flush_dcache_page(wc->w_target_page);
+
+	for(i = 0; i < wc->w_num_pages; i++) {
+		tmppage = wc->w_pages[i];
+
+		if (tmppage == wc->w_target_page) {
+			from = wc->w_target_from;
+			to = wc->w_target_to;
+
+			BUG_ON(from > PAGE_CACHE_SIZE ||
+			       to > PAGE_CACHE_SIZE ||
+			       to < from);
+		} else {
+			/*
+			 * Pages adjacent to the target (if any) imply
+			 * a hole-filling write in which case we want
+			 * to flush their entire range.
+			 */
+			from = 0;
+			to = PAGE_CACHE_SIZE;
+		}
+
+		if (ocfs2_should_order_data(inode))
+			walk_page_buffers(wc->w_handle, page_buffers(tmppage),
+					  from, to, NULL,
+					  ocfs2_journal_dirty_data);
+
+		block_commit_write(tmppage, from, to);
+	}
+
+	pos += copied;
+	if (pos > inode->i_size) {
+		i_size_write(inode, pos);
+		mark_inode_dirty(inode);
+	}
+	inode->i_blocks = ocfs2_inode_sector_count(inode);
+	di->i_size = cpu_to_le64((u64)i_size_read(inode));
+	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+	di->i_mtime = di->i_ctime = cpu_to_le64(inode->i_mtime.tv_sec);
+	di->i_mtime_nsec = di->i_ctime_nsec = cpu_to_le32(inode->i_mtime.tv_nsec);
+
+	ocfs2_journal_dirty(handle, wc->w_di_bh);
+
+	ocfs2_commit_trans(osb, handle);
+	ocfs2_data_unlock(inode, 1);
+	up_write(&OCFS2_I(inode)->ip_alloc_sem);
+	ocfs2_meta_unlock(inode, 1);
+	ocfs2_free_write_ctxt(wc);
 
-	return written ? written : ret;
+	return copied;
 }
 
 const struct address_space_operations ocfs2_aops = {
 	.readpage	= ocfs2_readpage,
 	.writepage	= ocfs2_writepage,
+	.write_begin	= ocfs2_write_begin,
+	.write_end	= ocfs2_write_end,
 	.bmap		= ocfs2_bmap,
 	.sync_page	= block_sync_page,
 	.direct_IO	= ocfs2_direct_IO,
Index: linux-2.6/fs/ocfs2/aops.h
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.h
+++ linux-2.6/fs/ocfs2/aops.h
@@ -42,58 +42,6 @@ int walk_page_buffers(	handle_t *handle,
 			int (*fn)(	handle_t *handle,
 					struct buffer_head *bh));
 
-struct ocfs2_write_ctxt;
-typedef int (ocfs2_page_writer)(struct inode *, struct ocfs2_write_ctxt *,
-				u64 *, unsigned int *, unsigned int *);
-
-ssize_t ocfs2_buffered_write_cluster(struct file *file, loff_t pos,
-				     size_t count, ocfs2_page_writer *actor,
-				     void *priv);
-
-struct ocfs2_write_ctxt {
-	size_t				w_count;
-	loff_t				w_pos;
-	u32				w_cpos;
-	unsigned int			w_finished_copy;
-
-	/* This is true if page_size > cluster_size */
-	unsigned int			w_large_pages;
-
-	/* Filler callback and private data */
-	ocfs2_page_writer		*w_write_data_page;
-	void				*w_private;
-
-	/* Only valid for the filler callback */
-	struct page			*w_this_page;
-	unsigned int			w_this_page_new;
-};
-
-struct ocfs2_buffered_write_priv {
-	char				*b_src_buf;
-	const struct iovec		*b_cur_iov; /* Current iovec */
-	size_t				b_cur_off; /* Offset in the
-						    * current iovec */
-};
-int ocfs2_map_and_write_user_data(struct inode *inode,
-				  struct ocfs2_write_ctxt *wc,
-				  u64 *p_blkno,
-				  unsigned int *ret_from,
-				  unsigned int *ret_to);
-
-struct ocfs2_splice_write_priv {
-	struct splice_desc		*s_sd;
-	struct pipe_buffer		*s_buf;
-	struct pipe_inode_info		*s_pipe;
-	/* Neither offset value is ever larger than one page */
-	unsigned int			s_offset;
-	unsigned int			s_buf_offset;
-};
-int ocfs2_map_and_write_splice_data(struct inode *inode,
-				    struct ocfs2_write_ctxt *wc,
-				    u64 *p_blkno,
-				    unsigned int *ret_from,
-				    unsigned int *ret_to);
-
 /* all ocfs2_dio_end_io()'s fault */
 #define ocfs2_iocb_is_rw_locked(iocb) \
 	test_bit(0, (unsigned long *)&iocb->private)
Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c
+++ linux-2.6/fs/ocfs2/file.c
@@ -1306,116 +1306,6 @@ out:
 	return ret;
 }
 
-static inline void
-ocfs2_set_next_iovec(const struct iovec **iovp, size_t *basep, size_t bytes)
-{
-	const struct iovec *iov = *iovp;
-	size_t base = *basep;
-
-	do {
-		int copy = min(bytes, iov->iov_len - base);
-
-		bytes -= copy;
-		base += copy;
-		if (iov->iov_len == base) {
-			iov++;
-			base = 0;
-		}
-	} while (bytes);
-	*iovp = iov;
-	*basep = base;
-}
-
-static struct page * ocfs2_get_write_source(struct ocfs2_buffered_write_priv *bp,
-					    const struct iovec *cur_iov,
-					    size_t iov_offset)
-{
-	int ret;
-	char *buf;
-	struct page *src_page = NULL;
-
-	buf = cur_iov->iov_base + iov_offset;
-
-	if (!segment_eq(get_fs(), KERNEL_DS)) {
-		/*
-		 * Pull in the user page. We want to do this outside
-		 * of the meta data locks in order to preserve locking
-		 * order in case of page fault.
-		 */
-		ret = get_user_pages(current, current->mm,
-				     (unsigned long)buf & PAGE_CACHE_MASK, 1,
-				     0, 0, &src_page, NULL);
-		if (ret == 1)
-			bp->b_src_buf = kmap(src_page);
-		else
-			src_page = ERR_PTR(-EFAULT);
-	} else {
-		bp->b_src_buf = buf;
-	}
-
-	return src_page;
-}
-
-static void ocfs2_put_write_source(struct ocfs2_buffered_write_priv *bp,
-				   struct page *page)
-{
-	if (page) {
-		kunmap(page);
-		page_cache_release(page);
-	}
-}
-
-static ssize_t ocfs2_file_buffered_write(struct file *file, loff_t *ppos,
-					 const struct iovec *iov,
-					 unsigned long nr_segs,
-					 size_t count,
-					 ssize_t o_direct_written)
-{
-	int ret = 0;
-	ssize_t copied, total = 0;
-	size_t iov_offset = 0;
-	const struct iovec *cur_iov = iov;
-	struct ocfs2_buffered_write_priv bp;
-	struct page *page;
-
-	/*
-	 * handle partial DIO write.  Adjust cur_iov if needed.
-	 */
-	ocfs2_set_next_iovec(&cur_iov, &iov_offset, o_direct_written);
-
-	do {
-		bp.b_cur_off = iov_offset;
-		bp.b_cur_iov = cur_iov;
-
-		page = ocfs2_get_write_source(&bp, cur_iov, iov_offset);
-		if (IS_ERR(page)) {
-			ret = PTR_ERR(page);
-			goto out;
-		}
-
-		copied = ocfs2_buffered_write_cluster(file, *ppos, count,
-						      ocfs2_map_and_write_user_data,
-						      &bp);
-
-		ocfs2_put_write_source(&bp, page);
-
-		if (copied < 0) {
-			mlog_errno(copied);
-			ret = copied;
-			goto out;
-		}
-
-		total += copied;
-		*ppos = *ppos + copied;
-		count -= copied;
-
-		ocfs2_set_next_iovec(&cur_iov, &iov_offset, copied);
-	} while(count);
-
-out:
-	return total ? total : ret;
-}
-
 static int ocfs2_check_iovec(const struct iovec *iov, size_t *counted,
 			     unsigned long *nr_segs)
 {
@@ -1452,7 +1342,7 @@ static ssize_t ocfs2_file_aio_write(stru
 				    loff_t pos)
 {
 	int ret, direct_io, appending, rw_level, have_alloc_sem  = 0;
-	int can_do_direct, sync = 0;
+	int can_do_direct;
 	ssize_t written = 0;
 	size_t ocount;		/* original count */
 	size_t count;		/* after file limit checks */
@@ -1468,12 +1358,6 @@ static ssize_t ocfs2_file_aio_write(stru
 	if (iocb->ki_left == 0)
 		return 0;
 
-	ret = ocfs2_check_iovec(iov, &ocount, &nr_segs);
-	if (ret)
-		return ret;
-
-	count = ocount;
-
 	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 
 	appending = file->f_flags & O_APPEND ? 1 : 0;
@@ -1517,33 +1401,22 @@ relock:
 		rw_level = -1;
 
 		direct_io = 0;
-		sync = 1;
 		goto relock;
 	}
 
-	if (!sync && ((file->f_flags & O_SYNC) || IS_SYNC(inode)))
-		sync = 1;
-
-	/*
-	 * XXX: Is it ok to execute these checks a second time?
-	 */
-	ret = generic_write_checks(file, ppos, &count, S_ISBLK(inode->i_mode));
-	if (ret)
-		goto out;
-
-	/*
-	 * Set pos so that sync_page_range_nolock() below understands
-	 * where to start from. We might've moved it around via the
-	 * calls above. The range we want to actually sync starts from
-	 * *ppos here.
-	 *
-	 */
-	pos = *ppos;
-
 	/* communicate with ocfs2_dio_end_io */
 	ocfs2_iocb_set_rw_locked(iocb);
 
 	if (direct_io) {
+		ret = ocfs2_check_iovec(iov, &ocount, &nr_segs);
+		if (ret)
+			goto out_dio;
+
+		ret = generic_write_checks(file, ppos, &count,
+					   S_ISBLK(inode->i_mode));
+		if (ret)
+			goto out_dio;
+
 		written = generic_file_direct_write(iocb, iov, &nr_segs, *ppos,
 						    ppos, count, ocount);
 		if (written < 0) {
@@ -1551,14 +1424,8 @@ relock:
 			goto out_dio;
 		}
 	} else {
-		written = ocfs2_file_buffered_write(file, ppos, iov, nr_segs,
-						    count, written);
-		if (written < 0) {
-			ret = written;
-			if (ret != -EFAULT || ret != -ENOSPC)
-				mlog_errno(ret);
-			goto out;
-		}
+		written = generic_file_aio_write_nolock(iocb, iov, nr_segs,
+							*ppos);
 	}
 
 out_dio:
@@ -1588,98 +1455,12 @@ out_sems:
 	if (have_alloc_sem)
 		up_read(&inode->i_alloc_sem);
 
-	if (written > 0 && sync) {
-		ssize_t err;
-
-		err = sync_page_range_nolock(inode, file->f_mapping, pos, count);
-		if (err < 0)
-			written = err;
-	}
-
 	mutex_unlock(&inode->i_mutex);
 
 	mlog_exit(ret);
 	return written ? written : ret;
 }
 
-static int ocfs2_splice_write_actor(struct pipe_inode_info *pipe,
-				    struct pipe_buffer *buf,
-				    struct splice_desc *sd)
-{
-	int ret, count, total = 0;
-	ssize_t copied = 0;
-	struct ocfs2_splice_write_priv sp;
-
-	ret = buf->ops->pin(pipe, buf);
-	if (ret)
-		goto out;
-
-	sp.s_sd = sd;
-	sp.s_buf = buf;
-	sp.s_pipe = pipe;
-	sp.s_offset = sd->pos & ~PAGE_CACHE_MASK;
-	sp.s_buf_offset = buf->offset;
-
-	count = sd->len;
-	if (count + sp.s_offset > PAGE_CACHE_SIZE)
-		count = PAGE_CACHE_SIZE - sp.s_offset;
-
-	do {
-		/*
-		 * splice wants us to copy up to one page at a
-		 * time. For pagesize > cluster size, this means we
-		 * might enter ocfs2_buffered_write_cluster() more
-		 * than once, so keep track of our progress here.
-		 */
-		copied = ocfs2_buffered_write_cluster(sd->file,
-						      (loff_t)sd->pos + total,
-						      count,
-						      ocfs2_map_and_write_splice_data,
-						      &sp);
-		if (copied < 0) {
-			mlog_errno(copied);
-			ret = copied;
-			goto out;
-		}
-
-		count -= copied;
-		sp.s_offset += copied;
-		sp.s_buf_offset += copied;
-		total += copied;
-	} while (count);
-
-	ret = 0;
-out:
-
-	return total ? total : ret;
-}
-
-static ssize_t __ocfs2_file_splice_write(struct pipe_inode_info *pipe,
-					 struct file *out,
-					 loff_t *ppos,
-					 size_t len,
-					 unsigned int flags)
-{
-	int ret, err;
-	struct address_space *mapping = out->f_mapping;
-	struct inode *inode = mapping->host;
-
-	ret = __splice_from_pipe(pipe, out, ppos, len, flags,
-				 ocfs2_splice_write_actor);
-	if (ret > 0) {
-		*ppos += ret;
-
-		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			err = generic_osync_inode(inode, mapping,
-						  OSYNC_METADATA|OSYNC_DATA);
-			if (err)
-				ret = err;
-		}
-	}
-
-	return ret;
-}
-
 static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
 				       struct file *out,
 				       loff_t *ppos,
@@ -1709,8 +1490,7 @@ static ssize_t ocfs2_file_splice_write(s
 		goto out_unlock;
 	}
 
-	/* ok, we're done with i_size and alloc work */
-	ret = __ocfs2_file_splice_write(pipe, out, ppos, len, flags);
+	ret = generic_file_splice_write_nolock(pipe, out, ppos, len, flags);
 
 out_unlock:
 	ocfs2_rw_unlock(inode, 1);

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 33/44] gfs2 convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (31 preceding siblings ...)
  2007-04-24  1:24 ` [patch 32/44] ocfs2: " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 34/44] fs: no AOP_TRUNCATED_PAGE for writes Nick Piggin
                   ` (10 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh

[-- Attachment #1: fs-gfs2-aops.patch --]
[-- Type: text/plain, Size: 9224 bytes --]

From: Steven Whitehouse <swhiteho@redhat.com>

(needs a SOB)

Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>

 fs/gfs2/ops_address.c |  209 +++++++++++++++++++++++++++++---------------------
 1 file changed, 125 insertions(+), 84 deletions(-)

Index: linux-2.6/fs/gfs2/ops_address.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_address.c
+++ linux-2.6/fs/gfs2/ops_address.c
@@ -17,6 +17,7 @@
 #include <linux/mpage.h>
 #include <linux/fs.h>
 #include <linux/writeback.h>
+#include <linux/swap.h>
 #include <linux/gfs2_ondisk.h>
 #include <linux/lm_interface.h>
 
@@ -337,45 +338,49 @@ out_unlock:
 }
 
 /**
- * gfs2_prepare_write - Prepare to write a page to a file
+ * gfs2_write_begin - Begin to write to a file
  * @file: The file to write to
- * @page: The page which is to be prepared for writing
- * @from: From (byte range within page)
- * @to: To (byte range within page)
+ * @mapping: The mapping in which to write
+ * @pos: The file offset at which to start writing
+ * @len: Length of the write
+ * @flags: Various flags
+ * @pagep: Pointer to return the page
+ * @fsdata: Pointer to return fs data (unused by GFS2)
  *
  * Returns: errno
  */
 
-static int gfs2_prepare_write(struct file *file, struct page *page,
-			      unsigned from, unsigned to)
+static int gfs2_write_begin(struct file *file, struct address_space *mapping,
+			    loff_t pos, unsigned len, unsigned flags,
+			    struct page **pagep, void **fsdata)
 {
-	struct gfs2_inode *ip = GFS2_I(page->mapping->host);
-	struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host);
+	struct gfs2_inode *ip = GFS2_I(mapping->host);
+	struct gfs2_sbd *sdp = GFS2_SB(mapping->host);
 	unsigned int data_blocks, ind_blocks, rblocks;
 	int alloc_required;
 	int error = 0;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + from;
-	loff_t end = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 	struct gfs2_alloc *al;
-	unsigned int write_len = to - from;
-
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+	unsigned to = from + len;
+	struct page *page;
 
-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, GL_ATIME|LM_FLAG_TRY_1CB, &ip->i_gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, GL_ATIME, &ip->i_gh);
 	error = gfs2_glock_nq_atime(&ip->i_gh);
-	if (unlikely(error)) {
-		if (error == GLR_TRYFAILED) {
-			unlock_page(page);
-			error = AOP_TRUNCATED_PAGE;
-			yield();
-		}
+	if (unlikely(error))
 		goto out_uninit;
-	}
 
-	gfs2_write_calc_reserv(ip, write_len, &data_blocks, &ind_blocks);
+	error = -ENOMEM;
+	page = __grab_cache_page(mapping, index);
+	*pagep = page;
+	if (!page)
+		goto out_unlock;
+
+	gfs2_write_calc_reserv(ip, len, &data_blocks, &ind_blocks);
 
-	error = gfs2_write_alloc_required(ip, pos, write_len, &alloc_required);
+	error = gfs2_write_alloc_required(ip, pos, len, &alloc_required);
 	if (error)
-		goto out_unlock;
+		goto out_putpage;
 
 
 	ip->i_alloc.al_requested = 0;
@@ -407,7 +412,7 @@ static int gfs2_prepare_write(struct fil
 		goto out;
 
 	if (gfs2_is_stuffed(ip)) {
-		if (end > sdp->sd_sb.sb_bsize - sizeof(struct gfs2_dinode)) {
+		if (pos + len > sdp->sd_sb.sb_bsize - sizeof(struct gfs2_dinode)) {
 			error = gfs2_unstuff_dinode(ip, page);
 			if (error == 0)
 				goto prepare_write;
@@ -429,6 +434,10 @@ out_qunlock:
 out_alloc_put:
 			gfs2_alloc_put(ip);
 		}
+out_putpage:
+		page_cache_release(page);
+		if (pos + len > ip->i_inode.i_size)
+			vmtruncate(&ip->i_inode, ip->i_inode.i_size);
 out_unlock:
 		gfs2_glock_dq_m(1, &ip->i_gh);
 out_uninit:
@@ -439,96 +448,128 @@ out_uninit:
 }
 
 /**
- * gfs2_commit_write - Commit write to a file
+ * gfs2_stuffed_write_end - Write end for stuffed files
+ * @inode: The inode
+ * @dibh: The buffer_head containing the on-disk inode
+ * @pos: The file position
+ * @len: The length of the write
+ * @copied: How much was actually copied by the VFS
+ * @page: The page
+ *
+ * This copies the data from the page into the inode block after
+ * the inode data structure itself.
+ *
+ * Returns: errno
+ */
+static int gfs2_stuffed_write_end(struct inode *inode, struct buffer_head *dibh,
+				  loff_t pos, unsigned len, unsigned copied,
+				  struct page *page)
+{
+	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_sbd *sdp = GFS2_SB(inode);
+	u64 to = pos + copied;
+	void *kaddr;
+	unsigned char *buf = dibh->b_data + sizeof(struct gfs2_dinode);
+	struct gfs2_dinode *di = (struct gfs2_dinode *)dibh->b_data;
+
+	BUG_ON((pos + len) > (dibh->b_size - sizeof(struct gfs2_dinode)));
+	kaddr = kmap_atomic(page, KM_USER0);
+	memcpy(buf + pos, kaddr + pos, copied);
+	memset(kaddr + pos + copied, 0, len - copied);
+	flush_dcache_page(page);
+	kunmap_atomic(kaddr, KM_USER0);
+
+	if (!PageUptodate(page))
+		SetPageUptodate(page);
+	unlock_page(page);
+	mark_page_accessed(page);
+	page_cache_release(page);
+
+	if (inode->i_size < to) {
+		i_size_write(inode, to);
+		ip->i_di.di_size = inode->i_size;
+		di->di_size = cpu_to_be64(inode->i_size);
+		mark_inode_dirty(inode);
+	}
+
+	brelse(dibh);
+	gfs2_trans_end(sdp);
+	gfs2_glock_dq(&ip->i_gh);
+	gfs2_holder_uninit(&ip->i_gh);
+	return copied;
+}
+
+/**
+ * gfs2_write_end
  * @file: The file to write to
- * @page: The page containing the data
- * @from: From (byte range within page)
- * @to: To (byte range within page)
+ * @mapping: The address space to write to
+ * @pos: The file position
+ * @len: The length of the data
+ * @copied:
+ * @page: The page that has been written
+ * @fsdata: The fsdata (unused in GFS2)
+ *
+ * The main write_end function for GFS2. We have a separate one for
+ * stuffed files as they are slightly different, otherwise we just
+ * put our locking around the VFS provided functions.
  *
  * Returns: errno
  */
 
-static int gfs2_commit_write(struct file *file, struct page *page,
-			     unsigned from, unsigned to)
+static int gfs2_write_end(struct file *file, struct address_space *mapping,
+			  loff_t pos, unsigned len, unsigned copied,
+			  struct page *page, void *fsdata)
 {
 	struct inode *inode = page->mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
-	int error = -EOPNOTSUPP;
 	struct buffer_head *dibh;
 	struct gfs2_alloc *al = &ip->i_alloc;
 	struct gfs2_dinode *di;
+	unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
+	unsigned int to = from + len;
+	int ret;
 
-	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_locked_by_me(ip->i_gl)))
-                goto fail_nounlock;
+	BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == 0);
 
-	error = gfs2_meta_inode_buffer(ip, &dibh);
-	if (error)
-		goto fail_endtrans;
+	ret = gfs2_meta_inode_buffer(ip, &dibh);
+	if (unlikely(ret)) {
+		unlock_page(page);
+		page_cache_release(page);
+		goto failed;
+	}
 
 	gfs2_trans_add_bh(ip->i_gl, dibh, 1);
-	di = (struct gfs2_dinode *)dibh->b_data;
 
-	if (gfs2_is_stuffed(ip)) {
-		u64 file_size;
-		void *kaddr;
-
-		file_size = ((u64)page->index << PAGE_CACHE_SHIFT) + to;
+	if (gfs2_is_stuffed(ip))
+		return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
 
-		kaddr = kmap_atomic(page, KM_USER0);
-		memcpy(dibh->b_data + sizeof(struct gfs2_dinode) + from,
-		       kaddr + from, to - from);
-		kunmap_atomic(kaddr, KM_USER0);
+	if (sdp->sd_args.ar_data == GFS2_DATA_ORDERED || gfs2_is_jdata(ip))
+		gfs2_page_add_databufs(ip, page, from, to);
 
-		SetPageUptodate(page);
+	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
 
-		if (inode->i_size < file_size) {
-			i_size_write(inode, file_size);
+	if (likely(ret >= 0)) {
+		copied = ret;
+		if  ((pos + copied) > inode->i_size) {
+			di = (struct gfs2_dinode *)dibh->b_data;
+			ip->i_di.di_size = inode->i_size;
+			di->di_size = cpu_to_be64(inode->i_size);
 			mark_inode_dirty(inode);
 		}
-	} else {
-		if (sdp->sd_args.ar_data == GFS2_DATA_ORDERED ||
-		    gfs2_is_jdata(ip))
-			gfs2_page_add_databufs(ip, page, from, to);
-		error = generic_commit_write(file, page, from, to);
-		if (error)
-			goto fail;
-	}
-
-	if (ip->i_di.di_size < inode->i_size) {
-		ip->i_di.di_size = inode->i_size;
-		di->di_size = cpu_to_be64(inode->i_size);
 	}
 
 	brelse(dibh);
 	gfs2_trans_end(sdp);
+failed:
 	if (al->al_requested) {
 		gfs2_inplace_release(ip);
 		gfs2_quota_unlock(ip);
 		gfs2_alloc_put(ip);
 	}
-	unlock_page(page);
-	gfs2_glock_dq_m(1, &ip->i_gh);
-	lock_page(page);
+	gfs2_glock_dq(&ip->i_gh);
 	gfs2_holder_uninit(&ip->i_gh);
-	return 0;
-
-fail:
-	brelse(dibh);
-fail_endtrans:
-	gfs2_trans_end(sdp);
-	if (al->al_requested) {
-		gfs2_inplace_release(ip);
-		gfs2_quota_unlock(ip);
-		gfs2_alloc_put(ip);
-	}
-	unlock_page(page);
-	gfs2_glock_dq_m(1, &ip->i_gh);
-	lock_page(page);
-	gfs2_holder_uninit(&ip->i_gh);
-fail_nounlock:
-	ClearPageUptodate(page);
-	return error;
+	return ret;
 }
 
 /**
@@ -797,8 +838,8 @@ const struct address_space_operations gf
 	.readpage = gfs2_readpage,
 	.readpages = gfs2_readpages,
 	.sync_page = block_sync_page,
-	.prepare_write = gfs2_prepare_write,
-	.commit_write = gfs2_commit_write,
+	.write_begin = gfs2_write_begin,
+	.write_end = gfs2_write_end,
 	.bmap = gfs2_bmap,
 	.invalidatepage = gfs2_invalidatepage,
 	.releasepage = gfs2_releasepage,

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 34/44] fs: no AOP_TRUNCATED_PAGE for writes
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (32 preceding siblings ...)
  2007-04-24  1:24 ` [patch 33/44] gfs2 " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 35/44] ecryptfs convert to new aops Nick Piggin
                   ` (9 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh

[-- Attachment #1: fs-no-AOP_TRUNCATED_PAGE.patch --]
[-- Type: text/plain, Size: 6545 bytes --]

prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and GFS2
were converted to the new aops, so we can make some simplifications for that.

Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 Documentation/filesystems/vfs.txt |    6 -----
 fs/ecryptfs/mmap.c                |   39 +++++++++-----------------------------
 include/linux/fs.h                |    2 -
 mm/filemap.c                      |   21 +++++++-------------
 4 files changed, 20 insertions(+), 48 deletions(-)

Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6/Documentation/filesystems/vfs.txt
@@ -619,11 +619,7 @@ struct address_space_operations {
   	any basic-blocks on storage, then those blocks should be
   	pre-read (if they haven't been read already) so that the
   	updated blocks can be written out properly.
-	The page will be locked.  If prepare_write wants to unlock the
-  	page it, like readpage, may do so and return
-  	AOP_TRUNCATED_PAGE.
-	In this case the prepare_write will be retried one the lock is
-  	regained.
+	The page will be locked.
 
 	Note: the page _must not_ be marked uptodate in this function
 	(or anywhere else) unless it actually is uptodate right now. As
Index: linux-2.6/fs/ecryptfs/mmap.c
===================================================================
--- linux-2.6.orig/fs/ecryptfs/mmap.c
+++ linux-2.6/fs/ecryptfs/mmap.c
@@ -412,11 +412,9 @@ out:
 	return rc;
 }
 
-static
-void ecryptfs_release_lower_page(struct page *lower_page, int page_locked)
+static void ecryptfs_release_lower_page(struct page *lower_page)
 {
-	if (page_locked)
-		unlock_page(lower_page);
+	unlock_page(lower_page);
 	page_cache_release(lower_page);
 }
 
@@ -437,7 +435,6 @@ static int ecryptfs_write_inode_size_to_
 	const struct address_space_operations *lower_a_ops;
 	u64 file_size;
 
-retry:
 	header_page = grab_cache_page(lower_inode->i_mapping, 0);
 	if (!header_page) {
 		ecryptfs_printk(KERN_ERR, "grab_cache_page for "
@@ -448,11 +445,7 @@ retry:
 	lower_a_ops = lower_inode->i_mapping->a_ops;
 	rc = lower_a_ops->prepare_write(lower_file, header_page, 0, 8);
 	if (rc) {
-		if (rc == AOP_TRUNCATED_PAGE) {
-			ecryptfs_release_lower_page(header_page, 0);
-			goto retry;
-		} else
-			ecryptfs_release_lower_page(header_page, 1);
+		ecryptfs_release_lower_page(header_page);
 		goto out;
 	}
 	file_size = (u64)i_size_read(inode);
@@ -466,11 +459,7 @@ retry:
 	if (rc < 0)
 		ecryptfs_printk(KERN_ERR, "Error commiting header page "
 				"write\n");
-	if (rc == AOP_TRUNCATED_PAGE) {
-		ecryptfs_release_lower_page(header_page, 0);
-		goto retry;
-	} else
-		ecryptfs_release_lower_page(header_page, 1);
+	ecryptfs_release_lower_page(header_page);
 	lower_inode->i_mtime = lower_inode->i_ctime = CURRENT_TIME;
 	mark_inode_dirty_sync(inode);
 out:
@@ -573,16 +562,11 @@ retry:
 							  byte_offset,
 							  region_bytes);
 	if (rc) {
-		if (rc == AOP_TRUNCATED_PAGE) {
-			ecryptfs_release_lower_page(*lower_page, 0);
-			goto retry;
-		} else {
-			ecryptfs_printk(KERN_ERR, "prepare_write for "
-				"lower_page_index = [0x%.16x] failed; rc = "
-				"[%d]\n", lower_page_index, rc);
-			ecryptfs_release_lower_page(*lower_page, 1);
-			(*lower_page) = NULL;
-		}
+		ecryptfs_printk(KERN_ERR, "prepare_write for "
+			"lower_page_index = [0x%.16x] failed; rc = "
+			"[%d]\n", lower_page_index, rc);
+		ecryptfs_release_lower_page(*lower_page);
+		(*lower_page) = NULL;
 	}
 out:
 	return rc;
@@ -598,19 +582,16 @@ ecryptfs_commit_lower_page(struct page *
 			   struct file *lower_file, int byte_offset,
 			   int region_size)
 {
-	int page_locked = 1;
 	int rc = 0;
 
 	rc = lower_inode->i_mapping->a_ops->commit_write(
 		lower_file, lower_page, byte_offset, region_size);
-	if (rc == AOP_TRUNCATED_PAGE)
-		page_locked = 0;
 	if (rc < 0) {
 		ecryptfs_printk(KERN_ERR,
 				"Error committing write; rc = [%d]\n", rc);
 	} else
 		rc = 0;
-	ecryptfs_release_lower_page(lower_page, page_locked);
+	ecryptfs_release_lower_page(lower_page);
 	return rc;
 }
 
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1987,8 +1987,11 @@ again:
 			ret = aops->readpage(file, page);
 			page_cache_release(page);
 			if (ret) {
-				if (ret == AOP_TRUNCATED_PAGE)
-					goto again;
+				/*
+				 * ret cannot be AOP_TRUNCATED_PAGE, because
+				 * the only filesystems that return that from
+				 * ->readpage actually use ->write_begin
+				 */
 				return ret;
 			}
 			goto again;
@@ -1996,13 +1999,10 @@ again:
 
 		ret = aops->prepare_write(file, page, offset, offset+len);
 		if (ret) {
-			if (ret != AOP_TRUNCATED_PAGE)
-				unlock_page(page);
+			unlock_page(page);
 			page_cache_release(page);
 			if (pos + len > inode->i_size)
 				vmtruncate(inode, inode->i_size);
-			if (ret == AOP_TRUNCATED_PAGE)
-				goto again;
 		}
 		return ret;
 	}
@@ -2027,7 +2027,6 @@ int pagecache_write_end(struct file *fil
 		ret = aops->commit_write(file, page, offset, offset+len);
 		unlock_page(page);
 		page_cache_release(page);
-		BUG_ON(ret == AOP_TRUNCATED_PAGE); /* can't deal with */
 
 		if (ret < 0) {
 			if (pos + len > inode->i_size)
@@ -2258,8 +2257,7 @@ static ssize_t generic_perform_write_2co
 		continue;
 
 fs_write_aop_error:
-		if (status != AOP_TRUNCATED_PAGE)
-			unlock_page(page);
+		unlock_page(page);
 		page_cache_release(page);
 		if (src_page)
 			page_cache_release(src_page);
@@ -2271,10 +2269,7 @@ fs_write_aop_error:
 		 */
 		if (pos + bytes > inode->i_size)
 			vmtruncate(inode, inode->i_size);
-		if (status == AOP_TRUNCATED_PAGE)
-			continue;
-		else
-			break;
+		break;
 	} while (iov_iter_count(i));
 
 	return written ? written : status;
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -378,7 +378,7 @@ struct iattr {
  *  			trying again.  The aop will be taking reasonable
  *  			precautions not to livelock.  If the caller held a page
  *  			reference, it should drop it before retrying.  Returned
- *  			by readpage(), prepare_write(), and commit_write().
+ *  			by readpage().
  *
  * address_space_operation functions return these large constants to indicate
  * special semantics to the caller.  These are much larger than the bytes in a

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 35/44] ecryptfs convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (33 preceding siblings ...)
  2007-04-24  1:24 ` [patch 34/44] fs: no AOP_TRUNCATED_PAGE for writes Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 36/44] fuse " Nick Piggin
                   ` (8 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Filesystems, Mark Fasheh, mhalcrow, phillip, ecryptfs-devel

[-- Attachment #1: fs-ecryptfs-aops.patch --]
[-- Type: text/plain, Size: 17190 bytes --]

Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Cc: mhalcrow@us.ibm.com
Cc: phillip@hellewell.homeip.net
Cc: ecryptfs-devel@lists.sourceforge.net
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/ecryptfs/crypto.c          |   32 +++---
 fs/ecryptfs/ecryptfs_kernel.h |    4 
 fs/ecryptfs/mmap.c            |  213 +++++++++++++++++++-----------------------
 3 files changed, 119 insertions(+), 130 deletions(-)

Index: linux-2.6/fs/ecryptfs/mmap.c
===================================================================
--- linux-2.6.orig/fs/ecryptfs/mmap.c
+++ linux-2.6/fs/ecryptfs/mmap.c
@@ -36,26 +36,6 @@
 
 struct kmem_cache *ecryptfs_lower_page_cache;
 
-/**
- * ecryptfs_get1page
- *
- * Get one page from cache or lower f/s, return error otherwise.
- *
- * Returns unlocked and up-to-date page (if ok), with increased
- * refcnt.
- */
-static struct page *ecryptfs_get1page(struct file *file, int index)
-{
-	struct dentry *dentry;
-	struct inode *inode;
-	struct address_space *mapping;
-
-	dentry = file->f_path.dentry;
-	inode = dentry->d_inode;
-	mapping = inode->i_mapping;
-	return read_mapping_page(mapping, index, (void *)file);
-}
-
 static
 int write_zeros(struct file *file, pgoff_t index, int start, int num_zeros);
 
@@ -360,17 +340,14 @@ out:
 /**
  * Called with lower inode mutex held.
  */
-static int fill_zeros_to_end_of_page(struct page *page, unsigned int to)
+static int fill_zeros_to_end_of_page(struct page *page, loff_t new_isize)
 {
-	struct inode *inode = page->mapping->host;
 	int end_byte_in_page;
 	char *page_virt;
 
-	if ((i_size_read(inode) / PAGE_CACHE_SIZE) != page->index)
+	if ((new_isize >> PAGE_CACHE_SHIFT) != page->index)
 		goto out;
-	end_byte_in_page = i_size_read(inode) % PAGE_CACHE_SIZE;
-	if (to > end_byte_in_page)
-		end_byte_in_page = to;
+	end_byte_in_page = new_isize % PAGE_CACHE_SIZE;
 	page_virt = kmap_atomic(page, KM_USER0);
 	memset((page_virt + end_byte_in_page), 0,
 	       (PAGE_CACHE_SIZE - end_byte_in_page));
@@ -380,16 +357,35 @@ out:
 	return 0;
 }
 
-static int ecryptfs_prepare_write(struct file *file, struct page *page,
-				  unsigned from, unsigned to)
+static int ecryptfs_write_begin(struct file *file,struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata)
 {
+	struct page *page;
+	pgoff_t index;
 	int rc = 0;
 
-	if (from == 0 && to == PAGE_CACHE_SIZE)
-		goto out;	/* If we are writing a full page, it will be
-				   up to date. */
-	if (!PageUptodate(page))
-		rc = ecryptfs_do_readpage(file, page, page->index);
+	index = pos >> PAGE_CACHE_SHIFT;
+	page = __grab_cache_page(mapping, index);
+	if (!page) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	/*
+	 * If we are writing a full page (with no possibility of a short
+	 * write), it will be guaranteed to end up being uptodate at
+	 * write_end-time
+	 */
+	if (flags & AOP_FLAG_UNINTERRUPTIBLE && len == PAGE_CACHE_SIZE)
+		goto out;
+	if (!PageUptodate(page)) {
+		rc = ecryptfs_do_readpage(file, page, index);
+		if (rc) {
+			unlock_page(page);
+			page_cache_release(page);
+		}
+	}
 out:
 	return rc;
 }
@@ -412,12 +408,6 @@ out:
 	return rc;
 }
 
-static void ecryptfs_release_lower_page(struct page *lower_page)
-{
-	unlock_page(lower_page);
-	page_cache_release(lower_page);
-}
-
 /**
  * ecryptfs_write_inode_size_to_header
  *
@@ -431,23 +421,17 @@ static int ecryptfs_write_inode_size_to_
 {
 	int rc = 0;
 	struct page *header_page;
+	void *fsdata;
 	char *header_virt;
-	const struct address_space_operations *lower_a_ops;
+	struct address_space *lower_mapping = lower_inode->i_mapping;
 	u64 file_size;
 
-	header_page = grab_cache_page(lower_inode->i_mapping, 0);
-	if (!header_page) {
-		ecryptfs_printk(KERN_ERR, "grab_cache_page for "
-				"lower_page_index 0 failed\n");
-		rc = -EINVAL;
-		goto out;
-	}
-	lower_a_ops = lower_inode->i_mapping->a_ops;
-	rc = lower_a_ops->prepare_write(lower_file, header_page, 0, 8);
-	if (rc) {
-		ecryptfs_release_lower_page(header_page);
+	rc = pagecache_write_begin(lower_file, lower_mapping, 0, sizeof(u64),
+					AOP_FLAG_UNINTERRUPTIBLE,
+					&header_page, &fsdata);
+	if (rc)
 		goto out;
-	}
+
 	file_size = (u64)i_size_read(inode);
 	ecryptfs_printk(KERN_DEBUG, "Writing size: [0x%.16x]\n", file_size);
 	file_size = cpu_to_be64(file_size);
@@ -455,13 +439,17 @@ static int ecryptfs_write_inode_size_to_
 	memcpy(header_virt, &file_size, sizeof(u64));
 	kunmap_atomic(header_virt, KM_USER0);
 	flush_dcache_page(header_page);
-	rc = lower_a_ops->commit_write(lower_file, header_page, 0, 8);
-	if (rc < 0)
+
+	rc = pagecache_write_end(lower_file, lower_mapping, 0, sizeof(u64),
+					sizeof(u64), header_page, fsdata);
+	if (rc != sizeof(u64)) {
 		ecryptfs_printk(KERN_ERR, "Error commiting header page "
 				"write\n");
-	ecryptfs_release_lower_page(header_page);
+		if (rc > 0)
+			rc = -EINVAL; /* XXX: can we do better? */
+	}
 	lower_inode->i_mtime = lower_inode->i_ctime = CURRENT_TIME;
-	mark_inode_dirty_sync(inode);
+	mark_inode_dirty_sync(inode); /* XXX: lower_inode? */
 out:
 	return rc;
 }
@@ -544,31 +532,21 @@ ecryptfs_write_inode_size_to_metadata(st
 int ecryptfs_get_lower_page(struct page **lower_page, struct inode *lower_inode,
 			    struct file *lower_file,
 			    unsigned long lower_page_index, int byte_offset,
-			    int region_bytes)
+			    int region_bytes, void **fsdata)
 {
-	int rc = 0;
+	int rc;
+	struct address_space *lower_mapping = lower_inode->i_mapping;
+	loff_t pos = (lower_page_index << PAGE_CACHE_SHIFT) + byte_offset;
 
-retry:
-	*lower_page = grab_cache_page(lower_inode->i_mapping, lower_page_index);
-	if (!(*lower_page)) {
-		rc = -EINVAL;
-		ecryptfs_printk(KERN_ERR, "Error attempting to grab "
-				"lower page with index [0x%.16x]\n",
-				lower_page_index);
-		goto out;
-	}
-	rc = lower_inode->i_mapping->a_ops->prepare_write(lower_file,
-							  (*lower_page),
-							  byte_offset,
-							  region_bytes);
+	rc = pagecache_write_begin(lower_file, lower_mapping, pos, region_bytes,
+				AOP_FLAG_UNINTERRUPTIBLE, /* XXX: ok? */
+				lower_page, fsdata);
 	if (rc) {
-		ecryptfs_printk(KERN_ERR, "prepare_write for "
+		ecryptfs_printk(KERN_ERR, "pagecache_write_begin for "
 			"lower_page_index = [0x%.16x] failed; rc = "
 			"[%d]\n", lower_page_index, rc);
-		ecryptfs_release_lower_page(*lower_page);
 		(*lower_page) = NULL;
 	}
-out:
 	return rc;
 }
 
@@ -580,18 +558,21 @@ out:
 int
 ecryptfs_commit_lower_page(struct page *lower_page, struct inode *lower_inode,
 			   struct file *lower_file, int byte_offset,
-			   int region_size)
+			   int region_size, void *fsdata)
 {
-	int rc = 0;
+	int rc;
+	struct address_space *lower_mapping = lower_inode->i_mapping;
+	loff_t pos = (lower_page->index << PAGE_CACHE_SHIFT) + byte_offset;
 
-	rc = lower_inode->i_mapping->a_ops->commit_write(
-		lower_file, lower_page, byte_offset, region_size);
-	if (rc < 0) {
+	rc = pagecache_write_end(lower_file, lower_mapping, pos, region_size,
+					region_size, lower_page, fsdata);
+	if (rc != region_size) {
 		ecryptfs_printk(KERN_ERR,
 				"Error committing write; rc = [%d]\n", rc);
+		if (rc > 0)
+			rc = -EINVAL;
 	} else
 		rc = 0;
-	ecryptfs_release_lower_page(lower_page);
 	return rc;
 }
 
@@ -606,9 +587,10 @@ int ecryptfs_copy_page_to_lower(struct p
 {
 	int rc = 0;
 	struct page *lower_page;
+	void *fsdata;
 
 	rc = ecryptfs_get_lower_page(&lower_page, lower_inode, lower_file,
-				     page->index, 0, PAGE_CACHE_SIZE);
+				     page->index, 0, PAGE_CACHE_SIZE, &fsdata);
 	if (rc) {
 		ecryptfs_printk(KERN_ERR, "Error attempting to get page "
 				"at index [0x%.16x]\n", page->index);
@@ -618,7 +600,7 @@ int ecryptfs_copy_page_to_lower(struct p
 	memcpy((char *)page_address(lower_page), page_address(page),
 	       PAGE_CACHE_SIZE);
 	rc = ecryptfs_commit_lower_page(lower_page, lower_inode, lower_file,
-					0, PAGE_CACHE_SIZE);
+					0, PAGE_CACHE_SIZE, fsdata);
 	if (rc)
 		ecryptfs_printk(KERN_ERR, "Error attempting to commit page "
 				"at index [0x%.16x]\n", page->index);
@@ -629,31 +611,37 @@ out:
 struct kmem_cache *ecryptfs_xattr_cache;
 
 /**
- * ecryptfs_commit_write
+ * ecryptfs_write_end
  * @file: The eCryptfs file object
- * @page: The eCryptfs page
- * @from: Ignored (we rotate the page IV on each write)
- * @to: Ignored
+ * @mapping: The eCryptfs address_space
+ * @pos: The start of the write
+ * @len: The length passed to write_begin (unused)
+ * @copied: The actual amount copied
+ * @page: The eCryptfs page returned by write_begin
+ * @fsdata: Filesystem private data (unused)
  *
  * This is where we encrypt the data and pass the encrypted data to
  * the lower filesystem.  In OpenPGP-compatible mode, we operate on
  * entire underlying packets.
  */
-static int ecryptfs_commit_write(struct file *file, struct page *page,
-				 unsigned from, unsigned to)
+static int ecryptfs_write_end(struct file *file,
+					struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
 {
 	struct ecryptfs_page_crypt_context ctx;
-	loff_t pos;
+	loff_t isize;
 	struct inode *inode;
 	struct inode *lower_inode;
 	struct file *lower_file;
 	struct ecryptfs_crypt_stat *crypt_stat;
 	int rc;
 
-	inode = page->mapping->host;
+	inode = mapping->host;
+	isize = inode->i_size; /* i_mutex is held */
 	lower_inode = ecryptfs_inode_to_lower(inode);
 	lower_file = ecryptfs_file_to_lower(file);
-	mutex_lock(&lower_inode->i_mutex);
+	mutex_lock(&lower_inode->i_mutex); /* XXX: put this in write_begin? */
 	crypt_stat = &ecryptfs_inode_to_private(file->f_path.dentry->d_inode)
 				->crypt_stat;
 	if (crypt_stat->flags & ECRYPTFS_NEW_FILE) {
@@ -664,8 +652,8 @@ static int ecryptfs_commit_write(struct 
 		ecryptfs_printk(KERN_DEBUG, "Not a new file\n");
 	ecryptfs_printk(KERN_DEBUG, "Calling fill_zeros_to_end_of_page"
 			"(page w/ index = [0x%.16x], to = [%d])\n", page->index,
-			to);
-	rc = fill_zeros_to_end_of_page(page, to);
+			max(isize, pos+copied));
+	rc = fill_zeros_to_end_of_page(page, max(isize, pos+copied));
 	if (rc) {
 		ecryptfs_printk(KERN_WARNING, "Error attempting to fill "
 				"zeros in page with index = [0x%.16x]\n",
@@ -682,11 +670,10 @@ static int ecryptfs_commit_write(struct 
 		goto out;
 	}
 	inode->i_blocks = lower_inode->i_blocks;
-	pos = (page->index << PAGE_CACHE_SHIFT) + to;
-	if (pos > i_size_read(inode)) {
-		i_size_write(inode, pos);
+	if (pos + copied > isize) {
+		i_size_write(inode, pos + copied);
 		ecryptfs_printk(KERN_DEBUG, "Expanded file size to "
-				"[0x%.16x]\n", i_size_read(inode));
+				"[0x%.16x]\n", pos + copied);
 	}
 	rc = ecryptfs_write_inode_size_to_metadata(lower_file, lower_inode,
 						   inode, file->f_dentry,
@@ -702,6 +689,9 @@ out:
 	else
 		SetPageUptodate(page);
 	mutex_unlock(&lower_inode->i_mutex);
+	unlock_page(page);
+	page_cache_release(page);
+
 	return rc;
 }
 
@@ -722,32 +712,31 @@ int write_zeros(struct file *file, pgoff
 	int rc = 0;
 	struct page *tmp_page;
 	char *tmp_page_virt;
-
-	tmp_page = ecryptfs_get1page(file, index);
-	if (IS_ERR(tmp_page)) {
-		ecryptfs_printk(KERN_ERR, "Error getting page at index "
-				"[0x%.16x]\n", index);
-		rc = PTR_ERR(tmp_page);
-		goto out;
-	}
-	rc = ecryptfs_prepare_write(file, tmp_page, start, start + num_zeros);
+	void *fsdata;
+	struct address_space *mapping = file->f_path.dentry->d_inode->i_mapping;
+	loff_t pos = (index << PAGE_CACHE_SHIFT) + start;
+
+	rc = pagecache_write_begin(file, mapping, pos, num_zeros,
+					AOP_FLAG_UNINTERRUPTIBLE,
+					&tmp_page, &fsdata);
 	if (rc) {
 		ecryptfs_printk(KERN_ERR, "Error preparing to write zero's "
 				"to remainder of page at index [0x%.16x]\n",
 				index);
-		page_cache_release(tmp_page);
 		goto out;
 	}
 	tmp_page_virt = kmap_atomic(tmp_page, KM_USER0);
 	memset(((char *)tmp_page_virt + start), 0, num_zeros);
 	kunmap_atomic(tmp_page_virt, KM_USER0);
 	flush_dcache_page(tmp_page);
-	rc = ecryptfs_commit_write(file, tmp_page, start, start + num_zeros);
-	if (rc < 0) {
+	rc = pagecache_write_end(file, mapping, pos, num_zeros, num_zeros,
+					tmp_page, fsdata);
+	if (rc != num_zeros) {
 		ecryptfs_printk(KERN_ERR, "Error attempting to write zero's "
 				"to remainder of page at index [0x%.16x]\n",
 				index);
-		page_cache_release(tmp_page);
+		if (rc > 0)
+			rc = -EINVAL;
 		goto out;
 	}
 	rc = 0;
@@ -795,8 +784,8 @@ static void ecryptfs_sync_page(struct pa
 struct address_space_operations ecryptfs_aops = {
 	.writepage = ecryptfs_writepage,
 	.readpage = ecryptfs_readpage,
-	.prepare_write = ecryptfs_prepare_write,
-	.commit_write = ecryptfs_commit_write,
+	.write_begin = ecryptfs_write_begin,
+	.write_end = ecryptfs_write_end,
 	.bmap = ecryptfs_bmap,
 	.sync_page = ecryptfs_sync_page,
 };
Index: linux-2.6/fs/ecryptfs/crypto.c
===================================================================
--- linux-2.6.orig/fs/ecryptfs/crypto.c
+++ linux-2.6/fs/ecryptfs/crypto.c
@@ -375,7 +375,8 @@ ecryptfs_extent_to_lwr_pg_idx_and_offset
 static int ecryptfs_write_out_page(struct ecryptfs_page_crypt_context *ctx,
 				   struct page *lower_page,
 				   struct inode *lower_inode,
-				   int byte_offset_in_page, int bytes_to_write)
+				   int byte_offset_in_page, int bytes_to_write,
+				   void *fsdata)
 {
 	int rc = 0;
 
@@ -383,7 +384,7 @@ static int ecryptfs_write_out_page(struc
 		rc = ecryptfs_commit_lower_page(lower_page, lower_inode,
 						ctx->param.lower_file,
 						byte_offset_in_page,
-						bytes_to_write);
+						bytes_to_write, fsdata);
 		if (rc) {
 			ecryptfs_printk(KERN_ERR, "Error calling lower "
 					"commit; rc = [%d]\n", rc);
@@ -407,7 +408,7 @@ static int ecryptfs_read_in_page(struct 
 				 struct page **lower_page,
 				 struct inode *lower_inode,
 				 unsigned long lower_page_idx,
-				 int byte_offset_in_page)
+				 int byte_offset_in_page, void **fsdata)
 {
 	int rc = 0;
 
@@ -419,13 +420,12 @@ static int ecryptfs_read_in_page(struct 
 					     lower_page_idx,
 					     byte_offset_in_page,
 					     (PAGE_CACHE_SIZE
-					      - byte_offset_in_page));
+					      - byte_offset_in_page), fsdata);
 		if (rc) {
 			ecryptfs_printk(
-				KERN_ERR, "Error attempting to grab, map, "
-				"and prepare_write lower page with index "
+				KERN_ERR, "Error in ecryptfs_get_lower_page "
+				"lower page with index "
 				"[0x%.16x]; rc = [%d]\n", lower_page_idx, rc);
-			goto out;
 		}
 	} else {
 		*lower_page = grab_cache_page(lower_inode->i_mapping,
@@ -436,10 +436,9 @@ static int ecryptfs_read_in_page(struct 
 				KERN_ERR, "Error attempting to grab and map "
 				"lower page with index [0x%.16x]; rc = [%d]\n",
 				lower_page_idx, rc);
-			goto out;
 		}
 	}
-out:
+
 	return rc;
 }
 
@@ -475,6 +474,8 @@ int ecryptfs_encrypt_page(struct ecryptf
 	int lower_byte_offset = 0;
 	int orig_byte_offset = 0;
 	int num_extents_per_page;
+	void *fsdata;
+
 #define ECRYPTFS_PAGE_STATE_UNREAD    0
 #define ECRYPTFS_PAGE_STATE_READ      1
 #define ECRYPTFS_PAGE_STATE_MODIFIED  2
@@ -503,10 +504,9 @@ int ecryptfs_encrypt_page(struct ecryptf
 		if (prior_lower_page_idx != lower_page_idx
 		    && page_state == ECRYPTFS_PAGE_STATE_MODIFIED) {
 			rc = ecryptfs_write_out_page(ctx, lower_page,
-						     lower_inode,
-						     orig_byte_offset,
-						     (PAGE_CACHE_SIZE
-						      - orig_byte_offset));
+					lower_inode, orig_byte_offset,
+					(PAGE_CACHE_SIZE - orig_byte_offset),
+					fsdata);
 			if (rc) {
 				ecryptfs_printk(KERN_ERR, "Error attempting "
 						"to write out page; rc = [%d]"
@@ -519,7 +519,7 @@ int ecryptfs_encrypt_page(struct ecryptf
 		    || page_state == ECRYPTFS_PAGE_STATE_WRITTEN) {
 			rc = ecryptfs_read_in_page(ctx, &lower_page,
 						   lower_inode, lower_page_idx,
-						   lower_byte_offset);
+						   lower_byte_offset, &fsdata);
 			if (rc) {
 				ecryptfs_printk(KERN_ERR, "Error attempting "
 						"to read in lower page with "
@@ -571,8 +571,8 @@ int ecryptfs_encrypt_page(struct ecryptf
 	}
 	BUG_ON(orig_byte_offset != 0);
 	rc = ecryptfs_write_out_page(ctx, lower_page, lower_inode, 0,
-				     (lower_byte_offset
-				      + crypt_stat->extent_size));
+				(lower_byte_offset + crypt_stat->extent_size),
+				fsdata);
 	if (rc) {
 		ecryptfs_printk(KERN_ERR, "Error attempting to write out "
 				"page; rc = [%d]\n", rc);
Index: linux-2.6/fs/ecryptfs/ecryptfs_kernel.h
===================================================================
--- linux-2.6.orig/fs/ecryptfs/ecryptfs_kernel.h
+++ linux-2.6/fs/ecryptfs/ecryptfs_kernel.h
@@ -503,11 +503,11 @@ int ecryptfs_write_inode_size_to_metadat
 int ecryptfs_get_lower_page(struct page **lower_page, struct inode *lower_inode,
 			    struct file *lower_file,
 			    unsigned long lower_page_index, int byte_offset,
-			    int region_bytes);
+			    int region_bytes, void **fsdata);
 int
 ecryptfs_commit_lower_page(struct page *lower_page, struct inode *lower_inode,
 			   struct file *lower_file, int byte_offset,
-			   int region_size);
+			   int region_size, void *fsdata);
 int ecryptfs_copy_page_to_lower(struct page *page, struct inode *lower_inode,
 				struct file *lower_file);
 int ecryptfs_do_readpage(struct file *file, struct page *page,

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 36/44] fuse convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (34 preceding siblings ...)
  2007-04-24  1:24 ` [patch 35/44] ecryptfs convert to new aops Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 37/44] hostfs " Nick Piggin
                   ` (7 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, Miklos Szeredi

[-- Attachment #1: fs-fuse-aops.patch --]
[-- Type: text/plain, Size: 2988 bytes --]

[mszeredi]
 - don't send zero length write requests
 - it is not legal for the filesystem to return with zero written bytes

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>

 fs/fuse/file.c |   48 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 15 deletions(-)

Index: linux-2.6/fs/fuse/file.c
===================================================================
--- linux-2.6.orig/fs/fuse/file.c
+++ linux-2.6/fs/fuse/file.c
@@ -443,22 +443,25 @@ static size_t fuse_send_write(struct fus
 	return outarg.size;
 }
 
-static int fuse_prepare_write(struct file *file, struct page *page,
-			      unsigned offset, unsigned to)
-{
-	/* No op */
+static int fuse_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+
+	*pagep = __grab_cache_page(mapping, index);
+	if (!*pagep)
+		return -ENOMEM;
 	return 0;
 }
 
-static int fuse_commit_write(struct file *file, struct page *page,
-			     unsigned offset, unsigned to)
+static int fuse_buffered_write(struct file *file, struct inode *inode,
+			       loff_t pos, unsigned count, struct page *page)
 {
 	int err;
 	size_t nres;
-	unsigned count = to - offset;
-	struct inode *inode = page->mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	loff_t pos = page_offset(page) + offset;
+	unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
 	struct fuse_req *req;
 
 	if (is_bad_inode(inode))
@@ -474,20 +477,35 @@ static int fuse_commit_write(struct file
 	nres = fuse_send_write(req, file, inode, pos, count);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
-	if (!err && nres != count)
+	if (!err && !nres)
 		err = -EIO;
 	if (!err) {
-		pos += count;
+		pos += nres;
 		spin_lock(&fc->lock);
 		if (pos > inode->i_size)
 			i_size_write(inode, pos);
 		spin_unlock(&fc->lock);
 
-		if (offset == 0 && to == PAGE_CACHE_SIZE)
+		if (count == PAGE_CACHE_SIZE)
 			SetPageUptodate(page);
 	}
 	fuse_invalidate_attr(inode);
-	return err;
+	return err ? err : nres;
+}
+
+static int fuse_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
+{
+	struct inode *inode = mapping->host;
+	int res = 0;
+
+	if (copied)
+		res = fuse_buffered_write(file, inode, pos, copied, page);
+
+	unlock_page(page);
+	page_cache_release(page);
+	return res;
 }
 
 static void fuse_release_user_pages(struct fuse_req *req, int write)
@@ -816,8 +834,8 @@ static const struct file_operations fuse
 
 static const struct address_space_operations fuse_file_aops  = {
 	.readpage	= fuse_readpage,
-	.prepare_write	= fuse_prepare_write,
-	.commit_write	= fuse_commit_write,
+	.write_begin	= fuse_write_begin,
+	.write_end	= fuse_write_end,
 	.readpages	= fuse_readpages,
 	.set_page_dirty	= fuse_set_page_dirty,
 	.bmap		= fuse_bmap,

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 37/44] hostfs convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (35 preceding siblings ...)
  2007-04-24  1:24 ` [patch 36/44] fuse " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-27 16:11   ` Jeff Dike
  2007-04-24  1:24 ` [patch 38/44] jffs2 " Nick Piggin
                   ` (6 subsequent siblings)
  43 siblings, 1 reply; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, Jeff Dike

[-- Attachment #1: fs-hostfs-aops.patch --]
[-- Type: text/plain, Size: 3320 bytes --]

This also gets rid of a lot of useless read_file stuff. And also
optimises the full page write case by marking a !uptodate page uptodate.

Cc: Jeff Dike <jdike@addtoit.com>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/hostfs/hostfs_kern.c |   70 +++++++++++++++++++-----------------------------
 1 file changed, 28 insertions(+), 42 deletions(-)

Index: linux-2.6/fs/hostfs/hostfs_kern.c
===================================================================
--- linux-2.6.orig/fs/hostfs/hostfs_kern.c
+++ linux-2.6/fs/hostfs/hostfs_kern.c
@@ -461,56 +461,42 @@ int hostfs_readpage(struct file *file, s
 	return(err);
 }
 
-int hostfs_prepare_write(struct file *file, struct page *page,
-			 unsigned int from, unsigned int to)
+int hostfs_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
 {
-	char *buffer;
-	long long start, tmp;
-	int err;
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 
-	start = (long long) page->index << PAGE_CACHE_SHIFT;
-	buffer = kmap(page);
-	if(from != 0){
-		tmp = start;
-		err = read_file(FILE_HOSTFS_I(file)->fd, &tmp, buffer,
-				from);
-		if(err < 0) goto out;
-	}
-	if(to != PAGE_CACHE_SIZE){
-		start += to;
-		err = read_file(FILE_HOSTFS_I(file)->fd, &start, buffer + to,
-				PAGE_CACHE_SIZE - to);
-		if(err < 0) goto out;
-	}
-	err = 0;
- out:
-	kunmap(page);
-	return(err);
+	*pagep = __grab_cache_page(mapping, index);
+	if (!*pagep)
+		return -ENOMEM;
+	return 0;
 }
 
-int hostfs_commit_write(struct file *file, struct page *page, unsigned from,
-		 unsigned to)
+int hostfs_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
 {
-	struct address_space *mapping = page->mapping;
 	struct inode *inode = mapping->host;
-	char *buffer;
-	long long start;
-	int err = 0;
+	void *buffer;
+	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+	int err;
 
-	start = (((long long) page->index) << PAGE_CACHE_SHIFT) + from;
 	buffer = kmap(page);
-	err = write_file(FILE_HOSTFS_I(file)->fd, &start, buffer + from,
-			 to - from);
-	if(err > 0) err = 0;
-
-	/* Actually, if !err, write_file has added to-from to start, so, despite
-	 * the appearance, we are comparing i_size against the _last_ written
-	 * location, as we should. */
+	err = write_file(FILE_HOSTFS_I(file)->fd, &pos, buffer + from, copied);
+	kunmap(page);
+
+	if (!PageUptodate(page) && err == PAGE_CACHE_SIZE)
+		SetPageUptodate(page);
+	unlock_page(page);
+	page_cache_release(page);
 
-	if(!err && (start > inode->i_size))
-		inode->i_size = start;
+	/* If err > 0, write_file has added err to pos, so we are comparing
+	 * i_size against the last byte written.
+	 */
+	if (err > 0 && (pos > inode->i_size))
+		inode->i_size = pos;
 
-	kunmap(page);
 	return(err);
 }
 
@@ -518,8 +504,8 @@ static const struct address_space_operat
 	.writepage 	= hostfs_writepage,
 	.readpage	= hostfs_readpage,
 	.set_page_dirty = __set_page_dirty_nobuffers,
-	.prepare_write	= hostfs_prepare_write,
-	.commit_write	= hostfs_commit_write
+	.write_begin	= hostfs_write_begin,
+	.write_end	= hostfs_write_end,
 };
 
 static int init_inode(struct inode *inode, struct dentry *dentry)

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 38/44] jffs2 convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (36 preceding siblings ...)
  2007-04-24  1:24 ` [patch 37/44] hostfs " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 39/44] cifs " Nick Piggin
                   ` (5 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, dwmw2, jffs-dev

[-- Attachment #1: fs-jffs2-aops.patch --]
[-- Type: text/plain, Size: 7474 bytes --]

Cc: dwmw2@infradead.org
Cc: jffs-dev@axis.com
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/jffs2/file.c |  105 +++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 66 insertions(+), 39 deletions(-)

Index: linux-2.6/fs/jffs2/file.c
===================================================================
--- linux-2.6.orig/fs/jffs2/file.c
+++ linux-2.6/fs/jffs2/file.c
@@ -21,10 +21,12 @@
 #include <linux/jffs2.h>
 #include "nodelist.h"
 
-static int jffs2_commit_write (struct file *filp, struct page *pg,
-			       unsigned start, unsigned end);
-static int jffs2_prepare_write (struct file *filp, struct page *pg,
-				unsigned start, unsigned end);
+static int jffs2_write_end(struct file *filp, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *pg, void *fsdata);
+static int jffs2_write_begin(struct file *filp, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata);
 static int jffs2_readpage (struct file *filp, struct page *pg);
 
 int jffs2_fsync(struct file *filp, struct dentry *dentry, int datasync)
@@ -67,8 +69,8 @@ const struct inode_operations jffs2_file
 const struct address_space_operations jffs2_file_address_operations =
 {
 	.readpage =	jffs2_readpage,
-	.prepare_write =jffs2_prepare_write,
-	.commit_write =	jffs2_commit_write
+	.write_begin =	jffs2_write_begin,
+	.write_end =	jffs2_write_end,
 };
 
 static int jffs2_do_readpage_nolock (struct inode *inode, struct page *pg)
@@ -121,15 +123,23 @@ static int jffs2_readpage (struct file *
 	return ret;
 }
 
-static int jffs2_prepare_write (struct file *filp, struct page *pg,
-				unsigned start, unsigned end)
+static int jffs2_write_begin(struct file *filp, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
 {
-	struct inode *inode = pg->mapping->host;
+	struct page *pg;
+	struct inode *inode = mapping->host;
 	struct jffs2_inode_info *f = JFFS2_INODE_INFO(inode);
-	uint32_t pageofs = pg->index << PAGE_CACHE_SHIFT;
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+	uint32_t pageofs = pos & (PAGE_CACHE_SIZE - 1);
 	int ret = 0;
 
-	D1(printk(KERN_DEBUG "jffs2_prepare_write()\n"));
+	pg = __grab_cache_page(mapping, index);
+	if (!pg)
+		return -ENOMEM;
+	*pagep = pg;
+
+	D1(printk(KERN_DEBUG "jffs2_write_begin()\n"));
 
 	if (pageofs > inode->i_size) {
 		/* Make new hole frag from old EOF to new page */
@@ -144,7 +154,7 @@ static int jffs2_prepare_write (struct f
 		ret = jffs2_reserve_space(c, sizeof(ri), &alloc_len,
 					  ALLOC_NORMAL, JFFS2_SUMMARY_INODE_SIZE);
 		if (ret)
-			return ret;
+			goto out_page;
 
 		down(&f->sem);
 		memset(&ri, 0, sizeof(ri));
@@ -174,7 +184,7 @@ static int jffs2_prepare_write (struct f
 			ret = PTR_ERR(fn);
 			jffs2_complete_reservation(c);
 			up(&f->sem);
-			return ret;
+			goto out_page;
 		}
 		ret = jffs2_add_full_dnode_to_inode(c, f, fn);
 		if (f->metadata) {
@@ -183,65 +193,79 @@ static int jffs2_prepare_write (struct f
 			f->metadata = NULL;
 		}
 		if (ret) {
-			D1(printk(KERN_DEBUG "Eep. add_full_dnode_to_inode() failed in prepare_write, returned %d\n", ret));
+			D1(printk(KERN_DEBUG "Eep. add_full_dnode_to_inode() failed in write_begin, returned %d\n", ret));
 			jffs2_mark_node_obsolete(c, fn->raw);
 			jffs2_free_full_dnode(fn);
 			jffs2_complete_reservation(c);
 			up(&f->sem);
-			return ret;
+			goto out_page;
 		}
 		jffs2_complete_reservation(c);
 		inode->i_size = pageofs;
 		up(&f->sem);
 	}
 
-	/* Read in the page if it wasn't already present, unless it's a whole page */
-	if (!PageUptodate(pg) && (start || end < PAGE_CACHE_SIZE)) {
+	/*
+	 * Read in the page if it wasn't already present. Cannot optimize away
+	 * the whole page write case until jffs2_write_end can handle the
+	 * case of a short-copy.
+	 */
+	if (!PageUptodate(pg)) {
 		down(&f->sem);
 		ret = jffs2_do_readpage_nolock(inode, pg);
 		up(&f->sem);
+		if (ret)
+			goto out_page;
 	}
-	D1(printk(KERN_DEBUG "end prepare_write(). pg->flags %lx\n", pg->flags));
+	D1(printk(KERN_DEBUG "end write_begin(). pg->flags %lx\n", pg->flags));
+	return ret;
+
+out_page:
+	unlock_page(pg);
+	page_cache_release(pg);
 	return ret;
 }
 
-static int jffs2_commit_write (struct file *filp, struct page *pg,
-			       unsigned start, unsigned end)
+static int jffs2_write_end(struct file *filp, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *pg, void *fsdata)
 {
 	/* Actually commit the write from the page cache page we're looking at.
 	 * For now, we write the full page out each time. It sucks, but it's simple
 	 */
-	struct inode *inode = pg->mapping->host;
+	struct inode *inode = mapping->host;
 	struct jffs2_inode_info *f = JFFS2_INODE_INFO(inode);
 	struct jffs2_sb_info *c = JFFS2_SB_INFO(inode->i_sb);
 	struct jffs2_raw_inode *ri;
+	unsigned start = pos & (PAGE_CACHE_SIZE - 1);
+	unsigned end = start + copied;
 	unsigned aligned_start = start & ~3;
 	int ret = 0;
 	uint32_t writtenlen = 0;
 
-	D1(printk(KERN_DEBUG "jffs2_commit_write(): ino #%lu, page at 0x%lx, range %d-%d, flags %lx\n",
+	D1(printk(KERN_DEBUG "jffs2_write_end(): ino #%lu, page at 0x%lx, range %d-%d, flags %lx\n",
 		  inode->i_ino, pg->index << PAGE_CACHE_SHIFT, start, end, pg->flags));
 
+	/* We need to avoid deadlock with page_cache_read() in
+	   jffs2_garbage_collect_pass(). So the page must be
+	   up to date to prevent page_cache_read() from trying
+	   to re-lock it. */
+	BUG_ON(!PageUptodate(pg));
+
 	if (end == PAGE_CACHE_SIZE) {
-		if (!start) {
-			/* We need to avoid deadlock with page_cache_read() in
-			   jffs2_garbage_collect_pass(). So we have to mark the
-			   page up to date, to prevent page_cache_read() from
-			   trying to re-lock it. */
-			SetPageUptodate(pg);
-		} else {
-			/* When writing out the end of a page, write out the 
-			   _whole_ page. This helps to reduce the number of
-			   nodes in files which have many short writes, like
-			   syslog files. */
-			start = aligned_start = 0;
-		}
+		/* When writing out the end of a page, write out the
+		   _whole_ page. This helps to reduce the number of
+		   nodes in files which have many short writes, like
+		   syslog files. */
+		start = aligned_start = 0;
 	}
 
 	ri = jffs2_alloc_raw_inode();
 
 	if (!ri) {
-		D1(printk(KERN_DEBUG "jffs2_commit_write(): Allocation of raw inode failed\n"));
+		D1(printk(KERN_DEBUG "jffs2_write_end(): Allocation of raw inode failed\n"));
+		unlock_page(pg);
+		page_cache_release(pg);
 		return -ENOMEM;
 	}
 
@@ -289,11 +313,14 @@ static int jffs2_commit_write (struct fi
 		/* generic_file_write has written more to the page cache than we've
 		   actually written to the medium. Mark the page !Uptodate so that
 		   it gets reread */
-		D1(printk(KERN_DEBUG "jffs2_commit_write(): Not all bytes written. Marking page !uptodate\n"));
+		D1(printk(KERN_DEBUG "jffs2_write_end(): Not all bytes written. Marking page !uptodate\n"));
 		SetPageError(pg);
 		ClearPageUptodate(pg);
 	}
 
-	D1(printk(KERN_DEBUG "jffs2_commit_write() returning %d\n",start+writtenlen==end?0:ret));
-	return start+writtenlen==end?0:ret;
+	D1(printk(KERN_DEBUG "jffs2_write_end() returning %d\n",
+					writtenlen > 0 ? writtenlen : ret));
+	unlock_page(pg);
+	page_cache_release(pg);
+	return writtenlen > 0 ? writtenlen : ret;
 }

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 39/44] cifs convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (37 preceding siblings ...)
  2007-04-24  1:24 ` [patch 38/44] jffs2 " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 40/44] ufs " Nick Piggin
                   ` (4 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, sfrench, samba-technical

[-- Attachment #1: fs-cifs-aops.patch --]
[-- Type: text/plain, Size: 6271 bytes --]

Convert to new aops, and fix security hole where page is set uptodate
before contents are uptodate.

Cc: sfrench@samba.org
Cc: samba-technical@lists.samba.org
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/cifs/file.c |   89 ++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 51 insertions(+), 38 deletions(-)

Index: linux-2.6/fs/cifs/file.c
===================================================================
--- linux-2.6.orig/fs/cifs/file.c
+++ linux-2.6/fs/cifs/file.c
@@ -103,7 +103,7 @@ static inline int cifs_open_inode_helper
 
 	/* want handles we can use to read with first
 	   in the list so we do not have to walk the
-	   list to search for one in prepare_write */
+	   list to search for one in write_begin */
 	if ((file->f_flags & O_ACCMODE) == O_WRONLY) {
 		list_add_tail(&pCifsFile->flist, 
 			      &pCifsInode->openFileList);
@@ -1358,40 +1358,37 @@ static int cifs_writepage(struct page* p
 	return rc;
 }
 
-static int cifs_commit_write(struct file *file, struct page *page,
-	unsigned offset, unsigned to)
+static int cifs_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
 {
 	int xid;
 	int rc = 0;
-	struct inode *inode = page->mapping->host;
-	loff_t position = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	struct inode *inode = mapping->host;
+	loff_t position = pos + copied;
 	char *page_data;
 
 	xid = GetXid();
-	cFYI(1, ("commit write for page %p up to position %lld for %d", 
-		 page, position, to));
+	cFYI(1, ("write end for page %p at pos %lld, copied %d",
+		 page, pos, copied));
 	spin_lock(&inode->i_lock);
 	if (position > inode->i_size) {
 		i_size_write(inode, position);
 	}
 	spin_unlock(&inode->i_lock);
+	if (!PageUptodate(page) && copied == PAGE_CACHE_SIZE)
+		SetPageUptodate(page);
+
 	if (!PageUptodate(page)) {
-		position =  ((loff_t)page->index << PAGE_CACHE_SHIFT) + offset;
-		/* can not rely on (or let) writepage write this data */
-		if (to < offset) {
-			cFYI(1, ("Illegal offsets, can not copy from %d to %d",
-				offset, to));
-			FreeXid(xid);
-			return rc;
-		}
+		unsigned long offset = pos & (PAGE_CACHE_SIZE - 1);
+
 		/* this is probably better than directly calling
 		   partialpage_write since in this function the file handle is
 		   known which we might as well	leverage */
 		/* BB check if anything else missing out of ppw
 		   such as updating last write time */
 		page_data = kmap(page);
-		rc = cifs_write(file, page_data + offset, to-offset,
-				&position);
+		rc = cifs_write(file, page_data + offset, copied, &pos);
 		if (rc > 0)
 			rc = 0;
 		/* else if (rc < 0) should we set writebehind rc? */
@@ -1399,9 +1396,12 @@ static int cifs_commit_write(struct file
 	} else {	
 		set_page_dirty(page);
 	}
-
 	FreeXid(xid);
-	return rc;
+
+	unlock_page(page);
+	page_cache_release(page);
+
+	return rc < 0 ? rc : copied;
 }
 
 int cifs_fsync(struct file *file, struct dentry *dentry, int datasync)
@@ -1928,34 +1928,47 @@ int is_size_safe_to_change(struct cifsIn
 		return 1;
 }
 
-static int cifs_prepare_write(struct file *file, struct page *page,
-	unsigned from, unsigned to)
+static int cifs_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
 {
 	int rc = 0;
 	loff_t i_size;
 	loff_t offset;
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+	struct page *page;
+
+	page = __grab_cache_page(mapping, index);
+	if (!page)
+		return -ENOMEM;
+	*pagep = page;
 
-	cFYI(1, ("prepare write for page %p from %d to %d",page,from,to));
+	cFYI(1, ("write begin for page %p at pos %lld, length %d",
+		 page, pos, len));
 	if (PageUptodate(page))
 		return 0;
 
-	/* If we are writing a full page it will be up to date,
-	   no need to read from the server */
-	if ((to == PAGE_CACHE_SIZE) && (from == 0)) {
-		SetPageUptodate(page);
+	/* If we are writing a full page it will become up to date,
+	   no need to read from the server (although we may encounter a
+	   short copy, so write_end has to handle this) */
+	if (len == PAGE_CACHE_SIZE)
 		return 0;
-	}
 
-	offset = (loff_t)page->index << PAGE_CACHE_SHIFT;
-	i_size = i_size_read(page->mapping->host);
+	offset = index << PAGE_CACHE_SHIFT;
+	i_size = i_size_read(mapping->host);
+
+	if (offset >= i_size) {
+		void *kaddr;
+		unsigned from, to;
 
-	if ((offset >= i_size) ||
-	    ((from == 0) && (offset + to) >= i_size)) {
 		/*
 		 * We don't need to read data beyond the end of the file.
 		 * zero it, and set the page uptodate
 		 */
-		void *kaddr = kmap_atomic(page, KM_USER0);
+		from = pos & (PAGE_CACHE_SIZE - 1);
+		to = from + len;
+
+		kaddr = kmap_atomic(page, KM_USER0);
 
 		if (from)
 			memset(kaddr, 0, from);
@@ -1971,12 +1984,12 @@ static int cifs_prepare_write(struct fil
 		/* we could try using another file handle if there is one -
 		   but how would we lock it to prevent close of that handle
 		   racing with this read? In any case
-		   this will be written out by commit_write so is fine */
+		   this will be written out by write_end so is fine */
 	}
 
 	/* we do not need to pass errors back 
 	   e.g. if we do not have read access to the file 
-	   because cifs_commit_write will do the right thing.  -- shaggy */
+	   because cifs_write_end will do the right thing.  -- shaggy */
 
 	return 0;
 }
@@ -1986,8 +1999,8 @@ const struct address_space_operations ci
 	.readpages = cifs_readpages,
 	.writepage = cifs_writepage,
 	.writepages = cifs_writepages,
-	.prepare_write = cifs_prepare_write,
-	.commit_write = cifs_commit_write,
+	.write_begin = cifs_write_begin,
+	.write_end = cifs_write_end,
 	.set_page_dirty = __set_page_dirty_nobuffers,
 	/* .sync_page = cifs_sync_page, */
 	/* .direct_IO = */
@@ -2002,8 +2015,8 @@ const struct address_space_operations ci
 	.readpage = cifs_readpage,
 	.writepage = cifs_writepage,
 	.writepages = cifs_writepages,
-	.prepare_write = cifs_prepare_write,
-	.commit_write = cifs_commit_write,
+	.write_begin = cifs_write_begin,
+	.write_end = cifs_write_end,
 	.set_page_dirty = __set_page_dirty_nobuffers,
 	/* .sync_page = cifs_sync_page, */
 	/* .direct_IO = */

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 40/44] ufs convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (38 preceding siblings ...)
  2007-04-24  1:24 ` [patch 39/44] cifs " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 41/44] udf " Nick Piggin
                   ` (3 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, dushistov

[-- Attachment #1: fs-ufs-aops.patch --]
[-- Type: text/plain, Size: 6386 bytes --]

Cc: dushistov@mail.ru
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/ufs/dir.c   |   50 +++++++++++++++++++++++++++++++-------------------
 fs/ufs/inode.c |   23 +++++++++++++++++++----
 2 files changed, 50 insertions(+), 23 deletions(-)

Index: linux-2.6/fs/ufs/inode.c
===================================================================
--- linux-2.6.orig/fs/ufs/inode.c
+++ linux-2.6/fs/ufs/inode.c
@@ -558,24 +558,39 @@ static int ufs_writepage(struct page *pa
 {
 	return block_write_full_page(page,ufs_getfrag_block,wbc);
 }
+
 static int ufs_readpage(struct file *file, struct page *page)
 {
 	return block_read_full_page(page,ufs_getfrag_block);
 }
-static int ufs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
+
+int __ufs_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
 {
-	return block_prepare_write(page,from,to,ufs_getfrag_block);
+	return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				ufs_getfrag_block);
 }
+
+static int ufs_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	*pagep = NULL;
+	return __ufs_write_begin(file, mapping, pos, len, flags, pagep, fsdata);
+}
+
 static sector_t ufs_bmap(struct address_space *mapping, sector_t block)
 {
 	return generic_block_bmap(mapping,block,ufs_getfrag_block);
 }
+
 const struct address_space_operations ufs_aops = {
 	.readpage = ufs_readpage,
 	.writepage = ufs_writepage,
 	.sync_page = block_sync_page,
-	.prepare_write = ufs_prepare_write,
-	.commit_write = generic_commit_write,
+	.write_begin = ufs_write_begin,
+	.write_end = generic_write_end,
 	.bmap = ufs_bmap
 };
 
Index: linux-2.6/fs/ufs/dir.c
===================================================================
--- linux-2.6.orig/fs/ufs/dir.c
+++ linux-2.6/fs/ufs/dir.c
@@ -38,12 +38,14 @@ static inline int ufs_match(struct super
 	return !memcmp(name, de->d_name, len);
 }
 
-static int ufs_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-	struct inode *dir = page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *dir = mapping->host;
 	int err = 0;
+
 	dir->i_version++;
-	page->mapping->a_ops->commit_write(NULL, page, from, to);
+	block_write_end(NULL, mapping, pos, len, len, page, NULL);
 	if (IS_DIRSYNC(dir))
 		err = write_one_page(page, 1);
 	else
@@ -81,16 +83,20 @@ ino_t ufs_inode_by_name(struct inode *di
 void ufs_set_link(struct inode *dir, struct ufs_dir_entry *de,
 		  struct page *page, struct inode *inode)
 {
-	unsigned from = (char *) de - (char *) page_address(page);
-	unsigned to = from + fs16_to_cpu(dir->i_sb, de->d_reclen);
+	loff_t pos = (page->index << PAGE_CACHE_SHIFT) +
+			(char *) de - (char *) page_address(page);
+	unsigned len = fs16_to_cpu(dir->i_sb, de->d_reclen);
 	int err;
 
 	lock_page(page);
-	err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
+	err = __ufs_write_begin(NULL, page->mapping, pos, len,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
 	BUG_ON(err);
+
 	de->d_ino = cpu_to_fs32(dir->i_sb, inode->i_ino);
 	ufs_set_de_type(dir->i_sb, de, inode->i_mode);
-	err = ufs_commit_chunk(page, from, to);
+
+	err = ufs_commit_chunk(page, pos, len);
 	ufs_put_page(page);
 	dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(dir);
@@ -312,7 +318,7 @@ int ufs_add_link(struct dentry *dentry, 
 	unsigned long npages = ufs_dir_pages(dir);
 	unsigned long n;
 	char *kaddr;
-	unsigned from, to;
+	loff_t pos;
 	int err;
 
 	UFSD("ENTER, name %s, namelen %u\n", name, namelen);
@@ -367,9 +373,10 @@ int ufs_add_link(struct dentry *dentry, 
 	return -EINVAL;
 
 got_it:
-	from = (char*)de - (char*)page_address(page);
-	to = from + rec_len;
-	err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
+	pos = (page->index << PAGE_CACHE_SHIFT) +
+			(char*)de - (char*)page_address(page);
+	err = __ufs_write_begin(NULL, page->mapping, pos, rec_len,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
 	if (err)
 		goto out_unlock;
 	if (de->d_ino) {
@@ -386,7 +393,7 @@ got_it:
 	de->d_ino = cpu_to_fs32(sb, inode->i_ino);
 	ufs_set_de_type(sb, de, inode->i_mode);
 
-	err = ufs_commit_chunk(page, from, to);
+	err = ufs_commit_chunk(page, pos, rec_len);
 	dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
 
 	mark_inode_dirty(dir);
@@ -509,6 +516,7 @@ int ufs_delete_entry(struct inode *inode
 	char *kaddr = page_address(page);
 	unsigned from = ((char*)dir - kaddr) & ~(UFS_SB(sb)->s_uspi->s_dirblksize - 1);
 	unsigned to = ((char*)dir - kaddr) + fs16_to_cpu(sb, dir->d_reclen);
+	loff_t pos;
 	struct ufs_dir_entry *pde = NULL;
 	struct ufs_dir_entry *de = (struct ufs_dir_entry *) (kaddr + from);
 	int err;
@@ -532,13 +540,16 @@ int ufs_delete_entry(struct inode *inode
 	}
 	if (pde)
 		from = (char*)pde - (char*)page_address(page);
+
+	pos = (page->index << PAGE_CACHE_SHIFT) + from;
 	lock_page(page);
-	err = mapping->a_ops->prepare_write(NULL, page, from, to);
+	err = __ufs_write_begin(NULL, mapping, pos, to - from,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
 	BUG_ON(err);
 	if (pde)
-		pde->d_reclen = cpu_to_fs16(sb, to-from);
+		pde->d_reclen = cpu_to_fs16(sb, to - from);
 	dir->d_ino = 0;
-	err = ufs_commit_chunk(page, from, to);
+	err = ufs_commit_chunk(page, pos, to - from);
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
 out:
@@ -559,14 +570,15 @@ int ufs_make_empty(struct inode * inode,
 
 	if (!page)
 		return -ENOMEM;
-	kmap(page);
-	err = mapping->a_ops->prepare_write(NULL, page, 0, chunk_size);
+
+	err = __ufs_write_begin(NULL, mapping, 0, chunk_size,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
 	if (err) {
 		unlock_page(page);
 		goto fail;
 	}
 
-
+	kmap(page);
 	base = (char*)page_address(page);
 	memset(base, 0, PAGE_CACHE_SIZE);
 
@@ -584,10 +596,10 @@ int ufs_make_empty(struct inode * inode,
 	de->d_reclen = cpu_to_fs16(sb, chunk_size - UFS_DIR_REC_LEN(1));
 	ufs_set_de_namlen(sb, de, 2);
 	strcpy (de->d_name, "..");
+	kunmap(page);
 
 	err = ufs_commit_chunk(page, 0, chunk_size);
 fail:
-	kunmap(page);
 	page_cache_release(page);
 	return err;
 }

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 41/44] udf convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (39 preceding siblings ...)
  2007-04-24  1:24 ` [patch 40/44] ufs " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 42/44] sysv " Nick Piggin
                   ` (2 subsequent siblings)
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, bfennema

[-- Attachment #1: fs-udf-aops.patch --]
[-- Type: text/plain, Size: 3402 bytes --]

Convert udf to new aops. Also seem to have fixed pagecache corruption in
udf_adinicb_commit_write -- page was marked uptodate when it is not. Also,
fixed the silly setup where prepare_write was doing a kmap to be used in
commit_write: just do kmap_atomic in write_end. Use libfs helpers to make
this easier.

Cc: bfennema@falcon.csc.calpoly.edu
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/udf/file.c  |   32 +++++++++++++-------------------
 fs/udf/inode.c |   11 +++++++----
 2 files changed, 20 insertions(+), 23 deletions(-)

Index: linux-2.6/fs/udf/file.c
===================================================================
--- linux-2.6.orig/fs/udf/file.c
+++ linux-2.6/fs/udf/file.c
@@ -73,34 +73,28 @@ static int udf_adinicb_writepage(struct 
 	return 0;
 }
 
-static int udf_adinicb_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
+static int udf_adinicb_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
 {
-	kmap(page);
-	return 0;
-}
-
-static int udf_adinicb_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to)
-{
-	struct inode *inode = page->mapping->host;
-	char *kaddr = page_address(page);
+	struct inode *inode = mapping->host;
+	unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
+	char *kaddr;
 
+	kaddr = kmap_atomic(page, KM_USER0);
 	memcpy(UDF_I_DATA(inode) + UDF_I_LENEATTR(inode) + offset,
-		kaddr + offset, to - offset);
-	mark_inode_dirty(inode);
-	SetPageUptodate(page);
-	kunmap(page);
-	/* only one page here */
-	if (to > inode->i_size)
-		inode->i_size = to;
-	return 0;
+		kaddr + offset, copied);
+	kunmap_atomic(kaddr, KM_USER0);
+
+	return simple_write_end(file, mapping, pos, len, copied, page, fsdata);
 }
 
 const struct address_space_operations udf_adinicb_aops = {
 	.readpage		= udf_adinicb_readpage,
 	.writepage		= udf_adinicb_writepage,
 	.sync_page		= block_sync_page,
-	.prepare_write		= udf_adinicb_prepare_write,
-	.commit_write		= udf_adinicb_commit_write,
+	.write_begin		= simple_write_begin,
+	.write_end		= udf_adinicb_write_end,
 };
 
 static ssize_t udf_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
Index: linux-2.6/fs/udf/inode.c
===================================================================
--- linux-2.6.orig/fs/udf/inode.c
+++ linux-2.6/fs/udf/inode.c
@@ -122,9 +122,12 @@ static int udf_readpage(struct file *fil
 	return block_read_full_page(page, udf_get_block);
 }
 
-static int udf_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
+static int udf_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
 {
-	return block_prepare_write(page, from, to, udf_get_block);
+	return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				udf_get_block);
 }
 
 static sector_t udf_bmap(struct address_space *mapping, sector_t block)
@@ -136,8 +139,8 @@ const struct address_space_operations ud
 	.readpage		= udf_readpage,
 	.writepage		= udf_writepage,
 	.sync_page		= block_sync_page,
-	.prepare_write		= udf_prepare_write,
-	.commit_write		= generic_commit_write,
+	.write_begin		= udf_write_begin,
+	.write_end		= generic_write_end,
 	.bmap			= udf_bmap,
 };
 

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 42/44] sysv convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (40 preceding siblings ...)
  2007-04-24  1:24 ` [patch 41/44] udf " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 43/44] minix " Nick Piggin
  2007-04-24  1:24 ` [patch 44/44] jfs " Nick Piggin
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, hch

[-- Attachment #1: fs-sysv-aops.patch --]
[-- Type: text/plain, Size: 6120 bytes --]

Cc: hch@infradead.org
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/sysv/dir.c   |   45 +++++++++++++++++++++++++--------------------
 fs/sysv/itree.c |   23 +++++++++++++++++++----
 2 files changed, 44 insertions(+), 24 deletions(-)

Index: linux-2.6/fs/sysv/itree.c
===================================================================
--- linux-2.6.orig/fs/sysv/itree.c
+++ linux-2.6/fs/sysv/itree.c
@@ -453,23 +453,38 @@ static int sysv_writepage(struct page *p
 {
 	return block_write_full_page(page,get_block,wbc);
 }
+
 static int sysv_readpage(struct file *file, struct page *page)
 {
 	return block_read_full_page(page,get_block);
 }
-static int sysv_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
+
+int __sysv_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
 {
-	return block_prepare_write(page,from,to,get_block);
+	return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				get_block);
 }
+
+static int sysv_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	*pagep = NULL;
+	return __sysv_write_begin(file, mapping, pos, len, flags, pagep, fsdata);
+}
+
 static sector_t sysv_bmap(struct address_space *mapping, sector_t block)
 {
 	return generic_block_bmap(mapping,block,get_block);
 }
+
 const struct address_space_operations sysv_aops = {
 	.readpage = sysv_readpage,
 	.writepage = sysv_writepage,
 	.sync_page = block_sync_page,
-	.prepare_write = sysv_prepare_write,
-	.commit_write = generic_commit_write,
+	.write_begin = sysv_write_begin,
+	.write_end = generic_write_end,
 	.bmap = sysv_bmap
 };
Index: linux-2.6/fs/sysv/dir.c
===================================================================
--- linux-2.6.orig/fs/sysv/dir.c
+++ linux-2.6/fs/sysv/dir.c
@@ -37,12 +37,13 @@ static inline unsigned long dir_pages(st
 	return (inode->i_size+PAGE_CACHE_SIZE-1)>>PAGE_CACHE_SHIFT;
 }
 
-static int dir_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-	struct inode *dir = (struct inode *)page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *dir = mapping->host;
 	int err = 0;
 
-	page->mapping->a_ops->commit_write(NULL, page, from, to);
+	block_write_end(NULL, mapping, pos, len, len, page, NULL);
 	if (IS_DIRSYNC(dir))
 		err = write_one_page(page, 1);
 	else
@@ -186,7 +187,7 @@ int sysv_add_link(struct dentry *dentry,
 	unsigned long npages = dir_pages(dir);
 	unsigned long n;
 	char *kaddr;
-	unsigned from, to;
+	loff_t pos;
 	int err;
 
 	/* We take care of directory expansion in the same loop */
@@ -212,16 +213,17 @@ int sysv_add_link(struct dentry *dentry,
 	return -EINVAL;
 
 got_it:
-	from = (char*)de - (char*)page_address(page);
-	to = from + SYSV_DIRSIZE;
+	pos = (page->index << PAGE_CACHE_SHIFT) +
+			(char*)de - (char*)page_address(page);
 	lock_page(page);
-	err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
+	err = __sysv_write_begin(NULL, page->mapping, pos, SYSV_DIRSIZE,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
 	if (err)
 		goto out_unlock;
 	memcpy (de->name, name, namelen);
 	memset (de->name + namelen, 0, SYSV_DIRSIZE - namelen - 2);
 	de->inode = cpu_to_fs16(SYSV_SB(inode->i_sb), inode->i_ino);
-	err = dir_commit_chunk(page, from, to);
+	err = dir_commit_chunk(page, pos, SYSV_DIRSIZE);
 	dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(dir);
 out_page:
@@ -238,15 +240,15 @@ int sysv_delete_entry(struct sysv_dir_en
 	struct address_space *mapping = page->mapping;
 	struct inode *inode = (struct inode*)mapping->host;
 	char *kaddr = (char*)page_address(page);
-	unsigned from = (char*)de - kaddr;
-	unsigned to = from + SYSV_DIRSIZE;
+	loff_t pos = (page->index << PAGE_CACHE_SHIFT) + (char *)de - kaddr;
 	int err;
 
 	lock_page(page);
-	err = mapping->a_ops->prepare_write(NULL, page, from, to);
+	err = __sysv_write_begin(NULL, mapping, pos, SYSV_DIRSIZE,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
 	BUG_ON(err);
 	de->inode = 0;
-	err = dir_commit_chunk(page, from, to);
+	err = dir_commit_chunk(page, pos, SYSV_DIRSIZE);
 	dir_put_page(page);
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
@@ -263,12 +265,13 @@ int sysv_make_empty(struct inode *inode,
 
 	if (!page)
 		return -ENOMEM;
-	kmap(page);
-	err = mapping->a_ops->prepare_write(NULL, page, 0, 2 * SYSV_DIRSIZE);
+	err = __sysv_write_begin(NULL, mapping, 0, 2 * SYSV_DIRSIZE,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
 	if (err) {
 		unlock_page(page);
 		goto fail;
 	}
+	kmap(page);
 
 	base = (char*)page_address(page);
 	memset(base, 0, PAGE_CACHE_SIZE);
@@ -280,9 +283,9 @@ int sysv_make_empty(struct inode *inode,
 	de->inode = cpu_to_fs16(SYSV_SB(inode->i_sb), dir->i_ino);
 	strcpy(de->name,"..");
 
+	kunmap(page);
 	err = dir_commit_chunk(page, 0, 2 * SYSV_DIRSIZE);
 fail:
-	kunmap(page);
 	page_cache_release(page);
 	return err;
 }
@@ -336,16 +339,18 @@ not_empty:
 void sysv_set_link(struct sysv_dir_entry *de, struct page *page,
 	struct inode *inode)
 {
-	struct inode *dir = (struct inode*)page->mapping->host;
-	unsigned from = (char *)de-(char*)page_address(page);
-	unsigned to = from + SYSV_DIRSIZE;
+	struct address_space *mapping = page->mapping;
+	struct inode *dir = mapping->host;
+	loff_t pos = (page->index << PAGE_CACHE_SHIFT) +
+			(char *)de-(char*)page_address(page);
 	int err;
 
 	lock_page(page);
-	err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
+	err = __sysv_write_begin(NULL, mapping, pos, SYSV_DIRSIZE,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
 	BUG_ON(err);
 	de->inode = cpu_to_fs16(SYSV_SB(inode->i_sb), inode->i_ino);
-	err = dir_commit_chunk(page, from, to);
+	err = dir_commit_chunk(page, pos, SYSV_DIRSIZE);
 	dir_put_page(page);
 	dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(dir);

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 43/44] minix convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (41 preceding siblings ...)
  2007-04-24  1:24 ` [patch 42/44] sysv " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  2007-04-24  1:24 ` [patch 44/44] jfs " Nick Piggin
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, Mark Fasheh, Andries Brouwer

[-- Attachment #1: fs-minix-aops.patch --]
[-- Type: text/plain, Size: 5827 bytes --]

Cc: Andries Brouwer <Andries.Brouwer@cwi.nl>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/minix/dir.c   |   43 +++++++++++++++++++++++++------------------
 fs/minix/inode.c |   23 +++++++++++++++++++----
 2 files changed, 44 insertions(+), 22 deletions(-)

Index: linux-2.6/fs/minix/inode.c
===================================================================
--- linux-2.6.orig/fs/minix/inode.c
+++ linux-2.6/fs/minix/inode.c
@@ -348,24 +348,39 @@ static int minix_writepage(struct page *
 {
 	return block_write_full_page(page, minix_get_block, wbc);
 }
+
 static int minix_readpage(struct file *file, struct page *page)
 {
 	return block_read_full_page(page,minix_get_block);
 }
-static int minix_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
+
+int __minix_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
 {
-	return block_prepare_write(page,from,to,minix_get_block);
+	return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				minix_get_block);
 }
+
+static int minix_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	*pagep = NULL;
+	return __minix_write_begin(file, mapping, pos, len, flags, pagep, fsdata);
+}
+
 static sector_t minix_bmap(struct address_space *mapping, sector_t block)
 {
 	return generic_block_bmap(mapping,block,minix_get_block);
 }
+
 static const struct address_space_operations minix_aops = {
 	.readpage = minix_readpage,
 	.writepage = minix_writepage,
 	.sync_page = block_sync_page,
-	.prepare_write = minix_prepare_write,
-	.commit_write = generic_commit_write,
+	.write_begin = minix_write_begin,
+	.write_end = generic_write_end,
 	.bmap = minix_bmap
 };
 
Index: linux-2.6/fs/minix/dir.c
===================================================================
--- linux-2.6.orig/fs/minix/dir.c
+++ linux-2.6/fs/minix/dir.c
@@ -9,6 +9,7 @@
  */
 
 #include "minix.h"
+#include <linux/buffer_head.h>
 #include <linux/highmem.h>
 #include <linux/smp_lock.h>
 
@@ -48,11 +49,12 @@ static inline unsigned long dir_pages(st
 	return (inode->i_size+PAGE_CACHE_SIZE-1)>>PAGE_CACHE_SHIFT;
 }
 
-static int dir_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-	struct inode *dir = (struct inode *)page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *dir = mapping->host;
 	int err = 0;
-	page->mapping->a_ops->commit_write(NULL, page, from, to);
+	block_write_end(NULL, mapping, pos, len, len, page, NULL);
 	if (IS_DIRSYNC(dir))
 		err = write_one_page(page, 1);
 	else
@@ -220,7 +222,7 @@ int minix_add_link(struct dentry *dentry
 	char *kaddr, *p;
 	minix_dirent *de;
 	minix3_dirent *de3;
-	unsigned from, to;
+	loff_t pos;
 	int err;
 	char *namx = NULL;
 	__u32 inumber;
@@ -272,9 +274,9 @@ int minix_add_link(struct dentry *dentry
 	return -EINVAL;
 
 got_it:
-	from = p - (char*)page_address(page);
-	to = from + sbi->s_dirsize;
-	err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
+	pos = (page->index >> PAGE_CACHE_SHIFT) + p - (char*)page_address(page);
+	err = __minix_write_begin(NULL, page->mapping, pos, sbi->s_dirsize,
+					AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
 	if (err)
 		goto out_unlock;
 	memcpy (namx, name, namelen);
@@ -285,7 +287,7 @@ got_it:
 		memset (namx + namelen, 0, sbi->s_dirsize - namelen - 2);
 		de->inode = inode->i_ino;
 	}
-	err = dir_commit_chunk(page, from, to);
+	err = dir_commit_chunk(page, pos, sbi->s_dirsize);
 	dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(dir);
 out_put:
@@ -302,15 +304,16 @@ int minix_delete_entry(struct minix_dir_
 	struct address_space *mapping = page->mapping;
 	struct inode *inode = (struct inode*)mapping->host;
 	char *kaddr = page_address(page);
-	unsigned from = (char*)de - kaddr;
-	unsigned to = from + minix_sb(inode->i_sb)->s_dirsize;
+	loff_t pos = (page->index << PAGE_CACHE_SHIFT) + (char*)de - kaddr;
+	unsigned len = minix_sb(inode->i_sb)->s_dirsize;
 	int err;
 
 	lock_page(page);
-	err = mapping->a_ops->prepare_write(NULL, page, from, to);
+	err = __minix_write_begin(NULL, mapping, pos, len,
+					AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
 	if (err == 0) {
 		de->inode = 0;
-		err = dir_commit_chunk(page, from, to);
+		err = dir_commit_chunk(page, pos, len);
 	} else {
 		unlock_page(page);
 	}
@@ -330,7 +333,8 @@ int minix_make_empty(struct inode *inode
 
 	if (!page)
 		return -ENOMEM;
-	err = mapping->a_ops->prepare_write(NULL, page, 0, 2 * sbi->s_dirsize);
+	err = __minix_write_begin(NULL, mapping, 0, 2 * sbi->s_dirsize,
+					AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
 	if (err) {
 		unlock_page(page);
 		goto fail;
@@ -421,17 +425,20 @@ not_empty:
 void minix_set_link(struct minix_dir_entry *de, struct page *page,
 	struct inode *inode)
 {
-	struct inode *dir = (struct inode*)page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *dir = mapping->host;
 	struct minix_sb_info *sbi = minix_sb(dir->i_sb);
-	unsigned from = (char *)de-(char*)page_address(page);
-	unsigned to = from + sbi->s_dirsize;
+	loff_t pos = (page->index << PAGE_CACHE_SHIFT) +
+			(char *)de-(char*)page_address(page);
 	int err;
 
 	lock_page(page);
-	err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
+
+	err = __minix_write_begin(NULL, mapping, pos, sbi->s_dirsize,
+					AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
 	if (err == 0) {
 		de->inode = inode->i_ino;
-		err = dir_commit_chunk(page, from, to);
+		err = dir_commit_chunk(page, pos, sbi->s_dirsize);
 	} else {
 		unlock_page(page);
 	}

-- 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [patch 44/44] jfs convert to new aops
  2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
                   ` (42 preceding siblings ...)
  2007-04-24  1:24 ` [patch 43/44] minix " Nick Piggin
@ 2007-04-24  1:24 ` Nick Piggin
  43 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Filesystems, jfs-discussion, shaggy, Mark Fasheh

[-- Attachment #1: fs-jfs-aops.patch --]
[-- Type: text/plain, Size: 2361 bytes --]

Cc: shaggy@austin.ibm.com
Cc: jfs-discussion@lists.sourceforge.net
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/jfs/inode.c |   19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

Index: linux-2.6/fs/jfs/inode.c
===================================================================
--- linux-2.6.orig/fs/jfs/inode.c
+++ linux-2.6/fs/jfs/inode.c
@@ -256,7 +256,7 @@ int jfs_get_block(struct inode *ip, sect
 
 static int jfs_writepage(struct page *page, struct writeback_control *wbc)
 {
-	return nobh_writepage(page, jfs_get_block, wbc);
+	return block_write_full_page(page, jfs_get_block, wbc);
 }
 
 static int jfs_writepages(struct address_space *mapping,
@@ -276,10 +276,13 @@ static int jfs_readpages(struct file *fi
 	return mpage_readpages(mapping, pages, nr_pages, jfs_get_block);
 }
 
-static int jfs_prepare_write(struct file *file,
-			     struct page *page, unsigned from, unsigned to)
-{
-	return nobh_prepare_write(page, from, to, jfs_get_block);
+static int jfs_write_begin(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata)
+{
+	*pagep = NULL;
+	return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				jfs_get_block);
 }
 
 static sector_t jfs_bmap(struct address_space *mapping, sector_t block)
@@ -303,8 +306,8 @@ const struct address_space_operations jf
 	.writepage	= jfs_writepage,
 	.writepages	= jfs_writepages,
 	.sync_page	= block_sync_page,
-	.prepare_write	= jfs_prepare_write,
-	.commit_write	= nobh_commit_write,
+	.write_begin	= jfs_write_begin,
+	.write_end	= generic_write_end,
 	.bmap		= jfs_bmap,
 	.direct_IO	= jfs_direct_IO,
 };
@@ -357,7 +360,7 @@ void jfs_truncate(struct inode *ip)
 {
 	jfs_info("jfs_truncate: size = 0x%lx", (ulong) ip->i_size);
 
-	nobh_truncate_page(ip->i_mapping, ip->i_size);
+	block_truncate_page(ip->i_mapping, ip->i_size, jfs_get_block);
 
 	IWRITE_LOCK(ip, RDWRLOCK_NORMAL);
 	jfs_truncate_nolock(ip, ip->i_size);

-- 


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 06/44] mm: trim more holes
  2007-04-24  1:23 ` [patch 06/44] mm: trim more holes Nick Piggin
@ 2007-04-24  6:07   ` Neil Brown
  2007-04-24  6:17     ` Nick Piggin
  0 siblings, 1 reply; 61+ messages in thread
From: Neil Brown @ 2007-04-24  6:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Filesystems, Mark Fasheh,
	Linux Memory Management

On Tuesday April 24, npiggin@suse.de wrote:
> 
> If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
> we may have failed the write operation despite prepare_write having
> instantiated blocks past i_size. Fix this, and consolidate the trimming into
> one place.
> 
..
> @@ -2025,40 +2012,53 @@ generic_file_buffered_write(struct kiocb
>  						cur_iov, iov_offset, bytes);
>  		flush_dcache_page(page);
>  		status = a_ops->commit_write(file, page, offset, offset+bytes);
> -		if (status == AOP_TRUNCATED_PAGE) {
> -			page_cache_release(page);
> -			continue;
> +		if (unlikely(status < 0))
> +			goto fs_write_aop_error;
> +		if (unlikely(copied != bytes)) {
> +			status = -EFAULT;
> +			goto fs_write_aop_error;
>  		}

It isn't clear to me that you are handling the case
       status == AOP_TRUNCATED_PAGE
here.  AOP_TRUNCATED_PAGE is > 0 (0x80001 to be precise)

Maybe ->commit_write cannot return AOP_TRUNCATED_PAGE.  If that is
true, then a comment to that effect (i.e. that the old code was wrong)
in the change log might easy review. 

Or did I miss something?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 06/44] mm: trim more holes
  2007-04-24  6:07   ` Neil Brown
@ 2007-04-24  6:17     ` Nick Piggin
  0 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  6:17 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linux Filesystems, Mark Fasheh,
	Linux Memory Management

On Tue, Apr 24, 2007 at 04:07:35PM +1000, Neil Brown wrote:
> On Tuesday April 24, npiggin@suse.de wrote:
> > 
> > If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
> > we may have failed the write operation despite prepare_write having
> > instantiated blocks past i_size. Fix this, and consolidate the trimming into
> > one place.
> > 
> ..
> > @@ -2025,40 +2012,53 @@ generic_file_buffered_write(struct kiocb
> >  						cur_iov, iov_offset, bytes);
> >  		flush_dcache_page(page);
> >  		status = a_ops->commit_write(file, page, offset, offset+bytes);
> > -		if (status == AOP_TRUNCATED_PAGE) {
> > -			page_cache_release(page);
> > -			continue;
> > +		if (unlikely(status < 0))
> > +			goto fs_write_aop_error;
> > +		if (unlikely(copied != bytes)) {
> > +			status = -EFAULT;
> > +			goto fs_write_aop_error;
> >  		}
> 
> It isn't clear to me that you are handling the case
>        status == AOP_TRUNCATED_PAGE
> here.  AOP_TRUNCATED_PAGE is > 0 (0x80001 to be precise)

Yes, you are right there.


> Maybe ->commit_write cannot return AOP_TRUNCATED_PAGE.  If that is
> true, then a comment to that effect (i.e. that the old code was wrong)
> in the change log might easy review. 
> 
> Or did I miss something?

Actually, it seems that the old ocfs2 code (in mainline, not -mm) can
return AOP_TRUNCATED_PAGE from commit_write.

So that line should be changed to
+           if (unlikely(status < 0 || status == AOP_TRUNCATED_PAGE)) 

Although we get rid of it in a subsequent patch anyway.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops
  2007-04-24  1:23 ` [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops Nick Piggin
@ 2007-04-24  6:59   ` Neil Brown
  2007-04-24  7:23     ` Nick Piggin
  0 siblings, 1 reply; 61+ messages in thread
From: Neil Brown @ 2007-04-24  6:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Filesystems, Mark Fasheh,
	Linux Memory Management

On Tuesday April 24, npiggin@suse.de wrote:
> +  write_begin: This is intended as a replacement for prepare_write. Called
> +        by the generic buffered write code to ask the filesystem to prepare
> +        to write len bytes at the given offset in the file. flags is a field
> +        for AOP_FLAG_xxx flags, described in include/linux/mm.h.

Putting "This is intended as a replacement.." there sees a bit
dangerous.  It could well accidentally remain when the documentation
for prepare_write gets removed.  I would make it a separate paragraph
and flesh it out.  And include text from prepare_write before that
gets removed.

   write_begin:
         This is intended as a replacement for prepare_write.  The key 
         differences being that:
		- it returns a locked page (in *pagep) rather than
                  being given a pre-locked page:
		- it can pass arbitrary state to write_end rather than
		  having to hide stuff in some filesystem-internal
	          data structure 
		- The (largely undocumented) flags option.

         Called by  the generic bufferred write code to ask an
         address_space to prepare to write len bytes at the given
         offset in the file.

	 The address_space should check that the write will be able to
  	 complete, by allocating space if necessary and doing any other
  	 internal housekeeping.  If the write will update parts of any
  	 basic-blocks on storage, then those blocks should be pre-read
  	 (if they haven't been read already) so that the updated blocks
  	 can be written out properly.
	 The possible flags are listed in include/linux/fs.h (not
  	 mm.h) and include
		AOP_FLAG_UNINTERRUPTIBLE:
			It is unclear how this should be used.  No
		  	current code handles it.

(together with the rest...)
> +
> +        The filesystem must return the locked pagecache page for the caller
> +        to write into.
> +
> +        A void * may be returned in fsdata, which then gets passed into
> +        write_end.
> +
> +        Returns < 0 on failure, in which case all cleanup must be done and
> +        write_end not called. 0 on success, in which case write_end must
> +        be called.


As you are not including perform_write in the current patchset, maybe
it is best not to include the documentation yet either?

> +  perform_write: This is a single-call, bulk version of write_begin/write_end
> +        operations. It is only used in the buffered write path (write_begin
> +        must still be implemented), and not for in-kernel writes to pagecache.
> +        It takes an iov_iter structure, which provides a descriptor for the
> +        source data (and has associated iov_iter_xxx helpers to operate on
> +        that data). There are also file, mapping, and pos arguments, which
> +        specify the destination of the data.
> +
> +        Returns < 0 on failure if nothing was written out, otherwise returns
> +        the number of bytes copied into pagecache.
> +
> +        fs/libfs.c provides a reasonable template to start with, demonstrating
> +        iov_iter routines, and iteration over the destination pagecache.
> +

NeilBrown

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops
  2007-04-24  6:59   ` Neil Brown
@ 2007-04-24  7:23     ` Nick Piggin
  2007-04-24  7:49       ` Neil Brown
  0 siblings, 1 reply; 61+ messages in thread
From: Nick Piggin @ 2007-04-24  7:23 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linux Filesystems, Mark Fasheh,
	Linux Memory Management

On Tue, Apr 24, 2007 at 04:59:47PM +1000, Neil Brown wrote:
> On Tuesday April 24, npiggin@suse.de wrote:
> > +  write_begin: This is intended as a replacement for prepare_write. Called
> > +        by the generic buffered write code to ask the filesystem to prepare
> > +        to write len bytes at the given offset in the file. flags is a field
> > +        for AOP_FLAG_xxx flags, described in include/linux/mm.h.
> 
> Putting "This is intended as a replacement.." there sees a bit
> dangerous.  It could well accidentally remain when the documentation
> for prepare_write gets removed.  I would make it a separate paragraph
> and flesh it out.  And include text from prepare_write before that
> gets removed.
> 
>    write_begin:
>          This is intended as a replacement for prepare_write.  The key 
>          differences being that:
> 		- it returns a locked page (in *pagep) rather than
>                   being given a pre-locked page:
> 		- it can pass arbitrary state to write_end rather than
> 		  having to hide stuff in some filesystem-internal
> 	          data structure 
> 		- The (largely undocumented) flags option.
> 
>          Called by  the generic bufferred write code to ask an
>          address_space to prepare to write len bytes at the given
>          offset in the file.
> 
> 	 The address_space should check that the write will be able to
>   	 complete, by allocating space if necessary and doing any other
>   	 internal housekeeping.  If the write will update parts of any
>   	 basic-blocks on storage, then those blocks should be pre-read
>   	 (if they haven't been read already) so that the updated blocks
>   	 can be written out properly.
> 	 The possible flags are listed in include/linux/fs.h (not
>   	 mm.h) and include
> 		AOP_FLAG_UNINTERRUPTIBLE:
> 			It is unclear how this should be used.  No
> 		  	current code handles it.

Yeah, reasonable points. I'll do an incremental patch to clean up
some of the documentation.

BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid
an initial read or other sequence they might be using to handle the
case of a short write. ecryptfs uses it, others can too.

For buffered writes, this doesn't get passed in (unless they are
coming from kernel space), so I was debating whether to have it at
all.  However, in the previous API, _nobody_ had to worry about
short writes, so this flag means I avoid making an API that is
fundamentally less performant in some situations.

> 
> (together with the rest...)
> > +
> > +        The filesystem must return the locked pagecache page for the caller
> > +        to write into.
> > +
> > +        A void * may be returned in fsdata, which then gets passed into
> > +        write_end.
> > +
> > +        Returns < 0 on failure, in which case all cleanup must be done and
> > +        write_end not called. 0 on success, in which case write_end must
> > +        be called.
> 
> 
> As you are not including perform_write in the current patchset, maybe
> it is best not to include the documentation yet either?

Right, missed that, thanks!

Nick

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops
  2007-04-24  7:23     ` Nick Piggin
@ 2007-04-24  7:49       ` Neil Brown
  2007-04-24 10:37         ` Nick Piggin
  0 siblings, 1 reply; 61+ messages in thread
From: Neil Brown @ 2007-04-24  7:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Filesystems, Mark Fasheh,
	Linux Memory Management

On Tuesday April 24, npiggin@suse.de wrote:
> 
> BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid
> an initial read or other sequence they might be using to handle the
> case of a short write. ecryptfs uses it, others can too.
> 
> For buffered writes, this doesn't get passed in (unless they are
> coming from kernel space), so I was debating whether to have it at
> all.  However, in the previous API, _nobody_ had to worry about
> short writes, so this flag means I avoid making an API that is
> fundamentally less performant in some situations.

Ahhh I think I get it now.

  In general, the address_space must cope with the possibility that
  fewer than the expected number of bytes is copied.  This may leave
  parts of the page with invalid data.  This can be handled by
  pre-loading the page with valid data, however this may cause a
  significant performance cost.
  The write_begin/write_end interface provide two mechanism by which
  this case can be handled more efficiently.
  1/ The AOP_FLAG_UNINTERRUPTIBLE flag declares that the write will
    not be partial (maybe a different name? AOP_FLAG_NO_PARTIAL).
    If that is set, inefficient preparation can be avoided.  However the
    most common write paths will never set this flag.
  2/ The return from write_end can declare that fewer bytes have been
    accepted. e.g. part of the page may have been loaded from backing
    store, overwriting some of the newly written bytes.  If this
    return value is reduced, a new write_begin/write_end cycle
    may be called to attempt to write the bytes again.

Also
+  write_end: After a successful write_begin, and data copy, write_end must
+        be called. len is the original len passed to write_begin, and copied
+        is the amount that was able to be copied (they must be equal if
+        write_begin was called with intr == 0).
+

That should be "... called without AOP_FLAG_UNINTERRUPTIBLE being
set".
And "that was able to be copied" is misleading, as the copy is not done in
write_end.  Maybe "that was accepted".

It seems to make sense now.  I might try re-reviewing the patches based
on this improved understanding.... only a public holiday looms :-)

NeilBrown

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops
  2007-04-24  7:49       ` Neil Brown
@ 2007-04-24 10:37         ` Nick Piggin
  0 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24 10:37 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linux Filesystems, Mark Fasheh,
	Linux Memory Management

On Tue, Apr 24, 2007 at 05:49:48PM +1000, Neil Brown wrote:
> On Tuesday April 24, npiggin@suse.de wrote:
> > 
> > BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid
> > an initial read or other sequence they might be using to handle the
> > case of a short write. ecryptfs uses it, others can too.
> > 
> > For buffered writes, this doesn't get passed in (unless they are
> > coming from kernel space), so I was debating whether to have it at
> > all.  However, in the previous API, _nobody_ had to worry about
> > short writes, so this flag means I avoid making an API that is
> > fundamentally less performant in some situations.
> 
> Ahhh I think I get it now.
> 
>   In general, the address_space must cope with the possibility that
>   fewer than the expected number of bytes is copied.  This may leave
>   parts of the page with invalid data.  This can be handled by
>   pre-loading the page with valid data, however this may cause a
>   significant performance cost.

Right. Bringing the page uptodate at write_begin-time is probably
the simplest way to handle it. However, more sophisticated schemes
are possible. For example, the generic block routines can recover
at write_end-time, and probably can't make use of the flag to do
things much better...


>   The write_begin/write_end interface provide two mechanism by which
>   this case can be handled more efficiently.
>   1/ The AOP_FLAG_UNINTERRUPTIBLE flag declares that the write will
>     not be partial (maybe a different name? AOP_FLAG_NO_PARTIAL).
>     If that is set, inefficient preparation can be avoided.  However the
>     most common write paths will never set this flag.

Yes, loop, nfsd, and filesystem-specific pagecache modification
(eg. ext2 directories) are probably the main things that use it.


>   2/ The return from write_end can declare that fewer bytes have been
>     accepted. e.g. part of the page may have been loaded from backing
>     store, overwriting some of the newly written bytes.  If this
>     return value is reduced, a new write_begin/write_end cycle
>     may be called to attempt to write the bytes again.

Yeah, although you'd have to be careful not to overwrite things if
the page is uptodate (because uptodate *really* means uptodate --
ie.  it is the only thing we have to synchronise buffered reads from
returning the data to userspace).


> 
> Also
> +  write_end: After a successful write_begin, and data copy, write_end must
> +        be called. len is the original len passed to write_begin, and copied
> +        is the amount that was able to be copied (they must be equal if
> +        write_begin was called with intr == 0).
> +
> 
> That should be "... called without AOP_FLAG_UNINTERRUPTIBLE being
> set".
> And "that was able to be copied" is misleading, as the copy is not done in
> write_end.  Maybe "that was accepted".

Thanks, very good eyes and good suggestions.

Actually I'm a bit worried about this copied vs accepted thing -- we've
already copied some number of bytes into the pagecache by the time write_end
is called. So if the filesystem accepts less and the pagecache page is marked
uptodate, then the pagecache is now out of sync with the filesystem. There
are a few places where it looks like we get this wrong... but that's for a
future patch :P

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 13/44] mm: restore KERNEL_DS optimisations
  2007-04-24  1:23 ` [patch 13/44] mm: restore KERNEL_DS optimisations Nick Piggin
@ 2007-04-24 10:43   ` Christoph Hellwig
  2007-04-24 11:03     ` Nick Piggin
  0 siblings, 1 reply; 61+ messages in thread
From: Christoph Hellwig @ 2007-04-24 10:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Filesystems, Mark Fasheh,
	Linux Memory Management

On Tue, Apr 24, 2007 at 11:23:59AM +1000, Nick Piggin wrote:
> Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
> path.
> 
> This may be a pretty questionable gain in most cases, especially after the
> legacy 2copy write path is removed, but it doesn't cost much.

Well, it gets removed later and sets a bad precedence.  Instead of
adding hacks we should have proper methods for kernel-space read/writes.
Especially as the latter are a lot simpler and most of the magic
in this patch series is not needed.  I'll start this work once
your patch series is in.

In general there seems to be a lot of stuff in the earlier patches
that just goes away later and doesn't make much sense in the series.
Is there a good reason not to simply consolidate out those changes
completely?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 16/44] rd convert to new aops
  2007-04-24  1:24 ` [patch 16/44] rd " Nick Piggin
@ 2007-04-24 10:46   ` Christoph Hellwig
  2007-04-24 11:05     ` Nick Piggin
  0 siblings, 1 reply; 61+ messages in thread
From: Christoph Hellwig @ 2007-04-24 10:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Linux Filesystems, Mark Fasheh

> +	page = __grab_cache_page(mapping, index);

btw, __grab_cache_page should probably get a more descriptive and
non-__-prefixed name now that it's used all over the place.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 13/44] mm: restore KERNEL_DS optimisations
  2007-04-24 10:43   ` Christoph Hellwig
@ 2007-04-24 11:03     ` Nick Piggin
  0 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24 11:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Linux Filesystems, Mark Fasheh,
	Linux Memory Management

On Tue, Apr 24, 2007 at 11:43:18AM +0100, Christoph Hellwig wrote:
> On Tue, Apr 24, 2007 at 11:23:59AM +1000, Nick Piggin wrote:
> > Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
> > path.
> > 
> > This may be a pretty questionable gain in most cases, especially after the
> > legacy 2copy write path is removed, but it doesn't cost much.
> 
> Well, it gets removed later and sets a bad precedence.  Instead of
> adding hacks we should have proper methods for kernel-space read/writes.
> Especially as the latter are a lot simpler and most of the magic
> in this patch series is not needed.  I'll start this work once
> your patch series is in.

It was removed earlier and put back in here. I agree it isn't so
important, but again it does help that the patchset introduces no
obvious regression. You could remove it in your patchset?


> In general there seems to be a lot of stuff in the earlier patches
> that just goes away later and doesn't make much sense in the series.
> Is there a good reason not to simply consolidate out those changes
> completely?

I guess the first half of the patchset -- the slow deadlock fix for
the old prepare_write path -- came about because that's the only
reasonable way I could find to fix it. I initially thought it would
take a lot longer to convert all filesystems and that we might want
to stay compatible for a while, which is why I wanted to ensure that
was working.

Basically I can't really see which ones you think I should merge and
be able retain a working kernel?

Granted there are a couple of bugfixes and some slightly orthogonal
cleanups in there, but I just thought I'd submit them in the same
series because it was a little easier for me.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 16/44] rd convert to new aops
  2007-04-24 10:46   ` Christoph Hellwig
@ 2007-04-24 11:05     ` Nick Piggin
  2007-04-24 11:11       ` Christoph Hellwig
  0 siblings, 1 reply; 61+ messages in thread
From: Nick Piggin @ 2007-04-24 11:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Andrew Morton, Linux Filesystems, Mark Fasheh

On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote:
> > +	page = __grab_cache_page(mapping, index);
> 
> btw, __grab_cache_page should probably get a more descriptive and
> non-__-prefixed name now that it's used all over the place.

Agreed. Suggestions? ;)


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 16/44] rd convert to new aops
  2007-04-24 11:05     ` Nick Piggin
@ 2007-04-24 11:11       ` Christoph Hellwig
  2007-04-24 11:16         ` Nick Piggin
  0 siblings, 1 reply; 61+ messages in thread
From: Christoph Hellwig @ 2007-04-24 11:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Andrew Morton, Linux Filesystems, Mark Fasheh

On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote:
> On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote:
> > > +	page = __grab_cache_page(mapping, index);
> > 
> > btw, __grab_cache_page should probably get a more descriptive and
> > non-__-prefixed name now that it's used all over the place.
> 
> Agreed. Suggestions? ;)

find_or_create_cache_page given that's it's like find_or_create_page +
add_to_page_cache?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 16/44] rd convert to new aops
  2007-04-24 11:11       ` Christoph Hellwig
@ 2007-04-24 11:16         ` Nick Piggin
  2007-04-24 11:18           ` Christoph Hellwig
  2007-04-24 11:42           ` Neil Brown
  0 siblings, 2 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24 11:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Andrew Morton, Linux Filesystems, Mark Fasheh

On Tue, Apr 24, 2007 at 12:11:19PM +0100, Christoph Hellwig wrote:
> On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote:
> > On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote:
> > > > +	page = __grab_cache_page(mapping, index);
> > > 
> > > btw, __grab_cache_page should probably get a more descriptive and
> > > non-__-prefixed name now that it's used all over the place.
> > 
> > Agreed. Suggestions? ;)
> 
> find_or_create_cache_page given that's it's like find_or_create_page +
> add_to_page_cache?

find_or_create_page adds to page cache as well, though :P

All those random little slightly different allocators scattered over
filemap.c are a bit annoying. Basically I think it would be better
to have a single variant that takes gfp_mask of both the pagecache
page, and the radix-tree insertion. Then serveral things can be
converted to use it.

I was going to try doing that after this patchset. Or do you think it
would be better to get the __grab_cache_page name right in the
first place?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 16/44] rd convert to new aops
  2007-04-24 11:16         ` Nick Piggin
@ 2007-04-24 11:18           ` Christoph Hellwig
  2007-04-24 11:20             ` Nick Piggin
  2007-04-24 11:42           ` Neil Brown
  1 sibling, 1 reply; 61+ messages in thread
From: Christoph Hellwig @ 2007-04-24 11:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Andrew Morton, Linux Filesystems, Mark Fasheh

On Tue, Apr 24, 2007 at 01:16:53PM +0200, Nick Piggin wrote:
> > find_or_create_cache_page given that's it's like find_or_create_page +
> > add_to_page_cache?
> 
> find_or_create_page adds to page cache as well, though :P
> 
> All those random little slightly different allocators scattered over
> filemap.c are a bit annoying. Basically I think it would be better
> to have a single variant that takes gfp_mask of both the pagecache
> page, and the radix-tree insertion. Then serveral things can be
> converted to use it.
> 
> I was going to try doing that after this patchset. Or do you think it
> would be better to get the __grab_cache_page name right in the
> first place?

If you plan to fix things up afterwards there's not point in doing
the renaming first.  Would be nice to get both into the same release
in the end, though :)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 16/44] rd convert to new aops
  2007-04-24 11:18           ` Christoph Hellwig
@ 2007-04-24 11:20             ` Nick Piggin
  0 siblings, 0 replies; 61+ messages in thread
From: Nick Piggin @ 2007-04-24 11:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Andrew Morton, Linux Filesystems, Mark Fasheh

On Tue, Apr 24, 2007 at 12:18:11PM +0100, Christoph Hellwig wrote:
> On Tue, Apr 24, 2007 at 01:16:53PM +0200, Nick Piggin wrote:
> > I was going to try doing that after this patchset. Or do you think it
> > would be better to get the __grab_cache_page name right in the
> > first place?
> 
> If you plan to fix things up afterwards there's not point in doing
> the renaming first.  Would be nice to get both into the same release
> in the end, though :)

OK, will do.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 16/44] rd convert to new aops
  2007-04-24 11:16         ` Nick Piggin
  2007-04-24 11:18           ` Christoph Hellwig
@ 2007-04-24 11:42           ` Neil Brown
  1 sibling, 0 replies; 61+ messages in thread
From: Neil Brown @ 2007-04-24 11:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Andrew Morton, Linux Filesystems, Mark Fasheh

On Tuesday April 24, npiggin@suse.de wrote:
> On Tue, Apr 24, 2007 at 12:11:19PM +0100, Christoph Hellwig wrote:
> > On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote:
> > > On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote:
> > > > > +	page = __grab_cache_page(mapping, index);
> > > > 
> > > > btw, __grab_cache_page should probably get a more descriptive and
> > > > non-__-prefixed name now that it's used all over the place.
> > > 
> > > Agreed. Suggestions? ;)
> > 
> > find_or_create_cache_page given that's it's like find_or_create_page +
> > add_to_page_cache?
> 
> find_or_create_page adds to page cache as well, though :P

I would really like to see the word 'lock' in there, as it does
return a locked page, and when I first saw __grab_cache_page recently
there was an unlock_page afterwards and I couldn't see where the lock
happened, and I was confused for a little while.

NeilBrown

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [patch 37/44] hostfs convert to new aops
  2007-04-24  1:24 ` [patch 37/44] hostfs " Nick Piggin
@ 2007-04-27 16:11   ` Jeff Dike
  0 siblings, 0 replies; 61+ messages in thread
From: Jeff Dike @ 2007-04-27 16:11 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Linux Filesystems, Mark Fasheh

On Tue, Apr 24, 2007 at 11:24:23AM +1000, Nick Piggin wrote:
> This also gets rid of a lot of useless read_file stuff. And also
> optimises the full page write case by marking a !uptodate page uptodate.
> 
> Cc: Jeff Dike <jdike@addtoit.com>
> Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Looks good.

Acked-by: Jeff Dike <jdike@linux.intel.com>

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2007-04-27 16:16 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-24  1:23 [patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 Nick Piggin
2007-04-24  1:23 ` [patch 01/44] mm: revert KERNEL_DS buffered write optimisation Nick Piggin
2007-04-24  1:23 ` [patch 02/44] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6 Nick Piggin
2007-04-24  1:23 ` [patch 03/44] Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83 Nick Piggin
2007-04-24  1:23 ` [patch 04/44] mm: clean up buffered write code Nick Piggin
2007-04-24  1:23 ` [patch 05/44] mm: debug write deadlocks Nick Piggin
2007-04-24  1:23 ` [patch 06/44] mm: trim more holes Nick Piggin
2007-04-24  6:07   ` Neil Brown
2007-04-24  6:17     ` Nick Piggin
2007-04-24  1:23 ` [patch 07/44] mm: buffered write cleanup Nick Piggin
2007-04-24  1:23 ` [patch 08/44] mm: write iovec cleanup Nick Piggin
2007-04-24  1:23 ` [patch 09/44] mm: fix pagecache write deadlocks Nick Piggin
2007-04-24  1:23 ` [patch 10/44] mm: buffered write iterator Nick Piggin
2007-04-24  1:23 ` [patch 11/44] fs: fix data-loss on error Nick Piggin
2007-04-24  1:23 ` [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops Nick Piggin
2007-04-24  6:59   ` Neil Brown
2007-04-24  7:23     ` Nick Piggin
2007-04-24  7:49       ` Neil Brown
2007-04-24 10:37         ` Nick Piggin
2007-04-24  1:23 ` [patch 13/44] mm: restore KERNEL_DS optimisations Nick Piggin
2007-04-24 10:43   ` Christoph Hellwig
2007-04-24 11:03     ` Nick Piggin
2007-04-24  1:24 ` [patch 14/44] implement simple fs aops Nick Piggin
2007-04-24  1:24 ` [patch 15/44] block_dev convert to new aops Nick Piggin
2007-04-24  1:24 ` [patch 16/44] rd " Nick Piggin
2007-04-24 10:46   ` Christoph Hellwig
2007-04-24 11:05     ` Nick Piggin
2007-04-24 11:11       ` Christoph Hellwig
2007-04-24 11:16         ` Nick Piggin
2007-04-24 11:18           ` Christoph Hellwig
2007-04-24 11:20             ` Nick Piggin
2007-04-24 11:42           ` Neil Brown
2007-04-24  1:24 ` [patch 17/44] ext2 " Nick Piggin
2007-04-24  1:24 ` [patch 18/44] ext3 " Nick Piggin
2007-04-24  1:24 ` [patch 19/44] ext4 " Nick Piggin
2007-04-24  1:24 ` [patch 20/44] xfs " Nick Piggin
2007-04-24  1:24 ` [patch 21/44] fs: new cont helpers Nick Piggin
2007-04-24  1:24 ` [patch 22/44] fat convert to new aops Nick Piggin
2007-04-24  1:24 ` [patch 23/44] adfs " Nick Piggin
2007-04-24  1:24 ` [patch 24/44] affs " Nick Piggin
2007-04-24  1:24 ` [patch 25/44] hfs " Nick Piggin
2007-04-24  1:24 ` [patch 26/44] hfsplus " Nick Piggin
2007-04-24  1:24 ` [patch 27/44] hpfs " Nick Piggin
2007-04-24  1:24 ` [patch 28/44] bfs " Nick Piggin
2007-04-24  1:24 ` [patch 29/44] qnx4 " Nick Piggin
2007-04-24  1:24 ` [patch 30/44] nfs " Nick Piggin
2007-04-24  1:24 ` [patch 31/44] smb " Nick Piggin
2007-04-24  1:24 ` [patch 32/44] ocfs2: " Nick Piggin
2007-04-24  1:24 ` [patch 33/44] gfs2 " Nick Piggin
2007-04-24  1:24 ` [patch 34/44] fs: no AOP_TRUNCATED_PAGE for writes Nick Piggin
2007-04-24  1:24 ` [patch 35/44] ecryptfs convert to new aops Nick Piggin
2007-04-24  1:24 ` [patch 36/44] fuse " Nick Piggin
2007-04-24  1:24 ` [patch 37/44] hostfs " Nick Piggin
2007-04-27 16:11   ` Jeff Dike
2007-04-24  1:24 ` [patch 38/44] jffs2 " Nick Piggin
2007-04-24  1:24 ` [patch 39/44] cifs " Nick Piggin
2007-04-24  1:24 ` [patch 40/44] ufs " Nick Piggin
2007-04-24  1:24 ` [patch 41/44] udf " Nick Piggin
2007-04-24  1:24 ` [patch 42/44] sysv " Nick Piggin
2007-04-24  1:24 ` [patch 43/44] minix " Nick Piggin
2007-04-24  1:24 ` [patch 44/44] jfs " Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).